Improvements in Holistic Research

A DISSERTATION SUBMITTED TO THE FACULTY OF THE UNIVERSITY OF BY

Daniel Allen Kluver

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

Joseph A. Konstan

August, 2018 c Daniel Allen Kluver 2018 ALL RIGHTS RESERVED Dedication

This dissertation is dedicated to my family, my friends, my advisers John Riedl and Joseph Konstan, my colleagues, both at GroupLens research and at Macalester College, and everyone else who believed in me and supported me along the way. Your support meant everything when I couldn’t support myself. Your belief meant everything when I couldn’t believe in myself. I couldn’t have done this without your help.

i Abstract

Since the mid 1990s, recommender systems have grown to be a major area of de- ployment in industry, and research in academia. A through-line in this research has been the pursuit, above all else, of the perfect algorithm. With this admirable focus has come a neglect of the full scope of building, maintaining, and improving recom- mender systems. In this work I outline a system deployment and a series of offline and online experiments dedicated to improving our holistic understanding of recommender systems. This work explores the design, algorithms, early performance, and interfaces of recommender systems within the scope of how they are interconnected with other aspects of the system. This work explores many indivisual aspects of a recommender system while keeping in mind how they are connected to other aspects of the system. The contributions of this thesis are: an exploration of the design of the BookLens system, a prototype recommender system for library-item recommendation; a methodology and exploration of algorithm performance for users with very few ratings which shows that the popular Item-Item recommendation algorithm performs very poorly in this context; an explo- ration of the issues faced by Item-Item, as well as fixes for these issues confirmed by both an offline and online analysis; and finally, the preference bits model for measuring the amount of noise and information contained in user ratings, as well as a rating sup- port interface capable of reducing the noise in user ratings leading to superior algorithm performance. Supporting these contributions are the following specific methodological improve- ments: a bias free methodology for measuring algorithm performance over a range of profile sizes; a prototype user-study design for investigating new-user recommendation through Mechanical Turk; the preference bits model as well as derived mea- surements of preference bits per rating, per impressions, and per second; and finally a sound experimental design that can be used to empirically measure preference bits values for a given interface. It is our hope that these methodological contributions can help researchers in the recommender systems field ask new questions and further the holistic study of recommender systems.

ii Contents

Dedication i

Abstract ii

List of Tables vii

List of Figures ix

1 Introduction 1

2 Rating-Based : Algorithms and Evaluation 5 2.1 Introduction...... 5 2.1.1 Examples of recommender systems ...... 5 2.1.2 Taxonomies of Recommendation Algorithms ...... 10 2.2 Concepts and Notation ...... 12 2.3 BaselinePredictors...... 15 2.4 Nearest Neighbor Algorithms ...... 17 2.5 Matrix Factorization Algorithms ...... 24 2.6 Learning to Rank ...... 29 2.7 Other Algorithms ...... 33 2.8 Combining Algorithms ...... 35 2.8.1 Ensemble Recommendation ...... 36 2.8.2 Recommending for Novelty and Diversity ...... 38 2.9 Metrics and Evaluation ...... 40

iii 3 A Case Study: the BookLens Recommender System 58 3.1 Past Work in Library Item Recommendation ...... 61 3.2 Development of the BookLens System ...... 63 3.2.1 Design Constraints ...... 65 3.3 TheFederatedRecommender ...... 70 3.3.1 The Centralized Recommender ...... 71 3.3.2 The Decentralized Recommender ...... 72 3.3.3 The Federated Recommender ...... 73 3.4 Case Study: The Design of BookLens ...... 74 3.4.1 Data Organization ...... 74 3.4.2 Structural Organization ...... 79 3.5 Case Study: Exploration of the BookLens Dataset ...... 82 3.5.1 Data on early use ...... 84 3.5.2 Investigation of catalog overlap ...... 88 3.5.3 What is the effect of rating pooling? ...... 90 3.5.4 Comparison of BookLens and MovieLens Rating Data ...... 94 3.6 Conclusions and Future work ...... 98

4 Understanding Recommender Behavior for Users With Few Ratings 102 4.1 Evaluation Methodologies ...... 104 4.1.1 Past Approaches ...... 105 4.1.2 Iterated Retain-n method ...... 106 4.1.3 Subsetting Retain-n method ...... 109 4.2 Evaluation of Common Algorithms ...... 111 4.2.1 Dataset ...... 111 4.2.2 Algorithms ...... 112 4.2.3 Metrics ...... 113 4.2.4 results ...... 117 4.3 Conclusion ...... 125

5 Improving Item-Item CF for Small Rating Counts 128 5.1 Introduction...... 128 5.2 Improving Item-Item - the Frequency Optimized Model ...... 129

iv 5.3 ImprovingItem-Item-DampedItem-Item ...... 131 5.4 Offline Evaluation of the Improved Item-Item ...... 133 5.4.1 Small Profile Evaluation Results ...... 135 5.4.2 Large Profile Evaluation Results ...... 143 5.4.3 Damping Parameter Tuning Analysis ...... 146 5.4.4 Discussion ...... 148 5.5 A User Evaluation of Damped Item-Item Algorithms ...... 150 5.5.1 Experiment Design ...... 152 5.5.2 Results ...... 161 5.6 Summary ...... 169

6 Preference Bits and the Study of Rating Interfaces 173 6.1 Background ...... 175 6.1.1 ...... 175 6.1.2 Recommender Systems Research on Rating ...... 176 6.1.3 Alternate Designs for Preference Elicitation ...... 178 6.1.4 Information Theory ...... 179 6.2 Measuring Information Contained in Ratings ...... 182 6.2.1 Related Measurements ...... 188 6.3 Studies of Rating Scales ...... 189 6.3.1 Study 1: Simulated Rating System ...... 190 6.3.2 Study 2: Re-analysis of a past re-rating experiment ...... 194 6.4 Study 3: An Experiment With Rating Support Interfaces ...... 197 6.4.1 Rating Interfaces ...... 199 6.4.2 Measurements ...... 203 6.4.3 ExperimentStructure ...... 205 6.4.4 Results ...... 205 6.4.5 Discussion ...... 213 6.5 Conclusions ...... 213

7 Conclusion 217

References 224

v Appendix A. Output of SEM model for user-centered evaluation of Item-Item improvements 244

Appendix B. Outline of the BookLens Core Web server API 251 B.1 APIoverview ...... 251 B.2 Authentication ...... 252 B.2.1 Tokens ...... 253 B.2.2 Signing Requests ...... 254 B.3 Book...... 255 B.3.1 WebRequests: ...... 257 B.4 Opus...... 259 B.4.1 WebRequests...... 260 B.5 User ...... 262 B.5.1 WebRequests...... 263 B.6 Review...... 267 B.6.1 WebRequests...... 269 B.7 BatchRequests...... 272 B.8 WebRequests...... 273 B.9 User Login ...... 274

vi List of Tables

2.1 Summary of mathematical notation ...... 15 2.2 A confusion matrix ...... 47 2.3 Information about the MovieLens datasets...... 57 3.1 An overview the MELSA partner libraries...... 65 3.2 A comparison of multi-community recommender designs ...... 73 3.3 Overall usage statistics for the BookLens system, broken out by source of data...... 85 3.4 Measurement of catalog overlap ...... 89 3.5 Measurements of catalog 1-coverage(percent of opuses at each library with at least one rating) with and without pooled ratings for each library. 91 3.6 Average increase in monthly ratings at each library ...... 93 4.1 Summary of algorithms ...... 113 4.2 Summary of metrics ...... 114 4.3 Summary of algorithm behavior for new users ...... 125 5.1 Summary of algorithms including reason for inclusion in this experiment. 134 5.2 AverageRating@20, AILS@20, and recommendation time in seconds for the algorithms in a standard large-profile evaluation ...... 145 5.3 Summary of results of our analysis of improvements to the Item-Item Algorithm...... 149 5.4 The list of movies used in the user evaluation of the damped Item-Item algorithm...... 155 5.5 The questions in the first phase of our survey ...... 159 5.6 The questions in our survey and their factor loadings ...... 164

vii 5.7 Regression coefficients between factors, and for conditions (relative to random recommendation) on goodness and diversity factors ...... 165 5.8 Conditional probability of item response based on seen or unseen . . . . 167 5.9 Conditional probability of item response based on user condition . . . . 167 6.1 The probability distribution of ratings on the movie Titanic (1997) by users with known gender on MovieLens...... 181 6.2 results from the re-analysis of the SCALES dataset ...... 195 6.3 Mean rating time per item in seconds for each scale and domain based on Sparling and Sen’s study ...... 196 6.4 Our quantitative results for four interfaces ...... 206 6.5 Itemspresentedinthequestionnaires...... 210

viii List of Figures

2.1 Screenshot of the Pandora music streaming service ...... 6 2.2 Screenshot of the Jester joke recommender ...... 7 2.3 Screenshot of the MovieLens home page ...... 8 2.4 Screenshot of the Amazon home page ...... 9 2.5 Examples of various rating scales used in the wild...... 13 2.6 An example ROC curve. Image used with permission from [41]...... 48 3.1 A timeline of the development of the BookLens system ...... 64 3.2 Three recommender system designs capable of working with multiple communitiesofinterest...... 71 3.3 A simplified data model for the BookLens system ...... 75 3.4 The architecture of the BookLens System ...... 79 3.5 The BookLens system (as deployed in one of our catalogs) ...... 81 3.6 A time-line of the deployments of the BookLens system...... 82 3.7 Number of opuses with each rating count ...... 88 3.8 Number of monthly rates per client with and without rating pooling . . 92 3.9 Number of ratings made by users of MovieLens and BookLens ...... 95 3.10 Distribution of ratings in MovieLens and BookLens...... 97 4.1 The iterated-retain-n methodology applied to the item baseline in the MovieLens 1m dataset...... 108 4.2 The subsetting-ratain-n methodology applied to the item baseline in the MovieLens 1m dataset...... 110 4.3 Accuracy and Ranking Metrics...... 118 4.4 Information Retrieval Recommendation Quality Metrics...... 120 4.5 Recommendation Quality Metrics...... 121

ix 4.6 Non-monotonic Recommendation Quality Metrics ...... 123 5.1 A plot of RMSE by the number of neighbors used when making a prediction131 5.2 Results of the small profile evaluation for the Full Item-Item and the Frequency Optimized Item-Item algorithms ...... 136 5.3 Results of the small profile evaluation for the damped Item-Item and the frequency optimized + damped (or combined) Item-Item algorithms . . 140 5.4 The effect of varying the damping parameter in the damped Item-Item algorithm for users with 10 ratings...... 147 5.5 The ratings page of our user survey ...... 153 5.6 The recommendations part of the third page of our user survey . . . . . 157 6.1 Graphical model of input preference bit measurement ...... 185 6.2 Ourratingsystemmodel...... 190 6.3 Preference bits per rating increases as the size of the rating scale increases191 6.4 Our four interface designs to provide memory-support, rating support and joint-evaluation support...... 200 6.5 The structural equation model fitted on the Subject constructs difficulty and usefulness and their relations to the 4 interfaces ...... 211 6.6 Total effects of the interfaces on usefulness: error bars are 1 standard error of the mean. Comparisons are made against the baseline interface...... 212

x Chapter 1

Introduction

Since the mid 1990s, automated recommender systems have become very popular in industry. Companies like Amazon, Netflix, Google, and Facebook have deployed recom- mendation algorithms in their systems to huge success. By making personal recommen- dations to each user, these systems narrow down the thousands of possible choices that user can make into a small list of the best options. The popularity of these deployments has pushed the field of recommender systems forward, resulting in a thriving and active research community. While there are many ways to make automated, personalized recommendations, the most popular is so-called collaborative filtering recommendation. Collaborative filtering recommender systems work on the assumption that people have relatively stable tastes. Therefore, if two people have agreed in the past they will likely continue to agree in the future. One powerful benefit of this formulation, is that it doesn’t matter why people agree. Two users could share tastes in books because they both like the same style of book binding, or they could share taste in movies due to a nuance of directing; a collaborative filtering recommender system does not care. So long as the two users continue agreeing, we can use the stated preferences of one user to predict the preferences of another. The majority of innovation in the field of recommender systems has focused exclu- sively on recommendation algorithms. Usually operating on existing datasets, this area of work asks “how can we better model the patterns that exist in preference data to make more accurate predictions, or more useful recommendations, for a user?” This

1 2 focus in the field is not surprising, the algorithm is the most obvious component re- sponsible for how well the system serves its goal of helping users find the best items on offer. Likewise, this is the work that is best supported by existing resources and methodologies. Due to this focus, many scholars conflate the study of recommender systems with the study of recommendation algorithms, ignoring the broader system in which these algorithms operate. Considered holistically, however, recommender systems are com- plicated, social and technical systems, and this focus ignores important aspects of the system. To completely understand and optimize a recommender system you must con- sider not only the algorithm that operates on preference data, but also how that data is collected from users how the output of the algorithm is presented to users, and the be- havioral properties of the users themselves. Unfortunately, while there is good support for experimentation into new recommendation algorithms, there is a lack of standard support for studying the surrounding context of these algorithms. While it is hard to say objectively that historic research on recommender systems is over-focused on algorithmic improvements, there are certainly interesting system- and user- focused recommender systems questions that have gone unanswered. Only recently has the field started to question, for example, whether rating, ranking, or some other form of comparison is the best way to express a preference on an item. Likewise, while historically popular application domains like movies and music have seen many publications, equally interesting domains such as book recommendation have seen much less research. While current advances have pushed the field or recommender systems fantastically far in a relatively short amount of time, there is still room for improvement. To reach these improvements, however, it has become increasingly important to take a holistic, system- and user-focused view of the system. While there are still algorithm driven improvements to be made, it has become increasingly hard to make a meaningful impact with an algorithm centered perspective, instead of viewing the algorithm as but one piece in a larger system. Outside of algorithm centered work, there are also interesting questions left unanswered about how best to elicit user preferences, what users want from their recommendations at various points in the user life-cycle, and how we can best use the knowledge generated by our algorithms. 3 This dissertation seeks to answer some of these questions by taking a holistic, system focused perspective on recommender systems. The heart of this perspective is the BookLens system, a library item recommendation system developed over the course of my PhD education. By designing, building, and maintaining a recommender system for a novel domain, we found ourselves asking novel questions about recommendation systems. Through a series of experiments we then arrived at the following surprising results:

• The popular Item-Item recommendation algorithm drastically under performs for users with few ratings.

• An improved Item-Item algorithm performs substantially better in offline analysis, but fails to outperform baseline approaches when evaluated by actual people.

• The size of the rating interface can be optimized to best capture user ratings, and changes in these interfaces can directly relate to better algorithm performance as measured using standard algorithm-centered metrics.

An important part of the contribution of this work, however, is not the direct conclu- sions of our studies. Indeed, much of the work involved in this thesis was not in finding the above results, but in asking how can we scientifically study these problems. There- fore, a secondary contribution of this thesis is a series of new means and methodologies for further understanding recommender systems. By enhancing our ways of knowing, this work can serve to support future non-algorithm-centered recommender systems re- search. Our report in the BookLens system, for example, can be used by others seeking to navigate the novel constraints inherent in deploying recommendations at local insti- tutions such as public libraries. This dissertation makes the following methodological contributions

• An evaluation methodology that allows accurately estimating how the behavior of algorithms change as the algorithm gains initial ratings

• A prototype user-centric evaluation procedure that can easily be deployed to Ama- zon Mechanical Turk to study new-user recommendation 4 • A framework and related measurements that allow us to measure the amount of information, and noise, in user ratings, as well as an experimental methodology for comparing this between various interfaces.

While there are certainly many questions left to answer about the recommender system, these contributions allow other researchers to build upon this dissertation, asking new and interesting questions about recommender systems. We will begin in the next chapter by reviewing the general field of algorithms and evaluations for rating-based collaborative filtering. While not explicitly focused on the broader system, this will serve as useful background knowledge for the next four chap- ters, as well as a tutorial on the technical aspects of these algorithms for the interested reader. Then in chapter 3 we will explore the BookLens system, which focuses on a novel domain for recommendation (Library items) and describes a novel design for recommender systems (the federated recommender). This chapter will end with a com- parison of data from BookLens with data from MovieLens, whose dataset has inspired dozens of explorations into recommender systems. This comparison will inspire two further research questions. In chapter 4 we will explore the first question: “how do al- gorithm perform when given a very small number of ratings?”. We will dig deeper into this question in chapter 5 seeking to understand why one common algorithm drastically under performs when given any small amount of user data, and what can be done to improve it. We will then explore the second research question in chapter 6: “How can we measure the amount of preference information gathered through a rating interface?”. That chapter will end with an experiment showing that we can not only improve the quality of user ratings, but that these improvements can have a measurable effect on the quality of predictions made to system users. Finally, we will conclude this dissertation in 7, which will pull the previous chapters together, discussing shared themes, common limitations, specific take-aways for recommender systems practitioners, and promisng directions for future work. Chapter 2

Rating-Based Collaborative Filtering: Algorithms and Evaluation

2.1 Introduction

While the contributions of this dissertation are varied in focus over a range of aspects of the recommender system, they build upon a shared prior work in the design and evaluation of recommendation algorithms. In this chapter we will explore this prior work 1. This chapter focuses upon the foundational ratings-based, collaborative-filtering algorithms and methods for evaluating these algorithms for use in a given application. On top of the work discussed in this chapter, each following chapter will have a smaller section addressing any related research that is relevant only to the work in that chapter.

2.1.1 Examples of recommender systems

There are many different ways recommendation algorithms can be incorporated into an online service. The most simple is the “streaming” style service, which is oriented around a stream of recommended content. Two examples of this are streaming music services

1A version of this chapter appeared under the same name previously as [64]. This chapter was co- authored with Joseph Konstan and Michael Ekstrand. As primary author I was primarily responsible for the text of that chapter, and this one

5 6

Figure 2.1: Screenshot of the Pandora music streaming service like Pandora2 and the Jester joke recommender3. Screenshots of these services is shown in in Figures 2.1 and 2.2. Both services share the same design: the user is presented with content (music or jokes). After each item the user is given the opportunity to evaluate the item. These evaluations influence the algorithm which then picks the next song or joke. This process repeats until the user leaves. Jester is known to use a collaborative filtering algorithm [47]. Interested readers can find more information about Jester and even download a rating dataset for experimentation from the Jester web page. As a commercial product, less is known about Pandora’s algorithm. However, it is reasonable to assume that they are using a hybrid algorithm that combines collaborative filtering information with their catalog of song metadata.

2https://www.pandora.com/ 3http://eigentaste.berkeley.edu/ 7

Figure 2.2: Screenshot of the Jester joke recommender

A quite different way to use recommendation algorithms can be seen in catalog based websites like MovieLens4. MovieLens is a movie recommender developed by the Grou- pLens research lab. On the surface MovieLens is similar to other movie catalog websites such as the Internet Movie DataBase (IMDB) or The Movie DataBase (TMDB). All three have pages dedicated to each movie detailing information about that movie and search features to help users find information about a given movie. MovieLens goes further, however, by employing a collaborative filtering algorithm. MovieLens encour- ages users to rate any movie they have seen, MovieLens then users these ratings to provide personalized predicted ratings which it shows alongside a movie’s cover art in both movie search and detail pages. These predictions can help users rapidly decide if it is worth learning more about a movie. Users can also ask MovieLens to produce a list of recommended movies, with the top 8 most recommended movies for a user being centrally positioned on the MovieLens home page, this can be seen in Figure 2.3.

4https://movielens.org/ 8

Figure 2.3: Screenshot of the MovieLens home page

A third common way to use recommendation algorithms is in e-commerce systems, perhaps the most notable being Amazon5. Amazon is an online store which started as a bookstore, but has since diversified to a general purpose online storefront. While the average user may not notice the recommendations in Amazon (or at the very least may think little of them) much of the Amazon storefront is determined based on recom- mendation algorithms. A screenshot of the Amazon main page for one author is shown in Figure 2.4 with recommendation features highlighted. Since only a small propor- tion of users use reviews on Amazon it is likely that Amazon uses data beyond ratings in their collaborative filtering algorithm. Unlike MovieLens, getting information and recommendation is not the primary motivation of Amazon users. Therefore, while the basic interfaces may be similar, the way recommendations are used, and the algorithm properties that a system designer might look for, will be different. As these examples show, recommendation algorithms can be useful in a wide range

5https://www.amazon.com/ 9

Figure 2.4: Screenshot of the Amazon home page of situations. That said, there are some commonalities: each service has some way of learning what users like. In MovieLens and the streaming services users can explicitly rate how much they like a movie, joke, or song. In Amazon purchase records and browsing history can be used to infer user interests. Each service also has some way of suggesting one or more item to the user based on their recommendation algorithm. It will be helpful to keep these examples in mind as they will help anchor the more abstract algorithm details covered in this chapter to a specific context of use. 10 2.1.2 Taxonomies of Recommendation Algorithms

Every year dozens of new recommendation algorithms are introduced. It should be no surprise, therefore, that there have been various attempts organizing these algorithms into a taxonomy or classification scheme for collaborative filtering algorithms. The pur- pose of any such organization is to allow better communication about how an algorithm works, and what other algorithms it is similar to, by describing where that algorithm is in a taxonomy. To some degree these classifications have been useful, providing a framework to help understand the field. Some of the distinctions that have been made, are no longer ideal, either because algorithms have advanced to the point where a distinction has no meaning, or because the classification itself has been used inconsistently. In this chapter we will restrict ourselves to categorizations that we feel are useful for communication. That said, we note that other works on recommender systems that a reader might explore are still organized under some of these traditional taxonomies. Therefore we will introduce some of these distinctions now so the reader can be aware of them if they wish to read other resources on recommendation algorithms. One important distinction that has been made between algorithms is between col- laborative filtering algorithms (the focus of this thesis) and content-based algorithms. Collaborative filtering algorithms, as was described earlier, operate by finding patterns in user behavior that can be used to predict future behavior. The traditional example of this would be that two users tend to like the same things, therefore when one user likes something, we can predict the other user will as well. Content based algorithms, on the other hand, focus on relationships between users and the content they like. A traditional example of this would be an algorithm that learns which genres of music a user likes and recommends songs from that genre. While still meaningful, the line be- tween collaborative and content based filtering has become somewhat blurry as modern algorithms have sought to combine the strengths of both algorithms. Readers can still expect to see this distinction made in new publications as the algorithms that are both content based and collaborative filtering algorithms are still in the minority. Within the specific range of collaborative filtering algorithms, the most common taxonomy separates so-called model-based algorithms and memory-based algorithms. The division was first made in a 1998 paper [18] where memory-based algorithms were 11 defined as those that operate over the entire dataset, where model-based algorithms are those that use the dataset to estimate a model which can then be used for predictions. For recommender systems this split is problematic as many algorithms can be described sufficiently as a memory-based algorithm or a model-based algorithm depending on how the algorithm is optimized and deployed. More recently this same distinction has been used more usefully to separate based on the basic design of an algorithm [123]. Model-based algorithms are those that use machine learning techniques to fit a parameterized model, while memory-based algo- rithms search through the training data to find similar examples (users or items). These examples are then aggregated to compute recommendations. While still common we find this latter separation does not do a great job of com- municating about the distinctions between algorithms. Therefore we will eschew this taxonomy and present algorithms grouped, and labeled, by their mathematical structure or motivation. In the next section we will cover the basic concepts and mathematical notation that will be used throughout this chapter. The section after that will describe baseline algorithms: simple algorithms which seek to capture broad trends in rating data. Section 2.4 will describe nearest neighbor algorithms: the group of algorithms that have historically been called memory based algorithms, which work by finding sim- ilar examples which are used in computing recommendations. Section 2.5 will describe Matrix Factorization Algorithms: a group of algorithms that share a common and pow- erful mathematical model inspired by matrix factorization. Section 2.6 will describe Learning to Rank Algorithms: algorithms that focus on ranking possible recommen- dations, instead of predicting what score a user will give a particular item. Section 2.7 will briefly mention other groups of algorithms which we do not explore in depth: graph based algorithms, linear regression based algorithms, and probabilistic algorithms. Section 2.8 will describe ensemble methods: ways to combine multiple recommenders. Finally, Section 2.9 will explore metrics and evaluation procedures for collaborative filtering algorithms. 12 2.2 Concepts and Notation

In this section we will discuss the core concepts and mathematical notations that will be used in our discussion of recommendation algorithms. The two most central objects in a recommendation system are the users the system recommends to and the items the system might recommend. These terms are purposely domain neutral as different domains often have domain specific terms for these concepts. One user represents one independently tracked account for recommendation. Typi- cally (but not always) this represents one system account, and is assumed to represent one person’s tastes. We will denote the set of all users as U with u,v,w ∈ U being individual users from the set. One item represents one independently tracked thing that can be recommended. In most systems it’s obvious what services or products should map to an item in the recommendation algorithm; in an e-commerce system like Amazon or Ebay, each prod- uct should an item. In a movie recommender each movie should be an item. In other domains there might be more uncertainty; in a music recommender should each song be an item (and recommended individually) or should each an album be an item? We will denote the set of all items as I with i,j,k ∈ I being individual items from the set. Most traditional collaborative filtering recommender systems are based on ratings: numeric measures of a user’s preference on an item. Ratings are collected from users on a given rating scale such as the 1-to-5 star scale used in MovieLens, Amazon, and many other websites or the ten star scale used by IMDB6 (see Fig. 2.5c for examples). Rating scales are almost always designed so that larger numbers indicate more preference: a user should like a movie they rated 5/5 stars more than they like a movie they rated

4/5 stars. The rating user u assigns to item i will be denoted rui. While there are many different rating scales that have been used, the choice of rating scale is often not relevant for a recommendation algorithm. Some interfaces, however, deserve special consideration. Small scales which can only take one or two values such as a binary (thumbs up, thumbs down) scale (see Fig. 2.5b) or unary “like” scales (see Fig. 2.5a) may require special adaptions when applying algorithms designed with a larger scale in mind. For example, when working with the unary “like” scale it can be

6http://www.imdb.com/ 13

(a) Unary rating scales from Facebook 7(left) and Twitter8(right) (b) Binary rating scale from YouTube9

(c) 5 point rating scale from MovieLens, (d) A very large continuous rating scale and a larger, 10 point scale from IMDB used in the Jester joke recommender

Figure 2.5: Examples of various rating scales used in the wild. important to explicitly treat non-response as a form of rating feedback. While the algorithms in this chapter are focused on rating-based approaches, it is important to understand that they only need a numeric measure of preference. In this way these algorithms can be used with for many different forms of preference feedback. For example, one simple strategy for handling implicit feedback (broadly, measurements of user preference that are not ratings such as purchase history in e-commerce systems, or viewing time in online video streaming services) is to convert them into pseudo-ratings that can be used with traditional algorithms. Recommender systems often organize the set of all ratings as a sparse rating matrix

R, which is a |U|×|I| matrix. R only has values at rui ∈ R where user u has rated item i; for all other pairs of u and i, rui is blank. The set of all items rated by a user u is Iu ⊂ I, the collection of all ratings by one user can also be expressed as a sparse vector ~ru. Similarly, the set of all users who have rated one item i is Ui ⊂ U, and the collection of these ratings can be expressed as a sparse vector ~ri As the rating matrix R is sparsely observed, one of the ways collaborative filtering algorithms can be viewed is matrix completion. Matrix completion is the task of filling

7https://www.facebook.com/ 8https://twitter.com/ 9https://www.youtube.com 14 in the missing values in a sparse matrix. In the recommendation domain this is also called the prediction task as filling in unobserved ratings is equivalent to predicting what a target user u would rate an item i. Rating predictions can be used by users to quickly evaluate an unknown item: if the item is predicted highly it might be worth further consideration. Like we described in MovieLens, one use of predictions is to sort items by predicted user preference. This leads to the view of a recommender algorithm as generating a ranking score. The ranking task is to generate a personalized ranking of the item set for each user. Any prediction algorithm can be used to generate rankings, but not all ranking algorithms produce scores that can be thought of as a prediction. Algorithms that focus specifically on the ranking task are known as learning-to-rank recommenders and will be discussed towards the end of this chapter. Both prediction and ranking oriented algorithms produce a score for each user and item. Therefore, we will use the syntax S(u,i) to represent the output of both types of algorithm. Almost every algorithm we discuss in this chapter will produce output as a score for each user and item. One of the most interesting applications of collaborative filtering recommendation technology is the recommendation task. The recommendation task is to generate a small list of items that a target user is likely to want. The simplest approach to this is Top-N recommendation which takes the N highest ranked items by a ranking or prediction algorithm. More advanced approaches involve combining ranking scores with other factors to change the properties of the list of recommendations. For each task there are associated metrics which can be used to evaluate the algo- rithm. Prediction algorithms can be evaluated by the accuracy of their prediction, and ranking algorithms can be compared to how users rank items. Recommendation algo- rithms are a particular challenge to evaluate, however, as users are sensitive to properties such as the diversity of the recommendations, or how novel the recommended items are. Evaluation is an important concept in the study of recommender systems, especially as some algorithms are partially defined by specific evaluation metrics. We will discuss evaluation approaches in more detail in Section 2.9 15 Users The set of all profiles in the system U A profile in the system, usually one person u,v,w ∈ U Items The collection of things being recommended I A member of the collection of items i,j,k ∈ I Rating A measure of a user’s preference for an item rui ∈ R User’s ratings The set of all items rated by one user Iu The vector of all ratings made by one user ~ru Item’s ratings The set of all users who have rated one item Ui The vector of all ratings made on one item ~ri Score/Prediction An algorithm’s score assigned to a user and item S(u,i)

Table 2.1: Summary of mathematical notation

2.3 Baseline Predictors

Before describing true collaborative filtering approaches, we will first discuss baseline predictors. Baseline predictors are the most simple approaches for rating. While a base- line predictor is rarely the primary prediction algorithm for a recommender system, the baseline algorithms do have their uses. Due to their simplicity, baseline predictors are often the most reliable algorithms in extreme conditions such as new users [65]. Be- cause of this, baseline predictions are often used as a fallback algorithm in cases where a more advanced algorithm might fail. Baseline predictions can also be used to estab- lish a minimum standard of performance when comparing new algorithms and domains. Finally, baseline prediction algorithms are often incorporated into more advanced al- gorithms, allowing the advanced algorithms to focus on modeling deviations from the basic, expected patterns that are already well captured by a baseline prediction. The most basic baseline is the global baseline, in which one value is taken as the prediction for every user and item S(u,i)= µ. While any value of µ is possible, taking µ as the average rating minimizes prediction error and is the standard choice. The global baseline can be trivially improved by using a different constant for every item or user leading to the item baseline S(u,i)= µi and the user baseline S(u,i)= µu respectively.

In the item baseline µi is an estimate of the item i’s average rating. This allows the item baseline to captures differences between different items. In particular, some items are widely considered to be good, while others are generally considered to be bad. In the user baseline µu is an estimate of the user u’s average rating. This allows the user 16 baseline to capture differences between how users tend to use the rating scale. Because most rating scales are not well anchored, two users might use different rating values to express the same preference for an item. This discussion leads us to the generic form of the baseline algorithm, the user-item baseline, given in Equation 2.1.

S(u,i)= µ + bu + bi (2.1)

Equation 2.1 has three variables: µ, the average rating in the system; bi, the item bias representing if an item is, on average, rated better or worse than average; and bu the user bias representing whether the user tends to rate high or low on average. By combining all three baseline models we are able to simultaneously account for differences between users and items, albeit in a very naive way. This equation is sometimes referred to as the personalized mean baseline as it is technically a personalized prediction algorithm, even though the personalization is very minimal. This model can be learned many ways [68], but are most easily learned with a series of averages, with µ being the average rating, bi being the item’s average rating after subtracting out µ and bu being the user’s average rating after subtracting out µ and bi [41]. The following equations can be used to compute µ, bi and bu:

rui µ = rui∈R (2.2) P |R|

u∈Ui (rui − µ) bi = (2.3) P |Ui|

i∈Iu (rui − bi − µ) bu = (2.4) P |Iu| One problem that can lead to poor performance from the user-item baseline is when a user or item has very few ratings. Predictions made on only a few ratings can be very unreliable, especially if the prediction is extremely high or low. One way to fix this is to introduce a damping term β to the numerator of the computation. Motivated by Bayesian statistics, this term will shrink the bias terms towards zero when the number of ratings for an item is small while having a negligible effect when the number of ratings 17 is large.

u∈Ui (rui − µ) bi = (2.5) P |Ui| + β

i∈Iu (rui − bi − µ) bu = (2.6) P |Iu| + β Damping parameter values of 5 to 25 have been used in the past [44, 65], but for best results β should be re-tuned for any given system.

2.4 Nearest Neighbor Algorithms

The first collaborative filtering algorithms were nearest neighbor algorithms. These algorithms work by finding similar items or users to the user or item we wish to make predictions for, and then uses ratings on these items, or by these users, to make a prediction. While newer algorithms have been designed, these algorithms are still in use in many live systems. The simplicity and flexibility of these basic approaches, combined with their competitive performance, makes them still important algorithms to understand. Readers with a background in general machine learning approaches may be famil- iar with nearest-neighbor algorithms, as these algorithms are a standard technique in machine learning. That said, there are many important details in how recommender systems experts have deployed the nearest neighbor algorithm in the past. These details are the result of careful study in how to best predict ratings in the recommender system domain.

User-User

Historically, the first collaborative filtering algorithms were user-based nearest neighbor algorithm, sometimes called the user to user algorithm or user-user for short [107]. This is the most direct implementations of the idea behind collaborative filtering, simply find users who have agreed with the current user in the past and use their ratings to make predictions for the current user. User-based nearest neighbor algorithms were quite popular in early recommender systems, but they have fallen out of favor due to 18 scalability concerns in systems with many users. The first step when making predictions under the user-user algorithm for a given user u and item i is to generate a set of users who are similar to u and have rated item i. The set of similar users is normally referred to as the user’s neighborhood Nu, with the subset who have rated an item i being Nui. Once we have the set of neighbors we can take a weighted average of their ratings on an item as the prediction. Therefore the most important detail in the user-based nearest neighbor algorithm is the similarity function sim(u,v). A natural and widely-used choice for sim(u,v) is a measurement of the correlation between the ratings of the two users; usually, this takes the form of Pearson’s r [50]:

(r − µ )(r − µ ) sim(u,v)= i ui u vi v (2.7) 2 2 iP(rui − µu) i(rvi − µv) Alternatively, the rank correlationpP in the formp ofP Spearman’s ρ (the Pearson correla- tion of ranks rather than values [87]) or Kendall’s τ (based on the number of concordant and discordant pairs) can be used. In addition to statistical measures, vector space mea- sures such as the cosine of the angle between the users’ rating vectors can be used:

~r · ~r sim(u,v)= u v (2.8) k~ruk2k~rvk2 In non-rating-based systems, the Jaccard coefficient between the two users’ pur- chased items is a reasonable choice. In most published work, Pearson correlation has produced better results than either rank-based or vector space similarity measures for rating-based systems [18, 50]. How- ever, Pearson correlation does have a significant weakness for rating data: it ignores items that only one the users has rated. In the extreme, if two users have only rated two items in common their correlation would be 1. Unfortunately, its unlikely that two users who do not watch the same things would truly be that similar. In general, corre- lations based on a small number of common ratings trend artificially towards extreme values. While these similar neighbors may have high similarity scores, they often do not perform well as neighbors whose similarity scores are based on a larger number of ratings. Significance weighting [50] addresses this problem by introducing a multiplier to 19 reduce the measured similarities between between users who have not rated many of the same items. The significance weighting strategy is to multiply the similarity by min(|Iu∩Iv|,T ) T , where T is a threshold of ”enough” co-rated items. This causes the simi- larity to linearly decrease between users with fewer than T common rated items. Past work has found T = 50 to be reasonable, with larger values showing no improvement [50]. There is a natural, parameter-free way to dampen similarity scores for users with few co-occurring items. If items a user has not rated are treated as having the user’s average rating, rather than discarded, the Pearson correlation can be computed over all items. When computing with sparse vectors, this can be realized by subtracting each user’s mean rating from their rating vectors, then comparing users by taking the cosine between their centered rating vectors and assuming missing values to be 0. This results in the following formula:

rˆ · rˆ sim(u,v)= u v (2.9) krˆuk2krˆvk2 (rui − µu)(rvi − µv) = i∈Iu∩Iv (2.10) 2 2 iP∈Iu (rui − µu) i∈Iv (rvi − µv) qP qP This is equivalent to the Pearson correlation, except that all of each users’ ratings contributes to their term in the denominator, while only the common ratings are counted in the numerator (due to the normalized value for missing ratings being 0). The result is similar to significance weighting. Similarity scores are damped based on the fraction of rated items that are in common; if the users have 50% overlap in their rated items, their resulting similarity will be greater than the similarity between users with the same common ratings but only 20% overlap due to additional ratings of other items. This method has the advantage of being parameter-free, and has been seen to perform at least as well [35, 36, 40]. After picking a similarity function, the next step in predicting for a user u and item i is to compute the similarity between user u and every other user. For systems with many users approximations such as randomly sampling users [53] can improve performance, possibly at a tradeoff of prediction accuracy. Once similarities have been computed the system must choose a set of similar users Nu. Various approaches can be taken here, 20 from using all users to a limited number or only those that are sufficiently similar. Past evaluations have suggested that using the 20 to 60 most similar users performs well and avoids excessive computations [50]. Additionally, by filtering the users the algorithm can avoid the noise that would be introduced by the lower quality neighbors.

Once the algorithm has a set of neighbors Nu the prediction for item i is simply the weighted average of the neighboring users’ ratings. Direct averages, without the weighting term, have been used in the past, but tend to perform worse. Let Nui refer to the subset of Nu containing all users in Nu who have rated item i.

sim(u,v) ∗ rvi S(u,i)= v∈Nui (2.11) |sim(u,v)| P v∈Nui These predictions can often be improvedP by incorporating basic normalization into the algorithm. For example, since users have different average ratings we can take a weighted average of the item’s offset from the user’s average rating.

sim(u,v) ∗ (rvi − µv) S(u,i)= µ + v∈Nui (2.12) u |sim(u,v)| P v∈Nui P More advanced normalization is also possible by using z score normalization in which all ratings are first reduced by the user’s average rating, and then divided by the standard deviation in the users rating. The user-based nearest neighbors approach tends to produce good predictions, but is often outperformed by newer algorithms. In this regard, the algorithm is listed here mostly for reference value. Other than its accuracy, one core issue with the performance of the user-based recommender is its slow predict time performance. Most modern rec- ommender systems have a very large number of users which makes finding neighbor- hoods of users expensive. For good results neighborhood finding should be done online and cannot be extensively cached for performance improvements. This computation makes user-based nearest neighbors very slow for large scale recommender systems, but it remains viable option for a recommender system with many items but relatively few users. 21 Item-Item

The item-based nearest neighbor algorithm (sometimes called the item to item or item- item algorithm for short) is closely related to user-user [112]. Where user-user works by finding users similar to the given user and recommending items they liked, item-item finds items similar to the items the given user has previously liked and uses those to make recommendations. Instead of computing similarities between users, item-item computes similarities between items, and uses an average rating over the item neighborhood to make predictions. Although other methods are possible, Item-Item is typically deployed with similarities estimated from the rating data itself, such that two items are more similar if they are likely to be rated the same thing. Unlike user-user, item-item is well suited to modern systems which have many more users than items. This allows item-item some key performance optimizations over user-user which we will address shortly. To make a prediction for user u and item i item-item first computers a neighborhood of similar items Nui. In practice it is common to limit this neighborhood to only the k most similar items. k = 30 is a common value from the academic research, however, different systems may require different settings for optimal performance [112]. Item-item then takes the weighted average of user u’s ratings on the items in this neighborhood, and uses that as a prediction.

sim(i,j) ∗ ruj S(u,i)= j∈Nui (2.13) |sim(i,j)| P j∈Nui P This equation can be enhanced by subtracting a baseline predictor from the ratings rui so the algorithm is only predicting deviation from baseline. If this is done, the baseline should be added back in after the fact to make a prediction. Note this will have no effect if the global or per-user baselines are used. The following equation shows how a baseline predictor could be subtracted, using B(u,i) to represent the baseline

sim(i,j) ∗ (ruj − B(u,j)) S(u,i)= B(u,i)+ j∈Nui (2.14) |sim(i,j)| P j∈Nui Generally item-item uses the same similarityP functions as user-user, simply replacing the user ratings with item ratings. 22 The cosine similarity metric is the most popular similarity metric for item-item recommendation. Past work has shown that cosine similarity performs better than other traditionally studied similarity functions [112]. The key to getting the best quality predictions using cosine similarity is normalizing the ratings [112]. Evaluations have shown that subtracting the user’s average rating from the ratings before computing similarity leads to substantially better recommendations. In practice we have found that subtracting the item baseline or the user-item baseline leads to improvements in performance [40].

(rui − B(u,i)) ∗ (ruj − B(u,j)) sim(i,j)= u∈U (2.15) 2 2 u∈PU (rui − B(u,i)) ∗ u∈U (ruj − B(u,j)) Both Pearson and SpearmanpP similarities havep beenP tried for item-item prediction, but do not tend to perform better than cosine similarity [112]. Just like Pearson simi- larity for the user-user algorithm, significance weighting can improve prediction quality when using Pearson similarity with item-item. One key advantage of the item-item algorithm is that item similarities and neigh- borhoods can be shared between users. Since no information about the given user is used in computing the list of similar items, there is no reason that the values cannot be cached and re-used with other users. Furthermore, in systems where the set of users is much larger than the set of items, we would statistically expect most items to have very many ratings. In this situation, the only items with few ratings are likely those of little interest to the user population in general. Many ratings per items leads to relatively stable item similarity scores, meaning that these can be cached for a much larger amount of time than user-user similarity scores. This insight has led to the common practice of precomputing the item-item similarity matrix. With a precomputed list of similar items the specific item neighborhood used for prediction can be found with a quick linear scan, using only information about the given user’s past ratings. This speeds up predict-time computation drastically, making item-item more suitable for modern interactive systems than the user-user algorithm. The cost of this speed-up is a regularly-occurring “model build” in which the sim- ilarity model is recomputed. This frequently is done nightly to ensure that new items are included in the model and are available to recommend. Since this model build is 23 not interactive, it can run on a separate bank of machines from the live system and be scheduled to avoid peak system use. The improvements from precomputing similarities can be made even larger by trun- cating the stored model. For reasonably large systems storing the whole item-item similarity matrix can take a lot of space. Many items have low to no similarity with all but a small percent of the system. Since these dissimilar items will almost never be used in an actual item neighborhood, there is no point to store them. By keeping only the most useful potential neighbors, the model size on disk and in memory can be reduced and predict time performance can be increased. Therefore it is common to keep only the n most similar items for any item in the model as “potential neighbors”. Past work has shown that larger models do perform slightly better than smaller models, but that the advantages disappear after some point. In the original work on item-item, the point at which a larger model has no benefit is around 100 to 200 items [112]. Work based on larger datasets have also found larger models (500 items or more) to more effective; suggesting that, like all other parameters, the best value for the model size will vary from system to system [37, 65]. With the various tweaks and optimizations the research community has found since item-item was first published, item-item can be a strong algorithm for recommendation. While it is slightly outperformed by newer algorithms, it is still very competitive when well tuned [38,39]. Furthermore, item-item is easier to implement, modify, and explain to users than most other recommendation algorithms. For these reasons, item-item is still a competitive algorithm for large scale recommender systems, and still sees modern deployment despite more recent, slightly more accurate, algorithms being offered [108].

Variants

Nearest-neighbor algorithms are the best-known approaches for collaborative filtering recommendation. Because of this, they have been modified in many interesting ways. One variant is an inversion of the user-user algorithm: the K-furthest neighbors algo- rithm by Said et al. [110] The K-furthest neighbors algorithm makes neighborhoods based on the least similar users, instead of the most similar users. The idea behind this is to enhance the diversity of the recommendations made. User evaluations comparing nearest neighbor recommendation to furthest neighbor recommendation shows that the 24 two are relatively close in user satisfaction, even if the predictions made by nearest neighbor recommendation are much more accurate. Another interesting variant to nearest neighbor recommendation is Bell and Koren’s Jointly Derived Neighborhood Interpolation Weights approach [9]. The key insight of this approach is that the quality of the similarity function directly determines the quality of the user recommendation in a neighborhood model. Therefore, these similarity scores should be directly optimized, instead of relying on ad-hoc similarity metrics. One key advantage of this is that the similarity scores can be jointly optimized, which makes the algorithm more robust to interactions involving multiple neighbors. A similar approach has also been taken by Ning and Karypis’ SLIM algorithm [90]. Many variants of nearest-neighbor algorithms use some external source of informa- tion to inform the similarity function used. One example of this is the trust aware recommendation framework [81], which re-weights user similarity scores based on an estimated degree of trust between two users. In this way the algorithm bases predic- tions on more trusted users. This same approach could be used with other forms of information such as content based similarity information.

2.5 Matrix Factorization Algorithms

Nearest-neighbor algorithms are good at capturing pairwise relationships between users or items. They cannot, however, take advantage of broader structure in the data, such as the idea that five different items share a common topic, or that a user’s ratings can be explained by their interest in a particular feature. To explicitly represent this type of relationship requires a fundamentally different approach to recommendation. One such approach is the use of latent feature models such as the popular family of matrix factorization algorithms. Rather than modeling individual relationships between users or items, latent feature models represent each user’s preference for items in terms of an underlying set of k features. Each user can then be described in terms of their preference for each latent feature, and each item can be described in terms of its to each feature. These item and user feature scores can then be combined to predict the user’s preference for future items. 25 All matrix factorization algorithms encode each user’s preference numerically in k- dimensional vectors ~pu and each item’s relevance to features in k-dimensional vectors

~qi. We will use puf to indicate the value representing a user’s preference for a given feature f and qif to indicate the value representing an item’s relevance to feature f. Once these vectors are computed, we can compute a user u’s preference for a particular item i as the linear combination of the user feature vector and the item feature vector.

k S(u,i)= ~pu · ~qi = puf qif (2.16) fX=1 Under this equation, S(u,i) will be high if and only if those feature u prefers (with high scores is ~pu) are also those feature i is relevant to (with high scores in ~qi).

It is common to organize the vectors ~pu into a |U|× k matrix named P and the vectors ~qi into a |I|× k matrix named Q. This allows all scores for a given user to T be computed in a single matrix operation ~su = ~pu × Q . Likewise, all scores for every user can be computed as S = P × QT These operations may be more efficient than repeatedly computing S(u,i) in some linear algebra packages. As with neighborhood based algorithms, this approach can easily be improved by directly accounting for a user’s average rating and an item’s average rating. This can be done as before, by normalizing ratings against a baseline predictor. However, it is much more common to introduce the bias terms directly into the model, and to learn these values simultaneously with learning matrices P and Q. This results in a biased matrix factorization model [68]:

S(u,i)= µ + bu + bi + ~pu · ~qi (2.17)

The goal in matrix factorization algorithms is to find the vectors ~pu and ~qi (as well as extra terms like µ, bu, bi) that lead to the best scoring function for a given metric. One interesting difference between this and a nearest neighbor style algorithm is that the same core model and scoring equation algorithm can lead to many different algorithms depending on how ~pu and ~qi are learned. We will be presenting three approaches for learning ~pu and ~qi, in this section we will see how to optimize ~pu and ~qi for prediction accuracy. In the following section we will address an algorithm that learns ~pu and ~qi to 26 optimize how accurately the algorithm ranks pairs of items. The algorithms we describe are a few of many possible matrix factorization algo- rithms. One of the interesting aspects of this model is that it has become a standard starting point for many novel algorithm modifications. SVD++ [68] and SVDFea- ture [23], for example, extend Equation 2.17 by adding terms to incorporate implicit feedback and additional user or item feature information. By combining Equation 2.17 with new terms to accommodate new data, and new ways of optimizing the model for different goals, many interesting algorithm variants are possible.

Training Matrix Decomposition Models With Singular Value Decom- position

One way to train a matrix factorization model for predictive accuracy, and the reason these are often called SVD algorithms, is with a truncated singular value decomposition (SVD) of the ratings matrix R. R ≈ P ΣQT (2.18)

Where P is an |U|× k matrix of user-feature preference scores, Q is an |I|× k matrix of item-feature relevance vectors, and Σ is a k × k diagonal matrix of global feature weights, called singular values. In a true algebraic SVD, P and Q are orthogonal, and this product is the best rank-k approximation for the original matrix R. This means that the matrices P and Q can be used to produce scores that are optimized to make accurate predictions of unknown ratings. Singular value decompositions are not bounded to a particular k; the number of non-zero singular values will be equal to the rank of the matrix. However, we can truncate the decomposition by only retaining the k largest singular values and their corresponding columns of P and Q. This accomplishes two things: first, it greatly reduces the size of the model, and second, it reduces noise. Ratings are known to contain both signal about user preferences and random noise [66]. If the ratings matrix is a combination of signal and noise, then consistent and useful signals will contribute primarily to the high-weight features while the random noise will primarily contribute to the lower wright features. For convenience, the columns of P and Q are often stored in a pre-weighted form so that Σ is not needed as a separate 27 matrix. With this we see that the scoring function is simply S(u,i)= ~pu · ~qi. There are two important and related difficulties with the singular value decompo- sition for training a matrix factorization model. First, it is only defined over complete matrices, but most of R is unknown. In a rating-based system, this problem can be addressed by imputation, or assuming a default value (e.g. the item’s mean rating) for unknown values [113]. If the ratings matrix is normalized by subtracting a baseline prediction before being decomposed, then the unknown values can be left as 0’s and the normalized matrix can be directly decomposed with standard sparse matrix methods. The second difficulty is that the process of computing a singular value decomposition is very computationally intensive and does not scale well to large matrices. Unlike with the first problem there is no natural solution to this. Because of this problem it is uncommon for matrix decomposition algorithms to operate based on a pure singular value decomposition.

Despite these limitations, using a singular value decomposition to compute ~pu and ~qi is still an easy way to build a basic collaborative filtering algorithm for experimentation. Optimized algorithms for computing matrix decompositions can be found in mathemat- ical computing packages such as MATLAB. However, for the reasons mentioned above this is not a reasonable approach for production scale recommender systems.

Training Matrix Decomposition Models With Gradient Descent

The goal of matrix factorization is to produce an effective model of user preference. Therefore the algebraic structure (singular value decomposition) is more of a means to an end rather than the end itself. In practice it is common to sidestep the problems inherent in using a singular value decomposition and instead directly optimize P and Q against some metric over our training data. This way we can simply ignore missing data opening up a range of speed-ups over a singular value decomposition. Simon Funk pioneered this approach to great affect by using gradient descent to train P and Q to optimize the popular mean squared error accuracy metric [44]. Similar algorithms are now a common strategy for latent factor style recommendation algorithms [70]. Since our goal is a matrix decomposition with minimal mean squared error, we can learn a decomposition by treating the problem as an optimization problem: learn matrices P (m × k) and Q (n × k) such that predicting the known ratings in R with 28 the multiplication PQT has minimal (squared) error. As mean squared error is easily differentiable, optimization is normally done via either stochastic gradient descent or alternating least squares. Stochastic gradient descent is a general purpose optimization approach used in ma- chine learning to optimizing a mathematical model for a given loss function or metric, so long as the metric is easy to take the derivative of. First the computer starts with an arbitrary initial value for the model parameters, in this case the matrices P and Q as well as the bias terms. Then, it iterates through each training point, in our case a user, item and rating: (u,i,rui). Based on this point it computes an update to the model parameters that will reduce the error made on that training point. The specific update rules are derived by taking the derivative of an error function with respect to the training point, in this case the error function is the squared error. These updates are repeated many times until the algorithm converges upon a local optimum. The update rules to train a biased matrix factorization model to minimize squared error using gradient descent are:

ǫui = µ + bu + bi + ~pu · ~qi − rui (2.19)

µ = µ + λ(ǫui − γµ) (2.20)

bu = bu + λ(ǫui − γbu) (2.21)

bi = bi + λ(ǫui − γbi) (2.22)

Puf = Puf + λ(ǫui ∗ Qif − γPuf ) (2.23)

Qif = Qif + λ(ǫui ∗ Puf − γQif ) (2.24)

To apply these these update rules we first compute ǫui, which represents the prediction error: S(u,i) − rui. Then the update for each variable can be computed and applied.

Note, as Puf and Qif have a mutual dependency in their update, it is conventional to compute both updates before applying either update. The gradient descent process uses a learning rate λ that controls the rate of optimiza- tion (0.001 is a common value), and γ is a regularization term (0.02 is a common value). This regularization term penalizes excessively large user-feature and item-feature values to avoid overfitting. This update rule should be applied until some stopping condition 29 is reached, the most common stopping condition being a specified number of iterations. To get the best performance, k, λ, and γ, and the stopping condition should be hand tuned using the evaluation methodologies discussed in Section 2.9.

Once learned, the set of variables µ, bu, bi, P and Q serves as the model for the algorithm. Given these values creating a prediction is as easy as applying Equation 2.17. Like with item-item this model is normally computed offline. Traditionally the model is rebuilt daily or weekly (depending on how long it takes to rebuild a model and how actively new ratings, items, and users are added to the system). With this type of recommender model it is also possible to perform online updates [106] which allow a model to account for ratings, items, and users added after the model is built with a minimal loss of accuracy. In practice online and offline model updates can be combined to balance between complete optimization and interactive data use. Matrix factorization approaches provide a memory-efficient and, at recommenda- tion time, computationally efficient means of producing recommendations. Computing a score in a matrix factorization algorithm requires only O(K) work (assuming that the factors are precomputed and stored for O(1) lookup), this is true no matter how many items, users, or ratings are involved. Furthermore, by taking account of the un- derlying commonalities between users and items that are reflected in users’ preferences and behaviors it makes very accurate predictions. Because of this matrix factorization algorithms are very popular and are one of the most common algorithms in the research literature.

2.6 Learning to Rank

As the original collaborative filtering algorithms focused on the prediction task, most of the research into the recommendation task has been designed around how we can use predictions to make good quality recommendations. The most common approach to doing this is also the most obvious: sort items by prediction. This approach is called the “Top-N recommendation”, and is still used by systems like MovieLens and remains quite popular today. That said, other approaches have been developed for directly targeting the quality of a recommendation list. Learning-to-rank algorithms are a recently popular class of algorithms from the 30 broader field of machine learning. As their name suggests, these algorithms directly learn how to produce good rankings of items, instead of the indirect approach taken by the Top-N recommendation strategy. While there are several approaches, most learning- to-rank algorithms produce rankings by learning a ranking function. Like a prediction, a ranking function produces a score for each pairing of user and item. Unlike predictions, however, the ranking score has no deliberate relationship with rating values, and is only interesting for its ranked order. Because learning-to-rank algorithms are designed around ranking and recommen- dation tasks, instead of prediction, they often outperform prediction algorithms at the recommendation task. However, because their output has no direct relation to the pre- diction task, they are incapable of making predictions without further post processing, and are unlikely to out-predict algorithms that make prediction their goal. Many mod- ern recommender systems do not display predicted rating to users; in such systems a learning-to-rank algorithm can lead to a much more useful recommender system. The heart of most learning to rank algorithms is a specific way to define ranking or recommendation quality. Unlike prediction algorithms, where “accuracy” is easy to de- fine, there are many ways to define ranking and recommendation quality. Furthermore, unlike prediction errors, ranking and recommendation errors are poorly suited for use in optimization. A small change to a model might lead to a small, but measurable predic- tion accuracy change but have no effect on the output ranking. Therefore the core work in many learning-to-rank algorithms is in designing easy-to-optimize measurements that approximate common ranking metrics. Once these new metrics are defined, standard optimization techniques can be applied to standard recommendation models such as the biased matrix factorization model from Equation 2.17. Learning-to-rank is an active area for research into recommender system algorithms with new algorithms being developed every year. To get a taste of this type of algo- rithm we will explain the Bayesian Personalized Ranking (BPR) algorithm [105]. First published in 2009, BPR is one of the earliest and most influential learning-to-rank algo- rithms for collaborative filtering recommendation. While BPR was designed for implicit preference information, we discuss it here as it is trivial to modify for use with rating data, and the structure and development of the BPR algorithm serves as a good example of learning-to-rank algorithms in general. 31 BPR

BPR is a pairwise learning-to-rank algorithm, which means that it tries to predict which of two items a user will prefer. If BPR can accurately predict that a user will prefer one item over another we can use that prediction strategy to rank items and form recommendations. This approach has two advantages to prediction based training: first, BPR tends to produce better recommendations than prediction algorithms. Secondly, BPR can use a much wider range of training data. As long as we can deduce from user behavior that one item is preferred over another, we can use that as a training point. BPR was originally designed for use with implicit unary forms of preference feedback, instead of ratings. For example, with unary data such as past purchases we can generate pairs by assuming that all purchased items are liked better than all other items. With a traditional ratings dataset we can generate training points by taking pairs of items that the user rated, but assigned different ratings to. To predict that a user will prefer one item over another, BPR tries to learn a function

P (i >u j) – the probability that user u prefers item i to item j. If P (i >u j) > 0.5 then, according to the model, user u is more likely to prefer i over j than they are to prefer j over i. Therefore if P (i>u j) > 0.5 we would want to rank item i above item j. There are many different functions that could be used for P , BPR uses the popular logistic function. The logistic function allows us to shift focus from computing a probability to com- puting any number xuij which represents a user’s relative preference for i over j (or j over i if xuij happens to be negative). The logistic transformation then defines P as

−xuij P (i u j) > 0.5 if and only if S(u,i) > S(u,j). Furthermore, the probability P will be more confident (closer to 0 or 1) if S(u,i) is substantially greater or smaller than S(u,j). Therefore to optimize P (i>u j) for predictive accuracy we need to optimize S so that it ranks items correctly. For the same reason, once we train S we can use S directly for ranking. Based on this formalization we arrive at the BPR optimization criteria, which is the function BPR seeks to optimize. The optimization criteria depends on some scoring function S(u,i), and a collection of training points (u,i,j) which represent that u has 32 expressed a preference for item i over item j. Given these, the optimization criteria is the product of the probability BPR assigns to each observed preference P (i >u j) = 1/(1+ e−(xui−xuj )). A good ranking function should maximize these probabilities, therefore we seek to maximize performance against this criteria. The full derivation of this, as well as the complete optimization criteria can be found the original BPR paper by Rendle et al. [105]. Almost any model can be used for the scoring function S. All that is required is that the derivative of S with respect to its model parameters can be found. Therefore any algorithm that can be trained for accuracy using gradient descent can also be trained using BPR’s optimization criteria to effectively rank items. We will cover BPR-MF, which uses a matrix factorization model. In particular, we will give update rules for the non-biased matrix factorization seen in Equation 2.16. Only minor modifications would be needed to derive update rules for a biased matrix factorization model. As with all matrix factorization models we have two matrices P and Q representing user and item factor values which need to be optimized so that S(u,i)= ~pu · ~qi provides a good ranking. P and Q can be optimized for the BPR optimization criteria using stochastic gradient descent by applying the following update rules: For a given training sample (u,i,j) representing the knowledge that user u prefers item i over item j:

e−(S(u,i)−S(u,j)) ǫuij = (2.25) 1+ e−(S(u,i)−S(u,j)) Puf = Puf + λ (ǫuij ∗ (Qif − Qjf )+ γ ∗ Puf ) (2.26)

Qif = Qif + λ (ǫuij ∗ Puf + γ ∗ Qif ) (2.27)

Qjf = Qjf + λ (−ǫuij ∗ Puf + γ ∗ Qjf ) (2.28)

Where λ is the learning rate and γ is the regularization term. Like the equations for updating a traditional matrix factorization algorithm, ǫuij in this equation represents the degree to which S does or does not correctly rank items i and j. When training BPR algorithms, the order in which training points are taken can have a drastic impact on the rate at which the algorithm converges. Rendle et al. [105] showed that the naive approach of taking training points grouped by user can be orders of magnitude slower than an approach that takes randomly chosen training points. 33 Therefore, for simplicity, Rendle et al. recommend training the algorithm by selecting random training points (with replacement) and applying the update rules above. This process can be repeated until any preferred stopping condition, such as iteration count, has been reached. Unsurprisingly, the BPR-MF algorithm is much better than classic algorithms at ordering pairs of items under the AUC metric. On other recommender metrics BPR- MF only shows modest improvements over traditional algorithms. The trade-off of this, however, is that BPR-MF, like most learning-to-rank algorithms, cannot make predic- tions. Theoretically, advanced techniques could be used to turn the ranking score into a prediction, however, in practice we find this does not lead to prediction improvements over prediction centered algorithms. Therefore for a website that uses both recommen- dations and predictions using separate algorithms for the two tasks might be essential.

2.7 Other Algorithms

The algorithms highlighted in this chapter provide an overview of the most influential and important recommender systems algorithms. While these few algorithms provide a basic grounding of most recommender algorithms, there are many more algorithms than covered in this chapter. Briefly, we want to mention a few other key approaches and techniques for generating personalized recommendations that have proved successful in the past.

Probabilistic Models

Probabilistic algorithms, such as those based on a Bayesian belief network, are a popular class of algorithms in the machine learning field. These algorithms have also seen increasing popularity in the recommender systems field recently. Many probabilistic algorithms are influenced by the PLSI (Probabilistic LSI) [54] and LDA (latent Dirichlet allocation) [15] algorithms. The basic structure of these models is to assume that there are k distinct clusters or profiles. Each profile has a distinct probability distribution over items describing which items users in that cluster tends to watch and, for each item a probability distribution over ratings for that cluster. Instead of directly trying to cast a user into only one cluster, each user is a probabilistic mixture of all clusters [55]. This 34 can be thought of as a type of latent feature model, each user has a value for each of k clusters (features) and item movie has a preference score associated with each of k clusters (features). One of the key advantages of these probabilistic models is that they are easier to update with new user or item information, due to the wealth of standard training approaches for probabilistic models [114].

Linear Regression Approaches

Many algorithms have incorporated linear regression techniques into their formulation. For example, the original work introducing the item-item algorithm experimented with using regression techniques in addition to the similarity computations. For each pair of items a linear regression is used to find the best linear transformation between the two items. This transformation would then be applied to get an adjusted rating to be used in the weighted average. While this showed some promise for very sparse systems the idea showed little promise for more traditional recommender systems. A more recent implementation of this idea is the Slope One recommendation algo- rithm [73]. In slope one we compute an average offset between all pairs of items. We then predict an item i by applying the offset to every other rating by that user, and performing a weighted average. The slope one algorithm has some popularity, especially as a reference algorithm, due to how simple it is to implement and motivate. Outside of nearest neighbor approaches, linear regression approaches are also a common way to combine multiple scoring functions together, or combine collaborative filtering output with other factors to create ensemble recommenders.

Graph-Based Approaches

Graph-based recommender algorithms leverage graph theoretic techniques and algo- rithm to build better recommender systems. Although uncommon, these algorithms have been a part of recommender systems research community since the early days [2]. In traditional recommender system websites like MovieLens or Netflix there is rarely a natural graph to consider, therefore these algorithms tend to impose a graph by connecting users to items they have rated. By also connecting items using content information you can use graph-based algorithms to combine content and collaborative 35 filtering approaches [60, 101]. Some services, however, have both recommendation features and social network features. In these websites it is natural to assume that a person to person connection is an indicator of trust. This leads to the set of trust based recommendation algorithms in which a person to person trust network is used as part of the recommendation process. Many of these algorithms use graph based propagation of trust as a core part of a recommendation algorithm [61, 81].

2.8 Combining Algorithms

Most recommender system deployments do not directly tie the scores output by one of the above algorithms to their . While these algorithms perform well, there usually are further improvements that can be made by combining the output of these algorithms with other algorithms or scores. Generally there are three reasons to perform these modifications: business logic, algorithm accuracy/precision, and recommendation quality. The first of these reasons – business logic – is both the most simple, and ubiquitous. Many recommender systems modify the output of their algorithm to serve business purposes such as “do not recommend items that are out of stock” or “promote items that are on sale”. The second of these reasons – algorithm accuracy – leads researchers to develop and deploy ensemble algorithms. Ensemble algorithms are techniques for combining arbi- trarily algorithms into one comprehensive final algorithm. The final algorithm normally performs better than any of its constituent algorithms independently. The design, de- velopment, and training of ensemble algorithms is a large topic in the broader machine learning field. As a comprehensive discussion of ensemble algorithms is out of scope for this chapter we will try to give a brief overview to the application of ensembles in the recommender systems field. The final reason to combine algorithms – recommendation quality – is more compli- cated. Properties like novelty and diversity have a large impact on how well users like recommendations. These properties are very hard to deliberately induce in collaborative filtering algorithms as they can be in tension with recommending the best items to a user. Several algorithms have been developed to modify a recommendation algorithm’s 36 output specifically for these properties. While these algorithms are not ensembles in the traditional sense they are another way to moderate the behavior of a recommenda- tion algorithm based on some other measure. These algorithms will be discussed after describing strategies for ensemble recommendation.

2.8.1 Ensemble Recommendation

The most basic approach to an ensemble algorithm is a simple weighted linear combi- nation between two algorithms Sa and Sb

S(u,i)= α + βa ∗ Sa(u,i)+ βb ∗ Sb(u,i) (2.29)

The simplest way to learn this is to have the developer directly specify the weightings. While this may sound naive there are several places where this can be appropriate, especially when the parameters are picked based on difficult to optimize metrics such as a user survey.

Linear Regression

A more attractive technique to train a linear model may be to use traditional linear re- gression to learn the best α and β parameters to optimize accuracy. You could imagine simple training Sa and Sb on all available training ratings and then using their pre- dictions and the same training ratings to predict α and β. Unfortunately, this is not recommended – the core issue is that the same ratings should not be used when training the sub-algorithms and when training the ensemble as it leads to overfitting. Ideally you want to train the ensemble on ratings that the sub-algorithms have not seen so that the ensemble is trained based on the out-of-sample error of each algorithm. The easiest way to do this would be to randomly hold out some small percent of training data (say 10%) and to use that withheld data to train the regression to minimize squared error. The problem with this approach to training a linear ensemble, however, is that it withholds a large amount of data, which might effect the overall algorithm performance. 37 Stacked regression

An alternate approach, without this problem, is Breiman’s stacked regression algorithm [19]. Breiman’s algorithm trains regression parameters by first producing a k subsets of the dataset by traditional (ratings based) crossfolding. Then each sub-algorithm is trained independently on each of the k folds of the algorithm. Due to the way crossfold validations are made this means that each of the original training ratings has an associated algorithm which has never seen that rating. When generating sub-algorithm predictions for any given training point (as needed to learn the linear regression) the algorithm which has never seen that training point is used. Once the linear regression has been trained the sub-algorithms should then be re-trained using the overall set of data. For a more comprehensive description of this procedure consult Breiman’s original paper [19]. All ensemble algorithms work best when very different algorithms are being com- bined. So an ensemble between item-item (with cosine similarities) and item-item (with Pearson similarities) is unlikely to show the same improvement as an ensemble between item-item and a content-based algorithm [37]. In MovieLens, for example, we could imagine making a simple content-based algo- rithm by computing each user’s average rating for each actor and director. While not necessarily the best algorithm, this content based algorithm would likely outperform a collaborative filtering algorithm for movies with very few ratings. Therefore, an en- semble of a content based algorithm and a collaborative filtering algorithm might show improved accuracy over a content based algorithm on its own. This example does lead to one interesting observation: there are many cases where we would want to create an ensemble and we know the conditions when one algorithm might perform better than another. The actor-based recommender would be our best recommender when we have next to no ratings for a movie, as we get more ratings, however, we should trust the collaborative filtering algorithm more. A linear weighting scheme does not allow for this type of adjustment. 38 Feature Weighted Linear Stacking

The feature weighted linear stacking algorithm, introduced by Sill el al., is an extension to Breiman’s linear stacked regression algorithm [118]. Feature weighted linear stacking allows the relative weights between algorithms to vary based on features like number of ratings on an item. This algorithm is most notable for being very popular during the Netflix prize competition, a major collaborative filtering accuracy competition, where it was the key facet of the second best algorithm. In feature weighted linear stacking algorithms are linearly related as per Equation 2.29, the difference is that the weights β are themselves a linear combination of the extra features. Sill et al. show that this model can be learned by solving a system of linear equations using any standard toolkit for solving systems of linear equations. Details of this solution, especially including information to assist in scaling are provided in the paper by Sill et al. [118]. While ensemble methods have provided much better predictive accuracy than single algorithm solutions, and could theoretically be applied to learning-to-rank problems as well, it should be noted that they can also become much more complex than a traditional recommender. While it is easy to overlook technical complexity when designing an algorithm, technical complexity can be a significant barrier to actually deploying large ensembles in the field. Notably, after receiving code for a 107-algorithm ensemble Netflix went on to only actually implement two of these sub-algorithms [3]. Ensemble methods can be much more complex and time invasive to keep up to date and can require much more processing when making predictions which can lead to slower responses to users. Therefore, when considering ensembles, especially very large ones, designers should consider if the improved algorithm accuracy is worth the increased system complexity.

2.8.2 Recommending for Novelty and Diversity

This brings us to the third reason that a recommender system might modify the scores output from a recommendation algorithm: to increase the quality of recommendations as reported by users. Research into recommendation systems has shown that selecting only good items is not enough to ensure that a user will find a recommendation useful [143, 147]. Many other properties can effect how useful a recommendation is to a user. Two that have been shown to be important, and have been the focus of some research 39 are novelty and diversity. Novelty and diversity are properties of a recommendation that measure how the items relate to each other or the user. Novelty refers to how unexpected or unfamiliar the user is with their recommendations [102]. If a recommendation only contains obvious recommendations they are neither novel, nor useful at helping a user find new items. Diverse recommendations cover a large range of different items. One flaw with many recommender systems is that their recommendations are all very similar to each other, which limits how useful the recommendations are. For example, a top-8 recommendation consisting of only Harry Potter movies would neither be novel (as those movies are quite well known), nor would it be diverse (since the recommendation only represents a small niche of the user’s presumably broader interests). There have been several different approaches to modifying an algorithm’s score to favor (or avoid) properties like novelty or diversity. We will focus on two broad strate- gies, the first is well suited to combining algorithms with item specific metrics such as how novel an item is, the second is well suited to metrics measured over the entire recommendation list. While these strategies have been pioneered for use with novelty and diversity, they can be applied with any metric. For example, when recommend- ing library items, users may be disappointed by recommendations on item that have a waitlist and cannot be borrowed immediately. These strategies could be used to modify recommendations to favor items without a waitlist, increasing user satisfaction. Spe- cific metrics for novelty and diversity will be discussed alongside algorithm evaluation in Section 2.9. The most common way to combine algorithms with some metric to increase user satisfaction is to use a simple linear combination between the original algorithm score

Sa and a some item level measurement of interest [14,134,143]. For example, the number of users who have rated an items |Ui| (or its inverse 1/|Ui|) is normally measured to allow manipulating novelty. By blending the score from an algorithms with 1/|Ui| we can promote items that have fewer ratings and enhance the novelty of recommended items. These scores are normally used only when ranking, typically an unmodified rating prediction is still used even when the prediction is blended for recommendation. The other major approach for modifying recommendations is an iterative re-ranking approach. In these approaches items are added to a recommendation set one at a time, 40 with the ranking score recomputed after each step. This approach has two advantages. First, re-ranking allows hard constraints when selecting items. For example, the iter- ative function could reject any more than two movies by a given director. Secondly, the iterative approach allows measurements such as the average similarity of an item with the other items already chosen for recommending. This is often necessary when manipulating diversity, as diversity is a property of the recommendation, not one spe- cific item. This was in fact the approach taken by the first paper to address diversity in recommendation [120]. The cost of this approach over re-scoring is slightly higher recommend time costs. A primary example of the iterative re-ranking algorithm is the diversity adding algorithm introduced by Ziegler et al. [147]. To generate a top 10 recommendation list, this algorithm first picks the top 50 items for a user as candidate items. The size of the candidate set represents a trade off between run time cost and flexibility of the algorithm to find more diverse items. From the top 50 items, the best predicted item is immediately added to the recommendation. After that, for each remaining candidate the algorithm computes a sum of the similarity between that candidate and each item in the recommendation. The algorithm then sorts by inverse similarity sum to get a dissimilarity rank for each potential item. The overall item weight is then a weighted average of the prediction rank and the dissimilarity rank (with weights chosen beforehand). The item with the smallest score by this weight is added to the recommendation. This process repeats, with updated similarity scores, until the desired ten item recommendation list has been made. Re-scoring and re-ranking algorithms are an easy way to promote certain properties in recommendations. Both algorithms can be relatively inexpensive to run, and can be added on top of an existing recommender. Additionally these approaches are very easy to re-use for a large variety of different recommendation metrics beyond just novelty and diversity.

2.9 Metrics and Evaluation

The last topic we will discuss in this chapter is how to evaluate a recommendation algorithm. This is an important concept in the recommender systems field as it gives 41 us a way to compare and contrast multiple algorithms for a given problem. Given that there is no one “best” recommendation algorithm, it is important to have a way to compare algorithms and see which one will work best for a given purpose. This is true both when arguing that a new algorithm is better than previous algorithms, and when selecting algorithms for a recommender system. One key application of evaluations in recommender systems is parameter tuning. Most recommendation algorithms have variables, called parameters which are not op- timized as part of the algorithm and must be set by the system designer. Parameter tuning is the process of tweaking these parameters to get the best performing version of an algorithm. For example, when deploying a matrix factorization algorithm the the system designer must choose the best number of features for their system. While we have tried to list reasonable starting points for each parameter of the algorithms we have discussed in this chapter, these are at best a guideline, and readers should carefully tune each parameter before relying on a recommendation algorithm. In this section we will mostly focus on Offline evaluation methodologies. Offline evaluations can be done based only on a dataset and without direct intervention from users. This is opposed to online evaluation which is a term for evaluations done against actual users of the system, usually over the internet. Offline evaluations are a common evaluation strategy from machine learning and information retrieval. In an offline evaluation we take the entire ratings dataset and split it into two pieces: a training dataset (Train), and a test dataset (Test); there are several ways of doing this, which we describe in more detail shortly. Algorithms are trained using only the ratings in the training set and asked to predict ratings, rank items, or make a recommendation for each user. These outputs are then evaluated based on the ratings in the test dataset using a variety of metrics to assess the algorithm’s performance. Just as there are several different goals for recommendation (prediction, ranking items, recommending items) there are many different ways to evaluate recommendations. Furthermore, while evaluating prediction quality may be straightforward (how well does the prediction match the rating), there are many different ways to evaluate if a ranking or recommendation is correct. Therefore, there are a great number of different evaluation metrics which score different aspects of recommendation quality. These can be broadly 42 grouped into prediction metrics, which evaluate how well the algorithm serves as a predictor, ranking quality metrics, which focus on the ranking of items produced by the recommender, decision support metrics, which evaluate how well the algorithm separates good items from bad, and metrics of novelty and diversity that evaluate how novel and diverse the recommendations might appear to users. Depending on how an algorithm might be used metrics from one or all of these sections might be used in an evaluation, and performance on several metrics might need to be balanced when deciding on a best algorithm.

Prediction Metrics

The most basic measurement we can take of an algorithm is the fraction of items it can score, known as the prediction coverage metric or simply the coverage metric for short. The coverage metric is simply the percent of user item pairs in the whole system that can be predicted. In some evaluations coverage is only computed over the test set. Beyond convenience of computation, this modification focuses more explicitly on how often the algorithms cannot produce a score for items that the user might be interested in (as evidenced by the user rating that item) [50]. This metric is of predominantly historic interest, as most modern algorithm deployments use a series of increasingly general baseline algorithms as a fallback strategy to ensure 100% coverage. That said, coverage can still be useful when comparing older algorithms, or looking at just one algorithm component in isolation. Assuming that an algorithm is producing predictions, the next most obvious mea- surement question is how well its predictions match actual user ratings. To answer this we have two metrics Mean Absolute Error (MAE) and Root Mean Squared Error

(RMSE). In the following two equations Test is a set containing the test ratings rui and the associated users and items.

(u,i,r )∈Test |S(u,i) − rui| MAE(Test)= ui (2.30) P |Test|

2 (u,i,r )∈Test(S(u,i) − rui) RMSE(Test)= ui (2.31) sP |Test| 43 Both MAE and RMSE measure the amount of error made when predicting for a user, and are on the same scale as the ratings. The biggest difference between these two metrics is that RMSE assigns a larger penalty to large prediction errors when compared with MAE. Since large prediction errors are likely to be the most problematic, RMSE is generally preferred. Both the RMSE and MAE metrics measure accuracy on the same scale as predic- tions. This can be normalized to a uniform scale by dividing the metric value by the size of the rating scale (maxRating − minRating), yielding the normalized mean average error (nMAE) and normalized root mean squared error (nRMSE) metrics. This is rarely done, however, as comparisons across different recommender systems are often hard to correctly interpret do to differences in how users use those systems. Prediction metrics can be computed for an entire system or individually for each user. RMSE, for example, can be computed for the system by averaging over all test ratings, or computed per user by averaging over each user’s test ratings. Due to the differences between people, most algorithms work better for some users than they do for others. Measuring per-user error scores lets us understand and measure how accurate the system is for each user. By averaging the per user metric values we can then get a second measure of the overall system. While the difference between per-user error and system-wide (or by rating) error may seem trivial it can be very important. Most deployed systems have power users who have rated many more items than the average user. Because they have more ratings these users tend to be overrepresented in the test set, leading to these users being given more weight when estimating system accuracy. Averaging the per user accuracy scores avoids this issue and allows the system to be evaluated based on its performance for all users.

Ranking Quality

A more advanced way of evaluating an algorithm is to ask if the way it orders or ranks items is consistent with user preferences. There are several ways of approaching this, the most basic being to simply compute the Spearman ρ or Pearson r correlation coefficients between the predicted ratings and test set ratings for a user. As noted by Herlocker et al. [50] Spearman’s ρ is imperfect because it works poorly when the user rates many 44 items at the same level. Additionally, both metrics assign equal importance to accuracy at the beginning of the list (the best items) and the end (the worst items). Realistically, however, we want a ranking metric that is most sensitive at the beginning of the ranking, where users are likely to look, and less sensitive to errors towards the end of the list, where users are unlikely to look. A more elegant metric for evaluating ranking quality is called the discounted cumu- lative gain metric DCG. DCG tries to estimate the value a user will receive from a list. It does this by assuming each item gives a value represented by its rating in the test set, or no value if unrated. To make DCG focus on the beginning of the list, these values are discounted logarithmically by their rank, so the maximum gain of items later in the list is smaller than the potential gain early in the list. The DCG metric is defined as follows: r dcg(u,Rec)= ui (2.32) max(1,logb(ki)) i∈XRec Where ki refers to the rank order of i, Rec is an ordered list of items representing an algorithms recommendation for the user u, and b is the base of the logarithm. While different values are possible, DCG is traditionally computed with b = 2. Other values of b have not been shown to yield meaningfully different results [63]. The DCG metric is almost always reported normalized as the normalized discounted cumulative gain metric nDCG. nDCG is simply the DCG value normalized to the 0-1 range by dividing by the “optimal” discounted cumulative gain value which would be given by any optimal ranking

dcg(u, prediction) ndcg(u)= (2.33) dcg(u, ratings)

A similar metric to this is the half life utility metric [18]. Half life utility uses a faster exponential discounting function. The half life utility is as follows:

max(r − d, 0) HalfLife(u)= ui (2.34) 2(ki−1)/(α−1) i∈XRec Half life utility has two parameters. The first is d which is a score that should repre- sent the neutral rating value. A recommendation for an item with score d should neither 45 help nor hurt the user, while any item rated above d should be good recommendations. d should also be used as the “default” rating value for rui where a user does not have a rating for that item, In this way unrated items are assigned a value of 0. The α variable controls the speed of exponential decay and should be set so that an item at rank α has a roughly 50% chance of actually being seen by the user. One common modification on these metrics is to absolutely limit the recommen- dation list size. For example, the nDCG@n metric is taken by computing the nDCG over the top-n recommendations only. Any item in the test set but not recommended is ignored in the computation. This is reasonable if you know there is a hard limit to how many items users can view, or if you only care about the beginning n elements of the list.

Decision Support Metrics

Another common approach to evaluating recommender systems is to use metrics from the information retrieval domain such as precision and recall [29]. These metrics treat the recommender as a classifier with the goal of separating good items from other items. For example, a good algorithm should only recommend good items (precision) and should be able to find all good items (recall). Decision support metrics give us a way of understanding how well the recommendation could support a user in deciding which items to consume. The basic workflow of all decision support metrics is to first perform a recommen- dation, then compare that recommendation against a previously selected ”relevant item set”. The relevant item set represents those items that we know to be good items to rec- ommend to a user. You then count how many of the recommended items were relevant and how many were not. For items in a larger scale system, a choice needs to be made about which items to consider relevant. The easiest choice would be to take all rated items in the test set as good items, which evaluates an algorithm on its ability to select items that the user will see. More useful, however, is the practice of choosing a cutoff such as four out of five stars at which we consider a recommendation ’good enough’. Best practices recommend testing with multiple similar cutoffs to ensure that results are robust across various choices for defining relevant items. Evaluation results that favor one algorithm 46 with items rated 4 and above as ”good”, but another algorithm if 4.5 and above are ”good”, deserve more careful consideration. Once this decision has been made there remains an issue of how to treat the remain- ing ”non relevant” items. In traditional information retrieval work it is often reasonable to assume that every item that is not known to be relevant can be considered not rel- evant, and therefore bad to recommend. This assertion is much less reasonable in the recommender system domain, while some of these items are known to be rated poorly, many more have simply never been rated. There are likely many good items for each user that has not been rated and would therefore be considered not relevant. It has been argued that not having complete knowledge of which items a user would like may make these metrics inaccurate or suffer from a bias [10, 28, 52]. Ultimately, this problem has not been solved, and most evaluations settle for the assumption most non-rated items are not good to recommend, and the evaluation bias caused by this will be minor. Related to the above issue of how to treat not relevant items is the question of how to compute recommendations. There have been various different approaches taken, and these have been shown to lead to different outcomes in the evaluation [11]. Commonly recommendations are done by taking the top-n predicted items, in which case these metrics would be labeled with that n such as precision@20 for precision computed over the top-20 list. n should be picked to match interface practices, so if only eight items are shown to a user, algorithms should be evaluated by their precision@8. The other important consideration is which candidate set of items the recommenda- tions should be over. Many different candidate set options have been used, but the most common options are either all items, or the relevant item set plus a random subsample of not relevant items. Some work has used the set of items that the user rated in the test set as a recommendation candidate set; while this does avoid the issue of how to treat unrated items, this evaluation methodology also provides different results which are believed to be less indicative to user satisfaction with a recommender [11]. Once the set of good items has been picked and recommendations have been gener- ated, the next step is to compute a confusion matrix for each user. A confusion matrix is a two by two matrix counting how many of the relevant items were recommended, how many of the relevant items were not recommended, and so forth. There are several metrics to compute based on this given confusion matrix. 47 Good Items Not Good Items Recommended Items True positives (tp) False positives (fp) Other Items False Negatives (fn) True negatives (tn)

Table 2.2: A confusion matrix

tp • precision - tp+fp - The percent of recommended items that are good

tp • recall (also known as sensitivity) - tp+fn - The percent of good items that are recommended

fp • false positive rate - fp+tn - the percent of not good items that are recommended

tn • specificity - fp+tn - The percent of not good items that are not recommended

These metrics, especially precision and recall, are traditionally reported and analyzed together. This is because precision and recall tend to have an inverse relationship. An algorithm can optimize for precision by making very few recommendations, but doing that would lead to a low recall. Likewise, an algorithm might get high recall by making very many recommendations, but this would lead to low precision. An ideal algorithm will therefore want to balance these two properties finding a balance that recommends predominantly good items, and recommends almost all of the good items. To make finding this balance easier researchers often look at the F-score, which is the harmonic mean between precision and recall

2 ∗ precision ∗ recall F = (2.35) precision + recall

Another way to summarize this information is with the use of an ROC curve. An ROC curve is a plot of the recall on the y-axis against the false positive rate on the x-axis. An example ROC curve is given in Figure 2.6. The ROC curve will have one point for every recommendation list length from recommending only one item to recommending all items. When the recommendation list is small, we expect a small false positive rate but also a small recall (hitting the point (0,0)). Alternatively, when all items are recommended the false positive rate and recall will both be 1. Therefore the ROC curve normally connects the point (0, 0) to (1, 1). The ROC curve can be used to evaluate an algorithm broadly, with a good algorithm approaching the point 48 1.0

0.8

0.6

0.4

0.2 Random

Sensitivity or Recall (true positive rate) Curve A Curve B 0.0 0.0 0.2 0.4 0.6 0.8 1.0 1 - Specificity (false positive rate)

Figure 2.6: An example ROC curve. Image used with permission from [41].

(0,1) which means it has almost no false positives, while still recalling almost every good item. To make this property easier to compare numerically it is also common to compute the area under the ROC curve, referred to as the AUC metric. It has been pointed out [105] that the AUC metric also measures the fraction of pairs of items that are ranked correctly. As mentioned earlier, it is often incorrect to assume that items that are not rated highly by a user are by definition bad items to recommend. That does not mean, however, that there are no bad recommendations, just like we can say highly rated (4 or more stars out of 5) items are clearly good, we can say poorly rated items (one or two stars out of five) are clearly bad. Using this insight, we can define the fallout metric. In a typical information retrieval evaluation fallout is the same as the false positive rate. In a recommender system evaluation, however, we can explicitly focus on how often bad items are recommended, and compute the percent of recommended items that are known to be bad. If one algorithm has a significantly higher fallout than another we can assume that it is making significant mistakes at a higher rate, and should be avoided. One issue with the precision metric is that, while it rewards an algorithm for rec- ommending good items, it does not care where those items are in a recommendation. Generally, we want good items recommended as early in the list as possible. To evaluate 49 this we can use the mean average precision metric (MAP), which is the mean of the average precision over every user. The average precision metrics takes the average of the precision at each of the relevant items in the recommendation. If an item is not recommended then it contributes a precision of 0.

averagePrecision@N(u) MAP @N = u∈U (2.36) |U| P

1 averagePrecision@N(u)= precision@rank(i) (2.37) |goodItems| i∈goodItemsX By taking the mean average precision we place more importance on the early items in the list than the later items, as the first item is used in all N precision computations, while the last one is only used in one. Another approach for checking that good items are early in the recommendation is the mean reciprocal rank metric (MRR). Instead of looking at how many good items or bad items an algorithm returns, mean reciprocal rank looks at how many items the user has to consider before finding a good item. For any user, their rank (ranku) is the position in the recommendation of the first relevant item. Based on this we can take reciprocal rank as 1/ranku and mean reciprocal rank is the average reciprocal rank over all users. A larger mean reciprocal rank means that the average user should have to look at fewer items before finding an item they will enjoy.

1/ranku MRR = u∈U (2.38) |U| P Novelty and Diversity

There are several other properties of a recommendation that can be measured. The most commonly discussed are novelty and diversity. These properties are believed to be very important in determining whether a user will find a set of recommendations useful, even if they are unrelated to the pure quality of the recommendation. Understanding the effect these properties have on user satisfaction is still one of the ongoing directions in recommender systems research [38]. It is important to compare these metrics along with other metrics, such as accuracy 50 or decision support metrics, as large values for these metrics are often seen along with large losses in quality. At the extreme a random recommender would have very high novelty and diversity, but would score bad on all other metrics. Generally speaking, good algorithms are those that increase novelty or diversity without meaningfully decreasing other measures of quality. Novelty refers to how unexpected or unfamiliar the user is with their recommenda- tions [102]. Recommendations that mostly contain familiar items are not considered novel recommendations. Since the goal of a recommender is to help its users find items they would not otherwise see, we expect that a good recommender should have higher novelty. The most obvious, and common, way to estimate novelty offline is to rely on some estimate of how well known an item is. The count of users who have rated an item, referred to as the item’s popularity, is commonly used for this. More popular items are assumed to be better-known and less novel to recommend [21]. Diverse recommendations cover a large range of different items. One flaw with many recommender systems is that they focus too heavily on some small set of items for a given user [62]. So knowing that a user liked a movie from the Star Wars franchise, for example, might lead the algorithm to only recommend science fiction to a user, even if that user likes adventure films in general. The most common way of understanding how diverse a set of recommendations is, is to measure the total or average similarity between all pairs of items in the recommendation list [142, 147]. This inter-list similarity (ILS) measure can be seen as an inverse of diversity, the more similar recommended items are, the less diverse the recommendation is. Ideally a similarity function that is based on the item itself, or item meta-data is used, as it allows diversity to be measured independent of properties of the ratings and predictions. Where sim is the similarity function, the ILS (an inverse metric of diversity) of a recommendation list Rec can be defined as:

sim(i,j) ILS(Rec)= i∈Rec j6=i∈Rec (2.39) P P 2 Structuring an Offline Evaluation

The heart of a good offline evaluation is how the train and test datasets are generated. Without a good process for splitting train and test datasets, the evaluation can fail to produce results, produce misleading results, or produce statistically insignificant results. 51 To avoid this, simple standard approaches for generating train and test datasets have been developed. The standard approach to producing train and test datasets is user- based K-fold crossfolding. In user-based K-fold crossfolding the users are split into K groups, where typical values of K are 5 or 10. For each of the K groups we generate a new train and test dataset split. For dataset split n, all users not in fold n are considered train users and all their ratings are allocated to the training dataset. The users in group n are then test users, and their ratings are split so that some can be in the train dataset (to inform the algorithm about that user’s tastes) and the rest go to the test dataset. Typically either a constant number of ratings, such as ten per user, or a constant fraction, such as ten percent of ratings per user, are allocated for testing. Testing items can either be chosen randomly, or the most recent items can be chosen to emphasize the importance of the order in which a user makes ratings. User-based K-fold crossfolding has several benefits. First, by performing a user based evaluation we know that each user will be evaluated once and only once. This ensures that our conclusions give equal weight to each user, with no user evaluated more or less than the others. Secondly, by ensuring that there are a large number of training users we know that we are evaluating the algorithm under a reasonably realistic condition, with a reasonable amount of training data. Finally, through replication we can measure statistical confidence around our metric values, as each train and test can be considered relatively independent. With user-based crossfolding it is common to treat each user as its own independent sample of the per-user error when computing statistical significance. There are several other approaches for structuring an offline evaluation that have been used in the past. These approaches are typically designed to focus the evaluation on a single factor, or to support new and interesting metrics. One interesting approach is the temporal evaluation [71, 72]. Temporal evaluations have been used to look at properties of the recommender system over time as the collaborative filtering changes. In a temporal evaluation the dataset is split into N equal sized temporal windows. This allows N − 1 evaluations to be done for each window after the first where the train set is all windows before the given window, and the test set is the target window. Applying normal metrics this way can give you an understanding of how an algorithm 52 might behave over time in a real deployment [71]. Temporal analysis also allows for interesting new metrics such as temporal diversity [72] that measure properties of how frequently recommendations change over time.

Online Evaluations

Not every comparison can be done without users. While it may one day be possible to directly optimize a recommender system its impact on users, our current understanding of offline metrics, and indeed human psychology, falls short of this goal. Many properties of the user experience of a recommender system can only be loosely estimated with offline metrics. This is especially true of properties like novelty and serendipity that explicitly relate to what is not known by the user, and could never be encoded into a training dataset. Since it is these user-experiential properties that determine which algorithm makes users most happy, we recommend that offline evaluation be used to choose a small set of algorithms and tunings that are then compared using a final online evaluation. There are several ways to perform an online evaluation, each with its own benefits and drawbacks:

Lab Study - In a lab study users are brought into a computer lab and asked to go through a series of steps. These steps may involve using a real version of the website, an interactive survey, or an in depth interview. An example lab study might bring users into the lab so eye tracking can be performed to evaluate how well a recommender chooses eye catching content for the home page. Lab studies give the experimenter a large amount of control over what the user does. The drawback of this flexibility, however, is that lab studies often do not create a realistic environment for how a recommender system might be used. Additionally, lab studies are typically limited in how many users they can involve as they often require space and supervision from an experimenter.

Virtual Lab Study - Virtual lab studies are similar to lab studies but are performed entirely online and without the direct supervision of the experimenter. This deployment trades some of the control of a lab study for much larger scale, as virtual lab studies can involve many more users in the same amount of time as a lab study. These normally take the form of purpose built web services that interact with the recommender and 53 guide the user through a series of actions and questions. While almost any form of data about user behavior and preference can be used with a virtual lab study, surveys are particularly popular. An example virtual lab study might guide users through rating on several different interfaces and then survey users to find out which they preferred. A well designed survey can be easy for a user to complete remotely, and very informative about how users evaluate the recommender system.

Online Field Study - Online field studies focus on studying how people use a de- ployed recommender system. Generally there are two approaches to online field studies. In the first, existing log data from a deployed system is used to try to understand how users have been behaving on the system. For example, rating data from a system could be analyzed to understand how often users enter a rating under a current system. Alternatively, a change can be made to a deployed system to answer a specific question. New forms of logging could be introduced, or new experimental features deployed for a trail period. This can allow answering more specific questions about user behavior. For example, a book recommender system might set the goal that 80% or more of users find a book within 5 minutes of accessing the service. Logging could be added to the system to allow measuring how often users find books, and in how long so that performance can be directly compared with this goal. Online field studies are ideal for understanding how a system is used or for measuring progress against some goal for the recommender system as a whole. Online field studies are not as suited for comparing different options. Likewise, online field studies are limited by their connection to a deployed live system, which might preclude studying possibly disruptive changes. Finally, online field studies often do not allow asking follow- up questions without resorting to a secondary evaluation, for example, while a researcher might know what users do in a situation, they will not be able to ask why.

A-B Test - A-B tests can be seen as an extension of an online field study. In an A-B test two or more versions of an algorithm or interface are deployed to a given website, with any user seeing only one of these versions. By tracking the behavior of these users, a researcher can identify differences in how users interact with the algorithm in a realistic setting. An example A-B test might deploy two algorithms to a recommender 54 service for two months and then look at user retention rates. If one algorithm leads to fewer users having another visit, then we can say that algorithm is likely worse. Being an experimental extension to online field studies, A-B tests share the same weaknesses: it can be hard to understand what is causing the results that it finds. Within these options virtual lab studies are most common when the goal is to un- derstand why users perceive algorithms differently, and A-B tests are preferred when the goal is simply to pick the “best” algorithm by one or more user behavior metrics. Online evaluations are a complicated subject, and the description here only scratches the surface. Several texts are available that go into depth on the various ways to de- sign a user experiment to answer any number of questions, including those of algorithm performance. In particular, we recommend the book “Ways of knowing in HCI” [93], which contains in depth coverage of a broad range of research methods which are ap- plicable in this scenario. With that said, there are some considerations that are specific to recommender systems which we will discuss. Since all recommender systems require a history of user data to work with, ensuring that this information is available is important when considering the design of a study. This decision is closely tied to how participants for the study will be recruited. If recruiting from an existing recommender system a lab study or virtual lab study can simply request user account information and use that to access ratings. Better yet, in a virtual lab study, links to the study could be pre-generated with a user-specific code so that users do not need to log in. If recruitment for a lab study does not come from an existing recommender system, than typically the first stage of the study is to collect enough ratings to make useful recommendations. As this is the same problem faced when on-boarding users to a recommender system this is a well studied problem. The common solution is to present users with a list of items to rate [46, 103, 104]. There are many different ways to pick which order to show items in to optimize how useful the ratings are for recommending, the time that it takes to get a certain number of ratings, or both. For a good review of this field of work see the 2011 study by Golbandi et al. [46], which also includes an example of an adaptive solution to collecting useful user ratings. If time is not a major factor, however, a design where users search to pick items to represent their tastes may have some benefits [83]. 55 Just as with offline evaluations, comparing reasonable algorithms is essential to producing useful results. Starting with offline evaluation methodologies can be a good way to make sure only the best algorithms are compared. When comparing novel algorithms, it can often help to include a baseline algorithm as a comparison point whose behavior is relatively well known. When using a survey, often as part of a virtual lab study, it is important to choose well written questions. Some questions can be ambiguous making it hard to assign meaning to their answers. A survey question is useless to an experimenter if many users do not understand it. To support experimenters, several researchers have put together frameworks to support online, user centered evaluations of recommender systems [67, 102]. These frameworks split the broader subject of how well the algorithm works into smaller factors tuned for specific properties of a recommender system. Each of these factors are also associated with well designed survey questions which are known to accurately capture a user’s opinion. When possible we recommend using one of these frameworks as a resource when designing an online evaluation.

Resources for Algorithm Evaluation

There are plenty of resources available to help explore ratings-based collaborative filter- ing recommender systems. In particular, there are open source recommender algorithm implementations, and rating datasets available for non-commercial use. These resources have been developed in an effort to help reduce the cost of research and development in collaborative filtering. In recent years there has been a push in the recommender systems community to support reproducible recommender systems research. One major result of this call has been several open source collections of recommender systems algorithms such as LensKit10, Mahout11, and MyMediaLite12. By publishing working code for an algorithm the researchers can ensure that every detail of an algorithm is public, and that two people comparing the same algorithm don’t get different results due to implementation details. More importantly, however, these toolkits allow programmers who are not recommender

10https://lenskit.org/ 11https://mahout.apache.org/ 12http://mymedialite.net/ 56 systems experts to use and learn about recommender systems algorithms and benefit from the work of the research community. The other noteworthy resource for exploring and evaluating recommender systems is the availability of public ratings datasets. These datasets are released by live recom- mender systems to allow people without direct access to a recommender service and its user base to participate in recommender system development. Some notable examples of datasets are the MovieLens movie rating datasets13, described in detail in Harper and Konstan’s 2015 paper [49], the Jester joke rating dataset14 described in the 2001 paper by Goldberg et al. [47], the Book-Crossing book rating dataset15 introduced in 2005 paper by Ziegler et al. [147] and the Amazon product rating and review dataset16 first introduced in McAuley and Leskovec’s 2013 paper [82]. As many of the experi- ments in later chapters of this thesis use one of the MovieLens datasets, we will include summary information on these datasets here, in table 2.3. These datasets, and many more available online, are available for anyone to download and use to learn about, and experiment with, recommender system algorithms. Most of the recommender systems toolkits have code for loading these datasets and performing offline evaluations already built. While most directly applicable towards offline evaluations, these resources can also help with online evaluations. Open source libraries can be used to quickly prototype recommender systems either for incorporation in an existing live system or for a lab or virtual lab study. Likewise, research datasets can be used to power a collaborative filtering algorithm for use in a lab or virtual lab study design in which new ratings will be collected from research participants. This can allow anyone with access to research participants to perform online evaluations to deeply understand how users react to collaborative filtering technologies.

13available at http://grouplens.org/datasets/movielens/ 14available (with updates) at http://eigentaste.berkeley.edu/dataset/ 15 available at http://www2.informatik.uni-freiburg.de/~cziegler/BX/ 16available at http://jmcauley.ucsd.edu/data/amazon/ 57

Name Date Range Rating Scale Users Movies Ratings ML 100K 9/1997-4/1998 1-5, stars 943 1,682 100,000 ML 1M 4/2000-2/2003 1-5, stars 6,040 3,706 1,000,209 ML 10M 1/1995-1/2009 0.5-5, half stars 68,878 10,681 10,000,054 ML 20M 1/1995-3-2015 0.5-5, half stars 138,493 27,278 20,000,263

Table 2.3: Information about the MovieLens datasets. Information from Harper and Konstan’s 2015 report. See that report for more information [49]. Chapter 3

A Case Study: the BookLens Recommender System

Personalized recommender systems have created new ways for large scale systems to help users find things they would like. Systems like Amazon and Netflix have shown that these algorithms provide a significant competitive advantage. One of the fundamental issues with these algorithms is that they require a very large amount of information to operate. Without some minimal amount of information a recommender system cannot provide useful insights to its users. This data requirement creates a dilemma for smaller systems. Without access to recommendation technologies, a small system can be at a significant disadvantage com- pared to a larger competitor. For some systems, the clear solution is to continue to grow until a recommender becomes reasonable. Other systems, however, are inherently small. These inherently small systems can be locked out of the benefits of recommender technology. Local public libraries are a great example of an inherently small system. Any given library1 tends to be a very small local institution whose sole goal is serving a specific community. According to a 2013 survey of US libraries conducted by the Institute of Museum and Library Services (IMLS) half of all libraries have fewer than 4018 regis- tered patrons, and most (70.6%) have fewer than 10,000 registered patrons [124]. For

1For the purpose of this chapter, the term “library” refers to a library organization as opposed to one physical library branch.

58 59 comparison, Amazon has been estimated to get 188 million visitors in a month [1]. While this comparison is a bit unfair (Amazon is a multinational company, and your average library serves one or two cities) it serves to demonstrate the disparity of scale between a local public library and the systems it might compete with. Despite being small, libraries are a large part of their local community. In a 2012 library poll, Pew Research found that 91% of respondents thought libraries were impor- tant to their community [146]. In a more recent 2015 poll, 65% of respondents said that closing their local public library would have a major impact on their community [56]. This should not be surprising, serving community needs has always been a primary goal of public libraries [30]. While each library may be small, taken as a whole libraries are quite large. There are at least 9300 libraries with at least 16,000 branches and a total of 172 million registered users in the alone [124]. All of this taken together makes the library domain a very interesting domain for technical intervention. While there are many interesting technological question in the library domain, our focus in this work is how we can help library patrons find items through a catalog. Especially with the advent of e-books and other digital loanable content, even small libraries catalog an extensive number of items. While navigation techniques (both online and offline) support targeted dives into the library catalog, much less support is available to users who don’t know exactly what they want. Patrons in physical library branches have many ways to find interesting books. Be- yond simply browsing the shelves looking for an interesting read, library patrons can request readers’ advisory from library staff. In readers’ advisory a trained librarian asks a patron about their preferences, and then makes personalized recommendations around that patron’s interests. In an online catalog, navigation support is often limited to standard search tools. While the majority of library patrons would be interested in further support, such as personalized book recommendations [145], this is not available on most library catalogs. Despite worse support for casually discovering books, online catalogs have become the primary way library patrons navigate their local libraries collection. This may be partially due to the onset of e-books and other forms of digital content, that can only be accessed through the digital catalog. Even when a library patron decides to 60 borrow a physical book it is increasingly common for them to place a hold online and only spend enough time in the library to pick up their items. This trend has become so significant that some library systems are deploying express libraries: small stations deployed around the community where you can pick up reserved books without ever setting foot in a library. While convenient, these new technologies make it harder for libraries to help their patrons find interesting content. In this chapter, we present the BookLens system. The BookLens system brings personalized recommender system technology (and related social content such as book reviews) to local public libraries. These features can help to bring services like read- ers’ advisory, that once were only available in-person, into the online catalog, without sacrificing the libraries traditional privacy promises and protection. To address the inherent limits of population size, we explore a federated recom- mender system architecture. Just like inter-library loan allows libraries to band together to expand their catalog offerings, while retaining the ability to personalize their collec- tions to serve their community, the federated recommender allows libraries to pool their ratings while retaining the ability to customize recommendations to their unique com- munity. We’ve found that this design substantially increases the amount of information available for use when making recommendations to any individual library system. This chapter presents a summary of our work on the BookLens system. We will begin with an overview of past work in the areas of library recommendation services and book recommendation algorithms. In section 3.2 we will provide a brief overview of the development of the BookLens system, and a discussion of the design constraints we discovered through our collaboration with the library systems. After this we will present our vision of a federated recommender system, both in the abstract (section 3.3) and our specific implementation of it for library recommendation (section 3.4). A key part of the federated design is using rating pooling to improve the amount of preference information available to any individual library. To our knowledge there has been no prior report on the effectiveness of such a rating pooling scheme across multiple communities with partially overlapping item sets. Therefore in section 3.5 we will report on the usage of the BookLens system investigating how pooling effects item coverage (measured as the number of items with at least 1 user rating), rating volume (measured in ratings per month), and overall rates of catalog overlap. Our findings show that rating pooling can 61 substantially increase the amount of information available to libraries of any size. We conclude section 3.5 with a brief comparison of several key rating properties between MovieLens and BookLens. This comparison serves to highlight properties of data that vary quite substantially from what is seen in the well studied MovieLens dataset, and motivates later chapters of this thesis. Finally, in section 3.6 we will conclude with a discussion of future work, as well as a reflection on the current state of the BookLens system. While this description of the project, discussion of its design, and analysis of the data from the BookLens system represents my sole work, the BookLens project as a whole has been a close collaboration with the broader BookLens team. At the , this project was collaborative with Michael Ludwig for all of the program- ming, design, and technical problem solving up through 2016, at which point we had reached feature-complete deployment of the BookLens system. After this point Michael reduced his involvement in the project and I became the primary point of contact for advancing and maintaining the BookLens project, as well as in pursuing research oppor- tunities surrounding this system. Additionally, we were supported at various points by Rich Davies, John Riedl, and Joseph Konstan, all of whom served a primarily advisory role. Finally, it is important to acknowledge the support of our MELSA partners in the design, development, and deployment of the BookLens system. More details on the library partners will be provided with the discussion of the development of BookLens in section 3.2.

3.1 Past Work in Library Item Recommendation

The vast majority of past research literature on the book or library-item recommen- dation problems focuses purely on building algorithms for the recommendation do- main. This body of research does a great job of identifying data that can be used to improve recommendation performance, such as social data [98, 99], author informa- tion [131], book subjects and meta data [79, 100, 141], and grade level appropriate- ness [79, 100]. Unfortunately, this body of research does not address the core issues of deployment and design such as patron privacy or data quality/quantity. Many of these algorithms are explicitly designed for use with records on catalog navigation or 62 book checkouts/renewals [74,79,85,89,96,126–128,135,139,144] which are not generally available at public libraries for privacy reasons. One way to view this work is the hope that if an algorithm can be shown to work well enough, then the other problems to deployment will be solved. This runs counter to our approach: if a stable system can be designed, then algorithms can be designed and optimized for that system. While much of past work has focused on book recommendation algorithms, several reports of experimental library item recommends comparable to BookLens have been published. Geyer-Schulz et al. describe a system deployed at the Universit¨atkalsruhe in 2002 [89]. Their system uses a separate recommender server which communicates sporadically with core catalog servers about item holdings. Their recommendations are based on server navigation logs, although their design does allow for recommendation feedback which could be used in recommendations in the future. Geyer-Schulz et al. evaluate their service with a single question survey and found that users generally found their recommendations helpful. Geyer-Schulz et al. also describe several barriers that they identified. The barriers they describe range from low level, technical barriers, such as weakness in their catalog’s session tracking behavior to broader theoretical issues. In particular, they briefly men- tion hearing from librarians that privacy issues might be a barrier to the deployment of these technologies, although they offer no particular solution. Other than privacy and technical issues they also highlight library budgets and algorithmic sparsity concerns as other barriers to broader deployment of these technologies. Both the design of this system, as well as the barriers they identify share many similarities with what we’ve found in our work. Luo et al. present the preliminary design work of their privacy-preserving book recommendation system (PPBRS) [78]. Their system is also based on monitoring online behavior for the user data needed to make recommendations. What is interesting about the PPBRS design, however, is that it has an explicit module for enforcing privacy. This module allows for implementation of privacy preserving measures after data is transferred to the core system, but before it has been stored. M¨onnich and Spierling describe the BibTip system [85]. Like the previous two systems BibTip is based on observing user actions on the library service for making rec- ommendations. Likewise, their design also includes an explicit separation of traditional 63 library catalog servers from recommendation servers. BibTip is interesting because it is designed to be used by multiple libraries. BibTip relies on the ability of a library cata- log to be modified to include an arbitrary JavaScript file, and to create specific HTML sections for recording the ISBN of an item (which is used as the primary item identifier by BibTip). No mention is made of how BibTip would deal with items that do not have an ISBN, and in our experience some libraries may have trouble making the required changes to integrate with BibTip. M¨onnich and Spierling’s 2008 report also highlight the importance of having sufficient behavioral data to generate recommendations with, and recommend a pooling based strategy such as the one used by BookLens. We designed the BookLens system independently of this past work, but interestingly arrived at many similar solutions. The BookLens server will feature a centralized recom- mendation server which runs independently from library catalog servers. We will have a server (the bridge) server which acts as a go-between for catalog and recommender servers, implementing any necessary privacy preserving tactics, similarly to the privacy module from Geyer-Schulz et al. There were, however several differences that are a direct reflection of limiting constraints in library-item recommendation we encountered in our collaboration with MELSA, which will be detailed in the next section.

3.2 Development of the BookLens System

The BookLens project began in 2008 with the idea of building “MovieLens, but for books”. At the time, this was a common request from users of the MovieLens service, who wanted a website with social features surrounding books. (In 2008, many of the now-popular, book-centered social network services, such as GoodReads2 and Library- Thing3, were still in their infancy.) A timeline of the development phase of BookLens can be seen in figure 3.1. This timeline ends in 2015 after what is considered the first deployment of a full BookLens system. This deployment marks a major shift in our fo- cus from the design and development of BookLens to it’s deployment in library systems. A second timeline covering these deployments will be provided in section 3.5 when we evaluate data from these deployments.

2https://www.goodreads.com/ 3https://www.librarything.com/ 64

Figure 3.1: A timeline of the development of the BookLens system

The original BookLens team was me and Michael Ludwig, both University of Min- nesota undergraduates at the time. From 2008 through 2010 Michael and I designed and developed the BookLens system, rapidly iterating from a user-facing website, to a service oriented architecture that could integrate with book stores and library services. In 2010 our system was ready to present to potential partners, and in mid 2010 we began working with Saint Paul public Library to deploy what would be BookLens Beta. In September 2011 we deployed our original BookLens system to Saint Paul Public Library. BookLens Beta was an early design prototype compared to the full BookLens system we will discuss in this chapter. While it had many of the same features as the later BookLens system, many lacked refinement. In 2012, Michael presented about this system at a local library conference, which led to the beginning of our partnership with the Metropolitan Library Service Agency (MELSA). MELSA is an alliance of the eight public library systems in the seven county Min- neapolis/Saint Paul metropolitan area. Basic information on these systems can be seen in table 3.1. As can be seen MELSA libraries cover quite a range of sizes, from Hennepin County Library, which is one of the largest libraries in America, to the relatively modest Carver County Library. MELSA’s major goal is to organize collaborative action among these library systems, such as organizing summer reading programs, and allowing li- brary cards from any library to work at any other library. Within MELSA we regularly met with the tech team, which contains representatives of the technical staff from each library. The members of the tech team served as our primary contact at each library. Through our collaboration with MELSA we substantially improved the BookLens system, improving it’s usability, integrating it more closely with library systems, and adding book reviewing features. This development continued November 2014 when we deployed our first full BookLens system at Saint Paul, Dakota, and Ramsey libraries. 65 Library Registered Size of Book Size of E-book Patrons Collection Collection Anoka County Library 281,092 493,866 49,168 Carver County Library 79,820 213,917 41,091 Dakota County Library 342,928 673,382 60,022 Hennepin County Library 843,674 4,271,882 142,861 Ramsey county Library 236,344 514,176 54,956 Saint Paul Public Library 342,844 783,131 68,069 Scott County Library 108,820 214,787 41,308 Washington County Library 163,280 386,519 47,979

Table 3.1: An overview the MELSA partner libraries. All Data from the 2013 IMLS survey [124]. Note that book counts here counts copies of books, and therefore should not be compared to item counts provided later in this chapter.

While minor development continued after this, at this time our priorities shifted towards resolving integration issues with the other MELSA libraries. While the software engineering process of the BookLens system may be interesting owing to our collaboration with multiple community partners, it likely is not novel. Many systems have been built by similar collaborations, and ultimately we do not concern ourselves in this report with minutia of the software engineering process that led to the BookLens design reported here. Through this process, however, we did learn much about library domain that was not clear to us in advance. Some of what we learned served as interesting opportunities for future research, such as the possibility of collaborating with librarians who provide personalized item recommendations based only on personal interactions. Others properties of the library domain, however, served as unexpected barriers that became constraints on our design. We detail these design constraints here.

3.2.1 Design Constraints

While we have worked with a large group of library systems, we will note that our library partners can hardly be seen as fully representative of all library systems. Library systems exist that are larger than our largest library partner, and many libraries exist that are smaller than our smallest partner. Likewise, some of our findings are likely a consequence of the culture of public libraries in the USA, and might not be representative of libraries 66 in other nations. With that said, the following is a list of design constraints that we felt shaped the design of the BookLens system, and may need to be considered any any other attempt to bring recommendation technologies into library domain

Lack of Expertise, Staff, and/or Funding Perhaps the most direct barrier which we see as preventing libraries from leveraging big-data algorithms such as collaborative filtering recommender systems is a lack knowledge, staff, and funding for the project. Librarians are not, and arguably should not be, trained as experts in social computing and personalization technologies, nor do most library organizations have resources to hire such an expert. It is possible some aspects of this may change going into the future, as libraries become more interested in how they can educate patrons to better understand today’s new information environment, and patrons increasingly ask for education programs on the computer and the internet [57]. At the current status, however, library leadership may not even be fully aware of the advances made by the research community that might help their patrons. Even if this does become seen as a key target for strategic development, it is not clear that any but the largest libraries currently possess the funds necessary to explore these technologies. This limitation is certainly not novel to the library domain, or recommendation technology. Lack of education about recent advances in social computing, and lack of small community institutions to pursue these advances is likely ubiquitous. Even though many websites people use every day leverage these powerful technologies, it is easy to overlook these technologies and it is not a trivial task to learn even the basic concepts of what these algorithms can in principal do. In the recommender systems domain, algorithm toolkits like LensKit [40] go a long way towards making cutting edge algorithms more accessible, however, until algorithm deployment and tuning become better understood, this alone is not enough to bridge the gap. Given the current state of educational resources about recommender systems, it is unlikely that organizations such as libraries will independently develop recommender systems integrated into their services. Therefore, the best way to overcome this con- straint is for experts to embed themselves into community organizations, such as li- braries. By learning about library systems and working, either with libraries directly, 67 or with software vendors in the library domain, it is possible for the advances of the re- search community to spread to these smaller domains. This can be seen as an outreach activity for the expert, or an investment in a future research platform.

Domain Specific Technology While libraries may not be software specialists, they do use a large amount of specialized software. Most libraries have a multitude of systems managing their catalog, holdings information, patron information, and their online cat- alog. Most, if not all, of this software is purchased from one of a small set of vendors and provided with very limited customization. There is also a significant degree of variety in this from library to library with our 8 libraries having four different online catalogs. For most of these catalogs, the extent of customization available is to add javascript to their library catalog page. In none of the purchased catalog software did we find suf- ficient customizability to add direct and secure server-to-server communication around recommendation. It’s currently quite uncertain exactly how this might change in the future. It is likely that libraries will continue using low maintenance “turn-key” solutions for their library catalog and other administrative systems. The deployment of novel patron- facing features, such as automated recommendation, to these systems is subject to market forces, and therefore rather hard to predict or manipulate. Domain specific technology is a standard issue when investigating new domains. The specific combination of a high dependence on technology with little access or con- trol, however, is interesting. As a research community of system builders, it is easy to forget that there are domains where “simply adding an algorithm” is a seemingly insurmountable task. Hospitals and small stores are likely situated in a similar situa- tion of dependence on technology which they have little voluntary control over. In the digital space there are also likely many small websites, blogs, and podcasts built on helpful website builders that, while convenient, may also serve to prevent some forms of customization. Libraries have one additional constraint beyond traditional domain specific technol- ogy limitations. Libraries often act as a sub-unit of local government. This can serve to place libraries under unique technological constraints that would not be present if libraries acted as an independent agency. In our experience this often took the form of 68 limitations on newly hosted web-servers and restrictions on the capability of any custom code to be operated by the libraries themselves.

Fundamentally Limited User Base This issue was explored to some degree in the introduction to this chapter. As community centered organizations, libraries have a fundamental limit to the size of their user base. While a library can grow, to a degree, substantially increased usage is not expected. On its own this would not be an issue, but for any given library the libraries user base alone may not be enough to support a recommendation engine, especially given the rate at which new books might be introduced. Far from seeing their user population as a limitation, as an online store might, li- braries see this as a strength. From their start, libraries have existed to serve community needs [30]. By only serving one specific community, libraries can better focus on their communities unique needs and identity. Our library partners have told us that each library has its own unique culture and they believe that each library’s patrons have sys- tematically different tastes in library items. For example, one library branch may have an unusually high number of readers who like a certain genre such as “cat mysteries” (mystery novels in which the protagonist is in some way aided by their pet cat), where another library branch may have an extensive youth collection to allow it to serve as the local school library. This constraint is likely to be true of almost any geographic region specific service. A system for local event recommendation likely cannot rely on out of town visitors for its local events. Instead any local system will either need to find new algorithms adapted to its dataset size, or find a way to aggregate information across multiple geographic regions.

Strong Privacy Protection Built upon their strong community focus, libraries also place a very strong focus on their role as protectors of patron privacy. Their dedication is best stated by one library’s privacy policy: “We value intellectual freedom and a patron’s right to open inquiry without having the subject of one’s interest examined or scrutinized by others.” [16]. In our experience libraries want to protect the privacy of their patrons’ data from both lawful and unlawful access. Therefore, many libraries 69 opt to delete all access and checkout logs after a short delay [6]. This not only protects against accidental disclosure of protected information, but also any deliberate or legally mandated disclosure as well. While this is a commendable dedication to a service owner’s role as stewards of patron privacy, it does run contrary to the data storage ethos that has, in part, powered the big data revolution. This places very particular limits on what data we can and cannot use for recommendation. While a recommendation engine based on search logs or circulation data might be possible, it is not an option for the BookLens system, and likely is not an option for any third-party recommendation solution. Storing and leveraging search or circulation log involves a fundamental shift in behaviors that it is not our right (as computer scientists) to require of library systems. Therefore, We built BookLens to operate only on explicit preference information with a strict opt-in requirement. Many organizations have this type of dedication to patron privacy. Education, child care, and medical organizations for example may have a legal obligation to their patron’s privacy. While traditional algorithms may still work in these domains, it is important to design and develop solutions in this area, and algorithms for this area, that are sensitive to legally enforced privacy protections.

Lack of unified item identifiers. The identification of a library resource is a sur- prisingly complicated subject. The MARC21 standard for digital representations library catalog data lists fifty separate fields for storing different types of “numbers and codes” for identifying works [92]. The most obvious approach would be the use of International Standard Book Numbers (ISBNs). However, not every library item has an ISBN. Fur- ther exacerbating this problem, some ISBNs have been assigned more than once. While there are centralized organizations like the library of congress or the Online Computer Library Center (OCLC) that provide identifying numbers and canonical catalog infor- mation, these too can not be expected to cover every item in a libraries catalog. Despite their flaws, a system based on ISBNs or Library of Congress identifiers may be a perfectly sound design if we are OK being limited to most, but not all, of the item catalog. That said, there is a second problem with these identifiers for the purpose of a recommender system. ISBNs are allocated to each edition or printing of a book. Because 70 of this one piece of written work might have three ISBNs, one for the paperback, one for the hardcover, and a third for the ebook. This might be quite reasonable for a store tracking its inventory, but it is not helpful to a recommender system which might want to have a unique identifier for each piece of work, regardless of how it was published. Ultimately, we found that our library partners often maintained their own internal identifier for each item in their catalog. This is not terribly surprising, data storage and management relies on having a good unique item identifier and no external source of this seems to exist. Ultimately, this client specific identifier became the identifier of choice for the BookLens system due in part to the technological limitations of library software mentioned above. We will discuss later a solution to resolving multiple editions of one book to recommendation item. In practice this is often a general problem. Many domains do not have a unique, shared identifier for all items, and certainly not for all items at the appropriate level for recommendation. Movie data, for example, has no uniform identifier, as some systems might use community standards like IMDB numbers, and others might track movies by their product code.

3.3 The Federated Recommender

The core of the BookLens recommender design is the idea of a federated recommender system. The federated recommender system design is one of many responses to building a recommender system to support multiple communities of interest each recommending over some subset of a shared item domain. For the purpose of terminology, we will use the term client to refer to a community of interest, while not the most neutral term, it matches terminology used later in this chapter. The problem then, is how do you design not just the algorithm, but the inter-system interactions and even the user-interfaces for a recommender with some form of data sharing between each client, given that each client may only track a separate sub-set of the item domain? The federated design system sits between two more conventional designs, the fully-centralized recommender, and the fully-decentralized recommender, and shares some features of both. We will describe these more familiar designs first, before exploring the key properties of the federated system. 71

(a) centralized system (b) federated system (c) decentralized system

Figure 3.2: Three recommender system designs capable of working with multiple com- munities of interest.

3.3.1 The Centralized Recommender

The centralized design can be seen in figure 3.2a. In a centralized design users of each client are modeled uniformly as direct users of a centralized recommender system. It should be noted that, while centralized system commonly communicates following the patterns seen in figure 3.2a, we would still consider figure 3.2b a centralized design if the recommender logically thinks of each user as a direct user of the system. A centralized recommender will do no recommendation level customization to each client, this means that differences between client communities are ignored. Furthermore, the recommendations will be made over the item set of the central recommender, not the client’s item set. This means some items may not be recommended, and it may be possible for an item not recognized by a client to be recommended to that client’s users. While client-side post-processing or filtering is technically possible to alleviate these issues this would defeat the major strength of a centralized design: ease of imple- mentation for the clients. Interacting with a centralized recommender system can be quite easy as the client simply needs to connect the user to the central server. This is often as easy as adding a javascript file to the client webpage. Since the client has little involvement in the recommendation process, this comes at a substantial cost of control, customization, and possibly end-user privacy. We see centralized recommending as the typical “straightforward” approach at solv- ing the technical problem of multiple-client recommendation. The recommendation technologies involved would be standard off-the-shelf algorithms trained, turned, and maintained by the centralized server. As an example of how this might work in the 72 library domain, consider if each library included a widget that linked directly into Ama- zon’s book recommendation and salefront. These users would be seen as “just another Amazon user”, and their viewing would likely be tracked with a cookie regardless of log-in status at the catalog. Recommendations would likely be made for items that Amazon sells, quite possibly with links to the Amazon store page for each item.

3.3.2 The Decentralized Recommender

At the far opposite extreme of this design is the totally decentralized recommender system seen in figure 3.2c. As the name might suggest, this design has no centralized server to support rating pooling and collaboration necessary to achieve high quality recommendation. Instead, in this design each server collaborates in a cryptographically secure, decentralized computation of a recommendation model. Proof of concept of such a design has been published recently, [117] showing that such a design can implement Item-Item collaborative filtering without any client revealing any private ratings. This specific design relied on a mediator to help manage information sharing, and a known- in-advance item domain and identification mapping. Likewise the response time of this design was not practical for a large-scale system, taking several seconds to make top-n recommendations for each user. That said, this is only the current state of the art, and its probable that more algorithms, and faster designs will be found at some point in the future. The decentralized design gives the client ultimate control over privacy and cus- tomization of results. Customization of recommendations will necessarily have to hap- pen after core recommendations are made under current designs, however there is no reason to assume that in-algorithm client personalization is impossible. The major weakness of this design is that deploying it as a client would be a substantial technolog- ical cost. Such algorithms are still state of the art, and the tuning of these algorithms on an on-going basis is likely quite complicated. For this complexity, however, the client would get complete control, and the strongest possible promises of privacy for their users. 73 Support for Centralized Federated Decentralized Multiple Item sets Not supported supported Varies Client technological demands Low Low to medium High Client control and privacy Low Medium High

Table 3.2: A comparison of multi-community recommender designs

3.3.3 The Federated Recommender

Between these two recommender designs is our choice: the federated recommender sys- tem. The federated design retains the core recommendation service as seen in the centralized design. However, rather than interacting with users directly, all interaction with users is mediated by the client. As such, all user interaction is recorded within the context of a client, and all recommendations are made explicitly for a user at a given client. Again, we note that this goes deeper than simply the communication patterns involved, there is no theoretical reason why either centralized or decentralized com- munication patterns couldn’t be used to represent a federated recommendation model. What is important is that the data is modeled, and the recommendation task is de- signed, relative to a given client. In this way we can get many of the ease-of-use benefits of centralized recommending, with many of the customization and privacy benefits of decentralized recommendation. In practice, the algorithm leveraged in a federated design should be client-aware both when making personalized recommendations, and non personalized recommenda- tions for users who have not logged into the client service. Due to the state of BookLens deployment, which will be discussed in later sections, we do not have an example of this. As mentioned earlier, our research plan for library recommendation and feder- ated recommendation is to first build a stable system. Other researchers have already begun working on algorithms for book recommendation, our goal is to collect a first- class dataset that can be used to forward research on this item domain well into the future. Likewise, while we are not aware of any specific client-aware recommendation algorithms, it is quite likely that these will be quite similar to existing advancements in context-aware recommendation, treating client as a user-stable form of context. Table 3.2 summarizes and compares these three designs. For our purposes, there are three driving constraints that drove us towards a federated design: Lack of Expertise, 74 Staff, and/or Funding, Fundamentally Limited User Base, and Strong Privacy Protec- tion. To best serve the libraries desire for strong privacy of patron data a decentralized approach would be ideal. Unfortunately, until substantially more development of these algorithms occurs, these systems are not suitable for deployment in the library domain as the technological demands would likely be too high. While centralized is perhaps an option, certainly it could be deployed with the lowest technological overhead, a central- ized design would not be able to capture the unique culture that our library partners have told us is important to their userbase. Therefore we conclude the federated design is best for our purposes.

3.4 Case Study: The Design of BookLens

We separate our discussion of the design of BookLens into two parts: the data orga- nization and the structural organization. First we will discuss the data organization, in which we define several key concepts on-top of the tradition item-user-rating model of a centralized recommender design. Then we will discuss the structural organization, which focuses on the series of servers and communications that we use to deploy Book- Lens with low technical cost to our clients while remaining sensitive to their privacy needs.

3.4.1 Data Organization

Clients As mentioned in the previous section, the top-level resource tracked by Book- Lens is a Client. A client is any single library system which BookLens interacts with. Currently, we have eight clients, one for each partner library through MELSA. We also have several artificial client resources which help us track various data sources we have imported into our database. Each user and book is associated with a client, and a client specific identifier, which allows libraries to offload item-set matching and management to BookLens. Through these associations we can associate each action with a given client as well, allowing us to support the individuality of our library partners. Finally, the client records play an important role in our authorization and authentication scheme, storing public and private keys which operate similarly to a username and password for each client. These 75

Figure 3.3: A simplified data model for the BookLens system 76 are used in an OAuth1.0a authentication scheme [7] to prevent inappropriate access to private data against a range of common attack strategies.

Users A BookLens user represents one library patron. Each user account is associated with a research and dataset opt-out, a privacy opt-in and a display name. The display name is set upon making a review, as public reviews are displayed with associated usernames. Our data model retains fields for username and password from previous iterations of BookLens, although we note that this feature is not used in the current deployment. Likewise, the current data model supports a user being associated with multiple clients (with each client getting separate user-client public and private keys for the OAuth scheme). Currently each user account is only associated with one library, however, upon request by patrons who interact with multiple library clients we can “merge” accounts using this feature. As BookLens users are tied to library accounts we cannot and do not track users who have not logged into their library site, we do however allow clients to request non-personalized information for browsing sessions with anonymous or unknown users.

Books A book record in BookLens represents a single version of a library item located at one specific library. The term book is slightly misleading as these records can be any item carried by libraries including music, movies, and video games. Book records contain all relevant meta-data about an item including title, authors (or creators more generally), and publication information. This information is provided by our library partners to help us connect ratings made across libraries, and is synchronized once daily. We also allow storing arbitrary meta-data “string”, this feature is used to help collect data such as reading level or genre that is sporadically encoded in source data. Data stored in this field is primarily intended for future investigation into extensions on this data model. Ideally, we would also store availability information on book records. For example, it might be helpful when recommending to know how many copies of a given book each library has, or if a given book is currently available, or has a long wait list. While this information is tracked meticulously by libraries, we found that most systems did not have a simple way for that information to be shared with an external system such as 77 ours. Therefore, we have no specific record of physical book copies. If two libraries have copies of the same edition of the same book we will store two book records, one for each library. This may seem redundant, but it is necessary as different libraries may have different information about each book. Moreover, due to these differences, it can be difficult to know which item at one library is the same as which items at another library. As discussed previously, we know of no comprehensive identifying scheme that we can rely upon for our item domain. Ultimately, this problem is solved by assigning each book record to one opus record.

Opus An opus record represents a single piece of creative work and is equivalent to an item in a general recommender system design. For example, a book record might be ISBN # 0345253426 - the 1966 paperback edition of “The Hobbit” by J. R. R. Tolkien published by Ballantine Books and owned by Ramsey County Public Library. An Opus record, on the other hand, would the “The Hobbit”, by J. R. R. Tolkien. Opus records store basic title and author information for internal use, but primarily is represented as a collection of all related book records – both common editions across many libraries, and separate printings and editions of the same work. Creating and identifying opuses is a difficult problem. There is no standard way of recording that two publications contain the same work. Therefore, opus identification occurs through a series of heuristic rules. First, if two items have the same ISBN, we consider these items as being the same Opus. Failing that, we match book records using a fuzzy search over title and authors4 through the Lucene 5 software library. If a title and the set of authors matches closely enough between a new or updated book record, and an existing one we mark these as being part of the same opus. If multiple records match well enough we assign a book to the opus with the most matching books. The specific values used for “close enough” were manually tuned to maximize performance on a small hand coded dataset of opuses 6. Our opus model is closely related to the FRBR (Functional Requirements for Bib- liographic Records) model [12]. FRBR is a hierarchy that has four levels, with the top

4We do not use copyright year, as we’ve found this field in our data, and our partners data, does not necessarily encode the earliest copyright year. 5http://lucene.apache.org/ 6The creation and tuning of these threshold values occurred quite some time ago, and further details about it are no longer available. 78 level (work) being essentially our opus records, and the third level (manifestation) being essentially our book record. FRBRization is the act of organizing library records into the FRBR hierarchy and is still a matter of ongoing research. Our goals lead to a flatter hierarchy and a relatively simple process as many distinctions made in FRBRization are not relevant to our task.

Ratings and Reviews The BookLens server allows any user to rate (assign a score from 0 to 5), or separately review (long form text) any book. All ratings and reviews are stored internally as a relationship between a user and an opus. The client for a given rating can be determined by checking which client is associated with a user record. Ratings and reviews can only be entered by users who have opted into the BookLens service. The first time a user goes to rate or review a brief pop-up is shown. This pop-up ask the user to opt-in to having their data stored on a third-party server (BookLens) for use in personalization. Data is only sent to the BookLens server if the user approves.

Prediction and Recommendation The BookLens API supports three main recom- mendation tasks: rating prediction, personalized item recommendation, and personal- ized item association recommendation. If no user is present, the API can also generate non-personalized recommendations and predictions. Further details about the Book- Lens API can be found in Appendix B. These are generated using the LensKit [40] recommendation toolkit. By using the LensKit toolkit we can easily apply a large range of recommendation algorithms to power the BookLens system, and can easily change implementations as the system evolves. We currently use a simple Item-to-Item recommendation algorithm using a content- based similarity metric. This algorithm is neither novel nor of particular interest to this report. Ideally, this report would include a comparison of various algorithms over the BookLens dataset. As will be addressed later in this chapter, the number of ratings available, both per users, and in relation to the size of our item domain, make algorithm centered evaluation difficult at this time. Therefore it is unlikely that an analysis of algorithms on our current dataset will be a good predictor of the best algorithm, either for the book domain in general, or the BookLens system in particular. 79

Figure 3.4: The architecture of the BookLens System

3.4.2 Structural Organization

So far we have only detailed one component of the BookLens system – the core server. Figure 3.4 shows the complete BookLens system architecture. The core of the BookLens system design is the BookLens core server. The core server exposes a RESTful API that allows clients to manage any of the resources described above using standard HTTP requests. This is only served over the HTTPS protocol and is further protected by a modified OAuth1.0a authentication scheme [7]. The pri- mary modification to traditional OAuth1.0a is that clients can create users who will be automatically authorized for that client and the user authorizations do not expire and must be directly revoked. This proved necessary to provide a seamless interaction with users and avoid traditional OAuth three-part “handshakes” in which the user repeatedly authorizes this interaction. This has been replaced with a one-time opt-in procedure. Ideally, library systems would communicate directly with the BookLens API. Unfor- tunately, libraries often have little to no control over the inner workings of their software, at most being able to inject custom JavaScript into their webpages. As RESTful APIs are not necessarily “light-weight” and the OAuth authentication scheme relies on a private authorization token, this cannot be done with a JavaScript only modification. Therefore, without programming an entire library catalog, a direct integration with 80 library systems is impossible at most libraries. Our solution to integrating with library partners is the bridge server. The bridge server is a small java web-server designed and developed by the University of Minnesota team, but owned and operated by each individual library. Each bridge server is cus- tomized to its library’s specific catalog software upon setup, and requires little-to-no maintenance. The core purpose of the bridge server is to manage the interaction between our OAuth authentication schemes and each library’s specific authorization scheme. While born out of a technical necessity, the bridge server has proven to be a con- venient design feature. The bridge server is an ideal location for all javascript and css customization’s required to make BookLens feel native in each library’s interface. Furthermore, having each library own and operate their own bridge server gives each library ultimate control over their use of BookLens. Each library can independently choose which features they want to use, and if they see fit, disable BookLens features entirely without the intervention of the University of Minnesota team. The bridge server also serves as a convenient place to put library facing administrative interfaces such as review moderation and catalog data synchronization. Finally, the bridge server plays an important role in protecting patron privacy. All user interactions go through the bridge server before being submitted to the core Book- Lens server. This design makes the bridge server an ideal place for protecting patron anonymity. Currently, we only run user identifiers (barcodes) through a one-way hash to prevent barcodes from being stored with ratings. In principal, however, this could serve as a place for other data policies, such as perturbing patron ratings or even mild temporal randomization to protect against de-anonymization attacks. An example of the BookLens interface at Saint Paul Public Library’s catalog is shown in figure 3.5. Not shown in this figure is the interface for viewing associated recommendations. Likewise, while the core server supports top-n recommendation we do not currently have an interface to expose this feature to users. These interface modifications are designed to fit naturally into the host catalog interface. Users should think of BookLens as just another feature of their library, not an independent system. 81

Figure 3.5: The BookLens system (as deployed in one of our catalogs). BookLens features are highlighted. The highlighted feature on the top left is shown mid-rate. The stars interface and tooltip show what rating would be entered if the user were to click. 82

Figure 3.6: A time-line of the deployments of the BookLens system.

3.5 Case Study: Exploration of the BookLens Dataset

Since the first deployment of an early prototype of BookLens in 2011, BookLens has been deployed at up seven library systems. Initially, only a beta version of BookLens (only rating and prediction features) was deployed at Saint Paul public library. Through that deployment we found a partnership with MELSA, and after iterating on the design of BookLens and expanding our set of features, we then deployed at six of the other seven MELSA libraries. A timeline of these deployments can be seen in figure 3.6. Carver County Public Library, the eighth MELSA library was set up for a public deployment, however a persistent, county-level networking issue prevented this deployment. The networking issue prevented computers on the Caver County Library’s network from accessing the Carver bridge server, meaning that BookLens features would have been missing from the catalog when viewed on library owned computers. As can be seen, several libraries have decided to stop using BookLens. Both Hen- nepin and Saint Paul library have switched from their previous catalog software, to the BiblioCommons catalog software, which has a partial overlap with BookLens features. When this switch was made, both libraries decided to switch to the BiblioCommons version of features to avoid redundancy and patron confusion. Scott and Washington 83 counties chose to remove their BookLens integrations following a breach of our core server in April 20177. This was, in part a consequence of changing leadership and strategic priorities making BookLens no longer a worthwhile investment of time for these libraries. While the loss of participation from these libraries is regrettable, we see it as mostly a reflection of the political processes involved in the public libraries, rather than a direct reflection on the design of the BookLens system. Although we must admit that, to some degree, had our server provided more value to the libraries and their patrons, then more libraries might be participating to this day. While the current version BookLens has been in continuous deployment for more than 4 years, we still consider BookLens to be in a soft-launch state. As our usage data will show, only a small proportion of library patrons have opted to use any BookLens features. While a low usage rate is to be expected, we note that there has been no attempt at advertising BookLens to library patrons, or informing library staff about this feature. While a centralized publicity campaign was planned, it was repeatedly delayed, and ultimately canceled once Hennepin left the BookLens system. Because of this lack of advertising most library patrons do not use, and may not even be aware of, BookLens features. Based on summer reading efforts run by the libraries focusing on gathering book reviews, we believe that users would be interested in this feature if advertising was pursued. Due to this soft-launch state, we will not be exploring recommendation algorithms over the BookLens dataset, as might be conventional in recommender systems research. Instead we are going to focus on summarizing current usage of the BookLens system (as of June 27, 2017), and measuring the effect of rating pooling on the amount of information available to each library. To our knowledge there has been no report on the effectiveness of rating pooling strategies.

7 On April 6 2017 we were notified that suspicious behavior was occurring on the main BookLens server. Records indicate that the suspicious behavior started on April 4, 2017. We do not currently know exactly what the cause of the intrusion was, however, the current belief is that someone took advantage of a vulnerability in an out-of-date software dependency. Due to the design of the BookLens system, no personally identifiable information could have been accessed (we store no personally identi- fiable information in the main recommender server). Furthermore, due to the nature of communication between the core server, bridge servers, and library servers, the breach at the main recommender server had no effect on the bridge servers or library servers. For safety we decommissioned the VM that ran the previous BookLens server, and set up a new BookLens from source code and database backups on a newly provisioned VM. 84 If each library operated over an identical set of items then obviously rating pooling would meaningfully increase the amount of available information. However, we have observed that library catalogs only partially overlap, which might substantially limit the effectiveness of rating pooling. Therefore, we consider it an open question how effective rating pooling will be in this domain. Likewise, while it is obvious that rating pooling will be beneficial for small library partners, it is less obvious how much value large libraries will get out of this type of rating pooling.

3.5.1 Data on early use

Table 3.3 presents basic information about the early use of the BookLens system through the end of June 27 2017. Table 3.3 is broken down into the various sources of information that are included in the BookLens system. This includes the seven libraries with historic or ongoing library deployments, and four datasets that have been imported into the system in an effort to reduce the cold-start problem. Each category of data is included with a subtotal row, and a total row is provided to describe the overall state of the BookLens system. The imported datasets can be separated into the BookCrossing dataset, publicly available for research purposes [147], and three years of data from the Bookawocky program which is a youth focused summer reading effort in which children are encour- aged to read books and record their reading by making a rating and review online. Bookawocky ratings, but not reviews, were imported under anonymous pseudonyms into the BookLens system with permission of MELSA, who had organized these efforts. BookCrossing ratings were entered into the system as another client with associated book records, with matching done directly through the opusing heuristics discussed previously. Implicit ratings in the BookCrossing dataset were assigned the value of the average rating from this dataset. While the BookCrossing dataset is quite large, as will be discussed shortly, we have not found it to overlap well with client libraries, substantially reducing its value as a cold-start mitigation tool. Bookawocky ratings were associated with hand typed title and author information. As these were typed by children and teenagers we opted to use a separate scheme to match these records with existing book records. The end-scheme was similar to our 85 ------c 59 87 243 751 251 109 east 32,132 32,132 30,632 # Reviews 910 906 564 1,161 5,863 4,785 7,438 15,486 29,675 12,948 10,263 30,649 1,025,509 1,056,158 1,085,833 # Ratings b 900 789 548 1,140 5,467 4,494 7,252 6,093 4,652 13,086 23,549 14,028 278,709 289,756 309,537 Rated Opuses - - - - source of data. Patron count data is from Opuses 422,401 219,434 200,581 180,140 106,475 145,685 278,709 278,709 1,431,554 1,731,241 1,924,643 - - - - Items 457,903 251,069 214,066 192,359 114,003 158,496 341,758 341,758 1,595,039 2,982,935 3,324,693 f that particular client. o import historic reviews from their old system into BookLens, the ews were entered after the introduction of BookLens. a y agreed to have rating data sent to BookLens and have entered at l 244 750 433 221 109 190 2,449 4,396 2,167 1,896 1,419 5,842 Imported datasets 46,116 51,598 55,994 Users Partner Public Libraries ------342,844 236,344 342,928 843,674 281,092 108,820 163,280 Patrons 2,318,982 ------32 32 32 14 24 20 22 Months Rated Opuses only counts the number of opuses rated by patrons o Users are counted as the number of accounts that have explicitl Hennepin County Public Library is the only library that was able t Name Saint Paul Ramsey County Dakota County Hennepin County Anoka County Scott County Washington County Library Subtotal Bookawocky 2014 Bookawocky 2015 Bookawocky 2016 Bookawocky Subtotal BookCrossing Imported Subtotal Total c reported number is the total number of reviews. 5662 of these revi one rating. b the 2013 library survey data. a Table 3.3: Overall usage statistics for the BookLens system, broken out by 86 Lucene-based fuzzy matching system, but differently tuned based on a small hand- matched subset of records. Unlike BookCrossing, only items with a known opus match to a library owned item were imported. For each library partner, table 3.3 contains several pieces of information. First, the table shows the number of library patrons according to the 2013 IMLS library survey [124], as well as the number of these patrons who have chosen to use the BookLens service. Secondly it shows how many discrete items (book records) the library has, how many opus records these books are organized into, and how many of these opuses are rated, considering only ratings from that library. Finally the table shows the number of ratings and reviews. Hennepin County Library’s review count is larger than might be expected as Hennepin County Library was able to import reviews from their previous review service, which no other library was able to do. Only 5662 of those reviews were entered through the BookLens system. From this table, we can see that, as of June 27, 2017 BookLens has had over four thousand users who have entered over twenty nine thousand ratings on over 1.7 million possible opuses. Interestingly, when comparing the library sources with the Booka- wocky datasets we see that our two and a half years of soft launch data is comparable to three summers worth of reading programs. This supports our assumption that signif- icant improvements in data usage might be possible with advertising related to Book- Lens features. This is further backed up by noting that 5,842 users provided data to Bookawocky, while only 4,396 users provided data to BookLens, despite a much larger availability. One important consideration when comparing usage rates between libraries is that not all BookLens libraries are of equal size. In particular, Hennepin County Public Library is our largest library (in fact it is one of the 25 largest libraries in the United States) and has more than twice the registered patrons of our next largest library and almost eight times the registered patrons of our smallest [124]. It is clear from table 3.3 that larger libraries tend to produce more ratings and reviews. That said, library size is unlikely to be entirely responsible for usage rates, this can be seen by comparing Saint Paul Public Library and Dakota County Public Library. Saint Paul Public Library has roughly the same number of patrons as Dakota County Public Library, but has slightly more than half the users, and less than a quarter of the total ratings. Likewise 87 Washington County Public Library has slightly more than half the patrons of Anoka County Library, but very nearly the same number of users. More interestingly though, the Washington County patrons have made approximately 60% the number of ratings that Anoka county patrons have, but Anoka county patrons have made approximately 60% the number of reviews. These differences between libraries point to different and complementary usage pat- terns at each library. The most likely cause of these differences is the differences in library catalog interfaces. Within our seven libraries, there are four different library catalogs, each of which provides a different user experience, likely affecting rates of on- line use. A careful study of these interfaces and what factors have led to these differences in both rating and reviewing activity is future work. One final note on table 3.3 we can see that most of the opuses in the BookLens system have never been rated. In part this is a side effect of the small number of actual ratings in the system. This is also, in some part, a consequence of our decision to accept ratings on all cataloged library items. This means that there are a large number of reference items, or cataloged government documents that are likely to never receive ratings in the BookLens system. Given that many opuses in BookLens may never be rated, the proportion of rated books may be a misleading estimate of how ratings are distributed over the set of items. Looking at table 3.3 we can see that on average a rated opus has only 1.26 ratings from BookLens users, and only 3.51 ratings if we look at the system as a whole. To better understand these numbers we plot a histogram of rating counts in figure 3.78. As expected, we see evidence of a strong power-law distribution with most opuses have only one or two ratings, if they have any at all, but some opuses have hundreds or even thousands of ratings. When considering only BookLens ratings we see that only 600 (2%) of rated opuses have more than 3 ratings, and only 4 have more than 20. Unsurprisingly, when we include all sources, we see first that many more opuses have

8Due to data limitations the data for figure 3.7 is based on the state of the database on July 10, 2018. Other than a longer period of data collection, the only meaningful difference from the data collected for other analysis in this chapter (June 2017) is the inclusion of two “update” datasets from Hennepin and Saint Paul public libraries representing ratings gathered through BiblioCommons. These are not included as “BookLens ratings” in deference to interface and user-experience differences in these systems. We do not believe that this difference meaningfully effects these results, other than inflating the “all sources” estimate. 88

Figure 3.7: Number of opuses with each rating count. Note the log scale on the x axis, this scale starts at 1. The two lines represent only ratings collected through BookLens, or from all sources. been rated at all. Furthermore, those opuses that are rated tend to have more ratings on average, with 33270 (24%) of rated opuses having more than 3 ratings and 6605 (5%) having more than 20 ratings. While there is a common core of opuses with enough ratings that recommendation may be possible, these represent only a small handful of the most popular library items. It is unclear if this rating count curve is a necessary property of the library item domain, or a consequence of our prolonged soft-launch state.

3.5.2 Investigation of catalog overlap

One of the important features of the BookLens system is the Opus abstraction. This allows multiple editions of one book, or one book owned by multiple libraries, to be treated as a single unit for the purpose of recommendation. Without a some gold- standard against which to compare it is impossible to know how accurate our opusing algorithm is, however this does not have to stop us from considering the value it adds to our system. Looking at table 3.3 we can see that most libraries have roughly the same number of opuses as books, indicating that multiple editions of one item is an uncommon occurrence within a library. That said, looking at the system as a whole we do see a substantial difference between the number of book records and the number of opus records, indicating the primary value of the opus record is in combining book 89 Overlap Sppl Ramsey Dak. Henn. Anoka Scott Wash. Avg. Overall Sppl. 100% 22.8% 19.9% 65.8% 19.6% 12.0% 17.3% 26.2% 73.5% Ramsey 43.9% 100% 42.4% 69.3% 41.6% 26.0% 34.1% 42.9% 83.7% Dakota 42.0% 46.4% 100% 67.4% 43.9% 27.9% 34.9% 43.7% 82.1% Hennepin 19.4% 10.6% 9.4% 100% 9.2% 5.4% 7.5% 10.3% 23.4% Anoka 46.0% 50.6% 48.9% 73.1% 100% 29.9% 38.1% 47.8% 87.3% Scott 47.7% 53.6% 52.6% 72.8% 50.5% 100% 44.4% 53.6% 87.2% Wash. 50.1% 51.3% 48.0% 73.8% 47.1% 32.5% 100% 50.5% 87.3%

Table 3.4: Measurement of catalog overlap. Note, library names have been abbreviated for space. records across libraries. To better understand the properties of cross-library opusing and catalog overlap we computed the catalog overlap between each pair of libraries. If Ilib1 and Ilib2 is the set of opuses connected to books at libraries lib1 and lib2 respectively, we compute the catalog overlap of lib1 from lib2 as the number of opuses connected to library 1 that are also in library 2: |I ∩ I | overlap(lib1,lib2) = lib1 lib2 |Ilib1| We can further reduce this to two major features: the average catalog overlap:

overlap(lib, lib2) averageOverlap(lib)= lib26=lib∈Libraries P |Libraries|− 1 And secondly, the overall catalog overlap, which is simply the overlap of a library with the union of all other libraries. These measurements can be seen in table 3.4 As can be seen, there is substantial variance in overlap by library, with larger li- braries, most notably Hennepin, tending to overlap less due simply to their larger overall catalog. The average overlap between any two libraries is 39.28%, and more than half of library pairs overlap less than 50%. By considering the overall column (the percent of opuses at a library that are in any other library) we see that most libraries have quite a high overall overlap, despite having a much smaller average overlap. With the exception of Hennepin (23.4%) and Saint Paul (73.5%) all libraries have an over 80% overall overlap, meaning that 4 out of 5 items in their catalog appear in another catalog, and will be subject to rating pooling. Certainly some of this is due to the large scale of Hennepin, but at most libraries we see an overall overlap 5 to 10% higher than that 90 libraries Hennepin overlap. This suggests that the catalogs, while not fully independent, do show relative Independence. Furthermore, with each additional library added to the system, we should expect to see increases in the overall overlap at most libraries. For comparison, it is worth returning to the BookCrossing dataset. The BookCross- ing dataset has one million ratings, which should make a significant dent into resolving the cold start problem. However, the books in BookCrossing only cover on average 10.2% of the opuses in the public libraries. This overlap is much lower than that be- tween two libraries random libraries, likely due to the age of the BookCrossing dataset.

3.5.3 What is the effect of rating pooling?

With this basic understanding of the data of the BookLens deployment, and catalog overlap, we turn our focus into answering one of the primary questions of the BookLens design - “To what degree does rating pooling work to increase the number of ratings available to library partners?”. The simplest way to answer this question is to compare each library with the aggregate size of BookLens (i.e. ratings in BookLens/rating from one given library). However, this answer would ignore the degree to which library collections do or do not overlap, as well as user rating patterns relative to this overlap. Therefore, we will answer this question in two ways. First, we will compute what fraction of opuses have at least one rating (the 1-coverage) under various pooling scenarios. Then we will estimate the average number of extra ratings available every month under various pooling scenarios. These two measurements will allow us two separate ways of understanding the impact of rating pooling on each of our libraries.

1-Coverage Table 3.5 reports the 1-coverage (fraction of opuses at a library with at least one rating) under a variety of rating pooling scenarios. We choose 1-coverage for this metric as most modern recommendation algorithms require at least one rating for any item before it is modeled in the algorithm, and therefore possible to predict or recommend for. Therefore, this measurement serves not only to estimate how many items gain ratings through pooling, but also as a direct measure of how many items can be included in the recommender. The partial pooling column reports 1-coverage when only other library sources are considered, and full-pooling reports both library sources and imported datasets. The pooling strategy substantially increases the number of 91 Name 1-Coverage 1-Coverage (par- 1-Coverage (no pooling) tial pooling) (full pooling) Saint Paul 0.27% 2.34% 12.41% Ramsey 2.49% 7.15% 19.89% Dakota 2.24% 7.16% 19.03% Hennepin 0.91% 1.39% 6.54% Anoka 0.50% 7.32% 19.22% Scott 0.74% 8.27% 24.40% Washington 0.38% 7.15% 23.33%

Table 3.5: Measurements of catalog 1-coverage(percent of opuses at each library with at least one rating) with and without pooled ratings for each library. Partial pooling only includes the library sources, whereas full pooling indicates all data sources opuses that have ratings at any given library. The imported datasets even further improve the 1-coverage, in part justifying the importance of these extra datasets. While the effect is the smallest for Hennepin County Library, It should be noted that even Hennepin sees a 50% improvement in 1-coverage from library sources, and more than a 600% increase in 1-coverage when considering the full data pooling.

Ratings Per Month The second way we will evaluate rating pooling is by considering the new ratings available to a library system each month. Figure 3.8 shows the monthly ratings for each library considering only the ratings made by patrons of that library or patrons of any library. Simply looking the figure we can see that most libraries see a massive increase in the number of ratings added to items in their catalog each month. This is especially true for the months where Hennepin County Library was part of the BookLens service. To attempt to quantify these results we will look at the average monthly number of ratings, and percent increase from pooling. Table 3.6 reports the average monthly ratings on items in the catalog of each library with and without pooling, as well as the average percent increase in monthly ratings. The average monthly ratings is taken as the average of the data plotted in figure 3.8. For pooled−notPooled each month we can also compute the percent increase in ratings ( notPooled ), with table 3.6 reporting the average monthly percent increase. As the first and last calendar month ratings for any catalog is likely to be incomplete we exclude these months from the analysis. For the basic pooling (including Hennepin) rows we only report months 92

Anoka County Library Dakota County Library 2000

1500

1000

500

0

Hennepin County Library Ramsey County Library 2000

1500

1000

500

0

Saint Paul Public Library Scott County Library 2000

Monthly Ratings 1500

1000

500

0 2015 2016 2017 Washington County Library 2000 Source of Ratings 1500 With Pooled Ratings Without Pooled Ratings 1000

500

0 2015 2016 2017 Month

Figure 3.8: Number of monthly rates per client with and without rating pooling 93 %(no Henn) 1046 increase ( 601, 2544) 111 ( 92, 133) 184 ( 137, 303) - 877 ( 648, 1372) 982 ( 611, 1539) 1595 (1134, 2185) ce intervals computed through boot- ratings (pooling, no Henn.) 197.6 (174.1, 221.4) 365.3 (322.4, 406.8) 331.7 (289.4, 377.4) - 278.1 (238.4, 326.0) 208.5 (180.8, 255.1) 235.5 (205.4, 276.8) %(Henn.) 2691% increase (1648, 6300) 525% ( 429, 615) 741% ( 551, 1016) 39% ( 29, 68) 4172% (3117, 5516) 1359% ( 912, 1723) 4309% ( 293, 6443) ratings (pooling, Henn.) 719.0 ( 635.3, 787.9) 1143.8 (1023.2, 1246.1) 1077.3 ( 967.8, 1161.0) 1538.4 (1346.2, 1660.5) 1044.4 ( 966.5, 1094.5) 755.1 ( 703.0, 804.3) 879.0 ( 819.4, 978.8) ratings (no pooling) 38.2 ( 30.6, 52.1) 189.0 (160.2, 218.0) 155.7 (124.0, 199.6) 1157.3 (942.5, 1279.8) 37.5 ( 30.3, 46.6) 47.4 ( 29.7, 91.4) 28.3 ( 18.3, 56.7) Name Saint Paul Ramsey Dakota Hennepin Anoka Scott Washington srapping. Results are shown with and without Hennepin county ratings. Table 3.6: Average increase in monthly ratings at each library with 95% confiden 94 where Hennepin was active as it had a very large effect on rating pooling. For each value we will also report a confidence interval to allow us to better understand how confident we can be about this average, due to the small number of months of data available. As monthly rating rates do not follow a normal distribution, we turn to bootstrapping based confidence intervals. In particular we compute the 95% adjusted bootstrap percentile (BCa) confidence interval over 100,000 bootstrapped samples using the boot package in R [20, 31]. All libraries see a substantial increase in average monthly ratings with our smallest libraries receiving the largest increase (at least 3000%). Even for Hennepin, our most active rater, we see a nice improvement of at least 29% which would be an additional 336 ratings each month. As can be seen in table 3.6, Hennepin is by far our most active rater. Therefore the question could be asked, what amount of these reported gains is through partner- ship with Hennepin, as opposed to rating pooling more generally. Therefore we also computed monthly ratings without Hennepin (for all months). Looking at this data we see that each library is still receiving a fair deal more ratings. The smallest library is getting at least 10 times more ratings, and the larger libraries are seeing at least 90 to 100% improvements in number of ratings. Doubling the number of available ratings will substantially effect the ability of a library to collect sufficient ratings to maintain a collaborative filtering algorithm. Based on these analysis we see that most libraries have a meaningful overlap with other libraries, and a high overlap with the library network as a whole. Ratings from other libraries substantially increase both the number of opuses with any rating, and the total number of usable ratings available at each system. Therefore we conclude that rating pooling provides a substantial benefit in our system.

3.5.4 Comparison of BookLens and MovieLens Rating Data

The final analysis we will do on the early usage of BookLens is to compare it against MovieLens. There is no “typical” recommender system, but due to its role in the recommender system community MovieLens is as good of an example as any of the standard data properties and assumptions that are made in collaborative filtering re- search. Therefore, by identifying ways in which BookLens and MovieLens differ we are also identifying common assumptions about the properties of rating data that may not 95 be true in general. In particular, we are going to look into two properties: the number of ratings made users, and the distribution of ratings made by users.

Ratings per user For a fair comparison we do not use MovieLens ratings from a public dataset as these datasets are filtered to only users with a sufficient number of ratings. Instead we will consider only users in MovieLens who had their first log-in between the beginning of March 2016 and the beginning of March 2017. During this time MovieLens did not have a mandatory number of ratings during the new user sign- up procedure. For BookLens we will consider patrons who opted-in to having data stored at the U of M BookLens server during the same time frame. We use this rather than first sign-in or account creation as this is the first time the user interacted with a BookLens widget and often is their first rating or review. The end off this time range was chosen so that users have had at least three months to use the system and accrue users. In the one year period BookLens had 1,237 users enter 3,819 ratings while 15,207 MovieLens users entered 1,905,538 ratings.

● 0.6

0.4 system ● BookLens ● MovieLens

Probability ● 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● 10 1000 Number of Ratings

Figure 3.9: Number of ratings made by users of MovieLens and BookLens. Note, the x-axis is on a log scale, and rating counts have been bucketed logartihmically.

The number of ratings made by these users is shown in figure 3.9. Note that the x-axis on this plot is on a log scale, and points represent logarithmic buckets. The first 96 point represents users with 1 rating, the second users with 2 or 3 ratings, the third users with 4-7 ratings, etc. While this visualization distorts some properties (both MovieLens and BookLens have 1 rating as the most common number of ratings), it also allows the broader trends to be visible and better comparisons. In both MovieLens and BookLens the most common number of ratings in this analy- sis is 1. Given the shape of these distributions its likely that the true most common value is 0 ratings, however it is impossible to track such users in BookLens so 0 rating users were not considered in this analysis. The average number of ratings for a BookLens user in this time range is 3.1 ratings, while the average number of ratings for a MovieLens user is 125.3. While the average is a poor descriptor of skewed distributions, this does serve to illustrate the scale of the differences between these distributions. BookLens users are more likely to have 4 ratings or fewer, while MovieLens users are more likely to have 5 or more ratings. There are various reasons that these distributions might differ. Books, for example, often take much longer to read than a movie does to watch. Likewise, the focus of the MovieLens system is on rating movies and receiving recommendations, while the focus of the BookLens system is on identifying which books you have not read might be available at the library. Finally, as mentioned earlier, the BookLens system has no particular advertising requesting users to enter historic ratings, while the MovieLens system makes it quite clear that adding these ratings will benefit the user. None of these reasons for smaller rating counts is necessarily unique to BookLens, e-commerce systems, for example, may frequently operate in slow-to-consume domains, with rating, especially historic ratings, being a rare event at best. Since the majority of algorithm development is done on datasets that, like MovieLens, have a minimum number of ratings per user, and a large number of average ratings, we must wonder if standard algorithms will behave reasonably when only given 3 or 4 ratings. We consider this question in depth in chapter 4.

How do users use the rating scale? The second comparison we wish to make is in how users of these two systems use the rating scale. The properties of ratings themselves are an often overlooked feature of a system which can substantially effect the quality of recommendations. When designing BookLens we opted to store all ratings normalized 97

BookLens MovieLens 0.4

0.3 Source 0.2 BookLens MovieLens Probability 0.1

0.0 0 2 4 0 2 4 Rating

Figure 3.10: Distribution of ratings in MovieLens and BookLens. to the -1 to +1 range, giving us freedom to use whatever rating scale we want. However, due our familiarity and the commonness of the interface we opted for a five star rating interface with half star increments. The only major differences between the rating scale itself between MovieLens and BookLens is that the BookLens interface will allow a 0 star rating, while MovieLens ends at 0.5 stars. Beyond that they two interfaces are effectively the same differing only by color and icon choices. That said, we have seen that other contextual changes, as well as differences in the domain, may lead to different rating properties. The rating distributions for BookLens and MovieLens is shown in figure 3.10. The MovieLens rating distribution is based on the MovieLens 20m rating dataset. If only because BookLens allows 0-value ratings we should expect these distributions to be difference. However, even if we remove 0-valued ratings, we see that the rest of the scale is used statistically significantly differently as well (p< 2.2e − 16, chi-squared test). Looking qualitatively at these distributions we see some similarities and differences: The BookLens rating distribution appears to exhibit a strong positivity bias. The average BookLens rating is a 3.99 whereas the average MovieLens rating is a 3.53, which is a difference of means of almost a half star. Interestingly, the positivity bias does not lead to a reduced standard deviation in the rating distribution, despite the ratings 98 appearing visually more concentrated. The standard deviation of MovieLens ratings is 1.05 whereas the standard deviation of BookLens ratings is 1.18. Since the standard deviation also represents the RMSE of an optimal global baseline recommender, this suggests that BookLens ratings may be harder to predict for. On the other hand, the concentration of ratings around 5 stars may serve to make BookLens easier to recommend for. Given these differences, we must ask, is a five-star rating scale right for BookLens? Perhaps an asymmetric scale would be better, given stronger positivity bias in our data. To answer this question we will first need an understanding the tradeoffs inherent in our choice of rating scale, and a means of measuring these tradeoffs empirically.

3.6 Conclusions and Future work

In this chapter we have presented the motivations, design, and analysis of the use of the BookLens system. The ultimate goal of the BookLens system is to bring collabora- tive filtering recommendation to our library partners. While doing so involved bridging many technical and social gaps, this is typical for any collaborative project of this scale. What is novel about BookLens is the federated approach for bringing recommenda- tion algorithms to locally bounded, inherently small, community partners – this case libraries. While our usage has not been enough to fully solve the cold-start problem inherent to new systems, and in that way prove our solution. Through our system we were able to show the potential of data pooling to mitigate the inherent issues of size faced by our library partners. Our measurements show that partners big and small have many more ratings available on items in their catalog with rating pooling, that many more items are rated when rating pooling is used, and that this benefit should grow as more libraries are pooled together, leading to even further overlap between a library and the pooled library catalog. Ultimately, the BookLens system, its deployment, and our study of the federated design is not without limitations. While we would like to be able to claim that this system design would work for any library, we have only deployed with seven library systems. These systems only expose us to a sample of the possible variety of technology combinations, and possible technical constraints. Likewise, while our libraries are small 99 (some more so than others) they are all in the /Saint Paul metropolitan area, and are therefore not nearly as small as the large proliferation of rural library systems. Therefore, while we have done our best to understand and design for the library domain as represented by our partners, our perspective may be skewed towards the concerns of larger, more urban libraries. One of the major factors that might have limited the success of BookLens was our slow development and deployment schedule. Our design process was highly collaborative and iterative. While this helps us be confident that we built the right system, and that we built it the right way, it also led to a relatively slow release cycle. When we started the BookLens project in 2008, social ratings, reviews, and recommendations around books were uncommon features. However, as were were building and improving our system and its set of features, so were services like GoodReads and LibraryThing. While the BookLens service is still working to reach a full deployment, these services have grown greatly in popularity to better fill the gap in social features around book reading, reviewing, and recommending [25,121]. Therefore, there is much less of a niche for BookLens now then when we started the system. While some of the factors that limited the success of the BookLens system fell within the realm of computer science and software engineering practices, some did not, and fell outside of the control of the BookLens team. Most notably here was our inability to deploy at Carver County due to a minor network configuration issue that neither the BookLens team, nor the library itself had the power to change. Likewise, a major reason that Hennepin County Public Library stopped using BookLens was their placement within county government. As far as we are aware, a major deciding factor toward Hennepin switching from a self-built library catalog to BiblioCommons (and part of why BookLens wasn’t considered during this transition) was a county level ban on the use of custom software. As libraries operate as a sub-unit of their county level government they can be effected by budgeting and technical constraints that they have little say in. Even with full cooperation and enthusiasm from our library partners it is possible that outside forces might block the deployment of a system like BookLens. Likewise, political issues, such as changes in leadership, funding, and priorities for the libraries can be hard to predict and can strongly effect momentum of long-term collaborative projects such as this. 100 Despite these limitations, and the loss of the data provided by several of our library partners, BookLens remains in active deployment at three libraries. This is a clear indication that our partners value the BookLens system, and the services it provides to their patrons. Current work on the BookLens system is focused on maintenance, and seeking collaboration with industry partners such as BiblioCommons. Both Saint Paul and Hennepin County Libraries have negotiated with BiblioCommons to seek a data sharing agreement so that data generated through their new catalogs can continue being part of the BookLens research agenda, even if they no longer gain immediate benefit from these ratings through BookLens system features. Separate from the value of BookLens-the-system to our library partners is the sci- entific contributions of the BookLens project. Through our collaborative process we have identified many constraints that are likely general to the library domain, and may serve to guide researchers in similar small community domains as well. Likewise, our design, both specifically as presented in section 3.4 and more generally the federated recommender design, can serve as inspiration to recommender system practitioners de- signing for similar domains. Our results show that this design meaningfully improves the available rating data at libraries big and small, and our measurements allow for better understanding of the catalog pooling, and rating pooling properties of such a federated system. Despite the increased popularity of social reviewing of books, much work remains to be done in the book domain in general, and the library domain more specifically. Our current strategies for identifying works to be merged into one opus relies on metadata- based heuristics. Future work should better explore the question of opus identification, both as a special case of more general duplicate record identification problems, but also as a recommendation question in its own right. Patterns of rating data might serve as an excellent indicator of both opuses that should be considered for merging, and also books within an opus that should be considered for separation. Furthermore, the opusing strategy is only one of possibly many layers of hierarchi- cal item representation that are possible. NoveList, a popular service that makes non- personalized “read-alike” recommendations based on expert recommendations, makes recommendations for items at an book, series, and author level. Methods of represent- ing this information within a personalized recommendation algorithm, and intelligent 101 heuristics for when to recommend items at aggregate levels would be an interesting direction for future exploration. Additionally, while it is known that readers will tend to focus on specific series or authors at once, it is an open question if similar strong consumption patterns can be detected, and leveraged, in other item domains. Early in the design of BookLens we made the decision to focus only on explicit, rating-based preference information. It’s quite likely that a system built upon implicit signals, namely circulation or navigational logs, would have been able to capture a higher volume of usable information. It is important to remember, however, that libraries delete this information for a reason. It is a core value of the libraries we worked with that a patron has the right to decide how their personal information is used. Therefore, no system in which circulation or navigation logs are retained or transferred can be possible. That said, there is no inherent reason that this information must be stored persistently and identifiably. Future work should explore algorithms that allow the training and maintaining high quality recommender system models without persistent storage of non-anonymized training data. Such an algorithm could create an interesting point of trade-off between privacy and personalization. Finally, the library domain has the interesting property that we have a ready supply of human experts (librarians) that our algorithm can learn from when making rec- ommendations. How best to incorporate librarians into recommendations is an open question. Future work should explore how best to allow librarians editorial control over algorithmic recommendations, how best our algorithms can support librarians in face-to-face recommendations, and when to recommend a user shifts from automatic to manual recommendation. While framed within the domain of library item recom- mendation these questions are relevant to any system in which human and algorithmic experts can collaborate to make recommendations. Chapter 4

Understanding Recommender Behavior for Users With Few Ratings

In the last chapter we showed that BookLens users tend to rate many fewer items than MovieLens users. The average BookLens user has a number of ratings that you would only expect to see in a system like MovieLens during a user’s first visit. While there is extensive research on what algorithms perform best for users typical to systems like MovieLens, those with many ratings, much less work has gone into understanding what works best for users typical in a system like BookLens. This is problematic as every user of every system has a small profile at some point, and without evaluation we can’t know how an algorithm will behave when given limited information about a user. We explore this issue in depth in this chapter and seek to answer the question: how do different algorithms behave for users with few ratings, and why? As recommendation algorithms control ever larger amounts of what is shown to users, these algorithms have come to be very important to the user experience of a system. If an algorithm provides good recommendations it can lead to a great user experience. When the algorithm fails, however, a user’s experience will suffer, which may cause them to leave the system. This is particularly important for new users who are deciding whether the system provides enough value to use again. Without much

102 103 information about the user the algorithm has to provide recommendations that are not only accurate, but contain the perfect amount of novelty and diversity. Too many novel items and the user will have no way of knowing that the recommendations are worthy. Not enough novel items and the user may see no value in the system and leave. Past work on new user experience in recommender systems has tended to focus on the ‘cold start problem’. Cold start refers to the problem that no algorithm can make highly personal recommendations with very limited amounts of information. Unfortunately this means that regardless of the algorithm, no recommender system can make high quality recommendations for users who are still joining the system. The most studied strategy for gathering more information for new users is to include a new user rating survey. These can take several forms, but the basic principal is to not allow users to receive recommendations until the user has provided enough information. Historically this has been in the form of a series of pages presenting users with items to rate until some fixed number of ratings has been reached. That said, surveys where users choose the movies to rate have shown promise [83], as well as interfaces where groups of items are compared to build an understanding of user tastes [22, 77]. Whatever the approach, the idea is that once users leave the new user survey their ratings will be “good enough” and the behavior of an algorithm for new users will be irrelevant. For a good discussion of the prior work on ratings based new user surveys see the 2011 report on bootstrapping in recommender systems by Golbandi et al. [46]. While past work has carefully covered new user rating surveys, it has not carefully evaluated which algorithms are best at recommending to new users. New user surveys sidestep the need to understand how algorithms behave for new users; if the survey works well, the users will have typical algorithm behavior. Given this, its not surprising that systems like MovieLens, where a user’s goal is interacting with a recommender, choose to use new user surveys. Likewise, most research datasets maintain a minimum number of ratings per user. As these are the major test-beds of research it is possible that there are significant differences in how algorithms behave for users with few ratings that have not been previously observed. Unfortunately, not all systems can use a new-user survey or guarantee a minimum number of ratings. In systems like BookLens, or e-commerce it is not reasonable to ask a user to take time away from their goal (finding items to borrow or purchase) to describe 104 themselves to your system, and it would be foolish to only provide personalization to those rare users who choose to enter ratings. Because of this, these systems tend to make heavy use of IP based location and demographic information, as well as implicit feedback in an effort to learn as much as possible from what little information about the user is available. Some systems will even go as far as to purchase user profile information from third party data vendors that track users across many services (often without the explicit permission of the user). Give that most recommender systems are deployed in systems like this, not in the dedicated research recommenders where research is typically done, it is all the more important to understand how algorithms behave when given few ratings. This brings us to our major goal for this chapter: to understand how collaborative filtering algorithms behave for users with few ratings and why? We will first explore evaluation methodologies that can be used to answer this question. Then we will pro- vide a careful analysis of three common algorithms performance for new users on the MovieLens 1M dataset. We will be answering this question in two parts.

4.1 Evaluation Methodologies

The ideal way to study algorithm behavior for users with few ratings would be a live study conducted on real people who have only entered a few ratings, such as new users to a recommender system. As is often the case with online experiments this might be suitable for comparing a limited number of algorithms and dataset sizes, but it is not scalable for comparing many algorithms and understanding how algorithm behavior changes as users get more ratings. Therefore we will develop an offline evaluation methodology, in particular, one that is compatible with the large range of traditional evaluation metrics described in section 2.9. Before describing and comparing different methodologies, we should first establish goals. Different systems will have different distributions of user profile sizes, both for current users and new users. As we seek general knowledge about algorithm behavior, we do not want to assume any single profile size for cold start. Therefore, we seek an evaluation that allows us not only to compare algorithms for a single profile size, but to understand how algorithm behavior changes across multiple profile sizes. This should 105 allow fair comparison between any two multiple algorithms at any given profile size, and between multiple profile sizes for any given algorithm. Our ultimate goal will be to provide plots of metric value against profile size across a range of algorithms and metrics.

4.1.1 Past Approaches

To the best of our knowledge only four other researchers have published research with comparable offline evaluations to what we seek. Each of these reports present a figure like the one we wish to generate which plots some metric against profile size. Among these reports only one puts any focus into the question of how to best perform such an analysis. The rest of these reports do not focus on this evaluation, using it as motivation for some broader result. The 2011 report on adaptive bootstrapping methods for recommender systems by Golbandi et al. [46] is a clear example of this: RMSE is plotted against profile size but no description of how the plot was generated is ever provided. Of the four published reports, possibly the most similar is the 2012 report by Cre- monesi et al. [28]. In this work the authors split their dataset with 70% of users for set aside training and 30% of users for testing. Of those test users 30% of their ratings are held back as a probe set. No details are given about how the remaining ratings are used to simulate different profile sizes. Given this lack of details its possible that their approach is very similar to our method, it is likewise possible that there are substantial differences. In this light we seek a well described and repeatable methodology so that this type of evaluation can be performed by others in the future. The other major approach to generating an understanding of how metrics operate over time is referred to as temporal evaluation. Temporal evaluation attempts to repli- cate the natural processes that occurred in a system such that (say) the 50th rating by a user might be predicted based only on the 49 ratings preceding it. This approach was taken by Drenner et al. [33] who created a graph of prediction accuracy by taking test users and then for each user, and each rating (in order) predicted the next 50 ratings using only the preceding ratings. Since the algorithm was not listed for this evaluation its unclear if this process would have involved retraining an algorithm for each user and profile size or not. A more thorough version of this analysis is provided in the 2010 report by Lathia 106 et al [72]. In this work the authors describe a system oriented approach to temporal evaluation. The key to this is a sliding time window. The analysis starts at some time t, an algorithm is trained using only those ratings made before time t. For each user, a set of predicted ratings and recommendations can be generated for time t, which can be evaluated using any rating that occurred after time t. Then the time t is advanced by a standard amount, say 1 week, and the process repeats. While this process is focused on the behavior of a system as a whole, by taking prediction or rankings from users when they happened to have a given number of ratings a profile size comparison can be made. The temporal evaluation procedure from Lathia et al. is well suited to understanding the total of many effects over the life of a system. It is less well suited for understanding an algorithm in isolation. The methodology was designed with specific temporal metrics in mind, such as the temporal diversity of recommendations (how stable or dynamic a top-n list is). For traditional metrics, however, this evaluation may suffer biases based on ordering effects of ratings specific to one dataset or service. While it is unclear if the Netflix dataset used in their paper has such order effects, the dataset we will use (MovieLens) likely does have these effects due to the use of a new user survey. Furthermore, as we will discuss shortly, taking all unrated items as a test set can lead to biases in the evaluation when looking at some metrics over profile size. Ultimately, none of these approaches meet out goals. Most of these approaches lack detail for complete replication. The exception, temporal evaluation, is a plausible candidate, but its goals differ from ours. Therefore we will explore options from first principles, with an eye towards carefully documenting our methodology for future use. Before describing our methodology, however, we will first describe a simpler method- ology. This methodology is well suited to current algorithm evaluation toolkits, but has an important drawback. Understanding this issue will help clarify the details in the methodology we will ultimately choose to use.

4.1.2 Iterated Retain-n method

The natural way to understand the behavior of a recommender system for a given profile size is the retain-n evaluation methodology. In a traditional k-fold crossfold evaluation as described in Section 2.9 the set of users is split into k groups of users. In each fold of the evaluation, one group of users is specified as test users and the rest are training 107 users. Every rating from training users are retained for training. There are various ways to split each test user’s ratings into the training and test set. In a retain-n evaluation a specific number (n) of each test user’s ratings are retained in the training set, with the rest being used in the test set. Once k pairs of train and test rating sets are created, k iterations of the algorithm are trained (one with each train set) and then evaluated using standard offline metrics and the test ratings. The process of generating these train and test sets is given in pseudocode in Algorithm 1.

Algorithm 1 Retain-n crossfolding strategy. Returns a list of pairs of train and test rating sets for use in a standard offline analysis. procedure Retain-n-crossfold(Ratings, Users, k, n) randomly split Users into k groups: U1,U2,...Uk. for i ∈ 1 ...k do Traini ← {} Testi ← {} for user u ∈ Users do Let Ru be the ratings by user u if u ∈ Ui then Train Let Ru contain n random ratings from Ru Test Let Ru contain the remaining ratings from Ru Train Traini ← Traini ∪ Ru Test Testi ← Testi ∪ Ru else Traini ← Traini ∪ Ru Return [(Train1,Test1), (Train2,Test2),..., (Traink,Testk)]

For example, to evaluate how the item-item and user-user algorithms perform on the MovieLens dataset for users with only seven ratings we could create a retain-7 evaluation. Each test user would have seven training ratings, with the rest of their ratings being test ratings. Since the first seven movies rated on the MovieLens system will be highly dependent on the new-user survey shown, we pick the seven items to retain randomly. This helps our results generalize to other circumstances than MovieLens under its historic start-up surveys If the retain-n evaluation is the natural approach for any specific profile size n then to look a range of profile sizes we can use the iterated-retain-n evaluation. The iterated- retain-n strategy simply uses the retain-n evaluation methodology to evaluate at each 108

RMSE nDCG 0.970 0.963 0.965 0.962 0.960 0.961 0.955 0.960 0.950 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5

value MAP Precision 0.19 0.175 0.18 0.170 0.17 0.165 0.16 0.160 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 Retain

Figure 4.1: The iterated-retain-n methodology applied to the item baseline in the Movie- Lens 1m dataset. Each metric is shown with the average user value and a 95% confidence interval, based on a normal approximation. profile size we are interested in. As many evaluation toolkits have built in support for a retain-n crossfolding strategy this approach is easy to implement. While simple to understand and implement, the iterated retain-n methodology is not perfect. While the results for any particular profile size are sound, and can be safely compared, it may not be reasonable to compare values between profile sizes. Due to their construction, the size of the test set (and indeed, the test set itself) varies between the evaluations for each profile size. We have observed in practice that some metrics are meaningfully impacted by changes in the size of test set. While the size of this effect varies widely between metric and dataset, it is still problematic. Figure 4.1 shows the result of using the iterated retain-n methodology to evaluate the item baseline algorithm on MovieLens 1m. The item baseline algorithm has no user personalization and therefore should be unaffected by more ratings from any given user. Therefore, if we see any trends over profile size in this evaluation, it is an indication of a bias inherent to our methodology. Under the iterated retain-n methodology we see substantial changes in reported performance with more ratings on several metrics (nDCG, MAP, Precision). These effects are not uniform, but vary metric-by-metric 109 Algorithm 2 Subsetting Retain-n crossfolding strategy. Returns several lists of pairs of train and test rating sets for use in a standard offline analysis, one for each simulated profile size procedure Subsetting-Retain-n-crossfolds(Ratings, Users, k, n) maxCrossfolds ← Retain-n-crossfolds(Ratings, Users, k, n) for i ∈ 1 ...n do for j ∈ 1 ...k do Trainj,Testj ← maxCrossfolds[j] i Let Trainj be a subset of Trainj where each test user only retains i random training ratings. i i i Crossfoldsi ← [(Train1,Test1), (Train2,Test2),..., (Traink,Testk)] Return Crossfolds1,Crossfolds2,...,Crossfoldsn with RMSE and nDCG showing inappropriate improvements, while MAP and Precision show inappropriate performance loss. While these effects are small in magnitude, they are larger than can be explained by random noise and show a consistent directional trend. As these biases might lead to incorrect results, or inaccurate claims of statistical significance, this behavior is not acceptable.

4.1.3 Subsetting Retain-n method

Ultimately, the problem with the iterated retain-n method is that the test set was not held constant. For two evaluation results to be comparable, their test set should not differ in any systematic way. Ideally, the test set between any two points of comparison should be identical. The subsetting retain-n method solves this problem by holding the test set constant across all profile sizes. This leads us to the subsetting retain-n method seen in Algorithm 2. The subset- ting method begins similarly to the iterative method, the normal retain-n crossfolding strategy is used to create train and test splits for the maximum profile size. However, for any smaller profile size, instead of creating a new train and test split, the maximum training data is simply reduced so that each test user has fewer ratings. For example, to generate train and test datasets for ratings from size 1 to 19, we first generate a set of test and training splits so that test users have 19 training items using the crossfold strategy outlined above. Then, to generate any other number of ratings, we randomly down-sample the 19 item training set. This allows us to generate training sets with a 110

RMSE nDCG

0.960 0.9628 0.956 0.9624 0.9620 0.952 0.9616 5 10 15 5 10 15

value MAP Precision 0.176 0.165 0.172 0.160 0.168 0.164 0.155 0.150 5 10 15 5 10 15 Retain

Figure 4.2: The subsetting-ratain-n methodology applied to the item baseline in the MovieLens 1m dataset. Each metric is shown with the average user value and a 95% confidence interval, based on a normal approximation. given number of ratings for test users, such that each profile size is evaluated with the same test set. While this process is certainly more complicated to implement, this methodology does have one key benefit over the iterated retain-n approach. At every profile size, each user is evaluated on the same test ratings. This means that not only are different algorithms comparable within one profile size, they are also comparable between profile sizes. This also leads to less random noise across profile sizes making trends in behavior easier to identify. Figure 4.2 shows the same analysis from the last subsection re-done with a subsetting retain-n methodology. As can be seen there is substantially less noise in the evaluation with RMSE and nDCG being essentially straight lines (as should be expected). While we still see slight trends for precision@20 and MAP@20 1, looking at the scales on the plots these trends are substantially smaller, and no longer larger than than the confidence intervals on the evaluation. Given that our expected behavior is no meaningful change

1One explanation for these trends is that rated items are removed from the recommendation list. This will be seen as an improvement by our metrics as the items added from small profile sizes to large ones do not appear in the test set and are therefore assumed to be low-quality for the user. 111 this is a clear improvement over the iterative retain-n methodology. This methodology is not without limitations. Most importantly, this method is more complicated to implement than the iterated retain-n approach. The random sub sam- pling of the training set, in particular, is unlikely to be supported in current evaluation toolkits. Secondly, this methodology can only study performance for users who have at least the maximum number of ratings. This methodology can only identify how an algorithm behaves for users who reach a minimum number of ratings. Different method- ologies will be needed to look for differences between users who only make 4 ratings and those than make 400, as well as asking if these users received different recommendations when they joined the system.

4.2 Evaluation of Common Algorithms

Using the iterated subsetting retain-n methodology we performed a careful study of how common algorithms behave for new users. As algorithm behavior is a vague term, we will break this down into three subgoals:

1. How well can different algorithm predict future ratings? 2. How well can different algorithms rank and recommend good items to new users? 3. How do algorithms behave as measured by other metrics, such as popularity and diversity, for new users?

Within the iterated subsetting retain-n methodology we will use user-based 5-fold crossvalidation.

4.2.1 Dataset

For this evaluation we will use the MovieLens 1M dataset. Details on this dataset can be seen in table 2.3 in section 2.9. These ratings were collected from the MovieLens system, which, when this dataset was released, required 15 ratings in a new user survey before users can enter. To the best of our knowledge, at the time this data was collected the new user survey preferred items that were very popular. As mentioned earlier, rating order is randomized to attempt to avoid measuring the effect of these initial ratings. 112 The dataset is designed such that each user in the dataset has at least 20 ratings. Therefore we will evaluate user profiles with up to 19 ratings. This will allow us to ensure that each user in the dataset has at least one test rating. As the average number of ratings per user in the MovieLens 1m dataset is 166 and the median is 96 most users will have far more than one test rating. 88% (5289) of users will have 10 or more testing ratings.

4.2.2 Algorithms

We seek to develop an understanding of how a range of different standard algorithms perform for new users. Rather than picking advanced algorithms that are tuned for new user performance, we decided to first understand how representatives of three common types of algorithms perform. Therefore we will compare user-user [107], item-item [112], and Simon Funk’s SVD [44] against the item baseline, and the user-item baseline. As user-item baseline is rank-equivalent to the item-baseline we will only report user-item baseline for metrics that directly use the prediction score, on all other metrics the two baselines are equivalent. These algorithms are described in depth in chapter 2. Due to the low profile sizes involved in our evaluation, not every algorithm was able to make predictions for every user-item pair. In these cases, we use the UserItemBase- line prediction as a fallback. This ensured that our prediction based metrics were always evaluated over the same number of items. We found this unnecessary for recommenda- tions as each algorithm was able to make a complete recommendation with only one rat- ing. Therefore we form recommendations as the top 20 items with predictions from the algorithm which were not in the training set. More information about these algorithms can be found in table 4.1. Code for our evaluation, using the Lenskit evaluation frame- work [40] is available at https://bitbucket.org/kluver/coldstartrecommendation. Each algorithm was tuned based off of the 2012 report by Ekstrand and Riedl [37]. Since this evaluation is on a different dataset (MovieLens 10M vs. MovieLens 1M), we performed minor tuning using the previous configuration as a starting point. As we had this starting value our tuning was done one parameter at a time looking at RMSE performance on cold start train/test splits. As most parameter changes had negligible effects on performance the starting configuration was used in almost every case. The only exception to this was the damping parameter. While tuning, we found that the 113 ItemBaseline Item’s average rating, with mild Bayesian damping towards the global mean [44]. UserItem- ItemBaseline adjusted by the user’s average offset from the Item- Baseline Baseline. Mild Bayesian damping is applied [44]. UserUser User based nearest neighbor collaborative filtering [107] with a neighborhood size of 30 and ratings normalized by subtracting the UserItemBaseline score. ItemItem Item based nearest neighbor collaborative filtering [112] with a neighborhood size of 30 and ratings normalized by subtracting the ItemBaseline score. SVD Simon Funk’s SVD based collaborative filtering approach [44] with 30 features and 150 training iterations per feature.

Table 4.1: Summary of algorithms. Note, we assign names in upper camel case to these specific configuration of these algorithms. This is different than the hyphenated names used in chapter 2 to refer to the generic form of these algorithms previous damping parameter used for the baselines to be too large. We found that a value of 5 gave much better results for new user recommendation. Based on trial runs of our evaluation with different parameters, our key results are not sensitive to minor changes of algorithm parameters.

4.2.3 Metrics

We present results from eleven metrics. As has been seen in previous studies [52] we find that some of are metrics are redundant. Future work should be able to use a smaller list of metrics. The metrics are described in table 4.2. Most of these metrics are described in more detail in section 2.9 We split our metrics into three groups roughly along our three sub research questions for this section.

Accuracy and Ranking Metrics RMSE and nDCG evaluate the recommender based on its prediction for each item. These evaluate the accuracy of the predictions and the ability of the recommender to rank the list of all items. We find that these two metrics perform quite similarly in our evaluation. Additionally we will measure Coverage – specifically, we measure prediction cover- age, the percent of items on which the prediction algorithm can make a prediction. while 114 Preferred direction Larger is better. Smaller is better. Larger is better. Larger is better. Larger is better. Smaller is better. Larger is better. Smaller is better. Moderate values are better. Moderate values are better. Moderate values are better. Moderate values are better. Range 0.0 to 1.0 0 to 5 stars 0 to 1 0 to 1 0 to 1 0 to 1 0 to 5 stars 0 to 5 stars 0 to 20 items 1 to 6000 users -1 to 1 0 to 12 bits ningful difference. Table 4.2: Summary of metrics Accuracy and Ranking Metrics [115]. [28]. a a Monotonic Recommendation Metrics Non-monotonic Recommender Metrics Description Measure of percent oftion. test-items with Standard arithm’s measure predic- ability to of make[115]. a predictions for prediction all algo- Standard items prediction accuracy metricStandard [115]. ranking quality metric [115]. Count of recommended items inrated the 4.0 test or set higher and Average of precisiontwenty at [80]. each rankCount one of through recommended items inrated the 2.0 test or set lower and Average rating ofin recommended the items test that set. RMSE are computed only over those itemsrecommended. that were Count of recommended itemsThe in average the number test of set. that users have in seen the recommended trainingThe items set [147]. average pairwise similaritymended between items recom- [147]. The Shannon’s entropy of the distributionommended of items rec- for users in the test set. We also tried 5.0 and 1.0 as cutoffs respectively, and found no mea a Metric Coverage RMSE nDCG Precision@20 MAP@20 Fallout@20 MeanRating@20 RMSE@20 SeenItems@20 Average- Popularity@20 AILS@20 Spread@20 115 this is not typically viewed as an accuracy metric, it shares a focus on algorithm predic- tions, rather than top-n recommendations, with RMSE and nDCG. We use a common simplification in our computation of this and report the percent of test-items on which the prediction algorithm was unable to make predictions. While slightly more nuanced to interpret, this variance is easier to compute (it can be computed at essentially no cost while computing other prediction metrics) and makes the metric more sensitive to failure to predict among items the user is interested in (as evidence by their rating the item). This is not a typical metric among modern algorithm evaluations simply because all algorithms, given a moderate number of ratings, tend to achieve full coverage. We will see, however, that this is not the case in this evaluation.

Monotonic Recommendation Metrics These metrics evaluate the top 20 recom- mendations produced by the recommender. The choice of 20 items for the evaluations was arbitrary, testing with other values showed similar results, except on one metric. This issue will be addressed later in the results subsection. We refer to these as mono- tonic metrics as there is a clear good direction for the metric. Precision@20, MAP@20, and Fallout@20 are standard information retrieval metrics which measure the quality of the recommendations. As we show later, these metrics were not useful for our evaluation. Therefore, to help understand the quality of our recommendations, we also include MeanRating@20, and RMSE@20. MeanRating@20 tries to establish how much a user like their recommendations by measuring each user’s average satisfaction with recommended items based on withheld ratings. RMSE@20 accompanies this metric and lets us know if the recommended items are likely to be incorrectly predicted. This can help us understand why an algo- rithm scores well on MeanRatings@20.

Non-monotonic Recommendation Metrics SeenItems@20, AveragePopular- ity@20, AILS@20, and Spread@20 are all non-monotonic measures of the user’s recommendations. These metrics are harder to interpret than the monotonic measures of recommendation quality. Unlike the previous metrics, we expect that users will be most satisfied by moderate values of these metrics, and that both extremes can lead to a negative user experience. 116 We use these metrics as they can help us understand important properties of how users perceive their recommendations. As mentioned in the introduction the number of recommended items that the user has seen (estimated by SeenItems@20) can have a major impact on how much users trust the recommendations. If an algorithm recom- mends too few items that the user has seen then we expect the user will have trouble evaluating if the recommendations are trustworthy. Too many, and the user may feel that the recommendations are not interesting or useful. Unfortunately, the optimal value for these metrics will vary by domain, user, and context. That said, we feel it is useful to understand how these metrics vary across algorithms for new users, as this knowledge can help inform default algorithm selection within a particular domain and context. AveragePopularity@20 is a crude metric of how novel the recommendations are to the user [147]. We measure popularity of a recommended item as how many users have rated that item in the training set. Broadly speaking, we expect users will prefer recommendation lists containing more novel (less popular) items. However, if the rec- ommended items are too novel the user is unlikely to have heard of the items and will have no point of reference in understanding their recommendations. Past work has shown that users like their recommendations to be more diverse [137, 147]. The belief is that by providing more diverse lists we are giving the user a larger range of items to choose from. This can make choosing an item easier as the user is less likely to have to compare very similar options. It has also been shown, however, that too much diversity can lead to less user satisfaction [147]. This is probably due to a tradeoff between how diverse a list is and how well it captures a user’s tastes. At some point, a list cannot become more diverse without including items that do not interest the user. We will measure diversity with the AILS@20 metric. The AILS@20 metric is based on the Intra-List Similarity (ILS) metric introduced by Ziegler et al. [147]. AILS@N is simply a rescaling of ILS to be scale free with regards to N, this was done to make the metric easier to interpret, by putting it on the same scale as similarity values. We used the same item similarity metric for AILS@20 as we used in the ItemItem recommender. The Spread@20 metric is designed to measure how well the recommender spreads its attention across many items. We expect that algorithms with a good understanding 117 of its users will be able to recommend different items to different users. Like diversity, however, we expect that it is not possible to reach complete spread (recommending each item an equal number of times) without making avoidably bad recommendations. When Spread is large we know that the recommender is recommending many items with a relatively even distribution. When Spread is small, the recommender is focusing on a small set of items which it recommends to every user. In this way spread is similar to an item-coverage metric (how many of the items does the algorithm ever recommend), but provides a more nuanced measure of item coverage as it considers how often each item is recommended. Spread is computed by computing recommendations for each test user. Let C(i) be the count of how many times item i showed up in the recommendations. Let P (i) =

C(i)/ j∈I C(j) be the probability that a random recommendation is for item i. Let I be aP random variable following P (i) as above. We define spread as the Shannon’s entropy of I, Spread = − i∈I P (i) log(P (i)). In essence, this measures how hard it is to predict a randomly recommendedP item. If recommendations are better spread over the item set, they will be harder to predict.

4.2.4 results

Using the methodology described in section 4.1 we evaluated each of the above metrics, for each of the above algorithms at each profile size from 1 to 19. We present these results three ways. First we will present plots of the per-user average of each of these metrics (the only exception being spread, which is not measured per user). Secondly we will give textual descriptions of the overall results. Finally, we will provide rough statistical measurements of which algorithms differ. To compute statistical significance for the differences between any two algorithms we will first perform separate Wilcoxon tests for each simulated profile size between the two algorithms. Wilcoxon tests are chosen as most metrics do not strongly follow any typical statistical distribution, over which a parametric test might be more appropriate. Since this process generates 19 separate p-values we will then use a Holm–Bonferroni correction to correct for multiple comparisons. In the name of brevity we will simply report trends of significance (at the p< 0.05 level) among these comparisons. The major limitation of this approach is that it does not allow for meta-analysis 118

Coverage nDCG ³ RMSE

● ● ● ● ● 1.0 ● ● ● ● 0.9650 1.05 ● ● ● ● ● ● ● ● ● ● ● ● ● 0.9625 0.8 ● 1.00 ● ● ● ● ● ● ● ● ● 0.9600 ● value ● ● 0.6 ● ● ● ● ● 0.95 ● ● ● ● ● ● ● 0.9575 ● ● ● ● ● ● 0.4 ● ● ● ● 0.90 4 8 12 16 4 8 12 16 4 8 12 16 Simulated Profile Size

Algorithm ● ItemItem UserUser svd ItemBaseline UserItemBaseline

Figure 4.3: Accuracy and Ranking Metrics. Note, the Coverage plot is heavily over- plotted as only Item-Item has substantial coverage issues. Note that UserItemBaseline is not plotted on nDCG as it is rank-equivalent with ItemBaseline. across multiple sample sizes. Additionally, it does not correct for multiple comparisons within the multiple pairs of algorithms that we compare. That said, since we are largely looking for a descriptive sense of behavior, not confidence in algorithmic ranking for any single ranking this statistical method should be sufficient for our purposes. We can use this method to understand when two algorithms differ by a large enough degree to be considered meaningful given the variance between user metric values which is poorly captured by our mean centered descriptive statistics. Finally, this approach is reasonably easy to replicate in statistical toolkits such as R. As this section goes over a substantial amount of detail we will conclude with a summary of these results before moving onto our next analysis.

Accuracy and Ranking Metrics

RMSE Figure 4.3 shows how accurate our algorithms are as measured by RMSE. UserUser, and SVD both behaved essentially as expected. SVD shows no improvement with only one rating, and UserUser actually gets worse with the first rating. Other than that, these algorithms outperform ItemBaseline, with both statistically signifi- cantly outperforming after 4 ratings. Interestingly, UserItemBaseline outperforms both

3We do not plot UserItemBaseline on this metric as it is equivalent to ItemBaseline on this metric. 119 algorithms until around 8 ratings, at which point both algorithms start to perform bet- ter. Given the size of the differences between UserItemBaseline, UserUser, and SVD we conclude that all three are equally accurate for new users. In particular UserUser does not perform statistically significantly better than the UserItemBaseline within 19 ratings, and SVD only statistically significantly outperforms UserItemBaseline with 17 or more ratings. That such a simple algorithm can outperform more complicated algo- rithms for small profile sizes suggests that recommending for users with very few ratings is a hard problem. ItemItem performs quite poorly for new users. ItemItem performs statistically sig- nificantly worse than the ItemBaseline given 11 or fewer ratings. Between 11 and 13 ratings the differences between ItemItem and ItemBaseline are not statistically signif- icant, however ItemItem has a higher RMSE until we have 13 ratings. Furthermore, ItemItem is statistically significantly worse than UserItemBaseline, SVD and UserUser for all datapoints tested. Interestingly, for the first two ratings ItemItem appears to perform worse as more ratings are added. This trend is a result of the limited coverage of the ItemItem algo- rithm. The coverage of our algorithms can be seen in figure 4.3. Note, that this figure is heavily overplotted as ItemItem is the only algorithm which has meaningful cover- age issues. Considering the accuracy of RMSE on only those items where it makes a personalized prediction, it shows a monotonic increase in accuracy similar to the other algorithms. However, for small rating counts ItemItem can only predict for some per- cent of the items (around 60% for users with two ratings, and around 75% for users with four ratings). Because of this, many of the predictions come from the better performing UserItemBaseline. As we get our first few ratings, the predictions become less likely to be from the UserItemBaseline, leading to an upward trend in RMSE despite the fact that the ItemItem predictions are getting monotonically more accurate. nDCG Figure 4.3 shows the algorithms performance on the nDCG metric. The results for nDCG are very similar to the RMSE results. Both UserUser and SVD are less accurate than the baseline as measured by nDCG with only one rating. UserUser takes much longer to recover, is statistically significantly worse given 3 or fewer ratings, and is never statistically significantly better. SVD is statistically significantly better that the 120

Precision@20 ³ MAP@20 ³ Fallout@20 ³

● 0.16 0.005

● 0.15 ● ● 0.004 ● 0.12 ● ● ● ● ● ● ● 0.10 0.003 ● ● ● ● 0.08 ● ● ● ●

value ● ● ● ● ● ● ● ● ● 0.002 ● ● ● ● ● ● ● ● ● ● ● ● ● 0.05 ● ● ● ● ● ● ● 0.04 ● ● ● ● ● ● ● ● 0.001

0.00 0.000 0 4 8 12 16 0 4 8 12 16 0 4 8 12 16 Simulated Profile Size

Algorithm ● ItemItem UserUser svd ItemBaseline

Figure 4.4: Information Retrieval Recommendation Quality Metrics. Note that UserItemBaseline is not plotted as it makes the same recommendations as ItemBaseline. baseline given 11 or more ratings, and statistically significantly better than UserUser for all rating sizes except 6 and 7. Again ItemItem performs quite poorly, and is statistically significantly worse than all algorithms given 2 or more ratings, given only one rating where ItemItem is not statistically significantly worse than UserUser, but is worse than the others. Just like with RMSE, SVD performs the best, but still requires a fair number of ratings to significantly outperform the baseline algorithms.

Monotonic Recommendation Quality Metrics

MAP@20, Precision@20, Fallout@20 Figure 4.4 shows MAP@20, Precision@20, and Fallout@20. All three metrics show essentially the same trend, UserUser gets the lowest score (statistically significant on all points and algorithms), the baseline gets the highest score. The baseline is statistically significantly higher than ItemItem for all points on precision and recall, and for all profile sizes 3 or greater for fallout. On both MAP@20 and Precision@20, SVD gets a statistically significant higher score than ItemItem throughout, performing almost as well as the baseline. On Fallout@20 ItemItem and SVD perform about as well as each other, with SVD showing a very slight lead (statistically significant after 5 ratings). We also tested recall and mean reciprocal rank metrics, but found them to exhibit

3We do not plot UserItemBaseline on this metric as it is equivalent to ItemBaseline on this metric. 121

SeenItems@20 ³ MeanRating@20 ³⁴ RMSE@20 ⁴

0.80 4 4.5 0.75

3 0.70

value 4.4 0.65 2

0.60

0 4 8 12 16 0 4 8 12 16 0 4 8 12 16 Simulated Profile Size

Algorithm ItemItem UserUser svd ItemBaseline UserItemBaseline

Figure 4.5: Recommendation Quality Metrics. Note that SeenItems@20 is discussed out- of-order here, as it provides important context for the interpretation of other results. Note that UserItemBaseline is not plotted on SeenItems@20 and MeanRating@20 as it recommends the same items as ItemBaseline. Note that UserUser is not plotted for RMSE@20 due to instability it its estimate caused by the low seenItems@20. the same trend, so they were not explored further. We tested Precision@20 and Re- call@20 with 4 and 5 stars as a cutoff of good and they showed essentially the same results (ItemBaseline and SVD were not statistically significant for 2 ratings) We tested Fallout@20 with 2 and 1 star cutoffs for bad. The results were very similar, however Fall- out@20 seems to have a larger amount of variance per user, as would be expected from measurements of frequencies that small. While the two plots for fallout@20 are visually equivalent, the one star versions have fewer statistically significant results (ItemItem needs 6 ratings to be statistically different from baseline, and SVD needs 16 or more). These results are rather strange as MAP and Precision are metrics where larger scores imply better performance, whereas Fallout is a metric where larger scores imply poor performance. Therefore we would not expect to see rank equivalent behavior from these metrics. Unfortunately this means that these metrics contradict each other. One possible reason for this is that these topN metrics are known to be biased toward algorithms that recommend more popular items [10]. Looking forward to the average popularity (figure 4.6) of recommended items, we see that the average popularity of the recommendations closely replicates the previous three plots. As a result of this, we discard these metrics, and consider less biased metrics to help us analyze the quality of the recommendations. 122 SeenItems@20 We address this non-monotonic metric out of turn here as it provides important context for the next two metrics. The first graph in figure 4.5 shows the average number of items in the top 20 recommendations for each algorithm that were also in the test set. The average number of test set items in recommendations shows a similar trend to the MAP and precision metrics, with UserUser getting by far the fewest seen items, and the baseline getting the most. UserUser is statistically significantly different from each other algorithm at all datapoints. The other three algorithms are each statistically significantly different with three or more ratings. As UserUser gets less than one seen item on average we found that UserUser was not appropriate for the MeanRating@20 and RMSE@20 metrics. In particular, we found that the UserUser results for these metrics are highly sensitive to recommendation list size. Because of this issue we will not include UserUser in our next two metrics and consider the next two metrics inconclusive for UserUser. That said, we consider the small number of seen items to be a bad result for UserUser. Put simply, if we cannot reliably estimate if the recommendations are reasonable, how can we expect a user to do so?

MeanRating@20 The second graph in figure 4.5 shows the average rating for those items in the top 20 recommendations for each algorithm. The first thing to note about this graph is that all of these algorithms are doing a reasonable job of recommending items the user might like to watch, with all algorithms averaging above 4 stars (implying that the recommended items are, at least on average, enjoyable). SVD performs as expected, starting around the baseline for a few ratings, and then steadily improving to be the best algorithm after 3 ratings. SVD is statistically significantly better than the baseline with 6 or more ratings, and better than Item-Item for all profile sizes tested. Again, we find that ItemItem performs poorly until quite a few ratings are added, and does worse than baseline throughout, this result is statistically significant for all profile sizes 17 or lower.

RMSE@20 The third graph in figure 4.5 shows the accuracy of our algorithms on items in the top 20 recommendations. Not surprisingly, this figure shows the same trends

3We do not plot UserItemBaseline on this metric as it is equivalent to ItemBaseline on this metric. 4We do not plot UserUser here as it cannot be reliably measured for this metric. 123

AveragePopularity@20 ³ AILS@20 ³ Spread@20 ³ 10 ● ● ● ● ● ● ● ● ● 1000 ● ● ● ● ● 9 ● 0.05 ● ● ● ● ● 750 ● ● ● ● ● 8 ● 0.04 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 7 500 ● value ● 0.03 ● ● ● ● ● ● ● 6 ● ● ● ● 250 0.02 5

0 0.01 0 4 8 12 16 0 4 8 12 16 0 4 8 12 16 Simulated Profile Size

Algorithm ● ItemItem UserUser svd ItemBaseline

Figure 4.6: Non-monotonic Recommendation Quality Metrics as the last one, meaning that the algorithms whose recommendations were (on average) most likable, were those whose recommendations were most accurate. Again, SVD performs quite well, easily beating the UserItemBaseline after only 1 rating (statistically significant at all profile sizes). ItemItem does the worst at all profile sizes, although this is only statistically significant for profile sizes 11 and lower. It is interesting to compare RMSE and RMSE@20. While SVD performs well in both RMSE and RMSE@20, in comparison to baselines, it’s edge over userItemBaseline appears greater in RMSE@20. SVD seems to be particularly accurate in its scores for its top 20 items. This stands in contrast to ItemItem. While ItemItem eventually has reasonable performance in RMSE, it never seems to achieve good RMSE@20 scores, in comparison to baselines. This agrees with the meanRating@20 scores indicating that RMSE@20 could be a more useful metric to understand prediction accuracy within the range of predictions the user is most likely to see and use when evaluating the system.

Non-monotonic Recommender Quality Metrics

AveragePopularity@20 The first graph in figure 4.6 shows the average popularity of items recommended by each algorithm. As expected the baseline algorithms makes the most popular recommendations, followed by SVD, ItemItem, and then UserUser in a distant last. All of these differences are statistically significant by our metrics. As stated earlier, We expect that users will feel the baseline’s popularity (around 1000) is 124 too high. We also expect that users will perceive UserUser’s popularity (around 50) as too low. SVD and ItemItem both seem to perform well, with SVD generating more popular recommendations than Item-Item. SVD and ItemItem both show a trend of decreasing popularity as more ratings are entered, suggesting that these algorithms will make more nuanced, less popular recommendations for users they know more about.

AILS@20 The second graph in 4.6 shows a similar trend to the popularity trend. Again, we expect that the baseline, which provides the least diverse recommendations, would be perceived as not very diverse by users. Quite possibly due to the preference for obscure movies, UserUser seems to produce the most diverse lists of any algorithm. We expect that the UserUser recommendation lists might be too diverse. Unlike popularity, SVD and ItemItem seem to perform equivalently in terms of diversity. We also see that all three collaborative filtering algorithms show a trend of increasing diversity (decreasing inter-list similarity) as more ratings are entered. Like with popularity we see (almost) all differences as statistically significantly dif- ferent. Since SVD and ItemItem are very similar we looked into this result further and found that these distributions had slightly different behavior with ItemItem having slightly higher AILS@20 scores in the top quartile. This suggests that while there may not be a meaningful mean shift, there may be a small but significant median shift (which the Wilcoxon test is sensitive to). Given the size of the differences between metric values we do not see this change as practically significant. This also suggests that future work should investigate statistical measurements that are better suited to the task at hand. Ideally a regression based technique might be used, but it is not ob- vious what equation should be used to best fit the large range of possible shapes metric growth with profile size can take, or how to best handle non-normality the underlying scores. Likewise, and ideal methodology should take advantage of the fact that users are repeated allowing paired testing methodologies.

Spread@20 The third graph in figure 4.6 shows the spread of the recommendations from each algorithm. As the spread metric cannot be measured per user we will not re- port statistical significance here. As expected the baseline performs worst on spread, as it recommends essentially the same items to each user. SVD performs mildly better than 125 Algorithm Prediction Recommendation Other accuracy quality properties ItemItem Poor Poor Good UserUser Good Inconclusive Poor SVD Good Good Good

Table 4.3: Summary of algorithm behavior for new users the baseline, and shows definite improvement as more ratings are added. UserUser and ItemItem start at around the same value as each other, however, after only two ratings they have crossed, as ItemItem quickly increases to have the highest spread of all the al- gorithms, and UserUser quickly decreases into second place. Interestingly, only UserUser shows a trend to have less spread as it learns more about the user. ItemItem and SVDs increasing Spread matches a model of algorithms that are conservative, making safe recommendations given little information, and then recommending more nuanced and personal items as more we learn more about the user. UserUser’s trend of decreasing spread is compatible with a model of making somewhat random recommendations given little information, and then narrowing in as it gets more information. More research will be needed to fully understand these trends.

4.3 Conclusion

The high level results of our evaluation are summarized in table 4.3. Overall we see that SVD seems to be the best performing algorithm for cold start use. SVD was the only algorithm that consistently outperformed the baseline algorithms on prediction and recommendation tasks. This is, perhaps, due to its design as an algorithm that produces offsets from a baseline algorithm. Due to regularization in the learning process, this offset will likely stay small when only limited information is available. While this leads to good prediction and recommendation performance, it does lead to arguably worse performance on average popularity and spread metrics than ItemItem. Unfortunately, ItemItem also performed surprisingly poorly at both prediction and recommendation, which makes it less suitable for new users. UserUser performed quite well at prediction, but tended to favor unpopular items in its recommendations. While some novelty is good, we believe that UserUser goes too 126 far with obscure movies in its recommendations. Therefore we conclude that UserUser alone is not well suited for new users. It is possible that careful tuning or hybridizing with another algorithm may help UserUser perform well for new users. Exploring this issue further is left as future work. Interestingly, we repeatedly see that for very small numbers of ratings the UserItem- Baseline is our strongest algorithm, both in terms of accuracy, and recommendation quality. This suggests a switching strategy for live systems, in which the UserItem- Baseline is deployed until a user has enough ratings, after which they are switched to a better recommender. The switching point between algorithms can be determined with our methodology. The only downside of this approach is that the baseline performed poorly on our non-monotonic metrics, which may effect users as they first join the system. Outside of what we learned about algorithm behavior, this chapter has also in- troduced the meanRating@20, RMSE@20 metrics, and Spread@20 metrics. MeanRat- ing@20 and RMSE@20 worked well in this study to help us understand the quality of recommendations without being subject to the popularity bias that effected traditional recommendation metrics. Future work should explicitly explore these metrics comparing them with more conventional recommendation quality metrics to see if they are a better predictor of user satisfaction. The Spread@20 metric served as a useful complement to popularity and diversity to better understand how the algorithms were behaving. While the subsetting-retain-n methodology explored here is somewhat cumbersome to implement, we found that it allowed a much more in-depth understanding of algorithm behavior and tradeoffs for users with few ratings. This depth of analysis may not be necessary for every new algorithm, but our results certainly show its value. We find it surprising that no algorithm we compared was able to outperform a baseline algorithm for the very-small-rating count (1-4) situation. We further find it surprising that the poor performance of Item-Item seems to have gone unnoticed before this study. To avoid such surprises it is important that a well designed methodology for evaluating algorithms under small-rating-count situations is available. Even, if the complete subsetting-retain- n methodology is too cumbersome for an initial report of an algorithm, some attention should be payed to small (<10) rating values to ensure an algorithm doesn’t have degenerate behavior for new users. 127 Future work into the subsetting-retain-n methodology should extend and demon- strate this methodology for non-rating data. As the recommender systems research community moves towards implicit or ranking based forms of preference elicitation, it is important that we have comparable methodologies to evaluate how algorithms perform under these use cases. Such extensions, combined with some estimate of the quantity tradeoff between ratings and implicit information, could even allow rough comparisons of different paradigms of preference elicitation for new users. Ultimately, however, identifying weaknesses in algorithms is not the final step in algorithm development and deployment and simply reporting on the poor performance of Item-Item is not a reasonable last step. Instead, once identified, the fundamental failures of algorithms should be understood, and if possible, fixed. In this way we can forward our knowledge as a field into the features of effective recommendation algorithms for users with few ratings. The next chapter will explore the failings of the Item-Item algorithm, looking at why it performs poorly, and how to fix it. After that we will present an online evaluation of the performance of our fixed Item-Item algorithm. This evaluation can serve as a prototype for online evaluations of an algorithms new-user performance to compliment the offline evaluation presented here. Chapter 5

Improving Item-Item CF for Small Rating Counts

5.1 Introduction

Item-Item is a popular algorithm, while it may not perform quite as well as the most recent cutting edge machine learning algorithms, it is easy to implement, explain [51], and modify for features such as alternate weighting schemes [8, 69]. Perhaps due to these advantages, Item-Item is still quite popular in production recommender systems, including those of large websites like Hulu [138] and Amazon [76]. Therefore, it is quite surprising that Item-Item would perform so badly for new users. Item-Item has been the subject of no small amount of scientific development and scrutiny, and to our knowledge our 2014 publication [65] (13 years after the algorithm was initially described) was the first to note this weakness. In this section we seek to use Item-Item as a case study. We will explain why Item- Item behaves poorly for new users and evaluate modifications to mitigate these issues. Exploring and documenting the weaknesses of Item-Item will help us understand what properties of an algorithm lead to good or bad behavior for users with few ratings. Through our investigation we are able to isolate specific factors that lead to Item-Item’s poor performance, as well as identify two improvements to the Item-Item algorithm that target these issues. Our improvements are able to resolve the issue, which we demon- strate with both an offline and online user-centered analysis showing that our improved

128 129 Item-Item outperforms the unmodified variant. Interestingly, we find that our users do not rate the improved Item-Item algorithm as superior to a baseline recommenda- tion strategy, suggesting that more research is needed into the broader area of what algorithm properties are desirable for new users. In the next two sections we explore two separate, but compounding, issues that ef- fect the Item-Item algorithm. We will conclude each section by isolating a modification to Item-Item that should mitigate the issue we identify First we will look at how model truncation reduces the amount of user rating data that can be used when making pre- dictions. Then we will look at how the prediction algorithm itself under-performs when given few, or poor quality, neighboring item ratings.

5.2 Improving Item-Item - the Frequency Optimized Model

The key to making a prediction (S(u,i)) for user u and item i in the Item-Item algorithm is taking the average of user u’s past ratings on other items. This average is weighted based on how similar the other rated item j is to the target item i based on some measurement of similarity sim(i,j). In principal, every rating could be included in this average rating, however often that is unnecessary, and a user’s ratings on a handful of similar items works just as well. Therefore, to avoid unnecessary computations, and storing unnecessary information, we often truncate the Item-Item model to store the k most similar neighbors for any item, and use only ratings on these items in a prediction. Quite deliberately, model truncation limits the number of ratings that are actually applied to a prediction in any given circumstance. While this is quite reasonable given a large collection of ratings, this can be problematic for users with few ratings. We have already seen evidence of this issue in figure 4.3, where we observed a significant coverage issue. The only reason Item-Item would fail to make a prediction given our configuration of the algorithm would be if no neighboring ratings were available. Especially for users with very few items this happens at an alarming rate. To get a better estimate of the effect of truncation we can look at how many ratings were used when making predictions (based on data output from the evaluations in the last chapter). On average, a user with eight ratings only used three in any given prediction, and a user with fourteen ratings only used five on any given prediction. 130 While some amount of difference between available and used ratings is expected, this degree is excessive. Since model truncation seems to be a core issue, one solution would be to simply not do it. While this will likely make the algorithm process much slower, it should also allow all of a user’s ratings to be used. We will test this solution against more elegant solutions. We do note, however, that model truncation was invented for a reason and we do not expect this solution to be feasible in terms of memory or computation speed for most Item-Item deployments. Ideally we should be able to increase the number of used neighbors without increasing the number of stored neighbors. One problem here is that many of the best neighbors for an item might not be very frequently rated. While very useful when available, these items are not often available in a user’s profile and are rarely used. When truncating the similarity matrix we should strike a balance between the value of a neighbor when used (how similar it is to the target item) and how frequently that neighbor will be used. This insight leads us to the frequency optimized model. Normal Item-Item model truncation chooses the K most similar neighbors without regard for how often they are rated. The frequency optimized model chooses items based on both the number of users who have rated an item and the similarity of the item to the target item. The sum of similarity used when making a prediction closely relates to how accurate we expect that prediction to be. Therefore our goal with this model is to optimize for the expected sum of similarity available when making predictions. The expected sum of similarity for an item i is j∈Ni P (j) ∗ sim(i,j) where P (j) is the probability that item will be available for useP in a random prediction. We will model this as simply the |Uj | proportion of users who have rated item j, P (j)= |U| . Therefore the expected similarity |Uj | contribution of an item j to predictions for an item i is |U| ∗ sim(i,j). Since we only intend to use this term for ranking purposes we simplify this equation to |Uj|sim(i,j). Selecting the K items that maximize this term will select items with an optimal balance between their value as a neighbor, and their likelihood of being available when making a prediction. This model makes two key assumptions. First, this model assumes that items are rated independently of each other. This assumption is trivially wrong, and accounting 131 1.2

1.1

1.0

RMSE 0.9

0.8

0.7

0 5 10 15 Number of neighbors used in prediction

Figure 5.1: A plot of RMSE by the number of neighbors used when making a prediction. RMSE was aggregated by user with average RMSE by user plotted, and 95% confidence intervals based on a normal approximation shown as error bars. The horizontal line shows the ItemBaseline’s average RMSE. for it may allow interesting improvements by accounting for item co-rating patterns when choosing neighbors. We do not explore these improvements, they likely would involve a substantial increase in the complexity of model truncation algorithms. The second assumption we make is that all users are equally likely to request a prediction. This assumption is quite reasonable for our evaluation case. In a deployed system, however, it may be worth focusing the algorithm on the rating patterns of recently active users.

5.3 Improving Item-Item - Damped Item-Item

Small item neighborhoods alone cannot account for the poor performance of Item-Item. To see this, consider figure 5.1. In this figure we can see RMSE plotted against the size of the neighborhood. Item-Item needs three or more neighbors to outperform the ItemBaseline (which, it should be noted, uses no ratings). If Item-Item’s prediction equation was accurate we would expect it to perform at least as well as ItemBaseline 132 under every circumstance. As this is not true, we will investigate this prediction equation as another source of error. It is well known that averages based on few values are unreliable. Phrased differently, when there are few neighbors we simply do not have enough evidence to base our prediction on, and our predictions are therefore unreliable. One common solution to this would be to shrink the prediction towards a more reliable baseline when there are few neighbors. This can be accomplished by adding a damping parameter d to the weighted average as seen in equation 5.1.

sim(i,j) ∗ (ruj − B(u,j)) S(u,i)= B(u,i)+ j∈Nui (5.1) d + |sim(i,j)| P j∈Nui P Adding a small constant d to the denominator of the weighted average will have a negligible effect when the sum of similarity of used neighbors is high. However, when the sum of similarity of used neighbors is low the d term will serve to reduce the offset from the baseline. In this way, predictions that deviate greatly from the baseline can only happen when there is sufficient evidence contained in the user’s ratings. The idea of using a small damping parameter to control the impact of limited data is not new. This is essentially the same idea used in some forms of regularization and most forms of Bayesian damping. Damping has been used to improve baseline predictions [44] and similarity estimates [8, 32]. Our specific approach has even been mentioned as a way to improve Item-Item before [69]. However, to our knowledge this technique has not been evaluated or discussed in the context of the new user problem. Since this improvement should significantly improve Item-Item’s new user behavior we feel it is vital to consider it in our analysis. As model truncation leads to smaller neighborhood sizes, and in turn less accurate predictions, we expect that these two factors compound to cause the massive prediction errors observed in the last chapter. To understand the role these factors play together we will also evaluate an Item-Item that has both improvements. We will call this the frequency optimized + damped Item-Item algorithm or combined Item-Item for short. We expect this algorithm will be the best of both worlds as it should both have larger item neighborhoods, and make better predictions given these neighborhoods. 133 5.4 Offline Evaluation of the Improved Item-Item

To measure the impact of our algorithm improvements we will use the same methodol- ogy as in section 4.2. We will use the same dataset (MovieLens 1M) with algorithms configured the same way. Furthermore, we will use a subset of the same metrics. In addition to this evaluation we will also use a more traditional user-based five-fold crossvalidation evaluation with 20% of each test user’s ratings retained for the test set. This is to allow us to measure what, if any, impact our algorithm has on an average user. Just as it was problematic in the past to focus only on the performance of algorithms on established users, it would be problematic if we ignored how our modified algorithms behave for normal system users. In particular, it is not clear that an algorithm im- provement should be taken if a small gain for new users comes at the cost of accuracy or efficiency for users with larger collections of ratings. We will call the first evaluation the small profile evaluation results and the second the large profile evaluation results. We will test on the algorithms shown in table 5.1. As can be seen we will compare against ItemBaseline, UserItemBaseline, and SVD algorithms as comparison points, we will compare these against unmodified ItemItem, ItemItem with no model truncation, ItemItem with frequency optimized model truncation, ItemItem with damping, and frequency optimized + damped ItemItem. Note that we leave UserUser out of this analysis as it was dominated by SVD and therefore is not an important comparison point for our improvements. Tuning of the damping parameter for the damped Item-Item algorithms was done through a grid search over RMSE. Since we anticipated that the damping parameter will be difficult to tune under a traditional large profile offline evaluation we used 9 rating training set. To avoid overfitting by tuning on our test set we randomly split the 19 training ratings in the primary crossfold for our analysis into 9 training ratings and 10 test ratings. Our grid size for the search was 0.1. We found that values between 0.2 and 0.5 seem optimal, with the exact optimal value being highly sensitive to chance in the crossfolding procedure. By additionally considering nDCG and MeanRating@20 metrics we were able to select 0.3 as an optimal damping value. We will use a subset of the metrics from the past analysis. In particular we will 134 Baseline comparisons ItemBaseline Item’s average rating with mild Bayesian damping [44]. Included as minimum acceptable performance. Item-Item Standard Item-Item algorithm [112]. Included to measure effect of improvements. Best new-user algorithms from past work UserItemBaseline ItemBaseline adjusted by the user’s average offset from baseline. Mild Bayesian damping is applied [44]. SVD Simon Funk’s SVD based collaborative filtering algorithm with 30 features and 150 training iterations per feature [44]. Our improvements Full Item-Item Item-Item with no model truncation. Frequency Optimized Item-Item with a modified similarity model which prefers items that have been rated by many users. Damped Item-Item Modified Item-Item which includes ItemBaseline in its predic- tions to improve performance for new users. Combined Damped Item-Item algorithm using the frequency optimized model.

Table 5.1: Summary of algorithms including reason for inclusion in this experiment. use RMSE, nDCG, coverage, Mean-Rating@20, popularity and diversity of recommen- dations. As precision, recall, and mean average precision were found to be not useful in the last evaluation they were excluded from this evaluation. As some of the modifications we are testing could effect the amount of time required to make predictions and recommendations, we will also look at the recommendation time, the average time for the system to make a prediction for every item and then form a top-20 list. We will measure this on a separate run of the small and large profile evaluations which will run single threaded with LensKit configured to avoid any resource sharing between algorithms 1 We configure these separate runs to have only one metric a custom “just recommend” metric which simply requests a recommendation for the user and does no further computation. In this way we can measure the test time as the recommendation time, and possibly a constant (between algorithms) overhead associated with LensKit’s testing framework. As we found the timing results to be constant between the small and large profile evaluations, and largely independent of

1LensKit uses a dependency injection framework, Grapht, which has support for optionally sharing java objects involved in recommendation between algorithms (so long as the component is identical between algorithms). This process can speed up evaluations, but in doing so invalidates algorithm timing experiments unless disabled. 135 profile size, we will only report this metric for the large profile evaluation.

5.4.1 Small Profile Evaluation Results

Since we are comparing eight algorithms, five of which are Item-Item variants, joint plots of our results show a nontrivial amount of overplotting. Therefore, to make the results of our small profile analysis easier we will split our discussion into two parts. The first part will cover the Full Item-Item model and the Frequency Optimized model and will focus on the changes brought about by simply increasing the size of user neighborhoods. Then we will cover the damped item-item and the frequency optimized + damped Item-Item variants. Just like the last evaluation we will adopt a strategy of using point-wise Wilcoxon tests for each simulated profile size and then a post-hoc Holm–Bonferroni correction to establish statistical significance of differences between algorithm results. We will also run a Student’s T test to compute confidence intervals for these differences which we will use to comment on how large of a difference is possible between metric values for two algorithms given the statistical power in our experiment. By taking the max (or min) of the relevant ends of the confidence intervals we will make claims of the type “algorithm 1 is no more than Y worse than algorithm 2”.

Frequency Optimized Item-Item

The results of the small profile evaluation for the full model and frequency optimized model Item-Item algorithms can be seen in figure 5.2.

Coverage While this cannot be seen due to overplotting, the full Item-Item model completely resolves the coverage issue seen in unmodified Item-Item. There are still a few rare occasions where no prediction can be made, given one rating the average coverage is 99.82%. While this is statistically significantly lower than the 100% coverage of our baseline algorithms, it is high enough to consider completely fixed. The frequency optimized model makes a clear and statistically significant improvement over Item-Item in terms of coverage at all data points. The frequency optimized model reaches a coverage of 95% given only 5 ratings, which is a large improvement over the 11 ratings needed by a normal Item-Item algorithm. 136 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Coverage AILS@20 ● ● ● ● ● ● ● ● 5 10 15 ● ● ● ● ● ● ● ● 1.0 0.8 0.6 0.4 0.05 0.04 0.03 ● ● ● ● m and the Frequency Optimized Item-Item ● ● ItemBaseline UserItemBaseline ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● nDCG ● ● ● ● ● ● AveragePopularity@20 FrequencyOptimized SVD ● ● 5 10 15 Simulated Profile Size Simulated ● ● ● ● ● ● ● ● ItemItem FullModel 800 600 400 1000 0.966 0.963 0.960 0.957 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● RMSE ● ● ● ● MeanRating@20 ● ● ● ● 5 10 15 ● ● ● ● ● ● ● ●

1.2 1.1 1.0 0.9 4.6 4.5 4.4 4.3 value algorithms Figure 5.2: Results of the small profile evaluation for the Full Item-Ite 137 RMSE Looking at RMSE we see that both the full model and frequency optimized model Item-Item algorithms have worse RMSE for the first few ratings. This is likely an artifact of the increased coverage, as that would mean the fall-back algorithm (Item- Baseline) is making fewer predictions for these algorithms. Both improved algorithms are statistically significantly worse than Item-Item for profiles size two or smaller. How- ever, both algorithms improve substantially as we get more ratings. The full model is statistically significantly better than Item-Item given three or more ratings, and the frequency optimized model is statistically significantly better given five or more. Comparing against the baseline algorithms, Item-Item needs 16 ratings to statis- tically significantly outperform the ItemBaseline, and never statistically significantly outperforms the UserItemBaseline in our sample. The frequency optimized Item-Item variant requires 11 ratings to statistically significantly outperform the ItemBaseline and also never statistically significantly outperforms the user-item baseline, although its be- havior is roughly equal given 14 or more ratings where it is no more than 0.01 RMSE worse. The full Item-Item model needs only eight ratings to outperform the item base- line and outperforms the UserItemBaseline given 16 ratings. Comparing to the best cold-start algorithm, SVD, the frequency optimized performs worse for all profile sizes less than 16 and the full model performs statistically significantly worse for profiles with seven ratings or less. Given nine or more ratings the full model and SVD are not statistically significantly different and SVD is no more than 0.02 RMSE better. nDCG Looking an nDCG we see similar behavior as in RMSE. For the first few rat- ings Item-Item outperforms Full model Item-Item or frequency optimized Item-Item. However after four ratings both variants statistically significantly outperform unmodi- fied Item-Item. While Unmodified Item-Item never outperforms the baseline algorithms the full model performs statistically significantly better than the baseline given 14 ratings, and the frequency optimized algorithm statistically significantly outperforms the baselines given 19 ratings. Neither algorithm outperforms SVD for the profile sizes tested, how- ever, given 10 or more ratings the full model is not statistically significantly worse than SVD and is no more than 0.002 nDCG worse. 138 meanRating@20 We again see similar behavior from meanRating@20 with Item- Item, and frequency optimized Item-Item performing similarly for the first few ratings. Frequency optimized Item-Item’s meanRating@20 is statistically significantly better than Item-Item’s after seven ratings. The full model performs better than Item-Item at almost every sample sizes, and is statistically significantly better after three ratings. Like with nDCG we see that the improved Item-Item algorithms are able to out- perform the baseline recommendations. The full Item-Item algorithm needs six ratings to statistically significantly out perform the baseline and the frequency optimized algo- rithm needs thirteen ratings to statistically significantly outperform the baseline. Unlike with nDCG, the full model is able to outperform SVD, giving statistically significantly better recommendations give seven or more ratings.

AveragePopularity@20 The average popularity of items recommended by the vari- ant Item-Item models behaves interestingly compared to Item-Item. The variant models both initially recommend more obscure items than Item-Item (statistically significant for ratings 2-7 for the full model and 3-9 for the frequency optimized model). How- ever, the variant models level out with more ratings to recommending more well known items than Item-Item for users with a reasonable number of ratings. The full model Item-Item recommends statistically significantly more popular items than Item-Item for users with ten or more ratings, and the frequency optimized model recommends statis- tically significantly more popular items given twelve or more ratings. Both algorithms maintain Item-Item’s property of recommending statistically significantly more obscure items that the baseline and SVD algorithms for all sample sizes.

AILS@20 Given only a few ratings full model Item-Item recommends less diverse items than both Item-Item (statically significant for only one rating) and SVD (statis- tically significant for one or two ratings). Likewise the frequency optimized algorithm recommends less diverse items than Item-Item and SVD, both statistically significant for profiles with up to four ratings. After these initial ratings, however, both algorithms level out to recommending more diverse items than SVD (statistically significant after 3 ratings for the full model and six ratings for the frequency optimized model). After seven ratings the frequency optimized model is statistically significantly more diverse 139 than Item-Item. The full model gives more diverse recommendations than Item-Item for profiles size two to thirteen, after which Item-Item and full model Item-Item are not statistically significantly different.

Summary In general, both model based improvements behave roughly as expected. We see a slight decrease in metric values over Item-Item for the first few ratings, caused by the increased coverage leading to more Item-Item predictions, and fewer ItemBase- line predictions. After four or five ratings, however, both improvements outperform traditional Item-Item, with the full model usually performing better than the frequency optimized model. Typically these variants are still outperformed by baseline measure- ments until ten to fifteen ratings have been collected. Likewise, on most metrics these algorithms do not outperform SVD, although given enough ratings the full-model Item- Item can draw equal.

Damped Item-Item

The results of the small profile evaluation for the damped Item-Item and the frequency Optimized + Damped (or combined) Item-Item algorithms can be seen in figure 5.3.

Coverage As the damping modification does not effect the neighborhood of items the coverage for these algorithms is the same as for the previous modifications. In particular, the damped Item-Item has the same coverage issues and an unmodified Item-Item and the combined algorithm has the same improved performance as the frequency optimized algorithm.

RMSE As can be seen in figure 5.3, the damped Item-Item variants perform much better in terms of RMSE. The damped Item-Item variant performs statistically signifi- cantly worse than normal Item-Item for the first rating, however performs statistically significantly better at every other profile size. Likewise the combined algorithm (which overplots with the UserItemBaseline and SVD) performs statistically significantly bet- ter than Item-Item for every profile size. After five ratings the damped Item-Item modification statistically significantly outperforms the item baseline. The combined al- gorithm statistically significantly outperforms the item baseline after only two ratings. 140 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Coverage AILS@20 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.0 0.8 0.6 0.4 0.06 0.05 0.04 0.03 ItemBaseline UserItemBaseline ● ● ● ● em and the frequency optimized + damped ● ● ● ● ● ● ● ● ● ● ● ● ● ● DampedItemItem SVD ● ● ● ● nDCG ● ● ● ● ● ● AveragePopularity@20 ● ● Simulated Profile Size Simulated ● ● ● ● ● ● ● ● 800 600 400 1000 0.966 0.963 0.960 0.957 ● ● ItemItem FrequencyOptimized+Damped ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● RMSE ● ● ● ● MeanRating@20 ● ● ● ● 5 10 15 5 10 15 5 10 15 ● ● ● ● ● ● ● ●

4.5 4.4 4.3

1.10 1.05 1.00 0.95 0.90 value (or combined) Item-Item algorithms Figure 5.3: Results of the small profile evaluation for the damped Item-It 141 Both damped algorithms need only fourteen ratings to outperform the UserItemBase- line. Neither algorithm statistically significantly outperforms the SVD algorithm at any point tested. That said, the SVD algorithm does not statistically significantly outper- form the combined algorithm at any point either, with their RMSE within 0.0175 of each other at all points. This places SVD and the combined Item-Item algorithm on even footing as tied for best predictions at all sample sizes, with neither being statistically significantly worse than any tested algorithm at any tested profile size. nDCG With nDCG we see a similar trend as we saw with RMSE. The combined algorithm is statistically significantly better than Item-Item at all sample sizes tested and the damped Item-Item algorithm is better after three ratings and worse given ex- actly one rating. Damped Item-Item is statistically significantly worse than the baseline recommendations given five or fewer ratings, and statistically significantly better given 13 ratings or more. The combined algorithm is never statistically significantly worse than baseline, and is better given eight or more ratings. Damped Item-Item is statis- tically significantly worse than SVD given five or fewer ratings, and never statistically significantly better. The combined algorithm is again never statically significantly dif- ferent from SVD with any difference being less than 0.0015 nDCG. Both the combined algorithm and SVD are never statistically significantly worse than any tested algorithm at any tested sample size. meanRating@20 Like before, the damped Item-Item algorithm has performance worse than Item-Item when given one rating, although this is not statistically signifi- cant. Given four or more ratings damped Item-Item statistically significantly outper- forms Item-Item. Given one or two ratings damped Item-Item performs statistically significantly worse than the baseline, and given twelve or more ratings it performs bet- ter. The damped Item-Item algorithm never outperforms the recommendations of the SVD algorithm. The combined algorithm outperforms Item-Item on meanRating@20 given 2 or more ratings and the ItemBaseline given 7 or more ratings. Likewise, the combined algorithm gets statistically significantly worse recommendations than SVD given only one rating, 142 with no statistically difference after that. After two ratings SVD and the combined Item- Item algorithm are never more than 0.047 meanRating@20 different. However, given that SVD non-significantly improves on the combined algorithm at almost every tested point it is likely that it has a small benefit that is not measurable by this experiment. Unlike RMSE and nDCG, SVD and frequency optimized + damped Item-Item are not tied as the best algorithm for meanRating@20. While this comparison is not featured on a plot, the full model is the best algorithm for meanRating@20 given seven or more ratings, with statistically significantly better recommendations than both SVD and the combined algorithm. The frequency optimized Item-Item algorithm, however, never outperforms SVD and the combined algorithm.

AveragePopularity@20 The damped modifications have a much larger average pop- ularity than ItemItem, statistically significant at all sizes tested. In general we find that the damped Item-Item has higher AveragePopularity@20 than the combined algo- rithm, although this is only statistically significant given thirteen or fewer ratings. The combined algorithm has less average popularity than the baseline at all sample sizes (statistically significant given any number of ratings but four). The damped Item-Item has statistically significantly higher AveragePopularity@20 than the baseline algorithm given 2-5 ratings, and statistically significantly lower AveragePopularity@20 given 10 or more ratings. Given three or more ratings both algorithms have a statistically signifi- cantly higher AveragePopularity@20 than SVD.

AILS@20 Just like with popularity, the damped algorithms show a very substantial increase in AILS@20 over Item-Item. In this case both algorithms have statistically significantly less diverse recommendations than Item-Item and SVD for all profile sizes tested. Interestingly, this decrease in diversity is large enough to actually be statistically significantly less diverse than the baseline recommendations for profiles sizes 2-18 for damped Item-Item. The combined algorithm has statistically significantly less diverse recommendations for profiles sized 3-19. 143 Summary of small profile results

Overall, looking at coverage, we see that the full model and frequency optimized model both help to increase coverage, indicating that they serve their purpose of increasing neighborhood size available for recommending. Likewise, looking at RMSE we see that the damping parameter makes a significant improvement in prediction quality. Com- paring the full model with the frequency optimized model we see that the full model seems to make a large difference in algorithm behavior, which translates to a larger improvement in most circumstances. While the frequency optimized model only gives a part of the benefit of the full model, it does so without increasing model size which means less cost involved with more memory to store the model, and as we will see, no impact on computation time. Comparing the model size modifications with the damping parameter modification we see that the damping parameter modifications tend to have a more pronounced effect on performance. This is most pronounced when considering popularity and diversity of recommendations. the large model has a small but meaningful effect on these measures of the nature of recommendations, whereas the damping parameter modification has a very pronounced effect. Considering these modifications as a whole the best modification, not surprisingly, is the frequency optimized + damped Item-Item algorithm. While we were not able to show an improvement over SVD as the best cold-start algorithm tested, we did show approximately equivalent performance on most metrics. The only interesting exception to this is that the full model Item-Item performed very well on meanRating@20. It is not clear if the cost associated with running this version of Item-Item is worth this specific improvement. Outside of this, the frequency optimized + damped Item-Item algorithm seems to be an equally strong competitor to SVD for new user recommendation.

5.4.2 Large Profile Evaluation Results

While the focus of this work is on users with small profiles, as we explore new algorithms explicitly for small profile performance it is important we don’t forget about the more standard large profile use case. In this light we carried about the large profile evaluation. The large profile evaluation is based on the same metrics and algorithms, but uses a 144 more common user-based five-fold crossvalidation evaluation in which 20% of each test user’s ratings are retained in the test set. This crossfolding strategy simulates the large profile sizes seen in MovieLens-like systems. To test for statistical difference we will carry out pairwise Wilcoxon tests on the per- user metric values. To correct for multiple corrections within the pairwise comparisons of any given metric we will use the Holm-Bonferroni method. As before we will simply these results to significant or not at the p < 0.05 level. Likewise we will only report specific metric values where there are differences between our personalized algorithms.

Coverage All algorithms have a coverage over 99%. While there are statistically significant differences (equivalent to those seen in the small profile evaluation), they are small enough that we do not consider them as meaningful.

RMSE, nDCG We see no statistically significant difference in RMSE or nDCG among the ItemItem variants and the SVD algorithm. As might be expected, the base- line algorithms predict statistically significantly less accurately than the personalized algorithms.

MeanRating@20 Other than between baselines we see only two statistically signif- icant difference among the personalized algorithms. SVD has statistically significantly lower meanRating@20 than damped ItemItem (4.58 < 4.62) and ItemItem (4.63).

AveragePopularity@20 Metric values for the mean popularity of recommended items can be found in table 5.2. All pairwise differences are statistically significant. metric values, and their rankings, are rank-equivalent with the relative ranking at pro- file size 19. The baseline algorithms have the most popular recommendations followed by the damped Item-Item variants, SVD, the non standard Item-Item models, and finally Item-Item which recommends the most obscure items.

AILS@20 Metric values for the average inter-list similarity of recommended items can be found in table 5.2. With one exception, all pairwise differences are significant. The one exception is the damped Item-Item algorithm and the Frequency Optimized + 145 Algorithm Average- AILS@20 Recommendation Popularity@20 Time (ms) ItemBaseline 1074.1 0.0587 0.5 Frequency Optimized 922.3 0.0536 75.2 + Damped Item-Item Damped Item-Item 877.0 0.0536 82.7 SVD 748.7 0.0362 1.1 Full Model Item-Item 702.6 0.0378 249.0 Frequency Optimized 662.8 0.0368 75.1 Item-Item Item-Item 439.3 0.0347 82.7

Table 5.2: AverageRating@20, AILS@20, and recommendation time in seconds for the algorithms in a standard large-profile evaluation

Damped Item-Item, which are not statistically significant different. The baseline algo- rithms make the least diverse recommendations, with the damped Item-Item algorithms coming in second and third. The full model gives the fourth least diverse recommenda- tions, followed by the Frequency Optimized Item-Item in fifth (third most diverse) SVD gives the second most diverse recommendations, and Item-Item gives the most diverse.

Recommendation Time Measurements of the average time (in milliseconds) to pre- dict for every item and then rank these to form top-20 recommendation lists is provided in table 5.2 As applying the damping parameter to a prediction requires a negligible amount of time we see that Damped Item-Item and Item-Item do not see statistically significant differences. Likewise Frequency Optimized + Damped Item-Item is not sta- tistically significantly different from the Frequency Optimized Item-Item algorithm. All other pairwise differences are statistically significant. Unsurprisingly the ItemBaseline algorithm is by far the fastest algorithm, followed up by SVD. Within the Item-Item algorithms we see a small improvement in recommen- dation time from using a frequency optimized model. That said, this difference is rather small, and may be subject to implementation differences in how item neighborhoods are identified from rated items in the LensKit code. As predicted we also see that using a full Item-Item model leads to a substantial slow down in recommendations with the full model being three times slower. While its possible a different algorithm for building 146 item neighborhoods could reduce this cost to some degree its unlikely that this cost will go away entirely with enhanced algorithms. For this reason we do not see a Full model Item-Item as a reasonable algorithm for deployment.

Summary of results Based on these results we find that our modified algorithms do not appear to make any meaningful difference to the performance of the Item-Item al- gorithm for normal users as measured by standard metrics of algorithm quality. This is good as it implies that our algorithms do not trade quality on large profiles for quality on small profiles. That said, we also see that the modified Item-Item algorithms do lead to a substantial difference in the properties of the items recommended. Every modification leads to more popular, and less diverse items being recommended. If only the frequency optimization modification is used this is a small effect. With the damping modification, however this is a very large effect with recommendations qualitatively more representa- tive of the baseline than of the Item-Item algorithm. As the damping modification makes Item-Item more conservative about what type of predictions and recommendations it makes, its possible that popular, non-diverse recommendations are an inherent property of conservative recommendations. This outcome could be problematic, as diverse, novel recommendations are typically a goal when designing recommendation algorithms.

5.4.3 Damping Parameter Tuning Analysis

The damping parameter is a new algorithm parameter introduced into Item-Item by our modifications. To understand how the algorithm changes with this parameter, how sensitive the parameter is to various misconfigurations, and what parameter values might be ideal, we carried out a tuning evaluation. This evaluation was performed using the train/test used for the 8 rating point in the previous evaluation. Thus each user in this evaluation has 8 ratings. Results of this analysis were generally consistent between different sized ratings, although a careful study of changes in these results as a function of profile size was not performed. Figure 5.4 shows how the damped Item-item algorithm’s performance changes for different values for the damping parameter ranging from 0 (pure Item-Item) to 3 (very aggressively damped) in steps of 0.05. The analysis was performed for damping pa- rameters up to size 5. As there were no interesting trends between 3 and 5 on most 147

RMSE nDCG

0.98 0.966

0.96 0.965

0.964 0.94

0.963 0.92

0.962

AveragePopularity@20 AILS@20

0.060 1000

0.055 900

800 0.050 Value 700 0.045

600 0.040

0 1 2 3 MeanRating@20

4.48

4.44

4.40

4.36 0 1 2 3 Damping

Algorithm Damped ItemItem ItemBaseline ItemItem UserItemBaseline

Figure 5.4: The effect of varying the damping parameter in the damped Item-Item algorithm for users with 10 ratings. 148 plots we present only 0-3 to simply the graphs. Each graph is also shown along with constant lines for an unmodified Item-Item, and both baseline algorithms. Only RMSE has different results for the ItemBaseline and the UserItemBaseline, on all other plots these two algorithms show up as a single line. In all metrics we see a very steep initial change from the behavior of Item-Item to the behavior of the Damped Item-Item. On most metrics this trend terminates in a peak between 0.25 and 0.45 for a damping parameter, after which the metric value either stays steady or declines to a new value. On RMSE and nDCG this peak is quite pronounced, and with higher values there is a sharp falloff. With very high damping, damped-Item-Item’s RMSE asymptotically approaches the ItemBaseline’s performance (which is the value that Item-Item is damped towards). We see a similar trend in nDCG, although around damping 4 (not shown in the figure) the damped Item-Item algorithm crosses over and has worse nDCG than the baseline algorithms. Of the recommendation metrics we see the strongest peak from the AILS@20 metric. After this peak AILS@20 slowly drops towards a new value. Interestingly, this value is higher than the baseline algorithms, implying that even the small nudge that the Item-Item algorithm can make under very heavy damping is still enough to change the nature of recommended items towards being less diverse. The Average popularity shows a similar trend, albeit with no significant peak, simply a rapid growth towards a new, very high value. Just like with AILS@20 the new value is a fair deal higher (more popular, less diverse) than the baseline algorithms. After the peak performance on MeanRating@20 the damped Item-Item algorithm does seem to drop in quality with higher damping, albeit at a much smaller rate than we see in RMSE and nDCG. Given this we recommend carefully tuning the damping parameter with an eye to- wards those peak performance areas. Underdamping is likely rare as optimal damping values are quite small, and behavior changes so rapidly towards the optimal damping value. Over-damping is characterized by behavior much like the baseline algorithms, with less diversity and more popular recommendations.

5.4.4 Discussion

Table 5.3 contains a summary of the findings of our evaluation of the Item-Item mod- ifications. As can be seen the modifications to our algorithm do make a meaningful 149 Algorithm Coverage Predict Prediction Recommendations Popularity & Time accuracy Diversity Item-Item Bad Good Bad Bad OK Full Model Great Bad Good after Great after 7 ratings more popular, 8 ratings less diverse Frequency Good Good Good after Good after 13 ratings more popular, Optimized 11 ratings less diverse Damped Bad Good Good after Good after 12 ratings much more popular, Item-Item 5 ratings less diverse Frequency Good Good Great Good after 7 ratings much more popular, Optimized less diverse + Damped

Table 5.3: Summary of results of our analysis of improvements to the Item-Item Algo- rithm. difference in cold start performance. And, as we saw in the large profile evaluation, this does not come at a cost in prediction accuracy or recommendation quality for users with larger profile sizes. While none of our modifications are reliably better than the other leading approach for new user recommendation: the SVD algorithm, we did find the Fre- quency Optimized + Damped algorithm to be comparable in almost every comparison. This leaves Frequency Optimized + Damped Item-Item solidly tied for recommendation algorithms for new users. Unfortunately, just as we saw in the evaluation in section 4.2, none of our algorithms reliably outperform the UserItemBaseline for users with very small profiles. While we do see some small advantage to using a full model Item-Item recommender over the Frequency Optimized + Damped Item-Item (better recommendations) we find that this comes with a very substantial cost in terms of added computation. There- fore, we do not consider this a feasible strategy. Our other strategies, however, both served their purposes. Using a frequency optimized Item-Item model appears to lead to better coverage, a clear sign of larger profiles on average. Likewise the damping parameter seems leads to much more accurate predictions, and consequentially better recommendations. The one unexpected outcome of these changes is that every modification had an effect on the popularity and diversity of recommended items. Just as damping seems to have a larger impact on prediction and recommendation quality, it also seems to have a larger effect on popularity and diversity. Its possible that both of these effects are being 150 driven by the algorithm being more conservative. The idea behind the damping strategy is to prevent Item-Item from making extreme recommendations without evidence. Therefore, the algorithm will behave much more conservatively, only making “safe” recommendations, either for items that are widely regarded as good, or for those items that there is particularly compelling evidence that the user will like. It is easy to see how this behavior could lead to an increase in popularity and decrease in diversity. While it is commonly assumed that users will want more diverse recommendations for less popular items, it’s not clear at this point if that is true for users with small profiles, especially users who are new to a system. The impact of this enhanced popularity and reduced diversity among recommend items will depend system-to-system and user-to-user. However, we do have good reason to believe from past work that, in the general case, more diversity, not less, would be preferred [137, 147]. Exploring post-algorithm diversification alongside the Item-Item modifications is left for future work. If post-algorithm diversification does not produce meaningful improvements it is possible that this property alone could be reason not to deploy the modified Item-Item algorithms on users who have “enough” ratings to use an algorithm that is not modified for new-user performance.

5.5 A User Evaluation of Damped Item-Item Algorithms

Offline evaluations are limited in how accurately they can predict the behavior and sat- isfaction of actual users when faced with a novel algorithm. While we can approximate this to some degree with the careful application of offline test methodologies and several well chosen metrics, there will always be a limit to this approximation. Therefore, both to confirm our finding that Item-Item is a poor choice for new user recommendation and that our modifications can ameliorate this issue, we will finish this chapter with a user evaluation. In previous sections of this chapter we have focused on two separate but important conditions where users may have small profiles - new users, and users in systems like BookLens which have different rating dynamics leading to persistently small profile sizes. While both of these situations lead to the same algorithmic focus, predicting and recommending given almost no information, the user needs and motivations involved 151 are very different. Therefore this experiment will only focus on new users. Not only are the recommendations received likely more important to long term use of the system for new users, it is also easier to recruit people who do not use a system than it is to recruit from a system like BookLens. Furthermore, this section focuses on the damping modification to the Item-Item algorithm. While the frequency optimized Item-Item algorithm does make meaningful improvements to algorithm quality, these improvements are largely focused on improving coverage and small improvements to algorithm quality, and would be less visible in recommendation centered evaluations. Damping, on the other hand has three interesting properties:

• Damped Item-Item allows for a smooth transition between the behavior of an unmodified Item-Item to the superior damped Item-Item behavior, to the behavior of a simple baseline as the damping becomes larger.

• According to our offline analysis damped Item-Item substantially changes recom- mendation properties (both quality as measured by meanRating@20, and nov- elty/diversity properties) compared to Item-Item.

• The damping parameter used in damped Item-Item is explicitly a trade-off between more personalized recommendations (even if the recommendations are guesswork) and non-personalized recommendations.

This final property has not been well studied to our knowledge but may be of interest to users who are exploring a new recommender system. As we will explain in this section, our findings are broadly consistent with the offline analysis: unmodified Item-Item performs poorly and is not separable from random recommendations on most survey questions. Unlike our offline analysis, however, we find that we are not able to outperform the baseline algorithm (recommend items by popularity in this case). Possibly because of this finding, we also find that much larger damping parameters than expected get near-optimal performance in an online setting. 152 5.5.1 Experiment Design

The experiment ran from August 29, 2016 to September 07, 2016 with recruiting done through the Amazon Mechanical Turk service. The experiment was broken down into three phases, after participants agreed to participate they were taken to the first phase of the experiment where we elicited ratings from them using a standard new user rating survey. Then we showed the participant their recommendations ostensibly based on their ratings (although not all algorithms actually used their ratings). Finally, after the participant had seen their recommendations we had a two page survey for them to take regarding how well they liked their recommendations. We will present the details of each of these steps in turn.

Recruitment

Participants were recruited through the Amazon Mechanical Turk service. Amazon Mechanical Turk is a platform for posting “human intelligent tasks”, short tasks which cannot be done reliably with computer intelligence, but can be done easily by humans. Workers are offered a fixed payment before accepting the task, which is granted upon completion of the tasks. While typical tasks might involve labeling images, Amazon Mechanical Turk has become an increasingly popular method of recruitment for virtual lab experiments. We opted not to recruit users from a previous recommender system for this exper- iment as we explicitly want participants who are new to our recommender system and do not have a relationship of trust with our recommendation technologies. Its likely that users of an existing live system would have a background that could alter how they interpret our recommendations compared to a user who might be using a recommender system for the first time. We offered participants payment of 0.75 $ based on an estimate that our experiment would take approximately 5 to 10 minutes to complete. This was calculated so that a worker who took around 5 minutes would be earning 9$ which was the Minnesota minimum wage during the planning phases of this experiment. In practice we found this estimate to be approximately accurate, of our participants the median completion time was 5.4 minutes. To reduce the likelihood of recruiting workers who might not 153

Figure 5.5: The ratings page of our user survey engage with the survey in good faith, we restricted our experiment to only workers who have had 1000 tasks approved in the past, and have a 97% (or higher) approval rate. We also restricted to workers who are in the United States of America. These types of restrictions are common-place especially when recruiting for online participants [97,125]. These restrictions are particularly important as ratings and algorithm evaluations are inherently subjective and therefore hard to verify for accuracy.

Eliciting Ratings

The first part of the survey was a standard informed consent document. The basic information contained in this document was also available to Mechanical Turk workers before they chose to accept the task on Mechanical Turk. After agreeing to participate in the experiment a participant would be taken to the page seen in figure 5.5. This page presents 24 movies for the participant to rate on a five star interface with half stars. We ask the participants to rate as many of these movies as they can. If a user does not rate any movies we ask them to rate movies they think they would like. 154 When this happens we record that the ratings are not authentic. This only happened to 3 participants. As we do not have enough data to estimate if there are differences between these modes of preference formation, we exclude these three participants from analysis. The movie and rating data in our survey is based on the MovieLens 20m dataset, details on this dataset are available in table 2.3 in section 2.9. Images for each movie as well as a link to more information are provided through TMDB, based on the linking between MovieLens movie ids and TMDB movie ids maintained in the MovieLens sys- tem. While using MovieLens 20m does mean the data was slightly out-of-date, likely not including the newest movies, it also allowed the survey to be distributed open source with all required resources for independent replication 2. The specific list of 24 movies was chosen using the Pop*Ent strategy from the 2002 report on learning new user preferences by Rashid et al. [103]. This approach was the best performing new item selection strategy in terms of balancing selecting items that users can reliably rate (so that users can be onboarded with a minimum amount of effort) and creating a user profile that leads to accurate predictions. In this approach each item is scored for its popularity (the number of users who have rated the item) and the information value of its ratings (the entropy of ratings on that item). By multiplying the log of the popularity with the entropy we can select for items that are both informative to have rated, and that the user is likely to be able to rate. While other strategies have certainly been developed in other studies, this approached seemed to give a reasonable selection of items and is relatively typical of the balanced approach that a traditional ratings based recommender system might use for building a new user survey to onboard new users. To avoid biasing towards older movies, which participants might be less likely to be able to rate, we only used ratings made within a year of the last rating in the MovieLens 20m dataset for popularity calculations. This modification led to a selection of more recent movies, with older, but still well known, movies included. We also found that the unmodified algorithm tended to favor multiple movies in any given series. Presumably, movies in a series have similar rating properties and popularity leading to co-selection.

2The sources, including all data needed to run the web server for this evaluation, can be found at https://bitbucket.org/Kluver/recommender-sterotyping 155 Guardians of the Galaxy (2014) Avatar (2009) Gravity (2013) The Matrix Revolutions (2003) The Wolf of Wall Street (2013) The Lord of the Rings: The Return of the King (2003) The Grand Budapest Hotel (2014) Titanic (1997) Inception (2010) Kill Bill: Vol. 1 (2003) Star Wars: Episode II - Attack of the Clones Interstellar (2014) (2002) Pirates of the Caribbean: The Curse of the Shrek (2001) Black Pearl (2003) The Hobbit: The Desolation of Smaug The Avengers (2012) (2013) Gladiator (2000) Her (2013) The Hunger Games: Catching Fire (2013) 300 (2006) Django Unchained (2012) Captain America: The Winter Soldier (2014) The Lego Movie (2014) Boyhood (2014)

Table 5.4: The list of movies used in the user evaluation of the damped Item-Item algorithm.

We believed that multiple ratings on items in one series might be less useful than a more diverse set of movies to rate and therefore we manually removed movie series. When removing movie series we retained the first movie selected in any series. We coded the list of series manually based on repeated inspection of the list of movies that would be shown to participants. For this coding we chose to consider the Lord of the Rings series and the Hobbit series as separate series, despite being closely connected and both being based on the works of J. R. R. Tolkien. Likewise, we chose to consider the Avengers series and the Captain America series as independent despite both being part of the broader Marvel cinematic universe. The final list of movies selected for this interface is shown in table 5.4. It is important to the goal of this study that users are not only new, but have reasonably small user profiles. Therefore we performed a simple simulation both to compare possible new user item selection scenarios, and to evaluate how many ratings we can expect users to have under this scenario. We picked 1000 random users from the MovieLens 20m dataset who have rated movies within one year of the most recent 156 rating in the dataset. We then saw how many of our 24 movies they had previously rated as a proxy for how many experimental users would be able to rate. We found that most (92.1%) of users would be able to rate at least one movie, with a median of 7, a first quartile of 3, and a third quartile of 12. While this means that some users (the top quartile) will be rating 12 or more ratings, on the high side for our definition of a small user profile, most users would indeed be in a good range of ratings for our experiment.

Recommendation

After entering ratings, participants were taken to a page that showed a list of 12 rec- ommended items. An example of what this would look like to a participant can be seen in figure 5.6 These items were selected using a top-12 strategy based on the underlying algorithm. As the algorithm was the manipulation for this experiment each user was randomly assigned into one of six conditions, each with a different associated algorithm:

• Random - Recommend random items

• Item-Item - A traditional item-item algorithm

• Damped Item-Item (0.05) - a damped Item-Item with a very small damping value

• Damped Item-Item (0.3) - a damped Item-Item with a reasonably optimal damp- ing value

• Damped Item-Item (2.5) - a damped Item-Item with a very large damping value

• Popularity - recommend those items which the user has not already rated that were most rated. As before, we only count ratings made within one year of the last rating in the MovieLens 20m database.

As would be typical in a live system we trained the Item-Item model once on the MovieLens 20m dataset, meaning that new user ratings did not effect item-item similar- ity estimates. While we would have liked to include an SVD algorithm for comparison, when this study was run, Lenskit did not have reasonably quality support for training SVD vectors for new users in real-time. Therefore, when we tested SVD as a condi- tion we found very bad new user performance that did not seem indicative of what a 157

Figure 5.6: The recommendations part of the third page of our user survey well implemented SVD for new users should be able to accomplish based on our earlier analysis. The damping values for the parameters were chosen as follows. The middle damping value (0.3) was chosen based on the optimal tuning value from the previous section. The small and large damping values were chosen somewhat arbitrarily to begin with. After that, using the same simulation as we used in the previous section (1000 random users from MovieLens 1m) we also tested which items would be recommended by each condition to the simulated user. We then re-tuned our small and large damping values to ensure that there is a reasonable (at least 3 or 4 on average) number of different items recommended by each category to the user. With too small of a damping value Item- Item and minimally damped Item-Item would recommend essentially the same items, making these conditions redundant. On the other hand, with too large of a damping value, the minimally damped Item-Item and the “optimal” (RMSE tuned) damped Item- Item would be redundant. Likewise, with too large of a damping value for the highly damped Item-Item we would see popularity and highly damped Item-Item behaving redundantly. We found the current values to be suitable midpoints between these two 158 extreme behaviors, allowing a relatively comprehensive study of the continuous space from unmodified Item-Item, through an optimal damping, to a baseline algorithm’s behavior. As mentioned in the introduction to this section, we chose to focus this experiment on the larger changes and tradeoffs involved in the damped Item-Item algorithm. Therefore, to limit the number of experimental conditions, reducing the recruitment costs of the experiment, we do not have an experimental condition for frequency optimized variants of the Item-Item algorithm. User-centered evaluation of these algorithm changes are left for future work. As the changes in recommendation based properties brought upon by frequency optimized Item-Item models are quite small, we will note that a more sensitive study based on this design may be necessary to fully understand the differences from the frequency-optimized model.

Survey

The user survey for this experiment is separated into two parts. The first part of the survey is a 23 question survey on a five point likert scale from “Strongly Disagree” to “Strongly Agree”. In deference to the privacy of our participants we also included a “prefer not to answer” option for each question. The questions can be seen in table 5.5 These questions were designed with the hope of being able to elicit several different aspects of how the user perceived the recommendations:

• Did the participant like the recommended movies?

• Did the participant perceive the movies as novel?

• Did the participant perceive the recommended movies as diverse?

• Did the participant feel like the recommendations were personalized?

• Was the participant satisfied with their recommendations?

• Would the participant use a recommender system based on this algorithm?

• how experienced with online movie communities like MovieLens is the user? 159

These questions are about the list of movies presented above. The list of movies matched my preferences. The recommended movies look like they would be fun to watch. The list of movies contains too many bad movies. The list of movies contains many movies that I did not expect. The list of movies contains many movies that are familiar to me. The movies recommended to me are novel. The list of movies is varied. The list of movies has many kinds of movies in it. Many of the movies are similar to each other. I see how the recommended movies related to my rated movies. These movies don’t make sense given what I’ve already rated. These movies seem personalized for me. The list of movies could reasonably be recommended to anyone. We are considering building a movie recommendation system based on the process we used to generate the above recommendations. These questions about your interest in such a movie recommender. I would enjoy using a movie recommender that gave recommendations like these. If a recommender like this were available to me, I would use it frequently. I would not use a recommender that gives recommendations like this one. Overall, I am satisfied with the list of recommended movies. I think a recommender system like this could help me find movies to watch. If a recommender system like this existed I would recommend it to my friends. These questions are about you. Im a movie lover. Compared to people I know, I read a lot about movies. I regularly use the internet to learn more about movies. Compared to people I know, Im not an expert on movies.

Table 5.5: The questions in the first phase of our survey 160 We acknowledge that this is a large and varied list of factors for this study, and that it was unlikely we would be able to elicit all of these separate factors. This list of factors was driven by two goals for this experiment. The first goal, was to confirm our improvements over Item-Item, this could be done with a simpler experiment than we chose to run. The second goal, however was to explore the ways in which users are sensitive to the various changes in degree of personalization, popularity, and diversity that can be smoothly tuned through the damping parameter. As there was little extra cost in exploring this behavior, we chose to pursue the full list of questions, despite the increased risk of failure to elicit some factors. The specific questions are a combination of novel questions and, where possible, adaptions of questions from several previously published surveys [38, 88, 102, 136]. We looked specifically for questions that had proved successful in eliciting the factors we were interested in in the past. Several of the questions were slightly adopted to the format of our experiment. The second part of the survey was on the following page and, for each recommended item, asked the participant to label if they thought an item was good, bad, or neutral to recommend to them. Additionally each item had a check box to ask if the partic- ipant had seen that item before. After this we had three free text questions: “If you marked any movies as good recommendations, please explain why. Otherwise, please explain what would have been a good movie for you.”, “If you marked any movies as bad recommendations, please explain why. Otherwise, please explain what would have been a bad movie for you. You can choose to leave this blank.”, “Do you have any other feedback to help us improve our recommendations? You can choose to leave this blank.” The purpose of these questions were to collect a dataset to allow more open discovery of factors that seem important to new users in whether they accept or reject a recommendation.

Final page

After both pages of survey the participant was taken to a final page that contained a randomly generated code. Each survey was associated with a separate code. Partici- pants were instructed to copy this code into Mechanical Turk as proof of completion to allow us to be sure that only people who had taken the survey would be payed for 161 participation. We also used this code, and records of its use, to help filter out potential cases of abuse where a code was used more than once 3.

5.5.2 Results

Overall, we had 450 participants. We filtered these in several ways to avoid possibly dishonest or inaccurate records. First we removed any survey result whose associated code was claimed by multiple Mechanical Turk workers. Secondly we removed survey results from participants who, contrary to our posting, conducted the experiment more than once. Finally, as mentioned earlier, we also removed survey results where the user did not make any ratings before attempting to move to the rest of the survey (these users were instructed to enter fake ratings). After removing these results, we have a collection of 427 survey results. The first property we checked was the distribution of the number of ratings per user. Ideally this should be on the low side (4-8 ratings) few enough ratings that Item-Item should perform much worse than we expect, but enough ratings that we shouldn’t expect that baseline to be dominant. We found more ratings per users than we were expecting based on our simulation with the ML20M dataset. The mean number of ratings per user was 14.6 with the median being 15. 95% of surveys had 5 or more ratings, 72% had more than 10 and 19% had more than 20. 46 survey results (10.5%) rated every item. There are several reasons why our participants might have rated more movies than our prior analysis led us to expect, based on MovieLens users. Once explanation is that our item selection method was biased towards newer movies which might have continued growing in how many people have viewed them after the MovieLens 20m dataset was created. Likewise, it is possible this is simply an outcome of being explicitly requested to rate this set of items, whereas MovieLens users may never have been prompted to rate some of these items Alternatively, it is possible that Mechanical Turk users and MovieLens users differ in some meaningful way in their movie viewing or rating behavior. Ultimately, this is ancillary to what we sought to understand. While this is a higher

3It was our understanding at the time that we had configured Mechanical Turk to not allow this situation, however, this did occur in our dataset. Future experiments should add precautions against this. 162 range of ratings per user than we were expecting, we still can and do expect to see differences between algorithms. Next we analyze the likert style questions. There are several ways that a survey of this style could be analyzed. We explored two methods: analyzing each survey question independently, or, analyzing them as one using factor analysis / structural equation modeling (SEM) approach. In the SEM approach we assume that answers to our 23 likert style questions were driven by a much smaller number of latent perceptions by the user. We performed both analysis and found they generally agree. Therefore for the sake of brevity, and because SEM analysis is slightly more sensitive, we will focus on the SEM style analysis. For all the following CFA and SEM analysis we used the lavaan package in R [109]. The model fitting methods in lavaan were able to treat all survey answers as ordinal data (not numeric data). Unfortunately, the survey was not able to use partial observations, therefore we were not able to include users who chose to not answer one of our survey questions in this analysis. Therefore the following models were trained on the 384 survey responses in which the participant did not respond to any question with “Prefer not to answer”. All models were fit with a least squares estimator and the standard deviation of latent factors constrained to 1. Ideally the latent factors driving survey responses should line up with the perceptions we designed the survey to collect. Unfortunately, that was not the case. To see this we performed a confirmatory factor analysis based on our intent for the survey questions. We found that this model was a poor fit. In particular the RMSEA model fit/error metric was 0.1333. For reference Hu and Bentler propose RMSEA<0.05 as a cutoff for a good fit and ideally the upper bound of a confidence interval for RMSEA should be below 0.1 [59]. Furthermore we found very large covariances between latent factors which was further indicative of a problem. Therefore, to understand what problems are causing the poor fit, and begin de- veloping a new explanatory model for our survey results we turned to an exploratory factor analysis. This was done using the fa method in R, with our likert answers coded as numbers 1-5. Based on this approach we found that 3 factors looked like the best number of latent factors to explain this data. Any larger and the new factors were not well measured by any question. Roughly speaking we found that our factors for 163 the user’s expertise / past experience with this type of system remained as intended. Likewise we saw a subset of our diversity questions seemed to form a second factor. The remaining questions all correlated highly with each other and formed the major factor of our survey which roughly speaking captures the “goodness” of the algorithm. At this phase we also added the number of items rated in the rating phase, and the number of recommendations that had been marked as seen before, good recommendations, or bad recommendations from the second survey page. Broadly we found that the number of recommendations marked as seen, good, or bad all cluster with “goodness” and the number of ratings entered clustered with expertise. With this knowledge in hand we returned to a CFA analysis to test and refine this measurement model. Model refinement was guided by modification indices. We removed any survey question that appeared to crossload multiple factors, or to not load well on any of our current factors. After this refinement we returned to SEM. We began with out SEM model allowing perception of diversity and self evaluated expertise to effect both perception of goodness and each other. While we found statis- tically significant trends in out analysis for diversity and expertise to impact goodness, we did not find any statistically significant relationship between the two. Therefore we do not include such a relationship in our final SEM model. Our final model is a marginal fit (CFI = 0.979, TLI=0.976, RMSEA=0.098, 90% CI: [0.092, 0.104]). While this is somewhat at the extremes of Hu and Benler’s recommended cutoffs, this is likely sufficient for our more exploratory purposes [59]. Question loadings are summarized in table 5.6, and relationships between conditions and factors can be found in table 5.7. For our analysis we dummy-coded the experimental conditions, using the random algorithm as a baseline. As such our tables report only statistical significance against the random baseline. By coding other conditions as baseline we can measure significance across other pairs of algorithms. We report statistical significance for these other com- parisons as needed. Unless otherwise noted, we will use a p< 0.05 cutoff for statistical significance. As expected, we found that diversity has a positive effect on perceived goodness of the algorithm. This is consistent with past research on diversity. We also found that self-reported expertise has a positive relationship with perceived goodness. Given that our participants were recruited from Amazon Mechanical Turk, it is likely that those 164 Question Loading “Goodness” factor The list of movies matched my preferences. 0.726 The recommended movies look like they would be fun to watch. 0.722 The list of movies contains many movies that are familiar to me. 0.565 I see how the recommended movies related to my rated movies. 0.695 These movies don’t make sense given what I’ve already rated. -0.704 I would not use a recommender that gives recommendations like this one. -0.583 Overall, I am satisfied with the list of recommended movies. 0.756 I think a recommender system like this could help me find movies to watch. 0.724 (The number of recommended items marked as good to recommend, not 1.799 normalized) (The number of recommended items marked as bad to recommend, not -1.889 normalized) (The number of recommended items marked as seen, not normalized) 0.927 Expertise factor I regularly use the internet to learn more about movies. 0.747 Compared to people I know, Im not an expert on movies. -0.443 Im a movie lover. 0.905 (The number of movies the user rated during the rating phase, not nor- 1.684 malized) Diversity factor The list of movies is varied. 0.930 The list of movies has many kinds of movies in it. 0.855 Unused Questions The list of movies contains too many bad movies. The list of movies contains many movies that I did not expect. The movies recommended to me are novel. Many of the movies are similar to each other. These movies seem personalized for me. The list of movies could reasonably be recommended to anyone. I would enjoy using a movie recommender that gave recommendations like these. If a recommender like this were available to me, I would use it frequently. If a recommender system like this existed I would recommend it to my friends. Compared to people I know, I read a lot about movies.

Table 5.6: The questions in our survey and their factor loadings. The factor loading for all used questions are statistically significant (p< 0.01) 165 Goodness Diversity Expertise 0.431 - Diversity 0.680 - Item-Item 0.684 (n.s.) Damped (0.05) 1.202 (n.s.) Damped (0.3) 2.339 -0.583 Damped (2.5) 2.045 -0.381 Popularity 2.209 -0.258

Table 5.7: Regression coefficients between factors, and for conditions (relative to random recommendation) on goodness and diversity factors. All results shown significant at a p< 0.05 level with a higher expertise rating were also generally more interested in movies and movie recommendation. Looking at the experimental condition we see results similar to those from our offline analysis. Unsurprisingly, random recommendations were seen as more diverse than all other conditions, although undamped, and minimally damped Item-Item are not statistically significantly less random. As predicted in our offline analysis we see the least diversity in our “optimal” (0.3) damping, with more diversity with either more or less damping (This is statistically significant, except in comparison to the highly damped (2.5) algorithm). Looking at the goodness factor, we see that all algorithms were better than ran- dom recommendations. As we would expect from our offline analysis the undamped Item-Item performs worse than all damped varieties, and the popularity baseline. Ad- ditionally, we found that the under-damped Item-Item (0.05) is worse than the three best algorithms Item-Item with a moderate (0.3) and very high (2.5) damping parameter, and the popularity baseline. The three best algorithms are not statistically significantly different from each-other, all other pairwise differences are statistically significant. Looking at the overall effect on goodness (both direct and indirectly though changes in diversity influencing goodness) we see equivalent results. The only difference is that the popularity algorithm is evaluated as (not-statistically significantly) more good than the moderately damped Item-Item algorithm. This ranking switch is due to the fact that the popularity algorithm generates more diverse recommendations than the moderately 166 damped Item-Item algorithm. At a high level the goodness results follow our offline evaluation, however we see the higher damped algorithms performing better than expected relative to the 0.3-damped Item-Item. Furthermore, the popularity baseline condition performs much better than we would expect. This may serve as evidence that traditional offline evaluations not- only have missed detectable failings in standard algorithms, but also are not measuring the right properties to let us accurately predict which algorithms new users will like.

Item Level Feedback

In an attempt to dig somewhat deeper into the causes of our findings, we explored the item-level feedback we got on the last page of our survey. On the last page of the survey we asked users to label each recommended items as good, bad, or neutral to recommend, as well as to mention if they had previously seen those items. Additionally we gave users the opportunity to further explain themselves in free-text. We will not be exploring the free-text responses in this evaluation. While there may be useful insights in this data, without the application of proper qualitative coding methodologies (such as grounded theory coding) such an exploration would be anecdotal at best. Anecdotally, many users were not able to disambiguate what attracted them to cer- tain items on their own, often saying citing that they were interesting genres or that the good movies simply looked interesting. This may indicate that free-text methodologies are not suitable towards eliciting what attracts a user to certain recommendations, that our prompt was insufficient, or that users are not often aware of what makes them think an item looks good. Regardless, we consider further exploration of the free-text data out-of-scope for this project. The recommendation rating data was already considered, in aggregate, as part of the SEM model presented in the previous sub-section. These results preview the re- sults we will see here. Having seen recommended items, and having marked them as good, was correlated with liking the recommendations as a whole. Additionally, the number of recommended items marked as bad was negatively correlated with liking the recommendations as a whole. Digging deeper into these results we might expect a relationship between whether an item is seen or unseen, and a user’s evaluation of this item. Table 5.8 shows the 167 Unseen Seen Bad 33.1% 5.8% Neutral 39.7% 12.8% Good 27.2% 81.4%

Table 5.8: Conditional probability of item response based on seen or unseen

Damped Damped Damped Random Item-Item Popularity (0.05) (0.3) (2.5) Bad 48.5% 36.8% 28.7% 11.6% 9.0% 12.2% Neutral 37.7% 38.8% 33.4% 28.5% 26.8% 23.7% Good 12.7% 24.5% 37.9% 60.0% 64.2% 64.1% Seen 3.9% 12.7% 26.6% 44.7% 41.9% 55.4% Seen Bad 0.6% 0.5% 1.8% 2.5% 1.5% 3.6% Unseen Good 11.5% 15.3% 17.2% 20.6% 27.7% 21.1%

Table 5.9: Conditional probability of item response based on user condition. This table also displays the probability of having seen the item in question given the condition. conditional probability of item responses based on whether the user claimed to have seen the item. As you can see there is a drastic difference with seen movies being much more likely to be evaluated as good. Note, that we do not report statistical significance on these numbers, As these are averaged over each recommendation, not each user, statistics directly based on the number of such recommendations would be deceptively over-confident. Given that neither the rate of having seen the item, nor the user’s evaluation was directly manipulated by our experimental conditions it is neither obvious how to evaluate statistical significance of this trend, nor what value such a p-value might have. Suffice to say that, from what data we do have and how large of a difference we see in these distributions, this is unlikely to be an artifact of random chance. To better understand how this effect plays out in practice we looked at the prob- ability of having seen recommended items, and the user’s evaluation of those recom- mendations conditional on our assigned condition. As can be seen, these results follow what was found in the SEM analysis; random and non-damped algorithms are evaluated poorly, with higher damping and popularity being evaluated well. Additionally, the rate of having seen a recommended movie goes up with increased damping, and is highest 168 when sorted by popularity. We evaluated statistical significance on these values using a resampling based per- mutation test. In the null hypotheses our condition labels are unrelated to the user’s evaluation of their ratings, therefore for each comparison we tried 1000 random per- mutations of the condition labels to users, and recomputed the table above. We only consider two probabilities different if fewer than 5% (p = 0.05) of the randomly labeled iterations displayed a difference in conditional probabilities as large as observed under actual condition labels we would say that the probability differences statistically signif- icant. Given the number of comparisons made, this p = 0.05 cutoff should be viewed as marginal evidence more than strong evidence, which is in-line with the exploratory nature of this part of the analysis. On all three metrics (rate of good item recommendation, rate of bad item recom- mendation, and rate of seen item recommendation) the moderately and highly damped item-item do not statistically significantly differ from each-other or from the popular item recommendations, and all three are statistically better to the other algorithms. the minimally damped Item-Item algorithm generates more seen and more good recom- mendation than both random and Item-Item none, and produces fewer bad recommen- dations than random. The non-damped Item-Item produces more good, and fewer bad recommendations than random, but the rate of seen item generation is not statistically different. Finally, we computed the chance of recommending an item that the user had already seen, but disliked, or that the user had not seen, but thought they would like. These conditions represent the ultimate failure of the algorithm (recommending something the user knows they didn’t like) and one version of the ultimate success of the algorithm (helping someone find something new that they might like). The rate of seen bad movies is low for all algorithms, being the largest for the moderately damped Item-Item and the popularity algorithm, which is largely consistent with rates of recommending seen items. Due in part to the low rates, only the most extreme conditions (undamped Item- Item or random recommendations compared with popularity) is statistically significant. The rate of unseen good movies is moderate, highest in the heavily damped Item- Item and lowest in random recommendations. Random makes statistically significantly fewer unseen good recommendations than moderately and high damped Item-Items 169 and popular item recommendation. The highly damped Item-Item algorithm, is also statistically significantly different from both small undamped and minimally damped Item-Item algorithms.

5.6 Summary

In this chapter we set out to understand and fix Item-Item collaborative filtering for use with small rating counts. In the previous chapter we showed that Item-Item performs very poorly for users with few ratings. Here we discovered this was due to two factors. First, standard model truncation tactics substantially reduce the number of ratings that can be leveraged when making predictions, leading to reduced prediction accuracy and substantial coverage issues. Secondly, the undamped weighted average scoring equation is unsuited for use when the number of neighboring ratings is small, accounting for much of the accurate and recommendation quality issues. We presented two ways to fix these issues. The frequency optimized model trun- cation strategy favors retaining more commonly rated items, allowing larger pools of neighboring items to be used reducing the coverage issues. Meanwhile using a damped average rating scoring equation fixes the weakness of the previous scoring equation and improves prediction and recommendation quality of the Item-Item algorithm. Taken together these fixes were able to substantially improve the performance of the Item- Item algorithm in offline analysis. While no Item-Item variant beat baseline algorithms for users with very few ratings, or the SVD algorithm for users with more ratings, the frequency optimized + damped Item-Item is no longer dominated by these approaches. These algorithm improvements are relatively simple, but make a substantial dif- ference to the behavior of the Item-Item algorithm for users with few ratings. There is still room for future exploration, particularly of the frequency optimized Item-Item model. While the weighting strategy between popularity and similarity is theoretically grounded (multiplying popularity by similarity allows optimizing for the expected sum of similarity when making future predictions) there is no reason to believe that it must be optimal. This is particularly worth exploring relative to content-based similarity metrics. It is possible that the use of both a rating-based measure of similarity and a 170 rating based measure of popularity led to a particularly positive interaction. A partic- ularly extreme form of this algorithm might choose to automatically optimize the set of retained neighbors against some definition of quality. To verify our offline results we used an online virtual-lab experiment in which users with no prior relationship with our recommender were asked to enter ratings and evalu- ate recommendations. While our analysis model was not an ideal fit, especially relative to the questions we originally launched the experiment with, we were still able to glean some useful results. In particular, our experimental results agree with our offline re- sults, showing that the improved Item-Item algorithm is a meaningful improvement over the traditional Item-Item formulation. Interestingly, however, we were not able to outperform an unpersonalized list of popular items. This series of experiments is not without limitations. Like the previous chapter, we face a serious limitation of having only used one dataset. This limits the amount by which we can confidently assert these are necessary properties of the Item-Item algo- rithm deployed on all datasets, as opposed to only the MovieLens domain. That said, the issues in Item-Item that we identified, and our fixes to these issues, are consistent with our understanding of general K-nearest neighbor algorithm behavior, and make no use of any special properties of the MovieLens domain. Therefore our improvements should be robust across recommender system domains. One limitation of the online experiment is that it only focused on the damping improvement and does not evaluate the model truncation improvement. The goal of this experiment was to, hopefully, learn more about how the changes in properties such as popularity and diversity in Item-Item impact new-user evaluations of the algorithm. We chose to focus on the damping modification as it has a large effect on popularity and diversity, and does so over a range of damping value settings. More conditions were not considered to reduce the cost and complexity of the study. Future explorations of this topic should look at a much larger range of algorithms, including SVD variants (subject to best-practices on adding new-users to SVD models). Another limitation of the online experiment is that our survey was unable to capture some of the properties we sought to study. While its possible that properties such as “degree of personalization” may be too abstract to be interpreted by first-time system users in a new-user circumstance, other properties like novelty of recommendations have 171 been measured before, and could not be separately elicited in this survey. More work is likely needed to improve the quality of this user-survey, and therefore increase the sensitivity of this methodology. This will be especially important if smaller algorithm behavior changes are to be studied with this experimental design. While Item-Item seems to be fading in popularity in the Recommender systems research community, it is still a popularly deployed algorithm, and careful, replicable analysis of these performance issues may be of use to general practitioners, especially those without the background necessary to independently pursue more advanced al- gorithms. Our findings, however, are not limited to this primary outcome. Viewing this as a case study in understanding algorithm performance factors for new users also highlights some interesting findings and trends in our results. Throughout this study, we have found that making the algorithm more conservative in its recommendations sits in opposition to diversity and novelty of recommended items. While this is true for both modifications, the damping parameter in particular forces the Item-Item algorithm to only take risks on items that are not generally evaluated well (as measured by the baseline) when it has enough supporting evidence. While this damping is a benefit for performance it substantially reduced the novelty and diversity of recommended items. We see similar behavioral patterns with the SVD algorithm, whose regularized training would have a similar effect: the algorithm performs well, but tended to have less diverse recommendation that the other algorithms we tested. It is therefore important to measure several properties of an algorithm (conservativeness, diversity and novelty) when making adjustments to an algorithm targeting any one of them, as the properties likely cannot be manipulated independently. In our work we were able to make the algorithm perform well, by making it more conservative, there was a significant tradeoff in novelty and diversity of recommended items. Interestingly, in our online analysis, we found that recommending items the user had already seen was a good thing. This sits contrary to perspectives that the users would be seeking a “value-add” from recommendations. In a domain like movies, recommend- ing an item the user has already seen is often seen as a “useless” recommendation, as it doesn’t help the user find something new. However, we found positive correlations between the overall user evaluation of algorithms and how often it recommended seen 172 movies. Likewise a movie was much more likely to be evaluated as a good recommen- dation if it was also a seen movie. This suggests that, at least for new users in our experiment, users were not looking for novel items as evidence that the algorithm will help find movies, but instead were looking for evidence that the algorithm understood them well. These recommendations likely served to establish trust in the algorithm and may have made new items in the recommendation seem more trustworthy. Future research should evaluate the roll of non-novel recommended items in building trust with users who are new to a recommender system, especially in contrast with the role of novelty in demonstrating value to long-time users. Looking forward, it seems that there is still much work to be done in understand- ing the needs and wants of users who are new to a recommender system. While there are general purpose models of users evaluating online systems, these models need to be mapped to known, measurable properties of the recommender system domain. A long-term goal in this field of research would be an in-depth understanding and re- evaluation of traditional algorithm design and evaluation methodologies to understand their effectiveness at capturing the needs of users who are new to a system. The un- expected success of the non-personalized popularity algorithm with our experimental users suggests that traditional offline measures of quality may not accurately predict online performance for new users. Chapter 6

Preference Bits and the Study of Rating Interfaces

The previous two chapters have focused on the aggregation and utilization of ratings from system users. In this chapter we will dig deeper into the quality of these ratings themselves. In particular, we will focus on how users make ratings, how we can measure the information content of these ratings, and what can be done to optimize a user experience to gather the most informative ratings. Not all rating scales are equally informative. It is natural to expect, for example, that ratings on a thumbs up / thumbs down scale will be less informative than a ratings on a five star rating scale. By the same right, carefully considered five star scale ratings might be better than ratings haphazardly entered on a 100 point “continuous” scale. In practice, you can also see differences in rating quality between two deployed recommen- dation systems using the same scale. The users of one system may have a different rating patterns those from the other system, leading to measurably different rating properties. For example, ratings in the BookLens system exhibit a stronger positivity bias than MovieLens, even though they use the same rating scale. This increased positivity bias may mean that ratings in that system are fundamentally less informative. It can be hard to choose the right rating interface for a given system. While it is natural to assume that a larger scale is the best, this is not necessarily the case. YouTube, for example, switched from a five star rating scale to a thumbs up / thumbs

173 174 down rating scale in 2010. This change was made after it was observed that system users rarely used any rating value other than 1 and 5 stars. If users are going to enter binary ratings anyways, it was probably best to reduce the complexity of the interface and just give them a binary rating interface [140]. Youtube’s scenario is a simple case where users are clearly ignoring much of the rating scale. Even when the full rating scale is users, there might be good cause to switch to a smaller rating scales. Smaller rating scales take less time to rate, and require less cognitive effort [122]. The smaller effort might also make smaller scales easier for novice users or those with limited computer skills. Small rating scales can even be easier to summarize for users. The best argument for smaller rating scales, however, is that smaller rating scales are likely to be used more consistently. A user is more likely to be unsure if a movie is a 2 star or 2.5 star movie than they are to be unsure if its a thumbs up or thumbs down movie. While most recommender system algorithms model user ratings as a perfect reflec- tion of each user’s stable tastes, this simply isn’t true. Research into ratings has found that there are inconsistencies in user ratings [4, 5, 26, 53, 94]. These inconsistencies can lead to ratings that do not accurately capture a user’s preference and therefore negatively affect the predictions and recommendations made by an algorithm. These inconsisten- cies are traditionally called rating noise and are split into two groups:Malicious noise, biased ratings entered to affect recommendations, and Natural noise, which is noise caused by natural rating processes [94]. In this chapter we will focus only on natural noise, and will use the terms natural noise and rating noise exchangeably. Rating noise negatively impacts algorithm accuracy and is believed to be the cause of the so-called magic barrier. First suggested by Hill et al. [53], the magic barrier is a limit in the possible accuracy of any recommendation algorithm caused by inconsis- tencies in user ratings. The existence of a magic barrier calls into question the utility of extensive algorithm driven development of recommender systems. If our algorithms are approaching this magic barrier than there is little to be gained by more elaborate algorithm development, especially work focusing on well understood recommendation problems. On the other hand, if we can decrease the amount of rating noise, or increase the amount of information contained in ratings, then we should be able to increase prediction accuracy without changing our algorithms. 175 Therefore, our focus in the rest of this chapter will be understanding how users make ratings, how we can measure the information content of these ratings, and what can be done to optimize a user experience to gather the most informative ratings. In the rest of this chapter we will provide a brief background on decision making, research into rating interfaces, and the mathematical field of information theory. Then we will describe the preference bits model, which we will use to derive a measure of the information content of ratings. After that I will describe three studies performed looking into how rating scales, and rating support can affect the information content of ratings. The work in this section has been published in two papers “How Many Bits Per Rating?” [66], and “Rat- ing Support Interfaces to Improve User Experience and Recommender Accuracy” [88], and represents collaborative work with a team of other authors, most notable Tien T. Nguyen. Tien and I collaborated closely in our work in this area. My primary focus in this work, and for this chapter, is the preference bits model used to measure informa- tion theoretic properties of ratings, and the methodologies for performing experiments based on this data. In the third, most substantial, study on rating support interfaces Tien performed much of the algorithmic work for deciding which content would go into the interfaces, while I was in charge of the interfaces themselves. That said, both of us worked in close collaboration with the whole team for much, if not all, of this work and all major decisions were made collaboratively by the team.

6.1 Background

To understand the source of rating noise we first turn to research on the psychology of preference elicitation. This understanding of preference elicitation, both from the fields of psychology and recommender systems, will serve as a basis for much of the following chapter.

6.1.1 Preference Elicitation

The field of decision making shows an ‘emerging consensus’ [119] towards the idea that decision makers do not have well defined preferences [13,43]. Preferences are not easily retrieved from some master list in memory, but rather are constructed while making a decision [75] and depend strongly on the situational context and task characteristics. 176 The observed stability of preferences comes from the fact that a person will usually follow approximately the same steps when constructing a preference and therefore tend to arrive at approximately the same preference. Rating noise, therefore, comes from situational differences that might affect this process. The rating elicitation process is linked first to memory processes [133]. When a per- son is asked to evaluate an item, a memory query will be activated to recall information that is needed for the evaluation. This query will activate knowledge and associations of the particular item that will be used to form a preference of that item. After this memory process the memory of the event, and a person’s values, are used to settle on a preference, which is then compared against a person’s understanding of rating values to pick the most appropriate rating value. In other words the evaluation of an item will depend on how the preference for that item is (re)constructed from memory, informa- tion provided by the interface that might bias this reconstruction, a persons values, and their situational understanding of the rating options. In practice, the preference elicitation process can be led astray. Different aspects of a memory may come to mind at different times and in different situations due to how memory is reconstructed [133]. Memories can also become less detailed and precise over time, which has been observed to lead to rating changes [17]. Anchoring effects [26,129], and task characteristics [58, 130] have been observed to strongly affect rating processes especially when a user lacks a good internal representation of the rating scale. Perhaps most directly, many re-rating studies have directly observed inconsistent ratings, and rating biases in the past: [4, 5,26, 53, 88, 111].

6.1.2 Recommender Systems Research on Rating

One of the most interesting tools that has been used to study how users rate is the re-rating study. The basic design of a re-rating study has users enter ratings on a given set of items. Then, after some delay, the same users are asked to rate the same items again, so that two sets of ratings for the same users and items can be collected and compared. This methodology has led to several interesting results. Cosley et al. use a re-rating study with ratings taken over a range of rating interfaces to establish that users are reasonably consistent across different rating scales [26]. How- ever, by manipulating predictions shown in the rating interface on the second rating, 177 Cosley et al. also showed that ratings can be biased by anchoring effects. While this could be considered good reason not to show predictions for un-rated items, the major value of rating is to enable predictions to be made, therefore this suggestion has not been well adopted. Using a carefully designed re-rating study with 3 re-ratings Amatriain et al. applied classical test theory to characterize the size and frequency of these inconsistencies [4]. Broadly, Amatriain et al. conclude that extreme ratings are more consistent than middle ratings, that learning factors do seem to apply to rating consistency, and that, despite occasional inconsistencies, elicited ratings would still be considered a good measure by test-retest reliability measurements from classical test theory. In follow up work Amatriain show that the inconsistencies, and hence the magic barrier, can be reduced by combining multiple ratings from a user to create a “denoised” dataset. Recommender systems researchers have found other interesting psychological prop- erties of rating interfaces without using re-rating studies. Bollen et al. showed, using an analysis of MovieLens data, that older movies tend to have higher mean ratings over time, possibly due to a psychological memory bias [17]. However, in the same paper Bollen et al. report on a more controlled survey which asked for movie ratings as well as an estimate of how long ago the participant last watched the movie. In this survey Bollen et al. found a negativity trend, when forced to make a rating, ratings based on older memories tended to be more negative. While the authors were not able to fully detangle the various ways in which time biases user ratings, this work does show that time has a measurable effect on ratings. In a somewhat different direction, Sparling and Sen performed a study investigating the cognitive effort involved in rating [122]. In their study Sparling and Sen used a dual-task paradigm where participants were asked to enter ratings while also managing a secondary reaction task. While their secondary task proved inconclusive at measuring cognitive load of different rating scales, they were able to measure various interesting results. They found that users chose to rate fewer items on larger scales, and took longer to make these ratings. Since the secondary task results were inconclusive they also conclude that rating time is a suitable approach for measuring the cognitive load of making a rating. 178 6.1.3 Alternate Designs for Preference Elicitation

While there are only a few distinct rating interfaces currently in deployment for pref- erence elicitation, the research community has explored a surprisingly large variety of preference elicitation techniques. Nobarany et al. built a classification system to ex- plore the design space of preference elicitation methods [91]. Their design space focuses on two major dimensions: are users rating a single item or comparing multiple items, and how many items are shown during one preference interaction (which they refer to as recall support). In the space of absolute rating they have traditional rating inter- faces and their “stars+history” interface, in which each rating choice is associated with past ratings to help explain the rating scale. The stars+history interfaces is the inspi- ration of the exemplar interface discussed later in this chapter. In the multiple item comparison space they explored several different options. Broadly they found that the stars+history interface was liked the most. They also comment that direct comparison methods, especially those that assume items can be absolutely ordered, may not match how humans think about items. Further exploring the space of item comparison recommendations, multiple researchers have explored comparison based interfaces for the initial process of learning about user preferences [22,48,77]. Generally speaking, the comparisons elicited on these interfaces are not taken as an absolute ranking choice, but instead as an indicator of how a latent feature model of user preference should change. These interfaces have all shown great promise in comparison based preference elicitation as they tend to create better user profiles in less time than a profile built through traditional rating based profile building. Overall, this body of work builds a strong case that showing multiple items in a rating interface, (although, not necessarily eliciting direct comparisons) may be a useful strategy when designing a preference elicitation interface. Indeed, research on preference elicitation has investigated the differences between comparing items on their own (single evaluation) and comparing items against each other (joint evaluation) [58]. When items are evaluated in isolation, the evaluation is absolute and will be predominantly based on aspects that are easy to evaluate. Using joint evaluation allows users to be more sensitive to dimensions on which items vary. When discussing the third study in this chapter We will show that there are measurable benefits in the quality of ratings when using an interface that can leverage joint evaluation when entering ratings. 179 6.1.4 Information Theory

The model we will explore shortly has a specific goal of measuring the amount of pref- erence information contained in noisy user ratings. As the goal of the model is to measure information it will rely heavily on information theory, therefore this subsection is included as a brief overview of the fundamental measurements of information theory. Claude Shannon developed information theory as a way of reasoning about informa- tion [116]. One of the fundamental measures of information is called entropy, which measures how hard a random variable is to predict. Entropy is defined as follows where X is a random variable and P (x) denotes the probability mass function of X.

H(X)= − P (x) log2 P (x) (6.1) xX∈X We measure entropy with bits where one bit measures how hard it is to predict a fair coin flip. As an example, consider a five star interface. Let R be the value of a random rating 1 on this interface. If we assume that all ratings are equally likely, then P (r = x)= 5 for all x in our rating domain. We can compute H(R) as:

5 1 1 H(R)= − log = log 5 ≈ 2.32 bits. (6.2) 5 2 5 2 r=1 X The assumption that all ratings are equally likely is rarely true, however this estimate remains of interest as entropy is maximized when all possible outcomes are equally likely.

Therefore log2 5 is an upper bound to the entropy of ratings from a five star interface.

More generally for a rating interface with m categories log2 m is the maximum possible entropy for an m point rating interface. Relative to any other random event Y , the randomness in X can be split into two parts: the signal, and the noise. The signal is the degree to which variable X helps predict variable Y , and is measured with mutual information, denoted I(X; Y ) The noise, on the other hand, is the entropy in X that would remain, even if we knew the value of Y , and is measured with the conditional entropy H(X|Y ). These properties 180 are related through the following relationship:

H(X)= I(X; Y )+ H(X|Y ) (6.3)

The mutual information between two random variables is a symmetric measure of how much information the value of one random variable gives us about the possible value of the other. Phrased another way, mutual information tells us how much one variable can serve to reduce the entropy in another variable, making it easier to predict. Mutual information is defined as follows where X and Y are both random variables:

P (x,y) I(X; Y )= P (x,y) log (6.4) 2 P (x)P (y) xX∈X yX∈Y We measure mutual information with bits where one bit of mutual information between two variables X and Y means we expect knowing X to reduce the entropy in Y by one bit. Mutual information ranges from zero when the two variables are independent, to the smaller of the two variables’ entropy when one variable entirely explains the other. Note that entropy serves as a maximum for mutual information, giving us another interpretation of entropy: the maximum capacity for a variable to carry information. The conditional entropy of one variable is the remaining entropy in one variable if we are given the value of another variable. Phrased another way, the conditional entropy tells us how much of the randomness in one variable is independent of the value of the other variable. Conditional entropy can be computed using equations 6.3 and 6.4 or directly through the following equation.

P (y) H(X|Y )= P (x,y) log (6.5) 2 P (x,y) xX∈X yX∈Y Conditional entropy ranges from 0, when the value of Y can completely predict the value of X, to the entropy of X, when the two variables are independent. As an example consider the relationship between ratings (R) on the movie Titanic (1997) and the gender (G) of the rater based on data from the MovieLens system.1 Table 6.1 contains the probability distribution P (R,G), as well as the marginals of

1This example only uses data from users who have rated Titanic and have a labeled gender. Addi- tionally, this was simplified by taking only whole star ratings. 181 P (G,R) 1 2 3 4 5 P(G) Female 2.35% 3.73% 7.93% 10.60% 10.61% 35.23% Male 5.07% 8.78% 16.05% 20.21% 14.67% 64.77% P (R) 7.43% 12.51% 23.98% 30.81% 25.28%

Table 6.1: The probability distribution of ratings on the movie Titanic (1997) by users with known gender on MovieLens. that distribution: P (R) and P (G). From this we can compute that H(G) = 0.93 bits, meaning that, while gender is easier to guess than a random coin flip (1 bit), but only by a little. We can split this 0.936 bits into the mutual information I(R; G) = 0.005 and the conditional entropy H(G|R) = 0.931. The small mutual information, and large conditional entropy means that the gender of a rater and the rating they give to Titanic are practically independent. This would be an issue if a system had the goal of using rating value on this movie to predict gender. One important result about mutual information is the data processing inequality [27]. The data processing inequality applies for any three random variables X, Y , and Z such that X and Z are conditionally independent on Y . This means that knowing Y and Z gives us no more information about X than just knowing Y , formally P (X|Y,Z) = P (X|Y ). When this property holds, the data processing inequality shows:

I(X; Z) ≤ I(X; Y ) (6.6)

We will use the data processing inequality to derive measures of the amount of infor- mation about user preferences that enters and exits the recommender. Interestingly information theory has been applied to rating systems before. In 1960, W. R. Garner applied information theory to measure the mutual information between ratings given to items, and the items themselves to see how discriminable items are given ratings. Garner found that larger rating scales do indeed have greater discriminating power, and concluded that the optimum number of rating points must depend in some way on the underlying discriminability of the the rated items. This can be seen as a precursor to our model, while it may sound similar on the surface it is quite different as it does not focus on individualized ratings in any way. 182 6.2 Measuring Information Contained in Ratings

The core question of this chapter is how can we measure the quality of ratings? One simple approach would be to collect enough ratings to build an algorithm and them measure how well an algorithm can predict future ratings on a metric such as RMSE. Better ratings should have a better prediction accuracy. This approach, however, is dependent of the implementation of an algorithm and would conflate rating patterns that an algorithm cannot predict with rating noise. A better approach would be estimating the minimum possible RMSE from any given algorithm. This is exactly the approach suggested by Said et al.’s research into the magic barrier [111]. In this work Said et al. derive a method to estimate the minimum possible RMSE based on empirical risk minimization from statistical learning theory. Their approach is based on performing a re-rating study and using the multiple ratings to assess what an optimal prediction would be, and what type of errors it would make. This approach is very well situated to answering the question “what is the magic barrier of a given recommendation system”, to help temper expectations about possible future developments. However, this approach is not as well situated to comparing different interfaces for preference elicitation as it doesn’t separate improvements in information from reductions in noise, and is not well situated to comparing interfaces that lack a common scale. Ultimately, the goal is an algorithm, metric, and rating scale independent way of measuring both the amount of information and noise in user ratings. As the goal is to measure an amount of information it is only natural to apply information theory. Therefore, in this section we will derive a measurement of the amount of preference information contained in any preference measure: the preference bits model. Since we are building on top of a statistical measure of information, we will require slightly different notation than used in previous chapters. First define the random variable Π(u,i) to represent the user u’s unknown preference for an item i. Where we are only discussing one item and user, or considering a system averaged over all items and users, we will ignore the parameters and simply refer to the variable as Π. We will assume that this preference Π is stable over time, but that there is a random process from which this stable preference leads to ratings. 183 It is worth noting here that we know from our earlier discussion of preference for- mation that preference is constructed and no stable preference exists in a human brain. It is undeniable, however, that while preference is constructed each time, that there are also factors (memories and values) in a human mind that can be reasonably modeled as stable for our purposes. A much more accurate model could be derived in which there are stable parts of a mind which lead to random variables for constructed preferences, which in turn lead to ratings and other behavior. The natural drift of memory and values could even be modeled using a markov structure of some kind. All that said, any of these “improved” models would ultimately have one stable, causal variable, which we would label Π (and assign an appropriate, if cumbersome name to). Even if this causal variable is separated from ratings by many other variables, it would not substantially affect our ultimate measures, other than to make them much harder to communicate. Therefore, with an acknowledgment that this model is an oversimplification of the men- tal processes at hand we will continue with our construction of the preference bits model. We will return to the consequences of our oversimplification briefly. Now let X be any random variable, with entropy H(X). Relative to preference, this entropy can be split into two parts by equation 6.3: the amount of information that X conveys about preference (the preference bits I(X; Π)) and the amount of information in X that is independent from the preference (the noise, H(X|Π)). Note that in our model Π would have a set value for any user who has consumed an item and therefore can make a rating. To any outside observer, however, this preference value is unknown and can still be modeled as random relative to our uncertainty of it. While the preference bits model is interesting, since Π is logically unmeasurable 2 the preference bits remains unmeasurable. Therefore, we need a way of estimating the preference bits of a variable. For this, let R(u,i) be a random variable for user u’s rating of item i. When discussing more than one possible re-rating we will index ratings as

R1,R2,.... With this we can now prove the following theorem

2if we had a cost effective and perfect way to measure a user’s preference (or all causal factors thereof), we wouldn’t bother with ratings 184 Theorem 1 Let X 6= R, Π be any random variable. Given that X and R are condi- tionally independent given Π.

I(X; R) ≤ I(X; Π) (6.7)

Within our assumption that X and R and conditionally independent, this follows directly from the data processing inequality (equation 6.6). What this means that that any variable X must tell us less about our ratings than it does about our preferences. Largely conditional independence can be assumed as, for the most part, we can reasonably assume that only a user’s preference is causal of their rating. However, in cases where predictions are shown this assumption may be violated, as we know that predictions bias ratings [26]. When this is violated the measured preference bits can be overestimated. As the absolute value of preference bits has little meaning but for comparison, this may be acceptable for some purposes, so long as this detail is accounted for and held constant across any experimental comparison. Likewise the following theorem can be shown:

Theorem 2 Let X 6= R, Π be any random variable. Given that X and R are condi- tionally independent given Π.

H(X|R) ≥ H(X|Π) (6.8)

Theorem 2 follows directly from theorem 1 and equation 6.3:

H(X)= I(X; Π) + H(X|Π) H(X|Π) = H(X) − I(X; Π) H(X|Π) ≤ H(X) − I(X; R)) H(X|Π) ≤ H(X|R)

Therefore, just as the mutual information with ratings can be seen as a lower limit to the preference bits, the conditional entropy given ratings can be seen as an upper 185

R(u,i)1

actual Π(u,i) Π(u,i) measurment

i ∈ I desired u ∈ U measurement R(u,i)2

Figure 6.1: Graphical model of input preference bit measurement limit of the preference noise. While these theorems do not not give us exact measures of the preference bits and noise, it does give us reason to say that X may serve as a reasonable proxy. From these theorems we will define several measurements. Let R1 and

R2 represent two ratings for the same item and user drawn independently from P (R|Π) for preference Π. So long as both ratings are taken on the same scale, the labels R1 and

R2 are theoretically irrelevant, however by convention we assign R1 to whichever set of ratings was taken first. Figure 6.1 shows this model. We call such pairs of ratings re-ratings. The input preference bits per rating, which will normally be shorted to just preference bits, is defined as I(R1; R2) the mutual information between the two re-ratings, which estimates how much useful information is contained in ratings. The rating noise H(R1|R2) (or

H(R2|R1)), which estimates how much distracting noise is contained in ratings. More strictly, these measures tell us how much information a rating contains about another, future rating by the same user on the same item and how hard it is to predict a rating given another rating by the same user on the same item. To apply theorems 1 and 2 to re-ratings, we must rely on four assumptions that have bearing on how re-ratings should be collected. First, our assumption that two ratings were drawn independently implies that no known biasing factor (such as predictions) can be shown while ratings are collected, as these would cause a dependency between these ratings. As we likely do not know the complete set of biasing factors, and some biasing 186 factors may be unavoidable, it is vital that as many factors are kept constant between any conditions that are to be compared. So long as a biasing factor is consistent between conditions we can, at the very least, hope that it will affect all conditions approximately equally. Secondly, our assumption that the two ratings were drawn independently implies that when a user is asked to re-rate an item, they should not remember their previous rating. This can be accomplished by waiting a period of time before asking a user to re-rate. The less time allowed between conditions, the higher the chance that some ratings are not independent because of user memory, which can lead to overestimation of the preference bits, and underestimation of the rating noise. Thirdly, as was mentioned before, we must assume that ratings, made at different times, are caused by the same preference. While any local differences in preference construction (in a hour-to-hour timescale) can be accounted for in our model of ratings as random even given a preference, this model has no good way of accounting for drift over time. It has been shown, however, that ratings do appear to drift over time [4,17]. Violations of this assumption will lead to underestimated preference bits as preference drift will be seen as another source of noise. To minimize this effect re-ratings should be collected in a relatively short time span. Again, in an experimental setting it is likewise important that this time span is carefully controlled between conditions to avoid introducing accidental differences between conditions. Note that the consequences of two of our assumptions (ratings are independent given preference, and preference is constant between ratings) are at odds. As too little time between ratings will lead to an overestimation of preference bits and too much will lead to an underestimation, we expect that as the time delay between re-ratings increases the amount of measured preference bits per rating will decrease.3 For this reason it is particularly important to carefully control the amount of time between ratings across all experimental conditions. The final assumption we are making here is that ratings are categorical, that is, there is a specific, finite set of possible rating values. While this is a benefit, in that it allows comparing traditional ratings with novel and unordered rating systems, this

3We confirmed this empirically with Xavier Amatriain’s dataset [4] where we see a decrease in preference bits per rating from 1.1227 with less than 15 day delay to 0.9757 with a more than 15 day delay. 187 also means that the equations in theorems 1 and 2 are not suitable for continuous forms of preference elicitation. If ratings are collected on a continuous scale, two options are possible: first, the ratings can be rounded to some discrete scale and treated as being on a non-continuous scale. Secondly, a smoothed model of the re-ratings could be fit and then information properties could be measured through integration. While the preference bits model assumes ratings are categorical, most rating scales are ordinal, that is, the scales have an inherent order. While we might implicitly know that a re-rating jump from 1 star to 5 stars is more problematic than a jump from 1 star to 2 stars, this information is not inherently captured in the preference bits model. In our experience, this information is sufficiently captured in the re-rating distribution itself. In our experience P (R1,R2) is the largest when R1 = R2 and drops rapidly as

|R1 − R2| increases, encoding the latent ordinal structure of ratings. That said, since the preference bits model does not assume ordinality, there are re-rating distributions that would cause misleading results For example, consider a re-rating experiment in which users re-rate according to R2 = (R1 + 2) mod 5, that is each re-rating is the original rating plus two (with the highest rating values assigned to the lowest values). Since there is no noise in these re-ratings the preference bits model would consider these highly informative ratings, despite their deeply confusing structure. Investigating the availability and suitability of an ordinal theory of information future work. Until such an improvement can be made we recommend manually inspecting re-rating distributions to ensure that they do not exhibit strange non-ordinal patterns. As a final note, we can also define the output preference bits I(R; P ) (where P represents predicted ratings on some prediction rating scale (not full numeric)), which represents how much we expect an algorithm’s predictions reduce a user’s uncertainty about what preference value they might form after consuming an item. Output prefer- ence bits can serve as a prediction accuracy metric that is self normalizing and therefore suitable for comparing multiple prediction scales. It remains future work to establish if this prediction accuracy metric better aligns with user satisfaction than RMSE, how- ever, as prediction is no longer a primary focus of the recommender systems community, this work is unlikely to be carried out. For a further discussion of output preference bits, and an analysis of how input, and output preference scales impact empirically measured output preference bits, I direct the reader to our paper [66]. As output preference bits 188 is somewhat out of scope of this chapter, the measure will not be discussed further.

6.2.1 Related Measurements

In our studies we will augment the preference bits and rating noise measurements with several other measurements. Not each study will use each metric, but we list them here for completeness. First and foremost we will use the magic barrier metric derived by Alan Said et al. [111]. As the information theoretic measurements and the magic barrier estimate both use data collected from a re-rating study these can easily both be applied and compared in our studies. Another measurement we will look at is the time to make ratings, as suggested by Sparling and Sen [122]. When possible we will measure this time directly in our experiments. When we do not have rating time available we will use the rating time estimates provided in Sparling and Sen’s report. We will also take the rating time measurement one step further. As we have a measurement of the amount of time it takes to make a rating, as well an estimate of how many bits each rating gives we can multiply these together to measure preference bits per second. The preference bits per second metric is a way of balancing the ease of use/cognitive load involved in rating with the information value of those ratings. Alone, time to rate and information content are often in conflict. The conflict between rating time and information is especially true when comparing rating interface sizes, where interfaces with small rating scales take less time to rate on, but also give less informative ratings that interfaces with larger scales. Sparilng and Sen’s study also reports the fraction of displayed items that were rated. Sparling and Sen found statistically significantly different numbers of ratings were made under different interfaces. Since the items available to rate were not varied systemat- ically between conditions we can conclude that the interfaces themselves led to more or fewer ratings being made. Therefore, when comparing rating interfaces it may be interesting to measure the ratings per impression, or how often a rating interface being shown to a user actually leads to a rating. Ratings per impression can be very hard to measure in a virtual-lab study, and Sparling and Sen’s numbers are likely larger than what might be expected in a natural setting. That said, to my knowledge Sparling and Sen’s estimates of ratings per impression are the only rating per impression estimates 189 available that are controlled across multiple interfaces so we will use them in our study. Just like we can compute preference bits per second by multiplying ratings per sec- ond with preference bits per rating, we can compute preference bits per impression by multiplying preference bits per rating with ratings per impression. Just like preference bits per second, preference bits per impression seeks to balance the extra ratings that small rating scales might get, with the extra value contained in large rating scales. Un- fortunately, an experiment that is well situated to measuring preference bits (re-rating experiment, ideally with a known rate-able item set) is not well situated to estimating ratings per impression. Therefore, we will re-use Sparling and Sen’s measurements for rating per impression with our measured bits per rating when comparing multiple rating scales. Finally we note that the best procedure for gathering rating information may not be the user’s favorite interface. Therefore, we will also measure user satisfaction with interfaces. If an interface is a substantially worse experience for the user it won’t matter if it is information theoretically optimal, it should not be deployed. We will measure user satisfaction through the use of a user survey, the details of which will be provided in a later section.

6.3 Studies of Rating Scales

In this section and the next I will present a series of three studies that were performed with two goals: better understanding the preference bits framework, these new mea- surements, and their suitability toward use in an experimental context, and to better understand ratings interfaces for recommender systems. The two studies in this section focus on the size of the rating scale as a fundamental trade-off between easy to use, quick and unambiguous small scales, with larger, slower to use, possibly more ambiguous large rating scales. The following section will present our third study, an experiment which focuses on a single rating scale asking how rating support can be used to better assist raters in their rating process. 190 Π Be(α,β)

π7 π1 π2 π3 π4 π5 π6 π8 π9 μ P(R|Π) ~ N(μ,σ) R

r1 r2 r3 r4 r5

Figure 6.2: Our rating system model

6.3.1 Study 1: Simulated Rating System

To understand how rating scale affects preference bits per rating, and to study statistical properties of these new metrics in a controlled environment, we built a probabilistic model of a rating system. Using our simulation we can easily simulate changes in rating scale, a simulated underlying preference scale, or amount of noise between rating and preference scales. The basic outline of our model is shown in figure 6.2. Ultimately, this model generates probabilities for P (R1,R2) or P (Π,R) given a set of parameters. In our simulation, the true preference Π of a user for an item is modeled as being drawn from a beta distribution with parameters α and β bucketed into mΠ bins. The beta distribution is a natural choice for modeling a random variable over a fixed range. By varying α and β many qualitatively different distributions can be achieved, ranging from uniform, to distributions similar to a normal distribution, or those that show a realistic positive skew. Beta distributions can even represent bimodal “love it or hate it” distributions when α and β are both less than 1. The number of buckets, mΠ, represents the true scale upon which users internally evaluate an item. To pick a user’s value for Π, we draw a random value from the beta distribution and take the value of the center of the bucket containing it. 191

1.8 1.6 1.4 1.2 1.0 0.8 0.6 Preference bits per rating 10 20 30 40 50 60 70 mR

Figure 6.3: Preference bits per rating increases as the size of the rating scale increases (α = 3, β = 1, mΠ = 50, σ = 0.05)

We model the input rating scale as the unit scale divided into mR buckets. To generate R from Π we calculate Π, add noise from a normal distribution N (µ = Π,σ) and choose the rating bucket (between 1 and mR) that contains the result. Because a normal distribution is not clamped to the unit range, we treat the smallest and highest rating as extending to negative and positive infinity respectively. This allows us to partially account for the extra stability of extreme ratings noticed in prior research [5].

From P (Π) and P (R|Π) we can compute P (R, Π) and P (R1,R2) (where R1 and R2 are simply ratings drawn from P (R|Π) with the same value of Π).

What factors affect true preference bits per rating?

We first consider the relationship between the true preference bits per rating and the rating scale. Figure 6.3 shows that increasing the size of the rating scale leads to an asymptotic increase in preference bits per rating. By experimenting with different parameter values we observed that the asymptote reached is a function of the preference scale and the amount of noise introduced. As the true preference scale increases, we see a logarithmic increase in the maximum preference bits per rating. Likewise, when we 192 increase noise the maximum preference bits per rating decreases. We found this result holds for a wide range of α and β values. From this result we conclude that for a given domain and group of users there is some number of rating options after which there is little value in increasing the resolution of a rating interface. This conclusion assumes that the rating scale a system presents does not substantially impact how deeply a user processes their own preferences. In our simulation, this could be indicated by increasing mΠ as a function of mR, which we did not test. Even should that be true, it is highly unlikely that the amount of extra thought would scale linearly with larger rating scales. Therefore there is likely still a point after which larger rating scales are simply gather more noise, for very little extra preference.

How well does I(R1; R2) proxy I(R; Π)?

Next we will use our model to assess the accuracy of I(R1; R2) as a measure of I(R; Π).

For I(R1; R2) to be a useful measure of I(R; Π), we want I(R1; R2) to correlate with

I(R; Π). To evaluate this we compute I(R1; R2) and I(R; Π) for a range of model pa- rameters for preference distribution (α,β ∈ {0.5, 1, 2,..., 6}), internal and input scales

(mΠ, mR ∈ {2, 3, 4, 5, 10, 15,..., 60}) and noise (σ ∈ {0.01, 0.1, 0.2}). This let us con- sider a wide range of preference models and rating scales under low medium and heavy noise. Using this data we found a strong linear dependency between I(R1; R2) and 2 I(R; Π) (R = 0.984). Therefore, we expect I(R1; R2) to be a good proxy measure of the true preference bits per rating. Note that we do not separately evaluate the mea- surement error of rating noise H(R1|R2) as this is directly related to I(R1; R2) through equation 6.3.

How should we estimate I(R1; R2) from data?

Finally, we used our model to assess a strategy for measuring I(R1; R2) on sampled data. We use the Miller-Madow method to estimate mutual information [84]. We estimate the joint probability P (R1,R2) as the number of times the two rating value co-occurred divided by the number of ratings. Where mR1 and mR2 are the number of possible 193 values for R1 and R2 (mr in our case), the Miller-Madow estimator is

Pˆ(r ,r ) (m − 1)(m − 1) Pˆ(r ,r ) log 1 2 − R1 R2 (6.9) 1 2 2 ˆ ˆ 2 log 2 ∗ N r1 r2 P (r1)P (r2) e X X The first term in this equation is simply the equation for mutual information. Unfortu- nately, directly measured mutual information suffers from a bias, meaning that the first term alone will tend to overestimate the mutual information. The second term corrects for the first order of magnitude of this error, substantially improving our estimate.4 Ideally, we would use an unbiased estimate of mutual information, instead of simply correcting for the bias in the estimator. Unfortunately, it has been shown [95] that there is no unbiased estimator for mutual information, therefore it is important that 2 large sample sizes (N >> mR) be used to avoid estimator bias. This requirement will make measuring input preference bits per rating difficult on a per user basis. As the correction term shows, the bias is also a factor of the size of the rating scale, meaning that very large rating scales may also be hard to accurately estimate. While more complex estimators for mutual information may reduce these issues [95], we will focus only on large datasets. Even with large datasets, it is important to apply the correction term when comparing information measures across different rating interface scales. When experimentally comparing interfaces of the same size, the correction term can be ignored, as it will cancel in pairwise comparisons. We used our model to perform a simulation study to evaluate the performance of the Miller-Madow estimator on re-rating data. We discovered that noise was consistently less than bias, and that the correction term can over-correct when noise is small, and many re-rating pairs are never observed. We found that a sample size of 1000 was sufficient to reduce bias and standard deviation to less than 0.1 bits for rating scales with up to 20 points. This can be used as a guideline for a preferred sample size for experiments seeking differences on the scale of 0.1 bits. Since this many ratings are unlikely to come from one user in an experimental context, we recommend pooling ratings from multiple users to estimate P (R1,R2). To limit the effect that any one

4Note, the 2012 paper used a slightly simpler version of this correction term that did not account for the multiple rating scales we use later in this chapter. The numbers reported shortly have been recomputed using this version of the Miller-Madow correction term. 194 outlier user has, ratings from as many users as possible are preferred over getting a very large number of ratings from any one user.

6.3.2 Study 2: Re-analysis of a past re-rating experiment

In our second study we use a real-world dataset to evaluate the average amount of preference bits we get from ratings elicited using several different rating scales. Like the first study, the second study seeks to understand how the rating scale used in an interface affects the amount of information gathered. This study goes one step further by also considering the tradeoffs involved in going to a larger scale such as slower rating time and fewer overall ratings. For this study we re-analyzed the SCALES dataset from Cosley et al. [26]. The SCALES dataset contains 3226 movie ratings collected from 77 MovieLens users on 882 movies using a variety of rating scales. This is enough ratings to avoid a significant estimator bias, although the dataset is limited by a relatively small number of users.5 While the SCALES dataset is not perfect, it is large enough to allow us to perform a preliminary analysis. By re-analyzing past data we can get tentative results at little to no additional experimental cost. I will note, however, that these results should only be taken as tentative results, pending a larger, more carefully controlled study of rating scales. Ratings in the SCALES dataset were entered on one of three different rating inter- faces: a 2-point thumbs up / thumbs down, a 6-point scale from -3 to +3 with no zero, and a 10-point, 5-star scale with half stars. Re-ratings are made by pairing these user’s original ratings entered on MovieLens (made on a five star interface) with their new ratings. While these ratings are not necessarily on the same rating scale, they still can be used to estimate preference bits per rating. The preference bits model placed no re- striction on sharing a rating scale, therefore the mutual information between a previous 5 star rating scale and a 6 point -3 to +3 rating will still be a lower bound on the true input bits for an interface. That said, this still may add some confound if some scales map more closely to the original regardless of actual preference bits. Unfortunately, we

5The original analysis of the scales dataset implies that only 2795 ratings were collected [26]. We contacted the lead author, it appears that the 431 remaining ratings were excluded from the analysis there. 195 Rating Scale 2 point 6 point 10 point Preference Bits 0.434 0.813 0.785 Rating Noise 0.436 1.572 2.384 Bits per second 0.1110 N/A 0.1898 Bits per impression 0.3206 N/A 0.5016

Table 6.2: results from the re-analysis of the SCALES dataset have no way to control for this confound in this study. Using the measurements introduced in section 6.2 we computed the results seen in table 6.2. As can be seen there is a clear advantage towards having larger rating interfaces as both the 6 and 10 point rating scales contain substantially more bits per rating. Surprisingly, the amount of information gained with the 10 point interface was less than the amount gained with the 6-point interface. This contradicts our previous results and suggests that the 6 point interface (with no zero) can gather information more efficiently than the five star with half star interface. This may be because the lack of a zero option forces users to form a more concrete opinion, or, it could be an artifact of noise in our study. Because the re-ratings were not done on a single scale it is hard to be sure, and a follow-up study with more rigorous methodology would be needed to be sure. Ideally these results should be quantified with a measure of statistical confidence. In general, statistical confidence around measures of information theoretic values seems to be a poorly studied field, therefore there are no simple frequentist method for hypothesis testing or generating confidence intervals. While general purpose statistical techniques can be tailored for this data, none are ideal for this purpose. Even if statistical measures could be provided it might not be prudent to provide them here as we have no way to account for the various issues mentioned earlier (re-ratings are not on the same interface, the time between rating and re-rating is not carefully controlled, etc.) Looking at the rating noise we see a trend of increasing rating noise with increasing rating scales.6 This should not be surprising, larger rating scales almost certainly have more total entropy, if the amount of preference bits does not substantially increase as rating scales get larger, than the rating noise will. In particular, from this we can see

6This noise measurement was computed with a bias correction term comparable to the one in equation 6.9 196 Scale points 2 5 10∗ 100 Rating time 3.91 4.09 4.14 5.13 Ratings per impression 0.665 0.641 0.639 0.601

Table 6.3: Mean rating time per item in seconds for each scale and domain based on Sparling and Sen’s study. Rating times with a ‘*’ indicate interpolated values.

that the 2 point scale is about half information and half noise. The 6 point scale has a lower ratio of signal to noise than the two point scale, but also has almost twice the overall amount of preference bits. The 10 point scale has less preference bits than the 6 point scale, and more noise, meaning it is likely a worse interface for preference elicitation. While the larger interfaces provide more information, they also take longer to use than the 2 point interface. One reason the 6-point interface may have worked so well, for example, is that the lack of a zero option forced users to think more carefully before entering their ratings. To address this trade-off we use the preference bits per second and preference bits per impression measurements. We will use the time to enter a rating measured by Sparling and Sen as seen in table 6.3. Because Sparling and Sen do not assess the 10 point scale, the listed values are estimated by linear interpolation.7 We do not estimate a time for the 6-point scale because it is not clear how this interface compares with a traditional six star rating scale. Combining the times per rating from table 6.3 with our measured preference bits per rating for the thumbs and five star with 0.434 0.785 half star interfaces we get 3.91 = 0.1110 and 4.14 = 0.1898 respectively. While this is a rough estimate, is suggests that, despite the faster ratings, the binary scale gives us information slower than the five star interface. This can be seen as evidence that the 2 star interface is too limited to capture the entirety of a user’s preference in an efficient manner. Just as smaller ratings tend to be faster, they also tend to collect more ratings per impression. This is presumably a factor of appearing to require less effort to use. The ratings per impression shown in table 6.3 are computed based on data from Sparling

7These numbers do not match those presented in [66], That work claims that linear interpolation was used to compute values for 10 rating points, but the reported numbers are not what would be expected. At this time I cannot explain the deviation. The numbers shown here do follow from direct linear interpolation between the size 5 scale and the size 100 scale 197 and Sen’s study and represents the number of movies rated versus the number of movies shown 8 By multiplying the ratings per impression with the bits per rating we can arrive at an estimate of the amount of preference bits we collect with each impression, or each time the user is exposed to a rating interface. From table 6.2 we can see that the 10 point scale gets substantially more preference bits per impression. This suggests that, while a 2 point rating scale may get more ratings overall, the substantial loss of information contained in those ratings compared with the 10 point scale leads to collecting less overall information. From these results we can tentatively conclude that the thumbs up / thumbs down interface would not be the best interface for the movie domain used in the SCALES dataset. While the thumbs up / thumbs down interface seems to get more ratings, and faster, these ratings contain roughly half the preference information. Ratings on the five stars with half stars interface collect more ratings per second, and by impression (under controlled item rating scenarios) and therefore collect more total ratings, even if it takes slightly longer to collect those ratings. The interesting outlier of this analysis is the -3 to +3 scale with no zeros. Ratings on this scale contained more preference information than both other interfaces, and less rating noise than the five star with half star interface. Unfortunately, we lack the data at this time to further explore this interesting interface option. Future work should re-visit this interface to better understand it’s promising results in this study.

6.4 Study 3: An Experiment With Rating Support Inter- faces

Our final study takes a slightly different approach to understanding the information properties of rating interfaces. Instead of focusing on how different rating scales affect usable information, this study seeks to understand how the support we give to users as they rate affects the quality of ratings entered. To that end we created three rating support interfaces, rating interfaces with extra features designed to aid the preference formation and elicitation process. The hope is that one or more of these interfaces will

8These numbers were computed from Sparling and Sen’s study by the authors upon request. These numbers cannot be found in their report. 198 enhance the amount of information users provide through rating interfaces, or at least reduce the amount of noise. This work was first reported at Recsys 2013 [88]. Our experiment focuses on two major forms of rating support: memory support and rating scale support. Memory support interfaces target the memory processes involved in rating. Since preference construction relies on what a user can remember about an item, if we help a user remember those aspects of an item that might be important to them we should help them form more consistent and more meaningful preferences. For memory support to be the most helpful, we believe it should be personalized to the user. Rating scale support focuses on the process by which a rating is selected after pref- erence formation. Past research has shown that anchors [129] and task characteris- tics [58, 130] can strongly impact ratings, especially when rating scales themselves are abstract. Therefore, we hypothesize that the process of mapping preferences to rating scales may be a primary cause of noise in ratings, and that supporting this process may mitigate this noise. It is not uncommon to see websites provide some form of label to each rating value in an effort to provide this type of support. Prior work on rating interface design [91] has suggested using the users’ past ratings to help them make future ratings. In particular Nobarany et al. suggest an interface in which rating options are annotated with past ratings. The hope is that the past ratings will help people make ratings that are ordered consistently. Psychological research on preference elicitation has investigated the differences be- tween comparing items on their own (single evaluation) and comparing items against each other (joint evaluation) [58]. When items are evaluated in isolation, the evaluation is absolute and will be predominantly based on aspects that are easy to evaluate. Using joint evaluation allows users to be more sensitive to import dimensions on which items vary which may be hard to evaluate in isolation. We will support joint evaluation in our interface by showing exemplars for each point on the rating scale. These exemplars will be other similar items rated in the past that serve as personalized anchors on the scale, e.g.: ‘you previously rated this item 3 stars’. Research on comparative evaluation [86] suggests these exemplars should be similar to serve as good anchors otherwise it might be hard to make any comparison (or even contrast effects can occur). 199 6.4.1 Rating Interfaces

With this motivation we will compare four different rating interfaces, one with minimal- istic support that serves as the baseline, one that uses tags to try to remind the user of important properties of the item, one that provides exemplars of past ratings to anchor the rating scale, and a final one that combines the previous two features (figures 6.4). Figure 6.4a presents the baseline design allowing users to map their internal prefer- ences into ratings. To assist our users in the rating process, we provide basic information such as posters and titles to ensure that the item we are asking the user to rate is unam- biguous. However, posters and titles likely do not provide sufficient memory support. Therefore, we design the interface that provides personalized memory support.

The Tag Interface

In our design process, we looked for a personalizable form of memory support that triggers and supports the recall process. While some users remember a movie because it is funny, others remember the movie because it is visually appealing. To provide personalized information about the movie, we use the tag genome [132] combined with users’ previous ratings. The tag genome is a tag-movie relevance matrix, inferred from user-supplied tags, ratings, and external textual sources. The relevance of a tag t to a movie m determined by the tag genome is a value between 0 and 1 denoted as rel(t, m). We remove two classes of tags from the genome prior to rendering our interfaces. The first is tags such as ‘excellent’ and ‘great’ that indicate a general quality judgment of the movie; these are redundant with ratings and do not provide any information about the movie. The second is binary metadata tags such as directors, actors, ‘based on a book’, etc. These tags describe properties that are objectively verifiable and either true or false; it is incorrect to infer them when they are not explicitly added to the item. The set of tags we removed this way were manually generated by inspection of the list of tags returned by our tag selection algorithms for any potential user. As tags were removed, we regenerated the the list of tags and manually inspected. Future work should build a way to incorporate such metadata tags, but for now we restrict our use of the tag genome to tags for which inference makes sense. Figure 6.4b shows the tag interface. 200 (b) Tag Interface (d) Tag + Exemplar Interface ing support and joint-evaluation support. (a) Baseline Interface (c) Exemplar Interface Figure 6.4: Our four interface designs to provide memory-support, rat 201 To personalize the interface, we select tags that best help users reconstruct their memory about the movies. Due to the limited space on the interface, only a set of 10 personalized tags from 1574 candidates can be displayed. To address this problem, we use Nguyen et al.’s approach [88] to compute how much each tag t tells us about a user u’s preferences, denoted as pMI(t,u). We hypothesize that the more information a tag can tell about user u’s preference, the more likely it is helpful to the user. Moreover, a useful tag must be sufficiently relevant to the movie. Therefore, we require that any selected tag must have a minimum relevance of 0.25. Similar tags, such as ‘political’ and ‘world politics’, diminish the effectiveness of memory support if they are displayed together. To address this challenge, we place a similarity constraint on the tag selection algorithm. Our prototyping suggests that mutual information between tags works well to measure tag similarity.

Formally, let P (t1) be the probability that tag t1 is applied to a random movie,

P (t1,t2) be the joint-probability that tags t1 and t2 are both applied to a random movie.

Hence, the tag similarity, denoted as dT agsim(t1,t2), is defined as dT agSim(t1,t2) = I(t1,t2) H(t2) , where I(t1,t2) is the mutual information of the two tags, and H(t2) is the entropy of tag t2. Note that dTagSim, or directed tag similarity, is asymmetric, i.e. dT agSim(t1,t2) 6= dT agSim(t2,t1). After several trials, we find that 1% (0.01) to be a reasonable cut-off. With these concepts laid out, our tag selection approach is a constrained optimiza- tion problem described as follow:

argmax rel(t,i) × pMI(t,u) ⋆ T ⋆ tX∈T subject to | T ⋆ |= 10 ⋆ dT agSim(ti,tj) < 1% ∀ ti,tj ∈ T , i 6= j rel(t,i) > 0.25 ∀ t ∈ T ⋆

Since finding sets of perfect personalized tags is computationally expensive, we imple- ment a greedy algorithm to find an approximate solution. Starting with an empty set of tags, the algorithm iteratively selects the tag with the highest rel(t,i) × pMI(t,u) 202 such that when added to the result the constraints are still satisfied.

The Exemplar Interface

The exemplar interface improves upon the baseline by showing exemplars. An exemplar for a rating-movie pair is a movie that the user previously gave the same rating. Figure 6.4c shows the design of the exemplar interface. When initially presented, the image on the right of the interface would be blank. When the mouse moves over any rating value on the stars rating widget an exemplar with that rating will be shown on the right. This exemplar will change if the mouse moves over another rating value, or return to the blank image if the mouse leaves the rating widget. Each exemplar at a specific rating option is the most similar movie to the movie to rate based on the cosine similarity of the movie rating vectors. If there is a particular rating value that the user has never assigned a blank image is shown. Movies that are selected for rating are not used as exemplars.

The Tag + Exemplar Interface

To investigate the effect of combining ‘memory support’ and ‘rating support’, we de- velop the fourth interface as seen in figure 6.4d. In this interface, users make use of personalized tags to compare their rated movies to the movie to rate. Initially, the interface shows a single movie and a list of tags. When users hover over a star, an exemplar is shown on the right of the interface. Each tag then moves either flush left, or flush right to denote which movie the tag is more relevant to (the movie to rate or the exemplar respectively). The tag will move to the center if the tag has approximately equal relevance to both movies (relevance values within 0.1 of each other). To help the users understand the movement of the tags a Venn diagram is shown in the background of the interface. This diagram reinforces the idea that tags on one side or the other best describe the associated movie, and tags in the center describe both movies equally well. The tag + exemplar interface uses the same exemplar selection strategy as the exemplar interface. Selecting tags, however, is more complicated in this interface. We want tags that not only are relevant to the user and movie, but also may help the user compare between movies. We measure this in two ways. First, for each tag t we measure σ(t) the standard deviation of the relevance values of each chosen exemplar to 203 that tag. If a tag’s relevance varies greatly between the exemplars then it can help serve to differentiate them. Secondly, we measure how well a tag splits between being more relevant to the movie to rate, or the exemplars. If a tag is always more relevant to the movie to rate, or always more relevant to the exemplars, then the tag does not help the user compare the different rating choices. To measure this we define Z(i,t) = max(Mt,Lt) , where M min(Mt,Lt) t is the number of exemplars to which tag t has higher relevance than to the movies to rate, and Lt is the number of exemplars to which tag t has lower relevances than to the movies to rate. When the tag would appear with the exemplar as often as with the movie to rate then Z(i,t) = 1, as the balance becomes less equal Z(i,t) becomes larger. Our algorithm for this tag + exemplar interface is also a constraint optimization problem described as follow:

argmax rel(t,i) × pMI(t,u) × σ(t) ÷ Z(i,t) ⋆ T ⋆ tX∈T subject to | T ⋆ |= 10 ⋆ dT agSim(ti,tj) < 1% ∀ ti,tj, ∈ T , i 6= j

Since finding sets of perfect personalized tags is computationally expensive, we im- plement a greedy algorithm to find an approximate solution. Starting with an empty set of tags, the algorithm iteratively selects the tag with the highest rel(t,i) × pMI(t,u) × σ(t) ÷ Z(i,t) such that when added to the result the constraints are still satisfied.

6.4.2 Measurements

To compare the our interfaces, we ask users to rate 50 movies, and re-rate these movies after at least two weeks. Users are asked to answer 22 survey questions after finishing the second round. To help users understand the interfaces, we provide a tutorial for each interface explaining the interface and walking users through the interface step by step. Our analyses are based on the measurements described in section 6.2. Our primary measurements will be the preference bits and rating noise measurements introduced previously. We will augment this with the magic barrier (minimum possible 204 RMSE) estimate from Said et al. [111]. Finally, to confirm these more theoretically oriented metrics we will also build a simple item-item recommenders for each interface and compare the RMSE on these interfaces directly. We expect that a good interface will either increase preference bits, decrease rating noise, or both. We further expect that an increase in preference bits or decrease in rating noise will be associated with a decrease in prediction error according to our two RMSE based metrics. As noted before, since this is a comparative study we will not be using a bias correction term for our measurements of information theoretic properties. The standard bias correction terms are a function of sample size and number of rating options, both of which are held near-constant in our experiment. Furthermore, our recruitment targets were large enough that, based on previous calculations, bias should be a minimal factor in our measurements. Other than these metrics we will also measure the amount of time it takes to enter ratings as a proxy for cognitive load following Sparling and Sen’s report. Along with a user survey, this will help us understand any tradeoffs made in the pursuit of more informative ratings. We will augment the rating time measurement with a user survey to better under- stand users’ subjective user experience when using these interfaces. Our survey will follow Knijnenburg et al.’s [67] evaluation framework for user experience. The frame- work models how objective system aspects such as the user interface influence users’ subjective perceptions and experience, controlling for other factors such as situational characteristics and personal characteristics. Particularly, we measure how usefulness of the system (experience) is influenced by the perceived difficulty of the system (subjective system aspect), and how each of these are affected differently by the 4 different inter- faces, controlling for the self-reported expertise as a personal characteristic. The survey consists of 22 7-likert-scale statements, ranging from ‘Strongly disagree’ to ‘Strongly agree’, that query the theoretical constructs of interest: usefulness of the interface, diffi- culty of the interface and self-reported movie expertise. Some questions are inverted to check users responses for coherence and lack of effort. Participants have to answer all questions. They can change their answers before submission. Participants can provide extra feedback in a text-box. We also ask (but not require) participants’ age and gender. 205 6.4.3 Experiment Structure

We recruited MovieLens users for our study. In order to make our algorithms work on these interfaces, we select only users who have rated at least 100 movies that have both tag genome and Netflix title data available. We also require that the 100 ratings be provided since MovieLens changed to its current half-star rating scales 9. Finally, we require that users have logged in at least once the four year period from (2009-03-01 to 2013-03-01). Data for our experiment was prepared on 2013-03-01. This step involved selecting possible users, randomly assigning them to interfaces, and preparing the list of movies, as well as tags and exemplars for each user. By pre-computing this information we were able to build highly responsive interfaces without necessarily seeking highly opti- mized implementations of our algorithms. After preparing data for the experiment, we randomly invited 3829 users via email invitations. Our participants are asked to rate 50 movies in the first round, and re-rate these 50 movies in the second round. To simulate the real rating experience, the participants are asked to rate the most recent 50 movies that they rated prior to the study. While this means that we do have past MovieLens ratings for these users, we do not use them to make re-rating pairs in our experiment as they occurred over a large time range and did not occur on our experimental interfaces. To avoid users recalling recently rated movies, we wait at least 14 days after data preparation before inviting users to participate in our study (round 1). This guarantees that at least two weeks have passed since the user last rated the selected movies in MovieLens. For the same reason, we also wait 14 days before starting round 2. After finishing the second round, participants are asked to complete a survey asking their experience with the interfaces as well as the rating process. For each 20 participants, a randomly selected one received a $50 Amazon Gift card.

6.4.4 Results

Our experiment ran from March 21st 2013 until April 25th 2013, collecting 38,586 10 ratings from 386 users, of which 103 users were assigned the baseline interface, 100 to the

9We estimate that MovieLens switched to a five-rating-star with half-star increment on 2003-02-12 10Due to network issues, we had 14 incomplete ratings which were removed from all of the analyses. 206 Interface Minimum RMSE RMSE Preference Rating Mean Median RMSE round 1 round 2 Bits per Noise Rating Rating Rating Time Time Baseline 0.232 1.13 1.12 1.26 1.71 4.26 3.71 Exemplar 0.209 0.95 0.95 1.22 1.62 5.53 4.60 Tag 0.237 1.03 1.10 1.22 1.72 4.49 4.07 Tag+Exemplar 0.227 1.04 1.01 1.25 1.70 6.70 5.60

Table 6.4: Our quantitative results for four interfaces exemplar interface, 98 to the tags interface, and 85 to the tags + exemplars interface. 352 provided information about age (mean: 36, min: 18, max: 81, σ : 11). 304 were male, 72 female, and 10 did not report. Our analyses don’t reveal significant differences between the two genders.

RMSE

We follow the procedure described in [111] to estimate the minimum RMSE for each interface. The estimated minimum RMSE for our interfaces are given in table 6.4. To test these differences for statistical significance we compute an estimated minimum RMSE for each user. Using an ANOVA on the per user minimum RMSE estimates we find marginal significance that the average per-user minimum RMSE differs between interfaces (p = 0.0574). Using a TukeyHSD we find marginal evidence that the exemplar interface has a lower average minimum RMSE than the tag and baseline interfaces (adjusted p = 0.0689, 0.098 respectively). To control for possible error in the ANOVA due to violations of the normality as- sumption we confirm these results with a pairwise Wilcoxon rank sum test over the per user minimum RMSE using the Holm method to correct for multiple comparisons. We find that the minimum RMSE is significantly different between the exemplar interface and baseline(p< 0.005), however, we find only minimal support for the conclusion that the exemplar interface differs significantly from the tag interface (p = 0.208). Therefore we conservatively conclude that the only significant pairwise difference in our minimum RMSE estimates is between exemplar and baseline. To compare these results against actual predictive accuracy we use the LensKit toolkit [40] to train and evaluate an item-item collaborative filtering recommender over 207 the ratings from the four interfaces. We use 10-fold cross validation with a 10 item holdout set for each test user. Table 6.4 shows the per-user RMSE of each condition for each round. The exemplar interface has the lowest RMSE in both rounds. The measured RMSEs are significantly different (p < 0.001 using ANOVA on the user RMSEs); the exemplar had lower RMSE than the baseline in round 1 (p < 0.001 using Tukey HSD post-hoc test), and exemplar had lower RMSE than both the baseline and tag interfaces in round 2 (p< 0.05).

Preference Bits per Rating

We estimate preference bits using the technique described earlier in this chapter. The results are summarized in table 6.4. To our knowledge significance testing on information theoretic values is not a well studied field, and there are no simple frequentist tests for significance of differences. Therefore, to test for significance in our measured mutual information differences we use a resampling based strategy to test each pairwise difference. In particular we use a permutation test executed at the per-user level. Under the null hypothesis ratings from any two interfaces are drawn from the same distribution (and therefore have the same true mutual information). Therefore, under the null hypothesis, all permutations of the assignments to users are equally likely. Therefore, to test for differences between any two conditions we randomly permute the assignments of users to conditions. It is important to permute interface labels over users rather than ratings to control for outlier users who rate with significantly more or less information or noise than the average user. For each permutation we compute the difference in means of measured mutual information between the two conditions. We reject our null hypothesis if fewer than 5% of our random trials have differences of means as large or larger than observed. We performed this test using 10000 permutations and found that no pairwise difference in mutual information was statistically significant. Therefore we do not have sufficient evidence to conclude that our interface modifications had a significant effect on the amount of preference information extracted.

Rating Noise The preference noise of an interface is estimated by the conditional entropy of the first rating given the second rating in a rerating dataset. The conditional 208 entropy estimates for each interface can be seen in table 6.4. The rating noise metric generally agrees with the RMSE based metrics, especially in measuring the exemplar interface as having the least noise. To test these measurements for differences we use the same resampling strategy discussed above. Using 10000 random permutations we find that no pairwise difference in conditional entropy was significant. Therefore we do not have sufficient evidence to conclude that our interface modifications had a significant effect on rating noise. It is interesting to note that while we are not able to conclude that our interfaces have an effect on rating noise, we are still able to conclude effects on the RMSE of the ratings. This result is contrary to the hypothesis that noise in user ratings is the primary cause of the magic barrier. We expect the differences in statistically significant findings between the RMSE based measurements and the information theoretic measurements this finding is largely due to differences in power between our statistical tests. Because statistics on information theoretic measures is a not-well explored field of statistics, it is possible that analysis with a more powerful technique would allow us to find significance in our rating noise metric that match the results found in the RMSE section. One limitation to the power of the statistical tests for every metric is the number of users. Since users are likely to rate with different patterns from one user to another ratings from one user cannot be assumed to be independent. Therefore the statistical methods for each metric must take account for error at the user level. As this makes the number of users play a similar role to the sample size when considering statistical power, I would recommend that future experiments seek larger groups of users, even if this comes at the expense of fewer ratings per user.

Objective User Experience (Cognitive Load) Cognitive load is estimated by measuring the time it takes users to rate a movie. We assume that it will take users time to process the information when using interfaces. A good interface should decrease the cognitive load and time required to enter a rating. Before analyzing the rating times, we first exclude long rating times. Since we could not monitor the users directly it is possible that some users got distracted or took a break while entering ratings, leading to an artificially inflated rating time. Such long times introduce bias in our analysis since these rating times do not reflect real users’ 209 cognitive loads. We exclude the top 1% of the data points for each interface, resulting in cut-off points of 41 seconds for the baseline interface, 43 seconds for the tag interface, 55 seconds for the exemplar interface, and 79 seconds for the tag + exemplar interface. As users made 50 ratings in each round, we take the average rating time per round as an estimated measure of cognitive load. Table 6.4 shows the statistics of rating time per interface aggregated over both rounds. Our ANOVA analyses suggest that the required cognitive load for the baseline and tags interfaces be the smallest among the four (p< 0.01). The difference between the two interfaces is not significant (p = 0.29). This suggests that users do not take advantage of the memory support provided by the tags interface. Participants spent the most time on the tag + exemplar interface, followed by the exemplar interface (The difference is significant, p < 0.0001 ). All other pairwise comparison are also significant (p < 0.0001). This suggests that users utilize the supports provided by the two exemplar empowered interfaces. At the same time, this also implies that taking advantage of exemplars when rating leads to an increased cognitive load, at least under our current interfaces. We also performed a multilevel linear regression analysis with random intercepts to take into account the 50 repeated observations per round. The results of this analysis were similar to those reported above. Likewise we also computed a separate ANOVA analysis for the two rounds. This found no statistically significant difference between rounds.

Subjective User Experience (Self-report) Our questionnaire was designed to measure user experience of usefulness and difficulty of the interface and participant’s expertise. The items in the questionnaires were submitted to a confirmatory factor analysis (CFA). The CFA used ordinal dependent variables and a weighted least squares estimator, estimating 3 factors. Items with low factor loadings, high cross-loadings, or high residual correlations were removed from the analysis. Factor loadings of included items are shown in table 6.5, as well as Cronbach’s alpha and average variance extracted (AVE) for each factor. The values of AVE and Cronbach’s alpha are good, indicating convergent validity. The square roots of the AVEs are higher than any of the factor correlations, indicating good discriminant validity. The subjective constructs from the CFA (table 6.5) were organized into a path 0.89 0.70 0.66 0.71 0.86 0.75 0.85 -0.71 -0.86 -0.82 -0.61 210 Factor Loading s rating interface vies is one movie have seen ng were excluded from the analysis. Items MovieLens should use thisThe interface interface helped meThe think interface carefully helped about me howThe express much interface my I did ratings liked not consistently the Rating make movies rating with easier this comparedThe interface to interface was the helped fun usual me MovieLen I to found remember the what interface IThe helpful like interface in and did relating dislike not a aboutI motivate movie mo know me to of to other other think moviesI sites carefully I would with about recommend rating my this interfaces ratings I interface I found to would the other rather interface users useThe confusing than interface th was easyI to found understand the interfaceRating difficult with to this use interfaceRating was with fast this interfaceI’m was a easy movie lover Compared to people ICompared know, to I people read I aI know, lot mostly I’m about like not movies popular anI movies expert get on very movies excitedI about know some why upcoming I movies I like often the like movies to I compare like a new movie to others I’ve seen Items presented in the questionnaires. Items without a factor loadi Table 6.5: Considered As- pects Usefulness Alpha: 0.83 AVE: 0.546 Difficulty Alpha: 0.84 AVE: 0.71 Movie Exper- tise Alpha: 0.68 AVE: 0.55 211

Figure 6.5: The structural equation model fitted on the Subject constructs difficulty and usefulness and their relations to the 4 interfaces model using a confirmatory structural equation modeling (SEM) approach with ordinal dependent variables and a weighted least squares estimator. In the resulting model, the subjective constructs are structurally related to each other and to the 4 interface types. In the final model, the expertise factor did not relate to any other variable, and was therefore removed from the analysis. The final model had a good model fit 11 ( χ2(52) = 3408, p <.001, CFI = 0.979, TLI = 0.971, RMSEA = .070, 90% CI: [.054, .085]). Figure 6.5 displays the effects found in this model. Factor scores in the final model are standardized; the numbers on the arrows (A → B) denote the estimated mean difference in B, measured in standard deviations, between participants that differ one standard deviation in A. The number in parentheses denotes the standard error of this estimate, and the p-value below these two numbers denotes the statistical significance of the effect. As per convention, we exclude effects with p >= .05. The three interfaces are compared to the baseline interface. The SEM model shows that all three interfaces are seen as more difficult than the baseline condition, with the exemplar + tag interface having roughly twice the impact (1.336) as the exemplar (0.655) and tag (0.649) interfaces. Both interfaces with exemplars are perceived more useful than the baseline interface, with the exemplar + tag interface almost twice as useful than the baseline as the exemplar interface. The tag interface is not more useful

11Hu and Bentler [59] propose cut-off values for the fit indices to be: CFI > .96, TLI > .95, and RMSEA < .05, with the upper bound of its 90% CI falling below 0.10 212 Total Effects Usefulness 0.4

0.3

0.2

0.1

0.0

-0.1

-0.2

standardized score score standardized -0.3

-0.4

-0.5 Baseline Exemplar Tag Exemplar + Tag

Figure 6.6: Total effects of the interfaces on usefulness: error bars are 1 standard error of the mean. Comparisons are made against the baseline interface. than the baseline (no significant loading). Reviewing the total structure of the model we observe that difficulty loads negatively on usefulness: the interfaces that are more difficult are also perceived as less useful. We should therefore inspect the total effects on usefulness, which are plotted in figure 6.6. Compared to the baseline interface the two exemplar-powered interfaces are only slightly more useful (differences not significant) because their difficulty stands in the way of a much better user experience. This suggests that one could increase the user experience of these interfaces by making them less difficult (i.e less confusing, easier to understand), which should increase their perceived usefulness. Likewise, it is possible that prolonged use would allow users to become experts in using this interface, it remains unclear if that would similarly reduce the perceived difficulty related with this interface. Note that the total effect on usefulness suggests that the tag interface is actually perceived to be less useful than the baseline (p<.05) mainly because it is more difficult than the baseline. Thus the tag interface is seen as worse than the baseline interface. 213 6.4.5 Discussion

Overall this experiment was a success. While the tag based interfaces did not show an improvement over the baseline interface, the exemplar interface did. Therefore we can conclude that exemplar based rating support is able to change how users rate and increase the overall quality of user ratings. Several participants even requested to see these interfaces implemented on MovieLens. Overall the success of the exemplar interface, and relative failure of the tag interface suggests that understanding the traditional star based rating scale may be a bigger barrier than remembering the item to rate. More work will be required before we can know this as a fact. For example, it may be that the tag interfaces were limited by our choice of algorithm for tag selection. Or tags may be the wrong way to help users with preference formation. Likewise, this is only one domain, its entirely possible that other item domains would have different trends. Another limitation of this experiment is that, while we were able to improve user ratings, we cannot conclude an improvement in a user’s evaluation of the usefulness of an interface. The major driving force for this is the increased sense of difficulty associated with rating in a rating support interface. This does, however, suggest that work improving these interfaces, and working them into a broader recommender system, might mitigate this weakness. The most difficult interface, for example, takes an entire browser window to rate a single item. While this is reasonable in a testing environment, it certainly would not be reasonable in a live recommender system, and would need to be streamlined before broader deployment.

6.5 Conclusions

Taken together, these three studies demonstrate that the preference bits framework can be applied to understand the impact of various design decisions for rating interfaces. We found that larger rating scales do lead to more information, despite also leading to more noise and cognitive load. Our evidence suggests that this extra information is worth the extra cognitive load, at least when comparing a thumbs up / thumbs down interface with a five star with half stars interface. Our evidence also suggests that a 6 point scale from -3 to +3 with no zero option might be a better still interface than 214 either the thumbs or stars scales, however, we lack the data needed to really understand how this interface affects rating quality and user satisfaction. Within the space of the five stars with half star interface we showed that factors other than rating scale can affect the information properties of ratings. In particular, our exemplar interface was able to reduce the RMSE of ratings from this interface, both actual and minimum possible RMSE. The same interface also had the lowest rating noise of those tested, although that result is not statistically significant. While we only explored our interfaces for the five star with half star rating scale, there is no reason to believe that these interface modifications wouldn’t be applicable to a wide range of rating interfaces. One of the core focuses of the recommender systems research community has always been algorithm performance. While the metrics may have changed with time, the focus on building algorithms that score better on offline metrics has not. Therefore it is an in- teresting finding of this work that increased metric values are not exclusive to algorithm driven work. Our work proves that the goal of improved recommender performance can be achieved through user centered improvements to the design of the recommender sys- tem’s interface, not just through improving algorithms. As interfaces for recommender systems have received comparably less attention I see this as an interesting direction for future work. These three studies are not without limitations. First and foremost, all of the experimental work in this chapter has focused exclusively on movie rating and almost exclusively among the set of MovieLens users. It is highly unlikely that MovieLens users are typical of all humans in how they think about rating movies, and it is even less likely that the rating process for all item domains is the same. Future work should replicate this work in other domains to try to understand how general our results are, ideally building a general theory to predict which domains and communities will respond best to which interfaces. For example, it is possible that work in a more objective domain may benefit more from memory support than movie ratings. The first two studies are also limited by the use of simulated, or reanalyzed data. While this data is sufficient for us to identify some key trends and make some specific comparisons, much more research is needed before these key trends can be accepted as rating interface theory. A larger scale experiment, especially one looking at the -3 to 215 +3 with no zero interface, remains an interesting direction for future work. Finally, these studies, and indeed the preference bits model, are limited in their assumption that every user must use the same rating scale. Other than convenience for the system developer, there is no a priori reason to assume that users must always rate on the same rating scale. It is reasonable to assume that different users might rate the best on different interfaces. A professional movie reviewer, for example, might want a larger rating scale with different forms of support than someone with less discerning tastes. Understanding the variation between users in their rating patterns remains future work. More important than the limited results of our three studies, however, is the devel- opment of a methodology that allows researchers to ask, and answer questions about the information content of ratings. The three studies presented in this chapter show that rating interfaces can be studied and provide a template for how to execute such a study. These studies also highlighted ways in which this methodology could be improved. While we were able to conclude useful results, the statistics around these measurements remain difficult to use and seem to lack power compared to less comprehensive techniques. Fu- ture work should continue to develop these methodologies as it seeks to further explore the information properties of rating interfaces. One interesting direction that can be tackled with these methodologies is the re- lationship between algorithm metrics such as accuracy and the preference information and rating noise. Initially we hypothesized that preference bits is a key contributor to the magic barrier, however the results of the third study suggest that rating noise, not preference bits, seems to have a larger impact on predictive accuracy. This one exper- iment, however, is not sufficient to understand this relationship. One way this could be studied is with a simulation study. While it is impossible to increase the preference bits of ratings, it is possible to decrease it and affect the rating noise in various ways. Based on the data from our studies it would be possible to deliberately degrade ratings and measure the impact on preference bits and rating noise. With that it might be possible to begin understanding the relationship between rating noise, preference bits, and algorithm quality. The final important direction for forwarding the preference bits model I will discuss is some way of extending the theory to allow comparisons with non-rating preference 216 elicitation methods. The preference bits model is limited to only process signals of preference that apply to only one item. Recent developments in recommender systems interfaces, however, have explored multi item comparison based interfaces [22, 48, 77]. A unified theory of ranking and rating preference information elicitation would allow these two methodologies to be compared on even terms. This is made difficult, however, due to a lack of a clear unifying understanding of how to combine these two signals of preference into one unified model. Chapter 7

Conclusion

This dissertations takes a holistic, system-focused perspective on recommender systems research. In this light, most of the work presented has focused specifically on one aspect of a recommender system, seeking to understand and improve that aspect while also keeping it mind its connections to the larger technical and social system. In taking this perspective we found many questions to which standard research methodologies in the field were not suited, and in answering those questions also created new measurements or methodologies. In doing so, we hope to make specific contributions to the field of recommender systems, as well as methodological contributions that forward the research communities ability to study and improve every aspect of the system. The work presented in this dissertation makes the following contributions:

• A documentation of the issues inherent in building big-data social computing systems for small library partners,

• Introduction of the federated design philosophy for recommender systems.

• The documentation of the BookLens system design, with particular focus on our solutions to the the previously mentioned issues.

• Specific measurements of the rate of overlap between catalogs of items observed in the wild for library items.

• Specific measurements of the value of rating pooling at libraries showing benefit to both big and small libraries alike.

217 218 • A carefully designed, bias-free methodology for measuring algorithm performance over a range of possible rating profile sizes.

• A detailed evaluation of the performance of the behavior of three common col- laborative filtering algorithms over a range of metrics for user profiles up to size 19.

• Proof that the Item-Item collaborative filtering algorithm is fundamentally flawed when given small amounts of profile information for any given user.

• Specific diagnosis of the issues in Item-Item as well as two algorithm modifications that remedy these behavioral issues.

• A prototype virtual-lab experiment protocol for studying new-user recommen- dation that can be deployed over Amazon Mechanical Turk by research teams without access to an independent user-recruitment pool

• A specific study showing that the damped Item-Item algorithm is preferred over traditional Item-Item recommendations, but not popular-item recommendations.

• The Preference bits model, as well as related measurements (preference bits per rating, impression, and second, as well as rating noise) that allow measuring the amount of absolute information contained in user ratings independent of algo- rithm.

• Evidence that, for movie rating, the 5-star with half star interface is superior to thumbs-up/ thumbs-down rating.

• An experimental methodology for carefully controlled measurement of preference bits values.

• A demonstration of a rating support interface capable of reducing rating noise without also reducing preference bits.

• A demonstration that rating-interface changes alone are enough to positively im- pact traditional algorithm-focused metrics, highlighting the holistic connection between these system components and perspectives. 219 While the applicability of some of these contributions is situationally limited (large- scale recommender systems like Amazon and Netflix might never need to know the limitations inherent to building software for public libraries), many of these contribu- tions can be broadly applied. Our work into recommendation under small profile sizes, for example, can serve to guide algorithm choice in a large range of systems. Even systems which choose to apply a new-user survey can use the prediction accuracy or recommendation quality curves produced to inform decisions about the tradeoff between user effort and number of ratings collected in a new-user survey. Likewise, our improve- ments to the Item-Item algorithm are useful across a large range of small-profile sizes, improving algorithm quality improvements for users with more than the 10 to 15 ratings typical from new-user surveys. Many of the key contributions of this work are novel means and methods for recom- mender system experiments. While the applicability of this to the recommender systems research scientist should be clear, it may be less obvious how these methods can also be of use to the recommender system practitioner. In any moderate scale recommender system deployment, however, offline evaluation procedures are often leveraged to inform decisions about algorithm choice and tuning. Our cold-start evaluation can be directly leveraged by practitioners in the design of their recommender system. Likewise, larger systems may find value in deploying a variant of the preference bits and rating noise measurement procedure from chapter 6 to inform decisions about rating interface. Even for systems where an experiment of this sort may not be practical, our findings can still inform development of interfaces towards those that provide effective rating support to reduce rating noise and improve recommender performance. One major limitation of these contributions is the systems we chose to use for our experiments on new-user recommendation and rating interfaces. While the BookLens system played a role of inspiration for these studies, its cold state meant that it was unsuitable for direct study, therefore the experimental component of this work was done exclusively on the MovieLens system. This fundamentally calls into question the generalizability of these results. The ideal experiment used to conclude many of the results in the list above would involve studies on many different systems, user-bases, and item-domains to ensure that these results are robust across all places where the relevant interfaces and algorithms might be deployed. Unfortunately, such a study can 220 be prohibitively complicated, especially for the user-studies in our research. This is one of the key reasons behind the focus on clearly describing repeatable experimental methodologies, and the design of novel methods and metrics: there is no reason for our team to perform the replications necessary for these results to be considered robust knowledge in the field of recommender science. The hope is that other researchers will adopt the methods and metrics, and by performing their own extensions and experi- ments to this work confirm these results over a range of item domains, user bases, and system deployments. At the same time, we acknowledge that the offline experiments such as those done when investigating the Item-Item algorithm’s new-user performance could be extended to new datasets with relatively minimal effort, and we consider this important next steps towards comprehensive publication of those results. In the research into algorithm performance for new users we created plots of various measures of algorithm performance by profile size. There is much work to be done further exploring these results. In particular, curves could be fit to these metric value plots, which could, perhaps, be used to understand and predict the performance of the algorithm given untested, (or perhaps untestable) amounts of user data. For example, it is known that RMSE must decrease to a minimum viable value (the magic barrier) [111]. Given this, we could perhaps fit an asymptotically decreasing function to empirically measured RMSE curves and try to formulate an empirical estimate of this magic barrier (the asymptote of the fit equation), which could be compared against re-rating based estimates of the same. Likewise, by creating these estimates on a per-user basis we may be able to estimate both the maximum accuracy we will ever achieve for this user, and how close to this maximum the user is with their current rating profile. Just as we showed these metrics can be studied over the lifetime of a user, similar methods could be trialed to explore the system-level performance. Systems like Book- Lens are fundamentally limited on how useful we can be to a user based only on how much total usage data we have. Similarly to the new-user study, a careful study of this, along with predictive regression analysis might allow understanding how accurate a system could possibly be, and estimating how much information would be needed to bring it there. Understanding this could lead to interesting results in understanding the value of importing external datasets for system bootstrapping (a common strategy to try to avoid system cold-start) as well as methods for understanding the system-level 221 processes that might lead to success or failure of a system. A second major limitation of this dissertation is that we only studied collaborative filtering algorithms, and only as they applied to ratings-based sources of preference information. While this focus allowed gaining initial traction, it limits the applicability of the new-user evaluation methodologies, and the preference bits model to a small slice of the possible range of recommendation algorithms. Future work should better explore how to extend these methods to systems which can directly elicit preference through content questions (what is your favorite music genre?), ranking type questions (which of these two books would you rather read?), or implicit forms of rating information. Some of these questions can relatively directly be incorporated (how consistent and high-entropy is favorite-genre as a user-based questions), others, particularly ranking will require more care to understand what the common grounds of measurement might be. For example, it isn’t clear that one rating should equal one ranking elicitation in a metric-value-by-rating graph comparison. As hinted in the last paragraph, two of our contributions do have a joint theme. Both the new-user evaluation protocol and the preference bits model focus on measur- ing aspects of how the recommender system learns about persistent properties of the user’s opinions. Applying both of these methodologies allows multiple ways of studying preference elicitation, either as a question of consistency and possible information value, or as a question of algorithm learning against a metric. Future work should investigate connections between these two approaches in an effort to arrive at a deeper theory of preference elicitation for recommender systems. Finally, given that this work seeks a holistic perspective on recommender systems it is a major limitation that at no point did we deeply study any of the ways to apply the learning inherent in recommender system. Just as the means of preference elicitation are important to how well an algorithm performs, it stands to reason that the value derived from an algorithm will only ever be as good as the interface used to elicit it. While there are many good questions to ask in this area, many are being well addressed by other members of the recommender systems community. Outside of the extensions of current work addressed above, we see several interesting directions for future work. For example, while the means of leveraging knowledge from recommenders to assist content consumers is an active area of recommender systems 222 research, little research has gone into how the knowledge from a recommender system can assist content creators. It has been speculated, but to my knowledge not confirmed, that large companies such as Netflix use data driven techniques to assist in the produc- tion of new series [24]. While algorithms to recommend content-to-create to content creators is not an unstudied field [34,45], there are likely great strides to be made here, especially by directly incorporating collaborative filtering algorithms. One of the often overlooked aspects of the recommender system is it’s collection of preference data itself. Often it is taken as a core truth of the system that these ratings are available and might be stored forever. However, in our work with library systems we saw a much more restrictive stance taken on the storage of possibly sensitive preference data. Given this, we see an intriguing direction for future work in understanding how to build recommender systems that respect various data retention policies. The first step in researching this direction could simply be building a temporal evaluation that applies various time or quantity based data retention policies in an effort to see how this effects algorithms. In the quite likely scenario that, unmodified, algorithms behave poorly under a data retention policy, the next step would be an investigation into how to modify the algorithms to improve performance. More broadly, however, for a recommender system to follow a heavily restrictive data retention policy broad changes in design and ways of work may be needed to answer questions such as how to tune parameters and improve algorithms without a pool of historic rating data to leverage. We are also interested in the ways that social biases in user ratings may be cap- tured and transmitted by a recommender system. Social biases against various groups of people may easily influence the ratings collected by a recommender system. With- out any deliberate malice, these ratings could cause a recommender system to amplify and repeat these biases, negatively effecting recommendations. In collaboration with Michael Ekstrand et al. we have already demonstrated the possibility of this in book rating datasets in an upcoming research paper, to appear at Recsys 2018 [42]. Future work in this direction will look more closely into what types of bias recommender al- gorithms are sensitive to. By using a simulation methodology (starting with a dataset and deliberately manipulating ratings subject to various model of bias) we can directly measure how algorithms will react to various forms statistical bias. Ultimately, it is our belief that the next big improvement in recommender systems 223 will not come from an algorithm focus, but instead by having a holistic perspective on the recommender system. While we wont be so bold as to claim this work contains that next big step, it is our hope that our work can serve as the inspiration and foundation of further research into recommender systems. In that way, this can be one step forwards towards a complete understanding of the design, implementation, and implications of recommender systems technologies. References

[1] 2015. Most popular retail websites in the United States as of September 2015, ranked by visitors (in millions). (September 2015). http://www.statista.com/ statistics/271450/monthly-unique-visitors-to-us-retail-websites/ Re- trieved May 13, 2016.

[2] Charu C. Aggarwal, Joel L. Wolf, Kun-Lung Wu, and Philip S. Yu. Hort- ing Hatches an Egg: A New Graph-theoretic Approach to Collaborative Fil- tering. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (1999) (KDD ’99). ACM, 201–212. DOI: http://dx.doi.org/10.1145/312129.312230

[3] Xavier Amatriain and Justin Basilico. 2012. Netflix Recommendations: Beyond the 5 stars (Part 1). (April 2012). http://techblog.netflix.com/2012/04/ -recommendations-beyond-5-stars.html

[4] Xavier Amatriain, Josep Pujol, and Nuria Oliver. 2009a. I Like It... I Like It Not: Evaluating User Ratings Noise in Recommender Systems. In UMAP 2009. Springer (2009), 247–258.

[5] Xavier Amatriain, Josep M. Pujol, Nava Tintarev, and Nuria Oliver. 2009b. Rate it again: increasing recommendation accuracy by user re-rating. In RecSys 09. ACM.

[6] American Library Association. 2006. Resolution on the Retention of Library Usage. (June 2006). http://www.ala.org/advocacy/intfreedom/statementspols/ ifresolutions/libraryusagerecords

224 225 [7] Mark Atwood, Dirk Balfanz, Darren Bounds, Richard M. Conlan, Blaine Cook, Leah Culver, Breno de Medeiros, Brian Eaton, Kellan Elliott-McCrea, Larry Halff, Eran Hammer-Lahav, Ben Laurie, Chris Messina, John Panzer, Sam Quigley, David Recordon, Eran Sandler, Jonathan Sergent, Todd Sieling, Brian Slesin- sky, and Andy Smith. 2009. OAuth Core 1.0 Revision A. (June 2009). http: //oauth.net/core/1.0a/

[8] Robert Bell, Yehuda Koren, and Chris Volinsky. 2007. Modeling relationships at multiple scales to improve accuracy of large recommender systems. In KDD ’07. ACM, 95–104.

[9] Robert M. Bell and Yehuda Koren. 2007. Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights. In Proceedings of the 2007 Seventh IEEE International Conference on Data Mining (ICDM ’07). IEEE Com- puter Society, Washington, DC, USA, 43–52. DOI:http://dx.doi.org/10.1109/ ICDM.2007.90

[10] Alejandro Bellogin. 2012. Performance prediction and evaluation in Recommender Systems: an Information Retrieval perspective. Ph.D. Dissertation. Universidad Autnoma de Madrid.

[11] Alejandro Bellogin, Pablo Castells, and Ivan Cantador. 2011. Precision-oriented Evaluation of Recommender Systems: An Algorithmic Comparison. In Proceedings of the Fifth ACM Conference on Recommender Systems (RecSys ’11). ACM, New York, NY, USA, 333–336. DOI:http://dx.doi.org/10.1145/2043932.2043996

[12] Rick Bennett, Brian F. Lavoie, and Edward T. ONeill. 2003. The concept of a work in WorldCat: an application of FRBR. Library Collections, Acquisitions, & Technical Services 27, 1 (2003), 45–59. DOI:http://dx.doi.org/10. 1080/14649055.2003.10765895

[13] James R. Bettman, Mary Frances Luce, and John W. Payne. 1998. Constructive Consumer Choice Processes. Journal of Consumer Research 25, 3 (Dec. 1998), 187 – 217. http://www.jstor.org/stable/10.1086/209535 ArticleType: research- article / Full publication date: December 1998 / Copyright c 1998 Journal of Consumer Research Inc. 226 [14] P. Bieganski, J.A. Konstan, and J.T. Riedl. 2001. System, method and article of manufacture for making serendipity-weighted recommendations to a user. (Dec. 25 2001). http://google.com/patents/US6334127 US Patent 6,334,127.

[15] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. The Journal of Machine Learning Research 3 (March 2003), 993–1022. http://dl.acm.org/citation.cfm?id=944919.944937

[16] Hennepin County Library Board. 2015. Patron Data Privacy Policy. (February 2015). http://www.hclib.org/about/policies/patron-data-privacy

[17] Dirk Bollen, Mark Graus, and Martijn C. Willemsen. 2012. Remembering the stars?: effect of time on preference retrieval from memory. In In Proc. of RecSys ’12. ACM, New York, NY, USA, 217 – 220. DOI:http://dx.doi.org/10.1145/ 2365952.2365998

[18] John S. Breese, David Heckerman, and Carl Kadie. 1998. Empirical Analysis of Predictive Algorithms for Collaborative Filtering. Technical Report MSR-TR-98- 12. Research. 18 pages. http://research.microsoft.com/apps/pubs/ default.aspx?id=69656

[19] Leo Breiman. 1996. Stacked Regressions. Machine Learning 24, 1 (1996), 49–64. DOI:http://dx.doi.org/10.1023/A:1018046112532

[20] Angelo Canty and B. D. Ripley. 2016. boot: Bootstrap R (S-Plus) Functions. R package version 1.3-18.

[21] Oscar` Celma and Perfecto Herrera. A New Approach to Evaluating Novel Rec- ommendations. In Proceedings of the 2008 ACM Conference on Recommender Sys- tems (2008) (RecSys ’08). ACM, 179–186. DOI:http://dx.doi.org/10.1145/ 1454008.1454038

[22] Shuo Chang, F. Maxwell Harper, and Loren Terveen. 2015. Using Groups of Items for Preference Elicitation in Recommender Systems. In Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & So- cial Computing (CSCW ’15). ACM, New York, NY, USA, 1258–1269. DOI: http://dx.doi.org/10.1145/2675133.2675210 227 [23] Tianqi Chen, Weinan Zhang, Qiuxia Lu, Kailong Chen, Zhao Zheng, and Yong Yu. 2012. SVDFeature: A Toolkit for Feature-based Collaborative Filtering. Journal of Machine Learning Research 13, 1 (Dec. 2012), 3619–3622. http://dl.acm.org/ citation.cfm?id=2503308.2503357

[24] Toai Chowdhury. 2017. How Netflix uses Big Data Analytics to ensure success. (2017). https://upxacademy.com/netflix-data-analytics/

[25] Kyusik Chung. 2011. Announcing Goodreads Personalized Rec- ommendations. (2011). http://www.goodreads.com/blog/show/ 303-announcing-goodreads-personalized-recommendations

[26] Dan Cosley, Shyong K. Lam, Istvan Albert, Joseph A. Konstan, and John Riedl. 2003. Is seeing believing?: how recommender system interfaces affect users’ opinions. In In Proc. of CHI ’03. ACM, New York, NY, USA, 585–592. DOI: http://dx.doi.org/10.1145/642611.642713

[27] Thomas M. Cover and Joy A. Thomas. 2006. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience.

[28] Paolo Cremonesi, Franca Garzottto, and Roberto Turrin. 2012. User Effort vs. Accuracy in Rating-based Elicitation. In Proceedings of the Sixth ACM Conference on Recommender Systems (RecSys ’12). ACM, New York, NY, USA, 27–34. DOI: http://dx.doi.org/10.1145/2365952.2365963

[29] Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance of Recom- mender Algorithms on Top-n Recommendation Tasks. In Proceedings of the Fourth ACM Conference on Recommender Systems (RecSys ’10). ACM, New York, NY, USA, 39–46. DOI:http://dx.doi.org/10.1145/1864708.1864721

[30] John Cotton Dana. 1913. A library primer. Library bureau.

[31] A. C. Davison and D. V. Hinkley. 1997. Bootstrap Methods and Their Appli- cations. Cambridge University Press, Cambridge. http://statwww.epfl.ch/ davison/BMA/ ISBN 0-521-57391-2. 228 [32] Christian Desrosiers and George Karypis. 2011. A comprehensive survey of neighborhood-based recommendation methods. In Recommender systems hand- book. Springer, 107–144.

[33] Sara Drenner, Shilad Sen, and Loren Terveen. 2008. Crafting the Initial User Experience to Achieve Community Goals. In RecSys ’08. ACM, 8. DOI:http: //dx.doi.org/10.1145/1454008.1454039

[34] Casey Dugan, Werner Geyer, and David R. Millen. 2010. Lessons Learned from Blog Muse: Audience-based Inspiration for Bloggers. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’10). ACM, New York, NY, USA, 1965–1974. DOI:http://dx.doi.org/10.1145/1753326.1753623

[35] Michael Ekstrand. 2013. Similarity Functions for User-User Collab- orative Filtering. (October 2013). http://grouplens.org/blog/ similarity-functions-for-user-user-collaborative-filtering/

[36] Michael Ekstrand. 2015. Similarity Functions in Item-Item CF. (June 2015). https://md.ekstrandom.net/blog/2015/06/item-similarity/

[37] Michael Ekstrand and John Riedl. 2012. When Recommenders Fail: Predicting Recommender Failure for Algorithm Selection and Combination. In Proceedings of the Sixth ACM Conference on Recommender Systems (RecSys ’12). ACM, New York, NY, USA, 233–236. DOI:http://dx.doi.org/10.1145/2365952.2366002

[38] Michael D. Ekstrand, F. Maxwell Harper, Martijn C. Willemsen, and Joseph A. Konstan. 2014. User Perception of Differences in Recommender Algorithms. In Proceedings of the 8th ACM Conference on Recommender Systems (RecSys ’14). ACM, New York, NY, USA, 161–168. DOI:http://dx.doi.org/10.1145/ 2645710.2645737

[39] Michael D. Ekstrand, Daniel Kluver, F. Maxwell Harper, and Joseph A. Konstan. 2015. Letting Users Choose Recommender Algorithms: An Experimental Study. In Proceedings of the 9th ACM Conference on Recommender Systems (RecSys ’15). ACM, New York, NY, USA, 11–18. DOI:http://dx.doi.org/10.1145/2792838. 2800195 229 [40] Michael D. Ekstrand, Michael Ludwig, Joseph A. Konstan, and John T. Riedl. 2011a. Rethinking the Recommender Research Ecosystem: Reproducibility, Openness, and LensKit. In Proceedings of the Fifth ACM Conference on Rec- ommender Systems (RecSys ’11). ACM, New York, NY, USA, 133–140. DOI: http://dx.doi.org/10.1145/2043932.2043958

[41] Michael D. Ekstrand, John T. Riedl, and Joseph A. Konstan. 2011b. Collaborative Filtering Recommender Systems. Foundations and Trends in Human-Computer Interaction 4, 2 (2011), 81–173. DOI:http://dx.doi.org/10.1561/1100000009

[42] Michael D. Ekstrand, Mucun Tian, Mohammed R. Imran Kazi, Hoda Mehrpouyan, and Daniel Kluver. upcoming. Exploring Author Gender in Book Rating and Rec- ommendation. Proceedings of The 12th ACM Conference on Recommender Systems (RecSys 2018) (upcoming).

[43] Baruch Fischhoff. 1991. Value elicitation: Is there anything in there? Amer- ican Psychologist 46, 8 (1991), 835–847. DOI:http://dx.doi.org/10.1037/ 0003-066X.46.8.835

[44] Simon Funk. 2006. Netflix Update: Try This at Home. http://sifter.org/ ~simon/journal/20061211.html. (Dec. 2006). http://sifter.org/~simon/ journal/20061211.html

[45] Werner Geyer and Casey Dugan. 2010. Inspired by the Audience: A Topic Sug- gestion System for Blog Writers and Readers. In Proceedings of the 2010 ACM Conference on Computer Supported Cooperative Work (CSCW ’10). ACM, New York, NY, USA, 237–240. DOI:http://dx.doi.org/10.1145/1718918.1718964

[46] Nadav Golbandi, Yehuda Koren, and Ronny Lempel. 2011. Adaptive Boot- strapping of Recommender Systems Using Decision Trees. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining (WSDM ’11). ACM, New York, NY, USA, 595–604. DOI:http://dx.doi.org/10.1145/ 1935826.1935910 230 [47] Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. 2001. Eigentaste: A Constant Time Collaborative Filtering Algorithm. Information Retrieval 4, 2 (2001), 133–151. DOI:http://dx.doi.org/10.1023/A:1011419012209

[48] Mark P. Graus and Martijn C. Willemsen. 2015. Improving the User Experience During Cold Start Through Choice-Based Preference Elicitation. In Proceedings of the 9th ACM Conference on Recommender Systems (RecSys ’15). ACM, New York, NY, USA, 273–276. DOI:http://dx.doi.org/10.1145/2792838.2799681

[49] F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems 5, 4, Article 19 (Dec. 2015), 19 pages. DOI:http://dx.doi.org/10.1145/2827872

[50] Jon Herlocker, Joseph A. Konstan, and John Riedl. 2002. An Empirical Analysis of Design Choices in Neighborhood-Based Collaborative Filtering Algorithms. In- formation Retrieval 5, 4 (01 Oct 2002), 287–310. DOI:http://dx.doi.org/10. 1023/A:1020443909834

[51] Jonathan L. Herlocker, Joseph A. Konstan, and John Riedl. 2000. Explaining Collaborative Filtering Recommendations. In CSCW ’00. ACM, New York, NY, USA, 241–250. DOI:http://dx.doi.org/10.1145/358916.358995

[52] Jonathan L. Herlocker, Joseph A. Konstan, Loren G. Terveen, and John T. Riedl. 2004. Evaluating Collaborative Filtering Recommender Systems. ACM Transac- tions on Information Systems 22, 1 (Jan. 2004), 5–53. DOI:http://dx.doi.org/ 10.1145/963770.963772

[53] Will Hill, Larry Stead, Mark Rosenstein, and George Furnas. 1995. Recommending and evaluating choices in a of use. In In Proc. of CHI ’95. New York, NY, USA, 194–201. DOI:http://dx.doi.org/10.1145/223904.223929

[54] Thomas Hofmann. 1999. Probabilistic Latent Semantic Indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Devel- opment in Information Retrieval (SIGIR ’99). ACM, New York, NY, USA, 50–57. DOI:http://dx.doi.org/10.1145/312624.312649 231 [55] Thomas Hofmann. 2004. Latent Semantic Models for Collaborative Filtering. ACM Transactions on Information Systems 22, 1 (Jan. 2004), 89–115. DOI:http://dx. doi.org/10.1145/963770.963774

[56] John B. Horrigan, Lee Rainie, and Dana Page. 2015. Libraries at the crossroads. (September 2015). http://www.pewinternet.org/2015/09/15/ libraries-at-the-crossroads/

[57] John B. Horrigan, Lee Rainie, and Dana Page. 2016. Libraries 2016. (September 2016). http://www.pewinternet.org/2016/09/09/libraries-2016/

[58] Christopher K. Hsee. 1996. The Evaluability Hypothesis: An Explanation for Preference Reversals between Joint and Separate Evaluations of Alternatives. Or- ganizational Behavior and Human Decision Processes 67, 3 (Sept. 1996), 247–257. DOI:http://dx.doi.org/10.1006/obhd.1996.0077

[59] Li-tze Hu and Peter Bentler. 1999. Cutoff criteria for fit indexes in covari- ance structure analysis: Conventional criteria versus new alternatives. Struc- tural Equation Modeling: A Multidisciplinary Journal 6, 1 (1999), 1–55. DOI: http://dx.doi.org/10.1080/10705519909540118

[60] Zan Huang, Wingyan Chung, and Hsinchun Chen. 2004. A graph model for E- commerce recommender systems. Journal of the American Society for Informa- tion Science and Technology 55, 3 (2004), 259–274. DOI:http://dx.doi.org/10. 1002/asi.10372

[61] Mohsen Jamali and Martin Ester. 2009. TrustWalker: A Random Walk Model for Combining Trust-based and Item-based Recommendation. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’09). ACM, New York, NY, USA, 397–406. DOI:http://dx.doi. org/10.1145/1557019.1557067

[62] Dietmar Jannach, Lukas Lerche, Fatih Gedikli, and Geoffray Bonnin. 2013. What Recommenders Recommend – An Analysis of Accuracy, Popularity, and Sales Diversity Effects. In Proceedings of the 21st International Conference on User Modeling, Adaptation, and Personalization (UMAP 2013), Sandra Carberry, 232 Stephan Weibelzahl, Alessandro Micarelli, and Giovanni Semeraro (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 25–37. DOI:http://dx.doi.org/10.1007/ 978-3-642-38844-6_3

[63] Kalervo J¨arvelin and Jaana Kek¨al¨ainen. 2002. Cumulated Gain-based Evaluation of IR Techniques. ACM Transactions on Information Systems 20, 4 (Oct. 2002), 422–446. DOI:http://dx.doi.org/10.1145/582415.582418

[64] Daniel Kluver, Michael D. Ekstrand, and Joseph A. Konstan. 2018. Rating- Based Collaborative Filtering: Algorithms and Evaluation. In Social Informa- tion Access: Systems and Technologies, Peter Brusilovsky and Daqing He (Eds.). Springer International Publishing, Cham, 344–390. DOI:http://dx.doi.org/10. 1007/978-3-319-90092-6_10

[65] Daniel Kluver and Joseph A. Konstan. 2014. Evaluating Recommender Behavior for New Users. In Proceedings of the 8th ACM Conference on Recommender Systems (RecSys ’14). ACM, New York, NY, USA, 121–128. DOI:http://dx.doi.org/10. 1145/2645710.2645742

[66] Daniel Kluver, Tien T. Nguyen, Michael Ekstrand, Shilad Sen, and John Riedl. 2012. How many bits per rating?. In In Proc. of RecSys ’12. ACM, New York, NY, USA, 99 – 106. DOI:http://dx.doi.org/10.1145/2365952.2365974

[67] Bart Knijnenburg, Martijn Willemsen, Zeno Gantner, Hakan Soncu, and Chris Newell. 2012. Explaining the user experience of recommender systems. User Modeling and User-Adapted Interaction 22, 4 (October 2012), 441–504. DOI: http://dx.doi.org/10.1007/s11257-011-9118-4

[68] Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (2008) (ACM KDD ’08). ACM, 426–434. DOI:http://dx.doi.org/10.1145/1401890.1401944

[69] Yehuda Koren and Robert Bell. 2011. Advances in collaborative filtering. In Rec- ommender systems handbook. Springer, 145–186. 233 [70] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix Factorization Tech- niques for Recommender Systems. Computer 42, 8 (Aug. 2009), 30–37. DOI: http://dx.doi.org/10.1109/MC.2009.263

[71] Neal Lathia, Stephen Hailes, and Licia Capra. 2009. Temporal Collaborative Filter- ing with Adaptive Neighbourhoods. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’09). ACM, New York, NY, USA, 796–797. DOI:http://dx.doi.org/10.1145/ 1571941.1572133

[72] Neal Lathia, Stephen Hailes, Licia Capra, and Xavier Amatriain. 2010. Temporal Diversity in Recommender Systems. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’10). ACM, New York, NY, USA, 210–217. DOI:http://dx.doi.org/10.1145/ 1835449.1835486

[73] Daniel Lemire and Anna Maclachlan. 2005. Slope One Predictors for Online Rating- Based Collaborative Filtering. In Proceedings of SIAM Data Mining (SDM’05). https://arxiv.org/abs/cs/0702144

[74] I-En Liao, WenChiao Hsu, MingShen Cheng, and LiPing Chen. 2010. A library recommender system based on a personal ontology model and collaborative filtering technique for English collections. The Electronic Library 28, 3 (2010), 386–400. DOI:http://dx.doi.org/10.1108/02640471011051972

[75] Sarah Lichtenstein and Paul Slovic. 2006. The Construction of Preference. Cam- bridge University Press.

[76] G. Linden, B. Smith, and J. York. 2003. Amazon.com recommendations: item- to-item collaborative filtering. Internet Computing, IEEE 7, 1 (Jan 2003), 76–80. DOI:http://dx.doi.org/10.1109/MIC.2003.1167344

[77] Benedikt Loepp, Tim Hussein, and J¨uergen Ziegler. 2014. Choice-based Preference Elicitation for Collaborative Filtering Recommender Systems. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’14). ACM, 234 New York, NY, USA, 3085–3094. DOI:http://dx.doi.org/10.1145/2556288. 2557069

[78] Y. Luo, J. Le, and H. Chen. 2009. A Privacy-Preserving Book Recommendation Model Based on Multi-agent. In Computer Science and Engineering, 2009. WCSE ’09. Second International Workshop on, Vol. 2. 323–327. DOI:http://dx.doi. org/10.1109/WCSE.2009.822

[79] S. Maneewongvatana and S. Maneewongvatana. 2010. A recommendation model for personalized book lists. In Communications and Information Technologies (ISCIT), 2010 International Symposium on. 389–394. DOI:http://dx.doi.org/10.1109/ ISCIT.2010.5664873

[80] Christopher D Manning, Prabhakar Raghavan, and Hinrich Sch¨utze. 2008. Intro- duction to information retrieval. Vol. 1. Cambridge university press Cambridge.

[81] Paolo Massa and Paolo Avesani. 2007. Trust-aware Recommender Systems. In Proceedings of the 2007 ACM Conference on Recommender Systems (RecSys ’07). ACM, New York, NY, USA, 17–24. DOI:http://dx.doi.org/10.1145/1297231. 1297235

[82] Julian McAuley and Jure Leskovec. 2013. Hidden Factors and Hidden Topics: Understanding Rating Dimensions with Review Text. In Proceedings of the 7th ACM Conference on Recommender Systems (RecSys ’13). ACM, New York, NY, USA, 165–172. DOI:http://dx.doi.org/10.1145/2507157.2507163

[83] Sean M. McNee, Shyong K. Lam, Joseph A. Konstan, and John Riedl. 2003. In- terfaces for Eliciting New User Preferences in Recommender Systems. In User Modeling 2003, Peter Brusilovsky, Albert Corbett, and Fiorella de Rosis (Eds.). Number 2702 in Lecture Notes in Computer Science. Springer Berlin Heidelberg, 178–187. http://link.springer.com/chapter/10.1007/3-540-44963-9_24

[84] G.A. Miller. 1955. Note on the bias of information estimates. Information theory in psychology: Problems and methods 2 (1955), 95–100.

[85] Michael M¨onnich and Marcus Spiering. 2008. Adding value to the library catalog by implementing a recommendation system. D-Lib Magazine 14, 5 (2008), 4. 235 [86] Thomas Mussweiler. 2003. Comparison processes in social judgment: Mechanisms and consequences. Psychological Review 110, 3 (2003), 472–489. DOI:http://dx. doi.org/10.1037/0033-295X.110.3.472

[87] Jerome L. Myers and A. Well. 2003. Research Design and Statistical Analysis. Vol. 2nd ed. Routledge.

[88] Tien T. Nguyen, Daniel Kluver, Ting-Yu Wang, Pik-Mai Hui, Michael D. Ek- strand, Martijn C. Willemsen, and John Riedl. 2013. Rating Support Interfaces to Improve User Experience and Recommender Accuracy. In Proceedings of the 7th ACM Conference on Recommender Systems (RecSys ’13). ACM, New York, NY, USA, 149–156. DOI:http://dx.doi.org/10.1145/2507157.2507188

[89] David M. Nichols, Michael B. Twidale, and Chris D. Paice. 1997. Recommenda- tion and Usage in the Digital Library. Technical Report. Lancaster University, Computing Department.

[90] X. Ning and G. Karypis. 2011. SLIM: Sparse Linear Methods for Top-N Recom- mender Systems. In Proceedings of the IEEE 11th Internation Conference on Data Mining (ICDM 2011). 497–506. DOI:http://dx.doi.org/10.1109/ICDM.2011. 134

[91] Syavash Nobarany, Louise Oram, Vasanth Kumar Rajendran, Chi-Hsiang Chen, Joanna McGrenere, and Tamara Munzner. 2012. The design space of opinion measurement interfaces: exploring recall support for rating and ranking. In In Proc. of CHI ’12. ACM, New York, NY, USA, 2035 – 2044. DOI:http://dx.doi. org/10.1145/2207676.2208351

[92] Library of Congress Network Development and MARC Standards Office. 2017. MARC 21 format for bibleographic data. (May 2017). https://www.loc.gov/ marc//bibliographic/

[93] Judith S. Olson and Wendy A. Kellogg. 2014. Ways of Knowing in HCI. Springer Publishing Company, Incorporated.

[94] Michael P. O’Mahony, Neil J. Hurley, and Gunol C.M. Silvestre. 2006. Detecting noise in recommender system databases. In Proceedings of the 11th international 236 conference on Intelligent user interfaces (IUI ’06). ACM, New York, NY, USA, 109 – 115. DOI:http://dx.doi.org/10.1145/1111449.1111477

[95] L. Paninski. 2003. Estimation of entropy and mutual information. Neural Compu- tation 15, 6 (2003), 1191–1253.

[96] David Pattern. 2005. Using ”circ tran” to show borrowing suggestions in HIP. (November 2005). https://daveyp.wordpress.com/2005/11/17/using-circ_ tran-to-show-borrowing-suggestions-in-hip/

[97] Eyal Peer, Joachim Vosgerau, and Alessandro Acquisti. 2014. Reputation as a sufficient condition for data quality on Amazon Mechanical Turk. Behavior Re- search Methods 46, 4 (01 Dec 2014), 1023–1031. DOI:http://dx.doi.org/10. 3758/s13428-013-0434-y

[98] Maria Soledad Pera, Nicole Condie, and Yiu-Kai Ng. 2011. Personalized Book Recommendations Created by Using Social Media Data. In Proceedings of the 2010 International Conference on Web Information Systems Engineering (WISS’10). Springer-Verlag, Berlin, Heidelberg, 390–403. http://dl.acm.org/citation. cfm?id=2044492.2044531

[99] Maria Soledad Pera and Yiu-Kai Ng. 2011. With a Little Help from My Friends: Generating Personalized Book Recommendations Using Data Extracted from a Social Website. In Proceedings of the 2011 IEEE/WIC/ACM International Con- ferences on Web Intelligence and Intelligent Agent Technology - Volume 01 (WI- IAT ’11). IEEE Computer Society, Washington, DC, USA, 96–99. DOI:http: //dx.doi.org/10.1109/WI-IAT.2011.9

[100] Maria Soledad Pera and Yiu-Kai Ng. 2013. What to Read Next?: Making Per- sonalized Book Recommendations for K-12 Users. In Proceedings of the 7th ACM Conference on Recommender Systems (RecSys ’13). ACM, New York, NY, USA, 113–120. DOI:http://dx.doi.org/10.1145/2507157.2507181

[101] Nguyen Duy Phuong, Le Quang Thang, and Tu Minh Phuong. A Graph-Based Method for Combining Collaborative and Content-Based Filtering. In Proceedings 237 of the 2008 Pacific Rim International Conference on Artificial Intelligence, Tu- Bao Ho and Zhi-Hua Zhou (Eds.). Springer Berlin Heidelberg, 859–869. http: //link.springer.com/chapter/10.1007/978-3-540-89197-0_80

[102] Pearl Pu, Li Chen, and Rong Hu. 2011. A User-centric Evaluation Framework for Recommender Systems. In Proceedings of the Fifth ACM Conference on Rec- ommender Systems (RecSys ’11). ACM, New York, NY, USA, 157–164. DOI: http://dx.doi.org/10.1145/2043932.2043962

[103] Al Mamunur Rashid, Istvan Albert, Dan Cosley, Shyong K. Lam, Sean M. McNee, Joseph A. Konstan, and John Riedl. 2002. Getting to Know You: Learning New User Preferences in Recommender Systems. In Proceedings of the 7th International Conference on Intelligent User Interfaces (IUI ’02). ACM, New York, NY, USA, 127–134. DOI:http://dx.doi.org/10.1145/502716.502737

[104] Al Mamunur Rashid, George Karypis, and John Riedl. 2008. Learning Preferences of New Users in Recommender Systems: An Information Theoretic Approach. ACM SIGKDD Explorations Newsletter 10, 2 (Dec. 2008), 90–100. DOI:http://dx.doi. org/10.1145/1540276.1540302

[105] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt- Thieme. BPR: Bayesian Personalized Ranking from Implicit Feedback. In Pro- ceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (2009) (UAI ’09). AUAI Press, 452–461. http://dl.acm.org/citation.cfm?id= 1795114.1795167

[106] Steffen Rendle and Lars Schmidt-Thieme. 2008. Online-updating Regularized Ker- nel Matrix Factorization Models for Large-scale Recommender Systems. In Proceed- ings of the 2008 ACM Conference on Recommender Systems (RecSys ’08). ACM, New York, NY, USA, 251–258. DOI:http://dx.doi.org/10.1145/1454008. 1454047

[107] Paul Resnick, Neophytos Iacovou, Mitesh Suchak, Peter Bergstrom, and John Riedl. 1994. GroupLens: An Open Architecture for Collaborative Filtering of Netnews. In Proceedings of the 1994 ACM Conference on Computer Supported 238 Cooperative Work (CSCW ’94). ACM, New York, NY, USA, 175–186. DOI:http: //dx.doi.org/10.1145/192844.192905

[108] Stephanie Kaye Rogers. 2016. Item-to-item Recommendations at Pinterest. In Proceedings of the 10th ACM Conference on Recommender Systems (RecSys ’16). ACM, New York, NY, USA, 393–393. DOI:http://dx.doi.org/10.1145/ 2959100.2959130

[109] Yves Rosseel. 2012. lavaan: An R Package for Structural Equation Modeling. Journal of Statistical Software, Articles 48, 2 (2012), 1–36. DOI:http://dx.doi. org/10.18637/jss.v048.i02

[110] Alan Said, Ben Fields, Brijnesh J. Jain, and Sahin Albayrak. 2013. User- centric Evaluation of a K-furthest Neighbor Collaborative Filtering Recommender Algorithm. In Proceedings of the 2013 Conference on Computer Supported Co- operative Work (CSCW ’13). ACM, New York, NY, USA, 1399–1408. DOI: http://dx.doi.org/10.1145/2441776.2441933

[111] Alan Said, Brijnesh J. Jain, Sascha Narr, and Till Plumbaum. 2012. Users and Noise: The Magic Barrier of Recommender Systems. In UMAP 2012. Springer, 237–248.

[112] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. Item-based Collaborative Filtering Recommendation Algorithms. In Proceedings of the 10th International Conference on World Wide Web (2001) (WWW ’01). ACM, 285– 295. DOI:http://dx.doi.org/10.1145/371920.372071

[113] Badrul M Sarwar, George Karypis, Joseph A Konstan, and John T Riedl. Ap- plication of in Recommender System – A Case Study. In WebKDD 2000 (2000). http://citeseerx.ist.psu.edu/viewdoc/summary? doi=10.1.1.29.8381

[114] Hanhuai Shan and Arindam Banerjee. Generalized Probabilistic Matrix Factor- izations for Collaborative Filtering. In Data Mining, IEEE International Confer- ence on (2010). IEEE Computer Society, 1025–1030. DOI:http://dx.doi.org/ 10.1109/ICDM.2010.116 239 [115] Guy Shani and Asela Gunawardana. 2011. Evaluating Recommendation Systems. In Recommender Systems Handbook, Francesco Ricci, Lior Rokach, Bracha Shapira, and Paul B. Kantor (Eds.). Springer US, 257–297. DOI:http://dx.doi.org/10. 1007/978-0-387-85820-3_8

[116] C.E. Shannon. 2001. A mathematical theory of communication. ACM SIGMO- BILE Mobile Computing and Communications Review 5, 1 (2001), 3–55.

[117] Erez Shmueli and Tamir Tassa. 2017. Secure Multi-Party Protocols for Item- Based Collaborative Filtering. In Proceedings of the Eleventh ACM Conference on Recommender Systems (RecSys ’17). ACM, New York, NY, USA, 89–97. DOI: http://dx.doi.org/10.1145/3109859.3109881

[118] Joseph Sill, Gabor Takacs, Lester Mackey, and David Lin. 2009. Feature-Weighted Linear Stacking. arXiv:0911.0460 [cs] (Novemeber 2009). http://arxiv.org/ abs/0911.0460

[119] Itamar Simonson. 2005. Determinants of Customers’ Responses to Customized Of- fers: Conceptual Framework and Research Propositions. Journal of Marketing 69, 1 (Jan. 2005), 32–45. DOI:http://dx.doi.org/10.1509/jmkg.69.1.32.55512

[120] Barry Smyth and Paul McClave. Similarity vs. Diversity. In Proceedings of the 4th International Conference on Case-Based Reasoning (2001-07-30) (ICCBR 2001), David W. Aha and Ian Watson (Eds.). Springer Berlin Heidelberg, 347–361. http: //link.springer.com/chapter/10.1007/3-540-44593-5_25

[121] Tim Spalding. 2008. LibraryThing recommendations! (26 May 2008). http: //blog.librarything.com/main/2008/05/librarything-recommendations/

[122] E. Isaac Sparling and Shilad Sen. 2011. Rating: How Difficult is It?. In RecSys 11. ACM.

[123] Xiaoyuan Su and Taghi M. Khoshgoftaar. 2009. A Survey of Collaborative Fil- tering Techniques. Advances in Artificial Intelligence 2009 (2009). DOI:http: //dx.doi.org/10.1155/2009/421425 240 [124] Deanne W. Swan, Justin Grimes, Timothy Owens, Kim A. Miller, J. Andrea Arroyo, Terri Craig, Suzanne Dorinski, Michael Freeman, Natasha Isaac, Patricia OShea, Regina Padgett, Peter Schilling, and C. Arturo Manjarrez. 2015. Public Libraries Survey Fiscal Year 2013. (aug 2015). https://www.imls.gov/research-evaluation/data-collection/ public-libraries-united-states-survey/public-libraries-united

[125] Jacob Thebault-Spieker, Daniel Kluver, Maximilian A. Klein, Aaron Halfaker, Brent Hecht, Loren Terveen, and Joseph A. Konstan. 2017. Simulation Exper- iments on (the Absence of) Ratings Bias in Reputation Systems. Proc. ACM Hum.-Comput. Interact. 1, CSCW, Article 101 (Dec. 2017), 25 pages. DOI: http://dx.doi.org/10.1145/3134736

[126] K. Tsuji, E. Kuroo, S. Sato, U. Ikeuchi, A. Ikeuchi, F. Yoshikane, and H. Itsumura. 2012. Use of Library Loan Records for Book Recommendation. In Advanced Applied Informatics (IIAIAAI), 2012 IIAI International Conference on. 30–35. DOI:http: //dx.doi.org/10.1109/IIAI-AAI.2012.16

[127] Keita Tsuji, Nobuya Takizawa, Sho Sato, Ui Ikeuchi, Atsushi Ikeuchi, Fuyuki Yoshikane, and Hiroshi Itsumura. 2014a. Book Recommendation Based on Library Loan Records and Bibliographic Information. Procedia - Social and Behavioral Sci- ences 147 (2014), 478 – 486. DOI:http://dx.doi.org/10.1016/j.sbspro.2014. 07.142 3rd International Conference on Integrated Information (IC-ININFO).

[128] K. Tsuji, F. Yoshikane, S. Sato, and H. Itsumura. 2014b. Book Recommendation Using Machine Learning Methods Based on Library Loan Records and Biblio- graphic Information. In 2014 IIAI 3rd International Conference on Advanced Ap- plied Informatics. 76–79. DOI:http://dx.doi.org/10.1109/IIAI-AAI.2014.26

[129] A. Tversky and D. Kahneman. 1974. Judgment under Uncertainty: Heuristics and Biases. Science 185, 4157 (Sept. 1974), 1124–1131. DOI:http://dx.doi.org/ 10.1126/science.185.4157.1124

[130] Amos Tversky, Shmuel Sattath, and Paul Slovic. 1988. Contingent weighting in judgment and choice. Psychological Review 95, 3 (1988), 371–384. DOI:http: //dx.doi.org/10.1037//0033-295X.95.3.371 241 [131] Paula Cristina Vaz, David Martins de Matos, Bruno Martins, and Pavel Calado. 2012. Improving a Hybrid Literary Book Recommendation System Through Author Ranking. In Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL ’12). ACM, New York, NY, USA, 387–388. DOI:http://dx.doi. org/10.1145/2232817.2232904

[132] Jesse Vig, Shilad Sen, and John Riedl. 2012. The Tag Genome: Encoding Com- munity Knowledge to Support Novel Interaction. ACM Trans. Interact. Intell. Syst. 2, 3 (Sept. 2012), 13:1 – 13:44. DOI:http://dx.doi.org/10.1145/2362394. 2362395

[133] E. U. Weber and E. J. Johnson. 2009. Mindful judgment and Decision Making. In Annual Review of Psychology. Vol. 60. Annual Reviews, Palo Alto, 53–85.

[134] Li-Tung Weng, Yue Xu, Yuefeng Li, and R. Nayak. Improving Recommendation Novelty Based on Topic Taxonomy. In 2007 IEEE/WIC/ACM International Con- ferences on Web Intelligence and Intelligent Agent Technology Workshops (2007- 11). 115–118. DOI:http://dx.doi.org/10.1109/WI-IATW.2007.59

[135] Colleen Whitney and Lisa R Schiff. 2006. The Melvyl recommender project: Developing library recommendation services. California Digital Library (2006).

[136] Martijn C. Willemsen, Mark P. Graus, and Bart P. Knijnenburg. 2016. Under- standing the role of latent feature diversification on choice difficulty and satisfac- tion. User Modeling and User-Adapted Interaction 26, 4 (01 Oct 2016), 347–389. DOI:http://dx.doi.org/10.1007/s11257-016-9178-6

[137] Martijn C Willemsen, Bart P Knijnenburg, Mark P Graus, Linda CM Velter- Bremmers, and Kai Fu. 2011. Using latent features diversification to reduce choice difficulty in recommendation lists. RecSys ’11 (2011).

[138] Liang Xiang. Sept. 2011. Hulu’s Recommendation System. http://tech.hulu. com/blog/2011/09/19/recommendation-system/. (Sept. 2011). http://tech. hulu.com/blog/2011/09/19/recommendation-system/

[139] L. Xin, E. Haihong, S. Junde, S. Meina, and T. Junjie. 2013. Collaborative Book Recommendation Based on Readers’ Borrowing Records. In Advanced Cloud 242 and Big Data (CBD), 2013 International Conference on. 159–163. DOI:http: //dx.doi.org/10.1109/CBD.2013.14

[140] Youtube. 2010. New video page launches for all users @http://youtube- global.blogspot.com/2010/03/new-video-page-launches-for-all-users.html. (March 2010). http://youtube-global.blogspot.com/2010/03/ new-video-page-launches-for-all-users.html

[141] Zhenming Yuan, Tianhao Yu, and Jia Zhang. 2011. A Social Tagging Based Collaborative Filtering Recommendation Algorithm for Digital Library. In Digital Libraries: For Cultural Heritage, Knowledge Dissemination, and Future Creation, Chunxiao Xing, Fabio Crestani, and Andreas Rauber (Eds.). Springer Berlin Hei- delberg, Berlin, Heidelberg, 192–201.

[142] Mi Zhang and Neil Hurley. Avoiding Monotony: Improving the Diversity of Rec- ommendation Lists. In Proceedings of the 2008 ACM Conference on Recommender Systems (2008) (RecSys ’08). ACM, 123–130. DOI:http://dx.doi.org/10.1145/ 1454008.1454030

[143] Yuan Cao Zhang, Diarmuid O` Saghdha, Daniele Quercia, and Tamas Jambor. Auralist: Introducing Serendipity into Music Recommendation. In Proceedings of the Fifth ACM International Conference on Web Search and Data Mining (2012) (WSDM ’12). ACM, 13–22. DOI:http://dx.doi.org/10.1145/2124295. 2124300

[144] Z. Zhu and J. Y. Wang. 2007. Book Recommendation Service by Improved As- sociation Rule Mining Algorithm. In 2007 International Conference on Machine Learning and Cybernetics, Vol. 7. 3864–3869. DOI:http://dx.doi.org/10.1109/ ICMLC.2007.4370820

[145] Kathryn Zickuhr, Lee Rainie, and Kristen Purcell. 2013a. Library Services in the Digital Age. (January 2013). http://libraries.pewinternet.org/2013/01/22/ Library-services/ 243 [146] Kathryn Zickuhr, Lee Rainie, and Kristen Purcell. 2013b. Younger Americans’ library habits and expectations. (June 2013). http://libraries.pewinternet. org/2013/06/25/younger-americans-library-services/

[147] Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, and Georg Lausen. Im- proving Recommendation Lists Through Topic Diversification. In Proceedings of the 14th International Conference on World Wide Web (2005) (WWW ’05). ACM, 22–32. DOI:http://dx.doi.org/10.1145/1060745.1060754 Appendix A

Output of SEM model for user-centered evaluation of Item-Item improvements

lavaan (0.5-23.1097) converged normally after 126 iterations

Number of observations 384

Estimator WLS Minimum Function Test Statistic 897.483 Degrees of freedom 192 P-value (Chi-square) 0.000

Model test baseline model:

Minimum Function Test Statistic 34625.982 Degrees of freedom 221 P-value 0.000

User model versus baseline model:

244 245

Comparative Fit Index (CFI) 0.979 Tucker-Lewis Index (TLI) 0.976

Root Mean Square Error of Approximation:

RMSEA 0.098 90 Percent Confidence Interval 0.092 0.104 P-value RMSEA <= 0.05 0.000

Standardized Root Mean Square Residual:

SRMR 0.123

Weighted Root Mean Square Residual:

WRMR 1.787

Parameter Estimates:

Information Expected Standard Errors Standard

Latent Variables: Estimate Std.Err z-value P(>|z|) good =~ MATCH_PREF 0.726 0.018 39.791 0.000 FUN_WATCH 0.722 0.018 39.157 0.000 FAMILIAR 0.565 0.020 27.979 0.000 RELATED 0.695 0.019 36.869 0.000 NO_SENSE -0.704 0.019 -37.592 0.000 WOULD_NOT_USE -0.583 0.018 -31.663 0.000 246 SATISFIED 0.756 0.019 39.224 0.000 HELP_ME 0.724 0.018 39.679 0.000 numGood 1.799 0.076 23.830 0.000 numBad -1.889 0.082 -22.936 0.000 numSeen 0.927 0.071 13.073 0.000 expert =~ INTERNET_RESEA 0.747 0.023 32.356 0.000 NOT_EXPERT -0.443 0.023 -18.983 0.000 nRats 1.684 0.178 9.440 0.000 MOVIE_LOVER 0.905 0.020 45.385 0.000 diverse =~ VARIED 0.930 0.018 50.602 0.000 MANY_KINDS 0.855 0.017 49.681 0.000

Regressions: Estimate Std.Err z-value P(>|z|) good ~ IINone 0.684 0.142 4.831 0.000 IISmall 1.202 0.131 9.187 0.000 IIOptimal 2.339 0.176 13.284 0.000 IIVeryHigh 2.045 0.154 13.263 0.000 Popularity 2.209 0.159 13.915 0.000 expert 0.431 0.036 12.091 0.000 diverse 0.680 0.050 13.547 0.000 diverse ~ IINone -0.186 0.140 -1.322 0.186 IISmall -0.011 0.118 -0.091 0.928 IIOptimal -0.583 0.130 -4.482 0.000 IIVeryHigh -0.381 0.128 -2.976 0.003 Popularity -0.258 0.121 -2.134 0.033

Intercepts: 247 Estimate Std.Err z-value P(>|z|) .MATCH_PREF 0.000 .FUN_WATCH 0.000 .FAMILIAR 0.000 .RELATED 0.000 .NO_SENSE 0.000 .WOULD_NOT_USE 0.000 .SATISFIED 0.000 .HELP_ME 0.000 .numGood 3.776 0.211 17.875 0.000 .numBad 5.219 0.184 28.367 0.000 .numSeen 3.879 0.179 21.629 0.000 .INTERNET_RESEA 0.000 .NOT_EXPERT 0.000 .nRats 12.963 0.216 59.993 0.000 .MOVIE_LOVER 0.000 .VARIED 0.000 .MANY_KINDS 0.000 .good 0.000 expert 0.000 .diverse 0.000

Thresholds: Estimate Std.Err z-value P(>|z|) MATCH_PREF|t1 -0.543 0.082 -6.583 0.000 MATCH_PREF|t2 0.274 0.082 3.340 0.001 MATCH_PREF|t3 0.693 0.081 8.539 0.000 MATCH_PREF|t4 1.733 0.082 21.027 0.000 FUN_WATCH|t1 -0.810 0.083 -9.786 0.000 FUN_WATCH|t2 0.001 0.077 0.018 0.985 FUN_WATCH|t3 0.456 0.079 5.739 0.000 FUN_WATCH|t4 1.633 0.079 20.649 0.000 248 FAMILIAR|t1 -0.836 0.081 -10.312 0.000 FAMILIAR|t2 0.007 0.072 0.103 0.918 FAMILIAR|t3 0.331 0.072 4.596 0.000 FAMILIAR|t4 1.194 0.069 17.349 0.000 RELATED|t1 -0.381 0.080 -4.738 0.000 RELATED|t2 0.520 0.078 6.666 0.000 RELATED|t3 0.846 0.082 10.312 0.000 RELATED|t4 1.771 0.079 22.487 0.000 NO_SENSE|t1 -1.696 0.083 -20.395 0.000 NO_SENSE|t2 -0.856 0.079 -10.798 0.000 NO_SENSE|t3 -0.528 0.080 -6.631 0.000 NO_SENSE|t4 0.089 0.086 1.038 0.299 WOULD_NOT_USE| -1.635 0.066 -24.830 0.000 WOULD_NOT_USE| -0.620 0.068 -9.170 0.000 WOULD_NOT_USE| -0.293 0.067 -4.354 0.000 WOULD_NOT_USE| 0.196 0.068 2.889 0.004 SATISFIED|t1 -0.180 0.080 -2.257 0.024 SATISFIED|t2 0.405 0.081 5.022 0.000 SATISFIED|t3 0.781 0.081 9.630 0.000 SATISFIED|t4 1.846 0.080 23.090 0.000 HELP_ME|t1 -0.441 0.078 -5.680 0.000 HELP_ME|t2 0.137 0.083 1.649 0.099 HELP_ME|t3 0.573 0.081 7.043 0.000 HELP_ME|t4 1.775 0.080 22.320 0.000 INTERNET_RESEA -1.801 0.064 -27.982 0.000 INTERNET_RESEA -1.106 0.054 -20.557 0.000 INTERNET_RESEA -0.738 0.047 -15.851 0.000 INTERNET_RESEA 0.604 0.047 12.941 0.000 NOT_EXPERT|t1 -1.315 0.055 -23.844 0.000 NOT_EXPERT|t2 -0.209 0.051 -4.128 0.000 NOT_EXPERT|t3 0.580 0.049 11.762 0.000 NOT_EXPERT|t4 1.391 0.058 24.117 0.000 249 MOVIE_LOVER|t1 -2.479 0.098 -25.368 0.000 MOVIE_LOVER|t2 -2.173 0.071 -30.548 0.000 MOVIE_LOVER|t3 -0.945 0.053 -17.874 0.000 MOVIE_LOVER|t4 0.325 0.046 7.144 0.000 VARIED|t1 -2.565 0.108 -23.753 0.000 VARIED|t2 -1.507 0.091 -16.605 0.000 VARIED|t3 -0.780 0.086 -9.042 0.000 VARIED|t4 0.513 0.087 5.923 0.000 MANY_KINDS|t1 -2.277 0.092 -24.695 0.000 MANY_KINDS|t2 -1.354 0.081 -16.771 0.000 MANY_KINDS|t3 -0.577 0.076 -7.615 0.000 MANY_KINDS|t4 0.547 0.075 7.302 0.000

Variances: Estimate Std.Err z-value P(>|z|) .good 1.000 .diverse 1.000 expert 1.000 .MATCH_PREF 0.131 .FUN_WATCH 0.140 .FAMILIAR 0.474 .RELATED 0.203 .NO_SENSE 0.183 .WOULD_NOT_USE 0.441 .SATISFIED 0.057 .HELP_ME 0.137 .numGood 2.986 0.167 17.872 0.000 .numBad 3.737 0.166 22.453 0.000 .numSeen 5.153 0.317 16.232 0.000 .INTERNET_RESEA 0.442 .NOT_EXPERT 0.804 .nRats 21.479 1.443 14.883 0.000 250 .MOVIE_LOVER 0.181 .VARIED 0.134 .MANY_KINDS 0.269 Appendix B

Outline of the BookLens Core Web server API1

B.1 API overview

This section serves as an API guide for the BookLens system as deployed. These are up-to-date as of the date of publication. The core BookLens web server is organized into a number of separate resources that can be operated upon using a RESTful API, with an OAUTH 1.0a [7] authen- tication scheme. A brief overview of OAuth authentication can be found in the next section, however, we leave complete technical details of OAUTH 1.0a authentication to its defining document [7]. The web requests for the core BookLens web API are accessed through a RESTful API. This means that identification of resources is typically done in through structures in the URL, actions are primarily disambiguated through standard HTTP verbs, and status is primarily communicated through HTTP response codes. For each recognized API command we will list the appropriate URL pattern, HTTP verb, the expected response format and possible error codes. Generally speaking, each request returns or modifies some “resource” such as a Book, Opus, or User. We will use json to encode

1This appendix is a transcription of the internally maintained API documentation for the Book- Lens project. Like most aspects of the BookLens system, this internal documentation was written in collaboration with Michael Ludwig.

251 252 these resources in responses, and expect json formatting for descriptions of new books or users to create. While these resources share the name with resources identified in chapter 3, in particular figure 3.3, various details will differ. The data model presented in figure 3.3 represents the internal, multi-client view of BookLens maintained by the core server, and are presented with some unimportant details removed. The resources presented here are modified to be specific to only one client, and contain full system details. While we will list specific error codes as we cover each API command, some error codes are shared by all requests. To avoid redundancy, we list these shared API codes here.

• 401 - returned if the request fails OAuth authentication.

• 404 - returned if a URL requested does not match any valid pattern, or contains a resource identifier that cannot be resolved

• 405 - returned if the HTTP method used is not valid for the resource requested.

• 409 - returned if multiple modifying methods (PUT, DELETE, POST) simulta- neously edits a single resource. This will only be returned if we cannot internally sequence these simultaneous edits against the database in a predictable way.

• 500 - returned in the case of an unexpected server error. The body of the re- quest will contain a stack trace which can be sent back to the BookLens team for debugging purposes.

B.2 Authentication

All requests to URLs starting with “/api” must be authenticated via OAuth, either as a 2-legged (client only) or 3-legged (client and user) request. URLs containing private data will be clearly marked and must include 3-legged (client and user) authentication using a user-key pair specific to the user whose private data is being requested. BookLens supports OAuth signature information (signature, timestamp, nonce, etc.) being either in query parameters (for http methods where that is appropriate) or in the Authorization header (on any request). 253 B.2.1 Tokens

OAuth requires token pairs to identify who is involved in a particular request. There are multiple token types described below. Depending on the type of token pair, they will be acquired differently. A token pair consists of a public key, which identifies the entity associated with the token, and a secret that is used in hash generation for the request to prove the request was sent by the entity. Only the entity owning the public key and the core server have access to the secret, so if the hash is valid then the core server can trust the request to have come from the entity. The secret is never sent by the entity as part of the request. When tokens are created and given to an entity, the core server may send the secret to the entity in the encrypted response body of a request. Client tokens: Client tokens represent the identity of a client of BookLens (e.g. a library system) and must be included with every API request to identify the client making the request. A client will only have one client token and it does not expire (unless explicitly requested by the client, or if BookLens manually revokes it for un- foreseen circumstances). There are no requests available to get a client token pair; that information is communicated outside of this API, as part of each libraries integration with BookLens. User access tokens: User access tokens represent the identity of a BookLens user. Clients having access to the token pair of a user may make user-requiring API requests on the user’s behalf. This includes things like saving a review or rating, updating said resources, or making user settings changes. While uncommon in the current BookLens deployment, we do retain support for users who maintain a user-name and password for the core BookLens web server, as well as standard OAuth API endpoints for acquiring a user token through an explicit token request and user authorization hand-shake. Tokens created this way can expire, either after a timeout (default length 1 hour) or by explicit action by a user or BookLens administration. No information on expired tokens is stored, so using an expired token will return a 401 error as if an invalid token was returned. As clients who maintain a username and password for the BookLens system are not common, most user authorization occurs through the creation of a “client-owned” user. These users accounts represent ones created automatically by a client to represent one of their users (covered by their own authentication system, most commonly barcode and 254 pin). Client-owned accounts are assigned a BookLens identifier, but are most commonly referred to by a client-specific identifier. As client-specific identifiers will be stored in the BookLens core server, they should be non-reversable to user identity, such as a one-way hash of user barcodes. User access tokens from client-owned users never expire. Temporary tokens: Temporary tokens are only used when going through the standard OAuth process of authorizing a user for a client. The client receives the temporary token and then redirects the user to a BookLens-owned login screen. If the user accepts, the user is redirected back to a client-specified URL that can upgrade the temporary token into an access token. If the user rejects, the temporary token cannot be upgraded. Temporary tokens cannot be used in place of user access tokens for making API requests.

B.2.2 Signing Requests

OAuth authentication information is typically placed in the Authorization header of the request. The value of this header starts with the text OAuth and is followed by comma- separated key-value pairs that specify the rest of the authentication configuration and parameters. Each key-value pair is formatted as key=”value” (note the double-quotes around the value). There are a number of OAuth parameters that can be included as key-value pairs (some of which are required):

• oauth_token: Specifies the public key for either the user access token or tempo- rary token, depending on the context of the request. This is optional; leaving it off will use 2-legged client-only authentication.

• oauth_callback: A URL that the user is redirected to when they complete the authorization needed to upgrade a temporary token to an access token. This is only necessary when initiating the access upgrade procedure.

• oauth_verifier: A value sent to the callback that must be included when up- grading to a full user access token. This is only necessary when completing the access upgrade procedure.

• oauth_consumer_key: Specifies the client’s public key. This is required. 255 • oauth_version: Specifies the version of the OAuth protocol. Should be set to 1.0.

• oauth_signature_method: Specifies the cryptographic method used to generate the signature. Should be set to HMAC-SHA1.

• oauth_timestamp: Specifies the timestamp of the request, as the number of sec- onds since January 1, 1970. If this is incorrectly specified (wrong timezone relative to the core server, etc.), authentication will fail. If the timestamp is more than approximately 5 minutes off from the server timestamp the request is rejected.

• oauth_nonce: Specifies a unique number for the request that will not be used for another request. This can simply be a random number generated by the requesting client. This feature prevents replay attacks against our API.

• oauth_signature: Specifies the signature that validates that the public keys spec- ified in oauth_token and oauth_consumer_key are who they say they are.

The specific algorithm for generating oauth_signatures is relatively straightfor- ward, but must be implemented exactly, as any mismatch between the signature sent and the signature the main server generates based on the request will cause the request to be rejected. Details of the signature generation process can be found in many places online. Due to the widespread availability of generic OAuth libraries, that are already tested against reference implementations of the OAuth standard, we recommend using a standard OAuth library to manage OAuth signatures instead of using hand-written signature generation code.

B.3 Book

A Book resource in BookLens represents one edition of a book (or any other media type) as owned by one client. Each book record is owned and controlled by a single client, even though the data contained in that record is publicly available to all clients. This means the owning client can change the fields or delete the book at their whim, generally to match changes made to their own catalog. As BookLens does not intend to serve as a repository of book information, the information from this resource should 256 not be used over locally owned data displays. We ask, however, that book records are generated by clients with as much information as possible to assist in matching book records across libraries into a joint opus record. Like all resources in BookLens, a book has a generated UUID as a unique BookLens identifier. Additionally, the owning client can also specify a custom id to aid in match- ing a corresponding client record. Books can be identified either using the BookLens generated ID, or by the owning client through their client-specific id. A book’s JSON representation:

{"identifier": "string, UUID", "title": "string", "isbn": "string, ISBN-10 or ISBN-13", "authors": ["string", "string", "..."], "publisher": "string", "edition": 1, "copyrightYear": 1960, "tags": ["string", "string", "..."], "opus": "string, opus’ public ID", "sourceURL": "string, URL to client’s record", "client": "string, client name" }

The tags array holds metadata for the book record that doesn’t fit within the rest of the defined book model. Generally this is data like the genre or subjects and is dependent on how the client stores their data. We recommend clients use a “key:value” structure for their tags. The opus property holds the ID of the opus the book belongs to. When updating a book this can be set to the value of an opus to move this book to a new opus. The sourceURL and client properties are ignored on an update and are only included when viewing a book. The source URL is inferred from the client configuration and the book’s custom ID. 257 B.3.1 Web Requests:

GET /api/book

Performs a search over all books in the database. The following query parameters are used to restrict the search. At least one of author, title, or isbn must be included:

• title: Book’s title must fuzzily match the queried title

• author: Book must include this author, fuzzily matched

• isbn: Book’s ISBN must match, after normalizing to ISBN-13

• minSimilarity: Numeral specifying how fuzzy the match can be; default is 0.5

• onlyOwned: Boolean to request only books owned by the calling client; default is false

• offset: Number of books to skip in the result before returning, useful for pagina- tion; default is 0

• length: Maximum number of books to return, useful for pagination; default is 100

On success this returns JSON {"books": []}, where each element in the array is a book block described above. The possible status codes are:

• 200 on success

• 400 if required query parameters are not supplied

GET /api/book/

View a specific book record by the given UUID or client specific ID. The ID is first checked against the custom IDs for the calling client, and then checked against the public UUIDs. On success this returns JSON {"book": {}} where the inner object is a book block described above. The possible status codes are:

• 200 on success

• 404 if the ID could not be found 258 PUT /api/book/

Update a specific book record with the client specific ID. If the ID does not exist for the calling client, a new book is created and is owned by that client. If the ID is the public UUID of an existing book, the calling client must be the client that created the book. Otherwise an error is raised instead of creating a book with a client specific ID that masks the public ID. The body of the request should JSON {"book": {}}, where the inner object is a book block described above. The client and sourceURL parameters are ignored. The update will cause the book record to exactly match the provided JSON; e.g. a null parameter will null the field in the database. If the opus ID parameter is null, the book will be reprocessed to assign a new opus. Otherwise, it will be removed from its old opus and added to the specified opus. Specifying the same opus will result in no change. Possible status codes are:

• 200 on a successful update

• 201 on a successful create (first time update)

• 400 if the request body could not be parsed; if it does not specify title+authors, or an ISBN; if the opus could not be found

• 403 if book is not owned by the client

DELETE /api/book/

Delete the specified book record with the client specific ID. The ID may also be the public UUID but the caller must be the owning client. This will remove the book from its opus. If the opus no longer has any books after this delete, it may be deleted as well (this may happen immediately). Possible status codes are:

• 200 on a successful delete

• 403 if the book is not owned by the client 259 B.4 Opus

An opus in BookLens represents the canonical work that is then published in various book formats. In many cases a book is only published once and so its opus contains a single book. For classics such as Dracula or a work by Shakespeare, many editions and published versions can all be considered as the same opus. BookLens accepts ratings, computes predictions, and gives recommendations at the opus level. An opus has a public UUID to identify it. Additionally, any valid book ID can be used in place of the opus UUID and it will select the opus of that book. Unlike books, opuses are not owned by a specific client. Books from multiple clients will be combined into a single opus through internal heuristics. An opus’s JSON representation:

{"identifier": "string, UUID", "clientBooks": ["string", "client book IDs of the opus’ books", "..."], "title": "string, representative title", "authors": ["string", "union of books’ authors", "..."], "tags": ["string", "union of books’ tags", "..."], "avgRating": 0.5, "reviewCount": 3, "blockedReviewCount": 1, "ratingCount": 97 }

The meaning of these fields is as follows:

• title - A normalized title that attempts to fix capitalization, ordering of articles, and anthology descriptors.

• authors - The union of the authors listed for every book within the opus.

• tags - The union of every tag listed on books within the opus.

• avgRating - the average of all ratings placed on this opus. A number from -1 to 1 where -1 represents the most negative opinion, and 1 represents the most positive opinion. this can be converted to a discrete star scale at display time as needed. 260 • reviewCount - The number of reviews that have been placed on this opus.

• blockedReviewCount - The number of reviews that have been placed on this opus that would be blocked for the current user (or anonymously if no user is provided).

• ratingCount - the number of ratings that have been placed on this opus. This includes ratings made by other systems, and ”fake” ratings that are in our database to help improve accuracy.

B.4.1 Web Requests

GET /api/opus

Performs a search over all opus in the database. The following query parameters are used to restrict the search. At least one of author, title, or isbn must be included:

• title: Opus or any of its books’ titles must fuzzily match the queried title

• author: Any of its books’ authors must fuzzily match the queried author

• isbn: Any of its books’ ISBNs must match, after normalizing to ISBN-13

• minSimilarity: Numeral specifying how fuzzy the match can be; default is 0.5

• onlyOwned: Boolean to request only opuses that contain at least one book owned by the calling client; default is false

• offset: Number of opuses to skip in the result before returning, useful for pagina- tion; default is 0

• length: Maximum number of opuses to return, useful for pagination; default is 100

On success this returns JSON {"opuses": []} where each element in the array is an opus block described above. The possible status codes are:

• 200 on success

• 400 if required query parameters are not supplied 261 GET /api/opus/

View a specific opus by the given UUID, one of its book’s UUIDs, or one of its book’s client-specific IDs. The ID is first checked against the public UUIDs of opuses, then client-specific book IDs, then book UUIDS. If a book-domain (client-specific or UUID) match is found, the opus for the matched book is returned. On success, this returns JSON {"opus": {}} where the inner object is an opus block as described above. The possible status codes are as follows. Note that error codes 300 and 301 will only be applicable if opus UUIDs for out-of-date opuses are used, for this reason we recommend using book IDs to reference opuses when possible to avoid possible opus disambiguation issues.

• 200 on success

• 300 if the opus has been split into multiple opuses; the result contains special JSON describing the possible opuses

• 301 if the opus has been merged into another; the Location header contains this new location

• 404 if the ID could not be found

GET /api/opus//book

Get all the books of the opus specified by ID. The ID specifies an opus identically to the plain GET /api/opus/ request. The returned result contains the complete book information for every book in the opus. On success, this returns JSON {"books": []} where each element in the array is a book block as described in the Book resource section. The possible status codes are:

• 200 on success

• 300 if the opus has been split into multiple opuses; the result contains special JSON describing the possible opuses

• 301 if the opus has been merged into another; the Location header contains this new location 262 • 404 if the ID could not be found

GET /api/opus//review

Get all reviews that have been written about the specific opus. The ID determines the opus identically to the plain GET /api/opus/ request. On success, this returns JSON {"reviews": []} where each element in the array is a review block as described in the Review resource section. The possible status codes are: • 200 on success

• 300 if the opus has been split into multiple opuses; the result contains special JSON describing the possible opuses

• 301 if the opus has been merged into another; the Location header contains this new location

• 404 if the ID could not be found

B.5 User

A user represents one of two types of person using the BookLens service. They are either a registered user that took the time to create an account using the main web site, or they are an automatically created user forming a one-to-one mapping with the user accounts of a client. BookLens has different information about these types of users and the authentication method is different. For a registered user, we have their username, email address, and password. A client is given access to a user when they go through the standard OAuth workflow where the browser is redirected to our site so that they can login with said username and password, and explicitly consent to giving the client access to their account. For a client-created user, there is no username, email address, or password; however, the client can assign a private, unique alias to that user so it may identify them later. When creating a client-created user, the client is provided with non-expiring OAuth tokens that it uses to authorize its access to a particular user. Any time a user ID is required in a URL, BookLens accepts the special meta-ID ’me’ that represents the user specified by the OAuth tokens signing the request. If the request does not have user OAuth tokens, an error is raised. 263 A user’s JSON representation:

{"identifier": "string, UUID", "userName": "string, public user name", "alias": "string, client-defined ID", "optIn": true or false, "token": {"key": "string, OAuth public key", "secret": "string, OAuth secret key"}, "opuses": [{"identifer": "opus object definition"}, {}] }

The token and alias fields are only provided to the owning client for client-created users. The opuses array contains the list of all opuses the user has rated. Each object in that array is a JSON object as defined in Opus with the additional properties rating and prediction. If rating is not present it means the user doesn’t actually have a rating for that opus. If prediction is not present, it means the recommender engine did not have enough information to compute one. This modified opus structure is used whenever an opus is included in one of the user APIs.

B.5.1 Web Requests

GET /api/user/

View the user specified by . The user data is considered private, so it can only be viewed with user OAuth authentication. The following query parameters are available:

• includeOpuses - Boolean to request the opuses array of rated opuses for the user. Defaults to true. Setting it to false can be useful when performing user-logged-in validation with the meta-ID me.

On success, this returns JSON {"user": {}} where the inner object is defined from above. If includeOpuses is false, then the opuses array is excluded, which can be faster if the user ratings are not necessary. The possible status codes are:

• 200 on success

• 403 if the user OAuth tokens don’t match the user being requested 264 • 404 if the user doesn’t exist

PUT /api/user/

Update a specific user by ID (client specific, public, and meta IDS are all accepted). If no user by that id exists an error is raised. Currently the only fields that can be updated are optIn and userName. The body of the request should be JSON: {"user": {"optIn": , "userName": "string"}}. If one of the sup- ported fields is not specified, no modification is performed on that field (e.g. the user- name is not modified). However, at least one field must be specified; an empty request is an error. The new username must be unique. Possible status codes are:

• 200 on a successful update

• 400 if the request is not valid (no updatable field, or if the username is not unique)

• 401 if the you make an unauthorized request (not signed by a user)

• 403 if the user OAuth token doesn’t match the user being requested (signed by the wrong user)

• 404 if the user doesn’t exist

POST /api/user

Create a new client-owned user. The request body should be JSON {"user": {}} where the inner object has a subset of the properties defined above in the user JSON structure. Namely, the created user takes its alias from the alias property. If the body is blank, the user is created with an empty alias, which is discouraged. All other properties are ignored. Attempting to create a user with an alias already in use by the client does nothing except return new OAuth tokens for the existing user. On success, the user block is returned with the token properties completed, which should be persisted for later authentication. The possible status codes are:

• 201 on success 265 GET /api/user//opus/

View the opus details of opus while including the rating and prediction of the user . Just like with the plain opus requests, can be an opus UUID, book UUID, or custom ID of a book within the opus. can be a UUID or the meta ID ’me’. On success this returns JSON as defined in Opus with the additional properties rating, prediction, ratingDate, and ratedBook. The rating- prefixed properties are only present if the user has rated the opus. ratedBook is present only if the rating event recorded a specific book of the opus. The ratedBook is the client’s custom ID for that book if the client of the request owns the book; otherwise ratedBook is the public ID of the book. Possible status codes are:

• 200 on success

• 403 if the user OAuth tokens don’t match the user being requested

• 404 if the user or opus don’t exist

PUT /api/user//opus/

Save a rating between the user and the opus . Just like with the plain opus requests, can be an opus UUID, book UUID, or custom ID of a book within the opus. can be a UUID or the meta ID ’me’. The rating value is sent in the body of the request with JSON: {"rating": {"value": 0.4, "book": "identifier", "created": "timestamp"}} where the floating point value is between -1 and 1. This overwrites any previous rating assigned to the opus by this user. The book identifier is an optional parameter to specify a particular book of the opus that the user was interacting with and is recommended to include when available. Created is an optional field to specify the timestamp associated with the rating event; if not provided it defaults to the current time as computed by our server. Specifying a timestamp is primarily useful when migrating existing rating data sets into BookLens. If provided, the date must be formatted as ”yyyy-MM-dd HH:mm:ss Z” as defined by Java’s SimpleDateFormat class. Possible status codes are: 266 • 200 on success

• 400 if the book was provided and does not exist, if the book does not belong to the opus, or if the created timestamp provided cannot be parsed

• 403 if the user OAuth tokens don’t match the user being requested

• 404 if the user or opus don’t exist

DELETE /api/user//opus/

Delete the rating between user and the opus . Just like with the plain opus requests, can be an opus UUID, book UUID, or custom ID of a book within the opus. can be a UUID or the meta ID ’me’. This request returns status code 200 and takes no action if there is no rating between the specified user and rating. Possible status codes are:

• 200 on success

• 403 if the user OAuth tokens don’t match the user being requested

• 404 if the user or opus don’t exist

GET /api/user//opus

Compute a recommendation list of opuses for the user , which can be a UUID or the meta ID ’me’. This supports the same filter and query parameters that the plain search/index API over opuses does except that results are a personal recommendation list. Available query parameters are:

• title: Opus or any of its books’ titles must fuzzily match the queried title

• author: Any of its books’ authors must fuzzily match the queried author

• isbn: Any of its books’ ISBNs must match, after normalizing to ISBN-13

• minSimilarity: Numeral specifying how fuzzy the match can be; default is 0.5

• onlyOwned: Boolean to request only opuses that contain at least one book owned by the calling client; default is false 267 • offset: Number of opuses to skip in the result before returning, useful for pagina- tion; default is 0

• length: Maximum number of opuses to return, useful for pagination; default is 100

• includeRated: Allow opuses already rated by the user to be included in the rec- ommendation

On success this returns JSON {"opuses": []} where each element in the array is an opus block with the additional prediction and rating properties described above. The list is sorted by the predicted value. The possible status codes are:

• 200 on success

• 403 if the user OAuth tokens don’t match the user being requested

• 404 if the user doesn’t exist

GET /api/user//review

Get all reviews that have been written by the specified user. The ID determines the user identically to the plain GET /api/user/ request. On success, this returns JSON {"reviews": []} where each element in the array is a review block as described in the Review resource section. Unlike a user’s core properties or the user’s ratings, the user’s reviews are necessarily public and can be queried without that user’s OAuth authorization. The possible status codes are:

• 200 on success

• 404 if the ID could not be found

B.6 Review

A review is a textual review written by a user about an opus. They are full-fledged resources but both the user and opus APIs provide means to query relevant reviews. 268 When viewing a review, the user’s rating is returned, so there is an implicit publishing of their rating for that particular opus. To make this clear to the user, it is strongly recommended to request a rating at the time of review creation. A review’s JSON representation:

{"identifier": "string, public UUID", "content": "string, text content of review", "opus": "string, opus’ UUID", "book": "string, book’s UUID", "opusData": {"identifier": "opus object definition"}, "user": "string, user’ UUID", "userName": "string, user’s public name", "rating": 0.0, "created": "string, formatted date of creation", "lastEdit": "string, formatted date of last edit", "client": "string, name of client review was created via", "block": true or false }

The userName property is the public user name associated with the review’s user. If it is not present, it means the user does not have an assigned public name and an appropriate label, such as ’Anonymous’ should be used instead. Note that the opus field only contains the opus identifier; when viewing a review, it is redundant with the identifier present in the opusData field. The book field is optional, and will specify a specific book record on which the review was made, where possible. block is a boolean value that determines if the review should be blocked due to inappropriate content or administrative action. Currently, this will be true if the au- thenticated user has explicitly blocked the review, if more than 5 users have blocked the review, or if a client administrator has blocked it. Blocked reviews are always returned in queries, so the presentation layer must filter them or present them in a hidden manner (such as “click to show blocked review”) 269 B.6.1 Web Requests

GET /api/review/

View the review content specified by the ID, which must be the public UUID of a previously created review. The review can be viewed by any client, with or without any user authorization. On success it returns JSON {"review": {}} where the inner block is formatted as described above. Note that if a review has never been edited since its creation, its last-edit timestamp will be the same as its created timestamp. The timestamps are formatted in a reasonably presentable format localized to the core server’s location. If the review was created with an explicit book, that book’s ID is in the book field. If the authorizing client owns that book, the ID is the custom book ID; otherwise it’s the public ID. Response codes are:

• 200 on success

• 404 if there is no such review

PUT /api/review/

Edit the review specified by the given ID, which must be the public UUID of an existing review. A review can only be edited by the user that created it, but it does not matter which client they work through. In the future, admin users may also have this right. Only the text content of the review can be changed. The body of the request must be JSON {"review": {"content": "new content"}}. Only the content property is accepted. If opus or book property is provided, an explicit error is generated; all other properties are ignored. Response codes are:

• 200 on success

• 400 if the body could not be parsed, or if an opus ID was included

• 403 if the authorizing OAuth user is not the owner of the review

• 404 if the review does not exist 270 DELETE /api/review/

Delete the review specified by the given ID, which must be the public UUID of an existing review. A review can only be edited by the user that created it, but it does not matter which client they work through. In the future, admin users will also have this right. Response codes are:

• 200 on success

• 403 if the authorizing OAuth user is not the owner of the review

• 404 if the review does not exist

POST /api/review

Create a new review. The created review is owned by the user authorizing the request through OAuth, and the creating client is the client making the request. The opus and initial review content are included in the request body. The body must be JSON

{ "review": { "opus": "string, id", "book": "string, id", "content": "text", "created": "timestamp", "lastEdit": "timestamp" } }

e.g. the review data block described above with only the opus, content fields. book, created, and lastEdit fields are allowed but optional. All others are ignored. The provided opus ID can be the public UUID, a book’s UUID, or a book’s custom ID. The book ID is not required, but recommended if available. created and lastEdit are only useful if migrating existing reviews from one system into BookLens and you wish to preserve the timestamps. If the timestamps aren’t provided, they default to the current 271 time on the core server. If provided, they must be formatted as ”yyyy-MM-dd HH:mm:ss Z”, using the conventions of Java’s SimpleDataFormat. Response codes are:

• 201 on success

• 400 if the body could not be parsed, if the review content is empty, if the opus could not be found, if the provided book could not be found or did not belong to the opus, or if the provided timestamps could not be parsed

• 403 if there is no authorizing OAuth user (i.e. a client cannot create reviews on its own)

POST /api/review//report

Report the review specified by the given ID on behalf of the authenticating user. The review will be marked as blocked by the authorizing user and the review will be shown in admin processing queries. There is no body. This works even if there is no authorizing user to support anonymous user reports. However, at most one anonymous report is counted towards the automatic filter limit and it’s not possible to track anonymous users and hide reviews during their session (as is possible with authorized users). Response codes are:

• 200 on success

POST /api/review//block

Mark the review by the given ID as blocked. After this the review will have its block attribute default to true for all viewers. This is considered an admin action and should not be exposed to end-users. Response codes are:

• 200 on success

POST /api/review//unblock

Mark the review by the given ID as unblocked. After this the review will have its block attribute default to false for all viewers (except the ones who reported it). This is used 272 when the admin decides the review is acceptable for the general public, despite having 5 or more user reports. This is considered an admin action and should not be exposed to end-users. Response codes are:

• 200 on success

GET /api/review/report/unprocessed

Retrieve all reported reviews that have not had an admin action (block or unblock) applied to them. The response content is a JSON object {"reviews": []} that contains review blobs as described above (this is equivalent to the review index requests for a user or opus’s reviews). The reviews are sorted so that the most recently reported review is first. The request accepts query parameters offset and length to support pagination. Response codes are:

• 200 on success

• 400 if the parameters are not integers

B.7 Batch Requests

Batch requests serve as a performance enhancement for querying the information of many opuses all at once. Variants exist for querying bulk opus information in the context of a user as well. This is primarily useful when querying all of the records present on a client generated search page. The underlying content of the batch request, and the IDs usable within batch requests are specified in the Opus and User page. Input batch JSON structure:

{ "requests": ["string, opus ID", "string, opus ID"] }

Output batch JSON structure:

{ 273 "responses": [ {"request": "string, matches one of the requested IDs", "status": 200, "errorMessage": "string, only present if status isn’t 200", "opus": {}}] }

B.8 Web Requests

Note: both of the requests described below use the HTTP POST method even though from an ideal point of view they have no side effects and represent many GET requests. This is done for convenience so that the many requests can be described cleanly in the body instead of in an ugly, lengthy URL and GET requests should not have bodies.

POST /api/batch/opus

Evaluate the batch request that’s described in the POST’s body. This is equivalent to invoking multiple GET /api/opus/ requests for each opus listed in the POST’s body. The body must be json formatted like {"batch": {}} where the inner object follows the pattern described above for input batch structure. The request strings in the body must be opus IDs, book IDs, or custom-client book IDs for the authenti- cated client. The response, assuming the request authenticates, is a json formatted like {"batch": {}} where the inner object follows the output batch structure. For each requested opus, there will be a response that includes the request as a key, the HTTP status if it were invoked as a single request, and the response. If the status is 200, the response is a formatted opus structure. If it’s not 200, the errorMessage field provides more description. Possible status codes for the entire request are:

• 200 on success

POST /api/batch/user//opus

Evaluate the batch request described in the POST’s body. This is equivalent to invoking multiple GET /api/user//opus/ requests. In this case, the user specified 274 in the URL is used for every opus specified in the body. Other than including the user- specific opus fields for rating and prediction, this behaves the same as the above batch request. Possible status codes for the entire request are:

• 200 on success

• 401 if 3-legged OAuth is not provided

• 403 if the user specified in the URL is not the authorized user from OAuth

• 404 if the user doesn’t exist

B.9 User Login

There are a small set of user-settings that we do not support automatic management of. In particular, no client can opt-in or out of data being used in a dataset, or experiment, and no client can set a user’s client-independent username and password. These features can only be set directly by the user through the core server’s user-facing website. As this website is user-name and password protected, this creates an issue for client- owned users, as these users have no username and password of their own. To resolve this a final “special” API endpoint is provided.

GET /account/oauthLogin

The purpose of this api end-point is to allow clients to log a user into the user-facing website of the BookLens core server. This API end point allows escalation of a user token in the OAuth scheme, to a user-specific cookie, in a more traditional user-facing web authentication scheme. Requests to the special OAuth login end-point must be signed with both a client- and user- specific access token. Requests to this url should additionally store all OAuth parameters as query parameters, in the url, rather than in a header. Requests to this url should contain an additional “redirect” query parameter, listing a url to send the user to after authentication. This url should only be accessed by user computers. The expected workflow is that the client will create and sign a request to this url on behalf of its user, and then redirect 275 the user to this url. From that point, the core server can verify the request signature. If the request passes authentication, then the server will issue an authorization cookie for the user, in the user’s browser, and then redirect the user as specified As this API endpoint is ultimately for the user’s browser, rather than direct computer- to-computer communication, a special error page will be rendered for the user should an error occur. Unfortunately, there is no direct way for a client server to be notified if this service fails for any given user. We acknowledge that it is technically possible for a client to use this API, as well as traditional web-bot technology to open their own simulated browser at this url. By doing so a client could pretend to be a user-browser and change user-specific settings without explicit user consent (for client owned user accounts). Unfortunately, we are not aware of a simple resolution to this issue that still allows the client to log their client- owned users into our user-facing web service. As all BookLens clients are considered trusted partners, we do not address this possibility for inappropriate resource access further.