Stability of Collaborative Filtering Recommendation Algorithms1
GEDIMINAS ADOMAVICIUS, University of Minnesota JINGJING ZHANG, University of Minnesota
The paper explores stability as a new measure of recommender systems performance. Stability is defined to measure the extent to which a recommendation algorithm provides predictions that are consistent with each other. Specifically, for a stable algorithm, adding some of the algorithm’s own predictions to the algorithm’s training data (for example, if these predictions were confirmed as accurate by users) would not invalidate or change the other predictions. While stability is an interesting theoretical property that can provide additional understanding about recommendation algorithms, we believe stability to be a desired practical property for recommender systems designers as well, because unstable recommendations can potentially decrease users’ trust in recommender systems and, as a result, reduce users’ acceptance of recommendations. In this paper, we also provide an extensive empirical evaluation of stability for six popular recommendation algorithms on four real-world datasets. Our results suggest that stability performance of individual recommendation algorithms is consistent across a variety of datasets and settings. In particular, we find that model-based recommendation algorithms consistently demonstrate higher stability than neighborhood-based collaborative filtering techniques. In addition, we perform a comprehensive empirical analysis of many important factors (e.g., the sparsity of original rating data, normalization of input data, the number of new incoming ratings, the distribution of incoming ratings, the distribution of evaluation data, etc.) and report the impact they have on recommendation stability. Categories and Subject Descriptors: H.3.3 [Information Search and Retrieval]: Information Filtering, H.2.8.d [Information Technology and Systems]: Database Applications – Data Mining, I.2.6 [Artificial Intelligence]: Learning. General Terms: Algorithms, Measurement, Performance, Reliability. Additional Key Words and Phrases: Evaluation of recommender systems, stability of recommendation algorithms, performance measures, collaborative filtering.
1. INTRODUCTION AND MOTIVATION Recommender systems analyze patterns of user interest in order to recommend items that best fit users’ personal preferences. In one of the more commonly used formulations of the recommendation problem, recommendation techniques aim to estimate users’ ratings for items that have not yet been consumed by users, based on the known ratings users provided for consumed items in the past. Recommender systems then recommend to users the items with highly predicted ratings. Recommender systems can add value to both users and companies [Dias et al. 2008; Garfinkel et al. 2006; Legwinski 2010; Rigby 2011] and are becoming an increasingly important decision aid in many information-rich settings, including a number of e- commerce websites (e.g., Amazon and Netflix). Much of the research in recommender systems literature has focused on enhancing the predictive accuracy of recommendation algorithms, although in the last several years researchers started to explore several alternative measures to evaluate the performance of recommender systems. Such metrics include coverage, diversity, novelty, serendipity, and several others, and comprehensive overviews of different metrics and key decisions in evaluating recommender systems can be found in [Herlocker et al. 2004; Shani and Gunawardana 2011]. Our work follows this general stream of research in that we believe that there are other important aspects of recommender systems performance (i.e., aside from accuracy) that have been largely overlooked in research literature. In particular, in this paper we introduce and explore the notion of stability of recommendation algorithms, which we first illustrate and motivate with the following simple example (also depicted in Figure 1). Suppose we have a movie recommender system, where user Alice has entered the ratings 4, 2, 1 (on the scale from 1 to 5) into a recommender system for the three
1 This paper substantially augments and improves its preliminary version which appeared in ACM RecSys 2010 [Adomavicius and Zhang 2010].
movies that she saw recently and, in return, the system provided for her the following rating predictions for four other movies: 4, 4, 5, and 1. Following the recommendations, Alice went to see the first three of the movies (because they were highly predicted by the system), but avoided the fourth one (because it was predicted only as 1 out of 5). It turns out that, in Alice’s case, the system was extremely accurate for the three movies that were predicted highly, and Alice indeed liked them as 4, 4, and 5, respectively. Imagine Alice’s surprise when, after submitting these latest ratings (which were exactly the same as the system’s own predictions), she saw that the recommender system is now predicting the final movie for her (which earlier was 1 out of 5) as 5 out of 5.
Figure 1. Example scenario of unstable recommendations.
A regular user like Alice, who ended up providing ratings for the three movies exactly as predicted by the system, therefore, may initially be quite impressed by how well the system was able to understand and predict her preferences for these movies. However, the subsequently observed instability in rating predictions (i.e., internal inconsistency of the recommendation algorithm) may seem especially odd and confusing. Providing consistent predictions has important implications on users’ trust and acceptance of recommendations. For example, it has been shown in the consumer psychology literature that online advice-giving agents with greater fluctuations in past opinions are considered less informative than those with a more uniform pattern of opinions [Gershoff et al. 2003]. Also advice that is inconsistent with past recommendations is considered less helpful than consistent advice [d’Astous and Touil 1999; Van Swol and Sniezek 2005]. Inconsistent recommendations provided by recommender systems may be discredited by users and, as a result, the downgrade in users’ confidence can directly reduce users’ perception of system’s personalization competence. For example, previous studies found that user’s trust and perception of personalization competence are the keys that lead to user’s acceptance of recommendations [Komiak and Benbasat 2006; O’Donovan and Smyth 2006; O’Donovan and Smyth 2005; Wang and Benbasat 2005]. Hence the instability of a recommender system is likely to have potential negative impact on users’ acceptance and, therefore, harm the success of the system. We would like to note that the recommendation example described in Figure 1 is not a purely hypothetical scenario – this can happen to some of the popular and commonly used recommendation algorithms, as will be discussed later in the paper. To define the notion of stability more precisely, we formalize the procedure used in the above example. Specifically, we measure the stability of a recommender system as the degree to which the system’s predictions for the same items stay the same (i.e.,
stay stable) over time, when any and all new incoming ratings submitted to the recommender system are in complete agreement with system’s prior predictions. Intuitively, (in)stability of a recommendation algorithm represents the degree of (in)consistency between the different predictions made by the algorithm. In other words, two predictions of the same recommender system can be viewed as “inconsistent” with each other, if accepting one of them as the truth (i.e., if the system happens to be perfectly accurate in its prediction) and adding it to the training data for the system changes (invalidates) the other prediction. By our definition, stable systems provide “consistent” predictions and vice versa. Figure 2 provides a very simple illustration of stable vs. unstable recommendation techniques, T1 and T2. Suppose, for a given user, the two techniques start with the same exact predictions for unknown items, but adding new ratings that are identical to system’s prior estimations leads to different predictions on remaining unknown item set. For a stable technique T1, adding new ratings does not change the estimations on other unknown items. However, for an unstable technique T2, the addition of new incoming ratings can lead to a dramatic change in estimations for remaining unknown ratings. The extent of this change provides an opportunity to measure the degree of instability (and not just to determine if the technique is stable or not), and the formal definition of this measure is provided in the next section.
Figure 2. Illustration of stable and unstable recommendation techniques.
Understanding stability is important both from theoretical and practical perspectives. The level of instability is an inherent property of recommendation algorithms because the measured changes in predictions are due to the addition of new incoming ratings which are identical to system’s previous estimations. As will be shown in the paper, different recommendation algorithms can exhibit different levels of instability, even though they may have comparable prediction accuracy. Therefore, it is important to understand inherent behavior of computational techniques that are used for predictive purposes in real-world settings. Also, it should be of value for recommender system designers to be aware how (un)stable various recommendation techniques are, especially since the stability may potentially have implications with respect to users’ trust and confidence in recommender systems, as discussed earlier. Therefore, we believe that stability constitutes an important complementary measure of recommendation systems performance. The remainder of the paper is organized as follows. Section 2 formally defines the notion of stability and describes the framework for measuring stability of recommendation algorithms. Section 3 describes the experimental setup (datasets, recommendation techniques, etc.) used to obtain empirical results. Section 4 reports
and discusses the results of our comprehensive empirical stability-related investigations; in particular, we investigate the stability properties of several families of popular collaborative filtering recommendation algorithms and explore a number of data-related factors and their impact on stability. Section 5 discusses some related work, and the conclusions of the study are presented in Section 6.
2. DEFINITION OF STABILITY Before discussing the notion of stability, we briefly introduce some of the notation used in the paper. Let Users be the set of all users in a recommender system and Items be the set of all items that can be recommended. Thus, we define a recommendation space S = Users × Items. The usefulness or utility of any item i to any user u can be denoted as rui, which usually is represented by a rating (on a numeric, ordinal, or binary scale) that indicates how much a particular user likes a particular item. Thus, the job of a recommender system typically is to use the available information (such as known ratings of items already consumed by users, content attributes of items, demographic attributes of users) to predict or estimate ratings for items that the users have not yet consumed. In other words, the goal is to construct a predictive model f: S → Rating, where Rating is a totally ordered set representing the rating scale, in order to predict the unknown ratings, i.e., ratings for items yet to be consumed by the users. Let D be some rating dataset, i.e., dataset representing the ratings that have been submitted by users to a recommender system. For notational simplicity, let SD (SD ⊆ S) represent the portion of recommendation space S that is covered by D. In other words, D = {(u,i,rui)|(u,i)∈SD, rui∈Rating}. The predictive model is then constructed by the recommender system using some estimation technique T (e.g., neighborhood- based collaborative filtering [Herlocker et al. 1999; Sarwar et al. 2001] or matrix factorization [Koren 2010] to make best predictions on unknown ratings based on the input data D. We will denote such predictive model as fT,D. For clarity, we will use rui to denote the actual rating that user u gave to item i, and ̂ for the system- predicted rating for item i that user u has not rated before, i.e., ̂ , , . To measure the performance of recommender systems, much of the research has primarily focused on predictive accuracy of recommendations. Predictive accuracy metrics compare the estimated ratings against actual ratings and represent the closeness of recommender system’s predictions to user’s true ratings. A number of accuracy metrics have been proposed in recommender systems literature, with Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) being among the most popular and widely used. MAE measures the mean absolute deviation of the predicted ratings from actual ratings. RMSE measures the root value of mean squared error (MSE) of predicted ratings which is the amount by which predicted rating differs from user’s actual rating. Mathematically the two accuracy metrics are computed as below: ∑| ̂ | ∑ ̂ , , , | | | |
Here S* represents a subset of SD specifically reserved for evaluation purposes (i.e., test dataset, not used during the model building process). We use these measures to report the accuracy performance of the investigated recommendation algorithms. As discussed earlier, we define the predictive model to be stable if its predictions are consistent with each other or, in other words, if its predictions for the same items are consistent over a period of time, assuming that any new ratings that have been submitted to the recommender system over that period are in complete agreement with system’s prior predictions. Thus, stability is an inherent property of a recommendation algorithm and can be quantified by the degree of change in
predicted ratings when rating data is extended with new incoming ratings. Formally, this extension is defined as follows:
Definition 1 [fT,D−extension]. Given recommendation algorithm T, set of known ratings D, and the resulting predictive model fT,D, dataset D' is an fT,D−extension of D, if it satisfies the following two conditions: (1) D' ⊇ D; (2) For any (u,i,r)∈D'\D, we have fT,D(u,i) = r. In other words, D' is an extension of dataset D (i.e., D' has all the same known ratings as D) with additional incoming ratings submitted to the system that are identical to estimations made by predictive model fT,D. Different predictive techniques can have different levels of instability, and thus allows quantifying the level of instability. Definition 2 [general (in)stability measure2]. Level of instability of predictive technique T on dataset D and its fT,D−extension D' is defined as: , , , where U is the set of unknown ratings need to be estimated. In other words, measures the aggregate difference in predictions , , made by recommendation algorithm T on the original data D and predictions made by the same algorithm T on the extended dataset D' (D' ⊇ D). Again, the only change from D to D' is that D' has some additional ratings that were perfectly accurately predicted by T. To obtain the comprehensive view of the stability, this aggregate difference in predictions is calculated across the predictions for the entire unknown portion of the rating space U = S\SD'. According to the definition, the level of instability may depend on several factors, including the type of predictive technique T, characteristics of initial rating dataset D, characteristics of the extension dataset D' (i.e., a given dataset can have a huge number of possible extension datasets), and the characteristics of the evaluation dataset U. Later in the paper we provide a comprehensive empirical examination of a number of important factors and their impact on recommendation stability. Measuring stability of a recommendation algorithm involves comparing predictions made at several points in time. Following our definition, we adopt a two- phase approach to compute the stability of a recommender algorithm, which is illustrated in Figure 3. In Phase 1, given a set of known ratings D, a predictive model fT,D is built, and predictions for all unknown ratings are made and denoted as P1, where P1(u,i) represents a system-predicted rating for user u and item i. In other words, P1(u,i) = fT,D(u,i) for all (u,i)∈S\SD. Then, a set of hypothetical (simulated) incoming ratings is added to the original set of known ratings D by assuming that the recommendation system predictions P1 were identical to users’ true preferences for unknown rating set S\SD. Therefore, in Phase 2, some subset P of predictions P1 is added as the newly incoming known ratings, and the set of known ratings becomes D' = D ∪ P. In other words, D' is constructed as an fT,D−extension of dataset D. Based on D', a second predictive model fT,D’ is built using the same recommendation technique T, and predictions on remaining unknown ratings (i.e., for S\SD') are made and denoted as P2. In other words, P2(u,i) = fT,D'(u,i) for all (u,i)∈S\SD'.
2 In this paper, stability is measured on a non-negative continuous scale, smaller values indicating more stable recommendation algorithms, larger values less stable recommendation algorithms. Therefore, for convenience, we also sometimes refer to this measure as instability (i.e., larger values – larger instability, smaller values – smaller instability).
Stability is then measured by comparing the two predictions, i.e., P1 and P2, to compute their mean absolute difference or root mean squared difference, which we call mean absolute shift (MAS) or root mean squared shift (RMSS), respectively. More formally, for each user-item pair (u,i) in the prediction list, the prediction difference is denoted by:
Δ=ui P12(,)ui − P (,) ui . Then the above mentioned stability measures are computed as: MAS = ()1 P Δ , RMSS= 1 P Δ2 2 ∑ ui ()2 ∑ ui (,)ui∈ P 2 (,)ui∈ P2
Phase 1 Known Ratings Predicted Ratings
D
Phase 2 Known Ratings New Ratings Predicted Ratings
D