<<

arXiv:1207.5649v1 [stat.ME] 24 Jul 2012 nryFuregr uH n hsiKhatri Shashi and He Yu Feuerverger, Challenge Andrey Netflix of the Significance Statistical aaaMJ31e-mail: 3M1 Ontario, Toronto, M6J Avenue, Canada Macklem 1 Empirics, Speak China 210093, Nanjing, e-mail: Road Hankou 22 Nanjing University Mathematics, of Department Student, 02 o.2,N.2 202–231 2, DOI: No. 27, Vol. 2012, Science Statistical [email protected] oot,Otro aaaMS33e-mail: 3G3 M5S Canada Toronto, Ontario, of Toronto, University Statistics, of Department

nryFuregri rfso fStatistics, of Professor is Feuerverger Andrey c ttsia Science article Statistical original the the of by reprint published electronic an is This ern iesfo h rgnli aiainand pagination in detail. original typographic the from differs reprint nttt fMteaia Statistics Mathematical of Institute 10.1214/11-STS368 [email protected] nttt fMteaia Statistics Mathematical of Institute , etitdBlzanmcie,srnae iglrvalue singular shrinkage, tion. machines, Boltzmann recomme error, restricted prediction penalization, neighbors, networks, nearest neural ense factors, Bayes, latent empirical descent, gradient freedom, ods, of degrees of number phrases: fective data. and of words set Key remarkable less this the from of mac treatment learned statistical statisti and been primarily a science a address provide computer to to is the and here by goal a our out work communities, of much carried learning dimensions Although been millions. the has the problems when into reaches arise space which of rameter issues shrinkage s mission-critical a parameter the in discuss and models also existing and compare We models, well. new as othe discussed and are methods tions models; ensemble detai network issues, neural some cross-validation and neighbor) effects, in (nearest kNN decomposi discuss, as value We well singular as web. modeling, movies; “baseline” wide to to app world give proaches commercial the will significant on have users accurately that particularly this ratings do predict can 9 to which about is is matrix goal users ratings thousand The million associated 480 the 100 some movies; involving about thousand stars) 18 of 5 to consists tho 1 set (from and filteri ratings data efforts, collaborative The own of systems. our problems ommender learned—from the been others—concerning has of what of overview Abstract. 02 o.2,N.2 202–231 2, No. 27, Vol. 2012, uH sa Undergraduate an is He Yu . hsiKar sDirector, is Khatri Shashi . [email protected] 2012 , nprdb h eayo h eflxcnet epoiean provide we contest, Netflix the of legacy the by Inspired olbrtv leig rs-aiain ef- cross-validation, filtering, Collaborative This . . in 1 uso n vriwfo primarily a overview—from as and known cussion statisti- context of problem a filtering a in was modeling, contest this cal of heart com- the communities— the intelligence artificial within and many science from puter from notably attention most attracted quarters—and specification. contest within predictive this to defined data While precisely this certain modeling team a in or person succeed the could to Grand dollars who US a million offered one re- and of data, publicly Prize of 2006, set remarkable Los 2, a of October leased Inc. on Netflix California, community, Gatos, research the to tion nwa undott ea naubecontribu- invaluable an be to out turned what In .ITOUTO N SUMMARY AND INTRODUCTION 1. u oli hsppri opoieadis- a provide to is paper this in goal Our . eflxcontest, Netflix drsystems, nder n hthave that ons a audience, cal penalization in(SVD), tion considera- r gadrec- and ng bemeth- mble decomposi- %sparse. 9% n some and temporal lications, ac for earch systems nsuch on movie ,ap- l, hine pa- se collaborative statistical 2 A. FEUERVERGER, Y. HE AND S. KHATRI viewpoint—of some of the lessons for statistics which called Cinematch, which is known to be based on emerged from this contest and its data set. This computer-intensive but “straightforward linear sta- vantage will also allow us to search for alternative tistical models with a lot of data conditioning” is approaches for analyzing such data (while noting known to attain (after fitting on the training data) some open problems), as well as to attempt to un- an RMSE of either 0.9514 or 0.9525, on the quiz derstand the commonalities and interplay among and test sets, respectively. (See Bennett and Lan- the various methods that key contestants have pro- ning, 2007.) These values represent, approximately, 1 posed. a 9 2 % improvement over the naive movie-average Netflix, the world’s largest internet-based movie predictor. The contest’s Grand Prize of one million rental company, maintains a data base of ratings US dollars was offered to anyone who could first2 im- their users have assigned (from 1 “star” to 5 “stars”) prove the predictions so as to attain an RMSE value to movies they have seen. The intended use of this of not more than 90% of 0.9525, namely, 0.8572, or data is toward producing a system for recommend- better, on the test set. ing movies to users based on predicting how much The Netflix contest began on Oct 2, 2006, and was someone is going to like or dislike any particular mov- to run until at least Oct 2, 2011, or until the Grand ie. Such predictions can be carried out using infor- Prize was awarded. More than 50,000 contestants mation on how much a user liked or disliked other internationally participated in this contest. Yearly movies they have rated, together with information Progress Prizes of $50,000 US were offered for the on how much other users liked or disliked those same, best improvement of at least 1% over the previous as well as other, movies. Such recommender sys- year’s result. The Progress Prizes for 2007 and 2008 tems, when sufficiently accurate, have considerable were won, respectively, by teams named “BellKor” commercial value, particularly in the context of the and “BellKor in BigChaos.” Finally, on July 26, world wide web. 2009, the Grand Prize winning entry was submitted The precise specifications of the Netflix data are by the “BellKor’s Pragmatic Chaos” team, attain- a bit involved, and we postpone our description of ing RMSE values of 0.8554 and 0.8567 on the quiz it to Section 2. Briefly, however, the training data and test sets, respectively, with the latter value rep- consists of some 100 million ratings made by approx- resenting a 10.06% improvement over the contest’s imately 480,000 users, and involving some 18,000 baseline. Twenty minutes after that submission (and movies. (The corresponding “matrix” of user-by- in accordance with the Last Call rules of the con- movie ratings is thus almost 99% sparse.) A subset test) a competing submission was made by “The of about 1.5 million ratings of the training set, called Ensemble”—an amalgamation of many teams—who the probe subset, was identified. A further data set, attained an RMSE value of 0.8553 on the quiz set, called the qualifying data was also supplied; it was and an RMSE value of 0.8567 on the test set. To divided into two approximately equal halves, called two additional decimal places, the RMSE values at- the quiz and test subsets, each consisting of about tained on the test set were 0.856704 by the winners, 1.5 million cases, but with the ratings withheld. The and 0.856714 by the runners up. Since the contest probe, quiz and test sets were constructed to have rules were based on test set RMSE, and also were similar statistical properties.1 limited to four decimal places, these two submissions The Netflix contest was based on a root mean were in fact a tie. It is therefore the order in which squared error (RMSE) criterion applied to the three these submissions were received that determined the million predictions required for the qualifying data. winner; following the rules, the prize went to the ear- If one naively uses the overall average rating for each lier submission. Fearing legal consequences, a sec- movie on the training data (with the probe subset ond and related contest which Netflix had planned removed) to make the predictions, then the RMSE to hold was canceled when it was pointed out by attained is either 1.0104, 1.0528 or 1.0540, respec- a probabilistic argument (see Narayanan and Shma- tively, depending on whether it is evaluated in sam- tikov, 2008) that, in spite of the precautions taken ple (i.e., on the training set), on the probe set or to preserve anonymity, it might theoretically be pos- on the quiz set. Netflix’s own , sible to identify some users on the basis of the seem- ingly limited information in the data. 1Readers unfamiliar with the Netflix contest may find it helpful to consult the more detailed description of the data 2Strictly, our use of “first” here is slightly inaccurate owing given in Section 2. to a Last Call rule of the competition. CHALLENGE 3

Of course, nonrepeatability, and other vagaries of A fundamentally different paradigm is based on neu- ratings by humans, itself imposes some lower bound ral networks—in particular, the restricted Boltzman on the accuracy that can be expected from any rec- machines (RBM)—which we describe in Section 6. ommender system, regardless of how ingenious it Last of these stand-alone methods are the nearest may be. It now appears that the 10% improvement neighbor or kNN methods which are the subject of Netflix required to win the contest is close to the Section 7. best that can be attained for this data. It seems Most of the methods that have been devised for fair to say that Netflix technical staff possessed “for- collaborative filtering involve parameterizations of tuitous insight” in setting the contest bar precisely very high dimension. Furthermore, many of the mod- where it did (i.e., 0.8572 RMSE); they also were well els are based on subtle and substantive contextual aware that this goal, even if attainable, would not insight. This leads us, in Section 8, to undertake be easy to achieve. a discussion of the issues involved in dimension re- The Netflix contest has come and gone; in this duction, specifically penalization and parameter story, significant contributions were made by Yehuda shrinkage. In Section 9 we digress briefly to describe Koren and by “BellKor” (R. Bell, Y. Koren, C. Volin- certain temporal issues that arise, but we return sky), “BigChaos” (M. Jahrer, A. Toscher), larger to our main discussion in Section 10 where, after teams called “The Ensemble” and “Grand Prize,” exploring their comparative properties, and taking “Gravity” (G. Takacs, I. Pilaszy, B. Nemeth, stock of the lessons learned from the ANOVA, SVD, D. Tikk), “ML@UToronto” (G. Hinton, A. Mnih, RBM and kNN models, we speculate on the requi- R. Salakhutdinov; “ML” stands for “machine learn- site characteristics of effective models as we search ing”), lone contestant Arkadiusz Paterek, “Pragmat- for new model classes. ic Theory” (M. Chabbert, M. Piotte) and many oth- In response to the Netflix challenge, subtle, new ers. Noteworthy of the contest was the craftiness and imaginative models were proposed by many con- of some participants, and the open collaboration of test participants. A selection of those ideas is sum- others. Among such stories, one that stands out is marized in Section 11. At the end, however, winning that of Brandyn Webb, a “cybernetic epistemolo- the actual contest proved not to be possible with- gist” having the alias Simon Funk (see Piatetsky, out the use of many hybrid models, and without 2007). He was the first to publicly reveal use of the combining the results from many prediction meth- SVD model together with a simple algorithm for its ods. This ensemble aspect of combining many proce- implementation that allowed him to attain a good dures is discussed briefly in Section 11. Significant early score in the contest (0.8914 on the quiz set). computational issues are involved in a data set of He also maintains an engaging website at http:// this magnitude; some numerical issues are described sifter.org/~simon/journal. briefly in Section 13. Finally, in Section 14, we sum- Although inspired by it, our emphasis in this pa- marize some of the statistical lessons learned, and per is not on the contest itself, but on the funda- briefly note a few open problems. In large part be- mentally different individual techniques which con- cause of the Netflix contest, the research literature tribute to effective collaborative filtering systems on such problems is now sufficiently extensive that and, in particular, on the statistical ideas which un- a complete listing is not feasible; however, we do derpin them. Thus, in Section 2, we first provide include a broadly representative bibliography. For a careful description of the Netflix data, as well as earlier background and reviews, see, for example, a number of graphical displays. In Section 3 we es- ACM SIGKDD (2007), Adomavicius and Tuzhilin tablish the notation we will use consistently through- (2005), Bell et al. (2009), Hill et al. (1995), Hoffman out, and also include a table summarizing the perfor- (2001b), Marlin (2004), Netflix (2006/2010), Park mance of many of the methods discussed. Sections 4, and Pennock (2007), Pu et al. (2008), Resnick and 5, 6 and 7 then focus on four key “stand-alone” tech- Varian (1997) and Tuzhilin at al. (2008). niques applicable to the Netflix data. Specifically, in Section 4 we discuss ANOVA techniques which 2. THE NETFLIX DATA provide a baseline for most other methods. In Sec- In this section we provide a more detailed overview tion 5 we discuss the singular value decomposition of the Netflix data; these in fact consist of two key or SVD (also known as the latent factor model, or components, namely, a training set and a qualify- matrix factorization) which is arguably the most ef- ing set. The qualifying data set itself consists of fective single procedure for collaborative filtering. two halves, called the quiz set and the test set; fur- 4 A. FEUERVERGER, Y. HE AND S. KHATRI thermore, a particular subset of the training set, neutral perturbations (such as deletions, changes of called the probe set, was identified. The quiz, test dates and/or ratings) to try to protect the confiden- and probe subsets were produced by a random three tiality and proprietary nature of its client base. way split of a certain collection of data, and so were In our discussions, the term “training set” will intended to have identical statistical properties. generally refer to the training data, but with the The main component of the Netflix data—namely, probe subset removed; this terminology is in line the “training” set—can be thought of as a matrix of with common usage when a subset is held out dur- ratings consisting of 480,189 rows, corresponding to ing a statistical fitting process. Of course, for pro- randomly selected anonymous users from Netflix’s ducing predictions to submit to Netflix, contestants customer base, and 17,770 columns, corresponding would normally retrain their algorithms on the full to movie titles. This matrix is 98.8% sparse; out of training set (i.e., with probe subset included). As the a possible 480,189 17,770 = 8,532,958,530 entries, subject of our paper is more concerned with collabo- × only 100,480,507 ratings are actually available. Each rative filtering generally, rather than with the actual such rating is an integer value (a number of “stars”) contest, we will make only limited reference to the between 1 (worst) and 5 (best). The data were col- qualifying data set, and mainly in our discussion on lected between October 1998 and December 2005, implicit data in Section 11, or when indicating cer- and reflect the distribution of all ratings received tain scores that contestants achieved. by Netflix during that period. It is known that Net- Finally, we mention that Netflix also provided the flix’s own database consisted of over 1.9 billion rat- titles, as well as the release years, for all of the ings, on over 85,000 movies, from over 11.7 million movies. In producing predictions for its internal use, subscribers; see Bennett and Lanning (2007). Netflix’s Cinematch algorithm does make use of other In addition to the training set, a qualifying data data sources, such as (presumably) geographical, or set consisting of 2,817,131 user–movie pairs was also other information about its customers, and this al- provided, but with the ratings withheld. It consists of lows it to achieve substantial improvements in RMSE. two halves: a quiz set, consisting of 1,408,342 user– However, it is known that Cinematch does not use movie pairs, and a test set, consisting of 1,408,789 names of the movies, or dates of ratings. In any pairs; these subsets were not identified. Contestants case, to produce the RMSE values on which the con- were required to submit predicted ratings for the en- test was to be based, Cinematch was trained with- tire qualifying set. To provide feedback to all partici- out any such other data. Nevertheless, no restric- pants, each time a contestant submitted a set of pre- tions were placed on contestants from using external dictions Netflix made public the RMSE value they sources, as, for instance, other databases pertaining attained on a web-based Leaderboard, but only for to movies. Interestingly however, none of the top the quiz subset. Prizes, however, were to be awarded contestants made use of any such auxiliary informa- on the basis of RMSE values attained on the test sub- tion.3 set. The purpose of this was to prevent contestants Figures 1 through 6 provide some visualizations of from tuning their algorithms on the “answer oracle.” the data. Figure 1 gives histograms for the ratings Netflix also provided the dates on which each of in the training set, and in the probe set. Reflecting the ratings in the data sets were made. The rea- temporal effects to be discussed in Section 9 (but see son for this is that Netflix is more interested in also Figure 5), the overall mean rating, 3.6736, of predicting future ratings than in explaining those the probe set is significantly higher than the overall of the past. Consequently, the qualifying data set mean, 3.6033, of the training set. Figures 2 and 3 had been selected from among the most recent rat- are plots (in lieu of histograms) of the cumulative ings that were made. However, to allow contestants number of ratings in the training, probe and qual- to understand the sampling characteristics of the ifying sets. Figure 2 is cumulative by movies (hori- qualifying data set, Netflix identified the probe sub- zontal axis, and sorted from most to least rated in set of 1,408,395 user-movie pairs within the training the training set), while Figure 3 is cumulative by set (and hence with known ratings), whose distribu- tional properties were meant to match those of the 3 qualifying data set. (The quiz, test and probe sub- This is not to say they did not try. But perhaps surpris- ingly—with the possible exception of a more specific release sets were produced from the random three-way split date—such auxiliary data did not improve RMSE. One pos- already mentioned.) As a final point, prior to re- sible explanation for this is that the Netflix data set is large leasing their data, Netflix applied some statistically enough to proxy such auxiliary information internally. NETFLIX CHALLENGE 5

Fig. 1. Frequency histograms for ratings in the training set (white) and probe set (black). Fig. 3. Cumulative proportion of ratings, by users, for the training, probe and qualifying sets. Users are on the horizontal axis, sorted by number of movies rated (from most to least).

Fig. 2. Cumulative proportion of ratings, by movies, for the training, probe and qualifying sets. Movies are on the hori- Fig. 4. Histograms for mean movie ratings (bars) and mean zontal axis, sorted from most to least rated. user ratings (line with points) in the training set. users. The steep rise in Figure 2 indicates, for in- In Figure 3, the considerable mismatch between the stance, that the 100 and 1000 most rated movies curve for the training data, with the curves for the account for over 14.3% and 62.5% of the ratings, probe and qualifying sets which match closely, re- respectively. In fact, the most rated4 movie (Miss flects the fact that the representation of user view- Congeniality) was rated by almost half the users in ership in the training set is markedly different from the training set, while the least rated was rated only that of the cases for which predictions are required; 3 times. Figure 2 also evidences a slight—although clearly, Netflix constructed the qualifying data to statistically significant—difference in the profiles for have a much more uniform distribution of user view- the training and the qualifying (and probe) data. ership. Figure 4 provides histograms for the movie mean 4We remark here that rented movies can be rated without ratings and the user mean ratings. The reason for having been watched. the evident mismatch between these two histograms 6 A. FEUERVERGER, Y. HE AND S. KHATRI

Fig. 5. Temporal effects: Histograms for the dates (quar- terly) of ratings in the training set (white bars) and qualifying Fig. 6. Mean movie ratings versus (logarithm of) number of set (inlaid black bars). The line with points shows quarterly ratings (for a random subset of movies). mean ratings for the training data (scale on right). is that the best rated movies were watched by dis- racies with which user and movie parameters can proportionately large numbers of users. Finally, Fig- be estimated, a problem which is particularly severe ure 5 exemplifies some noteworthy temporal effects for the users. Such extreme variation is among the in the data. Histograms are shown for the number features of typical collaborative filtering data which of ratings, quarterly, in the training and in the qual- complicate their analyses. ifying data sets, with the dates in the qualifying set being much later than the dates in the training set. 3. NOTATION AND A SUMMARY TABLE Figure 5 also shows a graph of the quarterly mean In this section we establish the notation we will ratings for the training data (with the scale being adhere to in our discussions throughout this paper. at the right). Clearly, a significant rise in mean rat- We also include a table, which will be referred to in ing occurred starting around the beginning of 2004. the sequel, of RMSE performance for many of the Whether this occurred due to changes in the ratings fitting methods we will discuss. definitions, to the introduction of a recommender Turning to notation, we will let i = 1, 2,...,I range system, to a change in customer profile, or due to over the users (or their indices) and j = 1, 2,...,J some other reason, is not known. range over the movies (or their indices). For the Net- Finally, Figure 6 plots the mean movie ratings flix training data, I = 480,189 and J = 17,770. Next, against the (log) number of ratings. (Only a ran- we will let J(i) be the set of movies rated by user i dom sample of movies is used so as not to clutter and I(j) be the set of users who rated movie j. The the plot.) The more rated movies do tend to have cardinalities of these sets will be denoted variously the higher mean ratings but with some notable ex- as Ji J(i) and Ij I(j) . We shall also use the ceptions, particularly among the less rated movies notation≡ | for| the set≡ |of all| user-movie pairs (i, j) which can sometimes have very high mean ratings. whose ratingsC are given. Denoting the total number As a general comment, the layout of the Netflix of user-movie ratings in the training set by N, note data contains enormous variation. While the average that N = = I J = J I . The ratings are |C| i=1 i j=1 j number of ratings per user is 209, and the average made on an ordered scale (such scales are known number of ratings per movie is 5654.5 (over the en- as “Likert scales”)P and areP coded as integers hav- tire training set), five users rated over 10,000 movies ing values k = 1, 2,...,K; for Netflix, K =5. The each, while many rated fewer than 5 movies. Like- actual ratings themselves, for (i, j) , will be de- ∈C wise, while some movies were rated tens of thou- noted by ri,j. Averages of ri,j over i I(j), over sands of times, most were rated fewer than 1000 j J(i), or over (i.e., over movies,∈ or users, or times, and many less than 200 times. The extent of over∈ the entire trainingC data set) will be denoted this variation implies large differences in the accu- by r·,j, ri,· and r·,· respectively. Estimated values NETFLIX CHALLENGE 7

Table 1 RMSE values attained by various methods

Predictivemodel RMSE Remarksandreferences

rˆi,j = µ 1.1296 RMSE on probe set, using mean of training set rˆi,j = αi 1.0688 Predict by user’s training mean, on probe set rˆi,j = βj 1.0528 Predict by movie’s training mean, on probe set rˆi,j = µ + αi + βj , naive 0.9945 Two-way ANOVA, no iteraction rˆi,j = µ + αi + βj 0.9841 Two-way ANOVA, no iteraction “Global effects” 0.9657 Bell and Koren (2007a, 2007b, 2007c) Cinematch, on quiz set 0.9514 As reported by Netflix Cinematch, on test set 0.9525 Target is to beat this by 10% kNN 0.9174 Bell and Koren (2007a, 2007b, 2007c) “Global” + SVD 0.9167 Bell and Koren (2007a, 2007b, 2007c) SVD 0.9167 Bell and Koren (2007a, 2007b, 2007c), on probe set “Global” + SVD + “joint kNN” 0.9071 Bell and Koren (2007a, 2007b, 2007c), on probe set “Global” + SVD + “joint kNN” 0.8982 Bell and Koren (2007a, 2007b, 2007c), on quiz set Simon Funk 0.8914 An early submission; Leaderboard TemporalDynamics + SVD++ 0.8799 Koren (2009) Arkadiusz Paterek’s best score 0.8789 An ensemble of many methods; Leaderboard ML Team: RBM + SVD 0.8787 See Section 6; Leaderboard Gravity’s best score 0.8743 November 2007; Leaderboard Progress Prize, 2007, quiz 0.8712 Bell, Koren and Volinsky (2007a, 2007b, 2007c) Progress Prize, 2007, test 0.8723 As above, but on the test set Progress Prize, 2008, quiz 0.8616 Bell, Koren and Volinsky (2008), Toscher and Jahrer (2008) Progress Prize, 2008, test 0.8627 As above, but on the test set Grand Prize, target 0.8572 10 % below Cinematch’s RMSE on test set Grand Prize, runner up 0.8553 The Ensemble, 20 minutes too late; on quiz set Grand Prize, runner up 0.8567 As above, but on the test set Grand Prize, winner 0.8554 BellKor + BigChaos + PragmaticTheory, on quiz set Grand Prize, winner 0.8567 As above, but on the test set

Selected RMSE values, compiled from various sources. Except as noted, RMSE values shown are either for the probe set after fitting on the training data with the probe set held out, or for the quiz set (typically from the Netflix Leaderboard) after fitting on the training data with the probe set included.

are denoted by “hats” as inr ˆi,j, which may refer to training set; but exceptions to this are noted. Ref- the fitted value from a model when (i, j) , or to erences to the “Leaderboard” refer to performance a predicted value otherwise. Many of the procedures∈C on the quiz set publicly released by Neflix. We refer we discuss are typically fitted to the residuals from to Table 1 in our subsequent discussions. a baseline fit such as an ANOVA; where this causes no confusion, we continue using the notations ri,j 4. ANOVA BASELINES andr ˆ in referring to such residuals. Some proce- i,j ANOVA methods furnish baselines for many anal- dures, however, involve both the original ratings as yses. One basic approach—referred to as preprocess- well as their residuals from other fits; in such cases, ing—involves first removing global effects such as the residuals are denoted as e ande ˆ . Finally, the i,j i,j user and movie means, and using the residuals as in- notation I(j, j′) will refer to the set of all users who put to subsequent models. Alternatively, such “row” saw both movies j and j′, and J(i, i′) will refer to the and “column” effects can be incorporated directly set of movies that were seen by both users i and i′. into those models where they are sometimes referred Finally, we also include, in this section, a table which provides a summary, compiled from multiple to as biases. In any case, most models work best sources, of the RMSE values attained by many of the when global effects are explicitly accounted for. In methods discussed in this paper. The RMSE values this section we discuss minimizing the sum of squared shown in Table 1 are typically for the probe set, after errors criterion 2 fitting on the remainder of the training set; or where (4.1) (ri,j rˆi,j) known, on the quiz set, after fitting on the entire − X(i,j) X∈C 8 A. FEUERVERGER, Y. HE AND S. KHATRI using various ANOVA methods for the predictions We now consider two-factor models of the form rˆ of the user-movie ratings r . Due to the large i,j i,j (4.6) rˆ = µ + α + β . number of parameters, regularization (i.e., penaliza- i,j i j tion) would normally be used, but we reserve our Identifiability conditions, such as i αi =0 and discussions of regularization issues to Section 8. βj = 0, would normally be imposed, although j P We first note that the best fitting model of the they become unnecessary under typical regulariza- P form tion. If we were to proceed as in a balanced two-way layout (i.e., with no ratings missing), then we would (4.2) rˆ µ for all i, j, i,j ≡ first estimate µ as the mean of all available ratings; obtained by setting µ = 3.6033, the mean of all the values of αi and βj would then be estimated as user-movie ratings in the training set (with probe the row and column means, over the available rat- removed), results in an RMSE on the training set ings, after µ has been subtracted throughout. Doing equal to its standard deviation 1.0846; on the probe this results in RMSE values of 0.9244 and 0.9945 set, using this same µ results in an RMSE of 1.1296, on the training and probe sets. If we proceed se- although the actual mean and standard deviation quentially, the order of the operations for estimat- 5 for the probe set are 3.6736 and 1.1274. ing the αi’s and the βj ’s will matter: If we estimate Next, if we predict each rating by the mean rating the αi’s first and subtract their effects before esti- for that user on the training set, thus fitting the mating the βj ’s, the result will not be the same as model first estimating the βj’s and subtracting their effect before estimating the α ’s; these procedures result, (4.3) rˆ = µ + α , i i,j i respectively, in RMSE values of 0.9218 and 0.9177 we obtain an RMSE of 0.9923 on the training set, on the training set. and 1.0688 using the same values on the probe. If, The layout for the Netflix data is unbalanced, with instead, we predict each rating by the mean for that the vast majority of user-movie pairings not rated; movie, thus fitting we therefore seek to minimize 2 (4.4) rˆi,j = µ + βj, (4.7) (r µ α β ) i,j − − i − j we obtain RMSE values 1.0104 and 1.0528 on the (i,jX)∈C training and probe sets, respectively. The solutions over the training set. This quadratic criterion is con- for (4.2)–(4.4) are just the least squares fits associ- vex, however, standard methods for solving the “nor- ated with mal” equations, obtained by setting derivatives with (r µ)2, respect to µ, αi and βj to zero, involve matrix in- i,j − versions which are not feasible over such high di- (i,j)∈C X X mensions. The optimization of (4.7) may, however, (4.5) (r µ α )2 and be carried out using either an EM or a gradient de- i,j − − i (i,j)∈C scent algorithm. When no penalization is imposed, X X minimizing (4.7) results in an RMSE value of 0.9161 (r µ β )2, on the training set, and 0.9841 on the probe subset. i,j − − j (i,j)∈C A consideration when fitting (4.6) as well as other X X models is that some predicted ratingsr ˆ can fall respectively, where is the set of indices (i, j) over i,j outside the [1, 5] range. This can occur when highly the training set. HistogramsC of the user and movie rated movies are rated by users prone to giving high means were given in Figure 4; we note, for later use, ratings, or when poorly rated movies are rated by that the the variances of the user and movie means users prone to giving low ratings. Under optimiza- on the test set (with probe removed) are 0.23074 tion of (4.7) over the test set, approximately 5.1 mil- and 0.27630, corresponding to standard deviations lionr ˆ estimates fall below 1, and 19.4 million fall of 0.48035 and 0.52564, respectively.6 i,j above 5. Although we may Winsorize (clip) theser ˆi,j

5 to lie in [1, 5], clipping in advance need not be op- Note that the difference between the squares of the probe’s timal when residuals from a baseline fit are input 1.1296 and 1.1274 RMSE values must equal the squared dif- ference between the two means, 3.6736 and 3.6033. to other procedures. We do not consider here the 6These values are useful for assessing regularization issues; problem of minimizing (4.7) when µ + αi + βj there see Section 8. is replaced by a Winsorized version. NETFLIX CHALLENGE 9

′ Of course, not all is well here. The differences in columns Vj are orthonormal eigenvectors of A A, RMSE values between the training and the probe and where D is an m n “diagonal” matrix whose sets reflect temporal effects, some of which were al- diagonal entries may be× taken as the descending or- ready noted. Furthermore, these models have pa- der nonnegative values rameterizations of high-dimensions and have there- λ =+ eigval AA′ fore been overfit, resulting in inferior predictions. j { }j These issues will be dealt with in Sections 8 and 9. q Finally, we remark that interaction terms can be =+ eigval A′A , j = 1, 2,..., min(m,n), { }j added to (4.6). The standard approachr ˆ = µ + i,j called theq singular values of A. The columns of V α + β + γ will not be effective, although it could i j i,j and U provide natural bases for inputs to and out- possibly be combined with regularization. Alterna- puts from the linear transformation A. In partic- tively, interactions could be based on user movie ′ × ular, AVj = λjUj, A Uj = λjVj, so given an input groupings via “many to one” functions a(i) and b(j), n and models such as B = j=1 cjVj , the corresponding output is AB = min(m,n) λjcjUj . (4.8) rˆi,j = µ + αi + βj + γa(i),b(j). j=1P Given an SVD of A, the Eckart–Young Theorem P There are many possibilities for defining such groups; states that, for a given k< min(m,n), the best rank k for example, the covariates discussed in Section 11 reconstruction of A, in the sense of minimizing the or nearest neighbor methods (kNN) can be used Frobenius norm of the difference, is U (k)D(k)(V (k))′, to construct suitable a(i) and b(j). Some further where U (k) is the m k matrix formed from the interaction-type ANOVA models are considered in first k columns of U, V×(k) is the n k matrix formed Section 10. by the first k columns of V , and×D(k) is the upper 5. SVD METHODS left k k block of D. This reconstruction may be expressed× in the form F G′ where F is m k and G In statistics, the singular value decomposition is k n; the reconstruction is thus formed× from the (SVD) is best known for its connection to principal inner× products between the k-vectors comprising F ′ components: If X = (X1, X2,...,Xn) is a random with those comprising G. These k-vectors may be vector of means 0, and n n covariance matrix Σ, × thought of as associated, respectively, with the rows then one may represent Σ as a linear combination and the columns of A, and (in applications) the com- of mutually orthogonal rank 1 matrices, as in ponents of these vectors are often referred to as fea- n tures. A numerical consequence of the Eckart–Young ′ Σ= λjPjPj, Theorem is that “best” rank k approximations can j=1 be determined iteratively: given a best rank k 1 ap- X − where λ λ λ 0 are ordered eigenvalues proximation, F G′, say, a best rank k approximation 1 ≥ 2 ≥···≥ n ≥ of Σ, and Pj corresponding orthonormal (column) is obtained by attaching a column vector to each eigenvectors. The principal components are the ran- of F and G which provide a best fit to the residual dom variables P ′X. Less commonly known is that matrix A F G′. SVD algorithms can therefore be j − the n n matrix (k) = k λ P P ′ gives the best quite straightforward. Here, however, we are specif- × T j=1 j j j rank k reconstruction of Σ, in the sense of minimiz- ically concerned with algorithms applicable to ma- ing the Frobenius norm PΣ (k) , defined as the trices which are sparse. We briefly discuss two such square root of the sum ofk the−T squaresk of its entries. algorithms, useful in collaborative filtering, namely, These results generalize. If A is an arbitrary real- alternating least squares (ALS) and gradient de- valued m n matrix, its singular value decomposi- scent. Some relevant references are Bell and Ko- tion is given× by7 ren (2007c), Bell, Koren and Volinsky (2007a), Funk ′ (2006/2007), Koren, Bell and Volinsky (2009), Raiko, A = UDV , Ilin and Karhunen (2007), Srebro and Jaakkola (2003), where U = (U1,U2,...,Um) is an m m matrix whose Srebro, Rennie and Jaakkola (2005), Takacs et al. × ′ columns Uj are orthonormal eigenvectors of AA , (2007, 2008a, 2008b, 2008c), Wu (2007) and Zhou where V = (V1,V2,...,Vn) is an n n matrix whose et al. (2008). See also Hofmann (2001a, 2004), Hof- × mann and Puzicha (1999), Kim and Yum (2005), 7If A is complex-valued, these relations still hold, with con- Marlin and Zemel (2004), Rennie and Srebro (2005), jugate transposes replacing transposes. Sali (2008) and Zou et al. (2006). 10 A. FEUERVERGER, Y. HE AND S. KHATRI

The alternating least squares (ALS) method for with respect to the ui,k; these are just ridge regres- determining the best rank p reconstruction involves sion problems.10 expressing the summation in the objective function ALS can also be performed one feature at a time, in two ways: with the advantage of yielding factors in descending p 2 order of importance. To do this, we initialize as be- fore, and again arrange the order of summation in r u v i,j − i,k j,k the objective function in two different ways; for the C k=1 ! X X X first feature, this is J p 2 J (5.1) = ri,j ui,kvj,k 2 − (ri,j ui,1vj,1) j=1 i∈I(j) k=1 ! − X X X j=1 i∈I(j) X X  I p 2 (5.5) I = ri,j ui,kvj,k . 2 − = (ri,j ui,1vj,1) . i=1 j∈J(i) k=1 ! − X X X i=1 Xj∈XJ(i)  The u may be initialized using small independent i,k We then iterate between the least squares problems normal variables, say. Then, for each fixed j, we of the inner sums in (5.5), namely, carry out the least squares fit for the vj,k based on 2 the inner sum in the middle expression of (5.1). And (5.6) vˆj,1 = ui,1ri,j ui,1 then, for each fixed i, we carry out the least squares i∈I(j) . i∈I(j) fit for u based on the inner sum of the last ex- X X i,k for all j, and then pressions in (5.1). This procedure is iterated until 2 convergence; several dozen iterations typically suf- (5.7) uˆi,1 = vj,1ri,j vj,1 fice. j∈J(i) . j∈J(i) ALS for SVD with regularization8 proceeds simi- X X for all i, until convergence. After k 1 features have larly. For example, minimizing9 been fit, we compute the residuals− p 2 k−1 ri,j ui,kvj,k − ei,j = ri,j ui,ℓvj,ℓ C k=1 ! − (5.2) X X X Xℓ=1 I J and replace (5.6) and (5.7) by 2 2 + λ1 ui + λ2 vj k k k k vˆ = u e u2 Xi=1 Xj=1 j,k i,k i,j i,k leads to iterations which alternate between minimiz- i∈XI(j) . i∈XI(j) ing and 2 2 p uˆi,k = vj,kei,j vj,k, 2 (5.3) ri,j ui,kvj,k + λ1 vj j∈XJ(i) . j∈XJ(i) − ! k k i∈XI(j) Xk=1 ranging over all j and all i, respectively. with respect to the vj,k, and then minimizing Regularization in one-feature-at-a-time ALS can be effected in several ways. Bell, Koren and Volinsky p 2 2 (2007a) shrink the residuals ei,j via (5.4) ri,j ui,kvj,k + λ2 ui − k k ni,j j∈J(i) k=1 ! ei,j ei,j, X X ← ni,j + λk

8Although we prefer to postpone discussion of regulariza- 10Some contestants preferred the regularization tion to the unified treatment attempted in Section 8, it is convenient to lay out those SVD equations here. p 2 9 2 2 We prefer not to set λ1 = λ2 at the outset for reasons of ri,j − ui,kvj,k + λ(kuik + kvj k ) C " k ! # conceptual clarity; see Section 8. In fact, because a constant XX X=1 may pass freely between user and movie features, generality is instead of (5.2), which changes the λ1 and λ2 in (5.3) and(5.4) not lost by taking λ1 = λ2. Generality is lost, however, when into Ij λ and Jiλ, respectively. In Section 8 we argue that this these values are held constant across all features; see Section 8. modification is not theoretically optimal. NETFLIX CHALLENGE 11 where ni,j = min(Ij, Ji) measures the “support” the update equations become for ri,j, and they increase the shrinkage parame- new old old λ old ter λk with each feature k. Alternately, one could ui,k = ui,k + η 2ei,jvj,k ui,k and − Ji add a regularization term (5.10)   λ ( u 2 + v 2) new old old λ old k k k vj,k = vj,k + η 2ei,jui,k vj,k . k k k k − Ij when fitting the kth feature, choosing the λk by   cross-validation. We note that, as in ALS, there are other ways to Finally, we consider gradient descent approaches sequence the updating steps in gradient descent. Si- for fitting SVD models. For an SVD of dimension p, mon Funk (2006/2007), for instance, trained the fea- tures one at a time. To train the kth feature, one ini- say, we first initialize all ui,k and vj,k in tializes the ui,k and vj,k randomly, and then loops 2 p over all (i, j) , updating the kth feature for all r u v . ∈C i,j − i,k j,k users and all movies. The updating equations are as C k=1 ! X X X before [e.g., (5.10)] except based on residuals ei,j = Then write k ri,j ℓ=1 ui,ℓvj,ℓ. After convergence, one proceeds p to the− next feature. e = r u v , P i,j i,j − i,k j,k We remark that sparse SVD problems are known Xk=1 to be nonconvex and to have multiple local min- and note that ima; see, for example, Srebro and Jaakkola (2003). 2 Nevertheless, starting from different initial condi- ∂ei,j = 2e v tions, we found that SVD seldom settled into en- ∂u i,j j,k i,k − tirely unsatisfactory minima, although the minima and attained did vary slightly. The magnitude of these 2 differences was commensurate with the variation in- ∂ei,j = 2ei,jui,k. herent among the options available for regulariza- ∂v − j,k tion. We also found that averaging the results from Updating can then be done locally following the neg- several SVD fits started at different initial condi- ative gradients: tions could lead to better results than a single SVD new old old fit of a higher dimension. On this point, see also Wu ui,k = ui,k + 2ηei,j vj,k and (5.8) (2007). Finally, we note the recent surge of work on new old old vj,k = vj,k + 2ηei,j ui,k , a problem referred to as matrix completion; see, for where the learning rate η controls for overshoot. For example, Candes and Plan (2009). a given (i, j) , these equations are used to update ∈C 6. NEURAL NETWORKS AND RBMS the ui,k and vj,k for all k; we then cycle over the (i, j) until convergence. If we regularize11 the A restricted Boltzman machine (RBM) is a neu- problem,∈C as in ral network consisting of one layer of visible units, p 2 and one layer of invisible ones; there are no con- nections within r u v between units either of these lay- i,j − i,k j,k ers, but all units of one layer are connected to all C k=1 ! (5.9) X X X units of the other layer. To be an RBM, these con- nections must be bidirectional and symmetric; some + λ u 2 + v 2 , k ik k jk definitions require that the units only take on bi- Xi Xj  nary values, but this restriction is unnecessary. We remark that the symmetry condition is only needed 11 If, instead of (5.9), we regularized as so as to simplify the training process. See Figure 7; ′ 2 2 2 [(ri,j − uivj ) + λ(kuik + kvj k )], additional clarification will emerge from the discus- C XX sion below. The name for these networks derives then the gradient descent update equations (5.10) become from the fact that their governing probability dis- new old old old tributions are analogous to the Boltzman distribu- ui,k = ui,k + η(2ei,j vj,k − λui,k ) and

new old old old tions which arise in statistical mechanics. For fur- vj,k = vj,k + η(2ei,j ui,k − λvj,k ). ther background, see, for example, Hertz, Krogh and 12 A. FEUERVERGER, Y. HE AND S. KHATRI

We next specify the underlying stochastic model. First, the distributions of the (v, h) are assumed to be independent across users. We therefore only need to specify a probability distribution on the collec- tion (v, h) for the ith user. This distribution is de- termined by its two conditional distributions mod- eled as follows: The conditional distribution of the ith user’s observed ratings information v, given that user’s hidden features vector h, is modeled as a (one- trial) multinomial distribution k P (vj = 1 h) Fig. 7. The RBM model for a single user: Each of the user’s (6.1) | k F k hidden units is connected to every visible unit (a multinomial exp(bj + f=1 hf Wj,f ) observation) that represents a rating made by that user. Every = , K ℓ F ℓ user is associated with one RBM, and the RBM models for ℓ=1 exp(bjP+ f=1 hf Wj,f ) the different users are linked through the common symmetric where the denominator is just a normalizing factor. W k P P weight parameters j,f . Next, the conditional distributions of the ith user’s hidden features variables, given that user’s observed Palmer (1991), Section 7.1, Izenman (2008), Chap- ratings v, are modeled as conditionally independent ter 10, or Ripley (1996), Section 8.4. See also Bishop Bernoulli variables (1995, 2006). We will describe the RBM model that K k k has been applied to the Netflix data by Salakhut- (6.2) P (hf = 1 v)= σ bf + vj Wj,f , dinov, Mnih and Hinton (2007), whom we will also | ! j∈XJ(i) Xk=1 refer to as SMH. −x In the SMH model, to each user i, there corre- where σ(x) = 1/(1 + e ) is the sigmoidal function. Note that (6.2) is equivalent to the linear logit model sponds a length F vector of hidden (i.e., unobserved) units, or features, h = (h , h ,...,h ). These fea- P (hf = 1 v) 1 2 F log | tures, hf , for f = 1, 2,...,F , are random variables 1 P (hf = 1 v) posited to take on binary values, 0 or 1. Note that (6.3)  − |  K subscripting to indicate the dependence of h on the k k ith user has been suppressed. Next, instead of think- = bf + vj Wj,f ; ing of the ratings of the ith user as the collection of j∈XJ(i) Xk=1 values ri,j for j J(i), we think of this user’s ratings in effect, (6.2)/(6.3) models user features in terms of ∈ 1 2 K as the collection of vectors vj = (vj , vj , . . . , vj ), for the movies the user has rated, and the user’s ratings j J(i), that is, for each of the movies he or she has for them. Note that the weights (interaction parame- ∈ k seen. Each of these vectors is defined by setting all ters) Wj,f are assumed to act symmetrically in (6.1) k k of its elements to 0, except for one: namely, vj = 1, and (6.2). The parameters bj and bf are referred to corresponding to ri,j = k. Here K is the number of k as biases; the bj may be initialized to the logs of possible ratings; for Netflix, K = 5. The collection their respective sample proportions over all users. of these v vectors for our ith user [with j J(i)] j ∈ We remark that in this model there is no analogue will be denoted by v. Here again, the dependence for user biases. k of v, as well as of the vj and the vj , on the user i is To obtain the joint density of v and h from their suppressed. two conditional distributions, we make use of the We next introduce the symmetric weight param- following result: Suppose f(x,y) is a joint density eters W k for 1 j J, 1 f F and 1 k K, j,f ≤ ≤ ≤ ≤ ≤ ≤ for (X,Y ), and that f1(x y), f2(y x) are the corre- which link each of the F hidden features of a user sponding conditional density| functions| for X Y and with each of the J possible movies; these weights Y X. Then noting the elementary equalities | | also carry a dependence on the rating values k. f (y x∗) The W k are not dependent on the user; the same f(x,y)= f (x y) 2 | f (x∗) j,f 1 | × f (x∗ y) × X weights apply to all users, however, only weights for 1 | the movies he or she has rated will be relevant for f (x y∗) = f (y x) 1 | f (y∗), any particular user. 2 | × f (y∗ x) × Y 2 | NETFLIX CHALLENGE 13 we see that f(x,y) can be determined from f1 and f2 and since it is proportional to either of ∂Z ′ ′ ′ k′ ∗ ∗ = exp E(v , h ) hf vj . f2(y x ) f1(x y ) ∂W k {− } j,f v′ h′ f1(x y) |∗ and f2(y x) ∗| . | × f1(x y) | × f2(y x) X X | | Now Here fX and fY are the marginals of X and Y , and ∗ ∗ ∂ log p(v) ∂ log( exp( E(v, h))) the choices of x and y are arbitrary. It follows that = h − ∂W k ∂W k the joint density of (v, h) satisfies the proportional- j,f P j,f ity (6.5) ∂ log Z P (h v)P (v h∗) ; p(v, h) 2 | 1 | ; − ∂W k ∝ P (h∗ v) j,f 2 | with the choice h∗ = 0, this yields the first term on the right in (6.5) equals 1 p(v, h) exp E(v, h) , exp( E(v, h))h vk ∝ {− } f j h exp( E(v, h)) − where − Xh F K K P k = p(h v)hj vi , E(v, h)= W k h vk vkbk | − j,f f j − j j h j∈J(i) f=1 k=1 j∈J(i) k=1 X X X X X X while the second term on the right in (6.5) is F K k 1 ∂Z 1 ′ ′ ′ k′ hf bf + log bj . = exp( E(v , h ))h v − k f j f=1 j∈J(i) k=1 ! Z ∂Wj,f Z ′ ′ − X X X Xv Xh ′ The computations here just involve taking products = p(v′, h′)h′ vk . over the observed ratings using (6.1), and over the f j v′ h′ hidden features using (6.2). By analogy to formu- X X lae in statistical mechanics, E(v, h) is referred to as Hence, altogether, an energy; note that only movies whose ratings are k ∆Wj,f known contribute to it. The joint density of (v, h) ′ can therefore be expressed as = ε p(h v)h vk p(v′, h′)h′ vk , | f j − j i exp E(v, h)  h v′ h′  p(v, h)= {− ′} ′ , X X X ′ ′ exp E(v , h ) v ,h {− } or, expressed more concisely, so that the likelihood function (i.e., the marginal k k k P (6.6) ∆W = ε( vj hf data vj hf model). distribution for the observed data) is j,f h i − h i Similarly, we obtain the updating protocols h exp E(v, h) (6.4) p(v)= {− ′ }′ . ′ ′ exp E(v , h ) (6.7) ∆bf = ε( hf data hf model) Pv ,h {− } h i − h i We will use the notationP and Z = exp( E(v′, h′)) (6.8) ∆bk = ε( vk vk ). − j j data j model v′ h′ h i − h i X X Note that the gradients here are for a single user for the denominator term of (6.4). k only; therefore, the three updating equations (6.6), Now the updating protocol for the Wj,f is given (6.7) and (6.8) must first be averaged over all users. by The updating equations (6.6), (6.7) and (6.8) for k ∂ log p(v) implementing maximum likelihood “learning” in- ∆Wj,f ε k , ≡ ∂Wj,f volves two forms of averaging. The averaging over the “data,” that is, based on the p(h v), is rela- k | where ε is a “learning rate.” To determine ∆Wij, we tively straightforward. However, the averaging over will need the derivatives the “model” is impractical, as it requires Gibbs-type ∂E(v, h) k MCMC sampling from p(v, h) which involves iterat- k = hf vj ∂Wj,f − ing between (6.1) and (6.2). SMH instead suggest 14 A. FEUERVERGER, Y. HE AND S. KHATRI running this Gibbs sampler for only a small num- the contest, exploiting this information was critical. ber of steps at each stage, a procedure referred to as It turns out that RBM models can incorporate such “contrastive divergence” (Hinton, 2002). For further implicit information in a relatively straightforward details, we refer the reader to SMH. way; according to Bell, Koren and Volinsky (2007b), Numerous variations on the model defined by (6.1) this is a key strength of RBM models. For further and (6.2) are possible. In particular, the user fea- details we refer the reader to SMH. tures h may be modeled as Gaussian variables hav- SMH reported that, when they also incorporated ing, say, unit variances. In this case the model for this implicit information, RBMs slightly outperform- P (vk = 1 h) remains as at (6.1), but (6.2) becomes ed carefully-tuned SVD models. They also found j | P (hf = h v) that the errors made by these two types of models | were significantly different so that linearly combin- K 2 1 1 k k ing multiple RBM and SVD models, using coeffi- = exp h bf vj Wj,f . cients determined over the probe set, allowed them √2π (−2 − − ! ) j∈XJ(i) Xk=1 to achieve an error rate over 6% better than Cine- The marginal distribution p(v) remains as at (6.4) match. The ML@UToronto team’s Leaderboard score except with energy term ultimately attained an RMSE of 0.8787 on the quiz F K K set (see Table 1). k k k k E(v, h)= Wj,f hf vj vj bj − − 7. NEAREST NEIGHBOR (KNN) METHODS j∈XJ(i) Xf=1 Xk=1 j∈XJ(i) Xk=1 F K Early recommender systems were based on nearest 1 + (h b )2 + log bk . neighbors (kNN) methods, and have the advantage 2 f − f j f=1 j∈J(i) k=1 ! of conceptual and computational simplicity useful X X X for producing convincing explanations to users as to The parameter updating equations remain unchang- why particular recommendations are being made to ed. Salakhutdinov, Mnih and Hinton (2007) report them. Usually applied to the residuals from a prelim- that this Gaussian version does not perform as well inary fit, kNN tries to identify (pairwise) similarities as the binary one; perhaps the nonlinear structure among users or among movies and use these to make in (6.2) is useful for modeling the Netflix data. Bell, predictions. Although generally less accurate than Koren and Volinsky (2007b, 2008), on the other hand, SVD, kNN models capture local aspects of the data prefer the Gaussian model. SMH indicate that to contrast sufficiently among not fitted completely by SVD or other global models users, good models typically require the number of we have described. Key references include Bell and binary user features to be not less than about F = Koren (2007a, 2007b, 2007c), Bell, Koren and Volin- 100. Hence, the dimension of the weights W , which sky (2007a, 2008), Koren (2008, 2010), Sarwar et al. is J F K, can be upward of ten million. The (2001), Toscher, Jahrer and Legenstein (2008) and parameterization× × of W can be reduced somewhat Wang, de Vries and Reinders (2006). See also Her- by representing it as a product of matrices of lower locker et al. (2000), Tintarev and Masthoff (2007) k p k and Ungar and Foster (1998). rank, as in Wj,f = ℓ=1 AjℓBℓf . This approach re- duces the number of W parameters to J p K + While the kNN paradigm applies symmetrically p F , a factor of aboutP p/F . × × to movies and to users, we focus our discussion on ×There is a further point which we mention only movie nearest neighbors, as these are the more accu- briefly here, but return to in Section 11. While the rately estimable, however, both effects are actually Netflix qualifying data omits ratings, it does provide important. A basic kNN idea is to estimate the rat- implicit information in the form of which movies ing that user i would assign to movie j by means users chose to rate; this is particularly useful for of a weighted average of the ratings he or she has users having only a small number of ratings in the assigned to movies most similar to j among movies training set. In fact, the full binary matrix indicat- which that user has rated: ing which user-movie pairs were rated (regardless of j′∈N(j;i) sj,j′ ri,j′ whether or not the ratings are known) is an impor- (7.1) rˆi,j = . ′ s ′ tant information source. This information is valu- P j ∈N(j;i) j,j able because the values missing in the ratings ma- Here the sj,j′ are similarityP measures which act as trix are not “missing at random” and for purposes of weights, and N(j; i) is the set of, say, K movies, NETFLIX CHALLENGE 15 that i has seen and that are most similar to j. Let- placed by global weights having a relatively more ting I(j, j′)= I(j) I(j′) be the set of users who accurate global optimization, as in the model have seen both movies∩ j and j′, similarity between pairs of movies can be measured using Pearson’s cor- rˆi,j = bi,j + (ri,j′ bi,j′ )wj,j′ ′ k − relation j ∈NX(j;i) ′ ′ (7.3) i∈I(j,j′)(ri,j r·,j )(ri,j r·,j ) sj,j′ = − − , + Ci,j′ . 2 2 P′ (ri,j r·,j ) ′ (ri,j′ r·,j′ ) ′ k i∈I(j,j ) − i∈I(j,j ) − j ∈NX(j;i) q q or by theP variant P Here wj,j′ is the same for all users, and the neighbor- k k k ′ hoods are now N (j; i) J(i) N (j), where N (j) i∈I(j,j′ )(ri,j ri,·)(ri,j ri,·) ≡ ∩ sj,j′ = − − is the set of k movies most similar to j as deter- 2 2 ∈ P′ (ri,j ri,·) ∈ ′ (ri,j′ ri,·) mined by the similarity measure. The sum involving i I(j,j ) − i I(j,j ) − the C ′ is included in order to model implicit in- qP qP i,j in which centering is at the user instead of the movie formation inherent in the choice of movies a user means, or by cosine similarity rated; for purposes of the Netflix contest, this sum i∈I(j,j′) ri,jri,j′ would include the cases in the qualifying data. As sj,j′ = . 2 2 a further enhancement, the bi,j following the equal- i∈IP(j,j′) ri,j i∈I(j,j′) r ′ i,j ity and the bi,j′ within the sum could be decou- The similarityq measureP is usedq toP determine the near- pled, with the second of these remaining as the orig- est neighbors, as well as to provide the weights inal baseline values, and the first of these set to in (7.1). In practice, if an ANOVA, SVD and/or µ + ai + bj and then trained simultaneously with the other fit is carried out first, kNN would be applied model. Furthermore, the sums in (7.3) could each to the residuals from that fit; under such “center- be normalized, for instance, using coefficients such as N k(j; i) −1/2. ing” the behavior of the three similarity measures | | above would be very alike. As the common supports ′ 8. DIMENSIONALITY AND PARAMETER I(j, j ) vary greatly, it is usual to regularize the sj,j′ via a rule such as SHRINKAGE I(j, j′) The large number (often millions) of parameters s ′ | | s ′ . j,j ← I(j, j′) + λ j,j in the models discussed make them prone to over- | | fitting, affecting the accuracy of the prediction pro- A more data-responsive kNN procedure could be cess. Reducing dimensionality through penalization based on therefore becomes mission critical. This leads to con- rˆi,j = wj,j′ ri,j′ , siderations which are relatively recent in statistics, j′∈N(j;i) such as the effective number of degrees of freedom of X a regularized model and its use in assessing predic- where the weights w ′ (which are specific to the ith j,j tive accuracy, as well as to the connections between user) are meant to be chosen via least squares fits that viewpoint and James–Stein shrinkage and em- 2 pirical Bayes ideas. In this section we attempt to (7.2) arg min ri′,j wj,j′ ri′,j′ . w place such issues within the Netflix context. A dif- ′ − ′ Xi =6 i j ∈XN(j;i)  ficulty which arises here stems from the distribu- This procedure cannot be implemented effectively tional mismatch between the training and validation as shown because enough ri′,j′ ratings are often not data, however, we will sidestep this issue so as to available, however, Bell and Koren (2007c) suggest focus on key theoretical considerations. Our discus- how one may compensate for the missing ratings sion draws from Casella (1985), Copas (1983), Efron here in a natural way. (1975, 1983, 1986, 1996, 2004), Efron et al. (2004), Many variations of such methods can be proposed Efron and Morris (1971, 1972a, 1972b, 1973a, 1973b, and can produce slightly better estimates, although 1975, 1977), Houwelingen (2001), Morris (1983), Stein at an increased computational burden; see Bell, Ko- (1981), Ye (1998) and Zou et al. (2007). See also ren and Volinsky (2008) and Koren (2008, 2010). Barbieri and Berger (2004), Baron (1984), Berger For example, user-specific weights, with their rela- (1982), Breiman and Friedman (1997), Candes and tively inaccurate local optimizations, could be re- Tao (2007), Carlin and Louis (1996), Fan and Li 16 A. FEUERVERGER, Y. HE AND S. KHATRI

(2006), Friedman (1994), Greenshtein and Ritov it being understood that here the predictions µ(βˆ) (2004), Li (1985), Mallows (1973), Maritz and Lwin are for an independent repetition Y ∗ of the same (1989), Moody (1992), Robins (1956, 1964, 1983), random experiment. Efron (1983) refers to the dif- Sarwar et al. (2000), Stone (1974) and Yuan and ference between prediction error and fitted error, Lin (2005). that is, between the right- and left-hand sides in (8.2)/(8.3), as the optimism. Prediction Optimism It is helpful, for the sake of exposition, to examine To give some context to our discussion, suppose Y the inequalities (8.2)/(8.3) for a linear model, where is an n 1 random vector with entries Yi, for i = µ(β)= Xβ, and X is n p. In that case, the leftmost 1, 2,...,n×, all having finite second moment, and sup- and rightmost expressions× in (8.2) are equidistant pose the mean of Y is modeled by a vector µ(β) with from the middle one, and (8.2)/(8.3) become entries µi(β), where β is a p 1 vector of parameters. (8.4) (n p)σ2

Effective Degrees of Freedom Example: I × J ANOVA The distances pσ2 across both boundaries in (8.4), To help fix ideas, it is instructive to consider the as well as at (8.5) and (8.6), lead to a natural def- optimization problem for the (complete) quadrati- inition for the effective number of degrees of free- cally penalized I J ANOVA12 dom of a statistical fitting procedure. In the linear × I J case, µ(β)= Xβ, using the least squares estima- (r µ α β )2 tor βˆ = (X′X)−1X′Y , we have µ(βˆ)= Xβˆ = HY , i,j − − i − j ′ −1 ′ i=1 j=1 where H = X(X X) X . Assuming the columns (8.12) X X of X are not colinear, the matrices H and M = I H I J project onto orthogonal subspaces of dimensions− p 2 2 + λ1 αi + λ2 βj . and n p. The occurrence of p at the left in (8.4) is i=1 ! j=1 ! usually− viewed as connected with the decomposition X X Y ′Y = Y ′HY + Y ′MY and the fact that the projec- We deliberately do not penalize for µ here because µ tion matrix H has rank p. For a projection matrix, is typically known to differ substantially from zero. however, rank and trace are identical, but it is the We will use the identity trace which actually matters. I J 2 To appreciate this, note that ifµ ˆi is any quantity (ri,j µ αi βj) determined independently of Y ∗, then − − − i Xi=1 Xj=1 ∗ 2 ∗ 2 2 (8.7) E(Yi µˆi) = E(Yi µi) + E(ˆµi µi) . I J − − − 2 = [r r· · (r · r· ·) (r· r· ·)] On the other hand, i,j − , − i, − , − ,j − , i=1 j=1 E(Y µˆ )2 = E(Y µ )2 + E(ˆµ µ )2 X X i − i i − i i − i (8.13) (8.8) I 2 Cov(Yi, µˆi). 2 2 − + IJ(µ r·,·) + J[αi (ri,· r·,·)] Taken together, and remembering that E(Y ∗ µ )2 = − − − i i Xi=1 E(Y µ )2, these give − i − i J ∗ 2 2 2 (8.9) E(Y µˆi) = E(Yi µˆi) + 2 Cov(Yi, µˆi), + I[β (r· r· ·)] , i − − j − ,j − , and then summing over i shows that the difference Xj=1 between the right- and the left-hand sides of (8.2) is where the “dots” represent averaging. It differs from n the standard ANOVA identity, but is derived simi- I J 2 Cov(Yi, µˆi). larly, although it requires i=1 αi =0 and j=1 βj = i=1 0. Using (8.13), the optimization problem (8.12) sep- X P P Equating this with 2pσ2 leads to the definition arates, leading to the solutions n 1 µˆ = r·,·, (8.10) effective d.f. Cov(Yi, µˆi). ≡ σ2 J i=1 (8.14) αˆ = (r · r· ·) and X i J + λ i, − , The relations (8.7), (8.8) and (8.9) hold for any es- 1 timator. But ifµ ˆ = HY , that is, for a linear estima- I βˆj = (r·,j r·,·). tor, the covariances Cov(Yi, µˆi) are just the diagonal I + λ2 − elements of H, so that Optimal choices for the regularization parameters λ1 1 and λ in (8.12) are usually estimated by cross- (8.11) effective d.f. = trace(H). 2 σ2 validation, however, here we wish to understand these For nonlinear models, the (approximate) effective analytically. We can do this by minimizing Akaike’s number of degrees of freedom may be defined either predictive information criterion (AIC), via (8.10), or via (8.11) if we use the trace of its AIC = 2 log( λ)+2df(λ), locally linear approximation µ(βˆ) µ(β )+ H(Y − L ≃ 0 − µ(β0)), with both of these definitions being justifi- 12Unlike the SVD case, discussed in (5.2) and in footnote 8 able asymptotically in view of (8.5) and (8.6), under of Section 5, using different values for λ1 and λ2 is essential the smoothness condition referred to there. here. 18 A. FEUERVERGER, Y. HE AND S. KHATRI where λ is the value (under λ-regularization) of Hence, the Cp criterion we seek to minimize can the likelihoodL for the r at the MLE, and df(λ) be taken as { i,j} is the effective number of degrees of freedom; here I J 1 J λ (λ1, λ2). As we are in a Gaussian case, with an · · · · · 2 ri,j r , (ri, r , ) RMSE≡ perspective, this is (except for additive con- σ − − J + λ1 − Xi=1 Xj=1 stants) the same as Mallows’ Cp statistic, I 2 · · · residual sum of squares λ (r ,j r , ) C = { } +2df(λ). − I + λ2 − p σ2  J I Minimizing this will (for linear models) be equiva- + 2 1 + (I 1) + (J 1) − J + λ − I + λ lent to minimizing the expected squared prediction  1 2  error, defined as the rightmost term in (8.2), or (for or, equivalently, nonlinear models) to minimizing it asymptotically. 2 I 1 J 2 For further discussion of these points, see Chapter 7 · · · 2 J 1 (ri, r , ) of Hastie et al. (2009). σ − J + λ1 − (   i=1 Now, the effective number of degrees of freedom X 2 J associated with (8.12) can be determined by viewing I 2 + I 1 (r· r· ·) the minimizing solution to (8.12) as a linear trans- − I + λ ,j − ,  2  j=1 ) formation,r ˆ = Hλr, from the vector r consisting of X the observations r , to the vectorr ˆ of fitted val- J I i,j + 2 (I 1) + (J 1) . uesr ˆi,j. The entries of the matrix Hλ are deter- − J + λ − I + λ  1 2  mined from the relationr ˆi,j =µ ˆ +α ˆi + βˆj , whereµ ˆ, The minimizations with respect to J/(J + λ1) and αˆ and βˆ are given at (8.14). Thus, the effective i j I/(I + λ2) thus separate, and setting derivatives to number of degrees of freedom, when penalizing by zero leads to the approximate solutions (λ1, λ2), is found to be σ2 λ1 = and df = trace Hλ I 2 (8.15)  i=1(ri,· r·,·) /(I 1) J I (8.16) − − =1+(I 1) + (J 1) . P σ2 − J + λ1 − I + λ2 λ2 = . J 2 j=1(r·,j r·,·) /(J 1) Next, for a given λ1 and λ2, the residual sum of  − −  squares is On substitutingP these into (8.15), we also see that I J under the theoretically optimal regularization the ef- J ri,j r·,· (ri,· r·,·) fective number of degrees of freedom for the ANOVA − − J + λ1 − becomes Xi=1 Xj=1 I 2 (I + J 1) (r· r· ·) , − − I + λ ,j − , I 1 σ2 2  − I 2 and this may be expanded as − J (1/(I 1)) (ri,· r·,·)  − i=1 − I J J 1 P σ2 2 + − ; [(ri,j r·,·) (ri,· r·,·) (r·,j r·,·)] I J 2 − − − − − (1/(J 1)) j=1(r·,j r·,·) i=1 j=1 − −  X X the expression in braces gives the reduction in de- 2 I P J 2 grees of freedom which results under the optimal + J 1 (r · r· ·) − J + λ i, − , penalization. Equations (8.16) and (8.14) may be  1  i=1 X interpreted as saying that optimal penalization (or 2 J I 2 shrinkage) should be done differentially by parame- + I 1 (r·,j r·,·) , − I + λ2 − ter groupings, with each group of (centered) param-   Xj=1 eters shrunk in accordance with that group’s vari- where the first of the three terms here may subse- ability (the variances of the row and column effects quently be ignored. here) relative to the variability of error, and each NETFLIX CHALLENGE 19 parameter in accordance with its support base (i.e., This optimization can be carried out by EM or by with the information content of the data relevant to gradient descent; it has no analytical solution, but its estimation—here I and J). analogy with the complete case suggests that the shrinkage rules Empirical Bayes Viewpoint Ji Ij The preceding computations may be compared αˆshrink = αˆ and βˆshrink = βˆ , i J + λ i j I + λ j with an empirical Bayes approach. For this we will i 1 j 2 assume that ri,j = µ + αi + βj + ei,j, with the ei,j be- whereα ˆi and βˆj are the unpenalized estimates, will 2 ing independent N(0,σ ) variables. For simplicity, be approximately optimal provided we again take λ1 2 we assume that µ and σ are known. On the param- and λ2 as ratios of row and column variation relative eters, αi and βj, respectively, we posit independent to error as at (8.16). Koren (2010) proposed the less 2 2 2 2 N(0,σ1) and N(0,σ2) priors, with σ1 and σ2 be- accurate but simpler penalization ing hyperparameters. Multiplying up the I + J + IJ i∈I(j)(ri,j µˆ) normal densities for the αi, βj and ri,j, and again βˆj = − using (8.13), we can complete squares and integrate P Ij + λ2 out the αi and βj . This leads to a likelihood func- first, and then 2 2 tion for σ1 and σ2 which, to within a factor not ˆ depending on σ2 and σ2, is given by j∈J(i)(ri,j µˆ βj) 1 2 αˆi = − − , I Ji + λ1 √2πσ 1 J 2 P exp J 13 2 2 2σ2 J + (σ2/σ2) whereµ ˆ is the overall mean; typical values he used Jσ1 + σ − − 1      were λ1 = 10 and λ2 = 25. p I For more complex models, such as sparse SVD, 2 (ri,· r·,·) · − the lessons here suggest that penalties on parameter i=1 X  groupings should correspond to priors which model √2πσ J 1 I2 the distributions of the groups. For Gaussian priors exp I 2 2 2 2 2 (quadratic regularization) we then need estimates · Iσ + σ −2σ − I + (σ /σ2)  2     for the group variances. For SVD we thus want esti- p J 2 mates of the variances of each of the user and movie (r·,j r·,·) , · − features. We experimented with fitting SVDs using j=1 X  minimal regularization—with features in descend- and maximizing this leads to the estimates ing order of importance—first removing low usage I 2 users to better assess the true user variation. Be- 2 1 2 σ σˆ = (r · r· ·) and cause free constants can move between correspond- 1 I i, − , − J i=1 ing user and movie features, we examined products X of the variances of corresponding features. These do J 2 2 1 2 σ σˆ = (r· r· ·) . tend toward zero (theoretically, this sequence must 2 J ,j − , − I j=1 be summable) but appear to do so in small batches, X settling down and staying near some small value, be- The resulting empirical Bayes Gaussian prior can fore settling still further, again staying a while, and thus be seen as being essentially equivalent to the so on. Our explanation for this is that there soon quadratically penalized optimization (8.12) under are no obvious features to be modeled, and that the optimal choice (8.16) for the penalty parame- batches of features then contribute small, approx- ters λ1, λ2. imately equal amounts of explanatory power. Such Generalizing considerations help suggest protocols for increasing regularization as we proceed along features. It is an We begin with a few remarks on the penalized important point that, in principle, the number of sparse ANOVA

(r µ α β )2 13Koren’s values were targeted to fit the probe set. If the i,j − − i − j C probe and training sets had identical statistical properties, (8.17) X X these values would likely have been smaller: recall that in Sec- I J tion 4 we obtained variances of 0.23 and 0.28 for the user and 2 2 + λ1 αi + λ2 βj . movie means, and RMSE values slightly below 1, suggesting ! ! the approximate values λ1 ≈ λ2 ≈ 4. Xi=1 Xj=1 20 A. FEUERVERGER, Y. HE AND S. KHATRI features may be allowed to become infinite, as long effects ui(t), are both allowed to vary over time as their priors tend toward degeneracy sufficiently but—on grounds that movies are more constant than quickly. users—the movie effects vj are not. The last sum Bell, Koren and Volinsky (2007a) proposed a par- models feedback from the implicit information. De- ticularly interesting empirical Bayes regularization tailed proposals for temporal modeling of the user for the feature parameters in SVD. They modeled and movie biases, and for the user SVD factors, user parameters as u N(µ, Σ ), movie parameters ui(t), as well as for modeling temporal effects in i ∼ 1 as vj N(ν, Σ2), and individual SVD-based ratings nearest neighbor models may be found in Koren ∼ ′ 2 as ri,j N(uivj,σ ), with the natural assumptions (2008, 2009). on independence.∼ They fitted such models using an EM and a Gibbs sampling procedure, alternating be- 10. IN SEARCH OF MODELS tween fitting the SVD parameters and fitting the pa- Examining and contrasting such models as ANOVA, rameters of the prior. See also Lim and Teh (2007). SVD, RBM and kNN is useful in a search for new model classes. We first remark that the best fit- 9. TEMPORAL CONSIDERATIONS ting models—such as SVD and RBM—have high- This section addresses the temporal discordances dimensional, simultaneously fitted parameterizations. between the Netflix training and qualifying data sets. On the other hand, useful models need not have, See, for example, Figures 1 and 5 of Section 2 for ev- with ANOVA and kNN both suggestive of this. If idence of such effects. Peoples’ tastes—collectively a model has p parameters, and if it is viewed as N and individually—change with time, and the movie spanning a p-dimensional submanifold of R , then “landscape” changes as well. The specific user who we want p to not be too large, and yet for this sub- submits the ratings for an account may change, and manifold to contain a vector close to the expected day-of-week as well as seasonal effects occur as well. N-dimensional vector of data to be fitted. For this to Furthermore, the introduction (and evolution) of happen, the model will have to reflect some substan- a recommender system itself affects ratings. Here tive aspect of the structure from whence the data we provide a very brief overview of the main ideas arose. One striking feature of collaborative filtering which have been proposed for dealing with such is- data is the apparent absence of any single model sues, although to limit our scope, time effects are that can explain most of the explainable variation not emphasized in our subsequent discussions. Key observed. The reason for this may be that the avail- references here are Koren (2008, 2009). able data are insufficient to reliably fit such a model. We first note that temporal effects can be entered Were sufficient data available, it is tempting to think into models in a “global” way. Specifically, the stan- that some variation of SVD might be such a single dard baseline ANOVA can be modified to read model. In this section we indicate some extensions to the models already discussed. Most of these were ri,j = µ + αi(t)+ βj(t)+ ei,j. arrived at independently, although many do con- Here all effects are shown as functions which depend tain features resembling those in models proposed on time, but the time arguments t can (variously) by others. It is to be understood that regularization represent chronological time, or can represent a user- is intended to be used with most of the procedures discussed. specific or a movie-specific time ti or tj, or even a jointly indexed time ti,j. Extending ANOVA Time effects can also be incorporated into both SVD and kNN type models. An example in the SVD Likert scales, such as the Netflix stars system, are case is the highly accurate model subjective, with each user choosing for themselves what rating (or ratings distribution) corresponds to rˆij(t)= µ + αi(t)+ βj(t) an average movie, and just how much better (or worse) it needs to be to change that rating into ′ −1/2 + vj ui(t)+ J(i) Cj′ , a higher (or a lower) one. This not only speaks to | | ′  j X∈J(i)  a centering for each user, captured by αi terms, but referred to as “SVD++” by Koren (2009), and fit also to a scaling specific to each user, and suggests a variation of the usual ANOVA of the form using both regularization and cross-validation. Here the baseline values αi(t) and βj(t), as well the user (10.1) ri,j = µ + αi + γiβj + error. NETFLIX CHALLENGE 21

The scaling factors γi here are meant to be shrunk equal lengths is artificial. Furthermore, users are toward 1 in regularization. This model is of an in- much more numerous than movies, so movie features teraction type, and may be generalized to are easier to estimate, while user features create the more severe overfitting. If we posit that each user (10.2) ri,j = µ + αi + βj + Interact + error, is governed by p features ui = (ui,1, ui,2, . . . , ui,p), where the Interact term in (10.2) can vary among14 and each movie by q features vj = (vj,1, vj,2, . . . , vj,q), (a) γiαiβj, (b) γjαiβj, with p

Modeling Users via Movie Parameters rˆi,j = µ + αi + βj (10.11) Parsimony of parameterization is a critical issue. p Because users are 27 times more numerous than + γ v˜j′,ℓ(ri,j′ )vj,ℓ. × ′ movies, assigning model parameters to users exacts Xℓ=1 j X∈J(i) a far greater price, in degrees of freedom, than as- Note that (10.11) is essentially the same as the RBM- signing parameters to movies. This suggests that pa- inspired model (10.10), but alternately arrived at. rameterization be arranged in such a way that its di- Here the weights γ might take the form J(i) −δ , mension is a multiple of the number of movies rather with typical possibilities for δ being 0, 1/2| or 1,| or than of the number of users; we thus try to model as determined by cross-validation. user features in terms of parameters associated with movies. In the context of the Neflix contest, Paterek 11. FURTHER IDEAS AND METHODS (2007) was the first to have publicly suggested this. Whether driven by fortune or by fame, the tenac- However, users cannot be regarded as being alike ity of the contestants in the Netflix challenge has not merely for having seen the identical set of movies; often been surpassed. In this section we collect to- their ratings for those movies must also be taken gether a few ideas which have not yet been discussed into account. Such considerations lead to three im- elsewhere in this paper. A very few of these are our mediate possibilities: own (or at least were obtained independently), but (A) There is one feature vector, vj, associated the boundaries between those and the many other with each movie. The feature, ui, for the ith user methods that have been proposed are necessarily is modeled as a sum (variously weighted) of func- blurred. tions of the vj for the movies he or she has seen, We start by noting that covariates can be included and the ratings he or she assigned to them. with many procedures, as in two (B) There are feature vectors, vj andv ˜j, as- rˆi,j = µ + αi(t)+ βj(t) sociated with each movie, and ui is based on thev ˜j for the movies i has seen, together with the ratings p M + u v + c Xm + , assigned to them. i,ℓ j,ℓ m i,j · · · k ℓ=1 m=1 (C) There are six feature vectors, vj andv ˜j X X ≡ m v˜j(k), for k = 1, 2,...,K (with K = 5) associated where the Xi,j for m = 1, 2,...,M are covariates. k Such models can generally be fit using gradient de- with each movie, and ui is based on thev ˜j for the movies i has seen, and the ratings k assigned to scent, and since regularization is typically used as them. well, it would control automatically for collineari- ties among covariates. Covariates introduce very few (There are possibilities beyond just these three.) additional parameters; hence, as a general rule (for These considerations lead to models such as this data), the more the better. A covariate will typ- ′ rˆi,j = µ + αi + βj + uivj, ically differ across both users and movies, unless it where the v are free p-dimensional movie parame- is viewed as being explanatory to the “row” or “col- j umn” effects. There are many possibilities for co- ters, while the ui are defined in terms of other (p-di- mensional) movie parameters. For the approaches variates. (See, e.g., Toscher and Jahrer, 2008). A se- lection of these include the following: (A), (B) and (C) mentioned above, typical possibil- ities include 1. User and movie supports Ji, Ij and various func- tions (singly and jointly) of these. u = γ (r ′ r ·)v ′ , i × i,j − i, j 2. Time between movie’s release date and date rat- j′∈J(i) X ing was made. u = γ r ′ v˜ ′ and 3. The number and/or proportion of movies user i i × i,j j j′∈J(i) has rated before and/or after rating movie j; the X number and/or proportion of users who have rated u = γ v˜ ′ (r ), movie j before and/or after user i has rated it; i × j i,j ′ and functions of these. j X∈J(i) 24 A. FEUERVERGER, Y. HE AND S. KHATRI

4. The relative density of movie ratings by the user the data is “missing,” but not “missing at random” around the time the rating was made; also, the (MAR). Marlin et al. (2007) discuss the impact of movie’s density of being rated around the time of the MAR assumption for such data. Paterek (2007) rating. introduced modified SVD models (called NSVD) of 5. Seasonal and day-of-week-effects. the type 6. Standard deviations and variances of the user’s ′ −1/2 ratings and of the user’s residuals. Same for movies. rˆi,j = µ + αi + βj + vj ui + Ji yj′ , | | ′ 7. Measures of “surge of interest” to detect “run-  j X∈J(i)  away movies.” where the yj are secondary movie features intended 8. The first few factors or principal components of to model the implicit choices a user has made. The the covariance matrix for the movie ratings. Netflix qualifying data set, for example, contains im- 9. Relationship between how frequently rated ver- portant information by identifying many cases of sus how highly rated the movie is, for example, movies that users had rated, even though those rat- (log Ij Avg) βj . ing values were not revealed. Corresponding to such − × information is an I J matrix which can be thought We next remark that although they are the easi- × est to treat numerically, neither the imposed RMSE of as consisting of 0’s and 1’s, indicating which user- criterion, nor the widely used quadratic penalties, movie pairs had been rated regardless of whether are sacrosanct for collaborative filtering. A mean ab- or not the actual ratings are known. This matrix is solute error criterion, for instance, penalizes large full, not sparse, and contains invaluable information. prediction errors less harshly or, equivalently, re- All leading contestants reported that including such wards correct estimates more generously, and penal- implicit information in their ensembles and proce- dures made a vital difference. Hu, Koren and Volin- ties based on L1 regularization, as in the lasso (Tib- sky (2008) treat the issue of implicit information in shirani, 1996), produce models with fewer nonzero greater detail. See also Oard et al. (1998). parameters. Although the lasso is geared more to Measures of similarity based on correlation-like model identification and parsimony than to opti- quantities were discussed in Section 7. Alternate sim- mal prediction, the additional regularization it offers ilarity measures can be constructed by defining dis- can be useful in models with large parameterization. tances between the feature vectors of SVD fits. Im- Other departures from RMSE are also of interest. plicit versions of similarity may also be useful. For For example, if we focus on estimating probability example, the proportion of users who have seen mov- distributions for the ratings, a question of interest ie j is I(j) /I, and who have seen movie j′ is I(j′) /I. is: With what probability can we predict a rating Under| independence,| the proportion who have| seen| value exactly? An objective function could be based both movies should be about I(j) I(j′) /I2, but is on trying to predict the highest proportion of rat- actually I(j, j′) /I. Significant| differences|| | between ings exactly correctly. A yet different approach can these two| proportions| is indicative of movies that be based on viewing the problem as one of ranking, appeal to rather different audiences. as in Cohen, Schapire and Singer (1999). See also Among the most general models which have been Popescul et al. (2001). suggested, Koren (2008) proposed combining the A natural question is whether or not it helps to SVD and kNN methodologies while allowing for im- shrink estimated ratings toward nearby integers. plicit information within each component, leading Takacs et al. (2007) considered this question from to models such as an RMSE viewpoint and argued that the answer is ′ −1/2 no. One can similarly ask whether it helps to shrink rˆi,j = µ + αi + βj + vj ui + J(i) yj estimated ratings toward corresponding modes of es- | |  j∈XJ(i)  timated probability distributions; we would expect k −1/2 there too the answer to be negative. + N (j; i) (ri,j′ bi.j′ )Wj,j′ | | ′ k − Collaborative filtering contexts typically harbor j ∈XN (j;i) substantial implicit information. In many contexts, k −1/2 a user’s search history or even mouse-clicks can be + R (j; i) Cj,j′ , | | ′ k useful. As mentioned several times previously, for j ∈XR (j;i) Netflix, which movies a user rated carries informa- where the bi.j′ are a baseline fit. Here the sum involv- ′ tion additional to the actual ratings. Here 99% of ing the yj makes the vjui SVD component “implicit NETFLIX CHALLENGE 25 information aware.” The sets N k(j; i) and Rk(j; i) Tibshirani and Friedman, 2009.) In fact, the Net- represent neighborhoods based on the explicit and flix problem provides a striking and quintessential implicit information, respectively, while the last sum demonstration of this phenomenon. is the implicit neighborhood based kNN term. This Various methods for blending (combining) models model is among the best that have been devised for were used to significantly improve prediction per- the Netflix problem. formance in the Netflix contest and are described in Toscher and Jahrer (2008), Toscher, Jahrer and Bell 12. IN PURSUIT OF EPSILON: (2009) and Toscher, Jahrer and Logenstein (2010). ENSEMBLE METHODS These include kernel ridge regression blending, kNN Meeting the 10% RMSE reduction requirement of blending, bagged gradient boosted decision trees and the Netflix contest proved to be impossible using any neural net blending. Modeling the residuals from single statistical procedure, or even by combining other models provided useful inputs, for example, only a small number of procedures. BellKor’s 2007 applying kNN on RBM residuals. It was found that Progress Prize submission, for instance, involved lin- linear blending could be significantly outperformed ear combinations of 107 different prediction meth- and that neural net blending combined with bagging ods. These were based on variations on themes, re- was among the more accurate of the proposed meth- fitting with different tuning parameters, and differ- ods. Essentially, individual models were fit on the ent methods of regularization (quadratic, lasso and training data while blending was done on the probe flexible normal priors). BellKor applied such varia- set, as it represented the distribution of user/movie tions to both movie and user oriented versions of ratings to be optimized over. In their winning sub- kNN, both multinomial and Gaussian versions of mission, Toscher, Jahrer and Bell (2009) noted that RBM, as well as to various versions of SVD. Resid- optimizing the RMSE of individual predictors is not uals from global and other fits were used, as were optimal when only the RMSE of the ensemble counts; covariates as well as time effects. The 2008 Progress they implemented sequential fitting-while-blending Prize submission involved more of the same, blend- procedures as well as ensembles-of-blends. Further ing over 100 different fitting methods, and, in par- details may be found in the cited papers. Some gen- ticular, modeling time effects in deeper detail; the eral discussion of ensemble methods is given in Hastie individual models were all fit using gradient descent et al. (2009), Chapter 16. See also DeCoste (2006), algorithms. Finally, the Grand Prize winning sub- and Toscher, Jahrer and Legenstein (2010). mission was based on a complex blending of no fewer It should be noted that although the Netflix con- than 800 models. test required combining very large numbers of pre- Several considerations underpin the logic of com- diction methods, good collaborative filtering proce- bining models. First, different methods pick up sub- dures do not. In fact, predictions of good quality can tly different aspects of the data so the nature of usually be obtained by combining a small number of errors made by different models differ; combining judiciously chosen methods. therefore improves predictions. Second, prediction 13. NUMERICAL ISSUES methods fare differently across various strata of the data, and user behavior across such strata differs The scale of the Netflix data demands attention to as well. For instance, users who rated thousands of numerical issues and limits the range of algorithms movies differ from those who only rated only a few. that can be implemented; we make a few remarks If regularized (e.g., ridge) regression is used to com- concerning our computations. We used a PC with bine estimators, the data can be partitioned accord- 8 GB of RAM, driven by a 3 GH, four-core, “Intel ing (say) to support (i.e., based on the Ji and/or Ij ), Core 2 Extreme X9650” processor. Our computa- and separate regressions fit in each set, with the tions were mainly carried out using compiled C++ ridge parameters selected using cross-validation. code called within a 64 bit version of MatLab run- A third consideration is related to the absence of ning on a 64 bit Windows machine. any unique way to approach estimation and predic- For speed of computation, storing all data in RAM tion in complex highly parameterized models. In the was critical. Briefly, we did this by vectorizing the literature, ways of combining pre- data in two ways: In the first, ratings were sorted dictions from many versions of many methods are by user and then by movie within user; and in the referred to as ensemble methods and are known to second, conversely. A separate vector carried the rat- be highly effective. (See, e.g., Chapter 16 of Hastie, ings dates. Two other vectors carried identifiers for 26 A. FEUERVERGER, Y. HE AND S. KHATRI the users and for the movies; those vectors were For the Netflix problem, an interesting question shortened considerably by only keeping track of the to speculate on is: What is the absolutely best at- indices at which a new user or a new movie began. tainable RMSE? At one time, the 10% improve- Ratings were stored as “single” (4 bytes per data ment barrier seemed insurmountable. But the al- point, so 400 MB for all ratings) and user number as gorithms of the winning and runner up teams ulti- “Int32” (long integer, using 400 MB). As not all vari- mately tied to produce a 10.06% improvement (Test ables are required by any particular algorithm, and RMSE 0.8567) over the contest’s baseline. When the dates often were not needed in our work, we could prediction methods of these two top teams is com- often compress all required data into less than 1 GB bined using a 50/50 blend, the resulting improve- of RAM. Takacs et al. (2007) also discuss methods ment is 10.19% (Test RMSE 0.8555); see http:// to avoid swapping data across ROM. www.the-ensemble.com. Except for the RBM, we implemented many of The Netflix challenge also raises new questions. the methods discussed in this paper, as well as many Some of these have already been under active re- others proposed in the literature. In general, we found search in recent years, while others pose new ques- that gradient descent methods (stopping when RMSE tions of problems that had been thought of as hav- on the probe set is minimized) worked effectively. ing been understood. For example, in the context For SVD, for example, one full pass through the of data sets of this size, how can one deal most ef- training data using our setup took approximately fectively with optimization under nonconvexity, as 3 seconds; thus, fitting a regularized SVD of rank 40 occurs, for instance, in very sparse SVD? Can bet- (which required approximately 4000–6000 passes ter algorithms be devised for fitting RBM models, through) took approximately 4–6 hours. for having them converge to global optima, and for deciding on early stopping for regularization pur- 14. CONCLUDING REMARKS poses? Furthermore, currently available theoretical The Netflix challenge was unusual for the breadth results for determining optimal cross-validation pa- of the statistical problems it raised and illustrated, rameters are based on contexts in which the dis- and for how closely those problems lie at the fron- tributions of the training data and of the cases for tiers of recent research. Few data sets are available, which predictions are required are the same. Can of such dimensions, that allow both theoretical ideas these theoretical results be effectively extended to and applied methods to be developed and tested to cover cases in which the training and test sets are quite this extent. This data set is also noteworthy not identically distributed? The Netflix problem also for its potential to draw together such diverse re- highlights the value of further work to gain still search communities. It is to be hoped that similar deeper understanding of issues and methods sur- contributions could be made by other such contests rounding penalization, shrinkage and regularization, in the future. general questions about bagging, boosting and en- In this paper, we discussed many key ideas that semble methods, as well as of the trade-offs between have been proposed by a large number of individu- model complexity and prediction accuracy. Related als and teams, and tried to contribute a few ideas to this are questions about choosing effective pri- and insights of our own. To provoke by trivializing, ors in empirical Bayes contexts (particularly if the let us propose that there is one undercurrent which number of parameters is potentially infinite), and underlies most of what we have discussed. Thus, of the consequences of choosing them suboptimally. ANOVA/baseline values can all be produced by the What, for example, are the trade-offs between us- features of an SVD. Likewise, fits from an SVD can ing a regularized model having a very large num- be used to define kNN neighborhoods. And finally, ber of parameters, as compared to using a model the internal structure of an RBM is, in essence, anal- having still more parameters but stronger regular- ogous to a kind of SVD. Hence, if there a single un- ization? For instance, if two SVD models are fit us- dercurrent, it surely is the SVD; that, plus covari- ing different numbers of features, but with penal- ates, plus a variety of other effects. What compli- ization arranged so that the effective number of de- cates this picture are the dimensions of the problem grees of freedom of both models is the same, can one and of its parameterization, together with the ensu- deal theoretically with questions concerning which ing requirements for regularization, and the difficul- model is better? And finally, can general guidelines ties (particularly the inaccuracies) of the estimation. be developed, with respect to producing effective NETFLIX CHALLENGE 27 ensembles of predictors, which apply to modeling Bell, R. M., Bennett, J., Koren, Y. and Volinsky, C. of large data sets requiring extensive parameteriza- (2009). The million dollar programming prize. IEEE Spec- tion? Such questions are among the legacies of the trum 46 28–33. challenge unleashed by the Netflix contest. Bennett, J. and Lanning, S. (2007). The Netflix Prize. In Proc. KDD Cup and Workshop 2007 3–6. ACM, New York. Berger, J. (1982). Bayesian robustness and the Stein effect. ACKNOWLEDGMENTS J. Amer. Statist. Assoc. 77 358–368. MR0664679 The authors thank Netflix Inc. for their scien- Bishop, C. M. (1995). Neural Networks for Pattern Recogni- tion. Clarendon Press, New York. MR1385195 tific contribution in making this exceptional data Bishop, C. M. (2006). Pattern Recognition and Machine set public, and for conducting a remarkable con- Learning. Springer, New York. MR2247587 test. This work was supported by grants from the Breiman, L. (1996). Bagging predictors. Machine Learning Natural Sciences and Engineering Research Council 26 123–140. of Canada. The work of Y.H. was additionally sup- Breiman, L. and Friedman, J. H. (1997). Predicting mul- ported by a grant from Google Inc., and by a Fields- tivariate responses in multiple linear regression (with dis- cussion). J. Roy. Statist. Soc. Ser. B 59 3–54. MR1436554 MITACS Undergraduate Summer Research Award. Burges, C. (1998). A tutorial on support vector machines The authors thank the referees for their thoughtful for pattern recognition. Data Mining and Knowledge Dis- feedback on our manuscript. covery 2 121–167. Candes, E. and Plan, Y. (2009). Matrix completion with noise. Technical report, Caltech. REFERENCES Candes, E. and Tao, T. (2007). The Dantzig selector: Statis- ACM SIGKDD (2007). KDD Cup and Workshop 2007. tical estimation when p is much larger than n. Ann. Statist. 35 Available at www.cs.uic.edu/~liub/Netflix-KDD-Cup- 2313–2351. MR2382644 2007.html. Canny, J. F. (2002). Collaborative filtering with privacy via Adomavicius, G. and Tuzhilin, A. (2005). Towards the next factor analysis. In Proc. 25th Annual Int. ACM SIGIR generation of recommender systems: A survey of the state- Conf. on Research and Development in Information Re- of-the-art and possible extensions. IEEE Transactions on trieval 238–245. ACM, New York. Knowledge and Data Engineering 17 634–749. Carlin, B. P. and Louis, T. A. (1996). Bayes and Empirical Barbieri, M. M. and Berger, J. O. (2004). Optimal predic- Bayes Methods for Data Analysis. Monogr. Statist. Appl. tive model selection. Ann. Statist. 32 870–897. MR2065192 Probab. 69. Chapman & Hall, London. MR1427749 Baron, A. (1984). Predicted squared error: A criterion for Casella, G. (1985). An introduction to empirical Bayes data automatic model selection. In Self-Organizing Methods in analysis. Amer. Statist. 39 83–87. MR0789118 Modeling (S. Farrow, ed.). Marcel Dekker, New York. Chien, Y. H. and George, E. (1999). A Bayesian model for Bell, R. and Koren, Y. (2007a). Lessons from the Netflix collaborative filtering. In Online Proc. 7th Int. Workshop Prize challenge. ACM SIGKDD Explorations Newsletter 9 on Artificial Intelligence and Statistics. Fort Lauderdale, 75–79. FL. Bell, R. and Koren, Y. (2007b). Improved neighborhood- Christianini, N. and Shawe-Taylor, J. (2000). An Intro- based collaborative filtering. In Proc. KDD Cup and Work- duction to Support Vector Machines and Other Kernel- shop 2007 7–14. ACM, New York. Based Learning Methods. Cambridge Univ. Press, Cam- Bell, R. and Koren, Y. (2007c). Scalable collaborative bridge. filtering with jointly derived neighborhood interpolation Cohen, W. W., Schapire, R. E. and Singer, Y. (1999). weights. In Proc. Seventh IEEE Int. Conf. on Data Mining Learning to order things. J. Artificial Intelligence Res. 10 43–52. IEEE Computer Society, Los Alamitos, CA. 243–270 (electronic). MR1692680 Bell, R., Koren, Y. and Volinsky, C. (2007a). Modeling Copas, J. B. (1983). Regression, prediction and shrinkage. relationships at multiple scales to improve accuracy of large J. Roy. Statist. Soc. Ser. B 45 311–354. MR0737642 recommender systems. In Proc. 13th ACM SIGKDD Int. DeCoste, D. (2006). Collaborative prediction using ensem- Conf. on Knowledge Discovery and Data Mining 95–104. bles of maximum margin matrix factorizations. In Proc. ACM, New York. 23rd Int. Conf. on Machine Learning 249–256. ACM, New Bell, R., Koren, Y. and Volinsky, C. (2007b). York. The BellKor solution to the Netflix Prize. Available Deerwester, S. C., Dumais, S. T., Landauer, T. K., Fur- at http://www.research.att.com/~volinsky/netflix/ nas, G. W. and Harshman, R. A. (1990). Indexing by ProgressPrizes2007BellKorSolution.pdf. latent semantic analysis. Journal of the Amercan Society Bell, R., Koren, Y. and Volinsky, C. (2007c). Chasing of Information Science 41 391–407. $1,000,000: How we won the Netflix Progress Prize. ASA Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Statistical and Computing Graphics Newsletter 18 4–12. Maximum likelihood from incomplete data via the EM al- Bell, R., Koren, Y. and Volinsky, C. (2008). The BellKor gorithm (with discussion). J. Roy. Statist. Soc. Ser. B 39 2008 solution to the Netflix Prize. Available at http:// 1–38. MR0501537 www.netflixprize.com/assets/ProgressPrize2008 Efron, B. (1975). Biased versus unbiased estimation. Ad- BellKor.pdf. vances in Math. 16 259–277. MR0375555 28 A. FEUERVERGER, Y. HE AND S. KHATRI

Efron, B. (1983). Estimating the error rate of a prediction tions. In Proc. 2000 ACM Conf. on Computer Supported rule: Improvement on cross-validation. J. Amer. Statist. Cooperative Work 241–250. ACM, New York. Assoc. 78 316–331. MR0711106 Herlocker, J. L., Konstan, J. A., Terveen, L. G. and Efron, B. (1986). How biased is the apparent error rate Riedl, J. T. (2004). Evaluating collaborative filtering rec- of a prediction rule? J. Amer. Statist. Assoc. 81 461–470. ommender systems. ACM Transactions on Information MR0845884 Systems 22 5–53. Efron, B. (1996). Empirical Bayes methods for combining Hertz, J., Krogh, A. and Palmer, R. G. (1991). Intro- likelihoods (with discussion). J. Amer. Statist. Assoc. 91 duction to the Theory of Neural Computation. Addison- 538–565. MR1395725 Wesley, Redwood City, CA. MR1096298 Efron, B. (2004). The estimation of prediction error: Co- Hill, W., Stead, L., Rosenstein, M. and Furnas, G. variance penalties and cross-validation (with discussion). (1995). Recommending and evaluating choices in a virtual J. Amer. Statist. Assoc. 99 619–642. MR2090899 community of use. In Proc. SIGCHI Conf. on Human Fac- Efron, B. and Morris, C. (1971). Limiting the risk of tors in Computing Systems 194–201. ACM, New York. Bayes and empirical Bayes estimators. I. The Bayes case. Hinton, G. E. (2002). Training products of experts by min- J. Amer. Statist. Assoc. 66 807–815. MR0323014 imizing contrastive divergence. Neural Comput. 14 1771– Efron, B. and Morris, C. (1972a). Limiting the risk of 1800. Bayes and empirical Bayes estimators. II. The empir- Hofmann, T. (2001a). Unsupervised learning by probabilistic ical Bayes case. J. Amer. Statist. Assoc. 67 130–139. latent semantic analysis. Mach. Learn. J 42 177–196. MR0323015 Hofmann, T. (2001b). Learning what people (don’t) want. Efron, B. and Morris, C. (1972b). Empirical Bayes on In Proc. European Conf. on Machine Learning. Lect. Notes vector observations: An extension of Stein’s method. Comput. Sci. Eng. 2167 214–225. Springer, Berlin. Biometrika 59 335–347. MR0334386 Hofmann, T. (2004). Latent semantic models for collabora- Efron, B. and Morris, C. (1973a). Stein’s estimation tive filtering. ACM Transactions on Information Systems rule and its competitors—an empirical Bayes approach. 22 89–115. J. Amer. Statist. Assoc. 68 117–130. MR0388597 Hofmann, T. and Puzicha, J. (1999). Latent class mod- Efron, B. and Morris, C. (1973b). Combining possibly els for collaborative filtering. In Proc. Int. Joint Conf. on related estimation problems (with discussion). J. Roy. Artificial Intelligence 2 688–693. Morgan Kaufmann, San Statist. Soc. Ser. B 35 379–421. MR0381112 Francisco, CA. Efron, B. and Morris, C. (1975). Data analysis using Hu, Y., Koren, Y. and Volinsky, C. (2008). Collaborative Stein’s estimator and its generalization. J. Amer. Statist. filtering for implicit feedback datasets. Technical report, Assoc. 70 311–319. AT&T Labs—Research, Florham Park, NJ. Efron, B. and Morris, C. (1977). Stein’s paradox in statis- Izenman, A. J. (2008). Modern Multivariate Statistical Tech- tics. Scientific American 236 119–127. niques: Regression, Classification, and Manifold Learning. Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. Springer, New York. MR2445017 (2004). Least angle regression (with discussion). Ann. James, W. and Stein, C. (1961). Estimation with quadratic Statist. 32 407–499. MR2060166 loss. In Proc. 4th Berkeley Sympos. Math. Statist. Probab. I Fan, J. and Li, R. (2006). Statistical challenges with high 361–379. Univ. California Press, Berkeley, CA. MR0133191 dimensionality: Feature selection in knowledge discovery. Kim, D. and Yum, B. (2005). Collaborative filtering based In International Congress of Mathematicians III 595–622. on iterative principal component analysis. Expert Systems Eur. Math. Soc., Z¨urich. MR2275698 with Applications 28 823–830. Friedman, J. (1994). An overview of predictive learning and Koren, Y. (2008). Factorization meets the neighborhood: function approximation. In From Statistics to Neural Net- A multifaceted collaborative filtering model. In Proc. 14th works (V. Cherkassky, J. Friedman and H. Wechsler, ACM SIGKDD Int. Conf. on Knowledge Discovery and eds.). NATO ISI Series F 136. Springer, New York. Data Mining 426–434. ACM, New York. Funk, S. (2006/2007). See Webb, B. (2006/2007). Koren, Y. (2009). Collaborative filtering with temporal dy- Gorrell, G. and Webb, B. (2006). Generalized Hebbian al- namics. In Proc. 15th ACM SIGKDD Int. Conf. on Knowl- gorithm for incremental latent semantic analysis. Technical edge Discovery and Data Mining 447–456. ACM, New report, Link¨oping Univ., Sweden. York. Greenshtein, E. and Ritov, Y. (2004). Persistence in high- Koren, Y. (2010). Factor in the neighbors: Scalable and accu- dimensional linear predictor selection and the virtue of rate collaborative filtering. ACM Transactions on Knowl- overparametrization. Bernoulli 10 971–988. MR2108039 edge Discovery from Data 4 Article 1. Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Koren, Y., Bell, R. and Volinsky, C. (2009). Matrix fac- Elements of Statistical Learning, 2nd ed. Springer, New torization techniques for recommender systems. Computer York. 42 (8) 30–37. Herlocker, J. L., Konstan, J. A., Borchers, A. and Li, K.-C. (1985). From Stein’s unbiased risk estimates to Riedl, J. (1999). An algorithmic framework for performing the method of generalized cross validation. Ann. Statist. collaborative filtering. In Proc. 22nd ACM SIGIR Conf. on 13 1352–1377. MR0811497 Information Retrieval 230–237. Lim, Y. J. and Teh, Y. W. (2007). Variational Bayesian Herlocker, J. L., Konstan, J. A. and Riedl, J. T. approach to movie rating predictions. In Proc. KDD Cup (2000). Explaining collaborative filtering recommenda- and Workshop 2007 15–21. ACM, New York. NETFLIX CHALLENGE 29

Little, R. J. A. and Rubin, D. B. (1987). Statistical Anal- Raiko, T., Ilin, A. and Karhunen, J. (2007). Principal ysis with Missing Data. Wiley, New York. MR0890519 component analysis for large scale problems with lots of Mallows, C. (1973). Some comments on Cp. Technometrics missing values. In ECML 2007. Lecture Notes in Artificiant 15 661–675. Intelligence 4701 (J. N. Kok et al. eds.) 691–698. Springer, Maritz, J. S. and Lwin, T. (1989). Empirical Bayes Meth- Berlin. ods, 2nd ed. Monogr. Statist. Appl. Probab. 35. Chapman Rennie, J. D. M. and Srebro, N. (2005). Fast maximum & Hall, London. MR1019835 margin matrix factorization for collaborative prediction. In Marlin, B. (2004). Collaborative filtering: A machine learn- Proc. 22nd Int. Conf. on Machine Learning 713–719. ACM, ing perspective. M.Sc. thesis, Computer Science Dept., New York. Univ. Toronto. Resnick, P. and Varian, H. R. (1997). Recommender sys- Marlin, B. and Zemel, R. S. (2004). The multiple multi- tems. Communications of the ACM 40 56–58. plicative factor model for collaborative filtering. In Proc. Resnick, P., Iacocou, N., Suchak, M., Berstrom, P. and 21st Int. Conf. on Machine Learning. ACM, New York. Riedl, J. (1994). Grouplens: An open architecture for col- Marlin, B., Zemel, R. S., Roweis, S. and Slaney, M. laborative filtering of netnews. In Proc. ACM Conf. on (2007). Collaborative filtering and the missing at random Computer Support Cooperative Work 175–186. assumption. In Proc. 23rd Conf. on Uncertainty in Artifi- Ripley, B. D. (1996). Pattern Recognition and Neural Net- cial Intelligence. AMC, New York. works. Cambridge Univ. Press, Cambridge. MR1438788 Moguerza, J. M. and Munoz,˜ A. (2006). Support vec- Robbins, H. (1956). An empirical Bayes approach to statis- tor machines with applications. Statist. Sci. 21 322–336. tics. In Proc. 3rd Berkeley Sympos. Math. Statist. Probab. I MR2339130 157–163. Univ. California Press, Berkeley. MR0084919 Moody, J. E. (1992). The effective number of parameters: Robbins, H. (1964). The empirical Bayes approach to sta- 35 An analysis of generalization and regularization in nonlin- tistical decision problems. Ann. Math. Statist. 1–20. ear learning systems. In Advances in Neural Information MR0163407 Robbins, H. Processing Systems 4. Morgan Kaufmann, San Francisco, (1983). Some thoughts on empirical Bayes esti- mation. Ann. Statist. 11 713–723. MR0707923 CA. Roweis, S. (1997). EM algorithms for PCA and SPCA. In Morris, C. N. (1983). Parametric empirical Bayes infer- Advances in Neural Information Processing Systems 10 ence: Theory and applications (with discussion). J. Amer. 626–632. MIT Press, Cambridge, MA. Statist. Assoc. 78 47–65. MR0696849 Salakhutdinov, R. and Mnih, A. (2008a). Probabilistic Narayanan, A. and Shmatikov, V. (2008). Robust de- matrix factorization. In Advances in Neural Information anonymization of large datasets (How to break anonymity Processing Systems 20 1257–1264. MIT Press, Cambridge, of the Netflix Prize dataset). Preprint. MA. Neal, R. M. and Hinton, G. E. (1998). A view of the EM Salakhutdinov, R. and Mnih, A. (2008b). Bayesian prob- algorithm that justifies incremental, sparse and other vari- abilistic matrix factorization using MCMC. In Proc. 25th ants. In Learning in Graphical Models (M. I. Jordan, ed.) Int. Conf. on Machine Learning. 355–368. Kluwer. Salakhutdinov, R., Mnih, A. and Hinton, G. (2007). Re- Netflix Inc. (2006/2010). Netflix Prize webpage: http:// stricted Boltzmann machines for collaborative filtering. In www.netflixprize.com/. Netflix Prize Leaderboard: http:// Proc. 24th Int. Conf. on Machine Learning. ACM Inetr- www.netflixprize.com/leaderboard/. Netflix Prize Forum: national Conference Proceeding Series 227 791–798. ACM, http://www.netflixprize.com/community/. New York. Oard, D. and Kim, J. (1998). Implicit feedback for rec- Sali, S. (2008). Movie rating prediction using singular value ommender systems. In Proc. AAAI Workshop on Recom- decomposition. Technical report, Univ. California, Santa mender Systems 31–36. AAAI, Menlo Park, CA. Cruz. Park, S. T. and Pennock, D. M. (2007). Applying collabo- Sarwar, B., Karypis, G., Konstan, J. and Riedl, J. T. rative filtering techniques to movie search for better rank- (2000). Application of in recom- ing and browsing. In Proc. 13th ACM SIGKDD Int. Conf. mender system—a case study. In Proc. ACM WebKDD on Knowledge Discovery and Data Mining 550–559. ACM, Workshop. ACM, New York. New York. Sarwar, B., Karypis, G., Konstan, J. and Riedl, J. T. Paterek, A. (2007). Improving regularized singular value de- (2001). Item-based collaborative filtering recommendation composition for collaborative filtering. In Proc. KDD Cup algorithms. In Proc. 10th Int. Conf. on the World Wide and Workshop 2007 39–42. ACM, New York. Web 285–295. ACM, New York. Piatetsky, G. (2007). Interview with Simon Funk. SIGKDD Srebro, N. and Jaakkola, T. (2003). Weighted low-rank Explorations Newsletter 9 38–40. approximations. In Proc. Twentieth Int. Conf. on Machine Popescul, A., Ungar, L., Pennock, D. and Lawrence, S. Learning (T. Fawcett and N. Mishra, eds.) 720–727. (2001). Probabilistic models for unified collaborative and ACM, New York. content-based recommendation in sparse-data environ- Srebro, N., Rennie, J. D. M. and Jaakkola, T. S. (2005). ments. In Proc. 17th Conf. on Uncertainty Artificial Intel- Maximum-margin matrix factorization. In Advances in ligence. Morgan Kaufmann, San Francisco, CA. 437–444. Neural Information Processing Systems 17 1329–1336. Pu, P., Bridge, D. G., Mobasher, B. and Ricci, F. (2008). Stein, C. (1974). Estimation of the mean of a multivariate In Proc. ACM Conf. on Recommender Systems 2008. normal distribution. In Proceedings of the Prague Sym- 30 A. FEUERVERGER, Y. HE AND S. KHATRI

posium on Asymptotic Statistics (Charles Univ., Prague, Tuzhilin, A., Koren, Y., Bennett, C., Elkan, C. and 1973) II 345–381. Charles Univ., Prague. MR0381062 Lemire, D. (2008). Proc. 2nd KDD Workshop on Large Stein, C. M. (1981). Estimation of the mean of a mul- Scale Recommender Systems and the Netflix Prize Compe- tivariate normal distribution. Ann. Statist. 9 1135–1151. tition. ACM, New York. MR0630098 Ungar, L. and Foster, D. (1998). Clustering methods for Stone, M. (1974). Cross-validatory choice and assessment collaborative filtering. In Proc. Workshop on Recommen- of statistical predictions (with discussion). J. Roy. Statist. dation Systems. AAAI Press, Menlo Park. Soc. Ser. B 36 111–147. MR0356377 van Houwelingen, J. C. (2001). Shrinkage and penal- Takacs, G., Pilaszy, I., Nemeth, B. and Tikk, D. (2007). ized likelihood as methods to improve predictive accuracy. 55 On the Gravity recommendation system. In Proc. KDD Statist. Neerlandica 17–34. MR1820605 Cup and Workshop 2007 22–30. ACM, New York. Vapnik, V. N. (2000). The Nature of Statistical Learning Takacs, G., Pilaszy, I., Nemeth, B. and Tikk, D. (2008a). Theory, 2nd ed. Springer, New York. MR1719582 Major components of the Gravity recommendation system. Wang, J., de Vries, A. P. and Reinders, M. J. T. (2006). SIGKDD Explorations 9 80–83. Unifying user-based and item-based collaborative filtering Takacs, G., Pilaszy, I., Nemeth, B. and Tikk, D. (2008b). approaches by similarity fusion. In Proc. 29th Annual Int. Investigation of various matrix factorization methods for ACM SIGIR Conf. on Research and Development in Infor- large recommender systems. In Proc. 2nd Netflix-KDD mation Retrieval 501–508. ACM, New York. Workshop. ACM, New York. Webb, B. (aka Funk, S.) (2006/2007). ‘Blog’ en- Takacs, G., Pilaszy, I., Nemeth, B. and Tikk, D. (2008c). tries, 27 October 2006, 2 November 2006, 11 De- Matrix factorization and neighbor based algorithms for the cember 2007 and 17 August 2007. Available at http://sifter.org/ simon/journal/. Netflix Prize problem. In Proc. ACM Conf. on Recom- ~ Wu, M. (2007). Collaborative filtering via ensembles of ma- mender Systems 267–274. ACM, New York. trix factorizations. In Proc. KDD Cup and Workshop 2007 Tibshirani, R. (1996). Regression shrinkage and selection 43–47. ACM, New York. via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288. Ye, J. (1998). On measuring and correcting the effects of MR1379242 data mining and model selection. J. Amer. Statist. Assoc. Tintarev, N. and Masthoff, J. (2007). A survey of ex- 93 120–131. MR1614596 planations in recommender systems. In Proc. 23rd Int. Yuan, M. and Lin, Y. (2005). Efficient empirical Bayes vari- Conf. on Data Engineering Workshops 801–810. IEEE, able selection and estimation in linear models. J. Amer. New York. Statist. Assoc. 100 1215–1225. MR2236436 Toscher, A. and Jahrer, M. (2008). The BigChaos solu- Zhang, Y. and Koren, J. (2007). Efficient Bayesian hierar- tion to the Netflix Prize 2008. Technical report, commendo chical user modeling for recommendation systems. In Proc. research and consulting, K¨oflach, Austria. 30th Int. ACM SIGIR Conf. on Research and Develop- Toscher, A., Jahrer, M. and Bell, R. M. (2009). The ments in Information Retrieval. ACM, New York. BigChaos solution to the Netflix Grand Prize. Technical Zhou, Y., Wilkinson, D., Schreiber, R. and Pan, R. report, commendo research and consulting, Koflach, Aus- (2008). Large scale parallel collaborative filtering for the tria. Netlix Prize. In Proc. 4th Int. Conf. Algorithmic Aspects Toscher, A., Jahrer, M. and Legenstein, R. (2008). Im- in Information and Management. Lecture Notes in Com- proved neighbourhood-based algorithms for large-scale rec- put. Sci. 5031 337–348. Springer, Berlin. ommender systems. In Proc. 2nd Netflix-KDD Workshop Zou, H., Hastie, T. and Tibshirani, R. (2006). Sparse prin- 2008. ACM, New York. cipal component analysis. J. Comput. Graph. Statist. 15 Toscher, A., Jahrer, M. and Legenstein, R. (2010). 265–286. MR2252527 Combining predictions for accurate recommender systems. Zou, H., Hastie, T. and Tibshirani, R. (2007). On the In Proc. 16th ACM SIGKDD Int. Conf. on Knowledge Dis- “degrees of freedom” of the lasso. Ann. Statist. 35 2173– covery and Data Mining 693–701. ACM, Washington, DC. 2192. MR2363967