Binary Principal Component Analysis in the Netflix Collaborative Filtering Task
Total Page:16
File Type:pdf, Size:1020Kb
BINARY PRINCIPAL COMPONENT ANALYSIS IN THE NETFLIX COLLABORATIVE FILTERING TASK Laszl´ o´ Kozma, Alexander Ilin, Tapani Raiko Helsinki University of Technology Adaptive Informatics Research Center P.O. Box 5400, FI-02015 TKK, Finland firstname.lastname@tkk.fi ABSTRACT teams. Among the popular ones are nearest neighbor meth- ods on either users, items or both, and methods using matrix We propose an algorithm for binary principal component decomposition [1, 2, 3, 4, 5]. It seems that no single method analysis (PCA) that scales well to very high dimensional can provide prediction accuracy comparable with the cur- and very sparse data. Binary PCA finds components from rent leaders in the competition. The most successful teams data assuming Bernoulli distributions for the observations. use a blend of a high number of different algorithms [3]. The probabilistic approach allows for straightforward treat- Thus, designing methods that may not be extremely good ment of missing values. An example application is collabo- in the overall performance but which look at the data at a rative filtering using the Netflix data. The results are compa- different angle may be beneficial. rable with those reported for single methods in the literature and through blending we are able to improve our previously obtained best result with PCA. The method presented in this work follows the matrix decomposition approach, which models the ratings using 1. INTRODUCTION a linear combination of some features (factors) character- izing the items and users. In contrast to other algorithms Recommendation systems form an integral component of reported in the Netflix competition, we first binarize the many modern e-commerce websites. Users often rate items ratings and then perform matrix decomposition on binary on a numerical scale (e.g., from 1 to 5). If the system is ca- data. The proposed model is closely related to the models pable of accurately predicting the rating a user would give presented in [6, 7, 8, 9]. Tipping [6] used a similar model to an item, it can use this knowledge to make recommenda- for visualization of binary data, while Collins et al. [7] pro- tions to the user. The collected ratings can be represented in posed a general method for the whole exponential family. the form of a matrix in which each column contains ratings Schein et al. [8] presented a practical algorithm for binary given by one user and each row containsratings givento one PCA and made extensive comparisons to show that it works item. Naturally most of the entries in the matrix are missing better than traditional PCA in the case of binary data. Sre- (as any given user rates just a subset of the items) and the bro and Jaakkola [9] extended the binary PCA model to the goal is to predict these missing values. This is known as the case where each element of the data has a weight, which in- collaborative filtering task. cludes the case with missing values (using a zero weight is Obtaining accurate predictions is the central problem in the same as having a missing value in the data). The most the research on recommendation systems. Although there important difference is that our algorithm is specially de- are certain requirements that a practical recommendation signed for high-dimensional data with lots of missing values system should fulfill (such as ease of interpretation, speed of and therefore we use point-estimates for model parameters response, ability to learn online), this work addresses only and use extra regularization terms. the collaborative filtering problem. Collaborative filtering is an interesting benchmark prob- lem for the machine learning community because a great variety of methods is applicable there. The competition re- The paper is organized as follows. The proposed binary cently organized by Netflix has stirred up the interest of PCA algorithm is presented in Section 2. Its application many researchers by a one million dollar prize. A vast to MovieLens and Netflix data is presented in Section 3, collection of approaches has been tried by the competing followed by the discussion. 2. METHOD with dimensionalities d × c and c × p respectively. In order aTs to have the bias term bi in the model σ(bi + i j ), we fixed The approach deals with an unsupervised analysis of data the last row of S to all ones. Then, the elements in the last Y in a matrix that contains three types of values: zeroes, column of A play the role of the bias terms bi. ones and missing values. We make the assumption that the We express the likelihood of Y as a productof Bernoulli data are missing at random, that is, we do not try to model distributions when the values are observed but only whether the values aTs yij aTs (1−yij ) are zeroes or ones. L = σ( i j ) (1 − σ( i j )) , (5) ijY∈O 2.1. Binarization where O denotes the set of elements in Y corresponding to A S As a preprocessing step, the ratings are encoded with binary observed ratings. We use Gaussian priors for and to values, according to the following scheme: make the solution less prone to overfitting: 1 → 0000 P (skj )= N (0, 1) (6) 2 → 0001 P (aik)= N (0, vk) (7) 3 → 0011 i =1,...,d ; k =1,...,c ; j =1,...,p. 4 → 0111 5 → 1111 Here we denote by N (µ, v) the Gaussian probability den- With this scheme, each element in the data tells whether a sity function with mean µ and variance v. The prior vari- S rating is greater or smaller than a particular threshold. This ance for the elements in is set to unity to fix the scaling kind of binarization can be used also for continuous valued indeterminacy of the model. We use a separate hyperpa- A data. rameter vk for each column of so that, if the number of Let us denote by X the sparsely populated matrix of rat- components c is chosen to be too large, unnecessary com- ponents can go to zero when the corresponding is close ings from 1 to 5 stars. X has dimensions m × p, where vk and are the number of movies and the number of peo- to zero. In practice we use a separate vk for different parts m p Y X ple, respectively. After binarization, we obtain binary ma- of corresponding to matrices l, but we omit this detail here to simplify the notation. trices X1, X2, X3, X4 (one for each digit of the rating val- ues), which we summarizein one matrix Y with dimensions d × p, where d =4m. 2.3. Learning We use simple maximum a posterioriestimation for the model 2.2. The model parameters A, S, vk to make the method scale well to high- dimensional problems. The logarithm of the posterior of the We model the probability of each element yij of matrix Y to be 1 using the following formula: unknown parameters is: aTs aTs P (yij =1)= σ( i j ) (1) F = yij log σ( i j ) ijX∈O where a and s are column vectors with c elements and i j aTs σ(·) is the logistic function: + (1 − yij ) log(1 − σ( i j)) ijX∈O 1 (2) d c σ(x)= −x . 1 1 2 1+ e − log2πvk + aik 2 2vk This can be understood as follows: The probability that the Xi=1 kX=1h i c p j-th person rates the i-th movie with the rating greater then 1 1 2 X − log2π + skj (8) l (assuming that yij is taken from the binary matrix l) 2 2 Xk=1 Xj=1h i depends on the combination of c features of the movie (col- lected in vector ai) and c features of the person (collected We perform gradient-based maximization of F w.r.t. param- in vector sj ). eters A and S. The required derivatives are The features of the movies and people are not known and have to be estimated to fit the available ratings. We ∂F sT aTs a = j (yij − σ( i j )) − diag(1/vk) i (9) ∂ai summarize the features in two matrices j|Xij∈O A a a T ∂F = 1 ... d (3) aT aTs s = i (yij − σ( i j)) − j , (10) ∂sj S = s1 ... sp . (4) i|Xij∈O 3 3 where diag(1/vk) is a diagonal matrix with 1/vk on the main diagonal. Here we rely on the property of the sigmoid function σ′ = σ(1 − σ). We use the gradient-ascent update 2 2 rules ∂F 1 1 ai ← ai + α (11) ∂ai p ∂F 0 0 s ← s + α (12) −2 0 2 −2 0 2 k k rd ∂s j aTs Fig. 1. Loss function − log P (yij ) as a function of i j where the scaling p/d accounts for the fact that the num- in the case where the observation is 1 (solid line) or not ber of rows and columnsp in the data can significantly differ. (dashed line). Left: Binary PCA. Right: Traditional PCA Such scaling is a crude but effective approximation of the (where binarization -1,1 is used for symmetry). Newton’s method. We use a simple strategy of selecting the learning rate: α is halved if the update leads to decrease of F and α is increased by 20% on the next iteration if the PCA is monotonic. Traditional PCA can generate predic- aTs update is successful. tions outside of the valid range. When the prediction i j is greater than 1, there is an incentive in the learning crite- The variance parameters vk are point-estimated using a simple update rule which maximizes F : rion to change the prediction to the negative direction.