Lecture Notes 10: Matrix Factorization
Total Page:16
File Type:pdf, Size:1020Kb
Optimization-based data analysis Fall 2017 Lecture Notes 10: Matrix Factorization 1 Low-rank models 1.1 Rank-1 model Consider the problem of modeling a quantity y[i; j] that depends on two indices i and j. To fix ideas, assume that y[i; j] represents the rating assigned to a movie i by a user j. If we have available a data set of such ratings, how can we predict new ratings for (i; j) that we have not seen before? A possible assumption is that some movies are more popular in general than others, and some users are more generous. This is captured by the following simple model y[i; j] a[i]b[j]: (1) ≈ The features a and b capture the respective contributions of the movie and the user to the ranking. If a[i] is large, movie [i] receives good ratings in general. If b[j] is large, then user [i] is generous, they give good ratings in general. In the model (1) the two unknown parameters a[i] and b[j] are combined multiplicatively. As a result, if we fit these parameters using observed data by minimizing a least-squares cost function, the optimization problem is not convex. Figure1 illustrates this for the function f (a; b) := (1 ab)2 ; (2) − which corresponds to an example where there is only one movie and one user and the rating equals one. Nonconvexity is problematic because if we use algorithms such as gradient descent to minimize the cost function, they may get stuck in local minima corresponding to parameter values that do not fit the data well. In addition, there is a scaling issue: the pairs (a; b) and (ac; b=c) yield the same cost for any constant c. This motivates incorporating a constraint to restrict the magnitude of some of the parameters. For example, in Figure1 we add the constraint a = 1, which is a nonconvex constraint set. j j Assume that there are m movies and n users in the data set and every user rates every movie. If we store the ratings in a matrix Y such that Yij := y[i; j] and the movie and user features in the vectors ~a Rm and ~b Rn, then equation (1) is equivalent to 2 2 Y ~a~b T : (3) ≈ Now consider the problem of fitting the problem by solving the optimization problem ~ T min Y ~a b subject to ~a 2 = 1: (4) ~a2Rm;~b2Rn − F jj jj 1 b 20.0 a 10.0 a = 1 a = +1 − Figure 1: Heat map and contour map for the function (1 ab)2. The dashed line correspond to the set a = 1. − j j Note that ~a~b T is a rank-1 matrix and conversely any rank-1 matrix can be written in this form where ~a 2 = 1 (~a is equal to any of the columns normalized by their `2 norm). The problem is consequentlyjj jj equivalent to min Y X F subject to rank (X) = 1: (5) X2Rm×n jj − jj By Theorem 2.10 in Lecture Notes 2 the solution Xmin to this optimization problem is given by the truncated singular-value decomposition (SVD) of Y T Xmin = σ1~u1~v1 ; (6) where σ1 is the largest singular value and ~u1 and ~v1 are the corresponding singular vectors. The corresponding solutions ~amin and ~bmin to problem (4) are ~amin = ~u1; (7) ~bmin = σ1~v1: (8) Example 1.1 (Movies). Bob, Molly, Mary and Larry rate the following six movies from 1 to 5, Bob Molly Mary Larry 0 1 1 5 41 The Dark Knight B 2 1 4 5C Spiderman 3 B C A := B 4 5 2 1C Love Actually (9) B C B 5 4 2 1C Bridget Jones's Diary @ 4 5 1 2A Pretty Woman 1 2 5 5 Superman 2 To fit a low-rank model, we first subtract the average rating m n 1 X X µ := A ; (10) mn ij i=1 j=1 2 from each entry in the matrix to obtain a centered matrix C and then compute its singular-value decomposition 27:79 0 0 0 3 T T 6 0 1:62 0 0 7 T A µ~1~1 = USV = U 6 7 V : (11) − 4 0 0 1:55 0 5 0 0 0 0:62 where ~1 R4 is a vector of ones. The fact that the first singular value is significantly larger than the rest2 suggests that the matrix may be well approximated by a rank-1 matrix. This is indeed the case: Bob Molly Mary Larry 01:34 (1) 1:19 (1) 4:66 (5) 4:81 (4)1 The Dark Knight B1:55 (2) 1:42 (1) 4:45 (4) 4:58 (5)C Spiderman 3 B C µ~1~1 T + σ ~u ~v T =B4:45 (4) 4:58 (5) 1:55 (2) 1:42 (1)C Love Actually (12) 1 1 1 B C B4:43 (5) 4:56 (4) 1:57 (2) 1:44 (1)C Bridget Jones's Diary @4:43 (4) 4:56 (5) 1:57 (1) 1:44 (2)A Pretty Woman 1:34 (1) 1:19 (2) 4:66 (5) 4:81 (5) Superman 2 For ease of comparison the values of A are shown in brackets. The first left singular vector is equal to D. Knight Spiderman 3 Love Act. B.J.'s Diary P. Woman Superman 2 ~u1 := ( 0:45 0:39 0:39 0:39 0:39 0:45 ): − − − This vector allows us to cluster the movies: movies with negative entries are similar (in this case action movies) and movies with positive entries are similar (in this case romantic movies). The first right singular vector is equal to Bob Molly Mary Larry ~v1 = ()0:48 0:52 0:48 0:52 : (13) − − This vector allows to cluster the users: negative entries indicate users that like action movies but hate romantic movies (Bob and Molly), whereas positive entries indicate the contrary (Mary and Larry). 4 1.2 Rank-r model Our rank-1 matrix model is extremely simplistic. Different people like different movies. In order to generalize it we can consider r factors that capture the dependence between the ratings and the movie/user r X y[i; j] al[i]bl[j]: (14) ≈ l=1 The parameters of the model have the following interpretation: 3 al[i]: movie i is positively (> 0), negatively (< 0) or not ( 0) associated to factor l. • ≈ bl[j]: user j likes (> 0), hates (< 0) or is indifferent ( 0) to factor l. • ≈ The model learns the factors directly from the data. In some cases, these factors may be interpretable{ for example, they can be associated to the genre of the movie as in Example 1.1 or the age of the user{ but this is not always the case. The model (14) corresponds to a rank-r model m×r r×n Y AB; A R ;B R : (15) ≈ 2 2 We can fit such a model by computing the SVD of the data and truncating it. By Theorem 2.10 in Lecture Notes 2 the truncated SVD is the solution to min Y AB F subject to ~a1 2 = 1;:::; ~ar 2 = 1: (16) A2Rm×r;B2Rr×n jj − jj jj jj jj jj However, the SVD provides an estimate of the matrices A and B that constrains the columns of A and the rows of B to be orthogonal. As a result, these vectors do not necessarily correspond to interpretable factors. 2 Matrix completion 2.1 Missing data The Netflix Prize was a contest organized by Netflix from 2007 to 2009 in which teams of data scientists tried to develop algorithms to improve the prediction of movie ratings. The problem of predicting ratings can be recast as that of completing a matrix from some of its entries, as illustrated in Figure2. This problem is known as matrix completion. At first glance, the problem of completing a matrix such as this one 1 ? 5 (17) ? 3 2 may seem completely ill posed. We can just fill in the missing entries arbitrarily! In more mathe- matical terms, the completion problem is equivalent to an underdetermined system of equations 2 3 M11 2 3 6 7 2 3 6 7 1 0 0 0 0 0 6M217 1 6 7 6 7 6 7 6 7 6 7 6 7 60 0 0 1 0 07 6M127 637 6 7 6 7 = 6 7 : (18) 6 7 6 7 6 7 60 0 0 0 1 07 6M227 657 4 5 6 7 4 5 6 7 0 0 0 0 0 1 6M137 2 4 5 M23 4 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Figure 2: A depiction of the Netflix challenge in matrix form. Each row corresponds to a user that ranks a subset of the movies, which correspond to the columns. The figure is due to Mahdi Soltanolkotabi. In order to solve the problem, we need to make an assumption on the structure of the matrix that we aim to complete. In compressed sensing we make the assumption that the original signal is sparse. In the case of matrix completion, we make the assumption that the original matrix is low rank. This implies that there exists a high correlation between the entries of the matrix, which may make it possible to infer the missing entries from the observations. As a very simple example, consider the following matrix 21 1 1 1 ? 13 61 1 1 1 1 17 6 7 : (19) 41 1 1 1 1 15 ? 1 1 1 1 1 Setting the missing entries to 1 yields a rank 1 matrix, whereas setting them to any other number yields a rank 2 or rank 3 matrix.