Matrix Factorization and Collaborative Filtering
Total Page:16
File Type:pdf, Size:1020Kb
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Matrix Factorization and Collaborative Filtering MF Readings: (Koren et al., 2009) Matt Gormley Lecture 25 April 19, 2017 1 Reminders • Homework 8: Graphical Models – Release: Mon, Apr. 17 – Due: Mon, Apr. 24 at 11:59pm • Homework 9: Applications of ML – Release: Mon, Apr. 24 – Due: Wed, May 3 at 11:59pm 2 Outline • Recommender Systems – Content Filtering – Collaborative Filtering (CF) – CF: Neighborhood Methods – CF: Latent Factor Methods • Matrix Factorization – Background: Low-rank Factorizations – Residual matrix – Unconstrained Matrix Factorization • Optimization problem • Gradient Descent, SGD, Alternating Least Squares • User/item bias terms (matrix trick) – Singular Value Decomposition (SVD) – Non-negative Matrix Factorization • Extra: Matrix Multiplication in ML – Matrix Factorization – Linear Regression – PCA – (Autoencoders) – K-means 7 RECOMMENDER SYSTEMS 8 Recommender Systems A Common Challenge: – Assume you’re a company selling items of some sort: movies, songs, products, etc. – Company collects millions of ratings from users of their items – To maximize profit / user happiness, you want to recommend items that users are likely to want 9 Recommender Systems 10 Recommender Systems 11 Recommender Systems 12 Recommender Systems Problem Setup • 500,000 users • 20,000 movies • 100 million ratings • Goal: To obtain lower root mean squared error (RMSE) than Netflix’s existing system on 3 million held out ratings 13 Recommender Systems 14 Recommender Systems • Setup: – Items: movies, songs, products, etc. (often many thousands) – Users: watchers, listeners, purchasers, etc. (often many millions) Doctor Doctor Strange Star Trek: Beyond Zootopia – Feedback: Alice 1 5 5-star ratings, not-clicking ‘next’, purchases, etc. Bob 3 4 • Key Assumptions: – Can represent ratings numerically Charlie 3 5 2 as a user/item matrix – Users only rate a small number of items (the matrix is sparse) 15 Recommender Systems 16 Two Types of Recommender Systems Content Filtering Collaborative Filtering • Example: Pandora.com • Example: Netflix movie music recommendations recommendations (Music Genome Project) • Pro: Does not assume • Con: Assumes access to access to side information side information about about items (e.g. does not items (e.g. properties of a need to know about movie song) genres) • Pro: Got a new item to • Con: Does not work on add? No problem, just be new items that have no sure to include the side ratings information 17 COLLABORATIVE FILTERING 19 Collaborative Filtering • Everyday Examples of Collaborative Filtering... – Bestseller lists – Top 40 music lists – The “recent returns” shelf at the library – Unmarked but well-used paths thru the woods – The printer room at work – “Read any good books lately?” – … • Common insight: personal tastes are correlated – If Alice and Bob both like X and Alice likes Y then Bob is more likely to like Y – especially (perhaps) if Bob knows Alice 20 Slide from William Cohen Two Types of Collaborative Filtering 1. Neighborhood Methods 2. Latent Factor Methods 21 Figures from Koren et al. (2009) Two Types of Collaborative Filtering 1. Neighborhood Methods In the figure, assume that a green line indicates the movie was watched Algorithm: 1. Find neighbors based on similarity of movie preferences 2. Recommend movies that those neighbors watched 22 Figures from Koren et al. (2009) Two Types of Collaborative Filtering 2. Latent Factor Methods • Assume that both movies and users live in some low- dimensional space describing their properties • Recommend a movie based on its proximity to the user in the latent space 23 Figures from Koren et al. (2009) MATRIX FACTORIZATION 24 Matrix Factorization • Many different ways of factorizing a matrix • We’ll consider three: 1. Unconstrained Matrix Factorization 2. Singular Value Decomposition 3. Non-negative Matrix Factorization • MF is just another example of a common recipe: 1. define a model 2. define an objective function 3. optimize with SGD 25 Matrix Factorization Whiteboard – Background: Low-rank Factorizations – Residual matrix 27 3.6. LATENT FACTOR MODELS 95 JULIUS CAESAR JULIUS CAESAR NERO NERO CLEOPATRA CLEOPATRA SEATTLE IN SLEEPLESS PRETTYWOMAN CASABLANCA HISTORY HISTORY ROMANCE Example: MF for Netflix 1 1 1 Problem1 0 0 0 1 1 0 3.6. LATENT FACTOR MODELS 2 1 1 951 0 0 0 2 1 0 HISTORY JULIUS CAESAR PRETTY WOMAN PRETTYWOMAN CLEOPATRA CLEOPATRA SEATTLE IN SLEEPLESS NERO NERO CASABLANCA CASABLANCA 3 1 1 1 0 0 0 3 1 0 HISTORY 1 1 1 0 0 0 1 1 1 1 1 1 1 1 BOTH 4 4 X ROMANCE 0 0 1 1 1 1 5 - 1 - 1 - 1 1 1 1 5 - 1 1 VT ROMANCE 6 - 1 - 1 1 1 1 1 6 - 1 1 7 - 1 - 1 - 1 1 1 1 7 - 1 1 R U (a) Example of rank-2 matrix factorization TTLE A N JULIUS CAESAR JULIUS CAESAR NERO NERO CLEOPATRA CLEOPATRA SEATTLE IN SLEEPLESS PRETTYWOMAN CASABLANCA HISTORY HISTORY ROMANCE WOMAN TY PLESS IN SEA 1 1 0 0 0 1 0 P US CAESAR 1 O O ABLANCA A T T OPATRA 1 1 O JULIU NER 2 1 1 1 0 0 0 2 1 0 CLE SLEE PRE CAS HISTORY JULIUS CAESAR PRETTY WOMAN PRETTYWOMAN CLEOPATRA CLEOPATRA SEATTLE IN SLEEPLESS NERO NERO CASABLANCA CASABLANCA 1 000000 1 1 1 0 0 0 1 0 3 3 000000 HISTORY 1 1 1 0 0 0 HISTORY 2 1 1 1 1 1 1 1 1 000000 BOTH 4 4 X ROMANCE 0 0 1 1 1 1 3 001 000 5 - 1 - 1 - 1 1 1 1 5 - 1 1 BOTH 4 T V 5 0 0 1 000 ROMANCE 6 - 1 - 1 1 1 1 1 6 - 1 1 ROMANCE 6 0 0 1 000 - 1 - 1 - 1 1 1 1 - 1 1 7 7 7 00101 000 0 0 R U R (a) Example of rank-2 matrix factorization (b) Residual matrix Figure 3.7: Example of a matrix factorization and its residual matrix TTLE A N WOMAN TY PLESS IN SEA P US CAESAR O O ABLANCA A T T OPATRA O JULIU NER CLE SLEE PRE CAS 1 000000 28 Figures from Aggarwal (2016) 000000 HISTORY 2 3 000000 BOTH 4 001 000 5 0 0 1 000 ROMANCE 6 0 0 1 000 7 00101 000 0 0 R (b) Residual matrix Figure 3.7: Example of a matrix factorization and its residual matrix Regression vs. Collaborative Filtering 72 CHAPTER 3. MODEL-BASED COLLABORATIVE FILTERING Regression Collaborative Filtering TRAINING ROWS NO DEMARCATION BETWEEN TRAINING AND TEST ROWS TEST ROWS INDEPENDENT DEPENDENT NO DEMARCATION BETWEEN DEPENDENT VARIABLES VARIABLE AND INDEPENDENT VARIABLES (a) Classification (b) Collaborative filtering 29 Figures from Aggarwal (2016) Figure 3.1: Revisiting Figure 1.4 of Chapter 1. Comparing the traditional classification problem with collaborative filtering. Shaded entries are missing and need to be predicted. the class variable (or dependent variable). All entries in the first (n 1) columns are fully specified, whereas only a subset of the entries in the nth column is− specified. Therefore, a subset of the rows in the matrix is fully specified, and these rows are referred to as the training data. The remaining rows are referred to as the test data.Thevaluesofthemissing entries need to be learned for the test data. This scenario is illustrated in Figure 3.1(a), where the shaded values represent missing entries in the matrix. Unlike data classification, any entry in the ratings matrix may be missing, as illustrated by the shaded entries in Figure 3.1(b). Thus, it can be clearly seen that the matrix com- pletion problem is a generalization of the classification (or regression modeling) problem. Therefore, the crucial differences between these two problems may be summarized as follows: 1. In the data classification problem, there is a clear separation between feature (inde- pendent) variables and class (dependent) variables. In the matrix completion problem, this clear separation does not exist. Each column is both a dependent and independent variable, depending on which entries are being considered for predictive modeling at agivenpoint. 2. In the data classification problem, there is a clear separation between the training and test data. In the matrix completion problem, this clear demarcation does not exist among the rows of the matrix. At best, one can consider the specified (observed) entries to be the training data, and the unspecified (missing) entries to be the test data. 3. In data classification, columns represent features, and rows represent data instances. However, in collaborative filtering, it is possible to apply the same approach to ei- ther the ratings matrix or to its transpose because of how the missing entries are distributed. For example, user-based neighborhood models can be viewed as direct UNCONSTRAINED MATRIX FACTORIZATION 30 Unconstrained Matrix Factorization Whiteboard – Optimization problem – SGD – SGD with Regularization – Alternating Least Squares – User/item bias terms (matrix trick) 31 Unconstrained Matrix Factorization In-Class Exercise Derive a block coordinate descent algorithm for the Unconstrained Matrix Factorization problem. • User vectors: • Set of non-missing entries: r ru R ∈ • Item vectors: • Objective: r ?i R T 2 ∈ `;KBM (vui ru ?i) , − r ? (u,i) • Rating prediction: ∈Z T vui = ru ?i 32 Matrix$factorization$as$SGD$V$why$does$ Matrix Factorization this$work?$$Here’s$the$key$claim: (with matrices) • User vectors: T r (Wu ) R ∗ Figures from Koren et al. (2009) • Item vectors:∈ r H i R ∗ • Rating∈ prediction: Vui = Wu H i ∗ ∗ =[WH]ui Figures from Gemulla et al. (2011)33 Matrix Factorization (with vectors) • User vectors: r ru R ∈ Figures from Koren et al. (2009) • Item vectors: r ?i R ∈ • Rating prediction: T vui = ru ?i 34 Matrix Factorization (with vectors) • Set of non-missing entries: • Objective: Figures from Koren et al. (2009) T 2 `;KBM (vui ru ?i) , − r ? (u,i) ∈Z 35 Matrix Factorization (with vectors) • Regularized Objective: T 2 `;KBM (vui ru ?i) r,? − (u,i) Figures from Koren et al.