ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 3, 2015 Outline PageRank Semi-supervised Learning Label Propagation Manifold regularization PageRank Ranking Websites Text based ranking systems (a dominated approach in the early 90s) Compute the similarity between query and websites (documents) Keywords are a very limited way to express a complex information Need to rank websites by \popularity", \authority", . PageRank: Developed by Brin and Page (1999) Determine the authority and popularity by hyperlinks PageRank Main idea: estimate the ranking of websites by the link structure. Topology of Websites Transform the hyperlinks to a directed graph: The adjacency matrix A such that Aij = 1 if page j points to page i Transition Matrix Normalize the adjacency matrix so that the matrix is a stochastic matrix (each column sum up to 1) Pij : probability that arriving at page i from page j P: a stochastic matrix or a transition matrix Random walk: step 1 Random walk through the transition matrix Start from [1; 0; 0; 0] (can use any initialization) Random walk: step 1 Random walk through the transition matrix xt+1 = Pxt Random walk: step 2 Random walk through the transition matrix Random walk: step 2 Random walk through the transition matrix xt+2 = Pxt+1 PageRank (convergence) P Start from an initial vector x with i xi = 1 (the initial distribution) For t = 1; 2;::: xt+1 = Pxt Each xt is a probability distribution (sums up to 1) Converges to a stationary distribution π such that π = Pπ if P satisfies the following two conditions: t 1 P is irreducible: for all i; j, there exists some t such that (P )i;j > 0 t 2 P is aperiodic: for all i; j, we have gcdft :(P )i;j > 0g = 1 π is the unique right eigenvector of P with eigenvalue 1. PageRank How to guarantee convergence? Add the possibility of jumping to a random node with small probability α, we get the commonly used PageRank π = αP + (1 − α)veT π 1 1 1 1 T v = n e = [ n n ::: n ] is commonly used Personalized PageRank: v = ei PageRank: Iterative Algorithm Input: Transition matrix P and personalization vector v (0) 1 1 Initial xi = n for i = 1; 2;:::; n 2 for t = 1; 2;::: do x (t+1) αPx (t) + (1 − α)v 3 end for Time complexity: O(nnz(P)) per iteration Label Propagation Semi-supervised Learning Given both labeled and unlabeled data Is unlabeled data useful? Figure from Wikipedia Semi-supervised Learning Two approaches: Graph-based algorithm (label propagation) Graph-based regularization (manifold regularization) Inductive vs Transductive Inductive (Generalize Transductive (Doesn't to unseen data) generalize to unseen data) Supervised SVM, . X (labeled data) Semi-supervised Manifold Label propagation (labeled + unlabeled) Regularization Measure the similarity between data points (similarity graph) How to enforce the output labels to be similar? Main Idea Smoothness Assumption If two data points are similar, then the output labels should also be similar Main Idea Smoothness Assumption If two data points are similar, then the output labels should also be similar Measure the similarity between data points (similarity graph) How to enforce the output labels to be similar? Similarity Graph Assume we have n data points x1;:::; xn n×n Define the similarity matrix S 2 R Sij = similarity(xi ; xj ) The similarity function can be defined by many ways, for example, 2 similarity(xi ; xj ) = exp(−γkxi − xj k ) S is a dense n × n matrix ) High computational cost Usually, a sparse similarity graph is preferred Usually, a k-nearest neighbors graph is used: Sij 6= 0 only when j is in i's k-nearest neighbors Transductive setting using label propagation Input: ` labeled training samples x1;:::; x` with labels y1;:::; y` (each yi is a k dimensional label vector for multiclass/multilabel problems) u unlabeled training samples x`+1;:::; x`+u Output: labels y`+1;:::; y`+u for all unlabeled samples Main idea: propagate labels through the similarity matrix Label Propagation Proposed in Zhue et al., \Semi-supervised Learning using Gaussian Fields and Harmonic Functions". In ICML 2003. (`+u)×(`+u) T 2 R : transition matrix such that S T = P(j ! i) = ij ij P`+u k=1 Skj (`+u)×k Y 2 R : the label matrix. The first ` rows are labels y1;:::; y`. The rest u rows are initialized by 0 (will not affect the results). The Algorithm: Repeat the following steps until convergence: Step 1: Propagate labels by the transition matrix: Y TY Step 2: Normalize each row of Y Step 3: Reset the first ` rows of Y to be y1;:::; y` Label Propagation (convergence) The algorithm will converge to a simple solution! Step 1 and step 2 can be combined into Y TY¯ ; where T¯ is the row-normalized matrix of T We focus on row ` + 1 to ` + u (defined by YU ) YL T¯`` T¯ù YL YU T¯u` T¯uu YU Label Propagation (convergence) So YU T¯u`YL + T¯uuYU Run for infinite number of iteration, we have t ∗ t 0 X i−1 Y = lim T¯ Y + T¯ T¯u`YL U t!1 uu U uu | {z } i=1 vanish when t ! 1 t X i−1 = lim T¯ T¯u`YL t!1 uu i=1 −1 = (I − T¯uu) T¯u`YL O(nnz(S)kT ) for both power iteration and linear system solvers (T : number of iterations) In the following, we show another interpretation. Graph Laplacian Graph Laplacian: X L = D − S; where D is a diagonal matrix with Dii = Sij j L is positive semi-definite (if S is nonnegative) Main property: for any vector z, T X 2 z Lz = Sij (zi − zj ) i;j Measure the non-smoothness of z according the the similarity matrix S Which z minimizes the non-smoothness? Eigenvector with the smallest eigenvalue. Another form of label propagation Consider we only have one label Another equivalent form of label propagation: X 2 T argmin Sij (^yi − y^j ) = y^ Ly^ := f (y^) ^y i;j s.t. y^1:` = y `+u where y^ 2 R is the estimation of the labels. Optimal solution: r^y`+1:u f (y^) = 0 D − S −S y^ ? ) `` `` ù ` = −Su` Duu − Suu yû 0 ) (Duu − Suu)yû − Su`y^` = 0 −1 ) yû = (Duu − Suu) Su`y^` −1 −1 −1 ) yû = (I − Duu Suu) Duu Su`y^` −1 ) yû = (I − T¯uu) T¯u`y^` Experimental Results (Zhu et al., 2003) Manifold Regularization Empirical Risk Minimization with Manifold Regularization Setting: given labeled training data x1;:::; x` with labels y1;:::; y` and unlabeled training data x`+1;:::; x`+u, obtain a function f to predict the label of new instances. We assume f is a linear function (f (x) = w T x) Empirical risk minimization: l X T min `(yi ; w xi ) + λR(w); w i=1 where `(·) is the loss function and R(·) is the regularization. Only use labeled data. Empirical Risk Minimization with Manifold Regularization Proposed in Belkin et al., \Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples". JMLR, 2006. We assume f is a linear function (f (x) = w T x) Empirical risk minimization with Manifold Regularization: l X T X T T 2 min `(yi ; w xi ) + λR(w)+β Sij (w xi − w xj ) w i=1 i;j l X T T = `(yi ; w xi ) + λR(w)+βy^ Ly^; i=1 T T T T where y^ = [w x1; w x2;:::; w x`+u] Use manifold regularization to ensure that similar data points have the similar output labels Manifold Regularization Trains an \inductive" classifier: can generalize to unseen instances Can be extended to other function classes (e.g., kernel methods) Can be used in other ML problems Ranking, Matrix Completion, . Experimental Comparisons Coming up Next class: midterm exam Questions?.

ECS289: Scalable Machine Learning

Semi-Supervised Classification Using Local and Global Regularization

Linear Manifold Regularization for Large Scale Semi-Supervised Learning

Semi-Supervised Deep Learning Using Improved Unsupervised Discriminant Projection?

Hyper-Parameter Optimization for Manifold Regularization Learning

Zero Shot Learning Via Multi-Scale Manifold Regularization

Semi-Supervised Deep Metric Learning Networks for Classiﬁcation of Polarimetric SAR Data

Manifold Regularization and Semi-Supervised Learning: Some Theoretical Analyses

A Graph Based Approach to Semi-Supervised Learning

Cross-Modal and Multimodal Data Analysis Based on Functional Mapping of Spectral Descriptors and Manifold Regularization

Classification by Discriminative Regularization

Approximate Manifold Regularization: Scalable Algorithm and Generalization Analysis

A Geometric Viewpoint of Manifold Learning