ECS289: Scalable

Cho-Jui Hsieh UC Davis

Nov 3, 2015 Outline

PageRank Semi- Label Propagation Manifold regularization PageRank Ranking Websites

Text based ranking systems (a dominated approach in the early 90s) Compute the similarity between query and websites (documents) Keywords are a very limited way to express a complex information Need to rank websites by “popularity”, “authority”, . . . PageRank: Developed by Brin and Page (1999) Determine the authority and popularity by hyperlinks PageRank

Main idea: estimate the ranking of websites by the link structure. Topology of Websites

Transform the hyperlinks to a directed graph:

The adjacency matrix A such that Aij = 1 if page j points to page i Transition Matrix

Normalize the adjacency matrix so that the matrix is a stochastic matrix (each column sum up to 1)

Pij : probability that arriving at page i from page j P: a stochastic matrix or a transition matrix Random walk: step 1

Random walk through the transition matrix Start from [1, 0, 0, 0] (can use any initialization) Random walk: step 1

Random walk through the transition matrix xt+1 = Pxt Random walk: step 2

Random walk through the transition matrix Random walk: step 2

Random walk through the transition matrix xt+2 = Pxt+1 PageRank (convergence)

P Start from an initial vector x with i xi = 1 (the initial distribution) For t = 1, 2,... xt+1 = Pxt Each xt is a probability distribution (sums up to 1) Converges to a stationary distribution π such that

π = Pπ

if P satisfies the following two conditions: t 1 P is irreducible: for all i, j, there exists some t such that (P )i,j > 0 t 2 P is aperiodic: for all i, j, we have gcd{t :(P )i,j > 0} = 1 π is the unique right eigenvector of P with eigenvalue 1. PageRank

How to guarantee convergence? Add the possibility of jumping to a random node with small probability α, we get the commonly used PageRank

π = αP + (1 − α)veT π

1 1 1 1 T v = n e = [ n n ... n ] is commonly used Personalized PageRank: v = ei PageRank: Iterative Algorithm

Input: Transition matrix P and personalization vector v (0) 1 1 Initial xi = n for i = 1, 2,..., n 2 for t = 1, 2,... do x (t+1) ← αPx (t) + (1 − α)v 3 end for

Time complexity: O(nnz(P)) per iteration Label Propagation Semi-supervised Learning

Given both labeled and unlabeled data Is unlabeled data useful?

Figure from Wikipedia Semi-supervised Learning

Two approaches: Graph-based algorithm (label propagation) Graph-based regularization (manifold regularization) Inductive vs Transductive

Inductive (Generalize Transductive (Doesn’t to unseen data) generalize to unseen data) Supervised SVM, . . . X (labeled data) Semi-supervised Manifold Label propagation (labeled + unlabeled) Regularization Measure the similarity between data points (similarity graph) How to enforce the output labels to be similar?

Main Idea Smoothness Assumption If two data points are similar, then the output labels should also be similar Main Idea Smoothness Assumption If two data points are similar, then the output labels should also be similar

Measure the similarity between data points (similarity graph) How to enforce the output labels to be similar? Similarity Graph

Assume we have n data points x1,..., xn n×n Define the similarity matrix S ∈ R

Sij = similarity(xi , xj )

The similarity function can be defined by many ways, for example,

2 similarity(xi , xj ) = exp(−γkxi − xj k )

S is a dense n × n matrix ⇒ High computational cost Usually, a sparse similarity graph is preferred Usually, a k-nearest neighbors graph is used:

Sij 6= 0 only when j is in i’s k-nearest neighbors Transductive setting using label propagation

Input:

` labeled training samples x1,..., x` with labels y1,..., y` (each yi is a k dimensional label vector for multiclass/multilabel problems) u unlabeled training samples x`+1,..., x`+u

Output: labels y`+1,..., y`+u for all unlabeled samples Main idea: propagate labels through the similarity matrix Label Propagation

Proposed in Zhue et al., “Semi-supervised Learning using Gaussian Fields and Harmonic Functions”. In ICML 2003. (`+u)×(`+u) T ∈ R : transition matrix such that S T = P(j → i) = ij ij P`+u k=1 Skj

(`+u)×k Y ∈ R : the label matrix. The first ` rows are labels y1,..., y`. The rest u rows are initialized by 0 (will not affect the results). The Algorithm: Repeat the following steps until convergence: Step 1: Propagate labels by the transition matrix: Y ← TY Step 2: Normalize each row of Y Step 3: Reset the first ` rows of Y to be y1,..., y` Label Propagation (convergence)

The algorithm will converge to a simple solution! Step 1 and step 2 can be combined into

Y ← TY¯ ,

where T¯ is the row-normalized matrix of T

We focus on row ` + 1 to ` + u (defined by YU )

Y  T¯ T¯  Y  L ← `` `u L YU T¯u` T¯uu YU Label Propagation (convergence)

So YU ← T¯u`YL + T¯uuYU Run for infinite number of iteration, we have

t ∗ t 0 X i−1 Y = lim T¯ Y + T¯ T¯u`YL U t→∞ uu U uu | {z } i=1 vanish when t → ∞ t X i−1 = lim T¯ T¯u`YL t→∞ uu i=1 −1 = (I − T¯uu) T¯u`YL

O(nnz(S)kT ) for both power iteration and linear system solvers (T : number of iterations) In the following, we show another interpretation. Graph Laplacian

Graph Laplacian: X L = D − S, where D is a diagonal matrix with Dii = Sij j

L is positive semi-definite (if S is nonnegative) Main property: for any vector z,

T X 2 z Lz = Sij (zi − zj ) i,j

Measure the non-smoothness of z according the the similarity matrix S Which z minimizes the non-smoothness? Eigenvector with the smallest eigenvalue. Another form of label propagation

Consider we only have one label Another equivalent form of label propagation:

X 2 T argmin Sij (ˆyi − yˆj ) = yˆ Lyˆ := f (yˆ) ˆy i,j

s.t. yˆ1:` = y

`+u where yˆ ∈ R is the estimation of the labels.

Optimal solution: ∇ˆy`+1:u f (yˆ) = 0 D − S −S  yˆ  ? ⇒ `` `` `u ` = −Su` Duu − Suu yˆu 0 ⇒ (Duu − Suu)yˆu − Su`yˆ` = 0 −1 ⇒ yˆu = (Duu − Suu) Su`yˆ` −1 −1 −1 ⇒ yˆu = (I − Duu Suu) Duu Su`yˆ` −1 ⇒ yˆu = (I − T¯uu) T¯u`yˆ` Experimental Results (Zhu et al., 2003) Manifold Regularization Empirical Risk Minimization with Manifold Regularization

Setting: given labeled training data x1,..., x` with labels y1,..., y` and unlabeled training data x`+1,..., x`+u, obtain a function f to predict the label of new instances. We assume f is a linear function (f (x) = w T x) Empirical risk minimization:

l X T min `(yi , w xi ) + λR(w), w i=1 where `(·) is the and R(·) is the regularization. Only use labeled data. Empirical Risk Minimization with Manifold Regularization

Proposed in Belkin et al., “Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples”. JMLR, 2006. We assume f is a linear function (f (x) = w T x) Empirical risk minimization with Manifold Regularization:

l X T X T T 2 min `(yi , w xi ) + λR(w)+β Sij (w xi − w xj ) w i=1 i,j l X T T = `(yi , w xi ) + λR(w)+βyˆ Lyˆ, i=1

T T T T where yˆ = [w x1, w x2,..., w x`+u] Use manifold regularization to ensure that similar data points have the similar output labels Manifold Regularization

Trains an “inductive” classifier: can generalize to unseen instances Can be extended to other function classes (e.g., kernel methods) Can be used in other ML problems Ranking, Matrix Completion, . . . Experimental Comparisons Coming up

Next class: midterm exam

Questions?