Lecture 16. Manifold Learning COMP90051 Statistical

Semester 2, 2017 Lecturer: Andrey Kan

Swiss roll image: Evan-Amos, Copyright: University of Melbourne Wikimedia Commons, CC0 Statistical Machine Learning (S2 2017) Deck 16 This lecture

• Introduction to manifold learning ∗ Motivation ∗ Focus on data transformation • Unfolding the manifold ∗ Geodesic distances ∗ Isomap algorithm • Spectral clustering ∗ Laplacian eigenmaps ∗ Spectral clustering pipeline

2 Statistical Machine Learning (S2 2017) Deck 16

Manifold Learning

Recovering low dimensional data representation non- linearly embedded within a higher dimensional space

3 Statistical Machine Learning (S2 2017) Deck 16 The limitation of k-means and GMM • K-means algorithm can find spherical clusters • GMM can find elliptical clusters • These algorithms will struggle in cases like this

K-means clustering desired result

Figure from Murphy 4 Statistical Machine Learning (S2 2017) Deck 16 Focusing on data geometry • We are not dismissing the k-means algorithm yet, but we are going to put it aside for a moment • One approach to address the problem in the previous slide would be to introduce improvements to algorithms such as k-means • Instead, let’s focus on geometry of the data and see if we can transform the data to make it amenable for simpler algorithms ∗ Recall “transform the data vs modify the model” discussion in supervised learning

5 Statistical Machine Learning (S2 2017) Deck 16 Non-linear data embedding • Recall the example with 3D GPS coordinates that denote a car’s location on a 2D surface • In a similar example consider coordinates of items on a picnic blanket which is approximately a plane ∗ In this example, the data resides on a plane embedded in 3D • A low dimensional surface can be quite curved in a higher dimensional space ∗ A plane of dough (2D) in a Swiss roll (3D)

Picnic blanket image: Theo Wright, Flickr, CC2 Swiss roll image: Evan-Amos, Wikimedia Commons, CC0 6 Statistical Machine Learning (S2 2017) Deck 16 Key assumption: It’s simpler than it looks!

• Key assumption: High dimensional data actually resides in a lower dimensional space that is locally Euclidean • Informally, the manifold is a subset of points in the high- dimensional space that locally looks like a low- dimensional space 7 Statistical Machine Learning (S2 2017) Deck 16 Manifold example

A B C +

𝐴𝐴𝐴𝐴 ≈ 𝐴𝐴𝐴𝐴 𝐵𝐵𝐵𝐵 • Informally, the manifold is a subset of points in the high- dimensional space that locally looks like a low-dimensional space • Example: arc of a circle ∗ consider a tiny bit of a circumference (2D)  can treat as line (1D)

8 Statistical Machine Learning (S2 2017) Deck 16 -dimensional manifold

• Definition from Guillemin and Pollack, Differential Topology, 1974 • A mapping 𝑙𝑙 on an open set is called smooth if it has continuous partial derivatives of all 𝑚𝑚orders 𝑓𝑓 𝑈𝑈 ⊂ 𝑹𝑹 • A map : is called smooth if around each point there is an open set 𝑙𝑙 and a smooth map : such that equals 𝑓𝑓 on𝑋𝑋 → 𝑹𝑹 𝑚𝑚 𝑙𝑙 𝒙𝒙 ∈ 𝑋𝑋 𝑈𝑈 ⊂ 𝑹𝑹 𝐹𝐹 𝑈𝑈 → 𝑹𝑹 𝐹𝐹 • A smooth𝑓𝑓 map𝑈𝑈 ∩ :𝑋𝑋 of subsets of two Euclidean spaces is a diffeomorphism if it is one to one and onto, and if the inverse map : is also𝑓𝑓 𝑋𝑋 smooth.→ 𝑌𝑌 and are diffeomorphic if such a map exists−1 𝑓𝑓 𝑌𝑌 → 𝑋𝑋 𝑋𝑋 𝑌𝑌 • Suppose that is a subset of some ambient Euclidean space . Then is an -dimensional manifold if each point possesses𝑚𝑚 a neighbourhood𝑋𝑋 which is diffeomorphic to an open set 𝑹𝑹 𝑋𝑋 𝑙𝑙 𝒙𝒙 ∈ 𝑋𝑋 𝑙𝑙 𝑉𝑉 ⊂ 𝑋𝑋 𝑈𝑈 ⊂ 𝑹𝑹 9 Statistical Machine Learning (S2 2017) Deck 16 Manifold examples • A few examples of manifolds are shown below • In all cases, the idea is that (hopefully) once the manifold is “unfolded”, the analysis, such as clustering becomes easy • How to “unfold” a manifold?

10 Statistical Machine Learning (S2 2017) Deck 16

Geodesic Distances and Isomap

A non-linear algorithm that preserves locality information using geodesic distances

11 Statistical Machine Learning (S2 2017) Deck 16 General idea: Dimensionality reduction

? A B B A

• Find a lower dimensional representation of data that preserves distances between points (MDS) • Do visualization, clustering, etc. on lower dimensional representation Problems?

12 Statistical Machine Learning (S2 2017) Deck 16 “Global distances” VS geodesic distances

geodesic distance • “Global distances” cause a problem: we may not want to C D preserve them • We are interested in preserving distances along the manifold (geodesic distances)

Images: ClkerFreeVectorImages and Kaz @pixabay.com (CC0) 7 Statistical Machine Learning (S2 2017) Deck 16 MDS and similarity matrix • In essence, “unfolding” a manifold is achieved via dimensionality reduction, using methods such as MDS • Recall that the input of an MDS algorithm is similarity (aka proximity) matrix where each element denotes how similar data points and are 𝑤𝑤𝑖𝑖𝑖𝑖 • Replacing distances with geodesic distances simply means constructing a different𝑖𝑖 𝑗𝑗 similarity matrix without changing the MDS algorithm ∗ Compare it to the idea of modular learning in kernel methods • As you will see shortly, there is a close connection between similarity matrices and graphs and in the next slide, we review basic definitions from graph theory

14 Statistical Machine Learning (S2 2017) Deck 16 Refresher on graph terminology • Graph is a tuple = { , }, where is a set of vertices, and × is a set of edges. Each edge is a pair of vertices𝐺𝐺 𝑉𝑉 𝐸𝐸 𝑉𝑉 ∗ Undirected graph:𝐸𝐸 ⊆ 𝑉𝑉 pairs𝑉𝑉 are unordered ∗ Directed graph: pairs are ordered • Graphs model pairwise relations between objects ∗ Similarity or distance between the data points • In a weighted graph, each vertex has an associated weight 𝑖𝑖𝑖𝑖 ∗ Weights capture the strength of the𝑣𝑣 relation between objects 𝑤𝑤𝑖𝑖𝑖𝑖 15 Statistical Machine Learning (S2 2017) Deck 16 Weighted adjacency matrix

• We will consider weighted undirected graphs with non-negative weights 0. Moreover, we will assume that = 0, if and only if vertices and 𝑖𝑖𝑖𝑖 are not connected 𝑤𝑤 ≥ 𝑤𝑤𝑖𝑖𝑖𝑖 𝑖𝑖 𝑗𝑗 • The degree of a vertex is defined as

deg 𝑖𝑖 𝑣𝑣 ∈ 𝑉𝑉𝑛𝑛 𝑖𝑖𝑖𝑖 • A weighted undirected𝑖𝑖 ≡graph�𝑗𝑗 =can1 𝑤𝑤 be represented with an weighted adjacency matrix that contain weights as its elements 𝑾𝑾 16 𝑤𝑤𝑖𝑖𝑖𝑖 Statistical Machine Learning (S2 2017) Deck 16 Similarity graph models data geometry • Geodesic distances can be approximated using a graph in which vertices represent data points ( , ) C • Let be the Euclidean distance D between the points in the original space𝑑𝑑 𝑖𝑖 𝑗𝑗 • Option 1: define some local radius . Connect vertices and with an edge if , 𝜀𝜀 𝑖𝑖 𝑗𝑗 • Option 𝑑𝑑2: 𝑖𝑖define𝑗𝑗 ≤ 𝜀𝜀nearest neighbor geodesic threshold . Connect vertices and distance C if is among the nearest neighbors 𝑘𝑘of OR is among the𝑖𝑖 𝑗𝑗 nearest𝑖𝑖 neighbors𝑘𝑘 of D 𝑗𝑗 𝑗𝑗 𝑘𝑘 • Set weight for each edge𝑖𝑖 to , 17 𝑑𝑑 𝑖𝑖 𝑗𝑗 Statistical Machine Learning (S2 2017) Deck 16 Computing geodesic distances • Given the similarity graph, compute shortest paths between each pair of points C ∗ E.g., using Floyd-Warshall D algorithm in 3 • Set geodesic distance𝑂𝑂 𝑛𝑛 between vertices and to the length (sum of weights) of the shortest path𝑖𝑖 𝑗𝑗 between them geodesic distance C • Define a new similarity matrix based on geodesic D distances

18 Statistical Machine Learning (S2 2017) Deck 16 Isomap: summary 1. Construct the similarity graph

C 2. Compute shortest paths D 3. Geodesic distances are the lengths of the shortest paths 4. Construct similarity geodesic matrix using geodesic distance C distances D 5. Apply MDS

19 Statistical Machine Learning (S2 2017) Deck 16

Spectral Clustering

An approach to non-linear dimensionality reduction

20 Statistical Machine Learning (S2 2017) Deck 16 Data processing pipelines • Isomap algorithm can be considered a pipeline in a sense that in combines different processing blocks, such as graph construction, and MDS • Here MDS serves as a core sub-routine to Isomap • Spectral clustering is similar to Isomap in that it also comprises a few standard blocks, including k-means clustering • In contrast to Isomap, spectral clustering uses a different non-linear mapping technique called Laplacian eigenmap

21 Statistical Machine Learning (S2 2017) Deck 16 Spectral clustering algorithm 1. Construct similarity graph, use the corresponding adjacency matrix as a new similarity matrix ∗ Just as in Isomap, the graph captures local geometry and breaks long distance relations ∗ Unlike Isomap, the adjacency matrix is used “as is”, shortest paths are not used 2. Map data to a lower dimensional space using Laplacian eigenmaps on the adjacency matrix ∗ This uses results from spectral graph theory 3. Apply k-means clustering to the mapped points

22 Statistical Machine Learning (S2 2017) Deck 16 Similarity graph for spectral clustering

• Again, we start with constructing a similarity graph. This can be done in the same way as for Isomap (but no need to compute the shortest paths) • Recall that option 1 was to connect points that are closer than , and options 2 was to connect points within neiborhood

• 𝜀𝜀There is also option 3 usually considered for spectral𝑘𝑘 clustering. Here all points are connected to each other (the graph is fully connected). The weights are assigned using a Gaussian kernel (aka heat kernel) with width parameter 1 = exp 𝜎𝜎 2 𝑤𝑤𝑖𝑖𝑖𝑖 − 𝒙𝒙𝑖𝑖 − 𝒙𝒙𝑗𝑗 𝜎𝜎 23 Statistical Machine Learning (S2 2017) Deck 16 Graph Laplacian • Recall that denotes weighted adjacency matrix which contains all weights 𝑾𝑾 • Next, degree matrix 𝑖𝑖𝑖𝑖is defined as a diagonal matrix with vertex degrees 𝑤𝑤on the diagonal. Recall that a vertex degree is deg = 𝑫𝑫 𝑛𝑛 • Finally, another𝑖𝑖 special∑𝑗𝑗= matrix1 𝑤𝑤𝑖𝑖𝑖𝑖 associated with each graph is called unnormalised graph Laplacian and is defined as

∗ For simplicity, here we introduce spectral clustering using 𝑳𝑳 ≡unnormalised𝑫𝑫 − 𝑾𝑾 Laplacian. In practice, it is common to use a Laplacian normalised in certain way, e.g., , where is an identity matrix −1 𝑳𝑳𝑛𝑛𝑛𝑛𝑛𝑛 ≡ 𝑰𝑰 − 𝑫𝑫 𝑾𝑾 𝑰𝑰

24 Statistical Machine Learning (S2 2017) Deck 16 Laplacian eigenmaps • Laplacian eigenmaps, a central sub-routine of spectral clustering, is a non-linear dimensionality reduction method • Similar to MDS, the idea is to map the original data points , = 1, … , to a set of low-dimensional points , < that𝑚𝑚 𝑖𝑖 “best represent” the original data 𝑙𝑙 𝒙𝒙 ∈ 𝑹𝑹 𝑖𝑖 𝑛𝑛 𝒛𝒛𝑖𝑖 ∈ 𝑹𝑹 𝑙𝑙 𝑚𝑚 • Laplacian eigenmaps use a similarity matrix rather than original data coordinates as a starting point ∗ Here the similarity matrix is the weighted adjacency𝑾𝑾 matrix of the similarity graph 𝑾𝑾 • Earlier, we’ve seen examples of how “best represent” criterion is formalised in MDS methods • Laplacian eigenmaps use a different criterion, namely the aim is to minimise (subject to some constraints)

, 2 � 𝒛𝒛𝑖𝑖 − 𝒛𝒛𝑗𝑗 𝑤𝑤𝑖𝑖𝑖𝑖 25 𝑖𝑖 𝑗𝑗 Statistical Machine Learning (S2 2017) Deck 16 Alternative representation of mapping

• This minimisation problem is solved using results from spectral graph theory • Instead of the mapped points , the output can be viewed as a set of dimensional vectors , = 1, … , . 𝑖𝑖 The solution eigenmap is expressed𝒛𝒛 in terms of these 𝑛𝑛 − 𝒇𝒇𝑗𝑗 𝑗𝑗 𝑙𝑙 ∗ For example, if the mapping is onto 1D line, = is just a 𝑗𝑗 collection of coordinates for all points 𝒇𝒇 1 ∗ If the mapping is onto 2D, is a collection of𝒇𝒇 all the𝒇𝒇 fist coordinates, and is a collection𝑛𝑛 of all the second coordinates 𝒇𝒇1 • For illustrative purposes,𝒇𝒇2 we will consider a simple example of mapping to 1D

26 Statistical Machine Learning (S2 2017) Deck 16 Problem formulation for 1D eigenmap

• Given an × similarity matrix , our aim is to find a 1D mapping , such that is the coordinate of the mapped 𝑛𝑛 point.𝑛𝑛 We are looking𝑾𝑾 for a mapping that 𝑖𝑖 𝑡𝑡푡 𝒇𝒇 𝑓𝑓 minimises , 𝑖𝑖 1 2 • Clearly for 2any∑𝑖𝑖 𝑗𝑗 , 𝑓𝑓this𝑖𝑖 − 𝑓𝑓can𝑗𝑗 be𝑤𝑤𝑖𝑖𝑖𝑖 minimised by multiplying by a small constant, so we need to introduce a scaling𝒇𝒇 constraint, e.g., = = 1 𝒇𝒇 2 ′ • Next recall that 𝒇𝒇 𝒇𝒇 𝒇𝒇 𝑳𝑳 ≡ 𝑫𝑫 − 𝑾𝑾 27 Statistical Machine Learning (S2 2017) Deck 16 Re-writing the objective in vector form

• , 1 2 2 ∑𝑖𝑖 𝑗𝑗 𝑓𝑓𝑖𝑖 − 𝑓𝑓𝑗𝑗 𝑤𝑤𝑖𝑖𝑖𝑖 • = , 2 + 1 2 2 2 ∑𝑖𝑖 𝑗𝑗 𝑓𝑓𝑖𝑖 𝑤𝑤𝑖𝑖𝑖𝑖 − 𝑓𝑓𝑖𝑖𝑓𝑓𝑗𝑗𝑤𝑤𝑖𝑖𝑖𝑖 𝑓𝑓𝑗𝑗 𝑤𝑤𝑖𝑖𝑖𝑖 • = 2 , + 1 𝑛𝑛 2 𝑛𝑛 𝑛𝑛 2 𝑛𝑛 2 ∑𝑖𝑖=1 𝑓𝑓𝑖𝑖 ∑𝑗𝑗=1 𝑤𝑤𝑖𝑖𝑖𝑖 − ∑𝑖𝑖 𝑗𝑗 𝑓𝑓𝑖𝑖𝑓𝑓𝑗𝑗𝑤𝑤𝑖𝑖𝑖𝑖 ∑𝑗𝑗=1 𝑓𝑓𝑗𝑗 ∑𝑖𝑖=1 𝑤𝑤𝑖𝑖𝑖𝑖 • = deg 2 , + deg 1 𝑛𝑛 2 𝑛𝑛 2 2 𝑖𝑖=1 𝑖𝑖 𝑖𝑖 𝑗𝑗 𝑖𝑖 𝑗𝑗 𝑖𝑖𝑖𝑖 𝑗𝑗=1 𝑗𝑗 • = ∑ 𝑓𝑓deg 𝑖𝑖 − , ∑ 𝑓𝑓 𝑓𝑓 𝑤𝑤 ∑ 𝑓𝑓 𝑗𝑗 𝑛𝑛 2 • = ∑𝑖𝑖=1 𝑓𝑓𝑖𝑖 𝑖𝑖 − ∑𝑖𝑖 𝑗𝑗 𝑓𝑓𝑖𝑖𝑓𝑓𝑗𝑗𝑤𝑤𝑖𝑖𝑖𝑖 ′ ′ • = 𝒇𝒇 𝑫𝑫𝑫𝑫 − 𝒇𝒇 𝑾𝑾𝒇𝒇 ′ 𝒇𝒇 𝑳𝑳𝑳𝑳 28 Statistical Machine Learning (S2 2017) Deck 16 Laplace recontre Lagrange • Our problem becomes to minimise , subject to = 1. Recall the method of Lagrange multipliers. Introduce′ a Lagrange′ multiplier , and set derivatives of the Lagrangian𝒇𝒇 𝑳𝑳𝑳𝑳to zero 𝒇𝒇 𝒇𝒇 • 𝜆𝜆 = 1 ′ ′ • 𝓛𝓛2 𝒇𝒇 𝑳𝑳𝑳𝑳 − 𝜆𝜆=𝒇𝒇0𝒇𝒇 − ′ ′ ′ Déjà vu? • 𝒇𝒇 =𝑳𝑳 − 2𝜆𝜆𝒇𝒇 • 𝑳𝑳The𝑳𝑳 latter𝜆𝜆𝒇𝒇 is precisely the definition of an eigenvector with being the corresponding eigenvalue! 𝜆𝜆 • Critical points of our objective function = , are eigenvectors of ′ 1 2 𝒇𝒇 𝑳𝑳𝑳𝑳 2 ∑𝑖𝑖 𝑗𝑗 𝑓𝑓𝑖𝑖 − 𝑓𝑓𝑗𝑗 𝑤𝑤𝑖𝑖𝑖𝑖 • Note that the function𝑳𝑳 is actually minimised using eigenvector , which is not useful. Therefore, for 1D mapping we use an

eigenvector with the second smallest eigenvalue 𝟏𝟏 29 Statistical Machine Learning (S2 2017) Deck 16 Laplacian eigenmaps: summary • Start with points . Construct a similarity graph using one of 3 options 𝑚𝑚 𝒙𝒙𝑖𝑖 ∈ 𝑹𝑹 • Construct weighted adjacency matrix (do not compute shortest paths) and the corresponding 𝑾𝑾 • Compute eigenvectors of , and arrange them in the order𝑳𝑳 of the corresponding eignevalues 0 = < < < 𝑳𝑳 , … , • Take eigenvectors corresponding to 𝜆𝜆1 to 𝜆𝜆2 as⋯ 𝜆𝜆𝑛𝑛 , < , where each corresponds to one of the new dimensions 𝜆𝜆2 𝜆𝜆𝑙𝑙+1 𝒇𝒇1 𝒇𝒇𝑙𝑙 𝑙𝑙 𝑚𝑚 𝒇𝒇𝑗𝑗 • Combine all vectors into an × matrix, with in columns. The mapped points are the rows of the matrix 𝑛𝑛 𝑙𝑙 𝒇𝒇𝑗𝑗

30 Statistical Machine Learning (S2 2017) Deck 16 Spectral clustering: summary

1. Construct a similarity spectral clustering result graph 2. Map data to a lower dimensional space using Laplacian eigenmaps on the adjacency matrix 3. Apply k-means clustering to the mapped points

31 Statistical Machine Learning (S2 2017) Deck 16 This lecture

• Introduction to manifold learning ∗ Motivation ∗ Focus on data transformation • Unfolding the manifold ∗ Geodesic distances ∗ Isomap algorithm • Spectral clustering ∗ Laplacian eigenmaps ∗ Spectral clustering pipeline

32