Datasets Preprocessing Dimensionality reduction

Preprocessing and Dimensionality Reduction

J´er´emyFix

CentraleSup´elec jeremy.fi[email protected]

2019

1 / 73 Datasets Preprocessing Dimensionality reduction

Where to get data You need datasets

You can use open datasets For example for experimenting a new ML algorithm: UCI ML Repo : http://archive.ics.uci.edu/ml/ • Kaggle competitions, e.g. https: • //www.kaggle.com/c/diabetic-retinopathy-detection specific well known datasets for specific ML problems •

2 / 73 Datasets Preprocessing Dimensionality reduction

Where to get data Some available datasets

Face expression classification 48x48 pixel grayscale images of faces 0=Angry, 1=Disgust, 2=Fear, 3=Happy, 4=Sad, 5=Surprise, 6=Neutral 28K Train; 3K for public test, another 3K for final test.

Kaggle, ICML 2013

3 / 73 Datasets Preprocessing Dimensionality reduction

Where to get data Some available datasets

Object localization/detection PascalVOC2012: 20 classes, 20000 Train images, 20000 Test, 11000 Test Avg image size : 469x387 pixels, RGB

Classes : person/bird, cat, cow, dog, horse, sheep/aeroplane, bicycle, boat, bus, car, motorbike, train/bottle, chair, dining table, potted plant, sofa, tv-monitor

http://host.robots.ox.ac.uk/pascal/VOC/

4 / 73 Datasets Preprocessing Dimensionality reduction

Where to get data Some available datasets

Object localization/detection ImageNet, ILSVRC2014: 1000 classes, 1.2M Train images, 50K Valid, 100K Test Avg image size : 482x415 pixels, RGB

ImageNet Large Scale Visual Recognition Challenge, Russakovsky et al. (2015)

5 / 73 Datasets Preprocessing Dimensionality reduction

Where to get data Some available datasets

Object localization/detection

Open Images Dataset: https://github.com/openimages/dataset 9M automatically labelled images, 4M human validated • ≈ 80M bounding boxes, 6000 classes • both meta labels (e.g. vehicle), fine-grain labels (e.g. honda • nsx)

6 / 73 Datasets Preprocessing Dimensionality reduction

Where to get data Some available datasets

Object segmentation COCO 2017: 200K images, 80 classes, 500K masks

http://cocodataset.org/

7 / 73 Datasets Preprocessing Dimensionality reduction

Where to get data Some available datasets

Recommendation systems MovieLens, Netflix Prize, Anime Recommendations Database MovieLens 20M 27K movies by 138K users • 5star ratings, 1/2 increment (0.0, 0.5, ..) • 20M ratings • metadata (e.g. genre) • links to imdb to enrich metadata • https://grouplens.org/datasets/movielens/

8 / 73 Datasets Preprocessing Dimensionality reduction

Where to get data Some available datasets

Automatic Timit, VoxForge, ... Timit : 630 speakers, eight American english dialects • time-aligned orthographic, phonetic and word transcriptions • 16kHz speech waveform file for each utterance • https://catalog.ldc.upenn.edu/ldc93s1

9 / 73 Datasets Preprocessing Dimensionality reduction

Where to get data Some available datasets Sentiment analysis Large Movie Review Dataset (IMDB) 25K reviews for training, 25K reviews for testing • movie reviews (sentences), with rating ([1,10]) • aim : Are reviews on a given product positive/negative ? • Maas(2011), Learning Word Vectors for Sentiment Analysis

Automatic translation Dataset from the european parliament (Europarl dataset) single language datasets (language model) • parallel corpora (translation), e.g. french-english (2M • sentences), Czech-English (650K sentences), ..

Europarl: A Parallel Corpus for Statistical Machine Translation, Philipp 10 / 73 Koehn, MT Summit 2005 Datasets Preprocessing Dimensionality reduction

Make your own dataset You need datasets

You have a specific problem You may need to collect data on your own. Crawl the web ? (e.g. Tweeter API, ..) • if supervized learning : assign labels (mechanical turk, domain • experts (classifying tumors)) Ensure you collected sufficient features •

11 / 73 Datasets Preprocessing Dimensionality reduction

We need vectors, appropriately scaled, without missing values

Preprocessing

12 / 73 Datasets Preprocessing Dimensionality reduction

We need vectors, appropriately scaled, without missing values Preprocessing data

Data are not necessarily vectorial

Ordinal or Categorical : poor/faire/excellent ; Male/Female • Text documents : bag of words / word embeddings •

Even if vectorial

Missing data : check how missing values are indicated (-9, ’ ’, • ..) Imputation of missing values →

13 / 73 Datasets Preprocessing Dimensionality reduction

We need vectors, appropriately scaled, without missing values Your data might not be vectorial data

Ordinal and categorical features Ordinal values have an order. Ordinal Feature value poor fair excellent Numerical feature value -1 0 1

Categorical values do not have an order (use one-hot) : Categorical value American Spanish German French Numerical value [1, 0, 0, 0] [0, 1, 0, 0] [0, 0, 1, 0] [0, 0, 0, 1]

14 / 73 Datasets Preprocessing Dimensionality reduction

We need vectors, appropriately scaled, without missing values Your data might not be vectorial data

Vectorial representation of text documents Bag Of Words define a vocabulary , = n • V |V| for each document, build a vector x so that x is the • i frequency of the word i V e.g. = I , in, love, metz, machinelearning, study V { } I love and love metz too. x = [1, 0, 2, 1, 1, 0] → I love studying machine learning in Metz. x = [1, 1, 1, 1, 1, 1] Does not take the order into account N→ gram, but this leads to sparser representations → −

15 / 73 Datasets Preprocessing Dimensionality reduction

We need vectors, appropriately scaled, without missing values Your data might not be vectorial data Vectorial representation of text documents Word/Sentence embeddings (e.g. , GLoVe, fasttext). Continuous Bag of Words (CBOW) : predict a word given its context

Input and output coded with one-hot • predict a word given its context • hidden layer : word representation • Captures some semantic information. For sentences : tweet2vec, sentence2vec, word vector avg see also : Bayesian approaches (e.g. Latent Dirichlet Allocation)

Pennington(2014) GloVe: Global Vectors for Word Representation; Mikolov(2013) Efficient Estimation of Word

Representations in Vector Space; https://fasttext.cc/ 16 / 73 Datasets Preprocessing Dimensionality reduction

We need vectors, appropriately scaled, without missing values Some features might be missing Missing features

Completely drop out the samples with missing attributes, or • the that have missing values or try to impute, i.e. set a value in place of the missing • attributes For missing value imputation, there are plenty of methods : global : assign the mean, median, most frequent value of an • attribute local : based on k-nearest neighbors, decide which value to • impute The bias you may introduce by imputing a value may depend on the causes of the missing values, see [Silva(2014)].

Silva(2014). A brief review of the main approaches for treatment of missing data 17 / 73 Datasets Preprocessing Dimensionality reduction

We need vectors, appropriately scaled, without missing values Some vectorial data might not be appropriately scaled

Feature scaling

dimensions with the largest variations will dominate euclidean • distances (e.g. nearest neighbors) when gradient descent is involved, feature scaling makes • convergence faster (because the loss is circular symmetric) when regularization is involved, we would like to use a single • regularization coefficient, independent on the scale of the features

18 / 73 Datasets Preprocessing Dimensionality reduction

We need vectors, appropriately scaled, without missing values Some vectorial data might not be appropriately scaled Feature scaling d Given xi R , you can normalize by : ∈ min/max scaling : • x −min x i, j [0, d 1]x0 = i,j k k,j ∀ ∀ ∈ − i,j maxk xk,j −mink xk,j z-score normalization : • xi,j µj i, j [0, d 1]xi,j = − ∀ ∀ ∈ − σj 1 X µ = x j N k,j k s 1 X 2 σj = (xk j µj ) N , − k

Your statistics must be computed from the training set and applied also to test data. 19 / 73 Datasets Preprocessing Dimensionality reduction

Dimensionality reduction

20 / 73 Datasets Preprocessing Dimensionality reduction

Dimensionality reduction : what/why/how ? What n d Optimally transform xi R into zi R so that d << n It remains to define what∈ means “optimally∈ transform”

Why

visualization of the data • interpretability of the predictor • speed up the algorithms whose complexity depends on n • data may occupy a manifold of lower dimensionality than n • : data get quickly sparse, models may • overfit

21 / 73 Datasets Preprocessing Dimensionality reduction

Dimensionality reduction: why ? /Visualization How are your data distributed ? How are your classes intricated ? Do we have discriminative features ?

22 / 73 t-SNE, Mnist, Maaten et al. Datasets Preprocessing Dimensionality reduction

Dimensionality reduction: why ?

Interpretability of the predictor e.g. Why does this predictor say the tumor is malignant ?

Real risk = 0.92 0.06 Real risk = 0.92 0.05 ± ± UCI ML Breast Cancer Wisconsin (Diagnostic) dataset Real risk estimated by 10-fold CV.

23 / 73 Datasets Preprocessing Dimensionality reduction

Dimensionality reduction: why ?

Speed up of the algorithms Decreasing dimensionality decreases training/inference times. For example : Linear regressiony ˆ = θT x + b • Logistic regression(classification) : P(y = 1/x) = 1 • 1+exp(θT x) n Both training and inference in O(n), x R ∈

24 / 73 Datasets Preprocessing Dimensionality reduction

Dimensionality reduction: why ?

The data may occupy a lower dimensional manifold

Swiss roll you do not necessarily loose information by reducing the number→ of dimensions

25 / 73 50 ≈ you do not necessarily loose information by reducing the number→ of dimensions

Datasets Preprocessing Dimensionality reduction

Dimensionality reduction: why ?

The data may occupy a lower dimensional manifold You want to classify facial expressions of a single person, controlled illumination: suppose a huge image resolution, e.g. 1024 1024 RGB • 1024×1024×3 × pixels, x R ∈ what is the dimensionality of the data manifold ? •

26 / 73 you do not necessarily loose information by reducing the number→ of dimensions

Datasets Preprocessing Dimensionality reduction

Dimensionality reduction: why ?

The data may occupy a lower dimensional manifold You want to classify facial expressions of a single person, controlled illumination: suppose a huge image resolution, e.g. 1024 1024 RGB • 1024×1024×3 × pixels, x R ∈ what is the dimensionality of the data manifold ? 50 • ≈

27 / 73 Datasets Preprocessing Dimensionality reduction

Dimensionality reduction: why ?

The data may occupy a lower dimensional manifold You want to classify facial expressions of a single person, controlled illumination: suppose a huge image resolution, e.g. 1024 1024 RGB • 1024×1024×3 × pixels, x R ∈ what is the dimensionality of the data manifold ? 50 • ≈ you do not necessarily loose information by reducing the number→ of dimensions

28 / 73 Datasets Preprocessing Dimensionality reduction

Dimensionality reduction: why ?

You may even have better predictors : Curse of dimensionality The data become (exponentially) quickly sparse with respect to the

Image from [Goodfellow, Bengio, Courville (2016) : ] See also [Hastie et al.(2017), The elements of statistical learning]

29 / 73 Datasets Preprocessing Dimensionality reduction

Dimensionality reduction : what/why/how ?

What n d Optimally transform xi R into zi R so that d << n It remains to define what∈ means “optimally∈ transform”

How

select a subset of the original features : • compute new features from the original ones :feature • extraction

30 / 73 Datasets Preprocessing Dimensionality reduction

Feature selection

Feature selection

Select a subset of the original features/attributes/dimensions n d xi R z R ∈ ∈

31 / 73 Datasets Preprocessing Dimensionality reduction

Feature selection Feature selection

Overview

Embededed : The ML algorithm is designed to select a subset • of the features, e.g. linear regression with L1 penalty Filters : dimensions are selected based on a heuristic • Wrappers : dimensions are selected based on an estimation • of the real risk

Notebook ”Feature selection.ipynb” ⇒

32 / 73 Datasets Preprocessing Dimensionality reduction

Feature selection Feature selection: embedded

Embedded : the loss to minimize embeds a penalty promoting sparsity. Least Absolute Shrinkage and Selection Operator (LASSO) n Given a regression problem (xi , yi ), xi R , yi R, optimize w.r.t. ∈ ∈ θ: N−1 1 X T 2 (yi θ xi ) + λ θ 1 (1) N − | | i=0 Linear regression with L1 penalty L1 penalty promotes sparse predictors

Tibshirani (1996). Regression shrinkage and Selection via the Lasso

33 / 73 Datasets Preprocessing Dimensionality reduction

Feature selection Feature selection: embedded LASSO example

N=30 points, yi = 0.5 + 0.4 sin(2πxi ) + (0, 0.01) 30 RBF features + constant∗ term: N 2 2 (x0−x) (xN−1−x) φ(x) = [1, e 2σ2 , ..., e 2σ2 ]

Fit Parameters 1.0 0.6 samples true lreg 0.8 0.4 lreg_l1

0.6 0.2

0.4 0.0

0.2 0.2

0.0 0.4 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 25 30 20% 33% features selected ≈ − 34 / 73 Datasets Preprocessing Dimensionality reduction

Feature selection Feature selection: embedded Decision tree example

Decision Tree with gini impurity, max depth=2, 10-fold CV (0.92) UCI ML Breast Cancer Wisconsin dataset. 569 samples, binary classif, 30 continuous features.

35 / 73 Datasets Preprocessing Dimensionality reduction

Feature selection Feature selection: univariate filters Principle : measure correlation/dependency between each input features, considered independently, and the target. E.g. chi-2, Anova test of independence, measure, pearson correlation,.. Example Continuous Discrete : Anova → Breast cancer, F-values Anova

P(x14/y) (lowest F) P(x27/y) (highest F) 36 / 73 Datasets Preprocessing Dimensionality reduction

Feature selection Feature selection: multivariate filters and wrappers Overview Denote χ a subset of the dimensions/attributes/features suppose we are provided a measure of how good this subset is • J(χ) we optimize J(χ) over the possible subsets χ • n n If x R , we have 2 possible subsets χ: ∈ χ ∅, x1 , x2 , , x1, x2 , ∈ { { } { } ··· { } ···}

http://featureselection.asu.edu/ : Algorithms and datasets Python package scikit-feature. John et al.(1994) Irrelevant features and the subset selection problem. 37 / 73 Datasets Preprocessing Dimensionality reduction

Feature selection Feature selection: optimizing J(χ) Tree search

Number of sets 1 ∅

x0 x1 xd 1 d { } { } ··· ··· { − }

x0, x1 x0, x2 x0, xd 1 x1, xd 1 xd 2, xd 1 d(d-1)/2 { } { } ··· { − } ··· { − } ··· { − − }

. . . d! . . . k!(d k)! −

1 Xd Sequential Forward Search Sequential Backward Search

If you allow to undo steps, “Sequential Floating Forward Search”/”Sequential Floating Backward Search”

Variants and extensions : Somol et al.(2010) Efficient Feature Subset Selection and Subset Size Optimization 38 / 73 Datasets Preprocessing Dimensionality reduction

Feature selection Feature selection: quantifying the quality of a subset of features We need to quantify the quality of a subset of features J(χ) Filters use a heuristic to be maximized. Filters Heuristic : e.g. Correlation based feature selection Strategy: Keep features correlated with the label, yet uncorrelated between each other. Given a training set (xi , yi ) : { } kr¯(χ, y) JCSF (χ) = pk(k − 1)¯r(χ, χ) 1 X r¯(χ, y) = r(x.,j , y) k j∈χ 1 X r¯(χ, χ) = r(x.,j , x.,j ) k(k − 1) 1 2 (j1,j2)∈χ,j16=j2

with k = χ and r a measure of correlation 39 / 73 | | Datasets Preprocessing Dimensionality reduction

Feature selection Feature selection: quantifying the quality of a subset of features

We need to quantify the quality of a subset of features J(χ) Wrappers use an estimation of the real risk to be minimized. Wrappers

1 Train a predictor from the subset χ 2 J(χ) = estimation of the real risk (e.g. cross validation)

More theoretically grounded, but more computationally expensive.

40 / 73 Datasets Preprocessing Dimensionality reduction

Feature extraction

Feature extraction

d Given N samples xi R , We compute r d new features from∈ the original d features. 

41 / 73 Datasets Preprocessing Dimensionality reduction

Feature extraction Principal Component Analsysis [Pearson(1901)] Statement : Find an affine transformation of the data minimizing the reconstruction error Intuition and formalisation

2.0

1.5

1.0

0.5

0.0

w1

0.5 w0

1.0 1.0 0.5 0.0 0.5 1.0 1.5 2.0

In 1D, we seek a line (w0, w1) minimizing the sum of the squared length of the red segments. It is not unique ! 42 / 73 Datasets Preprocessing Dimensionality reduction

Feature extraction Principal Component Analsysis [Pearson(1901)]

Statement : Find an affine transformation of the data minimizing the reconstruction error Formally :

2 N−1 r X X T min xi (w0 + (w (xi w0))wj ) (2) d j {w ,w ,..wr }∈ − − 0 1 R i=0 j=1 2

T subject to wi wj = δi,j .

form ? →

43 / 73 Datasets Preprocessing Dimensionality reduction

Feature extraction Principal Component Analsysis [Pearson(1901)]

Matrix formulation of PCA

Introduce W = (w1 ... wr ) d×r (R) | | ∈ M N−1 2 X T (2) min (Id WW )(xi w0) d ⇔ {w ,w ,..wr }∈ − − 2 0 1 R i=0

T subject to W W = Ir

44 / 73 Datasets Preprocessing Dimensionality reduction

Feature extraction Principal Component Analysis [Pearson(1901)]

Simplification of the matrix formulation

If M is idempotent, so is (I M) • T − (Id WW ) is symmetric and idempotent • −

N−1 X T T (2) min (xi w0) (Id WW )(xi w0) d ⇔ {w ,w ,..wr }∈ − − − 0 1 R i=0

T subject to W W = Ir

45 / 73 Datasets Preprocessing Dimensionality reduction

Feature extraction Principal Component Analysis [Pearson(1901)] n m n m Remember : For u : R R , v : R R , A m,m(R): 7→ 7→ ∈ M duT Av du dv = Av + AT u dx dx dx

Finding w0

N−1 X T T J = (xi w0) (Id WW )(xi w0) − − − i=0 N−1 ∂J T X = 2(Id WW ) (xi w0) ∂w0 − − − i=0

46 / 73 Datasets Preprocessing Dimensionality reduction

Feature extraction Principal Component Analysis [Pearson(1901)]

Finding w0

N−1 X T T J = (xi w0) (Id WW )(xi w0) − − − i=0 N−1 ∂J T X = 2(Id WW ) (xi w0) ∂w0 − − − i=0 ∂J 1 X = 0 h span w1, ..., wr , w0 = h + xi ∂w0 ⇔ ∃ ∈ { } N i

T (Id − WW )h is the residual vector by the orthogonal projection on the column vectors of W If h ∈ span{w1, ..., wr }, the residual is 0 ⊥ If h ∈ span{w1, ..., wr } , the residual is h 47 / 73 Datasets Preprocessing Dimensionality reduction

Feature extraction Principal Component Analysis [Pearson(1901)]

Finding w0

N−1 X T T J = (xi w0) (Id WW )(xi w0) − − − i=0

1 P argmin J w0 = h + xi w0 ⇒ N i h span w1, ..., wr e.g. h = 0 ∈ { }

The offset w0

The offset w0 is the mean of the data points, up to a translation in the space spaned by the principal components vectors. Step 1 : Center the data

48 / 73 Datasets Preprocessing Dimensionality reduction

Feature extraction Principal Component Analysis [Pearson(1901)] 1 P Denotex ˜i = xi x¯,x ¯ = xi − N i Deriving the first principal component

PN−1 T T J = x˜ (Id WW )˜xi • i=0 i − argmin J = argmax PN−1 x˜T WWT x˜ • w1,..wr w1,..wr i=0 i i X˜ = (x ˜0 ... xN˜−1) • | | argmin J = argmax Pr w T X˜X˜ T w • w1,..wr w1,..wr j=1 j j

Our optimization problem turns out to be :

Pr T ˜ ˜ T T argmaxw1,...,wr j=1 wj XX wj subject to W W = Ir

We have a constrained optimization problem : Lagrangian. 49 / 73 Datasets Preprocessing Dimensionality reduction

Feature extraction Principal Component Analysis [Pearson(1901)] Deriving the first principal component : Lagrangian

T ˜ ˜ T T argmaxw1 w1 XX w1 subject to w1 w1 = 1

(w , λ ) = w T X˜X˜ T w + λ (1 w T w ) • 1 1 1 1 1 1 1 L∂L T − = 0 X˜X˜ w1 = λ1w1, w1eigen~v but which λ1 ? • dw1 ⇒ w T X˜X˜ T w = λ , λ is the largest eigenvalue of X˜X˜ T • 1 1 1 1

First principal component vector The first principal component vector is a normalized eigenvector associated with the largest eigenvalue of the “sample matrix” X˜X˜ T

50 / 73 Datasets Preprocessing Dimensionality reduction

Feature extraction Principal Component Analysis [Pearson(1901)] Deriving the second principal component : Greedy T Suppose we have w1 a norm. eigenvector of X˜X˜ associated with its largest eigenvalue. Denote λ1 λ2 . . . λd 0 the eigenvalues. ≥ ≥ ≥ We want to optimize :

T ˜ ˜ T T ˜ ˜ T T ˜ ˜ T argmaxw2 w1 XX w1 + w2 XX w2 = argmaxw2 λ1 + w2 XX w2 T ˜ ˜ T = argmaxw2 w2 XX w2

T with wi wj = δi,j . T T T T T And w X˜X˜ w2 = w (X˜X˜ λ1w1w )w2 2 2 − 1 w2 is a normalized eigenvector associated with the largest T T eigenvalue of (X˜X˜ λ1w1w ), i.e. λ2 − 1

And so on. But is the greedy algorithm finding the optimum ? 51 / 73 Datasets Preprocessing Dimensionality reduction

Feature extraction Principal Component Analysis [Pearson(1901)] Deriving the other principal component : Greedy Does it make sense to use a greedy algorithm ? (proof in lecture notes) Theorem

For any symmetric positive semi-definite matrix M d×d (R), ∈ M denote λi i=1..d its eigenvalues with λ1 λ2 λd 0. For { } ≥ · · · ≥ ≥ any set of r [ 1, d ] orthogonal unit vectors, v1,..., vr , we have : ∈ | | { } r r X T X v Mvj λj (3) j ≤ j=1 j=1 And this upper bound is reached by eigenvectors associated with the largest eigenvalues of M

52 / 73 Datasets Preprocessing Dimensionality reduction

Feature extraction Principal Component Analysis [Pearson(1901)]

PCA : recipe d Given x0,..., xN−1 R , to compute the r principal component vectors{ : } ∈

1 Center your datax ˜i = xi x¯ − 2 Build the matrix X˜ = [˜x0 ... xN˜−1] | | 3 Compute r normalized eigenvectors associated with the r largest eigenvalues of X˜X˜ T

53 / 73 Datasets Preprocessing Dimensionality reduction

Feature extraction Principal Component Analysis [Pearson(1901)]

PCA is a projection method d Given x R , its principal components are its coordinates in the selected∈ eigenspace :

T T T x ((x x¯) w1, (x x¯) w2,..., (x x¯) wr ) → − − −

If x x0,..., xN−1 , you should better use the SVD which gives you∈ directly { the principal} components.

54 / 73 Datasets Preprocessing Dimensionality reduction

Feature extraction Principal Component Analysis [Pearson(1901)]

Singular Value Decomposition

For any matrix M d,N (R), there exists an orthogonal matrix ∈ M U d,d (R), a diagonal matrix D d,N (R), and an ∈ M ∈ M orthogonal matrix V N,N (R), such that : ∈ M M = UDVT

Orthogonal matrices : UT = U−1.

55 / 73 Datasets Preprocessing Dimensionality reduction

Feature extraction Principal Component Analysis [Pearson(1901)]

PCA with SVD Given X˜ = UDVT :

X˜X˜ T = UDDT U−1

This is the diagonalization of X˜X˜ T The projection vectors are the column vectors of U : w1,..., wr = u1,..., ur . The{ principal} components{ of} the training set are the r first rows of :

UT X˜ = UT UDVT = DVT

56 / 73 Datasets Preprocessing Dimensionality reduction

Feature extraction Principal Component Analysis [Pearson(1901)]

What is X˜X˜ T ?

N−1 ˜ ˜ T X T XX = x˜i x˜i i=0 X 1 X 1 X T = (xi xj )(xi xj ) − N − N i j j = (N 1)Σ − with Σ the sample covariance matrix. Σ is symmetric, positive semi-definite, i.e. its eigenvalues are all positive.

57 / 73 Datasets Preprocessing Dimensionality reduction

Feature extraction Principal Component Analysis [Pearson(1901)]

Equivalent formulations There are two equivalent formulations of the PCA : Find an affine transformation minimizing the reconstruction • error Find an affine transformation maximizing the variance of the • projections

58 / 73 Datasets Preprocessing Dimensionality reduction

Feature extraction Principal Component Analysis Maximizing the variance of the projections 1 P Suppose your data are centered, i.e. N i xi = 0. r Denote zi R the projection of xi over w1,..., wr . We have z¯ = 0. ∈ The sample covariance matrix Σ r,r (R) is : ∈ M 1 X 1 Σ = z zT = WT x xT W N 1 i i N 1 i i − i − Pr We want to maximize j=1 Σj,j and :

r X 1 X X 1 X Σ = (w T x )(xT w ) = w T XXT w j,j N 1 j i i j N 1 j j j=1 − j i − j

This is the same optimization problem as before. 59 / 73 Datasets Preprocessing Dimensionality reduction

Feature extraction Principal Component Analysis [Pearson(1901)]

What is the fraction of variance we keep ? For any matrix M, orthogonal matrix P :

Tr P−1MP = Tr M

˜ ˜ T PN−1 Therefore, Tr XX = i=0 λi . The variance of our datapoints is 1 ˜ ˜ T 1 PN−1 Tr N−1 XX = N−1 i=0 λi If we keep r principal components, we keep a fraction of the variance equals to : Pr−1 i=0 λi PN−1 i=0 λi

60 / 73 Datasets Preprocessing Dimensionality reduction

Feature extraction Principal Component Analysis [Pearson(1901)] PCA on MNIST (28 28 images) ×

PCA with 2 princip. vectors, 17.05% tot var 6

4

2

2 0 w

2

4

6 10 8 6 4 2 0 2 4 6 w1 0 1 2 3 4 5 6 7 8 9

10 first princip. vectors

0.10 0.08 0.06 0.04 0.02 0.00 0.02 0.04 0.06 0.08 0.10 61 / 73 Datasets Preprocessing Dimensionality reduction

Feature extraction Sample covariance and Gram matrices Definitions The sample coviarance matrix is :

1 X T 1 T Σ = (xi x¯)(xi x¯) = XX N 1 − − N 1 − i − The Gram matrix is :

 T T T  x x0 x x1 x xN−1 0 0 ··· 0 T  . . . .  G = X X =  . . . .  T T T x x0 x x1 x xN−1 N−1 N−1 ··· N−1 The Gram matrix is build up from dot products.

The eigenvectors/eigenvalues of G and Σ are related ! 62 / 73 Datasets Preprocessing Dimensionality reduction

Feature extraction Sample covariance and Gram matrices

Lemma n×m T  A R , ker (A) = ker A A ∀ ∈ Theorem (Rank-nullity) n×m A R , rk (A) + dim (ker (A)) = m. ∀ ∈ Theorem n×m   A R , rk ATA = rk AAT min(n, m) ∀ ∈ ≤

63 / 73 Datasets Preprocessing Dimensionality reduction

Feature extraction

Lemma (Eigenvalues of the covariance and gram matrices) The nonzero eigenvalues of the scaled covariance matrix (N 1)Σ = XXT and gram matrix G = XTX are the same : − ∗ ∗ λ R , v = 0, (N 1)Σv = λv = λ R , v = 0, Gv = λv { ∈ ∃ 6 − } { ∈ ∃ 6 }

And, during the proof, we show that : If (λ, v) eigen of XX T , then (λ, X T v) eigen of X T X • If (λ, w) eigen of X T X , then (λ, Xw) eigen of XX T • There are several applications of this property : the eigenface algorithm, used when N d •  the nonlinear PCA called Kernel PCA [Schoelkopf, 1999] •

64 / 73 Datasets Preprocessing Dimensionality reduction

Feature extraction What to do when N d  G N,N (R), Σ d,d (R). ∈ M ∈ M Eigenface If N d, it is much more efficient to “diagonalize” G than Σ. In that case, the recipe is :

1 Center your datax ˜i = xi x¯ − 2 Build the matrix X˜ = [˜x0 ... xN˜−1] | | N 3 Compute the r normalized eigenvectors wj R of G, with ∈ eigenvalues λj 4 Project your data on the r normalized eigenvectors of Σ given by : ˜ Xwj 1 ˜ = p Xwj X˜wj λj 2 65 / 73 Datasets Preprocessing Dimensionality reduction

Feature extraction Toward a Kernel PCA We can reformulate the PCA using only dot products. PCA with only dot products

Computing the Gram matrix involves only dot products between xi 1 Projecting a vector x on the vector X˜wj reads : √λj   < x0, x > 1 1 ˜ T T  .  (p Xwj ) x = p wj  .  λj λj < xN−1, x >

A linear algorithm involving only dot products can be rendered non-linear using the kernel trick (see SVM). The only remaining difficulty is that we must ensure the vectors in the feature space are centered.

66 / 73 Datasets Preprocessing Dimensionality reduction

Feature extraction Non linear PCA Kernel PCA [Scholkopf(1999)] N N 0 0 Consider a kernel k : R R R, < φ(x), φ(x ) >= k(x, x ) e.g. × 7→ |x−x0| 2 RBF kernel : k(x, x0) = exp( 2 ) • − 2σ2 We perform a PCA in the feature space, image of φ. Compute the Gram matrix, its eigenvectors/eigenvalues λj , wj . For projecting a vector x, compute :   k(x0, x) 1 T  .  p wj  .  λj k(xN−1, x)

What about centering the φ(xi )?

67 / 73 Datasets Preprocessing Dimensionality reduction

Feature extraction Non linear PCA

Kernel PCA : centering in the feature space It can be shown that introducing the Gram matrix G˜ : 1 1 G˜ = (IN 1)G(IN 1) − N − N is the matrix of the dot products of the feature vectors centered in the feature space. The above transformation is called double centering transformation.

68 / 73 Datasets Preprocessing Dimensionality reduction

Feature extraction

Feature extraction : Manifold learning

d r Goal : For each xi R , associate yi R so that the pairwise d ∈ ∈ distances in R are as similar as possible to the pairwise distance r in R Perfect for visualizing the datasets in low dimensions. Examples : LLE, MDS, Isomap, SNE, t-SNE, ..

69 / 73 Datasets Preprocessing Dimensionality reduction

Feature extraction Manifold learning

Overview d r xi R , yi R , r d, e.g. r = 2 ∈ ∈  d 1 Quantify the similarity between pairs of points in R r 2 Quantify the similarity between pairs of points in R 3 Quantify the discrepancy between these similarities

4 Optimize with respect to yi

70 / 73 Datasets Preprocessing Dimensionality reduction

Feature extraction Manifold learning

t-Stochastic Neighbhorhood Embedding (t-SNE) [van der Maaten(2008)] Focuses on preserving local distances, allowing larger distances in d r R to be even larger in R Similarity in d • R pi/j + pj/i i, j, pi,j = ∀ 2N 2 exp( |xi −xj | ) 2σ2 − i i, j, pi/j = 2 ∀ P exp( |xi −xk | ) k6=i 2σ2 − i

71 / 73 Datasets Preprocessing Dimensionality reduction

Feature extraction Manifold learning t-Stochastic Neighbhorhood Embedding (t-SNE) [van der Maaten(2008)]

Focuses on preserving local distances, allowing larger distances in Rd to be even larger in Rr Similarity in r • R 2 −1 (1 + yi yj 2) i, j, qi,j = P | − | 2 −1 ∀ (1 + yk yl ) k6=l | − |2 Maximize the similarity of q , p with the Kullback-Leibler • i,j i,j divergence :

X pi,j C = pi,j log( ) qi,j i,j

Complexity (O(N2)). Optimized with Barnes-Hutt, complexity 72 / 73 O(N log N) Datasets Preprocessing Dimensionality reduction

Feature extraction Manifold learning t-Stochastic Neighbhorhood Embedding (t-SNE) [van der Maaten(2008)]

73 / 73