Random Projections for Machine Learning and Data Mining

Outline Outline 1 Background and Preliminaries Random Projections for Machine Learning and 2 Johnson-Lindenstrauss Lemma (JLL) and extensions 1 Background and Preliminaries Data Mining: 3 Applications of JLL (1) Approximate Nearest Neighbour Search Theory and Applications 2 RP Perceptron Johnson-Lindenstrauss Lemma (JLL) and extensions Mixtures of Gaussians Random Features 3 Applications of JLL (1) Robert J. Durrant & Ata Kaban´ 4 Compressed Sensing 4 Compressed Sensing University of Birmingham SVM from RP sparse data 5 Applications of JLL (2) r.j.durrant, a.kaban @cs.bham.ac.uk 5 Applications of JLL (2) { } RP LLS Regression www.cs.bham.ac.uk/˜durranrj sites.google.com/site/rpforml Randomized low-rank matrix approximation Randomized approximate SVM solver 6 Beyond JLL and Compressed Sensing ECML-PKDD 2012, Friday 28th September 2012 6 Beyond JLL and Compressed Sensing Compressed FLD Ensembles of RP R.J.Durrant & A.Kaban´ (U.Birmingham) RP for Machine Learning & Data Mining ECML-PKDD 2012 1 / 123 R.J.Durrant & A.Kaban´ (U.Birmingham) RP for Machine Learning & Data Mining ECML-PKDD 2012 2 / 123 R.J.Durrant & A.Kaban´ (U.Birmingham) RP for Machine Learning & Data Mining ECML-PKDD 2012 3 / 123 Motivation - Dimensionality Curse Curse of Dimensionality Mitigating the Curse of Dimensionality The ‘curse of dimensionality’: A collection of pervasive, and often Comment: counterintuitive, issues associated with working with high-dimensional An obvious solution: Dimensionality d is too large, so reduce d to What constitutes high-dimensional depends on the problem setting, k d. data. but data vectors with arity in the thousands very common in practice Two typical problems: (e.g. medical images, gene activation arrays, text, time series, ...). How? Very high dimensional data (arity (1000)) and very many ∈ O Dozens of methods: PCA, Factor Analysis, Projection Pursuit, Random observations (N (1000)): Computational (time and space Issues can start to show up when data arity in the tens! ∈ O Projection ... complexity) issues. Very high dimensional data (arity (1000)) and hardly any We will simply say that the observations, , are d-dimensional and We will be focusing on Random Projection, motivated (at first) by the ∈ O d N T observations (N (10)): Inference a hard problem. Bogus there are N of them: = xi R i=1 and we will assume that, for ∈ O T { ∈ } following important result: interactions between features. whatever reason, d is too large. R.J.Durrant & A.Kaban´ (U.Birmingham) RP for Machine Learning & Data Mining ECML-PKDD 2012 4 / 123 R.J.Durrant & A.Kaban´ (U.Birmingham) RP for Machine Learning & Data Mining ECML-PKDD 2012 5 / 123 R.J.Durrant & A.Kaban´ (U.Birmingham) RP for Machine Learning & Data Mining ECML-PKDD 2012 6 / 123 Outline Johnson-Lindenstrauss Lemma Intuition The JLL is the following rather surprising fact [DG02, Ach03]: Theorem (Johnson and Lindenstrauss, 1984) Geometry of data gets perturbed by random projection, but not too 1 Background and Preliminaries much: 2 Let (0, 1). Let N, k N such that k C− log N, for a large ∈ ∈ > 2 Johnson-Lindenstrauss Lemma (JLL) and extensions enough absolute constant C. Let V Rd be a set of N points. Then ⊆ there exists a linear mapping R : Rd Rk , such that for all u, v V: → ∈ 3 Applications of JLL (1) (1 ) u v 2 Ru Rv 2 (1 + ) u v 2 − k − k 6 k − k 6 k − k 4 Compressed Sensing Dot products are also approximately preserved by R since if JLL 5 Applications of JLL (2) holds then: uT v (Ru)T Rv uT v + . (Proof: parallelogram − 6 6 law - see appendix). 6 Beyond JLL and Compressed Sensing Scale of k is essentially sharp: N, V s.t. ∀ ∃ k Ω( 2 log N/ log 1) is required [Alo03]. ∈ − − Figure: Original data Figure: RP data (schematic) We shall prove shortly that with high probability random projection implements a suitable linear R. R.J.Durrant & A.Kaban´ (U.Birmingham) RP for Machine Learning & Data Mining ECML-PKDD 2012 7 / 123 R.J.Durrant & A.Kaban´ (U.Birmingham) RP for Machine Learning & Data Mining ECML-PKDD 2012 8 / 123 R.J.Durrant & A.Kaban´ (U.Birmingham) RP for Machine Learning & Data Mining ECML-PKDD 2012 9 / 123 Intuition Applications What is Random Projection? (1) Geometry of data gets perturbed by random projection, but not too much: Random projections have been used for: Classification. e.g. Canonical RP: [BM01, FM03, GBN05, SR09, CJS09, RR08, DK12b] Construct a (wide, flat) matrix R k d by picking the entries ∈ M × Regression. e.g. [MM09, HWB07, BD09] from a univariate Gaussian (0, σ2). N T 1/2 Clustering and Density estimation. e.g. Orthonormalize the rows of R, e.g. set R0 = (RR )− R. [IM98, AC06, FB03, Das99, KMV12, AV09] To project a point v Rd , pre-multiply the vector v with RP matrix ∈ d k Other related applications: structure-adaptive kd-trees [DF08], R0. Then v R0v R0(R ) R is the projection of the 7→ ∈ ≡ low-rank matrix approximation [Rec11, Sar06], sparse signal d-dimensional data into a random k-dimensional projection space. reconstruction (compressed sensing) [Don06, CT06], data stream computations [AMS96]. Figure: Original data Figure: RP data & Original data R.J.Durrant & A.Kaban´ (U.Birmingham) RP for Machine Learning & Data Mining ECML-PKDD 2012 10 / 123 R.J.Durrant & A.Kaban´ (U.Birmingham) RP for Machine Learning & Data Mining ECML-PKDD 2012 11 / 123 R.J.Durrant & A.Kaban´ (U.Birmingham) RP for Machine Learning & Data Mining ECML-PKDD 2012 12 / 123 Comment (1) Concentration in norms of rows of R Near-orthogonality of rows of R If d is very large we can drop the orthonormalization in practice - the rows of R will be nearly orthogonal to each other and all nearly the same length. For example, for Gaussian ( (0, σ2)) R we have [DK12a]: N Pr (1 )dσ2 R 2 (1 + )dσ2 1 δ, (0, 1] − 6 k i k2 6 > − ∀ ∈ n o where Ri denotes the i-th row of R and δ = exp( (√1 + 1)2d/2) + exp( (√1 1)2d/2). − − − − − Similarly [Led01]: Figure: d = 100 norm Figure: d = 500 norm Figure: d = 1000 norm concentration concentration concentration Pr RT R /dσ2 1 2 exp( 2d/2), i = j. {| i j | 6 } > − − ∀ 6 Figure: Normalized dot product is concentrated about zero, d 100, 200,..., 2500 ∈ { } R.J.Durrant & A.Kaban´ (U.Birmingham) RP for Machine Learning & Data Mining ECML-PKDD 2012 13 / 123 R.J.Durrant & A.Kaban´ (U.Birmingham) RP for Machine Learning & Data Mining ECML-PKDD 2012 14 / 123 R.J.Durrant & A.Kaban´ (U.Birmingham) RP for Machine Learning & Data Mining ECML-PKDD 2012 15 / 123 Why Random Projection? Jargon Proof of JLL (1) We will prove the following randomized version of the JLL, and then show that this implies the original theorem: Linear. ‘With high probability’ (w.h.p) means with a probability as close to 1 as Theorem 2 1 we choose to make it. Let (0, 1). Let k N such that k C− log δ− , for a large enough Cheap. ∈ ∈ > absolute constant C. Then there is a random linear mapping Universal – JLL holds w.h.p for any fixed finite point set. ‘Almost surely’ (a.s.) or ‘with probability 1’ (w.p. 1) means so likely we P : Rd Rk , such that for any unit vector x Rd : → ∈ Oblivious to data distribution. can pretend it always happens. Target dimension doesn’t depend on data dimensionality (for JLL). Pr (1 ) Px 2 (1 + ) 1 δ − 6 k k 6 > − Interpretable - approximates an isometry (when d is large). ‘With probability 0’ (w.p. 0) means so unlikely we can pretend it never n o happens. Tractable to analysis. No loss to take x = 1, since P is linear. k k Note that this mapping is universal and the projected dimension k depends only on and δ. Lower bound [JW11, KMN11] k Ω( 2 log δ 1) ∈ − − R.J.Durrant & A.Kaban´ (U.Birmingham) RP for Machine Learning & Data Mining ECML-PKDD 2012 16 / 123 R.J.Durrant & A.Kaban´ (U.Birmingham) RP for Machine Learning & Data Mining ECML-PKDD 2012 17 / 123 R.J.Durrant & A.Kaban´ (U.Birmingham) RP for Machine Learning & Data Mining ECML-PKDD 2012 18 / 123 Proof of JLL (2) Proof of JLL (3) Proof of JLL (4) Recall that if W (µ , σ2) and the W are independent, then Consider the following simple mapping: i ∼ N i i i W µ , σ2 . Hence, in our setting, we have: i i ∼ N i i i i Pr Z > 1 + = Pr exp(tkZ) > exp(tk(1 + )) , t > 0 1 { } { } ∀ Px := Rx P P P Markov ineq. 6 E [exp(tkZ )] / exp(tk(1 + )), √k d d d d Y = R x E[R x ], Var(R x ) 0, x2 k i ij j ∼ N ij j ij j ≡ N j 2 i.i.d j=1 j=1 j=1 j=1 Yi indep. = E exp(tYi ) / exp(tk(1 + )), where R k d with entries Rij (0, 1). X X X X ∈ M × ∼ N Yi=1 h i Let x Rd be an arbitrary unit vector. 2 d 2 k ∈ and since x = j=1 xj = 1 we therefore have: mgf of χ2 = exp(t)√1 2t − exp( kt), t < 1/2 We are interested in quantifying: k k − − ∀ P h i Yi (0, 1) , i 1, 2,..., k next slide exp kt2/(1 2t) kt , 2 2 k ∼ N ∀ ∈ { } 6 − − 2 1 1 1 2 Px = Rx := (Y , Y ,..., Y ) = Y =: Z 2k/8 k k √ √ 1 2 k k i it follows that each of the Yi are standard normal RVs and therefore 6 e− , taking t = /4 < 1/2.

Random Projections for Machine Learning and Data Mining

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support