Addressing the Curse of Dimensionality with Convolutional Neural Networks

Addressing the Curse of Dimensionality with Convolutional Neural Networks Joan Bruna UC Berkeley —> Courant Institute, NYU collaborators: Ivan Dokmanic (UIUC), Stephane Mallat (ENS, France) Pablo Sprechmann (NYU), Yann LeCun (NYU) From Sanja Fisler: (inspired by R.Urtasun ICLR’16) Computer Vision: what Neural Networks can do for me. Machine Learning: what can I do for Neural Networks. From Sanja Fisler: (inspired by R.Urtasun ICLR’16) Computer Vision: what Neural Networks can do for me. Machine Learning: what can I do for Neural Networks. Theory?? Curse of Dimensionality • Images, videos, audio and text are instances of high-dimensional data. d 6 x R ,d 10 . 2 ⇠ high-dimensional space Curse of Dimensionality • Learning as a High-dimensional interpolation problem: d Observe (xi,yi = f(xi)) i N for some unknown f : R R. { } ! high-dimensional space Curse of Dimensionality • Learning as a High-dimensional interpolation problem: d Observe (xi,yi = f(xi)) i N for some unknown f : R R. { } ! • Goal: Estimate fˆ from training data. high-dimensional space Curse of Dimensionality • Learning as a High-dimensional interpolation problem: d Observe (xi,yi = f(xi)) i N for some unknown f : R R. { } ! •Goal: Estimate fˆ from training data. • “Pessimist” view: Assume as little as possible on f. e.g. f is Lipschitz: f(x) f(x0) L x x0 . | − | k − k high-dimensional space Curse of Dimensionality • Learning as a High-dimensional interpolation problem: d Observe (xi,yi = f(xi)) i N for some unknown f : R R. { } ! •Goal: Estimate fˆ from training data. • “Pessimist” view: Assume as little as possible on f. • e.g. f is Lipschitz: f(x) f(x0) L x x0 . | − | k − k Q: How many points do we need to observe to guarantee ˆ error f(x) f(x) < ✏? high-dimensional space | − | Curse of Dimensionality • Learning as a High-dimensional interpolation problem: d Observe (xi,yi = f(xi)) i N for some unknown f : R R. { } ! •Goal: Estimate fˆ from training data. • “Pessimist” view: Assume as little as possible on f. e.g. f is Lipschitz: f(x) f(x0) L x x0 . • | − | k − k Q: How many points do we need to observe to guarantee ˆ • error f(x) f(x) < ✏? high-dimensional space | − | d N should be ✏− . We pay an exponential⇠ price on the input dimension. Curse of Dimensionality • Therefore, in order to beat the curse of dimensionality, it is necessary to make assumptions about our data and exploit them in our models. Curse of Dimensionality • Therefore, in order to beat the curse of dimensionality, it is necessary to make assumptions about our data and exploit them in our models. • Invariance and local stability perspective: x Supervised learning: (x ,y ) , y 1,K labels. i i i i 2 { } x⌧ f(x)=p(y x) satisfies f(x⌧ ) f(x)if x⌧ ⌧ is a high-dimensional| family of deformations⇡ { of x} Unsupervised learning: (xi)i. Density f(x)=p(x) also satisfies f(x ) f(x). ⌧ ⇡ ill-posed Inverse Problems • Consider the following linear problem: 2 d y = Γx + w,x,y,w L (R ) , where 2 Γ is singular x is drawn from a certain distribution (e.g. natural images) w independent of x • Examples: – Super-Resolution, Inpainting, Deconvolution. • Standard Regularization route: xˆ = arg min Γx y 2 + (x) . x k − k R • Q: How to leverage training data? (x) (typically) convex R • Q: Is this formulation always appropriate to estimate images? Regularization and Learning • The inverse problem requires a high-dimensional model for p (x y) | • Underlying probabilistic model for regularized least squares is (x) p(x y) (Γx y, I)e−R . | / N − • Suppose x 1 ,x 2 are such that p (x y)=p(x y) . 1| 2| • Since log p ( x y ) is convex, it results that − | p(↵x +(1 ↵)x y) p(x y) , ↵ [0, 1] . 1 − 2| ≥ 1| 2 Regularization and Learning • The inverse problem requires a high-dimensional model for p (x y) | • Underlying probabilistic model for regularized least squares is (x) p(x y) (Γx y, I)e−R . | / N − • Suppose x 1 ,x 2 are such that p (x y)=p(x y) . 1| 2| • Since log p ( x y ) is convex, it results that − | p(↵x +(1 ↵)x y) p(x y) , ↵ [0, 1] . 1 − 2| ≥ 1| 2 • Jitter model y Regularization and Learning • The inverse problem requires a high-dimensional model for p (x y) | • Underlying probabilistic model for regularized least squares is (x) p(x y) (Γx y, I)e−R . | / N − • Suppose x 1 ,x 2 are such that p (x y)=p(x y) . 1| 2| • Since log p ( x y ) is convex, it results that − | p(↵x +(1 ↵)x y) p(x y) , ↵ [0, 1] . 1 − 2| ≥ 1| 2 • Jitter model x1 x2 y Regularization and Learning • The inverse problem requires a high-dimensional model for p(x y) | • Underlying probabilistic model for regularized least squares is (x) p(x y) (Γx y, I)e−R . | / N − • Suppose x 1 ,x 2 are such that p (x y)=p(x y) . 1| 2| • Since log p ( x y ) is convex, it results that − | p(↵x +(1 ↵)x y) p(x y) , ↵ [0, 1] . 1 − 2| ≥ 1| 2 • Jitter model: x1 x2 y ↵x +(1 ↵)x 1 − 2 • The conditional distribution of images is not well modeled only with Gaussian noise. Non-convexity in general. Plan • CNNs and the curse of dimensionality on image/audio regression. • Multi scale wavelet scattering convolutional networks. • Generative Models using CNN sufficient statistics. • Applications to inverse problems: Image and Texture Super- Resolution joint work with P. Sprechmann (NYU), Yann LeCun (NYU/FAIR) Ivan Dokmanic (UIUC), S. Mallat (ENS) and M. de Hoop (Rice) Learning, features and Kernels Change of variable Φ(x)= φk(x) k d { } 0 to nearly linearize f(x), which is approximated by: f˜(x)= Φ(x) ,w = w φ (x) . h i k k 1D projection k d X Data: x R Φ(x) Rd0 2 2 Linear Classifier w x Φ Metric: x x0 Φ(x) Φ(x0) k − k k − k How and when is possible to find such a Φ ? • What ”regularity” of f is needed ? • Feature Representation Wish-List A change of variable Φ(x) must linearize the orbits g.x g G 2 • p { } g1x g1 .x x x0 g1x0 p g1 .x0 Linearise symmetries with a change of variable Φ(x) • p Φ(g1 .x) Φ(x) p Φ(x0) Φ(g1 .x0) Lipschitz: x, g : Φ(x) Φ(g.x) C g • 8 k − k1 k k Discriminative:Lipschitz: x, g :Φ(Φx)(x) Φ(Φx0()g.x) C− Cf(gx) f(x0) • 8 k k − − kk | k k − | Geometric Variability Prior x(u) ,u: pixels, time samples, etc. ⌧(u) , : deformation field L (x)(u)=x(u ⌧(u)) : warping ⌧ − Video of Philipp Scott Johnson • Deformation prior: ⌧ = λ sup ⌧(u) +sup ⌧(u) . k k u | | u |r | – Models change in point of view in images – Models frequency transpositions in sounds – Consistent with local translation invariance Rotation and Scaling Variability Rotation and deformations • Group: SO(2) Diff(SO(2)) ⇥ Scaling and deformations • Group: R Diff(R) ⇥ Geometric Variability Prior • Blur operator: Ax = x φ,φ: local average ⇤ – The only linear operator A stable to deformations AL⌧ x Ax ⌧ x . φ(u) k − kk kk k [Bruna’12] Chapter 2. Invariant Scattering Representations xˆ xˆτ | || | σx (1 + s)σx ξ (1 + s)ξ ω Figure 2.1: Dilation of a complex bandpass window. If ξ σ s−1,thenthesupports ≫ x are nearly disjoint. Besides deformation instabilities, the Fourier modulus andtheautocorrelationlose 2 too much information. For example, a Dirac δ(u)andalinearchirpeiu are two signals having Fourier transforms whose moduli are equal and constant. Very different signals may not be discriminated from their Fourier modulus. Acanonicalinvariant[KDGH07; Soa09] Φ(x)=x(u a(x)) registers x L2(Rd) − ∈ with an anchor point a(x), which is translated when x is translated: a(xc)=a(x)+c. It thus defines a translation invariant representation: Φxc = Φx.Forexample,theanchor point may be a filtered maximum a(x)=argmax x ⋆ h(u) ,forsomefilterh(u). A u | | canonical invariant Φx(u)=x(u a(x)) carries more information than a Fourier modulus, − and characterizes x up to a global absolute position information [Soa09]. However, it has the same high-frequency instability as a Fourier modulustransform.Indeed,forany choiceGeometric of anchor point a(x), applyingVariability the Plancherel Prior formula proves that x(u a(x)) x′(u a(x′)) (2π)−1 xˆ(ω) xˆ′(ω) . ∥ − − − ∥≥ ∥| | − | |∥ ′ Ax = x φ,φ: local average • Blur operator:If x = xτ ,theFouriertransforminstabilityathighfrequenciesimp lies that Φx = x(u a(x)) is also unstable with⇤ respect to deformations. – The only− linear operator A stable to deformations 2.2.5 SIFT and HoG AL⌧ x Ax ⌧ x . φ(u) SIFT (Scale Invariant Featurek Transform)− is a localkk image descriptorkk introducedk [Bruna’12] by Lowe in [Low04], which achieved huge popularity thanks to its invariance and discriminability properties. The SIFT method originally consists in a keypoint detection phase, using a Dif- ferences of Gaussians pyramid, followed by a local description around each detected keypoint. The keypoint detection computes local maxima on a scale space generated by isotropic gaussian differences, which induces invariance to translations, rotations and partially to scaling. The descriptor then computes histograms of image gradient ampli- tudes, using 8 orientation bins on a 4 4gridaroundeachkeypoint,asshowninFigure × 2.2. 19 Geometric Variability Prior • Blur operator: Ax = x φ,φ: local average ⇤ – The only linear operator A stable to deformations: AL⌧ x Ax ⌧ x . φ(u) k − kk kk k [Bruna’12] j j • Wavelet filter bank: Wx = x , (u)=2− (2− R u) { ⇤ k} k ✓ k : spatially localized band-pass filter.

Addressing the Curse of Dimensionality with Convolutional Neural Networks

Curse of Dimensionality for TSK Fuzzy Neural Networks

Optimal Grid-Clustering : Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering

High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction Lecturer: Pravesh Kothari Scribe:Sanjeev Arora

Clustering for High Dimensional Data: Density Based Subspace Clustering Algorithms

A Brief Survey of Deep Reinforcement Learning Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, Anil Anthony Bharath

Subspace Clustering for High Dimensional Data: a Review ∗

Dimensionality Reduction

On the Behavior of Intrinsically High-Dimensional Spaces: Distances, Direct and Reverse Nearest Neighbors, and Hubness

Curse of Dimensionality in Adversarial Examples

Fractional Norms and Quasinorms Do Not Help to Overcome the Curse of Dimensionality

Coping with the Curse of Dimensionality by Combining Linear Programming and Reinforcement Learning

On the Effects of Dimensionality on Data Analysis with Neural Networks