Addressing the Curse of Dimensionality with Convolutional Neural Networks
Total Page:16
File Type:pdf, Size:1020Kb
Addressing the Curse of Dimensionality with Convolutional Neural Networks Joan Bruna UC Berkeley —> Courant Institute, NYU collaborators: Ivan Dokmanic (UIUC), Stephane Mallat (ENS, France) Pablo Sprechmann (NYU), Yann LeCun (NYU) From Sanja Fisler: (inspired by R.Urtasun ICLR’16) Computer Vision: what Neural Networks can do for me. Machine Learning: what can I do for Neural Networks. From Sanja Fisler: (inspired by R.Urtasun ICLR’16) Computer Vision: what Neural Networks can do for me. Machine Learning: what can I do for Neural Networks. Theory?? Curse of Dimensionality • Images, videos, audio and text are instances of high-dimensional data. d 6 x R ,d 10 . 2 ⇠ high-dimensional space Curse of Dimensionality • Learning as a High-dimensional interpolation problem: d Observe (xi,yi = f(xi)) i N for some unknown f : R R. { } ! high-dimensional space Curse of Dimensionality • Learning as a High-dimensional interpolation problem: d Observe (xi,yi = f(xi)) i N for some unknown f : R R. { } ! • Goal: Estimate fˆ from training data. high-dimensional space Curse of Dimensionality • Learning as a High-dimensional interpolation problem: d Observe (xi,yi = f(xi)) i N for some unknown f : R R. { } ! •Goal: Estimate fˆ from training data. • “Pessimist” view: Assume as little as possible on f. e.g. f is Lipschitz: f(x) f(x0) L x x0 . | − | k − k high-dimensional space Curse of Dimensionality • Learning as a High-dimensional interpolation problem: d Observe (xi,yi = f(xi)) i N for some unknown f : R R. { } ! •Goal: Estimate fˆ from training data. • “Pessimist” view: Assume as little as possible on f. • e.g. f is Lipschitz: f(x) f(x0) L x x0 . | − | k − k Q: How many points do we need to observe to guarantee ˆ error f(x) f(x) < ✏? high-dimensional space | − | Curse of Dimensionality • Learning as a High-dimensional interpolation problem: d Observe (xi,yi = f(xi)) i N for some unknown f : R R. { } ! •Goal: Estimate fˆ from training data. • “Pessimist” view: Assume as little as possible on f. e.g. f is Lipschitz: f(x) f(x0) L x x0 . • | − | k − k Q: How many points do we need to observe to guarantee ˆ • error f(x) f(x) < ✏? high-dimensional space | − | d N should be ✏− . We pay an exponential⇠ price on the input dimension. Curse of Dimensionality • Therefore, in order to beat the curse of dimensionality, it is necessary to make assumptions about our data and exploit them in our models. Curse of Dimensionality • Therefore, in order to beat the curse of dimensionality, it is necessary to make assumptions about our data and exploit them in our models. • Invariance and local stability perspective: x Supervised learning: (x ,y ) , y 1,K labels. i i i i 2 { } x⌧ f(x)=p(y x) satisfies f(x⌧ ) f(x)if x⌧ ⌧ is a high-dimensional| family of deformations⇡ { of x} Unsupervised learning: (xi)i. Density f(x)=p(x) also satisfies f(x ) f(x). ⌧ ⇡ ill-posed Inverse Problems • Consider the following linear problem: 2 d y = Γx + w,x,y,w L (R ) , where 2 Γ is singular x is drawn from a certain distribution (e.g. natural images) w independent of x • Examples: – Super-Resolution, Inpainting, Deconvolution. • Standard Regularization route: xˆ = arg min Γx y 2 + (x) . x k − k R • Q: How to leverage training data? (x) (typically) convex R • Q: Is this formulation always appropriate to estimate images? Regularization and Learning • The inverse problem requires a high-dimensional model for p (x y) | • Underlying probabilistic model for regularized least squares is (x) p(x y) (Γx y, I)e−R . | / N − • Suppose x 1 ,x 2 are such that p (x y)=p(x y) . 1| 2| • Since log p ( x y ) is convex, it results that − | p(↵x +(1 ↵)x y) p(x y) , ↵ [0, 1] . 1 − 2| ≥ 1| 2 Regularization and Learning • The inverse problem requires a high-dimensional model for p (x y) | • Underlying probabilistic model for regularized least squares is (x) p(x y) (Γx y, I)e−R . | / N − • Suppose x 1 ,x 2 are such that p (x y)=p(x y) . 1| 2| • Since log p ( x y ) is convex, it results that − | p(↵x +(1 ↵)x y) p(x y) , ↵ [0, 1] . 1 − 2| ≥ 1| 2 • Jitter model y Regularization and Learning • The inverse problem requires a high-dimensional model for p (x y) | • Underlying probabilistic model for regularized least squares is (x) p(x y) (Γx y, I)e−R . | / N − • Suppose x 1 ,x 2 are such that p (x y)=p(x y) . 1| 2| • Since log p ( x y ) is convex, it results that − | p(↵x +(1 ↵)x y) p(x y) , ↵ [0, 1] . 1 − 2| ≥ 1| 2 • Jitter model x1 x2 y Regularization and Learning • The inverse problem requires a high-dimensional model for p(x y) | • Underlying probabilistic model for regularized least squares is (x) p(x y) (Γx y, I)e−R . | / N − • Suppose x 1 ,x 2 are such that p (x y)=p(x y) . 1| 2| • Since log p ( x y ) is convex, it results that − | p(↵x +(1 ↵)x y) p(x y) , ↵ [0, 1] . 1 − 2| ≥ 1| 2 • Jitter model: x1 x2 y ↵x +(1 ↵)x 1 − 2 • The conditional distribution of images is not well modeled only with Gaussian noise. Non-convexity in general. Plan • CNNs and the curse of dimensionality on image/audio regression. • Multi scale wavelet scattering convolutional networks. • Generative Models using CNN sufficient statistics. • Applications to inverse problems: Image and Texture Super- Resolution joint work with P. Sprechmann (NYU), Yann LeCun (NYU/FAIR) Ivan Dokmanic (UIUC), S. Mallat (ENS) and M. de Hoop (Rice) Learning, features and Kernels Change of variable Φ(x)= φk(x) k d { } 0 to nearly linearize f(x), which is approximated by: f˜(x)= Φ(x) ,w = w φ (x) . h i k k 1D projection k d X Data: x R Φ(x) Rd0 2 2 Linear Classifier w x Φ Metric: x x0 Φ(x) Φ(x0) k − k k − k How and when is possible to find such a Φ ? • What ”regularity” of f is needed ? • Feature Representation Wish-List A change of variable Φ(x) must linearize the orbits g.x g G 2 • p { } g1x g1 .x x x0 g1x0 p g1 .x0 Linearise symmetries with a change of variable Φ(x) • p Φ(g1 .x) Φ(x) p Φ(x0) Φ(g1 .x0) Lipschitz: x, g : Φ(x) Φ(g.x) C g • 8 k − k1 k k Discriminative:Lipschitz: x, g :Φ(Φx)(x) Φ(Φx0()g.x) C− Cf(gx) f(x0) • 8 k k − − kk | k k − | Geometric Variability Prior x(u) ,u: pixels, time samples, etc. ⌧(u) , : deformation field L (x)(u)=x(u ⌧(u)) : warping ⌧ − Video of Philipp Scott Johnson • Deformation prior: ⌧ = λ sup ⌧(u) +sup ⌧(u) . k k u | | u |r | – Models change in point of view in images – Models frequency transpositions in sounds – Consistent with local translation invariance Rotation and Scaling Variability Rotation and deformations • Group: SO(2) Diff(SO(2)) ⇥ Scaling and deformations • Group: R Diff(R) ⇥ Geometric Variability Prior • Blur operator: Ax = x φ,φ: local average ⇤ – The only linear operator A stable to deformations AL⌧ x Ax ⌧ x . φ(u) k − kk kk k [Bruna’12] Chapter 2. Invariant Scattering Representations xˆ xˆτ | || | σx (1 + s)σx ξ (1 + s)ξ ω Figure 2.1: Dilation of a complex bandpass window. If ξ σ s−1,thenthesupports ≫ x are nearly disjoint. Besides deformation instabilities, the Fourier modulus andtheautocorrelationlose 2 too much information. For example, a Dirac δ(u)andalinearchirpeiu are two signals having Fourier transforms whose moduli are equal and constant. Very different signals may not be discriminated from their Fourier modulus. Acanonicalinvariant[KDGH07; Soa09] Φ(x)=x(u a(x)) registers x L2(Rd) − ∈ with an anchor point a(x), which is translated when x is translated: a(xc)=a(x)+c. It thus defines a translation invariant representation: Φxc = Φx.Forexample,theanchor point may be a filtered maximum a(x)=argmax x ⋆ h(u) ,forsomefilterh(u). A u | | canonical invariant Φx(u)=x(u a(x)) carries more information than a Fourier modulus, − and characterizes x up to a global absolute position information [Soa09]. However, it has the same high-frequency instability as a Fourier modulustransform.Indeed,forany choiceGeometric of anchor point a(x), applyingVariability the Plancherel Prior formula proves that x(u a(x)) x′(u a(x′)) (2π)−1 xˆ(ω) xˆ′(ω) . ∥ − − − ∥≥ ∥| | − | |∥ ′ Ax = x φ,φ: local average • Blur operator:If x = xτ ,theFouriertransforminstabilityathighfrequenciesimp lies that Φx = x(u a(x)) is also unstable with⇤ respect to deformations. – The only− linear operator A stable to deformations 2.2.5 SIFT and HoG AL⌧ x Ax ⌧ x . φ(u) SIFT (Scale Invariant Featurek Transform)− is a localkk image descriptorkk introducedk [Bruna’12] by Lowe in [Low04], which achieved huge popularity thanks to its invariance and discriminability properties. The SIFT method originally consists in a keypoint detection phase, using a Dif- ferences of Gaussians pyramid, followed by a local description around each detected keypoint. The keypoint detection computes local maxima on a scale space generated by isotropic gaussian differences, which induces invariance to translations, rotations and partially to scaling. The descriptor then computes histograms of image gradient ampli- tudes, using 8 orientation bins on a 4 4gridaroundeachkeypoint,asshowninFigure × 2.2. 19 Geometric Variability Prior • Blur operator: Ax = x φ,φ: local average ⇤ – The only linear operator A stable to deformations: AL⌧ x Ax ⌧ x . φ(u) k − kk kk k [Bruna’12] j j • Wavelet filter bank: Wx = x , (u)=2− (2− R u) { ⇤ k} k ✓ k : spatially localized band-pass filter.