<<

Addressing the Curse of Dimensionality with Convolutional Neural Networks

Joan Bruna UC Berkeley —> Courant Institute, NYU

collaborators: Ivan Dokmanic (UIUC), Stephane Mallat (ENS, France) Pablo Sprechmann (NYU), Yann LeCun (NYU) From Sanja Fisler: (inspired by R.Urtasun ICLR’16) Computer Vision: what Neural Networks can do for me. : what can I do for Neural Networks. From Sanja Fisler: (inspired by R.Urtasun ICLR’16)

Computer Vision: what Neural Networks can do for me. Machine Learning: what can I do for Neural Networks.

Theory?? Curse of Dimensionality

• Images, videos, audio and text are instances of high-dimensional data.

d 6 x R ,d 10 . 2 ⇠

high-dimensional space Curse of Dimensionality

• Learning as a High-dimensional interpolation problem: d Observe (xi,yi = f(xi)) i N for some unknown f : R R. { }  !

high-dimensional space Curse of Dimensionality

• Learning as a High-dimensional interpolation problem: d Observe (xi,yi = f(xi)) i N for some unknown f : R R. { }  !

• Goal: Estimate fˆ from training data.

high-dimensional space Curse of Dimensionality

• Learning as a High-dimensional interpolation problem: d Observe (xi,yi = f(xi)) i N for some unknown f : R R. { }  ! •Goal: Estimate fˆ from training data. • “Pessimist” view:

Assume as little as possible on f.

e.g. f is Lipschitz: f(x) f(x0) L x x0 . | |  k k

high-dimensional space Curse of Dimensionality

• Learning as a High-dimensional interpolation problem: d Observe (xi,yi = f(xi)) i N for some unknown f : R R. { }  ! •Goal: Estimate fˆ from training data. • “Pessimist” view:

Assume as little as possible on f. • e.g. f is Lipschitz: f(x) f(x0) L x x0 . | |  k k Q: How many points do we need to observe to guarantee ˆ error f(x) f(x) < ✏? high-dimensional space | | Curse of Dimensionality

• Learning as a High-dimensional interpolation problem: d Observe (xi,yi = f(xi)) i N for some unknown f : R R. { }  ! •Goal: Estimate fˆ from training data. • “Pessimist” view:

Assume as little as possible on f.

e.g. f is Lipschitz: f(x) f(x0) L x x0 . • | |  k k Q: How many points do we need to observe to guarantee ˆ • error f(x) f(x) < ✏? high-dimensional space | | d N should be ✏ . We pay an exponential⇠ price on the input . Curse of Dimensionality

• Therefore, in order to beat the curse of dimensionality, it is necessary to make assumptions about our data and exploit them in our models. Curse of Dimensionality

• Therefore, in order to beat the curse of dimensionality, it is necessary to make assumptions about our data and exploit them in our models.

• Invariance and local stability perspective: x

Supervised learning: (x ,y ) , y 1,K labels. i i i i 2 { } x⌧ f(x)=p(y x) satisfies f(x⌧ ) f(x)if x⌧ ⌧ is a high-dimensional| family of deformations⇡ { of x}

Unsupervised learning: (xi)i. Density f(x)=p(x) also satisfies f(x ) f(x). ⌧ ⇡ ill-posed Inverse Problems

• Consider the following linear problem:

2 d y = x + w,x,y,w L (R ) , where 2

is singular x is drawn from a certain distribution (e.g. natural images) w independent of x

• Examples: – Super-Resolution, Inpainting, Deconvolution. • Standard Regularization route: xˆ = arg min x y 2 + (x) . x k k R • Q: How to leverage training data? (x) (typically) convex R • Q: Is this formulation always appropriate to estimate images? Regularization and Learning

• The inverse problem requires a high-dimensional model for p (x y) | • Underlying probabilistic model for regularized least squares is (x) p(x y) (x y, I)eR . | / N

• Suppose x 1 ,x 2 are such that p (x y)=p(x y) . 1| 2| • Since log p ( x y ) is convex, it results that | p(↵x +(1 ↵)x y) p(x y) , ↵ [0, 1] . 1 2| 1| 2 Regularization and Learning

• The inverse problem requires a high-dimensional model for p (x y) | • Underlying probabilistic model for regularized least squares is (x) p(x y) (x y, I)eR . | / N

• Suppose x 1 ,x 2 are such that p (x y)=p(x y) . 1| 2| • Since log p ( x y ) is convex, it results that | p(↵x +(1 ↵)x y) p(x y) , ↵ [0, 1] . 1 2| 1| 2 • Jitter model y Regularization and Learning

• The inverse problem requires a high-dimensional model for p (x y) | • Underlying probabilistic model for regularized least squares is (x) p(x y) (x y, I)eR . | / N

• Suppose x 1 ,x 2 are such that p (x y)=p(x y) . 1| 2| • Since log p ( x y ) is convex, it results that | p(↵x +(1 ↵)x y) p(x y) , ↵ [0, 1] . 1 2| 1| 2 • Jitter model x1 x2 y Regularization and Learning

• The inverse problem requires a high-dimensional model for p(x y) | • Underlying probabilistic model for regularized least squares is (x) p(x y) (x y, I)eR . | / N

• Suppose x 1 ,x 2 are such that p (x y)=p(x y) . 1| 2| • Since log p ( x y ) is convex, it results that | p(↵x +(1 ↵)x y) p(x y) , ↵ [0, 1] . 1 2| 1| 2 • Jitter model: x1 x2 y

↵x +(1 ↵)x 1 2 • The conditional distribution of images is not well modeled only with Gaussian noise. Non-convexity in general. Plan

• CNNs and the curse of dimensionality on image/audio regression.

• Multi scale wavelet scattering convolutional networks.

• Generative Models using CNN sufficient statistics.

• Applications to inverse problems: Image and Texture Super- Resolution

joint work with P. Sprechmann (NYU), Yann LeCun (NYU/FAIR) Ivan Dokmanic (UIUC), S. Mallat (ENS) and M. de Hoop (Rice) Learning, features and Kernels

Change of variable (x)= k(x) k d { }  0 to nearly linearize f(x), which is approximated by: f˜(x)= (x) ,w = w (x) . h i k k 1D projection k d X Data: x R (x) Rd0 2 2 Linear Classifier

w x

Metric: x x0 (x) (x0) k k k k How and when is possible to find such a ? • What ”regularity” of f is needed ? • Feature Representation Wish-List

A change of variable (x) must linearize the orbits g.x g G 2 • p { } g1x g1 .x

x x0

g1x0

p g1 .x0 Linearise symmetries with a change of variable (x) • p (g1 .x) (x) p (x0) (g1 .x0)

Lipschitz: x, g : (x) (g.x) C g • 8 k k1 k k Discriminative:Lipschitz: x, g :(x)(x) (x0()g.x) C Cf(gx) f(x0) • 8 k k kk | k k | Geometric Variability Prior x(u) ,u: pixels, time samples, etc. ⌧(u) , : deformation field L (x)(u)=x(u ⌧(u)) : warping ⌧

Video of Philipp Scott Johnson • Deformation prior: ⌧ = sup ⌧(u) +sup ⌧(u) . k k u | | u |r | – Models change in point of view in images – Models frequency transpositions in sounds – Consistent with local translation invariance Rotation and Scaling Variability

Rotation and deformations •

Group: SO(2) Di(SO(2)) ⇥ Scaling and deformations •

Group: R Di(R) ⇥ Geometric Variability Prior • Blur operator: Ax = x ,: local average ⇤ – The only linear operator A stable to deformations AL⌧ x Ax ⌧ x . (u) k kk kk k [Bruna’12] Chapter 2. Invariant Scattering Representations

xˆ xˆτ | || |

σx (1 + s)σx

ξ (1 + s)ξ ω

Figure 2.1: Dilation of a complex bandpass window. If ξ σ s−1,thenthesupports ≫ x are nearly disjoint.

Besides deformation instabilities, the Fourier modulus andtheautocorrelationlose 2 too much information. For example, a Dirac δ(u)andalinearchirpeiu are two signals having Fourier transforms whose moduli are equal and constant. Very different signals may not be discriminated from their Fourier modulus. Acanonicalinvariant[KDGH07; Soa09] Φ(x)=x(u a(x)) registers x L2(Rd) − ∈ with an anchor point a(x), which is translated when x is translated:

a(xc)=a(x)+c.

It thus defines a translation invariant representation: Φxc = Φx.Forexample,theanchor point may be a filtered maximum a(x)=argmax x ⋆ h(u) ,forsomefilterh(u). A u | | canonical invariant Φx(u)=x(u a(x)) carries more information than a Fourier modulus, − and characterizes x up to a global absolute position information [Soa09]. However, it has the same high-frequency instability as a Fourier modulustransform.Indeed,forany choiceGeometric of anchor point a(x), applyingVariability the Plancherel Prior formula proves that x(u a(x)) x′(u a(x′)) (2π)−1 xˆ(ω) xˆ′(ω) . ∥ − − − ∥≥ ∥| | − | |∥ ′ Ax = x ,: local average • Blur operator:If x = xτ ,theFouriertransforminstabilityathighfrequenciesimp lies that Φx = x(u a(x)) is also unstable with⇤ respect to deformations. – The only− linear operator A stable to deformations 2.2.5 SIFT and HoG AL⌧ x Ax ⌧ x . (u) SIFT (Scale Invariant Featurek Transform) is a localkk image descriptorkk introducedk [Bruna’12] by Lowe in [Low04], which achieved huge popularity thanks to its invariance and discriminability properties. The SIFT method originally consists in a keypoint detection phase, using a Dif- ferences of Gaussians pyramid, followed by a local description around each detected keypoint. The keypoint detection computes local maxima on a scale space generated by isotropic gaussian differences, which induces invariance to translations, rotations and partially to scaling. The descriptor then computes histograms of image gradient ampli- tudes, using 8 orientation bins on a 4 4gridaroundeachkeypoint,asshowninFigure × 2.2.

19 Geometric Variability Prior • Blur operator: Ax = x ,: local average ⇤ – The only linear operator A stable to deformations: AL⌧ x Ax ⌧ x . (u) k kk kk k [Bruna’12]

j j • Wavelet filter bank: Wx = x , (u)=2 (2 R u) { ⇤ k} k ✓ k : spatially localized band-pass filter. W recovers information lost by A. j

✓ Geometric Variability Prior

• Blur operator: Ax = x ,: local average ⇤ – The only linear operator A stable to deformations: AL⌧ x Ax ⌧ x . (u) k kk kk k [Bruna’12]

• Wavelet filter bank: j j Wx = x , (u)=2 (2 R u) { ⇤ k} k ✓ k : spatially localized band-pass filter. W recovers information lost by A. W nearly commutes with deformations: j WL⌧ L⌧ W C ⌧ k k |r |1

✓ Geometric Variability Prior • Blur operator: Ax = x ,: local average ⇤ – The only linear operator A stable to deformations: AL⌧ x Ax ⌧ x . (u) k kk kk k [Bruna’12]

j j • Wavelet filter bank: Wx = x , (u)=2 (2 R u) { ⇤ k} k ✓ k : spatially localized band-pass filter. W recovers information lost by A. W nearly commutes with deformations: j WL⌧ L⌧ W C ⌧ k k |r |1 • Point-wise non-linearity ⇢ (x)= x | | – Commutes with deformations: ⇢L⌧ x = L⌧ ⇢x [Bruna’12] ✓ – Demodulates wavelet coefficients, preserves energy. Scattering Convolutional Network f WJ | | f ⇥J

f ⇥j1,1 WJ | | | | j1 f ⌅⇤j1,1 ⌅⇥J | | 1 f ⇥ ⇥ || j1,1 | j2,2 | W j ,j | J | f ⇧⇤ ⇧⇤ ⇧⇥ 1 2 || j1,1 | j2,2 | J , 1 2 ··· ··· f ⇥ ⇥ || j1,1 |··· jm,m | WJ j1...jm | | f ⇧⇤j1,1 ⇧⇤jm,m ⇧⇥J ⇥ || |··· | 1...m f ⇥ ⇥ ⇥ || j1,1 ···| jm+1,m+1 | Cascade of contractive operators. Image Examples [Bruna, Mallat, ’11,’12] Images Fourier Wavelet Scattering fˆ f f ⇤⇥1 ⇤ f ⇤⇥ ⇤⇥ ⇤ 1 | | || 1 | 2 |

2

1 SIFT

2

window size = image size Scattering Stability

Theorem: [Mallat ’10] With appropriate wavelets, SJ is stable to additive noise, S (x + n) S x n , k J J kk k unitary, S x = x , and stable to deformations k J k k k S x S x C x ⌧ . k J ⌧ J k k kkr k

x⌧ x SJ x⌧ | ⌧ | c Representation of Stationary Processes x(u): realizations of a stationary process X(u) (not Gaussian) Representation of Stationary Processes x(u): realizations of a stationary process X(u) (not Gaussian)

(X)= E(f (X)) { i }i 1 Estimation from samples x(n): (X)= f (x)(n) N i ( n ) X i b Discriminability: need to capture high-order moments Stability: E( (X) (X) 2) small k k b Scattering Moments X W | J | E(X)

X? j1,1 W | | | J | E( X? ) , j , | j1,1 | 8 1 1 X? ? || j1,1 | j2,2 | WJ | | E( X? j , ? j , ) , ji,i || 1 1 | 2 2 | 8

.. X? ···?... ? ··· | | j1,1 | | jm,m | WJ | | E( .. X? j1,1 ?... ? jm,m ) , ji,i | | | | | 8 .. X? ?... ? | | j1,1 | | jm+1,m+1 | Properties of Scattering Moments • Captures high order moments: [Bruna, Mallat, ’11,’12] Power Spectrum SJ [p]X m =1 m =2 Properties of Scattering Moments • Captures high order moments: [Bruna, Mallat, ’11,’12] Power Spectrum SJ [p]X m =1 m =2

• Cascading non-linearities is necessary to reveal higher-order moments. Consistency of Scattering Moments

Theorem: [B’15] If is a wavelet such that 1 1, and X(t)isa linear, stationary process with finite energy, then k k 

2 lim E( SˆN X SX )=0. N k k !1 Consistency of Scattering Moments

Theorem: [B’15] If is a wavelet such that 1 1, and X(t)isa linear, stationary process with finite energy, then k k 

2 lim E( SˆN X SX )=0. N k k !1

Corollary: If moreover X(t) is bounded, then

2 2 X E( SˆN X SX ) C | |1 . k k  pN

• Although we extract a growing number of features, their global variance goes to 0. • No variance blow-up due to high order moments. • Adding layers is critical (here depth is log(N)). Classification with Scattering

• State-of-the art on pattern and texture recognition: – MNIST [Pami’13]

– Texture (CUREt) [Pami’13]

• Object Recognition:

– 17% error on Cifar-10 [Oyallon, Mallat, CVPR’15] using better second layer wavelets that recombine channels. – How good is that? Not good enough. Best models do <5%. Scattering Limitations

• Modeling intraclass variability as geometric variability/ texture is not sufficient – Adapting wavelets to the dataset, even to the class.

• No simple mechanism to reduce the number of feature maps without learning – Scattering construction increases feature maps at exponential rate.

• Extend group formalism to understand behavior in deeper layers. – Although extensions to rototranslation group yield state-of-the-art results in general object recognition [“Group Equivariant CNNs”, Cohen & Welling, ICML’16]. Signal and Texture Recovery Challenge

SJ x = x J , x j J , x j j J ,... j J { ⇤ | ⇤ 1 | ⇤ || ⇤ 1 | ⇤ 2 | ⇤ } i • [Q1] Given S J x computed with m layers, under what conditions can we recover x (up to global symmetry)? Using what algorithm? As a function of the localization scale J ?

SX = E(X), E( X ), E( X ),... { | ⇤ j1 | || ⇤ j1 | ⇤ j2 | } • [Q2] Given SX, how can we characterize interesting processes? How to sample from such distributions? Sparse Signal Recovery

Theorem [B,M’14]: Suppose x (t)= a (t b )with b b , 0 n n n | n n+1| and SJ x0 = SJ x with m = 1 and J = .If has compact support, then 1 P x(t)= c (t e ) , with e e . n n | n n+1| & n X Sparse Signal Recovery

Theorem [B,M’14]: Suppose x (t)= a (t b )with b b , 0 n n n | n n+1| and SJ x0 = SJ x with m = 1 and J = .If has compact support, then 1 P x(t)= c (t e ) , with e e . n n | n n+1| & n X

• Sx essentially identifies sparse measures, up to log spacing factors. • Here, sparsity is encoded in the measurements themselves. • In 2D, singular measures (ie curves) require m =2 to be well characterized. Oscillatory Signal Recovery

Theorem [B,M’14]: Suppose x (⇠)= a (⇠ b )with log b 0 n n n | n log b , and S x = S x with m = 2 and J = log N.If has com- n+1| J J 0 P pact support K ,then c  b x(⇠)= c (⇠ e ) , with log e log e . n n | n n+1| & n X b

• Oscillatory, lacunary signals are also well captured with the same measurements. • It is the opposite set of extremal points from previous result. Scattering Reconstruction Algorithm x (0, I) 0 ⇠ N S = xs.t.Sx = S S { 0} b b

2 min Sx S0 x k k • Non-, non-convexb b optimization. • Levenberg-Marquardt gradient descent:

x = x (DSx )†(Sx S ) n+1 n n n 0 b b b Scattering Reconstruction Algorithm x (0, I) 0 ⇠ N S = xs.t.Sx = S S { 0} b b

2 min Sx S0 x k k • Non-linear Least Squares. b b – Levenberg-Marquardt gradient descent: x = x (DSx )†(Sx S ) n+1 n n n 0 • (Weak) Global convergence bguaranteesb usingb complex wavelets:

DSxˆ is full rank for m =2ifx compact support. Sparse Shape Reconstructions Original images of N 2 pixels:

J m = 1, 2 = N: reconstruction from O(log2 N) scattering coe↵.

J 2 m = 2, 2 = N: reconstruction from O(log2 N) scattering coe↵. Ergodic Texture Reconstruction Original Textures

Gaussian process model with same second order moments

J 2 m = 2, 2 = N: reconstruction from O(log2 N) scattering coe↵. Ergodic Texture Reconstruction using CNNs

• Results using a deep CNN network from [Gathys et al, NIPS’15] • Uses a much larger feature vector than scattering (O ((log N ) 2 ) ). Ergodic Texture Reconstruction using CNNs

• Results using a deep CNN network from [Gathys et al, NIPS’15] • Uses a much larger feature vector than scattering (O ((log N ) 2 ) ). Application: Super-Resolution

F

x y • Best Linear Method: Least Squares estimate (linear interpolation):

yˆ =(⌃x† ⌃xy)x

b b Application: Super-Resolution

F

x y • Best Linear Method: Least Squares estimate (linear interpolation):

• State-of-the-art Methods: yˆ =(⌃x† ⌃xy)x – Dictionary-learning Super-Resolution – CNN-based: Just train a CNN to regress from low-resb tob high- res. – They optimize cleverly a fundamentally unstable metric criterion:

2 ⇥⇤ = arg min F (xi, ⇥) yi , yˆ = F (x, ⇥⇤) ⇥ k k i X Scattering Approach • Relax the metric:

x y S S 1 S

F Scattering Approach • Relax the metric:

x y S S 1 S

F

– Start with simple linear estimation on scattering domain. – Deformation stability gives more approximation power in the transformed domain via locally linear methods. – The method is not necessarily better in terms of PSNR! Some Numerical Results

Original Linear Estimate state-of-the-art Scattering Some Numerical Results

Best Scattering Original state-of-the-art Linear Estimate Estimate Some Numerical Results

Best Scattering Original state-of-the-art Linear Estimate Estimate Sparse Spike Super-Resolution (with I. Domanic (ENS/UIUC), S. Mallat) Examples with Cox Processes (inhomogeneous Poisson point processes) Conclusions and Open Problems

• CNNs: Geometric encoding with built-in deformation stability. – Equipped to break curse of dimensionality. • This statistical advantage is useful both in supervised and unsupervised learning. – Gibbs CNN distributions are stable to deformations. – Exploited in high-dimensional inverse problems. • Challenges Ahead: – Decode geometry learnt by CNNs: role of higher layers? – CNNs and unsupervised learning: we need better inference. – Optimization in CNNs: exploit layerwise/convolutional model. – Non-Euclidean domains (text, genomics, n-body dynamical systems) Thank you!