<<

of Deep Convolutional Neural Networks

Helmut B˝olcskei

Department of Information Technology and Electrical Engineering

November 2017 joint work with Thomas Wiatowski and Philipp Grohs ImageNet CNNs win the ImageNet 2015 challenge [He et al., 2015 ]

ImageNet

ski

rock

plant

coffee ImageNet

ski

rock

plant

coffee

CNNs win the ImageNet 2015 challenge [He et al., 2015 ] Go!

CNNs beat Go-champion Lee Sedol [Silver et al., 2016 ] Atari games

CNNs beat professional human Atari-players [Mnih et al., 2015 ] “Carlos Kleiber conducting the Vienna Philharmonic’s New Year’s Concert 1989.”

Describing the content of an image

CNNs generate sentences describing the content of an image [Vinyals et al., 2015 ] Kleiber conducting the Vienna Philharmonic’s New Year’s Concert 1989

Describing the content of an image

CNNs generate sentences describing the content of an image [Vinyals et al., 2015 ]

“Carlos .” conducting the Vienna Philharmonic’s New Year’s Concert 1989

Describing the content of an image

CNNs generate sentences describing the content of an image [Vinyals et al., 2015 ]

“Carlos Kleiber .” Vienna Philharmonic’s New Year’s Concert 1989

Describing the content of an image

CNNs generate sentences describing the content of an image [Vinyals et al., 2015 ]

“Carlos Kleiber conducting the .” New Year’s Concert 1989

Describing the content of an image

CNNs generate sentences describing the content of an image [Vinyals et al., 2015 ]

“Carlos Kleiber conducting the Vienna Philharmonic’s .” 1989

Describing the content of an image

CNNs generate sentences describing the content of an image [Vinyals et al., 2015 ]

“Carlos Kleiber conducting the Vienna Philharmonic’s New Year’s Concert .” Describing the content of an image

CNNs generate sentences describing the content of an image [Vinyals et al., 2015 ]

“Carlos Kleiber conducting the Vienna Philharmonic’s New Year’s Concert 1989.” Feature extraction and classification

input: f =

non-linear feature extraction

feature vector Φ(f)

linear classifier

( hw, Φ(f)i > 0, ⇒ Shannon output: hw, Φ(f)i < 0, ⇒ von Neumann Mathematics for CNNs

“It is the guiding principle of many applied mathematicians that if something mathematical works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it.” [I. Daubechies, 2015 ]  1  possible with w = −1

: hw, Φ(f)i > 0 : hw, Φ(f)i < 0

 kfk  Φ(f) = 1

Why non-linear feature extractors?

Task: Separate two categories of data through a linear classifier

1

: hw, fi > 0 : hw, fi < 0 : hw, Φ(f)i > 0 : hw, Φ(f)i < 0

 1  possible with w = −1

 kfk  Φ(f) = 1

Why non-linear feature extractors?

Task: Separate two categories of data through a linear classifier

1

: hw, fi > 0 : hw, fi < 0

not possible! Why non-linear feature extractors?

Task: Separate two categories of data through a linear classifier

 kfk  Φ(f) = 1 1

: hw, fi > 0 : hw, Φ(f)i > 0 : hw, fi < 0 : hw, Φ(f)i < 0

 1  not possible! possible with w = −1 ⇒ Linear separability in feature space!

Why non-linear feature extractors?

Task: Separate two categories of data through a linear classifier

 kfk  Φ(f) = 1

⇒ Φ is invariant to angular component of the data Why non-linear feature extractors?

Task: Separate two categories of data through a linear classifier

 kfk  Φ(f) = 1

⇒ Φ is invariant to angular component of the data ⇒ Linear separability in feature space! Why non-linear feature extractors?

m×m Let T ∈ R be the cyclic shift matrix, i.e.,  0   .   . I(m−1)  T =   ,  0  1 0 ··· 0 m×m and let Φ ∈ C be translation-invariant, i.e., Φf = Φ Tlf, ∀ f, l

What is the structure of Φ? Why non-linear feature extractors?

m×m Let T ∈ R be the cyclic shift matrix, i.e.,  0   .   . I(m−1)  T =   ,  0  1 0 ··· 0 m×m and let Φ ∈ C be translation-invariant, i.e., Φf = Φ Tlf, ∀ f, l

What is the structure of Φ?

     

Φ =  φ1 φ2 . . . φm  =  φ2 φ3 . . . φ1  =  φm φ1 . . . φm−1  Why non-linear feature extractors?

m×m Let T ∈ R be the cyclic shift matrix, i.e.,  0   .   . I(m−1)  T =   ,  0  1 0 ··· 0 m×m and let Φ ∈ C be translation-invariant, i.e., Φf = Φ Tlf, ∀ f, l

What is the structure of Φ?  

Φ =  φ1 φ1 . . . φ1 

⇒ rank = 1! Why non-linear feature extractors?

Want to have a trivial null-space, i.e., 1 Φ(f) = 0 ⇔ f = 0,

to avoid “losing” features.

Non-linear Φ needed to avoid having

2 2 2 ||Φ(f1) − Φ(f2)|| = ||Φ(f1 − f2)|| ≥ Akf1 − f2k2,

for some A > 0. ⇒ translation-invariant, i.e., Φf = Φ Tlf, ∀ f, l

BUT: In practice, the Fourier-modulus has poor feature extraction performance due to its global nature!

Why non-linear feature extractors?

e.g.: Fourier-modulus Φ = |Fn|, n×n where Fn ∈ C is the DFT-matrix, and |Fn| refers to multipli- cation by Fn followed by componentwise application of | · |. BUT: In practice, the Fourier-modulus has poor feature extraction performance due to its global nature!

Why non-linear feature extractors?

e.g.: Fourier-modulus Φ = |Fn|, n×n where Fn ∈ C is the DFT-matrix, and |Fn| refers to multipli- cation by Fn followed by componentwise application of | · |.

⇒ translation-invariant, i.e., Φf = Φ Tlf, ∀ f, l Why non-linear feature extractors?

e.g.: Fourier-modulus Φ = |Fn|, n×n where Fn ∈ C is the DFT-matrix, and |Fn| refers to multipli- cation by Fn followed by componentwise application of | · |.

⇒ translation-invariant, i.e., Φf = Φ Tlf, ∀ f, l

BUT: In practice, the Fourier-modulus has poor feature extraction performance due to its global nature! Invariance/Covariance

Classification Regression

Want to explicitly control the amount of invariance induced less/more invariance ⇔ more/less covariance Deformation insensitivity

Feature vector should be independent of cameras (of different resolutions), and insensitive to small acquisition jitters Scattering networks ([Mallat, 2012], [Wiatowski and HB, 2015])

||f ∗ g (k) | ∗ g (l) | ||f ∗ g (p) | ∗ g (r) | λ1 λ2 λ1 λ2

|f ∗ g (k) | |f ∗ g (p) | λ1 λ1

f feature map Scattering networks ([Mallat, 2012], [Wiatowski and HB, 2015])

||f ∗ g (k) | ∗ g (l) | ||f ∗ g (p) | ∗ g (r) | λ1 λ2 λ1 λ2

· ∗ χ3 · ∗ χ3

|f ∗ g (k) | |f ∗ g (p) | λ1 λ1

· ∗ χ · ∗ χ 2 f feature map 2

· ∗ χ1 Scattering networks ([Mallat, 2012], [Wiatowski and HB, 2015])

||f ∗ g (k) | ∗ g (l) | ||f ∗ g (p) | ∗ g (r) | λ1 λ2 λ1 λ2

· ∗ χ3 · ∗ χ3

|f ∗ g (k) | |f ∗ g (p) | λ1 λ1

· ∗ χ · ∗ χ 2 f feature map 2

· ∗ χ1 feature vector Φ(f)

General scattering networks guarantee [Wiatowski & HB, 2015 ] - (vertical) translation invariance - small deformation sensitivity essentially irrespective of filters, non-linearities, and poolings! e.g.: Learned filters

Building blocks

Basic operations in the n-th network layer

g (k) λn non-lin. pool. . f .

g (r) λn non-lin. pool.

Filters: Semi-discrete Ψn := {χn} ∪ {gλn }λn∈Λn

2 2 X 2 2 2 d Ankfk2 ≤ kf ∗ χnk2 + kf ∗ gλn k ≤ Bnkfk2, ∀f ∈ L (R ) λn∈Λn e.g.: Learned filters

Building blocks

Basic operations in the n-th network layer

g (k) λn non-lin. pool. . f .

g (r) λn non-lin. pool.

Filters: Semi-discrete frame Ψn := {χn} ∪ {gλn }λn∈Λn

2 2 X 2 2 2 d Ankfk2 ≤ kf ∗ χnk2 + kf ∗ gλn k ≤ Bnkfk2, ∀f ∈ L (R ) λn∈Λn

e.g.: Structured filters e.g.: Learned filters

Building blocks

Basic operations in the n-th network layer

g (k) λn non-lin. pool. . f .

g (r) λn non-lin. pool.

Filters: Semi-discrete frame Ψn := {χn} ∪ {gλn }λn∈Λn

2 2 X 2 2 2 d Ankfk2 ≤ kf ∗ χnk2 + kf ∗ gλn k ≤ Bnkfk2, ∀f ∈ L (R ) λn∈Λn

e.g.: Unstructured filters Building blocks

Basic operations in the n-th network layer

g (k) λn non-lin. pool. . f .

g (r) λn non-lin. pool.

Filters: Semi-discrete frame Ψn := {χn} ∪ {gλn }λn∈Λn

2 2 X 2 2 2 d Ankfk2 ≤ kf ∗ χnk2 + kf ∗ gλn k ≤ Bnkfk2, ∀f ∈ L (R ) λn∈Λn

e.g.: Learned filters ⇒ Satisfied by virtually all non-linearities used in the deep learning literature!

1 ReLU: Ln = 1; modulus: Ln = 1; logistic sigmoid: Ln = 4 ; ...

Building blocks

Basic operations in the n-th network layer

g (k) λn non-lin. pool. . f .

g (r) λn non-lin. pool.

Non-linearities: Point-wise and Lipschitz-continuous 2 d kMn(f) − Mn(h)k2 ≤ Lnkf − hk2, ∀ f, h ∈ L (R ) Building blocks

Basic operations in the n-th network layer

g (k) λn non-lin. pool. . f .

g (r) λn non-lin. pool.

Non-linearities: Point-wise and Lipschitz-continuous 2 d kMn(f) − Mn(h)k2 ≤ Lnkf − hk2, ∀ f, h ∈ L (R )

⇒ Satisfied by virtually all non-linearities used in the deep learning literature!

1 ReLU: Ln = 1; modulus: Ln = 1; logistic sigmoid: Ln = 4 ; ... ⇒ Emulates most poolings used in the deep learning literature!

e.g.: Pooling by sub-sampling Pn(f) = f with Rn = 1

Building blocks

Basic operations in the n-th network layer

g (k) λn non-lin. pool. . f .

g (r) λn non-lin. pool.

Pooling: In continuous-time according to d/2 f 7→ Sn Pn(f)(Sn·), 2 d 2 d where Sn ≥ 1 is the pooling factor and Pn : L (R ) → L (R ) is Rn-Lipschitz-continuous Building blocks

Basic operations in the n-th network layer

g (k) λn non-lin. pool. . f .

g (r) λn non-lin. pool.

Pooling: In continuous-time according to d/2 f 7→ Sn Pn(f)(Sn·), 2 d 2 d where Sn ≥ 1 is the pooling factor and Pn : L (R ) → L (R ) is Rn-Lipschitz-continuous

⇒ Emulates most poolings used in the deep learning literature!

e.g.: Pooling by sub-sampling Pn(f) = f with Rn = 1 Building blocks

Basic operations in the n-th network layer

g (k) λn non-lin. pool. . f .

g (r) λn non-lin. pool.

Pooling: In continuous-time according to d/2 f 7→ Sn Pn(f)(Sn·), 2 d 2 d where Sn ≥ 1 is the pooling factor and Pn : L (R ) → L (R ) is Rn-Lipschitz-continuous

⇒ Emulates most poolings used in the deep learning literature!

e.g.: Pooling by averaging Pn(f) = f ∗ φn with Rn = kφnk1 Vertical translation invariance

Theorem (Wiatowski and HB, 2015) Assume that the filters, non-linearities, and poolings satisfy

−2 −2 Bn ≤ min{1,Ln Rn }, ∀ n ∈ N.

Let the pooling factors be Sn ≥ 1, n ∈ N. Then,   n n ktk |||Φ (Ttf) − Φ (f)||| = O , S1 ...Sn 2 d d for all f ∈ L (R ), t ∈ R , n ∈ N. Vertical translation invariance

Theorem (Wiatowski and HB, 2015) Assume that the filters, non-linearities, and poolings satisfy

−2 −2 Bn ≤ min{1,Ln Rn }, ∀ n ∈ N.

Let the pooling factors be Sn ≥ 1, n ∈ N. Then,   n n ktk |||Φ (Ttf) − Φ (f)||| = O , S1 ...Sn 2 d d for all f ∈ L (R ), t ∈ R , n ∈ N.

⇒ Features become more invariant with increasing network depth! Vertical translation invariance

Theorem (Wiatowski and HB, 2015) Assume that the filters, non-linearities, and poolings satisfy

−2 −2 Bn ≤ min{1,Ln Rn }, ∀ n ∈ N.

Let the pooling factors be Sn ≥ 1, n ∈ N. Then,   n n ktk |||Φ (Ttf) − Φ (f)||| = O , S1 ...Sn 2 d d for all f ∈ L (R ), t ∈ R , n ∈ N.

Full translation invariance: If lim S1 · S2 · ... · Sn = ∞, then n→∞ n n lim |||Φ (Ttf) − Φ (f)||| = 0 n→∞ Vertical translation invariance

Theorem (Wiatowski and HB, 2015) Assume that the filters, non-linearities, and poolings satisfy

−2 −2 Bn ≤ min{1,Ln Rn }, ∀ n ∈ N.

Let the pooling factors be Sn ≥ 1, n ∈ N. Then,   n n ktk |||Φ (Ttf) − Φ (f)||| = O , S1 ...Sn 2 d d for all f ∈ L (R ), t ∈ R , n ∈ N.

The condition −2 −2 Bn ≤ min{1,Ln Rn }, ∀ n ∈ N,

is easily satisfied by normalizing the filters {gλn }λn∈Λn . Vertical translation invariance

Theorem (Wiatowski and HB, 2015) Assume that the filters, non-linearities, and poolings satisfy

−2 −2 Bn ≤ min{1,Ln Rn }, ∀ n ∈ N.

Let the pooling factors be Sn ≥ 1, n ∈ N. Then,   n n ktk |||Φ (Ttf) − Φ (f)||| = O , S1 ...Sn 2 d d for all f ∈ L (R ), t ∈ R , n ∈ N.

⇒ applies to general filters, non-linearities, and poolings - features become invariant in every network layer, but needs J → ∞ - applies to transform and modulus non-linearity without pooling

- features become more invariant with increasing network depth - applies to general filters, general non-linearities, and general poolings

Philosophy behind invariance results

Mallat’s “horizontal” translation invariance [Mallat, 2012 ]: 2 d d lim |||ΦW (Ttf) − ΦW (f)||| = 0, ∀f ∈ L (R ), ∀t ∈ R J→∞

“Vertical” translation invariance: n n 2 d d lim |||Φ (Ttf) − Φ (f)||| = 0, ∀f ∈ L ( ), ∀t ∈ n→∞ R R - applies to wavelet transform and modulus non-linearity without pooling

- applies to general filters, general non-linearities, and general poolings

Philosophy behind invariance results

Mallat’s “horizontal” translation invariance [Mallat, 2012 ]: 2 d d lim |||ΦW (Ttf) − ΦW (f)||| = 0, ∀f ∈ L (R ), ∀t ∈ R J→∞

- features become invariant in every network layer, but needs J → ∞

“Vertical” translation invariance: n n 2 d d lim |||Φ (Ttf) − Φ (f)||| = 0, ∀f ∈ L ( ), ∀t ∈ n→∞ R R

- features become more invariant with increasing network depth Philosophy behind invariance results

Mallat’s “horizontal” translation invariance [Mallat, 2012 ]: 2 d d lim |||ΦW (Ttf) − ΦW (f)||| = 0, ∀f ∈ L (R ), ∀t ∈ R J→∞

- features become invariant in every network layer, but needs J → ∞ - applies to wavelet transform and modulus non-linearity without pooling

“Vertical” translation invariance: n n 2 d d lim |||Φ (Ttf) − Φ (f)||| = 0, ∀f ∈ L ( ), ∀t ∈ n→∞ R R

- features become more invariant with increasing network depth - applies to general filters, general non-linearities, and general poolings Translation covariance

Theorem (Wiatowski and HB, 2015) Assume that the filters, non-linearities, and poolings satisfy

−2 −2 Bn ≤ min{1,Ln Rn }, ∀ n ∈ N.

Let the pooling factors be Sn ≥ 1, n ∈ N. Then,   n n 1 |||Φ (Ttf) − TtΦ (f)||| = O ktk − 1 , S1 ...Sn 2 d d for all f ∈ L (R ), t ∈ R , n ∈ N. Translation covariance

Theorem (Wiatowski and HB, 2015) Assume that the filters, non-linearities, and poolings satisfy

−2 −2 Bn ≤ min{1,Ln Rn }, ∀ n ∈ N.

Let the pooling factors be Sn ≥ 1, n ∈ N. Then,   n n 1 |||Φ (Ttf) − TtΦ (f)||| = O ktk − 1 , S1 ...Sn 2 d d for all f ∈ L (R ), t ∈ R , n ∈ N.

⇒ No pooling (i.e., Sn = 1, for all n ∈ N) leads to full translation covariance! Experiment: Feature importance evaluation

Question: Which features are important in handwritten digit classification?

detection of facial landmarks (eyes, nose, mouth) through regression?

Compare importance of features corresponding to (i) different layers, (ii) wavelet scales, and (iii) wavelet directions. Experiment: Feature importance evaluation

Setup for Φ: D = 4 layers; Haar with J = 3 scales and modulus non-linearity in every network layer no pooling in the first layer, average pooling with uniform weights in the second and third layer (S = 2) Facial landmark detection: Dataset: Caltech Web Faces database (7092 images; 80% for training, 20% for testing) Random forest regressor [Breiman, 2001 ] with 30 trees Feature importance: Gini importance [Breiman, 1984 ]

Experiment: Feature importance evaluation

Handwritten digit classification: Dataset: MNIST database (10,000 training and 10,000 test images) Random forest classifier [Breiman, 2001 ] with 30 trees Feature importance: Gini importance [Breiman, 1984 ] Experiment: Feature importance evaluation

Handwritten digit classification: Dataset: MNIST database (10,000 training and 10,000 test images) Random forest classifier [Breiman, 2001 ] with 30 trees Feature importance: Gini importance [Breiman, 1984 ]

Facial landmark detection: Dataset: Caltech Web Faces database (7092 images; 80% for training, 20% for testing) Random forest regressor [Breiman, 2001 ] with 30 trees Feature importance: Gini importance [Breiman, 1984 ] Experiment: Feature importance evaluation

Average cumulative feature importance: Digit classification

digits displaced digits

j = 0 j = 0 j = 1 j = 1 0.2 j = 2 0.2 j = 2

0.1 0.1

0.0 0.0

0/- 1/0 1/1 1/2 2/0 2/1 2/2 3/0 3/1 3/2 0/- 1/0 1/1 1/2 2/0 2/1 2/2 3/0 3/1 3/2

triplet of bars [d/r] corresponds to horizontal r = 0, vertical r = 1, and diagonal r = 2 features in layer d Experiment: Feature importance evaluation

Average cumulative feature importance: Facial landmarks

left eye nose

j = 0 j = 0 j = 1 j = 1 0.2 j = 2 0.2 j = 2

0.1 0.1

0.0 0.0

0/- 1/0 1/1 1/2 2/0 2/1 2/2 3/0 3/1 3/2 0/- 1/0 1/1 1/2 2/0 2/1 2/2 3/0 3/1 3/2

triplet of bars [d/r] corresponds to horizontal r = 0, vertical r = 1, and diagonal r = 2 features in layer d Experiment: Feature importance evaluation

Average cumulative feature importance per layer:

left eye right eye nose mouth digits Layer 0 0.020 0.023 0.016 0.014 0.004 Layer 1 0.629 0.646 0.576 0.490 0.094 Layer 2 0.261 0.236 0.298 0.388 0.280 Layer 3 0.090 0.095 0.110 0.108 0.622 Non-linear deformations

d d Non-linear deformation (Fτ f)(x) = f(x − τ(x)), where τ : R → R

For “small” τ: Non-linear deformations

d d Non-linear deformation (Fτ f)(x) = f(x − τ(x)), where τ : R → R

For “large” τ: Deformation sensitivity for signal classes

−x2 Consider (Fτ f)(x) = f(x − τ(x)) = f(x − e )

f1(x), (Fτ f1)(x)

x

f2(x), (Fτ f2)(x)

x

For given τ the amount of deformation induced 2 d can depend drastically on f ∈ L (R ) Philosophy behind deformation stability/sensitivity bounds

Mallat’s deformation stability bound [Mallat, 2012 ]: −J 2  |||ΦW (Fτ f)−ΦW (f)||| ≤ C 2 kτk∞ +JkDτk∞ +kD τk∞ kfkW , 2 d for all f ∈ HW ⊆ L (R )

- The signal class HW and the corresponding k · kW depend on the mother wavelet (and hence the network)

Our deformation sensitivity bound: α 2 d |||Φ(Fτ f) − Φ(f)||| ≤ CCkτk∞, ∀f ∈ C ⊆ L (R )

- The signal class C (band-limited functions, cartoon functions, or Lipschitz functions) is independent of the network Philosophy behind deformation stability/sensitivity bounds

Mallat’s deformation stability bound [Mallat, 2012 ]: −J 2  |||ΦW (Fτ f)−ΦW (f)||| ≤ C 2 kτk∞ +JkDτk∞ +kD τk∞ kfkW , 2 d for all f ∈ HW ⊆ L (R )

- Signal class description complexity implicit via norm k · kW

Our deformation sensitivity bound: α 2 d |||Φ(Fτ f) − Φ(f)||| ≤ CCkτk∞, ∀f ∈ C ⊆ L (R )

- Signal class description complexity explicit via CC - L-band-limited functions: CC = O(L) 3/2 - cartoon functions of size K: CC = O(K ) - M-Lipschitz functions CC = O(M) Philosophy behind deformation stability/sensitivity bounds

Mallat’s deformation stability bound [Mallat, 2012 ]: −J 2  |||ΦW (Fτ f)−ΦW (f)||| ≤ C 2 kτk∞ +JkDτk∞ +kD τk∞ kfkW , 2 d for all f ∈ HW ⊆ L (R )

Our deformation sensitivity bound: α 2 d |||Φ(Fτ f) − Φ(f)||| ≤ CCkτk∞, ∀f ∈ C ⊆ L (R )

- Decay rate α > 0 of the deformation error is signal-class- specific (band-limited functions: α = 1, cartoon functions: 1 α = 2 , Lipschitz functions: α = 1) Philosophy behind deformation stability/sensitivity bounds

Mallat’s deformation stability bound [Mallat, 2012 ]: −J 2  |||ΦW (Fτ f)−ΦW (f)||| ≤ C 2 kτk∞ +JkDτk∞ +kD τk∞ kfkW , 2 d for all f ∈ HW ⊆ L (R )

- The bound depends explicitly on higher order of τ

Our deformation sensitivity bound: α 2 d |||Φ(Fτ f) − Φ(f)||| ≤ CCkτk∞, ∀f ∈ C ⊆ L (R )

- The bound implicitly depends on of τ via the 1 condition kDτk∞ ≤ 2d Philosophy behind deformation stability/sensitivity bounds

Mallat’s deformation stability bound [Mallat, 2012 ]: −J 2  |||ΦW (Fτ f)−ΦW (f)||| ≤ C 2 kτk∞ +JkDτk∞ +kD τk∞ kfkW , 2 d for all f ∈ HW ⊆ L (R ) - The bound is coupled to horizontal translation invariance 2 d d lim |||ΦW (Ttf) − ΦW (f)||| = 0, ∀f ∈ L (R ), ∀t ∈ R J→∞

Our deformation sensitivity bound: α 2 d |||Φ(Fτ f) − Φ(f)||| ≤ CCkτk∞, ∀f ∈ C ⊆ L (R ) - The bound is decoupled from vertical translation invariance n n 2 d d lim |||Φ (Ttf) − Φ (f)||| = 0, ∀f ∈ L ( ), ∀t ∈ n→∞ R R 2) Signal-class-specific deformation sensitivity bound:

α 2 d kFτ f − fk2 ≤ CCkτk∞, ∀f ∈ C ⊆ L (R ) 3) Combine 1) and 2) to get

α |||Φ(Fτ f) − Φ(f)||| ≤ kFτ f − fk2 ≤ CCkτk∞, 2 d for all f ∈ C ⊆ L (R )

Proof sketch: Decoupling

α 2 d |||Φ(Fτ f) − Φ(f)||| ≤ CCkτk∞, ∀f ∈ C ⊆ L (R )

1) Lipschitz continuity:

2 d |||Φ(f) − Φ(h)||| ≤ kf − hk2, ∀f, h ∈ L (R ),

established through (i) frame property of Ψn, (ii) Lipschitz continuity of non-linearities, and (iii) admissibility condition −2 −2 Bn ≤ min{1,Ln Rn } 3) Combine 1) and 2) to get

α |||Φ(Fτ f) − Φ(f)||| ≤ kFτ f − fk2 ≤ CCkτk∞, 2 d for all f ∈ C ⊆ L (R )

Proof sketch: Decoupling

α 2 d |||Φ(Fτ f) − Φ(f)||| ≤ CCkτk∞, ∀f ∈ C ⊆ L (R )

1) Lipschitz continuity:

2 d |||Φ(f) − Φ(h)||| ≤ kf − hk2, ∀f, h ∈ L (R ),

established through (i) frame property of Ψn, (ii) Lipschitz continuity of non-linearities, and (iii) admissibility condition −2 −2 Bn ≤ min{1,Ln Rn } 2) Signal-class-specific deformation sensitivity bound:

α 2 d kFτ f − fk2 ≤ CCkτk∞, ∀f ∈ C ⊆ L (R ) Proof sketch: Decoupling

α 2 d |||Φ(Fτ f) − Φ(f)||| ≤ CCkτk∞, ∀f ∈ C ⊆ L (R )

1) Lipschitz continuity:

2 d |||Φ(f) − Φ(h)||| ≤ kf − hk2, ∀f, h ∈ L (R ),

established through (i) frame property of Ψn, (ii) Lipschitz continuity of non-linearities, and (iii) admissibility condition −2 −2 Bn ≤ min{1,Ln Rn } 2) Signal-class-specific deformation sensitivity bound:

α 2 d kFτ f − fk2 ≤ CCkτk∞, ∀f ∈ C ⊆ L (R ) 3) Combine 1) and 2) to get

α |||Φ(Fτ f) − Φ(f)||| ≤ kFτ f − fk2 ≤ CCkτk∞, 2 d for all f ∈ C ⊆ L (R ) Noise robustness

Lipschitz continuity of Φ according to

2 d |||Φ(f) − Φ(h)||| ≤ kf − hk2, ∀f, h ∈ L (R ),

2 d also implies robustness w.r.t. additive noise η ∈ L (R ) according to

|||Φ(f + η) − Φ(f)||| ≤ kηk2 e.g.: Winner of the ImageNet 2015 challenge [He et al., 2015 ] - Network depth: 152 layers - average # of nodes per layer: 472 - # of FLOPS for a single forward pass: 11.3 billion

CNNs in a nutshell

CNNs used in practice employ potentially hundreds of layers and 10,000s of nodes! CNNs in a nutshell

CNNs used in practice employ potentially hundreds of layers and 10,000s of nodes!

e.g.: Winner of the ImageNet 2015 challenge [He et al., 2015 ] - Network depth: 152 layers - average # of nodes per layer: 472 - # of FLOPS for a single forward pass: 11.3 billion CNNs in a nutshell

CNNs used in practice employ potentially hundreds of layers and 10,000s of nodes!

e.g.: Winner of the ImageNet 2015 challenge [He et al., 2015 ] - Network depth: 152 layers - average # of nodes per layer: 472 - # of FLOPS for a single forward pass: 11.3 billion

Such depths (and breadths) pose formidable computational challenges in training and operating the network! Guarantee trivial null-space for feature extractor Φ

Specify the number of layers needed to have “most” of the input signal energy be contained in the feature vector

For a fixed (possibly small) depth, design CNNs that capture “most” of the input signal energy

Topology reduction

Determine how fast the energy contained in the propagated signals (a.k.a. feature maps) decays across layers Specify the number of layers needed to have “most” of the input signal energy be contained in the feature vector

For a fixed (possibly small) depth, design CNNs that capture “most” of the input signal energy

Topology reduction

Determine how fast the energy contained in the propagated signals (a.k.a. feature maps) decays across layers

Guarantee trivial null-space for feature extractor Φ For a fixed (possibly small) depth, design CNNs that capture “most” of the input signal energy

Topology reduction

Determine how fast the energy contained in the propagated signals (a.k.a. feature maps) decays across layers

Guarantee trivial null-space for feature extractor Φ

Specify the number of layers needed to have “most” of the input signal energy be contained in the feature vector Topology reduction

Determine how fast the energy contained in the propagated signals (a.k.a. feature maps) decays across layers

Guarantee trivial null-space for feature extractor Φ

Specify the number of layers needed to have “most” of the input signal energy be contained in the feature vector

For a fixed (possibly small) depth, design CNNs that capture “most” of the input signal energy Building blocks

Basic operations in the n-th network layer

g (k) λn | · | ↓S . f .

g (r) λn | · | ↓S

Filters: Semi-discrete frame Ψn := {χn} ∪ {gλn }λn∈Λn Non-linearity: Modulus | · | Pooling: Sub-sampling with pooling factor S ≥ 1 1 fb(ω) · gdλn (ω)

ω

Demodulation effect of modulus non-linearity

Components of feature vector given by |f ∗ gλn | ∗ χn+1

gλn (ω) χ[n+1(ω) d 1 ··· fb(ω) ··· ω Demodulation effect of modulus non-linearity

Components of feature vector given by |f ∗ gλn | ∗ χn+1

gλn (ω) χ[n+1(ω) d 1 ··· fb(ω) ··· ω

1 fb(ω) · gdλn (ω)

ω Demodulation effect of modulus non-linearity

Components of feature vector given by |f ∗ gλn | ∗ χn+1

gλn (ω) χ[n+1(ω) d 1 ··· fb(ω) ··· ω

1 fb(ω) · gdλn (ω)

ω

Modulus squared:

2 |f ∗ gλn (x)| R (ω) fb·gdλn Demodulation effect of modulus non-linearity

Components of feature vector given by |f ∗ gλn | ∗ χn+1

gλn (ω) χ[n+1(ω) d 1 ··· fb(ω) ··· ω

1 fb(ω) · gdλn (ω)

ω

Φ(f) 1

V via χ[n+1 |f ∗ gλn |(ω) ω Do all non-linearities demodulate?

High-pass filtered signal:

F(f ∗ gλ)

ω −2R 2R

2R |F(|f ∗ g |)| Do all non-linearitiesλ demodulate?

High-pass filtered signal:

F(f ∗ gλ)

ω −2R 2R

2R

Modulus: Yes! |F(|f ∗ gλ|)| ω ω −2R 2R −2R ... but (small) tails! 2R Do all non-linearities demodulate?

High-pass filtered signal:

F(f ∗ gλ)

ω −2R 2R

2R

Modulus squared: Yes, and sharply so!

2 |F(|f ∗ gλ| )|

ω −2R 2R ... but not Lipschitz-continuous! Do all non-linearities demodulate?

High-pass filtered signal:

F(f ∗ gλ)

ω −2R 2R

2R

Rectified linear unit: No!

|F(ReLU(f ∗ gλ))|

ω −2R 2R First goal: Quantify feature map energy decay

W2(f) ||f ∗ g (k) | ∗ g (l) | ||f ∗ g (p) | ∗ g (r) | λ1 λ2 λ1 λ2

· ∗ χ3 · ∗ χ3

W1(f) |f ∗ g (k) | |f ∗ g (p) | λ1 λ1

· ∗ χ · ∗ χ 2 f 2

· ∗ χ1 ⇒ Comprises various contructions of WH filters, wavelets, ridgelets, (α)-curvelets, shearlets

ω2

e.g.: analytic band-limited curvelets: ω1

Assumptions (on the filters)

i) Analyticity: For every filter gλn there exists a (not necessarily d canonical) orthant Hλn ⊆ R such that supp(g ) ⊆ H . cλn λn ii) High-pass: There exists δ > 0 such that X |g (ω)|2 = 0, a.e. ω ∈ B (0). cλn δ λn∈Λn Assumptions (on the filters)

i) Analyticity: For every filter gλn there exists a (not necessarily d canonical) orthant Hλn ⊆ R such that supp(g ) ⊆ H . cλn λn ii) High-pass: There exists δ > 0 such that X |g (ω)|2 = 0, a.e. ω ∈ B (0). cλn δ λn∈Λn

⇒ Comprises various contructions of WH filters, wavelets, ridgelets, (α)-curvelets, shearlets

ω2

e.g.: analytic band-limited curvelets: ω1 2 d 0 d - square-integrable functions L (R ) = H (R ) 2 d s d - L-band-limited functions LL(R ) ⊆ H (R ), ∀L > 0, ∀s ≥ 0 s d 1 - cartoon functions [Donoho, 2001 ] CCART ⊆ H (R ), ∀s ∈ [0, 2 )

s d H (R ) contains a wide range of practically relevant signal classes

Handwritten digits from MNIST database [LeCun & Cortes, 1998 ]

Input signal classes

Sobolev functions of order s ≥ 0: n Z o s d 2 d 2 s 2 H (R ) = f ∈ L (R ) (1 + |ω| ) |fb(ω)| dω < ∞ Rd 2 d 0 d - square-integrable functions L (R ) = H (R ) 2 d s d - L-band-limited functions LL(R ) ⊆ H (R ), ∀L > 0, ∀s ≥ 0 s d 1 - cartoon functions [Donoho, 2001 ] CCART ⊆ H (R ), ∀s ∈ [0, 2 )

Handwritten digits from MNIST database [LeCun & Cortes, 1998 ]

Input signal classes

Sobolev functions of order s ≥ 0: n Z o s d 2 d 2 s 2 H (R ) = f ∈ L (R ) (1 + |ω| ) |fb(ω)| dω < ∞ Rd

s d H (R ) contains a wide range of practically relevant signal classes 2 d s d - L-band-limited functions LL(R ) ⊆ H (R ), ∀L > 0, ∀s ≥ 0 s d 1 - cartoon functions [Donoho, 2001 ] CCART ⊆ H (R ), ∀s ∈ [0, 2 )

Handwritten digits from MNIST database [LeCun & Cortes, 1998 ]

Input signal classes

Sobolev functions of order s ≥ 0: n Z o s d 2 d 2 s 2 H (R ) = f ∈ L (R ) (1 + |ω| ) |fb(ω)| dω < ∞ Rd

s d H (R ) contains a wide range of practically relevant signal classes

2 d 0 d - square-integrable functions L (R ) = H (R ) s d 1 - cartoon functions [Donoho, 2001 ] CCART ⊆ H (R ), ∀s ∈ [0, 2 )

Handwritten digits from MNIST database [LeCun & Cortes, 1998 ]

Input signal classes

Sobolev functions of order s ≥ 0: n Z o s d 2 d 2 s 2 H (R ) = f ∈ L (R ) (1 + |ω| ) |fb(ω)| dω < ∞ Rd

s d H (R ) contains a wide range of practically relevant signal classes

2 d 0 d - square-integrable functions L (R ) = H (R ) 2 d s d - L-band-limited functions LL(R ) ⊆ H (R ), ∀L > 0, ∀s ≥ 0 Input signal classes

Sobolev functions of order s ≥ 0: n Z o s d 2 d 2 s 2 H (R ) = f ∈ L (R ) (1 + |ω| ) |fb(ω)| dω < ∞ Rd

s d H (R ) contains a wide range of practically relevant signal classes

2 d 0 d - square-integrable functions L (R ) = H (R ) 2 d s d - L-band-limited functions LL(R ) ⊆ H (R ), ∀L > 0, ∀s ≥ 0 s d 1 - cartoon functions [Donoho, 2001 ] CCART ⊆ H (R ), ∀s ∈ [0, 2 )

Handwritten digits from MNIST database [LeCun & Cortes, 1998 ] ⇒ decay factor a is explicit and can be tuned via r, R

Exponential energy decay

Theorem Let the filters be wavelets with mother wavelet

supp(ψb) ⊆ [r−1, r], r > 1,

or Weyl-Heisenberg (WH) filters with prototype

supp(gb) ⊆ [−R,R], R > 0. s d Then, for every f ∈ H (R ), there exists β > 0 such that

 − n(2s+β)  Wn(f) = O a 2s+β+1 ,

r2+1 1 1 where a = r2−1 in the wavelet case, and a = 2 + R in the WH case. Exponential energy decay

Theorem Let the filters be wavelets with mother wavelet

supp(ψb) ⊆ [r−1, r], r > 1,

or Weyl-Heisenberg (WH) filters with prototype function

supp(gb) ⊆ [−R,R], R > 0. s d Then, for every f ∈ H (R ), there exists β > 0 such that

 − n(2s+β)  Wn(f) = O a 2s+β+1 ,

r2+1 1 1 where a = r2−1 in the wavelet case, and a = 2 + R in the WH case.

⇒ decay factor a is explicit and can be tuned via r, R - β > 0 determines the decay of fb(ω) (as |ω| → ∞) according to

2 −( s + 1 + β ) |fb(ω)| ≤ µ(1 + |ω| ) 2 4 4 , ∀ |ω| ≥ L,

for some µ > 0, and L acts as an “effective bandwidth” - smoother input signals (i.e., s↑) lead to faster energy decay - pooling through sub-sampling f 7→ S1/2f(S·) leads to decay a factor S

What about general filters? ⇒ polynomial energy decay!

Exponential energy decay

Exponential energy decay:

 − n(2s+β)  Wn(f) = O a 2s+β+1 - smoother input signals (i.e., s↑) lead to faster energy decay - pooling through sub-sampling f 7→ S1/2f(S·) leads to decay a factor S

What about general filters? ⇒ polynomial energy decay!

Exponential energy decay

Exponential energy decay:

 − n(2s+β)  Wn(f) = O a 2s+β+1

- β > 0 determines the decay of fb(ω) (as |ω| → ∞) according to

2 −( s + 1 + β ) |fb(ω)| ≤ µ(1 + |ω| ) 2 4 4 , ∀ |ω| ≥ L,

for some µ > 0, and L acts as an “effective bandwidth” - pooling through sub-sampling f 7→ S1/2f(S·) leads to decay a factor S

What about general filters? ⇒ polynomial energy decay!

Exponential energy decay

Exponential energy decay:

 − n(2s+β)  Wn(f) = O a 2s+β+1

- β > 0 determines the decay of fb(ω) (as |ω| → ∞) according to

2 −( s + 1 + β ) |fb(ω)| ≤ µ(1 + |ω| ) 2 4 4 , ∀ |ω| ≥ L,

for some µ > 0, and L acts as an “effective bandwidth” - smoother input signals (i.e., s↑) lead to faster energy decay What about general filters? ⇒ polynomial energy decay!

Exponential energy decay

Exponential energy decay:

 − n(2s+β)  Wn(f) = O a 2s+β+1

- β > 0 determines the decay of fb(ω) (as |ω| → ∞) according to

2 −( s + 1 + β ) |fb(ω)| ≤ µ(1 + |ω| ) 2 4 4 , ∀ |ω| ≥ L,

for some µ > 0, and L acts as an “effective bandwidth” - smoother input signals (i.e., s↑) lead to faster energy decay - pooling through sub-sampling f 7→ S1/2f(S·) leads to decay a factor S Exponential energy decay

Exponential energy decay:

 − n(2s+β)  Wn(f) = O a 2s+β+1

- β > 0 determines the decay of fb(ω) (as |ω| → ∞) according to

2 −( s + 1 + β ) |fb(ω)| ≤ µ(1 + |ω| ) 2 4 4 , ∀ |ω| ≥ L,

for some µ > 0, and L acts as an “effective bandwidth” - smoother input signals (i.e., s↑) lead to faster energy decay - pooling through sub-sampling f 7→ S1/2f(S·) leads to decay a factor S

What about general filters? ⇒ polynomial energy decay! Arbitrarily large decay factor

Exponential energy decay for wavelets:

 − n(2s+β)  Wn(f) = O a 2s+β+1 , r2+1 where a = r2−1 , for r > 1

g1 g2 g3 ψb b b b 1

ω 1 1 r 2 3 r r r

⇒ Smaller bandwidth (i.e., r ↓) leads to faster energy decay Non-trivial null-space: ∃ f ∗ 6= 0 such that Φ(f ∗) = 0 ⇒ hw, Φ(f ∗)i = 0 for all w ! ⇒ these f ∗ become unclassifiable!

... our second goal ... trivial null-space for Φ

Why trivial null-space? Feature space

w

: hw, Φ(f)i > 0 : hw, Φ(f)i < 0 ... our second goal ... trivial null-space for Φ

Why trivial null-space? Feature space

w

: hw, Φ(f)i > 0

Φ(f ∗) : hw, Φ(f)i < 0

Non-trivial null-space: ∃ f ∗ 6= 0 such that Φ(f ∗) = 0 ⇒ hw, Φ(f ∗)i = 0 for all w ! ⇒ these f ∗ become unclassifiable! ... our second goal ...

Trivial null-space for feature extractor:  2 d  f ∈ L (R ) | Φ(f) = 0 = 0

S∞ n Feature extractor Φ(·) = n=0 Φ (·) shall satisfy

2 2 2 2 d Akfk2 ≤ |||Φ(f)||| ≤ Bkfk2, ∀f ∈ L (R ),

for some A, B > 0. - For Parseval frames (i.e., An = Bn = 1, n ∈ N), this yields 2 2 |||Φ(f)||| = kfk2 - Connection to energy decay: n−1 2 X k 2 kfk2 = |||Φ (f)||| + Wn(f) | {z } k=0 → 0

“Energy conservation”

Theorem

For the frame upper {Bn}n∈N and frame lower bounds {An}n∈N, Q∞ Q∞ define B := n=1 max{1,Bn} and A := n=1 min{1,An}. If

0 < A ≤ B < ∞, then 2 2 2 2 d Akfk2 ≤ |||Φ(f)||| ≤ Bkfk2, ∀ f ∈ L (R ). - Connection to energy decay: n−1 2 X k 2 kfk2 = |||Φ (f)||| + Wn(f) | {z } k=0 → 0

“Energy conservation”

Theorem

For the frame upper {Bn}n∈N and frame lower bounds {An}n∈N, Q∞ Q∞ define B := n=1 max{1,Bn} and A := n=1 min{1,An}. If

0 < A ≤ B < ∞, then 2 2 2 2 d Akfk2 ≤ |||Φ(f)||| ≤ Bkfk2, ∀ f ∈ L (R ).

- For Parseval frames (i.e., An = Bn = 1, n ∈ N), this yields 2 2 |||Φ(f)||| = kfk2 “Energy conservation”

Theorem

For the frame upper {Bn}n∈N and frame lower bounds {An}n∈N, Q∞ Q∞ define B := n=1 max{1,Bn} and A := n=1 min{1,An}. If

0 < A ≤ B < ∞, then 2 2 2 2 d Akfk2 ≤ |||Φ(f)||| ≤ Bkfk2, ∀ f ∈ L (R ).

- For Parseval frames (i.e., An = Bn = 1, n ∈ N), this yields 2 2 |||Φ(f)||| = kfk2 - Connection to energy decay: n−1 2 X k 2 kfk2 = |||Φ (f)||| + Wn(f) | {z } k=0 → 0 How many layers n are needed to have at least ((1 − ε) · 100)% of the input signal energy be contained in the feature vector, i.e., n 2 X k 2 2 2 d (1 − ε)kfk2 ≤ |||Φ (f)||| ≤ kfk2, ∀f ∈ L (R ). k=0

... and our third goal ...

For a given CNN, specify the number of layers needed to capture “most” of the input signal energy ... and our third goal ...

For a given CNN, specify the number of layers needed to capture “most” of the input signal energy

How many layers n are needed to have at least ((1 − ε) · 100)% of the input signal energy be contained in the feature vector, i.e., n 2 X k 2 2 2 d (1 − ε)kfk2 ≤ |||Φ (f)||| ≤ kfk2, ∀f ∈ L (R ). k=0 Number of layers needed

Theorem Let the frame bounds satisfy An = Bn = 1, n ∈ N. Let the input signal f be L-band-limited, and let ε ∈ (0, 1). If

  L  n ≥ log √ , a (1 − 1 − ε )

then n 2 X k 2 2 (1 − ε)kfk2 ≤ |||Φ (f)||| ≤ kfk2. k=0 Number of layers needed

Theorem Let the frame bounds satisfy An = Bn = 1, n ∈ N. Let the input signal f be L-band-limited, and let ε ∈ (0, 1). If

  L  n ≥ log √ , a (1 − 1 − ε )

then n 2 X k 2 2 (1 − ε)kfk2 ≤ |||Φ (f)||| ≤ kfk2. k=0

Sn k ⇒ also guarantees trivial null-space for k=0 Φ (f) - similar estimates for Sobolev input signals and for general filters (polynomial decay!)

Number of layers needed

Theorem Let the frame bounds satisfy An = Bn = 1, n ∈ N. Let the input signal f be L-band-limited, and let ε ∈ (0, 1). If

  L  n ≥ log √ , a (1 − 1 − ε )

then n 2 X k 2 2 (1 − ε)kfk2 ≤ |||Φ (f)||| ≤ kfk2. k=0

- lower bound depends on - description complexity of input signals (i.e., bandwidth L) r2+1 1 1 - decay factor (wavelets a = r2−1 , WH filters a = 2 + R ) Number of layers needed

Theorem Let the frame bounds satisfy An = Bn = 1, n ∈ N. Let the input signal f be L-band-limited, and let ε ∈ (0, 1). If

  L  n ≥ log √ , a (1 − 1 − ε )

then n 2 X k 2 2 (1 − ε)kfk2 ≤ |||Φ (f)||| ≤ kfk2. k=0

- lower bound depends on - description complexity of input signals (i.e., bandwidth L) r2+1 1 1 - decay factor (wavelets a = r2−1 , WH filters a = 2 + R ) - similar estimates for Sobolev input signals and for general filters (polynomial decay!) Recall: Winner of the ImageNet 2015 challenge [He et al., 2015] - Network depth: 152 layers - average # of nodes per layer: 472 - # of FLOPS for a single forward pass: 11.3 billion

Number of layers needed

Numerical example for bandwidth L = 1: (1 − ε) 0.25 0.5 0.75 0.9 0.95 0.99 wavelets (r = 2) 2 3 4 6 8 11 WH filters (R = 1) 2 4 5 8 10 14 general filters 2 3 7 19 39 199 Recall: Winner of the ImageNet 2015 challenge [He et al., 2015] - Network depth: 152 layers - average # of nodes per layer: 472 - # of FLOPS for a single forward pass: 11.3 billion

Number of layers needed

Numerical example for bandwidth L = 1: (1 − ε) 0.25 0.5 0.75 0.9 0.95 0.99 wavelets (r = 2) 2 3 4 6 8 11 WH filters (R = 1) 2 4 5 8 10 14 general filters 2 3 7 19 39 199 Number of layers needed

Numerical example for bandwidth L = 1: (1 − ε) 0.25 0.5 0.75 0.9 0.95 0.99 wavelets (r = 2) 2 3 4 6 8 11 WH filters (R = 1) 2 4 5 8 10 14 general filters 2 3 7 19 39 199

Recall: Winner of the ImageNet 2015 challenge [He et al., 2015] - Network depth: 152 layers - average # of nodes per layer: 472 - # of FLOPS for a single forward pass: 11.3 billion ... our fourth and last goal ...

For a fixed (possibly small) depth N, design scattering networks that capture “most” of the input signal energy ... our fourth and last goal ...

For a fixed (possibly small) depth N, design scattering networks that capture “most” of the input signal energy

For fixed depth N, want to impose conditions on the filters gλn so that

N 2 X k 2 2 2 d (1 − ε)kfk2 ≤ |||Φ (f)||| ≤ kfk2, ∀f ∈ L (R ). k=0 Depth-constrained networks

Theorem Let the input signal f be L-band-limited, and fix ε ∈ (0, 1) and N ∈ N. If supp(g ) = [a , b ], ∀ λ , cλn λn λn n and bλn κ + 1 1 < supλn ≤ , aλn κ − 1 1   N where κ := √L , then (1− 1−ε ) N 2 X k 2 2 (1 − ε)kfk2 ≤ |||Φ (f)||| ≤ kfk2. k=0 Depth-constrained networks

Theorem Let the input signal f be L-band-limited, and fix ε ∈ (0, 1) and N ∈ N. If supp(g ) = [a , b ], ∀ λ , cλn λn λn n and bλn κ + 1 1 < supλn ≤ , aλn κ − 1 1   N where κ := √L , then (1− 1−ε ) N 2 X k 2 2 (1 − ε)kfk2 ≤ |||Φ (f)||| ≤ kfk2. k=0

on the growth of filter bandwidths Depth-constrained networks

Theorem Let the input signal f be L-band-limited, and fix ε ∈ (0, 1) and N ∈ N. If supp(g ) = [a , b ], ∀ λ , cλn λn λn n and bλn κ + 1 1 < supλn ≤ , aλn κ − 1 1   N where κ := √L , then (1− 1−ε ) N 2 X k 2 2 (1 − ε)kfk2 ≤ |||Φ (f)||| ≤ kfk2. k=0

⇒ Satisfied, e.g., by analytic band-limited wavelets and WH filters ⇒ larger number of filters to cover the spectral [−L, L] of the input signal ⇒ larger number of filters in the first layer ⇒ depth-width tradeoff

Depth-width tradeoff

Spectral supports of the filters:

g1 g2 g3 ψb b b b 1

ω 1 1 r 2 3 L r r r

Smaller depth N ⇒ smaller bandwidth of the filters ⇒ larger number of filters in the first layer ⇒ depth-width tradeoff

Depth-width tradeoff

Spectral supports of the filters:

g1 g2 g3 ψb b b b 1

ω 1 1 r 2 3 L r r r

Smaller depth N ⇒ smaller bandwidth of the filters ⇒ larger number of filters to cover the spectral support [−L, L] of the input signal ⇒ depth-width tradeoff

Depth-width tradeoff

Spectral supports of the filters:

g1 g2 g3 ψb b b b 1

ω 1 1 r 2 3 L r r r

Smaller depth N ⇒ smaller bandwidth of the filters ⇒ larger number of filters to cover the spectral support [−L, L] of the input signal ⇒ larger number of filters in the first layer Depth-width tradeoff

Spectral supports of the filters:

g1 g2 g3 ψb b b b 1

ω 1 1 r 2 3 L r r r

Smaller depth N ⇒ smaller bandwidth of the filters ⇒ larger number of filters to cover the spectral support [−L, L] of the input signal ⇒ larger number of filters in the first layer ⇒ depth-width tradeoff Practice is digital

In practice ... we need to handle discrete data!

n×n f = ∈ R Practice is digital

In practice ... a wide variety of network architectures is used!

class probabilities [LeCun et al., 1990 ]

||f ∗ g1| ∗ h1| ||f ∗ gn| ∗ hm|

|f ∗ g1| |f ∗ gn|

f Practice is digital

In practice ... a wide variety of network architectures is used!

|||f ∗ g1| ∗ h1| ∗ k1| |||f ∗ gm| ∗ hn| ∗ kp|

||f ∗ g1| ∗ h1| ||f ∗ gm| ∗ hn|

||f ∗ g1| ∗ h1| ∗ µ ||f ∗ gm| ∗ hn| ∗ µ

|f ∗ g1| |f ∗ gm|

|f ∗ g1| f |f ∗ gm|

f ∗ ϕlow - “special” architectural components, e.g., drop-out, cf., AlexNet [Krizhevsky et al., 2012 ] skip-layer-connections, cf., ResNets [He et al., 2015 ] frame filters, cf., Parseval networks [Cisse et al., 2017 ]

- feature map aggregation can be incorporated [Balan et al., 2017]

- tree topology can be width-pruned by determining the “operationally significant nodes” [Wiatowski et al., 2017 ]

How general is the scattering network architecture?

- building blocks ⇒ general filters, non-linearities, and poolings - feature map aggregation can be incorporated [Balan et al., 2017]

- tree topology can be width-pruned by determining the “operationally significant nodes” [Wiatowski et al., 2017 ]

How general is the scattering network architecture?

- building blocks ⇒ general filters, non-linearities, and poolings

- “special” architectural components, e.g., drop-out, cf., AlexNet [Krizhevsky et al., 2012 ] skip-layer-connections, cf., ResNets [He et al., 2015 ] frame filters, cf., Parseval networks [Cisse et al., 2017 ] - tree topology can be width-pruned by determining the “operationally significant nodes” [Wiatowski et al., 2017 ]

How general is the scattering network architecture?

- building blocks ⇒ general filters, non-linearities, and poolings

- “special” architectural components, e.g., drop-out, cf., AlexNet [Krizhevsky et al., 2012 ] skip-layer-connections, cf., ResNets [He et al., 2015 ] frame filters, cf., Parseval networks [Cisse et al., 2017 ]

- feature map aggregation can be incorporated [Balan et al., 2017] How general is the scattering network architecture?

- building blocks ⇒ general filters, non-linearities, and poolings

- “special” architectural components, e.g., drop-out, cf., AlexNet [Krizhevsky et al., 2012 ] skip-layer-connections, cf., ResNets [He et al., 2015 ] frame filters, cf., Parseval networks [Cisse et al., 2017 ]

- feature map aggregation can be incorporated [Balan et al., 2017]

- tree topology can be width-pruned by determining the “operationally significant nodes” [Wiatowski et al., 2017 ] Local properties: Lipschitz continuity

Theorem (Wiatowski et al., 2016) The features generated in the d-th network layer are Lipschitz- continuous with Lipschitz constant 1/2 d  Qd 2 2 L := kχdk1 k=1 BkLkRk , i.e., d d d |||Φ (f) − Φ (h)||| ≤ L kf − hk2, ∀f, h ∈ HN , where HN := {f : Z → C | f[n] = f[n + N], ∀ n ∈ Z}. Local properties: Lipschitz continuity

Theorem (Wiatowski et al., 2016) The features generated in the d-th network layer are Lipschitz- continuous with Lipschitz constant 1/2 d  Qd 2 2 L := kχdk1 k=1 BkLkRk , i.e., d d d |||Φ (f) − Φ (h)||| ≤ L kf − hk2, ∀f, h ∈ HN , where HN := {f : Z → C | f[n] = f[n + N], ∀ n ∈ Z}.

The Lipschitz constant Ld determines the noise sensitivity of Φd(f) according to

d d d |||Φ (f + η) − Φ (f)||| ≤ L kηk2, ∀f ∈ HN Local properties: Lipschitz continuity

Theorem (Wiatowski et al., 2016) The features generated in the d-th network layer are Lipschitz- continuous with Lipschitz constant 1/2 d  Qd 2 2 L := kχdk1 k=1 BkLkRk , i.e., d d d |||Φ (f) − Φ (h)||| ≤ L kf − hk2, ∀f, h ∈ HN , where HN := {f : Z → C | f[n] = f[n + N], ∀ n ∈ Z}.

The Lipschitz constant Ld impacts the energy of Φd(f) according to

d d |||Φ (f)||| ≤ L kfk2, ∀f ∈ HN Local properties: Lipschitz continuity

Theorem (Wiatowski et al., 2016) The features generated in the d-th network layer are Lipschitz- continuous with Lipschitz constant 1/2 d  Qd 2 2 L := kχdk1 k=1 BkLkRk , i.e., d d d |||Φ (f) − Φ (h)||| ≤ L kf − hk2, ∀f, h ∈ HN , where HN := {f : Z → C | f[n] = f[n + N], ∀ n ∈ Z}.

The Lipschitz constant Ld quantifies the impact of deformations τ according to

d d d 1/2 1/2 N,K |||Φ (Fτ f) − Φ (f)||| ≤ 4L KN kτk∞ , ∀f ∈ CCART Local properties: Lipschitz continuity

Theorem (Wiatowski et al., 2016) The features generated in the d-th network layer are Lipschitz- continuous with Lipschitz constant 1/2 d  Qd 2 2 L := kχdk1 k=1 BkLkRk , i.e., d d d |||Φ (f) − Φ (h)||| ≤ L kf − hk2, ∀f, h ∈ HN , where HN := {f : Z → C | f[n] = f[n + N], ∀ n ∈ Z}.

The Lipschitz constant Ld is hence a characteristic constant for the features Φd(f) generated in the d-th network layer the features Φd(f) contain less energy than Φd−1(f), owing to

d d |||Φ (f)||| ≤ L kfk2, ∀f ∈ HN

⇒ Tradeoff between deformation sensitivity and energy preservation!

Local properties: Lipschitz continuity

kχ k B1/2L R Ld = d 1 d d d Ld−1 kχd−1k1

kχd−1k1 d d−1 If kχdk1 < 1/2 , then L < L , and hence Bd LdRd the features Φd(f) are less deformation-sensitive than Φd−1(f), thanks to d d d 1/2 1/2 N,K |||Φ (Fτ f) − Φ (f)||| ≤ 4L KN kτk∞ , ∀f ∈ CCART ⇒ Tradeoff between deformation sensitivity and energy preservation!

Local properties: Lipschitz continuity

kχ k B1/2L R Ld = d 1 d d d Ld−1 kχd−1k1

kχd−1k1 d d−1 If kχdk1 < 1/2 , then L < L , and hence Bd LdRd the features Φd(f) are less deformation-sensitive than Φd−1(f), thanks to d d d 1/2 1/2 N,K |||Φ (Fτ f) − Φ (f)||| ≤ 4L KN kτk∞ , ∀f ∈ CCART the features Φd(f) contain less energy than Φd−1(f), owing to

d d |||Φ (f)||| ≤ L kfk2, ∀f ∈ HN Local properties: Lipschitz continuity

kχ k B1/2L R Ld = d 1 d d d Ld−1 kχd−1k1

kχd−1k1 d d−1 If kχdk1 < 1/2 , then L < L , and hence Bd LdRd the features Φd(f) are less deformation-sensitive than Φd−1(f), thanks to d d d 1/2 1/2 N,K |||Φ (Fτ f) − Φ (f)||| ≤ 4L KN kτk∞ , ∀f ∈ CCART the features Φd(f) contain less energy than Φd−1(f), owing to

d d |||Φ (f)||| ≤ L kfk2, ∀f ∈ HN

⇒ Tradeoff between deformation sensitivity and energy preservation! Yours truly Experiment: Handwritten digit classification

- Dataset: MNIST database of handwritten digits [LeCun & Cortes, 1998]; 60,000 training and 10,000 test images - Φ-network: D = 3 layers; same filters, non-linearities, and pooling operators in all layers - Classifier: SVM with radial basis function kernel [Vapnik, 1995 ] - Dimensionality reduction: Supervised orthogonal least squares scheme [Chen et al., 1991] - modulus and ReLU perform better than tanh and LogSig - results with pooling (S = 2) are competitive with those without pooling, at significanly lower computational cost - state-of-the-art: 0.43 [Bruna and Mallat, 2013] - similar feature extraction network with directional, non-separable wavelets and no pooling - significantly higher computational complexity

Experiment: Handwritten digit classification

Classification error in percent: Haar wavelet Bi-orthogonal wavelet abs ReLU tanh LogSig abs ReLU tanh LogSig n.p. 0.57 0.57 1.35 1.49 0.51 0.57 1.12 1.22 sub. 0.69 0.66 1.25 1.46 0.61 0.61 1.20 1.18 max. 0.58 0.65 0.75 0.74 0.52 0.64 0.78 0.73 avg. 0.55 0.60 1.27 1.35 0.58 0.59 1.07 1.26 - results with pooling (S = 2) are competitive with those without pooling, at significanly lower computational cost - state-of-the-art: 0.43 [Bruna and Mallat, 2013] - similar feature extraction network with directional, non-separable wavelets and no pooling - significantly higher computational complexity

Experiment: Handwritten digit classification

Classification error in percent: Haar wavelet Bi-orthogonal wavelet abs ReLU tanh LogSig abs ReLU tanh LogSig n.p. 0.57 0.57 1.35 1.49 0.51 0.57 1.12 1.22 sub. 0.69 0.66 1.25 1.46 0.61 0.61 1.20 1.18 max. 0.58 0.65 0.75 0.74 0.52 0.64 0.78 0.73 avg. 0.55 0.60 1.27 1.35 0.58 0.59 1.07 1.26

- modulus and ReLU perform better than tanh and LogSig - state-of-the-art: 0.43 [Bruna and Mallat, 2013] - similar feature extraction network with directional, non-separable wavelets and no pooling - significantly higher computational complexity

Experiment: Handwritten digit classification

Classification error in percent: Haar wavelet Bi-orthogonal wavelet abs ReLU tanh LogSig abs ReLU tanh LogSig n.p. 0.57 0.57 1.35 1.49 0.51 0.57 1.12 1.22 sub. 0.69 0.66 1.25 1.46 0.61 0.61 1.20 1.18 max. 0.58 0.65 0.75 0.74 0.52 0.64 0.78 0.73 avg. 0.55 0.60 1.27 1.35 0.58 0.59 1.07 1.26

- modulus and ReLU perform better than tanh and LogSig - results with pooling (S = 2) are competitive with those without pooling, at significanly lower computational cost Experiment: Handwritten digit classification

Classification error in percent: Haar wavelet Bi-orthogonal wavelet abs ReLU tanh LogSig abs ReLU tanh LogSig n.p. 0.57 0.57 1.35 1.49 0.51 0.57 1.12 1.22 sub. 0.69 0.66 1.25 1.46 0.61 0.61 1.20 1.18 max. 0.58 0.65 0.75 0.74 0.52 0.64 0.78 0.73 avg. 0.55 0.60 1.27 1.35 0.58 0.59 1.07 1.26

- modulus and ReLU perform better than tanh and LogSig - results with pooling (S = 2) are competitive with those without pooling, at significanly lower computational cost - state-of-the-art: 0.43 [Bruna and Mallat, 2013] - similar feature extraction network with directional, non-separable wavelets and no pooling - significantly higher computational complexity - vanishing moments condition on the mother wavelet 2 - applies to 1-D real-valued band-limited input signals f ∈ L (R)

Energy decay: Related work

[Waldspurger, 2017 ]: Exponential energy decay

−n Wn(f) = O(a ),

for some unspecified a > 1. - 1-D wavelet filters - every network layer equipped with the same set of wavelets 2 - applies to 1-D real-valued band-limited input signals f ∈ L (R)

Energy decay: Related work

[Waldspurger, 2017 ]: Exponential energy decay

−n Wn(f) = O(a ),

for some unspecified a > 1. - 1-D wavelet filters - every network layer equipped with the same set of wavelets - vanishing moments condition on the mother wavelet Energy decay: Related work

[Waldspurger, 2017 ]: Exponential energy decay

−n Wn(f) = O(a ),

for some unspecified a > 1. - 1-D wavelet filters - every network layer equipped with the same set of wavelets - vanishing moments condition on the mother wavelet 2 - applies to 1-D real-valued band-limited input signals f ∈ L (R) - analyticity and vanishing moments conditions on the filters - applies to d-dimensional complex-valued input signals 2 d f ∈ L (R )

Energy decay: Related work

[Czaja and Li, 2016]: Exponential energy decay

−n Wn(f) = O(a ),

for some unspecified a > 1. - d-dimensional uniform covering filters (similar to Weyl- Heisenberg filters), but does not cover multi-scale filters (e.g. wavelets, ridgedelets, curvelets etc.) - every network layer equipped with the same set of filters - applies to d-dimensional complex-valued input signals 2 d f ∈ L (R )

Energy decay: Related work

[Czaja and Li, 2016]: Exponential energy decay

−n Wn(f) = O(a ),

for some unspecified a > 1. - d-dimensional uniform covering filters (similar to Weyl- Heisenberg filters), but does not cover multi-scale filters (e.g. wavelets, ridgedelets, curvelets etc.) - every network layer equipped with the same set of filters - analyticity and vanishing moments conditions on the filters Energy decay: Related work

[Czaja and Li, 2016]: Exponential energy decay

−n Wn(f) = O(a ),

for some unspecified a > 1. - d-dimensional uniform covering filters (similar to Weyl- Heisenberg filters), but does not cover multi-scale filters (e.g. wavelets, ridgedelets, curvelets etc.) - every network layer equipped with the same set of filters - analyticity and vanishing moments conditions on the filters - applies to d-dimensional complex-valued input signals 2 d f ∈ L (R )