Harmonic Analysis of Deep Convolutional Neural Networks

Harmonic Analysis of Deep Convolutional Neural Networks Helmut B}olcskei Department of Information Technology and Electrical Engineering October 2017 joint work with Thomas Wiatowski and Philipp Grohs ImageNet CNNs win the ImageNet 2015 challenge [He et al., 2015 ] ImageNet ski rock plant coffee ImageNet ski rock plant coffee CNNs win the ImageNet 2015 challenge [He et al., 2015 ] \Carlos Kleiber conducting the Vienna Philharmonic's New Year's Concert 1989." Describing the content of an image CNNs generate sentences describing the content of an image [Vinyals et al., 2015 ] Kleiber conducting the Vienna Philharmonic's New Year's Concert 1989 Describing the content of an image CNNs generate sentences describing the content of an image [Vinyals et al., 2015 ] \Carlos ." conducting the Vienna Philharmonic's New Year's Concert 1989 Describing the content of an image CNNs generate sentences describing the content of an image [Vinyals et al., 2015 ] \Carlos Kleiber ." Vienna Philharmonic's New Year's Concert 1989 Describing the content of an image CNNs generate sentences describing the content of an image [Vinyals et al., 2015 ] \Carlos Kleiber conducting the ." New Year's Concert 1989 Describing the content of an image CNNs generate sentences describing the content of an image [Vinyals et al., 2015 ] \Carlos Kleiber conducting the Vienna Philharmonic's ." 1989 Describing the content of an image CNNs generate sentences describing the content of an image [Vinyals et al., 2015 ] \Carlos Kleiber conducting the Vienna Philharmonic's New Year's Concert ." Describing the content of an image CNNs generate sentences describing the content of an image [Vinyals et al., 2015 ] \Carlos Kleiber conducting the Vienna Philharmonic's New Year's Concert 1989." Feature extraction and classification input: f = non-linear feature extraction feature vector Φ(f) linear classifier ( hw; Φ(f)i > 0; ) Shannon output: hw; Φ(f)i < 0; ) von Neumann 1 possible with w = −1 : hw; Φ(f)i > 0 : hw; Φ(f)i < 0 kfk Φ(f) = 1 Why non-linear feature extractors? Task: Separate two categories of data through a linear classifier 1 : hw; fi > 0 : hw; fi < 0 : hw; Φ(f)i > 0 : hw; Φ(f)i < 0 1 possible with w = −1 kfk Φ(f) = 1 Why non-linear feature extractors? Task: Separate two categories of data through a linear classifier 1 : hw; fi > 0 : hw; fi < 0 not possible! Why non-linear feature extractors? Task: Separate two categories of data through a linear classifier kfk Φ(f) = 1 1 : hw; fi > 0 : hw; Φ(f)i > 0 : hw; fi < 0 : hw; Φ(f)i < 0 1 not possible! possible with w = −1 ) Linear separability in feature space! Why non-linear feature extractors? Task: Separate two categories of data through a linear classifier kfk Φ(f) = 1 ) Φ is invariant to angular component of the data Why non-linear feature extractors? Task: Separate two categories of data through a linear classifier kfk Φ(f) = 1 ) Φ is invariant to angular component of the data ) Linear separability in feature space! Translation invariance Handwritten digits from the MNIST database [LeCun & Cortes, 1998 ] Feature vector should be invariant to spatial location ) translation invariance Deformation insensitivity Feature vector should be independent of cameras (of different resolutions), and insensitive to small acquisition jitters Scattering networks ([Mallat, 2012], [Wiatowski and HB, 2015]) jjf ∗ g (k) j ∗ g (l) j jjf ∗ g (p) j ∗ g (r) j λ1 λ2 λ1 λ2 jf ∗ g (k) j jf ∗ g (p) j λ1 λ1 f feature map Scattering networks ([Mallat, 2012], [Wiatowski and HB, 2015]) jjf ∗ g (k) j ∗ g (l) j jjf ∗ g (p) j ∗ g (r) j λ1 λ2 λ1 λ2 · ∗ χ3 · ∗ χ3 jf ∗ g (k) j jf ∗ g (p) j λ1 λ1 · ∗ χ · ∗ χ 2 f feature map 2 · ∗ χ1 Scattering networks ([Mallat, 2012], [Wiatowski and HB, 2015]) jjf ∗ g (k) j ∗ g (l) j jjf ∗ g (p) j ∗ g (r) j λ1 λ2 λ1 λ2 · ∗ χ3 · ∗ χ3 jf ∗ g (k) j jf ∗ g (p) j λ1 λ1 · ∗ χ · ∗ χ 2 f feature map 2 · ∗ χ1 feature vector Φ(f) General scattering networks guarantee [Wiatowski & HB, 2015 ] - (vertical) translation invariance - small deformation sensitivity essentially irrespective of filters, non-linearities, and poolings! e.g.: Learned filters Building blocks Basic operations in the n-th network layer g (k) λn non-lin. pool. f . g (r) λn non-lin. pool. Filters: Semi-discrete frame Ψn := fχng [ fgλn gλn2Λn 2 2 X 2 2 2 d Ankfk2 ≤ kf ∗ χnk2 + kf ∗ gλn k ≤ Bnkfk2; 8f 2 L (R ) λn2Λn e.g.: Learned filters Building blocks Basic operations in the n-th network layer g (k) λn non-lin. pool. f . g (r) λn non-lin. pool. Filters: Semi-discrete frame Ψn := fχng [ fgλn gλn2Λn 2 2 X 2 2 2 d Ankfk2 ≤ kf ∗ χnk2 + kf ∗ gλn k ≤ Bnkfk2; 8f 2 L (R ) λn2Λn e.g.: Structured filters e.g.: Learned filters Building blocks Basic operations in the n-th network layer g (k) λn non-lin. pool. f . g (r) λn non-lin. pool. Filters: Semi-discrete frame Ψn := fχng [ fgλn gλn2Λn 2 2 X 2 2 2 d Ankfk2 ≤ kf ∗ χnk2 + kf ∗ gλn k ≤ Bnkfk2; 8f 2 L (R ) λn2Λn e.g.: Unstructured filters Building blocks Basic operations in the n-th network layer g (k) λn non-lin. pool. f . g (r) λn non-lin. pool. Filters: Semi-discrete frame Ψn := fχng [ fgλn gλn2Λn 2 2 X 2 2 2 d Ankfk2 ≤ kf ∗ χnk2 + kf ∗ gλn k ≤ Bnkfk2; 8f 2 L (R ) λn2Λn e.g.: Learned filters ) Satisfied by virtually all non-linearities used in the deep learning literature! 1 ReLU: Ln = 1; modulus: Ln = 1; logistic sigmoid: Ln = 4 ; ... Building blocks Basic operations in the n-th network layer g (k) λn non-lin. pool. f . g (r) λn non-lin. pool. Non-linearities: Point-wise and Lipschitz-continuous 2 d kMn(f) − Mn(h)k2 ≤ Lnkf − hk2; 8 f; h 2 L (R ) Building blocks Basic operations in the n-th network layer g (k) λn non-lin. pool. f . g (r) λn non-lin. pool. Non-linearities: Point-wise and Lipschitz-continuous 2 d kMn(f) − Mn(h)k2 ≤ Lnkf − hk2; 8 f; h 2 L (R ) ) Satisfied by virtually all non-linearities used in the deep learning literature! 1 ReLU: Ln = 1; modulus: Ln = 1; logistic sigmoid: Ln = 4 ; ... ) Emulates most poolings used in the deep learning literature! e.g.: Pooling by sub-sampling Pn(f) = f with Rn = 1 Building blocks Basic operations in the n-th network layer g (k) λn non-lin. pool. f . g (r) λn non-lin. pool. Pooling: In continuous-time according to d=2 f 7! Sn Pn(f)(Sn·); 2 d 2 d where Sn ≥ 1 is the pooling factor and Pn : L (R ) ! L (R ) is Rn-Lipschitz-continuous Building blocks Basic operations in the n-th network layer g (k) λn non-lin. pool. f . g (r) λn non-lin. pool. Pooling: In continuous-time according to d=2 f 7! Sn Pn(f)(Sn·); 2 d 2 d where Sn ≥ 1 is the pooling factor and Pn : L (R ) ! L (R ) is Rn-Lipschitz-continuous ) Emulates most poolings used in the deep learning literature! e.g.: Pooling by sub-sampling Pn(f) = f with Rn = 1 Building blocks Basic operations in the n-th network layer g (k) λn non-lin. pool. f . g (r) λn non-lin. pool. Pooling: In continuous-time according to d=2 f 7! Sn Pn(f)(Sn·); 2 d 2 d where Sn ≥ 1 is the pooling factor and Pn : L (R ) ! L (R ) is Rn-Lipschitz-continuous ) Emulates most poolings used in the deep learning literature! e.g.: Pooling by averaging Pn(f) = f ∗ φn with Rn = kφnk1 Vertical translation invariance Theorem (Wiatowski and HB, 2015) Assume that the filters, non-linearities, and poolings satisfy −2 −2 Bn ≤ minf1;Ln Rn g; 8 n 2 N: Let the pooling factors be Sn ≥ 1, n 2 N. Then, n n ktk jjjΦ (Ttf) − Φ (f)jjj = O ; S1 :::Sn 2 d d for all f 2 L (R ), t 2 R , n 2 N. Vertical translation invariance Theorem (Wiatowski and HB, 2015) Assume that the filters, non-linearities, and poolings satisfy −2 −2 Bn ≤ minf1;Ln Rn g; 8 n 2 N: Let the pooling factors be Sn ≥ 1, n 2 N. Then, n n ktk jjjΦ (Ttf) − Φ (f)jjj = O ; S1 :::Sn 2 d d for all f 2 L (R ), t 2 R , n 2 N. ) Features become more invariant with increasing network depth! Vertical translation invariance Theorem (Wiatowski and HB, 2015) Assume that the filters, non-linearities, and poolings satisfy −2 −2 Bn ≤ minf1;Ln Rn g; 8 n 2 N: Let the pooling factors be Sn ≥ 1, n 2 N. Then, n n ktk jjjΦ (Ttf) − Φ (f)jjj = O ; S1 :::Sn 2 d d for all f 2 L (R ), t 2 R , n 2 N. Full translation invariance: If lim S1 · S2 · ::: · Sn = 1, then n!1 n n lim jjjΦ (Ttf) − Φ (f)jjj = 0 n!1 Vertical translation invariance Theorem (Wiatowski and HB, 2015) Assume that the filters, non-linearities, and poolings satisfy −2 −2 Bn ≤ minf1;Ln Rn g; 8 n 2 N: Let the pooling factors be Sn ≥ 1, n 2 N. Then, n n ktk jjjΦ (Ttf) − Φ (f)jjj = O ; S1 :::Sn 2 d d for all f 2 L (R ), t 2 R , n 2 N. The condition −2 −2 Bn ≤ minf1;Ln Rn g; 8 n 2 N; is easily satisfied by normalizing the filters fgλn gλn2Λn .

Load more