Harmonic Analysis of Deep Convolutional Neural Networks

Harmonic Analysis of Deep Convolutional Neural Networks Helmut B}olcskei Department of Information Technology and Electrical Engineering November 2017 joint work with Thomas Wiatowski and Philipp Grohs ImageNet CNNs win the ImageNet 2015 challenge [He et al., 2015 ] ImageNet ski rock plant coffee ImageNet ski rock plant coffee CNNs win the ImageNet 2015 challenge [He et al., 2015 ] Go! CNNs beat Go-champion Lee Sedol [Silver et al., 2016 ] Atari games CNNs beat professional human Atari-players [Mnih et al., 2015 ] \Carlos Kleiber conducting the Vienna Philharmonic's New Year's Concert 1989." Describing the content of an image CNNs generate sentences describing the content of an image [Vinyals et al., 2015 ] Kleiber conducting the Vienna Philharmonic's New Year's Concert 1989 Describing the content of an image CNNs generate sentences describing the content of an image [Vinyals et al., 2015 ] \Carlos ." conducting the Vienna Philharmonic's New Year's Concert 1989 Describing the content of an image CNNs generate sentences describing the content of an image [Vinyals et al., 2015 ] \Carlos Kleiber ." Vienna Philharmonic's New Year's Concert 1989 Describing the content of an image CNNs generate sentences describing the content of an image [Vinyals et al., 2015 ] \Carlos Kleiber conducting the ." New Year's Concert 1989 Describing the content of an image CNNs generate sentences describing the content of an image [Vinyals et al., 2015 ] \Carlos Kleiber conducting the Vienna Philharmonic's ." 1989 Describing the content of an image CNNs generate sentences describing the content of an image [Vinyals et al., 2015 ] \Carlos Kleiber conducting the Vienna Philharmonic's New Year's Concert ." Describing the content of an image CNNs generate sentences describing the content of an image [Vinyals et al., 2015 ] \Carlos Kleiber conducting the Vienna Philharmonic's New Year's Concert 1989." Feature extraction and classification input: f = non-linear feature extraction feature vector Φ(f) linear classifier ( hw; Φ(f)i > 0; ) Shannon output: hw; Φ(f)i < 0; ) von Neumann Mathematics for CNNs \It is the guiding principle of many applied mathematicians that if something mathematical works really well, there must be a good underlying mathematical reason for it, and we ought to be able to understand it." [I. Daubechies, 2015 ] 1 possible with w = −1 : hw; Φ(f)i > 0 : hw; Φ(f)i < 0 kfk Φ(f) = 1 Why non-linear feature extractors? Task: Separate two categories of data through a linear classifier 1 : hw; fi > 0 : hw; fi < 0 : hw; Φ(f)i > 0 : hw; Φ(f)i < 0 1 possible with w = −1 kfk Φ(f) = 1 Why non-linear feature extractors? Task: Separate two categories of data through a linear classifier 1 : hw; fi > 0 : hw; fi < 0 not possible! Why non-linear feature extractors? Task: Separate two categories of data through a linear classifier kfk Φ(f) = 1 1 : hw; fi > 0 : hw; Φ(f)i > 0 : hw; fi < 0 : hw; Φ(f)i < 0 1 not possible! possible with w = −1 ) Linear separability in feature space! Why non-linear feature extractors? Task: Separate two categories of data through a linear classifier kfk Φ(f) = 1 ) Φ is invariant to angular component of the data Why non-linear feature extractors? Task: Separate two categories of data through a linear classifier kfk Φ(f) = 1 ) Φ is invariant to angular component of the data ) Linear separability in feature space! Why non-linear feature extractors? m×m Let T 2 R be the cyclic shift matrix, i.e., 0 0 1 B . C B . I(m−1) C T = B C ; @ 0 A 1 0 ··· 0 m×m and let Φ 2 C be translation-invariant, i.e., Φf = Φ Tlf; 8 f; l What is the structure of Φ? Why non-linear feature extractors? m×m Let T 2 R be the cyclic shift matrix, i.e., 0 0 1 B . C B . I(m−1) C T = B C ; @ 0 A 1 0 ··· 0 m×m and let Φ 2 C be translation-invariant, i.e., Φf = Φ Tlf; 8 f; l What is the structure of Φ? 2 3 2 3 2 3 Φ = 4 φ1 φ2 : : : φm 5 = 4 φ2 φ3 : : : φ1 5 = 4 φm φ1 : : : φm−1 5 Why non-linear feature extractors? m×m Let T 2 R be the cyclic shift matrix, i.e., 0 0 1 B . C B . I(m−1) C T = B C ; @ 0 A 1 0 ··· 0 m×m and let Φ 2 C be translation-invariant, i.e., Φf = Φ Tlf; 8 f; l What is the structure of Φ? 2 3 Φ = 4 φ1 φ1 : : : φ1 5 ) rank = 1! Why non-linear feature extractors? Want to have a trivial null-space, i.e., 1 Φ(f) = 0 , f = 0; to avoid \losing" features. Non-linear Φ needed to avoid having 2 2 2 jjΦ(f1) − Φ(f2)jj = jjΦ(f1 − f2)jj ≥ Akf1 − f2k2; for some A > 0. ) translation-invariant, i.e., Φf = Φ Tlf; 8 f; l BUT: In practice, the Fourier-modulus has poor feature extraction performance due to its global nature! Why non-linear feature extractors? e.g.: Fourier-modulus Φ = jFnj; n×n where Fn 2 C is the DFT-matrix, and jFnj refers to multipli- cation by Fn followed by componentwise application of j · j. BUT: In practice, the Fourier-modulus has poor feature extraction performance due to its global nature! Why non-linear feature extractors? e.g.: Fourier-modulus Φ = jFnj; n×n where Fn 2 C is the DFT-matrix, and jFnj refers to multipli- cation by Fn followed by componentwise application of j · j. ) translation-invariant, i.e., Φf = Φ Tlf; 8 f; l Why non-linear feature extractors? e.g.: Fourier-modulus Φ = jFnj; n×n where Fn 2 C is the DFT-matrix, and jFnj refers to multipli- cation by Fn followed by componentwise application of j · j. ) translation-invariant, i.e., Φf = Φ Tlf; 8 f; l BUT: In practice, the Fourier-modulus has poor feature extraction performance due to its global nature! Invariance/Covariance Classification Regression Want to explicitly control the amount of invariance induced less/more invariance , more/less covariance Deformation insensitivity Feature vector should be independent of cameras (of different resolutions), and insensitive to small acquisition jitters Scattering networks ([Mallat, 2012], [Wiatowski and HB, 2015]) jjf ∗ g (k) j ∗ g (l) j jjf ∗ g (p) j ∗ g (r) j λ1 λ2 λ1 λ2 jf ∗ g (k) j jf ∗ g (p) j λ1 λ1 f feature map Scattering networks ([Mallat, 2012], [Wiatowski and HB, 2015]) jjf ∗ g (k) j ∗ g (l) j jjf ∗ g (p) j ∗ g (r) j λ1 λ2 λ1 λ2 · ∗ χ3 · ∗ χ3 jf ∗ g (k) j jf ∗ g (p) j λ1 λ1 · ∗ χ · ∗ χ 2 f feature map 2 · ∗ χ1 Scattering networks ([Mallat, 2012], [Wiatowski and HB, 2015]) jjf ∗ g (k) j ∗ g (l) j jjf ∗ g (p) j ∗ g (r) j λ1 λ2 λ1 λ2 · ∗ χ3 · ∗ χ3 jf ∗ g (k) j jf ∗ g (p) j λ1 λ1 · ∗ χ · ∗ χ 2 f feature map 2 · ∗ χ1 feature vector Φ(f) General scattering networks guarantee [Wiatowski & HB, 2015 ] - (vertical) translation invariance - small deformation sensitivity essentially irrespective of filters, non-linearities, and poolings! e.g.: Learned filters Building blocks Basic operations in the n-th network layer g (k) λn non-lin. pool. f . g (r) λn non-lin. pool. Filters: Semi-discrete frame Ψn := fχng [ fgλn gλn2Λn 2 2 X 2 2 2 d Ankfk2 ≤ kf ∗ χnk2 + kf ∗ gλn k ≤ Bnkfk2; 8f 2 L (R ) λn2Λn e.g.: Learned filters Building blocks Basic operations in the n-th network layer g (k) λn non-lin. pool. f . g (r) λn non-lin. pool. Filters: Semi-discrete frame Ψn := fχng [ fgλn gλn2Λn 2 2 X 2 2 2 d Ankfk2 ≤ kf ∗ χnk2 + kf ∗ gλn k ≤ Bnkfk2; 8f 2 L (R ) λn2Λn e.g.: Structured filters e.g.: Learned filters Building blocks Basic operations in the n-th network layer g (k) λn non-lin. pool. f . g (r) λn non-lin. pool. Filters: Semi-discrete frame Ψn := fχng [ fgλn gλn2Λn 2 2 X 2 2 2 d Ankfk2 ≤ kf ∗ χnk2 + kf ∗ gλn k ≤ Bnkfk2; 8f 2 L (R ) λn2Λn e.g.: Unstructured filters Building blocks Basic operations in the n-th network layer g (k) λn non-lin. pool. f . g (r) λn non-lin. pool. Filters: Semi-discrete frame Ψn := fχng [ fgλn gλn2Λn 2 2 X 2 2 2 d Ankfk2 ≤ kf ∗ χnk2 + kf ∗ gλn k ≤ Bnkfk2; 8f 2 L (R ) λn2Λn e.g.: Learned filters ) Satisfied by virtually all non-linearities used in the deep learning literature! 1 ReLU: Ln = 1; modulus: Ln = 1; logistic sigmoid: Ln = 4 ; ... Building blocks Basic operations in the n-th network layer g (k) λn non-lin. pool. f . g (r) λn non-lin. pool. Non-linearities: Point-wise and Lipschitz-continuous 2 d kMn(f) − Mn(h)k2 ≤ Lnkf − hk2; 8 f; h 2 L (R ) Building blocks Basic operations in the n-th network layer g (k) λn non-lin. pool. f . g (r) λn non-lin. pool. Non-linearities: Point-wise and Lipschitz-continuous 2 d kMn(f) − Mn(h)k2 ≤ Lnkf − hk2; 8 f; h 2 L (R ) ) Satisfied by virtually all non-linearities used in the deep learning literature! 1 ReLU: Ln = 1; modulus: Ln = 1; logistic sigmoid: Ln = 4 ; ... ) Emulates most poolings used in the deep learning literature! e.g.: Pooling by sub-sampling Pn(f) = f with Rn = 1 Building blocks Basic operations in the n-th network layer g (k) λn non-lin. pool.

Load more