Arxiv:2006.08391V2 [Cs.LG] 7 Nov 2020 Applications

On Lipschitz Regularization of Convolutional Layers using Toeplitz Matrix Theory Alexandre Araujo, Benjamin Negrevergne, Yann Chevaleyre, Jamal Atif PSL, Universite´ Paris-Dauphine, CNRS, LAMSADE, MILES Team, Paris, France Abstract (2000) used to approximate the maximum singular value of a linear function. Although generic and accurate, this technique This paper tackles the problem of Lipschitz regularization of Convolutional Neural Networks. Lipschitz regularity is now is also computationally expensive, which impedes its usage established as a key property of modern deep learning with in large training settings. implications in training stability, generalization, robustness In this paper we introduce a new upper bound on the largest against adversarial examples, etc. However, computing the singular value of convolution layers that is both tight and easy exact value of the Lipschitz constant of a neural network is to compute. Instead of using the power method to iteratively known to be NP-hard. Recent attempts from the literature in- approximate this value, we rely on Toeplitz matrix theory troduce upper bounds to approximate this constant that are and its links with Fourier analysis. Our work is based on either efficient but loose or accurate but computationally ex- the result (Gray et al. 2006) that an upper bounded on the pensive. In this work, by leveraging the theory of Toeplitz singular value of Toeplitz matrices can be computed from matrices, we introduce a new upper bound for convolutional layers that is both tight and easy to compute. Based on this the inverse Fourier transform of the characteristic sequence result we devise an algorithm to train Lipschitz regularized of these matrices. We first extend this result to doubly-block Convolutional Neural Networks. Toeplitz matrices (i.e., block Toeplitz matrices where each block is Toeplitz) and then to convolutional operators, which can be represented as stacked sequences of doubly-block 1 Introduction Toeplitz matrices. From our analysis immediately follows The last few years have witnessed a growing interest in Lip- an algorithm for bounding the Lipschitz constant of a con- schitz regularization of neural networks, with the aim of volutional layer, and by extension the Lipschitz constant of improving their generalization (Bartlett, Foster, and Telgar- the whole network. We theoretically study the approximation sky 2017), their robustness to adversarial attacks (Tsuzuku, of this algorithm and show experimentally that it is more Sato, and Sugiyama 2018; Farnia, Zhang, and Tse 2019), or efficient and accurate than competing approaches. their generation abilities (e.g for GANs: Miyato et al. 2018; Finally, we illustrate our approach on adversarial robust- Arjovsky, Chintala, and Bottou 2017). Unfortunately com- ness. Recent work has shown that empirical methods such as puting the exact Lipschitz constant of a neural network is adversarial training (AT) offer poor generalization (Schmidt NP-hard (Virmaux and Scaman 2018) and in practice, exist- et al. 2018), and can be improved by applying Lipschitz reg- ing techniques such as Virmaux and Scaman (2018); Fazlyab ularization (Farnia, Zhang, and Tse 2019). To illustrate the et al. (2019) or Latorre, Rolland, and Cevher (2020) are benefit of our new method, we train a large, state-of-the-art difficult to implement for neural networks with more than Wide ResNet architecture with Lipschitz regularization and one or two layers, which hinders their use in deep learning show that it offers a significant improvement over adver- arXiv:2006.08391v2 [cs.LG] 7 Nov 2020 applications. sarial training alone, and over other methods for Lipschitz To overcome this difficulty, most of the work has focused regularization. In summary, we make the three following on computing the Lipschitz constant of individual layers in- contributions: stead. The product of the Lipschitz constant of each layer is an upper-bound for the Lipschitz constant of the entire net- 1. We devise an upper bound on the singular values of the op- work, and it can be used as a surrogate to perform Lipschitz erator matrix of convolutional layers by leveraging Toeplitz regularization. Since most common activation functions (such matrix theory and its links with Fourier analysis. as ReLU) have a Lipschitz constant equal to one, the main bottleneck is to compute the Lipschitz constant of the under- 2. We propose an efficient algorithm to compute this upper lying linear application which is equal to its maximal singular bound which enables its use in the context of Convolutional value. The work in this line of research mainly relies on the Neural Networks. celebrated iterative algorithm by Golub and Van der Vorst 3. We use our method to regularize the Lipschitz constant of Copyright © 2021, Association for the Advancement of Artificial neural networks for adversarial robustness and show that Intelligence (www.aaai.org). All rights reserved. it offers a significant improvement over AT alone. 2 Related Work works are theoretically interesting but lack scalability (i.e., A popular technique for approximating the maximal singular the bound can only be computed on small networks). value of a matrix is the power method (Golub and Van der Finally, in parallel to the development of the results in Vorst 2000), an iterative algorithm which yields a good ap- this paper, we discovered that Yi (2020) have studied the proximation of the maximum singular value when the algo- asymptotic distribution of the singular values of convolutional rithm is able to run for a sufficient number of iterations. layers by using a related approach. However, this author Yoshida and Miyato (2017); Miyato et al. (2018) have does not investigate the robustness applications of Lipschitz used the power method to normalize the spectral norm of regularization. each layer of a neural network, and showed that the resulting models offered improved generalization performance and 3 A Primer on Toeplitz and block Toeplitz generated better examples when they were used in the con- matrices text of GANs. Farnia, Zhang, and Tse 2019 built upon the In order to devise a bound on the Lipschitz constant of a work of Miyato et al. (2018) and proposed a power method convolution layer as used by the Deep Learning community, specific for convolutional layers that leverages the deconvo- we study the properties of doubly-block Toeplitz matrices. lution operation and avoid the computation of the gradient. In this section, we first introduce the necessary background They used it in combination with adversarial training. In on Toeplitz and block Toeplitz matrices, and introduce a new the same vein, Gouk et al. (2018) demonstrated that regular- result on doubly-block Toeplitz matrices. ized neural networks using the power method also offered Toeplitz matrices and block Toeplitz matrices are well- improvements over their non-regularized counterparts. Fur- known types of structured matrices. A Toeplitz matrix (re- thermore, Tsuzuku, Sato, and Sugiyama (2018) have shown spectively a block Toeplitz matrix) is a matrix in which each that a neural network can be more robust to some adversarial scalar (respectively block) is repeated identically along diag- attacks, if the prediction margin of the network (i.e., the dif- onals. ference between the first and the second maximum logit) is n × n A higher than a minimum threshold that depends on the global An Toeplitz matrix is fully determined by a two- fa g nm × nm Lipschitz constant of the network. Building on this observa- sided sequence of scalars: h h2N , whereas an B tion, they use the power method to compute an upper bound block Toeplitz matrix is fully determined by a two-sided fB g N = {−n+1; : : : ; n− on the global Lipschitz constant, and maximize the prediction sequence of blocks h h2N , where 1g B m × m margin during training. Finally, Virmaux and Scaman (2018) and where each block h is an matrix. have used automatic differentiation combined with the power method to compute a tighter bound on the global Lipschitz 0 a0 a1 ··· an−1 1 0 B0 B1 ··· Bn−1 1 : : : : : : : constant of neural networks. Despite a number of interesting B a−1 a0 : : C B B−1 B0 : : C results, using the power method is expensive and results in A = B : : C B = B : : C: B : : : C B : : : C prohibitive training times. @ : a0 a1 A @ : B0 B1 A a ··· a a Other approaches to regularize the Lipschitz constant of −n+1 −1 0 B−n+1 ··· B−1 B0 neural networks have been proposed by Sedghi, Gupta, and Long (2019) and Singla and Feizi (2019). The method Finally, a doubly-block Toeplitz matrix is a block of Sedghi, Gupta, and Long (2019) exploits the properties of Toeplitz matrix in which each block is itself a Toeplitz ma- circulant matrices to approximate the maximal singular value trix. In the remainder, we will use the standard notation of a convolutional layer. Although interesting, this method ( · )i;j2f0;:::;n−1g to construct (block) matrices. For example, results in a loose approximation of the maximal singular A = (aj−i)i;j2f0;:::;n−1g and B = (Bj−i)i;j2f0;:::;n−1g. value of a convolutional layer. Furthermore, the complex- ity of their algorithm is dependent on the convolution input 3.1 Bound on the singular value of Toeplitz and which can be high for large datasets such as ImageNet. More block Toeplitz matrices recently, Singla and Feizi (2019) have successfully bounded A standard tool for manipulating (block) Toeplitz matri- the operator norm of the Jacobian matrix of a convolution ces is the use of Fourier analysis. Let fahgh2N be the se- layer by the Frobenius norm of the reshaped kernel. This quence of coefficients of the Toeplitz matrix A 2 n×n technique has the advantage to be very fast to compute and to R and let fBhgh2N be the sequence of m × m blocks of be independent of the input size but it also results in a loose the block Toeplitz matrix B.

Load more