The Lipschitz Constant of Self-Attention

The Lipschitz Constant of Self-Attention Hyunjik Kim 1 George Papamakarios 1 Andriy Mnih 1 Abstract constraint for neural networks, to control how much a net- Lipschitz constants of neural networks have been work’s output can change relative to its input. Such Lips- explored in various contexts in deep learning, chitz constraints are useful in several contexts. For example, such as provable adversarial robustness, estimat- Lipschitz constraints can endow models with provable ro- ing Wasserstein distance, stabilising training of bustness against adversarial pertubations (Cisse et al., 2017; GANs, and formulating invertible neural net- Tsuzuku et al., 2018; Anil et al., 2019), and guaranteed gen- works. Such works have focused on bounding eralisation bounds (Sokolic´ et al., 2017). Moreover, the dual the Lipschitz constant of fully connected or con- form of the Wasserstein distance is defined as a supremum volutional networks, composed of linear maps over Lipschitz functions with a given Lipschitz constant, and pointwise non-linearities. In this paper, we hence Lipschitz-constrained networks are used for estimat- investigate the Lipschitz constant of self-attention, ing Wasserstein distances (Peyré & Cuturi, 2019). Further, a non-linear neural network module widely used Lipschitz-constrained networks can stabilise training for in sequence modelling. We prove that the stan- GANs, an example being spectral normalisation (Miyato dard dot-product self-attention is not Lipschitz et al., 2018). Finally, Lipschitz-constrained networks are for unbounded input domain, and propose an al- also used to construct invertible models and normalising ternative L2 self-attention that is Lipschitz. We flows. For example, Lipschitz-constrained networks can be derive an upper bound on the Lipschitz constant of used as a building block for invertible residual networks and L2 self-attention and provide empirical evidence hence flow-based generative models (Behrmann et al., 2019; for its asymptotic tightness. To demonstrate the Chen et al., 2019). Additionally, Neural ODEs (Chen et al., practical relevance of our theoretical work, we 2018; Grathwohl et al., 2019) are typically defined using formulate invertible self-attention and use it in vector fields parameterized via Lipschitz networks, so that a Transformer-based architecture for a character- the flow generated by the vector field is guaranteed to exist level language modelling task. for all times. Nonetheless, designing Lipschitz-continuous neural networks and computing (or even upper-bounding) their Lip- 1. Introduction schitz constant is a hard problem. Previous work mostly Lipschitz continuity is a strong form of continuity for func- focused on fully-connected and convolutional networks, not tions. Loosely speaking, a function is Lipschitz continuous only because they are common in deep learning, but also be- if changing its input by a certain amount cannot change cause they are relatively simple to analyze, as compositions its output by more than K times that amount. The con- of linear maps and pointwise non-linearities. Even in this stant K is a hard constraint on how rapidly the function’s case however, exact evaluation of the Lipschitz constant of fully-connected and convolutional networks is NP-hard (Vir- arXiv:2006.04710v2 [stat.ML] 9 Jun 2021 output can vary, and the smallest such K is known as the p maux & Scaman, 2018) and obtaining a tight upper bound function’s Lipschitz constant. For example, f1(x) = jxj remains a challenging task (Virmaux & Scaman, 2018; Fa- and f2(x) = exp(x) for x 2 R are not Lipschitz continuous, because their output can change arbitrarily fast as zlyab et al., 2019; Latorre et al., 2020). x approaches 0 and +1 respectively. On the other hand, Fully-connected and convolutional networks are not the g1(x) = tanh(x) and g2(x) = αx are Lipschitz continuous, only neural networks worthy of interest. Recently, self- because their rate of change (derivative) is bounded. attention (Vaswani et al., 2017) has become a popular In deep learning, we often use Lipschitz continuity as a alternative to recurrent neural networks. Self-attention is a key component of the Transformer (Vaswani et al., 1DeepMind, UK. Correspondence to: Hyunjik Kim <hyun- 2017), that has found success as a building block in models [email protected]>. of various data modalities, starting with natural-language processing (Vaswani et al., 2017; Devlin et al., 2019; Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). Brown et al., 2020) and extending to computer vision The Lipschitz Constant of Self-Attention (Zhang et al., 2019; Parmar et al., 2019), audio generation (Virmaux & Scaman, 2018; Behrmann et al., 2019). We (Huang et al., 2019), and reinforcement learning (Parisotto describe how these results are used to provide bounds on et al., 2020). However, so far no previous work has anal- the Lipschitz constant of fully-connected networks (FCN) ysed the Lipschitz properties of self-attention, and thus it and convolutional neural networks (CNN), using the fact has been unclear whether self-attention is a viable option that both are compositions of linear maps and pointwise in applications that require Lipschitz constraints. In this non-linearities. To begin with, the following theorem sug- work, we address this gap in the theory of self-attention by gests a way to bound Lipp(f) for a differentiable Lipschitz providing a thorough analysis of its Lipschitz properties. In function f: particular, we make the following contributions: Theorem 2.1 (Federer, 1969). Let f : Rn ! Rm be differentiable and Lipschitz continuous under a choice of p-norm • We prove that the widely used dot-product self-attention k · kp. Let Jf (x) denote its total derivative (Jacobian) at x. is not Lipschitz, and therefore not suitable to use in appli- Then Lipp(f) = supx2 n kJf (x)kp where kJf (x)kp is the cations requiring Lipschitz constraints. R induced operator norm on Jf (x). • We formulate L2 self-attention as an alternative, and show Hence if f is a linear map represented by a matrix W then that it is Lipschitz. • We derive a theoretical upper bound on the Lipschitz con- Lipp(f) = kW kp := sup kW xkp stant of L2 self-attention, and provide empirical evidence kxkp=1 ( of the asymptotic tightness of the bound. σ (W ); if p = 2 = max max P jW j if p = 1 • As a practical demonstration of the theory, we use this i j ij bound to formulate invertible self-attention, and explore its use in a Transformer architecture for character-level where kW kp is the operator norm on matrices induced by language modelling. We compare its test log-likelihood the vector p-norm, and σmax(W ) is the largest singular and stability to dot-product self-attention. value of W . Under this choice of norm, many common non-linearities (including relu, sigmoid, tanh, elu) are 1-Lipschitz. kW k2 = σmax(W ) is usually estimated via 2. Lipschitz Constant of power iteration; we provide details on how this is done in Fully-Connected/Convolutional Layers AppendixB. We first define the notion of Lipschitz continuity, and pro- Since we now know the Lipschitz constants of the compo- ceed to define the Lipschitz constant. nents of both FCN and CNN, we can bound their Lipschitz constants by applying the following lemma: Definition 2.1. Given two metric spaces (X ; dX ) and (Y; dY ), a function f : X!Y is called Lipschitz con- Lemma 2.1 (Federer, 1969). Let g; h be two composable tinuous (or K-Lipschitz) if there exists a constant K ≥ 0 Lipschitz functions. Then g ◦h is also Lipschitz with Lip(g ◦ such that h) ≤ Lip(g) Lip(h). 0 0 0 Corollary 2.1. For a fully-connected network (FCN) or a dY (f(x); f(x )) ≤ KdX (x; x ) for all x; x 2 X : (1) convolutional neural network (CNN) f = WK ◦ ρK−1 ◦ Q The smallest such K is the Lipschitz constant of f, denoted WK−1 ◦ ::: ◦ ρ1 ◦ W1, we have Lipp(f) ≤ k kWkkp Lip(f). under a choice of p-norm with 1-Lipschitz non-linearities ρk. In this paper, we focus on the common case where X = n m R , Y = R , and dX ; dY are induced by a p-norm The above bound is not necessarily tight; there are various P p 1=p kxkp := ( i jxij ) . We will primarily consider the works that compute tighter bounds for FCN and CNN (e.g. cases p = 2 and p = 1, where kxk1 := maxi jxij. To Virmaux & Scaman, 2018; Fazlyab et al., 2019; Latorre emphasise the dependence of the Lipschitz constant on the et al., 2020). choice of p-norm, we will often denote it by Lipp(f). In this case, it follows directly from Definition 2.1 that the 3. Lipschitz Constant of Self-Attention Lipschitz constant is given by 3.1. Dot-product self-attention is not Lipschitz kf(x) − f(x0)k Lip (f) = sup p : p 0 (2) Moving on, we investigate whether self-attention is Lips- x6=x02 n kx − x kp R chitz. We first consider the widely used (scaled) dot-product Next, we outline some basic results that are useful for esti- multihead self-attention as formulated by Vaswani et al. mating Lipschitz constants, also covered in related works (2017). Let x1;:::; xN be a sequence of N elements, where The Lipschitz Constant of Self-Attention D xi 2 R for i = 1;:::;N. We represent this sequence as a The mapping f can be written as matrix X: 2 3 2 > 3 f1(X) — x1 — f(X) = PX = softmax aXX> X = 6 . 7 2 N×1; 6 . 7 N×D 4 . 5 R X = 4 . 5 2 R ; (3) > fN (X) — xN — N X Dot-product multihead self-attention (DP-MHA) is a map where fi(X) = Pijxj 2 R from RN×D to RN×D consisting of H ‘heads’, where H j=1 is chosen to divide D.

The Lipschitz Constant of Self-Attention

Lipschitz Continuity of Convex Functions

The Beginnings 2 2. the Topology of Metric Spaces 5 3. Sequences and Completeness 9 4

Convergence Rates for Deterministic and Stochastic Subgradient

Hölder-Continuity for the Nonlinear Stochastic Heat Equation with Rough Initial Conditions

1. the Rademacher Theorem

On Lipschitz Continuity of Projections 2

Stability and Convergence of Stochastic Gradient Clipping: Beyond Lipschitz Continuity and Smoothness

Arxiv:1706.10175V1 [Math.CV] 30 Jun 2017 Uscnomlslto Fteieult 11 Ihrsett H D the to Respect with (1.1) Inequality the Metric

Metric Spaces

Functions of One Variable

Fat Homeomorphisms and Unbounded Derivate Containers

Lecture Notes on Ordinary Differential Equations