<<

The Lipschitz Constant of Self-Attention

Hyunjik Kim 1 George Papamakarios 1 Andriy Mnih 1

Abstract constraint for neural networks, to control how much a net- Lipschitz constants of neural networks have been work’s output can change relative to its input. Such Lips- explored in various contexts in deep learning, chitz constraints are useful in several contexts. For example, such as provable adversarial robustness, estimat- Lipschitz constraints can endow models with provable ro- ing Wasserstein distance, stabilising training of bustness against adversarial pertubations (Cisse et al., 2017; GANs, and formulating invertible neural net- Tsuzuku et al., 2018; Anil et al., 2019), and guaranteed gen- works. Such works have focused on bounding eralisation bounds (Sokolic´ et al., 2017). Moreover, the dual the Lipschitz constant of fully connected or con- form of the Wasserstein distance is defined as a supremum volutional networks, composed of linear maps over Lipschitz functions with a given Lipschitz constant, and pointwise non-linearities. In this paper, we hence Lipschitz-constrained networks are used for estimat- investigate the Lipschitz constant of self-attention, ing Wasserstein distances (Peyré & Cuturi, 2019). Further, a non-linear neural network module widely used Lipschitz-constrained networks can stabilise training for in sequence modelling. We prove that the stan- GANs, an example being spectral normalisation (Miyato dard dot-product self-attention is not Lipschitz et al., 2018). Finally, Lipschitz-constrained networks are for unbounded input domain, and propose an al- also used to construct invertible models and normalising ternative L2 self-attention that is Lipschitz. We flows. For example, Lipschitz-constrained networks can be derive an upper bound on the Lipschitz constant of used as a building block for invertible residual networks and L2 self-attention and provide empirical evidence hence flow-based generative models (Behrmann et al., 2019; for its asymptotic tightness. To demonstrate the Chen et al., 2019). Additionally, Neural ODEs (Chen et al., practical relevance of our theoretical work, we 2018; Grathwohl et al., 2019) are typically defined using formulate invertible self-attention and use it in vector fields parameterized via Lipschitz networks, so that a Transformer-based architecture for a character- the flow generated by the vector field is guaranteed to exist level language modelling task. for all times. Nonetheless, designing Lipschitz-continuous neural net- works and computing (or even upper-bounding) their Lip- 1. Introduction schitz constant is a hard problem. Previous work mostly Lipschitz continuity is a strong form of continuity for func- focused on fully-connected and convolutional networks, not tions. Loosely speaking, a is Lipschitz continuous only because they are common in deep learning, but also be- if changing its input by a certain amount cannot change cause they are relatively simple to analyze, as compositions its output by more than K times that amount. The con- of linear maps and pointwise non-linearities. Even in this stant K is a hard constraint on how rapidly the function’s case however, exact evaluation of the Lipschitz constant of fully-connected and convolutional networks is NP-hard (Vir- arXiv:2006.04710v2 [stat.ML] 9 Jun 2021 output can vary, and the smallest such K is known as the p maux & Scaman, 2018) and obtaining a tight upper bound function’s Lipschitz constant. For example, f1(x) = |x| remains a challenging task (Virmaux & Scaman, 2018; Fa- and f2(x) = exp(x) for x ∈ R are not Lipschitz contin- uous, because their output can change arbitrarily fast as zlyab et al., 2019; Latorre et al., 2020). x approaches 0 and +∞ respectively. On the other hand, Fully-connected and convolutional networks are not the g1(x) = tanh(x) and g2(x) = αx are Lipschitz continuous, only neural networks worthy of interest. Recently, self- because their rate of change () is bounded. attention (Vaswani et al., 2017) has become a popular In deep learning, we often use Lipschitz continuity as a alternative to recurrent neural networks. Self-attention is a key component of the Transformer (Vaswani et al., 1DeepMind, UK. Correspondence to: Hyunjik Kim . of various data modalities, starting with natural-language processing (Vaswani et al., 2017; Devlin et al., 2019; Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). Brown et al., 2020) and extending to computer vision The Lipschitz Constant of Self-Attention

(Zhang et al., 2019; Parmar et al., 2019), audio generation (Virmaux & Scaman, 2018; Behrmann et al., 2019). We (Huang et al., 2019), and reinforcement learning (Parisotto describe how these results are used to provide bounds on et al., 2020). However, so far no previous work has anal- the Lipschitz constant of fully-connected networks (FCN) ysed the Lipschitz properties of self-attention, and thus it and convolutional neural networks (CNN), using the fact has been unclear whether self-attention is a viable option that both are compositions of linear maps and pointwise in applications that require Lipschitz constraints. In this non-linearities. To begin with, the following theorem sug- work, we address this gap in the theory of self-attention by gests a way to bound Lipp(f) for a differentiable Lipschitz providing a thorough analysis of its Lipschitz properties. In function f: particular, we make the following contributions: Theorem 2.1 (Federer, 1969). Let f : Rn → Rm be differ- entiable and Lipschitz continuous under a choice of p- • We prove that the widely used dot-product self-attention k · kp. Let Jf (x) denote its total derivative (Jacobian) at x. is not Lipschitz, and therefore not suitable to use in appli- Then Lipp(f) = supx∈ n kJf (x)kp where kJf (x)kp is the cations requiring Lipschitz constraints. R induced operator norm on Jf (x). • We formulate L2 self-attention as an alternative, and show Hence if f is a linear map represented by a matrix W then that it is Lipschitz.

• We derive a theoretical upper bound on the Lipschitz con- Lipp(f) = kW kp := sup kW xkp stant of L2 self-attention, and provide empirical evidence kxkp=1 ( of the asymptotic tightness of the bound. σ (W ), if p = 2 = max max P |W | if p = ∞ • As a practical demonstration of the theory, we use this i j ij bound to formulate invertible self-attention, and explore its use in a Transformer architecture for character-level where kW kp is the operator norm on matrices induced by language modelling. We compare its test log-likelihood the vector p-norm, and σmax(W ) is the largest singular and stability to dot-product self-attention. value of W . Under this choice of norm, many common non-linearities (including relu, sigmoid, tanh, elu) are 1-Lipschitz. kW k2 = σmax(W ) is usually estimated via 2. Lipschitz Constant of power iteration; we provide details on how this is done in Fully-Connected/Convolutional Layers AppendixB. We first define the notion of Lipschitz continuity, and pro- Since we now know the Lipschitz constants of the compo- ceed to define the Lipschitz constant. nents of both FCN and CNN, we can bound their Lipschitz constants by applying the following lemma: Definition 2.1. Given two spaces (X , dX ) and (Y, dY ), a function f : X → Y is called Lipschitz con- Lemma 2.1 (Federer, 1969). Let g, h be two composable tinuous (or K-Lipschitz) if there exists a constant K ≥ 0 Lipschitz functions. Then g ◦h is also Lipschitz with Lip(g ◦ such that h) ≤ Lip(g) Lip(h).

0 0 0 Corollary 2.1. For a fully-connected network (FCN) or a dY (f(x), f(x )) ≤ KdX (x, x ) for all x, x ∈ X . (1) convolutional neural network (CNN) f = WK ◦ ρK−1 ◦ Q The smallest such K is the Lipschitz constant of f, denoted WK−1 ◦ ... ◦ ρ1 ◦ W1, we have Lipp(f) ≤ k kWkkp Lip(f). under a choice of p-norm with 1-Lipschitz non-linearities ρk. In this paper, we focus on the common case where X = n m R , Y = R , and dX , dY are induced by a p-norm The above bound is not necessarily tight; there are various P p 1/p kxkp := ( i |xi| ) . We will primarily consider the works that compute tighter bounds for FCN and CNN (e.g. cases p = 2 and p = ∞, where kxk∞ := maxi |xi|. To Virmaux & Scaman, 2018; Fazlyab et al., 2019; Latorre emphasise the dependence of the Lipschitz constant on the et al., 2020). choice of p-norm, we will often denote it by Lipp(f). In this case, it follows directly from Definition 2.1 that the 3. Lipschitz Constant of Self-Attention Lipschitz constant is given by 3.1. Dot-product self-attention is not Lipschitz kf(x) − f(x0)k Lip (f) = sup p . p 0 (2) Moving on, we investigate whether self-attention is Lips- x6=x0∈ n kx − x kp R chitz. We first consider the widely used (scaled) dot-product Next, we outline some basic results that are useful for esti- multihead self-attention as formulated by Vaswani et al. mating Lipschitz constants, also covered in related works (2017). Let x1,..., xN be a sequence of N elements, where The Lipschitz Constant of Self-Attention

D xi ∈ R for i = 1,...,N. We represent this sequence as a The mapping f can be written as matrix X:    >  f1(X) — x1 — f(X) = PX = softmax aXX> X =  .  ∈ N×1,  .  N×D  .  R X =  .  ∈ R , (3) > fN (X) — xN — N X Dot-product multihead self-attention (DP-MHA) is a map where fi(X) = Pijxj ∈ R from RN×D to RN×D consisting of H ‘heads’, where H j=1 is chosen to divide D. Each head is a map from N×D to R and a = W K W Q ∈ (we assume a 6= 0 such that self- N×D/H defined by R R attention is non-trivial). Hence f can be interpreted as a Q K > ! XW (XW ) V map of each xi to a point in the convex hull of x1, ..., xN . DP(X) := softmax XW N×1 N×1 pD/H Since f is a map from R to R , its Jacobian is V   = PXW , J11 ...J1N . . . N×N Q K V D×D/H J =  . . .  ∈ , (5) where W ,W ,W ∈ R are learnable parame- f  . . .  R N×N ters specific to each head, and P ∈ R is the output JN1 ...JNN of the softmax (we suppress the dependence of P on X ∂fi(X) where Jij = ∈ . By taking partial we to reduce clutter below). The input to the softmax is an ∂xj R N × N matrix of pairwise dot products (hence dot-product can show that self-attention), and the softmax is applied to each row of J = aX>P (i) [E X + δ X] + P I this matrix. Finally, the outputs of all heads are concate- ij ji ij ij nated into an N × D matrix and are right multiplied by where W O ∈ RD×D, thus DP-MHA is defined by  1 H  O N×N MHADP (X) := DP (X),..., DP (X) W . (4) • Eij ∈ R is a binary matrix with zeros everywhere except the (i, j)th entry In what follows, we will prove that MHA as defined above is not Lipschitz, assuming that the MHA map is non-trivial, • δij ∈ {0, 1} is the Kronecker delta i.e. W Q,W K ,W V ,W O 6= 0. It is sufficient to show that (i) > N×N a single head DP is not Lipschitz, since MHA is a linear • P := diag(Pi:) − Pi: Pi: ∈ R . combination of the outputs of each head. Also note that P is a stochastic matrix, i.e. its entries are non-negative and See AppendixA for useful identities in deriving the above its rows sum to 1. Since the rows of X are the xi’s, a linear Jacobian. transformation of each xi by some matrix A is equivalent So for i = j: to right multiplication of X by A>. So right multiplication V > (i) > (i) of X by W is a linear map and thus Lipschitz. Therefore, Jii = aX P eiiX + aX P X + Pii (6) we are interested in the mapping f(X) = PX; this is not a linear mapping because P itself is a non-linear function X>P (i)X of X. In fact, we show that f is not Lipschitz, thus proving Let us investigate the scalar . We observe that it the first main result of the paper: is in fact a variance of a discrete distribution. Specifically: Theorem 3.1. DP-MHA is not Lipschitz for any vector p- > (i) P 2 P 2 X P X = k Pikxk − ( k Pikxk) = Var(X), (7) norm k · kp with p ∈ [1, ∞]. where X is a discrete distribution with support at the inputs Summary of Proof. We use Theorem 2.1, noting that if the {x1, . . . , xN } and probability mass function given by their supremum of the norm of the Jacobian is infinite, then the softmax probabilities P(X = xj) = Pij. A consequence of mapping is not Lipschitz. In particular, we show that when this interpretation is that P (i) is positive semi-definite (PSD) x = 0 for some i, some elements of the Jacobian of f > (i) i since X P X = Var(X) ≥ 0, with equality if and only grow proportionally to the sample variance of x6=i, which is if the xj are all equal. unbounded. We use this observation to show that Jii is unbounded, and Proof. We show the proof for the case D = H = 1 so kJf kp is unbounded, hence DP-MHA is not Lipschitz. N×1 Consider the case x = 0. Then (i.e. X ∈ R , a column vector, and xi ∈ R) for readabil- i ity. See AppendixC for the general case, which follows the 1 P > = softmax (XAx ) = 1, same logic. i: i N The Lipschitz Constant of Self-Attention i.e. we have uniform attention regardless of x6=i. The self-attention based on L2 distance: first term of Jii in Equation (6) disappears since eiiX = 2 1 > Q > K ! [0, . . . , x ,..., 0] = 0, and the last term becomes I. Now xi W − xj W i N P ∝ exp(L ) := exp − 2 , > (i) ij ij p consider the second term aX P X = aVar(Xl). Note X D/H is uniformly distributed, since P(X = xj) = Pij = 1/N. (8) Hence the second term is equal to a times the sample vari- P ance of x1, . . . , xN , which can be arbitrarily large. Hence with the normalisation constant ensuring that j Pij = 1. Jii can become arbitrarily large, so the full Jacobian Jf is We will refer to it as L2 self-attention. It is reminiscent unbounded. of the standard squared-exponential kernel, but with soft- max normalisation that ensures that each row of the kernel matrix sums to 1. Normalisation is usually necessary to High-level intuition for proof. At xi = 0, fi(X) = deal with inputs of varying length N (Wang et al., 2018), 1 P x f N k k, the mean of the inputs. The rate of change of i hence we keep the softmax for L2 self-attention. Similarly is governed by how fast the softmax saturates when xi is to dot-product self-attention, L2 self-attention can be com- perturbed, which is determined by how spread out the x6=i puted efficiently with matrix operations; see AppendixE for are. The more spread out they are (the higher the sample details, with a comparison of wall-clock runtimes between variance), the greater the rate of saturation of the softmax, different choices of attention. and the faster the rate of change of fi. Since the sample variance of x6=i can be arbitrarily large, the rate of change We first state the mathematical formulation of L2 multihead of fi can also be arbitrarily large, i.e. the entries of the Jaco- self-attention (L2-MHA) before proving the main result — bian (and hence its p-norm) can become arbitrarily large. In the upper bound of its Lipschitz constant with respect to k·kp > Q for p = 2, ∞. The full L2-MHA map F : N×D → N×D AppendixD, we show that adding bias terms to xi W and R R > K is defined as xj W does not resolve the issue. The implications of this result are the following. (1) There F (X) := f 1(X)W V,1, . . . , f H (X)W V,H  W O can be undesirable behaviour (e.g. training instabilities) for where f h(X) := P hXA . the Transformer when some inputs are close to zero and h others have large magnitude. (2) Dot-product self-attention In the above, W V,h ∈ D×D/H , W O ∈ D×D, P h is de- (and hence the standard Transformer) is not a suitable choice R R fined as in Equation (8) with W Q,h = W K,h ∈ RD×D/H , when we require a Lipschitz neural network, such as for > and A := W Q,hW Q,h /pD/H ∈ D×D. There are formulating invertible residual networks (Behrmann et al., h R two changes from the usual form of multihead self-attention: 2019). Therefore, to use self-attention and Transformers in such applications, a Lipschitz formulation of self-attention Q,h K,h h is required, together with an explicit (ideally tight) upper (1) We require W = W for each head f (X) to be Lipschitz. In Lemma F.1 of AppendixF we show that bound to its Lipschitz constant, to quantify how much the Q,h K,h output can change with respect to changes in the input. L2-MHA is not Lipschitz for arbitrary W , W , and that tying W Q,h = W K,h is sufficient for L2-MHA One method to make dot-product self-attention Lipschitz is to be Lipschitz, with intuition for why tying is sufficient. by ensuring its inputs are bounded. Indeed, if the input space h is compact, e.g. [0, 1]N×D, any continuously differentiable (2) In each head of the self-attention f (X), right mul- function is Lipschitz, including dot-product self-attention. tiplication by Ah has been included for the theorem However, as we further discuss in Section6, such an ap- below to hold (details are in the proof). In practice, proach has its own challenges, since it makes the Lipschitz there is little harm done by this extra linear transfor- constant depend on the input range. Instead, in the next sec- mation, since when the heads are combined together in h V,h tion we formulate a version of self-attention that is provably F , each f (X) is additionally transformed by W , a Lipschitz on all of RN×D, allowing us to derive an upper free parameter. bound that holds for any subset of RN×D. The second main result of the paper is the following: 3.2. L2 self-attention: a Lipschitz formulation of Theorem 3.2. L2-MHA is Lipschitz, with the following self-attention bound on Lip∞(F ): The pathology in dot-product self-attention arises because ! 1 > the softmax probabilities P are constant with respect to −1 O i: Lip∞(F ) ≤ 4φ (N − 1) + p kW k∞ x6=i when xi = 0. This behaviour can be undesirable as D/H > > we want Pij to vary according to xj, regardless of whether Q,h Q,h V,h max kW k∞kW k∞ max kW k∞ xi is zero or not. Hence we propose an alternative form of h h The Lipschitz Constant of Self-Attention and the following bound on Lip2(F ): c/kW k2 if c < kW k2 where kW k2 is estimated by power √ iteration (c.f. AppendixB). This multiplicative factor deter- N mines the scale of the Lipschitz constant of the normalised Lip (F ) ≤ 4φ−1(N − 1) + 1 2 pD/H function. q  P Q,h 2 V,h 2 O h kW k2 kW k2 kW k2 4.2. Invertible self-attention The standard use case of where φ(x) := x exp(x + 1) is an invertible univariate self-attention is with a skip Output function on x > 0, and N is the input sequence length. connection inside the Trans- Specifically, φ−1(N − 1) = W ( N ) where W is the former. A Transformer block 0 e 0 LayerNorm Lambert W -function, which grows sub-logarithmically as is composed of residual blocks O(log N − log log N) (Corless et al., 1996). Hence the of multihead self-attention (MHA) and fully-connected above√ bounds can be simplified to O(log N) for p = ∞ and O( N log N) for p = 2. (FCN) layers (Figure1). Hence similarly to invertible Dropout ResNets, we can normalise Proof. See AppendixF, which uses the key observation L2-MHA by the upper bounds FCN that X>P (i)X is a covariance matrix (c.f. Equation (7)) given in Theorem 3.2 to obtain to bound kJ k , the norm of the Jacobian of F . Ap- F p Contractive-L2-MHA pendixG shows how the argument can be modified to prove LayerNorm f, with which we can ob- the analogous result for the case with masking in the self- tain invertible self-attention attention. g(x) = x + f(x). Since Dropout is also part of the These bounds are complemented by the concurrent√ work residual branch along with Dropout of Vuckovic et al.(2020), which provides a O( D log N) Contractive-L2-MHA, MHA bound on Lip1(F ) using measure-theoretic tools. we should check that it is also contractive. At test 4. Application: Invertible Self-Attention time, Dropout multiplies inputs by the dropout keep Input 4.1. Invertible residual network probability p < 1, so it is a Figure 1. Transformer Consider the residual function g(x) := x+f(x). Behrmann contraction with Lipschitz block. et al.(2019) give the following sufficient condition for its constant p at evaluation time. invertibility: if f is a contraction with respect to some At training time, Dropout metric, i.e. if Lip(f) < 1, and the metric space on which amounts to setting some inputs to zero, while keeping f is defined is complete, then g is invertible. (A Eu- other inputs constant. This can be expressed as right clidean space with a metric induced by a p-norm k · kp multiplication by a diagonal binary matrix M, and for such kMk := sup kMxk ≤ 1 for p ∈ [1, ∞] is always complete.) Specifically, the in- matrices we can verify p kxkp=1 p . verse g−1(y) is the unique fixed point of the recursion Notice that LayerNorm is not part of the residual branch, xi+1 := y − f(xi), since by the definition of the inverse we hence its Lipschitz continuity is not relevant for invertibility; have y = g−1(y) + f(g−1(y)). Because f is a contraction, rather, we can replace it with an invertible normalisation Banach’s Fixed Point Theorem guarantees that this fixed such as ActNorm (Kingma & Dhariwal, 2018). However, point exists and is unique for all y, and that the recursion architectures that place LayerNorm inside the residual converges for all initial values x0 (often set to y in practice) branch (termed pre-LN as opposed to the traditional exponentially fast. Hence the inverse can be computed to post-LN in Figure1) have become more prevalent in the arbitrary accuracy (up to numerical precision in practice) by literature (Wang et al., 2019; Xiong et al., 2020), and in this the above fixed-point iteration. case it makes sense to investigate its Lipschitz continuity. We show that LayerNorm is Lipschitz in AppendixN, Note that a composition of such invertible residual blocks with a bound on its Lipschitz constant. is also invertible. Behrmann et al.(2019) use this observa- tion to design invertible ResNets: they take f to be a CNN In the next section, we investigate the properties of in- normalised by an upper bound on Lip(f) given by Corol- vertible self-attention and how it compares with the stan- lary 2.1, making the resulting function contractive. For the dard dot-product self-attention; we replace DP-MHA in the Contractive-L2-MHA 2-norm k · k2, a hyperparameter c < 1 is chosen and each Transformer with , hence replac- linear map (convolution) W in the CNN is multiplied by ing the residual self-attention module with invertible self- The Lipschitz Constant of Self-Attention attention. We are not interested in the modified Transformer and recall that the upper bound is O(log N − log log N), per se, but rather in comparing the properties of invertible dominated by the O(log N) term for large N. Hence the self-attention to standard self-attention — we only use the plot for the upper bound shows a linear trend. We also Transformer as a testbed for this purpose, since self-attention observe that the slope of the lower bound is very similar, is commonly used in a Transformer. Given the theoretical providing empirical evidence that the O(log N − log log N) focus of the paper, we believe that a more challenging ap- upper bound is asymptotically tight. plication of invertible self-attention, such as normalising There are at least two possible explanations for the gap be- flow-based modelling, would be more suitable as a separate tween the upper and lower bounds. (1) The lower bound paper focused on that particular application. is only a local optimum — the true Lipschitz constant is a global optimum across inputs, which can be difficult to 5. Experimental Results attain especially for high values of N. (2) The multiplica- tive constant of the upper bound may be loose. Assuming 5.1. Asymptotic tightness of the upper bound on asymptotic tightness, it remains an open question whether Lip (F ) ∞ the multiplicative constant can be tightened. We show the analogous plot for Lip2(F ) and discuss the results in Ap- 20 pendixJ. Additionally in AppendixK, we show that opti- 18 mising kJ (X)k w.r.t. X for DP-MHA f causes the norm 16 f ∞ 14 to diverge, providing empirical verification of Theorem 3.1, 12 that DP-MHA is indeed not Lipschitz. 10 UB 8 LB: top 1 out of 50 random init 5.2. Numerical invertibility of MHA residual map 6 LB: top 5 out of 50 random init 4 Inverting L2-MHA Inverting DP-MHA 100 200 500 1000 3 Residual Map Residual Map N 10 102 101 0 Figure 2. Lower and upper bound on Lip∞(f) for L2-MHA f, 10 with H = D = 1 and varying N. 10-1 10-2 10-3 A tight bound on the Lipschitz constant of self-attention is -4 c = 0.90 desirable for all listed applications in Section1; it leads to 10 10-5 c = 0.70 tighter generalisation bounds, lighter constraints for prov- 10-6 c = 0.50 -7 able robustness, and better expressiveness in residual flow max reconstruction error 10 models. Hence we investigate the tightness of our bound 0 5 10 15 20 25 0 5 10 15 20 25 fixed point iterations fixed point iterations on the Lipschitz constant of L2-MHA. The Lipschitz con- stant is a supremum over the space of inputs X ∈ RN×D (c.f. Equation (2)) and approximating it requires solving Figure 3. Invertibility of g(x) = x + cf(x) where f is L2-MHA an intractable optimisation problem. Hence it is infeasi- (left) and DP-MHA (right). ble to estimate accurately in general, especially when X Recall from Section 4.1 that g(x) = x + f(x) is invertible is high-dimensional. However, we may compute a lower if f is contractive. Hence if f is Contractive-L2-MHA, bound on the Lipschitz constant by maximising the norm of g is necessarily invertible. However, technically we do not the Jacobian kJ (X)k with respect to X until convergence. f disprove the invertibility of DP-MHA, since the converse This local optimum will form a lower bound by Theorem does not hold in general i.e. if f is DP-MHA, which we have 2.1, and we can expect this lower bound to be fairly tight shown is not Lipschitz hence not contractive, it may still be for the low-dimensional case, provided the optimisation is the case that g is invertible. To verify that DP-MHA (with the thorough. skip connection) is not invertible in practice, we compare the We use this observation to provide empirical evidence for numerical invertibility of the residual map g(x) = x+cf(x) the asymptotic tightness of the upper bound on Lip∞(f) between the cases where f is L2-MHA and DP-MHA in Fig- in Theorem 3.2. In Figure2, we show the upper bound as ure3. For each, we take MHA with 8 heads and randomly well as the lower bound on Lip∞(f) obtained by optimis- initialised weights, and quantify the maximum reconstruc- ing kJf (X)k∞ with respect to X for L2-MHA f with 50 tion error across a batch of 128 inputs whose outputs are different random initialisations of X, with H = D = 1 inverted via the fixed-point iteration described in Section and N varying between 100 and 1000. See AppendixH for 4.1. We use N = 64, D = 64, and c ∈ {0.5, 0.7, 0.9} (see further details. Note that we use a log-scale for the x-axis, AppendixI for analogous results for a wider range of N The Lipschitz Constant of Self-Attention

LSTM Original Transformer (DP) Transformer (L2) Transformer (L2), WQ = WK Transformer (Contractive-L2), p = D Num Layers Num Layers Num Layers Num Layers 1.8 64 1 1 1 1 128 2 2 2 2 1.6 256 3 3 3 3 512 4 4 4 4 5 5 5 5 1.4 6

Test NLL 8 1.2 10 12 14 1.0

20 40 60 80 100 20 40 60 80 100 20 40 60 80 100 20 40 60 80 100 50 100 150 200 epoch epoch epoch epoch epoch

Figure 4. Test NLL during training for various LSTM/Transformer models on PTB character level language modelling.

and D and for DP-MHA with trained weights). To highlight the bound on Lip2(F ). During training we backpropagate the difference between the two types of self-attention, recall through these contractive blocks F/ Lip∞(F ) (including in the proof of Theorem 3.1 (showing that DP-MHA is not the denominator) to update the model parameters. We found Lipschitz) that when one of the inputs xi is 0, some terms of that only backpropagating through the numerator (i.e. apply- the Jacobian grow with the sample variance of x6=i. Hence ing stop-gradient to denominator) gave slightly worse we check numerical invertibility at a set of N inputs where performance. See AppendixH for experimental details. x = 0 and x are chosen uniformly at random. i 6=i The results are shown in Figure4. The first plot shows In Figure3, we see that DP-MHA is not invertible whereas the best performing LSTM reaching a test NLL of around L2-MHA is invertible for sufficiently small c. This shows 1.0, and the second plot shows the best performing Trans- how not having the theoretical guarantee of f being con- former reaching a slightly improved performance for 3– tractive can cost us invertibility in practice. We note that 5 layers of Transformer blocks. We observe instabili- the figure shows local invertibility at the sampled inputs, ties in training for a higher number of layers, requiring as opposed to global invertibility across the whole input careful tuning of the learning rate schedule for stability space, yet this clearly highlights the difference between the at the cost of performance, a commonly observed phe- two choices of self-attention. Experiments with the globally nomenon in the literature of deep Transformer architec- invertible self-attention obtained by normalising with the tures (Bapna et al., 2018; Parisotto et al., 2020). The third Lipschitz upper bound are provided in the next section. plot shows results for the Transformer with DP-MHA re- placed with L2-MHA but without tying W Q and W K , and 5.3. Expressiveness of L2-MHA and invertible we observe a very similar test performance. The fourth plot self-attention shows the change when we further tie the query and key weights (making W Q = W K ); we see that there is a small A natural question to ask is: how does the expressiveness degradation in performance. Here the number of trainable of L2-MHA and Contractive-L2-MHA (that leads to parameters has been reduced, but in AppendixL we show invertible self-attention with the skip connection) compare that matching parameter count does not help performance, with the original DP-MHA? We expect that the Lipschitz suggesting that the reduction in performance when tying constraint will limit the expressiveness of the Transformer, queries and keys is not solely due to having fewer parame- and would like to find out by how much. We investigate this ters. We note that performance saturates at around 5 layers by comparing the performance of the original Transformer for each Transformer model so far. On the rightmost plot and the Transformer with invertible self-attention (c.f. Fig- we show results when further dividing self-attention in each ure1) at character-level language modelling on the Penn block by the upper bound on Lip∞(F ), to obtain invertible Treebank dataset (Marcus et al., 1993). We compare the test self-attention. This does give reduced performance for the negative log-likelihood (NLL) of a baseline LSTM, the orig- same number of layers, but we can attain similar perfor- inal Transformer (DP-MHA), and a series of models between mance with more layers, no longer saturating at 5 layers. the original Transformer and the Transformer with invert- ible self-attention (Contractive-L2-MHA), making one Thus we conclude the following. (1) Replacing the dot- change at a time and tuning the hyperparameters on a val- product with the L2 distance incurs hardly any loss in ex- idation set. For Contractive-L2-MHA, we normalise pressiveness. (2) Tying the query and key weights to obtain Lipschitz self-attention incurs a small loss in expressiveness. F =L2-MHA by the bound on Lip∞(F ) as it is tighter than The Lipschitz Constant of Self-Attention

Number of Layers 2 4 6 8 10 12 14 16 18 Transformer (DP) 1.061 1.032 1.021 1.017 1.025 - - - - Transformer (L2), W Q = W K 1.168 1.040 1.023 1.024 1.019 1.008 1.018 1.027 1.034 Transformer (Contractive-L2) 1.246 1.135 1.103 1.079 1.072 1.060 1.039 1.029 1.031

Table 1. Test NLL for Transformer models trained with fixed learning rate on PTB character level language modelling.

(3) Dividing by the upper bound on Lip∞(F ) to obtain to 10 layers. The generalisation performance of the best invertible self-attention incurs a noticeable loss in expres- model for each setting of self-attention is similar. siveness, but also has a stabilising effect on the optimisation of the Transformer, thus allowing one to compensate for the 6. Conclusion and Discussion apparent loss in expressiveness by increasing the number of layers. We have shown that the widely used dot-product self- attention is not Lipschitz, and that the proposed L2 self- 5.4. Training Stability of DP-MHA vs L2-MHA attention is Lipschitz, by deriving an O(log N√− log log N) Lipschitz bound for p = ∞ and an O( N(log N − In Figure5, we compare the output variance of trained log log N)) bound for p = 2, where N is the input se- L2-MHA against trained DP-MHA, with weights from the Q K quence length. We also provided empirical evidence of the one-layer Transformer (L2), W = W model and (DP) asymptotic tightness of the bound for p = ∞. We demon- model used for Figure4 respectively. We take the same strated that Lipschitz-constrained self-attention can be used distribution of inputs as used for the numerical invertibility to formulate invertible self-attention, which we experimen- experiment in Section 5.2, and show the histogram of inputs tally evaluated on a character-level language modelling task. and outputs after flattening the input/output tensors. We And finally, we also showed that L2-MHA is more stable see that the range of outputs remains similar to the range during training, allowing the use of fixed learning rate for of inputs for Lipschitz L2-MHA, whereas for DP-MHA the stable training of deep architectures. outputs have a much wider range, because the Jacobian norm is large for DP-MHA at these inputs. Our approach to Lipschitz self-attention has been to re- place the dot-product kernel with an L2 kernel. An alterna- Inputs Outputs: L2-MHA (trained) Outputs: DP-MHA (trained) 0.7 0.7 0.7 tive would be to constrain the inputs of self-attention to be 0.6 0.6 0.6 bounded; if the input space is compact, e.g. [0, 1]N×D, any 0.5 0.5 0.5 continuously is Lipschitz, including 0.4 0.4 0.4 0.3 0.3 0.3 dot-product self-attention. However, while being simple to 0.2 0.2 0.2 implement, this solution has its own difficulties. First, it 0.1 0.1 0.1 0.0 0.0 0.0 makes the Lipschitz constant depend on the range of the -10 -5 0 5 10 -10 -5 0 5 10 -10 -5 0 5 10 input, and thus obtaining a tight bound would require non- trivial mathematical work. We stress that a guarantee that Figure 5. Histogram showing distribution of inputs/outputs of the function is Lipschitz does not tell us anything about trained L2-MHA and DP-MHA its Lipschitz constant; without a tight Lipschitz bound, the In practice, this leads to instabilities in training for DP-MHA, true Lipschitz constant can be very large, at which point it hence requiring careful tuning of the learning rate schedule is unhelpful that the function is Lipschitz. Second, since for training deeper Transformer models: linear warmup and self-attention is typically applied at multiple layers within square root decay, as detailed in AppendixH. We investigate a model (e.g. Transformer), the input to each self-attention the behaviour of the different Transformer models on the will live in a different compact set that depends on the pa- above PTB task when using a fixed learning rate. We rameters of the previous layers, complicating the analysis observe that DP-MHA fails to train at all beyond 10 layers, for subsequent layers. A solution is to constrain the inputs whereas both L2-MHA (W Q = W K ) (i.e. Lipschitz L2- of each layer to be in the same compact set, e.g. by passing MHA but not contractive) and Contractive-L2-MHA them through a sigmoid non-linearity. This however can shows stable training for up to 18 layers (see AppendixM have undesirable side effects such as vanishing gradients for the training curves). This was the deepest model we when the sigmoids are saturated. Despite these difficulties, could fit on a single GPU, and we expect to be able to train this could be a worthwhile alternative route for obtaining even deeper models with these two. In Table1 we show the Lipschitz self-attention to explore in the future. best Test NLL across training for each of the Transformer Having a provably Lipschitz self-attention module at our models. Note that for DP-MHA training becomes unstable disposal makes it possible to use Transformer-based ar- beyond 10 layers, so we are only able to provide results up The Lipschitz Constant of Self-Attention chitectures in applications requiring Lipschitz constraints, Cisse, M., Bojanowski, P., Grave, E., Dauphin, Y., and while enjoying theoretical guarantees. A natural appli- Usunier, N. Parseval networks: Improving robustness to cation of Lipschitz self-attention is for residual flows adversarial examples. In Proceedings of the 34th Inter- (Behrmann et al., 2019), and for parameterising Neural national Conference on Machine Learning, pp. 854–863, ODEs (Chen et al., 2018) where a Lipschitz vector field 2017. guarantees the existence of a unique solution to the ODE for all times. These models can be used for density estimation Corless, R. M., Gonnet, G. H., Hare, D. E., Jeffrey, D. J., and generative modelling of sets. Another interesting direc- and Knuth, D. E. On the Lambert W function. Advances tion for future work would be to analyse different variants in Computational mathematics, 5(1):329–359, 1996. of self-attention based on kernels other than dot-product and Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: L2, as (Tsai et al., 2019) do from an experimental perspec- Pre-training of deep bidirectional transformers for lan- tive, for which we believe the mathematical tools developed guage understanding. In Proceedings of the 2019 Confer- in this paper may aid the analysis. ence of the North American Chapter of the Association for Computational Linguistics, pp. 4171–4186, 2019. Acknowledgements Fazlyab, M., Robey, A., Hassani, H., Morari, M., and Pap- We would like to thank Adam Kosiorek, Arnaud Doucet, pas, G. Efficient and accurate estimation of Lipschitz Yee Whye Teh, Michalis Titsias, Emilien Dupont and Theo- constants for deep neural networks. In Advances in Neu- phane Weber for helpful discussion and feedback. ral Information Processing Systems, pp. 11423–11434, 2019.

References Federer, H. Geometric Measure Theory. Classics in Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Mathematics. Springer Berlin Heidelberg, 1969. ISBN Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, 9783642620102. M., et al. TensorFlow: Large-scale machine learning Glorot, X. and Bengio, Y. Understanding the difficulty on heterogeneous distributed systems. arXiv preprint of training deep feedforward neural networks. In Pro- arXiv:1603.04467, 2016. ceedings of the thirteenth international conference on Anil, C., Lucas, J., and Grosse, R. Sorting out Lipschitz artificial intelligence and statistics, pp. 249–256, 2010. function approximation. In International Conference on Grathwohl, W., Chen, R. T. Q., Betterncourt, J., Sutskever, Machine Learning, pp. 291–301, 2019. I., and Duvenaud, D. FFJORD: Free-form continuous Bapna, A., Chen, M. X., Firat, O., Cao, Y., and Wu, Y. Train- dynamics for scalable reversible generative models. In ing deeper neural machine translation models with trans- International Conference on Learning Representations, parent attention. In Proceedings of the 2018 Conference 2019. on Empirical Methods in Natural Language Processing, pp. 3028–3033, 2018. Huang, C.-Z. A., Vaswani, A., Uszkoreit, J., Simon, I., Hawthorne, C., Shazeer, N., Dai, A. M., Hoffman, M. D., Behrmann, J., Grathwohl, W., Chen, R. T. Q., Duvenaud, Dinculescu, M., and Eck, D. Music Transformer. In D., and Jacobsen, J.-H. Invertible residual networks. International Conference on Learning Representations, In Proceedings of the 36th International Conference on 2019. Machine Learning, 2019. Kingma, D. P. and Ba, J. Adam: A method for stochastic op- Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, timization. In 3rd International Conference on Learning J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Representations, 2015. Askell, A., et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020. Kingma, D. P. and Dhariwal, P. Glow: Generative flow with invertible 1 × 1 convolutions. In Advances in Neural Chen, R. T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, Information Processing Systems, pp. 10215–10224, 2018. D. Neural ordinary differential equations. In Advances in Neural Information Processing Systems, pp. 6571–6583, Latorre, F., Rolland, P., and Cevher, V. Lipschitz constant 2018. estimation of neural networks via sparse polynomial op- timization. In International Conference on Learning Chen, R. T. Q., Behrmann, J., Duvenaud, D., and Jacobsen, Representations, 2020. J.-H. Residual flows for invertible generative modeling. In Advances in Neural Information Processing Systems, Marcus, M. P., Marcinkiewicz, M. A., and Santorini, B. 2019. Building a large annotated corpus of English: The Penn The Lipschitz Constant of Self-Attention

Treebank. Computational Linguistics, 19(2):313–330, Wang, Q., Li, B., Xiao, T., Zhu, J., Li, C., Wong, D. F., 1993. and Chao, L. S. Learning deep transformer models for machine translation. In Proceedings of the 57th Annual Mises, R. and Pollaczek-Geiringer, H. Praktische verfahren Meeting of the Association for Computational Linguistics, der gleichungsauflösung. ZAMM-Journal of Applied pp. 1810–1822, 2019. Mathematics and Mechanics/Zeitschrift für Angewandte Mathematik und Mechanik, 9(2):152–164, 1929. Wang, X., Girshick, R., Gupta, A., and He, K. Non-local neural networks. In Proceedings of the IEEE Conference Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spec- on Computer Vision and Pattern Recognition, pp. 7794– tral normalization for Generative Adversarial Networks. 7803, 2018. In International Conference on Learning Representations, 2018. Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, Parisotto, E., Song, H. F., Rae, J. W., Pascanu, R., Gul- C., Zhang, H., Lan, Y., Wang, L., and Liu, T. On layer Inter- cehre, C., Jayakumar, S. M., Jaderberg, M., Kaufman, normalization in the transformer architecture. In national Conference on Machine Learning R. L., Clark, A., Noury, S., Botvinick, M. M., Heess, N., , pp. 10524– and Hadsell, R. Stabilizing Transformers for reinforce- 10533. PMLR, 2020. ment learning. In International Conference on Machine Zhang, H., Goodfellow, I., Metaxas, D., and Odena, A. Self- Learning, 2020. attention Generative Adversarial Networks. In Proceed- Parmar, N., Ramachandran, P., Vaswani, A., Bello, I., Lev- ings of the 36th International Conference on Machine skaya, A., and Shlens, J. Stand-alone self-attention in Learning, pp. 7354–7363, 2019. vision models. In Advances in Neural Information Pro- cessing Systems, pp. 68–80, 2019. Peyré, G. and Cuturi, M. Computational optimal transport. Foundations and Trends in Machine Learning, 11(5-6): 355–607, 2019. Sokolic,´ J., Giryes, R., Sapiro, G., and Rodrigues, M. R. Robust large margin deep neural networks. IEEE Trans- actions on Signal Processing, 65(16):4265–4280, 2017. Tsai, Y.-H. H., Bai, S., Yamada, M., Morency, L.-P., and Salakhutdinov, R. Transformer dissection: An unified understanding for Transformer’s attention via the lens of kernel. In Proceedings of the 2019 Conference on Em- pirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 4344–4353, 2019. Tsuzuku, Y., Sato, I., and Sugiyama, M. Lipschitz-margin training: Scalable certification of perturbation invariance for deep neural networks. In Advances in Neural Infor- mation Processing Systems, pp. 6541–6550, 2018. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten- tion is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008, 2017. Virmaux, A. and Scaman, K. Lipschitz regularity of deep neural networks: analysis and efficient estimation. In Advances in Neural Information Processing Systems, pp. 3835–3844, 2018. Vuckovic, J., Baratin, A., and Tachet des Combes, R. A mathematical theory of attention. arXiv preprint arXiv:2007.02876, 2020. The Lipschitz Constant of Self-Attention Appendix

A. Useful Identities for deriving Jacobian expressions In this section, we list some useful identities for deriving the Jacobians of the expressions in the paper.

Suppose λ is a scalar, u, v, x ∈ Rn×1 are column vectors, and f(u) is a vector valued function. We use the standard m n ∂a m×n convention that for a ∈ R , b ∈ R , we have ∂b ∈ R . Then we have the following identities:

∂ ∂u ∂λ • ∂x [λu] = λ ∂x + x ∂x ∂f(u) ∂f(u) ∂u • ∂x = ∂u ∂x ∂ > > ∂v > ∂u • ∂x [u v] = u ∂x + v ∂x

∂λ ∂λ Note ∂x is a row vector, so u ∂x is a matrix. The Jacobian of the softmax is also well-known. Suppose v = softmax (u) ∈ Rn×1. Then   v1(1 − v1) −v1v2 ... −v1vn ∂v  −v2v1 v2(1 − v2) ... −v2vn  = diag(v) − vv> =   .  . . .. .  ∂u  . . . .  −vnv1 −vnv2 . . . vn(1 − vn)

B. Power Iteration

m×n Although kW k∞ can be computed efficiently in O(nm) time for W ∈ R , naïvely computing kW k2 = σmax(W ) := p > 3 λmax(W W ) requires O(n ) operations. (By λmax(A) we denote the greatest eigenvalue of a symmetric matrix A.) We can however obtain an underestimate σ˜(W ) via power iteration: s > > > W W bk bk W W bk bk+1 = > , σ˜k(W ) = > , (9) kW W bkk2 bk bk 2 2 with each iteration taking O(n ) time. Then using K  n iterations gives us an underestimate σ˜K in O(Kn ) time. Since this is an underestimate, the resulting approximation to the Lipschitz constant of the linear map will not be an upper bound. However the number of power iterations is usually chosen so that σ˜ is accurate enough — K = 5 is shown to be sufficient in the context of fully connected networks or convolutions considered by Behrmann et al.(2019). The iteration will converge if W >W has an eigenvalue that is strictly greater in magnitude than its other eigenvalues, and the starting vector b0 has a nonzero component in the direction of an eigenvector associated with the dominant eigenvalue. This happens with probability 1 if b0 is chosen at random, and the convergence is geometric with ratio |λ2/λmax| where λ2 is the eigenvalue with second largest magnitude (Mises & Pollaczek-Geiringer, 1929).

C. Proof of Theorem 3.1 for General D

Theorem 3.1. DP-MHA is not Lipschitz for any vector p-norm k · kp with p ∈ [1, ∞]. Proof. The mapping f can be written as  >  f1(X) > >  .  N×D f(X) = PX = softmax XA X X =  .  ∈ R , (10) > fN (X)

K Q> p D×D PN > where A = W W / D/H ∈ R and fi(X) = j=1 Pijxj with Pi: = softmax (XAxi). Hence f can be N×D N×D interpreted as a map of each xi to a point in the convex hull of x1, ..., xN . Since f is a map from R to R , its Jacobian is   J11 ...J1N  . .. .  ND×ND Jf =  . . .  ∈ R , (11) JN1 ...JNN The Lipschitz Constant of Self-Attention

∂fi(X) D×D > (i)  >  where Jij = ∈ . By taking partial derivatives we can show that Jij = X P EjiXA + XAδij + PijI ∂xj R N×N where Eij ∈ R is a binary matrix with zeros everywhere except the (i, j)th entry, δij is the Kronecker delta, and (i) > P := diag(Pi:) − Pi: Pi:. So for i = j:

> (i) > > (i) Jii = X P EiiXA + X P XA + PiiI P > > > (i) = Pii (xi − k Pikxk) xi A + X P XA + PiiI. (12)

> For the last equality, note EiiX has all rows equal to zero except for the ith row given by xi . We can then verify that > (i) P > X P EiiX simplifies to Pii(xi − k Pikxk)xi .

For vector p-norms, kJf kp is bounded if and only if its entries are bounded, by definition of the operator norm. The entries of X>P (i)XA are bounded for arbitrary A only if the entries of X>P (i)X are bounded. So let us investigate the entries of this D × D matrix. Writing out each term of the matrix, we observe that it is in fact a covariance matrix of a discrete distribution. Specifically:

> (i) P P P [X P X]lm = k Pikxklxkm − ( k Pikxkl)( k Pikxkm) = Cov(Xl, Xm), (13) where X is a discrete distribution with support at the inputs {x1,..., xN } and probability mass function given by their (i) softmax probabilities P(X = xj) = Pij. A consequence of this interpretation is that P is positive semi-definite (PSD) > (i) since for D = 1, Equation (13) becomes X P X = Var(X) ≥ 0, with equality if and only if the xj are all equal.

We use this observation to show that the terms of Jii are unbounded, and so DP-MHA is not Lipschitz. Consider the > 1 1 case xi = 0. Then Pi: = softmax (XAxi) = N , i.e. we have uniform attention regardless of x6=i. The first term 1 of Jii in Equation (12) disappears since xi = 0, and the last term becomes N I. For the second term, the entries > (i) [X P X]ll = Var(Xl) are unbounded since the latter is equal to the sample variance of x1l, . . . , xNl, which can be arbitrarily large. Note that we have shown that single head dot-product self-atttention (H = 1) is not Lipschitz, but it is clear that this implies multihead self-attention DP-MHA is also not Lipschitz, since the output of multihead attention is a linear combination of the outputs of each head.

D. Bias term in DP Self-Attention

Q > Q K > K A natural question to ask is whether we can add bias terms b to xi W and b to xj W to resolve the issue of attention weights Pi: becoming uniform when xi = 0. The answer is no in general. It can again be shown that Jii is unbounded when > Q Q Q D×D/H xi is chosen such that xi W + b = 0 (such a choice is possible assuming W is full rank, a dense set in R ). > 1 1 > (i) Then Pi: = N again, and the diagonal entries of X P X are unbounded.

E. Efficient Computation of L2 Self-Attention Dot-product self-attention only requires a few matrix multiplications to compute the logits (i.e. the inputs to the softmax) between all pairs of inputs, without having to loop over pairs, hence it can be computed efficiently. Similarly, we can show 2 2 > 2 that L2 self-attention can also be computed in an efficient manner. Using the identity ka − bk2 = kak2 − 2a b + kbk2 we can compute the logits of L2 attention between all pairs via matrix multiplications and computation of row-wise L2 norms, with negligible overhead compared to dot-product self-attention. Specifically, for L2 self-attention we can show that

! kXW Qk2 1> − 2XW Q(XW K )> + 1kXW K k2> P = softmax − row row , (14) pD/H

2 m×n 2 m where kAkrow applies the squared L2 norm to each row of A, so if A ∈ R then kAkrow ∈ R . In Table2 we show the wall-clock training times for the Transformer models with different attention functions and a varying number of layers. It is evident that the differences between the models are rather small. The Lipschitz Constant of Self-Attention

1 Layer 2 Layers 3 Layers 4 Layers 5 Layers Transformer (DP) 37 56 77 92 110 Transformer (L2) 35 56 73 99 115 Transformer, W Q = W K (L2) 39 58 79 91 108 Transformer, (Contractive-L2) 37 60 81 102 127

Table 2. Wall clock training times for one epoch of training (seconds)

F. Proof of Theorem 3.2 Recall the formulation of L2-MHA:

N×D N×D F : R → R F (X) = f 1(X)W V,1, . . . , f H (X)W V,H  W O h h f (X) = P XAh > Q,h > K,h 2 ! h kxi W − xj W k2 X h Pij ∝ exp(Lij) := exp − p , Pij = 1 D/H j

Q,h K,h V,h D×D/H O D×D h N×N Q,h Q,h> p where we have that W ,W ,W ∈ R , W ∈ R , P ∈ R and Ah := W W / D/H ∈ RD×D, and the softmax is applied to each row of the input matrix. Recall Equation (14):

> ! kXW Q,hk2 1> − 2XW Q,h(XW K,h)> + 1kXW K,hk2 P h = softmax − row row . pD/H

F.1. L2 self-attention is not Lipschitz for general W Q,W K Let us first look at the case of H = 1 and suppress the index h to reduce clutter. Consider the map f˜(X) := PX, so f(X) = f˜(X)A. We need f˜ to be Lipschitz for f and hence F to be Lipschitz. Note that P is defined as:

> Q > K 2 ! kxi W − xj W k2 Pij ∝ exp(Lij) := exp − pD/H

P N×N N×D and the normalisation constant satisfies j Pij = 1, for P ∈ R , X ∈ R . For L2 self-attention, we may take partial derivatives and use the chain rule to show that the Jacobian of f˜ is:   J˜11 ... J˜1N J =  . .. .  ∈ ND×ND (15) f˜  . . .  R J˜N1 ... J˜NN with ˜ > (i) ∂Li: D×D Jij = X P + PijI ∈ R (16) ∂xj where ∂L 2 h > > i i: K 1 > Q Q Q K  K = p XW − xi W W δij + EjiXW − EjjXW W (17) ∂xj D/H and   Pi1(1 − Pi1) −Pi1Pi2 ... −Pi1PiN  −Pi2Pi1 Pi2(1 − Pi2) ... −Pi2PiN  P (i) := diag(P ) − P >P =   , i: i: i:  . . .. .   . . . .  −PiN Pi1 −PiN Pi2 ...PiN (1 − PiN ) The Lipschitz Constant of Self-Attention

exp −kx>W Q − x>W K k2 P = i j 2 . ij P > Q > K 2 k exp −kxi W − xk W k2

N×N Recall that Eji ∈ R is a binary matrix with zeros everywhere except the (j, i)th entry. Hence EjiX has all rows equal > to zero except for the jth row given by xi . We can then verify:

> (i) X > X P EjiX = Pij(xj − Pikxk)xi . (18) k

Also note P (i) is symmetric, and each row/colum sums to 0, i.e. P (i)1 = 1>P (i) = 0. Hence we may simplify the Jacobian terms as follows:

2 h > (i) K T Q Q> > (i) Q K K> i J˜ii = X P (XW − 1x W )W + X P EiiX(W − W )W + PiiI pD/H i " # 2 > > > (i) K 1 T Q Q X > Q K K = p X P (XW − xi W )W + Pii(xi − Pikxk)xi (W − W )W + PiiI D/H k " # 2 > (i) K Q> X > Q K K> = p X P XW W + Pii(xi − Pikxk)xi (W − W )W + PiiI, (19) D/H k and for i 6= j:

2 > (i) Q K K> J˜ij = X P (EijXW − EjjXW )W + PijI pD/H

2 X > Q > K K> = p Pij(xj − Pikxk)(xi W − xj W )W + PijI. (20) D/H k

We are now ready to show that f˜ is not Lipschitz for general W Q,W K : K D×D/H K Q Lemma F.1. If W ∈ R is full rank (i.e. full column rank), and W 6= W , then Jij has terms that are unbounded for i 6= j, hence f˜ is not Lipschitz.

> D D ˜ K P > Q > K H × H Proof. Let us investigate the expression Kij := PijW (xj − k Pikxk)(xi W − xj W ) ∈ R for i 6= j, which is related to J˜ij as follows by Equation (20):

! K> 2 K> W J˜ij = K˜ij + PijI W . pD/H

K It suffices to show that K˜ij is unbounded to show that J˜ij is unbounded, since W is full rank and Pij ∈ [0, 1]. > > Q > K Let yj = xi W − xj W . Then we have:

X Q> K> X Q> K> yj − Pikyk = W xi − W xj − Pik(W xi − W xk) k k Q> K> Q> X K> = W xi − W xj − (W xi − PikW xk) k K> X = −W (xj − Pikxk). k

˜ P > D/H K Q K Hence Kij = −Pij(yj − k Pikyk)yj . Note yi can take an arbitrary value in R , since W 6= W and W is full-rank. The Lipschitz Constant of Self-Attention

K For all j 6= i, let us choose xj such that yj = −yi. This is possible for any value of yi since W is full-rank. Note 2 2 exp(−kyj k2) 1 yj = −yi and not yi. We then have that kyjk2 is equal for all j, hence Pij := P 2 = for all j. Then for k exp(−kykk2) N i 6= j, K˜ij simplifies to

1  1  2N − 2 K˜ = − −y − (N − 2)(−y ) (−y )> = − y y> ij N i N i i N 2 i i

D/H whose entries are unbounded since yi can be any vector in R (note we assume N ≥ 2 for self-attention to be well-defined, hence 2N − 2 6= 0).

The intuition for this result is as follows: a reason for DP-MHA not being Lipschitz is that for xi = 0„ the attention weights Q K Pij become uniform regardless of the values of xj for j 6= i. A similar issue arises for L2-MHA with W 6= W and K full-rank W , as shown above: given any xi, we can choose xj such that the Pij become uniform.

F.2. L2 self-attention is Lipschitz for W Q = W K Hence we impose the restriction that W K = W Q. With this assumption we have √  > 2 Pij ∝ exp −k(xi − xj) Ak2 (21) √ √ √ √ Q Q> p D×D > Q 1 where A = W W / D/H ∈ R and A is chosen such that A = A A , in particular A := W /(D/H) 4 . The terms in the Jacobian of f˜ simplify to:

> (i) (i) J˜ii = 2X P XA + PiiI (note P 1 = 0), (22) X > J˜ij = 2Pij(xj − Pikxk)(xi − xj) A + PijI for i 6= j. (23) k

Let the Jacobian of f(X) be:   J11 ...J1N  . .. .  ND×ND Jf =  . . .  ∈ R . (24) JN1 ...JNN

˜ ˜ ˜ ∂ ˜ > ∂fi(X) ∂fi(X) Since f(X) = f(X)A, and by the chain rule [fi(X)A] = A = A (by symmetry of A), we have that ∂xj ∂xj ∂xj Jij = AJ˜ij. Hence

> (i) (i) Jii = 2AX P XA + PiiA (note P 1 = 0), (25) X > Jij = 2PijA(xj − Pikxk)(xi − xj) A + PijA for i 6= j. (26) k

Noting Lipp(f) = supX kJf (X)kp, we would like to upper bound kJf kp.

F.2.1. UPPERBOUNDON Lip∞(F ) FOR L2-MHA

Consider the choice p = ∞, where kJf k∞ is the maximum absolute row sum of Jf . A key observation is that if we can bound the ∞-norm of the Jacobian of fi, a single output of f, (i.e. a single block row k[Ji1, ..., JiN ]k∞ of Jf ) then this is also a bound on kJf k∞ due to permutation equivariance of self-attention; all block rows have the same maximal k · k∞ when each is optimised over the input X. Using this, we can prove that kJf k∞ admits an upper bound that is O(log N − log log N). Below we state and prove lemmas that lead to the proof of this upper bound. √ √ √ > > (i) First we analyse the term√ A X P X A, that appears in the first term of Jii. Note that for Y := X A, so that the > > rows of Y are yi := xi A, we have √ √ > > (i) > (i) A X P X A = Y P Y = Cov(Y) (27) The Lipschitz Constant of Self-Attention

2 P 2 where P(Y = yj) = Pij = exp(−kyj − yik2)/ k exp(−kyk − yik2). The last equality uses the observation in Equation (7). The central inequality used throughout the proof of the main theorem is the following:

P P 2 P 2 −1 Lemma F.2. Tr(Cov(Y)) = j Pijkyj − k Pikykk2 ≤ j Pijkyj − yik2 ≤ φ (N − 1) where φ(c) = c exp(c + 1) is a one-dimensional invertible function on R≥0.

P P P 2 Proof. The first equality holds since Tr(Cov(Y)) = j Cov(Y)jj = j Var(Yj) = j E[(Yj − E[Yj]) ]. The next 2 2 2 inequality holds since Var(Yj) = Var(Yj) = E[Yj ] − E[Yj] ≤ E[Yj ] where Y = Y − yi. The final inequality can be proved as follows. We would like to bound

P 2 2 P 2 2 X j kyj − yik2 exp(−kyj − yik2) j zj exp(−zj ) P ky − y k2 = = (28) ij j i 2 P exp(−ky − y k2) P exp(−z2) j k k i 2 k k

where zj := kyj − yik2 (hence zi = 0). Define:

P 2 2 P 2 2 j zj exp(−zj ) j6=i zj exp(−zj ) g(z) := P 2 = P 2 . (29) k exp(−zk) 1 + k6=i exp(−zk)

2 2 2 First note that as zj → ∞, exp(−zj ) → 0 exponentially fast, causing the product zj exp(−zj ) → 0. Hence we expect the above quantity to be bounded and attain its maximum. 2 Let h(zj) := exp(−zj ) for notational conciseness, and note h(zj) > 0. By taking partial derivatives with the chain rule, we have that for j 6= i " # ∂g(z) 2zjh(zj) 2 X X 2 = P 2 (1 − zj ) h(zk) + h(zk)zk . (30) ∂zj ( h(zk)) k k k

2 P P 2 Hence the derivative is 0 if and only if zj = 0 or (1 − zj ) k h(zk) + k h(zk)zk = 0, the latter being equivalent to P 2 2 k h(zk)zk N zj = 1 + P = 1 + g(z). Hence at the maximum, the non-zero values among {zj}j=1 must be equal to one another. k h(zk) 2 It is clear now that the maximum value c is attained when zj = 1 + c for j 6= i (and recall zi = 0). So h(zj) = exp(−1 − c) for j 6= i. Substituting this into g(z), and rearranging, we obtain c exp(c + 1) = N − 1. Note φ(x) := x exp(x + 1) is increasing for x > 0 hence c = φ−1(N − 1).

Note φ(log N) = (log N) exp(log N + 1) ≥ N log N ≥ N − 1 for N ≥ 3. Since φ is increasing, we have φ−1(N − 1) ≤ log(N) for N ≥ 3. In fact, it is known that φ−1(N − 1) = O(log N − log log N) (Corless et al., 1996).

Note the A term in f(X) = f˜(X)A allows us to use the above inequality, since Y >P (i)Y = Cov(Y) now appears in the terms of Jf :

√ √ > (i) > Jii = 2 A[Y P Y ] A + PiiA, (31) √ √ X > > Jij, = 2 APij(yj − Pikyk)(yi − yj) A + PijA for i 6= j. (32) k The Lipschitz Constant of Self-Attention P Using the inequalities kBCk ≤ kBkkCk, kB + Ck ≤ kBk + kCk and k[A1,...,AN ]k ≤ i kAik, we have:

k[Ji1,...,JiN ]k∞ X ≤kJiik∞ + kJijk∞ j6=i √ √ > (i) > ≤2k Ak∞kY P Y k∞k A k∞ + PiikAk∞ √ √ X X > > + 2 k Ak∞kPij(yj − Pikyk)(yi − yj) k∞k A k∞ + PijkAk∞ j6=i k √ √   > > (i) X X > =2k Ak∞k A k∞ kY P Y k∞ + kPij(yj − Pikyk)(yi − yj) k∞ + kAk∞ j6=i k Q Q>   Q Q> kW k∞kW k∞ > (i) X X > kW W k∞ =2 p kY P Y k∞ + kPij(yj − Pikyk)(yi − yj) k∞ + p . D/H j k D/H P For the first equality, note that j Pij = 1. For the second equality, note that the summand for j = i is 0 because the term yi − yj = 0. Each of the terms in the brackets are bounded by the following lemmas: > (i) −1 p Lemma F.3. kY P Y k∞ ≤ φ (N − 1) D/H (φ defined as in Lemma F.2).

> (i) Proof. Recall that Y P Y = Cov(Y). Let σ(Ym) denote the standard deviation of Ym. Then [Cov(Y)]lm ≤ σ(Yl)σ(Ym). Hence X X kCov(Y)k∞ = max |[Cov(Y)]lm| ≤ max σ(Yl) σ(Ym) l l m m r r D X D ≤ σ2( ) = Tr(Cov( )) H Ym H Y m r D ≤ φ−1(N − 1), H q P D pP 2 1 since m σ(Ym) ≤ H m σ (Ym) (by e.g. using the Cauchy–Schwartz inequality on [σ(Y1), . . . , σ(YD/H )] and ) pP 2 and maxl σ(Yl) ≤ m σ (Ym), and the last inequality is from Lemma F.2.

P P > −1 p Lemma F.4. j kPij(yj − k Pikyk)(yi − yj) k∞ ≤ φ (N − 1) D/H.

> Proof. Note kuv k∞ = kuk∞kvk1 for real vectors u, v. Hence

X X > X X kPij(yj − Pikyk)(yi − yj) k∞ = Pijkyj − Pikykk∞kyi − yjk1 j k j k > = a b ≤ kak2kbk2,

p P p where aj = Pijkyj − k Pikykk∞, bj = Pijkyi − yjk1. p P Note aj ≤ cj := Pijkyj − k Pikykk2 since kuk∞ ≤ kuk2 for vector u. Hence kak2 ≤ kck2. q q q D D p D D/H Also bj ≤ H dj := H Pijkyi − yjk2 since kuk1 ≤ H kuk2 for u ∈ R (e.g. by the Cauchy–Schwartz q 1 D inequality on [|u1|,..., |uD/H |] and ). Hence kbk2 ≤ H kdk2. 2 P P 2 −1 2 P 2 Note kck2 = j Pijkyj − k Pikykk2 = Tr(Cov(Y)) ≤ φ (N −1) from Lemma F.2, and kdk2 = j Pijkyi −yjk2 ≤ −1 q D q D −1 φ (N − 1) also from Lemma F.2. Hence kak2kbk2 ≤ H kck2kdk2 ≤ H φ (N − 1). The Lipschitz Constant of Self-Attention

Putting the above lemmas altogether, with the observation supX kJf (X)k∞ = supX k[Ji1(X),...,JiN (X)]k∞ by per- mutation invariance of kJf k∞ (since f is permutation equivariant and k · k∞ is the maximum absolute row sum), we have

Q Q> Q Q> −1 kW W k∞ kJf k∞ ≤ 4kW k∞kW k∞φ (N − 1) + pD/H ! Q Q> −1 1 ≤ kW k∞kW k∞ 4φ (N − 1) + (33) pD/H ! Q Q> 1 ≤ kW k∞kW k∞ 4 log N + , pD/H where the last inequality holds for N ≥ 3. The full multihead attention map that combines the heads f h(X) is:

F : X 7→ f 1(X)W V,1, . . . f H (X)W V,H  W O = g(X)W V W O where g : X 7→ [f 1(X), . . . , f H (X)], W O ∈ RD×D and W V,1 ... 0  V  . .. .  DH×D W =  . . .  ∈ R . 0 ...W V,H

V > Note the Jacobian Jg is a block matrix whose rows are Jf h , hence kJgk∞ = maxh kJf h k∞, and similarly kW k∞ = V,h> maxh kW k∞. Hence we have

V,h> O> Lip∞(F ) ≤ max kJf h k∞ max kW k∞kW k∞. h h

Combining this with Inequality (33), we have: ! −1 1 Q,h Q,h> V,h> O> Lip∞(F ) ≤ 4φ (N − 1) + max kW k∞kW k∞ max kW k∞ kW k∞. pD/H h h

F.2.2. UPPERBOUNDON Lip2(F ) FOR L2-MHA For p = 2, we use the following lemma: pP 2 Lemma F.5. Let A be a block matrix with block rows A1,...,AN . Then kAk2 ≤ i kAik2, and equality holds if and only if the first right singular vectors of the Ai align.

Proof.   2   2 A1 A1

2  .   .  X 2 X 2 X 2 kAk2 = . = sup . x = sup kAixk2 ≤ sup kAixk2 = kAik2.   kxk =1   kxk =1 kxk =1 A 2 A 2 i i 2 i N 2 N 2

Note that equality holds if and only if the first right singular vectors of the Ai align. √ Hence a bound on the spectral norm of each block row of Jf can give us an O( N) bound on kJf k2, which may be loose, and it remains an open question as to whether this bound can be tightened.

To bound the k · k2 norm of each row of Jf , we use the following lemmas: > (i) −1 Lemma F.6. kY P Y k2 ≤ φ (N − 1) The Lipschitz Constant of Self-Attention

> (i) −1 Proof. kY P Y k2 = kCov(Y)k2 = λmax(Cov(Y)) ≤ Tr(Cov(Y)) ≤ φ (N − 1), where the first equality holds by symmetry of Cov(Y) and the next holds by Cov(Y) being positive semi-definite, so all its eigenvalues are non-negative, and hence the maximal eigenvalue is bounded by the sum of the eigenvalues, equal to its trace. The final inequality is from Lemma F.2.

P P > −1 Lemma F.7. j kPij(yj − k Pikyk)(yi − yj) k2 ≤ φ (N − 1)

Proof. Directly use Cauchy–Schwartz on c and d in the proof of Lemma F.4.

P Again using the inequalities kBCk ≤ kBkkCk, kB + Ck ≤ kBk + kCk and k[A1,...,AN ]k ≤ i kAik, with the > additional equality kB k2 = kBk2, we have the bound:

k[Ji1,...,JiN ]k2 Q Q>   Q Q> kW k2kW k2 > (i) X X > kW W k2 ≤ 2 p kY P Y k2 + kPij(yj − Pikyk)(yi − yj) k2 + p D/H j k D/H > kW Qk2 kW QW Q k ≤ 4φ−1(N − 1) 2 + 2 pD/H pD/H kW Qk2   ≤ 2 4φ−1(N − 1) + 1 . pD/H

Using Lemma F.5, we have that √ Q 2   NkW k2 −1 kJf k2 ≤ 4φ (N − 1) + 1 (34) pD/H √ NkW Qk2 ≤ 2 (4 log N + 1). pD/H

To obtain the final result for the full multihead self-attention F , we need a final lemma: pP 2 Lemma F.8. Let A be a block matrix with block columns A1,...,AN . Then kAk2 ≤ i kAik2.

Proof.

  2 x1

 .  X kAk2 = k[A1,...,AN ]k2 = sup [A1,...,AN ]  .  = sup k Aixik2 P kx k2=1 P kx k2=1 i i 2 x i i 2 i N 2 X X X ≤ sup kAixik2 = sup λikAieik2 = sup λikAik2 P 2 P 2 P 2 i kxik2=1 i keik2=1, i λi =1 i i λi =1 i s X 2 ≤ kAik2, i where we are using the substitution xi = λiei, and the last inequality holds by e.g. Cauchy–Schwartz inequality on [λ1, . . . , λN ] and [kA1k2,..., kAN k2].

Recall that F : X 7→ f 1(X)W V,1, . . . , f H (X)W V,H  W O.

h V,h V,h Since kf (X)W k2 ≤ kJf h k2kW k2, by Lemma F.8 we have that s 1 V,1 H V,H X 2 V,h 2 [f (X)W , . . . , f (X)W ] 2 ≤ kJf h k2kW k2 h The Lipschitz Constant of Self-Attention and hence   s X 2 V,h 2 O Lip2(F ) ≤  kJf h k2kW k2 kW k2. (35) h Combining this with Inequality (34), we have: √ q  N −1  P Q,h 2 V,h 2 O Lip (F ) ≤ 4φ (N − 1) + 1 kW k kW k kW k2. 2 pD/H h 2 2

G. The Case with Masking Since self-attention is often used with masking, a natural question is how masking affects the derived bounds. In self- attention (for any choice of attention function), masking is implemented as follows: given a set of mask indices M ⊂ {1,...,N} × {1,...,N}, the logits (i.e. the inputs to the softmax) are set to −∞ at the mask indices. That is, ( L˜ij if (i, j) ∈/ M Lij = −∞ if (i, j) ∈ M

> where L˜ij is the original logit (e.g. for L2 self-attention, L˜ij = −(xi − xj) A(xi − xj)).

Masking implies fi(X) is not a function of xj for (i, j) ∈ M, hence Jij = 0 for (i, j) ∈ M. Thus fi(X) is equal to the ith output for self-attention with inputs restricted to {xj :(i, j) ∈/ M}, the unmasked inputs with respect to the ith output. Hence Jij will no longer contribute to the bound on k[Ji1,...,JiN ]k, and hence the bound for the unmasked case will continue to hold as long as (i, i) ∈ M i.e. xi attends to itself (this is necessary for the proof of Lemma F.2 to hold). The bound can in fact be tightened by replacing N with |{xj :(i, j) ∈/ M}|, the number of unmasked inputs with respect to the ith output.

H. Experimental Details

For the experiment in Section 5.1, showing the asymptotic tightness of the upper bound on Lip∞(F ) where F is L2-MHA, we fix all free parameters of F (namely W Q,W V ) to be the identity, and only optimise the input X. We use 50 random initialisations of X for each N, where Xij ∼ U[−c, c] for c ∼ U[0, 10] (we observed that having c itself be random improves optimisation). We display the top 5 results for each value of N after optimising each random initialisation till convergence using Adam (Kingma & Ba, 2015) with a learning rate of 0.1. For the experiments in Section 5.3, we comparing the performance of the original Transformer and the Transformer with Lipschitz/invertible self-attention at character-level language modelling on the Penn Treebank dataset (Marcus et al., 1993).1 Each training example is a sentence represented as a variable-length sequence of characters, and examples are batched according to length such that padding is minimised, with the maximum sequence length set to 288. All models are autoregressive, outputting the logits for the categorical likelihood predicting the next character, and are trained using maximum likelihood (cross-entropy loss) with a batch size of 64. The LSTM models have the dimensionality of the hidden state equal to the dimensionality D of the cell state (the usual default implementation). The Transformer models are trained with a varying number of blocks (number of layers) with H = 8 heads and D = 512, tuning hyperparameters for dropout rate in {0, 0.1, 0.2} and base learning rate γ ∈ {0.2, 0.4, 0.6, 0.8, 1.0, 1.5, 2.0} with number of warmup iterations w ∈ {1000, 2000, 4000, 8000} for the standard custom learning rate schedule in Vaswani et al.(2017):

γ −1/2 −3/2 t = √ min(t , tw ), D −1/2 where t is the learning rate at training iteration t. Hence the learning rate linearly increases from 0 to (Dw) over w iterations, then decays proportionally to t−1/2. We use Glorot Uniform initialisation (Glorot & Bengio, 2010) for all h q q i h i weights (U − 1 , 1 ), except for weights in L2-MHA that are initialised from U − √s , √s , and s is din+dout din+dout D D 1 a hyperparameter. For D = 512, we used s = 24 . All experiments were done in Tensorflow 1.14 (Abadi et al., 2016) on single Nvidia Tesla V100 GPUs.

1We use the standard training-validation-test split, and the dataset can be found at e.g. https://github.com/harvardnlp/ TextFlow/tree/master/data/ptb. The Lipschitz Constant of Self-Attention I. Numerical Invertibility of MHA Residual Map Following Section 5.2, Figure6 confirms that numerical invertibility does not hold for trained weights for dot-product multihead self-attention (DP-MHA) (obtained from one-layer Transformer (DP) model used for Figure4), similar to the randomly initialised weight case. Figure7 shows additional results for different values of N and D.

Inverting trained DP-MHA Residual Map (D=512,H=8) 107

105

103

1 10 c = 0.90 c = 0.70 10 1 c = 0.50 3 max reconstruction error 10 0 5 10 15 20 25 fixed point iterations

Figure 6. Invertibility of g(x) = x + cf(x) for trained DP-MHA f.

Figure 7. Numerical invertibility of g(x) = x + cf(x) where f is L2-MHA(left) or DP-MHA (right), for different values of N and D. The Lipschitz Constant of Self-Attention

J. Behaviour of Lower Bound on Lip2(F )

8 7 6 5

4 c*logN 3 top 1 out of 50 random init top 5 out of 50 random init 2 10 100 300 N

Figure 8. Lower bound on Lip2(F ) where F is L2-MHA, with D = 1 and varying N, obtained by optimising kJF (X)k2 with respect to X, with 50 random initialisations of X for each N.

In Figure8, we show the lower bound on Lip2(F ) obtained by optimising kJF (X)k2 using the same optimisation procedure as for Figure2 of Section 5.1. Here the optimisation is more difficult, evident in the variance of the top 5 values, and the trend is less clear, but it appears that Lip2(f) grows at a rate of O(log N). The message is less clear here, and there are at least two possibilities:

(1) The optimisation is difficult even for small values of N, hence Figure8 shows a loose lower bound. √ (2) If the lower bound is tight, this suggests that the O( N log N) bound in Theorem 3.2 is not asymptotically tight, and could be improved to O(log N) (or O(log N − log log N) as for p = ∞).

K. Optimising the norm of the Jacobian of DP-MHA

In Figure9, we show how the norm of the Jacobian kJf (X)k∞ for DP-MHA f keeps increasing when being optimised with respect to X. This is a useful sanity check validating our theoretical result of Theorem 3.1, that DP-MHA is not Lipshchitz. The oscillations are likely due to momentum term of Adam optimizer that was used to optimise the norm.

Optimise ||Jf(X)|| wrt X for f = DP-MHA (trained, D=512)

25000

20000 | | )

X 15000 ( f J | | 10000

5000

0 0 50 100 150 200 250 300 optim iterations

Figure 9. Optimise kJf (X)k∞ w.r.t. X for trained DP-MHA f. The Lipschitz Constant of Self-Attention L. Experiment tying keys and queries of L2-MHA but preserving parameter count In Figure4 of Section 5.3, we have shown that there is a clear reduction in performance when tying the keys and queries. To test whether this can be attributed to the reduction in parameter count, we tried doubling the number of columns of W Q when the keys and queries are shared (i.e. from D/H to 2D/H) so that the shared model has the same number of parameters as the unshared model. In Figure 10, the third column shows results for shared L2-MHA, but with the same number of parameters as the unshared L2-MHA i.e. without tying the keys and queries. The performance is similar to the second column (tying with a reduced number of parameters), suggesting that there is an inherent limitation in expressiveness to tying the keys and queries, and that the reduction in number of parameters is an insufficient explanation this phenomenon.

Transformer (L2), WQ = WK, Transformer (L2) Transformer (L2), WQ = WK same parameter count Num Layers Num Layers Num Layers 1.8 1 1 1 2 2 2 1.6 3 3 3 4 4 4 5 5 5 1.4 Valid NLL

1.2

1.0 20 40 60 80 100 20 40 60 80 100 20 40 60 80 100 epoch epoch epoch

Figure 10. Experiment tying keys/queries but preserving parameter count.

M. Training curves for fixed learning rate DP-MHA vs L2-MHA

Transformer (DP) Transformer (L2, WK = WQ) Transformer (Contractive-L2) Num Layers 4 2 4 6 3 8 10 12 2

Train NLL 14 16 1 18

0 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 epoch epoch epoch

Figure 11. Train NLL for Transformer (DP), Transformer (L2) and Transformer (Contractive-L2) The Lipschitz Constant of Self-Attention N. The Lipschitz constant of LayerNorm In this section, we show that LayerNorm is Lipschitz, with a loose bound on its Lipschitz constant w.r.t. to the ∞-norm. LayerNorm is defined as follows: x − µ(x) LN(x) = γ + β pσ2(x) +  D 1 X µ(x) = x D d d=1 D 1 X σ2(x) = (x − µ(x))2 D d d=1 where x, β, γ ∈ RD. We will omit dependence on x to write µ, σ2 in cases when there is no ambiguity to reduce clutter.

In the trivial case where xd are all equal or when D = 1, x = µ hence LN(x) = β, so its Lipschitz constant is 0. Thus let us assume D > 2 and not all xd are equal. First let us compute the derivative of µ and σ2 w.r.t x: ∂µ 1 = 1> ∂x D ∂σ2 1 X ∂ = 2(x − µ) (x − µ) ∂x D d ∂x d d 2 X 1 = (x − µ)(e − 1)> D d d D d > 2  X 1 X  = (x − µ)e − 1 (x − µ) D d d D d d d 2 X = (x − µ)e> D d d d 2 = (x − µ)> D D P where ed ∈ R is a one-hot vector with 1 at the dth element. Note the penultimate equality holds because d(xd − µ) = 0.

Now the derivative of LN(x)d, the dth element of LN(x), w.r.t.x is  2  ∂LN(x)d ∂ 2 − 1  1 2 − 3 ∂σ = γ (x − µ)(σ + ) 2 + (x − µ) − (σ + ) 2 ∂x d ∂x d d 2 ∂x   2 − 1 1 > 1 2 −1 2 > = γ (σ + ) 2 (e − 1) − (x − µ)(σ + ) (x − µ) d d D 2 d D   2 − 1 1 > 1 2 −1 > = γ (σ + ) 2 (e − 1) − (σ + ) (x − µ)(x − µ) . d d D D d

Hence   ∂LN(x) 2 − 1 1 > 1 2 −1 > = (σ + ) 2 diag(γ) − γ1 − (σ + ) diag(γ)(x − µ)(x − µ) . ∂x D D

Note   γ1(D − 1)/D −γ1/D . . . −γ1/D 1  −γ2/D γ2(D − 1)/D . . . −γ2/D  diag(γ) − γ1> =   ,  . . .. .  D  . . . .  −γD/D −γD/D . . . γD(D − 1)/D The Lipschitz Constant of Self-Attention hence 1 > 2(D − 1) diag(γ) − γ1 = max |γd|, (36) D ∞ D d recalling that k·k∞ is the maximum absolute row sum. P 2 1 P 2 Let zd := xd − µ. Hence d zd = 0, σ = D d zd and  2  z1 . . . z1zD >  . .. .  Cov(x) = (x − µ)(x − µ) =  . . .  . 2 zDz1 . . . zD

Hence P kCov(x)k max |z | |z 0 | ∞ = d d d0 d . 2 1 P 2 σ D d zd

Noting that this expression is scale-invariant in z, we may assume WLOG maxd |zd| = zD = 1, since we are assuming not all xd are equal and hence at least one zd is non-zero. The expression now becomes  P  kCov(x)k∞ 1 + d

Since all terms |zd| ≤ 1 are bounded, this continuous expression reaches a global maximum for some value of z with zD = 1.

It is easy to see that at the global maximum, zd 6= 0 ∀d: suppose this were to be true, WLOG z1 = 0. Then let us see how the quantity (37) changes when z1 = 0 is increased by 0 < δ < 1 and zD = 1 is decreased by δ, keeping the sum constant. P P 2 2 It is easy to see that the numerator d |zd| stays constant, but the denominator d zd changes by 2δ − 2δ < 0. Since for small δ, the numerator of (37) stays constant but the denominator decreases, the quantity (37) increases, contradicting that the global max is obtained for z1 = 0. Hence we may assume that zd 6= 0 ∀d. P Hence the quantity (37) (in particular, d |zd|) is differentiable at the global maximum, at which the partial derivatives of the following Lagrangian are zero: P 1 + |zd| X L(z , . . . , z , λ) = d

P P ∂L d|zk| From now on let us write for below to reduce clutter. Setting = 0 and noting = sgn(zk), we obtain d

Hence at the global maximum, zk takes one of two values a > 0 and b < 0. Further we have that P P 2 1 + |zd| sgn(zk) − λ(1 + zd) P 2 = (38) 1 + zd 2zk

P 2 P 2 1−λ(1+ zd) −1−λ(1+ zd) P 2 If both a and b are among the zk, we have that 2a = 2b . Solving for λ(1 + zd) and plugging it in back to Equation (38), we get: P 1 + |zd| 1 P 2 = 1 + zd a − b The Lipschitz Constant of Self-Attention P Since a > 0, b < 0 and zd = −1, a − b is minimised when only one of the zd is a and the rest are b. Hence a crude lower 1 bound on a − b is D−2 , giving a bound: kCov(x)k ∞ ≤ D(D − 2) (39) σ2

1 However we conjecture that the true global maximum is attained when zd = − D−1 ∀d < D (i.e. all the zd for d < D are P 1+ d

∂LN(x) 2 − 1 1 > 1 2 −1 > = (σ + ) 2 diag(γ) − γ1 − (σ + ) diag(γ)(x − µ)(x − µ) ∂x ∞ D D ∞   1 1 1 − 2 1> 2 −1 > ≤  diag(γ) − γ + kdiag(γ)k∞ (σ + ) (x − µ)(x − µ) ∞ D ∞ D   1 1 1 − 2 1> 2 ≤  diag(γ) − γ + kdiag(γ)k∞ Cov(x)/σ ∞ D ∞ D   − 1 2(D − 1) 1 ≤  2 max |γd| + max |γd|D(D − 2) D d D d   − 1 2(D − 1) =  2 max |γd| + D − 2 d D  2  − 1 D − 2 =  2 max |γd| . d D