Deep Learning 13.3. Transformer Networks

Deep learning 13.3. Transformer Networks Fran¸coisFleuret https://fleuret.org/dlc/ Vaswani et al. (2017) proposed to go one step further: instead of using attention mechanisms as a supplement to standard convolutional and recurrent operations, they designed a models combining only attention layers. They designed this \transformer" for a sequence-to-sequence translation task, but it is currently key to state-of-the-art approaches across NLP tasks. Fran¸coisFleuret Deep learning / 13.3. Transformer Networks 1 / 30 Notes The standard practice is to train a transformer in a non-supervised manner on large unlabeled datasets such as Wikipedia{or re-use a pre-trained transformer{and then fine tune it in a supervised manner for tasks which require a ground truth such as sentiment analysis. They first introduce a multi-head attention module. Scaled Dot-Product Attention Multi-Head Attention (Vaswani et al., 2017) QK > FigureAttention( 2: (left) ScaledQ; K Dot-Product; V ) = softmax Attention. (right)p Multi-HeadV Attention consists of several attention layers running in parallel. dk O MultiHead(Q; K; V ) = Concat (H1;:::; Hh) W 3.2.1 Scaled Dot-Product Attention Q K V Hi = Attention QWi ; KWi ; VWi ; i = 1;:::; h with We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of queries and keys of dimension dk, and values of dimension dv. We compute the dot products of the query with all keys, divide each by √dk, and apply a softmax function to obtain the weights on the Q dmodel ×dk K dmodel ×dk V dmodel ×dv O hdv ×dmodel Wvalues.i 2 R ; Wi 2 R ; Wi 2 R ; W 2 R In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix Q. The keys and values are also packed together into matrices K and V . We compute Fran¸coisFleuretthe matrix of outputs as: Deep learning / 13.3. Transformer Networks 2 / 30 QKT Attention(Q, K, V ) = softmax( )V (1) √ Notes dk The \scaledThe dot-product two most commonly attention" used (left) attention is very functions are additive attention [2], and dot-product (multi- plicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor close to theof attention1 . Additive module attention we saw computes in lecture the compatibility function using a feed-forward network with 13.2. \Attention√dk Mechanisms", with the addition a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is of an optionalmuch masking faster and (in more pink). space-efficient This may in bepractice, since it can be implemented using highly optimized useful whenmatrix such multiplication a module is code. used for a generative auto-regressive operation and the While for small values of d the two mechanisms perform similarly, additive attention outperforms attention should be causal, lookingk only to the dot product attention without scaling for larger values of d [3]. We suspect that for large values of past. k dk, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients 4. To counteract this effect, we scale the dot products by 1 . √dk The attention is a function of the keys, queries, and values.3.2.2 The only Multi-Head difference Attention with what was seen in the previous course is that the attention matrix is rescaledInstead of with performing the dimension a single attention of the function with dmodel-dimensional keys, values and queries, h embedding,we which found matters it beneficial quite to linearly a lot. project the queries, keys and values times with different, learned linear projections to dk, dk and dv dimensions, respectively. On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding dv-dimensional In the multi-headoutput values. attention, These each are concatenated head h has and its once again projected, resulting in the final values, as own processingdepicted of the in Figure input 2. keys, queries, and K Q V values through4 respectively Wi , Wi , and Wi . To illustrate why the dot productsO get large, assume that the components of q and k are independent random And there is one final processing W applied on dk variables with mean 0 and variance 1. Then their dot product, q k = i=1 qiki, has mean 0 and variance dk. the concatenated results of the multiple heads. · P 4 Their complete model is composed of: • An encoder that combines N = 6 modules each composed of a multi-head attention sub-module, and a [per-component] one hidden-layer MLP, with residual pass-through and layer normalization. • A decoder with a similar structure, but with causal attention layers to allow for regression training, and additional attention layers that attend to the layers of the encoder. Positional information is provided through an additive positional encoding of same dimension dmodel as the internal representation, and is of the form 0 1 t PE = sin t;2i @ 2i A 10;000 dmodel 0 1 t PE = cos : t;2i+1 @ 2i+1 A 10;000 dmodel Fran¸coisFleuret Deep learning / 13.3. Transformer Networks 3 / 30 Notes Contrary to what we previously saw with the concatenated binary positional encoding, here the position is provided as additive encoding, where t is the position in the sequence, and2 i and2 i + 1 the dimension. Figure 1: The Transformer - model architecture. (Vaswani et al., 2017) Fran¸coisFleuret Deep learning / 13.3. Transformer Networks 4 / 30 3.1 Encoder and Decoder Stacks Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two Notes sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position- wise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is This is aLayerNorm( depictionx of+ the Sublayer( standardx)), transformer where Sublayer(xthat) is the applies function a implementedone hidden layerby the perceptron sub-layer at architectureitself. for To facilitatesequence-to-sequence these residual connections, all sub-layersevery position in the model, of the as wellsequence as the embedding separately. This translation.layers, It produce consists outputs of an of encoder dimension (leftdmodel part)= 512. can be implemented with1 1 convolutions. and a decoder (right part). Both are a stack of Both the self-attention and× the feed-forward are N = 6 modules. combined with residual pass-through. Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head Each tokenattention (subword) over the outputof the of input the encoder sequence stack. is SimilarThe to the decoder encoder, is we an employ auto-regressive residual connections model, and encodedaround with a each look-up of the sub-layers,table to get followed its by layer normalization.each of its module We also modify has a multi-head the self-attention embeddingsub-layer of dimension in the decoderd, so stack that to the prevent input positions is self-attention from attending operation, to subsequent then positions. an attention This that a tensormasking, of size T combinedd. Then with fact the that positional the output embeddingsattends are offset to the by encoder, one position, and ensures a feed-forward that the i i encodingpredictions of same for size× position is addedcan to depend it. only on the knownoperation. outputs atThe positions self-attention less than is. masked to make it causal, i.e. it takes into account only the part Each of3.2 the N Attentionmodules of the encoder is of the sequence already generated. The composed of a multi-head self-attention attention to the encoder is not masked but its operationAn followed attention byfunction a \feed can forward"be described operation as mapping akeys query and and values a set of are key-value functions pairs toof an the output, outputs of where the query, keys, values, and output are all vectors.the corresponding The output is computed module as in a weighted the encoding sum stack. of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. 3 The architecture is tested on English-to-German and English-to-French translation using the standard WMT2014 datasets. • English-to-German: 4.5M sentence pairs, 37k tokens vocabulary. • English-to-French: 36M sentence pairs, 32k tokens vocabulary. • 8 P100 GPUs (150 TFlops FP16), 0.5 day for the small model, 3.5 days for the large one. Fran¸coisFleuret Deep learning / 13.3. Transformer Networks 5 / 30 Table 2: The Transformer achieves better BLEU scores than previous state-of-the-art models on the English-to-German and English-to-French newstest2014 tests at a fraction of the training cost. BLEU Training Cost (FLOPs) Model EN-DE EN-FR EN-DE EN-FR ByteNet [18] 23.75 Deep-Att + PosUnk [39] 39.2 1.0 1020 GNMT + RL [38] 24.6 39.92 2.3 1019 1.4 · 1020 ConvS2S [9] 25.16 40.46 9.6 · 1018 1.5 · 1020 MoE [32] 26.03 40.56 2.0 · 1019 1.2 · 1020 · · Deep-Att + PosUnk Ensemble [39] 40.4 8.0 1020 GNMT + RL Ensemble [38] 26.30 41.16 1.8 1020 1.1 · 1021 · 19 · 21 ConvS2S Ensemble [9] 26.36 41.29 7.7 10 1.2 10 · · Transformer (base model) 27.3 38.1 3.3 1018 · Transformer (big) 28.4 41.8 2.3 1019 · Label Smoothing During training, we employed label smoothing of value ls = 0(Vaswani.1 [36].

Load more