<<

13.3. Transformer Networks

Fran¸coisFleuret https://fleuret.org/dlc/ Vaswani et al. (2017) proposed to go one step further: instead of using attention mechanisms as a supplement to standard convolutional and recurrent operations, they designed a models combining only attention layers.

They designed this “transformer” for a sequence-to-sequence translation task, but it is currently key to state-of-the-art approaches across NLP tasks.

Fran¸coisFleuret Deep learning / 13.3. Transformer Networks 1 / 30

Notes

The standard practice is to train a transformer in a non-supervised manner on large unlabeled datasets such as Wikipedia–or re-use a pre-trained transformer–and then fine tune it in a supervised manner for tasks which require a ground truth such as sentiment analysis. They first introduce a multi-head attention module.

ScaledScaled Dot- Dot-Product Attention Attention Multi-HeadMulti-Head Attention Attention

(Vaswani et al., 2017)  QK >  FigureFigureAttention( 2: 2: (left) (left) Scaled ScaledQ, K Dot-Product, Dot-ProductV ) = softmax Attention. Attention. (right)√ (right) Multi-Head Multi-HeadV Attention Attention consists consists of of several several attentionattention layers layers running running in in parallel. parallel. dk O MultiHead(Q, K, V ) = Concat (H1,..., Hh) W 3.2.13.2.1 Scaled Scaled Dot-Product Dot-Product Attention Attention  Q K V  Hi = Attention QWi , KWi , VWi , i = 1,..., h with WeWe call call our our particular particular attention attention "Scaled "Scaled Dot-Product Dot-Product Attention" Attention" (Figure (Figure 2). 2). The The input input consists consists of of queriesqueries and and keys keys of of dimension dimensiondkd,k and, and values values of of dimension dimensiondvd.v. We We compute compute the the dot dot products products of of the the queryquery with with all all keys, keys, divide divide each each by by√√dkd,k and, and apply apply a a softmax softmax function to to obtain obtain the the weights weights on on the the Q dmodel ×dk K dmodel ×dk V dmodel ×dv O hdv ×dmodel Wvalues.values.i ∈ R , Wi ∈ R , Wi ∈ R , W ∈ R InIn practice, practice, we we compute compute the the attention attention function function on on a a set set of of queries queries simultaneously, simultaneously, packed packed together together intointo a a matrix matrixQQ.. The The keys keys and and values values are are also also packed packed together together into into matrices matricesKKandandVV.. We We compute compute

Fran¸coisFleuretthethe matrix matrix of of outputs outputs as: as: Deep learning / 13.3. Transformer Networks 2 / 30

QKQKTT Attention(Attention(Q,Q, K, K, V V)) = = softmax( softmax( )V)V (1)(1) √√ Notes dkdk

The “scaledTheThe dot-product two two most most commonly commonly attention” used used (left) attention attention is very functions functions are are additive additive attention attention [2 [],2], and and dot-product dot-product (multi- (multi- plicative)plicative) attention. attention. Dot-product Dot-product attention attention is is identical identical to to our our , algorithm, except except for for the the scaling scaling factor factor close to theof attentionof 1 1 .. Additive Additive module attention attention we saw computes computes in lecture the the compatibility compatibility function function using using a a feed-forward feed-forward network network with with 13.2. “Attention√√dkd Mechanisms”,k with the aa single single hidden hidden . layer. While While the the two two are are similar similar in in theoretical theoretical complexity, complexity, dot-product dot-product attention attention is is of an optionalmuchmuch masking faster faster and and (in more more pink). space-efficient space-efficient This may in in bepractice, practice, since since it it can can be be implemented implemented using using highly highly optimized optimized useful whenmatrixmatrix such a multiplication module is code. used code. for a generative auto-regressive operation and the WhileWhile for for small small values values of ofdd thethe two two mechanisms mechanisms perform perform similarly, similarly, additive additive attention attention outperforms outperforms attention should be causal, lookingk k only to the dotdot product product attention attention without without scaling scaling for for larger larger values values of ofdd [3[].3]. We We suspect suspect that that for for large large values values of of past. k k dkd,k the, the dot dot products products grow grow large large in in magnitude, magnitude, pushing pushing the the softmax function into into regions regions where where it it has has extremelyextremely small small gradients gradients4.4. To To counteract counteract this this effect, effect, we we scale scale the the dot dot products products by by 1 1 . . √√dkdk The attention is a function of the keys, queries, and values.3.2.2 The3.2.2 only Multi-Head Multi-Head difference Attention Attention with what was seen in the previous course is that the attention matrix is rescaledInsteadInstead of with of performing performing the dimension a a single single attention of attention the function function with withdmodeldmodel-dimensional-dimensional keys, keys, values values and and queries, queries, hh embedding,we whichwe found found matters it it beneficial beneficial quite to to linearly a linearly lot. project project the the queries, queries, keys keys and and values values timestimes with with different, different, learned learned linearlinear projections projections to todkd,kd, kdkandanddvdvdimensions,dimensions, respectively. respectively. On On each each of of these these projected projected versions versions of of queries,queries, keys keys and and values values we we then then perform perform the the attention attention function function in in parallel, parallel, yielding yieldingdvd-dimensionalv-dimensional In the multi-headoutputoutput values. attention, values. These These each are are concatenated head concatenatedh has and its and once once again again projected, projected, resulting resulting in in the the final final values, values, as as own processingdepicteddepicted of the in in Figure Figure input 2. 2. keys, queries, and K Q V values through4 4 respectively Wi , Wi , and Wi . ToTo illustrate illustrate why why the the dot dot products productsO get get large, large, assume assume that that the the components components of ofq qandandkkareare independent independent random random And there is one final processing W applied on dkdk variablesvariables with with mean mean00andand variance variance1.1 Then. Then their their dot dot product, product,q q kk== i=1i=1qiqkiik,i has, has mean mean00andand variance variancedkd.k. the concatenated results of the multiple heads. · · PP 44 Their complete model is composed of:

• An encoder that combines N = 6 modules each composed of a multi-head attention sub-module, and a [per-component] one hidden-layer MLP, with residual pass-through and layer normalization.

• A decoder with a similar structure, but with causal attention layers to allow for regression training, and additional attention layers that attend to the layers of the encoder.

Positional information is provided through an additive positional encoding of same dimension dmodel as the internal representation, and is of the form   t PE = sin t,2i  2i  10,000 dmodel   t PE = cos . t,2i+1  2i+1  10,000 dmodel

Fran¸coisFleuret Deep learning / 13.3. Transformer Networks 3 / 30

Notes

Contrary to what we previously saw with the concatenated binary positional encoding, here the position is provided as additive encoding, where t is the position in the sequence, and2 i and2 i + 1 the dimension. Figure 1: The Transformer - model architecture. (Vaswani et al., 2017)

Fran¸coisFleuret Deep learning / 13.3. Transformer Networks 4 / 30 3.1 Encoder and Decoder Stacks

Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two Notes sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position- wise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is This is aLayerNorm( depictionx of+ the Sublayer( standardx)), transformer where Sublayer(xthat) is the applies function a implementedone hidden layerby the sub-layer at architectureitself. for To facilitatesequence-to-sequence these residual connections, all sub-layersevery position in the model, of the as wellsequence as the embedding separately. This translation.layers, It produce consists outputs of an of encoder dimension (leftdmodel part)= 512. can be implemented with1 1 . and a decoder (right part). Both are a stack of Both the self-attention and× the feed-forward are N = 6 modules. combined with residual pass-through. Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head Each tokenattention (subword) over the outputof the of input the encoder sequence stack. is SimilarThe to the decoder encoder, is we an employ auto-regressive residual connections model, and encodedaround with a each look-up of the sub-layers,table to get followed its by layer normalization.each of its module We also modify has a multi-head the self-attention embeddingsub-layer of dimension in the decoderd, so stack that to the prevent input positions is self-attention from attending operation, to subsequent then positions. an attention This that a tensormasking, of size T combinedd. Then with fact the that positional the output embeddingsattends are offset to the by encoder, one position, and ensures a feed-forward that the i i encodingpredictions of same for size× position is addedcan to depend it. only on the knownoperation. outputs atThe positions self-attention less than is. masked to make it causal, i.e. it takes into account only the part Each of3.2 the N Attentionmodules of the encoder is of the sequence already generated. The composed of a multi-head self-attention attention to the encoder is not masked but its operationAn followed attention byfunction a “feed can forward”be described operation as mapping akeys query and and values a set of are key-value functions pairs toof an the output, outputs of where the query, keys, values, and output are all vectors.the corresponding The output is computed module as in a weighted the encoding sum stack. of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

3 The architecture is tested on English-to-German and English-to-French translation using the standard WMT2014 datasets.

• English-to-German: 4.5M sentence pairs, 37k tokens vocabulary.

• English-to-French: 36M sentence pairs, 32k tokens vocabulary.

• 8 P100 GPUs (150 TFlops FP16), 0.5 day for the small model, 3.5 days for the large one.

Fran¸coisFleuret Deep learning / 13.3. Transformer Networks 5 / 30 Table 2: The Transformer achieves better BLEU scores than previous state-of-the-art models on the English-to-German and English-to-French newstest2014 tests at a fraction of the training cost. BLEU Training Cost (FLOPs) Model EN-DE EN-FR EN-DE EN-FR ByteNet [18] 23.75 Deep-Att + PosUnk [39] 39.2 1.0 1020 GNMT + RL [38] 24.6 39.92 2.3 1019 1.4 · 1020 ConvS2S [9] 25.16 40.46 9.6 · 1018 1.5 · 1020 MoE [32] 26.03 40.56 2.0 · 1019 1.2 · 1020 · · Deep-Att + PosUnk Ensemble [39] 40.4 8.0 1020 GNMT + RL Ensemble [38] 26.30 41.16 1.8 1020 1.1 · 1021 · 19 · 21 ConvS2S Ensemble [9] 26.36 41.29 7.7 10 1.2 10 · · Transformer (base model) 27.3 38.1 3.3 1018 · Transformer (big) 28.4 41.8 2.3 1019 ·

Label During training, we employed label smoothing of value ls = 0(Vaswani.1 [36]. This et al., 2017) hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.

6 Results

6.1 Machine Translation

On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big) Fran¸coisFleuretin Table 2) outperforms the best previously Deep reported learning / 13.3. models Transformer (including Networks ensembles) by more than 2.0 6 / 30 BLEU, establishing a new state-of-the-art BLEU score of 28.4. The configuration of this model is listed in the bottom line of Table 3. Training took 3.5 days on 8 P100 GPUs. Even our base model surpasses all previously published models and ensembles, at a fraction of the training cost of any of Notesthe competitive models.

The standardOn the WMT metric 2014 in English-to-French natural language translation task, our big model achieves a BLEU score of 41.0, outperforming all of the previously published single models, at less than 1/4 the training cost of the processingprevious is the state-of-the-art Bilingual Evaluation model. The Transformer (big) model trained for English-to-French used Understudydropout Score rate Pdrop (BLEU)= 0.1 score, instead which of 0.3 aims. at evaluating a generated sequence to a reference sentence.For the The base BLEU models, score we usedranges a single between model0 obtained by averaging the last 5 checkpoints, which were written at 10-minute intervals. For the big models, we averaged the last 20 checkpoints. We (perfectused mismatch) beam search and with1 (perfect a beam size match). of 4 and length penalty α = 0.6 [38]. These hyperparameters were chosen after experimentation on the development set. We set the maximum output length during inference to input length + 50, but terminate early when possible [38]. Table 2 summarizes our results and compares our translation quality and training costs to other model architectures from the literature. We estimate the number of floating point operations used to train a model by multiplying the training time, the number of GPUs used, and an estimate of the sustained single-precision floating-point capacity of each GPU 5.

6.2 Model Variations

To evaluate the importance of different components of the Transformer, we varied our base model in different ways, measuring the change in performance on English-to-German translation on the development set, newstest2013. We used beam search as described in the previous section, but no checkpoint averaging. We present these results in Table 3. In Table 3 rows (A), we vary the number of attention heads and the attention key and value dimensions, keeping the amount of computation constant, as described in Section 3.2.2. While single-head attention is 0.9 BLEU worse than the best setting, quality also drops off with too many heads.

5We used values of 2.8, 3.7, 6.0 and 9.5 TFLOPS for K80, K40, M40 and P100, respectively.

8 Input-Input Layer5 in will never perfect but should just is we , my Law be , its application be - this what are opinion missing . The . , , - is in its be be we my will but are this just The Law what never

Input-Input Layer5 should perfect opinion missing application in will never perfect but should just is we , my are opinion Law be , its application be - this what missing . The (Vaswani et al., 2017) . , , - is in its be be we my will but are this just The Law what never should perfect opinion missing application

FigureInput-Input 4: Two attention heads, Layer5 also in layer 5 of 6, apparently involved in anaphora resolution. Top: Full attentions for head 5. Bottom: Isolated attentions from just the word ‘its’ for attention heads 5 and 6. Note that the attentions are very sharp for this word. in we , my should just is will never perfect but be , its application be - this what are opinion Law missing . The

14 . , , - is in its be be we my will but are this just The Law what never

Input-Input Layer5 should perfect opinion missing application

in we , my should just is never perfect but will are opinion be - this what be , its application Law missing . The

grammatical issues. Notes On the left iscomputed a by visualization one of headencoder the of attention the as layer 5 of the On the right the attention given by the word “application” which provides help for gender and “its” for two different heads is on “law” and Fran¸coisFleuret Deep learning / 13.3. Transformer Networks 7 / 30

.

, ,

-

is

in

its

be

be

we

my

will

but

are this

The just Law

never what

perfect should opinion missing application

Figure 4: Two attention heads, also in layer 5 of 6, apparently involved in anaphora resolution. Top: Full attentions for head 5. Bottom: Isolated attentions from just the word ‘its’ for attention heads 5 and 6. Note that the attentions are very sharp for this word.

14 Input-Input Layer5 (Vaswani et al., 2017) in never perfect but should just is we , my will be - this what are opinion Law be , its application The missing . . , , - is in its be be we my will but are this just The Law what never

Input-Input Layer5should perfect opinion missing application in never perfect but should just is we , my will be - this what are opinion be , its application Law missing . The . , , - is in its be be we my will but are this just The Law what never should perfect opinion missing application

Figure 5: Many of the attention heads exhibit behaviour that seems related to the structure of the sentence. We give two such examples above, from two different heads from the encoder self-attention at layer 5 of 6. The heads clearly learned to perform different tasks. Notes Two other heads also in layer 5.

15 Fran¸coisFleuret Deep learning / 13.3. Transformer Networks 8 / 30 The Universal Transformer (Dehghani et al., 2018) is a similar model where all the blocks are identical, resulting in a recurrent model that iterates over consecutive revisions of the representation instead of positions.

Additionally the number of steps is modulated per position dynamically.

Fran¸coisFleuret Deep learning / 13.3. Transformer Networks 9 / 30 Transformer self-training and fine-tuning for NLP

Fran¸coisFleuret Deep learning / 13.3. Transformer Networks 10 / 30 The transformer networks were introduced for translation, and trained with a supervised procedure, from pairs of sentences.

However, as for word embeddings, they can be trained in a unsupervised manner, for auto-regression or as denoising auto-encoders, from very large data-sets, and fine-tuned on supervised tasks with small data-sets.

Fran¸coisFleuret Deep learning / 13.3. Transformer Networks 11 / 30

Notes

A transformer [pre-]trained in a unsupervised manner for the task of predicting a token: for auto-regression, the input is the sentence up to the token to predict, for mask language modeling, the input is a full sentence with some tokens replaced by a “mask” token. No ground truth is required for those tasks.

As for word embedding, training a transformer model like this allows to capture statistical structures in the text and provide an extremely good representation for more sophisticated tasks which can only be trained in a supervised manner with only small datasets available. BERT (Ours) OpenAI GPT ELMo

T T T T T T 1 2 ... TN 1 2 ... TN 1 2 ... TN

Trm Trm ... Trm Trm Trm ... Trm Lstm Lstm Lstm Lstm Lstm Lstm ......

Trm Trm ... Trm Trm Trm ... Trm Lstm Lstm ... Lstm Lstm Lstm ... Lstm

E E E E E ... E 1 2 ... N 1 2 N E E 1 2 ... EN

Figure 3: Differences in pre-training model architectures. BERT uses a bidirectional Transformer. OpenAI GPT uses a left-to-right Transformer. ELMo uses the concatenation of independently trained left-to-right and right-to- left LSTMs to generate features for downstream tasks. Among the three, only BERT representations are jointly conditioned on both left and right context in all layers. In addition to the architecture differences, BERT and OpenAI GPT are fine-tuning approaches, while ELMo is a feature-based approach.

to converge. In Section C.1 we demonstrate that epochs over the 3.3 billion word corpus. We (Devlin et al., 2018) MLM does converge marginally slower than a left- use Adam with learning rate of 1e-4, β1 = 0.9, to-right model (which predicts every token), but β2 = 0.999, L2 weight decay of 0.01, learning the empirical improvements of the MLM model rate warmup over the first 10,000 steps, and linear far outweigh the increased training cost. decay of the learning rate. We use a dropout prob- ability of 0.1 on all layers. We use a gelu acti- Next Sentence Prediction The next sentence vation (Hendrycks and Gimpel, 2016) rather than prediction task can be illustrated in the following the standard relu, following OpenAI GPT. The Fran¸coisFleuret Deep learning / 13.3. Transformer Networks 12 / 30 examples. training loss is the sum of the mean masked LM likelihood and the mean next sentence prediction Input = [CLS] the man went to [MASK] store [SEP] likelihood. he bought a gallon [MASK] milk [SEP] Training of BERTBASE was performed on 4 Label = IsNext Cloud TPUs in Pod configuration (16 TPU chips 13 total). Training of BERTLARGE was performed on 16 Cloud TPUs (64 TPU chips total). Each pre- Input = [CLS] the man [MASK] to the store [SEP] training took 4 days to complete. penguin [MASK] are flight ##less birds [SEP] Longer sequences are disproportionately expen- Label = NotNext sive because attention is quadratic to the sequence length. To speed up pretraing in our experiments, A.2 Pre-training Procedure we pre-train the model with sequence length of To generate each training input sequence, we sam- 128 for 90% of the steps. Then, we train the rest ple two spans of text from the corpus, which we 10% of the steps of sequence of 512 to learn the refer to as “sentences” even though they are typ- positional embeddings. ically much longer than single sentences (but can be shorter also). The first sentence receives the A A.3 Fine-tuning Procedure embedding and the second receives the B embed- For fine-tuning, most model hyperparameters are ding. 50% of the time B is the actual next sentence the same as in pre-training, with the exception of that follows A and 50% of the time it is a random the batch size, learning rate, and number of train- sentence, which is done for the “next sentence pre- ing epochs. The dropout was always diction” task. They are sampled such that the com- kept at 0.1. The optimal hyperparameter values bined length is 512 tokens. The LM masking is are task-specific, but we found the following range ≤ applied after WordPiece tokenization with a uni- of possible values to work well across all tasks: form masking rate of 15%, and no special consid- eration given to partial word pieces. • Batch size: 16, 32 We train with batch size of 256 sequences (256 13https://cloudplatform.googleblog.com/2018/06/Cloud- sequences * 512 tokens = 128,000 tokens/batch) TPU-now-offers-preemptible-pricing-and-global- for 1,000,000 steps, which is approximately 40 availability.html GPT (Generative Pre-Training, Radford, 2018) is a transformer trained for auto-regressive text generation.

Figure 1: (left) Transformer architecture and training objectives used in this work. (right) Input transformations for fine-tuning on different tasks. We convert all structured inputs into token sequences to be processed by our pre-trained model, followed by a linear+softmax layer. (Radford, 2018)

3.3 Task-specific input transformations

For some tasks, like text classification, we can directly fine-tune our model as described above. Certain other tasks, like question answering or textual entailment, have structured inputs such as Fran¸coisFleuretordered sentence pairs, or triplets of document, Deep learning /question, 13.3. Transformer and answers.Networks Since our pre-trained model 13 / 30 was trained on contiguous sequences of text, we require some modifications to apply it to these tasks. Previous work proposed learning task specific architectures on top of transferred representations [44]. Such an approach re-introduces a significant amount of task-specific customization and does not Notes use transfer learning for these additional architectural components. Instead, we use a traversal-style approach [52], where we convert structured inputs into an ordered sequence that our pre-trained Note thatmodel GPT can model process. is These inherently input transformations causal, so allow us to avoid making extensive changes to the only carriesarchitecture information across tasks. forward, We provide and consists a brief description of of these input transformations below and 12 modulesFigure as 1provides opposed a tovisual6 for illustration. the original All transformations include adding randomly initialized start and end tokens ( s , e ). transformer. h i h i Textual entailment For entailment tasks, we concatenate the premise p and hypothesis h token The taskssequences, GPT can with be a delimiter fine-tuned token on ($) are: in between.

Classification:Similarity For for similarity instance tasks, for there sentiment is no inherent ordering of the two sentences being compared. • To reflect this, we modify the input sequence to contain both possible sentence orderings (with a analysis, when the input is a comment, m delimiter in between) and process each independently to produce two sequence representations hl andwhich the are task added is to element-wise predict whether before being it is fed into the linear output layer. positive or negative. • Entailment:Question Answering given a premise and Commonsense and a Reasoning For these tasks, we are given a context document z, a question q, and a set of possible answers a . We concatenate the document context hypothesis, the task is to predict whether { k} theand hypothesis question with is each implied possible by theanswer, premise. adding a delimiter token in between to get [z; q; $; ak]. Each of these sequences are processed independently with our model and then normalized via a softmax • Similarity:layer to produce the task an output is to distribution predict if over two possible answers. pieces of text have the same meaning. • Multiple4 Experiments choice: the task is to predict the correct answer. 4.1 Setup

Unsupervised pre-training We use the BooksCorpus dataset [71] for training the language model. It contains over 7,000 unique unpublished books from a variety of genres including Adventure, Fantasy, and Romance. Crucially, it contains long stretches of contiguous text, which allows the generative model to learn to condition on long-range information. An alternative dataset, the 1B Word Benchmark, which is used by a similar approach, ELMo [44], is approximately the same size

4 “GPT-2 is a large transformer-based language model with 1.5 billion parame- ters, trained on a dataset of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than 10X the amount of data.”

(Radford et al., 2019)

The GPT-3 model has 175B parameters and was trained on 300B tokens from various sources (Brown et al., 2020).

Fran¸coisFleuret Deep learning / 13.3. Transformer Networks 14 / 30 We can use HuggingFace’s pre-trained models (https://huggingface.co/).

import torch

from transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained('gpt2') model = GPT2LMHeadModel.from_pretrained('gpt2') model.eval()

tokens = tokenizer.encode('Studying Deep-Learning is')

for k in range(100): # no more than 100 tokens outputs = model(torch.tensor([tokens])).logits next_token = torch.argmax(outputs[0, -1]) print(k, next_token) tokens.append(next_token) if tokenizer.decode([next_token]) == '.': break

print(tokenizer.decode(tokens))

prints

Studying Deep-Learning is a great way to learn about the world around you.

Fran¸coisFleuret Deep learning / 13.3. Transformer Networks 15 / 30

Notes

HugginFace is a company that develops and distributes open-source implementations of language models. The transformers can be installed using the pip Python packets management system with

pip install transformers

The piece of code in the slide loads a GPT-2 model and generates the end of a sentence given “Studying Deep-Learning is” as beginning.

tokenizer can take as input a sequence of strings to produce a sequence of token, or the opposite, take as input a sequence of tokens and produces the corresponding sequence of words.

In this example, the generative procedure picks at each iteration the word with the maximum probability, but stochastic sampling could also be used.

The generated sentence could make sense in other situations (e.g. replacing “Deep-Learning” with another topic of studies), but is grammatically correct and consistent with “studying”. Large models of this class have been shown to exhibit some “zero shot learning” capabilities when they are properly “primed” (Brown et al., 2020).

For instance using HuggingFace’s gpt2-xl model with 1.6B parameters, we can get these sentence completions, where the priming text is between <>:

yellow, and orange is blue.

sour, and orange is bitter.

a fruit, and so on.

Fran¸coisFleuret Deep learning / 13.3. Transformer Networks 16 / 30 BERT (Bidirectional Encoder Representation from Transformers, Devlin et al., 2018) is a transformer pre-trained with:

• Masked Language Model (MLM), that consists in predicting [15% of] words which have been replaced with a “MASK” token.

• Next Sentence Prediction (NSP), which consists in predicting if a certain sentence follows the current one.

It is then fine-tuned on multiple NLP tasks.

Fran¸coisFleuret Deep learning / 13.3. Transformer Networks 17 / 30 MNLI NER SQuAD NSP Mask LM Mask LM Start/End Span

C C T1 ... TN T[SEP] T1’ ... TM’ T1 ... TN T[SEP] T1’ ... TM’

BERT BERT BERT

E E [CLS] E1 ... EN E[SEP] E1’ ... EM’ [CLS] E1 ... EN E[SEP] E1’ ... EM’

[CLS] Tok 1 Tok N [SEP] Tok 1 TokM [CLS] Tok 1 Tok N [SEP] Tok 1 TokM ......

Masked Sentence A Masked Sentence B Question Paragraph

Unlabeled Sentence A and B Pair Question Answer Pair Pre-training Fine-Tuning

Figure 1: Overall pre-training and fine-tuning procedures for BERT. Apart from output layers, the same architec- tures are used in both pre-training and fine-tuning. The same pre-trained model parameters are used to initialize models for different down-stream tasks. During fine-tuning, all parameters are fine-tuned. [CLS] is a special symbol added in front of every input example, and [SEP] is a special separator token (e.g. separating ques- tions/answers).

ing and auto-encoder objectives have been used mal difference between the pre-trained(Devlin etarchitec- al., 2018) for pre-training such models (Howard and Ruder, ture and the final downstream architecture. 2018; Radford et al., 2018; Dai and Le, 2015). Model Architecture BERT’s model architec- 2.3 Transfer Learning from Supervised Data ture is a multi-layer bidirectional Transformer en- Fran¸coisFleuret Deep learning / 13.3. Transformer Networks 18 / 30 There has also been work showing effective trans- coder based on the original implementation de- fer from supervised tasks with large datasets, such scribed in Vaswani et al.(2017) and released in tensor2tensor 1 as natural language inference (Conneau et al., the library. Because the use Notes2017) and machine translation (McCann et al., of Transformers has become common and our im- 2017). research has also demon- plementation is almost identical to the original, BERT is trained in an unsupervised manner to we will omit an exhaustive background descrip- dostrated two tasks the importanceat the same of time: transfer predicting learning from tion of the model architecture and refer readers to missinglarge pre-trained words and models, predicting where whether an effective two recipe sentences follow each other in the corpus. Vaswani et al.(2017) as well as excellent guides is to fine-tune models pre-trained with Ima- 2 geNet (Deng et al., 2009; Yosinski et al., 2014). such as “The Annotated Transformer.” In this work, we denote the number of layers 3 BERT (i.e., Transformer blocks) as L, the hidden size as H, and the number of self-attention heads as A.3 We introduce BERT and its detailed implementa- We primarily report results on two model sizes: tion in this section. There are two steps in our BERTBASE (L=12, H=768, A=12, Total Param- framework: pre-training and fine-tuning. Dur- eters=110M) and BERTLARGE (L=24, H=1024, ing pre-training, the model is trained on unlabeled A=16, Total Parameters=340M). data over different pre-training tasks. For fine- BERTBASE was chosen to have the same model tuning, the BERT model is first initialized with size as OpenAI GPT for comparison purposes. the pre-trained parameters, and all of the param- Critically, however, the BERT Transformer uses eters are fine-tuned using from the bidirectional self-attention, while the GPT Trans- downstream tasks. Each downstream task has sep- former uses constrained self-attention where every arate fine-tuned models, even though they are ini- token can only attend to context to its left.4 tialized with the same pre-trained parameters. The question-answering example in Figure1 will serve 1https://github.com/tensorflow/tensor2tensor 2 as a running example for this section. http://nlp.seas.harvard.edu/2018/04/03/attention.html 3In all cases we set the feed-forward/filter size to be 4H, A distinctive feature of BERT is its unified ar- i.e., 3072 for the H = 768 and 4096 for the H = 1024. chitecture across different tasks. There is mini- 4We note that in the literature the bidirectional Trans- Class Class Label Label

... C C T T TN T1 ... TN T[SEP] T1’ ... TM’ 1 2

BERT BERT

E[CLS] E ... E E E ’ ... E ’ ... E 1 N [SEP] 1 M E[CLS] E1 E2 N

Tok Tok Tok Tok [CLS] [SEP] [CLS] Tok 1 Tok 2 Tok N ...... 1 N 1 M

Sentence 1 Sentence 2 Single Sentence

Start/End Span O B-PER ... O

... C C T T TN T1 ... TN T[SEP] T1’ ... TM’ 1 2

BERT BERT

E[CLS] E ... E E E ’ ... E ’ ... E 1 N [SEP] 1 M E[CLS] E1 E2 N

Tok Tok Tok Tok [CLS] [SEP] [CLS] Tok 1 Tok 2 Tok N ...... 1 N 1 M

Question Paragraph Single Sentence

(Devlin et al., 2018) Fran¸coisFleuretFigure 4: Illustrations Deep learningof Fine-tuning / 13.3. Transformer BERT Networks on Different Tasks. 19 / 30

SST-2 The Stanford Sentiment Treebank is a for whether the sentences in the pair are semanti- binary single-sentence classification task consist- cally equivalent (Dolan and Brockett, 2005). ing of sentences extracted from movie reviews with human annotations of their sentiment (Socher RTE Recognizing Textual Entailment is a bi- et al., 2013). nary entailment task similar to MNLI, but with much less training data (Bentivogli et al., 2009).14 CoLA The Corpus of Linguistic Acceptability is a binary single-sentence classification task, where WNLI Winograd NLI is a small natural lan- the goal is to predict whether an English sentence guage inference dataset (Levesque et al., 2011). is linguistically “acceptable” or not (Warstadt The GLUE webpage notes that there are issues 15 et al., 2018). with the construction of this dataset, and every trained system that’s been submitted to GLUE has STS-B The Semantic Textual Similarity Bench- performed worse than the 65.1 baseline accuracy mark is a collection of sentence pairs drawn from of predicting the majority class. We therefore ex- news headlines and other sources (Cer et al., clude this set to be fair to OpenAI GPT. For our 2017). They were annotated with a score from 1 GLUE submission, we always predicted the ma- to 5 denoting how similar the two sentences are in terms of semantic meaning. 14Note that we only report single-task fine-tuning results in this paper. A multitask fine-tuning approach could poten- MRPC Microsoft Research Paraphrase Corpus tially push the performance even further. For example, we did observe substantial improvements on RTE from multi- consists of sentence pairs automatically extracted task training with MNLI. from online news sources, with human annotations 15https://gluebenchmark.com/faq Head 8-10 Head 8-11

- Direct objects attend to their verbs - Noun modifiers (e.g., determiners) attend to their noun - 86.8% accuracy at the dobj relation - 94.3% accuracy at the det relation

Head 7-6 Head 4-10

- Possessive pronouns and apostrophes - Passive auxiliary verbs attend to the attend to the head of the corresponding NP verb they modify

- 80.5% accuracy at the poss relation - 82.5% accuracy at the auxpass relation

Head 9-6 Head 5-4

- Prepositions attend to their objects - Coreferent mentions attend to their antecedents

- 76.3% accuracy at the pobj relation - 65.1% accuracy at linking the head of a coreferent mention to the head of an antecedent

Figure 5: BERT attention heads that correspond to linguistic phenomena. In the example attention maps, the darkness of a line indicates the strength of the attention weight. All attention to/from red words is colored red; these colors are there to highlight certain parts of the attention heads’ behaviors. For Head 9-6, we don’t show attention to [SEP] for clarity. Despite not being explicitly trained on these tasks, BERT’s attention heads perform (Clark et al., 2019) remarkably well, illustrating how syntax-sensitive behavior can emerge from self-supervised training alone.

Fran¸coisFleuret Deep learning / 13.3. Transformer Networks 20 / 30

Notes

Once again, visualizing the attention matrices show that it connects a word with other words that help to get its meaning. Attention in computer vision

Fran¸coisFleuret Deep learning / 13.3. Transformer Networks 21 / 30 Wang et al. (2018) proposed an attention mechanism for images, following the model from Vaswani et al. (2017).

 >  y = softmax (Wθx) (WΦx) Wg x.

Fran¸coisFleuret Deep learning / 13.3. Transformer Networks 22 / 30 They insert “non-local blocks” in residual architectures and get improvements on both video and images classification.

i is often based only on the current and the latest time steps z T×H×W×1024 (e.g., j = i or i 1). − The non-local operation is also different from a fully- 1×1×1 connected (fc) layer. Eq.(1) computes responses based on T×H×W×512 relationships between different locations, whereas fc uses softmax THW×512 learned weights. In other words, the relationship between x j THW×THW and xi is not a function of the input data in fc, unlike in non- local layers. Furthermore, our formulation in Eq.(1) supports inputs of variable sizes, and maintains the corresponding THW×512 512×THW THW×512 size in the output. On the contrary, an fc layer requires a T×H×W×512 T×H×W×512 T×H×W×512 fixed-size input/output and loses positional correspondence θ: 1×1×1 φ: 1×1×1 g: 1×1×1 (e.g., that from xi to yi at the position i). T×H×W×1024 A non-local operation is a flexible building block and can x be easily used together with convolutional/recurrent layers. Figure 2. A spacetime non-local block. The feature maps are It can be added into the earlier part of deep neural networks, shown as the shape of their tensors, e.g., T H W 1024 for × × × 1024 channels (proper reshaping is performed when noted). “ ” unlike fc layers that are often used in the end. This allows us ⊗ denotes matrix multiplication, and “ ” denotes element-wise sum. to build a richer hierarchy that combines both non-local and ⊕ local information. The softmax operation is performed on each row. The blue boxes de- note 1 1 1 convolutions. Here we show the embedded Gaussian × × 3.2. Instantiations version, with a bottleneck of 512 channels. The vanilla Gaussian version can be done by removing θ and φ, and the dot-product Next we describe several versions of f and g. Interest- version can be done by replacing softmax with scaling by 1/N. ingly, we will show by experiments (Table 2a) that our non- local models are not sensitive to these choices, indicating y = softmax(xT W T W x)g(x), which is the self-attention that the generic non-local behavior is the main reason for the θ φ (Wang et al., 2018) observed improvements. form in [49]. As such, our work provides insight by relating For simplicity, we only consider g in the form of a linear this recent self-attention model to the classic computer vision method of non-local means [4], and extends the sequential embedding: g(xj) = WgFran¸coisFleuretxj, where Wg is a weight matrix Deep learning / 13.3. Transformer Networks 23 / 30 to be learned. This is implemented as, e.g., 1 1 self-attention network in [49] to a generic space/spacetime × in space or 1 1 1 convolution in spacetime. non-local network for image/video recognition in computer × × Next we discuss choices for the pairwise function f. vision. Notes Despite the relation to [49], we show that the attentional Gaussian. Following the non-local mean [4] and bilateral behavior (due to softmax) is not essential in the applications filters [47], a natural choice of f is theThe Gaussian input function. is a video In sequencewe study. of ToT showframes this, of we describe two alternative versions this paper we consider: size H W . Each pixelof location non-local is operations represented next. × Tby a feature vector of dimension 1024, and the xi xj Dot product. f can be defined as a dot-product similarity: f(xi, xj) = e input. is of size T (2) H W 1024. × × × T T f(xi, xj) = θ(xi) φ(xj). (4) Here xi xj is dot-product similarity. Euclidean distance as used in [4, 47] is also applicable, butThe dot query, product key, is more and valueHere tensors we adopt are the obtained embedded version. In this case, we set the implementation-friendly in modernby deep one-by-one learning platforms. convolutionnormalization from the input.factor as And(x) = N, where N is the number of C The normalization factor is set as (ax final) = one-by-onej f(xi, xj). convolutionpositions project in x, rather the than the sum of f, because it simplifies Coutput of∀ the attention mechanismgradient computation. to the A normalization like this is necessary Embedded Gaussian. A simple extension of the Gaussian P because the input can have variable size. function is to compute similarity inoriginal an embedding input space. size. In The main difference between the dot product and embed- this paper we consider: ded Gaussian versions is the presence of softmax, which Similarly to the transformer for natural language T θ(xi) φ(xj ) plays the role of an . f(xi, xj) = e understanding,. there(3) is a residual pass-through. Concatenation. Concatenation is used by the pairwise func- Here θ(xi) = Wθxi and φ(xj) = Wφxj are two embed- tion in Relation Networks [40] for visual reasoning. We also dings. As above, we set (x) = f(x , x ). evaluate a concatenation form of f: C j i j We note that the self-attention module∀ [49] recently pre- f(x , x ) = ReLU(wT [θ(x ), φ(x )]). (5) sented for machine translation isP a special case of non-local i j f i j operations in the embedded Gaussian version. This can be Here [ , ] denotes concatenation and wf is a weight vector 1 · · seen from the fact that for a given i, (x) f(xi, xj) becomes that projects the concatenated vector to a scalar. As above, the softmax computation along the dimensionC j. So we have we set (x) = N. In this case, we adopt ReLU [35] in f. C Figure 3. Examples of the behavior of a non-local block in res3 computed by a 5-block non-local model trained on Kinetics. These examples are from held-out validation videos. The starting point of arrows represents one xi, and the ending points represent xj . The 20 highest weighted arrows for each xi are visualized. The 4 frames are from a 32-frame input, shown with a stride of 8 frames. These visualizations show how the model finds related clues to its prediction. (Wang et al., 2018) 4.1. Implementation Details BN to other layers in a non-local block. The scale parameter of this BN layer is initialized as zero, following [17]. This Fran¸coisFleuretOur models are pre-trained on ImageNet Deep learning [39]. / 13.3. Transformer Networks 24 / 30 Training. ensures that the initial state of the entire non-local block is an Unless specified, we fine-tune our models using 32-frame identity mapping, so it can be inserted into any pre-trained input clips. These clips are formed by randomly cropping out networks while maintaining its initial behavior. 64 consecutive frames from the original full-length video and then dropping every other frame. The spatial size is 224 224 Inference. Following [46] we perform spatially fully- × pixels, randomly cropped from a scaled video whose shorter convolutional inference on videos whose shorter side is side is randomly sampled in [256, 320] pixels, following [46]. rescaled to 256. For the temporal domain, in our practice we We train on an 8-GPU machine and each GPU has 8 clips in a sample 10 clips evenly from a full-length video and compute mini-batch (so in total with a mini-batch size of 64 clips). We the softmax scores on them individually. The final prediction train our models for 400k iterations in total, starting with a is the averaged softmax scores of all clips. learning rate of 0.01 and reducing it by a factor of 10 at every 150k iterations (see also Figure4). We use a momentum 5. Experiments on Video Classification of 0.9 and a weight decay of 0.0001. We adopt dropout [22] after the global pooling layer, with a dropout ratio of We perform comprehensive studies on the challenging 0.5. We fine-tune our models with BatchNorm (BN) [25] Kinetics dataset [27]. We also report results on the Charades enabled when it is applied. This is in contrast to common dataset [44] to show the generality of our models. practice [21] of fine-tuning ResNets, where BN was frozen. We have found that enabling BN in our application reduces 5.1. Experiments on Kinetics overfitting. Kinetics [27] contains ∼246k training videos and 20k We adopt the method in [20] to initialize the weight layers validation videos. It is a classification task involving 400 introduced in the non-local blocks. We add a BN layer right human action categories. We train all models on the training after the last 1 1 1 layer that represents W ; we do not add set and test on the validation set. × × z The problem of long range interactions has been tackled in sequence modeling through the use of attention. Attention has enjoyed rich success in tasks such as language modeling [25, 26], [27, 28] and neural captioning [29]. Recently, attention modules have been employed in discriminative computer vision models to boost the performance of traditional CNNs. Most notably, a channel-based attention mechanism termed Squeeze-Excite may be applied to selectively modulate the scale of CNN channels [30, 31]. Likewise, spatially-aware attention mechanisms have been used to augment CNN architectures to provide contextual information for improving object detection [32] and image classification [33–35]. These works have used global attention layers as an add-on to existing convolutional models. This global form attends to all spatial locations of an input, limiting its usage to small inputs which typically require significant downsampling of the original image. In this work, we ask the question if content-based interactions can serve as the primary primitive of vision models instead of acting as an augmentation to convolution. To this end, we develop a simple local self-attention layer that can be used for both small and large inputs. We leverage this stand-alone attention layer to build a fully attentional vision model that outperforms the convolutional baseline for both image classification and object detection while being parameter and compute efficient. Furthermore, we conduct a number of ablations to better understand stand-alone attention. We hope that this result will spur new research directions focused on exploring content-based interactions as a mechanism for improving vision models.

2 Background

2.1 Convolutions Convolutional neural networks (CNNs) are typically employed with small neighborhoods (i.e. kernel sizes) to encourage the network to learn local correlation structures within a particular layer. Given h w din an input x R × × with height h, width w, and input channels din, a local neighborhood k around a pixel∈ x is extracted with spatial extent k, resulting in a region with shape k k d (seeN ij × × in Figure1). Ramachandran et al. (2019) replaced convolutions with local attention. k k dout din dout Given a learned weight matrix W R × × × , the output yij R for position ij is defined ∈ ∈ by spatially summing the product of depthwise matrix of the inputX values: yi,j = Wi−a,j−bxa,b (Convolution) yij = Wi a,j b xab (a,b)∈N (i,j) (1) − − a,b (i,j) ∈NXk X  >  yi,j = softmaxa,b WQ xi,j WK xa,b va,b (Local attention) where (i, j) = a, b a i k/2, b j k/2 (see Figure2). Importantly, CNNs employ k (a,b)∈N (i,j) weightN sharing, where W| is− reused| ≤ for| generating− | ≤ the output for all pixel positions ij. Weight  sharing enforces translation equivariance in the learned representation and consequently decouples the parameter count of the convolution from the input size.

Figure 4: An example of relative distance computation. The rela- tive distances are computed with Figure 1: An example of a local window around Figure 2: An example of a 3 3 convolution. respect to the position of the high- × i = 3, j = 3 (one-indexed) with spatial extent The output is the inner product between the Figure 3: An example of a local attention layer over spatial lighted pixel. The format of dis- k = 3. local window and the learned weights. extent of k = 3. tances is row offset, column offset.

A wide array of applications have leveraged convolutions to achieve competitive results including text-to-speech [36] and generative sequence models [37, 38]. Several efforts have As currently framed, no positional(Ramachandran information is et encoded al., 2019) in attention, which makes it equivariant, limiting expressivity for vision tasks. Sinusoidal embeddings based on the absolute position of pixels in an image (ij) can be used [25], but early experimentation suggested that using 2 Fran¸coisFleuret Deep learning / 13.3.relative Transformer positional Networks embeddings [51, 46] results in significantly better 25 / 30 accuracies. Instead, attention with 2D relative position embeddings, relative attention, is used. Relative attention starts by defining the relative distance of ij to each position ab k(i, j). The relative distance is factorized across dimensions, so each element ab (i, j) receives∈ N two distances: a row offset a i and column ∈ Nk − offset b j (see Figure4). The row and column offsets are associated with an embedding ra i − 1 − and rb j respectively each with dimension dout. The row and column offset embeddings are − 2 concatenated to form ra i,b j. This spatial-relative attention is now defined as − −

yij = softmaxab qij>kab + qij>ra i,b j vab (3) − − a,b (i,j) ∈X Nk 

Thus, the logit measuring the similarity between the query and an element in k(i, j) is modulated both by the content of the element and the relative distance of the element fromN the query. Note that by infusing relative position information, self-attention also enjoys translation equivariance, similar to convolutions. The parameter count of attention is independent of the size of spatial extent, whereas the parameter count for convolution grows quadratically with spatial extent. The computational cost of attention also grows slower with spatial extent compared to convolution with typical values of din and dout. For example, if din = dout = 128, a convolution layer with k = 3 has the same computational cost as an attention layer with k = 19.

3 Fully Attentional Vision Models

Given a local attention layer as a primitive, the question is how to construct a fully attentional architecture. We achieve this in two steps:

3.1 Replacing Spatial Convolutions A spatial convolution is defined as a convolution with spatial extent k > 1. This definition excludes 1 1 convolutions, which may be viewed as a standard fully connected layer applied to each pixel independently.× 3 This work explores the straightforward strategy of creating a fully attentional vision model: take an existing convolutional architecture and replace every instance of a spatial convolution with an attention layer. A 2 2 average pooling with stride 2 operation follows the attention layer whenever spatial downsampling× is required.

3Many deep learning libraries internally translate a 1 1 convolution to a simple matrix multiplication. ×

4 ResNet-26 ResNet-38 ResNet-50 FLOPS Params Acc. FLOPS Params Acc. FLOPS Params Acc. (B) (M) (%) (B) (M) (%) (B) (M) (%) Baseline 4.7 13.7 74.5 6.5 19.6 76.2 8.2 25.6 76.9 Conv-stem + Attention 4.5 10.3 75.8 5.7 14.1 77.1 7.0 18.0 77.4 Full Attention 4.7 10.3 74.8 6.0 14.1 76.9 7.2 18.0 77.6 Table 1: ImageNet classification results for a ResNet network with different depths. Baseline is a standard ResNet, Conv-stem + Attention uses spatial convolution in the stem and attention everywhere else, and Full Attention uses attention everywhere including the stem. The attention models outperform the baseline across all depths while having 12% fewer FLOPS and 29% fewer parameters.

Figure 5: Comparing parameters and FLOPS against accuracy on ImageNet classification across a range of network widths for ResNet-50. Attention models have fewer parameters and FLOPS while improving upon the accuracy of the baseline.

(Ramachandran et al., 2019) higher classification accuracy while having 12% fewer floating point operations (FLOPS)4 and 29% fewer parameters. Furthermore, this performance gain is consistent across most model variations generated by both depth and width scaling.

4.2 COCO Object Detection Fran¸coisFleuret Deep learning / 13.3. Transformer Networks 26 / 30 Setup In this section, we evaluate attention models on the COCO object detection task [56] using the RetinaNet architecture [18]. RetinaNet is an object detection model that consists of a backbone image classification network followed by a Feature Pyramid Network (FPN)[57] and two output networks known as detection heads. We experiment with making the backbone and/or the FPN and detection heads fully attentional. The backbone models are the same models described in Section 4.1. The details of how the FPN and detection heads are made fully attentional are provided in the appendix.

Results Table2 shows the object detection results. Using an attention-based backbone in the RetinaNet matches the mAP of using the convolutional backbone but contains 22% fewer parameters. Furthermore, employing attention across all parts of the model including the backbone, FPN, and detection heads matches the mAP of the baseline RetinaNet while using 34% fewer parameters and 39% fewer FLOPS. These results demonstrate the efficacy of stand-alone attention across multiple vision tasks.

6 “A fully attentional network based off of the proposed stand-alone local self-attention layer achieves competitive predictive performance on ImageNet classification and COCO object detection tasks while requiring fewer pa- rameters and floating point operations than the corresponding convolution baselines.”

(Ramachandran et al., 2019)

Fran¸coisFleuret Deep learning / 13.3. Transformer Networks 27 / 30 Cordonnier et al. (2020) showed that provided with proper positional encoding multi-head multiplicative attention layers can encode convolutions with filter support of size the number of heads:

“A multi head self-attention layer with Nh heads of dimension Dh, output dimension Dout and a relative positional encoding√ of√ dimension Dp ≥ 3 can express any convolutional layer of kernel size Nh × Nh and min (Dh, Dout ) output channels.”

(Cordonnier et al., 2020)

Fran¸coisFleuret Deep learning / 13.3. Transformer Networks 28 / 30 Published as a conference paper at ICLR 2020

Figure 5: Attention of each head (column) at each layer (row) using learned relative positional encoding without content-based attention. The central black square is the query pixel. We reordered the heads for visualization and zoomed on the 7x7 pixels around the query pixel. layer 1 layer 2 layer 3 layer 4 layer 5 layer 6

Figure 6: Attention probabilities for a model with 6 layers (rows) and 9 heads (columns) using learned relative positional encoding and content-content based attention. Attention maps are aver- aged over 100 test images to display head behavior and remove the dependence on the input content. The black square is the query pixel. More examples are presented in Appendix A. (Cordonnier et al., 2020) 9

Fran¸coisFleuret Deep learning / 13.3. Transformer Networks 29 / 30 https://epfml.github.io/attention-cnn/

Fran¸coisFleuret Deep learning / 13.3. Transformer Networks 30 / 30 References

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. Language models are few-shot learners. CoRR, abs/2005.14165, 2020. K. Clark, U. Khandelwal, O. Levy, and C. Manning. What does BERT look at? An analysis of BERT’s attention. CoRR, abs/1906.04341, 2019. J. Cordonnier, A. Loukas, and M. Jaggi. On the relationship between self-attention and convolutional layers. In International Conference on Learning Representations (ICLR), 2020. M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and L. Kaiser. Universal transformers. CoRR, abs/1807.03819, 2018. J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018. A. Radford. Improving language understanding with . web, June 2018. https://openai.com/blog/language-unsupervised/. A. Radford, J. Wu, D. Amodei, D. Amodei, J. Clark, M. Brundage, and I. Sutskever. Better language models and their implications. web, February 2019. https://blog.openai.com/better-language-models/. P. Ramachandran, N. Parmar, A. Vaswani, I. Bello, A. Levskaya, and J. Shlens. Stand-alone self-attention in vision models. CoRR, abs/1906.05909, 2019. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. CoRR, abs/1706.03762, 2017. X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In Conference on Computer Vision and (CVPR), 2018.