Transformers for Large-Scale Language and Image Modeling

Transformers for large-scale language and image modeling Outline • Basic transformer model (review and details) • Transformer-based language models • BERT • GPT • Other models • Concerns • Models for vision and language • Image transformers • Image-text transformers: CLIP, DALL-E Basic transformer model (review) • Sequence-to-sequence architecture using only point-wise processing and attention (no recurrent units or convolutions) Encoder: receives entire input Decoder: predicts next token sequence and outputs encoded conditioned on encoder output and sequence of the same length previously predicted tokens A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, NeurIPS 2017 Image source Key-Value-Query attention model Decoder The decoder generates a query describing what it wants to focus on $1 $2 $3 $4 Sum the values generated by encoder weighted by the attention weights Feed the scores into a softmax to create the attention weights Compute dot products between the query and the keys generated by encoder, giving !1 !2 !3 !4 alignment scores between source tokens and the query Encoder Image source Key-Value-Query attention model A1 A2 A3 A4 • Key vectors: ! = #$% Product( ), Sum( ) • Value Vectors: & = #$' 4 4 4 4 • Query vectors &1 F,F G,F H,F I,F • Similarities: scaled dot-product attention &2 4F,G 4G,G 4H,G 4I,G ,- ·%+ 1 ( = or ( = 0! / 3 4 4 4 4 ),+ / &3 F,H G,H H,H I,H (3 is the dimensionality of the keys) Softmax( ) #1 !1 (F,F (G,F (H,F (I,F • Attn. weights: 4 = softmax((, dim = 1) # ! ( ( ( ( • Output vectors: 2 2 F,G G,G H,G I,G # ! ( ( ( ( A) = ∑+ 4),+&+ or A = 4& 3 3 F,H G,H H,H I,H 01 02 03 04 Adapted from J. Johnson Key-Value-Query attention model 01 02 03 04 • How does permuting the order of Product( ), Sum( ) the queries change the output? . • How does changing the order of /1 ),) +,) ,,) -,) . the keys/values change the /2 ),+ .+,+ .,,+ .-,+ output? /3 .),, .+,, .,,, .-,, Softmax( ) &1 '1 (),) (+,) (,,) (-,) &2 '2 (),+ (+,+ (,,+ (-,+ &3 '3 (),, (+,, (,,, (-,, !1 !2 !3 !4 Adapted from J. Johnson Attention mechanisms • Encoder self-attention: queries, keys, and values come from previous layer of encoder • Decoder self-attention: values corresponding to future decoder outputs are masked out • Encoder-decoder attention: queries come from previous decoder layer, keys and values come from output of encoder Self-attention • Used to capture context within the sequence As we are encoding “it”, we As we are encoding “it”, we should focus on “the animal” should focus on “the street” Image source Self-attention layer A1 A2 A3 Product(→), Sum(↑) • Query vectors: ! = #$% (3 4 4 4 • Key vectors: & = #$' E,F G,F F,F (2 4 4 4 • Value vectors: ( = #$) E,G G,G F,G ( • Similarities: scaled dot-product attention 1 4E,E 4G,E 4F,E % ·' * = . - or * = !&1 / 3 Softmax(↑) +,- 0 & * * * (3 is the dimensionality of the keys) 3 E,F G,F F,F &2 *E,G *G,G *F,G &1 * *G,E *F,E • Attn. weights: 4 = softmax(*, dim = 1) E,E • Output vectors: !1 !2 !3 ∑ A+ = - 4+,-(- or A = 4( #1 #2 #3 Adapted from J. Johnson One query per input vector Masked self-attention layer This is …. -1 -2 -3 • The decoder should not “look ahead” Product(→), Sum(↑) in the output sequence ,3 +',) +*,) +),) ,2 +',* +*,* +),* ,1 +',' +*,' +),' Softmax(↑) %3 &',) &*,) &),) %2 &',* &*,* &),* %1 &',' &*,' &),' !1 !2 !3 .1 .2 .3 <START> This is Adapted from J. Johnson Masked self-attention layer -1 -2 -3 • The decoder should not “look ahead” Product(→), Sum(↑) in the output sequence ,3 +',) +*,) +),) ,2 +',* +*,* +),* ,1 +',' +*,' +),' Softmax(↑) %3 &',) &*,) &),) %2 &',* &*,* &),* %1 &',' &*,' &),' !1 !2 !3 .1 .2 .3 Adapted from J. Johnson Masked self-attention layer 01 02 03 • The decoder should not “look ahead” Product(→), Sum(↑) in the output sequence /3 - - .,,, /2 - .+,+ .,,+ /1 .),) .+,) .,,) Softmax(↑) %3 −∞ −∞ (,,, %2 −∞ (+,+ (,,+ %1 (),) (+,) (,,) !1 !2 !3 11 12 13 Adapted from J. Johnson Attention mechanisms: Summary N transformer N transformer blocks blocks • Encoder self-attention: queries, keys, and values come from previous layer of encoder • Decoder self-attention: values corresponding to future decoder outputs are masked out • Encoder-decoder attention: queries come from previous decoder layer, keys and values come from output of encoder Attention mechanisms: Illustration https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html Transformer architecture: Details Decoder Encoder A. Vaswani et al., Attention is all you need, NeurIPS 2017 Multi-head attention • Run ℎ attention models in parallel on top of different linearly projected versions of ", $, %; concatenate and linearly project the results • Intuition: enables model to attend to different kinds of information at different positions (see visualization tool) Transformer blocks • A Transformer is a sequence of transformer blocks • Vaswani et al.: N=12 blocks, embedding dimension = 512, 6 attention heads • Add & Norm: residual connection followed by layer normalization • Feedforward: two linear layers with ReLUs in between, applied independently to each vector • Attention is the only interaction between inputs! Positional encoding • To give transformer information about ordering of tokens, add function of position (based on sines and cosines) to every input position Embedding dimension Image source Transformer architecture: Zooming back out Decoder Encoder A. Vaswani et al., Attention is all you need, NeurIPS 2017 Results https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html Different ways of processing sequences RNN 1D convolutional network Transformer Y1 Y2 Y3 Product(→), Sum(↑) V3 A1,3 A2,3 A3,3 V h1 h2 h3 h4 h1 h2 h3 h4 2 A1,2 A2,2 A3,2 V1 A1,1 A2,1 A3,1 Softmax(↑) K3 E1,3 E2,3 E3,3 K2 E1,2 E2,2 E3,2 K1 E1,1 E2,1 E3,1 Q1 Q2 Q3 x1 x2 x3 x4 x1 x2 x3 x4 X1 X2 X3 Works on ordered sequences Works on multidimensional grids • Works on sets of vectors • Pros: Good for long sequences: • Pro: Each output can be • Pro: Good at long sequences: After one RNN layer, hT ”sees” computed in parallel (at training after one self-attention layer, the whole sequence time) each output “sees” all inputs! • Con: Not parallelizable: need to • Con: Bad at long sequences: • Pro: Each output can be compute hidden states Need to stack many conv layers computed in parallel (at training sequentially for outputs to “see” the whole time) • Con: Hidden states have sequence • Con: Memory-intensive limited expressive capacity Making transformers more efficient R. Child et al., Generating Long Sequences with Sparse Transformers, arXiv 2019 Outline • Basic transformer model • Transformer-based language models • BERT • GPT • Other models • Concerns Self-supervised language modeling with transformers 1. Download A LOT of text from the internet 2. Train a giant transformer using a suitable pretext task 3. Fine-tune the transformer on desired NLP task Image source Self-supervised language modeling with transformers 1. Download A LOT of text from the internet 2. Train a giant transformer using a suitable pretext task 3. Fine-tune the transformer on desired NLP task Bidirectional Encoder Representations from Transformers (BERT) J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, EMNLP 2018 BERT: Pretext tasks • Masked language model (MLM) • Randomly mask 15% of tokens in input sentences, goal is to reconstruct them using bidirectional context Image source BERT: Pretext tasks • Next sentence prediction (NSP) • Useful for Question Answering Predict likelihood that sentence B belongs after sentence A and Natural Language Inference tasks • In the training data, 50% of the time B is the actual sentence that follows A (labeled as IsNext), and 50% of the time it is a random sentence (labeled as NotNext). Image source BERT: More detailed view WordPiece (from GNMT) Trained on Wikipedia (2.5B words) + BookCorpus (800M words) Image source BERT: Evaluation • General Language Understanding Evaluation (GLUE) benchmark (gluebenchmark.com) BERT: Downstream tasks Textual entailment Source: J. Hockenmaier Entailment, textual equivalence and similarity BERT: Downstream tasks Sentiment classification, linguistic acceptability Image source BERT: Downstream tasks Find span in paragraph that contains the answer Source: SQuAD v1.1 paper BERT: Downstream tasks Image source Named entity recognition Other language models Image source Source: J. Johnson Scaling up transformers Model Layers Hidden dim. Heads Params Data Training Transformer-Base 12 512 8 65M 8x P100 (12 hours) Transformer-Large 12 1024 16 213M 8x P100 (3.5 days) Vaswani et al., Attention is all you need, NeurIPS 2017 Source: J. Johnson Scaling up transformers Model Layers Hidden dim. Heads Params Data Training Transformer-Base 12 512 8 65M 8x P100 (12 hours) Transformer-Large 12 1024 16 213M 8x P100 (3.5 days) BERT-Base 12 768 12 110M 13 GB 4x TPU (4 days) BERT-Large 24 1024 16 340M 13 GB 16x TPU (4 days) Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, EMNLP 2018 Source: J. Johnson Scaling up transformers Model Layers Hidden dim. Heads Params Data Training Transformer-Base 12 512 8 65M 8x P100 (12 hours) Transformer-Large 12 1024 16 213M 8x P100 (3.5 days) BERT-Base 12 768 12 110M 13 GB 4x TPU (4 days) BERT-Large 24 1024 16 340M 13 GB 16x TPU (4 days) XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days) RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day) Yang et al., XLNet: Generalized Autoregressive Pretraining for Language Understanding, 2019 Liu et al., RoBERTa: A Robustly Optimized BERT Pretraining Approach, 2019 Source: J. Johnson Scaling up transformers Model Layers Hidden dim. Heads Params Data Training Transformer-Base 12 512 8 65M 8x P100 (12 hours) Transformer-Large 12 1024 16 213M 8x P100 (3.5 days) BERT-Base 12 768 12 110M 13 GB 4x TPU (4 days) BERT-Large 24 1024 16 340M 13 GB 16x TPU (4 days) XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days) RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day) GPT-2 48 1600 ? 1.5B 40 GB Radford et al., Language models are unsupervised multitask learners, 2019 Source: J.

Load more