Transformers for large-scale language and image modeling Outline • Basic transformer model (review and details) • Transformer-based language models • BERT • GPT • Other models • Concerns • Models for vision and language • Image transformers • Image-text transformers: CLIP, DALL-E Basic transformer model (review) • Sequence-to-sequence architecture using only point-wise processing and attention (no recurrent units or convolutions)

Encoder: receives entire input Decoder: predicts next token sequence and outputs encoded conditioned on encoder output and sequence of the same length previously predicted tokens

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, NeurIPS 2017 Image source Key-Value-Query attention model

Decoder The decoder generates a query describing what it wants to focus on $1 $2 $3 $4

Sum the values generated by encoder weighted by the attention weights

Feed the scores into a softmax to create the attention weights

Compute dot products between the query and the keys generated by encoder, giving !1 !2 !3 !4 alignment scores between source tokens and the query Encoder Image source Key-Value-Query attention model A1 A2 A3 A4

• Key vectors: ! = #$% Product( ), Sum( ) • Value Vectors: & = #$' 4 4 4 4 • Query vectors &1 F,F G,F H,F I,F

• Similarities: scaled dot-product attention &2 4F,G 4G,G 4H,G 4I,G

,- ·%+ 1 ( = or ( = 0! / 3 4 4 4 4 ),+ / &3 F,H G,H H,H I,H (3 is the dimensionality of the keys) Softmax( )

#1 !1 (F,F (G,F (H,F (I,F • Attn. weights: 4 = softmax((, dim = 1) # ! ( ( ( ( • Output vectors: 2 2 F,G G,G H,G I,G # ! ( ( ( ( A) = ∑+ 4),+&+ or A = 4& 3 3 F,H G,H H,H I,H

01 02 03 04 Adapted from J. Johnson Key-Value-Query attention model 01 02 03 04

• How does permuting the order of Product( ), Sum( ) the queries change the output? . . . . • How does changing the order of /1 ),) +,) ,,) -,) . the keys/values change the /2 ),+ .+,+ .,,+ .-,+ output? /3 .),, .+,, .,,, .-,,

Softmax( )

&1 '1 (),) (+,) (,,) (-,)

&2 '2 (),+ (+,+ (,,+ (-,+

&3 '3 (),, (+,, (,,, (-,,

!1 !2 !3 !4 Adapted from J. Johnson Attention mechanisms

• Encoder self-attention: queries, keys, and values come from previous layer of encoder • Decoder self-attention: values corresponding to future decoder outputs are masked out • Encoder-decoder attention: queries come from previous decoder layer, keys and values come from output of encoder Self-attention • Used to capture context within the sequence

As we are encoding “it”, we As we are encoding “it”, we should focus on “the animal” should focus on “the street” Image source Self-attention layer A1 A2 A3

Product(→), Sum(↑) • Query vectors: ! = #$%

(3 4 4 4 • Key vectors: & = #$' E,F G,F F,F

(2 4 4 4 • Value vectors: ( = #$) E,G G,G F,G ( • Similarities: scaled dot-product attention 1 4E,E 4G,E 4F,E % ·' * = . - or * = !&1 / 3 Softmax(↑) +,- 0 & * * * (3 is the dimensionality of the keys) 3 E,F G,F F,F &2 *E,G *G,G *F,G

&1 * *G,E *F,E • Attn. weights: 4 = softmax(*, dim = 1) E,E

• Output vectors: !1 !2 !3 ∑ A+ = - 4+,-(- or A = 4( #1 #2 #3

Adapted from J. Johnson One query per input vector Masked self-attention layer This is ….

-1 -2 -3 • The decoder should not “look ahead” Product(→), Sum(↑) in the output sequence ,3 +',) +*,) +),)

,2 +',* +*,* +),*

,1 +',' +*,' +),'

Softmax(↑)

%3 &',) &*,) &),)

%2 &',* &*,* &),*

%1 &',' &*,' &),'

!1 !2 !3

.1 .2 .3

This is Adapted from J. Johnson Masked self-attention layer

-1 -2 -3 • The decoder should not “look ahead” Product(→), Sum(↑) in the output sequence ,3 +',) +*,) +),)

,2 +',* +*,* +),*

,1 +',' +*,' +),'

Softmax(↑)

%3 &',) &*,) &),)

%2 &',* &*,* &),*

%1 &',' &*,' &),'

!1 !2 !3

.1 .2 .3

Adapted from J. Johnson Masked self-attention layer

01 02 03 • The decoder should not “look ahead” Product(→), Sum(↑) in the output sequence /3 - - .,,,

/2 - .+,+ .,,+

/1 .),) .+,) .,,)

Softmax(↑)

%3 −∞ −∞ (,,,

%2 −∞ (+,+ (,,+

%1 (),) (+,) (,,)

!1 !2 !3

11 12 13

Adapted from J. Johnson Attention mechanisms: Summary

N transformer N transformer blocks blocks

• Encoder self-attention: queries, keys, and values come from previous layer of encoder • Decoder self-attention: values corresponding to future decoder outputs are masked out • Encoder-decoder attention: queries come from previous decoder layer, keys and values come from output of encoder Attention mechanisms: Illustration

https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html Transformer architecture: Details

Decoder

Encoder

A. Vaswani et al., Attention is all you need, NeurIPS 2017 Multi-head attention • Run ℎ attention models in parallel on top of different linearly projected versions of ", $, %; concatenate and linearly project the results • Intuition: enables model to attend to different kinds of information at different positions (see visualization tool) Transformer blocks • A Transformer is a sequence of transformer blocks • Vaswani et al.: N=12 blocks, embedding dimension = 512, 6 attention heads • Add & Norm: residual connection followed by layer normalization • Feedforward: two linear layers with ReLUs in between, applied independently to each vector • Attention is the only interaction between inputs! Positional encoding • To give transformer information about ordering of tokens, add function of position (based on sines and cosines) to every input

position

Embedding dimension Image source Transformer architecture: Zooming back out

Decoder

Encoder

A. Vaswani et al., Attention is all you need, NeurIPS 2017 Results

https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html Different ways of processing sequences

RNN 1D convolutional network Transformer

Y1 Y2 Y3

Product(→), Sum(↑)

V3 A1,3 A2,3 A3,3 V h1 h2 h3 h4 h1 h2 h3 h4 2 A1,2 A2,2 A3,2 V1 A1,1 A2,1 A3,1

Softmax(↑)

K3 E1,3 E2,3 E3,3

K2 E1,2 E2,2 E3,2

K1 E1,1 E2,1 E3,1

Q1 Q2 Q3

x1 x2 x3 x4 x1 x2 x3 x4 X1 X2 X3

Works on ordered sequences Works on multidimensional grids • Works on sets of vectors • Pros: Good for long sequences: • Pro: Each output can be • Pro: Good at long sequences: After one RNN layer, hT ”sees” computed in parallel (at training after one self-attention layer, the whole sequence time) each output “sees” all inputs! • Con: Not parallelizable: need to • Con: Bad at long sequences: • Pro: Each output can be compute hidden states Need to stack many conv layers computed in parallel (at training sequentially for outputs to “see” the whole time) • Con: Hidden states have sequence • Con: Memory-intensive limited expressive capacity Making transformers more efficient

R. Child et al., Generating Long Sequences with Sparse Transformers, arXiv 2019 Outline • Basic transformer model • Transformer-based language models • BERT • GPT • Other models • Concerns Self-supervised language modeling with transformers 1. Download A LOT of text from the internet 2. Train a giant transformer using a suitable pretext task 3. Fine-tune the transformer on desired NLP task

Image source Self-supervised language modeling with transformers 1. Download A LOT of text from the internet 2. Train a giant transformer using a suitable pretext task 3. Fine-tune the transformer on desired NLP task

Bidirectional Encoder Representations from Transformers (BERT)

J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, EMNLP 2018 BERT: Pretext tasks • Masked language model (MLM) • Randomly mask 15% of tokens in input sentences, goal is to reconstruct them using bidirectional context

Image source BERT: Pretext tasks • Next sentence prediction (NSP) • Useful for Question Answering Predict likelihood that sentence B belongs after sentence A and Natural Language Inference tasks • In the training data, 50% of the time B is the actual sentence that follows A (labeled as IsNext), and 50% of the time it is a random sentence (labeled as NotNext).

Image source BERT: More detailed view

WordPiece (from GNMT)

Trained on Wikipedia (2.5B words) + BookCorpus (800M words)

Image source BERT: Evaluation • General Language Understanding Evaluation (GLUE) benchmark (gluebenchmark.com) BERT: Downstream tasks

Textual entailment

Source: J. Hockenmaier

Entailment, textual equivalence and similarity BERT: Downstream tasks

Sentiment classification, linguistic acceptability Image source BERT: Downstream tasks

Find span in paragraph that contains the answer Source: SQuAD v1.1 paper BERT: Downstream tasks

Image source

Named entity recognition Other language models

Image source Source: J. Johnson Scaling up transformers

Model Layers Hidden dim. Heads Params Data Training Transformer-Base 12 512 8 65M 8x P100 (12 hours) Transformer-Large 12 1024 16 213M 8x P100 (3.5 days)

Vaswani et al., Attention is all you need, NeurIPS 2017 Source: J. Johnson Scaling up transformers

Model Layers Hidden dim. Heads Params Data Training Transformer-Base 12 512 8 65M 8x P100 (12 hours) Transformer-Large 12 1024 16 213M 8x P100 (3.5 days) BERT-Base 12 768 12 110M 13 GB 4x TPU (4 days) BERT-Large 24 1024 16 340M 13 GB 16x TPU (4 days)

Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, EMNLP 2018 Source: J. Johnson Scaling up transformers

Model Layers Hidden dim. Heads Params Data Training Transformer-Base 12 512 8 65M 8x P100 (12 hours) Transformer-Large 12 1024 16 213M 8x P100 (3.5 days) BERT-Base 12 768 12 110M 13 GB 4x TPU (4 days) BERT-Large 24 1024 16 340M 13 GB 16x TPU (4 days) XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days) RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day)

Yang et al., XLNet: Generalized Autoregressive Pretraining for Language Understanding, 2019 Liu et al., RoBERTa: A Robustly Optimized BERT Pretraining Approach, 2019 Source: J. Johnson Scaling up transformers

Model Layers Hidden dim. Heads Params Data Training Transformer-Base 12 512 8 65M 8x P100 (12 hours) Transformer-Large 12 1024 16 213M 8x P100 (3.5 days) BERT-Base 12 768 12 110M 13 GB 4x TPU (4 days) BERT-Large 24 1024 16 340M 13 GB 16x TPU (4 days) XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days) RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day) GPT-2 48 1600 ? 1.5B 40 GB

Radford et al., Language models are unsupervised multitask learners, 2019 Source: J. Johnson Scaling up transformers

Model Layers Hidden dim. Heads Params Data Training Transformer-Base 12 512 8 65M 8x P100 (12 hours) Transformer-Large 12 1024 16 213M 8x P100 (3.5 days) BERT-Base 12 768 12 110M 13 GB 4x TPU (4 days) BERT-Large 24 1024 16 340M 13 GB 16x TPU (4 days) XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days) RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day) GPT-2 48 1600 ? 1.5B 40 GB Megatron-LM 72 3072 32 8.3B 174 GB 512x V100 GPU (9 days)

~$430,000 on Amazon AWS!

Shoeybi et al., Megatron-LM: Training Multi-Billion Parameter Languge Models using Model Parallelism, 2019 Source: J. Johnson Scaling up transformers

Model Layers Hidden dim. Heads Params Data Training Transformer-Base 12 512 8 65M 8x P100 (12 hours) Transformer-Large 12 1024 16 213M 8x P100 (3.5 days) BERT-Base 12 768 12 110M 13 GB 4x TPU (4 days) BERT-Large 24 1024 16 340M 13 GB 16x TPU (4 days) XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days) RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day) GPT-2 48 1600 ? 1.5B 40 GB Megatron-LM 72 3072 32 8.3B 174 GB 512x V100 GPU (9 days) Turing-NLG 78 4256 28 17B ? 256x V100 GPU

Microsoft, Turing-NLG: A 17-billion parameter language model by Microsoft, 2020 Source: J. Johnson Scaling up transformers

Model Layers Hidden dim. Heads Params Data Training Transformer-Base 12 512 8 65M 8x P100 (12 hours) Transformer-Large 12 1024 16 213M 8x P100 (3.5 days) BERT-Base 12 768 12 110M 13 GB 4x TPU (4 days) BERT-Large 24 1024 16 340M 13 GB 16x TPU (4 days) XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days) RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day) GPT-2 48 1600 ? 1.5B 40 GB Megatron-LM 72 3072 32 8.3B 174 GB 512x V100 GPU (9 days) Turing-NLG 78 4256 28 17B ? 256x V100 GPU GPT-3 96 12288 96 175B 694GB ?

~$4.6M, 355 GPU-years (source)

Brown et al., Language Models are Few-Shot Learners, NeurIPS 2020 (Best Paper Award) OpenAI GPT (Generative Pre-Training) • Pre-training task: next token prediction (causal language modeling)

A. Radford et al., Improving language understanding with unsupervised learning, 2018 Image source GPT-2 and GPT-3 • Key idea: if the model and training datasets are big enough, model can adapt to new tasks without fine-tuning

Model Layers Hidden dim. Heads Params Dataset GPT-2 48 1600 ? 1.5B WebText: 40GB GPT-3 96 12288 96 175B CommonCrawl (cleaned up): 694GB

GPT-2: A. Radford et al., Language models are unsupervised multitask learners, 2019 GPT-3: T. Brown et al., Language models are few-shot learners, NeurIPS 2020 (Best Paper Award) GPT-3 • Key idea: if the model and training datasets are big enough, model can adapt to new tasks without fine-tuning • Few-shot learning: In addition to the task description, the model sees a few examples of the task

T. Brown et al., Language models are few-shot learners, NeurIPS 2020 GPT-3 • Key idea: if the model and training datasets are big enough, model can adapt to new tasks without fine-tuning • One-shot learning: In addition to the task description, the model sees a single example of the task

T. Brown et al., Language models are few-shot learners, NeurIPS 2020 GPT-3 • Key idea: if the model and training datasets are big enough, model can adapt to new tasks without fine-tuning • Zero-shot learning: The model sees the task description and no training examples

T. Brown et al., Language models are few-shot learners, NeurIPS 2020 Task: Generate news article

Gray: human prompts, boldface: GPT-3 completions

(Three articles provided as training examples) Task: Use new word in sentence

Gray: human prompts, boldface: GPT-3 completions Task: Correct grammar

Gray: human prompts, boldface: GPT-3 completions Task: Generate poems Task: Generate poems GPT-3 creative fiction

https://www.gwern.net/GPT-3 For much, much more, see: https://github.com/elyase/awesome-gpt3 GPT-3 bias • Male vs. female: GPT-3 bias • Race: GPT-3 bias • Religion: GPT-3 concerns • Potential damage from bias and misuse (e.g., offensive chatbots, fake news, fake college essays) • Privacy (e.g., memorizing and revealing personal data from the training set) • Access and reproducibility • Carbon footprint • Potential for destroying jobs (e.g., writers, editors, programmers) • Too smart or not smart enough?

• See also: Philosophers on GPT-3 The importance of grounding • Hypothesis: Meaning cannot be learned from form alone – it requires knowing about the relationship between language and the outside world

Image source

E. Bender and A. Koller, Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data, ACL 2020 Transformers: Outline • Basic transformer model • Transformer-based language models • BERT • GPT • Other models • Concerns • Applications of transformers in vision Image transformer – Google • Image generation and super-resolution with 32x32 output, attention restricted to local neighborhoods

N. Parmar et al., Image transformer, ICML 2018 Sparse transformers – OpenAI

R. Child et al., Generating Long Sequences with Sparse Transformers, arXiv 2019 Image GPT – OpenAI

https://openai.com/blog/image-gpt/

M. Chen et al., Generative pretraining from pixels, ICML 2020 Image GPT • Image resolution up to 64x64, color values quantized to 512 levels (9 bits), dense attention • For transfer learning, average-pool encoded features across all positions

M. Chen et al., Generative pretraining from pixels, ICML 2020 Image GPT

M. Chen et al., Generative pretraining from pixels, ICML 2020 Vision transformer (ViT) – Google • Split an image into patches, feed linearly projected patches into standard transformer encoder • With patches of 14x14 pixels, you need 16x16=256 patches to represent 224x224 images

A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR 2021 Vision transformer (ViT)

BiT: Big Transfer (ResNet) ViT: Vision Transformer (Base/Large/Huge, patch size of 14x14, 16x16, or 32x32)

Internal Google dataset (not public)

A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR 2021 Vision transformer (ViT) – Google • Transformers are claimed to be more computationally efficient than CNNs or hybrid architectures

A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR 2021 Outline • Transformer-based language models • BERT • GPT • Other models • Concerns • Models for vision and language • Image/vision transformers • Image-text transformers: CLIP, DALL-E Contrastive Language-Image Pretraining (CLIP)

Contrastive objective: in a batch of N image-text pairs, classify each text string to the correct image and vice versa

A. Radford et al., Learning Transferable Visual Models From Natural Language Supervision, arXiv 2021 https://openai.com/blog/clip/ Contrastive Language-Image Pretraining (CLIP)

A. Radford et al., Learning Transferable Visual Models From Natural Language Supervision, arXiv 2021 https://openai.com/blog/clip/ CLIP: Details • Image encoders • ResNet-50 with self-attention layer on top of global average pooling • Vision transformer (ViT) • Language encoder: GPT-style transformer with 63M parameters • Dataset: 400M image-text pairs from the Web CLIP: Results CLIP: Results CLIP for image editing • Combine powerful recent image-text embedding technique (CLIP) with pre-trained StyleGAN for text-based image editing

O. Patashnik et al.. StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery. arXiv 2021 CLIP for image editing

https://twitter.com/RiversHaveWings/status/1382436519181385729 Text-based image generation: DALL-E • Learn a joint sequential transformer model that can be used to generate image based on text prompt

A. Ramesh et al., Zero-Shot Text-to-Image Generation, arXiv 2021 https://openai.com/blog/dall-e/ DALL-E: Image encoding • Train convolutional encoder and decoder similar to VQ-VAE to compress images to 32x32 grids of discrete tokens (each assuming 8192 values) DALL-E: Transformer architecture and training • Concatenate up to 256 text tokens with 32x32=1024 image tokens, learn a transformer model with 64 layers and 12B parameters • Dataset: 250M image-text pairs from the Internet (similar scale to JFT-300M, apparently different from data used to train CLIP) • Transformer model details • Decoder-only architecture • 64 self-attention layers, • 62 attention heads, sparse attention patterns • Mixed-precision training, distributed optimization DALL-E: Generating images given text • Re-rank samples using CLIP