Contents

1. Introduction to Pre-trained Language Representations Pre-training Language 2. (Feature-based approach) ELMo Representation 3. (Fine-tuning approach) OpenAI GPT (ELMo, BERT, OpenAI GPT) 4. OpenAI GPT3

5. (Fine-tuning approach) BERT Ko, Youngjoong 6. Summary

Sungkyunkwan University

2

Introduction : Introduction : Transfer Learning

 How can we take advantage of distributed word representation?  Transfer Learning using word representations  Transfer Learning  Pre-trained Word Representation

 What is Transfer Learning? GloVe skip-thought InferSent Classification ELMo Sequence Labeling ULMFiT Question & Answering …. GPT Text Corpus BERT

Pre-trained word representation

Unsupervised Learning 3 with unlabeled data 4 with labeled data

1 Introduction : Pre-trained Language Representations Introduction : Pre-trained Language Representations

 Feature-based approach  Use task-specific architectures that include the pre-trained representations as additional features . Learned representations are used as features in a downstream model  ELMo

 Two kinds of Pre-trained Language Representations  1) Feature-based approach  2) Fine-tuning approach

5 6

Introduction : Pre-trained Language Representations Feature-based approach : ELMo

 Fine-tuning approach  ELMo (Peters et al. 2018)

 Introduce minimal task-specific parameters  In GloVe, Word2vec method, BTS fan  Trained on the downstream tasks by simply fine-tuning the pre-trained  Polysemous words refer to same fan parameters representation no matter the context coolAir-conditioner  OpenAI GPT, BERT . “I am a big fan of Mozart” concert  ‘fan’ = [-0.5, -0.3, 0.228, 0.9, 0.31] Y N . “I need a fan to cool the heat”  ‘fan’ = [-0.5, -0.3, 0.228, 0.9, 0.31] Downstream task ? need a fan of

Language Model

I need a fan I grew up in

7 8

2 Feature-based approach : ELMo Feature-based approach : ELMo

 ELMo (Peters et al. 2018) (Cont’d)  ELMo (Peters et al. 2018) (Cont’d)  Let the words be represented according to the context !!!  Trained task-specifically . Learned parameters(weights) from language model are used as a feature for I need a fan … Hmm? … fan to cool the heat another task

Different Task

Learn to consider context Learn to consider context ‘tree’ general from right-to-left from left-to-right ‘tree’ from ELMo [-0.1, 0.3, 0.25, 0.7, -0.1, -0.4, 0.6, 0.8]

Language Model

9 The tree grew up 10

Feature-based approach : ELMo Feature-based approach : ELMo

 ELMo (Peters et al. 2018) (Cont’d)  biLSTM part  Two components of the ELMo  Two objectives : predicting word in forward direction, backward direction . biLSTM pre-training part . Forward : 푝 푡, 푡,…, 푡 = ∏ 푝(푡|푡, 푡,…, 푡)  Use vectors derived from a bidirectional LSTM trained with a coupled LM Objective  Task of predicting next token . ELMo part . backward : 푝 푡, 푡,…, 푡 = ∏ 푝(푡|푡, 푡,…, 푡)  Task specific combination of the representations in the biLM  Task of predicting previous token  Overall objective is to jointly maximizes the log likelihood of the forward and backward directions biLSTM part ⃗ ⃖ ELMo part . 퐽 휽 = ∑(log 푝 푡|푡,…, 푡; 휃, 휃, 휃 + log 푝 푡|푡,…, 푡; 휃, 휃, 휃 )

11 12

3 Feature-based approach : ELMo Feature-based approach : ELMo

 ELMo part  ELMo part (Cont’d)

 Part where the ELMo word representation is defined  representation trained from biLSTM part, 푅  Task specific combination of the intermediate layer representations in the  ELMo collapses all layers in 푅 into a single vector, biLM 푬푳푴풐 =퐸 푅;Θ

푅 = ℎ , 푗 = 0,…, 퐿} 푅 = ℎ , 푗 = 0, … , 퐿}

⃗ ⃖ j=2 [퐡, 퐡,]

j=1 ⃗ ⃖ [퐡, 퐡,]

j=0 [x x]

k=1 k=2k=3 13 k=1 k=2 k=3 14

Feature-based approach : ELMo Feature-based approach : ELMo

 ELMo part (Cont’d)  ELMo Evaluation  ELMo collapses all layers in 푅 into a single vector,

 Choices of 푬푳푴풐 = 퐸 푅;Θ . Simplest case : 푬푳푴풐 = 퐸 푅 = ℎ, (top layer) (Peters et al. 2017, McCann et al. 2017) . General case : compute a task specific weighting of all biLM layers down-stream task learns weighting parameters 푬푳푴풐 = 퐸 푅;Θ = 훾 푠 퐡,

 Question Answering, SQuAD : average F score - +1.4% than SOTA  Textual Entailment, SNLI : accuracy score - +0.7% when SOTA + ELMo ⃗ ⃖ [퐡, 퐡,]  Semantic Role Labeling, SRL : average F score - +3.2% when SOTA reimplementation + ELMo  Coreference resolution, Coref : average F score - +3.2% when SOTA reimplementation + ELMo

 Named Entity Extraction, NER : average F score - +0.3% when SOTA + ELMo ⃗ ⃖  Sentiment Analysis, SST-5 : accuracy score - +1% when SOTA reimplementation + ELMo [퐡, 퐡,]

[x x ] 15 16

4 Feature-based approach : ELMo Fine-tuning approach to pre-trained Word Representation

 ELMo Evaluation : effects of ‘Deep biLM’ part  Fine-tuning approach  Deep biLM effects  Trained on the downstream tasks by simply fine-tuning the pre-trained params . Minimal task-specific parameters . “fine-tune effectively with minimal changes to the architecture of the pre-trained model,” (Radford et al. 2018)  OpenAI GPT, BERT Y N

Downstream task

 using biLM’s context representation, Language Model . Disambiguate word sense in the source sent (Word Sense Disambiguity test)  Deeper layers catch more of semantic information . Disambiguate part of speech in the source sent (POS tagging test)  Shallower layers catch more of syntactic information I grew up in

17 18

Fine-tuning approach : Open-AI GPT Fine-tuning approach : Open-AI GPT

 OpenAI-GPT  Framework  First stage, learning a high-capacity language model on a large corpus of  From ‘Improving Language text(BooksCorpus dataset) Understanding by Generative  Followed by a fine-tuning stage where the model adapts to a discriminative Pre-Training’ paper, Radford et task with labeled data al. 2018  First Stage, Unsupervised pre-training  Make use of Transformers model  Language model objective with large corpus of unlabeled data into unsupervised pre-training  퐿 풰 = ∑ 푙표푔푷 푢|푢,…, 푢; 휽 . 푘 ∶ 푠푖푧푒 표푓 푡ℎ푒 푐표푛푡푒푥푡 푤푖푛푑표푤  It is then transferred to . 풰 ={푢 , 푢 ,…, 푢 }∶ 푢푛푠푢푝푒푟푣푖푠푒푑 푐표푟푝푢푠 표푓 푡표푘푒푛푠 discriminative tasks (downstream task) . 푷 ∶ 푚표푑푒푙푒푑 푢푠푖푛푔 푎 푛푒푢푟푎푙 푛푒푡푤표푟푘 푤푖푡ℎ 푝푎푟푎푚푒푡푒푟푠 휽  Multi-layer Transformer decoder block for the language model

Encoder Decoder (BERT) (GPT)

19 20

5 Fine-tuning approach : Open-AI GPT Fine-tuning approach : Open-AI GPT

 Transformer (Decoder Block)  First Stage, Unsupervised pre-training (Cont’d)

 Linearly transform 푑 : Q (Query) , linearly transform the others : K (Key)  Overview  Dot product of every neighboring position . Inputs tokenized by spaCy tokenizer  (mask out future words logits by multiplying 10e-9) . Inputs are fed into 12 layers of Transformer blocks in each time step  Softmax the logits Convex Combination of the softmax result then put . Last layer produce distribution over BPE based vocabulary (40,000) through FFNN

=> Becomes the re-representation of 푑 = 푑′

21 22

Fine-tuning approach : Open-AI GPT Fine-tuning approach : Open-AI GPT

 First Stage, Unsupervised pre-training (Cont’d)  First Stage, Unsupervised pre-training (Cont’d)  12 layers of Transformer blocks  Example) “I want to build a language model architecture …” . i) Masked multi-headed self-attention over the input context tokens . ii) Followed by position-wise feedforward layers ‘build’

. iii) Softmax of feedforward result over the target tokens ‘want’ ‘to’

L = 12

I want to

23 24

6 Fine-tuning approach : Open-AI GPT Fine-tuning approach : Open-AI GPT

 Second Stage, Supervised fine-tuning  There is one problem in Second Stage …  Once the first stage is finished with unsupervised corpus of tokens, our  Input form of the Open-AI GPT looks like language model parameter is pre-trained thus it is available as pre-trained . 풰 ={푢, 푢,…,푢}, word representation ex) {‘I’, ‘am’, ‘a’, ‘student’, ‘I’, ‘like’, ‘studying’, … }  Adapt the parameters from first stage to the supervised target task with labeled dataset 풞 = 푐,…, 푐} 푤ℎ푒푟푒 푐 ={푥 ,…, 푥 , 푦  In task-specific task such as question answering, textual entailment  The inputs are passed through our pre-trained model to obtain the final . They have structured inputs transformer block’s activation ℎ ex) {document, question, answers} . 푚 : last token index, 푙 : last layer index 푃 푦 푥 ,…, 푥 = 푠표푓푡푚푎푥(ℎ 푊)

ℎ푊 document

… question

answer

푥 푥 푥 25 26

Fine-tuning approach : Open-AI GPT Fine-tuning approach : Open-AI GPT

 There is one problem in Second Stage … (Cont’d)  Second Stage, Supervised fine-tuning (Cont’d)  How to align those structured inputs so as to fine-tune with the structured inputs in a pre-trained input manner? . Fit form of the inputs to the model . ‘Start,’ ‘Delim,’ ‘Extract’ tokens !! . Ex) Entailment task, input has structured form of Premise and Entailment . Align the input as below

 See the difference between Feature-based & Fine-tuning approach? . Feature-based : You fit your representation to the task . Fine-tuning : task is fitted to the representation learning

27 28

7 Open-AI GPT 3 Open-AI GPT 3

 Task-agnostic Language Model  Task-agnostic Language Model  Fine-tuning 없는 범용성이 좋은 Task-agnostic NLP 모델  Few-shot Learning  Zero-shot Learning?

29 30

Open-AI GPT 3 Open-AI GPT 3

 Model: GPT-2와 동일한 구조  문장 생성 및 Cloze 퀴즈 맞추기 태스크에 대한 성능  파라미터 수 증가 (175B 파라미터)  데이터 소개  45TB나 되는 150Billion Token (500GB 전처리된 텍스트)

 Translation

31 32

8 Fine-tuning approach : BERT Fine-tuning approach : BERT

 BERT (Bidirectional Encoder Representations from Transformers)  BERT (Bidirectional Encoder Representations from Transformers)  Paper published in NAACL 2019 by AI  Open-AI GPT cannot take on right to left context  “BERT:Pre-training of Deep Bidirectional Transformers for Language . Deep bidirectional model is more powerful than either a left-to-right model(GPT) Understanding,” Devlin et al. 2019, NAACL. or the shallow concatenation of a left-to-right and right-to-left model(ELMo)  Won Best Long Paper . Every token can only attend to previous tokens in the self-attention layers of the Transformer . This is due to the fact that standard Language Models can only be trained left-to- right or right-to-left  Since bidirectional conditioning would allow each word to indirectly “see itself” in a multi-layered context.  Ex) language model training “As long as you love me” from left to right

“As” ? “as”

Trm Trm Trm Illegal !!! Trm Trm Trm

33 ‘SOS’ “As”34 “long”

Fine-tuning approach : BERT Fine-tuning approach : BERT

 BERT (Bidirectional Encoder Representations from Transformers)  Masked Language Model?  BERT is designed to pre-train representations by jointly conditioning on  How about mask one of the tokens in a sentence and guess what that is both left and right context in all layers  Ex) As long MASK you love me : “as” . How? By training on two new tasks !  Can take account of the context after the target token  Word Representation learning via “masked language model” task  “Next Sentence Prediction” task  Next Sentence Prediction task?  Guessing appropriate sequence after which follows “as” ? “as” long  Ex) current sequence : “I think I mastered the concept”

Trm Trm Trm Trm Trm Trm vs. Appropriate next sequence Inappropriate next sequence Trm Trm Trm Trm Trm Trm Now I can start coding. What’s this smell? ‘SOS’ “as” “long” as MASK as Let’s do it!! Can you smell it too? Illegally injecting future Masked language label information to LM. 35 model task 36

9 Fine-tuning approach : BERT Fine-tuning approach : BERT

 Overall Architecture  Overall Architecture  Multi-layer bidirectional Transformer Encoder  Two model sizes

. Transformer is now able to refers to the right-to-left context due to the changed . BERT : L=12, H=768, A=12 train objectives . BERT : L=24, H=1024, A=16  Task specific layer on top of the model . Where L = number of layers, H = hidden size, A = self-attention head A=16 A=12

L=12 L=24

H=768 H=1024

BERT BERT 37 38

Fine-tuning approach : BERT Fine-tuning approach : BERT

 Overall Procedure  Overall Procedure (Cont’d)  How to construct an input?  Input is represented and fed to the model summing . [CLS] + sentence A + [SEP] + sentence B . 1) WordPiece embeddings (Wu et al. 2016) . Just like ‘start’, ‘delim’ tokens : ‘CLS’, ‘SEP’ tokens . 2) Segment embeddings . 3) Learned positional embeddings  Input example . Ex) [‘CLS’, ‘my’, ‘dog’, ‘is’, ‘cute’, ‘SEP’, ‘he’, ‘likes’, ‘playing’, ‘SEP’]

39 40

10 Fine-tuning approach : BERT Fine-tuning approach : BERT

 Overall Procedure (Cont’d)  Overall Procedure (Cont’d)  1) WordPiece embeddings (Wu et al. 2016)  2) Segment Embeddings . Use embeddings trained with objectives that selects D wordpieces such that the . Segment Embeddings help resulting corpus is minimal in the number of wordpieces when segmented BERT distinguish the according to the chosen wordpiece model tokens in input pair . Sent A : Index 0 → 768 vec . Data-driven tokenization method that aims to achieve a balance between vocab Sent B : index 1 → 768 vec size and out-of-vocab words . Index 0 when input only contains  “strawberries” one input sentence = “straw” + “berries”

. Enables BERT to only store 30,522 “words” in its vocab and very rarely encounter out-of-vocab words

41 42

Fine-tuning approach : BERT Architecture of Transformer

 Overall Procedure (Cont’d)  Multihead attention in Transformer  2) Positional Embeddings  Sentence is about ‘who,’ ‘did what,’ and ‘to whom’  In CNN, different filters learn the concept of ‘who,’ ‘did what,’ and ‘to whom.’  Self-attention can’t pick out different information from different places . It’s just a linear combination everywhere

. BERT is designed to process input sequences of up to length 512 . BERT learns a vector representation for each position

512

43 44

11 Architecture of Transformer Fine-tuning approach : BERT

 Multihead attention in Transformer (Cont’d)  Task #1 : Masked Language Model (MLM)  Apply self-attention multiple times, each of them linearly transform token so  Mask 훼% of the input tokens to be predicted that it conveys different information of interests  The final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocab

went store

softmax softmax

Trm Trm Trm Trm Trm Trm Trm Trm

Trm Trm Trm Trm Trm Trm Trm Trm

[CLS] the man [MASK] to [MASK] to buy

45 46

Fine-tuning approach : BERT Fine-tuning approach : BERT

 Task #1 : Masked LM (Cont’d)  Task #1 : Masked LM (Cont’d)  Two downsides of this approach  Two downsides of this approach . 1st, we are creating a mismatch between pre-training and fine-tuning . 2nd , only 15% of tokens are predicted in each batch  ‘[MASK]’ token is never seen during fine-tuning time  which suggests that more pre-training steps may be required for the model to converge  Left-to-right model predicts every token so it converges faster  Take special steps . However, empirical improvements of the MLM model far outweigh the increased Ex) “my dog is hairy” and ‘hairy’ is randomly selected training cost . 80% of the time (0.8 × 훼%) : MASK  “my dog is [MASK]” . 10% of the time (0.1 × 훼%) : Replace with a random word  “my dog is apple” . 10% of the time (0.1 × 훼%) : Keep the word unchanged  “my dog is hairy”

47 48

12 Fine-tuning approach : BERT Fine-tuning approach : BERT

 Task #2 : Next Sentence Prediction (NSP)  Task #2 : Next Sentence Prediction (NSP) (Cont’d)  To equip with ability to understand the relationship between two text IsNext NotNext sentences which is not directly captured by LM. . Question Answering(QA), Natural Language Inference(NLI) tasks 83% vs 17%  Pre-train binary next sentence prediction task

IsNext NotNext

[CLS] the man went to [MASK] store [SEP] he bought a [MASK] chocolate milk [SEP]

49 50

Fine-tuning approach : BERT Fine-tuning approach : BERT

 Task #2 : Next Sentence Prediction (NSP) (Cont’d)  Task #2 : Next Sentence Prediction (NSP) (Cont’d) IsNext NotNext  Effect of training with the task of NSP

11% vs 89%

. No NSP : trained without the NSP task . LTR & No NSP : trained without the NSP task + only left-to-right LSTM

 Removing NSP hurts performance significantly on QNLI, MNLI, SQuAD which depend largely on the relationship between two sentences

[CLS] the man went to [MASK] store [SEP] starry [MASK] night [SEP]

51 52

13 Fine-tuning approach : BERT Fine-tuning approach : BERT

 Now that the pre-trained model is ready, start fine-tuning !  Fine-tuning  No need to construct another model for another task  [CLS] embedding (푪 휖 ℝ) is mostly used for fine-tuning task  Just add the output layer parts !  Only new parameter (푾 휖 ℝ × ) for classification layer . 퐾 is the # of classifier lables, ex. 2 for [‘IsNext’, ‘NotNext’]  Label (푷 휖 ℝ = softmax 퐶푊 ) 푷 Just add params for specific task 퐾

pre-trained Fine-tuning Just add params model for specific task ready pre-trained Fine-tuning model ready

53 54

Fine-tuning approach : BERT Fine-tuning approach : BERT

 Fine-tuning in spanning or token-level task  Result  Modified slightly to use different number and different location of the hidden-  GLUE(General Language Understanding Evaluation) Dataset states other than [CLS] . Obtains 4.5% and 7.0% respective average accuracy improvement over the prior SOTA

55 56

14 Fine-tuning approach : BERT Fine-tuning approach : BERT

 Result (QA test)  Result SQuAD 2.0  SQuAD 1.1 + ‘No Answer’ task  +5.1 F1 improvement over the previous best system

57 58

References References

 Chris McCormick, “Word2Vec Tutorial – The Skip-Gram Model” :  ELMo explanation blog : https://www.slideshare.net/shuntaroy/a-review-of-deep- http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/ contextualized-word-representations-peters-2018  Mikolov et al. 2013, “Efficient Estimation of Word Representations in Vector Space”)  ELMo, BERT, OpenAI GPT explanation blog : http://jalammar.github.io/illustrated-bert/  G.E. Hinton, J.L. McClelland, D.E. Rumelhart. Distributed representations. In: Parallel  Universal Language Model Fine-tuning for Text Classfication (Jeremy Howard and distributed processing: Explorations in the microstructure of cognition. Volume 1: Sebastian Ruder, 2018) Foundations, MIT Press, 1986.  BERT:Pre-training of Deep Bidirectional Transformers for Language  Distributed Representations of Words and Phrases and their Compositionality (Mikolov Understanding,” Devlin et al. 2019, NAACL et al. 2013)  Improving Language Understanding with (Radford et al. 2018)  LM picture :  Attention Is All You Need, Vaswani et al. 2017 https://upload.wikimedia.org/wikipedia/commons/d/dd/CBOW_eta_Skipgram.png  Stanford CS224N: NLP with | Winter 2019 | Lecture 14 – Transformers and  Transfer Learning Tutorial by Sebastian Ruder, Matthew Peters, Swabha Swayamdipta, Self-Attention Thomas Wolf : https://docs.google.com/presentation/d/1fIhGikFPnb7G5kr58OvYC3GN4io7MznnM0aAga  https://www.youtube.com/watch?v=5vcj8kSwBCY&t=811s dvJfc/edit#slide=id.g58bdd596a1_0_0  OpenAI-GPT figure : https://www.topbots.com/generalized-language-models-ulmfit-  Semi-supervised Sequence Learning - NIPS Proceedings, Dai and Le, 2015 -gpt/  Semi-supervised sequence tagging with bidirectional language models, Peters et al. 2017  Input representation of BERT : https://medium.com/@_init_/why-bert-has-3-embedding- layers-and-their-implementation-details-9c261108e28a  Deep contextualized word representations, Peters et al. 2018

59 60

15 Thank you for your attention!

고 영 중 (Ko, Youngjoong) nlplab.skku.edu

61

16