Contents
1. Introduction to Pre-trained Language Representations Pre-training Language 2. (Feature-based approach) ELMo Representation 3. (Fine-tuning approach) OpenAI GPT (ELMo, BERT, OpenAI GPT) 4. OpenAI GPT3
5. (Fine-tuning approach) BERT Ko, Youngjoong 6. Summary
Sungkyunkwan University
2
Introduction : Transfer Learning Introduction : Transfer Learning
How can we take advantage of distributed word representation? Transfer Learning using word representations Transfer Learning Pre-trained Word Representation
What is Transfer Learning? word2vec GloVe skip-thought InferSent Classification ELMo Sequence Labeling ULMFiT Question & Answering …. GPT Text Corpus BERT
Pre-trained word representation
Unsupervised Learning Supervised Learning 3 with unlabeled data 4 with labeled data
1 Introduction : Pre-trained Language Representations Introduction : Pre-trained Language Representations
Feature-based approach Use task-specific architectures that include the pre-trained representations as additional features . Learned representations are used as features in a downstream model ELMo
Two kinds of Pre-trained Language Representations 1) Feature-based approach 2) Fine-tuning approach
5 6
Introduction : Pre-trained Language Representations Feature-based approach : ELMo
Fine-tuning approach ELMo (Peters et al. 2018)
Introduce minimal task-specific parameters In GloVe, Word2vec method, BTS fan Trained on the downstream tasks by simply fine-tuning the pre-trained Polysemous words refer to same fan parameters representation no matter the context coolAir-conditioner OpenAI GPT, BERT . “I am a big fan of Mozart” concert ‘fan’ = [-0.5, -0.3, 0.228, 0.9, 0.31] Y N . “I need a fan to cool the heat” ‘fan’ = [-0.5, -0.3, 0.228, 0.9, 0.31] Downstream task ? need a fan of
Language Model Language Model
I need a fan I grew up in
7 8
2 Feature-based approach : ELMo Feature-based approach : ELMo
ELMo (Peters et al. 2018) (Cont’d) ELMo (Peters et al. 2018) (Cont’d) Let the words be represented according to the context !!! Trained task-specifically . Learned parameters(weights) from language model are used as a feature for I need a fan … Hmm? … fan to cool the heat another task
Different Task
Learn to consider context Learn to consider context ‘tree’ general from right-to-left from left-to-right ‘tree’ from ELMo [-0.1, 0.3, 0.25, 0.7, -0.1, -0.4, 0.6, 0.8]
Language Model
9 The tree grew up 10
Feature-based approach : ELMo Feature-based approach : ELMo
ELMo (Peters et al. 2018) (Cont’d) biLSTM part Two components of the ELMo Two objectives : predicting word in forward direction, backward direction . biLSTM pre-training part . Forward : 푝 푡 , 푡 ,…, 푡 = ∏ 푝(푡 |푡 , 푡 ,…, 푡 ) Use vectors derived from a bidirectional LSTM trained with a coupled LM Objective Task of predicting next token . ELMo part . backward : 푝 푡 , 푡 ,…, 푡 = ∏ 푝(푡 |푡 , 푡 ,…, 푡 ) Task specific combination of the representations in the biLM Task of predicting previous token Overall objective is to jointly maximizes the log likelihood of the forward and backward directions biLSTM part ⃗ ⃖ ELMo part . 퐽 휽 = ∑ (log 푝 푡 |푡 ,…, 푡 ; 휃 , 휃 , 휃 + log 푝 푡 |푡 ,…, 푡 ; 휃 , 휃 , 휃 )
11 12
3 Feature-based approach : ELMo Feature-based approach : ELMo
ELMo part ELMo part (Cont’d)
Part where the ELMo word representation is defined Layer representation trained from biLSTM part, 푅 Task specific combination of the intermediate layer representations in the ELMo collapses all layers in 푅 into a single vector, biLM 푬푳푴풐 =퐸 푅 ;Θ