Pre-Training Language Representation (Elmo, BERT, Openai GPT)

Contents 1. Introduction to Pre-trained Language Representations Pre-training Language 2. (Feature-based approach) ELMo Representation 3. (Fine-tuning approach) OpenAI GPT (ELMo, BERT, OpenAI GPT) 4. OpenAI GPT3 5. (Fine-tuning approach) BERT Ko, Youngjoong 6. Summary Sungkyunkwan University 2 Introduction : Transfer Learning Introduction : Transfer Learning How can we take advantage of distributed word representation? Transfer Learning using word representations Transfer Learning Pre-trained Word Representation What is Transfer Learning? word2vec GloVe skip-thought InferSent Classification ELMo Sequence Labeling ULMFiT Question & Answering …. GPT Text Corpus BERT Pre-trained word representation Unsupervised Learning Supervised Learning 3 with unlabeled data 4 with labeled data 1 Introduction : Pre-trained Language Representations Introduction : Pre-trained Language Representations Feature-based approach Use task-specific architectures that include the pre-trained representations as additional features . Learned representations are used as features in a downstream model ELMo Two kinds of Pre-trained Language Representations 1) Feature-based approach 2) Fine-tuning approach 5 6 Introduction : Pre-trained Language Representations Feature-based approach : ELMo Fine-tuning approach ELMo (Peters et al. 2018) Introduce minimal task-specific parameters In GloVe, Word2vec method, BTS fan Trained on the downstream tasks by simply fine-tuning the pre-trained Polysemous words refer to same fan parameters representation no matter the context coolAir-conditioner OpenAI GPT, BERT . “I am a big fan of Mozart” concert ‘fan’ = [-0.5, -0.3, 0.228, 0.9, 0.31] Y N . “I need a fan to cool the heat” ‘fan’ = [-0.5, -0.3, 0.228, 0.9, 0.31] Downstream task ? need a fan of Language Model Language Model I need a fan I grew up in 7 8 2 Feature-based approach : ELMo Feature-based approach : ELMo ELMo (Peters et al. 2018) (Cont’d) ELMo (Peters et al. 2018) (Cont’d) Let the words be represented according to the context !!! Trained task-specifically . Learned parameters(weights) from language model are used as a feature for I need a fan … Hmm? … fan to cool the heat another task Different Task Learn to consider context Learn to consider context ‘tree’ general from right-to-left from left-to-right ‘tree’ from ELMo [-0.1, 0.3, 0.25, 0.7, -0.1, -0.4, 0.6, 0.8] Language Model 9 The tree grew up 10 Feature-based approach : ELMo Feature-based approach : ELMo ELMo (Peters et al. 2018) (Cont’d) biLSTM part Two components of the ELMo Two objectives : predicting word in forward direction, backward direction . biLSTM pre-training part . Forward : 푝 푡, 푡,…, 푡 = ∏ 푝(푡|푡, 푡,…, 푡) Use vectors derived from a bidirectional LSTM trained with a coupled LM Objective Task of predicting next token . ELMo part . backward : 푝 푡, 푡,…, 푡 = ∏ 푝(푡|푡, 푡,…, 푡) Task specific combination of the representations in the biLM Task of predicting previous token Overall objective is to jointly maximizes the log likelihood of the forward and backward directions biLSTM part ⃗ ⃖ ELMo part . 퐽 휽 = ∑(log 푝 푡|푡,…, 푡; 휃, 휃, 휃 + log 푝 푡|푡,…, 푡; 휃, 휃, 휃 ) 11 12 3 Feature-based approach : ELMo Feature-based approach : ELMo ELMo part ELMo part (Cont’d) Part where the ELMo word representation is defined Layer representation trained from biLSTM part, 푅 Task specific combination of the intermediate layer representations in the ELMo collapses all layers in 푅 into a single vector, biLM 푬푳푴풐 =퐸 푅;Θ 푅 = ℎ , 푗 = 0,…, 퐿} 푅 = ℎ , 푗 = 0, … , 퐿} ⃗ ⃖ j=2 [퐡, 퐡,] j=1 ⃗ ⃖ [퐡, 퐡,] j=0 [x x] k=1 k=2k=3 13 k=1 k=2 k=3 14 Feature-based approach : ELMo Feature-based approach : ELMo ELMo part (Cont’d) ELMo Evaluation ELMo collapses all layers in 푅 into a single vector, Choices of 푬푳푴풐 = 퐸 푅;Θ . Simplest case : 푬푳푴풐 = 퐸 푅 = ℎ, (top layer) (Peters et al. 2017, McCann et al. 2017) . General case : compute a task specific weighting of all biLM layers down-stream task learns weighting parameters 푬푳푴풐 = 퐸 푅;Θ = 훾 푠 퐡, Question Answering, SQuAD : average F score - +1.4% than SOTA Textual Entailment, SNLI : accuracy score - +0.7% when SOTA + ELMo ⃗ ⃖ [퐡, 퐡,] Semantic Role Labeling, SRL : average F score - +3.2% when SOTA reimplementation + ELMo Coreference resolution, Coref : average F score - +3.2% when SOTA reimplementation + ELMo Named Entity Extraction, NER : average F score - +0.3% when SOTA + ELMo ⃗ ⃖ Sentiment Analysis, SST-5 : accuracy score - +1% when SOTA reimplementation + ELMo [퐡, 퐡,] [x x ] 15 16 4 Feature-based approach : ELMo Fine-tuning approach to pre-trained Word Representation ELMo Evaluation : effects of ‘Deep biLM’ part Fine-tuning approach Deep biLM effects Trained on the downstream tasks by simply fine-tuning the pre-trained params . Minimal task-specific parameters . “fine-tune effectively with minimal changes to the architecture of the pre-trained model,” (Radford et al. 2018) OpenAI GPT, BERT Y N Downstream task using biLM’s context representation, Language Model . Disambiguate word sense in the source sent (Word Sense Disambiguity test) Deeper layers catch more of semantic information . Disambiguate part of speech in the source sent (POS tagging test) Shallower layers catch more of syntactic information I grew up in 17 18 Fine-tuning approach : Open-AI GPT Fine-tuning approach : Open-AI GPT OpenAI-GPT Framework First stage, learning a high-capacity language model on a large corpus of From ‘Improving Language text(BooksCorpus dataset) Understanding by Generative Followed by a fine-tuning stage where the model adapts to a discriminative Pre-Training’ paper, Radford et task with labeled data al. 2018 First Stage, Unsupervised pre-training Make use of Transformers model Language model objective with large corpus of unlabeled data into unsupervised pre-training 퐿 풰 = ∑ 푙표푔푷 푢|푢,…, 푢; 휽 . 푘 ∶ 푠푖푧푒 표푓 푡ℎ푒 푐표푛푡푒푥푡 푤푖푛푑표푤 It is then transferred to . 풰 ={푢 , 푢 ,…, 푢 }∶ 푢푛푠푢푝푒푟푣푖푠푒푑 푐표푟푝푢푠 표푓 푡표푘푒푛푠 discriminative tasks (downstream task) . 푷 ∶ 푚표푑푒푙푒푑 푢푠푖푛푔 푎 푛푒푢푟푎푙 푛푒푡푤표푟푘 푤푖푡ℎ 푝푎푟푎푚푒푡푒푟푠 휽 Multi-layer Transformer decoder block for the language model Encoder Decoder (BERT) (GPT) 19 20 5 Fine-tuning approach : Open-AI GPT Fine-tuning approach : Open-AI GPT Transformer (Decoder Block) First Stage, Unsupervised pre-training (Cont’d) Linearly transform 푑 : Q (Query) , linearly transform the others : K (Key) Overview Dot product of every neighboring position . Inputs tokenized by spaCy tokenizer (mask out future words logits by multiplying 10e-9) . Inputs are fed into 12 layers of Transformer blocks in each time step Softmax the logits Convex Combination of the softmax result then put . Last layer produce probability distribution over BPE based vocabulary (40,000) through FFNN => Becomes the re-representation of 푑 = 푑′ 21 22 Fine-tuning approach : Open-AI GPT Fine-tuning approach : Open-AI GPT First Stage, Unsupervised pre-training (Cont’d) First Stage, Unsupervised pre-training (Cont’d) 12 layers of Transformer blocks Example) “I want to build a language model architecture …” . i) Masked multi-headed self-attention over the input context tokens . ii) Followed by position-wise feedforward layers ‘build’ . iii) Softmax of feedforward result over the target tokens ‘want’ ‘to’ L = 12 I want to 23 24 6 Fine-tuning approach : Open-AI GPT Fine-tuning approach : Open-AI GPT Second Stage, Supervised fine-tuning There is one problem in Second Stage … Once the first stage is finished with unsupervised corpus of tokens, our Input form of the Open-AI GPT looks like language model parameter is pre-trained thus it is available as pre-trained . 풰 ={푢, 푢,…,푢}, word representation ex) {‘I’, ‘am’, ‘a’, ‘student’, ‘I’, ‘like’, ‘studying’, … } Adapt the parameters from first stage to the supervised target task with labeled dataset 풞 = 푐,…, 푐} 푤ℎ푒푟푒 푐 ={푥 ,…, 푥 , 푦 In task-specific task such as question answering, textual entailment The inputs are passed through our pre-trained model to obtain the final . They have structured inputs transformer block’s activation ℎ ex) {document, question, answers} . 푚 : last token index, 푙 : last layer index 푃 푦 푥 ,…, 푥 = 푠표푓푡푚푎푥(ℎ 푊) ℎ푊 document … question answer 푥 푥 푥 25 26 Fine-tuning approach : Open-AI GPT Fine-tuning approach : Open-AI GPT There is one problem in Second Stage … (Cont’d) Second Stage, Supervised fine-tuning (Cont’d) How to align those structured inputs so as to fine-tune with the structured inputs in a pre-trained input manner? . Fit form of the inputs to the model . ‘Start,’ ‘Delim,’ ‘Extract’ tokens !! . Ex) Entailment task, input has structured form of Premise and Entailment . Align the input as below See the difference between Feature-based & Fine-tuning approach? . Feature-based : You fit your representation to the task . Fine-tuning : task is fitted to the representation learning 27 28 7 Open-AI GPT 3 Open-AI GPT 3 Task-agnostic Language Model Task-agnostic Language Model Fine-tuning 없는 범용성이 좋은 Task-agnostic NLP 모델 Few-shot Learning Zero-shot Learning? 29 30 Open-AI GPT 3 Open-AI GPT 3 Model: GPT-2와 동일한 구조 문장 생성 및 Cloze 퀴즈 맞추기 태스크에 대한 성능 파라미터 수 증가 (175B 파라미터) 데이터 소개 45TB나 되는 150Billion Token (500GB 전처리된 텍스트) Translation 31 32 8 Fine-tuning approach : BERT Fine-tuning approach : BERT BERT (Bidirectional Encoder Representations from Transformers) BERT (Bidirectional Encoder Representations from Transformers) Paper published in NAACL 2019 by Google AI Open-AI GPT cannot take on right to left context “BERT:Pre-training of Deep Bidirectional Transformers for Language . Deep bidirectional model is more powerful than either a left-to-right model(GPT) Understanding,” Devlin et al. 2019, NAACL. or the shallow concatenation of a left-to-right and right-to-left model(ELMo) Won Best Long Paper .

Pre-Training Language Representation (Elmo, BERT, Openai GPT)

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support