Pre-Training Language Representation (Elmo, BERT, Openai GPT)

Total Page:16

File Type:pdf, Size:1020Kb

Pre-Training Language Representation (Elmo, BERT, Openai GPT) Contents 1. Introduction to Pre-trained Language Representations Pre-training Language 2. (Feature-based approach) ELMo Representation 3. (Fine-tuning approach) OpenAI GPT (ELMo, BERT, OpenAI GPT) 4. OpenAI GPT3 5. (Fine-tuning approach) BERT Ko, Youngjoong 6. Summary Sungkyunkwan University 2 Introduction : Transfer Learning Introduction : Transfer Learning How can we take advantage of distributed word representation? Transfer Learning using word representations Transfer Learning Pre-trained Word Representation What is Transfer Learning? word2vec GloVe skip-thought InferSent Classification ELMo Sequence Labeling ULMFiT Question & Answering …. GPT Text Corpus BERT Pre-trained word representation Unsupervised Learning Supervised Learning 3 with unlabeled data 4 with labeled data 1 Introduction : Pre-trained Language Representations Introduction : Pre-trained Language Representations Feature-based approach Use task-specific architectures that include the pre-trained representations as additional features . Learned representations are used as features in a downstream model ELMo Two kinds of Pre-trained Language Representations 1) Feature-based approach 2) Fine-tuning approach 5 6 Introduction : Pre-trained Language Representations Feature-based approach : ELMo Fine-tuning approach ELMo (Peters et al. 2018) Introduce minimal task-specific parameters In GloVe, Word2vec method, BTS fan Trained on the downstream tasks by simply fine-tuning the pre-trained Polysemous words refer to same fan parameters representation no matter the context coolAir-conditioner OpenAI GPT, BERT . “I am a big fan of Mozart” concert ‘fan’ = [-0.5, -0.3, 0.228, 0.9, 0.31] Y N . “I need a fan to cool the heat” ‘fan’ = [-0.5, -0.3, 0.228, 0.9, 0.31] Downstream task ? need a fan of Language Model Language Model I need a fan I grew up in 7 8 2 Feature-based approach : ELMo Feature-based approach : ELMo ELMo (Peters et al. 2018) (Cont’d) ELMo (Peters et al. 2018) (Cont’d) Let the words be represented according to the context !!! Trained task-specifically . Learned parameters(weights) from language model are used as a feature for I need a fan … Hmm? … fan to cool the heat another task Different Task Learn to consider context Learn to consider context ‘tree’ general from right-to-left from left-to-right ‘tree’ from ELMo [-0.1, 0.3, 0.25, 0.7, -0.1, -0.4, 0.6, 0.8] Language Model 9 The tree grew up 10 Feature-based approach : ELMo Feature-based approach : ELMo ELMo (Peters et al. 2018) (Cont’d) biLSTM part Two components of the ELMo Two objectives : predicting word in forward direction, backward direction . biLSTM pre-training part . Forward : 푝 푡, 푡,…, 푡 = ∏ 푝(푡|푡, 푡,…, 푡) Use vectors derived from a bidirectional LSTM trained with a coupled LM Objective Task of predicting next token . ELMo part . backward : 푝 푡, 푡,…, 푡 = ∏ 푝(푡|푡, 푡,…, 푡) Task specific combination of the representations in the biLM Task of predicting previous token Overall objective is to jointly maximizes the log likelihood of the forward and backward directions biLSTM part ⃗ ⃖ ELMo part . 퐽 휽 = ∑(log 푝 푡|푡,…, 푡; 휃, 휃, 휃 + log 푝 푡|푡,…, 푡; 휃, 휃, 휃 ) 11 12 3 Feature-based approach : ELMo Feature-based approach : ELMo ELMo part ELMo part (Cont’d) Part where the ELMo word representation is defined Layer representation trained from biLSTM part, 푅 Task specific combination of the intermediate layer representations in the ELMo collapses all layers in 푅 into a single vector, biLM 푬푳푴풐 =퐸 푅;Θ 푅 = ℎ , 푗 = 0,…, 퐿} 푅 = ℎ , 푗 = 0, … , 퐿} ⃗ ⃖ j=2 [퐡, 퐡,] j=1 ⃗ ⃖ [퐡, 퐡,] j=0 [x x] k=1 k=2k=3 13 k=1 k=2 k=3 14 Feature-based approach : ELMo Feature-based approach : ELMo ELMo part (Cont’d) ELMo Evaluation ELMo collapses all layers in 푅 into a single vector, Choices of 푬푳푴풐 = 퐸 푅;Θ . Simplest case : 푬푳푴풐 = 퐸 푅 = ℎ, (top layer) (Peters et al. 2017, McCann et al. 2017) . General case : compute a task specific weighting of all biLM layers down-stream task learns weighting parameters 푬푳푴풐 = 퐸 푅;Θ = 훾 푠 퐡, Question Answering, SQuAD : average F score - +1.4% than SOTA Textual Entailment, SNLI : accuracy score - +0.7% when SOTA + ELMo ⃗ ⃖ [퐡, 퐡,] Semantic Role Labeling, SRL : average F score - +3.2% when SOTA reimplementation + ELMo Coreference resolution, Coref : average F score - +3.2% when SOTA reimplementation + ELMo Named Entity Extraction, NER : average F score - +0.3% when SOTA + ELMo ⃗ ⃖ Sentiment Analysis, SST-5 : accuracy score - +1% when SOTA reimplementation + ELMo [퐡, 퐡,] [x x ] 15 16 4 Feature-based approach : ELMo Fine-tuning approach to pre-trained Word Representation ELMo Evaluation : effects of ‘Deep biLM’ part Fine-tuning approach Deep biLM effects Trained on the downstream tasks by simply fine-tuning the pre-trained params . Minimal task-specific parameters . “fine-tune effectively with minimal changes to the architecture of the pre-trained model,” (Radford et al. 2018) OpenAI GPT, BERT Y N Downstream task using biLM’s context representation, Language Model . Disambiguate word sense in the source sent (Word Sense Disambiguity test) Deeper layers catch more of semantic information . Disambiguate part of speech in the source sent (POS tagging test) Shallower layers catch more of syntactic information I grew up in 17 18 Fine-tuning approach : Open-AI GPT Fine-tuning approach : Open-AI GPT OpenAI-GPT Framework First stage, learning a high-capacity language model on a large corpus of From ‘Improving Language text(BooksCorpus dataset) Understanding by Generative Followed by a fine-tuning stage where the model adapts to a discriminative Pre-Training’ paper, Radford et task with labeled data al. 2018 First Stage, Unsupervised pre-training Make use of Transformers model Language model objective with large corpus of unlabeled data into unsupervised pre-training 퐿 풰 = ∑ 푙표푔푷 푢|푢,…, 푢; 휽 . 푘 ∶ 푠푖푧푒 표푓 푡ℎ푒 푐표푛푡푒푥푡 푤푖푛푑표푤 It is then transferred to . 풰 ={푢 , 푢 ,…, 푢 }∶ 푢푛푠푢푝푒푟푣푖푠푒푑 푐표푟푝푢푠 표푓 푡표푘푒푛푠 discriminative tasks (downstream task) . 푷 ∶ 푚표푑푒푙푒푑 푢푠푖푛푔 푎 푛푒푢푟푎푙 푛푒푡푤표푟푘 푤푖푡ℎ 푝푎푟푎푚푒푡푒푟푠 휽 Multi-layer Transformer decoder block for the language model Encoder Decoder (BERT) (GPT) 19 20 5 Fine-tuning approach : Open-AI GPT Fine-tuning approach : Open-AI GPT Transformer (Decoder Block) First Stage, Unsupervised pre-training (Cont’d) Linearly transform 푑 : Q (Query) , linearly transform the others : K (Key) Overview Dot product of every neighboring position . Inputs tokenized by spaCy tokenizer (mask out future words logits by multiplying 10e-9) . Inputs are fed into 12 layers of Transformer blocks in each time step Softmax the logits Convex Combination of the softmax result then put . Last layer produce probability distribution over BPE based vocabulary (40,000) through FFNN => Becomes the re-representation of 푑 = 푑′ 21 22 Fine-tuning approach : Open-AI GPT Fine-tuning approach : Open-AI GPT First Stage, Unsupervised pre-training (Cont’d) First Stage, Unsupervised pre-training (Cont’d) 12 layers of Transformer blocks Example) “I want to build a language model architecture …” . i) Masked multi-headed self-attention over the input context tokens . ii) Followed by position-wise feedforward layers ‘build’ . iii) Softmax of feedforward result over the target tokens ‘want’ ‘to’ L = 12 I want to 23 24 6 Fine-tuning approach : Open-AI GPT Fine-tuning approach : Open-AI GPT Second Stage, Supervised fine-tuning There is one problem in Second Stage … Once the first stage is finished with unsupervised corpus of tokens, our Input form of the Open-AI GPT looks like language model parameter is pre-trained thus it is available as pre-trained . 풰 ={푢, 푢,…,푢}, word representation ex) {‘I’, ‘am’, ‘a’, ‘student’, ‘I’, ‘like’, ‘studying’, … } Adapt the parameters from first stage to the supervised target task with labeled dataset 풞 = 푐,…, 푐} 푤ℎ푒푟푒 푐 ={푥 ,…, 푥 , 푦 In task-specific task such as question answering, textual entailment The inputs are passed through our pre-trained model to obtain the final . They have structured inputs transformer block’s activation ℎ ex) {document, question, answers} . 푚 : last token index, 푙 : last layer index 푃 푦 푥 ,…, 푥 = 푠표푓푡푚푎푥(ℎ 푊) ℎ푊 document … question answer 푥 푥 푥 25 26 Fine-tuning approach : Open-AI GPT Fine-tuning approach : Open-AI GPT There is one problem in Second Stage … (Cont’d) Second Stage, Supervised fine-tuning (Cont’d) How to align those structured inputs so as to fine-tune with the structured inputs in a pre-trained input manner? . Fit form of the inputs to the model . ‘Start,’ ‘Delim,’ ‘Extract’ tokens !! . Ex) Entailment task, input has structured form of Premise and Entailment . Align the input as below See the difference between Feature-based & Fine-tuning approach? . Feature-based : You fit your representation to the task . Fine-tuning : task is fitted to the representation learning 27 28 7 Open-AI GPT 3 Open-AI GPT 3 Task-agnostic Language Model Task-agnostic Language Model Fine-tuning 없는 범용성이 좋은 Task-agnostic NLP 모델 Few-shot Learning Zero-shot Learning? 29 30 Open-AI GPT 3 Open-AI GPT 3 Model: GPT-2와 동일한 구조 문장 생성 및 Cloze 퀴즈 맞추기 태스크에 대한 성능 파라미터 수 증가 (175B 파라미터) 데이터 소개 45TB나 되는 150Billion Token (500GB 전처리된 텍스트) Translation 31 32 8 Fine-tuning approach : BERT Fine-tuning approach : BERT BERT (Bidirectional Encoder Representations from Transformers) BERT (Bidirectional Encoder Representations from Transformers) Paper published in NAACL 2019 by Google AI Open-AI GPT cannot take on right to left context “BERT:Pre-training of Deep Bidirectional Transformers for Language . Deep bidirectional model is more powerful than either a left-to-right model(GPT) Understanding,” Devlin et al. 2019, NAACL. or the shallow concatenation of a left-to-right and right-to-left model(ELMo) Won Best Long Paper .
Recommended publications
  • Knowledge-Powered Deep Learning for Word Embedding
    Knowledge-Powered Deep Learning for Word Embedding Jiang Bian, Bin Gao, and Tie-Yan Liu Microsoft Research {jibian,bingao,tyliu}@microsoft.com Abstract. The basis of applying deep learning to solve natural language process- ing tasks is to obtain high-quality distributed representations of words, i.e., word embeddings, from large amounts of text data. However, text itself usually con- tains incomplete and ambiguous information, which makes necessity to leverage extra knowledge to understand it. Fortunately, text itself already contains well- defined morphological and syntactic knowledge; moreover, the large amount of texts on the Web enable the extraction of plenty of semantic knowledge. There- fore, it makes sense to design novel deep learning algorithms and systems in order to leverage the above knowledge to compute more effective word embed- dings. In this paper, we conduct an empirical study on the capacity of leveraging morphological, syntactic, and semantic knowledge to achieve high-quality word embeddings. Our study explores these types of knowledge to define new basis for word representation, provide additional input information, and serve as auxiliary supervision in deep learning, respectively. Experiments on an analogical reason- ing task, a word similarity task, and a word completion task have all demonstrated that knowledge-powered deep learning can enhance the effectiveness of word em- bedding. 1 Introduction With rapid development of deep learning techniques in recent years, it has drawn in- creasing attention to train complex and deep models on large amounts of data, in order to solve a wide range of text mining and natural language processing (NLP) tasks [4, 1, 8, 13, 19, 20].
    [Show full text]
  • Recurrent Neural Network Based Language Model
    INTERSPEECH 2010 Recurrent neural network based language model Toma´sˇ Mikolov1;2, Martin Karafiat´ 1, Luka´sˇ Burget1, Jan “Honza” Cernockˇ y´1, Sanjeev Khudanpur2 1Speech@FIT, Brno University of Technology, Czech Republic 2 Department of Electrical and Computer Engineering, Johns Hopkins University, USA fimikolov,karafiat,burget,[email protected], [email protected] Abstract INPUT(t) OUTPUT(t) A new recurrent neural network based language model (RNN CONTEXT(t) LM) with applications to speech recognition is presented. Re- sults indicate that it is possible to obtain around 50% reduction of perplexity by using mixture of several RNN LMs, compared to a state of the art backoff language model. Speech recognition experiments show around 18% reduction of word error rate on the Wall Street Journal task when comparing models trained on the same amount of data, and around 5% on the much harder NIST RT05 task, even when the backoff model is trained on much more data than the RNN LM. We provide ample empiri- cal evidence to suggest that connectionist language models are superior to standard n-gram techniques, except their high com- putational (training) complexity. Index Terms: language modeling, recurrent neural networks, speech recognition CONTEXT(t-1) 1. Introduction Figure 1: Simple recurrent neural network. Sequential data prediction is considered by many as a key prob- lem in machine learning and artificial intelligence (see for ex- ample [1]). The goal of statistical language modeling is to plex and often work well only for systems based on very limited predict the next word in textual data given context; thus we amounts of training data.
    [Show full text]
  • Unified Language Model Pre-Training for Natural
    Unified Language Model Pre-training for Natural Language Understanding and Generation Li Dong∗ Nan Yang∗ Wenhui Wang∗ Furu Wei∗† Xiaodong Liu Yu Wang Jianfeng Gao Ming Zhou Hsiao-Wuen Hon Microsoft Research {lidong1,nanya,wenwan,fuwei}@microsoft.com {xiaodl,yuwan,jfgao,mingzhou,hon}@microsoft.com Abstract This paper presents a new UNIfied pre-trained Language Model (UNILM) that can be fine-tuned for both natural language understanding and generation tasks. The model is pre-trained using three types of language modeling tasks: unidirec- tional, bidirectional, and sequence-to-sequence prediction. The unified modeling is achieved by employing a shared Transformer network and utilizing specific self-attention masks to control what context the prediction conditions on. UNILM compares favorably with BERT on the GLUE benchmark, and the SQuAD 2.0 and CoQA question answering tasks. Moreover, UNILM achieves new state-of- the-art results on five natural language generation datasets, including improving the CNN/DailyMail abstractive summarization ROUGE-L to 40.51 (2.04 absolute improvement), the Gigaword abstractive summarization ROUGE-L to 35.75 (0.86 absolute improvement), the CoQA generative question answering F1 score to 82.5 (37.1 absolute improvement), the SQuAD question generation BLEU-4 to 22.12 (3.75 absolute improvement), and the DSTC7 document-grounded dialog response generation NIST-4 to 2.67 (human performance is 2.65). The code and pre-trained models are available at https://github.com/microsoft/unilm. 1 Introduction Language model (LM) pre-training has substantially advanced the state of the art across a variety of natural language processing tasks [8, 29, 19, 31, 9, 1].
    [Show full text]
  • Hello, It's GPT-2
    Hello, It’s GPT-2 - How Can I Help You? Towards the Use of Pretrained Language Models for Task-Oriented Dialogue Systems Paweł Budzianowski1;2;3 and Ivan Vulic´2;3 1Engineering Department, Cambridge University, UK 2Language Technology Lab, Cambridge University, UK 3PolyAI Limited, London, UK [email protected], [email protected] Abstract (Young et al., 2013). On the other hand, open- domain conversational bots (Li et al., 2017; Serban Data scarcity is a long-standing and crucial et al., 2017) can leverage large amounts of freely challenge that hinders quick development of available unannotated data (Ritter et al., 2010; task-oriented dialogue systems across multiple domains: task-oriented dialogue models are Henderson et al., 2019a). Large corpora allow expected to learn grammar, syntax, dialogue for training end-to-end neural models, which typ- reasoning, decision making, and language gen- ically rely on sequence-to-sequence architectures eration from absurdly small amounts of task- (Sutskever et al., 2014). Although highly data- specific data. In this paper, we demonstrate driven, such systems are prone to producing unre- that recent progress in language modeling pre- liable and meaningless responses, which impedes training and transfer learning shows promise their deployment in the actual conversational ap- to overcome this problem. We propose a task- oriented dialogue model that operates solely plications (Li et al., 2017). on text input: it effectively bypasses ex- Due to the unresolved issues with the end-to- plicit policy and language generation modules. end architectures, the focus has been extended to Building on top of the TransferTransfo frame- retrieval-based models.
    [Show full text]
  • N-Gram Language Modeling Tutorial Dustin Hillard and Sarah Petersen Lecture Notes Courtesy of Prof
    N-gram Language Modeling Tutorial Dustin Hillard and Sarah Petersen Lecture notes courtesy of Prof. Mari Ostendorf Outline: • Statistical Language Model (LM) Basics • n-gram models • Class LMs • Cache LMs • Mixtures • Empirical observations (Goodman CSL 2001) • Factored LMs Part I: Statistical Language Model (LM) Basics • What is a statistical LM and why are they interesting? • Evaluating LMs • History equivalence classes What is a statistical language model? A stochastic process model for word sequences. A mechanism for computing the probability of: p(w1, . , wT ) 1 Why are LMs interesting? • Important component of a speech recognition system – Helps discriminate between similar sounding words – Helps reduce search costs • In statistical machine translation, a language model characterizes the target language, captures fluency • For selecting alternatives in summarization, generation • Text classification (style, reading level, language, topic, . ) • Language models can be used for more than just words – letter sequences (language identification) – speech act sequence modeling – case and punctuation restoration 2 Evaluating LMs Evaluating LMs in the context of an application can be expensive, so LMs are usually evaluated on the own in terms of perplexity:: ˜ 1 PP = 2Hr where H˜ = − log p(w , . , w ) r T 2 1 T where {w1, . , wT } is held out test data that provides the empirical distribution q(·) in the cross-entropy formula H˜ = − X q(x) log p(x) x and p(·) is the LM estimated on a training set. Interpretations: • Entropy rate: lower entropy means that it is easier to predict the next symbol and hence easier to rule out alternatives when combined with other models small H˜r → small PP • Average branching factor: When a distribution is uniform for a vocabulary of size V , then entropy is log2 V , and perplexity is V .
    [Show full text]
  • Deepfakes & Disinformation
    DEEPFAKES & DISINFORMATION DEEPFAKES & DISINFORMATION Agnieszka M. Walorska ANALYSISANALYSE 2 DEEPFAKES & DISINFORMATION IMPRINT Publisher Friedrich Naumann Foundation for Freedom Karl-Marx-Straße 2 14482 Potsdam Germany /freiheit.org /FriedrichNaumannStiftungFreiheit /FNFreiheit Author Agnieszka M. Walorska Editors International Department Global Themes Unit Friedrich Naumann Foundation for Freedom Concept and layout TroNa GmbH Contact Phone: +49 (0)30 2201 2634 Fax: +49 (0)30 6908 8102 Email: [email protected] As of May 2020 Photo Credits Photomontages © Unsplash.de, © freepik.de, P. 30 © AdobeStock Screenshots P. 16 © https://youtu.be/mSaIrz8lM1U P. 18 © deepnude.to / Agnieszka M. Walorska P. 19 © thispersondoesnotexist.com P. 19 © linkedin.com P. 19 © talktotransformer.com P. 25 © gltr.io P. 26 © twitter.com All other photos © Friedrich Naumann Foundation for Freedom (Germany) P. 31 © Agnieszka M. Walorska Notes on using this publication This publication is an information service of the Friedrich Naumann Foundation for Freedom. The publication is available free of charge and not for sale. It may not be used by parties or election workers during the purpose of election campaigning (Bundestags-, regional and local elections and elections to the European Parliament). Licence Creative Commons (CC BY-NC-ND 4.0) https://creativecommons.org/licenses/by-nc-nd/4.0 DEEPFAKES & DISINFORMATION DEEPFAKES & DISINFORMATION 3 4 DEEPFAKES & DISINFORMATION CONTENTS Table of contents EXECUTIVE SUMMARY 6 GLOSSARY 8 1.0 STATE OF DEVELOPMENT ARTIFICIAL
    [Show full text]
  • A General Language Model for Information Retrieval Fei Song W
    A General Language Model for Information Retrieval Fei Song W. Bruce Croft Dept. of Computing and Info. Science Dept. of Computer Science University of Guelph University of Massachusetts Guelph, Ontario, Canada N1G 2W1 Amherst, Massachusetts 01003 [email protected] [email protected] ABSTRACT However, estimating the probabilities of word sequences can be expensive, since sentences can be arbitrarily long and the size of Statistical language modeling has been successfully used for a corpus needs to be very large. In practice, the statistical speech recognition, part-of-speech tagging, and syntactic parsing. language model is often approximated by N-gram models. Recently, it has also been applied to information retrieval. Unigram: P(w ) = P(w )P(w ) ¡ P(w ) According to this new paradigm, each document is viewed as a 1,n 1 2 n language sample, and a query as a generation process. The Bigram: P(w ) = P(w )P(w | w ) ¡ P(w | w ) retrieved documents are ranked based on the probabilities of 1,n 1 2 1 n n−1 producing a query from the corresponding language models of Trigram: P(w ) = P(w )P(w | w )P(w | w ) ¡ P(w | w ) these documents. In this paper, we will present a new language 1,n 1 2 1 3 1,2 n n−2,n−1 model for information retrieval, which is based on a range of data The unigram model makes a strong assumption that each word smoothing techniques, including the Good-Turing estimate, occurs independently, and consequently, the probability of a curve-fitting functions, and model combinations.
    [Show full text]
  • A Unified Multilingual Handwriting Recognition System Using
    1 A Unified Multilingual Handwriting Recognition System using multigrams sub-lexical units Wassim Swaileha,∗, Yann Soullarda, Thierry Paqueta aLITIS Laboratory - EA 4108 Normandie University - University of Rouen Rouen, France ABSTRACT We address the design of a unified multilingual system for handwriting recognition. Most of multi- lingual systems rests on specialized models that are trained on a single language and one of them is selected at test time. While some recognition systems are based on a unified optical model, dealing with a unified language model remains a major issue, as traditional language models are generally trained on corpora composed of large word lexicons per language. Here, we bring a solution by con- sidering language models based on sub-lexical units, called multigrams. Dealing with multigrams strongly reduces the lexicon size and thus decreases the language model complexity. This makes pos- sible the design of an end-to-end unified multilingual recognition system where both a single optical model and a single language model are trained on all the languages. We discuss the impact of the language unification on each model and show that our system reaches state-of-the-art methods perfor- mance with a strong reduction of the complexity. c 2018 Elsevier Ltd. All rights reserved. 1. Introduction Offline handwriting recognition is a challenging task due to the high variability of data, as writing styles, the quality of the input image and the lack of contextual information. The recognition is commonly done in two steps. First, an optical character recognition (OCR) model is optimized for recognizing sequences of characters from input images of handwritten texts (Plotz¨ and Fink (2009); Singh (2013)).
    [Show full text]
  • Word Embeddings: History & Behind the Scene
    Word Embeddings: History & Behind the Scene Alfan F. Wicaksono Information Retrieval Lab. Faculty of Computer Science Universitas Indonesia References • Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A Neural Probabilistic Language Model. The Journal of Machine Learning Research, 3, 1137–1155. • Mikolov, T., Corrado, G., Chen, K., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. Proceedings of the International Conference on Learning Representations (ICLR 2013), 1–12 • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. NIPS, 1–9. • Morin, F., & Bengio, Y. (2005). Hierarchical Probabilistic Neural Network Language Model. Aistats, 5. References Good weblogs for high-level understanding: • http://sebastianruder.com/word-embeddings-1/ • Sebastian Ruder. On word embeddings - Part 2: Approximating the Softmax. http://sebastianruder.com/word- embeddings-softmax • https://www.tensorflow.org/tutorials/word2vec • https://www.gavagai.se/blog/2015/09/30/a-brief-history-of- word-embeddings/ Some slides were also borrowed from From Dan Jurafsky’s course slide: Word Meaning and Similarity. Stanford University. Terminology • The term “Word Embedding” came from deep learning community • For computational linguistic community, they prefer “Distributional Semantic Model” • Other terms: – Distributed Representation – Semantic Vector Space – Word Space https://www.gavagai.se/blog/2015/09/30/a-brief-history-of-word-embeddings/ Before we learn Word Embeddings... Semantic Similarity • Word Similarity – Near-synonyms – “boat” and “ship”, “car” and “bicycle” • Word Relatedness – Can be related any way – Similar: “boat” and “ship” – Topical Similarity: • “boat” and “water” • “car” and “gasoline” Why Word Similarity? • Document Classification • Document Clustering • Language Modeling • Information Retrieval • ..
    [Show full text]
  • Language Modelling for Speech Recognition
    Language Modelling for Speech Recognition • Introduction • n-gram language models • Probability estimation • Evaluation • Beyond n-grams 6.345 Automatic Speech Recognition Language Modelling 1 Language Modelling for Speech Recognition • Speech recognizers seek the word sequence Wˆ which is most likely to be produced from acoustic evidence A P(Wˆ |A)=m ax P(W |A) ∝ max P(A|W )P(W ) W W • Speech recognition involves acoustic processing, acoustic modelling, language modelling, and search • Language models (LMs) assign a probability estimate P(W ) to word sequences W = {w1,...,wn} subject to � P(W )=1 W • Language models help guide and constrain the search among alternative word hypotheses during recognition 6.345 Automatic Speech Recognition Language Modelling 2 Language Model Requirements � Constraint Coverage � NLP Understanding � 6.345 Automatic Speech Recognition Language Modelling 3 Finite-State Networks (FSN) show me all the flights give restaurants display • Language space defined by a word network or graph • Describable by a regular phrase structure grammar A =⇒ aB | a • Finite coverage can present difficulties for ASR • Graph arcs or rules can be augmented with probabilities 6.345 Automatic Speech Recognition Language Modelling 4 Context-Free Grammars (CFGs) VP NP V D N display the flights • Language space defined by context-free rewrite rules e.g., A =⇒ BC | a • More powerful representation than FSNs • Stochastic CFG rules have associated probabilities which can be learned automatically from a corpus • Finite coverage can present difficulties
    [Show full text]
  • Game Playing with Deep Q-Learning Using Openai Gym
    Game Playing with Deep Q-Learning using OpenAI Gym Robert Chuchro Deepak Gupta [email protected] [email protected] Abstract sociated with performing a particular action in a given state. This information is fundamental to any reinforcement learn- Historically, designing game players requires domain- ing problem. specific knowledge of the particular game to be integrated The input to our model will be a sequence of pixel im- into the model for the game playing program. This leads ages as arrays (Width x Height x 3) generated by a particular to a program that can only learn to play a single particu- OpenAI Gym environment. We then use a Deep Q-Network lar game successfully. In this project, we explore the use of to output a action from the action space of the game. The general AI techniques, namely reinforcement learning and output that the model will learn is an action from the envi- neural networks, in order to architect a model which can ronments action space in order to maximize future reward be trained on more than one game. Our goal is to progress from a given state. In this paper, we explore using a neural from a simple convolutional neural network with a couple network with multiple convolutional layers as our model. fully connected layers, to experimenting with more complex additions, such as deeper layers or recurrent neural net- 2. Related Work works. The best known success story of classical reinforcement learning is TD-gammon, a backgammon playing program 1. Introduction which learned entirely by reinforcement learning [6]. TD- gammon used a model-free reinforcement learning algo- Game playing has recently emerged as a popular play- rithm similar to Q-learning.
    [Show full text]
  • Reinforcement Learning with Tensorflow&Openai
    Lecture 1: Introduction Reinforcement Learning with TensorFlow&OpenAI Gym Sung Kim <[email protected]> http://angelpawstherapy.org/positive-reinforcement-dog-training.html Nature of Learning • We learn from past experiences. - When an infant plays, waves its arms, or looks about, it has no explicit teacher - But it does have direct interaction to its environment. • Years of positive compliments as well as negative criticism have all helped shape who we are today. • Reinforcement learning: computational approach to learning from interaction. Richard Sutton and Andrew Barto, Reinforcement Learning: An Introduction Nishant Shukla , Machine Learning with TensorFlow Reinforcement Learning https://www.cs.utexas.edu/~eladlieb/RLRG.html Machine Learning, Tom Mitchell, 1997 Atari Breakout Game (2013, 2015) Atari Games Nature : Human-level control through deep reinforcement learning Human-level control through deep reinforcement learning, Nature http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html Figure courtesy of Mnih et al. "Human-level control through deep reinforcement learning”, Nature 26 Feb. 2015 https://deepmind.com/blog/deep-reinforcement-learning/ https://deepmind.com/applied/deepmind-for-google/ Reinforcement Learning Applications • Robotics: torque at joints • Business operations - Inventory management: how much to purchase of inventory, spare parts - Resource allocation: e.g. in call center, who to service first • Finance: Investment decisions, portfolio design • E-commerce/media - What content to present to users (using click-through / visit time as reward) - What ads to present to users (avoiding ad fatigue) Audience • Want to understand basic reinforcement learning (RL) • No/weak math/computer science background - Q = r + Q • Want to use RL as black-box with basic understanding • Want to use TensorFlow and Python (optional labs) Schedule 1.
    [Show full text]