Transformers for Large-Scale Language and Image Modeling

Total Page:16

File Type:pdf, Size:1020Kb

Transformers for Large-Scale Language and Image Modeling Transformers for large-scale language and image modeling Outline • Basic transformer model (review and details) • Transformer-based language models • BERT • GPT • Other models • Concerns • Models for vision and language • Image transformers • Image-text transformers: CLIP, DALL-E Basic transformer model (review) • Sequence-to-sequence architecture using only point-wise processing and attention (no recurrent units or convolutions) Encoder: receives entire input Decoder: predicts next token sequence and outputs encoded conditioned on encoder output and sequence of the same length previously predicted tokens A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, NeurIPS 2017 Image source Key-Value-Query attention model Decoder The decoder generates a query describing what it wants to focus on $1 $2 $3 $4 Sum the values generated by encoder weighted by the attention weights Feed the scores into a softmax to create the attention weights Compute dot products between the query and the keys generated by encoder, giving !1 !2 !3 !4 alignment scores between source tokens and the query Encoder Image source Key-Value-Query attention model A1 A2 A3 A4 • Key vectors: ! = #$% Product( ), Sum( ) • Value Vectors: & = #$' 4 4 4 4 • Query vectors &1 F,F G,F H,F I,F • Similarities: scaled dot-product attention &2 4F,G 4G,G 4H,G 4I,G ,- ·%+ 1 ( = or ( = 0! / 3 4 4 4 4 ),+ / &3 F,H G,H H,H I,H (3 is the dimensionality of the keys) Softmax( ) #1 !1 (F,F (G,F (H,F (I,F • Attn. weights: 4 = softmax((, dim = 1) # ! ( ( ( ( • Output vectors: 2 2 F,G G,G H,G I,G # ! ( ( ( ( A) = ∑+ 4),+&+ or A = 4& 3 3 F,H G,H H,H I,H 01 02 03 04 Adapted from J. Johnson Key-Value-Query attention model 01 02 03 04 • How does permuting the order of Product( ), Sum( ) the queries change the output? . • How does changing the order of /1 ),) +,) ,,) -,) . the keys/values change the /2 ),+ .+,+ .,,+ .-,+ output? /3 .),, .+,, .,,, .-,, Softmax( ) &1 '1 (),) (+,) (,,) (-,) &2 '2 (),+ (+,+ (,,+ (-,+ &3 '3 (),, (+,, (,,, (-,, !1 !2 !3 !4 Adapted from J. Johnson Attention mechanisms • Encoder self-attention: queries, keys, and values come from previous layer of encoder • Decoder self-attention: values corresponding to future decoder outputs are masked out • Encoder-decoder attention: queries come from previous decoder layer, keys and values come from output of encoder Self-attention • Used to capture context within the sequence As we are encoding “it”, we As we are encoding “it”, we should focus on “the animal” should focus on “the street” Image source Self-attention layer A1 A2 A3 Product(→), Sum(↑) • Query vectors: ! = #$% (3 4 4 4 • Key vectors: & = #$' E,F G,F F,F (2 4 4 4 • Value vectors: ( = #$) E,G G,G F,G ( • Similarities: scaled dot-product attention 1 4E,E 4G,E 4F,E % ·' * = . - or * = !&1 / 3 Softmax(↑) +,- 0 & * * * (3 is the dimensionality of the keys) 3 E,F G,F F,F &2 *E,G *G,G *F,G &1 * *G,E *F,E • Attn. weights: 4 = softmax(*, dim = 1) E,E • Output vectors: !1 !2 !3 ∑ A+ = - 4+,-(- or A = 4( #1 #2 #3 Adapted from J. Johnson One query per input vector Masked self-attention layer This is …. -1 -2 -3 • The decoder should not “look ahead” Product(→), Sum(↑) in the output sequence ,3 +',) +*,) +),) ,2 +',* +*,* +),* ,1 +',' +*,' +),' Softmax(↑) %3 &',) &*,) &),) %2 &',* &*,* &),* %1 &',' &*,' &),' !1 !2 !3 .1 .2 .3 <START> This is Adapted from J. Johnson Masked self-attention layer -1 -2 -3 • The decoder should not “look ahead” Product(→), Sum(↑) in the output sequence ,3 +',) +*,) +),) ,2 +',* +*,* +),* ,1 +',' +*,' +),' Softmax(↑) %3 &',) &*,) &),) %2 &',* &*,* &),* %1 &',' &*,' &),' !1 !2 !3 .1 .2 .3 Adapted from J. Johnson Masked self-attention layer 01 02 03 • The decoder should not “look ahead” Product(→), Sum(↑) in the output sequence /3 - - .,,, /2 - .+,+ .,,+ /1 .),) .+,) .,,) Softmax(↑) %3 −∞ −∞ (,,, %2 −∞ (+,+ (,,+ %1 (),) (+,) (,,) !1 !2 !3 11 12 13 Adapted from J. Johnson Attention mechanisms: Summary N transformer N transformer blocks blocks • Encoder self-attention: queries, keys, and values come from previous layer of encoder • Decoder self-attention: values corresponding to future decoder outputs are masked out • Encoder-decoder attention: queries come from previous decoder layer, keys and values come from output of encoder Attention mechanisms: Illustration https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html Transformer architecture: Details Decoder Encoder A. Vaswani et al., Attention is all you need, NeurIPS 2017 Multi-head attention • Run ℎ attention models in parallel on top of different linearly projected versions of ", $, %; concatenate and linearly project the results • Intuition: enables model to attend to different kinds of information at different positions (see visualization tool) Transformer blocks • A Transformer is a sequence of transformer blocks • Vaswani et al.: N=12 blocks, embedding dimension = 512, 6 attention heads • Add & Norm: residual connection followed by layer normalization • Feedforward: two linear layers with ReLUs in between, applied independently to each vector • Attention is the only interaction between inputs! Positional encoding • To give transformer information about ordering of tokens, add function of position (based on sines and cosines) to every input position Embedding dimension Image source Transformer architecture: Zooming back out Decoder Encoder A. Vaswani et al., Attention is all you need, NeurIPS 2017 Results https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html Different ways of processing sequences RNN 1D convolutional network Transformer Y1 Y2 Y3 Product(→), Sum(↑) V3 A1,3 A2,3 A3,3 V h1 h2 h3 h4 h1 h2 h3 h4 2 A1,2 A2,2 A3,2 V1 A1,1 A2,1 A3,1 Softmax(↑) K3 E1,3 E2,3 E3,3 K2 E1,2 E2,2 E3,2 K1 E1,1 E2,1 E3,1 Q1 Q2 Q3 x1 x2 x3 x4 x1 x2 x3 x4 X1 X2 X3 Works on ordered sequences Works on multidimensional grids • Works on sets of vectors • Pros: Good for long sequences: • Pro: Each output can be • Pro: Good at long sequences: After one RNN layer, hT ”sees” computed in parallel (at training after one self-attention layer, the whole sequence time) each output “sees” all inputs! • Con: Not parallelizable: need to • Con: Bad at long sequences: • Pro: Each output can be compute hidden states Need to stack many conv layers computed in parallel (at training sequentially for outputs to “see” the whole time) • Con: Hidden states have sequence • Con: Memory-intensive limited expressive capacity Making transformers more efficient R. Child et al., Generating Long Sequences with Sparse Transformers, arXiv 2019 Outline • Basic transformer model • Transformer-based language models • BERT • GPT • Other models • Concerns Self-supervised language modeling with transformers 1. Download A LOT of text from the internet 2. Train a giant transformer using a suitable pretext task 3. Fine-tune the transformer on desired NLP task Image source Self-supervised language modeling with transformers 1. Download A LOT of text from the internet 2. Train a giant transformer using a suitable pretext task 3. Fine-tune the transformer on desired NLP task Bidirectional Encoder Representations from Transformers (BERT) J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, EMNLP 2018 BERT: Pretext tasks • Masked language model (MLM) • Randomly mask 15% of tokens in input sentences, goal is to reconstruct them using bidirectional context Image source BERT: Pretext tasks • Next sentence prediction (NSP) • Useful for Question Answering Predict likelihood that sentence B belongs after sentence A and Natural Language Inference tasks • In the training data, 50% of the time B is the actual sentence that follows A (labeled as IsNext), and 50% of the time it is a random sentence (labeled as NotNext). Image source BERT: More detailed view WordPiece (from GNMT) Trained on Wikipedia (2.5B words) + BookCorpus (800M words) Image source BERT: Evaluation • General Language Understanding Evaluation (GLUE) benchmark (gluebenchmark.com) BERT: Downstream tasks Textual entailment Source: J. Hockenmaier Entailment, textual equivalence and similarity BERT: Downstream tasks Sentiment classification, linguistic acceptability Image source BERT: Downstream tasks Find span in paragraph that contains the answer Source: SQuAD v1.1 paper BERT: Downstream tasks Image source Named entity recognition Other language models Image source Source: J. Johnson Scaling up transformers Model Layers Hidden dim. Heads Params Data Training Transformer-Base 12 512 8 65M 8x P100 (12 hours) Transformer-Large 12 1024 16 213M 8x P100 (3.5 days) Vaswani et al., Attention is all you need, NeurIPS 2017 Source: J. Johnson Scaling up transformers Model Layers Hidden dim. Heads Params Data Training Transformer-Base 12 512 8 65M 8x P100 (12 hours) Transformer-Large 12 1024 16 213M 8x P100 (3.5 days) BERT-Base 12 768 12 110M 13 GB 4x TPU (4 days) BERT-Large 24 1024 16 340M 13 GB 16x TPU (4 days) Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, EMNLP 2018 Source: J. Johnson Scaling up transformers Model Layers Hidden dim. Heads Params Data Training Transformer-Base 12 512 8 65M 8x P100 (12 hours) Transformer-Large 12 1024 16 213M 8x P100 (3.5 days) BERT-Base 12 768 12 110M 13 GB 4x TPU (4 days) BERT-Large 24 1024 16 340M 13 GB 16x TPU (4 days) XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days) RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day) Yang et al., XLNet: Generalized Autoregressive Pretraining for Language Understanding, 2019 Liu et al., RoBERTa: A Robustly Optimized BERT Pretraining Approach, 2019 Source: J. Johnson Scaling up transformers Model Layers Hidden dim. Heads Params Data Training Transformer-Base 12 512 8 65M 8x P100 (12 hours) Transformer-Large 12 1024 16 213M 8x P100 (3.5 days) BERT-Base 12 768 12 110M 13 GB 4x TPU (4 days) BERT-Large 24 1024 16 340M 13 GB 16x TPU (4 days) XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days) RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day) GPT-2 48 1600 ? 1.5B 40 GB Radford et al., Language models are unsupervised multitask learners, 2019 Source: J.
Recommended publications
  • Knowledge-Powered Deep Learning for Word Embedding
    Knowledge-Powered Deep Learning for Word Embedding Jiang Bian, Bin Gao, and Tie-Yan Liu Microsoft Research {jibian,bingao,tyliu}@microsoft.com Abstract. The basis of applying deep learning to solve natural language process- ing tasks is to obtain high-quality distributed representations of words, i.e., word embeddings, from large amounts of text data. However, text itself usually con- tains incomplete and ambiguous information, which makes necessity to leverage extra knowledge to understand it. Fortunately, text itself already contains well- defined morphological and syntactic knowledge; moreover, the large amount of texts on the Web enable the extraction of plenty of semantic knowledge. There- fore, it makes sense to design novel deep learning algorithms and systems in order to leverage the above knowledge to compute more effective word embed- dings. In this paper, we conduct an empirical study on the capacity of leveraging morphological, syntactic, and semantic knowledge to achieve high-quality word embeddings. Our study explores these types of knowledge to define new basis for word representation, provide additional input information, and serve as auxiliary supervision in deep learning, respectively. Experiments on an analogical reason- ing task, a word similarity task, and a word completion task have all demonstrated that knowledge-powered deep learning can enhance the effectiveness of word em- bedding. 1 Introduction With rapid development of deep learning techniques in recent years, it has drawn in- creasing attention to train complex and deep models on large amounts of data, in order to solve a wide range of text mining and natural language processing (NLP) tasks [4, 1, 8, 13, 19, 20].
    [Show full text]
  • Recurrent Neural Network Based Language Model
    INTERSPEECH 2010 Recurrent neural network based language model Toma´sˇ Mikolov1;2, Martin Karafiat´ 1, Luka´sˇ Burget1, Jan “Honza” Cernockˇ y´1, Sanjeev Khudanpur2 1Speech@FIT, Brno University of Technology, Czech Republic 2 Department of Electrical and Computer Engineering, Johns Hopkins University, USA fimikolov,karafiat,burget,[email protected], [email protected] Abstract INPUT(t) OUTPUT(t) A new recurrent neural network based language model (RNN CONTEXT(t) LM) with applications to speech recognition is presented. Re- sults indicate that it is possible to obtain around 50% reduction of perplexity by using mixture of several RNN LMs, compared to a state of the art backoff language model. Speech recognition experiments show around 18% reduction of word error rate on the Wall Street Journal task when comparing models trained on the same amount of data, and around 5% on the much harder NIST RT05 task, even when the backoff model is trained on much more data than the RNN LM. We provide ample empiri- cal evidence to suggest that connectionist language models are superior to standard n-gram techniques, except their high com- putational (training) complexity. Index Terms: language modeling, recurrent neural networks, speech recognition CONTEXT(t-1) 1. Introduction Figure 1: Simple recurrent neural network. Sequential data prediction is considered by many as a key prob- lem in machine learning and artificial intelligence (see for ex- ample [1]). The goal of statistical language modeling is to plex and often work well only for systems based on very limited predict the next word in textual data given context; thus we amounts of training data.
    [Show full text]
  • Unified Language Model Pre-Training for Natural
    Unified Language Model Pre-training for Natural Language Understanding and Generation Li Dong∗ Nan Yang∗ Wenhui Wang∗ Furu Wei∗† Xiaodong Liu Yu Wang Jianfeng Gao Ming Zhou Hsiao-Wuen Hon Microsoft Research {lidong1,nanya,wenwan,fuwei}@microsoft.com {xiaodl,yuwan,jfgao,mingzhou,hon}@microsoft.com Abstract This paper presents a new UNIfied pre-trained Language Model (UNILM) that can be fine-tuned for both natural language understanding and generation tasks. The model is pre-trained using three types of language modeling tasks: unidirec- tional, bidirectional, and sequence-to-sequence prediction. The unified modeling is achieved by employing a shared Transformer network and utilizing specific self-attention masks to control what context the prediction conditions on. UNILM compares favorably with BERT on the GLUE benchmark, and the SQuAD 2.0 and CoQA question answering tasks. Moreover, UNILM achieves new state-of- the-art results on five natural language generation datasets, including improving the CNN/DailyMail abstractive summarization ROUGE-L to 40.51 (2.04 absolute improvement), the Gigaword abstractive summarization ROUGE-L to 35.75 (0.86 absolute improvement), the CoQA generative question answering F1 score to 82.5 (37.1 absolute improvement), the SQuAD question generation BLEU-4 to 22.12 (3.75 absolute improvement), and the DSTC7 document-grounded dialog response generation NIST-4 to 2.67 (human performance is 2.65). The code and pre-trained models are available at https://github.com/microsoft/unilm. 1 Introduction Language model (LM) pre-training has substantially advanced the state of the art across a variety of natural language processing tasks [8, 29, 19, 31, 9, 1].
    [Show full text]
  • Hello, It's GPT-2
    Hello, It’s GPT-2 - How Can I Help You? Towards the Use of Pretrained Language Models for Task-Oriented Dialogue Systems Paweł Budzianowski1;2;3 and Ivan Vulic´2;3 1Engineering Department, Cambridge University, UK 2Language Technology Lab, Cambridge University, UK 3PolyAI Limited, London, UK [email protected], [email protected] Abstract (Young et al., 2013). On the other hand, open- domain conversational bots (Li et al., 2017; Serban Data scarcity is a long-standing and crucial et al., 2017) can leverage large amounts of freely challenge that hinders quick development of available unannotated data (Ritter et al., 2010; task-oriented dialogue systems across multiple domains: task-oriented dialogue models are Henderson et al., 2019a). Large corpora allow expected to learn grammar, syntax, dialogue for training end-to-end neural models, which typ- reasoning, decision making, and language gen- ically rely on sequence-to-sequence architectures eration from absurdly small amounts of task- (Sutskever et al., 2014). Although highly data- specific data. In this paper, we demonstrate driven, such systems are prone to producing unre- that recent progress in language modeling pre- liable and meaningless responses, which impedes training and transfer learning shows promise their deployment in the actual conversational ap- to overcome this problem. We propose a task- oriented dialogue model that operates solely plications (Li et al., 2017). on text input: it effectively bypasses ex- Due to the unresolved issues with the end-to- plicit policy and language generation modules. end architectures, the focus has been extended to Building on top of the TransferTransfo frame- retrieval-based models.
    [Show full text]
  • N-Gram Language Modeling Tutorial Dustin Hillard and Sarah Petersen Lecture Notes Courtesy of Prof
    N-gram Language Modeling Tutorial Dustin Hillard and Sarah Petersen Lecture notes courtesy of Prof. Mari Ostendorf Outline: • Statistical Language Model (LM) Basics • n-gram models • Class LMs • Cache LMs • Mixtures • Empirical observations (Goodman CSL 2001) • Factored LMs Part I: Statistical Language Model (LM) Basics • What is a statistical LM and why are they interesting? • Evaluating LMs • History equivalence classes What is a statistical language model? A stochastic process model for word sequences. A mechanism for computing the probability of: p(w1, . , wT ) 1 Why are LMs interesting? • Important component of a speech recognition system – Helps discriminate between similar sounding words – Helps reduce search costs • In statistical machine translation, a language model characterizes the target language, captures fluency • For selecting alternatives in summarization, generation • Text classification (style, reading level, language, topic, . ) • Language models can be used for more than just words – letter sequences (language identification) – speech act sequence modeling – case and punctuation restoration 2 Evaluating LMs Evaluating LMs in the context of an application can be expensive, so LMs are usually evaluated on the own in terms of perplexity:: ˜ 1 PP = 2Hr where H˜ = − log p(w , . , w ) r T 2 1 T where {w1, . , wT } is held out test data that provides the empirical distribution q(·) in the cross-entropy formula H˜ = − X q(x) log p(x) x and p(·) is the LM estimated on a training set. Interpretations: • Entropy rate: lower entropy means that it is easier to predict the next symbol and hence easier to rule out alternatives when combined with other models small H˜r → small PP • Average branching factor: When a distribution is uniform for a vocabulary of size V , then entropy is log2 V , and perplexity is V .
    [Show full text]
  • A General Language Model for Information Retrieval Fei Song W
    A General Language Model for Information Retrieval Fei Song W. Bruce Croft Dept. of Computing and Info. Science Dept. of Computer Science University of Guelph University of Massachusetts Guelph, Ontario, Canada N1G 2W1 Amherst, Massachusetts 01003 [email protected] [email protected] ABSTRACT However, estimating the probabilities of word sequences can be expensive, since sentences can be arbitrarily long and the size of Statistical language modeling has been successfully used for a corpus needs to be very large. In practice, the statistical speech recognition, part-of-speech tagging, and syntactic parsing. language model is often approximated by N-gram models. Recently, it has also been applied to information retrieval. Unigram: P(w ) = P(w )P(w ) ¡ P(w ) According to this new paradigm, each document is viewed as a 1,n 1 2 n language sample, and a query as a generation process. The Bigram: P(w ) = P(w )P(w | w ) ¡ P(w | w ) retrieved documents are ranked based on the probabilities of 1,n 1 2 1 n n−1 producing a query from the corresponding language models of Trigram: P(w ) = P(w )P(w | w )P(w | w ) ¡ P(w | w ) these documents. In this paper, we will present a new language 1,n 1 2 1 3 1,2 n n−2,n−1 model for information retrieval, which is based on a range of data The unigram model makes a strong assumption that each word smoothing techniques, including the Good-Turing estimate, occurs independently, and consequently, the probability of a curve-fitting functions, and model combinations.
    [Show full text]
  • A Unified Multilingual Handwriting Recognition System Using
    1 A Unified Multilingual Handwriting Recognition System using multigrams sub-lexical units Wassim Swaileha,∗, Yann Soullarda, Thierry Paqueta aLITIS Laboratory - EA 4108 Normandie University - University of Rouen Rouen, France ABSTRACT We address the design of a unified multilingual system for handwriting recognition. Most of multi- lingual systems rests on specialized models that are trained on a single language and one of them is selected at test time. While some recognition systems are based on a unified optical model, dealing with a unified language model remains a major issue, as traditional language models are generally trained on corpora composed of large word lexicons per language. Here, we bring a solution by con- sidering language models based on sub-lexical units, called multigrams. Dealing with multigrams strongly reduces the lexicon size and thus decreases the language model complexity. This makes pos- sible the design of an end-to-end unified multilingual recognition system where both a single optical model and a single language model are trained on all the languages. We discuss the impact of the language unification on each model and show that our system reaches state-of-the-art methods perfor- mance with a strong reduction of the complexity. c 2018 Elsevier Ltd. All rights reserved. 1. Introduction Offline handwriting recognition is a challenging task due to the high variability of data, as writing styles, the quality of the input image and the lack of contextual information. The recognition is commonly done in two steps. First, an optical character recognition (OCR) model is optimized for recognizing sequences of characters from input images of handwritten texts (Plotz¨ and Fink (2009); Singh (2013)).
    [Show full text]
  • Word Embeddings: History & Behind the Scene
    Word Embeddings: History & Behind the Scene Alfan F. Wicaksono Information Retrieval Lab. Faculty of Computer Science Universitas Indonesia References • Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A Neural Probabilistic Language Model. The Journal of Machine Learning Research, 3, 1137–1155. • Mikolov, T., Corrado, G., Chen, K., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. Proceedings of the International Conference on Learning Representations (ICLR 2013), 1–12 • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. NIPS, 1–9. • Morin, F., & Bengio, Y. (2005). Hierarchical Probabilistic Neural Network Language Model. Aistats, 5. References Good weblogs for high-level understanding: • http://sebastianruder.com/word-embeddings-1/ • Sebastian Ruder. On word embeddings - Part 2: Approximating the Softmax. http://sebastianruder.com/word- embeddings-softmax • https://www.tensorflow.org/tutorials/word2vec • https://www.gavagai.se/blog/2015/09/30/a-brief-history-of- word-embeddings/ Some slides were also borrowed from From Dan Jurafsky’s course slide: Word Meaning and Similarity. Stanford University. Terminology • The term “Word Embedding” came from deep learning community • For computational linguistic community, they prefer “Distributional Semantic Model” • Other terms: – Distributed Representation – Semantic Vector Space – Word Space https://www.gavagai.se/blog/2015/09/30/a-brief-history-of-word-embeddings/ Before we learn Word Embeddings... Semantic Similarity • Word Similarity – Near-synonyms – “boat” and “ship”, “car” and “bicycle” • Word Relatedness – Can be related any way – Similar: “boat” and “ship” – Topical Similarity: • “boat” and “water” • “car” and “gasoline” Why Word Similarity? • Document Classification • Document Clustering • Language Modeling • Information Retrieval • ..
    [Show full text]
  • Language Modelling for Speech Recognition
    Language Modelling for Speech Recognition • Introduction • n-gram language models • Probability estimation • Evaluation • Beyond n-grams 6.345 Automatic Speech Recognition Language Modelling 1 Language Modelling for Speech Recognition • Speech recognizers seek the word sequence Wˆ which is most likely to be produced from acoustic evidence A P(Wˆ |A)=m ax P(W |A) ∝ max P(A|W )P(W ) W W • Speech recognition involves acoustic processing, acoustic modelling, language modelling, and search • Language models (LMs) assign a probability estimate P(W ) to word sequences W = {w1,...,wn} subject to � P(W )=1 W • Language models help guide and constrain the search among alternative word hypotheses during recognition 6.345 Automatic Speech Recognition Language Modelling 2 Language Model Requirements � Constraint Coverage � NLP Understanding � 6.345 Automatic Speech Recognition Language Modelling 3 Finite-State Networks (FSN) show me all the flights give restaurants display • Language space defined by a word network or graph • Describable by a regular phrase structure grammar A =⇒ aB | a • Finite coverage can present difficulties for ASR • Graph arcs or rules can be augmented with probabilities 6.345 Automatic Speech Recognition Language Modelling 4 Context-Free Grammars (CFGs) VP NP V D N display the flights • Language space defined by context-free rewrite rules e.g., A =⇒ BC | a • More powerful representation than FSNs • Stochastic CFG rules have associated probabilities which can be learned automatically from a corpus • Finite coverage can present difficulties
    [Show full text]
  • A Pretrained Language Model for Scientific Text
    SCIBERT: A Pretrained Language Model for Scientific Text Iz Beltagy Kyle Lo Arman Cohan Allen Institute for Artificial Intelligence, Seattle, WA, USA fbeltagy,kylel,[email protected] Abstract neural architectures. Leveraging the success of un- supervised pretraining has become especially im- Obtaining large-scale annotated data for NLP portant especially when task-specific annotations tasks in the scientific domain is challenging are difficult to obtain, like in scientific NLP. Yet and expensive. We release SCIBERT, a pre- while both BERT and ELMo have released pre- trained language model based on BERT (De- vlin et al., 2019) to address the lack of high- trained models, they are still trained on general do- quality, large-scale labeled scientific data. main corpora such as news articles and Wikipedia. SCIBERT leverages unsupervised pretraining In this work, we make the following contribu- on a large multi-domain corpus of scien- tions: tific publications to improve performance on (i) We release SCIBERT, a new resource demon- downstream scientific NLP tasks. We eval- strated to improve performance on a range of NLP uate on a suite of tasks including sequence tasks in the scientific domain. SCIBERT is a pre- tagging, sentence classification and depen- dency parsing, with datasets from a variety trained language model based on BERT but trained of scientific domains. We demonstrate sta- on a large corpus of scientific text. tistically significant improvements over BERT (ii) We perform extensive experimentation to and achieve new state-of-the-art results on sev- investigate the performance of finetuning ver- eral of these tasks. The code and pretrained sus task-specific architectures atop frozen embed- models are available at https://github.
    [Show full text]
  • Improving Text Recognition Using Optical and Language Model Writer Adaptation
    Improving text recognition using optical and language model writer adaptation Yann Soullard, Wassim Swaileh, Pierrick Tranouez, Thierry Paquet Clement´ Chatelain LITIS lab LITIS lab Normandie University, Universite de Rouen Normandie INSA Rouen Normandie 76800 Saint Etienne du Rouvray, France 76800 Saint Etienne du Rouvray, France Email: ffi[email protected] Email: [email protected] Abstract—State-of-the-art methods for handwriting text shown to be very efficient on various recognition tasks recognition are based on deep learning approaches and lan- such as handwriting recognition, signature identification or guage modeling that require large data sets during training. medical imaging [4], [5]. In addition, the language itself, but In practice, there are some applications where the system processes mono-writer documents, and would thus benefit from also lexicons as well as word frequencies may not match being trained on examples from that writer. However, this is the statistical distributions learned during language model not common to have numerous examples coming from just one training. Statistical language model adaptation algorithms writer. In this paper, we propose an approach to adapt both the [6] have also been designed in this respect. optical model and the language model to a particular writer, This paper presents an approach to adapt a generic text from a generic system trained on large data sets with a variety of examples. We show the benefits of the optical and language recognition system to a particular writer. In this respect, both model writer adaptation. Our approach reaches competitive the optical model and the language model of the generic sys- results on the READ 2018 data set, which is dedicated to model tem are considered during the adaptation phase.
    [Show full text]
  • Exploring the Limits of Language Modeling
    Exploring the Limits of Language Modeling Rafal Jozefowicz [email protected] Oriol Vinyals [email protected] Mike Schuster [email protected] Noam Shazeer [email protected] Yonghui Wu [email protected] Google Brain Abstract language models improves the underlying metrics of the In this work we explore recent advances in Re- downstream task (such as word error rate for speech recog- current Neural Networks for large scale Lan- nition, or BLEU score for translation), which makes the guage Modeling, a task central to language un- task of training better LMs valuable by itself. derstanding. We extend current models to deal Further, when trained on vast amounts of data, language with two key challenges present in this task: cor- models compactly extract knowledge encoded in the train- pora and vocabulary sizes, and complex, long ing data. For example, when trained on movie subti- term structure of language. We perform an ex- tles (Serban et al., 2015; Vinyals & Le, 2015), these lan- haustive study on techniques such as character guage models are able to generate basic answers to ques- Convolutional Neural Networks or Long-Short tions about object colors, facts about people, etc. Lastly, Term Memory, on the One Billion Word Bench- recently proposed sequence-to-sequence models employ mark. Our best single model significantly im- conditional language models (Mikolov & Zweig, 2012) as proves state-of-the-art perplexity from 51.3 down their key component to solve diverse tasks like machine to 30.0 (whilst reducing the number of param- translation (Sutskever et al., 2014; Cho et al., 2014; Kalch- eters by a factor of 20), while an ensemble of brenner et al., 2014) or video generation (Srivastava et al., models sets a new record by improving perplex- 2015a).
    [Show full text]