Two/Too Simple Adaptations of Word2vec for Syntax Problems

Total Page:16

File Type:pdf, Size:1020Kb

Two/Too Simple Adaptations of Word2vec for Syntax Problems Two/Too Simple Adaptations of Word2Vec for Syntax Problems Wang Ling Chris Dyer Alan Black Isabel Trancoso L2F Spoken Systems Lab, INESC-ID, Lisbon, Portugal Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA, USA Instituto Superior Tecnico,´ Lisbon, Portugal flingwang,cdyer,[email protected] [email protected] Abstract in particular the “skip-gram” and the “continuous bag-of-words” (CBOW) models. These two mod- We present two simple modifications to the els make different independence and conditioning models in the popular Word2Vec tool, in or- assumptions; however, both models discard word der to generate embeddings more suited to order information in how they account for context. tasks involving syntax. The main issue with the original models is the fact that they are Thus, embeddings built using these models have insensitive to word order. While order in- been shown to capture semantic information be- dependence is useful for inducing semantic tween words, and pre-training using these models representations, this leads to suboptimal re- has been shown to lead to major improvements in sults when they are used to solve syntax-based many tasks (Collobert et al., 2011). While more so- problems. We show improvements in part-of- phisticated approaches have been proposed (Dhillon speech tagging and dependency parsing using et al., 2011; Huang et al., 2012; Faruqui and Dyer, our proposed models. 2014; Levy and Goldberg, 2014; Yang and Eisen- stein, 2015), Word2Vec remains a popular choice 1 Introduction due to their efficiency and simplicity. Word representations learned from neural language However, as these models are insensitive to word models have been shown to improve many NLP order, embeddings built using these models are sub- tasks, such as part-of-speech tagging (Collobert et optimal for tasks involving syntax, such as part-of- al., 2011), dependency parsing (Chen and Man- speech tagging or dependency parsing. This is be- ning, 2014; Kong et al., 2014) and machine trans- cause syntax defines “what words go where?”, while lation (Liu et al., 2014; Kalchbrenner and Blunsom, semantics than “what words go together”. Obvi- 2013; Devlin et al., 2014; Sutskever et al., 2014). ously, in a model where word order is discarded, These low-dimensional representations are learned the many syntactic relations between words can- as parameters in a language model and trained to not be captured properly. For instance, while most maximize the likelihood of a large corpus of raw words occur with the word the, only nouns tend to text. They are then incorporated as features along occur exactly afterwords (e.g. the cat). This is side hand-engineered features (Turian et al., 2010), supported by empirical evidence that suggests that or used to initialize the parameters of neural net- order-insensitivity does indeed lead to substandard works targeting tasks for which substantially less syntactic representations (Andreas and Klein, 2014; training data is available (Hinton and Salakhutdinov, Bansal et al., 2014), where systems using pre-trained 2012; Erhan et al., 2010; Guo et al., 2014). with Word2Vec models yield slight improvements One of the most widely used tools for building while the computationally far more expensive which word vectors are the models described in (Mikolov use word order information embeddings of Col- et al., 2013), implemented in the Word2Vec tool, lobert et al. (2011) yielded much better results. In this work, we describe two simple modifica- owi , since this requires the computation of a jV j×dw tions to Word2Vec, one for the skip-gram model matrix multiplication. Solutions for problem are ad- and one for the CBOW model, that improve the dressed in the Word2Vec by using the hierarchical quality of the embeddings for syntax-based tasks1. softmax objective function or resorting to negative Our goal is to improve the final embeddings while sampling (Goldberg and Levy, 2014). maintaining the simplicity and efficiency of the orig- The CBOW model predicts the center word wo inal models. We demonstrate the effectiveness of given a representation of the surrounding words our approaches by training, on commodity hard- w−c; :::; w−1; w1; wc. Thus, the output vector ware, on datasets containing more than 50 million ow−c;:::;w−1;w1;wc is obtained from the product of the sentences and over 1 billion words in less than a matrix O 2 <jV |×dw with the sum of the embed- P day, and show that our methods lead to improve- dings of the context words −c≤j≤c;j6=0 rwj . ments when used in state-of-the-art neural network We can observe that in both methods, the order of systems for part-of-speech tagging and dependency the context words does not influence the prediction parsing, relative to the original models. output. As such, while these methods may find sim- ilar representations for semantically similar words, 2 Word2Vec they are less likely to representations based on the The work in (Mikolov et al., 2013) is a popular syntactic properties of the words. choice for pre-training the projection matrix W 2 Skip-Ngram <d×|V j where d is the embedding dimension with CBOW input projection output input projection output the vocabulary V . As an unsupervised task that is O trained on raw text, it builds word embeddings by w-2 w-2 O maximizing the likelihood that words are predicted w-1 O w-1 from their context or vice versa. Two models were SUM w0 w0 O w1 w1 defined, the skip-gram model and the continuous O bag-of-words model, illustrated in Figure 1. w2 w2 The skip-gram model’s objective function is to Figure 1: Illustration of the Skip-gram and Continuous maximize the likelihood of the prediction of contex- Bag-of-Word (CBOW) models. tual words given the center word. More formally, given a document of T words, we wish to maximize T 1 X X L = log p(w j w ) (1) 3 Structured Word2Vec T t+j t t=1 −c≤j≤c; j6=0 To account for the lack of order-dependence in the above models, we propose two simple modifications Where c is a hyperparameter defining the window to these methods that include ordering information, of context words. To obtain the output probabil- which we expect will lead to more syntactically- ity p(w jw ), the model estimates a matrix O 2 o i oriented embeddings. These models are illustrated <jV |×dw , which maps the embeddings r into a wi in Figure 2. jV j-dimensional vector owi . Then, the probability of predicting the word wo given the word wi is de- 3.1 Structured Skip-gram Model fined as: The skip-gram model uses a single output matrix o (w ) jV |×d e wi o O 2 < to predict every contextual word p(wo j wi) = (2) P ow (w) w−c; :::; w−1; w1; :::; wc, given the embeddings of w2V e i the center word w0. Our approach adapts the This is referred as the softmax objective. However, model so that it is sensitive to the positioning of the for larger vocabularies it is inefficient to compute words. It defines a set of c × 2 output predictors (jV j)×d 1The code developed in this work is made available in O−c; :::; O−1;O1;Oc, with size O 2 < . Each https://github.com/wlin12/wang2vec. of the output matrixes is dedicated to predicting the output for a specific relative position to the center CWINDOW Structured Skip-Ngram word. When making a prediction p(wo j wi), we input projection output input projection output O-2 select the appropriate output matrix Oo−i to project w-2 w-2 the word embeddings to the output vector. Note, that O-1 w-1 O w-1 the number of operations that must be performed for w0 w0 O1 the forward and backward passes in the network re- w1 w1 O2 mains the same, as we are simply switching the out- w2 w2 put layer O for each different word index. Figure 2: Illustration of the Structured Skip-gram and 3.2 Continuous Window Model Continuous Window (CWindow) models. The Continuous Bag-Of-Words words model defines a window of words w−c; :::; wc with size c, where the prediction of the center word shown that pre-trained embeddings can be used to w0 is conditioned on the remaining words achieve better generalization (Collobert et al., 2011; w−c; :::; w−1; w1; :::; wc. The prediction matrix Chen and Manning, 2014). O 2 <(jV j)×d is fed with the sum of the embed- dings of the context words. As such, the order 4.1 Building Word Vectors of the contextual words does not influence the prediction of the center word. Our approach We built vectors for English in two very different defines a different output predictor O 2 <(jV |×2cd domains. Firstly, we used an English Wikipedia which receives as input a (2c × d)-dimensional dump containing 1,897 million words (60 million vector that is the concatenation of the embeddings sentences), collected in September of 2014. We of the context words in the order they occur built word embeddings using the original and our proposed methods on this dataset. These embed- [e(w−c); : : : ; e(w−1); e(w1); : : : ; e(wc)]. As matrix O defines a parameter for the word embeddings for dings will be denoted as WIKI(L). Then, we took a each relative position, this allows the words to be sample of 56 million English tweets with 847 mil- treated differently depending on where they occur. lion words collected in (Owoputi et al., 2013), and This model, denoted as CWindow, is essentially the applied the same procedure to build the TWITTER window-based model described in (Collobert et al., embeddings.
Recommended publications
  • Malware Classification with BERT
    San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 5-25-2021 Malware Classification with BERT Joel Lawrence Alvares Follow this and additional works at: https://scholarworks.sjsu.edu/etd_projects Part of the Artificial Intelligence and Robotics Commons, and the Information Security Commons Malware Classification with Word Embeddings Generated by BERT and Word2Vec Malware Classification with BERT Presented to Department of Computer Science San José State University In Partial Fulfillment of the Requirements for the Degree By Joel Alvares May 2021 Malware Classification with Word Embeddings Generated by BERT and Word2Vec The Designated Project Committee Approves the Project Titled Malware Classification with BERT by Joel Lawrence Alvares APPROVED FOR THE DEPARTMENT OF COMPUTER SCIENCE San Jose State University May 2021 Prof. Fabio Di Troia Department of Computer Science Prof. William Andreopoulos Department of Computer Science Prof. Katerina Potika Department of Computer Science 1 Malware Classification with Word Embeddings Generated by BERT and Word2Vec ABSTRACT Malware Classification is used to distinguish unique types of malware from each other. This project aims to carry out malware classification using word embeddings which are used in Natural Language Processing (NLP) to identify and evaluate the relationship between words of a sentence. Word embeddings generated by BERT and Word2Vec for malware samples to carry out multi-class classification. BERT is a transformer based pre- trained natural language processing (NLP) model which can be used for a wide range of tasks such as question answering, paraphrase generation and next sentence prediction. However, the attention mechanism of a pre-trained BERT model can also be used in malware classification by capturing information about relation between each opcode and every other opcode belonging to a malware family.
    [Show full text]
  • Fun with Hyperplanes: Perceptrons, Svms, and Friends
    Perceptrons, SVMs, and Friends: Some Discriminative Models for Classification Parallel to AIMA 18.1, 18.2, 18.6.3, 18.9 The Automatic Classification Problem Assign object/event or sequence of objects/events to one of a given finite set of categories. • Fraud detection for credit card transactions, telephone calls, etc. • Worm detection in network packets • Spam filtering in email • Recommending articles, books, movies, music • Medical diagnosis • Speech recognition • OCR of handwritten letters • Recognition of specific astronomical images • Recognition of specific DNA sequences • Financial investment Machine Learning methods provide one set of approaches to this problem CIS 391 - Intro to AI 2 Universal Machine Learning Diagram Feature Things to Magic Vector Classification be Classifier Represent- Decision classified Box ation CIS 391 - Intro to AI 3 Example: handwritten digit recognition Machine learning algorithms that Automatically cluster these images Use a training set of labeled images to learn to classify new images Discover how to account for variability in writing style CIS 391 - Intro to AI 4 A machine learning algorithm development pipeline: minimization Problem statement Given training vectors x1,…,xN and targets t1,…,tN, find… Mathematical description of a cost function Mathematical description of how to minimize/maximize the cost function Implementation r(i,k) = s(i,k) – maxj{s(i,j)+a(i,j)} … CIS 391 - Intro to AI 5 Universal Machine Learning Diagram Today: Perceptron, SVM and Friends Feature Things to Magic Vector
    [Show full text]
  • NVIDIA CEO Jensen Huang to Host AI Pioneers Yoshua Bengio, Geoffrey Hinton and Yann Lecun, and Others, at GTC21
    NVIDIA CEO Jensen Huang to Host AI Pioneers Yoshua Bengio, Geoffrey Hinton and Yann LeCun, and Others, at GTC21 Online Conference to Feature Jensen Huang Keynote and 1,300 Talks from Leaders in Data Center, Networking, Graphics and Autonomous Vehicles NVIDIA today announced that its CEO and founder Jensen Huang will host renowned AI pioneers Yoshua Bengio, Geoffrey Hinton and Yann LeCun at the company’s upcoming technology conference, GTC21, running April 12-16. The event will kick off with a news-filled livestreamed keynote by Huang on April 12 at 8:30 am Pacific. Bengio, Hinton and LeCun won the 2018 ACM Turing Award, known as the Nobel Prize of computing, for breakthroughs that enabled the deep learning revolution. Their work underpins the proliferation of AI technologies now being adopted around the world, from natural language processing to autonomous machines. Bengio is a professor at the University of Montreal and head of Mila - Quebec Artificial Intelligence Institute; Hinton is a professor at the University of Toronto and a researcher at Google; and LeCun is a professor at New York University and chief AI scientist at Facebook. More than 100,000 developers, business leaders, creatives and others are expected to register for GTC, including CxOs and IT professionals focused on data center infrastructure. Registration is free and is not required to view the keynote. In addition to the three Turing winners, major speakers include: Girish Bablani, Corporate Vice President, Microsoft Azure John Bowman, Director of Data Science, Walmart
    [Show full text]
  • On Recurrent and Deep Neural Networks
    On Recurrent and Deep Neural Networks Razvan Pascanu Advisor: Yoshua Bengio PhD Defence Universit´ede Montr´eal,LISA lab September 2014 Pascanu On Recurrent and Deep Neural Networks 1/ 38 Studying the mechanism behind learning provides a meta-solution for solving tasks. Motivation \A computer once beat me at chess, but it was no match for me at kick boxing" | Emo Phillips Pascanu On Recurrent and Deep Neural Networks 2/ 38 Motivation \A computer once beat me at chess, but it was no match for me at kick boxing" | Emo Phillips Studying the mechanism behind learning provides a meta-solution for solving tasks. Pascanu On Recurrent and Deep Neural Networks 2/ 38 I fθ(x) = f (θ; x) ? F I f = arg minθ Θ EEx;t π [d(fθ(x); t)] 2 ∼ Supervised Learing I f :Θ D T F × ! Pascanu On Recurrent and Deep Neural Networks 3/ 38 ? I f = arg minθ Θ EEx;t π [d(fθ(x); t)] 2 ∼ Supervised Learing I f :Θ D T F × ! I fθ(x) = f (θ; x) F Pascanu On Recurrent and Deep Neural Networks 3/ 38 Supervised Learing I f :Θ D T F × ! I fθ(x) = f (θ; x) ? F I f = arg minθ Θ EEx;t π [d(fθ(x); t)] 2 ∼ Pascanu On Recurrent and Deep Neural Networks 3/ 38 Optimization for learning θ[k+1] θ[k] Pascanu On Recurrent and Deep Neural Networks 4/ 38 Neural networks Output neurons Last hidden layer bias = 1 Second hidden layer First hidden layer Input layer Pascanu On Recurrent and Deep Neural Networks 5/ 38 Recurrent neural networks Output neurons Output neurons Last hidden layer bias = 1 bias = 1 Recurrent Layer Second hidden layer First hidden layer Input layer Input layer (b) Recurrent
    [Show full text]
  • Learned in Speech Recognition: Contextual Acoustic Word Embeddings
    LEARNED IN SPEECH RECOGNITION: CONTEXTUAL ACOUSTIC WORD EMBEDDINGS Shruti Palaskar∗, Vikas Raunak∗ and Florian Metze Carnegie Mellon University, Pittsburgh, PA, U.S.A. fspalaska j vraunak j fmetze [email protected] ABSTRACT model [10, 11, 12] trained for direct Acoustic-to-Word (A2W) speech recognition [13]. Using this model, we jointly learn to End-to-end acoustic-to-word speech recognition models have re- automatically segment and classify input speech into individual cently gained popularity because they are easy to train, scale well to words, hence getting rid of the problem of chunking or requiring large amounts of training data, and do not require a lexicon. In addi- pre-defined word boundaries. As our A2W model is trained at the tion, word models may also be easier to integrate with downstream utterance level, we show that we can not only learn acoustic word tasks such as spoken language understanding, because inference embeddings, but also learn them in the proper context of their con- (search) is much simplified compared to phoneme, character or any taining sentence. We also evaluate our contextual acoustic word other sort of sub-word units. In this paper, we describe methods embeddings on a spoken language understanding task, demonstrat- to construct contextual acoustic word embeddings directly from a ing that they can be useful in non-transcription downstream tasks. supervised sequence-to-sequence acoustic-to-word speech recog- Our main contributions in this paper are the following: nition model using the learned attention distribution. On a suite 1. We demonstrate the usability of attention not only for aligning of 16 standard sentence evaluation tasks, our embeddings show words to acoustic frames without any forced alignment but also for competitive performance against a word2vec model trained on the constructing Contextual Acoustic Word Embeddings (CAWE).
    [Show full text]
  • Introduction to Machine Learning
    Introduction to Machine Learning Perceptron Barnabás Póczos Contents History of Artificial Neural Networks Definitions: Perceptron, Multi-Layer Perceptron Perceptron algorithm 2 Short History of Artificial Neural Networks 3 Short History Progression (1943-1960) • First mathematical model of neurons ▪ Pitts & McCulloch (1943) • Beginning of artificial neural networks • Perceptron, Rosenblatt (1958) ▪ A single neuron for classification ▪ Perceptron learning rule ▪ Perceptron convergence theorem Degression (1960-1980) • Perceptron can’t even learn the XOR function • We don’t know how to train MLP • 1963 Backpropagation… but not much attention… Bryson, A.E.; W.F. Denham; S.E. Dreyfus. Optimal programming problems with inequality constraints. I: Necessary conditions for extremal solutions. AIAA J. 1, 11 (1963) 2544-2550 4 Short History Progression (1980-) • 1986 Backpropagation reinvented: ▪ Rumelhart, Hinton, Williams: Learning representations by back-propagating errors. Nature, 323, 533—536, 1986 • Successful applications: ▪ Character recognition, autonomous cars,… • Open questions: Overfitting? Network structure? Neuron number? Layer number? Bad local minimum points? When to stop training? • Hopfield nets (1982), Boltzmann machines,… 5 Short History Degression (1993-) • SVM: Vapnik and his co-workers developed the Support Vector Machine (1993). It is a shallow architecture. • SVM and Graphical models almost kill the ANN research. • Training deeper networks consistently yields poor results. • Exception: deep convolutional neural networks, Yann LeCun 1998. (discriminative model) 6 Short History Progression (2006-) Deep Belief Networks (DBN) • Hinton, G. E, Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18:1527-1554. • Generative graphical model • Based on restrictive Boltzmann machines • Can be trained efficiently Deep Autoencoder based networks Bengio, Y., Lamblin, P., Popovici, P., Larochelle, H.
    [Show full text]
  • Lecture 26 Word Embeddings and Recurrent Nets
    CS447: Natural Language Processing http://courses.engr.illinois.edu/cs447 Where we’re at Lecture 25: Word Embeddings and neural LMs Lecture 26: Recurrent networks Lecture 26 Lecture 27: Sequence labeling and Seq2Seq Lecture 28: Review for the final exam Word Embeddings and Lecture 29: In-class final exam Recurrent Nets Julia Hockenmaier [email protected] 3324 Siebel Center CS447: Natural Language Processing (J. Hockenmaier) !2 What are neural nets? Simplest variant: single-layer feedforward net For binary Output unit: scalar y classification tasks: Single output unit Input layer: vector x Return 1 if y > 0.5 Recap Return 0 otherwise For multiclass Output layer: vector y classification tasks: K output units (a vector) Input layer: vector x Each output unit " yi = class i Return argmaxi(yi) CS447: Natural Language Processing (J. Hockenmaier) !3 CS447: Natural Language Processing (J. Hockenmaier) !4 Multi-layer feedforward networks Multiclass models: softmax(yi) We can generalize this to multi-layer feedforward nets Multiclass classification = predict one of K classes. Return the class i with the highest score: argmaxi(yi) Output layer: vector y In neural networks, this is typically done by using the softmax N Hidden layer: vector hn function, which maps real-valued vectors in R into a distribution … … … over the N outputs … … … … … …. For a vector z = (z0…zK): P(i) = softmax(zi) = exp(zi) ∕ ∑k=0..K exp(zk) Hidden layer: vector h1 (NB: This is just logistic regression) Input layer: vector x CS447: Natural Language Processing (J. Hockenmaier) !5 CS447: Natural Language Processing (J. Hockenmaier) !6 Neural Language Models LMs define a distribution over strings: P(w1….wk) LMs factor P(w1….wk) into the probability of each word: " P(w1….wk) = P(w1)·P(w2|w1)·P(w3|w1w2)·…· P(wk | w1….wk#1) A neural LM needs to define a distribution over the V words in Neural Language the vocabulary, conditioned on the preceding words.
    [Show full text]
  • Audio Event Classification Using Deep Learning in an End-To-End Approach
    Audio Event Classification using Deep Learning in an End-to-End Approach Master thesis Jose Luis Diez Antich Aalborg University Copenhagen A. C. Meyers Vænge 15 2450 Copenhagen SV Denmark Title: Abstract: Audio Event Classification using Deep Learning in an End-to-End Approach The goal of the master thesis is to study the task of Sound Event Classification Participant(s): using Deep Neural Networks in an end- Jose Luis Diez Antich to-end approach. Sound Event Classifi- cation it is a multi-label classification problem of sound sources originated Supervisor(s): from everyday environments. An auto- Hendrik Purwins matic system for it would many applica- tions, for example, it could help users of hearing devices to understand their sur- Page Numbers: 38 roundings or enhance robot navigation systems. The end-to-end approach con- Date of Completion: sists in systems that learn directly from June 16, 2017 data, not from features, and it has been recently applied to audio and its results are remarkable. Even though the re- sults do not show an improvement over standard approaches, the contribution of this thesis is an exploration of deep learning architectures which can be use- ful to understand how networks process audio. The content of this report is freely available, but publication (with reference) may only be pursued due to agreement with the author. Contents 1 Introduction1 1.1 Scope of this work.............................2 2 Deep Learning3 2.1 Overview..................................3 2.2 Multilayer Perceptron...........................4
    [Show full text]
  • Yoshua Bengio and Gary Marcus on the Best Way Forward for AI
    Yoshua Bengio and Gary Marcus on the Best Way Forward for AI Transcript of the 23 December 2019 AI Debate, hosted at Mila Moderated and transcribed by Vincent Boucher, Montreal AI AI DEBATE : Yoshua Bengio | Gary Marcus — Organized by MONTREAL.AI and hosted at Mila, on Monday, December 23, 2019, from 6:30 PM to 8:30 PM (EST) Slides, video, readings and more can be found on the MONTREAL.AI debate webpage. Transcript of the AI Debate Opening Address | Vincent Boucher Good Evening from Mila in Montreal Ladies & Gentlemen, Welcome to the “AI Debate”. I am Vincent Boucher, Founding Chairman of Montreal.AI. Our participants tonight are Professor GARY MARCUS and Professor YOSHUA BENGIO. Professor GARY MARCUS is a Scientist, Best-Selling Author, and Entrepreneur. Professor MARCUS has published extensively in neuroscience, genetics, linguistics, evolutionary psychology and artificial intelligence and is perhaps the youngest Professor Emeritus at NYU. He is Founder and CEO of Robust.AI and the author of five books, including The Algebraic Mind. His newest book, Rebooting AI: Building Machines We Can Trust, aims to shake up the field of artificial intelligence and has been praised by Noam Chomsky, Steven Pinker and Garry Kasparov. Professor YOSHUA BENGIO is a Deep Learning Pioneer. In 2018, Professor BENGIO was the computer scientist who collected the largest number of new citations worldwide. In 2019, he received, jointly with Geoffrey Hinton and Yann LeCun, the ACM A.M. Turing Award — “the Nobel Prize of Computing”. He is the Founder and Scientific Director of Mila — the largest university-based research group in deep learning in the world.
    [Show full text]
  • Hello, It's GPT-2
    Hello, It’s GPT-2 - How Can I Help You? Towards the Use of Pretrained Language Models for Task-Oriented Dialogue Systems Paweł Budzianowski1;2;3 and Ivan Vulic´2;3 1Engineering Department, Cambridge University, UK 2Language Technology Lab, Cambridge University, UK 3PolyAI Limited, London, UK [email protected], [email protected] Abstract (Young et al., 2013). On the other hand, open- domain conversational bots (Li et al., 2017; Serban Data scarcity is a long-standing and crucial et al., 2017) can leverage large amounts of freely challenge that hinders quick development of available unannotated data (Ritter et al., 2010; task-oriented dialogue systems across multiple domains: task-oriented dialogue models are Henderson et al., 2019a). Large corpora allow expected to learn grammar, syntax, dialogue for training end-to-end neural models, which typ- reasoning, decision making, and language gen- ically rely on sequence-to-sequence architectures eration from absurdly small amounts of task- (Sutskever et al., 2014). Although highly data- specific data. In this paper, we demonstrate driven, such systems are prone to producing unre- that recent progress in language modeling pre- liable and meaningless responses, which impedes training and transfer learning shows promise their deployment in the actual conversational ap- to overcome this problem. We propose a task- oriented dialogue model that operates solely plications (Li et al., 2017). on text input: it effectively bypasses ex- Due to the unresolved issues with the end-to- plicit policy and language generation modules. end architectures, the focus has been extended to Building on top of the TransferTransfo frame- retrieval-based models.
    [Show full text]
  • Arxiv:2007.00183V2 [Eess.AS] 24 Nov 2020
    WHOLE-WORD SEGMENTAL SPEECH RECOGNITION WITH ACOUSTIC WORD EMBEDDINGS Bowen Shi, Shane Settle, Karen Livescu TTI-Chicago, USA fbshi,settle.shane,[email protected] ABSTRACT Segmental models are sequence prediction models in which scores of hypotheses are based on entire variable-length seg- ments of frames. We consider segmental models for whole- word (“acoustic-to-word”) speech recognition, with the feature vectors defined using vector embeddings of segments. Such models are computationally challenging as the number of paths is proportional to the vocabulary size, which can be orders of magnitude larger than when using subword units like phones. We describe an efficient approach for end-to-end whole-word segmental models, with forward-backward and Viterbi de- coding performed on a GPU and a simple segment scoring function that reduces space complexity. In addition, we inves- tigate the use of pre-training via jointly trained acoustic word embeddings (AWEs) and acoustically grounded word embed- dings (AGWEs) of written word labels. We find that word error rate can be reduced by a large margin by pre-training the acoustic segment representation with AWEs, and additional Fig. 1. Whole-word segmental model for speech recognition. (smaller) gains can be obtained by pre-training the word pre- Note: boundary frames are not shared. diction layer with AGWEs. Our final models improve over segmental models, where the sequence probability is com- prior A2W models. puted based on segment scores instead of frame probabilities. Index Terms— speech recognition, segmental model, Segmental models have a long history in speech recognition acoustic-to-word, acoustic word embeddings, pre-training research, but they have been used primarily for phonetic recog- nition or as phone-level acoustic models [11–18].
    [Show full text]
  • Hierarchical Multiscale Recurrent Neural Networks
    Published as a conference paper at ICLR 2017 HIERARCHICAL MULTISCALE RECURRENT NEURAL NETWORKS Junyoung Chung, Sungjin Ahn & Yoshua Bengio ∗ Département d’informatique et de recherche opérationnelle Université de Montréal {junyoung.chung,sungjin.ahn,yoshua.bengio}@umontreal.ca ABSTRACT Learning both hierarchical and temporal representation has been among the long- standing challenges of recurrent neural networks. Multiscale recurrent neural networks have been considered as a promising approach to resolve this issue, yet there has been a lack of empirical evidence showing that this type of models can actually capture the temporal dependencies by discovering the latent hierarchical structure of the sequence. In this paper, we propose a novel multiscale approach, called the hierarchical multiscale recurrent neural network, that can capture the latent hierarchical structure in the sequence by encoding the temporal dependencies with different timescales using a novel update mechanism. We show some evidence that the proposed model can discover underlying hierarchical structure in the sequences without using explicit boundary information. We evaluate our proposed model on character-level language modelling and handwriting sequence generation. 1 INTRODUCTION One of the key principles of learning in deep neural networks as well as in the human brain is to obtain a hierarchical representation with increasing levels of abstraction (Bengio, 2009; LeCun et al., 2015; Schmidhuber, 2015). A stack of representation layers, learned from the data in a way to optimize the target task, make deep neural networks entertain advantages such as generalization to unseen examples (Hoffman et al., 2013), sharing learned knowledge among multiple tasks, and discovering disentangling factors of variation (Kingma & Welling, 2013).
    [Show full text]