Lecture 4: Static Word Embeddings

Total Page:16

File Type:pdf, Size:1020Kb

Lecture 4: Static Word Embeddings CS546: Machine Learning in NLP (Spring 2020) http://courses.engr.illinois.edu/cs546/ Lecture 4: Static word embeddings Julia Hockenmaier [email protected] 3324 Siebel Center Office hours: Monday, 11am—12:30pm (Static) Word Embeddings A (static) word embedding is a function that maps each word type to a single vector — these vectors are typically dense and have much lower dimensionality than the size of the vocabulary — this mapping function typically ignores that the same string of letters may have different senses (dining table vs. a table of contents) or parts of speech (to table a motion vs. a table) — this mapping function typically assumes a fixed size vocabulary (so an UNK token is still required) CS447: Natural Language Processing (J. Hockenmaier) 2 Word2Vec Embeddings Main idea: Use a classifier to predict which words appear in the context of (i.e. near) a target word (or vice versa) This classifier induces a dense vector representation of words (embedding) Words that appear in similar contexts (that have high distributional similarity) will have very similar vector representations. These models can be trained on large amounts of raw text (and pre-trained embeddings can be downloaded) CS447: Natural Language Processing (J. Hockenmaier) 3 Word2Vec (Mikolov et al. 2013) The first really influential dense word embeddings Two ways to think about Word2Vec: — a simplification of neural language models — a binary logistic regression classifier Variants of Word2Vec — Two different context representations: CBOW or Skip-Gram — Two different optimization objectives: Negative sampling (NS) or hierarchical softmax CS447: Natural Language Processing (J. Hockenmaier) 4 Word2Vec architectures INPUT INPUTPROJECTION PROJECTION OUTPUT OUTPUT INPUT INPUTPROJECTION PROJECTION OUTPUT OUTPUT w(t-2) w(t-2) w(t-2) w(t-2) w(t-1) w(t-1) w(t-1) w(t-1) SUM SUM w(t) w(t) w(t) w(t) w(t+1) w(t+1) w(t+1) w(t+1) w(t+2) w(t+2) w(t+2) w(t+2) CBOW CBOW Skip-gram Skip-gram FigureCS546 1:Figure New Machine model 1: New architectures. Learning model architectures. in NLP The CBOW The architecture CBOW architecture predicts the predicts current the word current based word on the based on5 the context, andcontext, the Skip-gram and the Skip-gram predicts surrounding predicts surrounding words given words the givencurrent the word. current word. R wordsR fromwords the futurefrom the of futurethe current of the word current as correct word as labels. correct This labels. will This require will us require to do R us to2 do R 2 word classifications,word classifications, with the current with the word current as input, word and as input, each ofand the eachR + ofR thewordsR + asR output.words as In output.⇥ the In⇥ the followingfollowing experiments, experiments, we use C we= 10 use. C = 10. 4 Results4 Results To compareTo the compare quality the of quality different of versions different of versions word vectors, of word previous vectors, papers previous typically papers usetypically a table use a table showing exampleshowing words example and words their andmost their similar most words, similar and words, understand and understand them intuitively. them intuitively. Although Although it is easy toit is show easy that to show word thatFrance wordisFrance similaris to similarItaly and to perhapsItaly and some perhaps other some countries, other countries, it is much it is much more challengingmore challenging when subjecting when subjecting those vectors those in vectors a more in complex a more similarity complex similarity task, as follows. task, as We follows. We follow previousfollow observation previous observation that there that can bethere many can different be many types different of similarities types of similarities between words, between for words, for example,example, word big wordis similarbig is to similarbigger toinbigger the samein thesense same that sensesmall thatis similarsmall is to similarsmaller to. Examplesmaller. Example of anotherof type another of relationship type of relationship can be word can pairs be wordbig - pairs biggestbigand - biggestsmalland - smallestsmall -[20]. smallest We further[20]. We further denote twodenote pairs two of words pairs of with words the same with the relationship same relationship as a question, as a question, as we can as ask: we ”Whatcan ask: is ”What the is the word thatword is similar that is to similarsmall in to thesmall samein sense the same as biggest sense asisbiggest similaris to similarbig?” to big?” SomewhatSomewhat surprisingly, surprisingly, these questions these questions can be answered can be answered by performing by performing simple algebraic simple algebraicoperations operations with the vectorwith the representation vector representation of words. of To words. find a wordTo find that a word is similar that is to similarsmall in to thesmall samein sensethe same as sense as biggest isbiggest similaris to similarbig, we to canbig simply, we can compute simply vector computeX vector= vectorX (”=biggestvector(””)biggestvector”)(”bigvector”) +(”big”) + vector(”smallvector”)(”.small Then,”) we. Then, search we in the search vector in the space vector for thespace word for closest the word to closestX measured− to X measured− by cosine by cosine distance, anddistance, use it and as the use answer it as the to answer the question to the (we question discard (we the discard input thequestion input words question during words this during this search). Whensearch). the When word the vectors word are vectors well trained, are well it trained, is possible it is to possible find the to correct find the answer correct (word answer (word smallest)smallest using this) using method. this method. Finally, weFinally, found we that found when that we whentrain high we train dimensional high dimensional word vectors word on vectors a large on amount a large of amount data, the of data, the resulting vectorsresulting can vectors be used can to be answer used to very answer subtle very semantic subtle relationshipssemantic relationships between words, between such words, as such as a city anda the city country and the it country belongs it to, belongs e.g. France to, e.g. is France to Paris is as to Germany Paris as Germany is to Berlin. is to Word Berlin. vectors Word vectors with suchwith semantic such semanticrelationships relationships could be usedcould to be improve used to many improve existing many NLP existing applications, NLP applications, such such as machineas translation, machine translation, information information retrieval and retrieval question and answering question answering systems, and systems, may enable and may other enable other future applicationsfuture applications yet to be invented. yet to be invented. 5 5 CBOW: predict target from context (CBOW=Continuous Bag of Words) Training sentence: ... lemon, a tablespoon of apricot jam a pinch ... c1 c2 t c3 c4 Given the surrounding context words (tablespoon, of, jam, a), predict the target word (apricot). Input: each context word is a one-hot vector Projection layer: map each one-hot vector down to a dense D-dimensional vector, and average these vectors Output: predict the target word with softmax CS546 Machine Learning in NLP 6 Skipgram: predict context from target Training sentence: ... lemon, a tablespoon of apricot jam a pinch ... c1 c2 t c3 c4 Given the target word (apricot), predict the surrounding context words (tablespoon, of, jam, a), Input: each target word is a one-hot vector Projection layer: map each one-hot vector down to a dense D-dimensional vector, and average these vectors Output: predict the context word with softmax CS546 Machine Learning in NLP 7 Skipgram w11 … w1h … w1H … … … … … w11 … w1T … w1V wC1 … wCh … wCH … … … … … … … … … … wh1 … whT … whV wV1 … wVh … wVH … … … … … Score of wH1 … wHT … wHV wj context word p(wc |wt) ∝ exp(wc ⋅ wt) One-hot encoding wi of target word The rows in the weight matrix for the hidden layer correspond to the weights for each hidden unit. The columns in the weight matrix from input to the hidden layer correspond to the input vectors for each (target) word [typically, those are used as word2vec vectors] The rows in the weight matrix from the hidden to the output layer correspond to the output vectors for each (context) word [typically, those are ignored] CS546 Machine Learning in NLP 8 Negative sampling Skipgram aims to optimize the avg log probability of the data: T T 1 1 exp(wt+jwt) log p(w ∣ w ) = log T ∑ ∑ t+j t T ∑ ∑ ( V ) t=1 −c≤j≤c,j≠0 t=1 −c≤j≤c,j≠0 ∑k=1 exp(wkwt) V But computing the partition function is very expensive ∑ exp(wkwt) k=1 — This can be mitigated by hierarchical softmax (represent each wt+j by Huffman encoding, and predict the sequence of nodes in the resulting binary tree via softmax). — Noise Contrastive Estimation is an alternative to (hierarchical) softmax that aims to distinguish actual data points wt+j from noise via logistic regression — But we just want good word representations, so we do something simpler: Negative Sampling instead aims to optimize k 1 log σ(w ⋅ w ) + E log σ(−w w ) with σ(x) = T c ∑ wi∼P(w)[ T i ] 1 + exp(−x) i=1 CS546 Machine Learning in NLP 9 Skip-Gram Training data Training sentence: ... lemon, a tablespoon of apricot jam a pinch ... c1 c2 t c3 c4 Training data: input/output pairs centering on apricot Assume a +/- 2 word window (in reality: use +/- 10 words) Positive examples: (apricot, tablespoon), (apricot, of), (apricot, jam), (apricot, a) For each positive example, sample k negative examples, using noise words (according to [adjusted]
Recommended publications
  • Backpropagation and Deep Learning in the Brain
    Backpropagation and Deep Learning in the Brain Simons Institute -- Computational Theories of the Brain 2018 Timothy Lillicrap DeepMind, UCL With: Sergey Bartunov, Adam Santoro, Jordan Guerguiev, Blake Richards, Luke Marris, Daniel Cownden, Colin Akerman, Douglas Tweed, Geoffrey Hinton The “credit assignment” problem The solution in artificial networks: backprop Credit assignment by backprop works well in practice and shows up in virtually all of the state-of-the-art supervised, unsupervised, and reinforcement learning algorithms. Why Isn’t Backprop “Biologically Plausible”? Why Isn’t Backprop “Biologically Plausible”? Neuroscience Evidence for Backprop in the Brain? A spectrum of credit assignment algorithms: A spectrum of credit assignment algorithms: A spectrum of credit assignment algorithms: How to convince a neuroscientist that the cortex is learning via [something like] backprop - To convince a machine learning researcher, an appeal to variance in gradient estimates might be enough. - But this is rarely enough to convince a neuroscientist. - So what lines of argument help? How to convince a neuroscientist that the cortex is learning via [something like] backprop - What do I mean by “something like backprop”?: - That learning is achieved across multiple layers by sending information from neurons closer to the output back to “earlier” layers to help compute their synaptic updates. How to convince a neuroscientist that the cortex is learning via [something like] backprop 1. Feedback connections in cortex are ubiquitous and modify the
    [Show full text]
  • Implementing Machine Learning & Neural Network Chip
    Implementing Machine Learning & Neural Network Chip Architectures Using Network-on-Chip Interconnect IP Implementing Machine Learning & Neural Network Chip Architectures USING NETWORK-ON-CHIP INTERCONNECT IP TY GARIBAY Chief Technology Officer [email protected] Title: Implementing Machine Learning & Neural Network Chip Architectures Using Network-on-Chip Interconnect IP Primary author: Ty Garibay, Chief Technology Officer, ArterisIP, [email protected] Secondary author: Kurt Shuler, Vice President, Marketing, ArterisIP, [email protected] Abstract: A short tutorial and overview of machine learning and neural network chip architectures, with emphasis on how network-on-chip interconnect IP implements these architectures. Time to read: 10 minutes Time to present: 30 minutes Copyright © 2017 Arteris www.arteris.com 1 Implementing Machine Learning & Neural Network Chip Architectures Using Network-on-Chip Interconnect IP What is Machine Learning? “ Field of study that gives computers the ability to learn without being explicitly programmed.” Arthur Samuel, IBM, 1959 • Machine learning is a subset of Artificial Intelligence Copyright © 2017 Arteris 2 • Machine learning is a subset of artificial intelligence, and explicitly relies upon experiential learning rather than programming to make decisions. • The advantage to machine learning for tasks like automated driving or speech translation is that complex tasks like these are nearly impossible to explicitly program using rule-based if…then…else statements because of the large solution space. • However, the “answers” given by a machine learning algorithm have a probability associated with them and can be non-deterministic (meaning you can get different “answers” given the same inputs during different runs) • Neural networks have become the most common way to implement machine learning.
    [Show full text]
  • Arxiv:2002.06235V1
    Semantic Relatedness and Taxonomic Word Embeddings Magdalena Kacmajor John D. Kelleher Innovation Exchange ADAPT Research Centre IBM Ireland Technological University Dublin [email protected] [email protected] Filip Klubickaˇ Alfredo Maldonado ADAPT Research Centre Underwriters Laboratories Technological University Dublin [email protected] [email protected] Abstract This paper1 connects a series of papers dealing with taxonomic word embeddings. It begins by noting that there are different types of semantic relatedness and that different lexical representa- tions encode different forms of relatedness. A particularly important distinction within semantic relatedness is that of thematic versus taxonomic relatedness. Next, we present a number of ex- periments that analyse taxonomic embeddings that have been trained on a synthetic corpus that has been generated via a random walk over a taxonomy. These experiments demonstrate how the properties of the synthetic corpus, such as the percentage of rare words, are affected by the shape of the knowledge graph the corpus is generated from. Finally, we explore the interactions between the relative sizes of natural and synthetic corpora on the performance of embeddings when taxonomic and thematic embeddings are combined. 1 Introduction Deep learning has revolutionised natural language processing over the last decade. A key enabler of deep learning for natural language processing has been the development of word embeddings. One reason for this is that deep learning intrinsically involves the use of neural network models and these models only work with numeric inputs. Consequently, applying deep learning to natural language processing first requires developing a numeric representation for words. Word embeddings provide a way of creating numeric representations that have proven to have a number of advantages over traditional numeric rep- resentations of language.
    [Show full text]
  • Predrnn: Recurrent Neural Networks for Predictive Learning Using Spatiotemporal Lstms
    PredRNN: Recurrent Neural Networks for Predictive Learning using Spatiotemporal LSTMs Yunbo Wang Mingsheng Long∗ School of Software School of Software Tsinghua University Tsinghua University [email protected] [email protected] Jianmin Wang Zhifeng Gao Philip S. Yu School of Software School of Software School of Software Tsinghua University Tsinghua University Tsinghua University [email protected] [email protected] [email protected] Abstract The predictive learning of spatiotemporal sequences aims to generate future images by learning from the historical frames, where spatial appearances and temporal vari- ations are two crucial structures. This paper models these structures by presenting a predictive recurrent neural network (PredRNN). This architecture is enlightened by the idea that spatiotemporal predictive learning should memorize both spatial ap- pearances and temporal variations in a unified memory pool. Concretely, memory states are no longer constrained inside each LSTM unit. Instead, they are allowed to zigzag in two directions: across stacked RNN layers vertically and through all RNN states horizontally. The core of this network is a new Spatiotemporal LSTM (ST-LSTM) unit that extracts and memorizes spatial and temporal representations simultaneously. PredRNN achieves the state-of-the-art prediction performance on three video prediction datasets and is a more general framework, that can be easily extended to other predictive learning tasks by integrating with other architectures. 1 Introduction
    [Show full text]
  • Automatic Correction of Real-Word Errors in Spanish Clinical Texts
    sensors Article Automatic Correction of Real-Word Errors in Spanish Clinical Texts Daniel Bravo-Candel 1,Jésica López-Hernández 1, José Antonio García-Díaz 1 , Fernando Molina-Molina 2 and Francisco García-Sánchez 1,* 1 Department of Informatics and Systems, Faculty of Computer Science, Campus de Espinardo, University of Murcia, 30100 Murcia, Spain; [email protected] (D.B.-C.); [email protected] (J.L.-H.); [email protected] (J.A.G.-D.) 2 VÓCALI Sistemas Inteligentes S.L., 30100 Murcia, Spain; [email protected] * Correspondence: [email protected]; Tel.: +34-86888-8107 Abstract: Real-word errors are characterized by being actual terms in the dictionary. By providing context, real-word errors are detected. Traditional methods to detect and correct such errors are mostly based on counting the frequency of short word sequences in a corpus. Then, the probability of a word being a real-word error is computed. On the other hand, state-of-the-art approaches make use of deep learning models to learn context by extracting semantic features from text. In this work, a deep learning model were implemented for correcting real-word errors in clinical text. Specifically, a Seq2seq Neural Machine Translation Model mapped erroneous sentences to correct them. For that, different types of error were generated in correct sentences by using rules. Different Seq2seq models were trained and evaluated on two corpora: the Wikicorpus and a collection of three clinical datasets. The medicine corpus was much smaller than the Wikicorpus due to privacy issues when dealing Citation: Bravo-Candel, D.; López-Hernández, J.; García-Díaz, with patient information.
    [Show full text]
  • Combining Word Embeddings with Semantic Resources
    AutoExtend: Combining Word Embeddings with Semantic Resources Sascha Rothe∗ LMU Munich Hinrich Schutze¨ ∗ LMU Munich We present AutoExtend, a system that combines word embeddings with semantic resources by learning embeddings for non-word objects like synsets and entities and learning word embeddings that incorporate the semantic information from the resource. The method is based on encoding and decoding the word embeddings and is flexible in that it can take any word embeddings as input and does not need an additional training corpus. The obtained embeddings live in the same vector space as the input word embeddings. A sparse tensor formalization guar- antees efficiency and parallelizability. We use WordNet, GermaNet, and Freebase as semantic resources. AutoExtend achieves state-of-the-art performance on Word-in-Context Similarity and Word Sense Disambiguation tasks. 1. Introduction Unsupervised methods for learning word embeddings are widely used in natural lan- guage processing (NLP). The only data these methods need as input are very large corpora. However, in addition to corpora, there are many other resources that are undoubtedly useful in NLP, including lexical resources like WordNet and Wiktionary and knowledge bases like Wikipedia and Freebase. We will simply refer to these as resources. In this article, we present AutoExtend, a method for enriching these valuable resources with embeddings for non-word objects they describe; for example, Auto- Extend enriches WordNet with embeddings for synsets. The word embeddings and the new non-word embeddings live in the same vector space. Many NLP applications benefit if non-word objects described by resources—such as synsets in WordNet—are also available as embeddings.
    [Show full text]
  • Word-Like Character N-Gram Embedding
    Word-like character n-gram embedding Geewook Kim and Kazuki Fukui and Hidetoshi Shimodaira Department of Systems Science, Graduate School of Informatics, Kyoto University Mathematical Statistics Team, RIKEN Center for Advanced Intelligence Project fgeewook, [email protected], [email protected] Abstract Table 1: Top-10 2-grams in Sina Weibo and 4-grams in We propose a new word embedding method Japanese Twitter (Experiment 1). Words are indicated called word-like character n-gram embed- by boldface and space characters are marked by . ding, which learns distributed representations FNE WNE (Proposed) Chinese Japanese Chinese Japanese of words by embedding word-like character n- 1 ][ wwww 自己 フォロー grams. Our method is an extension of recently 2 。␣ !!!! 。␣ ありがと proposed segmentation-free word embedding, 3 !␣ ありがと ][ wwww 4 .. りがとう 一个 !!!! which directly embeds frequent character n- 5 ]␣ ございま 微博 めっちゃ grams from a raw corpus. However, its n-gram 6 。。 うござい 什么 んだけど vocabulary tends to contain too many non- 7 ,我 とうござ 可以 うござい 8 !! ざいます 没有 line word n-grams. We solved this problem by in- 9 ␣我 がとうご 吗? 2018 troducing an idea of expected word frequency. 10 了, ください 哈哈 じゃない Compared to the previously proposed meth- ods, our method can embed more words, along tion tools are used to determine word boundaries with the words that are not included in a given in the raw corpus. However, these segmenters re- basic word dictionary. Since our method does quire rich dictionaries for accurate segmentation, not rely on word segmentation with rich word which are expensive to prepare and not always dictionaries, it is especially effective when the available.
    [Show full text]
  • ACL 2019 Social Media Mining for Health Applications (#SMM4H)
    ACL 2019 Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task Proceedings of the Fourth Workshop August 2, 2019 Florence, Italy c 2019 The Association for Computational Linguistics Order copies of this and other ACL proceedings from: Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected] ISBN 978-1-950737-46-8 ii Preface Welcome to the 4th Social Media Mining for Health Applications Workshop and Shared Task - #SMM4H 2019. The total number of users of social media continues to grow worldwide, resulting in the generation of vast amounts of data. Popular social networking sites such as Facebook, Twitter and Instagram dominate this sphere. According to estimates, 500 million tweets and 4.3 billion Facebook messages are posted every day 1. The latest Pew Research Report 2, nearly half of adults worldwide and two- thirds of all American adults (65%) use social networking. The report states that of the total users, 26% have discussed health information, and, of those, 30% changed behavior based on this information and 42% discussed current medical conditions. Advances in automated data processing, machine learning and NLP present the possibility of utilizing this massive data source for biomedical and public health applications, if researchers address the methodological challenges unique to this media. In its fourth iteration, the #SMM4H workshop takes place in Florence, Italy, on August 2, 2019, and is co-located with the
    [Show full text]
  • Mapping the Hyponymy Relation of Wordnet Onto
    Under review as a conference paper at ICLR 2019 MAPPING THE HYPONYMY RELATION OF WORDNET ONTO VECTOR SPACES Anonymous authors Paper under double-blind review ABSTRACT In this paper, we investigate mapping the hyponymy relation of WORDNET to feature vectors. We aim to model lexical knowledge in such a way that it can be used as input in generic machine-learning models, such as phrase entailment predictors. We propose two models. The first one leverages an existing mapping of words to feature vectors (fastText), and attempts to classify such vectors as within or outside of each class. The second model is fully supervised, using solely WORDNET as a ground truth. It maps each concept to an interval or a disjunction thereof. On the first model, we approach, but not quite attain state of the art performance. The second model can achieve near-perfect accuracy. 1 INTRODUCTION Distributional encoding of word meanings from the large corpora (Mikolov et al., 2013; 2018; Pen- nington et al., 2014) have been found to be useful for a number of NLP tasks. These approaches are based on a probabilistic language model by Bengio et al. (2003) of word sequences, where each word w is represented as a feature vector f(w) (a compact representation of a word, as a vector of floating point values). This means that one learns word representations (vectors) and probabilities of word sequences at the same time. While the major goal of distributional approaches is to identify distributional patterns of words and word sequences, they have even found use in tasks that require modeling more fine-grained relations between words than co-occurrence in word sequences.
    [Show full text]
  • Pairwise Fasttext Classifier for Entity Disambiguation
    Pairwise FastText Classifier for Entity Disambiguation a,b b b Cheng Yu , Bing Chu , Rohit Ram , James Aichingerb, Lizhen Qub,c, Hanna Suominenb,c a Project Cleopatra, Canberra, Australia b The Australian National University c DATA 61, Australia [email protected] {u5470909,u5568718,u5016706, Hanna.Suominen}@anu.edu.au [email protected] deterministically. Then we will evaluate PFC’s Abstract performance against a few baseline methods, in- cluding SVC3 with hand-crafted text features. Fi- For the Australasian Language Technology nally, we will discuss ways to improve disambig- Association (ALTA) 2016 Shared Task, we uation performance using PFC. devised Pairwise FastText Classifier (PFC), an efficient embedding-based text classifier, 2 Pairwise Fast-Text Classifier (PFC) and used it for entity disambiguation. Com- pared with a few baseline algorithms, PFC Our Pairwise FastText Classifier is inspired by achieved a higher F1 score at 0.72 (under the the FastText. Thus this section starts with a brief team name BCJR). To generalise the model, description of FastText, and proceeds to demon- we also created a method to bootstrap the strate PFC. training set deterministically without human labelling and at no financial cost. By releasing 2.1 FastText PFC and the dataset augmentation software to the public1, we hope to invite more collabora- FastText maps each vocabulary to a real-valued tion. vector, with unknown words having a special vo- cabulary ID. A document can be represented as 1 Introduction the average of all these vectors. Then FastText will train a maximum entropy multi-class classi- The goal of the ALTA 2016 Shared Task was to fier on the vectors and the output labels.
    [Show full text]
  • Learned in Speech Recognition: Contextual Acoustic Word Embeddings
    LEARNED IN SPEECH RECOGNITION: CONTEXTUAL ACOUSTIC WORD EMBEDDINGS Shruti Palaskar∗, Vikas Raunak∗ and Florian Metze Carnegie Mellon University, Pittsburgh, PA, U.S.A. fspalaska j vraunak j fmetze [email protected] ABSTRACT model [10, 11, 12] trained for direct Acoustic-to-Word (A2W) speech recognition [13]. Using this model, we jointly learn to End-to-end acoustic-to-word speech recognition models have re- automatically segment and classify input speech into individual cently gained popularity because they are easy to train, scale well to words, hence getting rid of the problem of chunking or requiring large amounts of training data, and do not require a lexicon. In addi- pre-defined word boundaries. As our A2W model is trained at the tion, word models may also be easier to integrate with downstream utterance level, we show that we can not only learn acoustic word tasks such as spoken language understanding, because inference embeddings, but also learn them in the proper context of their con- (search) is much simplified compared to phoneme, character or any taining sentence. We also evaluate our contextual acoustic word other sort of sub-word units. In this paper, we describe methods embeddings on a spoken language understanding task, demonstrat- to construct contextual acoustic word embeddings directly from a ing that they can be useful in non-transcription downstream tasks. supervised sequence-to-sequence acoustic-to-word speech recog- Our main contributions in this paper are the following: nition model using the learned attention distribution. On a suite 1. We demonstrate the usability of attention not only for aligning of 16 standard sentence evaluation tasks, our embeddings show words to acoustic frames without any forced alignment but also for competitive performance against a word2vec model trained on the constructing Contextual Acoustic Word Embeddings (CAWE).
    [Show full text]
  • Fasttext.Zip: Compressing Text Classification Models
    Under review as a conference paper at ICLR 2017 FASTTEXT.ZIP: COMPRESSING TEXT CLASSIFICATION MODELS Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Herve´ Jegou´ & Tomas Mikolov Facebook AI Research fajoulin,egrave,bojanowski,matthijs,rvj,[email protected] ABSTRACT We consider the problem of producing compact architectures for text classifica- tion, such that the full model fits in a limited amount of memory. After consid- ering different solutions inspired by the hashing literature, we propose a method built upon product quantization to store word embeddings. While the original technique leads to a loss in accuracy, we adapt this method to circumvent quan- tization artefacts. Combined with simple approaches specifically adapted to text classification, our approach derived from fastText requires, at test time, only a fraction of the memory compared to the original FastText, without noticeably sacrificing quality in terms of classification accuracy. Our experiments carried out on several benchmarks show that our approach typically requires two orders of magnitude less memory than fastText while being only slightly inferior with respect to accuracy. As a result, it outperforms the state of the art by a good margin in terms of the compromise between memory usage and accuracy. 1 INTRODUCTION Text classification is an important problem in Natural Language Processing (NLP). Real world use- cases include spam filtering or e-mail categorization. It is a core component in more complex sys- tems such as search and ranking. Recently, deep learning techniques based on neural networks have achieved state of the art results in various NLP applications. One of the main successes of deep learning is due to the effectiveness of recurrent networks for language modeling and their application to speech recognition and machine translation (Mikolov, 2012).
    [Show full text]