K\= Oan: a Corrected CBOW Implementation

Total Page:16

File Type:pdf, Size:1020Kb

K\= Oan: a Corrected CBOW Implementation koan¯ : A Corrected CBOW Implementation Ozan Irsoy˙ Adrian Benton Bloomberg LP Bloomberg LP [email protected] [email protected] Karl Stratos Rutgers University [email protected] Abstract It is a common belief in the NLP community that continuous bag-of-words (CBOW) word embeddings tend to underperform skip-gram (SG) embeddings. We find that this belief is founded less on theoretical differences in their training objectives but more on faulty CBOW implementations in standard software li- braries such as the official implementation word2vec.c [Mikolov et al., 2013b] and Gensim [Rehˇ u˚rekˇ and Sojka, 2010]. We show that our correct implementa- tion of CBOW yields word embeddings that are fully competitive with SG on various intrinsic and extrinsic tasks while being more than three times as fast to train. We release our implementation, koan¯ 1, at https://github.com/ bloomberg/koan. 1 Introduction Pre-trained word embeddings are a standard way to boost performance on many tasks of interest such as named-entity recognition (NER), where contextual word embeddings such as BERT [Devlin et al., 2019] are computationally expensive and may only yield marginal gains. Word2vec [Mikolov et al., 2013a] is a popular method for learning word embeddings because of its scalability and robust performance. Word2vec offers two training objectives: (1) continuous bag-of-words (CBOW) that predicts the center word by averaging context embeddings, and (2) skip-gram (SG) that predicts context words from the center word.2 A common belief in the NLP community is that while CBOW is fast to train, it lags behind SG in performance.3 This observation is made by the inventors of Word2vec themselves [Mikolov, 2013] and also independently by other works such as Stratos et al. [2015]. This is strange since the two arXiv:2012.15332v1 [cs.CL] 30 Dec 2020 objectives lead to very similar weight updates. This is also strange given the enormous success of contextual word embeddings based on masked language modeling (MLM), as CBOW follows a rudimentary form of MLM (i.e., predicting a masked target word from a context window without any perturbation of the masked word). In fact, direct extensions of CBOW such as fastText [Bojanowski et al., 2017] and Sent2vec [Pagliardini et al., 2018] have been quite successful. In this work, we find that the performance discrepancy between CBOW and SG embeddings is founded less on theoretical differences in their training objectives but more on faulty CBOW im- plementations in standard software libraries such as the official implementation word2vec.c [Mikolov et al., 2013b] and Gensim [Rehˇ u˚rekˇ and Sojka, 2010]. Specifically, we find that in these 1A story which is used in Zen practice to provoke “great doubt”. 2In this work, we always use the negative sampling formulations of Word2vec objectives which are consis- tently more efficient and effective than the hierarchical softmax formulations. 3SG requires sampling negative examples from every word in context, while CBOW requires sampling negative examples only for the target word. implementations, the gradient for source embeddings is incorrectly multiplied by the context win- dow size, resulting in incorrect weight updates and inferior performance. We make the following contributions. First, we show that our correct implementation of CBOW indeed yields word embeddings that are fully competitive with SG while being trained in less than a third of the time (e.g., it takes less than 73 minutes to train strong CBOW embeddings on the entire Wikipedia corpus). We present experiments on intrinsic word similarity and anal- ogy tasks, as well as extrinsic evaluations on the SST-2, QNLI, and MNLI tasks from the GLUE benchmark [Wang et al., 2018], and the CoNLL03 English named entity recognition task [Sang and De Meulder, 2003]. Second, we make our implementation, koan¯ , publicly available at https://github.com/bloomberg/koan. In addition to including a correct CBOW im- plementation, koan¯ features other software improvements such as a constant time, linear space negative sampling routine based on the alias method [Walker, 1974] along with multi-threaded sup- port. 2 Implementation The parameters of CBOW are two sets of word embeddings: “source-side” and “target-side” vectors 0 d vw; vw 2 R for every word type w 2 V in the vocabulary. A window of text in a corpus consists of a center word wO and context words w1 : : : wC . For instance, in the window the dog laughed, we have wO = dog and w1 = the and w2 = laughed. Given a window of text, the CBOW loss is defined as: C 1 X v = v c C wj j=1 k > X > L = − log σ(v0 v ) − log σ(−v0 v ) wO c ni c i=1 where n1 : : : nk 2 V are negative examples drawn iid from some noise distribution Pn over V . The L v0 v0 v gradients of with respect to the target ( wO ), negative target ( ni ), and average context source ( c) embeddings are: @L 0 > =(σ(v vc) − 1)vc @v0 wO wO @L 0 > =σ(v vc)vc @v0 ni ni k @L 0 > 0 X 0 > 0 =(σ(v vc) − 1)v + σ(v vc)v (1) @v wO wO ni ni c i=1 and by the chain rule with respect to a source context embedding: k @L 1 0 > 0 X 0 > 0 = [(σ(v vc) − 1)v + σ(v vc)v ] @v C wO wO ni ni wj i=1 However, the CBOW negative sampling implementations in word2vec.c4 and Gensim5 incor- rectly update each context vector by Eq. (1), without normalizing by the number of context words. In fact, this error has been pointed out in several Gensim issues.6 Why does this matter? Aside from being incorrect, this update matters because: 4https://github.com/tmikolov/word2vec/blob/20c129af10659f7c50e86e3be406df663beff438/ word2vec.c#L483 5In fact, it appears that normalization by number of context words is only done for the “sum variant” of CBOW: https://github.com/RaRe-Technologies/gensim/blob/master/ gensim/models/word2vec_inner.pyx#L455-L456. 6https://github.com/RaRe-Technologies/gensim/issues/1873, https: //github.com/RaRe-Technologies/gensim/issues/697. 2 CBOW, 5 epochs Skipgram, 1 epoch 10 10 Implementation Implementation word2vec.c word2vec.c 8 gensim 8 gensim k an k an 6 6 4 4 Time (hours) Time (hours) 2 2 0 0 4 8 16 32 48 4 8 16 32 48 Workers Workers Figure 1: Hours to train CBOW (left) and SG (right) embeddings as a function of number of worker threads for different Word2vec implementations. 1. In both Gensim and word2vec.c, the width of the context window is randomly sampled from 1;:::;Cmax for every target word. This means that context embeddings which were averaged over wider context windows will experience a larger update than their contribu- tion, relative to context embeddings averaged over narrower windows. As a consequence, the norm of the source embeddings increases with the context size (Appendix B). 2. One might think that scaling of the source vector updates will have little effect on the re- sulting embeddings, as this stochastic scaling can be folded into the learning rate. However, recall that the set of parameters for the CBOW model is θ = [v; v0]. Now it is clear that by scaling the update for source and not target embeddings, we are no longer taking a step in the direction of the negative stochastic gradient, and our computation of the stochastic gradient is no longer an unbiased estimate of the true gradient. See Appendix C for an ana- lysis of how the incorrect gradient update relates to the correct one as a function of CBOW hyperparameters. In addition to implementing the correct gradient for CBOW training, we also use the alias method for constant time, linear memory negative sampling (Appendix D). 3 Experiments We evaluated CBOW and SG embeddings as well as recorded the time it took to train them under Gensim and our implementations. Unless otherwise noted, we learn word embeddings on the entire Wikipedia corpus. The corpus was tokenized using the PTB tokenizer with default settings, and words occurring fewer than ten times were dropped. Training hyperparameters were held fixed for each set of embeddings: negative sampling rate 5, maximum context window of 5 words, number of training epochs 5, and embedding dimensionality 300. We found that the default initial learning rate in Gensim, 0.025, learned strong SG embeddings. However, we swept separately for CBOW initial learning rate for Gensim and our implementa- tion. We swept over initial learning rate in f0:025; 0:05; 0:075; 0:1g selecting 0.025 to learn Gensim CBOW embeddings and 0.075 for koan¯ . We selected learning rate to maximize average perfor- mance on the intrinsic evaluation tasks. We believe that Gensim requires a smaller learning rate for CBOW, because a smaller learning rate is more likely to reduce the stochastic loss, since the incorrect update is still a descent direction, although not following the negative stochastic gradient (Appendix C). 3.1 Training time Figure 1 shows the time to train each set of embeddings as a function of number of worker threads on the tokenized Wikipedia corpus. All implementations were benchmarked on an Amazon EC2 c5.metal instance (96 logical processors, 196GB memory) using time, and embeddings were trained with a minimum token frequency of 10 and downsampling frequency threshold of 10−3. We benchmark koan¯ with a buffer of 100,000 sentences. We trained CBOW for five epochs and SG for a single epoch, as it is much slower. Gensim seems to have trouble exploiting more worker 3 threads when training CBOW, and both word2vec.c and koan¯ learn embeddings faster.
Recommended publications
  • A Comprehensive Embedding Approach for Determining Repository Similarity
    Repo2Vec: A Comprehensive Embedding Approach for Determining Repository Similarity Md Omar Faruk Rokon Pei Yan Risul Islam Michalis Faloutsos UC Riverside UC Riverside UC Riverside UC Riverside [email protected] [email protected] [email protected] [email protected] Abstract—How can we identify similar repositories and clusters among a large online archive, such as GitHub? Determining repository similarity is an essential building block in studying the dynamics and the evolution of such software ecosystems. The key challenge is to determine the right representation for the diverse repository features in a way that: (a) it captures all aspects of the available information, and (b) it is readily usable by ML algorithms. We propose Repo2Vec, a comprehensive embedding approach to represent a repository as a distributed vector by combining features from three types of information sources. As our key novelty, we consider three types of information: (a) metadata, (b) the structure of the repository, and (c) the source code. We also introduce a series of embedding approaches to represent and combine these information types into a single embedding. We evaluate our method with two real datasets from GitHub for a combined 1013 repositories. First, we show that our method outperforms previous methods in terms of precision (93% Figure 1: Our approach outperforms the state of the art approach vs 78%), with nearly twice as many Strongly Similar repositories CrossSim in terms of precision using CrossSim dataset. We also see and 30% fewer False Positives. Second, we show how Repo2Vec the effect of different types of information that Repo2Vec considers: provides a solid basis for: (a) distinguishing between malware and metadata, adding structure, and adding source code information.
    [Show full text]
  • Practice with Python
    CSI4108-01 ARTIFICIAL INTELLIGENCE 1 Word Embedding / Text Processing Practice with Python 2018. 5. 11. Lee, Gyeongbok Practice with Python 2 Contents • Word Embedding – Libraries: gensim, fastText – Embedding alignment (with two languages) • Text/Language Processing – POS Tagging with NLTK/koNLPy – Text similarity (jellyfish) Practice with Python 3 Gensim • Open-source vector space modeling and topic modeling toolkit implemented in Python – designed to handle large text collections, using data streaming and efficient incremental algorithms – Usually used to make word vector from corpus • Tutorial is available here: – https://github.com/RaRe-Technologies/gensim/blob/develop/tutorials.md#tutorials – https://rare-technologies.com/word2vec-tutorial/ • Install – pip install gensim Practice with Python 4 Gensim for Word Embedding • Logging • Input Data: list of word’s list – Example: I have a car , I like the cat → – For list of the sentences, you can make this by: Practice with Python 5 Gensim for Word Embedding • If your data is already preprocessed… – One sentence per line, separated by whitespace → LineSentence (just load the file) – Try with this: • http://an.yonsei.ac.kr/corpus/example_corpus.txt From https://radimrehurek.com/gensim/models/word2vec.html Practice with Python 6 Gensim for Word Embedding • If the input is in multiple files or file size is large: – Use custom iterator and yield From https://rare-technologies.com/word2vec-tutorial/ Practice with Python 7 Gensim for Word Embedding • gensim.models.Word2Vec Parameters – min_count:
    [Show full text]
  • Gensim Is Robust in Nature and Has Been in Use in Various Systems by Various People As Well As Organisations for Over 4 Years
    Gensim i Gensim About the Tutorial Gensim = “Generate Similar” is a popular open source natural language processing library used for unsupervised topic modeling. It uses top academic models and modern statistical machine learning to perform various complex tasks such as Building document or word vectors, Corpora, performing topic identification, performing document comparison (retrieving semantically similar documents), analysing plain-text documents for semantic structure. Audience This tutorial will be useful for graduates, post-graduates, and research students who either have an interest in Natural Language Processing (NLP), Topic Modeling or have these subjects as a part of their curriculum. The reader can be a beginner or an advanced learner. Prerequisites The reader must have basic knowledge about NLP and should also be aware of Python programming concepts. Copyright & Disclaimer Copyright 2020 by Tutorials Point (I) Pvt. Ltd. All the content and graphics published in this e-book are the property of Tutorials Point (I) Pvt. Ltd. The user of this e-book is prohibited to reuse, retain, copy, distribute or republish any contents or a part of contents of this e-book in any manner without written consent of the publisher. We strive to update the contents of our website and tutorials as timely and as precisely as possible, however, the contents may contain inaccuracies or errors. Tutorials Point (I) Pvt. Ltd. provides no guarantee regarding the accuracy, timeliness or completeness of our website or its contents including this tutorial. If you discover any errors on our website or in this tutorial, please notify us at [email protected] ii Gensim Table of Contents About the Tutorial ..........................................................................................................................................
    [Show full text]
  • Software Framework for Topic Modelling with Large
    Software Framework for Topic Modelling Radim Řehůřek and Petr Sojka NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic {xrehurek,sojka}@fi.muni.cz http://nlp.fi.muni.cz/projekty/gensim/ the available RAM, in accordance with While an intuitive interface is impor- Although evaluation of the quality of NLP Framework for VSM the current trends in NLP (see e.g. [3]). tant for software adoption, it is of course the obtained similarities is not the subject rather trivial and useless in itself. We have of this paper, it is of course of utmost Large corpora are ubiquitous in today’s Intuitive API. We wish to minimise the therefore implemented some of the popular practical importance. Here we note that it world and memory quickly becomes the lim- number of method names and interfaces VSM methods, Latent Semantic Analysis, is notoriously hard to evaluate the quality, iting factor in practical applications of the that need to be memorised in order to LSA and Latent Dirichlet Allocation, LDA. as even the preferences of different types Vector Space Model (VSM). In this paper, use the package. The terminology is The framework is heavily documented of similarity are subjective (match of main we identify a gap in existing implementa- NLP-centric. and is available from http://nlp.fi. topic, or subdomain, or specific wording/- tions of many of the popular algorithms, Easy deployment. The package should muni.cz/projekty/gensim/. This plagiarism) and depends on the motivation which is their scalability and ease of use. work out-of-the-box on all major plat- website contains sections which describe of the reader.
    [Show full text]
  • Word2vec and Beyond Presented by Eleni Triantafillou
    Word2vec and beyond presented by Eleni Triantafillou March 1, 2016 The Big Picture There is a long history of word representations I Techniques from information retrieval: Latent Semantic Analysis (LSA) I Self-Organizing Maps (SOM) I Distributional count-based methods I Neural Language Models Important take-aways: 1. Don't need deep models to get good embeddings 2. Count-based models and neural net predictive models are not qualitatively different source: http://gavagai.se/blog/2015/09/30/a-brief-history-of-word-embeddings/ Continuous Word Representations I Contrast with simple n-gram models (words as atomic units) I Simple models have the potential to perform very well... I ... if we had enough data I Need more complicated models I Continuous representations take better advantage of data by modelling the similarity between the words Continuous Representations source: http://www.codeproject.com/Tips/788739/Visualization-of- High-Dimensional-Data-using-t-SNE Skip Gram I Learn to predict surrounding words I Use a large training corpus to maximize: T 1 X X log p(w jw ) T t+j t t=1 −c≤j≤c; j6=0 where: I T: training set size I c: context size I wj : vector representation of the jth word Skip Gram: Think of it as a Neural Network Learn W and W' in order to maximize previous objective Output layer y1,j W'N×V Input layer Hidden layer y2,j xk WV×N hi W'N×V N-dim V-dim W'N×V yC,j C×V-dim source: "word2vec parameter learning explained." ([4]) CBOW Input layer x1k WV×N Output layer Hidden layer x W W' 2k V×N hi N×V yj N-dim V-dim WV×N xCk C×V-dim source:
    [Show full text]
  • Generative Adversarial Networks for Text Using Word2vec Intermediaries
    Generative Adversarial Networks for text using word2vec intermediaries Akshay Budhkar1, 2, 4, Krishnapriya Vishnubhotla1, Safwan Hossain1, 2 and Frank Rudzicz1, 2, 3, 5 1Department of Computer Science, University of Toronto fabudhkar, vkpriya, [email protected] 2Vector Institute [email protected] 3St Michael’s Hospital 4Georgian Partners 5Surgical Safety Technologies Inc. Abstract the information to improve, however, if at the cur- rent stage of training it is not doing that yet, the Generative adversarial networks (GANs) have gradient of G vanishes. Additionally, with this shown considerable success, especially in the loss function, there is no correlation between the realistic generation of images. In this work, we apply similar techniques for the generation metric and the generation quality, and the most of text. We propose a novel approach to han- common workaround is to generate targets across dle the discrete nature of text, during training, epochs and then measure the generation quality, using word embeddings. Our method is ag- which can be an expensive process. nostic to vocabulary size and achieves compet- W-GAN (Arjovsky et al., 2017) rectifies these itive results relative to methods with various issues with its updated loss. Wasserstein distance discrete gradient estimators. is the minimum cost of transporting mass in con- 1 Introduction verting data from distribution Pr to Pg. This loss forces the GAN to perform in a min-max, rather Natural Language Generation (NLG) is often re- than a max-min, a desirable behavior as stated in garded as one of the most challenging tasks in (Goodfellow, 2016), potentially mitigating mode- computation (Murty and Kabadi, 1987).
    [Show full text]
  • Term Document Matrix Python
    Term Document Matrix Python Chad Schuyler still snoods: warrantable and dilapidated Lionello flush quite alone but bottle-feeds her guerezas greyly. Is Lukas incestuous when Clark hoarsens sexually? Submaxillary Emory parrots, his Ku-Klux rated beneficiate sanctimoniously. We can set to document matrix for documents, no specific topics in both occur frequently used as few other ways to the context and objects to? Python Textmining Package. The term document matrix python framework for python, but i loop my list. Ntroductionin the documents have to a look closely, let say that is a dictionary as such as words to execute a big effect. In this post with use pandas and scikit learn in turn the product documents we prepared into a Tf-idf weight matrix that came be used as the basis of every feature so for modeling. Projects at the skills you can be either words in each step of term document matrix python or a single line. Bag of terms and matrix elements along with other files for buttons on. Pca or communities of. Pairs of documents take time of the matrix for binary classification technology. Preprocessing the test data science graduate of the. It is one term matrix within each topic in terms? The matrix are various analytical, a timely updates. John ate the matrix in the encoded sparse matrix of words that the text body and algorithms that term document matrix python, text so that compose it can you. Recall that python back from terms could execute a matrix is more on? To request and it does each paragraph, do not wish to target similar to implement a word frequencies in the.
    [Show full text]
  • Evaluation of Morphological Embeddings for English and Russian Languages
    Evaluation of Morphological Embeddings for English and Russian Languages Vitaly Romanov Albina Khusainova Innopolis University, Innopolis, Innopolis University, Innopolis, Russia Russia [email protected] [email protected] Abstract representations. Such approaches involve addi- tional techniques that perform segmentation of a This paper evaluates morphology-based em- word into morphemes (Arefyev N.V., 2018; Virpi- beddings for English and Russian languages. oja et al., 2013). The presumption is that we can Despite the interest and introduction of sev- potentially increase the quality of distributional eral morphology-based word embedding mod- els in the past and acclaimed performance im- representations if we incorporate these segmenta- provements on word similarity and language tions into the language model (LM). modeling tasks, in our experiments, we did Several approaches that include morphology not observe any stable preference over two of into word embeddings were proposed, but the our baseline models - SkipGram and FastText. evaluation often does not compare proposed em- The performance exhibited by morphological bedding methodologies with the most popular em- embeddings is the average of the two baselines mentioned above. bedding vectors - Word2Vec, FastText, Glove. In this paper, we aim at answering the question of 1 Introduction whether morphology-based embeddings can be useful, especially for languages with rich mor- One of the most significant shifts in the area of nat- phology (such as Russian). Our contribution is the ural language processing is to the practical use of following: distributed word representations. Collobert et al. (2011) showed that a neural model could achieve 1. We evaluate simple SkipGram-based (SG- close to state-of-the-art results in Part of Speech based) morphological embedding models (POS) tagging and chunking by relying almost with new intrinsic evaluation BATS dataset only on word embeddings learned with a language (Gladkova et al., 2016) model.
    [Show full text]
  • Machine Learning Methods for Finding Textual Features of Depression from Publications
    Georgia State University ScholarWorks @ Georgia State University Computer Science Dissertations Department of Computer Science 12-14-2017 Machine Learning Methods for Finding Textual Features of Depression from Publications Zhibo Wang Georgia State University Follow this and additional works at: https://scholarworks.gsu.edu/cs_diss Recommended Citation Wang, Zhibo, "Machine Learning Methods for Finding Textual Features of Depression from Publications." Dissertation, Georgia State University, 2017. https://scholarworks.gsu.edu/cs_diss/132 This Dissertation is brought to you for free and open access by the Department of Computer Science at ScholarWorks @ Georgia State University. It has been accepted for inclusion in Computer Science Dissertations by an authorized administrator of ScholarWorks @ Georgia State University. For more information, please contact [email protected]. METHODS FOR FINDING TEXTUAL FEATURES OF MACHINE LEARNING DEPRESSION FROM PUBLICATIONS by ZHIBO WANG Under the Direction of Dr. Yanqing Zhang ABSTRACT Depression is a common but serious mood disorder. In 2015, WHO reports about 322 million people were living with some form of depression, which is the leading cause of ill health and disability worldwide. In USA, there are approximately 14.8 million American adults (about 6.7% percent of the US population) affected by major depressive disorder. Most individuals with depression are not receiving adequate care because the symptoms are easily neglected and most people are not even aware of their mental health problems. Therefore, a depression prescreen system is greatly beneficial for people to understand their current mental health status at an early stage. Diagnosis of depressions, however, is always extremely challenging due to its complicated, many and various symptoms.
    [Show full text]
  • Learning Eligibility in Cancer Clinical Trials Using Deep Neural Networks
    Learning Eligibility in Cancer Clinical Trials using Deep Neural Networks Aurelia Bustos, Antonio Pertusa∗ Pattern Recognition and Artificial Intelligence Group (GRFIA), Department of Software and Computing Systems, University Institute for Computing Research, University of Alicante, E-03690 Alicante, Spain Abstract Interventional cancer clinical trials are generally too restrictive, and some patients are often excluded on the basis of comorbidity, past or concomitant treatments, or the fact that they are over a certain age. The efficacy and safety of new treatments for patients with these characteristics are, therefore, not defined. In this work, we built a model to automatically predict whether short clinical statements were considered inclusion or exclusion criteria. We used protocols from cancer clinical trials that were available in public reg- istries from the last 18 years to train word-embeddings, and we constructed a dataset of 6M short free-texts labeled as eligible or not eligible. A text classifier was trained using deep neural networks, with pre-trained word- embeddings as inputs, to predict whether or not short free-text statements describing clinical information were considered eligible. We additionally ana- lyzed the semantic reasoning of the word-embedding representations obtained and were able to identify equivalent treatments for a type of tumor analo- gous with the drugs used to treat other tumors. We show that representation learning using deep neural networks can be successfully leveraged to extract the medical knowledge from clinical trial protocols for potentially assisting arXiv:1803.08312v3 [cs.CL] 25 Jul 2018 practitioners when prescribing treatments. Keywords: Clinical Trials, Clinical Decision Support System, Natural Language Processing, Word Embeddings, Deep Neural Networks ∗Corresponding author.
    [Show full text]
  • Arxiv:1612.05340V2 [Cs.CL] 23 Dec 2016 Learn, High, Program I, a Possible Textual Label Could Be Simply EDUCATION
    Automatic Labelling of Topics with Neural Embeddings Shraey Bhatia1;2, Jey Han Lau1;2 and Timothy Baldwin2 1 IBM Research 2 Dept of Computing and Information Systems, The University of Melbourne [email protected], [email protected], [email protected] Abstract Topics generated by topic models are typically represented as list of terms. To reduce the cognitive overhead of interpreting these topics for end-users, we propose labelling a topic with a succinct phrase that summarises its theme or idea. Using Wikipedia document titles as label candidates, we compute neural embeddings for documents and words to select the most relevant labels for topics. Compared to a state-of-the-art topic labelling system, our methodology is simpler, more efficient, and finds better topic labels. 1 Introduction Topic models are a popular approach to detecting trends and traits in document collections, e.g. in tracing the evolution of a scientific field through its publications (Hall et al., 2008), enabling visual navigation over search results (Newman et al., 2010a), interactively labelling document collections (Poursabzi-Sangdeh et al., 2016), or detecting trends in text streams (Wang and McCallum, 2006; AlSumait et al., 2008). They are typically unsupervised, and generate “topics” ti in the form of multinominal distributions over the terms wj of the document collection (Pr(wjjti)), and topic distributions for each document dk in the collection, in the form of a multinomial distribution over topics (Pr(tijdk)). Traditionally, this has been carried out based on latent Dirichlet allocation (LDA: Blei et al. (2003)) or extensions thereof, but more recently there has been interest in deep learning approaches to topic modelling (Cao et al., 2015; Larochelle and Lauly, 2012).
    [Show full text]
  • An Introduction to Topic Modeling As an Unsupervised Machine Learning Way to Organize Text Information
    2015 ASCUE Proceedings An Introduction to Topic Modeling as an Unsupervised Machine Learning Way to Organize Text Information Robin M Snyder RobinSnyder.com [email protected] http://www.robinsnyder.com Abstract The field of topic modeling has become increasingly important over the past few years. Topic modeling is an unsupervised machine learning way to organize text (or image or DNA, etc.) information such that related pieces of text can be identified. This paper/session will present/discuss the current state of topic modeling, why it is important, and how one might use topic modeling for practical use. As an ex- ample, the text of some ASCUE proceedings of past years will be used to find and group topics and see how those topics have changed over time. As another example, if documents represent customers, the vocabulary is the products offered to customers, and and the words of a document (i.e., customer) rep- resent products bought by customers, than topic modeling can be used, in part, to answer the question, "customers like you bought products like this" (i.e., a part of recommendation engines). Introduction The intelligent processing of text in general and topic modeling in particular has become increasingly important over the past few decades. Here is an extract from the author's class notes from 2006. "Text mining is a computer technique to extract useful information from unstructured text. And it's a difficult task. But now, using a relatively new method named topic modeling, computer scientists from University of California, Irvine (UCI), have analyzed 330,000 stories published by the New York Times between 2000 and 2002 in just a few hours.
    [Show full text]