Latent Dirichlet Allocation, Explained and Improved Upon for Applications in Marketing Intelligence

Total Page:16

File Type:pdf, Size:1020Kb

Latent Dirichlet Allocation, Explained and Improved Upon for Applications in Marketing Intelligence Latent Dirichlet Allocation, explained and improved upon for applications in marketing intelligence Iris Koks Technische Universiteit Delft Latent Dirichlet Allocation, explained and improved upon for applications in marketing intelligence by Iris Koks to obtain the degree of Master of Science at the Delft University of Technology, to be defended publicly on Friday March 22, 2019 at 2:00 PM. Student number: 4299981 Project duration: August, 2018 – March, 2019 Thesis committee: Prof. dr. ir. Geurt Jongbloed, TU Delft, supervisor Dr. Dorota Kurowicka, TU Delft Drs. Jan Willem Bikker PDEng, CQM An electronic version of this thesis is available at http://repository.tudelft.nl/. "La science, mon garçon, est faite d’erreurs, mais ce sont des erreurs qu’il est utile de faire, parce qu’elles conduisent peu à peu à la vérité." - Jules Verne, in "Voyage au centre de la Terre" Abstract In today’s digital world, customers give their opinions on a product that they have purchased online in the form of reviews. The industry is interested in these reviews, and wants to know about which topics their clients write, such that the producers can improve products on specific aspects. Topic models can extract the main topics from large data sets such as the review data. One of these is Latent Dirichlet Allocation (LDA). LDA is a hierarchical Bayesian topic model that retrieves topics from text data sets in an unsupervised manner. The method assumes that a topic is assigned to each word in a document (review), and aims to retrieve the topic distribution for each document, and a word distribution for each topic. Using the highest probability words from each topic-word distribution, the content of each topic can be determined, such that the main subjects can be derived. Three methods of inference to obtain the topic and word distributions are considered in this research: Gibbs sampling, Variational methods, and Adam optimization to find the posterior mode. Gibbs sampling and Adam optimization have the best theoretical foundations for their application to LDA. From results on artificial and real data sets, it is concluded that Gibbs sampling has the best performance in terms of robustness and perplexity. In case the data set consists of reviews, it is desired to extract the sentiment (positive, neutral, negative) from the documents, in addition to the topics. Therefore, an extension to LDA that uses sentiment words and sentence structure as additional input is proposed: LDA with syntax and sentiment. In this model, a topic distribution and a sentiment distribution for each review are retrieved. Furthermore, a word distribution per topic-sentiment combination can be estimated. With these distributions, the main topics and sentiments in a data set can be determined. Adam optimization is used as inference method. The algorithm is tested on simulated data and found to work well. However, the optimization method is very sensitive to hyperparameter settings, so it is expected that Gibbs sampling as inference method for LDA with syntax and sentiment performs better. Its implementation is left for further research. Keywords: Latent Dirichlet Allocation, topic modeling, sentiment analysis, opinion mining, review analysis, Hierarchical Bayesian inference v Preface With this thesis, my student life comes to an end. Although I started with studying French after high school, after 4 years, I finally came to my senses, such that now, I have become a mathematician. Fortunately, my passion for languages has never completely disappeared, and it could even be incorporated in this final research project. During the last 8 months, I have been doing research and writing my master thesis at CQM in Eindhoven. I have had a wonderful time with my colleagues there, and I would like to thank all of them for making it the interesting, nice time it was. I learned a lot about industrial statistics, machine learning, consultancy and the CQM-way of working. Special thanks go to my supervisors Jan Willem Bikker, Peter Stehouwer and Matthijs Tijink, for their sincere interest and helpfulness during every meeting. Next to the interesting conversations about mathematics, we also had nice talks about careers, life and personal development, which I will always remember. Although he was not my direct supervisor, Johan van Rooij helped me a lot with programming and implemen- tation questions, and by thinking along patiently when I got stuck in some mathematical derivation, for which I am very grateful. Furthermore, I would like to thank my other supervisor, professor Geurt Jongbloed, for his guidance through this project and for bringing out the best of me on a mathematical level. Naturally, I also enjoyed our talks about all aspects of life and the laughs we had. Also, I would like to thank Dorota Kurowicka for being on my graduation committee, and for her interesting lectures about copulas. Unfortunately, I could not incorporate them in this project, but maybe in my career as (financial) risk analyst? Then, of course, I would like to express my gratitude to Jan Frouws, for his unconditional care and support during these last 8 months. He was always there to listen to me, mumbling on and on about reviews, topics, strollers and optimization methods. Without you, throughout this project, the ups would not have been that high and the downs would have been way deeper. I am really proud of how we manage and enjoy life together. In addition, I would like to thank my parents for always supporting me during my entire educational journey. Because of them, I always liked going to school and learning, hence my broad interests in languages, economics, physics and mathematics. Also, thanks to my brother Corné, with whom I made my homework together for years and I had the most interesting (political) discussions. Lastly, I would like to thank my friends that I met in all different places. With you, I have had a wonderful student life in which I started (and immediately quit) rowing, bouldering, learned to cook properly and for large groups (making sushi, pasta, ravioli, mexican wraps, massive Heel Holland Bakt cakes...), learned about Christianity, air planes and bit coins, lost my ‘Brabants’ accent such that I became a little more like a ‘Randstad’ person, and gained interest in the financial world, in which I hope to get a nice career. Iris Koks Eindhoven, March 2019 vii Nomenclature Tn n-dimensional closed simplex D Corpus, set of all documents L Log likelihood § Number of different sentiments σ Sentiment index ® Hyperparameter vector of size K with belief on document-topic distribution ¯ Hyperparameter vector of size V with belief on topic-word distribution γ Hyperparameter vector of size § with belief on document-sentiment distribution Ák Word probability vector of size V for topic k ¼d Sentiment probability vector of size § for document d θd Topic probability vector of size K for document d C Number of different parts-of-speech considered c Part-of-speech d Document index H Entropy h Shannon information K Number of topics M Number of documents in a data set Nd Number of words in document d Ns Number of words in phrase s s Phrase or sentence index Sd Number of phrases in document d V Vocabulary size, i.e. number of unique words in data set w Word index z Topic index Adam Adaptive moment estimation (optimization method) JS Jensen-Shannon KL Kullback-Leibler LDA Latent Dirichlet Allocation MAP Maximum a posteriori or posterior mode NLP Natural Language Processing VBEM Variational Bayesian Expectation Maximization ix Contents List of Figures xiii List of Tables xv 1 Introduction 1 1.1 Costumer insights using Latent Dirichlet Allocation.......................1 1.2 Research questions........................................2 1.3 Thesis outline...........................................2 1.4 Note on notation.........................................4 2 Theoretical background5 2.1 Bayesian statistics.........................................5 2.2 Dirichlet process and distribution................................7 2.2.1 Stick-breaking construction of Dirichlet process......................8 2.2.2 Dirichlet distribution...................................8 2.3 Natural language processing................................... 13 2.4 Model selection criteria...................................... 14 3 Latent Dirichlet Allocation 17 3.1 Into the mind of the writer: generative process.......................... 18 3.2 Important distributions in LDA.................................. 21 3.3 Probability distribution of the words............................... 21 3.4 Improvements and adaptations to basic LDA model....................... 24 4 Inference methods for LDA 27 4.1 Posterior mean.......................................... 29 4.1.1 Analytical determination.................................. 29 4.1.2 Markov chain Monte Carlo methods............................ 30 4.2 Posterior mode.......................................... 38 4.2.1 General variational methods................................ 38 4.2.2 Variational Bayesian EM for LDA.............................. 43 5 Determination posterior mode estimates for LDA using optimization 47 5.1 LDA’s posterior density...................................... 47 5.2 Gradient descent......................................... 49 5.3 Stochastic gradient descent.................................... 50 5.4 Adam optimization........................................ 51 5.4.1 Softmax transformation.................................. 54 5.4.2 Regularization......................................
Recommended publications
  • Variational Message Passing
    Journal of Machine Learning Research 6 (2005) 661–694 Submitted 2/04; Revised 11/04; Published 4/05 Variational Message Passing John Winn [email protected] Christopher M. Bishop [email protected] Microsoft Research Cambridge Roger Needham Building 7 J. J. Thomson Avenue Cambridge CB3 0FB, U.K. Editor: Tommi Jaakkola Abstract Bayesian inference is now widely established as one of the principal foundations for machine learn- ing. In practice, exact inference is rarely possible, and so a variety of approximation techniques have been developed, one of the most widely used being a deterministic framework called varia- tional inference. In this paper we introduce Variational Message Passing (VMP), a general purpose algorithm for applying variational inference to Bayesian Networks. Like belief propagation, VMP proceeds by sending messages between nodes in the network and updating posterior beliefs us- ing local operations at each node. Each such update increases a lower bound on the log evidence (unless already at a local maximum). In contrast to belief propagation, VMP can be applied to a very general class of conjugate-exponential models because it uses a factorised variational approx- imation. Furthermore, by introducing additional variational parameters, VMP can be applied to models containing non-conjugate distributions. The VMP framework also allows the lower bound to be evaluated, and this can be used both for model comparison and for detection of convergence. Variational message passing has been implemented in the form of a general purpose inference en- gine called VIBES (‘Variational Inference for BayEsian networkS’) which allows models to be specified graphically and then solved variationally without recourse to coding.
    [Show full text]
  • UW Latex Thesis Template
    Bayesian Unsupervised Labeling of Web Document Clusters by Ting Liu A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Master of Mathematics in Computer Science Waterloo, Ontario, Canada, 2011 c Ting Liu 2011 I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. ii Abstract Information technologies have recently led to a surge of electronic documents in the for- m of emails, webpages, blogs, news articles, etc. To help users decide which documents may be interesting to read, it is common practice to organize documents by categories/topics. A wide range of supervised and unsupervised learning techniques already exist for automated text classification and text clustering. However, supervised learning requires a training set of documents already labeled with topics/categories, which is not always readily available. In contrast, unsupervised learning techniques do not require labeled documents, but as- signing a suitable category to each resulting cluster remains a difficult problem. The state of the art consists of extracting keywords based on word frequency (or related heuristics). In this thesis, we improve the extraction of keywords for unsupervised labeling of doc- ument clusters by designing a Bayesian approach based on topic modeling. More precisely, we describe an approach that uses a large side corpus to infer a language model that im- plicitly encodes the semantic relatedness of different words.
    [Show full text]
  • Bayesian Network Structure Learning for the Uncertain Experimentalist with Applications to Network Biology
    Bayesian network structure learning for the uncertain experimentalist With applications to network biology by Daniel James Eaton B.Sc., The University of British Columbia, 2005 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in The Faculty of Graduate Studies (Computer Science) The University Of British Columbia June, 2007 °c Daniel James Eaton 2007 Abstract In this work, we address both the computational and modeling aspects of Bayesian network structure learning. Several recent algorithms can handle large networks by operating on the space of variable orderings, but for tech- nical reasons they cannot compute many interesting structural features and require the use of a restrictive prior. We introduce a novel MCMC method that utilizes the deterministic output of the exact structure learning algorithm of Koivisto and Sood to construct a fast-mixing proposal on the space of DAGs. We show that in addition to ¯xing the order-space algorithms' shortcomings, our method outperforms other existing samplers on real datasets by delivering more accurate structure and higher predictive likelihoods in less compute time. Next, we discuss current models of intervention and propose a novel approach named the uncertain intervention model, whereby the targets of an interven- tion can be learned in parallel to the graph's causal structure. We validate our model experimentally using synthetic data generated from known ground truth. We then apply our model to two biological datasets that have been previously analyzed using Bayesian networks. On the T-cell dataset of Sachs et al. we show that the uncertain intervention model is able to better model the density of the data compared to previous techniques, while on the ALL dataset of Yeoh et al.
    [Show full text]
  • A Latent Dirichlet Allocation Method for Selectional Preferences
    Latent Variable Models of Lexical Semantics Alan Ritter [email protected] 1 Agenda • Topic Modeling Tutorial • Quick Overview of Lexical Semantics • Examples & Practical Tips – Selectional Preferences • Argument types for relations extracted from the web – Event Type Induction • Induce types for events extracted from social media – Weakly Supervised Named Entity Classification • Classify named entities extracted from social media into types such as PERSON, LOCATION, PRODUCT, etc… 2 TOPIC MODELING 3 Useful References: • Parameter estimation for text analysis – Gregor Heinrich – http://www.arbylon.net/publications/text-est.pdf • Gibbs Sampling for the Uninitiated – Philip Resnik, Eric Hardisty – http://www.cs.umd.edu/~hardisty/papers/gsfu.pdf • Bayesian Inference with Tears – Kevin Knight – http://www.isi.edu/natural-language/people/bayes- with-tears.pdf 4 Topic Modeling: Motivation • Uncover Latent Structure • Lower Dimensional Representation • Probably a good fit if you have: – Large amount of grouped data – Unlabeled – Not even sure what the labels should be? 5 Topic Modeling: Motivation 6 Terminology • I’m going to use the terms: – Documents / Words / Topics • Really these models apply to any kind of grouped data: – Images / Pixels / Objects – Users / Friends / Groups – Music / Notes / Genres – Archeological Sites / Artifacts / Building Type 7 Estimating Parameters from Text • Consider document as a bag of random words: 푁푤 푃 퐷 휃 = 푃 푤푗 휃 = 휃푤푗 = 휃푤 푗 푗 푤 • How to estimate 휃? 푃 퐷 휃 푃(휃) 푃 휃 퐷 = 푃(퐷) Likelihood × Prior Posterior
    [Show full text]
  • Lecture Notes: Mathematics for Inference and Machine Learning
    LECTURE NOTES IMPERIAL COLLEGE LONDON DEPARTMENT OF COMPUTING Mathematics for Inference and Machine Learning Instructors: Marc Deisenroth, Stefanos Zafeiriou Version: January 13, 2017 Contents 1 Linear Regression3 1.1 Problem Formulation...........................4 1.2 Probabilities................................5 1.2.1 Means and Covariances.....................6 1.2.1.1 Sum of Random Variables...............7 1.2.1.2 Affine Transformation.................7 1.2.2 Statistical Independence.....................8 1.2.3 Basic Probability Distributions..................8 1.2.3.1 Uniform Distribution..................9 1.2.3.2 Bernoulli Distribution.................9 1.2.3.3 Binomial Distribution................. 10 1.2.3.4 Beta Distribution.................... 10 1.2.3.5 Gaussian Distribution................. 12 1.2.3.6 Gamma Distribution.................. 16 1.2.3.7 Wishart Distribution.................. 16 1.2.4 Conjugacy............................. 17 1.3 Probabilistic Graphical Models...................... 18 1.3.1 From Joint Distributions to Graphs............... 19 1.3.2 From Graphs to Joint Distributions............... 19 1.3.3 Further Reading......................... 21 1.4 Vector Calculus.............................. 22 1.4.1 Partial Differentiation and Gradients.............. 24 1.4.1.1 Jacobian........................ 26 1.4.1.2 Linearization and Taylor Series............ 27 1.4.2 Basic Rules of Partial Differentiation.............. 29 1.4.3 Useful Identities for Computing Gradients........... 29 1.4.4 Chain Rule............................ 30 1.5 Parameter Estimation........................... 33 1.5.1 Maximum Likelihood Estimation................ 33 1.5.1.1 Closed-Form Solution................. 35 1.5.1.2 Iterative Solution................... 35 1.5.1.3 Maximum Likelihood Estimation with Features... 36 1.5.1.4 Properties........................ 37 1.5.2 Overfitting............................ 37 1.5.3 Regularization.........................
    [Show full text]