Latent Dirichlet Allocation, Explained and Improved Upon for Applications in Marketing Intelligence

Latent Dirichlet Allocation, explained and improved upon for applications in marketing intelligence Iris Koks Technische Universiteit Delft Latent Dirichlet Allocation, explained and improved upon for applications in marketing intelligence by Iris Koks to obtain the degree of Master of Science at the Delft University of Technology, to be defended publicly on Friday March 22, 2019 at 2:00 PM. Student number: 4299981 Project duration: August, 2018 – March, 2019 Thesis committee: Prof. dr. ir. Geurt Jongbloed, TU Delft, supervisor Dr. Dorota Kurowicka, TU Delft Drs. Jan Willem Bikker PDEng, CQM An electronic version of this thesis is available at http://repository.tudelft.nl/. "La science, mon garçon, est faite d’erreurs, mais ce sont des erreurs qu’il est utile de faire, parce qu’elles conduisent peu à peu à la vérité." - Jules Verne, in "Voyage au centre de la Terre" Abstract In today’s digital world, customers give their opinions on a product that they have purchased online in the form of reviews. The industry is interested in these reviews, and wants to know about which topics their clients write, such that the producers can improve products on specific aspects. Topic models can extract the main topics from large data sets such as the review data. One of these is Latent Dirichlet Allocation (LDA). LDA is a hierarchical Bayesian topic model that retrieves topics from text data sets in an unsupervised manner. The method assumes that a topic is assigned to each word in a document (review), and aims to retrieve the topic distribution for each document, and a word distribution for each topic. Using the highest probability words from each topic-word distribution, the content of each topic can be determined, such that the main subjects can be derived. Three methods of inference to obtain the topic and word distributions are considered in this research: Gibbs sampling, Variational methods, and Adam optimization to find the posterior mode. Gibbs sampling and Adam optimization have the best theoretical foundations for their application to LDA. From results on artificial and real data sets, it is concluded that Gibbs sampling has the best performance in terms of robustness and perplexity. In case the data set consists of reviews, it is desired to extract the sentiment (positive, neutral, negative) from the documents, in addition to the topics. Therefore, an extension to LDA that uses sentiment words and sentence structure as additional input is proposed: LDA with syntax and sentiment. In this model, a topic distribution and a sentiment distribution for each review are retrieved. Furthermore, a word distribution per topic-sentiment combination can be estimated. With these distributions, the main topics and sentiments in a data set can be determined. Adam optimization is used as inference method. The algorithm is tested on simulated data and found to work well. However, the optimization method is very sensitive to hyperparameter settings, so it is expected that Gibbs sampling as inference method for LDA with syntax and sentiment performs better. Its implementation is left for further research. Keywords: Latent Dirichlet Allocation, topic modeling, sentiment analysis, opinion mining, review analysis, Hierarchical Bayesian inference v Preface With this thesis, my student life comes to an end. Although I started with studying French after high school, after 4 years, I finally came to my senses, such that now, I have become a mathematician. Fortunately, my passion for languages has never completely disappeared, and it could even be incorporated in this final research project. During the last 8 months, I have been doing research and writing my master thesis at CQM in Eindhoven. I have had a wonderful time with my colleagues there, and I would like to thank all of them for making it the interesting, nice time it was. I learned a lot about industrial statistics, machine learning, consultancy and the CQM-way of working. Special thanks go to my supervisors Jan Willem Bikker, Peter Stehouwer and Matthijs Tijink, for their sincere interest and helpfulness during every meeting. Next to the interesting conversations about mathematics, we also had nice talks about careers, life and personal development, which I will always remember. Although he was not my direct supervisor, Johan van Rooij helped me a lot with programming and implementation questions, and by thinking along patiently when I got stuck in some mathematical derivation, for which I am very grateful. Furthermore, I would like to thank my other supervisor, professor Geurt Jongbloed, for his guidance through this project and for bringing out the best of me on a mathematical level. Naturally, I also enjoyed our talks about all aspects of life and the laughs we had. Also, I would like to thank Dorota Kurowicka for being on my graduation committee, and for her interesting lectures about copulas. Unfortunately, I could not incorporate them in this project, but maybe in my career as (financial) risk analyst? Then, of course, I would like to express my gratitude to Jan Frouws, for his unconditional care and support during these last 8 months. He was always there to listen to me, mumbling on and on about reviews, topics, strollers and optimization methods. Without you, throughout this project, the ups would not have been that high and the downs would have been way deeper. I am really proud of how we manage and enjoy life together. In addition, I would like to thank my parents for always supporting me during my entire educational journey. Because of them, I always liked going to school and learning, hence my broad interests in languages, economics, physics and mathematics. Also, thanks to my brother Corné, with whom I made my homework together for years and I had the most interesting (political) discussions. Lastly, I would like to thank my friends that I met in all different places. With you, I have had a wonderful student life in which I started (and immediately quit) rowing, bouldering, learned to cook properly and for large groups (making sushi, pasta, ravioli, mexican wraps, massive Heel Holland Bakt cakes...), learned about Christianity, air planes and bit coins, lost my ‘Brabants’ accent such that I became a little more like a ‘Randstad’ person, and gained interest in the financial world, in which I hope to get a nice career. Iris Koks Eindhoven, March 2019 vii Nomenclature Tn n-dimensional closed simplex D Corpus, set of all documents L Log likelihood § Number of different sentiments σ Sentiment index ® Hyperparameter vector of size K with belief on document-topic distribution ¯ Hyperparameter vector of size V with belief on topic-word distribution γ Hyperparameter vector of size § with belief on document-sentiment distribution Ák Word probability vector of size V for topic k ¼d Sentiment probability vector of size § for document d θd Topic probability vector of size K for document d C Number of different parts-of-speech considered c Part-of-speech d Document index H Entropy h Shannon information K Number of topics M Number of documents in a data set Nd Number of words in document d Ns Number of words in phrase s s Phrase or sentence index Sd Number of phrases in document d V Vocabulary size, i.e. number of unique words in data set w Word index z Topic index Adam Adaptive moment estimation (optimization method) JS Jensen-Shannon KL Kullback-Leibler LDA Latent Dirichlet Allocation MAP Maximum a posteriori or posterior mode NLP Natural Language Processing VBEM Variational Bayesian Expectation Maximization ix Contents List of Figures xiii List of Tables xv 1 Introduction 1 1.1 Costumer insights using Latent Dirichlet Allocation.......................1 1.2 Research questions........................................2 1.3 Thesis outline...........................................2 1.4 Note on notation.........................................4 2 Theoretical background5 2.1 Bayesian statistics.........................................5 2.2 Dirichlet process and distribution................................7 2.2.1 Stick-breaking construction of Dirichlet process......................8 2.2.2 Dirichlet distribution...................................8 2.3 Natural language processing................................... 13 2.4 Model selection criteria...................................... 14 3 Latent Dirichlet Allocation 17 3.1 Into the mind of the writer: generative process.......................... 18 3.2 Important distributions in LDA.................................. 21 3.3 Probability distribution of the words............................... 21 3.4 Improvements and adaptations to basic LDA model....................... 24 4 Inference methods for LDA 27 4.1 Posterior mean.......................................... 29 4.1.1 Analytical determination.................................. 29 4.1.2 Markov chain Monte Carlo methods............................ 30 4.2 Posterior mode.......................................... 38 4.2.1 General variational methods................................ 38 4.2.2 Variational Bayesian EM for LDA.............................. 43 5 Determination posterior mode estimates for LDA using optimization 47 5.1 LDA’s posterior density...................................... 47 5.2 Gradient descent......................................... 49 5.3 Stochastic gradient descent.................................... 50 5.4 Adam optimization........................................ 51 5.4.1 Softmax transformation.................................. 54 5.4.2 Regularization......................................

Load more