Feature Extraction Using Latent Dirichlet Allocation and Neural Networks: a Case Study on Movie Synopses

Feature extraction using Latent Dirichlet Allocation and Neural Networks: A case study on movie synopses Despoina I. Christou Department of Applied Informatics University of Macedonia Dissertation submitted for the degree of BSc in Applied Informatics 2015 to my parents and my siblings… Acknowledgements I wish to extend my deepest appreciation to my supervisor, Prof. I. Refanidis, with whom I had the chance to work personally for the first time. I always appreciated him as a professor of AI, but I was mostly inspired by his exceptional ethics as human. I am grateful for the discussions we had during my thesis work period as I found them really fruitful to gain different perspectives on my future work and goals. I am also indebted to Mr. Ioannis Alexander M. Assael, first graduate of Oxford MSc Computer Science, class of 2015, for all his expertise suggestions and continuous assistance. Finally, I would like to thank my parents, siblings, friends and all other people who with their constant faith in me, they encourage me to pursue my interests. Abstract Feature extraction has gained increasing attention in the field of machine learning, as in order to detect patterns, extract information, or predict future observations from big data, the urge of informative features is crucial. The process of extracting features is highly linked to dimensionality reduction as it implies the transformation of the data from a sparse high- dimensional space, to higher level meaningful abstractions. This dissertation employs Neural Networks for distributed paragraph representations, and Latent Dirichlet Allocation to capture higher level features of paragraph vectors. Although Neural Networks for distributed paragraph representations are considered the state of the art for extracting paragraph vectors, we show that a quick topic analysis model such as Latent Dirichlet Allocation can provide meaningful features too. We evaluate the two methods on the CMU Movie Summary Corpus, a collection of 25,203 movie plot summaries extracted from Wikipedia. Finally, for both approaches, we use K-Nearest Neighbors to discover similar movies, and plot the projected representations using T-Distributed Stochastic Neighbor Embedding to depict the context similarities. These similarities, expressed as movie distances, can be used for movies recommendation. The recommended movies of this approach are compared with the recommended movies from IMDB, which use a collaborative filtering recommendation approach, to show that our two models could constitute either an alternative or a supplementary recommendation approach. Contents 1. Introduction ................................................................................................................................. 1 1.1 Overview ............................................................................................................................. 1 1.2 Motivation .......................................................................................................................... 3 2. Latent Dirichlet Allocation ........................................................................................................... 4 2.1 History .................................................................................................................................... 4 2.1.1 TF – IDF scheme .............................................................................................................. 4 2.1.2 Latent Semantic Analysis (LSA) ...................................................................................... 6 2.1.3 Probabilistic Latent Semantic Analysis (PLSA) ............................................................... 8 2.2 Latent Dirichlet Allocation (LDA) ........................................................................................ 10 2.2.1 LDA intuition ................................................................................................................. 10 2.2.2 LDA and Probabilistic models ....................................................................................... 15 2.2.3 Model Inference ........................................................................................................... 17 2.2.3.1 Gibbs Sampler ........................................................................................................ 18 3. Autoencoders ............................................................................................................................. 22 3.1 Neural Networks .................................................................................................................. 22 3.2 Autoencoder ........................................................................................................................ 25 3.3 Backpropagation method ................................................................................................... 27 4. Feature Representation ............................................................................................................. 31 4.1 Word Vectors (word2vec models) ...................................................................................... 31 4.2 Using Autoencoder to obtain paragraph vectors (doc2vec) ............................................. 36 4.3 Using LDA to obtain paragraph vectors ............................................................................. 38 5. Case Study: Movies Modelling .................................................................................................. 40 5.1 Movie Database ................................................................................................................... 40 i 5.2 Preprocessing ...................................................................................................................... 40 5.3 Learning features using LDA ............................................................................................... 41 5.4 Learning features using Autoencoder ................................................................................ 42 5.5 K-Nearest Neighbors ........................................................................................................... 43 5.6 t-SNE ..................................................................................................................................... 44 6. Models Evaluation ..................................................................................................................... 46 6.1 Symmetric LDA Evaluation .................................................................................................. 46 6.1.1 Evaluated Topics ........................................................................................................... 46 6.1.2 Dataset Plot ................................................................................................................... 50 6.1.3 Movies Recommendation ............................................................................................ 56 6.2 Autoencoder Evaluation ..................................................................................................... 58 6.2.1 Dataset Plot ................................................................................................................... 58 6.2.2 Movies Recommendation ............................................................................................ 63 6.3 Comparative evaluation of models .................................................................................... 65 7. Conclusions ................................................................................................................................ 68 7.1 Contributions and Future Work .......................................................................................... 68 Appendix A ..................................................................................................................................... 69 Appendix B ..................................................................................................................................... 70 Appendix C ..................................................................................................................................... 71 Bibliography ................................................................................................................................... 73 ii Chapter 1. Introduction 1.1 Overview Thanks to the enormous amount of electronic data that is the digitization of old material, the registration of new material, sensor data and both governmental and private digitization intentions in general, the amount of data available of all sorts has been expanding and increasing for the last decade. Simultaneously, the need for automatic data organization tools and search engines has become obvious. Naturally, this has led to an increased scientific interest and activity in related areas such as pattern recognition and dimensionality reduction, fields related mostly to feature extraction. Although the history of text categorization dates back to the introduction of computers, it is only from the early 90’s that text categorization has become an important part of the mainstream research of text mining, thanks to the increased application-oriented interest and to the rapid development of more powerful hardware. Categorization has successfully proved its strengths in various contexts, such as automatic document annotation (or indexing), document filtering (spam filtering in particular), automated metadata

Feature Extraction Using Latent Dirichlet Allocation and Neural Networks: a Case Study on Movie Synopses

Probabilistic Topic Modelling with Semantic Graph

Employee Matching Using Machine Learning Methods

A Comprehensive Embedding Approach for Determining Repository Similarity

Matrix Decompositions and Latent Semantic Indexing

An Evaluation of Machine Learning Approaches to Natural Language Processing for Legal Text Classification

Fasttext.Zip: Compressing Text Classification Models

Supervised and Unsupervised Word Sense Disambiguation on Word Embedding Vectors of Unambiguous Synonyms

Latent Semantic Analysis for Text-Based Research

Practice with Python

Investigating Classification for Natural Language Processing Tasks

Linked Data Triples Enhance Document Relevance Classification

How Latent Is Latent Semantic Analysis?