Word Embeddings for Natural Language Processing

Word Embeddings for Natural Language Processing THÈSE NO 7148 (2016) PRÉSENTÉE LE 26 SEPTEMBRE 2016 À LA FACULTÉ DES SCIENCES ET TECHNIQUES DE L'INGÉNIEUR LABORATOIRE DE L'IDIAP PROGRAMME DOCTORAL EN GÉNIE ÉLECTRIQUE ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE POUR L'OBTENTION DU GRADE DE DOCTEUR ÈS SCIENCES PAR Rémi Philippe LEBRET acceptée sur proposition du jury: Prof. J.-Ph. Thiran, président du jury Prof. H. Bourlard, Dr R. Collobert, directeurs de thèse Dr Y. Grandvalet, rapporteur Prof. G. Attardi, rapporteur Dr J.-C. Chappelier, rapporteur Suisse 2016 He who is not courageous enough to take risks will accomplish nothing in life. — Muhammad Ali To my grandfathers. Acknowledgements I would like to express my gratitude to people who contributed to this thesis and my life as a Ph.D. student. First, I would like to thank Dr. Ronan Collobert for giving me the opportunity to join his team at Idiap for doing my doctoral studies. Thanks for being always available during the first two years to teach me all the tricks of the trade for training neural networks. I would also like to thank Dr. David Grangier and Dr. Michael Auli for the day-to-day supervision during my 6-month internship at Facebook, where I learned a lot. At that time of my PhD, it gave me enthusiasm and energy to move forward. Many thanks to Prof. Hervé Bourlard for making Idiap such a great place to work, to Nadine and Sylvie for their kindness and responsiveness. I am also particularly grateful to the Hasler Foundation for supporting me financially. I really appreciate the time and dedication my thesis committee devoted to reviewing my work: Prof. Giuseppe Attardi, Dr. Yves Grandvalet, Dr. Jean-Cédric Chappelier, and Prof. Jean-Philippe Thiran. Thank you for the fruitful discussions and your comments. A special dedication goes to my fellow colleagues from the applied machine learning team at Idiap: Dimitri, Joël, and Pedro. Thanks to Dimitri for teaching us all the specificities of Switzer- land and Game of Thrones; to Joël for organizing great team dinners composed of Raclette and Fondue; and to Pedro for always being in a good mood and for his Brazilian accent. I have a special thought for César, the Fifth Musketeer, who joined the team for a year before boarding a plane to Montreal. Also thanks are in order to all my floorball teammates: Bastien, Laurent, Pierre-Edouard, David, Hugues, Christian, Michael, Manuel, Elie, Paul, Romain, Rémi. This Ph.D. has given me the opportunity to discover Switzerland, specifically the Canton of Valais. I would like to thank the people of Le Trétien (the small mountain village where I lived) for the warm welcome. I particularly want to thank Monique and Christian for their love and presence; and my closest neighors Vicky and Thierry, Rita and Benoit, Marie-Noëlle, and Damien for their friendship and generosity. The two last years of this journey have been stressful with a lot of questions and concerns. This period would have been very difficult without the unconditional support and love of my wife Claire. Thank you for all the encouragement, your joie de vivre and for the sacrifices that i Acknowledgements you made for me. Now it is my turn to support you in your project. Finally, I want to thank my family for their loving support through the years. Thanks to my parents, Christine and Philippe; to my sister, Julie, her husband, Eddie, my nephew, Matilin, and my niece, Margaud. Thanks for frequently making the long trip to Le Trétien, where we shared some good moments together. Last but not least, I thank my grandmother, Jeanne, for its unfaltering support. Lausanne, 6 August 2016 R. L. ii Abstract Word embedding is a feature learning technique which aims at mapping words from a vocabulary into vectors of real numbers in a low-dimensional space. By leveraging large corpora of unlabeled text, such continuous space representations can be computed for capturing both syntactic and semantic information about words. Word embeddings, when used as the underlying input representation, have been shown to be a great asset for a large variety of natural language processing (NLP) tasks. Recent techniques to obtain such word embeddings are mostly based on neural network language models (NNLM). In such systems, the word vectors are randomly initialized and then trained to predict optimally the contexts in which the corresponding words tend to appear. Because words occurring in similar contexts have, in general, similar meanings, their resulting word embeddings are semantically close after training. However, such architectures might be challenging and time-consuming to train. In this thesis, we are focusing on building simple models which are fast and efficient on large-scale datasets. As a result, we propose a model based on counts for computing word embeddings. A word co-occurrence probability matrix can easily be obtained by directly counting the context words surrounding the vocabulary words in a large corpus of texts. The computation can then be drastically simplified by performing a Hellinger PCA of this matrix. Besides being simple, fast and intuitive, this method has two other advantages over NNLM. It first provides a framework to infer unseen words or phrases. Secondly, all embedding dimensions can be obtained after a single Hellinger PCA, while a new training is required for each new size with NNLM. We evaluate our word embeddings on classical word tagging tasks and show that we reach similar performance than with neural network based word embeddings. While many techniques exist for computing word embeddings, vector space models for phrases remain a challenge. Still based on the idea of proposing simple and practical tools for NLP,we introduce a novel model that jointly learns word embeddings and their summation. Sequences of words (i.e. phrases) with different sizes are thus embedded in the same semantic space by just averaging word embeddings. In contrast to previous methods which reported a posteriori some compositionality aspects by simple summation, we simultaneously train words to sum, while keeping the maximum information from the original vectors. These word and phrase embeddings are then used in two different NLP tasks: document classification and sentence generation. Using such word embeddings as inputs, we show that good performance is achieved in sentiment classification of short and long text documents with a convolutional neural network. Finding good compact representations of text iii Abstract documents is crucial in classification systems. Based on the summation of word embeddings, we introduce a method to represent documents in a low-dimensional semantic space. This simple operation, along with a clustering method, provides an efficient framework for adding semantic information to documents, which yields better results than classical approaches for classification. Simple models for sentence generation can also be designed by leveraging such phrase embeddings. We propose a phrase-based model for image captioning which achieves similar results than those obtained with more complex models. Not only word and phrase embeddings but also embeddings for non-textual elements can be helpful for sentence generation. We, therefore, explore to embed table elements for generating better sentences from structured data. We experiment this approach with a large-scale dataset of biographies, where biographical infoboxes were available. By parameterizing both words and fields as vectors (embeddings), we significantly outperform a classical model. Key words: word embedding, natural language processing, PCA, artificial neural networks, language model, document classification, sentence generation iv Résumé Le word embedding est une méthode d’apprentissage automatique qui vise à représenter les mots d’un vocabulaire dans des vecteurs de réels dans un espace à faible dimension. En s’appuyant sur un grand corpus de textes non annoté, de telles représentations vectorielles peuvent être calculées pour capturer à la fois des informations syntaxiques et sémantiques sur mots. Ces word embeddings, lorsqu’ils sont ensuite utilisés comme données d’entrée, se sont révélés être un grand atout pour une grande variété de tâches en traitement automatique du langage naturel (TALN). Les techniques récentes pour obtenir ces représentations de mots sont principalement basées sur des modèles de langue neuronaux (MLN). Dans de tels systèmes, les vecteurs représentant les mots sont initialisés aléatoirement, puis entrainés à prédire de façon optimale les contextes dans lesquels ils apparaissent. Étant donné que les mots apparaissant dans des contextes similaires ont, en principe, des significations sem- blables, leurs représentations vectorielles sont sémantiquement proches après l’apprentissage. Cependant, de telles architectures sont généralement difficiles et longues à entrainer. Dans cette thèse, nous nous concentrons sur la construction de modèles simples qui sont à la fois rapides et efficaces avec des ensembles de données à grande échelle. En conséquence, nous proposons un modèle basé sur le simple comptage de mots pour calculer les word embeddings. Une matrice de probabilité de cooccurrences peut être facilement obtenue en comptant directement, dans un grand corpus de textes, les mots de contexte entourant les mots du vocabulaire d’intérêt. L’obtention des word embeddings peut alors être considérable- ment simplifiée en effectuant une ACP de cette matrice, avec la distance de Hellinger. En plus d’être simple, rapide et intuitive, cette méthode présente deux autres avantages par rapport aux MLN. Tout d’abord, cette méthode permet l’inférence de nouveaux mots ou expressions (groupes de mots). Deuxièmement, toutes les dimensions de word embeddings peuvent être obtenues après une seule ACP,alors qu’un nouvel apprentissage est nécessaire pour chaque nouvelle taille d’embeddings avec les MLN. Nous évaluons ensuite nos représentations de mots sur des tâches classiques d’étiquetage de mots, et nous montrons que les performances sont similaires qu’avec des word embeddings obtenus par l’intermédiare de MLN.

Load more