Topic Modelling with Word Embeddings

Topic Modelling with Word Embeddings Fabrizio Esposito Anna Corazza, Francesco Cutugno Dept. of Humanities DIETI Univ. of Napoli Federico II Univ. of Napoli Federico II fabrizio.esposito3 anna.corazza|francesco.cutugno @unina.it @unina.it Abstract 1 Introduction English. This work aims at evaluat- ing and comparing two different frame- Over recent years, the development of political works for the unsupervised topic mod- corpora (Guerini et al., 2013; Osenova and Simov, elling of the CompWHoB Corpus, namely 2012) has represented one of the major trends in our political-linguistic dataset. The first the fields of corpus and computational linguis- approach is represented by the application tics. Being carriers of specific content features, of the latent DirichLet Allocation (hence- these textual resources have met the interest of re- forth LDA), defining the evaluation of this searchers and practitioners in the study of topic model as baseline of comparison. The sec- detection. Unfortunately, not only has this task ond framework employs Word2Vec tech- turned out to be hard and challenging even for nique to learn the word vector representa- human evaluators but it must be borne in mind tions to be later used to topic-model our that manual annotation often comes with a price. data. Compared to the previously de- Hence, the aid provided by unsupervised machine fined LDA baseline, results show that the learning techniques proves to be fundamental in use of Word2Vec word embeddings signif- addressing the topic detection issue. icantly improves topic modelling perfor- Topic models are a family of algorithms that al- mance but only when an accurate and task- low to analyse unlabelled large collections of doc- oriented linguistic pre-processing step is uments in order to discover and identify hidden carried out. topic patterns in the form of cluster of words. While LDA (Blei et al., 2003) has become the Italiano. L’obiettivo di questo contributo most influential topic model (Hall et al., 2008), e` di valutare e confrontare due differen- ti framework per l’apprendimento auto- different extensions have been proposed so far: matico del topic sul CompWHoB Corpus, Rosen-Zvi et al. (Rosen-Zvi et al., 2004) devel- la nostra risorsa testuale. Dopo aver im- oped an author-topic generative model to include plementato il modello della latent Dirich- also authorship information; Chang et al. (Chang Let Allocation, abbiamo definito come et al., 2009a) presented a probabilist topic model standard di riferimento la valutazione di to infer descriptions of entities from corpora iden- questo stesso approccio. Come secondo tifying also the relationships between them; Yi framework, abbiamo utilizzato il modello Yang et al. (Yang et al., 2015) proposed a factor Word2Vec per apprendere le rappresen- graph framework for incorporating prior knowl- tazioni vettoriali dei termini successiva- edge into LDA. mente impiegati come input per la fase In the present paper we aim at topic modelling di apprendimento automatico del topic. I the CompWHoB Corpus (Esposito et al., 2015), risulati mostrano che utilizzando i ‘word a political corpus collecting the transcripts of the embeddings’ generati da Word2Vec, le White House Press Briefings. The main charac- prestazioni del modello aumentano signifi- teristic of our dataset is represented by its dia- cativamente ma solo se supportati da una logical structure: since the briefing consists of a accurata fase di ‘pre-processing’ linguisti- question-answer sequence between the US press co. secretary and the news media, the topic under dis- cussion may change from one answer to the fol- lowing question, and vice versa. Our purpose was 2.2 Gold-Standard Annotation to address this main feature of the CompWHoB Two hundred documents of the test set were man- Corpus associating at each answer/question only ually annotated by scholars with expertise in lin- one topic. In order to reach our goal, we propose guistics and political science using a set of thirteen an evaluative comparison of two different frame- categories. Seven macro-categories were created works: in the first one, we employed the LDA ap- taking into account the US major federal execu- proach by extracting from each answer/question tive departments so as not to excessively narrow document only the topic with the highest proba- the topic representation, accounting for 28.5% of bility; in the second framework, we applied the the labelled documents. Six more categories were word embeddings generated from the Word2Vec designed in order to take into account the informal model (Mikolov and Dean, 2013) to our data in nature of the press briefings that makes them an order to test how dense high-quality vectors repre- atypical political-media genre (Venuti and Spinzi, sent our data, finally comparing this approach with 2013), accounting for the remaining 71.5% (Ta- the previously defined LDA baseline. The evalua- ble 1). The labelled documents represent the gold- tion was performed using a set of gold-standard standard to be used in the evaluation stage. This annotations developed by human experts in po- choice is motivated by the fact that even if metrics litical science and linguistics. In Section 2 we such as perplexity or held-out likelihood prove to present the dataset used in this work. In Section 3, be useful in the evaluation of topic models, they the linguistic pre-processing is detailed. Section 4 often fail in qualitatively measuring the coheren- shows the methodology employed to topic-model ce of the generated topics (Chang et al., 2009b). our data. In Section 5 we present the results of our Thus, more formally our gold-standard can be de- work. fined as the set G = fg1; g2; :::; gSg where gi is the ith category in a range f1;Sg with S = 13 as 2 The dataset the total number of categories. 2.1 The CompWHoB Corpus Crime and justice Culture and Education Economy and welfare Foreign Affairs The textual resource used in the present contri- Greetings Health bution is the CompWHoB (Computational White Internal Politics Legislation & Reforms House press Briefings) Corpus, a political cor- Military & Defense President Updates pus collecting the transcripts of the White House Presidential News Press issues Press Briefings extracted from the American Pres- Unknown topic idency Project website, annotated and formatted into XML encoding according to TEI Guidelines (Consortium et al., 2008). The CompWHoB Cor- Table 1: Gold-Standard Topics pus spans from January 27, 1993 to December 18, 2014. Each briefing is characterised by a turn- 3 Linguistic Pre-Processing taking between the podium and the journalists, signalled in the XML files by the use of a u tag for In order to improve the quality of our textual data, each utterance. At the time of writing, 5,239 brief- special attention was paid to the linguistic pre- ings have been collected, comprising 25,251,572 processing step. In particular, since LDA repre- tokens and a total number of 512,651 utterances sents documents as mixtures of topics in forms of (from now on, utterances will be referred to as words probability, we wanted these topics to make ‘documents’). The document average length has sense also to human judges. Being press briefings been measured to 49.25 tokens, while its length actual conversations where the talk moves from variability is comprised within a range of a min- one social register to another (e.g. switch from imum of 0 and a maximum of 4724 tokens. The the reading of an official statement to an informal dataset used in the present contribution was built interaction between the podium and the journal- and divided into training and test set by randomly ists) (Partington, 2003), the first step was to de- selecting documents from the CompWHoB Cor- sign an ad-hoc stoplist able to take into account the pus in order to vary as much as possible the topics main features of this linguistic genre. Indeed, not dealt with by US administration. only were words with a low frequency discarded, but also high frequency ones were removed in or- as nouns (‘NN’) were kept in both the training and der not to overpower the rest of the documents. test sets’ documents. This choice was motivated More importantly, we included in our stoplist all by the necessity of generating topics that could be the personal and indefinite pronouns as well as the semantically meaningful. After having carried out most commonly used honorifics (e.g. Mr., Ms., the pre-processing step, we trained LDA model etc.), given their predominant role in addressing on our training corpus by employing the online the speakers in both informal and formal settings variational Bayes (VB) algorithm (Hoffman et al., (e.g. “Mr. Secretary, you said oil production is up, 2010) provided by the Gensim library. Based on [...]”). Moreover, the list of the first names of the online stochastic optimization with a natural gra- press secretaries in office during the years covered dient step, LDA online proves to converge to a lo- by the CompWHoB Corpus was extracted from cal optimum of the VB objective function. It can Wikipedia and added to the stoplist, since most be applied to large streaming document collections of the time used only as nouns of address (Brown being able to make better predictions and find bet- et al., 1960). As regards the proper NLP pipeline ter topic models with respect to those found with implemented in this work, the Natural Language batch VB. As parameters of our model, we set the ToolKit1 (NLTK) platform (Bird et al., 2009) was k number of topics to thirteen as the numbers of employed: word tokenization, POS-tagging, using classes in our gold-standard, updating the model the Penn Treebank tag set (Marcus et al., 1993) every 150 documents and giving two passes over and lemmatization were carried out to refine our the corpus in order to generate accurate data.

Load more