Twitter Topic Modeling by Tweet Aggregation

Twitter Topic Modeling by Tweet Aggregation Asbjørn Ottesen Steinskog Jonas Foyn Therkelsen Björn Gambäck Department of Computer Science Norwegian University of Science and Technology NO–7491 Trondheim, Norway [email protected] [email protected] [email protected] Abstract yield interesting results about people’s inter- ests. One could for example compare the top- Conventional topic modeling ics different politicians tend to talk about to schemes, such as Latent Dirich- obtain a greater understanding of their simi- let Allocation, are known to perform larities and differences. Twitter has an abun- inadequately when applied to tweets, dance of messages, and the enormous amount due to the sparsity of short documents. of tweets posted every second makes Twitter To alleviate these disadvantages, we suitable for such tasks. However, detecting apply several pooling techniques, topics in tweets can be a challenging task due aggregating similar tweets into indi- to their informal type of language and since vidual documents, and specifically tweets usually are more incoherent than tra- study the aggregation of tweets shar- ditional documents. The community has also ing authors or hashtags. The results spawned user-generated metatags, like hash- show that aggregating similar tweets tags and mentions, that have analytical value into individual documents signifi- for opinion mining. cantly increases topic coherence. The paper describes a system aimed at dis- covering trending topics and events in a cor- 1 Introduction pus of tweets, as well as exploring the top- Due to the tremendous amount of data broad- ics of different Twitter users and how they casted on microblog sites like Twitter, extract- relate to each other. Utilizing Twitter meta- ing information from microblogs has turned data mitigates the disadvantages tweets typi- out to be useful for establishing the public cally have when using standard topic model- opinion on different issues. O’Connor et al. ing methods; user information as well as hash- (2010) found a correlation between word fre- tag co-occurrences can give a lot of insight quencies in Twitter and public opinion sur- into what topics are currently trending. veys in politics. Analyzing tweets (Twitter The rest of the text is outlined as follows: messages) over a timespan can give great in- Section 2 describes the topic modeling task sights into what happened during that time, and some previous work in the field, while as people tend to tweet about what concerns Section 3 outlines our topic modeling strate- them and their surroundings. Many influential gies, and Section 4 details a set of experi- people post messages on Twitter, and inves- ments using these. Section 5 then discusses tigating the relation between the underlying and sums up the results, before pointing to topics of different authors’ messages could some directions for future research. 77 Proceedings of the 21st Nordic Conference of Computational Linguistics, pages 77–86, Gothenburg, Sweden, 23-24 May 2017. c 2017 Linkoping¨ University Electronic Press 2 Topic modeling ics is first sampled from a Dirichlet distribution, and a topic is chosen based on this distri- Topic models are statistical methods used to bution. Each document is modeled as a distri- represent latent topics in document collec- bution over topics, with topics represented as tions. These probabilistic models usually distributions over words (Blei, 2012). present topics as multinomial distributions over words, assuming that each document in Koltsova and Koltcov (2013) used LDA a collection can be described as a mixture of mainly on topics regarding Russian presiden- topics. The language used in tweets is of- tial elections, but also on recreational and ten informal, containing grammatically cre- other topics, with a dataset of all posts by ative text, slang, emoticons and abbreviations, 2,000 LiveJournal bloggers. Despite the broad making it more difficult to extract topics from categories, LDA showed its robustness by tweets than from more formal text. correctly identifying 30–40% of the topics. Sotiropoulos et al. (2014) obtained similar The 2015 International Workshop on Se- results on targeted sentiment towards topics mantic Evaluation (SemEval) presented a task related to two US telecommunication firms, on Topic-Based Message Polarity Classifica- while Waila et al. (2013) identified socio- tion, similar to the topic of this paper. The political events and entities during the Arab most successful systems used text preprocess- Spring, to find global sentiment towards these. ing and standard methods: Boag et al. (2015) took a supervised learning approach using lin- The Author-topic model (Rosen-Zvi et al., ear SVM (Support Vector Machines), heav- 2004) is an LDA extension taking information ily focused on feature engineering, to reach about an author into account: for each word the best performance of all. Plotnikova et al. in a document d, an author from the docu- (2015) came in second utilizing another super- ment’s set of authors is chosen at random. A vised method, Maximum Entropy, with lexi- topic t is then chosen from a distribution over con and emoticon scores and trigrams, while topics specific to the author, and the word is essentially ignoring topics, which is interest- generated from that topic. The model gives ing given the task. Zhang et al. (2015) differed information about the diversity of the topics from the other techniques by focusing on word covered by an author, and makes it possible to embedding features, as well as the traditional calculate the distance between the topics cov- textual features, but argued that to only extend ered by different authors, to see how similar the model with the word embeddings did not they are in their themes and topics. necessarily significantly improve results. Topic modeling algorithms have gained in- Although the informal language and sparse creased attention in modeling tweets. How- text make it difficult to retrieve the underly- ever, tweets pose some difficulties because ing topics in tweets, Weng et al. (2010) pre- of their sparseness, as the short documents viously found that Latent Dirichlet Alloca- might not contain sufficient data to establish tion (LDA) produced decent results on tweets. satisfactory term co-occurrences. Therefore, LDA (Blei et al., 2003) is an unsupervised pooling techniques (which involve aggregat- probabilistic model which generates mixtures ing related tweets into individual documents) of latent topics from a collection of docu- might improve the results produced by stan- ments, where each mixture of topics produces dard topic model methods. Pooling tech- words from the collection’s vocabulary with niques include, among others, aggregation of certain probabilities. A distribution over top- tweets that share hashtags and aggregation of 78 tweets that share author. Hong and Davison However, statistical methods cannot model (2010) compare the LDA topic model with a human’s perception of the coherence in a an Author-topic model for tweets, finding that topic model perfectly, so human judgement the topics learned from these methods differ is commonly used to evaluate topic models. from each other. By aggregating tweets writ- Chang et al. (2009) propose two tasks where ten by an author into one individual document, humans can evaluate topic models: Word in- they mitigate the disadvantages caused by the trusion lets humans measure the coherence of sparse nature of tweets. Moreover, Quan et the topics in a model by evaluating the la- al. (2015) present a solution for topic model- tent space in the topics. The human subject ing for sparse documents, finding that auto- is presented with six words, and the task is matic text aggregation during topic modeling to find the intruder, which is the one word is able to produce more interpretable topics that does not belong with the others. The idea from short texts than standard topic models. is that the subject should easily identify that word when the set of words minus the intruder 3 Extracting topic models makes sense together. In topic intrusion, subjects are shown a document’s title along with Topic models can be extracted in several the first few words of the document and four ways, in addition to the LDA-based meth- topics. Three of those are the highest proba- ods and SemEval methods outlined above. bility topics assigned to the document, while Specifically, here three sources of information the intruder topic is chosen randomly. are singled out for this purpose: topic model scores, topic clustering, and hashtags. In addition to these methods, we introduce a way to evaluate author-topic models, specif- 3.1 Topic model scoring ically with the Twitter domain in mind. A topic mixture for each author is obtained from The unsupervised nature of topic discovery the model. The human subjects should know makes the assessment of topic models chal- the authors in advance, and have a fair under- lenging. Quantitative metrics do not necessar- standing of which topics the authors are gen- ily provide accurate reflections of a human’s erally interested in. The subjects are then pre- perception of a topic model, and hence a vari- sented a list of authors, along with topic distri- ety of evaluation metrics have been proposed. butions for each author (represented by the 10 The UMass coherence metric (Mimno et most probable topics, with each topic given by al., 2011) measures topic coherence: C = M m 1 D(wm,wl )+1 its 10 most probable words). The task of the ∑ = ∑ − log with (w1,...,wM) m 2 l=1 D(wl ) subject is to deduce which topic distribution being the M most probable words in the topic, belongs to which author. The idea is that co- D(w) the number of documents that contain herent topics would make it easy to recognize word w, and D(w ,w ) the number of doc- m l authors from a topic mixture, as author inter- uments that contain both words w and w .

Load more