Topic Modeling by Tweet Aggregation

Asbjørn Ottesen Steinskog Jonas Foyn Therkelsen Björn Gambäck Department of Computer Science Norwegian University of Science and Technology NO–7491 Trondheim, Norway [email protected] [email protected] [email protected]

Abstract yield interesting results about people’s inter- ests. One could for example compare the top- Conventional topic modeling ics different politicians tend to talk about to schemes, such as Latent Dirich- obtain a greater understanding of their simi- let Allocation, are known to perform larities and differences. Twitter has an abun- inadequately when applied to tweets, dance of messages, and the enormous amount due to the sparsity of short documents. of tweets posted every second makes Twitter To alleviate these disadvantages, we suitable for such tasks. However, detecting apply several pooling techniques, topics in tweets can be a challenging task due aggregating similar tweets into indi- to their informal type of language and since vidual documents, and specifically tweets usually are more incoherent than tra- study the aggregation of tweets shar- ditional documents. The community has also ing authors or . The results spawned user-generated metatags, like hash- show that aggregating similar tweets tags and mentions, that have analytical value into individual documents signifi- for opinion mining. cantly increases topic coherence. The paper describes a system aimed at dis- covering trending topics and events in a cor- 1 Introduction pus of tweets, as well as exploring the top- Due to the tremendous amount of data broad- ics of different Twitter users and how they casted on microblog sites like Twitter, extract- relate to each other. Utilizing Twitter - ing information from microblogs has turned data mitigates the disadvantages tweets typi- out to be useful for establishing the public cally have when using standard topic model- opinion on different issues. O’Connor et al. ing methods; user information as well as hash- (2010) found a correlation between word fre- co-occurrences can give a lot of insight quencies in Twitter and public opinion sur- into what topics are currently trending. veys in politics. Analyzing tweets (Twitter The rest of the text is outlined as follows: messages) over a timespan can give great in- Section 2 describes the topic modeling task sights into what happened during that time, and some previous work in the field, while as people tend to tweet about what concerns Section 3 outlines our topic modeling strate- them and their surroundings. Many influential gies, and Section 4 details a set of experi- people post messages on Twitter, and inves- ments using these. Section 5 then discusses tigating the relation between the underlying and sums up the results, before pointing to topics of different authors’ messages could some directions for future research.

77 Proceedings of the 21st Nordic Conference of Computational Linguistics, pages 77–86, Gothenburg, Sweden, 23-24 May 2017. c 2017 Linkoping¨ University Electronic Press 2 Topic modeling ics is first sampled from a Dirichlet distribu- tion, and a topic is chosen based on this distri- Topic models are statistical methods used to bution. Each document is modeled as a distri- represent latent topics in document collec- bution over topics, with topics represented as tions. These probabilistic models usually distributions over words (Blei, 2012). present topics as multinomial distributions over words, assuming that each document in Koltsova and Koltcov (2013) used LDA a collection can be described as a mixture of mainly on topics regarding Russian presiden- topics. The language used in tweets is of- tial elections, but also on recreational and ten informal, containing grammatically cre- other topics, with a dataset of all posts by ative text, slang, emoticons and abbreviations, 2,000 LiveJournal bloggers. Despite the broad making it more difficult to extract topics from categories, LDA showed its robustness by tweets than from more formal text. correctly identifying 30–40% of the topics. Sotiropoulos et al. (2014) obtained similar The 2015 International Workshop on Se- results on targeted sentiment towards topics mantic Evaluation (SemEval) presented a task related to two US telecommunication firms, on Topic-Based Message Polarity Classifica- while Waila et al. (2013) identified socio- tion, similar to the topic of this paper. The political events and entities during the Arab most successful systems used text preprocess- Spring, to find global sentiment towards these. ing and standard methods: Boag et al. (2015) took a supervised learning approach using lin- The Author-topic model (Rosen-Zvi et al., ear SVM (Support Vector Machines), heav- 2004) is an LDA extension taking information ily focused on feature engineering, to reach about an author into account: for each word the best performance of all. Plotnikova et al. in a document d, an author from the docu- (2015) came in second utilizing another super- ment’s set of authors is chosen at random. A vised method, Maximum Entropy, with lexi- topic t is then chosen from a distribution over con and emoticon scores and trigrams, while topics specific to the author, and the word is essentially ignoring topics, which is interest- generated from that topic. The model gives ing given the task. Zhang et al. (2015) differed information about the diversity of the topics from the other techniques by focusing on word covered by an author, and makes it possible to embedding features, as well as the traditional calculate the distance between the topics cov- textual features, but argued that to only extend ered by different authors, to see how similar the model with the word embeddings did not they are in their themes and topics. necessarily significantly improve results. Topic modeling algorithms have gained in- Although the informal language and sparse creased attention in modeling tweets. How- text make it difficult to retrieve the underly- ever, tweets pose some difficulties because ing topics in tweets, Weng et al. (2010) pre- of their sparseness, as the short documents viously found that Latent Dirichlet Alloca- might not contain sufficient data to establish tion (LDA) produced decent results on tweets. satisfactory term co-occurrences. Therefore, LDA (Blei et al., 2003) is an unsupervised pooling techniques (which involve aggregat- probabilistic model which generates mixtures ing related tweets into individual documents) of latent topics from a collection of docu- might improve the results produced by stan- ments, where each mixture of topics produces dard topic model methods. Pooling tech- words from the collection’s vocabulary with niques include, among others, aggregation of certain probabilities. A distribution over top- tweets that share hashtags and aggregation of

78 tweets that share author. Hong and Davison However, statistical methods cannot model (2010) compare the LDA topic model with a human’s perception of the coherence in a an Author-topic model for tweets, finding that topic model perfectly, so human judgement the topics learned from these methods differ is commonly used to evaluate topic models. from each other. By aggregating tweets writ- Chang et al. (2009) propose two tasks where ten by an author into one individual document, humans can evaluate topic models: Word in- they mitigate the disadvantages caused by the trusion lets humans measure the coherence of sparse nature of tweets. Moreover, Quan et the topics in a model by evaluating the la- al. (2015) present a solution for topic model- space in the topics. The human subject ing for sparse documents, finding that auto- is presented with six words, and the task is matic text aggregation during topic modeling to find the intruder, which is the one word is able to produce more interpretable topics that does not belong with the others. The idea from short texts than standard topic models. is that the subject should easily identify that word when the set of words minus the intruder 3 Extracting topic models makes sense together. In topic intrusion, sub- jects are shown a document’s title along with Topic models can be extracted in several the first few words of the document and four ways, in addition to the LDA-based meth- topics. Three of those are the highest proba- ods and SemEval methods outlined above. bility topics assigned to the document, while Specifically, here three sources of information the intruder topic is chosen randomly. are singled out for this purpose: topic model scores, topic clustering, and hashtags. In addition to these methods, we introduce a way to evaluate author-topic models, specif- 3.1 Topic model scoring ically with the Twitter domain in mind. A topic mixture for each author is obtained from The unsupervised nature of topic discovery the model. The human subjects should know makes the assessment of topic models chal- the authors in advance, and have a fair under- lenging. Quantitative metrics do not necessar- standing of which topics the authors are gen- ily provide accurate reflections of a human’s erally interested in. The subjects are then pre- perception of a topic model, and hence a vari- sented a list of authors, along with topic distri- ety of evaluation metrics have been proposed. butions for each author (represented by the 10 The UMass coherence metric (Mimno et most probable topics, with each topic given by al., 2011) measures topic coherence: C = M m 1 D(wm,wl )+1 its 10 most probable words). The task of the ∑ = ∑ − log with (w1,...,wM) m 2 l=1 D(wl ) subject is to deduce which topic distribution being the M most probable words in the topic, belongs to which author. The idea is that co- D(w) the number of documents that contain herent topics would make it easy to recognize word w, and D(w ,w ) the number of doc- m l authors from a topic mixture, as author inter- uments that contain both words w and w . m l ests would be reflected by topic probabilities. The metric utilizes word co-occurrence statis- tics gathered from the corpus, which ideally 3.2 Clustering tweets already should be accounted for in the topic model. Mimno et al. (2011) achieved reason- An important task related to topic modeling able results when comparing the scores ob- is determining the number of clusters, k, to tained by this measure with human scoring on use for the model. There is usually not a a corpus of 300,000 health journal abstracts. single correct optimal value: too few clus-

79 ters will produce topics that are overly broad are likely caused by the sparse and noisy data and too many clusters will result in overlap- in tweets, as this method was originally used ping or too similar topics. The Elbow method for longer, more coherent documents. The can be used to estimate the optimal number method’s estimation of number of topics is of clusters, by running k-means clustering on therefore not a good indication of the number the dataset for different values of k, and then of underlying topics in Twitter corpora. calculating the sum of squared error (SSE = Good results have been achieved by using a n 2 ∑ [yi f (xi)] ) for each value. higher number of topics for tweets than what i=1 − For text datasets, Can and Ozkarahan is needed for larger, more coherent corpora, (1990) propose that the number of clusters can since short messages and the diverse themes in mn be expressed by the formula t , where m is tweets require more topics. Hong and Davison the number of documents, n the number of (2010) use a range of topics for tweets from 20 terms, and t the number of non-zero entries to 150, obtaining the best results for t = 50, al- in the document by term matrix. Greene et al. though the chance of duplicate or overlapping (2014) introduce a term-centric stability anal- topics increase with the amount of topics. ysis strategy, assuming that a topic model with an optimal number of clusters is more robust 3.3 Hashtags to deviations in the dataset. However, they Tweets have user-generated metatags that can validated the method on news articles, that are aid the topic and sentiment analysis. A hash- much longer and usually more coherent than tag is a term or an abbreviation preceded by tweets. Greene et al. (2014) released a Python the hash mark (#). Being labels that users tag implementation1 of the stability analysis ap- their messages with, they serve as indicators proach, which we used to predict the optimal of which underlying topics are contained in a number of clusters for a Twitter dataset. tweet. Moreover, hashtags can help discover To estimate the number of topics in a emerging events and breaking news, by look- tweet corpus, stability analysis was applied to ing at new or uncommon hashtags that sud- 10,000 tweets posted on January 27, 2016, us- denly rise in attention. We here present the us- ing a [2,10] k range for the number of top- age of co-occurrences to divulge the ics. An initial topic model was generated from hidden thematic structure in a corpus, using a the whole dataset. Proceedingly, τ random collection of 3 million tweets retrieved during subsets of the dataset were generated, with the Super Bowl game, February 7th, 2016. one topic model per k value for each subset Hashtag co-occurrences show how hash- S ,...,S . The stability score for a k value 1 τ tags appear together in tweets. Since dif- is generated by computing the mean agree- ferent hashtags appearing in the same tweet ment score between a reference set S0 and a τ usually share the underlying topics, a hashtag sample ranking set Si for k: ∑ agree(S0,Si) i=1 co-occurence graph might give interesting in- (Greene et al., 2014). The number of terms formation regarding the topics central to the to consider, t, also affects the agreement. A tweet. Looking at which hashtags co-occur t value of 20 indicates that the top 10 terms with the hashtag #SuperBowl gives us more for each topic were used. The stability scores information about other important themes and were overall low, e.g., t = 20 ranging from topics related to Super Bowl. Table 1 shows 0.43 at k = 2 to 0.31 at k = 10. The low scores the 10 most popular hashtags from the Su- 1https://github.com/derekgreene/topic-stability per Bowl corpus, about half of them being re-

80 Hashtag Freq. SB50 10,234 KCA 3,624 SuperBowl 2,985 FollowMeCameronDallas 1,899 Broncos 1,290 EsuranceSweepstakes 1,079 followmecaniff 995 FollowMeCarterReynolds 938 KeepPounding 794 Figure 1: Hashtag co-occurrence network SuperBowl50 783

Table 1: Hashtags during Super Bowl 2016 tags votelauramarano, KCA and VoteAriana- Grande. KCA is an abbreviation of Kid’s lated to the Super Bowl. Interestingly, (Den- Choice Awards, an annual award show where ver) Broncos, the winning team of the match, people can vote for their favorite television, was the 5th most mentioned hashtag, while the movie and music acts, by tweeting the hash- losing team (Carolina) Panthers only was the tag KCA with a specific nominee-specific vot- 17th most popular hashtag that day. ing hashtag (e.g., VoteArianaGrande). Some hashtags are related to a topic with- 4 Topic Modeling Experiments out it being apparent, since it requires further knowledge to understand how they are related: Extending the studies of the previous section, “Keep Pounding” is a quote by the late Car- a set of experiments related to topic modeling olina Panthers player and coach Sam Mills. were conducted, comparing a standard LDA Hashtag co-occurrences help reveal such topic model to a hashtag-aggregated model, related hashtags, and hashtag-graph based and comparing two author-topic models. topic models have been found to enhance the semantic relations displayed by standard topic 4.1 Hashtag-aggregated topic model model techniques (Wang et al., 2014). Fig- A pooling technique that involves aggregating ure 1 displays the 20 most co-occurring hash- tweets sharing hashtags was applied, the as- tags in a co-occurrence network for the Su- sumption being that tweets that share hashtags per Bowl corpus. Three clusters of hash- also share underlying topics. The main goal of tags emerge, with Super Bowl related hash- this method is the same as for the Author-topic tags forming the largest. Related topics and model and other pooling techniques; alleviat- terms become more apparent when displayed ing the disadvantages of short documents by in a co-occurrence graph like this, with Keep- aggregating documents that are likely to share Pounding and SB50 being the 8th most co- latent topics. Some restrictions were intro- occurring hashtag pair, and the artists Be- duced: only single-hashtag tweets were used, yonce and Coldplay appearing in the Su- and only hashtags that appeared in at least 20 per Bowl cluster since they performed during of the documents in the corpus. the halftime show. The graph also indicates Table 2 shows a sample of the resulting top- EsuranceSweepstakes to be related to Super ics. They appear more coherent than the top- Bowl, and indeed the company Esurance run ics generated on tweets as individual docu- an ad during the match, encouraging people to ments, even though many of the less proba- tweet using the hashtag EsuranceSpeestakes. ble words in each topic might seem somewhat Another cluster consists of the three hash- random. It is, however, easier to get an un-

81 Topic #7 Topic #21 Topic #24 Topic #34 revivaltour new purposetourboston trump selenagomez soundcloud justinbieber hillary whumun news boston bernie wtf video one realdonaldtrump getting favorite best will boyfriend sounds tonight clinton bitch health yet greysanatomy mad blessed shows vote resulted efc bitcoin president blend mealamovie redsox people

Table 2: Four topics after hashtag aggregation Figure 2: Coherence score of LDA topic model vs. hashtag-aggregated topic model derstanding of the underlying topics conveyed in the tweets, and aggregating tweets sharing hashtags can produce more coherence than a Ten topics were generated from the Rosen- topic model generated by single tweets as doc- Zvi et al. (2004) author-topic model, each uments. The UMass coherence scores for the topic being represented by the 10 most prob- topics in this topic model are also much higher able words. The resulting topics are reason- than for standard LDA, as shown in Figure 2. ably coherent, and can be seen in Table 3. 4.2 Author-topic model experiments The quality of an author-topic model can be Tweets from six popular Twitter users were measured in its ability to accurately portray obtained through the Twitter API, selecting the user’s interests. A person that has knowl- users known for tweeting about different top- edge of which themes and topics a user usu- ics, so that the results would be distinguish- ally talks about, should be able to recognize able. Barack Obama would be expected to the user by their topic distribution. Eight per- tweet mainly about topics related to poli- sons were shown topic distributions such as tics, while the astrophysicist Neil deGrasse the one in Figure 3 without knowing which Tyson would assumingly tweet about science- user it belonged to, and asked to identify the related topics. The themes communicated by users based on the topic distributions. Obama and Donald Trump ought to be sim- All participants managed to figure out ilar, both being politicians, while the inven- which topic distribution belonged to which tor Elon Musk ought to show similarities with author for all author’s, except the distributions Tyson. Tweets from two pop artists, Justin of Taylor Swift and Justin Bieber, which were Bieber and Taylor Swift, were also included very similar, both having Topic 4 and 5 as and expected not to share many topics with most probable. The remaining author’s had the other users. To obtain tweets reflecting easily recognizable topic distributions, which the author’s interests and views, all retweets was confirmed by the experiment. and quote tweets were discarded, as well as The author-topic model proposed by Hong tweets containing media or URLs. Two ap- and Davison (2010) performs standard LDA proaches to author topic-modeling were com- on aggregated user profiles. To conduct the pared, based on Rosen-Zvi et al. (2004) and experiments, the model was thus first trained on Hong and Davison (2010), respectively. on a corpus where each document contains ag-

82 #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 will just earth night tonight just people great president don new like moon today love one much thank obama fyi now will day get thank know time trump2016 america time get now sun new thanks orbit won will sotu tesla america good world happy ts1989 two tonight cruz actonclimate first make think ask back taylurking might way hillary economy rocket poll also universe time show long bad makeamericagreatagain work science many around full see got planet show big change space trump live space one crowd star morning cnn americans launch don landing year good tomorrow instead really said jobs model

Table 3: The ten topics generated for the Rosen-Zvi et al. (2004) author-topic model

Figure 3: Topic distribution for Obama Figure 4: Obama’s aggregated topics

topic for each author generally describes the gregated tweets for each user. Furthermore, author with a high precision. It is therefore new tweets for each author (which were not easier to distinguish Justin Bieber from Taylor part of the training data) were downloaded, Swift than it was in the previous experiment. and the topic distribution inferred for each of the new tweets. Finally, the topic distribution 5 Discussion and conclusion for each user was calculated as the average topic distribution over all new tweets written A topic modeling system for modeling tweet by that user. Since a low number of topics corpora was created, utilizing pooling tech- produces too many topics containing the same niques to improve the coherence and inter- popular words, 50 topics were used instead of pretability of standard topic models. The re- the 10 in the previous experiments. sults indicate that techniques such as author An example of the resulting topic mixtures aggregating and hashtag aggregation generate for the authors can be seen in Figure 4, with more coherent topics. the most probable topics for each of the au- Various methods for estimating the optimal thors tabulated in Table 4. As opposed to the number of topics for tweets were tested. The previous topic mixtures, these topic mixtures Elbow method almost exclusively suggested generally had one topic that was much more a k value between 3 and 5, no matter how probable than the remaining topics. There- large or diverse the corpus was. The stabil- fore, the diversity in the language by each ity score (Greene et al., 2014) also produced author might not be captured as well by this rather poor estimates for a number of topics model. On the other hand, the most probable when applied to a tweet corpus. The spar-

83 #0 (Obama) #20 (Musk) #26 (Tyson) #35 (Trump) #43 (Bieber) #19 (Swift) president tesla earth will thanks tonight obama will moon great love ts1989 america rocket just thank whatdoyoumean taylurking sotu just day trump2016 mean just actonclimate model one just purpose love time launch time cruz thank thank work good sun hillary lol crowd economy dragon people new good night americans falcon space people great now change now will makeamericagreatagain see show

Table 4: The most probable topic for the six authors inferred from aggregated topic distribution sity of tweets is likely the cause of this; the most quantitative ways try to estimate human documents do not contain enough words to judgement. Moreover, there is no precise def- produce sufficient term co-occurrences. Hong inition of a gold standard for topic models, and Davison (2010) found that 50 topics pro- which makes the task of comparing and rank- duced the optimal results for their author-topic ing topic models difficult. A combination of a model, although the optimal number of top- computational method and human judgement ics dependents on the diversity of the cor- was therefore used in the evaluations. pora. Thus k values of 10 and 50 were used One way to extend the topic modeling sys- in the experiments, with 50 for large corpora tem would be to apply online analysis by where a diverse collection of documents was implementing automatic updates of a topic expected. model, continuously extended by new tweets retrieved from the Twitter Streaming API. Hashtag co-occurrences were used to di- This would help in detect emerging events, in vulge latent networks of topics and events in a fashion similar to Lau et al. (2012). More- a collection of tweets. A hashtag aggregat- over, Dynamic Topic Models could be consid- ing technique was shown to mitigate the neg- ered to provide better temporal modeling of ative impacts sparse texts have on the coher- Twitter data. A limitation to topic modeling in ence of a topic model. Hashtag aggregation general is the difficulties in evaluating the ac- technique is especially interesting, as it uti- curacy of the models. Computational methods lizes a specific metadata tag that is not present try to simulate human judgement, which poses in standard documents. A hashtag aggregated difficulties, as human judgement is not clearly topic model produced a much better coher- defined. Further research could help provide ence than the standard LDA variation for the better methods for evaluating topic models. In same corpus; this is also consistent with recent this paper, we aggregated tweets sharing au- research on topic models for microblogs. Two thors and hashtags. Further work should look Author-topic models were used in our exper- into other pooling schemes, and see how they iments, one using the Rosen-Zvi et al. (2004) compare to author and hashtag aggregation. topic model and the aggregated author-topic One example would be to aggregate conver- model proposed by Hong and Davison (2010), sations on Twitter into individual documents. both seeming to produce interpretable topics. Tweets contain a lot of metadata that can aid It is worth noting that there is no standard- the aggregation process. ized methods for evaluating topic models, as

84 References Models: # Detection Topic Model Online.\ In Proceedings of COLING David M Blei, Andrew Y Ng, and Michael I Jor- 2012: Technical Papers, pages 1519–1534, dan. 2003. Latent Dirichlet Allocation. In pages 1519–1534, Mumbai, India. the Journal of Machine Learning Research, vol- ume 3, pages 993–1022, MIT, Massachusetts, David Mimno, Hanna M. Wallach, Edmund Tal- USA. JMLR. org. ley, Miriam Leenders, and Andrew McCallum. 2011. Optimizing semantic coherence in topic David M. Blei. 2012. Probabilistic Topic Models. models. In Proceedings of the Conference on In Communications of Association for Com- Empirical Methods in Natural Language Pro- puter Machinery, volume 55, New York, NY, cessing, EMNLP ’11, pages 262–272, Strouds- USA, April. ACM. burg, PA, USA. Association for Computational William Boag, Peter Potash, and Anna Rumshisky. Linguistics. 2015. TwitterHawk: A Feature Bucket Based Approach to Sentiment Analysis. In Proceed- Brendan O’Connor, Ramnath Balasubramanyan, ings of the 9th International Workshop on Se- Bryan R Routledge, and Noah A Smith. 2010. mantic Evaluation (SemEval 2015), pages 640– From Tweets to Polls: Linking Text Sentiment 646, Denver, Colorado, June. Association for to Public Opinion Time Series. In Interna- Computational Linguistics. tional Conference on Web and , volume 11, pages 1–2, Washington DC, USA. Fazli Can and Esen A. Ozkarahan. 1990. Concepts and Effectiveness of the Cover- Nataliia Plotnikova, Micha Kohl, Kevin Volkert, coefficient-based Clustering Methodology for Andreas Lerner, Natalie Dykes, Heiko Emer, Text Databases. In ACM Transitional Database and Stefan Evert. 2015. KLUEless: Po- Systems, volume 15, pages 483–517, New York, larity Classification and Association. In Pro- NY,USA, December. Association for Computer ceedings of the 9th International Workshop on Machinery. Semantic Evaluation (SemEval 2015), Erlan- gen, Germany. Friedrich-Alexander-Universitat Jonathan Chang, Sean Gerrish, Chong Wang, Jor- Erlangen-Nurnberg. dan L Boyd-Graber, and David M Blei. 2009. Reading tea leaves: How humans interpret topic Xiaojun Quan, Chunyu Kit, Yong Ge, and models. In Advances in neural information Sinno Jialin Pan. 2015. Short and sparse processing systems, pages 288–296, Vancouver, text topic modeling via self-aggregation. In British Columbia. Proceedings of the 24th International Confer- ence on Artificial Intelligence, IJCAI’15, pages Derek Greene, Derek O’Callaghan, edi- 2270–2276. AAAI Press. tor="Calders Toon Cunningham, Pádraig", Floriana Esposito, Eyke Hüllermeier, and Rosa Michal Rosen-Zvi, Thomas Griffiths, Mark Meo, 2014. How Many Topics? Stability Steyvers, and Padhraic Smyth. 2004. The Analysis for Topic Models, pages 498–513. Author-topic Model for Authors and Docu- Springer Berlin Heidelberg, Berlin, Heidelberg. ments. In Proceedings of the 20th Confer- ence on Uncertainty in Artificial Intelligence, Liangjie Hong and Brian D. Davison. 2010. Em- UAI ’04, pages 487–494, Arlington, Virginia, pirical Study of Topic Modeling in Twitter. United States. AUAI Press. In Proceedings of the First Workshop on So- cial Media Analytics, SOMA ’10, pages 80–88, Dionisios N Sotiropoulos, Chris D Kounavis, New York, NY, USA. ACM. Panos Kourouthanassis, and George M Giaglis. 2014. What drives social sentiment? An en- Olessia Koltsova and Sergei Koltcov. 2013. Map- tropic measure-based clustering approach to- ping the public agenda with topic modeling: wards identifying factors that influence social The case of the Russian LiveJournal. In Policy sentiment polarity. In Information, Intelligence, & Internet, volume 5, pages 207–227, . Systems and Applications, IISA 2014, The 5th International Conference, pages 361–373, Cha- Jey Han Lau, Nigel Collier, and Timothy Bald- nia Crete, Greece. IEEE. win. 2012. On-line Trend Analysis with Topic

85 Pranav Waila, VK Singh, and Manish K Singh. Qi He. 2010. TwitterRank: Finding Topic- 2013. Blog text analysis using topic modeling, sensitive Influential Twitterers. In Proceed- named entity recognition and sentiment classi- ings of the Third ACM International Conference fier combine. In Advances in Computing, Com- on Web Search and Data Mining, WSDM ’10, munications and Informatics (ICACCI), 2013 pages 261–270, New York, NY, USA. ACM. International Conference on, pages 1166–1171, Mysore, India. IEEE. Zhihua Zhang, Guoshun Wu, and Man Lan. 2015. East Normal University, ECNU: Multi- Y. Wang, J. Liu, J. Qu, Y. Huang, J. Chen, and level Sentiment Analysis on Twitter Using Tra- X. Feng. 2014. Hashtag Graph Based Topic ditional Linguistic Features and Word Embed- Model for Tweet Mining. In 2014 IEEE In- ding Features. In Proceedings of the 9th In- ternational Conference on Data Mining, pages ternational Workshop on Semantic Evaluation 1025–1030, Shenzhen, China, Dec. (SemEval 2015), Shanghai, China. East China Normal University Shanghai. Jianshu Weng, Ee-Peng Lim, Jing Jiang, and

86