Tales of a Coronavirus Pandemic: Topic Modelling with Short-Text Data

by Adam Shen

a thesis submitted to The Faculty of Graduate and Postdoctoral Affairs in partial fulfilment of the requirements for the degree of Master of Science in Statistics

Carleton University Ottawa, Ontario, Canada

© 2021 Adam Shen Master of Science (2021) Carleton University Mathematics and Statistics Ottawa, Ontario, Canada

TITLE: Tales of a Coronavirus Pandemic: Topic Modelling with Short-Text Data

AUTHOR: Adam Shen B.Sc., (Statistics) McMaster University Hamilton, Ontario, Canada

SUPERVISORS: Dr. David Campbell Dr. Song Cai Dr. Shirley Mills

NUMBER OF PAGES: xiii, 72

ii Abstract

With more than 13 million tweets collected spanning between March 2020 to Novem- ber 2020 relating to the COVID-19 global pandemic, the topics of discussion are investigated using topic models – statistical models that learn latent topics present in a collection of documents.

Topic modelling is first conducted using Latent Dirichlet Allocation (LDA), a method that has seen great success when applied to formal texts. As LDA attempts to learn latent topics by analysing term co-occurrences within documents, it can encounter difficulties in the learning process when presented with shorter documents such as tweets. To address the inadequacies of LDA applied to short-text, a second topic modelling technique is considered, known as the Biterm Topic Model (BTM), which instead analyses term co-occurrences over the entire collection of documents.

Comparing the performances of LDA and BTM, it was found that the topic quality of BTM was superior to that of LDA.

iii Acknowledgements

I would like to thank Dr. David Campbell for his patience, positivity, and encour- agement throughout the duration of this thesis. I have learned so much from him in such a short amount of time and I am extremely grateful for the many opportunities he has given me.

I would not have made it this far without the kindness, guidance, and support from Dr. Song Cai, especially in my first semester when I was struggling atjust about everything. In addition, this thesis would not have been possible without his computing server which I had the privilege of using all to myself.

Thank you to Dr. Shirley Mills, MITACS, and Carleton University for funding the project upon which this thesis was built.

iv Contents

Abstract iii

Acknowledgements iv

List of Figures viii

List of Tables xii

List of Algorithms xiii

1 Introduction 1 1.1 The coronavirus disease 2019 pandemic ...... 1 1.2 About ...... 2 1.2.1 Terminology ...... 2 1.2.2 Increasing one’s own public presence ...... 3 1.3 Thesis outline ...... 3

2 Data preparation 5 2.1 Data collection ...... 5 2.2 Data verification ...... 6 2.3 Text cleaning ...... 7 2.4 Language detection ...... 7

v 2.5 Quality check ...... 9

3 Exploratory analysis 10 3.1 Results ...... 10 3.2 Discussion ...... 17

4 Sentiment analysis 20 4.1 Background ...... 20

4.2 How sentimentr works ...... 21 4.3 Methods ...... 23 4.4 Results ...... 24 4.5 Discussion ...... 28

5 Topic modelling 29 5.1 Introduction ...... 29 5.2 Latent Dirichlet Allocation (LDA) ...... 30 5.2.1 Model outline ...... 31 5.2.2 Parameter estimation ...... 33 5.2.3 Perplexity ...... 35 5.3 Biterm Topic Model (BTM) ...... 36 5.3.1 Model outline ...... 37 5.3.2 Parameter estimation ...... 40 5.3.3 Document-topic proportions ...... 41 5.3.4 Log-likelihood ...... 42 5.4 Comparison of LDA and BTM models ...... 42 5.5 Methods ...... 43 5.5.1 LDA ...... 43 5.5.2 BTM ...... 45 5.6 Results ...... 46

vi 5.6.1 LDA ...... 46 5.6.2 BTM ...... 52 5.6.3 Topic quality ...... 60 5.7 Discussion ...... 61

6 Conclusions 63

Bibliography 65

vii List of Figures

3.1 Number of tweets (thousands) available per day...... 11 3.2 Number of tweets (millions) that used between 0 and 15 hashtags. . . 12 3.3 Top 30 hashtags and their frequencies (thousands), excluding hashtags related to the search queries...... 13 3.4 Top three monthly hashtags and their frequencies (thousands), ex- cluding hashtags related to the search queries. Horizontal scales differ across months due to differing amounts of tweets available for each month...... 14 3.5 Wordcloud of the most used terms in the collected tweets, excluding stopwords and terms related to the search queries. Increased text size and darker colour corresponds to increased usage...... 15 3.6 Wordclouds of terms co-occurring with the captioned term in a tweet, excluding stopwords, terms related to the search query, and the cap- tioned term. Increased text size and darker colour corresponds to increased usage...... 16

4.1 Monthly average sentiments for tweets, using the Jockers-Rinker and NRC lexicons. Point sizes represent the proportion of tweets con- tributed by the given month, relative to other months...... 25

viii 4.2 Sentiment scores for the five lowest scoring tweets of March 2020, using the Jockers-Rinker and NRC lexicons. Text shown is the cleaned text, i.e. removal of mentions, symbols, emojis, and demotion of hashtags. 26 4.3 Sentiment scores for the five lowest scoring tweets of October 2020, us- ing the Jockers-Rinker and NRC lexicons. Text shown is the cleaned text, i.e. removal of mentions, symbols, emojis, and demotion of hash- tags...... 27

5.1 Graphical model representation of LDA. Nodes in the graph represent random variables; shaded nodes are observed variables. Plates denote replication, with the number of replicates given in the bottom right corner of the plate...... 32 5.2 Graphical model representation of BTM. Nodes in the graph represent random variables; shaded nodes are observed variables. Plates denote replication, with the number of replicates given in the bottom right corner of the plate...... 39 5.3 Perplexities for the nine models trained in the first tuning step, eval- uated on the validation set. K, the number of topics, was fixed while varying values of α and β. The hyperparameters corresponding to the model with the lowest perplexity were α = 100/K and β = 0.1.... 47 5.4 Perplexities for six of the seven models trained in the second tuning step, plus the best model from the first tuning step, evaluated on the validation set. α and β were fixed, while varying over values of K. Perplexities involving the K = 500 model were unable to be calculated due to memory constraints...... 48 5.5 Monthly document-topic allocations scaled by the number of tweets available per month. The five topics highlighted attained the highest scaled allocation values overall...... 49

ix 5.6 Monthly top three topics by scaled allocations...... 50 5.7 Top 20 most probable terms for topics appearing in Figure 5.5 and Figure 5.6...... 51 5.8 Log-likelihoods for the nine models trained in the first tuning step, evaluated on the validation set. K, the number of topics, was fixed at 10 while varying values of α and β. The hyperparameters correspond- ing to the model with the highest log-likelihood were α = 50/K and β = 0.1...... 52 5.9 Log-likelihoods for the seven models trained in the second tuning step, plus the best model from the first tuning step, evaluated on the val- idation set (black) and the testing set (green). α and β were fixed, while varying over values of K...... 53 5.10 Monthly document-topic allocations scaled by the number of tweets available per month. The five topics highlighted obtained the highest scaled allocation values overall...... 54 5.11 Monthly top three topics by scaled allocations...... 56 5.12 Biterm clusters for topics appearing in Figure 5.10 and Figure 5.11. Each cluster is an undirected graph as biterms are unordered pairs of terms. Increased node (term) size corresponds to higher topic-term probability. Increased thickness and darkness of edges (links) corre- sponds to higher co-occurrences within the topic. As biterms were computed using a window size of 15, adjacent nodes may not have necessarily appeared adjacently in the original text...... 58 5.13 Top 20 most probable terms for topics appearing in Figure 5.10 and Figure 5.11...... 59

x 5.14 Mean coherence scores and 95% t-intervals for the 500 topics of the BTM and LDA models using the top M = {5, 10, 15, 20} terms of each topic, evaluated on the testing data. A higher coherence score means that a topic is more coherent. Using a two-sample t-test, the mean coherence of topics learned by BTM was found to be significantly (p < 0.001) greater than the mean coherence of topics learned by LDA for the each of the four values of M considered here...... 60

xi List of Tables

2.1 A sample of the raw data obtained from the Twitter API using the

rtweet13 package...... 8

xii List of Algorithms

5.1 Collapsed Gibbs sampling algorithm for LDA...... 35 5.2 Collapsed Gibbs sampling algorithm for BTM...... 40

xiii Chapter 1

Introduction

1.1 The coronavirus disease 2019 pandemic

The coronavirus disease 2019 (COVID-19) pandemic is an ongoing pandemic caused by a strain of coronavirus known as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)1. This strain of coronavirus was first identified in Wuhan, China in December of 20191. At the time of writing, more than one year has elapsed since the documentation of the initial case. In this time, more than 86.5 million cases have been confirmed globally, of which, more than 1.87 million cases have resulted indeath2.

Transmission of COVID-19 can occur between humans when a person comes into contact with the respiratory droplets resulting from the coughing, sneezing, talking, or breathing of an infected person1. Symptoms of COVID-19 can vary greatly from person to person, with typical symptoms including, but not limited to: fever, cough, fatigue, difficulty breathing, lack of smell, and lack of taste1.

The effect of the pandemic has been felt on a global scale through the economy3, travel restrictions4, unemployment5, supply shortages6, heightened racism and xeno- phobia towards people of Chinese and East Asian descent7,8, and protests for change

1 in healthcare and political systems9,10.

Mass hysteria was bound to ensue in the early days of the pandemic due to uncer- tainty surrounding the authenticity of information sources, and a cult-like uprising of deniers of scientific evidence. In times like these where the general public hasbeen advised to stay home except for essential purposes and in-person social gatherings have been restricted, one way to stay connected with friends and family, get the latest news, and discuss current events is through social networking platforms.

1.2 About Twitter

Twitter is an American social networking platform with approximately 321 million active users (as of February 2019)11 across the world. Twitter is available in most countries and can be accessed from a web browser or a mobile application. The platform allows users to publish posts called tweets, consisting of text up to 280 characters, as well as media including photos, GIFs, and videos. For the average user, published tweets are visible to the public (including those without Twitter accounts). As a result of the character limit, users may resort to slang or abbreviations to convey their ideas. After a tweet has been published (publicly), the original poster, as well as other Twitter users, can interact with the tweet in three ways: like, comment, or retweet (re-post a tweet to one’s own collection of tweets).

1.2.1 Terminology

• Hashtag: a keyword or group of words (without spaces or punctuation) preceded

by a # symbol. Hashtags can act as a topic identifier in a tweet.

• Mention: the tagging of one or more Twitter users by including their Twitter

handles (usernames) preceded by a @ symbol. Mentions can be used to direct a

2 tweet at one or more Twitter users and notify them of the tweet.

• Follow: keep up with a Twitter user’s activity by displaying their tweets, likes, comments, and retweets on your main feed.

1.2.2 Increasing one’s own public presence

While hashtags are not mandatory, hashtags can be clicked on to view all other tweets sharing the same hashtag. Through Twitter’s explore feature, one can view trending keywords/hashtags and associated tweets. Outside of the overall trending topics, the explore feature can be filtered to include tweets on news, politics, and entertainment. Therefore, Twitter users can garner more attention towards their tweets by taking part in the discussion of current events.

Taking advantage of the fact that the users that one follows dictates the content shown on the main feed, having users interact with your tweets (like, comment, or retweet) allows your content to travel through their social networks, appearing as activity for their followers. With a moderately-sized number of followers, information (whether correct or not) can travel extremely quickly across users’ social networks.

1.3 Thesis outline

Twitter users have voiced many differing opinions on the pandemic, its effect on our daily lives, its handling by government organizations, and its intertwining with global affairs varying from region to region. While Twitter’s application programming interface (API) allows for the easy access and retrieval of tweet data, the amount of available data (despite only being a sample) is enormous and would be overwhelming for manual analysis by humans.

The purpose of this thesis is to employ modern information retrieval techniques,

3 namely Latent Dirichlet Allocation and Biterm Topic Modelling, in order to sum- marise the obtained pandemic tweet data by their latent topics. With this infor- mation, an analysis of the topical trends can be performed to elucidate the most prominent topics of discussion, as well as how the topics of discussion vary over the course of the pandemic.

The processing of the raw text data preceding any analyses is detailed in Chapter 2. In Chapter 3, an exploratory analysis is performed to give a basic overview of the data and illustrate some of the insights that can be drawn from text data without the use of models. Chapter 4 is an extension of the exploratory analysis: a sentiment analysis is performed using an augmented dictionary lookup – one that considers parts of speech known as valence shifters. In Chapter 5, topic modelling with the classical method of Latent Dirichlet Allocation (LDA), and its short-text variant, the Biterm Topic Model (BTM), is performed. Finally, Chapter 6 discusses the shortcomings of LDA and BTM and a novel approach to topic modelling that addresses some of the shortcomings of LDA and BTM.

All computational methods in this thesis were performed in R 4.012 with the aid of numerous open-source packages13–40.

4 Chapter 2

Data preparation

2.1 Data collection

Twitter data can be obtained from the Twitter Application Programming Interface (API) by specifying keywords that the returned data should contain, either in plain text or hashtag. Data from up to nine days prior to the time of the call to the Twitter API can be retrieved. The rate limit allows a maximum retrieval of 18,000 observations every 15 minutes. For each successful query, a simple random sample of data containing the specified keywords is returned. Individual calls to the Twitter API are independent and as a result, duplicate data may be obtained between calls.

The keywords used in the queries to obtain the data for this thesis included:

covid covid19 covid_19 coronavirus epidemic pandemic social distancing socialdistancing lockdown

Twitter data were obtained at the rate limit continuously between March 27, 2020 and November 14, 2020. After removing duplicate data, a total of 61,001,110 unique observations remained consisting of 14,458,259 (23.7%) tweets and 46,542,851 (76.3%)

5 retweets.

Each observation had metadata fields including:

• user_id : a unique numeric identifier for the user performing the tweet/retweet

• status_id : a unique numeric identifier for the tweet/retweet

• created_at : the date and time at which the tweet/retweet was performed (in UTC time)

• is_retweet : a logical value indicating whether the observation was a retweet

• hashtags : a vector of hashtags contained within the tweet/retweet

• lang : a character value indicating the language that Twitter has detected

Other metadata fields such as city, country, and geographical coordinates were also included. However, since the availability of data for these fields was highly dependent on the settings of the individual users, many of these fields were mostly missing values and were therefore unusable. A sample of the raw data can be found in Table 2.1.

All text-related analyses were conducted using only tweets, i.e. retweets were excluded, as tweets were most representative of the ideas of individual users. Hence- forth, the term “data” will refer to tweets only, unless otherwise specified. The word “term” will be used when words and hashtags are referred to without distinction.

2.2 Data verification

It was found that there were instances of tweets and retweets that had been collected despite not matching any of the search queries. Therefore, another pass through of the data was done to ensure tweets contained at least one of the queried terms. Of the 14,458,259 collected tweets, 681,654 (4.7%) tweets did not contain any of the

6 queried terms and were removed, leaving 13,776,605 tweets containing one or more of the queried terms.

2.3 Text cleaning

The major steps of the text cleaning included:

• Conversion of the text encoding from UTF-8 to ASCII, thereby removing any symbols, emojis, and characters that were not native to the English alphabet. This was done for compatibility across machines and packages.

• Conversion of all text to lower case.

• Removal of user mentions and links.

• Demotion of hashtags to words (by removing the # sign). This was done due to the inconsistent nature of how Twitter users incorporate hashtags into their tweets – some users replace keywords mid-sentence with hashtags, some accu- mulate all relevant hashtags at the end of the tweet, and some use a combination of both.

Additional text cleaning occurred in chapters 3 and 5, after further exploration.

2.4 Language detection

While Twitter automatically classifies the language of tweets using its own algorithms and allows for retrieval of tweets matching specific languages, it is often possible that the retrieved tweets can contain a mixture of languages. For the results of the pro- ceeding exploratory analysis to be interpretable, in combination with the knowledge of the role valence shifters play in the English language, only tweets that were either mostly or entirely in English were retained.

7 user_id status_id created_at text is_retweet hashtags lang The NY Dem Presidential Primary on 6/23 is canceled due to COVID-19. Other 833499844517961728 1255290686250811392 2020-04-29 00:19:56 elections on 6/23 will still take place: FALSE NA en https://t.co/0P6PMPq20h - Rock the Vote In the midst of the current #pandemic, it is more important than ever to get creative with your #homeworkout. One great "pandemic", 581180392 1254511313331597314 2020-04-26 20:42:59 #trainfromhome option includes making a FALSE "homeworkout", en sandbag from a laundry bag. Here’s how to "trainfromhome" implement sandbag circuits: https://t.co/Ottxy2YZAq #Masks serve 2 purposes: 8 1) To remind everyone that there is supposed to be a deadly pandemic, despite nobody 819296296062201857 1304087649011953665 2020-09-10 16:01:38 knowing anybody who is sick with covid, and FALSE "Masks" en 2) To increase incidence of pulmonary ill- nesses, which will later be blamed on covid. Proof: See my pinned tweet. #GOPBetrayedAmerica decisions are being made to take your voice/vote away. The same #GOPCorruptionOverCountry that "GOPBetrayedAmerica", 2350488460 1322387795718119425 2020-10-31 03:59:53 FALSE en won’t respond with help for every American "GOPCorruptionOverCountry" during this once a century pandemic https://t.co/Gg5gecx7tb

Table 2.1: A sample of the raw data obtained from the Twitter API using the rtweet13 package. The R packages cld224 (a wrapper for the C++ implementation of Google’s Com- pact Language Detector 2) and cld325 (a wrapper for the C++ implementation of Google’s Compact Language Detector 3) were used to detect the language of each tweet. Unfortunately, the results of cld3’s language detection were very poor even for tweets that were truly entirely in English. This was possibly due to the short nature of tweets, combined with the uncertainty surrounding scientific terms such as

COVID-19. On the other hand, the results of cld2 were very promising. As such, only tweets that were classified as English by both Twitter and cld2 were kept.

A total of 192,233 (1.4%) tweets were removed, with 13,584,372 tweets remaining.

2.5 Quality check

Tweets that had fewer than two terms were removed. Upon further inspection into the number of hashtags used in the collected tweets, it was found that tweets that con- tained an excessive number of hashtags had no semantic value and were either spam or to gain attention by mentioning as many trending topics as possible. Therefore, tweets containing more than 15 hashtags were removed.

A total of 14,751 (0.1%) tweets were removed, leaving a final total of 13,569,621 tweets.

9 Chapter 3

Exploratory analysis

3.1 Results

Considering that the rtweet13 package can collect tweets up to nine days prior to the current time of collection, the creation timestamps of the collected tweets were investigated to ensure that the data was suitable for further analysis. Collected tweets were dated between the period of March 27, 2020 and November 14, 2020. Within this interval of 232 days, 51 (22%) days had no tweet data due to software failures in the collection code. Counts of the number of tweets available per day are given in Figure 3.1.

Of the data remaining after removing tweets that contained more than 15 hash- tags, it was seen that the vast majority of tweets did not use any hashtags (Figure 3.2). For tweets that did contain hashtags, the most common hashtags were those that matched the search queries, as one would expect. Excluding hashtags that matched the search queries, the 30 most popular hashtags are displayed in Figure 3.3.

Throughout the duration of the collected tweets, there were many variants of hashtags advising others to stay home. This included hashtags such as stayhome,

10 100

50 Number of tweets (thousands)

0

Apr Jul Oct Date

Figure 3.1: Number of tweets (thousands) available per day.

stayathome, stayhomestaysafe, stayhomesavelives, and stayhome[***] where [***] was a city, state, province, or country. As such, hashtags containing the word stay and any of home, safe, save, lives, or life were collapsed into the hashtag stay*. The stay* variant hashtags were the most popular, with 95,538 occurrences out of 5,316,541 (1.8%) non-query hashtags. The second most popular hashtag was

trump, with related hashtags trumpvirus at rank 12, and maga (Make America Great Again) at rank 16. Lastly, many of the top hashtags referred to geographical locations including China, India, Australia, USA, Canada, and UK.

Due to the large amount of variability in the number of tweets obtained for each day (Figure 3.1), with many days with missing data, the top three non-query hashtags for each month were sought (Figure 3.4), rather than the daily top hashtags. The

all-time top hashtag, stay*, persisted in the top three most used hashtags for all months, only losing its first place position in the months of September, October, and

11 10.0

7.5

5.0 Number of tweets (millions)

2.5

0.0

0 5 10 15 Number of hashtags in a tweet

Figure 3.2: Number of tweets (millions) that used between 0 and 15 hashtags.

November. Interestingly, during the months of March, April, and May, the usage of

stay* was substantially greater than other top hashtags. Over the months of June, July, and August, while stay* remained the top hashtag, its usage was eventually equalized with the other top hashtags. From September onwards, stay* consistently remained at position three.

The trump hashtag appeared in the top three hashtags for the months of March to June, and reappeared in the top three hashtags in the months of October and Novem- ber. In September, the top two hashtags, protectenhypen and respect_enhypen, were in reference to the Korean boy-band Enhypen, whose members were swarmed by fans at a airport due to the leakage of their private flight schedule41.

To visualize term frequencies relative to each other, wordclouds were constructed with increased text size and darker colour corresponding to increased usage. Term frequencies were counted following the removal of terms related to the search queries as

12 stay* trump news health wearamask mentalhealth protectenhypen business healthcare china smartnews trumpvirus quarantine india vote maga

Hashtag auspol bidenharris2020 election2020 virus foxnews economy respect_enhypen usa cdnpoli protests education who vaccine nhs 0 25 50 75 100 Frequency (thousands)

Figure 3.3: Top 30 hashtags and their frequencies (thousands), excluding hashtags related to the search queries.

well as the removal of stopwords. Stopwords were removed according to the snowball stopword lexicon22.

A general wordcloud (Figure 3.5) was constructed for an overview of the terms

contained in the data. The most used term was people with a total of 1,753,528 usages. trump was the seventh most used term, with 841,979 usages. Unfortunately, a global wordcloud may not offer too much information due to a lack of focus,as evidenced by the dark green centre with surrounding words being much lighter.

It was therefore of greater interest to construct wordclouds for terms co-occurring

in tweets that mentioned the (exact) terms: trump, china, election2020, economy, health, and vaccine (Figure 3.6). These wordclouds were created similar to the general wordcloud: term frequencies were counted following the removal of terms

related to the search queries and removal of stopwords (snowball lexicon22), followed

13 Mar Apr May

stay* stay* stay*

quarantine trump trump

trump quarantine news

0 2 4 6 0 10 20 30 40 0 5 10 15 20

Jun Jul Aug

stay* stay* stay*

blacklivesmatter cultforgood wapparty Hashtag

trump wearamask modiji_postponejeeneet

0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 2.5 0 1 2 3

Sep Oct Nov

protectenhypen trump election2020

respect_enhypen vote trump

stay* stay* stay*

0 5 10 15 0.0 2.5 5.0 7.5 0 1 2 3 4 5 Frequency (thousands)

Figure 3.4: Top three monthly hashtags and their frequencies (thousands), excluding hash- tags related to the search queries. Horizontal scales differ across months due to differing amounts of tweets available for each month. by the additional removal of the term of interest.

From Figure 3.6, it was seen that the term trump appeared in all wordclouds (ex- cluding its own). In addition, trump was the central term in the china, election2020, and economy wordclouds, while still remaining close to the centre in the health and vaccine wordclouds. Lastly, it was observed that tweets that mentioned vaccine, trump, health, and economy were the most focused (smaller vocabulary size for terms appearing more than once) as evidenced by the more abundant increased-size and dark green terms spanning towards the outer edges of the wordclouds, with the vaccine wordcloud being the most focused.

14 seeing millions human national jobs restrictions office wave infected medical anything march mental idea looking science line important quarantine ones patients numbers seems video little across happy political future pm taking needs plan together stupid situation friends normal outside already business whole thousands information per different america family continue rate india days nothing wait seriously allowed media anyone dying impact response check

instead saying week public working weeks seen system part almost dead yet post put never others fact gonna started ever use death country biden related find change past healthcare money stay live lets thing via lol year feel hospital else really face good tested making city test look said last worse global also due join election might share hope got come big world measures order took try done time new talk home especially

schools americans wearing party mask job died

local masks ago things may one team fear fuck back next love practice times every told positive maybe control day like many bad police keep get ask end disease today bc hit first oh u will give trying food able great lot fight two deaths rules sick news going save ok virus lives real call hard uk

data least must please people sir risk million go health

shut follow believe

stop low far hours china say

school small

see went etc well away online still shit lack

middle soon can just safe safety us number spread cases dr work deal long daily hand tell u.s support yes

states help think mean everyone now case either trump travel amid read says much years die close take even vote

state protect everything another know way flu let kids wear coming best right need cdc life man thought

second government used act less pay

killed months crisis make hear learn care run house cause using want place high made open true possible since worst month full president getting watch businesses thank re called report testing free community point lost economy better without sure old god set given

living ppl someone left understand students around vaccine early story latest though start show american problem white person thanks yall closed families something staff workers children fucking following fake total season enough w mail relief wrong probably guidelines actually service reason tests hands countries taken makes always despite remember course talking contact top economic govt non current

Figure 3.5: Wordcloud of the most used terms in the collected tweets, excluding stopwords and terms related to the search queries. Increased text size and darker colour corresponds to increased usage.

15 soon unemployment around order november given banned issue weeks ignored behind political call lack late lost wearing making responsibility anyone called worse hope party less unemployment scientists companies leader least numbers far things business warning thought else thank always deadly allowed golf major makes

hands thousands saying actually times start feel voted war else top used released shit instead wants origin leaders hoax numbers stay without office gates middle making win likely march eu keep disease rate blame made ignored together free fauci millions lies another stupid deal public try true reported happened failure left job wh full amid canada measures city seems fight please please watch top live believe economic shut end actually across voting supporters flu republican wave problem trust long talk leadership force true may death health place jan love close left europe democrats south times worse healthcare worst part already africa thing really force joe potential spreading others

took house situation global economy went care million handling vaccine testing things taking attack open record everyone cover started taiwan town iran better trade hoax helped tell says let australia lives big japan

middle never hold rallies debate support infected government fact find every response trumps think say accountable talking office staff cases come daily coming day trying total know lead national way win well still know house everything much must trumps test die yet wearing hit lied remember media start seen infected better many year pay going keep evidence next money taking died caused president elected away world virus used best stop anything ccp years care

wear instead countries get two since like make lying january health hold medical wear usa time closed fauci want just open use hard wuhan become ppe first now obama war bad originated public test also oh experts last called got cdc fault knew via get us w data first new country days due hong

bad tell xi man nation take country virus us see ban fake truth masks let team save kong human india fight anyone months back one mask biden like look w just right border post much made without old tax told u economy crisis cdc news lot usa still joe using u.s re already want one feb ever almost need least said come yet man lot u.s cut help million see can said go ever read biden state told govt others mr due little next died good cases long today trump whole people dr mail die fact home lab global ppl week ago give great way china failed came take travel states lives jobs n election task millions month held go

u hit supplies now even flights lying killed got think state american flu s dead away via world less mean united president even stopped fault party early bill wants america two nation gop years outbreak positive last yes ask chinese many shit enough mask news cause call give vaccine anti blaming people dr calling vote deaths rally chinas months uk americans votes can back control plan shut full responsible really needs watch point real failed done entire work job un tested sick going

look since death despite dead plan early blame also gave new day best

work done

maybe home deaths testing voters america donald vote fox god working billion lie worst lies reason americans follow put another response put italy political patients spread say safe action number lie says white may spread make every dems masks election far saying pelosi might time use try getting help need order trying lost daily hope anti god support rest amid found re russia right bill communist great face getting oh bbc

life etc run administration states stop year gets population took nations history thing korea team obama united sent sure fraud leadership mean medical must person calling deal science kill well something leader crisis american nothing days risk facts nothing lied sars rights high dems everyone life knew today including seriously money believe wrong threat racist brought democrats never fuck number market around act citizens good control end person intelligence beijing report show failure created killing big worldwide almost thanks enough republicans etc economic relief might research case campaign pakistan international anything supply continue handling media talking point normal february trump's wrong reports remember pence responsible recovery act sure began military thousands needs change jobs deadly real cure national government white information maybe markets something kids stupid administration donald power restrictions hate thinks family live yes beginning propaganda makes entire show fake soon everything dying reason face thank foreign killed tried second borders fighting

(a) trump (b) china

shows stress hoax else reason rise food instead history donald system show deadly christmas dems major restrictions ppl actually celebrate uspolitics far asking result shit opening nation blame something hours stand failure anything trumpvirus watching leader leadership maybe across making white system god process reopening numbers everyone destroying past seriously start harris especially dying national hard anyone anything fact find cost crash policy government impact house high record financial court worst without handling forget voteblue2020 seems despite destroy things healthcare pilots practice working congress talk common point likely wearamask believe racist taking measures pre kill schools united hear damage american trying electionresults2020 racism candidates dead lot reopen closed die thousands course history sure live getting ever quarantine record biden2020 makes flu close republicans public almost mean week obama done bring

die least taxes recover data first plans second market foxnews vaccine keep second ensure trumps ballot next truth 2020election victory inherited

running away man response long thanks man democracy lets public say millions matter turn elections2020 businesses long say everyone despite black funds future believe voteearly yet able protect relief climate last wear elections trump's business must wisconsin numbers candidate rally voters also economic real happen american help week join run workers pay frontline supporters good make old use thousands thing

watch early usa saying vote2020 needs politics life cases open uk election cause love told electionday president u.s wants voter economy may thought ahead amid china crisis really make around well likely blame well put stay americans states war focus control maybe florida world presidential virus health local first today show actually part needs w like re worst even since lives close killed fight risk never check amid u may support left response mr w get lot house fact growth weeks lost final people must fear term many months low best poll look via

white now free country ago place real back number wear jobs just need spread something thing come potus failed better

months s failed now stay voted plan texas far loss know federal might re senate mask already due hard voting oct getting world us get keep home everything end died can vote less tax can lose gets masks trying better state died wait virus really full office safe fear top via going tell way america create killing bidenharris2020 one run nation s safe pass go issue year polling elect million work give debate want means got president new also wearing post recovery jobs bad gdp even try think days mind bad still risk media trump took great trump come positive every tonight time home let stop enough stock voters done two see one issues big news lost post india relief counting usa due campaign dr time new

weeks march whole

global level much november right deaths another like take since protect cast course biden rate dr joebiden gop worse election us million stimulus wearing says oh full around joe much job good news people case someone made middle nov wrong mail still yes never number impact ballots country

night give says debates2020 said follow political find shit put great biden just fix left change right end caused called call life win fauci countries deaths go saying unemployment hit coming sure case small always day work test back plan year trump's america health tell fake going

years etc face family masks line won two feel stop debt daily took thank lives everything votes georgia take help vote know scotus global china money cases blue making party many donald results job death result less u.s person issues govt every think need care matter last fight happy together remember politics cnn wins best said state democrats little place read americans want years able let please season lies day healthcare science trumps testing

trumpmeltdown sick times breaking high support way economic money rallies count electionnight maga millions democratic racial normal half kids mask government deal fraud killed polls care got away democrats focus live dead lies race soon national nothing without officials wants waiting issue started important workers trump2020 control listen face open entire address things read climate crisis please taking times votehimout true middle recession today fucking administration counted donaldtrump death act see states strong morning big change stimulus shut little citizens party counteveryvote republican extension already mean fighting business another hope ever become depression yet destroyed point dying

nothing recovery made look arizona try continue policy anyone start given tomorrow leadership approve huge save killing whole bidenharris election2020results testing different coming use deal working worse turnout power society current forward four blacklivesmatter hope joe vaccine political talking obama pennsylvania looking wrong trade gop seems lead usps dems handling important effects brexit unemployment midst free next stupid sick spread instead votersuppression talk created continue enough payrollsupportprogram looks heard absentee russia elected understand war media lead

(c) election2020 (d) economy

online ensure numbers vote city give face efforts late things save case conditions president countries food less population oxford comes trying bad money education case economy oh science states issue information around rate study show already students everyone post given lives another healthy sure etc potential states research stock economic around hospitals ceo hold weeks learn access political americans show working team away testing works shot please someone everyone clinical country department reported great return kids feel years systems non cure created measures find disease deaths developed got officials trying treatment side without home needs millions fight come working support family worse immune please lack old nothing

jobs lost flu two daily cdc million good said available government fear

science economy good data end due tell worried long fast still deal guidelines hard thing found since tests follow distribution kill risk race say die vaccines yes hands really risk spread free health pfizer fans middle trump way u.s vote come right already based life fact taking future every likely wave since especially share see go masks take keep person use know women privacy might start fake must better real day tested ppl got stop away put death protect important line best make one emergency people rate let take control bill re ill care virus even sure death think today well look stay via news job point future masks die china better top can ready way time called going like patients mask two mask can ever may w trump let govt time care just number real cdc fda want organization live try end really u dear much hard create

nhs read world healthcare know testing best open new joe term well state read never made never news now safe big love gates day wants ones issues get public people yes global want local going work stop make live said us hope like every act old school usa told true bad say positive many virus fauci uk

keep join normal next new one others without w back still now wait impact stay far us yet use system mental first plan get flu till half workers data u.s done u track global find election ministry plan top lot hit respect part also full via due save cases even spread lot tell effective lets less another need ever november help sick white just dr call americans minister amid part long need public give work see test think

week response far home services trials india wear put big days

government uk safety done back doses soon year says world years go ask county test safe s anti much thing look thank related continue month safety medical security today trial shows vaccine talk immunity life help contact develop state hospital national toll dr crisis deaths right first may early says disease advice free latest biden year continue cases also high million yet appreciate enhypen check phase control term scientists saying last experts times making months made biden front schools children taking cold countries etc herd sars feel results

used many china place total protectenhypen next ppe

lives fight enough insurance coming getting president face nothing measures drug country point problems full getting report wearing months least company human wearing wear stress response last team week human families physical community needs service high experts hope days start media must join means shows month fact development trust possible across things died staff great including money place believe tested latest outbreak doctors professionals line deadly able russian killed medical media

(e) health (f) vaccine

Figure 3.6: Wordclouds of terms co-occurring with the captioned term in a tweet, excluding stopwords, terms related to the search query, and the captioned term. Increased text size and darker colour corresponds to increased usage.

16 3.2 Discussion

From Figure 3.1, it was seen that many dates had no available tweet data. For the dates that did have tweet data, there was a large amount of variability in the number of tweets collected. This showcases one of the difficulties of obtaining data from the Twitter API and can be explained by a variety of factors:

1. Rate limit and data window. Due to the fact that Twitter data obtained would have been created up to nine days prior to the time of query, for a topic as popular as the COVID-19 pandemic, it is almost certain that there existed an incredibly large amount of tweet and retweet data related to the search terms. With a rate limit (maximum number of data points retrieved in one call) of 18,000, presumably much smaller than the total number of tweets and retweets related to the search terms, one would require numerous rounds of collections to obtain a sizeable fraction of the existing data within this moving nine day window. In addition, since each collection is performed independently from one another, it is possible to obtain duplicated data, and on rare occasions unwanted data (Section 2.2).

2. Collection of retweet data. As the original intention behind the collection of this particular COVID-19 Twitter data was not for this thesis, retweets were included in the collection step. As the call to the Twitter API allows for the exclusion of retweets, enabling it would have been desirable. As seen in Section 2.1, 76.3% of the original collected data consisted of retweets and were inevitably excluded from the analyses performed in this thesis.

Had metadata fields that detailed the geographic origin of the collected tweet data been available, an additional analysis of the topics of discussion by region would have been of interest. Unfortunately, as explained in Section 2.1, the majority of values for these fields were missing. Although a sizeable number of tweets mentioned

17 geographical regions in hashtag form (Figure 3.3), and quite likely, an even larger number of tweets mentioned geographical regions in plain text (as hashtags are not mandatory), it would not have been correct to assume the geographic origin of the tweet based on the region of discussion. For example, tweets that discussed the origin of COVID-19 would have mentioned Wuhan or China, but this would not have offered any geographical information on the user. In addition, as Twitter is aloosely moderated platform, it is acceptable to share opinions on matters one knows nothing about.

The appearances of the term trump in Figure 3.5 and Figure 3.6 suggest that discussions of Donald Trump, the former president of the United States, are heavily intertwined with many of the sub-topics of these COVID-19 tweets. One should

recall and take into consideration that the term frequencies in the trump wordcloud of Figure 3.6 were counted from tweets that explicitly mentioned trump. As such, the term frequencies of tweets that talked about Trump without mentioning his name were not counted. This includes:

• Tweets where users referred to Trump by his Twitter handle (recall that user mentions were removed as part of the cleaning process; Section 2.3) and did not mention his name again for the remainder of the tweet

• Tweets where Trump was referred to by another title/nickname such as president or POTUS

• Tweets directed at Trump’s family members, referring to Trump by a familial

relation such as your father

Finally, it is worth noting that when working with informal text, tallied term frequencies are estimates. Due to the presence of hashtags, character limits, and to a lesser extent, misspellings, these estimates are likely to be underestimated. For

18 example, a hashtag such as nomasknoservice would not generate counts for no, mask, and service, but would instead generate a count for the possibly unique term, nomasknoservice. At this time there does not exist a simple solution to split hashtags accurately into their component terms, especially for data of this size that uses both colloquial and modern scientific terms.

19 Chapter 4

Sentiment analysis

4.1 Background

Sentiment analysis aims to classify the polarity of text as either negative or positive, based on the sign of the final computed sentiment score. Basic sentiment analysis methods involve a dictionary lookup: words in a document are compared against words in a polarity lexicon, where each polarized word has a pre-determined score, and the final sentiment score is computed as the sum of the individual polarized words’ scores.

The disadvantage of simple dictionary lookups is that they do not consider parts of speech that can affect polarized words. These parts of speech are known as valence shifters and can be broken down into four categories32:

• Negators: flips the sign of a polarized word (e.g., Iam not sick.)

• Amplifiers: increases the impact of a polarized word (e.g., Iam definitely sick.)

• De-amplifiers: reduces the impact of a polarized word (e.g., Iam somewhat sick.)

20 • Adversative conjunctions: overrules the previous clause containing a polarized word (e.g., I was sick but I feel better now.)

Without consideration for valence shifters, a basic sentiment analysis may not ac-

curately capture the intended sentiment of the given text. The R package sentimentr32 attempts to address this shortcoming by employing an augmented dictionary lookup, which combines a basic dictionary lookup of polarized terms with consideration for valence shifters.

4.2 How sentimentr works

Consider a sentence, si, as an ordered bag of words such that

si = {wi,1, . . . , wi,n},

th th where wi,j represents the j word of the i sentence. All punctuation is stripped from the sentence with the exceptions of commas, colons, and semicolons, which

c p will be denoted as w and referred to as “comma-words”. Polarized words, wi,j, of the sentence are tagged as positive or negative (according to the chosen sentiment lexicon).

After the tagging of polarized words, context clusters are formed about each polar- ized word. Cluster boundaries take into consideration the locations of comma-words, as comma-words can be thought of as possible “changes in thought”. Therefore a

p context cluster, ci,j, centred at wi,j with lower bound, L, and upper bound U, can be written as an ordered sequence of words

{ p } ci,j = L, . . . , wi,j,...,U

21 where:   • L = max wp , w , max wc < wp i,j−nbefore i,1 i,j i,j   • U = min wp , w , min wc > wp i,j+nafter i,n i,j i,j

• j − nbefore is the position of the word nbefore units before the polarized word,

and j + nafter is the position of the word nafter units after the polarized word.

By default, nbefore = 4 and nafter = 2.

Informally, the lower bound, L, is the closest of the most recent comma-word before the polarized centre word, the word nbefore units before the polarized word, or the first word in the sentence should j −nbefore go out of range, to the polarized centre word. Similarly, the upper bound, U, is the closest of the most recent comma-word after the polarized centre word, the word nafter units after the polarized word, or the last word of the sentence should j + nafter go out of range, to the polarized centre word.

It is within this context cluster that valence shifters are considered and affect the weighting of the polarized centre word.

a • Amplifiers, wi,j, increase the polarity of the context cluster centre while de-

d amplifiers, wi,j decrease the polarity of the context cluster centre.

neg • Negators, wi,j , flip the sign of the polarized centre word. In addition, amplifiers become de-amplifiers if there are an odd number of negators within the context cluster. Similarly, de-amplifiers will be become amplifiers if there are aneven number of negators.

• Adversative conjunctions that appear before the polarized centre word up- weight the polarity of the context cluster, while adversative conjunctions that appear after the polarized centre word will downweight the polarity of the con-

22 text cluster.

For an amplification/de-amplification weight, z1 (default value 0.8), and an adversa- tive conjunction weight, z2 (default value 0.85), the raw sentiment of a sentence is the sum of the weighted context clusters and is given by:

X X   · p · − 2+ϕneg ci,j = (1 + ϕa + ϕd) wi,j ( 1) where: P · · a • ϕa = (ϕadv > 1) + (ϕneg z1 wi,j) is the amplification weight P − · a d − • ϕd = (ϕadv < 1) + max z1( ϕneg wi,j + wi,j), 1 is the de-amplification weight P  neg • ϕneg = wi,j mod 2 is the negation weight

p • wi,j is the weight of the polarized centre word (obtained from the sentiment lexicon)

• ϕadv = 1 + z2 (nadv, before − nadv, after) is the adversative conjunction weight,

nadv, before is the number of adversative conjunctions before the polarized centre,

nadv, after is the number of adversative conjunctions after the polarized centre

Finally, the raw sentiment scores of each sentence are scaled by dividing by the square root of the word count of the respective sentence.

4.3 Methods

Two sentiment lexicons were used to calculate tweet sentiments, both of which are available in the lexicon31 package:

1. Jockers-Rinker lexicon: a combination of Jockers’ lexicon42 and Rinker’s aug-

23 mented version of Hu and Liu’s lexicon42,43. This lexicon contained a total of 11,710 entries, of which 287 were bigrams (a pair of terms). Entries had weights

ranging from -2.0 to 1.0. The term trump was removed from this lexicon due to the fact that most, if not all, occurrences in this context would be in reference to Donald Trump rather than the verb.

2. NRC lexicon: a lexicon by National Research Council Canada’s Mohamamad and Turney44. This lexicon contained a total of 5,468 entries, all of which were

single terms. Entries had weights of -1 or +1. As the term trump did not appear in this lexicon, no additional modifications were required.

For the analysis presented in the next section, the sentiment of a tweet was com- puted as the sum of the raw sentiment scores. Raw sentiment scores were used rather than scaled sentiment scores since tweets with extreme sentiment scores were of in- terest and scaling would reduce the extremity of the sentiment scores, particularly for longer tweets and tweets that were less concise.

4.4 Results

Monthly average tweet sentiments were computed using the modified Jockers-Rinker

lexicon (with the aforementioned removal of the term trump) and the NRC lexicon for the entire collection of tweets (Figure 4.1). Despite the difference in polarized term weightings and abundance of terms between the two lexicons, the overall monthly trends were quite similar. For both lexicons, the monthly average sentiments be- gan very negatively in March. By May, the monthly average tweet sentiment had reached its least negative point (still below zero), followed immediately by a stark de- scent until October, and becoming slightly less negative in November than October. Interestingly, the average tweet sentiment for October surpassed the average tweet sentiment of March in negativity, when using the NRC lexicon.

24 Jockers-Rinker NRC

-0.1

-0.06

-0.2 -0.08 Average sentiment

-0.3 -0.10

Mar Apr May Jun Jul Aug Sep Oct Nov Mar Apr May Jun Jul Aug Sep Oct Nov Month

Figure 4.1: Monthly average sentiments for tweets, using the Jockers-Rinker and NRC lexicons. Point sizes represent the proportion of tweets contributed by the given month, relative to other months.

Following from the trends of Figure 4.1, the five most negative tweets of March (Figure 4.2) and October (Figure 4.3) for both lexicons were extracted along with their corresponding cleaned texts. Collectively, the lowest scoring tweets of March discussed pandemic comparisons, criticisms of Trump and the American Republican party, the New York State COVID-19 case counts, and criticism of Indian political leaders. In October, the lowest scoring tweets collectively discussed criticisms of Trump and the American Republican party, and pandemic related frustrations. Interestingly, with the Jockers-Rinker lexicon, the lowest scoring tweet of October scores only one unit less than the lowest scoring tweet of March. Meanwhile, with the NRC lexicon, the lowest scoring tweet of October is almost as negative as the lowest scoring tweet of March.

25 flu pandemic death toll: million cause: influenza sixth cholera pandemic death toll: , cause: cholera flu pandemic death toll: million cause: influenza

what kind of pandemic is worse: highly contagious but not as deadly or not very contagious but extremely deadly

civics home schooling. the current and or previous gop president: started an unnecessary, disastrous war made the hurricane katrina disaster much worse unnecessarily racked up half the total debt made the climate crisis worse is making the pandemic much worse

. you wanted to be . congrats! you are : most ignorant person in america most useless person most illegitimate president making usa shithole country creating country wide poorest response to pandemic most criminal president administration

really bad takes about why south korea has been relatively successful in controlling covid19: collectivism wtf? confucianism wtf? wtf? political consensus wtf?wtf?wtf?wtf?wtf?wtf?wtf?wtf?wtf?wtf?wtf?wtf? wtf?wtf?wtf?wtf?wtf?wtf?wtf?wtf?wtf?wtf?wtf?wtf?

-4 -3 -2 -1 0 Sentiment (Jockers-Rinker)

you all have failed to lead us indians in this crisis and stop pandemic. shame on you to be our leaders. you don't deserve to lead india. shame shame shame shame shame shame shame shame shame shame shame shame

gah; messed up the nys data. here's the right numbers for deaths since the pandemic began. mar mar mar mar mar mar mar mar mar mar mar mar mar mar mar mar

lost the election lost recount lost electoral college threats lost fake stormy daniels lost fake meuller investigation lost russia hoax lost fake impeachment hmmmm. how can we dems get rid of trump?. blame world pandemic on him fake trumpgenocide

civics home schooling. the current and or previous gop president: started an unnecessary, disastrous war made the hurricane katrina disaster much worse unnecessarily racked up half the total debt made the climate crisis worse is making the pandemic much worse

note the differences dry cough sneeze air pollution cough mucus sneeze runny nose common cold cough mucus sneeze runny nose body ache weakness light fever flu dry cough sneeze body pain weakness high fever difficulty breathing coronavirus coronavirus covid19

-15 -10 -5 0 Sentiment (NRC)

Figure 4.2: Sentiment scores for the five lowest scoring tweets of March 2020, using the Jockers-Rinker and NRC lexicons. Text shown is the cleaned text, i.e. removal of mentions, symbols, emojis, and demotion of hashtags.

26 rump has helped make america 1st again: largest stock market drop most lies highest national debt most convicted team members most pandemic infections in the world most racist acts by any humanoid in history most dictator allies most nepotism in history

but what about a pandemic? but what about a pandemic? but what about a pandemic? but what about a pandemic?but what about a pandemic? but what about a pandemic? but what about a pandemic? but what about a pandemic?but what about a pandemic? but what about a pandemic? but.

i hate the pandemic. i hate the pandemic. i hate the pandemic. i hate the pandemic. i hate the pandemic. i hate the pandemic. i hate the pandemic. i hate the pandemic. i hate the pandemic. i hate the pandemic. i hate the pandemic. i hate the pandemic. i hate the pandemic.

covid19 covid19 covid19 covid19 covid19 covid19 covid19 covid19 liar liar liar liar liar liar liar liar liar liar liar fraud fraud fraud fraud fraud fraud fraud cheat cheat cheat cheat cheat cheat cheat cheat failure failure failure failure failure failure failure loser loser loser loser

republicans: more sexual assault more treason more whining about mean tweets more unnecessary pandemic deaths more voter disenfranchisement more lowlife scumbag lies

-5 -4 -3 -2 -1 0 Sentiment (Jockers-Rinker)

oh he had a mask, but was smoking a cig really feel the gross was intentional frankly i miss smoking cigs myself but quit in light of the pandemic plus i'm fucking broke but with people like him being gross and even screaming outside i don't want to go get cigs anyway

a 2nd trump term more racism more misogyny more attacks on women's rights more attacks on lgbtq rights more bullying more hate crimes more tax breaks for rich more corruption more pollution more misery more poverty more coronavirus no aca obamacare

but what about a pandemic? but what about a pandemic? but what about a pandemic? but what about a pandemic?but what about a pandemic? but what about a pandemic? but what about a pandemic? but what about a pandemic?but what about a pandemic? but what about a pandemic? but.

i hate the pandemic. i hate the pandemic. i hate the pandemic. i hate the pandemic. i hate the pandemic. i hate the pandemic. i hate the pandemic. i hate the pandemic. i hate the pandemic. i hate the pandemic. i hate the pandemic. i hate the pandemic. i hate the pandemic.

covid19 covid19 covid19 covid19 covid19 covid19 covid19 covid19 liar liar liar liar liar liar liar liar liar liar liar fraud fraud fraud fraud fraud fraud fraud cheat cheat cheat cheat cheat cheat cheat cheat failure failure failure failure failure failure failure loser loser loser loser

-30 -20 -10 0 Sentiment (NRC)

Figure 4.3: Sentiment scores for the five lowest scoring tweets of October 2020, using the Jockers-Rinker and NRC lexicons. Text shown is the cleaned text, i.e. removal of mentions, symbols, emojis, and demotion of hashtags.

27 4.5 Discussion

Upon examining the texts in Figures 4.2 and 4.3, it is apparent that a prominent issue with dictionary-based sentiment analyses on informal texts is the lack of consideration for excessive repetition. In particular, by repeating polarized terms, tweets are able to receive inflated sentiment scores while offering little semantic value to human readers. To improve the quality of the computed sentiment scores, one could implement a penalty for the repetition of polarized words occurring within a fixed window.

A caveat of dictionary-based sentiment analysis is that the choice of lexicon can be quite arbitrary and the term weightings of the chosen lexicon may not be appropriate for all contexts. In the setting of COVID-19, the term positive should have a positive weighting, whereas the bigram test positive should have a negative weighting. As the term test is unweighted in both Jockers-Rinker and NRC lexicons, a phrase where a user claims to have tested positive would be assigned a positive score.

From the plots of the monthly trends (Figure 4.1), it was seen that the average sentiments were all negative. However, without further investigation into the contents of each tweet, it cannot be concluded that the overall opinions of Twitter users towards the COVID-19 pandemic was negative. As was seen in Figures 4.2 and 4.3, few tweets managed to form an opinion on the COVID-19 pandemic. Instead, tweets mostly consisted of criticisms of political and societal actions in response to the pandemic.

28 Chapter 5

Topic modelling

5.1 Introduction

Topic models are statistical models used to learn the latent topics present in a col- lection of documents, also known as a corpus. Popular text data sources can include, but are not limited to, journal articles, news articles, social networking platforms, survey answers, and product reviews. In today’s world, copious amounts of text data are readily available in digitized formats – too much for even humans to process man- ually through reading. In the domain of information retrieval, topic modelling can help simplify the synthesis of large corpora. The training of a topic model on large text corpora can provide insight into the topical structures of the text, creating a structured framework towards understanding unstructured text.

In the setting of Twitter data, while hashtags have the potential to act as topic identifiers, the non-mandatory nature of hashtags makes the use of hashtags astopic identifiers infeasible. As discussed in the exploratory analysis, tweets that mentioned the termp trump were not the only tweets about Trump. The appearance of the term trump in subfigures (b) through (f) of Figure 3.6 also suggested that a topic such as

29 “Trump” would be very broad. As such, the use of single terms as topic identifiers is inadequate.

5.2 Latent Dirichlet Allocation (LDA)

To date, one of the most well known approaches to topic modelling is Latent Dirichlet Allocation (LDA), which proposes that documents cover a limited number of topics and topics tend to use a limited number of keywords45. With the LDA approach, topic-term distributions and document-topic proportions are learned by analysing the document-term co-occurrences46,47. Working with formal sources of text such as books, journal articles, and news articles, LDA has seen great success45.

The Dirichlet distribution

The Dirichlet distribution is a multivariate continuous probability distribution param- eterized by α = (α1, . . . , αK ), with α1, . . . , αK > 0 and K ≥ 2, and has probability density function: P K YK Γ( α ) − Q i=1 i αi 1 f(x1, . . . , xK ) = K xi , i=1 Γ(αi) i=1

XK x1, . . . , xK ≥ 0, xi = 1, i=1 where Γ(·) is the gamma function. In the case where all elements of α are equal, the distribution can be parameterized by the single common value α. The probability density function of the Dirichlet distribution can be simplified as:

YK Γ(αK) α−1 f(x1, . . . , xK ) = K xi , Γ(α) i=1 and is referred to as the symmetric Dirichlet distribution.

30 5.2.1 Model outline

The LDA model is constructed under the premise that each document is composed of a latent distribution of topics, and each topic is composed of a latent distribution of terms. Consider a corpus, C, of D documents covering K topics over W unique terms in the vocabulary.

Notation:

• The collection of all terms, t, as they appear in the corpus is denoted W =

{ }Nt ti i=1, where Nt is the total number of terms in the corpus.

{ }W • The vocabulary of unique terms is denoted V = wj j=1.

• The number of terms in each document, d ∈ [1,D], is denoted Nd.

• z ∈ [1,K] is a topic indicator variable.

• The distribution of topics within the documents, P (z | d), is represented by the D × K matrix, { }D θ = θd d=1,

whose individual entries may be denoted θd,k = P (z = k | d).

• The distribution of terms within topics, P (w | z), is represented by the K × W matrix, { }K Φ = ϕk k=1,

whose individual entries may be denoted ϕk,w = P (w | z = k).

For data with a multinomial distribution, the Dirichlet distribution is a conjugate prior. For the sake of mathematical convenience, LDA relies on the use of symmetric, conjugate Dirichlet priors. Let α and β be the hyperparameters for two symmetric Dirichlet priors. α represents the topic uniformity within documents – a low value of

31 α coincides with the belief that documents cover few topics. β represents the term uniformity within topics – a low value of β coincides with the belief that topics use few terms. The generative process of LDA is as follows:

1. For each topic, k ∈ [1,K]:

(a) Draw the topic-term distribution, ϕk ∼ Dirichlet(β)

2. For each document, d ∈ [1,D]:

(a) Draw the topic distribution, θd ∼ Dirichlet(α)

(b) Draw the document length, Nd ∼ Poisson(ξ)

th 3. For each term in the d document, wd,n, n ∈ [1,Nd]:

(a) Draw a topic assignment, zd,n ∼ Multinomial(θd)

∼ (b) Draw the term from the topic assignment, wd,n Multinomial(ϕzd,n )

A graphical representation of the generative process of LDA can be found in Figure 5.1.

α θd zd,n wd,n ϕk β ∈ n ∈ [1,Nd] k [1,K]

d ∈ [1,D]

Figure 5.1: Graphical model representation of LDA. Nodes in the graph represent random variables; shaded nodes are observed variables. Plates denote replication, with the number of replicates given in the bottom right corner of the plate.

The joint distribution of the latent and observed variables and hyperparameters for the dth document can be written as:

YNd P (wd, zd, θd, Φ | α, β) = P (wd,n | ϕd,n) P (zd,n | θd) P (θd | α) P (Φ | β) n=1

32 The likelihood of document wd is obtained by integrating over θd and Φ, and summing over zd:

ZZ YNd X P (wd | α, β) = P (wd,n | ϕd,n) P (zd,n | θd) P (θd | α) P (Φ | β) dθd dΦ

n=1 zd

Finally, the likelihood of the corpus C is computed as the product of the likelihoods of the individual documents:

YD P (C | α, β) = P (wd | α, β) (5.1) d=1

5.2.2 Parameter estimation

The estimation of LDA parameters through direct maximization of the likelihood function is infeasible as the problem is:

1. Ill-posed – a unique solution is not guaranteed.

2. Non-identifiable – the topic labels are unconstrained; while the label switching does not impact model fit, it results in a complex posterior that is rampant with multiple equivalent likelihood modes.

3. Intractable – there does not exist an efficient optimization algorithm when faced with multiple equivalent likelihood modes.

As such, approximate inference is done using collapsed Gibbs sampling. Gibbs sampling is a sampling algorithm used for statistical inference when the true joint distribution of a collection of parameters is unknown or difficult to sample from, but a conditional distribution is known or easier to sample from. Collapsed Gibbs sampling is a special case of Gibbs sampling where one or more variables have been marginalized out, often simplifying the sampling distribution even further.

33 As the complex posterior is rampant with multiple equivalent modes, the collapsed Gibbs sampler is used as an optimization tool rather than to obtain a posterior sample. By performing collapsed Gibbs sampling over the marginal distributions, the intractabilities in the posterior can be avoided. After collapsed Gibbs sampling, the document-topic probabilities and topic-term probabilities can be estimated by their maximum posterior value.

Due to the use of conjugate Dirichlet priors48, θ and Φ are marginalized over.

For term t, only its topic zi needs to be sampled according to the conditional distri- bution39,49:

    n(t) + β n(k) + α  (i), k   (i), d  P (zi = k | z(i), W) ∝ P P , (5.2) W (j) − K (v) − j=1 nk 1 + W β v=1 nd 1 + Kα

where:

• z(i) is the topic assignments for all terms, except the current term, t

(t) • n(i), k is the number of times the term t is assigned to topic k, excluding the current topic assignment

(k) • n(i), d is the number of terms in document d assigned to topic k, excluding the current topic assignment P W (j) − • j=1 nk 1 is the total number of terms assigned to topic k, excluding the current topic assignment P K (v) − • v=1 nd 1 is the total number of terms in document d, excluding the current term, t

The collapsed Gibbs sampling algorithm for LDA (Algorithm 5.1) can be summarised

34 Algorithm 5.1 Collapsed Gibbs sampling algorithm for LDA. Input: Number of topics K, Dirichlet hyperparameters α and β, collection of terms W Output: θb, Φb Randomly assign a topic to all terms t ∈ W for iter = 1 to N do for each document, d ∈ [1,D] do for each term, t, in document d do Update topic assignment of t according to Equation 5.2 end for end for (k) (w) Update nd , nk end for Compute estimates for θ and Φ using Equation 5.3 and Equation 5.4, respectively as follows49. To begin, all terms in the corpus are randomly assigned a topic. For each iteration, 1 to N, the topic assignments of each term within each document are updated sequentially according to Equation 5.2. After N iterations, the number of

(k) times document d was assigned to topic k, nd , and the number of times term w of the (w) vocabulary was assigned to topic k, nk are counted. Finally, maximum conditional posterior estimates for θ and Φ are computed using49:

n(k) + β θb = P d (5.3) d,k K (v) v=1 nd + Kα

n(w) + β ϕb = P k (5.4) k,w W (j) j=1 nk + W β

5.2.3 Perplexity

Perplexity is a method of evaluating language model performance on held-out data and is equivalent to the geometric mean per-word profile likelihood39:    log P (C | θb , Φb)  b b H H Perplexity(CH | θH , Φ) = exp P P ,  D W (d,w)  d=1 w=1 n

35 with the profile likelihood in the numerator calculated using50: ( )   XD XW XK | b b (d,w) b b log P (CH θH , Φ) = n log θHd,k ϕk,w , d=1 w=1 k=1 where:

• CH is the held-out corpus

b • θH contains the document-topic proportions of the held-out data, estimated by repeating the collapsed Gibbs sampling procedure while keeping Φb fixed

• n(d,w) is the number of times the wth term occurs in the dth document

A low perplexity indicates that the model performs well as it is unsurprised by the term co-occurrences in the held-out data.

5.3 Biterm Topic Model (BTM)

With short texts such as tweets, if each tweet is treated as a single document, document-term co-occurrences can become extremely sparse and LDA can encounter difficulties learning topic-term relationships51. Some attempts to overcome this hur- dle involved the creation of longer pseudo-documents through aggregation of users’ tweets52 and aggregation of tweets sharing common terms51 prior to using LDA. These methods have experienced some improvement over LDA, but each subjective aggregation has its own disadvantages.

In 2013, the biterm topic model (BTM)53 was introduced, which attempted to address the inadequacies of using LDA on short documents through the modelling of global term co-occurrences rather than at the document level. Today, BTM is one of the most well-known methods for topic modelling with short texts, with many variations building off of the core ideas of BTM such as FastBTM54 and GraphBTM55.

36 5.3.1 Model outline

To begin, biterms are extracted from the individual documents using a specified window size. The extracted biterms are unordered term pairs occurring within the document and the specified window size. For example, the single document containing terms

ABCB and using a window size of 3 would result in two text windows, ABC and BCB, and produce biterms AB, BC, AC, BC, CB, and BB. As the term pairs are unordered, the biterm quantities can be summarised as: AB (n = 1), AC (n = 1), BC (n = 3), and BB (n = 1).

The BTM is constructed under the premise that terms co-occurring together fre- quently likely belong to the same topic. Therefore, the first step in learning the latent topic components is to model the generation of biterms. It is assumed that the topic is sampled from a mixture of topics over the entire corpus and each biterm is drawn from a specific topic independently, i.e. both terms are drawn from the same topic46.

Given a corpus with D documents, suppose it contains NB biterms, and K topics over W unique terms in the vocabulary.

Notation:

{ }NB • The collection of biterms is denoted as B = bi i=1

• The individual biterms are denoted as bi = (wi,1, wi,2), where wi,1 and w1,2 are the terms of the biterm.

• z ∈ [1,K] is a topic indicator variable.

• The distribution of topics within the corpus, P (z), is represented by a K-

37 dimensional multinomial distribution,

XK { }K θ = θk k=1 , θk = P (z = k) and θk = 1 k=1

• The distribution of terms within topics, P (w | z), is represented by the K × W matrix, Φ, with entries

XW ϕk,w = P (w | z = k), where ϕk,w = 1 w=1

Like LDA, BTM relies on the use of symmetric, conjugate Dirichlet priors. Let α and β be the two hyperparameters for the Dirichlet priors. α represents the topic uniformity within the corpus, which in turn affects the topic uniformity within doc- uments – a low value of α coincides with the belief that documents cover few topics. β represents the term uniformity within topics – a low value of β coincides with the belief that topics use few terms. The generative process of BTM is as follows:

1. Draw the global topic distribution, θ ∼ Dirichlet(α)

2. For each topic, k ∈ [1,K]:

(a) Draw the topic-term distribution, ϕk ∼ Dirichlet(β)

3. For each biterm, bi ∈ B:

(a) Draw a topic assignment, zi ∼ Multinomial(θ)

∼ (b) Draw two terms from the topic assignment, wi,1, wi,2 Multinomial(ϕzi )

For simplicity, it is assumed that biterms are generated independently. A graphical representation of the generative process of BTM can be found in Figure 5.2.

The probability of the biterm, bi = (wi,1, wi,2), conditioned on the model param-

38 wi

α θ z ϕk β wj k ∈ [1,K]

b ∈ [1,NB]

Figure 5.2: Graphical model representation of BTM. Nodes in the graph represent random variables; shaded nodes are observed variables. Plates denote replication, with the number of replicates given in the bottom right corner of the plate.

eters θ and Φ is given by:

XK P (bi | θ, Φ) = P (wi,1, wi,2, zi = k | θ, Φ) k=1 XK | | | = P (zi = k θk) P (wi,1 zi = k, ϕk,wi,1 ) P (wi,2 zi = k, ϕk,wi,2 ) k=1 XK

= θk ϕk,wi,1 ϕk,wi,2 k=1

Given the hyperparameters α and β, the probability of the biterm, bi, is obtained by integrating over θ and Φ:

ZZ

P (bi | α, β) = P (bi | θ, Φ) P (θ | α) P (Φ | β) dθ dΦ

ZZ XK | | = θk ϕk,wi,1 ϕk,wi,2 P (θ α) P (Φ β) dθ dΦ k=1

The likelihood of the entire corpus, B, can be obtained my multiplying the probabil- ities of the individual biterms:

ZZ YNB XK | | | P (B α, β) = θk ϕk,wi,1 ϕk,wi,2 P (θ α) P (Φ β) dθ dΦ (5.5) i=1 k=1

The likelihood of the corpus in BTM (Equation 5.5) shares similarities with the likelihood of the corpus in LDA (Equation 5.1) except that BTM models biterm generation at the corpus level while LDA models term generation at the document

39 level.

5.3.2 Parameter estimation

As with LDA, the solving of the coupled parameters θ and Φ through direct maxi- mization of the likelihood function is ill-posed, non-identifiable, and intractableSec- ( tion 5.2.2). Therefore, approximate inference is carried out using collapsed Gibbs sampling.

Like LDA, the collapsed Gibbs sampler is used as an optimization tool rather than to obtain a posterior sample. Due to the use of conjugate Dirichlet priors46,48,

θ and Φ are marginalized over. For biterm bi, only its topic zi needs to be sampled according to the conditional distribution46,56:

(n | + β)(n | + β) | ∝ P (i), wi,1 k (Pi), wi,2 k P (zi = k z(i), B) (n(i), k + α) W W , ( j=1 n(i), j | k + W β + 1)( j=1 n(i), j | k + W β) (5.6) where:

• z(i) is the topic assignments for all biterms, excluding bi

• n(i), k is the number of biterms assigned to topic k, excluding bi

• n(i), j | k is the number of times term j is assigned to topic k, excluding bi

Algorithm 5.2 Collapsed Gibbs sampling algorithm for BTM. Input: Number of topics K, Dirichlet hyperparameters α and β, collection of biterms B Output: θb, Φb Randomly assign a topic to all biterms, bi = (wi,1, wi,2) ∈ B for iter = 1 to N do for each biterm, bi do Update topic assignment of bi according to Equation 5.6

Update nk, nwi,1|k, nwi,2|k end for end for Compute estimates for θ and Φ using Equation 5.7 and Equation 5.8, respectively

40 The collapsed Gibbs sampling algorithm for BTM (Algorithm 5.2) is summarised as follows46. To begin, all biterms in the corpus are randomly assigned a topic. For each iteration, 1 to N, the topic assignments of each biterm are updated sequentially according to Equation 5.6. After N iterations, the number of biterms in each topic, nk, and the number of times term w was assigned to topic k, nw|k, are counted. Finally, maximum conditional posterior estimates for θ and Φ are computed using46:

b nk + α θk = (5.7) NB + Kα

n | + β b P w k ϕk,w = W (5.8) j=1 nj | k + W β

5.3.3 Document-topic proportions

As the BTM models the generative process of biterms within the corpus, unlike LDA which models the generative process of terms within documents, the topic proportions of documents in BTM cannot be found directly. Instead, it is assumed that the topic proportions of a document equals the expectation of the topic proportions of biterms generated from the document46,53:

X P (z | d) = P (z | bi) P (bi | d),

bi ∈ Bd

where Bd is the collection of biterms found in document d. The first term of the right- hand side, P (z | bi), is obtained using Bayes’ formula with the estimated parameters of the model:

| | | PP (z) P (wi,1 z) P (wi,2 z) P (z = k bi) = | | z P (z) P (wi,1 z) P (wi,2 z)

41 θb ϕb ϕb = P k k, wi,1 k, wi,2 , K b b b k=1 θk ϕk, wi,1 ϕk, wi,2 and the quantity P (bi | d) is estimated empirically as:

\ P nd(bi) P (bi | d) = , b ∈ d nd(b)

where nd(b) is the number of times biterm b appears in document d.

5.3.4 Log-likelihood

Perplexities should not be used with BTM models, unlike LDA models, as LDA opti- mizes the likelihood of word co-occurrences within documents, while BTM optimizes the likelihood of biterm occurrences within the corpus46. For BTM models, the log- likelihood of held-out data is used and can be calculated as35: ( ) X XK | b b b b b ℓ(b θ, Φ) = log θk ϕk,wi,1 ϕk,wi,2

bi ∈ BH k=1 where:

• BH is the collection of held-out biterms

• Each biterm can be expressed as bi = (wi,1, wi,2)

A higher log-likelihood indicates that the model performs better as it is unsurprised by the biterm occurrences in the held-out data.

5.4 Comparison of LDA and BTM models

The performance of LDA and BTM models can be compared by comparing the quality of the learned topics. Topic models can often learn nonsense topics or topics that do not agree with human judgements, especially as the number of topics in a topic model

42 grows. The quality of the topics learned by LDA and BTM models can be assessed by calculating topic coherence scores using57:     M m−1 (t) (t)  X X D vm , vl + 1 C t; V (t) = log    , (t) m=2 l=1 D vl

where:

• t is the topic   (t) (t) (t) • V = v1 , . . . , vM is the list of the M most probable terms of topic t

• D(v) is the number of documents where term v appears (document frequency of v)

• D(v, v′) is the number of documents where terms v and v′ appear together (co-document frequency of v and v′)

Informally, the coherence score posits that a topic is coherent if its M most probable terms co-occur in documents frequently. As such, a higher coherence score indicates that a topic is more coherent. While both LDA and BTM are constructed under the premise that terms co-occurring frequently likely belong to the same topic, as LDA encounters difficulties in learning topic-term relationships in shorter documents, itis unsurprising that BTM often achieves higher coherence scores than LDA53,58.

5.5 Methods

5.5.1 LDA

Restarting from the cleaned data obtained at the end of the data preparation step (Chapter 2), further processing was performed prior to model fitting. Stopwords were removed according to a custom stopword list, in order to decrease the number

43 of potential terms in the vocabulary. This custom stopword list was derived from

the snowball lexicon of stopwords found within the tidytext22 package, with the following modifications:

• Negation words were removed from the stopword list, e.g., not, can't, don't, against, etc.

• Remaining contraction stopwords that do not become words when the apos-

trophe is missing were added to the stopword list, e.g., heres, theyll, hows, etc. This was done to account for the fact that punctuation is not mandatory

on Twitter and to reduce vocabulary size. As an example, words such as ill, hell, well, originating from i'll, he'll, and we'll, were not added to the stopword list.

The remaining terms were then stemmed using the SnowballC33 package, according to the Porter stemming algorithm, once again in an effort to reduce the vocabulary

size. For example, terms such as want, wants, and wanted, would all become want.

The hyperparameters that required tuning were K (number of topics), α (document- topic uniformity), and β (topic-term uniformity). Due to the size of the data, tuning of hyperparameters using cross-validation was not feasible. Instead, an 80/10/10 training/validation/testing split was made. In order to evaluate the trained models on the validation and testing sets, it was imperative that the validation and testing sets did not contain terms that were not seen in the training phase. Therefore, terms across the three splits that appeared in five or fewer tweets were removed.

The R package topicmodels39 was used to fit the LDA models via collapsed Gibbs sampling. The tuning of the hyperparameters was as follows:

1. Fix K while varying α and β. For all combinations of α and β:

(a) Train a LDA model using training data.

44 (b) Evaluate model on validation data to obtain a perplexity.

(c) Collect value of α and β that resulted in the lowest perplexity value.

2. Varying values of K, while fixing α and β to the values obtained from step 1:

(a) Train a LDA model using training data.

(b) Evaluate model on validation data to obtain a perplexity.

(c) Collect value of K that resulted in the lowest perplexity value.

3. Evaluate final model performance by obtaining a perplexity using testing data.

Out of consideration for time and the number of models that were required to be trained, 50 iterations of collapsed Gibbs sampling were performed for each model.

5.5.2 BTM

The same processed text data and its corresponding splits used to fit the LDA models were reused for fitting the BTM models. The R package BTM35 (a wrapper for the original C++ implementation of the BTM by Yan et al.53) was used to perform the biterm topic modelling. The biterm window size was kept at the default value of 15 as it was expected that most tweets would contain 15 or fewer terms following the removal of custom stopwords and rare terms in the data processing step. The hyperparameters that required tuning were K (number of topics), α (corpus-topic uniformity), and β (topic-term uniformity). Due to the size of the data, the tuning of hyperparameters with cross-validation was not feasible and a tuning process similar to that of LDA was taken:

1. Fix K while varying α and β. For all combinations of α and β:

(a) Train a BTM model using training data.

45 (b) Evaluate model on validation data to obtain a log-likelihood.

(c) Collect value of α and β that resulted in the highest log-likelihood value.

2. Varying values of K, while fixing α and β to the values obtained from step 1:

(a) Train a BTM model using training data.

(b) Evaluate model on validation data to obtain a log-likelihood.

(c) Collect value of K that resulted in the highest log-likelihood value.

3. Evaluate final model performance using testing data.

Out of consideration for time and the number of models that were required to be trained, 50 iterations of collapsed Gibbs sampling were performed for each model.

5.6 Results

5.6.1 LDA

Model tuning

In the first step of the tuning process, the following values were used:

K = 10, α = {1/K, 50/K, 100/K}, β = {0.001, 0.01, 0.1}

Nine models were trained on the training data and evaluated on the validation data to obtain perplexities (Figure 5.3). The model with the lowest perplexity with K = 10 fixed, was the model using α = 100/K and β = 0.1.

In the second step of the tuning process, the following values were used:

K = {20, 50, 100, 200, 300, 400, 500}, α = 100/K, β = 0.1

46 K = 10

45000

40000 min not min

beta

Perplexity 0.001 35000 0.01 0.1

30000

25000 0.1 5 10 alpha

Figure 5.3: Perplexities for the nine models trained in the first tuning step, evaluated on the validation set. K, the number of topics, was fixed while varying values of α and β. The hyperparameters corresponding to the model with the lowest perplexity were α = 100/K and β = 0.1.

Seven models were trained on the training data and evaluated on the validation data to obtain perplexities (Figure 5.4). The perplexities continued to decrease with an increase in number of topics (K). The perplexity for the K = 500 model was unable to be calculated due to insufficient memory. It is possible that the perplexities would have continued to decrease for values of K greater than 500. As the perplexity for the K = 500 model evaluated on the validation set was unable to be calculated due to memory constraints, the perplexity for the same model evaluated on the testing set was also unable to be calculated. Due to the aforementioned memory constraints, it was decided that no additional models would be trained. As such, the K = 500 model was taken to be the optimal model. Henceforth, “model” will refer to the K = 500 model unless otherwise specified.

47 alpha = 100/K, beta = 0.1 25800

25500

25200 Perplexity

24900

24600

24300 0 100 200 300 400 500 K

Figure 5.4: Perplexities for six of the seven models trained in the second tuning step, plus the best model from the first tuning step, evaluated on the validation set. α and β were fixed, while varying over values of K. Perplexities involving the K = 500 model were unable to be calculated due to memory constraints.

Document-topic allocations

As each document is assumed to contain a mixture of topics, document-topic propor- tions were treated as topic allocations and summed for each topic by month. Monthly topic allocations were then scaled by dividing by the number of tweets available for each month, so as to not artificially inflate topic allocations due to having more tweets in a given month.

The five topics with the highest scaled allocations overall were highlighted in Figure 5.5. Using Figure 5.7 as a guide, the five topics shown in Figure 5.5 were labelled.

For the months of March and April, all five of the highlighted topics had extremely low scaled allocation values. Throughout May and June, the topic of police brutality

48 0.0035

0.0030 Scaled allocation 0.0025

0.0020

Mar Apr May Jun Jul Aug Sep Oct Nov Month

Korean boy-band Enhypen Postponement of Indian NEET and JEE entrance exams Voting methods in USA presidential election Topic Police brutality USA presidential election

Figure 5.5: Monthly document-topic allocations scaled by the number of tweets available per month. The five topics highlighted attained the highest scaled allocation values overall. had the highest scaled allocation, corresponding to the murder of George Floyd and the subsequent protests on police brutality towards blacks. The postponement of Indian NEET and JEE entrance exams attained the highest scaled allocation in the month of August, followed immediately by a sharp decline in subsequent months. In September, the topic with the highest scaled allocation concerned the swarming of the Korean boy-band, Enhypen, by fans after the leakage of their private flight schedule. Finally, topics discussing the USA presidential election and voting methods saw slow increases in scaled allocation from month to month, but eventually attained the highest scaled allocations for the months of October and November, coinciding with the election which took place in November 2020.

The scaled topic allocations were ranked for each month and the top three topics

49 Mar Apr May

Daily case reports China Prayers in various regions

Prayers China Shopping

Pandemic in India Essential workers China

0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 0.0000 0.0005 0.0010 0.0015 0.0020 0.0000 0.0005 0.0010 0.0015 0.0020

Jun Jul Aug

Postponement of General pandemic Police brutality Indian NEET and JEE advice: wash hands entrance exams

General pandemic General pandemic Post-secondary advice: wash hands advice: wear a mask education

General pandemic Primary/secondary Primary/secondary Topic (K = 500) advice: wear a mask education education

0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 0.000 0.001 0.002 0.003

Sep Oct Nov

Voting methods in Korean boy-band Reliability of USA presidential Enhypen COVID-19 tests election

Reliability of USA presidential Trump rallies COVID-19 tests election

Criticising of Trump Criticising of Trump Vaccine

0.000 0.001 0.002 0.0000 0.0005 0.0010 0.0015 0.0020 0.000 0.001 0.002 0.003 Scaled Allocation

Figure 5.6: Monthly top three topics by scaled allocations. for each month were displayed in Figure 5.6. Once again, using Figure 5.7 as a guide, the topics that appeared in Figure 5.6 were labelled. Interestingly, there were two topics in the months of June and July that gave advice to the general public. While the top terms for both topics were quite similar, one topic placed an emphasis on hand-washing, while the other placed an emphasis on wearing masks.

50 Postponement of Primary/secondary Police brutality Indian NEET and JEE Criticising of Trump education entrance exams black exam trump school protest sir american educ p student dead teacher violenc postpon li children polic conduct russia student gun situat kill learn riot pleas russian teach georg neet putin parent white request 200k classroom domest jee donald access racist centr liar institut cop plz impeach pupil murder examin incompet inequ brown sop presid secondari blm flood bounti primari shoot govt soldier equiti statu kindli america higher brutal dear mishandl learner women aspir cage qualiti bush lakh troop flexibl 0.00 0.04 0.08 0.12 0.00 0.03 0.06 0.09 0.00 0.05 0.10 0.15 0.00 0.05 0.10 0.15 0.20

Post−secondary Vaccine Prayers Pandemic in India education student vaccin god india school develop church pm colleg trial prai fight class effect bless govt univers treatment prayer indian onlin pfizer mai against fee speed lord modi campu avail faith corona graduat distribut heal indiafightscorona semest fda holi app parent dose allah via uni scientist worship lakh grade compani pastor namo teach antibodi christian india' teacher pharma ramadan crore faculti promis jesu migrant academ oxford amen bjp pai therapeut merci contribut tuition pharmaceut peac r enrol potenti mosqu battl 0.00 0.05 0.10 0.15 0.0 0.1 0.2 0.3 0.4 0.00 0.05 0.10 0.15 0.00 0.05 0.10 0.15

General pandemic USA presidential Reliability of Daily case reports advice: wash hands election COVID−19 tests in various regions hand biden test case wash joe posit total mask trump neg death wear campaign result confirm sanit win rapid new face presid kit report cover candid pcr updat cloth won antibodi activ shake vote symptom recov touch berni sampl number frequent elect fals recoveri

Terms hygien presidenti swab todai sanitis biden' asymptomat daili regularli primari lab discharg soap medicar detect record mouth democrat diagnost kashmir nose dnc site addit facemask hunter symptomat bring maintain pandem accur cumul clean donald antigen rise 0.0 0.1 0.2 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2

Voting methods in Korean boy−band China USA presidential Shopping Enhypen election china protect vote store chines safeti mail shop wuhan thank elect custom lab appreci voter groceri ccp worri ballot staff origin health count queue world ask poll supermarket communist secur fraud hi blame especi person open viru dear suppress site taiwan middl earli walmart hong well democrat liquor kong respons absente deliveri asia fan usp mall china' respect cast shopper ban privaci republican applic asian protectenhypen machin usual europ will postal compli beij enhypen democraci patienc xi sasaeng turnout item 0.0 0.1 0.2 0.3 0.00 0.02 0.04 0.06 0.0 0.1 0.2 0.3 0.00 0.05 0.10 0.15

General pandemic Essential workers Trump rallies advice: wear a mask worker mask ralli essenti wear trump healthcar distanc event frontlin social super hero glove hold front practic spreader support wore attend line sanit campaign wage 6ft held ensur guidelin superspread job follow donald kei hygien cain migrant etc trump' extens observ herman pass adher biden ppe not maskless salut requir indoor pilot no tulsa hazard face attende paid protocol crowd 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.00 0.05 0.10 0.15 Topic−term probabilities

Figure 5.7: Top 20 most probable terms for topics appearing in Figure 5.5 and Figure 5.6.

51 5.6.2 BTM

Model tuning

In the first step of the tuning process, the following values were used:

K = 10, α = {1/K, 50/K, 100/K}, β = {0.001, 0.01, 0.1}

K = 10

-2.6066

-2.6068

) max 9 not max x10 (

beta

0.001 0.01 Log-likelihood -2.6070 0.1

-2.6072 0.1 5 10 alpha

Figure 5.8: Log-likelihoods for the nine models trained in the first tuning step, evaluated on the validation set. K, the number of topics, was fixed at 10 while varying values of α and β. The hyperparameters corresponding to the model with the highest log-likelihood were α = 50/K and β = 0.1.

Nine models were trained on the training data and evaluated on the validation data to obtain log-likelihoods (Figure 5.8). The model with the highest log-likelihood with K = 10 fixed, was the model using α = 50/K and β = 0.1.

52 In the second step of the tuning process, the following values were used:

K = {20, 50, 100, 200, 300, 400, 500}, α = 50/K, β = 0.1

alpha = 50/K, beta = 0.1 -2.592

-2.596 ) 9 x10 (

-2.600 Log-likelihood

-2.604

0 100 200 300 400 500 K

Figure 5.9: Log-likelihoods for the seven models trained in the second tuning step, plus the best model from the first tuning step, evaluated on the validation set (black) andthe testing set (green). α and β were fixed, while varying over values of K.

Seven models were trained on the training data and evaluated on the validation data to obtain log-likelihoods (Figure 5.9). The log-likelihoods continued to increase with an increase in number of topics (K). It is possible that the log-likelihood would have continued to increase for values of K greater than 500. However, due to the fact that the training of the model with K = 500 took slightly more than two weeks to complete, it was decided that no additional models would be trained. As such, the K = 500 model was taken to be the optimal model and evaluated on the testing data to obtain a final log-likelihood. Henceforth, “model” will refer tothe K = 500 model unless otherwise specified.

53 Document-topic allocations

As each document is assumed to contain a mixture of topics, document-topic propor- tions were treated as topic allocations and summed for each topic by month. Monthly topic allocations were then scaled by dividing by the number of tweets available for each month, so as to not artificially inflate topic allocations due to having more tweets in a given month.

0.020

0.015

0.010 Scaled allocation

0.005

0.000

Mar Apr May Jun Jul Aug Sep Oct Nov Month

General pandemic advice Shopping Voting methods in USA presidential election Topic Postponement of Indian NEET and JEE entrance exams USA presidential election

Figure 5.10: Monthly document-topic allocations scaled by the number of tweets available per month. The five topics highlighted obtained the highest scaled allocation values overall.

The five topics with the highest scaled allocations overall were highlighted in Figure 5.10. Using Figure 5.12 and Figure 5.13 as guides, the five topics shown in Figure 5.10 were labelled.

In March, the highest scaled allocation was attributed to the topic discussing the American presidential election. From its corresponding biterm cluster (Figure 5.12),

54 it was suggested that discussions were motivated by Trump’s handling of the COVID-

19 pandemic, the responsibilities (word stem: respons) of a capable president, and support for the opposing candidate, Joe Biden. As the months passed, the topic of the American presidential election slowly saw an increase in its scaled allocation, eventually reaching the highest monthly scaled allocation for three straight months (September to November). Unsurprisingly, another election-related topic which dis- cussed voting methods saw a slow growth in its monthly scaled allocation until the month of the election (November) where it attained the second highest scaled alloca- tion.

The topics of general pandemic advice and shopping saw large growths in scaled allocation from March to April, and maintained the highest scaled allocations for the months of April to July, eventually being overtaken by the election topics. Discussions on the postponement of the Indian NEET and JEE entrance exams saw a sharp increase in scaled allocation in the month of August, attaining the highest scaled allocation for the month. Immediately after, this topic saw a sharp decrease in scaled allocation, eventually becoming indistinguishable among the other unpopular topics.

The scaled topic allocations were ranked for each month and the top three topics for each month were displayed in Figure 5.11. Using Figure 5.12 and Figure 5.13 once again as guides, the topics that appeared in Figure 5.11 were labelled.

From the biterm cluster of the topic discussing the lack of masks at republican

events (Figure 5.12), the biterm with the highest number of occurrences was no mask. Due to the presence of this biterm, one would have likely described this topic as tweets in support of not wearing masks. However, upon further analysis into the tweets with the highest allocations for this topic, it was found that this topic arose from tweeters calling out the lack of masks at Republican events. In June, this topic was the third most discussed topic and coincided with Trump’s rally in Tulsa, Oklahoma.

55 Mar Apr May

USA presidential General pandemic General pandemic election advice advice

Economic impact of Shopping Shopping pandemic

Social distancing Pandemic USA presidential enforcement and frustrations election violations

0.000 0.005 0.010 0.015 0.000 0.005 0.010 0.015 0.020 0.000 0.005 0.010 0.015

Jun Jul Aug

Postponement of General pandemic General pandemic Indian NEET and JEE advice advice entrance exams

General pandemic Shopping Shopping advice

No masks at USA presidential USA presidential Topic (K = 500) republican events election election

0.000 0.005 0.010 0.015 0.020 0.000 0.005 0.010 0.015 0.020 0.000 0.005 0.010 0.015

Sep Oct Nov

USA presidential USA presidential USA presidential election election election

Voting methods in General pandemic General pandemic USA presidential advice advice election

Economic impact of Economic impact of Pandemic pandemic pandemic frustrations

0.000 0.005 0.010 0.015 0.000 0.005 0.010 0.015 0.000 0.005 0.010 0.015 0.020 Scaled Allocation

Figure 5.11: Monthly top three topics by scaled allocations.

For visual clarity, the textplot_bitermclusters function from the textplot36 package requires terms and biterms to appear at most once when plotting multiple topics in a single call in order to prevent overlapping clusters. A term that is common to more than one of the topics to be plotted is reserved for the single topic that has the highest topic-term probability for the term. Edges joining terms to form biterms are only plotted within a topic cluster if both component terms appear in the cluster and if the biterm had the highest number of assignments towards the particular topic, in comparison to the other topics to be plotted.

In Figure 5.12, the topics of “social distancing enforcement and violations”, “shop- ping”, “no masks at republican events” and “general pandemic advice” all had social and distanc appear in their top 20 terms (Figure 5.13). Out of these four topics, social and distanc both had the highest topic-term probabilities in the topic of “social distancing enforcement and violations” and therefore appeared as nodes in

56 this cluster. The biterm social distanc had the most assignments towards the topic “general pandemic advice”. As a result, the nodes social and distanc in the “social distancing enforcement and violations” were not joined by an edge. For the sake of convenience and visual aesthetics, the clusters in Figure 5.12 were plotted in a single call rather than plotting the individual component plots separately and stitching them back together.

57 Biterm topic model: K = 500

plan

longer gather face

on viru

mask without no USA presidential election

No masks at republican events presid respons

joe covid19 rule call american dai arrest via trump biden distanc state voter elect photo polic peopl break vote poll enforc count park violat ballot mail social person

Economic impact of Social distancing pandemic enforcement and Voting methods in violations USA presidential election

growth busi job econom Shopping year

industri coronaviru global market

recoveri impact will economi custom place

shop

close open still go like store time middl get yall fuck

shit Postponement of pandem parti realli Indian NEET and JEE ass entrance exams just

General pandemic

u conduct situat advice Pandemic frustrations postpon student neet

jee sir pleas wash exam follow keep

wear stai home safe not practic hand

Figure 5.12: Biterm clusters for topics appearing in Figure 5.10 and Figure 5.11. Each cluster is an undirected graph as biterms are unordered pairs of terms. Increased node (term) size corresponds to higher topic-term probability. Increased thickness and darkness of edges (links) corresponds to higher co-occurrences within the topic. As biterms were computed using a window size of 15, adjacent nodes may not have necessarily appeared adjacently in the original text.

58 Voting methods in Pandemic Economic impact of USA presidential frustrations pandemic election vote pandem pandem pandem fuck covid19 mail yall economi elect like econom ballot covid19 will voter go coronaviru count shit market person get global not just year trump still impact covid19 not growth poll parti industri peopl ass busi state realli recoveri dai middl job earli peopl due democrat u continu will got crisi biden dont demand go whole report 0.00 0.02 0.04 0.06 0.08 0.00 0.02 0.04 0.06 0.00 0.01 0.02 0.03 0.04 0.05

Postponement of No masks at Indian NEET and JEE Shopping republican events entrance exams no exam distanc mask student social social pandem no distanc covid19 store wear neet peopl on postpon shop not jee open peopl conduct mask without not on face pleas not go u place

Terms plan will pleas viru sir close longer no custom gather situat time crowd give staff ralli distanc can requir social will mandat want go case live line 0.00 0.05 0.10 0.00 0.02 0.04 0.06 0.00 0.02 0.04 0.06

Social distancing USA presidential General pandemic enforcement and election advice violations trump distanc social covid19 social distanc pandem mask peopl biden wear not presid not polic american practic rule coronaviru stai mask elect peopl enforc will safe via u pleas violat vote follow park no hand arrest respons keep no joe wash break call home photo death everyon guidelin democrat maintain video donald will see america u ignor plan work show 0.00 0.02 0.04 0.06 0.000 0.025 0.050 0.075 0.000 0.025 0.050 0.075 0.100 Topic−term probabilities

Figure 5.13: Top 20 most probable terms for topics appearing in Figure 5.10 and Figure 5.11.

59 5.6.3 Topic quality

K = 500 0

−200

Model BTM −400 LDA Mean coherence

−600

5 10 15 20 Top M words

Figure 5.14: Mean coherence scores and 95% t-intervals for the 500 topics of the BTM and LDA models using the top M = {5, 10, 15, 20} terms of each topic, evaluated on the testing data. A higher coherence score means that a topic is more coherent. Using a two- sample t-test, the mean coherence of topics learned by BTM was found to be significantly (p < 0.001) greater than the mean coherence of topics learned by LDA for the each of the four values of M considered here.

For each topic within each model, the top M = {5, 10, 15, 20} terms were used to calculate the coherence scores. For each value of M and model, the coherence scores of all topics evaluated on the testing data were averaged with a 95% t-interval constructed around the centre (Figure 5.14). Using a small number for M, the topic qualities of BTM and LDA were comparable. However, as the number of top words increased, the distinction of the mean coherence between the models became more apparent. It was also seen that the individual topic coherence scores of BTM exhibited much less variability compared to LDA. By two-sample t-test, it was found that the mean topic coherence of BTM was significantlyp ( < 0.001) greater than the mean

60 topic coherence of LDA for each of the four values of M used considered here. This was consistent with the results presented in Yan et al.53.

5.7 Discussion

It was seen that three of the five overall top topics from March to November were shared by LDA and BTM: the USA presidential election, voting methods in the USA presidential election, and the postponement of the Indian entrance exams. The differences in topic allocations between the two models became apparent when looking at the top three topics for each month. In LDA, many of the top monthly topics were not the same as the overall top topics, whereas for BTM, many of the top monthly topics were the same as the overall top topics. Nonetheless, the use of both LDA and BTM has provided insightful information into the topics discussed in these COVID-19 tweets.

The application of topic modelling on Twitter data comes with some caveats. When using Twitter data, it is assumed that the collected data has semantic value. Unfortunately, regardless of the number of steps taken to filter out tweets of low quality, it is likely impossible that all such cases will be found, especially for data of this size. Unknowingly having a large amount of low quality tweets will likely have a negative impact on the learned topics.

As the most important part of topic modelling is to allow for humans to interpret the resulting topics, when topics arise due to short, vague text data, interpretation may not always be possible. While Twitter users can collectively discuss specific topics which are successfully learned by the topic models, a lack of context combined with the short nature of representative tweets can often add to the difficulty of interpreting a topic, especially to outsiders.

61 To date, LDA has few visualizations that aid users in interpreting topics. As a result of modelling term co-occurrences at the corpus level, biterm topic clusters such as those shown in Figure 5.12 prove to be indispensable visual aids for interpreting the learned topics. In conclusion, as the topics learned by BTM were, on average, more coherent than those learned by LDA (Figure 5.14), and easier to interpret with the aid of the textplot36 package, it is clear that BTM remains superior to LDA when working with short-text data.

62 Chapter 6

Conclusions

In this thesis, many numerical and graphical methods were presented towards analysing informal, unstructured short-text data pertaining to the COVID-19 pandemic. This included analysing term frequencies and term co-frequencies, sentiment analysis, and the training of topic models in order to better understand the topics of discussions throughout the global pandemic. In regards to topic modelling, it was concluded that the topics learned by BTM were, on average, more coherent than the topics learned by LDA. In spite of these promising results, there are a few naturally arising shortcomings to consider in regards to topic models such as LDA and BTM.

First, held-out data cannot contain terms that were not seen in the training step. In the setting of Twitter data, this often involves removing terms that appeared in too few documents, which in many cases are unique hashtags or proper nouns. However, it is left up to the user to decide how many is “too few” and the resulting cut-off point can be somewhat arbitrary, varying between data sets. The possibility of out- of-vocabulary terms also means that previously trained models may not be re-usable for new text data of a similar theme without extensive pre-processing of the new text.

Second, the number of topics must be chosen before fitting LDA or BTM models.

63 While the true number of topics remains unknown, using too few topics will result in unhelpful and difficult to interpret topic clusters. On the other hand, usingtoo many topics can result in computational difficulties due to a lack of computational resources. In addition, using too many topics defeats the purpose of topic modelling as the number of topics to interpret no longer becomes manageable.

Finally, there is a lack of justification for the use of Dirichlet priors other thanfor mathematical convenience.

Future work

In 2018, a novel approach to topic modelling was introduced by Gerlach et al.59 which incorporated a community detection technique, known as stochastic block modelling, into the realm of topic modelling. In this study, it was concluded that their model, the hierarchical stochastic block model (hSBM), was able to outperform LDA on the basis of minimum description length60 on both long and short-text data. In addition, the need to choose the number of topics was eliminated as it was done automatically, and the use of the non-justified Dirichlet priors was eliminated through the formulation of new priors. As the results of hSBM appear extremely promising, it is of interest to compare the performance of the hSBM with BTM for short-text data.

64 Bibliography

[1] World Health Organization. Coronavirus disease (covid-19). URL https:// www.who.int/news-room/q-a-detail/coronavirus-disease-covid-19.

[2] Johns Hopkins University & Medicine. Johns hopkins coronavirus resource

center. Johns Hopkins Coronavirus Resource Center. URL https:// coronavirus.jhu.edu/.

[3] Rob McLean, Laura He, and Anneken Tappe. Dow plunges 1,000 points, posting its worst day in two years as coronavirus fears spike.

CNN. URL https://www.cnn.com/2020/02/23/business/stock-futures- coronavirus/index.html.

[4] Coronavirus scare: Complete list of airlines suspending flights. India Today. URL

https://www.indiatoday.in/lifestyle/travel/story/coronavirus-scare- complete-list-of-airlines-suspending-flights-1650574-2020-02-27.

[5] James Brownsell. Ilo: Half of all workers risk losing jobs due to

virus. URL https://www.aljazeera.com/economy/2020/4/29/half-the- worlds-workers-face-losing-their-jobs-says-ilo.

[6] Jessica Guynn. Coronavirus fears empty store shelves of toilet pa- per, bottled water, masks as shoppers stock up. USA TODAY. URL

https://www.usatoday.com/story/money/2020/02/28/coronavirus-2020-

65 preparation-more-supply-shortages-expected/4903322002/.

[7] Nylah Burton. The coronavirus exposes the history of racism and “cleanliness”.

Vox, Feb 2020. URL https://www.vox.com/2020/2/7/21126758/coronavirus- xenophobia-racism-china-asians.

[8] Fears of coronavirus trigger anti-china sentiment worldwide. kore-

atimes, Feb 2020. URL http://www.koreatimes.co.kr/www/world/2021/01/ 683_282767.html.

[9] Claire C. Miller. Could the pandemic wind up fixing what’s broken about work in america? The New York Times, Apr 2020. ISSN 0362-

4331. URL https://www.nytimes.com/2020/04/10/upshot/coronavirus- future-work-america.html.

[10] Ian Swanson. Five ways the coronavirus could change american politics. The Hill,

May 2020. URL https://thehill.com/homenews/campaign/495761-five- ways-the-coronavirus-could-change-american-politics.

[11] Brett Molina. Twitter overcounted active users since 2014, shares surge on profit

hopes. USA TODAY. URL https://www.usatoday.com/story/tech/news/ 2017/10/26/twitter-overcounted-active-users-since-2014-shares- surge/801968001/.

[12] R Core Team. R: A Language and Environment for Statistical Computing.R

Foundation for Statistical Computing, Vienna, Austria, 2020. URL https:// www.R-project.org/.

[13] Michael W. Kearney. rtweet: Collecting and analyzing twitter data. Journal of

Open Source Software, 4(42):1829, 2019. doi: 10.21105/joss.01829. URL https: //joss.theoj.org/papers/10.21105/joss.01829. R package version 0.7.0.

66 [14] Travers Ching. qs: Quick Serialization of R Objects, 2020. URL https:// CRAN.R-project.org/package=qs. R package version 0.23.3.

[15] Hadley Wickham, Romain François, Lionel Henry, and Kirill Müller. dplyr: A

Grammar of Data Manipulation, 2020. URL https://CRAN.R-project.org/ package=dplyr. R package version 1.0.2.

[16] Hadley Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-

Verlag New York, 2016. ISBN 978-3-319-24277-4. URL https:// ggplot2.tidyverse.org.

[17] Lionel Henry and Hadley Wickham. purrr: Functional Programming Tools, 2020.

URL https://CRAN.R-project.org/package=purrr. R package version 0.3.4.

[18] Henrik Bengtsson. A unifying framework for parallel and distributed processing

in r using futures, aug 2020. URL https://arxiv.org/abs/2008.00553.

[19] Davis Vaughan and Matt Dancho. furrr: Apply Mapping Functions in Parallel

using Futures, 2020. URL https://CRAN.R-project.org/package=furrr.R package version 0.2.1.

[20] Kirill Müller and Hadley Wickham. tibble: Simple Data Frames, 2020. URL

https://CRAN.R-project.org/package=tibble. R package version 3.0.4.

[21] Hadley Wickham. tidyr: Tidy Messy Data, 2020. URL https://CRAN.R- project.org/package=tidyr. R package version 1.1.2.

[22] Julia Silge and David Robinson. tidytext: Text mining and analysis using tidy

data principles in r. JOSS, 1(3), 2016. doi: 10.21105/joss.00037. URL http: //dx.doi.org/10.21105/joss.00037.

[23] Hadley Wickham. stringr: Simple, Consistent Wrappers for Common String

Operations, 2019. URL https://CRAN.R-project.org/package=stringr.R

67 package version 1.4.0.

[24] Jeroen Ooms. cld2: Google’s Compact Language Detector 2, 2018. URL https: //CRAN.R-project.org/package=cld2. R package version 1.2.

[25] Jeroen Ooms. cld3: Google’s Compact Language Detector 3, 2020. URL https: //CRAN.R-project.org/package=cld3. R package version 1.3.

[26] Garrett Grolemund and Hadley Wickham. Dates and times made easy with

lubridate. Journal of Statistical Software, 40(3):1–25, 2011. URL https:// www.jstatsoft.org/v40/i03/.

[27] Kamil Slowikowski. ggrepel: Automatically Position Non-Overlapping Text Labels

with ’ggplot2’, 2020. URL https://CRAN.R-project.org/package=ggrepel.R package version 0.8.2.

[28] Stefan Milton Bache and Hadley Wickham. magrittr: A Forward-Pipe Operator

for R, 2020. URL https://CRAN.R-project.org/package=magrittr. R package version 2.0.1.

[29] Ian Fellows. wordcloud: Word Clouds, 2018. URL https://CRAN.R- project.org/package=wordcloud. R package version 2.6.

[30] Erich Neuwirth. RColorBrewer: ColorBrewer Palettes, 2014. URL https:// CRAN.R-project.org/package=RColorBrewer. R package version 1.1-2.

[31] Tyler W. Rinker. lexicon: Lexicon Data. Buffalo, New York, 2018. URL http: //github.com/trinker/lexicon. version 1.2.1.

[32] Tyler W. Rinker. sentimentr: Calculate Text Polarity Sentiment. Buffalo, New

York, 2019. URL http://github.com/trinker/sentimentr. version 2.7.1.

[33] Milan Bouchet-Valat. SnowballC: Snowball Stemmers Based on the C ’lib-

stemmer’ UTF-8 Library, 2020. URL https://CRAN.R-project.org/package=

68 SnowballC. R package version 0.7.0.

[34] Max Kuhn, Fanny Chow, and Hadley Wickham. rsample: General Resampling

Infrastructure, 2020. URL https://CRAN.R-project.org/package=rsample.R package version 0.0.8.

[35] Jan Wijffels. BTM: Biterm Topic Models for Short Text, 2020. URL https: //CRAN.R-project.org/package=BTM. R package version 0.3.4.

[36] Jan Wijffels. textplot: Text Plots, 2020. URL https://CRAN.R-project.org/ package=textplot. R package version 0.1.4.

[37] Thomas Lin Pedersen. ggraph: An Implementation of Grammar of Graphics

for Graphs and Networks, 2020. URL https://CRAN.R-project.org/package= ggraph. R package version 2.0.2.

[38] Joël Gombin, Ramnath Vaidyanathan, and Vladimir Agafonkin. concave-

man: A Very Fast 2D Concave Hull Algorithm, 2020. URL https://CRAN.R- project.org/package=concaveman. R package version 1.1.0.

[39] Bettina Grün and Kurt Hornik. topicmodels: An R package for fitting topic mod- els. Journal of Statistical Software, 40(13):1–30, 2011. doi: 10.18637/jss.v040.i13.

[40] Hiroaki Yutani. gghighlight: Highlight Lines and Points in ’ggplot2’, 2020. URL

https://CRAN.R-project.org/package=gghighlight. R package version 0.3.1.

[41] Enhypen reportedly left hurt, scared, & crying after being mobbed by “fans” on first airport trip 2020. Koreaboo, Sep 2020. URL

https://www.koreaboo.com/news/enhypen-reportedly-left-hurt-scared- crying-after-mobbed-fans-first-airport-trip/.

[42] Matthew L. Jockers. Syuzhet: Extract Sentiment and Plot Arcs from Text, 2015.

URL https://github.com/mjockers/syuzhet.

69 [43] Mingqing Hu and Bing Liu. Mining and summarizing customer reviews. Seattle, Washington, 2004.

[44] Saif M. Mohammad and Patrick D. Turney. Emotions evoked by common words and phrases: Using Mechanical Turk to create an emotion lexicon, 2010.

[45] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3(4):993 – 1022, 2003.

[46] Xueqi Cheng, Xiaohui Yan, Yanyan Lan, and Jiafeng Guo. BTM: Topic modeling over short texts. IEEE Transactions on Knowledge and Data Engineering, 26: 2928 – 2941, 2014.

[47] Xuerui Wang and Andrew McCallum. Topics over time: A non-markov continuous-time model of topical trends. Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 424 –– 433, 2006. doi: 10.1145/1150402.1150450.

[48] Jun S. Liu. The collapsed gibbs sampler in bayesian computations with ap- plications to a gene regulation problem. Journal of the American Statistical Association, 89(427):958 –– 966, Sep 1994. ISSN 0162-1459, 1537-274X. doi: 10.1080/01621459.1994.10476829.

[49] X.-H. Phan, L.-M. Nguyen, and S. Horiguchi. Learning to classify short and sparse text & web with hidden topics from large-scale data collections. Proceed- ings of the 17th international conference on World Wide Web, page 91–100, Apr 2008. doi: 10.1145/1367497.1367510.

[50] David Newman, Arthur Asuncion, Padhraic Smyth, and Max Welling. Dis- tributed algorithms for topic models. Journal of Machine Learning Research, pages 1801 – 1828, 2009.

70 [51] Liangjie Hong and Brian D. Davison. Empirical study of topic modeling in twitter. pages 80 –– 88, 2010. doi: 10.1145/1964858.1964870.

[52] Jianshu Weng, Ee-Peng Lim, Jing Jiang, and Qi He. Twitterrank: Finding topic-sensitive influential twitterers. pages 261 –– 270, 2010. .doi: 10 1145/ 1718487.1718520.

[53] Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. A biterm topic model for short texts. pages 1445 –– 1456, May 2013. doi: 10.1145/2488388.2488514.

[54] Xingwei He, Hua Xu, Jia Li, Liu He, and Linlin Yu. FastBTM: Reducing the sampling time for biterm topic model. Knowledge-Based Systems, 132:11 – 20, 2017. doi: 10.1016/j.knosys.2017.06.005.

[55] Qile Zhu, Zheng Feng, and Xiaolin Li. Graphbtm: Graph enhanced autoencoded variational inference for biterm topic model. pages 4663 –– 4672, 2018. URL

https://www.aclweb.org/anthology/D18-1495.pdf.

[56] Xueqi Cheng, Xiaohui Yan, Yanyan Lan, and Jiafeng Guo. Supplemen-

tal material of BTM: Topic modeling over short texts, 2014. URL http: //xiaohuiyan.github.io/paper/BTM-TKDE-supplemental.pdf.

[57] David Mimno, Hanna Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. Optimizing semantic coherence in topic models. pages 262 – 272, Jul

2011. URL https://www.aclweb.org/anthology/D11-1024.

[58] Elıas Jónsson and Jake Stolee. An evaluation of topic modelling techniques for twitter.

[59] Martin Gerlach, Tiago P Peixoto, and Eduardo G Altmann. A network approach to topic models. Science advances, 4(7):eaaq1360, 2018.

[60] Jorma Rissanen. Modeling by shortest data description. Automatica, 14(5):465

71 – 471, 1978.

72