COVID-19 Twitter Monitor
Total Page:16
File Type:pdf, Size:1020Kb
COVID-19 Twitter Monitor: Aggregating and visualizing COVID-19 related trends in social media Joseph Cornelius∗ Tilia Ellendorff yz Lenz Furreryz Fabio Rinaldi∗yz ∗Dalle Molle Institute for Artificial Intelligence Research (IDSIA) ySwiss Institute of Bioinformatics zUniversity of Zurich, Department of Computational Linguistics fjoseph.cornelius,[email protected] ftilia.ellendorff, [email protected] Abstract Social media platforms offer extensive information about the development of the COVID-19 pandemic and the current state of public health. In recent years, the Natural Language Processing community has developed a variety of methods to extract health-related information from posts on social media platforms. In order for these techniques to be used by a broad public, they must be aggregated and presented in a user-friendly way. We have aggregated ten methods to analyze tweets related to the COVID-19 pandemic, and present interactive visualizations of the results on our online platform, the COVID-19 Twitter Monitor. In the current version of our platform, we offer distinct methods for the inspection of the dataset, at different levels: corpus- wide, single post, and spans within each post. Besides, we allow the combination of different methods to enable a more selective acquisition of knowledge. Through the visual and interactive combination of various methods, interconnections in the different outputs can be revealed. 1 Introduction Today, social media platforms are important sources of information, with a usage rate of over 70% for adults in the USA and a continuous increase in popularity.1 Platforms like Twitter are characterized by their thematic diversity and realtime coverage of worldwide events. As a result, social media platforms offer information, especially about ongoing events that are locally and dynamically fluctuating, such as the SARS-CoV-2 (COVID-19) pandemic. The micro-posts contain not only the latest news or announce- ments of new medical findings but also reports about personal well-being and the sentiment towards public health interventions. Information contained in tweets has multiple usages. For instance, domain experts could discover how recent scientific studies find an echo in social media, while the health indus- try could find patients’ reactions to certain drugs, and the general public could observe trends towards popular topics (e.g. the alternating popularity of politicians over time). To provide universal access to the acquisition of knowledge from public discourse about COVID-19 on Twitter, social media mining methods have to be easily applicable. 1https://www.pewresearch.org/internet/fact-sheet/social-media/ Figure 1: A single COVID-19 related tweet annotated with our drug brand name detection system and classified by our systems for personal health mention (PHM) identification and sentiment analysis. Download & Analysis Downsampling Detection Interactive Web Preprocessing . Visualization Database . Database . Classification Figure 2: A schematic of the data acquisition and processing pipeline of our web platform. We present a platform called COVID-19 Twitter Monitor that analyzes, categorizes, and extracts tex- tual information from posts on Twitter. Our platform provides insights into the development and spread of the COVID-19 pandemic. The information is retrieved at different levels: spans within each post (e.g. drug name detection), single post (e.g. text language identification) and corpus level (e.g. distribution of hashtags). We use a variety of different methods to analyse pandemic-related micro-posts with a focus on health issues. In contrast to existing platforms, we aim to provide a platform for newly emerging diseases for which no disease-specific supervised datasets are available. Therefore, our approach concentrates on unsuper- vised and pre-trained supervised methods that can be expected to generalize to unseen conditions. We have created an interactive web interface that integrates these methods and thus simplifies the use of social media mining methods. When visualizing the results, different methods are combined to display underlying connections in the extracted information. With this approach, we hope to make the contained information more accessible and provide an increased level of insight. Since we present the first version of our web platform2, this is only an initial description of our approach and the preliminary results. 2 Related Work Numerous studies show that social media platforms are particularly suitable for obtaining health-related information like the identification of drug abuse or the tracking of infectious diseases (Woo et al., 2016; Paul et al., 2016; McGough et al., 2017; Cocos et al., 2017). The detection of infectious disease out- breaks by using data from Twitter has already been tested, for instance in the case of the Ebola virus (Odlum and Yoon, 2015) and the Zika virus (Juric et al., 2017; Mamidi et al., 2019). Paul et al. (2016) propose using social media data for disease surveillance by locally determining and forecasting influenza prevalence. They address obstacles that the typically informal text on social media platforms presents to common language processing systems, and stress the need for normalized medical terminology and general linguistic normalization for social media. In addition, they argue that sentiment analysis is par- ticularly suitable for analyzing the patient’s impression of the quality of health care or public opinion on certain drugs. Likewise, Mamidi et al. (2019) apply sentiment analysis to tweets concerning virus outbreaks to gain information for disease control. Joshi et al. (2019) describe how to identify disease outbreaks using text-based data. Like Wang et al. (2014), they show how topic modeling can be used to discover health-related topics in the micro- posts. Twitter adapted topic models were also utilized by Paul and Dredze (2011) to conduct syndromic surveillance regarding the flu. Lamb et al. (2013) showed that more intricate data retrieval techniques can be applied to noisy Twitter data, such as identifying posts that mention personal health (PHM). Further fine-grained analyses were conducted by them, for example, to determine whether the tweet concerns the health of the author or a person familiar with the author. Going even further, Yin et al. (2015) showed that Twitter-related PHM classifiers can operate disease agnostic; they trained their system on four disease contexts and could achieve a precision of 0.77 on 34 other health issues. Several platforms have already explored using social media data for health surveillance and disease tracking. Lee et al. (2013) uses Twitter data for influenza and cancer surveillance but focuses on the computation of various distributions, e.g. the temporal change in quantity or the most frequent words 2https://covid19smm.nlp.idsia.ch total number unique tweets containing URLs 372228 283887 323760 (72.76%) Domains 372228 37282 323760 (72.76%) Hashtags 380744 93454 136279 (30.63%) Languages 425721 53 425721 (95.67%) RxNorm Brand Names 1285 101 1218 (0.27%) Preprint Paper 152 124 149 (0.03%) Tweets 444990 444990 – Table 1: Statistics of the data collection with formerly 500K tweets after similarity filtering. For each feature, the total number of occurrences, the number of distinct occurrences, and the total number of tweets containing the feature are displayed. of influenza tweets. In contrast, HealthMap3 (Brownstein et al., 2008) lets you explore various disease outbreaks at the corpus and post level, but primarily uses google news as a data source. Systems such as Flutrack4 (Chorianopoulos and Talvis, 2016) and Influenza Observations and Forecast5 focus instead on map-based identification of infection disease outbreaks. 3 Data We use the COVID-19 Twitter dataset published by the Panacea Lab (Banda et al., 2020). The dataset consists of daily Twitter chatter (about 4M tweets per day), collected by the Twitter Stream API for the keywords “COVD19”, “CoronavirusPandemic”, “COVID-19”, “2019nCOV”, “CoronaOutbreak”, “coronavirus”, and “WuhanVirus”. In addition, the Panacea Lab extended the dataset over time with Twitter datasets of research groups from the Universitat Autonoma` de Barcelona, the National Research University Higher School of Economics, and the Kazan Federal University. As of June 14, 2020, version 14.0 of the dataset contains over 400 million unique tweets. Since the data has been collected irrespective of language, all languages are covered, with a pronounced prevalence of English (60%), Spanish (16%), and French (4%) tweets. An update is provided every two days and a cumulative update is provided every week. We use the cleaned version of the dataset provided by the Panacea Lab with all retweets filtered out. 4 Methods and Results In the following section we describe the different methods used by our platform. We start by creating a randomly selected subset of the Panacea Lab’s COVID-19 Twitter dataset.6 We then process this subset with the preprocessing steps described below and apply the methods for the data analysis as described in this section. Subsequently, we store the obtained results and the original data subset in a database accessed by our interactive web platform for visualization of the results, as illustrated in Figure 2. 4.1 Pre-Processing For tweets used for topic modeling, sentiment analysis, and PHM identification, our system applies the following preprocessing methods: • Without splitting the sentences, all tweets are tokenized using the spaCy7 tokenizer.