Using a Recurrent Neural Network Model for Classification of Tweets Conveyed Influenza-Related Information

Using a Recurrent Neural Network Model for Classification of Tweets Conveyed Influenza-related Information 1 2,3 1 4,5* Chen-Kai Wang , Onkar Singh Zhao-Li Tang and Hong-Jie Dai 1 Graduate Institute of Biomedical Informatics, Taipei Medical University, Taipei, Taiwan, R.O.C. 2 Bioinformatics Program, Taiwan International Graduate Program, Institute of Information Science, Academia Sinica, Taipei, Taiwan, R.O.C. 3 Institute of Biomedical Informatics, National Yang-Ming University, Taipei, Taiwan, R.O.C. 4Department of Computer Science and Information Engineering, National Taitung University, Taitung, Taiwan, R.O.C. 5Interdisciplinary Program of Green and Information Technology, National Taitung Univer- sity, Taitung, Taiwan, R.O.C. Abstract 1 Introduction Traditional disease surveillance systems depend on outpatient reporting and virological test results With the popularity of WWW, the use of data min- released by hospitals. These data have valid and ing techniques to analyze the big data generated by accurate information about emerging outbreaks users provides a feasible way for identification and but it’s often not timely. In recent years the expo- exploration of health-related information. For ex- ample, Ginsberg et al. (2009) utilized search query nential growth of users getting connected to social logs of Google to develop models for influenza media provides immense knowledge about epi- surveillance. With the recent increased use of so- demics by sharing related information. Social me- cial media platforms, users can communicate each dia can now flag more immediate concerns related other by updating their status led to wide sharing to outbreaks in real time. In this paper we apply of personal information timely. The information the long short-term memory recurrent neural net- spreads over these platforms is now a valuable in- work (RNN) architecture to classify tweets con- formation resource for building social sensors to veyed influenza-related information and compare develop real time event detection systems for de- its performance with baseline algorithms includ- tecting events like earthquakes (Sakaki, Okazaki, ing support vector machine (SVM), decision tree, & Matsuo, 2010) and abuse of medications (Sarker, naive Bayes, simple logistics, and naive Bayes O’Connor, et al., 2016). multinomial. The developed RNN model A review conducted by Charles-Smith et al. achieved an F-score of 0.845 on the MedWeb task (2015) demonstrated evidence that the use of social test set, which outperforms the F-score of SVM media data can provide real-time surveillance of without applying the synthetic minority over- health issues, speed up outbreak management, sampling technique by 0.08. The F-score of the identify target populations necessary to support and RNN model is within 1% of the highest score even improve public health and intervention out- achieved by SVM with oversampling technique. comes. Facebook, microblogs, blogs, and discus- sion forums are examples of such media. Among them, Twitter, the leading micro-blogging platform, has become the primary data sources for digital disease surveillance and outbreak management. The * Corresponding author 33 Proceedings of the International Workshop on Digital Disease Detection using Social Media 2017 (DDDSM-2017), pages 33–38, Taipei, Taiwan, November 27, 2017. c 2017 AFNLP platform has been used for creating first hand re- 2 Method ports of adverse drug events (Bian, Topaloglu, & Yu, 2012). Shared tasks (Aramaki, Wakamiya, 2.1 Preprocessing Morita, Kano, & Ohkuma, 2017; SARKER, In the preprocessing step, we normalize all web NIKFARJAM, & GONZALEZ, 2016) and hacka- links and usernames into “@URL” and “@REF” thon style competitions (Adam, Jonnagaddala, respectively. The part-of-speech tagger developed Chughtai, & Macintyre, 2017) for digital disease by Gimpel et al. (2011) is then used to tokenize detection or biosurveillance are also emerging. The tweets followed by removing the hashtag symbol rationale behind social media-based surveillance “#” from its attached keywords or topics. Finally, systems is based on the assumption that target we followed the numeric normalization procedure events occur in the real world will immediately re- proposed by (Dai, Touray, Wang, Jonnagaddala, & flect on social media. Therefore, systems that ag- Syed-Abdul, 2016) to normalize all numeral parts gregate and determine the degree of related infor- in each token into “1”. mation from social media can monitor or even fore- cast the current or future outbreak events. 2.2 Network Architecture Iso, Wakamiya, and Aramaki (2016) have The network architecture used in this study is a re- demonstrated words such as “fever” present clues current model consisting of an embedding layer, a for upcoming influenza outbreaks. They concluded bi-directional RNN layer followed by a dense layer that an approximately 16-day time lag exists be- to compute the posterior probabilities for each dis- tween the frequency of the word “fever” mentioned ease or symptom. Figure 1 illustrates the architec- in tweets and the number of influenza patients an- ture. nounced by the infectious disease surveillance cen- ter in Japan. Although their results are promising, 2.3 Embedding Layer the use of word-level information was noisy and The pre-trained word vectors for Twitter generated impedes precise influenza surveillance. Take the by GloVe (Pennington, Socher, & Manning, 2014) following two tweets described in the work of was used to initialize the embedding layer. The Aramaki, Maskawa, and Morita (2011) as an exam- pre-trained 200-dimensional vectors were trained ple. on two billion tweets which can be downloaded 2 “Headache? You might have flu.” from the project website . For word cannot be “The World Health Organization reports the found in the pre-trained vectors, we initialized it avian influenza, or bird flu, epidemic has spread with values closed to zero. to nine Asian countries in the past few weeks.” 2.4 Recurrent Neural Network and Dense Although the above two tweets include mentions Layer of “flu”, apparently they do not indicate any influ- RNNs have proven to be a very powerful model in enza patient has presented nearby. Therefore it is many natural language tasks (Mesnil, He, Deng, & required to develop classifiers to categorize dis- Bengio, 2013; Tomá, Martin, Luká, Jan, & Sanjeev, eases/symptoms related to influenza in order to 2010). This work used the long short term memory have an accurate influenza forecasting model. With (LSTM), which is a special kind of RNN capable this in mind, we consider to develop classifiers for of learning long-term dependencies, to implement diseases/symptoms related to influenza. We formu- the RNN layer. As traditional RNNs, LSTM net- late the task as a classification problem and employ works have the form of a chain of repeating cells of several baseline algorithms and recurrent neural its neural network. LSTM uses gates to control the networks (RNNs) to develop our models. The per- information flow inside its network. In Figure 1, formance of the developed models are evaluated on assume that the first cell in the forward-LSTM is a corpus annotated with eight disease/symptoms the tth token. The cell uses Equation 1 to output a including influenza, cold, hay fever, diarrhea, head- value by considering the value ht−1 generated by the ache, cough, fever and runny nose. previous cell and the current input xt. 푓푡 = 휎(푊푓 ∙ [ℎ푡−1, 푥푡] + 푏푓) (1) 2https://nlp.stanford.edu/projects/glove/ 34 y1 y2 Softmax Dense Layer Thought Vector ht Backward Backward Backward Backward LSTM LSTM LSTM LSTM ht-1 Forward Forward Forward Forward LSTM Ct LSTM LSTM LSTM xt xt Embed Embed Embed Embed I got a fever Figure 1: Network architecture developed in this work. The cell then produces its two outputs Ct and ht by using equation 4 and equation 6 respectively. 2.5 Baseline Systems Owing to the evaluation of the MedWeb task is 푡 = 휎(푊푖 ∙ [ℎ푡−1, 푥푡] + 푏푖) (2) ongoing, for the purpose of performance compar- 퐶̃ = tanh(푊 ∙ [ℎ , 푥 ] + 푏 ) (3) 푡 퐶 푡−1 푡 퐶 ison, we implemented several baseline algorithms ̃ 퐶푡 = 푓푡 ∗ 퐶푡−1 + 푡 ∗ 퐶푡 (4) with Weka (Holmes, Donkin, & Witten, 1994) in- 표푡 = 휎(푊표 ∙ [ℎ푡−1, 푥푡] + 푏표) (5) cluding C4.5 decision tree, simple logistic, support vector machine(SVM) trained with sequen- ℎ = 표 ∗ tanh(퐶 ) (6) 푡 푡 푡 tial minimal optimization, and Naïve Bayes Mul- In our network architecture, we duplicate the tinomial. The features used by the baseline were first recurrent layer to create a second recurrent n-gram features with TF-IDF as the weighing layer so that there are now two layers side-by-side. function. Based on the tokens generated by the The second layer is denoted as the Backward preprocessing step, we generate lowercased uni- LSTM in Figure 1. For the backward layer, the in- grams, bigrams and trigrams and filtered out stop put sequence is provided as a reversed copy of the words by using the list developed by McCallum input sequence. (1996) along with a custom stop word list created Finally, we concatenate the last frame from the forward recursion, and the first frame from the by analyzing the training set. Finally, the Snow- backward recursion to build the thought vector ball stemmer (Porter, 2001) is used to perform (Gibson). The vector is then be the input of the stemming. dense layer with a dimension of 150 for soft max classification. For the task studied in this work, we created one RNN model for each disease/symptom. Therefore, in total of eight RNN models were constructed for the eight types of diseases/symptoms. 35 Symptom/Disease Positive Negative P R F Influenza 112 1808 SVM 0.865 0.874 0.869 Diarrhea 189 1731 C4.5 0.841 0.851 0.845 Hay fever 201 1719 Naïve Bayes (NB) 0.265 0.967 0.406 Cough 237 1683 Simple Logistic 0.817 0.915 0.863 Headache 254 1666 NB Multinomial 0.662 0.874 0.751 Cold 284 1636 SVM-SMOTE 0.861 0.876 0.869 Fever 355 1565 Runny nose 417 1503 Table 3: Baseline algorithm performance on Total 2049 13311 the training set (macro-averaged).

Load more