SmokEng: Towards Fine-grained Classification of -related Social Media Text

Kartikey Pant, Venkata Himakar Yanamandra, Alok Debnath and Radhika Mamidi International Institute of Information Technology Hyderabad, Telangana, India {kartikey.pant, himakar.y, alok.debnath}@research.iiit.ac.in [email protected]

Abstract Canada 1. Furthermore, it allows us to understand patterns in ethnically diverse and vulnerable au- Contemporary datasets on tobacco consump- diences (Lienemann et al., 2017). Social media tion focus on one of two topics, either pub- provides an active and useful platform for spread- lic health mentions and disease surveillance, ing awareness, especially dialog platforms, which or sentiment analysis on topical tobacco prod- have untapped potential for disease surveillance ucts and services. However, two primary con- (Platt et al., 2016). These platforms are useful in siderations are not accounted for, the language of the demographic affected and a combina- stimulating the discussion on societal roles in the tion of the topics mentioned above in a fine- domain of public health (Platt et al., 2016). Sharpe grained classification mechanism. In this pa- et al.(2016) has shown the utility of social media per, we create a dataset of 3144 tweets, which by highlighting that the number of people using are selected based on the presence of collo- social media channels for information about their quial slang related to and analyze it illnesses before seeking medical care. based on the semantics of the tweet. Each class is created and annotated based on the Correlation studies have shown that the most content of the tweets such that further hierar- probable leading cause of preventable death glob- chical methods can be easily applied. ally is the consumption of tobacco and tobacco products (Prochaska et al., 2012). The disease Further, we prove the efficacy of standard text classification methods on this dataset, by de- most commonly associated with tobacco con- signing experiments which do both binary as sumption is lung cancer, with two million cases 2 well as multi-class classification. Our experi- reported in 2018 alone . While are ments tackle the identification of either a spe- condemned on social media, this has been rivaled cific topic (such as tobacco product promo- by the rising popularity and analysis of the sup- tion), a general mention (cigarettes and related posed benefits of e-cigarettes (Dai and Hao, 2017). products) or a more fine-grained classification. Information pertaining to new flavors and inno- This methodology paves the way for further vations in the industry and surrounding culture analysis, such as understanding sentiment or style, which makes this dataset a vital contri- have generated sizable traffic on social media as bution to both disease surveillance and tobacco well (Hilton et al., 2016). Studies show that so- use research. cial acceptance is a leading factor to the use and arXiv:1910.05598v1 [cs.CL] 12 Oct 2019 proliferation of e-cigarettes, with some reports 1 Introduction claiming as many as 2.39 million high school and 0.63 million middle school students having used As Twitter has grown in popularity to 330 million an e- at least once (Malik et al., 2019; monthly active users, researchers have increas- Mantey et al., 2019). However, there are strong ingly been using it as a source of data for tobacco claims suggesting the use of e-cigarettes as a ’gate- surveillance (Lienemann et al., 2017). Tobacco- way’ drug for other illicit substances (Unger et al., related advertisements, tweets, awareness posts, and related information is most actively viewed 1https://www.cdc.gov/tobacco/data_ by young adults (aged 18 to 29), who are ex- statistics/fact_sheets/adult_data/cig_ smoking/ tensive users of social media and also represent 2https://www.wcrf.org/dietandcancer/ the largest population of smokers in the US and cancer-trends/lung-cancer-statistics Figure 1: Procedure for Data Collection. We started out with approximately 7 million Tweets which were mined based on 24 slang terms. These were pre-processed to select relevant tweets with decent traction on Twitter. A final cleaned dataset of 3144 tweets is presented.

2016). ment analysis of Tobacco-related Twitter posts and In this paper, we aim at classifying tweets relat- performed analysis using machine learning clas- ing to cigarettes, e-cigarettes, and other tobacco- sifiers for the detection of tobacco-relevant posts related products into distinct classes. This classifi- with a particular focus on emerging products like cation is fine-grained in order to assist in the anal- e-cigarettes and hookah. Their work depends on a ysis of the type of tweets which affect the users triaxial classification along and uses basic statisti- the most for each product or category. The exten- cal classifiers. However, their feature-engineered sive, manually annotated dataset of 3144 tweets keyword-based systems do not account for slang pertains to tobacco use classification into adver- associated with tobacco consumption. tisement, general information, personal informa- Vandewater et al.(2018) performs a classifica- tion, and non-tobacco drug classes. Such classifi- tion study based on identifying brand associated cation provides insight into the type of tweet and with a post using basic text analytics using key- associated target audience. For example, present words and image-based classifiers to determine cessation programs target users who are ready to the brands that were most responsible to posting quit rather than people who use it regularly, which about their brands on social media. Cortese et al. can be solved using twitter and other online so- (2018) does a similar analysis on the consumer cial media (Prochaska et al., 2012). Unlike many side, for female smokers on Instagram, targeting previous studies, we also include common slang the same age group, but based entirely on feature terms into the classification scheme so as to be extraction on images, particularly selfies. able to work with the social media discourse of More recently, Malik et al.(2019) explored pat- the target audience. terns of communication of e-cigarette company Finally, we present several text-classification use on Twitter. They categorized 1008 ran- models for the fine-grained classification tasks domly selected tweets across four dimensions, pertaining to tobacco-related tweets on the re- namely, user type, sentiment, genre, theme. How- 3 leased dataset . In doing so, we extend the work ever, they explore the effects of only Juul, and in topical Twitter content analysis as well as the not other cigarettes or e-cigarettes, further limit- study of public health mentions on Twitter. ing their experiment to only Juul-based analysis and inferences. 2 Related Work In the domain of Disease Surveillance, Aramaki Mysl´ın et al.(2013) explored content and senti- et al.(2011) explored the problem of identifying 3https://github.com/kartikeypant/ influenza epidemics using machine-learning based smokeng-tobacco-classification tweet classifiers along with search engine trends Annotation Name Label Class Mention of Non-Tobacco Drugs OD -1 Unrelated or Ambiguous Mention UM 0 Personal or Anecdotal Mention PM 1 Informative or Advisory Mention IM 2 Advertisements AD 3

Table 1: Label and ID associated with each class. for medical keywords and medical records for the and e-cigarettes. Our initial list consisted of 32 disease in a local environment. For doing so, they such terms compiled from online slang dictionar- use SVM based classifiers for extracting tweets ies, but we pruned this list to 24 terms. These that mention actual influenza patients. However, were smoking, cigarette, e-cig*, cigar, tobacco, since they use only SVM based classifiers, they hookah, shisha, e-juice, e-liquid, vape, vaping, are limited in their accuracy in classification. cheroot, cigarillo, roll-up, , baccy, rollies, Dai et al.(2017) also focuses on public health claro, chain-smok*, vaper, ciggie, nicotine, non- surveillance, and uses word embeddings on a topic smoker, non-smoking. classifier in order to identify and capture seman- By taking the dataset for a full week, we thus tic similarities between medical tweets by disease avoided potential bias based on the day of the and tweet type for a more robust yet very filtered week, which has been observed for alcohol related classification, not accounting for the variety of lin- tweets, which spike in positive sentiment on Fri- guistic features in tweets such as slang, abbrevia- days and Saturdays (Cavazos-Rehg et al., 2015). tions and the like in the keyword-based classifica- For each of the 7 days, all tweets matching any tion mechanism. Jiang et al.(2018) works on a of the listed keywords were included. Tweets similar problem using machine learning solutions matching these tobacco related keywords reflected such as an LSTM classifier. 0.00043% of all tweets in the Twitter API 1% sample. The resulting final dataset thus contained 3 Dataset Creation 3144 tweets, with a mean of 449 tweets per day. In this section, we explain the development of the dataset that we present along with this paper. We 3.2 Data Annotation summarize the methods for collecting and filter- The collected data was then annotated based on the ing through the tweets to arrive at the final dataset categories mentioned in Table 1. These categories and provide some examples of the types of tweets were chosen on the basis of frequency of occur- and features we focused on. We also provide the rence, motivated by the general perception of to- dataset annotation schema and guidelines. bacco and non-tobacco drug related tweets. These included advertisements as well anecdotes, infor- 3.1 Data Collection mation and cautionary tweets. We further noticed Using the Twitter Application Programming Inter- that a similar pattern was seen for e-cigarettes and 4 face (API ), we collected a sample of tweets be- also pertained to some other drugs. While we tween 1st October 2018 and 7th October 2018 that have explored e-cigarettes in this classification, we represented 1% of the entire Twitter feed. This 1% have marked the mention of other drugs that were sample consisted of an average 1,035,206 million tagged with the same keywords. tweets per day. Out of the 7,246,442 tweets, only A formal definition of each of the categories is tweets written in English and written by users with given below. more than 100 followers have selected for the next step in order to clear spam written by bots. • Unrelated or Ambiguous Mention: This In order to extract tobacco related tweets from category of tweets contain tweets containing this dataset, we constructed a list of keywords rel- information unrelated to tobacco or any other evant to general tobacco usage, including hookah drug, or pertaining to ambiguity in the intent 4https://developer.twitter.com/en/products/tweets/sample.html of the tweet, such as sarcasm. Label Examples ”What are you smoking bruh ?” UM ”The smoking gun on Kavanaugh! URL ” ”im smoking and doing whats best for me” PM ”I haven’t had a cigarette in $NUMBER$ months why do I want one so bad now??” ”Obama puffed. Clinton did cigar feel.Churchill won major wars on whisky.” IM ”The FDA’s claim of a teen vaping addiction epidemic doesn’t add up. #ecigarette #health” ”Which ACID Kuba Kuba are you aiming for? #De4L #ExperienceAcid #cigar #cigars URL” AD ”Spookah Lounge: A concept - a year round Halloween-themed hookah lounge” ”Making my money and smoking my weed” OD ”Mobbin in da Bentley smoking moonrocks.”

Table 2: Examples for each category represented by its label.

• Personal or Anecdotal Mention: Tweets are proficiency in English. A sample annotation set classified as containing a personal or anecdo- consisting of 10 tweets per class was selected ran- tal mention if they imply either personal use domly from all across the corpus. Both annota- of tobacco products or e-cigarettes, or pro- tors were given the selected sample annotation set. vide instances of use of the products by them- These sample annotation set served as a reference selves or others. baseline of each category of the text. In order to validate the quality of annotation, we • Informative or Advisory Mention: This calculated the Inter-Annotator Agreement (IAA) class of tweets consist of a broad range of for the fine-grain classification between the two topics such as: annotation sets of 3,144 tobacco-related tweets us- – mention or discussion on statistics of to- ing Cohen’s Kappa coefficient (Fleiss and Cohen, bacco and e-cigarette use or consump- 1973). The Kappa score of 0.791 indicates that the tion quality of the annotation and presented schema is – mention associated health risks or bene- productive. fits 4 Methodology – portray the use of tobacco products or e- cigarettes by a public figure In this section we describe the classifiers designed – emphasize social campaigns for anti- for this task of fine grained classification. The smoking, and related classifier architecture is based upon a combina- products such as patches tion of choosing word representations, along with a discriminator that is compatible with that rep- • Advertisements: All tweets written with the resentation. We use the TF-IDF for the suport intent of the sale of tobacco products, e- vector machines and GloVe embeddings (Penning- cigarettes and associated products or services ton et al., 2014) with our convolutional neural are marked advertisements. In this classifica- network architecture and recurrent architectures tion, intent is considered using the mention of (LSTM and Bi-LSTM). We also used FastText and price as an objective measure. BERT embeddings (both base and large) with their • Mention of Non-Tobacco Drugs: Tweets native classifiers to note the change in the accura- which mention the use, sale, anecdotes cies. and information about drugs other than e- 4.1 Support Vector Machines (SVM) cigarettes or tobacco products are annotated in this category. The first learning model used for classification in our experiment was Support Vector Machines 3.3 Inter-annotator Agreement (SVM) (Cortes and Vapnik, 1995). We used term Annotation of the dataset to detect the presence of frequency-inverse document frequency (TF-IDF) tobacco substance use was carried out by two hu- as a feature to classify the annotated tweets in our man annotators having linguistic background and data set (Salton and Buckley, 1988). TF-IDF cap- tures the importance of the given the word in a • Fully-Connected Layer: It is a classic fully document, defined in Equation 1. connected neural network layer. It is con- nected to the Pooling layers via a Dropout layer in order to prevent overfitting. Softmax N tfidf(t, d, D) = f(t, d)×log (1) activation function is used for defining the fi- |{dD : td}| nal output of this layer. where f(t, d) indicates the number of times term t The following objective function is commonly appears in context, d and N is the total number of used in the task: documents |dD : td| represents the total number P N of documents where t occurs. 1 X XL E = (oL − y )2 (3) The SVM classifier finds the decision boundary w n j,p j,p p=1 j=1 that maximizes the margin by minimizing ||w|| to find the optimal hyperplane for all the classifica- L where P is the number of patterns, oj,p is the out- tion tasks: th th put of j neuron that belongs to L layer, NL is th min f : 1 kwk2 the number of neurons in output of L layer, yj,p 2 is the desirable target of jth neuron of pattern p and y is the output associated with an input vec- s.t. y(i) wT x(i) + b ≥ 1, i = 1, . . . , m i (2) tor xi to the CNN. where w is the weight vector, x is the input vec- We use Adam Optimizer (Kingma and Ba, tor and b is the bias. 2014) to minimize the cost function Ew. 4.3 Recurrent Neural Architectures 4.2 Convolutional Neural Networks (CNN) Recurrent neural networks (RNN) have been em- In this subsection, we outline the Convolutional ployed to produce promising results on a variety Neural Networks (Fukushima, 1988) for classifi- of tasks, including language model and speech cation and also provide the process description for recognition (Mikolov et al., 2010, 2011; Graves text classification in particular. Convolutional neu- and Schmidhuber, 2005). An RNN predicts the ral networks are multistage trainable neural net- current output conditioned on long-distance fea- works architectures developed for classification tures by maintaining a memory based on history tasks (Lecun et al., 1998). Each of these stages information. consist of the types of layers described below: An input layer represents features at time t. One-hot vectors for words, dense vector features • Embedding Layer: The purpose of an em- such as word embeddings, or sparse features usu- bedding layer is to transform the text inputs ally represent an input layer. An input layer has into a form which can be used by the CNN the same dimensionality as feature size. An out- model. Here, each word of a text document is put layer represents a probability distribution over transformed into a dense vector of fixed size. labels at time t and has the same dimensionality • Convolutional Layers: A Convolutional as the size of the labels. Compared to the feed- layer consists of multiple kernel matrices that forward network, an RNN contains a connection perform the convolution mathematical oper- between the previous hidden state and current hid- ation on their input and produce an output den state. This connection is made through the re- matrix of features upon the addition of a bias current layer, which is designed to store history in- value. formation. The following equation is used to com- pute the values in the hidden, and output layers: • Pooling Layers: The purpose of a pooling layer is to perform dimensionality reduction h(t) = f(Ux(t) + Wh(t − 1)). (4) of the input feature vectors. Pooling layers use sub-sampling to the output of the convo- y(t) = g(Vh(t)), (5) lutional layer matrices combing neighbour- ing elements. We have used the commonly where U, W , and V are the connection weights used max-pooling function for the pooling. to be computed during training, and f(z) and g(z) Model/Experiment Personal Health Mentions Tobacco-related Mentions SVM 82.17% 83.44% CNN 84.08% 82.48% LST M 84.39% 83.32% BiLST M 83.92% 82.97% F astT ext 83.76% 81.05% BERTBase 85.19% 85.50% BERTLarge 87.26% 85.67%

Table 3: Binary Classification accuracies for specific topic (Personal Health Mention) or general theme (Tobacco- related Mentions). are sigmoid and softmax activation functions as par with deep learning classifiers in terms of accu- follows. racy, and much faster for training and evaluation. 1 FastText uses bag of words and bag of n-grams f(z) = −z , (6) 1 + e as features for text classification. Bag of n-grams feature captures partial information about the lo- ezm g(z ) = (7) cal word order. FastText allows updating word m P ez k k vectors through back-propagation during training In this paper, we apply Long Short Term Mem- allowing the model to fine-tune word representa- ory (LSTM) and Bidirectional Long Short Term tions according to the task at hand (Bojanowski Memory(Bi-LSTM) to sequence tagging (Hochre- et al., 2016). The model is trained using stochastic iter and Schmidhuber, 1997; Graves and Schmid- gradient descent and a linearly decaying learning huber, 2005; Graves et al., 2013). rate. LSTM networks use purpose-built memory cells to update the hidden layer values. As a re- sult, they may be better at finding and exploiting 4.5 BERT long-range dependencies in the data than a stan- dard RNN. The following equation implements While previous studies on word representations the LSTM model: focused on learning context-independent repre- sentations, recent works have focused on learning it = σ(Wxixt + Whiht−1 + Wcict−1 + bi) (8) contextualized word representations. One of the more recent contextualized word representation is BERT (Devlin et al., 2019).

ft = σ(Wxf xt + Whf ht−1 + Wcf ct−1 + bf ) (9) BERT is a contextualized word representation model, pre-trained using bidirectional transform- ers(Vaswani et al., 2017). It uses a masked lan- guage model that predicts randomly masked in a ot = σ(Wxoxt + Whoht−1 + Wcoct + bo) (10) sequence. It uses the task of next sentence predic- tion for learning the embeddings with a broader context. It outperforms many existing techniques ht = ottanh(ct) (11) on most NLP tasks with minimal task-specific ar- In sequence tagging task, we have access to chitectural changes. It is pretrained on 3.3B words both past and future input features for a given time. from various sources including BooksCorpus and Thus, we can utilize a bidirectional LSTM net- the English Wikipedia. work (Bi-LSTM) as proposed in (Graves et al., Based on the transformer architecture used, 2013). BERT is classified into two types: BERTBase and BERTLarge. BERTBase uses a 12-layered trans- 4.4 FastText former with 110M parameters. BERTLarge uses FastText classifier has proven to be efficient for a 24-layered transformer with 340M parameters. text classification (Joulin et al., 2016). It is often at We use the cased variant of both models. Methods Accuracy F1 Score Recall SVM 65.45% 0.678 0.657 CNN 66.72% 0.668 0.599 LST M 64.97% 0.641 0.583 BiLST M 65.29% 0.643 0.597 F astT ext 69.43% 0.696 0.669 BERTBase 70.86% 0.708 0.709 BERTLarge 71.34% 0.714 0.713

Table 4: Evaluation scores for the Fine-grained classification experiment.

5 Experiments following categories of tweets: personal mentions of tobacco-use, general information about tobacco In this section, we describe three experiments on or its use, advertisements. Thus, the experiment the dataset created in the section above. The ex- was to determine whether the tweet belonged to periments are designed to show how well existing one of the above categories or not. The objec- models perform on the naive binary classification tive here is also to gauge semantic information in based on this dataset as well as the fine-grained tweets with mentions of tobacco, suggesting that five-class classification system. The first experi- tweets using the similar slang might be talking ment is based on detecting just personal or anec- about other drugs or ambiguous or unrelated in- dotal mentions. The second is based on identify- formation. Table 3 illustrates the results for this ing whether a tweet is about tobacco or not. The experiment. last experiment is a full fine-grained classification experiment. 5.3 Experiment 3: Performing Fine-grained The following experiments were conducted Classification of Tobacco-related keeping an 80-20 split between training and test Mentions data, with 2517 tweets in the training dataset and The last experiment conducted in the study was 629 tweets in the test dataset. All tweets were to classify the tweets into all five categories: UM, shuffled randomly before the train-test split. PM, IM, AD, OD. Table4 illustrates the results of BERT was observed to perform the best Large the experiment. This is essentially the fine grained in all three experiments, followed closely by classification experiment which relies on semantic BERT in all the experiments that were con- Base information as well as lexical choice. We see that ducted. models from all the three experiments perform dif- 5.1 Experiment 1: Detecting Personal ferently given the type of task. Table 4 illustrates Mentions of Tobacco Use the results for this experiment. The first experiment in the study was to detect 6 Discussion tweets containing personal mentions of tobacco use. Tweets containing personal mentions of to- In this section, we analyze our contributions from bacco use are the ones marking implicit or ex- the perspective of advancing work in the fields of plicit use of a tobacco substance by the poster. topical content analysis as well as the study of The objective of this experiment is to analyze the public health mentions in tweets, with regards to best method to identify tweets which talk about tobacco products, as well as e-cigarettes and re- tobacco in an anecdotal manner, which can be lated products. Given the effects of both as well as used to understand the semantic similarity be- the significant overlap in the demographic of con- tween such tweets. Table 3 illustrates the results sumers of tobacco products and Twitter users, we for this experiment. found it necessary to understand the nature of the tweets produced and consumed by them. 5.2 Experiment 2: Identifying Our dataset, a collection of 3144 tweets, ac- Tobacco-related Mentions cumulated and filtered over the period of just a The next experiment in the study was to detect all week, implies that tobacco and related drugs are tobacco-related tweets related. These include the tweeted about and spoken of quite frequently, but tweet, generally health campaigns. Fundamen- tally, the classes we have chosen for the collected data are based on the same principle as the data collection mechanism, with the aim to bridge the gap between the classification studies and the pub- lic health surveillance research. This is because our categories cover the breadth of the tweets evenly, directed towards semantically understand- ing the nature of the tweets. This information is vital for addressing the validity and reach of cam- paigns, advertisements and other efforts. Figure 2 shows the distribution of the number of tweets in each class. We see that in the span of a week, informative or advisory and personal mentions are the most widely posted. The tweets that provide general information about smokers or Figure 2: Distribution of tweets among different cate- gories the habits of smoking tobacco or e-cigarettes are generated the most, implying that a larger section of the population tweets of smoking in an anecdo- Category Retweets Favorites tal manner. Similarly, Table 5 shows an interest- UM 1079.05 0.794 ing trends for the favorites. Advertisements have PM 12171.60 0.904 a higher average favorite count than most other IM 680.24 3.918 classes, while anecdotal and advisory tweets are AD 140.81 4.586 the most retweeted on average. This difference is OD 873.08 0.868 an interesting observation, primarily because on further work such as sentiment analysis and do- Table 5: Average retweets and favorites across classes ing short text style transfer (Luo et al., 2019) for these categories may provide an effective strategy for advertisers and campaigners alike. the linguistic cues common among these tweets was not considered until now. The inclusion of 7 Conclusion and Future Work tweets into the corpus based on slang terminology is an attempt to analyze the Twitter landscape in In this paper, we created a dataset of tweets and the language of the audience which most highly classified them in order to understand the social correlates with the demographic of consumers for media atmosphere around tobacco, e-cigarettes the aforementioned products. To the best of our and other related products. Our schema for cat- knowledge, using common slang as a basis of egorization targets posts on public health as much dataset creation and filtration for this task has not as tobacco related products, therefore allowing us been attempted before. to know the number and type of tweets used in Contemporary methods in the field focus on two public health surveillance for the above mentioned basic characterizations, user based and sentiment products. Most importantly, we consider slang as a based. User based classification such as Malik very important aspect of our data collection mech- et al.(2019) and Jo et al.(2016) are based on anism, which has allowed us to factor in the con- the analyzing activity from a particular user or set tent which is circulated and exposed to the major- of users, while sentiment based analyses such as ity of the consumers of social media and the afore- Paul and Dredze(2011); Allem et al.(2018) and mentioned products both. Mysl´ın et al.(2013) are based on understanding This contribution can be further extended by the sentiment of the users on the basis of a new working with other social media platforms, where product, category or a more generalized percep- the methods introduced above can be easily repli- tion of smoking in general. On the other hand, cated. Social media specific slang can be taken public health mention research such as Jawad et al. into account to make a more robust dataset for this (2015) focuses on effect of a particular type of task. Furthermore, on the public health surveil- lance aspect, more metadata using the tweets can Joseph L. Fleiss and Jacob Cohen. 1973. The equiv- be extracted, which gives an idea of the type of alence of weighted kappa and the intraclass corre- tweets or posts needed to grab the attention of lation coefficient as measures of reliability. Educa- tional and Psychological Measurement, 33(3):613– a wider audience on topics of public health and 619. awareness for the grave topic of tobacco products and e-cigarettes. Kunihiko Fukushima. 1988. Neocognitron: A hier- archical neural network capable of visual pattern recognition. Neural Networks, 1(2):119 – 130.

References Alex Graves, Abdel rahman Mohamed, and Geof- Jon-Patrick Allem, Likhit Dharmapuri, Adam Leven- frey E. Hinton. 2013. Speech recognition with deep thal, Jennifer Unger, and Tess Cruz. 2018. Hookah- recurrent neural networks. 2013 IEEE International related posts to twitter from 2017 to 2018: The- Conference on Acoustics, Speech and Signal Pro- matic analysis. Journal of Medical Internet Re- cessing, pages 6645–6649. search, 20:e11669. Alex Graves and Jurgen¨ Schmidhuber. 2005. Frame- Eiji Aramaki, Sachiko Maskawa, and Mizuki Morita. wise phoneme classification with bidirectional lstm 2011. Twitter catches the flu: detecting influenza and other neural network architectures. Neural net- epidemics using twitter. In Proceedings of the con- works : the official journal of the International Neu- ference on empirical methods in natural language ral Network Society, 18 5-6:602–10. processing, pages 1568–1576. Association for Com- putational Linguistics. Shona Hilton, Heide Weishaar, Helen Sweeting, Fil- ippo Trevisan, and Srinivasa Vittal Katikireddi. Piotr Bojanowski, Edouard Grave, Armand Joulin, and 2016. E-cigarettes, a safer alternative for teenagers? Tomas Mikolov. 2016. Enriching word vectors with a uk focus group study of teenagers’ views. BMJ subword information. Transactions of the Associa- Open, 6(11). tion for Computational Linguistics, 5:135–146. Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. Long Patricia A. Cavazos-Rehg, Melissa J. Krauss, Shaina short-term memory. Neural Computation, 9:1735– Sowles, and Laura J. Bierut. 2015. ”hey everyone, 1780. i’m drunk.” an evaluation of drinking-related twitter chatter. Journal of studies on alcohol and drugs, 76 Mohammed Jawad, Jooman Abass, Ahmad Hariri, and 4:635–43. Elie A Akl. 2015. Social media use for public health campaigning in a low resource setting: the case of Corinna Cortes and Vladimir Vapnik. 1995. Support- waterpipe . BioMed research inter- vector networks. Mach. Learn., 20(3):273–297. national, 2015.

Daniel K Cortese, Glen Szczypka, Sherry Emery, Shuai Keyuan Jiang, Shichao Feng, Qunhao Song, Ricardo A Wang, Elizabeth Hair, and Donna Vallone. 2018. Calix, Matrika Gupta, and Gordon R Bernard. 2018. Smoking selfies: using instagram to explore young Identifying tweets of personal health experience women’s smoking behaviors. Social Media+ Soci- through word embedding and lstm neural network. ety, 4(3):2056305118790762. BMC bioinformatics, 19(8):210.

Hongying Dai and Jianqiang Hao. 2017. Mining social Catherine L Jo, Rachel Kornfield, Yoonsang Kim, media data for opinion polarities about electronic Sherry Emery, and Kurt M Ribisl. 2016. Price- cigarettes. , 26(2):175–180. related promotions for tobacco products on twitter. Tobacco control, 25(4):476–479. Xiangfeng Dai, Marwan Bikdash, and Bradley Meyer. 2017. From social media to public health surveil- Armand Joulin, Edouard Grave, Piotr Bojanowski, and lance: Word embedding based clustering method for Tomas Mikolov. 2016. Bag of tricks for efficient text twitter classification. In SoutheastCon 2017, pages classification. In EACL. 1–7. IEEE. Diederik P. Kingma and Jimmy Ba. 2014. Adam: Jacob Devlin, Ming-Wei Chang, Kenton Lee, and A method for stochastic optimization. Cite Kristina Toutanova. 2019. BERT: pre-training of arxiv:1412.6980Comment: Published as a confer- deep bidirectional transformers for language under- ence paper at the 3rd International Conference for standing. In Proceedings of the 2019 Conference Learning Representations, San Diego, 2015. of the North American Chapter of the Association for Computational Linguistics: Human Language Yann Lecun, Leon´ Bottou, Yoshua Bengio, and Patrick Technologies, NAACL-HLT 2019, Minneapolis, MN, Haffner. 1998. Gradient-based learning applied to USA, June 2-7, 2019, Volume 1 (Long and Short Pa- document recognition. In Proceedings of the IEEE, pers), pages 4171–4186. pages 2278–2324. Brianna Lienemann, Jennifer Unger, Tess Cruz, and using bayesian change point analysis: A compara- Kar-Hai Chu. 2017. Methods for coding tobacco- tive analysis. JMIR Public Health and Surveillance, related twitter data: A systematic review. Journal of 2:e161. Medical Internet Research, 19:e91. Jennifer Unger, Daniel Soto, and Adam Leventhal. Fuli Luo, Peng Li, Pengcheng Yang, Jie Zhou, Yutong 2016. E-cigarette use and subsequent cigarette and Tan, Baobao Chang, Zhifang Sui, and Xu Sun. 2019. marijuana use among hispanic young adults. Drug Towards fine-grained text sentiment transfer. In Pro- and Alcohol Dependence, 163. ceedings of the 57th Conference of the Association for Computational Linguistics, pages 2013–2022. Elizabeth A Vandewater, Stephanie L Clendennen, Emily T Hebert,´ Galya Bigman, Christian D Jack- Aqdas Malik, Yisheng Li, Habib Karbasian, Juho son, Anna V Wilkinson, and Cheryl L Perry. 2018. Hamari, and Aditya Johri. 2019. Live, love, juul: Whose post is it? predicting e-cigarette brand from User and content analysis of twitter posts about juul. social media posts. Tobacco regulatory science, American Journal of Health Behavior, 43:326–336. 4(2):30–43.

Dale S Mantey, Cristina S Barroso, Ben T Kelder, and Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Steven H Kelder. 2019. Retail access to e-cigarettes Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz and frequency of e-cigarette use in high school stu- Kaiser, and Illia Polosukhin. 2017. Attention is all dents. Tobacco Regulatory Science, 5(3):280–290. you need. In Advances in Neural Information Pro- cessing Systems, pages 5998–6008. Tomas Mikolov, Anoop Deoras, Daniel Povey, Lukas´ Burget, and Jan Cˇernocky.´ 2011. Strategies for training large scale neural network language models. 2011 IEEE Workshop on Automatic Speech Recogni- tion & Understanding, pages 196–201.

Tomas Mikolov, Martin Karafiat,´ Lukas´ Burget, Jan Cˇernocky,´ and Sanjeev Khudanpur. 2010. Recur- rent neural network based language model. In IN- TERSPEECH.

Mark Mysl´ın, Shu-Hong Zhu, Wendy Chapman, and Mike Conway. 2013. Using twitter to examine smoking behavior and perceptions of emerging to- bacco products. Journal of medical Internet re- search, 15:e174.

Michael J. Paul and Mark Dredze. 2011. You are what you tweet: Analyzing twitter for public health. In ICWSM. The AAAI Press.

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 confer- ence on empirical methods in natural language pro- cessing (EMNLP), pages 1532–1543.

Tevah Platt, Jodyn Platt, Daniel Thiel, and Sharon L. R Kardia. 2016. Facebook advertising across an engagement spectrum: A case example for pub- lic health communication. JMIR Public Health and Surveillance, 2:e27.

Judith Prochaska, Cornelia Pechmann, Romina Kim, and James Leonhardt. 2012. Twitter = quitter? an analysis of twitter quit smoking social networks. To- bacco Control, 21:447–449.

Gerard Salton and Christopher Buckley. 1988. Term- weighting approaches in automatic text retrieval. Inf. Process. Manage., 24(5):513–523.

Danielle Sharpe, Richard S Hopkins, Robert Cook, and Catherine Striley. 2016. Evaluating google, twit- ter, and wikipedia as tools for influenza surveillance