An Insight Analysis and Detection of Drug‑Abuse Risk Behavior on Twitter with Self‑Taught Deep Learning
Total Page:16
File Type:pdf, Size:1020Kb
Hu et al. Comput Soc Netw (2019) 6:10 https://doi.org/10.1186/s40649‑019‑0071‑4 RESEARCH Open Access An insight analysis and detection of drug‑abuse risk behavior on Twitter with self‑taught deep learning Han Hu1 , NhatHai Phan1*, Soon A. Chun2, James Geller1, Huy Vo3, Xinyue Ye1, Ruoming Jin4, Kele Ding4, Deric Kenne4 and Dejing Dou5 *Correspondence: [email protected] Abstract 1 New Jersey Institute Drug abuse continues to accelerate towards becoming the most severe public health of Technology, University Heights, Newark 07102, USA problem in the United States. The ability to detect drug-abuse risk behavior at a popu- Full list of author information lation scale, such as among the population of Twitter users, can help us to monitor the is available at the end of the trend of drug-abuse incidents. Unfortunately, traditional methods do not efectively article detect drug-abuse risk behavior, given tweets. This is because: (1) tweets usually are noisy and sparse and (2) the availability of labeled data is limited. To address these challenging problems, we propose a deep self-taught learning system to detect and monitor drug-abuse risk behaviors in the Twitter sphere, by leveraging a large amount of unlabeled data. Our models automatically augment annotated data: (i) to improve the classifcation performance and (ii) to capture the evolving picture of drug abuse on online social media. Our extensive experiments have been conducted on three mil- lion drug-abuse-related tweets with geo-location information. Results show that our approach is highly efective in detecting drug-abuse risk behaviors. Keywords: Deep learning, Self-taught learning, Drug abuse, Twitter Introduction Abuse of prescription drugs and of illicit drugs has been declared a “national emer- gency” [1]. Tis crisis includes the misuse and abuse of cannabinoids, opioids, tran- quilizers, stimulants, inhalants, and other types of psychoactive drugs, which statistical analysis documents as a rising trend in the United States. Te most recent reports from the National Survey on Drug Use and Health (NSDUH) [2] estimate that 10.6% of the total population of people ages 12 years and older (i.e., about 28.6 million people) mis- used illicit drugs in 2016, which represents an increase of 0.5% since 2015 [3]. According to the Centers for Disease Control and Prevention (CDC), opioid drugs were involved in 42,249 known deaths in 2016 nationwide [4]. In addition, the number of heroin-involved deaths has been increasing sharply for 5 years, and surpassed the number of frearm homicides in 2015 [5]. In April 2017, the Department of Health and Human Services announced their “Opi- oid Strategy” to battle the country’s drug-abuse crisis [1]. In the Opioid Strategy, one of the major aims is to strengthen public health data collection, to inform a timeliness © The Author(s) 2019. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creat iveco mmons .org/licen ses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. Hu et al. Comput Soc Netw (2019) 6:10 Page 2 of 19 public health response, as the epidemic evolves. Given its 100 million daily active users and 500 million daily tweets [6] (messages posted by Twitter users), Twitter has been used as a sufcient and reliable data source for many detection tasks, including epide- miology [7] and public health [8–13], at the population scale, in a real-time manner. Motivated by these facts and the urgent needs, our goal in this paper is to develop a large-scale computational system to detect drug-abuse risk behaviors via Twitter sphere. Several studies [10, 13–17] have explored the detection of prescription drug abuse on Twitter. However, the current state-of-the-art approaches and systems are limited in terms of scales and accuracy. Tey typically applied keyword-based approaches to col- lect tweets explicitly mentioning specifc drug names, such as Adderall, Oxycodone, Quetiapine, Metformin, Cocaine, marijuana, weed, meth, tranquilizer, etc. [10, 13, 15, 17]. However, that may not refect the actual distribution of drug-abuse risk behaviors on online social media, since: (1) the expressions of drug-abuse risk behaviors are often vague, in comparison with common topics, i.e., a lot of slang is used and (2) relying on only keyword-based approaches is susceptible to lexical ambiguity in natural language [12]. In addition, the drug-abuse risk behavior Twitter data are very imbalanced, i.e., dominated by non-drug-abuse risk behavior tweets, such as drug-related news, social discussions, reports, advertisements, etc. Te limited availability of annotated tweets makes it even more challenging to distinguish drug-abuse risk behaviors from drug- related tweets. However, existing approaches [10, 13, 15, 17] have not been designed to address these challenging issues for drug-abuse risk behavior detection on online social media. Contributions: To address these challenges, our main contributions are to propose: (1) a large-scale drug-abuse risk behavior tweets collection mechanism based on super- vised machine-learning and data crowd-sourcing techniques and (2) a deep self-taught learning algorithm for drug-abuse risk behavior detection. Based on our previous work [18], we extended the analysis of the classifcation results from our three million tweets dataset with the analysis of word frequency, hashtag frequency, drug name-behavior co- occurrence, temporal distribution (time-of-day), and state-level spatial distribution. We frst collect tweets through a flter, in which a variety of drug names, colloquial- isms and slang terms, and abuse-indicating terms (e.g., overdose, addiction, high, abuse, and even death) are combined together. We manually annotate a small number of tweets as seed tweets, which are used to train machine-learning classifers. Ten, the classifers are applied to large number of unlabeled tweets to produce machine-labeled tweets. Te machine-labeled tweets are verifed again by humans on Mechanical Turk, i.e., a crowd- sourcing platform, with good accuracy but at a much lower cost. Te new labeled tweets and the seed tweets are combined to form a sufcient and reliable labeled dataset for drug-abuse risk behavior detection by applying deep learning models, i.e., convolution neural networks (CNN) [19], long–short-term memory (LSTM) models [20], etc. However, there are still a large amount of unlabeled data, which can be leveraged to signifcantly improve our models in terms of classifcation accuracy. Terefore, we fur- ther propose a self-taught learning algorithm, in which the training data of our deep self- taught learning models will be recursively augmented with a set of new machine-labeled tweets. Tese machine-labeled tweets are generated by applying the previously trained Hu et al. Comput Soc Netw (2019) 6:10 Page 3 of 19 deep learning models to a random sample of a huge number of unlabeled tweets, i.e., the three million tweets in our dataset. Note that the set of new machine-labeled tweets possibly has a diferent distribution from the original training and test datasets. After the model is trained, we apply it to our geo-location-tagged dataset to acquire classifcation results for analysis. Results from the aforementioned analysis show that the drug-abuse risk behavior-positive tweets have distinctive patterns of words, hashtags, drug name-behavior co-occurrence, time-of-day distribution, and spatial distribution, compared with other tweets. Tese results show that our approach is highly efective in detecting drug-abuse risk behaviors. Te rest of this paper is organized as follows. We review related work in the next section. Ten, we describe the implementation of our method in detail, followed by experiment results and data analysis. Finally, we conclude this paper and propose future directions. Background and related work On one hand, the traditional studies and organizations, such as NSDUH [2], CDC [4], Monitoring the Future [21], the Drug-Abuse Warning Network (DAWN) [22], and the MedWatch program [23] are trustworthy sources for getting the general picture of the drug-abuse epidemic. On the other hand, many studies that are based on modern online social media, such as Twitter, have shown promising results in drug-abuse detection and related topics [7–17]. Many of the existing studies were focusing on the quantitative analysis utilizing data from online social media. Meng et al. [24] used traditional text and sentiment analysis methods to investigate substance use patterns and underage use of substance, and the association between demographic data and these patterns. Ding et al. [25] investigated the correlation between substance (tobacco, alcohol, and drug) use disorders and words in Facebook users’ “Status Updates” and “Likes”. Teir results showing word patterns are diferent between users who have substance us disorder and users who do not have. Hanson et al. [15] conducted a quantitative analysis on 213,633 tweets discussing “Adderall”, a prescription stimulant commonly abused among college students, and also published another study [14] focused on how possible drug-abuser interact with and infuence others in online social circles. Te results showed that strong correlation could be found: (1) between the amount of interaction about prescription drugs and a level of abusiveness shown by the network and (2) between the types of drugs mentioned by the index user and his or her network. Shutler et al. [17] performed a qualitative analysis of six prescription opioids, i.e., Percocet, Percs, OxyContin, Oxys, Vicodin, and Hydros. Tweets were collected with exact word matching and manual clas- sifcation. Teir primary goal was to identify the key terms used in tweets that likely indicate drug abuse.