<<

Using to Understand Text for : A Systematic Review

Patrick Pilipiec

Computer and Engineering, master's level (60 credits) 2021

Luleå University of Technology Department of Computer Science, Electrical and Space Engineering Abstract

Background: Pharmacovigilance is a science that involves the ongoing monitoring of adverse reactions of existing medicines. Its primary purpose is to sustain and improve public . The existing systems that apply the science of pharmacovigilance to practice are, however, not only expensive and time-consuming, but they also fail to include experiences from many users. The application of computational to user-generated text is hypothesized as a pro-active and an effective supplemental source of evidence.

Objective: To review the existing evidence on the effectiveness of computational linguistics to understand user-generated text for the purpose of pharmacovigilance.

Methodology: A broad and multi-disciplinary systematic literature search was conducted that involved four databases. Studies were considered relevant if they reported on the application of computational linguistics to understand text for pharmacovigilance. Both peer- reviewed journal articles and conference materials were included. The PRISMA guidelines were used to evaluate the quality of this systematic review.

Results: A total of 16 relevant publications were included in this systematic review. All studies were evaluated to have a medium reliability and validity. Despite the quality, for all types of , a vast majority of publications reported positive findings with respect to the identification of adverse drug reactions. The remaining two studies reported rather neutral results but acknowledged the potential of computational linguistics for pharmacovigilance.

Conclusions: There exists consistent evidence that computational linguistics can be used effectively and accurately on user-generated textual content that was published to the Internet, to identify adverse drug reactions for the purpose of pharmacovigilance. The evidence suggests that the analysis of textual data has the potential to complement the traditional system of pharmacovigilance. Recommendations for researchers and practitioners of computational linguistics, policy makers, and users of drugs are suggested.

Keywords: Adverse drug reactions, Computational linguistics, Machine learning, Pharmacovigilance, , Social media, User-generated content

2 Table of Contents

1 Introduction ...... 5

1.1 Research Problem ...... 5

1.2 Background and Relevance ...... 5

1.3 Research Gap ...... 6

1.4 Research Objective ...... 7

2 Literature Review ...... 8

2.1 Process from Drug Development to Market Authorization ...... 8

2.2 Need to Monitor Drug Safety for Adverse Drug Reactions ...... 9

2.3 Pharmacovigilance ...... 9

2.4 Emergence of Machine Learning ...... 10

2.5 Computational Linguistics ...... 11

2.6 Applications of Computational Linguistics ...... 12

3 Methodology ...... 14

3.1 Information Sources ...... 14

3.2 Search Strategy ...... 14

3.3 Study Selection ...... 15

3.4 Inclusion and Exclusion Criteria ...... 16

3.5 Data Analysis ...... 16

4 Results ...... 18

4.1 Literature Search and Study Selection ...... 18

4.2 Study Characteristics ...... 20

4.3 Effectiveness of Computational Linguistics to Identify Adverse Drug Reactions .. 23

5 Discussion ...... 25

5.1 Discussion ...... 25

5.2 Limitations ...... 26

3 5.3 Recommendations ...... 27

6 Conclusion ...... 29

References ...... 30

Appendices ...... 50

Appendix 1. Search Strategies ...... 50

Appendix 2. Characteristics of Publications ...... 52

Appendix 3. PRISMA Checklist ...... 57

4 1 Introduction

This chapter first introduces the research problem. Subsequently, the background and relevance of this study are discussed. Furthermore, the research gap that this study addresses is identified. This chapter concludes with specifying the research objective.

1.1 Research Problem

This study investigates the problem that there exists a paucity of evidence on the application of machine learning, and in particular computational linguistics, to understand user-generated textual content for the purpose of pharmacovigilance.

1.2 Background and Relevance

In drug development, there exists a strong tension between accessibility and safety. While drugs can effectively cure diseases and improve someone’s life (Santoro, Genov, Spooner, Raine, & Arlett, 2017), the required process of research and development of drugs is, however, expensive, and not surprisingly, pharmaceutical companies have a high stake in yielding a profit on their investment (Barlow, 2017). This increases the urgency to make effective drugs available to the public swiftly. In contrast, medicines can also induce adverse drug reactions that may result in mortality, for which the identification of such reactions demands thorough and time-consuming testing of the drug’s safety, and it thereby drastically increases the time-to-market of new drugs (World Health Organization, 2004).

In fact, the potential consequences of adverse drug reactions are significant. In the (EU), five per cent of the hospital admissions and almost 200,000 deaths were caused by adverse drug reactions in 2008, and the associated societal cost totaled 79 billion Euros (European Commission, 2008).

The system that applies tools and practices from the research field of pharmacovigilance was introduced to alleviate this tension (Beninger, 2018). This system performs ongoing monitoring of adverse drug reactions of existing drugs (Beninger, 2018). It also enables that the time-to-market of effective drugs is minimized, but that its long-term safety is continuously examined post the market authorization (European Commission, 2020). The worldwide need for the swift development and public availability of a new effective drug was

5 also noticeable for COVID-19, which demonstrates the importance of the tension between reducing the time-to-market but maintaining drug safety. Overall, pharmacovigilance is the cornerstone in the regulation of drugs (Santoro et al., 2017).

The traditional system that applies pharmacovigilance is, however, suboptimal, mainly because it is very expensive and often fails to monitor adverse drug reactions experienced by users if these are not reported to the authorities, pharmaceutical companies, or to medical professionals (European Commission, 2020; Schmider et al., 2019). The reporting of these adverse drug reactions is important because it may help to protect public health (Santoro et al., 2017).

In today’s society, many people share personal content on social media (Kaplan & Haenlein, 2010; Monkman, Kaiser, & Hyder, 2018; Moro, Rita, & Vala, 2016). As a result, an abundance of studies has already demonstrated that user-generated content can accurately be used for remote sensing, among others to gauge public health. This naturally raises the valid question whether user-generated textual content can also be analyzed for the purpose of pharmacovigilance. Such automated analysis may provide a cheaper and timely supplement to the expensive and time-consuming traditional methods for pharmacovigilance, and it may also include first-hand experiences about adverse drug reactions from users that were not reported to the authorities, pharmaceutical companies, or medical professionals.

1.3 Research Gap

We observe a paucity of evidence on the application of machine learning, and in particular computational linguistics, to understand user-generated textual content for the purpose of pharmacovigilance. Although various studies were conducted that investigated the suitability and effectiveness of computational linguistics, to our awareness, no systematic review has yet been conducted that aggregated the reported evidence and that assessed the quality of those studies.

Therefore, it may be worthwhile to analyze user-generated content that has already been published to the Internet to pro-actively and automatically identify adverse drug reactions, without relying on users to actively report those cases to the authorities, pharmaceutical companies, or medical professionals.

6 1.4 Research Objective

To address the research gap, the purpose of this study is to review the existing evidence on the effectiveness of machine learning, and in particular computational linguistics, to understand user-generated content for the purpose of pharmacovigilance.

7 2 Literature Review

This chapter starts with an introduction of the process from drug development to market authorization. It follows by discussing the need to monitor the drug safety of medicines that already exist on the market, and the traditional system of pharmacovigilance to surveille adverse drug reactions. Subsequently, the emergence of machine learning and computational linguistics are elaborated. This chapter concludes with a discussion about the applications of computational linguistics.

2.1 Process from Drug Development to Market Authorization

In the last century, researchers and the have made a significant progress in the development of drugs that can cure a wide range of diseases. Among others, these novel drugs can decrease the suffering of people, reduce disease-related mortality, and prolong human life (Santoro et al., 2017).

The process of research and development of drugs is not only expensive and risky (Barlow, 2017), but most importantly it is also highly regulated (Santoro et al., 2017). Traditionally, drugs are developed in three subsequent phases, also referred to as hurdles (Rawlins, 2012).

1. In the first phase, pre-clinical studies that involve in vitro experiments are performed using test tubes or laboratory studies, to identify promising chemical substances that react with a disease (Angell, 2004). A limited number of these drug-candidates is then tested in animals until a few most promising drug-candidates remain (Angell, 2004). 2. In the second phase, clinical studies are performed that involve experimental testing of these drug-candidates in humans, where the conditions of these clinical experiments are highly controlled and supervised (Angell, 2004). These clinical studies are intended to confirm the quality, optimal dosage, safety, and side-effects of the most preferred drug. 3. In the third phase, the pharmaceutical company can apply for market authorization with the European Medicines Agency in the EU or with the Food and Drug Administration in the United States (Angell, 2004). Once market authorization is approved, the now registered drug can be prescribed by medical doctors and sold to the general public (Walker, Sculpher, Claxton, & Palmer, 2012).

8 2.2 Need to Monitor Drug Safety for Adverse Drug Reactions

A severe limitation in the process of bringing new drugs to the market is the potential of drugs to cause adverse drug reactions. This should not be underestimated, as these adverse drug reactions are a significant cause of illnesses and deaths (Santoro et al., 2017). For example, it was reported that in the EU, five per cent of the hospital admissions and almost 200,000 deaths were caused by adverse drug reactions in 2008 (European Commission, 2008). That year, adverse drug reactions also had a societal cost of 79 billion Euros (European Commission, 2008). Adverse drug reactions are also a primary cause of death in various other countries (World Health Organization, 2004).

Despite the fact that the pre-clinical and clinical studies comprise testing drug safety and potential adverse drug reactions, only a total of a few hundreds or thousands of participants are included in these studies (Santoro et al., 2017). In addition, these studies are performed under highly controlled clinical conditions that are not representative for usage in real-world circumstances (Santoro et al., 2017). For example, people may consume various drugs simultaneously, and the interaction between chemical substances may result in harmful reactions. Therefore, the findings do not have a high external validity and not all adverse drug reactions may have been identified prior to making the drug generally available (World Health Organization, 2004). As long as the benefits outweigh potential costs, it is generally considered unethical to withhold the general public from using an effective drug at this stage, and, therefore, it is accepted that some persons may develop adverse drug reactions in the future.

To counteract this limitation, existing drugs on the market are constantly being monitored for safety and adverse drug reactions (World Health Organization, 2004). The long-term monitoring of existing drugs is crucial, because potential adverse drug reactions, interactions, and other factors may only emerge many years to even decades after the drug initially received market authorization (World Health Organization, 2004).

2.3 Pharmacovigilance

The long-term monitoring of drug safety beyond market authorization is named pharmacovigilance (Beninger, 2018). Pharmacovigilance is defined by the World Health Organization as “the science and activities relating to the detection, assessment,

9 understanding and prevention of adverse effects or any other medicine-related problem” (World Health Organization, 2004, p. 1).

As such, the application of tools and practices from pharmacovigilance by public health authorities results in a pro-active system that is intended to promote and protect public health (Santoro et al., 2017). It involves a wide array of activities, including the collection of data related to drug safety, obligating pharmaceutical companies and medical professionals to report adverse drug reactions, inviting users to report experiences with drugs, and the detection of signals that may indicate drug safety issues (European Commission, 2020). This makes pharmacovigilance an essential pillar in the regulation of drugs (Santoro et al., 2017).

There are, however, significant costs associated with the processing and administration of the reported cases concerning adverse drug reactions (Schmider et al., 2019). In addition, the current system of collecting data to monitor drug safety is suboptimal because end-users are not obliged to report cases of adverse drug reactions (European Commission, 2020).

2.4 Emergence of Machine Learning

In the preceding 15 years, many technological innovations occurred that have enabled the storage, processing, and analysis of big data (Mikalef, Pappas, Krogstie, & Pavlou, 2020; Tekin, Etlioğlu, Koyuncuoğlu, & Tekin, 2019; Wall & Krummel, 2020). In particular, with the emergence of Web 2.0 and social media platforms, there has been a significant increase of user-generated content that is published to the Internet (Kaplan & Haenlein, 2010; Monkman et al., 2018; Moro et al., 2016). Among others, vast amounts of textual data are generated on blogs, call center logs, documents, emails, forums, news, social media sites, and surveys (Gandomi & Haider, 2015).

Similarly, there have been significant developments in machine learning that resulted in powerful methods and algorithms for computational linguistics, which enabled the processing and understanding of human language (Gandomi & Haider, 2015; Iglesias, Tiemblo, Ledezma, & Sanchis, 2016).

Consequently, mining social media and computational linguistics have received much interest in the literature (Kennedy, Elgesem, & Miguel, 2015).

10 2.5 Computational Linguistics

In recent years, the field of computational linguistics, also referred to as Natural Language Processing (NLP), has experienced significant innovations (Szpakowicz, Feldman, & Kazantseva, 2018). NLP is an important field of research within the realm of (Zeng, Shi, Wu, & Hong, 2015). It uses algorithms and techniques from to analyze and understand natural language, such as human language (Zeng et al., 2015).

Text mining is frequently defined as the analysis of textual data, such as unstructured or semi-structured text, with the purpose to extract hidden patterns and information (Gandomi & Haider, 2015). As such, it combines with NLP (Tekin et al., 2019). The key difference of data mining and text mining is that the former extracts hidden patterns from structured data, while the latter is used to analyze semi-structured and unstructured data with the objective to obtain structured data in the end (Tekin et al., 2019).

Text mining has firm roots in artificial intelligence, data mining, databases, linguistics, and statistics (Iglesias et al., 2016). It has emerged from a need to analyze large amounts of texts that contain human language, which can be mined for insights that facilitate data-driven decision making (Gandomi & Haider, 2015).

Data mining techniques can, however, not be applied to unstructured data. Therefore, text mining is applied as pre-processing for unstructured data. The pre-processing of text contains various steps, such as (Jurafsky & Martin, 2019; Rizk & Elragal, 2020):

§ Tokenization: Splitting a document of text into a sequence of tokens, where tokens represent a single word. § Transformation of cases: Converting all uppercases to lowercases, with the purpose to avoid duplicate words as a consequence of variations in inconsistent capitalization. § removal: Stop words are matched with a dictionary and removed because these individual tokens do not hold much meaning. § : This transforms all tokens to their stem, such that the words “walking” and “walked” are converted to their stem “walk”. § Filtering of tokens: Specified tokens may be eliminated because they can skew the distribution of words.

11 Once text mining has been applied to extract structured data from a semi-structured or unstructured source, if required, the conventional data mining algorithms can subsequently be used to process and analyze these structured data further to yield the valued insights (Iglesias et al., 2016). The complexity that is involved with analyzing unstructured textual data, and in particular its irregularities, therefore, makes that the process of text mining is far more difficult than data mining (Liao, Chu, & Hsiao, 2012).

The examples of text mining are numerous, and include (Tekin et al., 2019), discovery of associations (Sunikka & Bragge, 2012), the summarization of documents (Tekin et al., 2019), clustering of texts (Aggarwal & Zhai, 2012; Lu, Zhang, Liu, Li, & Deng, 2013), text classification (Sadiq & Abdullah, 2014; Tan, 1999), entity relationship modeling (Tekin et al., 2019), creation of a granular taxonomy (Tekin et al., 2019), prediction (Nassirtoussi, Aghabozorgi, Wah, & Ngo, 2014; Oberreuter & Velsquez, 2013), and the extraction of concepts and entities (Tekin et al., 2019).

2.6 Applications of Computational Linguistics

An abundance of studies has been conducted in many domains and demonstrated the potential of computational linguistics to analyze user-generated content that has been written in human language.

For example, in environmental studies that use remote sensing, social media content was used to estimate seabass fishing as a reliable substitute for on-site surveillance (Monkman et al., 2018). Similarly, the content from these platforms was also used to investigate air and air quality (Hswen, Qin, Brownstein, & Hawkins, 2019; Jiang, Sun, Wang, & Young, 2019; Riga & Karatzas, 2014; Wang, Paul, & Dredze, 2015).

In other studies, text mining was used in the domain of financial investments, where financial news was mined to predict the stock market (Chung, 2014). In fact, the mining of web news has branched off into web news mining, because it requires novel methods and technologies to process and analyze a vast amount of news sources that are continuously being updated at a high velocity (Iglesias et al., 2016; Sayyadi, Salehi, & AbolHassani, 2007).

In addition, social media were often analyzed to identify sentiments and discussions about (Bahk et al., 2016; Becker et al., 2016; D'Andrea, Ducange, & Marcelloni, 2017;

12 Du, Xu, Song, & Tao, 2017; Dunn, Leask, Zhou, Mandl, & Coiera, 2015; Powell et al., 2016; Straton, Ng, Jang, Vatrapu, & Mukkamala, 2019; Tavoschi et al., 2020; Zhou et al., 2015). Social media and web searches were also used to accurately predict disease outbreaks (Yang, Horneffer, & DiLisio, 2013), gauge food quality and borne illnesses (Kate, Negi, & Kalagnanam, 2014), and to detect various diseases (Edo-Osagie, Smith, Lake, Edeghere, & De La Iglesia, 2019; Tufts et al., 2018).

Furthermore, computational linguistics was used to monitor public health, such as surveilling allergies (Lee, Agrawal, & Choudhary, 2015; Nargund & Natarajan, 2016; Rong, Michalska, Subramani, Du, & Wang, 2019), depressions (Leis, Ronzano, Mayer, Furlong, & Sanz, 2019; Mowery et al., 2017; Zucco, Calabrese, & Cannataro, 2017), suicidal thoughts and suicide- related conversations (Brown et al., 2019; Kavuluru et al., 2016; Song, Song, Seo, & Jin, 2016), obesity (Cesare, Dwivedi, Nguyen, & Nsoesie, 2019; Kent et al., 2016), marijuana and drug abuse (Cavazos-Rehg, Zewdie, Krauss, & Sowles, 2018; Phan, Chun, Bhole, Geller, & Ieee, 2017; Tran, Nguyen, Nguyen, & Golen, 2018; Ward et al., 2019), tobacco and e- cigarettes (Cole-Lewis et al., 2015; Kim, Miano, Chew, Eggers, & Nonnemaker, 2017; Kostygina, Tran, Shi, Kim, & Emery, 2016; Myslin, Zhu, Chapman, & Conway, 2013), and to gauge public health concerns (Ji, Chun, Wei, & Geller, 2015).

13 3 Methodology

This chapter first discusses the information sources and search strategy. Thereafter, the selection of studies, as well as the inclusion and exclusion criteria are elaborated. This chapter concludes with outlining the procedure for data analysis.

3.1 Information Sources

To cover all related disciplines, a broad selection of databases was made that included PubMed, Web of Science, IEEE Xplore, and ACM Digital Library. These databases were selected because they index studies in various fields. Specifically, PubMed was included because it predominantly indexes research in the field of public health, healthcare, and medicine. IEEE Xplore and ACM Digital Library were searched because these databases index publications in information technology and information management. Web of Science was included because it is a very large database that indexes studies in various disciplines, and also because of its multidisciplinary nature, there exists a consensus among researchers that it is good practice to include this database in systematic reviews.

We recognize that Google Scholar is increasingly used as a source for systematic reviews, but that there exists a debate among scientists about the appropriateness of using Google Scholar (Giustini & Boulos, 2013). The quality of systematic reviews is partly determined by whether a systematic literature search can be replicated and validated (Giustini & Boulos, 2013). Since Google Scholar frequently, even daily, updates its algorithms for search and the ranking of content, replication of the systematic literature search may yield unpredictable results (Giustini & Boulos, 2013). Therefore, we have elected to exclude Google Scholar as an information source in this review.

3.2 Search Strategy

For each of the included databases, an optimized search strategy was formulated (see Appendix 1). The search query was constructed from two blocks. The first block addresses the concept of computational linguistics and the second block includes search terms related to health surveillance. Each block included a broad list of related search terms and synonyms that used the logical OR-operator. Furthermore, the queries for each block were combined

14 using the logical AND-operator. Additional database-specific filters were applied to further optimize the search results.

The systematic literature search was performed in March 2020. After the databases were searched and the results were imported into an EndNote Library, the method for de- duplication by Bramer et al. (2016) was performed to identify and remove duplicate studies.

3.3 Study Selection

Studies that are eligible for this systematic review were selected in three subsequent phases.

In the first phase, the titles were screened for the presence of subjects related to public health- monitoring or public health-surveillance. The screening was intended to be very global to reduce the likelihood that studies were excluded incorrectly. Therefore, not only terms such as “adverse drug reactions” were considered relevant, but also titles containing more indirect terms such as “ outcomes” were included. In addition, if it was ambiguous whether a study was relevant or not, it was still included for further screening in the next phase. Studies that were not relevant were omitted from the library.

In the second phase, the abstracts were screened for information related to computational linguistics, public health monitoring, public health surveillance, and pharmacovigilance. In addition, the keywords that are provided with the manuscript for indexing purposes were screened for these concepts. This screening phase was also intended to be broad. For example, abstracts were considered relevant if they contained terms directly related to pharmacovigilance, such as “adverse effects of drug treatment”, but also rather indirectly related terms such as “drug reviews”. Publications were still included if their relevance was considered ambiguous, for further screening in the next phase. Irrelevant manuscripts were removed.

In the third phase, the full text of the remaining studies was downloaded and read. Studies were considered relevant if they investigated the application of computational linguistics to understand text, with the purpose of public health monitoring or public health surveillance within the discipline of pharmacovigilance. Therefore, studies were considered relevant if they aimed to identify adverse drug reactions using computational linguistics. Publications were only included in this systematic review when they met these criteria.

15 3.4 Inclusion and Exclusion Criteria

Studies were considered eligible when they investigated the application of computational linguistics to understand text, with the purpose of public health monitoring or public health surveillance within the discipline of pharmacovigilance. Therefore, eligible studies reported on the application and results of using computational linguistics to identify adverse drug reactions from textual sources, such as forums, patient records, and social media. Furthermore, because it is a commonality in information technology and computational linguistics that materials are not always published in peer-reviewed journals, but instead it is frequently published only in conference proceedings or conference papers, both journal articles and conference proceedings were included in this systematic review. It was not expected that this would have an effect on the reliability of studies, because conference proceedings and conference papers are also subject to a peer-review process.

Publications were excluded if they only reported on a framework instead of the actual application. For example, authors may suggest a process to investigate adverse drug reactions using computational linguistics without actually applying it and evaluating the results. Likewise, studies were excluded if they were published in a language other than English. Furthermore, if the same publication was published in different formats, for example as both a conference proceeding and a journal article, only one format of the publication was retained.

3.5 Data Analysis

The included publications were evaluated on their quality by assessing their reliability and validity. This assessment was performed by following the strategy of Kampmeijer et al. (2016): A publication was evaluated as reliable if it reported a thorough and repeatable description of the performed process, methods, data collection, and data analysis (Kampmeijer et al., 2016). A publication was evaluated as valid if the reported findings are logically the result of the described process, methods, data, and analyses that were used to find that result (Kampmeijer et al., 2016). Although the quality assessment was rigorous and based on scientific standards, all identified publications were included in the systematic review.

16 Thematic analysis was used to analyze the included publications (Hsieh & Shannon, 2005). The themes were defined by the objectives of the present systematic review. Therefore, the following themes were extracted from the full text of the included publications: Authors, year of publication, type of drugs, data source, sample size, users, unique users, origin of users, average number of followers, years of data collection, horizon of data collection, software used, techniques and classifiers used, outcome, drugs studied, result, and a description of the result.

For each publication, the extracted themes were processed into an extraction matrix. This matrix was used to synthesize and narratively present the extracted information by theme. The results are summarized and presented using tables.

This systematic review was guided by the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines (Liberati et al., 2009; Moher, Liberati, Tetzlaff, Altman, & The, 2009). The quality of this systematic review was also evaluated using the PRISMA Checklist in Appendix 3.

17 4 Results

First, this chapter summarizes the studies that were identified via the literature search and it visualizes the study selection using a flow diagram. Second, this chapter describes and discusses the included publications in further detail. Finally, this chapter concludes with the presentation of the effectiveness of computational linguistics to understand text in the domain of pharmacovigilance to identify adverse drug reactions.

4.1 Literature Search and Study Selection

The procedure that was followed for the selection of studies is presented in Figure 1. The literature search involved four databases (see Appendix 1) and yielded 5,318 initial records. These records were assessed for the presence of duplicate publications. Consequently, 744 duplicate results were identified and omitted. Therefore, the literature search yielded 4,574 unique studies.

18

Figure 1. Flow diagram for literature search and study selection

The study selection was performed in three phases. In the first phase, studies were screened on title. A total of 4,347 studies were considered irrelevant and were excluded. A reason that these studies were excluded is that they did not mention “adverse drug reactions” or other related but rather general terms such as “medication outcomes” in the title. In the second phase, the remaining 227 results were screened by reading the abstract; 206 irrelevant studies were omitted. For example, studies were excluded when not mentioning “adverse effects of drug treatment” or other related but rather general terms such as “drug reviews” in the abstract. In the third phase, the full text of the remaining 21 publications was read. Five studies were considered irrelevant because they did not investigate the application of computational linguistics to understand text, with the purpose of public health monitoring or

19 public health surveillance within the discipline of pharmacovigilance. These irrelevant studies were, therefore, removed.

Overall, the literature search and selection of studies yielded 16 publications that were considered relevant and were included in this systematic review. A detailed overview of the characteristics of the included studies is presented in Appendix 2. A summary is provided in Table 1 and will be further elaborated in this chapter.

[Insert Table 1 here]

4.2 Study Characteristics

A general description of the publications included in the analysis is provided in Table 2.

[Insert Table 2 here]

All studies were published between 2009 and 2019. Most of the studies (68.75%) were published in the last five years (2015-2019).

These studies investigated the adverse drug reactions that were reported. Of the studies that specifically presented the type of drugs that were studied, these publications involved drugs to treat depression (5.26%), asthma (5.26%), cystic fibrosis (5.26%), HIV (5.26%), cancer (10.53%), rheumatoid arthritis (5.26%), and type 2 diabetes (5.26%). In a majority of studies (57.89%), the type of drugs was not specified.

Publications used data from four sources. A majority of studies (55.56%) used textual information from social media to extract adverse drug reactions. Drug reviews were also a popular source of unstructured data (22.22%). Forums (16.67%) and the electronic health record (5.56%) were used less often.

20 There was a wide diversity in the sample size of the posts that were included, which ranged from 1,245 (Lin et al., 2015) to more than two billion (Bian, Topaloglu, & Yu, 2012) tweets. Three studies (18.75%) included less than 5,000 posts, six publications (37.50%) used at least 5,000 but less than 20,000 posts, and six studies (37.50%) were performed using more than 20,000 posts. The sample size was not reported in one study (6.25%).

Only one study (6.25%) provided contextual information about the background of the publishers of the included posts. In that study, the users were identified as HIV-infected and undergoing drug treatment (Adrover, Bodnar, Huang, Telenti, & Salathé, 2015). In the remaining studies (93.75%), the background of these users was not disclosed.

The vast majority of studies (87.50%) did not provide information about the unique number of users that published the analyzed content. In only two studies (12.50%), it was disclosed that less than 5,000 users had published the posts.

One study discussed the geographical location of users that published the included posts. In that study, it was reported that the users were from Canada, South Africa, United Kingdom, or the United States (Adrover et al., 2015). The remaining 15 studies did not discuss the geographical location of users.

Similarly, only one study (6.25%) discussed that a user had an average of less than 5,000 followers (Adrover et al., 2015). The remaining studies (93.75%) provided no information about this theme.

Of the studies that presented the date of publication of the textual samples that were included in the analyses (74.19%), these posts were published between 2004 and 2015. Content published since 2010 was included in more studies compared to content published before 2010. The remaining 25.81% of studies did not discuss when the posts were published.

Studies that reported the date of publication of included content (50.00%) could be used to compute the time horizon of the collected data. In 12.50% of studies, this horizon was one calendar year. In 6.25% of the studies, this horizon was between two and five years. In another 6.25%, the horizon ranged between 6 and 10 years. In four studies (25.00%), the time horizon could not be computed because the data were published within the same calendar year. The remaining studies (50.00%) did not present the date on which the included data were published, and, therefore, the horizon of data collection could not be computed.

21 The studies reported a vast difference in software that was used. In total, 27 software products were discussed. Studies often also used alternatives for the same type of software. For example, although some studies used Tweepy (2.86%), Twitter REST API (2.86%), or Twitter4J (2.86%) to retrieve data from Twitter using a different , these software products can be aggregated in the type of Twitter API. By frequency, the spelling checker was used most often (8,57%). Notably, three studies (8.57%) did not present the software that was used.

Likewise, a vast number of 21 different techniques and classifiers was reported. NLP was used in five studies (11.63%). Similarly, sentiment analysis was used in five studies (11.63%). Two studies (4.65%) and one study (2.33%) reported the use of Term-frequency- inverse document frequency (TF-IDF) and Term Document Matrix (TDM), respectively. By frequency, Support Vector Machines (SVM) was used most often in eight studies (18.60%) to analyze the quantitative features that were extracted from unstructured data. The remaining 4.65% of the studies did not disclose the techniques and classifiers that were used.

Although all included studies investigated how computational linguistics can be applied to understand text for pharmacovigilance, and thus all studies investigated reported adverse drug reactions, some studies were more explicit than others in discussing their outcome. For example, in several studies it was explicitly disclosed that the outcome was reported adverse drug reactions for cancer (5.56%), reported adverse drug reactions for HIV treatment (5.56%), user sentiment on depression drugs (5.56%), and user sentiment on cancer drugs (5.56%). The remaining studies were rather ambiguous by presenting the outcome as patient satisfaction with drugs (5.56%), patient-reported medication outcomes (5.56%), reported adverse drug reactions (61.11%), and reported effectiveness of drugs (5.56%).

The studies also differed with respect to the number of drugs for which adverse drug reactions were investigated. Most studies (31.25%) included posts concerning 20 or more drugs, followed by 25.00% of publications that studied between five and nine drugs. Two studies (12.50%) included between 10 and 14 drugs, while only one study (6.25%) addressed less than five drugs. No studies included between 15 and 19 drugs. The remaining 25.0% of studies did not disclose the number of drugs that were investigated.

The studies reported consistent evidence that computational linguistics can be successfully used to understand text for the purpose of pharmacovigilance. A vast majority of studies

22 (87.50%) presented positive results. These studies observed that adverse drug reactions could indeed be extracted accurately and reliably from content published by patients. These studies often compared the accuracy of the adverse effects that were extracted from posts against a list of known adverse drug reactions, for example from the medical package insert or from other reliable sources. Only 12.50% of the studies reported neutral findings. No studies reported a negative result.

In this systematic review, rigorous medical standards were used to assess the quality of the included studies. The reliability and validity of all studies was assessed as medium. One reason for this is that, although all studies were performed reasonably well, these studies failed to be entirely transparent about their process, methodology, used software, and the used technologies and classifiers. As is presented in Appendix 2, all studies failed to disclose a full overview of crucial information.

4.3 Effectiveness of Computational Linguistics to Identify Adverse Drug Reactions

The reported findings on the effectiveness of computational linguistics to identify adverse drug reactions from text were analyzed. To identify differences between themes, various characteristics of these publications were compared with the reported results of using computational linguistics to identify adverse drug reactions. The observed findings are presented in Table 3.

[Insert Table 3 here]

The results that follow are skewed towards a positive effectiveness.

Positive results were reported for utilizing computational linguistics to identify adverse drug results. For all types of drugs that were investigated, positive results were found. For named diseases, only one study observed a neutral effectiveness for oncological drugs (Bian et al., 2012). Not all publications included sufficient details on the type of drugs that were included in the analyses. Therefore, of the studies that did not disclose the type of drugs, 90.91% of

23 publications reported a positive effect, while only 9.09% found a neutral effectiveness for identifying adverse drug reactions.

Most of the studies (62.50%) included at least unstructured data from social media. In particular, all of these studies included data from Twitter, while one study additionally used content from YouTube (Ru, Harris, & Yao, 2015). The majority of these studies reported that adverse drug reactions could indeed be extracted from textual content that was published to social media. In addition, drug reviews (25.00%) and content from forums (18.75%) were studied frequently. For both data sources, adverse drug reactions were identified correctly. The least studied source involved electronic health records, for which also positive results were reported.

There were no significant inconsistencies in the effectiveness of computational linguistics to identify adverse drug reactions with respect to the outcome that was under investigation in each study. For a vast majority of the outcomes, adverse drug reactions could indeed be established. For the outcome of reported adverse drug reactions for cancer, only a neutral effectiveness was reported (Bian et al., 2012). Likewise, although most of the studies that investigated the outcome of reported adverse drug reactions had observed positive findings, only one study instead found a negative result (Dai & Wang, 2019).

There were no notable differences in the effectiveness of computational linguistics with respect to the number of drugs that were considered in the publications.

Lastly, no studies had either a low or high reliability or validity. Irrespective of having a medium reliability and validity, a majority of publications (87.50%) reported a positive effectiveness of using computational linguistics to identify adverse drug reactions. In only two studies (12.50%), this effectiveness remained ambiguous and was, therefore, rated as neutral.

24 5 Discussion

This chapter starts with a discussion of the findings. Subsequently, it presents and elaborates on the limitations of this systematic review. It concludes with suggested recommendations for researchers and practitioners of computational linguistics, policy makers, and users.

5.1 Discussion

The purpose of this study was to review the existing evidence on the effectiveness of machine learning, and in particular computational linguistics, to understand user-generated textual content for the purpose of pharmacovigilance.

A main finding of this systematic review is that the potential of applying computational linguistics for pharmacovigilance looks very promising. Studies included in this systematic review consistently reported positive results on the effectiveness and accuracy of using computational linguistics that is applied to user-generated digital content to identify adverse drug reactions. For all diseases investigated, a vast majority of studies reported that the identified adverse drug reactions were consistent with the information provided on the medical package insert.

This finding suggests that the user-generated textual content that drug users share on the Internet may have the potential to augment or enhance the reactive, expensive, and time- consuming traditional system of pharmacovigilance. Computational linguistics may thus be used to automate the monitoring of adverse drug reactions using content that users publish to social media and other digital platforms (Ru et al., 2015). This novel tool may not only contribute to improving public health and the quality of healthcare, but it could potentially also reduce the costs and processing time that are associated with conducting pharmacovigilance. Therefore, this tool may be a viable solution that addresses two of the most prominent challenges of traditional pharmacovigilance, namely the reduction of the high associated costs (Schmider et al., 2019) and the inclusion of adverse drug reactions as experienced by the end-users (European Commission, 2020).

A second main finding of this systematic review is that some studies also correctly identified adverse drug reactions that were previously unknown. This suggests that computational linguistics may also be used to identify novel adverse drug reactions. Therefore, it may serve

25 as a suitable tool for pro-active and real-time health surveillance using remote sensing (Sampathkumar, Chen, & Luo, 2014). As such, this automated system may identify trends and periodically report novel insights to policy makers and public health professionals, and it may support and enable these professionals to initiate interventions timely to protect public health and the maintain and perhaps even increase the quality of healthcare further (Akay, Dragomir, Erlandsson, & Akay, 2016).

Although this systematic review finds that the application of computational linguistics may be at least as reliable as the traditional system of pharmacovigilance, it does not suggest that the traditional system is obsolete and should be replaced by computational linguistics. Instead, it may be worthwhile to apply computational linguistics as a complementary tool to retrieve and process adverse drug reactions that end-users share on the Internet. This information and the insights may be combined with the adverse drug reactions that are reported by medical professionals, with the purpose to achieve a more complete overview of adverse drug reactions. Similarly, computational linguistics may be a suitable tool for the real-time monitoring of adverse drug reactions, or to discover novel adverse drug reactions. Indeed, once this tool has been developed, calibrated, and optimized, it would function fully automated and require very limited human input.

Overall, the evidence about the effectiveness of computational linguistics is also consistent with other researchers, who reported a similar accuracy of machine learning to understand text in various other domains. This suggests that computational linguistics may have the potential to innovate and revolutionize many other industries as well. Future research is required to identify the industries and their prominent problems that can be revolutionized using machine learning.

5.2 Limitations

This systematic review has four noteworthy limitations.

First, due to practical considerations, the systematic literature search and study selection were performed by only one researcher. Therefore, it was not possible to establish inter-rater reliability. As a result, it may be possible that this has introduced selection , but this could not be verified.

26 Second, it was observed that studies often failed to report information on all themes that were used to extract relevant information (see Appendix 2). Consequently, the absence of these data limited the analyses of the studies with respect to their methodology, sample characteristics, and the utilized techniques. In addition, various publications failed to disclose information about for what diseases the adverse drug reactions were identified. Failure to report such essential details limits the replicability of these studies.

Third, all studies that were included in this systematic review were found to have a medium quality. Quality was operationalized using reliability and validity. Indeed, rigorous scientific standards were used to rate this quality. The quality was, however, assessed by only one researcher, which impedes the measurement of inter-rater reliability. As was already introduced in the second limitation, the quality of publications could overall be improved by disclosing more details on aspects such as the methodology, sample characteristics, used techniques, and the investigated drugs and related diseases.

Fourth, because it is a commonality in the field of information technology and computational linguistics that findings are not always published in peer-reviewed journals, but instead it is often only published in conference proceedings or conference papers, both types of publications were included in this systematic review. This is important, because it may be possible that the process of peer-review is more rigorous when performed by journals compared to conferences. It was also observed that a significant number of included publications were not journal articles.

5.3 Recommendations

In the following, three recommendations are suggested to various audiences, namely researchers and practitioners in the field of computational linguistics, policy makers, and drug users.

First, researchers and practitioners of computational linguistics ought to be more transparent and disclose more details about the methods, sample characteristics, used technologies, and the drugs and diseases that were investigated. This information has great value, among others to establish the quality of these studies and for the purpose of replicability. Furthermore, although conferences are an important platform to share knowledge and findings in the field

27 of information technology and computational linguistics, it is suggested that these studies are also more often published in peer-reviewed journals.

Second, it may be relevant for policy makers to consider the automated analysis of user- generated textual content for the purpose of pharmacovigilance. The findings indicate strong evidence that computational linguistics can effectively and accurately identify adverse drug reactions using content that drug users have published to the Internet. These technologies of artificial intelligence may complement information that is derived from traditional pharmacovigilance. Indeed, this systematic review confirms that policy makers can pro- actively receive greater levels of control about pharmacovigilance. It is strongly suggested that policy makers use these powerful technologies ethically, responsibly, and with great respect to the privacy and anonymity of these drug users.

Third, it is suggested that drug users who share their experiences about adverse drug reactions on the Internet, publish such content intentionally and thoughtfully. Although the published information is valuable and has the potential to protect public health, it can also be misused easily, for example by creating a profile of these users that includes various types of personal, geographical, and medical information. In that sense, such information can also be used or misused for commercial or illegal purposes. Users should be aware of the potential consequences, both positive and negative, of disclosing such information publicly on the Internet.

28 6 Conclusion

This systematic review has demonstrated that there exists consistent evidence that computational linguistics can be used effectively and accurately to identify adverse drug reactions that are published to the Internet by the users of these drugs. In addition, some studies correctly identified new adverse drug reactions that were previously unknown. The effectiveness and reliability of computational linguistics to understand user-generated textual content for the purpose of pharmacovigilance is also consistent with findings in other domains. This suggests that computational linguistics has the potential to innovate in other industries as well. For pharmacovigilance, computational linguistics has the potential to complement traditional pharmacovigilance.

29 References

Adrover, C., Bodnar, T., Huang, Z., Telenti, A., & Salathé, M. (2015). Identifying Adverse Effects of HIV Drug Treatment and Associated Sentiments Using Twitter. JMIR Public Health and Surveillance, 1(2), 10.

Aggarwal, C. C., & Zhai, C. (2012). A Survey of Text Clustering Algorithms. In Mining Text Data (pp. 77-128). New York City, NY, USA: Springer.

Akay, A., Dragomir, A., Erlandsson, B.-E., & Akay, M. (2016). Assessing anti-depressants using intelligent data monitoring and mining of online fora. Ieee Journal of Biomedical and Health Informatics, 20(4), 977-986.

Angell, M. (2004). The Truth About the Drug Companies: How They Deceive Us and What to Do About It. New York, NY, USA: Random House.

Bahk, C. Y., Cumming, M., Paushter, L., Madoff, L. C., Thomson, A., & Brownstein, J. S. (2016). Publicly Available Online Tool Facilitates Real-Time Monitoring Of Vaccine Conversations And Sentiments. Health Affairs, 35(2), 341-347.

Barlow, J. (2017). Managing Innovation in Healthcare. London, UK: World Scientific.

Becker, B. F. H., Larson, H. J., Bonhoeffer, J., Van Mulligen, E. M., Kors, J. A., & Sturkenboom, M. C. J. M. (2016). Evaluation of a multinational, multilingual vaccine debate on Twitter. Vaccine, 34(50), 6166-6171.

Beninger, P. (2018). Pharmacovigilance: An Overview. Clinical Therapeutics, 30(12), 1991- 2004.

Bian, J., Topaloglu, U., & Yu, F. (2012). Towards Large-scale Twitter Mining for Drug- related Adverse Events. Paper presented at the Proceedings of the 2012 international workshop on Smart health and wellbeing, Maui, Hawaii, USA.

Bramer, W. M., Giustini, D., de Jonge, G. B., Holland, L., & Bekhuis, T. (2016). De- duplication of database search results for systematic reviews in EndNote. Journal of the Medical Library Association: JMLA, 104(3), 240-243.

Brown, R. C., Bendig, E., Fischer, T., Goldwich, A. D., Baumeister, H., & Plener, P. L. (2019). Can acute suicidality be predicted by Instagram data? Results from qualitative and quantitative language analyses. PLoS One, 14(9), e0220623.

Cavazos-Rehg, P. A., Zewdie, K., Krauss, M. J., & Sowles, S. J. (2018). "No High Like a Brownie High": A Content Analysis of Edible Marijuana Tweets. American Journal of , 32(4), 880-886.

Cesare, N., Dwivedi, P., Nguyen, Q. C., & Nsoesie, E. O. (2019). Use of social media, search queries, and demographic data to assess obesity prevalence in the United States. Palgrave Communications, 5, 9.

30 Chung, W. (2014). BizPro: Extracting and categorizing business intelligence factors from textual news articles. International Journal of Information Management, 34(2), 272- 284.

Cole-Lewis, H., Pugatch, J., Sanders, A., Varghese, A., Posada, S., Yun, C., . . . Augustson, E. (2015). Social Listening: A Content Analysis of E-Cigarette Discussions on Twitter. Journal of Medical Internet Research, 17(10), 14.

D'Andrea, E., Ducange, P., & Marcelloni, F. (2017). Monitoring Negative Opinion about Vaccines from Tweets Analysis. Paper presented at the 2017 Third International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN), Kolkata, India.

Dai, H.-J., Touray, M., Jonnagaddala, J., & Syed-Abdul, S. (2016). Feature Engineering for Recognizing Adverse Drug Reactions from Twitter Posts. Information, 7(2), 20.

Dai, H.-J., & Wang, C.-K. (2019). Classifying Adverse Drug Reactions from Imbalanced Twitter Data. International Journal of Medical Informatics, 129, 122-132.

Du, J., Xu, J., Song, H.-Y., & Tao, C. (2017). Leveraging machine learning-based approaches to assess human papillomavirus vaccination sentiment trends with Twitter data. Bmc Medical Informatics and Decision Making, 17, 8.

Dunn, A. G., Leask, J., Zhou, X., Mandl, K. D., & Coiera, E. (2015). Associations Between Exposure to and Expression of Negative Opinions About Human Papillomavirus Vaccines on Social Media: An Observational Study. Journal of Medical Internet Research, 17(6), e144.

Edo-Osagie, O., Smith, G., Lake, I., Edeghere, O., & De La Iglesia, B. (2019). Twitter mining using semi-supervised classification for relevance filtering in syndromic surveillance. PLoS One, 14(7), 29.

European Commission. (2008). Strengthening pharmacovigilance to reduce adverse effects of medicines. Brussels, Belgium: European Commission,.

European Commission. (2020). Pharmacovigilance. Retrieved from https://ec.europa.eu/health/human-use/pharmacovigilance_en

Gandomi, A., & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35(2), 137-144.

Ginn, R., Pimpalkhute, P., Nikfarjam, A., Patki, A., O'Connor, K., Sarker, A., . . . Gonzalez, G. (2014). Mining Twitter for Mentions: A Corpus and Classification Benchmark. Paper presented at the Lrec 2014 - Ninth International Conference on Language Resources and Evaluation, Paris.

Giustini, D., & Boulos, M. N. K. (2013). Google Scholar is not enough to be used alone for systematic reviews. Online Journal of Public Health Informatics, 5(2).

Gräßer, F., Kallumadi, S., Malberg, H., & Zaunseder, S. (2018). Aspect-Based Sentiment Analysis of Drug Reviews Applying Cross-Domain and Cross-Data Learning. Paper

31 presented at the Proceedings of the 2018 International Conference on Digital Health, New York.

Hsieh, H.-F., & Shannon, S. E. (2005). Three Approaches to Qualitative Content Analysis. Qualitative Health Research, 15(9), 1277-1288.

Hswen, Y. L., Qin, Q. Y., Brownstein, J. S., & Hawkins, J. B. (2019). Feasibility of using social media to monitor outdoor in London, England. Preventive Medicine, 121, 86-93.

Iglesias, J. A., Tiemblo, A., Ledezma, A., & Sanchis, A. (2016). Web news mining in an evolving framework. Information Fusion, 28, 90-98.

Ji, X., Chun, S. A., Wei, Z., & Geller, J. (2015). Twitter sentiment classification for measuring public health concerns. Social Network Analysis and Mining, 5.

Jiang, J. Y., Sun, X., Wang, W., & Young, S. (2019). Enhancing Air Quality Prediction with Social Media and Natural Language Processing. Stroudsburg: Assoc Computational Linguistics-Acl.

Jurafsky, D., & Martin, J. J. (2019). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and (3rd ed.). Stanford, CA, USA: Stanford University.

Kampmeijer, R., Pavlova, M., Tambor, M., Golinowska, S., & Groot, W. (2016). The use of e-health and m-health tools in health promotion and primary prevention among older adults: a systematic literature review. Bmc Health Services Research, 16(Suppl 5), 290.

Kaplan, A. M., & Haenlein, M. (2010). Users of the world, unite! The challenges and opportunities of Social Media. Business Horizons, 53(1), 59-68.

Kate, K., Negi, S., & Kalagnanam, J. (2014). Monitoring violation reports from internet forums. Studies in Health Technology and Informatics, 205, 1090-1094.

Katragadda, S., Karnati, H., Pusala, M., Raghavan, V., & Benton, R. (2015, 9-12 Nov. 2015). Detecting Adverse Drug Effects Using Link Classification on Twitter Data. Paper presented at the 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Washington, DC, USA.

Kavuluru, R., Ramos-Morales, M., Holaday, T., Williams, A. G., Haye, L., Cerel, J., & Acm. (2016). Classification of Helpful Comments on Online Suicide Watch Forums. New York: Assoc Computing Machinery.

Kennedy, H., Elgesem, D., & Miguel, C. (2015). On fairness: User perspectives on social media data mining. Convergence: The International Journal of Research into New Media Technologies, 23(3), 270–288.

Kent, E. E., Prestin, A., Gaysynsky, A., Galica, K., Rinker, R., Graff, K., & Chou, W. Y. S. (2016). "Obesity is the New Major Cause of Cancer": Connections Between Obesity and Cancer on Facebook and Twitter. Journal of Cancer Education, 31(3), 453-459.

32 Kim, A., Miano, T., Chew, R., Eggers, M., & Nonnemaker, J. (2017). Classification of Twitter Users Who Tweet About E-Cigarettes. JMIR Public Health and Surveillance, 3(3), e63.

Kostygina, G., Tran, H., Shi, Y., Kim, Y., & Emery, S. (2016). "Sweeter Than a Swisher': amount and themes of little cigar and cigarillo content on Twitter. Tobacco Control, 25, i75-i82.

Lee, K., Agrawal, A., & Choudhary, A. (2015, 25-28 Aug. 2015). Mining social media streams to improve public health allergy surveillance. Paper presented at the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

Leis, A., Ronzano, F., Mayer, M. A., Furlong, L. I., & Sanz, F. (2019). Detecting Signs of Depression in Tweets in Spanish: Behavioral and Linguistic Analysis. Journal of Medical Internet Research, 21(6), 16.

Liao, S.-H., Chu, P.-H., & Hsiao, P.-Y. (2012). Data mining techniques and applications – A decade review from 2000 to 2011. Expert Systems with Applications, 39, 11303- 11311.

Liberati, A., Altman, D. G., Tetzlaff, J., Mulrow, C., Gøtzsche, P. C., Ioannidis, J. P. A., . . . Moher, D. (2009). The PRISMA statement for reporting systematic reviews and meta- analyses of studies that evaluate healthcare interventions: explanation and elaboration. BMJ, 339, b2700.

Lin, W.-S., Dai, H.-J., Jonnagaddala, J., Chang, N.-W., Jue, T. R., Iqbal, U., . . . Li, Y.-C. (2015, 20-22 Nov. 2015). Utilizing different word representation methods for twitter data in adverse drug reactions extraction. Paper presented at the 2015 Conference on Technologies and Applications of Artificial Intelligence (TAAI), Tainan, Taiwan.

Lu, Y., Zhang, P., Liu, J., Li, J., & Deng, S. (2013). Health-Related Hot Topic Detection in Online Communities Using Text Clustering. PLoS One, 8(2), e56221.

Mikalef, P., Pappas, I. O., Krogstie, J., & Pavlou, P. A. (2020). Big data and business analytics: A research agenda for realizing business value. Information & Management, 57(1), 103237.

Mishra, A., Malviya, A., & Aggarwal, S. (2015, 14-17 Nov. 2015). Towards Automatic Pharmacovigilance: Analysing Patient Reviews and Sentiment on Oncological Drugs. Paper presented at the 2015 IEEE 15th International Conference on Data Mining Workshops, Atlantic City, NJ, USA.

Moher, D., Liberati, A., Tetzlaff, J., Altman, D. G., & The, P. G. (2009). Preferred Reporting Items for Systematic Reviews and Meta-Analyses: The PRISMA Statement. PLOS Medicine, 6(7), e1000097.

Monkman, G., F., Kaiser, M. J., & Hyder, K. (2018). Text and data mining of social media to map wildlife recreation activity. Biological Conservation, 228, 89-99.

33 Moro, S., Rita, P., & Vala, B. (2016). Predicting social media performance metrics and evaluation of the impact on brand building: A data mining approach. Journal of Business Research, 69(9), 3341-3351.

Mowery, D., Smith, H., Cheney, T., Stoddard, G., Coppersmith, G., Bryan, C., & Conway, M. (2017). Understanding Depressive Symptoms and Psychosocial Stressors on Twitter: A Corpus-Based Study. Journal of Medical Internet Research, 19(2), e48.

Myslin, M., Zhu, S. H., Chapman, W., & Conway, M. (2013). Using twitter to examine smoking behavior and perceptions of emerging tobacco products. Journal of Medical Internet Research, 15(8), e174.

Nargund, K., & Natarajan, S. (2016, 21-24 Sept. 2016). Public health allergy surveillance using micro-blogs. Paper presented at the 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

Nassirtoussi, A. K., Aghabozorgi, S., Wah, T. Y., & Ngo, D. C. L. (2014). Text mining for market prediction: A systematic review. Expert Systems with Applications, 41(16), 7653-7670.

Nikfarjam, A., Sarker, A., O'Connor, K., Ginn, R., & Gonzalez, G. (2015). Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with cluster features. Journal of the American Medical Informatics Association, 22(3), 671-681.

Oberreuter, G., & Velsquez, J. D. (2013). Text mining applied to plagiarism detection: The use of words for detecting deviations in the writing style. Expert Systems with Applications, 40(9), 3756-3763.

Phan, N., Chun, S. A., Bhole, M., Geller, J., & Ieee. (2017). Enabling Real-Time Drug Abuse Detection in Tweets. In 2017 Ieee 33rd International Conference on Data Engineering (pp. 1510-1514). New York: IEEE.

Powell, G. A., Zinszer, K., Verma, A., Bahk, C., Madoff, L., Brownstein, J., & Buckeridge, D. (2016). Media content about vaccines in the United States and Canada, 2012-2014: An analysis using data from the Vaccine Sentimeter. Vaccine, 34(50), 6229-6235.

Rawlins, M. D. (2012). Crossing the fourth hurdle. British Journal of Clinical , 73(6), 855-860.

Riga, M., & Karatzas, K. (2014). Investigating the Relationship between Social Media Content and Real-time Observations for Urban Air Quality and Public Health. New York: Assoc Computing Machinery.

Rizk, A., & Elragal, A. (2020). Data science: developing theoretical contributions in information systems via text analytics. Journal of Big Data, 7.

Rong, J., Michalska, S., Subramani, S., Du, J., & Wang, H. (2019). Deep learning for pollen allergy surveillance from twitter in Australia. Bmc Medical Informatics and Decision Making, 19(1), 208.

34 Ru, B., Harris, K., & Yao, L. (2015). A Content Analysis of Patient-Reported Medication Outcomes on Social Media. Atlantic City, NJ, USA: IEEE.

Sadiq, A. T., & Abdullah, S. M. (2014). Hybrid Intelligent Techniques for Text Categorization. International Journal of Advanced Computer Science and Applications, 2, 23-40.

Sampathkumar, H., Chen, X.-W., & Luo, B. (2014). Mining Adverse Drug Reactions from online healthcare forums using Hidden Markov Model. Bmc Medical Informatics and Decision Making, 14(91), 18.

Santoro, A., Genov, G., Spooner, A., Raine, J., & Arlett, P. (2017). Promoting and Protecting Public Health: How the European Union Pharmacovigilance System Works. Drug Safety, 40, 855-869.

Sayyadi, H., Salehi, S., & AbolHassani, H. (2007). Survey on News Mining Tasks. In T. Sobh (Ed.), Innovations and Advanced Techniques in Computer and Information and Engineering (pp. 219-224). Dordrecht, The Netherlands: Springer.

Schmider, J., Kumar, K., LaForest, C., Swankoski, B., Naim, K., & Caubel, P. M. (2019). Innovation in Pharmacovigilance: Use of Artificial Intelligence in Case Processing. Clinical Pharmacology & Therapeutics, 105(4), 954-961.

Song, J. Y., Song, T. M., Seo, D. C., & Jin, J. H. (2016). Data Mining of Web-Based Documents on Social Networking Sites That Included Suicide-Related Words Among Korean Adolescents. Journal of Adolescent Health, 59(6), 668-673.

Straton, N., Ng, R., Jang, H., Vatrapu, R., & Mukkamala, R. R. (2019, 18-21 Nov. 2019). Predictive modelling of stigmatized behaviour in vaccination discussions on Facebook. Paper presented at the 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), San Diego, CA, USA.

Sunikka, A., & Bragge, J. (2012). Applying text-mining to personalization and customization research literature – Who, what and where? Expert Systems with Applications, 39(11), 10049-10058.

Szpakowicz, S., Feldman, A., & Kazantseva, A. (2018). Editorial: Computational Linguistics and Literature. Frontier in Digital Humanities, 5, 1-2.

Tan, A. (1999). Text Mining: The State of the Art and the Challenges. Paper presented at the PAKDD 1999 Workshop on Knowledge Disocovery from Advanced Databases.

Tavoschi, L., Quattrone, F., D’Andrea, E., Ducange, P., Vabanesi, M., Marcelloni, F., & Lopalco, P. L. (2020). Twitter as a sentinel tool to monitor public opinion on vaccination: an opinion mining analysis from September 2016 to August 2017 in Italy. Human Vaccines & Immunotherapeutics, 8.

Tekin, M., Etlioğlu, M., Koyuncuoğlu, Ö., & Tekin, E. (2019, 2019//). Data Mining in Digital Marketing. Paper presented at the Proceedings of the International Symposium for Production Research 2018, Cham.

35 Tran, T., Nguyen, D., Nguyen, A., & Golen, E. (2018, 20-24 May 2018). Sentiment Analysis of Marijuana Content via Facebook Emoji-Based Reactions. Paper presented at the 2018 IEEE International Conference on Communications (ICC).

Tufts, C., Polsky, D., Volpp, K. G., Groeneveld, P. W., Ungar, L., Merchant, R. M., & Pelullo, A. P. (2018). Characterizing Tweet Volume and Content About Common Health Conditions Across Pennsylvania: Retrospective Analysis. JMIR Public Health Surveill, 4(4), e10834.

Walker, S., Sculpher, M., Claxton, K., & Palmer, S. (2012). Coverage with Evidence Development, Only in Research, Risk Sharing, or Patient Access Scheme? A Framework for Coverage Decisions. Value in Health, 15(3), 570-579.

Wall, J., & Krummel, T. (2020). The digital surgeon: How big data, automation, and artificial intelligence will change surgical practice. Journal of Pediatric Surgery, 55, 47-50.

Wang, S., Paul, M. J., & Dredze, M. (2015). Social media as a sensor of air quality and public response in China. Journal of Medical Internet Research, 17(3), e22.

Wang, X., Hripcsak, G., Markatou, M., & Friedman, C. (2009). Active Computerized Pharmacovigilance Using Natural Language Processing, Statistics, and Electronic Health Records: A Feasibility Study. Journal of the American Medical Informatics Association, 16(3), 328-337.

Ward, P. J., Rock, P. J., Slavova, S., Young, A. M., Bunn, T. L., & Kavuluru, R. (2019). Enhancing timeliness of drug overdose mortality surveillance: A machine learning approach. PLoS One, 14(10), e0223318.

World Health Organization. (2004). WHO Policy Perspectives on Medicines - Pharmacovigilance: ensuring the safe use of medicines. Geneva, Switzerland: World Health Organization.

Wu, L., Moh, T.-S., & Khuri, N. (2015, 29 Oct.-1 Nov. 2015). Twitter Opinion Mining for Adverse Drug Reactions. Paper presented at the 2015 IEEE International Conference on Big Data (Big Data), Santa Clara, CA, USA.

Yang, C. C., Yang, H., Jiang, L., & Zhang, M. (2012). Media Mining for Drug Safety Signal Detection. Paper presented at the Proceedings of the 2012 international workshop on Smart health and wellbeing, Maui, Hawaii, USA.

Yang, Y. T., Horneffer, M., & DiLisio, N. (2013). Mining social media and web searches for disease detection. Journal of Public Health Research, 2(1), 17-21.

Zeng, Z., Shi, H., Wu, Y., & Hong, Z. (2015). Survey on Natural Language Processing Techniques in Bioinformatics. Computational and Mathematical Methods in Medicine, 1-10.

Zhou, X., Coiera, E., Tsafnat, G., Arachi, D., Ong, M.-S., & Dunn, A. G. (2015). Using social connection information to improve opinion mining: Identifying negative sentiment about HPV vaccines on Twitter. Studies in Health Technology and Informatics, 216, 761-765.

36 Zucco, C., Calabrese, B., & Cannataro, M. (2017, 13-16 Nov. 2017). Sentiment analysis and affective computing for depression monitoring. Paper presented at the 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

37 Table 1. Summary of Characteristics of Publications Included in the Analysis (16 Publications Reviewed)

Authors Data source Sample size Horizon of Software used Techniques and Outcome Result Description of result data classifiers used collection (Adrover et al., Social media 1,642 tweets 3 years - Toolkit for - Artificial Neural Reported Positive Reported adverse 2015) Multivariate Networks (ANN) adverse drug effects are consistent Analysis - Boosted Decision reactions for with well-recognized Trees with AdaBoost HIV treatment toxicities. (BDT) - Boosted Decision Trees with Bagging (BDTG) - Sentiment Analysis - Support Vector Machines (SVM) (Akay et al., Forums 7,726 posts 10 years - General - Hyperlink-Induced User sentiment Positive Natural language 2016) Architecture for Topic Search (HITS) on depression processing is suitable Text Engineering - k-Means Clustering drugs to extract information (GATE) - Network Analysis on adverse drug - NLTK toolkit - Term-frequency- reactions concerning within MATLAB inverse document depression. - RapidMiner frequency (TF-IDF) (Bian et al., Social media 2,102,176,189 tweets 1 year - Apache Lucene - MetaMap Reported Neutral Classification models 2012) - Support Vector adverse drug had limited Machines (SVM) reactions for performance. Adverse cancer events related to cancer drugs can potentially be extracted from tweets.

38 (Dai, Touray, Social media 6,528 tweets Unknown - GENIA tagger - Backward/forward Reported Positive Adverse drug Jonnagaddala, & - Hunspell sequential feature adverse drug reactions were Syed-Abdul, - Snowball selection reactions identified reasonably 2016) stemmer (BSFS/FSFS) well. - Stanford Topic algorithm Modelling - k-Means Clustering Toolbox - Sentiment Analysis - Twokenizer - Support Vector Machines (SVM) (Dai & Wang, Social media 32,670 tweets Unknown - Hunspell - Support Vector Reported Neutral Adverse drug 2019) - Twitter Machines (SVM) adverse drug reactions were tokenizer - Term Frequency- reactions identified not very Inverse Document well. Frequency (TF-IDF) (Ginn et al., Social media 10,822 tweets Unknown Unknown - Naïve Bayes (NB) Reported Positive Adverse drug 2014) - Natural Language adverse drug reactions were Processing (NLP) reactions identified good. - Support Vector Machines (SVM) (Gräßer, Drug reviews 218,614 reviews Unknown BeautifulSoup - Logistic Regression - Patient Positive Classification results Kallumadi, - Sentiment Analysis satisfaction with were very good. Malberg, & drugs Zaunseder, 2018) - Reported adverse drug reactions - Reported effectiveness of drugs (Katragadda, Social media 172,800 tweets 1 year Twitter4J - Decision Trees Reported Positive Building a medical Karnati, Pusala, - Medical Profile adverse drug profile of users Raghavan, & Graph reactions enables the accurate Benton, 2015) - Natural Language detection of adverse Processing (NLP) drug events.

39 (Lin et al., 2015) Social media 1,245 tweets Unknown - CRF++ toolkit - Natural Language Reported Positive Adverse drug - GENIA tagger Processing (NLP) adverse drug reactions were - Hunspell reactions identified reasonably - Twitter REST well. API - Twokenizer (Mishra, Drug reviews Unknown Unknown - SentiWordNet - Sentiment Analysis User sentiment Positive Sentiment on adverse Malviya, & - WordNet - Support Vector on cancer drugs drug reactions was Aggarwal, 2015) Machines (SVM) identified reasonably - Term Document well. Matrix (TDM) (Nikfarjam, - Drug reviews - DailyStrength.org: Unknown Unknown - ARDMine Reported Positive Adverse drug Sarker, - Social media 6,279 reviews - Lexicon-based adverse drug reactions were O'Connor, Ginn, - Twitter.com: 1,784 - MetaMap reactions identified very well. & Gonzalez, tweets - Support Vector 2015) Machines (SVM) (Ru et al., 2015) - Drug reviews - PatientsLikeMe.com: Not - Deeply Moving Unknown Patient-reported Positive Social media serves - Social media 796 reviews applicable medication as a new data source - Twitter.com: 39,127 outcomes to extract patient- tweets reported medication - WebMD.com: 2,567 outcomes. reviews - YouTube.com: 42,544 comments (Sampathkumar Forums - .com: Not - Java Hidden - Hidden Markov Reported Positive Reported adverse et al., 2014) 8,065 posts applicable Markov Model Model (HMM) adverse drug effects are consistent - SteadyHealth.com: library - Natural Language reactions with well-recognized 11,878 - jsoup Processing (NLP) side-effects. (Wang, Hripcsak, Electronic 25,074 discharge Not MedLEE Unknown Reported Positive Reported adverse Markatou, & Health Record summaries applicable adverse drug effects are consistent Friedman, 2009) (EHR) reactions with well-recognized toxicities (recall: 75%; precision: 31%).

40 (Wu, Moh, & Social media 3,251 tweets Not - AFINN - MetaMap Reported Positive Several well-known Khuri, 2015) applicable - Bing Liu - Naïve Bayes (NB) adverse drug adverse drug sentiment words - Natural Language reactions reactions were - Multi- Processing (NLP) identified. Perspective - Sentiment Analysis Question - Support Vector Answering Machines (SVM) (MPQA) - SentiWordNet - TextBlob - Tweepy - WEKA (Yang, Yang, Forums 6,244 discussion Unknown Unknown - Association Mining Reported Positive Adverse drug Jiang, & Zhang, threads adverse drug reactions were 2012) reactions identified.

41 Table 2. General Description of Publications Included in the Analysis (16 Publications Reviewed)

Classification Sub-categories n (%) Publications category Year of 2009 1 (6.25) (Wang et al., 2009) publication 2010 0 (0.00) - 2011 0 (0.00) - 2012 2 (12.50) (Bian et al., 2012; Yang et al., 2012) 2013 0 (0.00) - 2014 2 (12.50) (Ginn et al., 2014; Sampathkumar et al., 2014) 2015 7 (43.75) (Adrover et al., 2015; Katragadda et al., 2015; Lin et al., 2015; Mishra et al., 2015; Nikfarjam et al., 2015; Ru et al., 2015; Wu et al., 2015) 2016 2 (12.50) (Akay et al., 2016; Dai et al., 2016) 2017 0 (0.00) - 2018 1 (6.25) (Gräßer et al., 2018) 2019 1 (6.25) (Dai & Wang, 2019) Type of drugs Asthma 1 (5.26) (Ru et al., 2015) Cancer 2 (10.53) (Bian et al., 2012; Mishra et al., 2015) Cystic fibrosis 1 (5.26) (Ru et al., 2015) Depression 1 (5.26) (Akay et al., 2016) HIV 1 (5.26) (Adrover et al., 2015) Rheumatoid arthritis 1 (5.26) (Ru et al., 2015) Type 2 diabetes 1 (5.26) (Ru et al., 2015) Unknown 11 (57.89) (Dai et al., 2016; Dai & Wang, 2019; Ginn et al., 2014; Gräßer et al., 2018; Katragadda et al., 2015; Lin et al., 2015; Nikfarjam et al., 2015; Sampathkumar et al., 2014; Wang et al., 2009; Wu et al., 2015; Yang et al., 2012) Data source Drug reviews 4 (22.22) (Gräßer et al., 2018; Mishra et al., 2015; Nikfarjam et al., 2015; Ru et al., 2015) Electronic Health Record 1 (5.56) (Wang et al., 2009) (EHR) Forums 3 (16.67) (Akay et al., 2016; Sampathkumar et al., 2014; Yang et al., 2012) Social media 10 (55.56) (Adrover et al., 2015; Bian et al., 2012; Dai et al., 2016; Dai & Wang, 2019; Ginn et al., 2014; Katragadda et al., 2015; Lin et al., 2015; Nikfarjam et al., 2015; Ru et al., 2015; Wu et al., 2015) Sample size Less than 5,000 3 (18.75) (Adrover et al., 2015; Lin et al., 2015; Wu et al., 2015) 5,000 to 9,999 4 (25.00) (Akay et al., 2016; Dai et al., 2016; Nikfarjam et al., 2015; Yang et al., 2012) 10,000 to 14,999 1 (6.25) (Ginn et al., 2014) 15,000 to 19,999 1 (6.25) (Sampathkumar et al., 2014) 20,000 or more 6 (37.50) (Bian et al., 2012; Dai & Wang, 2019; Gräßer et al., 2018; Katragadda et al., 2015; Ru et al., 2015; Wang et al., 2009) Unknown 1 (6.25) (Mishra et al., 2015)

42 Users HIV-infected persons 1 (6.25) (Adrover et al., 2015) undergoing drug treatment Unknown 15 (93.75) (Akay et al., 2016; Bian et al., 2012; Dai et al., 2016; Dai & Wang, 2019; Ginn et al., 2014; Gräßer et al., 2018; Katragadda et al., 2015; Lin et al., 2015; Mishra et al., 2015; Nikfarjam et al., 2015; Ru et al., 2015; Sampathkumar et al., 2014; Wang et al., 2009; Wu et al., 2015; Yang et al., 2012) Unique users Less than 5,000 2 (12.50) (Adrover et al., 2015; Katragadda et al., 2015) 5,000 to 9,999 0 (0.00) - 10,000 to 14,999 0 (0.00) - 15,000 to 19,999 0 (0.00) - 20,000 or more 0 (0.00) - Unknown 14 (87.50) (Akay et al., 2016; Bian et al., 2012; Dai et al., 2016; Dai & Wang, 2019; Ginn et al., 2014; Gräßer et al., 2018; Lin et al., 2015; Mishra et al., 2015; Nikfarjam et al., 2015; Ru et al., 2015; Sampathkumar et al., 2014; Wang et al., 2009; Wu et al., 2015; Yang et al., 2012) Origin of users Canada 1 (5.26) (Adrover et al., 2015) South Africa 1 (5.26) (Adrover et al., 2015) United Kingdom 1 (5.26) (Adrover et al., 2015) United States 1 (5.26) (Adrover et al., 2015) Unknown 15 (78.95) (Akay et al., 2016; Bian et al., 2012; Dai et al., 2016; Dai & Wang, 2019; Ginn et al., 2014; Gräßer et al., 2018; Katragadda et al., 2015; Lin et al., 2015; Mishra et al., 2015; Nikfarjam et al., 2015; Ru et al., 2015; Sampathkumar et al., 2014; Wang et al., 2009; Wu et al., 2015; Yang et al., 2012) Average Less than 5,000 1 (6.25) (Adrover et al., 2015) number of 5,000 to 9,999 0 (0.00) - followers 10,000 to 14,999 0 (0.00) - 15,000 to 19,999 0 (0.00) - 20,000 or more 0 (0.00) - Unknown 15 (93.75) (Akay et al., 2016; Bian et al., 2012; Dai et al., 2016; Dai & Wang, 2019; Ginn et al., 2014; Gräßer et al., 2018; Katragadda et al., 2015; Lin et al., 2015; Mishra et al., 2015; Nikfarjam et al., 2015; Ru et al., 2015; Sampathkumar et al., 2014; Wang et al., 2009; Wu et al., 2015; Yang et al., 2012) Years of data 2004 2 (6.45) (Akay et al., 2016; Wang et al., 2009) collection 2005 1 (3.23) (Akay et al., 2016) 2006 1 (3.23) (Akay et al., 2016) 2007 1 (3.23) (Akay et al., 2016) 2008 1 (3.23) (Akay et al., 2016) 2009 2 (6.45) (Akay et al., 2016; Bian et al., 2012) 2010 3 (9.68) (Adrover et al., 2015; Akay et al., 2016; Bian et al., 2012) 2011 2 (6.45) (Adrover et al., 2015; Akay et al., 2016) 2012 3 (9.68) (Adrover et al., 2015; Akay et al., 2016; Sampathkumar et al., 2014) 2013 2 (6.45) (Adrover et al., 2015; Akay et al., 2016)

43 2014 3 (9.68) (Akay et al., 2016; Katragadda et al., 2015; Ru et al., 2015) 2015 2 (6.45) (Katragadda et al., 2015; Wu et al., 2015) Unknown 8 (25.81) (Dai et al., 2016; Dai & Wang, 2019; Ginn et al., 2014; Gräßer et al., 2018; Lin et al., 2015; Mishra et al., 2015; Nikfarjam et al., 2015; Yang et al., 2012) Horizon of data 1 year 2 (12.50) (Bian et al., 2012; Katragadda et al., 2015) collection 2 to 5 years 1 (6.25) (Adrover et al., 2015) 6 to 10 years 1 (6.25) (Akay et al., 2016) Not applicable 4 (25.00) (Ru et al., 2015; Sampathkumar et al., 2014; Wang et al., 2009; Wu et al., 2015) Unknown 8 (50.00) (Dai et al., 2016; Dai & Wang, 2019; Ginn et al., 2014; Gräßer et al., 2018; Lin et al., 2015; Mishra et al., 2015; Nikfarjam et al., 2015; Yang et al., 2012) Software used AFINN 1 (2.86) (Wu et al., 2015) Apache Lucene 1 (2.86) (Bian et al., 2012) BeautifulSoup 1 (2.86) (Gräßer et al., 2018) Bing Liu sentiment words 1 (2.86) (Wu et al., 2015) CRF++ toolkit 1 (2.86) (Lin et al., 2015) Deeply Moving 1 (2.86) (Ru et al., 2015) General Architecture for 1 (2.86) (Akay et al., 2016) Text Engineering (GATE) GENIA tagger 2 (5.71) (Dai et al., 2016; Lin et al., 2015) Hunspell 3 (8.57) (Dai et al., 2016; Dai & Wang, 2019; Lin et al., 2015) Java Hidden Markov 1 (2.86) (Sampathkumar et al., 2014) Model library jsoup 1 (2.86) (Sampathkumar et al., 2014) MedLEE 1 (2.86) (Wang et al., 2009) Multi-Perspective 1 (2.86) (Wu et al., 2015) (MPQA) NLTK toolkit within 1 (2.86) (Akay et al., 2016) MATLAB RapidMiner 1 (2.86) (Akay et al., 2016) SentiWordNet 2 (5.71) (Mishra et al., 2015; Wu et al., 2015) Snowball stemmer 1 (2.86) (Dai et al., 2016) Stanford Topic Modelling 1 (2.86) (Dai et al., 2016) Toolbox TextBlob 1 (2.86) (Wu et al., 2015) Toolkit for Multivariate 1 (2.86) (Adrover et al., 2015) Analysis Tweepy 1 (2.86) (Wu et al., 2015) Twitter REST API 1 (2.86) (Lin et al., 2015) Twitter tokenizer 1 (2.86) (Dai & Wang, 2019) Twitter4J 1 (2.86) (Katragadda et al., 2015) Twokenizer 2 (5.71) (Dai et al., 2016; Lin et al., 2015) Unknown 3 (8.57) (Ginn et al., 2014; Nikfarjam et al., 2015; Yang et al., 2012) WEKA 1 (2.86) (Wu et al., 2015) WordNet 1 (2.86) (Mishra et al., 2015) ARDMine 1 (2.33) (Nikfarjam et al., 2015)

44 Techniques and Artificial Neural 1 (2.33) (Adrover et al., 2015) classifiers used Networks (ANN) Association Mining 1 (2.33) (Yang et al., 2012) Backward/forward 1 (2.33) (Dai et al., 2016) sequential feature selection (BSFS/FSFS) algorithm Boosted Decision Trees 1 (2.33) (Adrover et al., 2015) with AdaBoost (BDT) Boosted Decision Trees 1 (2.33) (Adrover et al., 2015) with Bagging (BDTG) Decision Trees 1 (2.33) (Katragadda et al., 2015) Hidden Markov Model 1 (2.33) (Sampathkumar et al., 2014) (HMM) Hyperlink-Induced Topic 1 (2.33) (Akay et al., 2016) Search (HITS) k-Means Clustering 2 (4.65) (Akay et al., 2016; Dai et al., 2016) Lexicon-based 1 (2.33) (Nikfarjam et al., 2015) Logistic Regression 1 (2.33) (Gräßer et al., 2018) Medical Profile Graph 1 (2.33) (Katragadda et al., 2015) MetaMap 3 (6.98) (Bian et al., 2012; Nikfarjam et al., 2015; Wu et al., 2015) Naïve Bayes (NB) 2 (4.65) (Ginn et al., 2014; Wu et al., 2015) Natural Language 5 (11.63) (Ginn et al., 2014; Katragadda et al., 2015; Lin et Processing (NLP) al., 2015; Sampathkumar et al., 2014; Wu et al., 2015) Network Analysis 1 (2.33) (Akay et al., 2016) Sentiment Analysis 5 (11.63) (Adrover et al., 2015; Dai et al., 2016; Gräßer et al., 2018; Mishra et al., 2015; Wu et al., 2015) Support Vector Machines 8 (18.60) (Adrover et al., 2015; Bian et al., 2012; Dai et al., (SVM) 2016; Dai & Wang, 2019; Ginn et al., 2014; Mishra et al., 2015; Nikfarjam et al., 2015; Wu et al., 2015) Term Document Matrix 1 (2.33) (Mishra et al., 2015) (TDM) Term-frequency-inverse 2 (4.65) (Akay et al., 2016; Dai & Wang, 2019) document frequency (TF- IDF) Unknown 2 (4.65) (Ru et al., 2015; Wang et al., 2009) Outcome Patient satisfaction with 1 (5.56) (Gräßer et al., 2018) drugs Patient-reported 1 (5.56) (Ru et al., 2015) medication outcomes Reported adverse drug 11 (61.11) (Dai et al., 2016; Dai & Wang, 2019; Ginn et al., reactions 2014; Gräßer et al., 2018; Katragadda et al., 2015; Lin et al., 2015; Nikfarjam et al., 2015; Sampathkumar et al., 2014; Wang et al., 2009; Wu et al., 2015; Yang et al., 2012) Reported adverse drug 1 (5.56) (Bian et al., 2012) reactions for cancer Reported adverse drug 1 (5.56) (Adrover et al., 2015) reactions for HIV treatment

45 Reported effectiveness of 1 (5.56) (Gräßer et al., 2018) drugs User sentiment on 1 (5.56) (Akay et al., 2016) depression drugs User sentiment on cancer 1 (5.56) (Mishra et al., 2015) drugs Drugs studied Less than 5 1 (6.25) (Adrover et al., 2015) 5 to 9 4 (25.00) (Akay et al., 2016; Bian et al., 2012; Wang et al., 2009; Wu et al., 2015) 10 to 14 2 (12.50) (Ru et al., 2015; Yang et al., 2012) 15 to 19 0 (0.00) - 20 or more 5 (31.25) (Ginn et al., 2014; Katragadda et al., 2015; Mishra et al., 2015; Nikfarjam et al., 2015; Sampathkumar et al., 2014) Unknown 4 (25.00) (Dai et al., 2016; Dai & Wang, 2019; Gräßer et al., 2018; Lin et al., 2015) Result Positive 14 (87.50) (Adrover et al., 2015; Akay et al., 2016; Dai et al., 2016; Ginn et al., 2014; Gräßer et al., 2018; Katragadda et al., 2015; Lin et al., 2015; Mishra et al., 2015; Nikfarjam et al., 2015; Ru et al., 2015; Sampathkumar et al., 2014; Wang et al., 2009; Wu et al., 2015; Yang et al., 2012) Neutral 2 (12.50) (Bian et al., 2012; Dai & Wang, 2019) Negative 0 (0.00) - Reliability Low 0 (0.00) - Medium 16 (Adrover et al., 2015; Akay et al., 2016; Bian et al., (100.00) 2012; Dai et al., 2016; Dai & Wang, 2019; Ginn et al., 2014; Gräßer et al., 2018; Katragadda et al., 2015; Lin et al., 2015; Mishra et al., 2015; Nikfarjam et al., 2015; Ru et al., 2015; Sampathkumar et al., 2014; Wang et al., 2009; Wu et al., 2015; Yang et al., 2012) High 0 (0.00) - Validity Low 0 (0.00) - Medium 16 (Adrover et al., 2015; Akay et al., 2016; Bian et al., (100.00) 2012; Dai et al., 2016; Dai & Wang, 2019; Ginn et al., 2014; Gräßer et al., 2018; Katragadda et al., 2015; Lin et al., 2015; Mishra et al., 2015; Nikfarjam et al., 2015; Ru et al., 2015; Sampathkumar et al., 2014; Wang et al., 2009; Wu et al., 2015; Yang et al., 2012) High 0 (0.00) -

46 Table 3. Publications by Classification Category and Result (16 Publications Reviewed)

Classification Sub-categories Positive Neutral Negative Publications Category n (%) n (%) n (%) Type of drugs Asthma 1 (5.26) 0 (0.00) 0 (0.00) (Ru et al., 2015) Cancer 1 (5.26) 1 (5.26) 0 (0.00) (Bian et al., 2012; Mishra et al., 2015) Cystic fibrosis 1 (5.26) 0 (0.00) 0 (0.00) (Ru et al., 2015) Depression 1 (5.26) 0 (0.00) 0 (0.00) (Akay et al., 2016) HIV 1 (5.26) 0 (0.00) 0 (0.00) (Adrover et al., 2015) Rheumatoid arthritis 1 (5.26) 0 (0.00) 0 (0.00) (Ru et al., 2015) Type 2 diabetes 1 (5.26) 0 (0.00) 0 (0.00) (Ru et al., 2015) Unknown 10 (52.63) 1 (5.26) 0 (0.00) (Dai et al., 2016; Dai & Wang, 2019; Ginn et al., 2014; Gräßer et al., 2018; Katragadda et al., 2015; Lin et al., 2015; Nikfarjam et al., 2015; Sampathkumar et al., 2014; Wang et al., 2009; Wu et al., 2015; Yang et al., 2012) Data source Drug reviews 4 (22.22) 0 (0.00) 0 (0.00) (Gräßer et al., 2018; Mishra et al., 2015; Nikfarjam et al., 2015; Ru et al., 2015) Electronic Health Record (EHR) 1 (5.56) 0 (0.00) 0 (0.00) (Wang et al., 2009) Forums 3 (16.67) 0 (0.00) 0 (0.00) (Akay et al., 2016; Sampathkumar et al., 2014; Yang et al., 2012) Social media 8 (44.44) 2 (11.11) 0 (0.00) (Adrover et al., 2015; Bian et al., 2012; Dai et al., 2016; Dai & Wang, 2019; Ginn et al., 2014; Katragadda et al., 2015; Lin et al., 2015; Nikfarjam et al., 2015; Ru et al., 2015; Wu et al., 2015) Origin of users Canada 1 (5.26) 0 (0.00) 0 (0.00) (Adrover et al., 2015) South Africa 1 (5.26) 0 (0.00) 0 (0.00) (Adrover et al., 2015) United Kingdom 1 (5.26) 0 (0.00) 0 (0.00) (Adrover et al., 2015) United States 1 (5.26) 0 (0.00) 0 (0.00) (Adrover et al., 2015) Unknown 13 (68.42) 2 (10.53) 0 (0.00) (Akay et al., 2016; Bian et al., 2012; Dai et al., 2016; Dai & Wang, 2019; Ginn et al., 2014; Gräßer et al., 2018; Katragadda et al., 2015; Lin et al., 2015; Mishra et al., 2015; Nikfarjam et al., 2015; Ru et al., 2015; Sampathkumar et al., 2014; Wang et al., 2009; Wu et al., 2015; Yang et al., 2012) Horizon of data 1 year 1 (6.25) 1 (6.25) 0 (0.00) (Bian et al., 2012; Katragadda et al., 2015) collection 2 to 5 years 1 (6.25) 0 (0.00) 0 (0.00) (Adrover et al., 2015)

47 6 to 10 years 1 (6.25) 0 (0.00) 0 (0.00) (Akay et al., 2016) Not applicable 4 (25.00) 0 (0.00) 0 (0.00) (Ru et al., 2015; Sampathkumar et al., 2014; Wang et al., 2009; Wu et al., 2015) Unknown 7 (43.75) 1 (6.25) 0 (0.00) (Dai et al., 2016; Dai & Wang, 2019; Ginn et al., 2014; Gräßer et al., 2018; Lin et al., 2015; Mishra et al., 2015; Nikfarjam et al., 2015; Yang et al., 2012) Outcome Patient satisfaction with drugs 1 (5.56) 0 (0.00) 0 (0.00) (Gräßer et al., 2018) Patient-reported medication 1 (5.56) 0 (0.00) 0 (0.00) (Ru et al., 2015) outcomes Reported adverse drug reactions 10 (55.56) 1 (5.56) 0 (0.00) (Dai et al., 2016; Dai & Wang, 2019; Ginn et al., 2014; Gräßer et al., 2018; Katragadda et al., 2015; Lin et al., 2015; Nikfarjam et al., 2015; Sampathkumar et al., 2014; Wang et al., 2009; Wu et al., 2015; Yang et al., 2012) Reported adverse drug reactions 0 (0.00) 1 (5.56) 0 (0.00) (Bian et al., 2012) for cancer Reported adverse drug reactions 1 (5.56) 0 (0.00) 0 (0.00) (Adrover et al., 2015) for HIV treatment Reported effectiveness of drugs 1 (5.56) 0 (0.00) 0 (0.00) (Gräßer et al., 2018) User sentiment on depression drugs 1 (5.56) 0 (0.00) 0 (0.00) (Akay et al., 2016) User sentiment on cancer drugs 1 (5.56) 0 (0.00) 0 (0.00) (Mishra et al., 2015) Drugs studied Less than 5 1 (6.25) 0 (0.00) 0 (0.00) (Adrover et al., 2015) 5 to 9 3 (18.75) 1 (6.25) 0 (0.00) (Akay et al., 2016; Bian et al., 2012; Wang et al., 2009; Wu et al., 2015) 10 to 14 2 (12.50) 0 (0.00) 0 (0.00) (Ru et al., 2015; Yang et al., 2012) 15 to 19 0 (0.00) 0 (0.00) 0 (0.00) - 20 or more 5 (31.25) 0 (0.00) 0 (0.00) (Ginn et al., 2014; Katragadda et al., 2015; Mishra et al., 2015; Nikfarjam et al., 2015; Sampathkumar et al., 2014) Unknown 3 (18.75) 1 (6.25) 0 (0.00) (Dai et al., 2016; Dai & Wang, 2019; Gräßer et al., 2018; Lin et al., 2015) Reliability Low 0 (0.00) 0 (0.00) 0 (0.00) - Medium 14 (87.50) 2 (12.50) 0 (0.00) (Adrover et al., 2015; Akay et al., 2016; Bian et al., 2012; Dai et al., 2016; Dai & Wang, 2019; Ginn et al., 2014; Gräßer et al., 2018; Katragadda et al., 2015; Lin et al., 2015; Mishra et al., 2015; Nikfarjam

48 et al., 2015; Ru et al., 2015; Sampathkumar et al., 2014; Wang et al., 2009; Wu et al., 2015; Yang et al., 2012) High 0 (0.00) 0 (0.00) 0 (0.00) - Validity Low 0 (0.00) 0 (0.00) 0 (0.00) - Medium 14 (87.50) 2 (12.50) 0 (0.00) (Adrover et al., 2015; Akay et al., 2016; Bian et al., 2012; Dai et al., 2016; Dai & Wang, 2019; Ginn et al., 2014; Gräßer et al., 2018; Katragadda et al., 2015; Lin et al., 2015; Mishra et al., 2015; Nikfarjam et al., 2015; Ru et al., 2015; Sampathkumar et al., 2014; Wang et al., 2009; Wu et al., 2015; Yang et al., 2012) High 0 (0.00) 0 (0.00) 0 (0.00) -

49 Appendices

Appendix 1. Search Strategies

Table 4. Search strategy for PubMed

PubMed

Block 1: Computational Linguistics artificial intelligence[Title/Abstract] OR Artificial Intelligence[MeSH Terms] OR machine learning[Title/Abstract] OR Machine Learning[MeSH Terms] OR text mining[Title/Abstract] OR computational linguistics[Title/Abstract] OR natural language processing[Title/Abstract] OR Natural Language Processing[MeSH Terms] OR nlp[Title/Abstract] OR sentiment analysis[Title/Abstract] OR word embedding*[Title/Abstract] OR natural language toolkit[Title/Abstract] OR nltk[Title/Abstract]

Block 2: Health Surveillance public health surveillance[Title/Abstract] OR Public Health Surveillance[MeSH Terms] OR health surveillance[Title/Abstract] OR public health monitoring[Title/Abstract] OR health monitoring[Title/Abstract]

Filters: Article types: Journal Article Languages: English

Table 5. Search strategy for Web of Science

Web of Science

Block 1: Computational Linguistics TS=(artificial intelligence) OR TS=(machine learning) OR TS=(text mining) OR TS=(computational linguistics) OR TS=(natural language processing) OR TS=(nlp) OR TS=(sentiment analysis) OR TS=(word embedding*) OR TS=(natural language toolkit) OR TS=(nltk)

Block 2: Health Surveillance TS=(public health surveillance) OR TS=(health surveillance) OR TS=(public health monitoring) OR TS=(health monitoring)

Filters: Document types: Article, Proceedings Paper Languages: English

50

Table 6. Search strategy for IEEE Xplore

IEEE Xplore

Block 1: Computational Linguistics "All Metadata":artificial intelligence OR "All Metadata":machine learning OR "All Metadata":text mining OR "All Metadata":computational linguistics OR "All Metadata":natural language processing OR "All Metadata":nlp OR "All Metadata":sentiment analysis OR "All Metadata":word embedding* OR "All Metadata":natural language toolkit OR "All Metadata":nltk

Block 2: Health Surveillance "All Metadata":public health surveillance OR "All Metadata":health surveillance OR "All Metadata":public health monitoring OR "All Metadata":health monitoring

Filters: Document types: Conferences, Journals

Table 7. Search strategy for ACM Digital Library

ACM Digital Library

Block 1: Computational Linguistics artificial intelligence OR machine learning OR text mining OR computational linguistics OR natural language processing OR nlp OR sentiment analysis OR word embedding* OR natural language toolkit OR nltk

Block 2: Health Surveillance public health surveillance OR health surveillance OR public health monitoring OR health monitoring

Filters: Searched within: Abstract, Author Keyword, Title All publications: Proceedings, Research Article

51 Appendix 2. Characteristics of Publications

Table 8. Characteristics of Publications Included in the Analysis (16 Publications Reviewed)

Authors Type of Data source Sample size Users Unique Origin of Average Years of Horizon Software Techniques and Outcome Drugs studied Result Description Reliabilit Validit drugs users users number data of data used classifiers used of result y y of collectio collectio follower n n s (Adrover et al., HIV Social media: 1,642 tweets HIV- 512 - Canada 2,300 2010, 3 years - Toolkit for - Artificial Neural Reported - Atripla Positiv Reported Medium Mediu 2015) - Twitter.com infected (male: - South 2011, Multivariate Networks (ANN) adverse - Sustiva e adverse m persons 247; Africa 2012, Analysis - Boosted drug - Truvada effects are undergoin female: - United 2013 Decision Trees reactions for consistent g drug 83; Kingdom with AdaBoost HIV with well- treatment unknown - United (BDT) treatment recognized : 182) States - Boosted toxicities. (New Decision Trees York with Bagging City, San (BDTG) Fransisco - Sentiment ) Analysis - Support Vector Machines (SVM) (Akay et al., Depression Forums: 7,726 posts Unknown Unknow Unknown Unknow 2004, 10 years - General - Hyperlink- User - Citalopram Positiv Natural Medium Mediu 2016) - n n 2005, Architecture Induced Topic sentiment - Chlorpromazine e language m DepressionForums.or 2006, for Text Search (HITS) on - Cyclobenzaprine processing is g 2007, Engineering - k-Means depression - Dulexetine suitable to 2008, (GATE) Clustering drugs - Promethazine extract 2009, - NLTK - Network information 2010, toolkit within Analysis on adverse 2011, MATLAB - Term- drug 2012, - RapidMiner frequency-inverse reactions 2013, document concerning 2014 frequency (TF- depression. IDF) (Bian et al., Cancer Social media: 2,102,176,189 Unknown Unknow Unknown Unknow 2009, 1 year - Apache - MetaMap Reported - Avastin Neutral Classificatio Medium Mediu 2012) - Twitter.com tweets n n 2010 Lucene - Support Vector adverse - Melphalan n models had m Machines (SVM) drug - Rupatadin limited reactions for - Tamoxifen performance. cancer - Taxotere Adverse events related to cancer drugs can potentially be extracted from tweets.

52 (Dai et al., Unknown Social media: 6,528 tweets Unknown Unknow Unknown Unknow Unknown Unknown - GENIA - Reported Unknown Positiv Adverse Medium Mediu 2016) - Twitter.com n n tagger Backward/forwar adverse e drug m - Hunspell d sequential drug reactions - Snowball feature selection reactions were stemmer (BSFS/FSFS) identified - Stanford algorithm reasonably Topic - k-Means well. Modelling Clustering Toolbox - Sentiment - Twokenizer Analysis - Support Vector Machines (SVM)

(Dai & Wang, Unknown Social media: 32,670 tweets Unknown Unknow Unknown Unknow Unknown Unknown - Hunspell - Support Vector Reported Unknown Neutral Adverse Medium Mediu 2019) - Twitter.com n n - Twitter Machines (SVM) adverse drug m tokenizer - Term drug reactions Frequency- reactions were Inverse identified not Document very well. Frequency (TF- IDF)

(Ginn et al., Unknown Social media: 10,822 tweets Unknown Unknow Unknown Unknow Unknown Unknown Unknown - Naïve Bayes Reported 65 drugs Positiv Adverse Medium Mediu 2014) - Twitter.com n n (NB) adverse e drug m - Natural drug reactions Language reactions were Processing (NLP) identified - Support Vector good. Machines (SVM) (Gräßer et al., Unknown Drug reviews: 218,614 reviews Unknown Unknow Unknown Unknow Unknown Unknown BeautifulSou - Logistic - Patient Unknown Positiv Classificatio Medium Mediu 2018) - Drugs.com n n p Regression satisfaction e n results m - Drugslib.com - Sentiment with drugs were very Analysis - Reported good. adverse drug reactions - Reported effectivenes s of drugs

53 (Katragadda et Unknown Social media: 172,800 tweets Unknown 864 Unknown Unknow 2014, 1 year Twitter4J - Decision Trees Reported 103 drugs Positiv Building a Medium Mediu al., 2015) - Twitter.com n 2015 - Medical Profile adverse e medical m Graph drug profile of - Natural reactions users enables Language the accurate Processing (NLP) detection of adverse drug events.

(Lin et al., Unknown Social media: 1,245 tweets Unknown Unknow Unknown Unknow Unknown Unknown - CRF++ - Natural Reported Unknown Positiv Adverse Medium Mediu 2015) - Twitter.com n n toolkit Language adverse e drug m - GENIA Processing (NLP) drug reactions tagger reactions were - Hunspell identified - Twitter reasonably REST API well. - Twokenizer (Mishra et al., Cancer Drug reviews: Unknown Unknown Unknow Unknown Unknow Unknown Unknown - - Sentiment User 146 drugs Positiv Sentiment on Medium Mediu 2015) - WebMD.com n n SentiWordNe Analysis sentiment e adverse drug m t - Support Vector on cancer reactions - WordNet Machines (SVM) drugs was - Term Document identified Matrix (TDM) reasonably well.

(Nikfarjam et Unknown Drug reviews: - DailyStrength.org: Unknown Unknow Unknown Unknow Unknown Unknown Unknown - ARDMine Reported 81 drugs Positiv Adverse Medium Mediu al., 2015) - DailyStrength.org 6,279 reviews n n - Lexicon-based adverse e drug m - Twitter.com: 1,784 - MetaMap drug reactions Social media: tweets - Support Vector reactions were - Twitter.com Machines (SVM) identified very well.

54 (Ru et al., - Asthma Drug reviews: - Unknown Unknow Unknown Unknow 2014 Not - Deeply Unknown Patient- - Albuterol Positiv Social media Medium Mediu 2015) - Cystic - PatientsLikeMe.com PatientsLikeMe.com n n applicabl Moving reported - Azithromycin e serves as a m fibrosis - WebMD.com : 796 reviews e medication - Bromocriptine new data - - Twitter.com: outcomes - Insulin source to Rheumatoi Social media: 39,127 tweets - Ipratropium extract d arthritis - Twitter.com - WebMD.com: - Ivacaftor patient- - Type 2 - YouTube.com 2,567 reviews - Meloxicam reported diabetes - YouTube.com: - Metformin medication 42,544 comments - Prednisone outcomes. - Sulfasalazine

(Sampathkuma Unknown Forums: - Medications.com: Unknown Unknow Unknown Unknow 2012 Not - Java Hidden - Hidden Markov Reported - Positiv Reported Medium Mediu r et al., 2014) - Medications.com 8,065 posts n n applicabl Markov Model (HMM) adverse Medications.com: e adverse m - SteadyHealth.com - SteadyHealth.com: e Model library - Natural drug 168 drugs effects are 11,878 - jsoup Language reactions - consistent Processing (NLP) SteadyHealth.com with well- : 316 drugs recognized side-effects.

(Wang et al., Unknown Electronic Health 25,074 discharge Unknown Unknow Unknown Unknow 2004 Not MedLEE Unknown Reported - ACE inhibitors Positiv Reported Medium Mediu 2009) Record (EHR) summaries n n applicabl adverse - Bupropion e adverse m e drug - Ibuprofen effects are reactions - Morphine consistent - Paroxetine with well- - Rosiglitazone recognized - Warfarin toxicities (recall: 75%; precision: 31%). (Wu et al., Unknown Social media: 3,251 tweets Unknown Unknow Unknown Unknow 2015 Not - AFINN - MetaMap Reported - Baclofen Positiv Several well- Medium Mediu 2015) - Twitter.com n n applicabl - Bing Liu - Naïve Bayes adverse - Duloxetine e known m e sentiment (NB) drug - Gabapentin adverse drug words - Natural reactions - Glatiramer reactions - Multi- Language - Pregabalin were Perspective Processing (NLP) identified. Question - Sentiment Answering Analysis (MPQA) - Support Vector - Machines (SVM) SentiWordNe t

55 - TextBlob - Tweepy - WEKA

(Yang et al., Unknown Forums: 6,244 discussion Unknown Unknow Unknown Unknow Unknown Unknown Unknown - Association Reported - Adenosine Positiv Adverse Medium Mediu 2012) - MedHelp.org threads n n Mining adverse - Biaxin e drug m drug - Cialis reactions reactions - Elidel were - Lansoprazole identified. - Lantus - Luvox - Prozac - Tacrolimus - Vyvanse

56 Appendix 3. PRISMA Checklist

Table 9. PRISMA Checklist

Reported on Section/topic # Checklist item page # TITLE Title 1 Identify the report as a systematic review, meta-analysis, or both. 1 ABSTRACT Structured summary 2 Provide a structured summary including, as applicable: background; objectives; 2 data sources; study eligibility criteria, participants, and interventions; study appraisal and synthesis methods; results; limitations; conclusions and implications of key findings; systematic review registration number. INTRODUCTION Rationale 3 Describe the rationale for the review in the context of what is already known. 5 Objectives 4 Provide an explicit statement of questions being addressed with reference to 7 participants, interventions, comparisons, outcomes, and study design (PICOS). METHODS Protocol and 5 Indicate if a review protocol exists, if and where it can be accessed (e.g., Web - registration address), and, if available, provide registration information including registration number. Eligibility criteria 6 Specify study characteristics (e.g., PICOS, length of follow-up) and report 16 characteristics (e.g., years considered, language, publication status) used as criteria for eligibility, giving rationale. Information sources 7 Describe all information sources (e.g., databases with dates of coverage, contact 14 with study authors to identify additional studies) in the search and date last searched. Search 8 Present full electronic search strategy for at least one database, including any 14 limits used, such that it could be repeated. Study selection 9 State the process for selecting studies (i.e., screening, eligibility, included in 15 systematic review, and, if applicable, included in the meta-analysis). Data collection 10 Describe method of data extraction from reports (e.g., piloted forms, 17 process independently, in duplicate) and any processes for obtaining and confirming data from investigators. Data items 11 List and define all variables for which data were sought (e.g., PICOS, funding 17 sources) and any assumptions and simplifications made. Risk of bias in 12 Describe methods used for assessing risk of bias of individual studies (including 16 individual studies specification of whether this was done at the study or outcome level), and how this information is to be used in any data synthesis. Summary measures 13 State the principal summary measures (e.g., risk ratio, difference in means). - Synthesis of results 14 Describe the methods of handling data and combining results of studies, if done, 17 including measures of consistency (e.g., I2) for each meta-analysis.

57 Reported on Section/topic # Checklist item page #

Risk of bias across 15 Specify any assessment of risk of bias that may affect the cumulative evidence 16 studies (e.g., publication bias, selective reporting within studies). Additional analyses 16 Describe methods of additional analyses (e.g., sensitivity or subgroup analyses, 17 meta-regression), if done, indicating which were pre-specified. RESULTS Study selection 17 Give numbers of studies screened, assessed for eligibility, and included in the 18 review, with reasons for exclusions at each stage, ideally with a flow diagram. Study characteristics 18 For each study, present characteristics for which data were extracted (e.g., study 20 size, PICOS, follow-up period) and provide the citations. Risk of bias within 19 Present data on risk of bias of each study and, if available, any outcome level 23 studies assessment (see item 12). Results of individual 20 For all outcomes considered (benefits or harms), present, for each study: (a) 23 studies simple summary data for each intervention group (b) effect estimates and confidence intervals, ideally with a forest plot. Synthesis of results 21 Present results of each meta-analysis done, including confidence intervals and 23 measures of consistency. Risk of bias across 22 Present results of any assessment of risk of bias across studies (see Item 15). 23 studies Additional analysis 23 Give results of additional analyses, if done (e.g., sensitivity or subgroup - analyses, meta-regression [see Item 16]). DISCUSSION Summary of 24 Summarize the main findings including the strength of evidence for each main 25 evidence outcome; consider their relevance to key groups (e.g., healthcare providers, users, and policy makers). Limitations 25 Discuss limitations at study and outcome level (e.g., risk of bias), and at review- 26 level (e.g., incomplete retrieval of identified research, reporting bias). Conclusions 26 Provide a general interpretation of the results in the context of other evidence, 29 and implications for future research. FUNDING Funding 27 Describe sources of funding for the systematic review and other support (e.g., - supply of data); role of funders for the systematic review. Adopted and modified from Moher et al. (2009)

For more information, visit: www.prisma-statement.org.

58