A General Method for Estimating the Prevalence of Influenza-Like-Symptoms with Wikipedia Data
Total Page:16
File Type:pdf, Size:1020Kb
A general method for estimating the prevalence of Influenza-Like-Symptoms with Wikipedia data Giovanni De Toni Cristian Consonni Alberto Montresor DISI, University of Trento Eurecat - Centre Tecnòlogic of DISI, University of Trento [email protected] Catalunya [email protected] [email protected] ABSTRACT Wikipedia1 is a multilingual encyclopedia, written collabora- Influenza is an acute respiratory seasonal disease that affects mil- tively by volunteers online. As of this writing, the English version lions of people worldwide and causes thousands of deaths in Europe alone has more than 6M articles and 50M pages and it is edited on alone. Being able to estimate in a fast and reliable way the impact average by more than 65k active users every month [16]. of an illness on a given country is essential to plan and organize Research has shown that Wikipedia is a first-stop source for in- effective countermeasures, which is now possible by leveraging un- formation of all kinds, including information about science [12, 19]. conventional data sources like web searches and visits. In this study, This is of particular significance in fields such as medicine, where it we show the feasibility of exploiting information about Wikipedia’s has been shown that Wikipedia is a prominent result for over 70% page views of a selected group of articles and machine learning of search results for health-related keywords [5]. Wikipedia is also models to obtain accurate estimates of influenza-like illnesses in- the most viewed medical resources globally [3], and it is used as an cidence in four European countries: Italy, Germany, Belgium, and information source by 50% to 70% of practicing physicians [2]. the Netherlands. We propose a novel language-agnostic method, In this study, we investigate how to exploit Wikipedia as a source based on two algorithms, Personalized PageRank and CycleRank, of data for doing rapid now-casting of the influenza activity levels in to automatically select the most relevant Wikipedia pages to be several European countries. McIver and Brownstein [7] developed monitored without the need for expert supervision. We then show a method based on Wikipedia’s pageviews, which behaves quite how our model is able to reach state-of-the-art results by comparing well when predicting influenza levels in the United States provided it with previous solutions. by the Centers for Disease Control and Prevention (CDC). They use pageviews of a predefined set of pages from the English language edition of Wikipedia chosen by a panel of experts. Generous et al. [1] also developed linear machine learning models and demon- 1 INTRODUCTION strated the ability to forecast incidence levels for 14 location-disease combinations. Influenza is a widespread acute respiratory infection that occurs We extend upon the state of the art by proposing a novel method with seasonal frequency, mainly during the autumn-winter period. that can be applied automatically to other languages and countries. The European Centre for Disease and Control (ECDC) estimates We focused our attention on predicting influenza levels in Italy, that influenza causes up to 50 millions cases and upto 70,000 deaths Germany, Belgium and the Netherlands. In extending this work, we annually in Europe [13, 20]. Globally, the number of deaths caused have developed a method that is more flexible with respect of the by this infection ranges from 290,000 to 650,000 [21]. The most af- characteristics of the edition of Wikipedia and that can be applied fected categories are children in the pediatric age range and seniors. to any language thus enabling us to build and deploy machine Severe cases are recorded in people above 65 years old and with pre- learning models for Influenza-like Illnesses (ILI) estimation without vious health conditions, for example, respiratory or cardiovascular the need of expert supervision. illnesses [8, 14]. The paper is structured as follows: in Section 2 we review the The ECDC monitors the influenza virus activities and provides relevant datasets that we have used in our study. In Section 3 and 4 weekly bulletins that aggregate the open data coming from all the arXiv:2010.14903v1 [cs.CY] 28 Oct 2020 we discuss our models and how we selected the relevant features to European countries. The local data are gathered from the national use for predicting influenza activity levels from Wikipedia data. We networks of volunteer physicians [22]. The Centre supplies only present the results of our study in Section 5 and 6, and we discuss data about the current situation, which means that they do not pro- them in Section 7. Finally, we conclude the paper in Section 8. vide any valuable information about which will be the future impact of the influenza illness in a European country, namely the number of future infected people. Moreover, ECDC data are available with a 2 DATA SOURCES delay ranging between one and two weeks, which prevents taking Wikipedia page views. Data about Wikipedia page views are preventive measures to mitigate the impact of influenza. In recent freely available to download. For our study, we have used two 2 years, thanks to machine learning techniques, new methods have datasets: the pagecounts dataset (December 2007-May 2016, for been developed to estimate the activity level of diseases by harness- a total of 461 weeks) and the pageviews dataset (October 2016 to ing the power of the internet and big data. Researchers tried to use April 2019, for a total of 134 weeks). Both of them count the number unconventional sources of data to make predictions, for example, tweets from Twitter [4, 10, 11], Google’s search keywords [18] or 1https://www.wikipedia.org social media [17]. 2Pagecounts-raw on wikitech.wikimedia.org, https://w.wiki/gnq, visited on 2020-10-14 of page hits to Wikipedia articles. The pagecounts dataset contains Italian data were downloaded from the InfluNet system4, which non-filtered page views, including automatically bot-generated is the Italian flu surveillance program managed by the Istituto ones; moreover, it does not contain data of page hits from mobile Superiore di Sanità; German data were obtained from the Robert devices. The pageviews dataset has been recently developed, su- Koch Institute5; Dutch and Belgian data were take from the GISRS perseding pagecounts in October 2016, and includes only human (Global Influenza Surveillance and Response System) online tool traffic from both desktop and mobile devices. By considering both which is available from the WHO website6, since at the best of our these datasets, we can examine a longer time range while estimating knowledge, the two countries do not offer open data from national the impact of different models on our results. health institute websites. From these datasets, we extracted the page views of three differ- For Germany and Italy, we cover 12 influenza seasons (from ent versions of Wikipedia: Italian, German, and Dutch. We selected 2007-2008 to 2018-2019); for the Netherlands and Belgium, we have these languages because the majority of page views for each of data that cover 9 influenza seasons (from 2010-2011 to 2018-2019). them come almost exclusively from one single area.3 For instance, Each of the datasets records the influenza incidence aggregated just 41.5% of the page views of the English Wikipedia come from weekly over 100 000 patients. For each influenza seasons, each of US and UK, while all the rest comes from all over the world. On the the dataset entries represents the influenza level on a specific week. contrary, 87.0% of the page views of the Italian version of Wikipedia For each influenza seasons, data cover 26 weeks, from the 42nd come from Italy, 77.0% of the page views of the German version week of the previous year to the 15th week of the following year; of Wikipedia come from Germany, and almost 89% of the Dutch for instance, the 2009-2010 influenza season comprises data ranging Wikipedia’s page views come from the Netherlands and Belgium from the 42nd week of 2009 to the 15th week of 2010. The complete (20.3% coming from Belgium and 69.1% coming from the Nether- dataset has the following size: 312 weeks of data for Germany and lands alone). This enables us to build datasets which contain less Italy, 234 for the Netherlands and Belgium. noise and which are only affected by the internet behaviour ofthe population of the country or area we are examining. 3 FEATURE SELECTION Each of row is composed of four columns, namely: the project’s The selection of the pages to monitor was based on the following language, the page title, the number of requests and the size in byte assumption: during the influenza seasons, we should see an incre- of the page retrieved. Data are available with hourly granularity; ment in the page view number of some specific Wikipedia’s pages, to use them in our project, we filtered them to select only specific which should be related to flu topics (e.g., “Influenza”, “Headache”, entries and then we computed an aggregated weekly value of the “Flu vaccine”, etc.). That occurs because most people tend to search page views for each of the entries selected. online the symptoms of their disease when they are sick, trying to We generated weekly aggregates for page views for the period get more information about their illness. ranging from the 50th week of 2007 to the 15th week of 2019, for a We use three different methodologies in order to avoid to make total of 591 weeks. pageviews data were used to cover the period unnecessary assumptions about which pages to choose. from September 2016 to April 2019 period, while for the previous period pagecounts data were used.