Data-Driven Methods for Modeling and Predicting Multivariate Time Series using Surrogates
Prithwish Chakraborty
Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of
Doctor of Philosophy in Computer Science
Narendran Ramakrishnan, Chair Madhav Marathe Chang-Tien Lu Ravi Tandon John S. Brownstein
April 28, 2016 Arlington, VA
Keywords: Multivariate Time Series, Surrogates, Generalized Linear Models, Bayesian Sequential Analysis, Computational Epidemiology
Copyright c 2015, Prithwish Chakraborty
Data-Driven Methods for Modeling and Predicting Multivariate Time Series using Surrogates
Prithwish Chakraborty
(ABSTRACT)
Modeling and predicting multivariate time series data has been of prime interest to re- searchers for many decades. Traditionally, time series prediction models have focused on finding attributes that have consistent correlations with target variable(s). However, diverse surrogate signals, such as News data and Twitter chatter, are increasingly available which can provide real-time information albeit with inconsistent correlations. Intelligent use of such sources can lead to early and real-time warning systems such as Google Flu Trends. Furthermore, the target variables of interest, such as public heath surveillance, can be noisy. Thus models built for such data sources should be flexible as well as adaptable to changing correlation patterns. In this thesis we explore various methods of using surrogates to generate more reliable and timely forecasts for noisy target signals. We primarily investigate three key components of the forecasting problem viz. (i) short-term forecasting where surrogates can be employed in a now-casting framework, (ii) long-term forecasting problem where surrogates acts as forcing parameters to model system dynamics and, (iii) robust drift models that detect and exploit ‘changepoints’ in surrogate-target relationship to produce robust models. We explore various ‘physical’ and ‘social’ surrogate sources to study these sub-problems, primarily to generate real-time forecasts for endemic diseases. On modeling side, we employed matrix factorization and generalized linear models to detect short-term trends and explored various Bayesian sequential analysis methods to model long-term effects. Our research indicates that, in general, a combination of surrogates can lead to more robust models. Interestingly, our findings indicate that under specific scenarios, particular surrogates can decrease overall forecasting accuracy - thus providing an argument towards the use of ‘Good data’ against ‘Big data’.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center (DoI/NBC) contract number D12PC000337. The US Government is authorized to reproduce and distribute reprints of this work for Gov- ernmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be inter- preted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the US Government. Data-Driven Methods for Modeling and Predicting Multivariate Time Series using Surrogates
Prithwish Chakraborty
(GENERAL AUDIENCE ABSTRACT)
In the context of public health, modeling and early-forecasting of infectious diseases is of prime importance. Such efforts help agencies to devise interventions and implement effec- tive counter-measures. However, disease surveillance is an involved process where agencies estimate the intensity of diseases in the public domain using various networks. The process involves various levels of data cleaning and aggregation and as such the resultant surveillance data is inherently noisy (requiring several revisions to stabilize) and delayed. Thus real-time forecasting about such diseases necessitates stable and robust methods that can provide ac- curate public health information in time-critical manner. This work focuses on data-driven modeling and forecasting of time series, especially infectious diseases, for a number regions of the world including Latin America and the United States of America. With the increasing popularity of social media, real-time societal information could be extracted from various media such as Twitter and News. This work addresses this critical area where a number of models have been presented to systematically integrate and compare the usefulness of such real-time information from both physical- (such as Temperature) and non-physical-indicators (such as Twitter) towards robust disease forecasting. Specifically, this work focuses on three critical areas: (a) Short-term forecasting of disease case counts to get better estimates of current on ground scenario, (b) long-term forecasting about disease season characteristics to get help public health agencies plan and implement interventions and finally (c) Con- cept drift detection and adaptation to consider the ever evolving relationship of the societal surrogates and the public health surveillance and lend robustness to the disease forecasting models. This work shows that such indicators could be useful for reliable estimation of dis- ease characteristics - even when the ground-truth itself is unreliable and provide insights as to how such indicators can be integrated as part of public surveillance. This work has used principles from diverse fields spanning Bayesian Statistics, Machine Learning, Information Theory, and Public Health to analyze and characterize such diseases. Acknowledgments
I extend my sincere thanks and gratitude to my advisor Dr. Naren Ramakrishnan for his continued encouragement and guidance throughout my work. His feedback, insights and inputs have contributed immensely to the final form of this work. He has been my mentor and my guide. I have always found in him a patient listener who rendered clarity to my thoughts and I have always come out of our discussion with renewed vigor and focus. It has been my utmost privilege to work with him for all these years. I would also like to thank my entire committee. I sincerely thank Dr. Madhav Marathe and Dr. John Brownstein for their unique perspectives on public health without which this work wouldn’t have been complete. I have especially enjoyed my meetings with Dr. Madhav Marathe and our collaborations that have helped me to gain a broader understanding about the field of computational epidemiology. I cannot thank Dr. Ravi Tandon enough for his inputs and insights that ultimately materialized in the form of ‘concept drift’ - a crucial component of this work. Finally, Dr. C.T. Lu have always been welcoming and encouraging, and I thank him for his crucial feedback and inputs about this work. I consider myself fortunate to have received the guidance of such an esteemed and kind group of people. I extend my heartfelt thanks and gratitude to Dr. Bryan Lewis, NDSSL at Virginia Tech for all his encouragement, guidance and countless hours working with me on this work. I have been lucky to have him as my mentor. I would also like to thank Discovery Analytics Center at Virginia Tech which has been my- home-away-from-home for these past few years. I have found mentors like Tozammel Hossain and Patrick Butler who have immensely shaped my early PhD years. All my lab members have been crucial and I will miss my time with all of them. They have been my friend, my colleague and more often than not my support group throughout this process. I wish all of you the best for your future. I have also been fortunate to work with a varied group of collaborators from NDSSL, HealthMap and YeLab as well as public health agencies such as IARPA and CDC which has made my PhD a great experience that I will cherish forever. I would also like to express my gratitude to my wonderful friends - Deba Pratim Saha, Gourab Ghosh Roy, Saurav Ghosh, Sathappan Muthiah, Arijit Chattopadhyay, Sayantan Guha and Abhishek Mukherjee, to name a few - with whom I have shared unique moments throughout this time. Thanks for being around and being there for me whenever I needed
iv you all. Thanking my family is perhaps not enough. My mother Mrs. Devyani Chakraborty and my brother Mr. Prasenjit Chakraborty have been my closest friends and confidants. This work as well as me owes everything to you. My late father Mr. Prasanta Kr. Chakraborty would have been happy to see me where I am today. My sister-in-law Mrs. Amrita Dhole Chakraborty and my cousins, I thank you for being the best family I could hope for and being there for me always.
v Contents
1 Background and Motivation 1 1.1 Flu Surveillance Effects ...... 1 1.2 Motivation towards using surrogates ...... 6
I Short-term Forecasting using Surrogates 7
2 Forecasting a Moving Target: Ensemble Models for ILI Case Count Predictions 9 2.1 Related Work ...... 10 2.2 Problem Formulation ...... 12 2.2.1 Methods ...... 12 2.3 Ensemble Approaches ...... 14 2.3.1 Data level fusion: ...... 15 2.3.2 Model level fusion: ...... 15 2.4 Forecasting a Moving Target ...... 16 2.5 Experimental Setup ...... 18 2.5.1 Reference Data...... 18 2.5.2 Evaluation criteria...... 19 2.5.3 Surrogate data sources...... 20 2.6 Results ...... 23 2.7 Discussion ...... 26
vi 3 Dynamic Poisson Autoregression for Influenza-Like-Illness Case Count Prediction 27 3.1 Summary ...... 29 3.1.1 Model Similarity ...... 29 3.1.2 Forecasting Results ...... 30 3.1.3 Seasonal Analysis ...... 31 3.2 Discussion ...... 32
II Long-term Forecasting using Surrogates 36
4 Curve-matching from library of curves 38
5 Data Assimilation methods for long-term forecasting 41 5.1 Data Assimilation ...... 41 5.2 Data Assimilation Models in disease forecasting ...... 44 5.3 Data Assimilation Using surrogate Sources ...... 45 5.4 Experimental Results and Performance Summary ...... 45 5.5 Discussion ...... 51
III Detecting and Adapting to Concept Drift 52
6 Hierarchical Quickest Change Detection via Surrogates 54 6.1 HQCD–Hierarchical Quickest Change Detection ...... 55 6.1.1 Quickest Change Detection (QCD) ...... 55 6.1.2 Changepoint detection in Hierarchical Data ...... 56 6.2 HQCD for Count Data via Surrogates ...... 61 6.2.1 Hierarchical Model for Count Data ...... 62
vii 6.2.2 Changepoint Posterior Estimation ...... 64 6.3 Experiments ...... 67 6.3.1 Synthetic Data ...... 67 6.3.2 Real life case study ...... 69 6.4 Discussion ...... 72
7 Concept Drift Adaptation for Google Flu Trends 73 7.1 Background ...... 73 7.2 Robust Models via Concept Drift Adaptation ...... 75 7.2.1 Experimental evaluation and comparing Surrogate Sources ...... 76 7.3 Discussion ...... 77
8 Conclusion 83 8.1 Importance of Open Source Indicators for Public Health ...... 83 8.2 Guidelines for using surrogates for Health Surveillance ...... 84 8.3 Future Work ...... 86
A Data Assimilation: detailed performance 92
B Sequential Bayesian Inference 112 B.1 SMC2 algorithm traces ...... 113 B.2 SMC2 priors ...... 114
C HQCD: Additional Experimental Results 115
viii List of Figures
1.1 Epidemic Pyramid: Depicts the process of how disease exposure in general population goes through several stages of surveillance and gets reported as confirmed cases. Adapted and redrawn from “The public health officer - Antimicrobial Resistance Learning Site For Veterinary Students”, http:// amrls.cvm.msu.edu/integrated/principles/meet-the-public-health-officer 2 1.2 Christmas Effect in USA: Number of people seeking care drops during Christ- mas holidays. However, number of ILI related visits don’t vary from non- Christmas times leading to an inflated percent ILI in general population. . .3 1.3 ILI Surveillance drop towards the end of ILI season in CDC ILINet system. Inflection point can be seen at week 33. Reduced surveillance may render reports from later parts less accurate...... 4 1.4 ILI surveillance instability: percentage relative error of updates w.r.t. final value as a function of update horizon for PAHO ILI reports for several Latin American countries. Stability varies from one country to other...... 5
2.1 Our ILI data pipeline, depicting six different data sources used in this chapter to forecast ILI case counts...... 11 2.2 Average relative error of PAHO count values with respect to stable values. (a) Comparison between Argentina and Colombia (b) Comparison between different seasons for Argentina...... 17 2.3 Average relative error of PAHO count values before and after correction for different countries...... 19 2.4 Accuracy of different methods for each country...... 23
ix 3.1 The distance matrix obtained from our learned DPARX model (bottom fig- ure), associated with the ground truth ILI case count series (top figure) on the AR dataset. We can observe the strong seasonality automatically inferred in the matrix. Each element in the matrix is the Euclidean distance between a pair of the learned models at two corresponding time points after training. For the top figure, the x axis is the index of the weeks; the y axis is the num- ber of ILI cases. For the bottom figure, both x and y axes are the index of the time points. Note that the starting time point (index 0) for the distance matrix is week 15 of the ILI case count series...... 33 3.2 Model distance matrices for US dataset. The three matrices are derived from the fully connected similarity graph, the 3-nearest neighbor similarity graph and the seasonal 3-nearest neighbor similarity graph, from left to right corre- spondingly...... 34 3.3 Comparison of seasonal characteristics for Mexico using different algorithms for one-step ahead prediction. Blue vertical dashed lines indicate the actual start and end of the season. ILI season considered: 2013...... 34
4.1 Filtering library of curves based on season size and season shape...... 38 4.2 Example of seasonal forecasts for ILI using curve-matching methods. . . . . 40 4.3 Performance measures for ILI seasonal characteristics using curve-matching . 40
5.1 Performance summary for (a) ILI and (b) CHIKV seasonal forecasts using Weather as a surrogate source under data assimilation framework ...... 46 5.2 Comparison of forecasting accuracy for Date metrics using surrogates . . . . 48 5.3 Comparison of forecasting accuracy for Value metrics using surrogates . . . . 48 5.4 Comparison of forecasting accuracy for ‘Start Date’ using different surrogate sources ...... 49 5.5 Comparison of forecasting accuracy for ‘End Date’ using different surrogate sources ...... 49 5.6 Comparison of forecasting accuracy for ‘Peak Date’ using different surrogate sources ...... 50 5.7 Comparison of forecasting accuracy for ‘Peak Value’ using different surrogate sources ...... 50 5.8 Comparison of forecasting accuracy for ‘Season Value’ using different surro- gate sources ...... 51
x 6.1 Illustration of Quickest Change Detection (QCD): blue colored line represents the actual changepoint at time Γ = t4. (a) declaring a change at γ1 leads to a false alarm, whereas (b) declaring the change at γ2 leads to detection delay. QCD can strike a tradeoff between false alarm and detection delay...... 56 6.2 Generative process for HQCD. As an example consider civil unrest protests. In the framework, different protest types (such as Education- and Housing- related protests) form the targets denoted by Si’s. The total number of protests will be denoted by the top-most variable E. Finally, the set of sur- rogates, such as counts of Twitter keywords, stock price data, weather data, network usage data etc. are denoted by Kj’s...... 57 6.3 Histogram fit of (a) surrogate source (Twitter keyword counts) and (b) tar- get source (Number of protests of different categories), for various temporal windows, under i.i.d. assumptions. These assumptions lead to satisfactory dis- tribution fit, at a batch level, for both sources. The top-most row corresponds to the period before the Brazilian spring (pre 2013-05-25), the second row is for the period 2013-05-25 to 2013-10-20, and the third is for the period after 2013-10-20. The last row shows the fit for the entire period. These temporal fits are indicative of significant changes in distribution along the Brazilian Spring timeline, for both target and surrogates...... 63 6.4 Computation time for one complete run of changepoint detection (in mins) on a 1.6 GHz quad core 8gb intel i5 processor: Gibbs sampling [8] vs HQCD vs HQCD without surrogates. Gibbs sampling computation times are unsuitable for online detection...... 65 6.5 Comparison of HQCD against state-of-the-art on simulated target sources. X- axis represents time and Y-axis represents actual value. Solid blue lines refer to the true changepoint, solid green refers to the ones detected by HQCD and brown refers to HQCD without surrogates. Dashed red, magenta, purple and gold lines refer to changepoints detected by RuLSIF, WGLRT, BOCPD and GLRT, respectively. HQCD shows better detection for most targets with low overall detection delay and false alarms...... 68 6.6 False Alarm vs Delay trade-off for different methods. HQCD shows the best trade-off...... 69 6.7 Comparison of detected changepoints at the sum-of-targets (all Protests). HQCD detections are shown in solid green while those from the state-of- the-art methods i.e. RuLSIF (red), WGLRT (magenta), BOCPD (purple) and GLRT (gold) are shown with dashed lines. HQCD detection is the closest to the traditional start date of Mass Protests in the three countries studied . . 70
xi 6.8 (Brazilian Spring) Heatmap of changepoint influences of targets on targets (a); and surrogates on targets (b). Darker (lighter) shades indicate higher (lesser) changepoint influence. (a) shows presence of strong off-diagonal elements indicating strong cross-target changepoint information. (b) shows a mixture of uninformative and informative surrogates...... 71
7.1 Evidence of Concept Drift. In Google Flu Trends data for Argentina (left), the corresponding 52-week rolling mean (right) exhibits a saddle point in early 2012 - indicates a possible mean shift drift in GFT for Argentina...... 74 7.2 Concept Drift Adaption Framework. Framework ingest target sources such as CDC ILI case count data and surrogate sources such as GFT and detects changepoints via ‘Concept Drift Detector’ stage. Drift probabilities are next passed onto ‘Drift Adaptation’ stage where robust predictions are generated using resampling based methods...... 75 7.3 Drift Adaptation for Mexico using GFT ...... 78 7.4 Drift Adaptation for Mexico using GST ...... 79 7.5 Drift Adaptation for Mexico using HealthMap ...... 80 7.6 Drift Adaptation for Mexico using weather sources ...... 81 7.7 Drift Adaptation for Mexico using All sources ...... 82
8.1 Correlation of surrogate sources with disease incidence. Count of influenza re- lated keywords from (a) HealthMap and (b) GST compared against influenza case counts for Argentina as available from PAHO. HealthMap keywords cap- ture the start of the season more accurately, while GST keywords exhibit a sub-optimal but consistent correlation with PAHO counts...... 85
C.1 Comparison of detected changepoints at the target sources (Protest types) HQCD detections are shown in solid green while those from the state-of- the-art methods i.e. RuLSIF (red), WGLRT (magenta), BOCPD (purple) and GLRT (gold) are shown with dashed lines...... 115
xii List of Tables
2.1 Comparing forecasting accuracy of models using individual sources. Scores in this and other tables are normalized to [0,4] so that 4 is the most accurate. . 24 2.2 Comparison of prediction accuracy while combining all data sources and using MFN regression...... 24 2.3 Comparison of prediction accuracy while using model level fusion on MFN regressors and employing PAHO stabilization...... 24 2.4 Discovering importance of sources in Model level fusion on MFN regressors by ablating one source at a time...... 24 2.5 ILI case count prediction accuracy for Mexico using OpenTable data as a single source, and by combining it with all other sources using model level fusion on uncorrected ILI case count data...... 25
3.1 Prediction accuracies for competing algorithms with different forecast steps over different countries using the GFT input source. GFT data is not available for other countries...... 31 3.2 Prediction accuracies for competing algorithms with different forecast steps over different countries using the weather data source...... 35 3.3 Prediction accuracies for competing algorithms with different forecast steps over different countries using the GST data source...... 35 3.4 Prediction accuracies for competing algorithms with different forecast steps over different countries using the HealthMap data source...... 35
5.1 Forecasting performance of seasonal characteristics using data assimilation methods ...... 47
6.1 Comparison of state-of-the-art methods vs Hierarchical Quickest Change De- tection ...... 55
xiii 6.2 (Synthetic data) comparing true changepoint (Γ) for targets against detected changepoint (γ) by HQCD against state-of-the-art methods for false alarm (FA) and additive detection delay (ADD). Each row represent a target and best detected changepoint is shown in bold whereas false alarms are shown in red...... 67
7.1 Comparison of surrogate sources pre- and post-drift adaptation...... 76
A.1 Performance of Data assimilation methods using different surrogate sources w.r.t. seasonal characteristics ...... 92
C.1 (Protest uprisings) Comparison of HQCD vs state-of-the-art with respect to detected changepoints ...... 116
xiv Chapter 1
Background and Motivation
The problem of multivariate time series forecasting has been studied extensively for several decades and have found use in diverse fields such as Economics and Statistics [9]. Some of the more popular methods that have been used in this sphere are Autoregressive (AR) models, Autoregressive Moving Average models (ARMA) and Vector Autoregressive Models (VAR) for linear problems. For nonlinear problems, some of the more popular methods have been Kernel Regression and Gaussian Process. However, the traditional approaches have focused on admitting only coherent time series and/or admitting independent time series which exhibits consistent causal relation with the target of interest. In recent years, ‘big data’ in the form of diverse real-time sources such as social media and news has been readily available. These data sources are in general noisy, and relationships with any target sources can change over time such as, search patterns of users. However, if used intelligently, such sources can aid in accurately modeling complex target sources such as the number of influenza case counts for a country, in near real-time. This thesis focuses on such noisy surrogates. We explore the problem of flu forecasting in Section 1.1 to identify the key advantages in using surrogates and motivate our methods in Section 1.2.
1.1 Flu Surveillance Effects
Accurate and timely influenza (flu) forecasting has gained significant traction in recent times. If done well, such forecasting can aid in deploying effective public health measures. Unlike other statistical or machine learning problems, however, flu forecasting brings unique chal- lenges and considerations stemming from the nature of the surveillance apparatus and the end-utility of forecasts. However flu surveillance is an inherently complex process and iden- tifying the quirks of this process can lead to a better understanding of the possible problems facing a forecasting model.
1 2
Final reports to Health Agencies
Surveillance estimations
Specimen obtained
Person seeks care
Person becomes ill
Exposures in general population
Figure 1.1: Epidemic Pyramid: Depicts the process of how disease exposure in general pop- ulation goes through several stages of surveillance and gets reported as confirmed cases. Adapted and redrawn from “The public health officer - Antimicrobial Resistance Learn- ing Site For Veterinary Students”, http://amrls.cvm.msu.edu/integrated/principles/ meet-the-public-health-officer 3
Figure 1.2: Christmas Effect in USA: Number of people seeking care drops during Christmas holidays. However, number of ILI related visits don’t vary from non-Christmas times leading to an inflated percent ILI in general population.
Influenza-like Illnesses (ILI), tracked by many agencies such as CDC, PAHO, and WHO [10, 44, 64], is a category designed to capture severe respiratory disease, like influenza (flu), but also includes many other less severe respiratory illness due to their similar presentation. Surveillance methods often vary between agencies. Even for a single agency, there may be different networks (such as outpatient based and lab sample based) tracking ILI/Flu. While outpatient reporting networks such as ILINet aim to measure exact case counts for the regions under consideration, lab surveillance networks such as WHO NREVSS (used by PAHO) seek to confirm and identify the specific strain. In the absence of a clinic based surveillance system, lab-based systems can provide estimates at per “X” population level; however making an estimate of actual influenza flu cases from these systems is challenging [10]. Furthermore, surveillance reports are often non-representative of actual ILI incidence. Figure 1.1 shows a representative ‘epidemic pyramid’ which depicts the surveillance system. The entire process is inherently associated with possible reporting errors starting from patients seeking care to final determination of confirmed case through laboratory tests. Surveillance networks are also affected by cultural phenomenon such as holiday periods where behavior of people visiting hospitals changes from other weeks. Figure 1.2 depicts the ‘Christmas effect’ observed during the holidays when people seek care from physicians only in emergency situations leading to inflated ILI percentages. Such effects may render the surveillance reports non-representative of on-ground scenarios. In addition to these effects, surveillance systems are also affected by other systematic artifacts. Surveillance reporting has been known to taper off or stop altogether during the post-peak part of the season. For example, as is evident from Figure 1.3, the number of providers who reported to US CDC ILINet surveillance tapers off towards the end of the ILI season (for US, calendar week 40 corresponds to first ILI season week [10]). Specifically, the inflection 4
Figure 1.3: ILI Surveillance drop towards the end of ILI season in CDC ILINet system. Inflection point can be seen at week 33. Reduced surveillance may render reports from later parts less accurate. 5
Figure 1.4: ILI surveillance instability: percentage relative error of updates w.r.t. final value as a function of update horizon for PAHO ILI reports for several Latin American countries. Stability varies from one country to other. point of the average curve occurs at season week 33. Such effects can possibly be attributed to resource re-allocation due to reduced interest in post-peak activities. A combination of such effects ultimately causes surveillance data to be delayed from real-time. Even when the reports are published, the reports can be candidates for revision/updating for several weeks after initial publication. The lag between initial publication and final revision can be as small as 2 weeks (e.g., for CDC ILINet data) or can wildly fluctuate. For example, PAHO reports for some Latin American countries such as Argentina, Colombia and Mexico can take more than 10 weeks to settle. On the other hand, PAHO reports stabilize within 5 weeks for countries such as Chile, Costa Rica and Peru (see Figure 1.4). The reason for such discrepancies has to do with the maturity of the surveillance apparatus and the level of coordination underlying public health reporting. 6
1.2 Motivation towards using surrogates
The flu surveillance effects described above can be thought of as a representative scenario for a large class of problems dealing with real-time surveillance where on-ground scenario is difficult to ascertain. Most work on forecasting do not account for such instability. In essence, these problems requires forecasting a moving target. Real-time surrogates as outlined above can be useful in such scenarios to augment the surveillance mechanism with information from general population. Thus motivating the problem of flu forecasting, this thesis outlines three key problems as follows:
Short-term forecasts using surrogates to augment delayed surveillance reports and • provide real-time information of on-ground scenarios.
Long-term forecasts using surrogates as forcing parameters to determine long-term • characteristics with increased accuracy.
Identifying and adapting to Concept Drift to detect changing relationships of • surrogates and increase robustness of short- and long-term forecasts using such surro- gates. Part I
Short-term Forecasting using Surrogates
7 8
The first problem of this thesis is aimed at short-term forecasting of often delayed and unsta- ble target sources such as Influenza-Like-Illness (ILI) case counts as reported by surveillance agencies such as CDC [10] and PAHO [44]. We compared a range of surrogates encompassing physical sources such as humidity and temperature, and social sources such as Twitter and News in [12] under a Matrix Factorization framework for ILI prediction in 15 Latin American countries. We found that no single source is best suited to model ILI for all countries. How- ever, physical sources were in general the most informative sources. Furthermore, combining the sources led to better forecasting accuracy in general. We present these considerations in Chapter 2. We next focused on increasing the forecasting horizon and used Regularized Generalized Linear Models to capture dynamic trends of ILI data in [62]. Our experiments indicate that we can reliably forecast up to 4 weeks in advance, for a range of countries including USA and several Latin American countries, using our proposed methods. We highlight the important aspects of our findings from the problem in Chapter 3. Chapter 2
Forecasting a Moving Target: Ensemble Models for ILI Case Count Predictions
Traditionally, epidemiological forecasts of common illnesses, such as the flu, rely heavily on surveillance reports published by health organizations. However, as discussed in Chapter 1, traditional surveillance reports are often published with a considerable delay and thus recent research has focused on mining social signals from search engine query volume [67, 24] and social media chatter [27, 34, 39, 15, 56]. One of the pioneering work in this space is the work of Ginsberg et al. [24] where ILI case counts are predicted from the volume of search engine queries. This work inspired significant follow-on work, e.g., [67], where Yuan et al. used search query data from Baidu (a popular search engine in China) to detect influenza outbreaks. More real-time ILI detection [34] systems have been proposed by modeling Twitter streams. Apart from such social media sources, there has also been considerable research on exploiting physical indicators such as climate data. The primary advantage of such data sources is that the effects are much more causal and less noisy. Shaman et al. [57, 49, 51] explored this area in detail and found absolute humidity to be a good indicator of influenza outbreaks. While the aforementioned efforts have made important strides, there are important areas that have been relatively less studied. First, only few efforts have focused on combining multiple data sources [29, 27] to aid in forecasting. In particular, to the best of our knowledge there has been no work that investigates the combination of social indicators and physical indicators to forecast ILI incidence. Second, and more importantly, official estimates as reported by health organizations (e.g., WHO, PAHO) are often lagged by several weeks and even when reported are typically revised for several weeks before the case counts are finalized. Real-time prediction systems must be designed to handle the forecasting of such a ‘moving
9 10 target’. Finally, most existing work have been retrospective and not set in the context of a formal data mining validation framework. To overcome these deficiencies, we propose a novel approach to ILI case count forecasting. Our contributions are:
Our approach integrates both social indicators and physical indicators and thus lever- • ages the selective superiorities of both types of feature sets. We systematize such integration using a novel matrix factorization-based regression approach using neigh- borhood embedding, thus helping account for non-linear relationships between the surrogates and the official ILI estimates.
We investigate the efficacy of combining diverse different sources at two levels: data • fusion level, and model level, and discuss the relative (de)merits.
We propose different ways of handling uncertainties in the official estimates and factor • these uncertainties into our prediction models.
Finally, we present a detailed and prospective analysis of our proposed methods by • comparing predictions from a near-horizon real time prediction system to official esti- mates of ILI case counts in 15 countries of Latin America.
2.1 Related Work
Related work naturally falls into the categories of social media analytics, physical indicators, and event dynamics modeling. These are next described as follows: Social media analytics: Most relevant work using social media analytics focuses on Twit- ter, specifically by tracking a dictionary of ILI-related keywords in the data stream. Such investigations have often focused on the importance of diversity in keyword lists, e.g., [39, 15]. In [39], Kanhabua and Nejdl used clustering methods to determine important topics in Twit- ter data, constructed time series for matched keywords, and used Jaccards coefficient to char- acterize the temporal diversity of tweets. They noted, that such temporal diversity may be correlated with real-world ILI outbreaks. In [15] the authors studied the dynamics between the change in circulated tweets and the H1N1 virus. Inspired by these work, we curated a custom ILI related keyword dictionary which is described in details in Section 2.5.3. Physical indicators for detecting ILI incidence levels: Tamerius et al. [57] investigated the existence of seasonal cycles of influenza epidemics in different climate regions. For the said work, they considered climatic information from 78 globally distributed sites. Using logistic regression they found that, strong correlations exist between influenza epidemics and weather conditions, especially when conditions are cold-dry or humid-rainy. Similarly, exciting results were reported by Shaman et al. in [49, 51] where they discovered absolute humidity to be a key indicator of flu. To uncover these relationships they used non-linear 11