Data-Driven Methods for Modeling and Predicting Multivariate Time Series using Surrogates

Prithwish Chakraborty

Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of

Doctor of Philosophy in Computer Science

Narendran Ramakrishnan, Chair Madhav Marathe Chang-Tien Lu Ravi Tandon John S. Brownstein

April 28, 2016 Arlington, VA

Keywords: Multivariate Time Series, Surrogates, Generalized Linear Models, Bayesian Sequential Analysis, Computational

Copyright c 2015, Prithwish Chakraborty

Data-Driven Methods for Modeling and Predicting Multivariate Time Series using Surrogates

Prithwish Chakraborty

(ABSTRACT)

Modeling and predicting multivariate time series data has been of prime interest to re- searchers for many decades. Traditionally, time series prediction models have focused on finding attributes that have consistent correlations with target variable(s). However, diverse surrogate signals, such as News data and Twitter chatter, are increasingly available which can provide real-time information albeit with inconsistent correlations. Intelligent use of such sources can lead to early and real-time warning systems such as Google Flu Trends. Furthermore, the target variables of interest, such as public heath surveillance, can be noisy. Thus models built for such data sources should be flexible as well as adaptable to changing correlation patterns. In this thesis we explore various methods of using surrogates to generate more reliable and timely forecasts for noisy target signals. We primarily investigate three key components of the forecasting problem viz. (i) short-term forecasting where surrogates can be employed in a now-casting framework, (ii) long-term forecasting problem where surrogates acts as forcing parameters to model system dynamics and, (iii) robust drift models that detect and exploit ‘changepoints’ in surrogate-target relationship to produce robust models. We explore various ‘physical’ and ‘social’ surrogate sources to study these sub-problems, primarily to generate real-time forecasts for endemic diseases. On modeling side, we employed matrix factorization and generalized linear models to detect short-term trends and explored various Bayesian sequential analysis methods to model long-term effects. Our research indicates that, in general, a combination of surrogates can lead to more robust models. Interestingly, our findings indicate that under specific scenarios, particular surrogates can decrease overall forecasting accuracy - thus providing an argument towards the use of ‘Good data’ against ‘Big data’.

This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center (DoI/NBC) contract number D12PC000337. The US Government is authorized to reproduce and distribute reprints of this work for Gov- ernmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be inter- preted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the US Government. Data-Driven Methods for Modeling and Predicting Multivariate Time Series using Surrogates

Prithwish Chakraborty

(GENERAL AUDIENCE ABSTRACT)

In the context of , modeling and early-forecasting of infectious diseases is of prime importance. Such efforts help agencies to devise interventions and implement effec- tive counter-measures. However, disease surveillance is an involved process where agencies estimate the intensity of diseases in the public domain using various networks. The process involves various levels of data cleaning and aggregation and as such the resultant surveillance data is inherently noisy (requiring several revisions to stabilize) and delayed. Thus real-time forecasting about such diseases necessitates stable and robust methods that can provide ac- curate public health information in time-critical manner. This work focuses on data-driven modeling and forecasting of time series, especially infectious diseases, for a number regions of the world including Latin America and the United States of America. With the increasing popularity of social media, real-time societal information could be extracted from various media such as Twitter and News. This work addresses this critical area where a number of models have been presented to systematically integrate and compare the usefulness of such real-time information from both physical- (such as Temperature) and non-physical-indicators (such as Twitter) towards robust disease forecasting. Specifically, this work focuses on three critical areas: (a) Short-term forecasting of disease case counts to get better estimates of current on ground scenario, (b) long-term forecasting about disease season characteristics to get help public health agencies plan and implement interventions and finally (c) Con- cept drift detection and adaptation to consider the ever evolving relationship of the societal surrogates and the public health surveillance and lend robustness to the disease forecasting models. This work shows that such indicators could be useful for reliable estimation of dis- ease characteristics - even when the ground-truth itself is unreliable and provide insights as to how such indicators can be integrated as part of public surveillance. This work has used principles from diverse fields spanning Bayesian Statistics, Machine Learning, Information Theory, and Public Health to analyze and characterize such diseases. Acknowledgments

I extend my sincere thanks and gratitude to my advisor Dr. Naren Ramakrishnan for his continued encouragement and guidance throughout my work. His feedback, insights and inputs have contributed immensely to the final form of this work. He has been my mentor and my guide. I have always found in him a patient listener who rendered clarity to my thoughts and I have always come out of our discussion with renewed vigor and focus. It has been my utmost privilege to work with him for all these years. I would also like to thank my entire committee. I sincerely thank Dr. Madhav Marathe and Dr. John Brownstein for their unique perspectives on public health without which this work wouldn’t have been complete. I have especially enjoyed my meetings with Dr. Madhav Marathe and our collaborations that have helped me to gain a broader understanding about the field of computational epidemiology. I cannot thank Dr. Ravi Tandon enough for his inputs and insights that ultimately materialized in the form of ‘concept drift’ - a crucial component of this work. Finally, Dr. C.T. Lu have always been welcoming and encouraging, and I thank him for his crucial feedback and inputs about this work. I consider myself fortunate to have received the guidance of such an esteemed and kind group of people. I extend my heartfelt thanks and gratitude to Dr. Bryan Lewis, NDSSL at Virginia Tech for all his encouragement, guidance and countless hours working with me on this work. I have been lucky to have him as my mentor. I would also like to thank Discovery Analytics Center at Virginia Tech which has been my- home-away-from-home for these past few years. I have found mentors like Tozammel Hossain and Patrick Butler who have immensely shaped my early PhD years. All my lab members have been crucial and I will miss my time with all of them. They have been my friend, my colleague and more often than not my support group throughout this process. I wish all of you the best for your future. I have also been fortunate to work with a varied group of collaborators from NDSSL, HealthMap and YeLab as well as public health agencies such as IARPA and CDC which has made my PhD a great experience that I will cherish forever. I would also like to express my gratitude to my wonderful friends - Deba Pratim Saha, Gourab Ghosh Roy, Saurav Ghosh, Sathappan Muthiah, Arijit Chattopadhyay, Sayantan Guha and Abhishek Mukherjee, to name a few - with whom I have shared unique moments throughout this time. Thanks for being around and being there for me whenever I needed

iv you all. Thanking my family is perhaps not enough. My mother Mrs. Devyani Chakraborty and my brother Mr. Prasenjit Chakraborty have been my closest friends and confidants. This work as well as me owes everything to you. My late father Mr. Prasanta Kr. Chakraborty would have been happy to see me where I am today. My sister-in-law Mrs. Amrita Dhole Chakraborty and my cousins, I thank you for being the best family I could hope for and being there for me always.

v Contents

1 Background and Motivation 1 1.1 Flu Surveillance Effects ...... 1 1.2 Motivation towards using surrogates ...... 6

I Short-term Forecasting using Surrogates 7

2 Forecasting a Moving Target: Ensemble Models for ILI Case Count Predictions 9 2.1 Related Work ...... 10 2.2 Problem Formulation ...... 12 2.2.1 Methods ...... 12 2.3 Ensemble Approaches ...... 14 2.3.1 Data level fusion: ...... 15 2.3.2 Model level fusion: ...... 15 2.4 Forecasting a Moving Target ...... 16 2.5 Experimental Setup ...... 18 2.5.1 Reference Data...... 18 2.5.2 Evaluation criteria...... 19 2.5.3 Surrogate data sources...... 20 2.6 Results ...... 23 2.7 Discussion ...... 26

vi 3 Dynamic Poisson Autoregression for Influenza-Like-Illness Case Count Prediction 27 3.1 Summary ...... 29 3.1.1 Model Similarity ...... 29 3.1.2 Forecasting Results ...... 30 3.1.3 Seasonal Analysis ...... 31 3.2 Discussion ...... 32

II Long-term Forecasting using Surrogates 36

4 Curve-matching from library of curves 38

5 Data Assimilation methods for long-term forecasting 41 5.1 Data Assimilation ...... 41 5.2 Data Assimilation Models in disease forecasting ...... 44 5.3 Data Assimilation Using surrogate Sources ...... 45 5.4 Experimental Results and Performance Summary ...... 45 5.5 Discussion ...... 51

III Detecting and Adapting to Concept Drift 52

6 Hierarchical Quickest Change Detection via Surrogates 54 6.1 HQCD–Hierarchical Quickest Change Detection ...... 55 6.1.1 Quickest Change Detection (QCD) ...... 55 6.1.2 Changepoint detection in Hierarchical Data ...... 56 6.2 HQCD for Count Data via Surrogates ...... 61 6.2.1 Hierarchical Model for Count Data ...... 62

vii 6.2.2 Changepoint Posterior Estimation ...... 64 6.3 Experiments ...... 67 6.3.1 Synthetic Data ...... 67 6.3.2 Real life case study ...... 69 6.4 Discussion ...... 72

7 Concept Drift Adaptation for Google Flu Trends 73 7.1 Background ...... 73 7.2 Robust Models via Concept Drift Adaptation ...... 75 7.2.1 Experimental evaluation and comparing Surrogate Sources ...... 76 7.3 Discussion ...... 77

8 Conclusion 83 8.1 Importance of Open Source Indicators for Public Health ...... 83 8.2 Guidelines for using surrogates for Health Surveillance ...... 84 8.3 Future Work ...... 86

A Data Assimilation: detailed performance 92

B Sequential Bayesian Inference 112 B.1 SMC2 algorithm traces ...... 113 B.2 SMC2 priors ...... 114

C HQCD: Additional Experimental Results 115

viii List of Figures

1.1 Epidemic Pyramid: Depicts the process of how disease exposure in general population goes through several stages of surveillance and gets reported as confirmed cases. Adapted and redrawn from “The public health officer - Antimicrobial Resistance Learning Site For Veterinary Students”, http:// amrls.cvm.msu.edu/integrated/principles/meet-the-public-health-officer 2 1.2 Christmas Effect in USA: Number of people seeking care drops during Christ- mas holidays. However, number of ILI related visits don’t vary from non- Christmas times leading to an inflated percent ILI in general population. . .3 1.3 ILI Surveillance drop towards the end of ILI season in CDC ILINet system. Inflection point can be seen at week 33. Reduced surveillance may render reports from later parts less accurate...... 4 1.4 ILI surveillance instability: percentage relative error of updates w.r.t. final value as a function of update horizon for PAHO ILI reports for several Latin American countries. Stability varies from one country to other...... 5

2.1 Our ILI data pipeline, depicting six different data sources used in this chapter to forecast ILI case counts...... 11 2.2 Average relative error of PAHO count values with respect to stable values. (a) Comparison between Argentina and Colombia (b) Comparison between different seasons for Argentina...... 17 2.3 Average relative error of PAHO count values before and after correction for different countries...... 19 2.4 Accuracy of different methods for each country...... 23

ix 3.1 The distance matrix obtained from our learned DPARX model (bottom fig- ure), associated with the ground truth ILI case count series (top figure) on the AR dataset. We can observe the strong seasonality automatically inferred in the matrix. Each element in the matrix is the Euclidean distance between a pair of the learned models at two corresponding time points after training. For the top figure, the x axis is the index of the weeks; the y axis is the num- ber of ILI cases. For the bottom figure, both x and y axes are the index of the time points. Note that the starting time point (index 0) for the distance matrix is week 15 of the ILI case count series...... 33 3.2 Model distance matrices for US dataset. The three matrices are derived from the fully connected similarity graph, the 3-nearest neighbor similarity graph and the seasonal 3-nearest neighbor similarity graph, from left to right corre- spondingly...... 34 3.3 Comparison of seasonal characteristics for Mexico using different algorithms for one-step ahead prediction. Blue vertical dashed lines indicate the actual start and end of the season. ILI season considered: 2013...... 34

4.1 Filtering library of curves based on season size and season shape...... 38 4.2 Example of seasonal forecasts for ILI using curve-matching methods. . . . . 40 4.3 Performance measures for ILI seasonal characteristics using curve-matching . 40

5.1 Performance summary for (a) ILI and (b) CHIKV seasonal forecasts using Weather as a surrogate source under data assimilation framework ...... 46 5.2 Comparison of forecasting accuracy for Date metrics using surrogates . . . . 48 5.3 Comparison of forecasting accuracy for Value metrics using surrogates . . . . 48 5.4 Comparison of forecasting accuracy for ‘Start Date’ using different surrogate sources ...... 49 5.5 Comparison of forecasting accuracy for ‘End Date’ using different surrogate sources ...... 49 5.6 Comparison of forecasting accuracy for ‘Peak Date’ using different surrogate sources ...... 50 5.7 Comparison of forecasting accuracy for ‘Peak Value’ using different surrogate sources ...... 50 5.8 Comparison of forecasting accuracy for ‘Season Value’ using different surro- gate sources ...... 51

x 6.1 Illustration of Quickest Change Detection (QCD): blue colored line represents the actual changepoint at time Γ = t4. (a) declaring a change at γ1 leads to a false alarm, whereas (b) declaring the change at γ2 leads to detection delay. QCD can strike a tradeoff between false alarm and detection delay...... 56 6.2 Generative process for HQCD. As an example consider civil unrest protests. In the framework, different protest types (such as Education- and Housing- related protests) form the targets denoted by Si’s. The total number of protests will be denoted by the top-most variable E. Finally, the set of sur- rogates, such as counts of Twitter keywords, stock price data, weather data, network usage data etc. are denoted by Kj’s...... 57 6.3 Histogram fit of (a) surrogate source (Twitter keyword counts) and (b) tar- get source (Number of protests of different categories), for various temporal windows, under i.i.d. assumptions. These assumptions lead to satisfactory dis- tribution fit, at a batch level, for both sources. The top-most row corresponds to the period before the Brazilian spring (pre 2013-05-25), the second row is for the period 2013-05-25 to 2013-10-20, and the third is for the period after 2013-10-20. The last row shows the fit for the entire period. These temporal fits are indicative of significant changes in distribution along the Brazilian Spring timeline, for both target and surrogates...... 63 6.4 Computation time for one complete run of changepoint detection (in mins) on a 1.6 GHz quad core 8gb intel i5 processor: Gibbs sampling [8] vs HQCD vs HQCD without surrogates. Gibbs sampling computation times are unsuitable for online detection...... 65 6.5 Comparison of HQCD against state-of-the-art on simulated target sources. X- axis represents time and Y-axis represents actual value. Solid blue lines refer to the true changepoint, solid green refers to the ones detected by HQCD and brown refers to HQCD without surrogates. Dashed red, magenta, purple and gold lines refer to changepoints detected by RuLSIF, WGLRT, BOCPD and GLRT, respectively. HQCD shows better detection for most targets with low overall detection delay and false alarms...... 68 6.6 False Alarm vs Delay trade-off for different methods. HQCD shows the best trade-off...... 69 6.7 Comparison of detected changepoints at the sum-of-targets (all Protests). HQCD detections are shown in solid green while those from the state-of- the-art methods i.e. RuLSIF (red), WGLRT (magenta), BOCPD (purple) and GLRT (gold) are shown with dashed lines. HQCD detection is the closest to the traditional start date of Mass Protests in the three countries studied . . 70

xi 6.8 (Brazilian Spring) Heatmap of changepoint influences of targets on targets (a); and surrogates on targets (b). Darker (lighter) shades indicate higher (lesser) changepoint influence. (a) shows presence of strong off-diagonal elements indicating strong cross-target changepoint information. (b) shows a mixture of uninformative and informative surrogates...... 71

7.1 Evidence of Concept Drift. In Google Flu Trends data for Argentina (left), the corresponding 52-week rolling mean (right) exhibits a saddle point in early 2012 - indicates a possible mean shift drift in GFT for Argentina...... 74 7.2 Concept Drift Adaption Framework. Framework ingest target sources such as CDC ILI case count data and surrogate sources such as GFT and detects changepoints via ‘Concept Drift Detector’ stage. Drift probabilities are next passed onto ‘Drift Adaptation’ stage where robust predictions are generated using resampling based methods...... 75 7.3 Drift Adaptation for Mexico using GFT ...... 78 7.4 Drift Adaptation for Mexico using GST ...... 79 7.5 Drift Adaptation for Mexico using HealthMap ...... 80 7.6 Drift Adaptation for Mexico using weather sources ...... 81 7.7 Drift Adaptation for Mexico using All sources ...... 82

8.1 Correlation of surrogate sources with disease incidence. Count of influenza re- lated keywords from (a) HealthMap and (b) GST compared against influenza case counts for Argentina as available from PAHO. HealthMap keywords cap- ture the start of the season more accurately, while GST keywords exhibit a sub-optimal but consistent correlation with PAHO counts...... 85

C.1 Comparison of detected changepoints at the target sources (Protest types) HQCD detections are shown in solid green while those from the state-of- the-art methods i.e. RuLSIF (red), WGLRT (magenta), BOCPD (purple) and GLRT (gold) are shown with dashed lines...... 115

xii List of Tables

2.1 Comparing forecasting accuracy of models using individual sources. Scores in this and other tables are normalized to [0,4] so that 4 is the most accurate. . 24 2.2 Comparison of prediction accuracy while combining all data sources and using MFN regression...... 24 2.3 Comparison of prediction accuracy while using model level fusion on MFN regressors and employing PAHO stabilization...... 24 2.4 Discovering importance of sources in Model level fusion on MFN regressors by ablating one source at a time...... 24 2.5 ILI case count prediction accuracy for Mexico using OpenTable data as a single source, and by combining it with all other sources using model level fusion on uncorrected ILI case count data...... 25

3.1 Prediction accuracies for competing algorithms with different forecast steps over different countries using the GFT input source. GFT data is not available for other countries...... 31 3.2 Prediction accuracies for competing algorithms with different forecast steps over different countries using the weather data source...... 35 3.3 Prediction accuracies for competing algorithms with different forecast steps over different countries using the GST data source...... 35 3.4 Prediction accuracies for competing algorithms with different forecast steps over different countries using the HealthMap data source...... 35

5.1 Forecasting performance of seasonal characteristics using data assimilation methods ...... 47

6.1 Comparison of state-of-the-art methods vs Hierarchical Quickest Change De- tection ...... 55

xiii 6.2 (Synthetic data) comparing true changepoint (Γ) for targets against detected changepoint (γ) by HQCD against state-of-the-art methods for false alarm (FA) and additive detection delay (ADD). Each row represent a target and best detected changepoint is shown in bold whereas false alarms are shown in red...... 67

7.1 Comparison of surrogate sources pre- and post-drift adaptation...... 76

A.1 Performance of Data assimilation methods using different surrogate sources w.r.t. seasonal characteristics ...... 92

C.1 (Protest uprisings) Comparison of HQCD vs state-of-the-art with respect to detected changepoints ...... 116

xiv Chapter 1

Background and Motivation

The problem of multivariate time series forecasting has been studied extensively for several decades and have found use in diverse fields such as Economics and Statistics [9]. Some of the more popular methods that have been used in this sphere are Autoregressive (AR) models, Autoregressive Moving Average models (ARMA) and Vector Autoregressive Models (VAR) for linear problems. For nonlinear problems, some of the more popular methods have been Kernel Regression and Gaussian Process. However, the traditional approaches have focused on admitting only coherent time series and/or admitting independent time series which exhibits consistent causal relation with the target of interest. In recent years, ‘big data’ in the form of diverse real-time sources such as social media and news has been readily available. These data sources are in general noisy, and relationships with any target sources can change over time such as, search patterns of users. However, if used intelligently, such sources can aid in accurately modeling complex target sources such as the number of influenza case counts for a country, in near real-time. This thesis focuses on such noisy surrogates. We explore the problem of flu forecasting in Section 1.1 to identify the key advantages in using surrogates and motivate our methods in Section 1.2.

1.1 Flu Surveillance Effects

Accurate and timely influenza (flu) forecasting has gained significant traction in recent times. If done well, such forecasting can aid in deploying effective public health measures. Unlike other statistical or machine learning problems, however, flu forecasting brings unique chal- lenges and considerations stemming from the nature of the surveillance apparatus and the end-utility of forecasts. However flu surveillance is an inherently complex process and iden- tifying the quirks of this process can lead to a better understanding of the possible problems facing a forecasting model.

1 2

Final reports to Health Agencies

Surveillance estimations

Specimen obtained

Person seeks care

Person becomes ill

Exposures in general population

Figure 1.1: Epidemic Pyramid: Depicts the process of how disease exposure in general pop- ulation goes through several stages of surveillance and gets reported as confirmed cases. Adapted and redrawn from “The public health officer - Antimicrobial Resistance Learn- ing Site For Veterinary Students”, http://amrls.cvm.msu.edu/integrated/principles/ meet-the-public-health-officer 3

Figure 1.2: Christmas Effect in USA: Number of people seeking care drops during Christmas holidays. However, number of ILI related visits don’t vary from non-Christmas times leading to an inflated percent ILI in general population.

Influenza-like Illnesses (ILI), tracked by many agencies such as CDC, PAHO, and WHO [10, 44, 64], is a category designed to capture severe respiratory disease, like influenza (flu), but also includes many other less severe respiratory illness due to their similar presentation. Surveillance methods often vary between agencies. Even for a single agency, there may be different networks (such as outpatient based and lab sample based) tracking ILI/Flu. While outpatient reporting networks such as ILINet aim to measure exact case counts for the regions under consideration, lab surveillance networks such as WHO NREVSS (used by PAHO) seek to confirm and identify the specific strain. In the absence of a clinic based surveillance system, lab-based systems can provide estimates at per “X” population level; however making an estimate of actual influenza flu cases from these systems is challenging [10]. Furthermore, surveillance reports are often non-representative of actual ILI incidence. Figure 1.1 shows a representative ‘epidemic pyramid’ which depicts the surveillance system. The entire process is inherently associated with possible reporting errors starting from patients seeking care to final determination of confirmed case through laboratory tests. Surveillance networks are also affected by cultural phenomenon such as holiday periods where behavior of people visiting hospitals changes from other weeks. Figure 1.2 depicts the ‘Christmas effect’ observed during the holidays when people seek care from physicians only in emergency situations leading to inflated ILI percentages. Such effects may render the surveillance reports non-representative of on-ground scenarios. In addition to these effects, surveillance systems are also affected by other systematic artifacts. Surveillance reporting has been known to taper off or stop altogether during the post-peak part of the season. For example, as is evident from Figure 1.3, the number of providers who reported to US CDC ILINet surveillance tapers off towards the end of the ILI season (for US, calendar week 40 corresponds to first ILI season week [10]). Specifically, the inflection 4

Figure 1.3: ILI Surveillance drop towards the end of ILI season in CDC ILINet system. Inflection point can be seen at week 33. Reduced surveillance may render reports from later parts less accurate. 5

Figure 1.4: ILI surveillance instability: percentage relative error of updates w.r.t. final value as a function of update horizon for PAHO ILI reports for several Latin American countries. Stability varies from one country to other. point of the average curve occurs at season week 33. Such effects can possibly be attributed to resource re-allocation due to reduced interest in post-peak activities. A combination of such effects ultimately causes surveillance data to be delayed from real-time. Even when the reports are published, the reports can be candidates for revision/updating for several weeks after initial publication. The lag between initial publication and final revision can be as small as 2 weeks (e.g., for CDC ILINet data) or can wildly fluctuate. For example, PAHO reports for some Latin American countries such as Argentina, Colombia and Mexico can take more than 10 weeks to settle. On the other hand, PAHO reports stabilize within 5 weeks for countries such as Chile, Costa Rica and Peru (see Figure 1.4). The reason for such discrepancies has to do with the maturity of the surveillance apparatus and the level of coordination underlying public health reporting. 6

1.2 Motivation towards using surrogates

The flu surveillance effects described above can be thought of as a representative scenario for a large class of problems dealing with real-time surveillance where on-ground scenario is difficult to ascertain. Most work on forecasting do not account for such instability. In essence, these problems requires forecasting a moving target. Real-time surrogates as outlined above can be useful in such scenarios to augment the surveillance mechanism with information from general population. Thus motivating the problem of flu forecasting, this thesis outlines three key problems as follows:

Short-term forecasts using surrogates to augment delayed surveillance reports and • provide real-time information of on-ground scenarios.

Long-term forecasts using surrogates as forcing parameters to determine long-term • characteristics with increased accuracy.

Identifying and adapting to Concept Drift to detect changing relationships of • surrogates and increase robustness of short- and long-term forecasts using such surro- gates. Part I

Short-term Forecasting using Surrogates

7 8

The first problem of this thesis is aimed at short-term forecasting of often delayed and unsta- ble target sources such as Influenza-Like-Illness (ILI) case counts as reported by surveillance agencies such as CDC [10] and PAHO [44]. We compared a range of surrogates encompassing physical sources such as humidity and temperature, and social sources such as Twitter and News in [12] under a Matrix Factorization framework for ILI prediction in 15 Latin American countries. We found that no single source is best suited to model ILI for all countries. How- ever, physical sources were in general the most informative sources. Furthermore, combining the sources led to better forecasting accuracy in general. We present these considerations in Chapter 2. We next focused on increasing the forecasting horizon and used Regularized Generalized Linear Models to capture dynamic trends of ILI data in [62]. Our experiments indicate that we can reliably forecast up to 4 weeks in advance, for a range of countries including USA and several Latin American countries, using our proposed methods. We highlight the important aspects of our findings from the problem in Chapter 3. Chapter 2

Forecasting a Moving Target: Ensemble Models for ILI Case Count Predictions

Traditionally, epidemiological forecasts of common illnesses, such as the flu, rely heavily on surveillance reports published by health organizations. However, as discussed in Chapter 1, traditional surveillance reports are often published with a considerable delay and thus recent research has focused on mining social signals from search engine query volume [67, 24] and social media chatter [27, 34, 39, 15, 56]. One of the pioneering work in this space is the work of Ginsberg et al. [24] where ILI case counts are predicted from the volume of search engine queries. This work inspired significant follow-on work, e.g., [67], where Yuan et al. used search query data from Baidu (a popular search engine in China) to detect influenza outbreaks. More real-time ILI detection [34] systems have been proposed by modeling Twitter streams. Apart from such social media sources, there has also been considerable research on exploiting physical indicators such as climate data. The primary advantage of such data sources is that the effects are much more causal and less noisy. Shaman et al. [57, 49, 51] explored this area in detail and found absolute humidity to be a good indicator of influenza outbreaks. While the aforementioned efforts have made important strides, there are important areas that have been relatively less studied. First, only few efforts have focused on combining multiple data sources [29, 27] to aid in forecasting. In particular, to the best of our knowledge there has been no work that investigates the combination of social indicators and physical indicators to forecast ILI incidence. Second, and more importantly, official estimates as reported by health organizations (e.g., WHO, PAHO) are often lagged by several weeks and even when reported are typically revised for several weeks before the case counts are finalized. Real-time prediction systems must be designed to handle the forecasting of such a ‘moving

9 10 target’. Finally, most existing work have been retrospective and not set in the context of a formal data mining validation framework. To overcome these deficiencies, we propose a novel approach to ILI case count forecasting. Our contributions are:

Our approach integrates both social indicators and physical indicators and thus lever- • ages the selective superiorities of both types of feature sets. We systematize such integration using a novel matrix factorization-based regression approach using neigh- borhood embedding, thus helping account for non-linear relationships between the surrogates and the official ILI estimates.

We investigate the efficacy of combining diverse different sources at two levels: data • fusion level, and model level, and discuss the relative (de)merits.

We propose different ways of handling uncertainties in the official estimates and factor • these uncertainties into our prediction models.

Finally, we present a detailed and prospective analysis of our proposed methods by • comparing predictions from a near-horizon real time prediction system to official esti- mates of ILI case counts in 15 countries of Latin America.

2.1 Related Work

Related work naturally falls into the categories of social media analytics, physical indicators, and event dynamics modeling. These are next described as follows: Social media analytics: Most relevant work using social media analytics focuses on Twit- ter, specifically by tracking a dictionary of ILI-related keywords in the data stream. Such investigations have often focused on the importance of diversity in keyword lists, e.g., [39, 15]. In [39], Kanhabua and Nejdl used clustering methods to determine important topics in Twit- ter data, constructed time series for matched keywords, and used Jaccards coefficient to char- acterize the temporal diversity of tweets. They noted, that such temporal diversity may be correlated with real-world ILI outbreaks. In [15] the authors studied the dynamics between the change in circulated tweets and the H1N1 virus. Inspired by these work, we curated a custom ILI related keyword dictionary which is described in details in Section 2.5.3. Physical indicators for detecting ILI incidence levels: Tamerius et al. [57] investigated the existence of seasonal cycles of influenza epidemics in different climate regions. For the said work, they considered climatic information from 78 globally distributed sites. Using logistic regression they found that, strong correlations exist between influenza epidemics and weather conditions, especially when conditions are cold-dry or humid-rainy. Similarly, exciting results were reported by Shaman et al. in [49, 51] where they discovered absolute humidity to be a key indicator of flu. To uncover these relationships they used non-linear 11

6 6 6 6 66 66 P66 LL6 P66 LL66 66 PP66 P:6I66 PL6I66 P6I6 6I6 PLL6I6 P6I6

6

6 6 P6: PL66: OL6I6 b6I6

666666PO6 66 66666666666L6 66 66666666666666666

66 6666666PL6 66666666666P6 6666666666666PL6 6

6

Figure 2.1: Our ILI data pipeline, depicting six different data sources used in this chapter to forecast ILI case counts. regressors such as Kalman filters, and this was a key inspiration for us in finding a uniform model for the varied data sources as explained in Section 2.2.1. Event dynamics modeling: Denecke et al. [27] proposed an event-based approach for early prediction of ILI threats [27]. Their method (M-Eco) considers multiple resources such as Twitter, TV reports, online news articles, and blogs and uses clustering to identify signals for event detection. Network dynamic solutions have also been used [3] to study the behavior of an epidemic in a society. 12

2.2 Problem Formulation

In this section, we formally introduce the problem. Let = P1,P2,...,PT denote the P h i known total weekly ILI case count for the country under consideration, where Pt denotes the case count for time point t and T denotes the time point till which the ILI case count is known. Corresponding to the ILI case count data, let us denote the available surrogate information for the same country by = 1, 2,..., T 1 , where T 1 is the time point till X hX X X i which the surrogate information is available and t denotes the surrogate attributes for time point t. The problem we desire to solve is to findX a predictive model (f) for the case count data, as presented formally in equation 2.1.

f : t = f ( , ) (2.1) P P X In this chapter, in order to better understand the importance of different sources, we assume that the ILI activities in different countries are independent of each other.

2.2.1 Methods

Focusing on the methods, we employ non-linear temporal regressions over the surrogate attributes to forecast the case count using three models: (a) Matrix Factorization Based Regression (MF), (b) Nearest Neighbor Based Regression (NN), and (c) Matrix Factorization Regression using Nearest Neighbor embedding (MFN). For each of the methods, we define two parameters: β and α. α is the lookahead window length, denoting distance of the time point for prediction from T ; β is the lookback window length denoting the number of time points to look back in order to find the regression relation between the case count and the surrogate data.

We define regression vectors Vt and labels Lt, t = 1,...,T as below:. ∀ Vt Pt−β−α, t−β−α,Pt+1−β−α, t+1−β−α,..., ≡ h X X Pt−α, t−α X i Lt Pt ≡ The regression vector for predicting the case count at time point T 0(T + α > T 0 > T ) is given by equation 2.2.

VT 0 PT 0−β−α, T 0−β−α,Pt+1−β−α, t+1−β−α,..., ≡ h X X (2.2) PT 0−α, T 0−α X i Under these definitions we describe the models as follows:

Matrix Factorization Based Regression (MF):

Matrix Factorization is a well accepted technique in the recommender systems literature to predict user preferences from incomplete user ratings/information. Typically [7] a user- 13 preference matrix is factored into an user-factor and factor-preference matrix. However, such factorizations are incognizant of any temporal continuity. As such to enforce temporal continuity, to predict for the time point T 0(T + α > T 0 > T ) we use the regression vectors and labels as defined earlier, to define a m n prediction matrix , as given in equation 2.3: × M   Vα+β+1 Lα+β+1  . .   . .  =   (2.3) M  VT LT  VT 0 LT 0

The prediction matrix is factorized into a f m factor-feature matrix U and a f n factor- prediction matrix as: × × T ci,j = bi,j + U Fj M i Here, bi,j is the baseline estimate given by: ¯ bi,j = + bj (2.4) M ¯ where represents the all-element average and bj represents the column wise deviations from theM average and is generally a free-parameter, i.e., it is fitted as part of the optimization problem. U and F matrix are estimated by minimizing the error function:

m−1 P  2 b∗,F,U = argmin( i,n ci,n i=1 M − M n m−1 n (2.5) P 2 P 2 P 2 +λ1 bj + Ui + Fj )) j=1 i=1 || || j=1 || || where λ1 is a regularization parameter. An important design criteria in the error function of equation 2.5 is the fact that we only compute the error between the predicted label values and the actual label values i.e., the nth column of the prediction matrix . The rationale behind this choice is the fact that unlike traditional recommender systemsM we are only concerned with the label column and can sacrifice reconstruction accuracies for other columns.

The lookback window β, the factor size f and the regularization parameter λ1 are estimated using cross-validation and the final prediction for time point T 0 is given by:

T PbT 0 = bm,n + UmFn

Nearest Neighbor Based Regression (NN):

For our second class of models, viz. nearest neighbor models, we define a training set ΓNN = Vt,Lt , where Vt represents the regression attributes and Lt denote the corresponding labels. { } Also, let us define the set (i) = k : Vk is one of the top K nearest neighbors of Vi where N { } 14

K indicates the maximum number of nearest neighbors considered. The predicted count PbT 0 for the time point T 0 is given as:

! K P P PbT 0 = θkLk,T −α / θk (2.6) k∈N (T 0) k=1

th Here θk indicates the weight assigned to the k nearest neighbor. Typically the inverse Euclidean distances to VT 0 are chosen as the weights.

Matrix Factorization Based Regression using Nearest Neighbor Embedding (MFN):

It has been shown in [28] that matrix factorization using nearest neighbor constraints can outperform classical matrix factorization approach as well as traditional nearest neighbor approaches towards recommender systems. Drawing inspirations from the result, we modify the method to suit the temporal nature of our problem in similar ways as described in section 2.2.1. We again define a similar prediction matrix (see equation 2.3). Following [28], we define the matrix decomposition rule as M

T ci,j = bi,j + Ui Fj M − 1 P (2.7) +Fj (i) 2 ( i,k bi,k)xk |N | k∈N(i) M − The key difference between equation 2.7 and the one proposed in [28] is that we don’t have any term for implicit feedback and, further, only the top K neighbors as found through Euclidean distance are used. The model is fitted using equation 2.8 as given below:

m−1 P  2 b∗, F, U, x∗ = argmin( i,n ci,n i=1 M − M n m−1 n (2.8) P 2 P 2 P 2 P 2 +λ2( bj + Ui + Fj + xk )) j=1 i=1 || || j=1 || || k || ||

2.3 Ensemble Approaches

In the last section, we described different strategies to correlate a specific source with the ILI case count of a specific country and predict future ILI counts. In practice, we desire to work with a multitude of data sources and there are two broad ways to accomplish this objective: (a) data level fusion, where a single regressor is constructed from different data sources to the ILI case count, and (b) model level fusion, where we build one regressor for each data source and subsequently combine the predictions from the models. In this section, we describe these fusion methods. Experimental results with both methods are presented in Section 2.6. 15

2.3.1 Data level fusion:

Here we express the feature vector , as a tuple over all the different data sources and then proceed with any one of the regressionX methods as outlined in Section 2.2.1. For example, while combining Twitter and weather data sources (see Figure 2.1), the feature vector is given by: X t = t, t X hT W i where Tt and Wt denote attributes derived from Twitter and weather, respectively.

2.3.2 Model level fusion:

In this approach, the models are combined using matrix factorization regression with nearest neighbor embedding by comparing the prediction estimates from each model with the actual estimate (since the ground truth can change as well) and the average ILI case count for the month for the particular country (to help organize a baseline). Let us denote the average ILI case count for a particular calendar month I for a given country by: X µI = Pt/ t I |{ ∈ }| t∈I Considering C different sources and hence C different models, let us denote the prediction T h T h for the t time point from the c model by cPbt. Using these definitions we can now proceed to describe the fusion model. Essentially, the model is similar to the one described in Section 2.2.1, where the differences can be found in the way we construct the feature vectors. Similar to equation 2.3, we construct a prediction 0 0 T h m n matrix for fusion given byC where the t row is represented by equation 2.9. × M h i C t = 1Pbt ... C Pbt Pt (2.9) M Then similar to equation 2.7, we factor this matrix into latent factors, C U, C F , C b∗ as given by equation 2.10:

T C ci,j = µi + C bj + C Ui C Fj M − 1 P (2.10) +C Fj C (i) 2 (C i,k µi + C bk)C Zk | N | k∈C N(i) M − so that the final prediction for the T Th data point is given by 0 PbT = C cT , n . M The fitting function is given by equation 2.11: m0−1 P  2 C b∗, C F, C U, C x∗ = argon( C i,n0 C ci,n0 i=1 M − M n0 m0−1 n0 (2.11) P 2 P 2 P 2 P 2 +λ3( C bj + C Ui + C Fj + k C xk )) j=1 i=1 || || j=1 || || || || 16

As before the free parameters are estimated through cross-validation.

2.4 Forecasting a Moving Target

One of the key challenges in creating a prospective ILI case count predictor is the fact that the official estimates are often delayed and, furthermore, even when published the estimates are revised over a number of weeks before these become finally stable. For this chapter, we concentrate on 15 Latin American countries as described in Section 2.5 and consider the official ILI estimates from the Pan American Health Organization (PAHO).Thus we can categorize PAHO count values downloaded on any week into three different types: (a) the ¨ unknown PAHO counts represented by Pt, (b) the known and stable PAHO counts denoted ˙ ˜ by Pt, and (c) the known and unstable PAHO counts denoted by Pt. While we desire to ¨ ˜ predict Pt, the uncertainty associated with Pt introduces errors in the predictions. In this section, we study the effects of such unstable data and propose three different models to adjust these unstable values to more accurate ones. Figure 2.2a plots the relative error of an unstable PAHO data series w.r.t. its final estimate, as a function of time. It can be seen that different countries have different stability char- acteristics: for some countries, PAHO count values are stabilized very slowly whereas for others they stabilize faster (esp as the number of updates for a week increases). Stability behavior of PAHO count values were also found to be dependent on the time of the year as shown in Figure 2.2b. To plot this curve for Argentina, we categorized any week with less than 100 cases to belong to a low season, greater than 300 to be a high season, and the remaining values to be mid season (the thresholds were different for different countries). At the same time, the PAHO official updates provide an indication of the number of samples used to generate the case count estimate. Preliminary experiments show that this size is correlated with the accuracy of ILI case counts. In other words, in general, larger values of statistical population size results in smaller relative errors for ILI case count. Thus using both the number of samples and the lag in uploading the week data, we can use machine learning techniques to revise the officially published PAHO estimates. Preliminary results show that for different seasons and different countries, we encounter different stability patterns. Therefore, any PAHO count adjustment method should be customized for seasons and countries separately. Let us assume that ˙ is the set of stable PAHO counts for a specific country. Also, assume that the sequence ofP updates for each stable PAHO count value is available. In other words, ˙ for Pi we have the following set:

˙ n (1) (2) (m) o i = P ,P , ..., P , ... (2.12) P i i i

(m) where Pi is the value of Pi after m weeks of update. 17

(a)

(b) Figure 2.2: Average relative error of PAHO count values with respect to stable values. (a) Comparison between Argentina and Colombia (b) Comparison between different seasons for Argentina. 18

After recognizing high, low, and mid-season months for the country, we can categorize each ˙ Pi to belong to one of these categories. Then, for category S, an adjustment dataset is S constructed named as A which is defined as follows: P n o S (1) ˙ (1) (m) ˙ (m) A = (1,P , Pi,N ), ..., (m, P , Pi,N ), ... (2.13) P i i i i

S Each member of A is a tuple with four entries: the first entry denotes the time slot that P the sample belongs to; the second entry is the actual unstable value of Pi; the third entry (m) is the related stable value; and finally, Ni is the size of the statistical population for that week. In the next step, a linear regression algorithm is used to adjust unstable PAHO values. In S order to adjust value of the PAHO values in the mth time slot of season S, we use A set P to learn a0, a1, a2, and a3 coefficients in the following equation:

ˆ˙ (m) (m) (m) Pi = a0 + a1m + a2Pi + a3Ni (2.14) ˆ˙ (m) where Pi is the adjusted PAHO count value for the mth time slot. Experimental results show that this adjustment method results in more accurate known PAHO values. Average relative errors of the published unstable PAHO values before and after correction for each country are shown in Figure 2.3. While in a few cases, we do not experience any improvement, in countries such as Argentina and Paraguay, we experience significant improvements. (m) Finally, similar to equation 2.14, in addition to Pi , one can use only time difference (m) (m) or size of population (Ni ) to correct unstable PAHO values. Effect of these corrections on overall accuracy of predictions are explored in Section 2.6.

2.5 Experimental Setup

2.5.1 Reference Data.

In this chapter, we focus on 15 Latin American countries viz. Argentina, Bolivia, Cos ta RCA, Colombia, Chile, Ecuador, El Salvador, Guatemala, French Guiana, Honduras, Mexico, Nicaragua, Paraguay, Panama and Peru. We collected weekly ILI counts from the official Pan American Health Organization (PAHO) website(http://ais.paho.org/phip/ viz/ed_flu.asp), every day from January 2013 to August 2013. The estimates downloaded every day for each country contain data from January 2010 to the latest available week on the day of collection. This dataset is stored in a database we refer to as the Temporal Data Repository (TDR). The TDR is also timestamped so that for any given day, we can readily 19

Figure 2.3: Average relative error of PAHO count values before and after correction for different countries. retrieve the ILI case counts that were download on that day. This is important as historic data may be updated by PAHO even a number of weeks after the first update. For the purpose of experimental validation we used the data for the period Jan 2010 to December 2012 as the static training set. We considered Wednesdays of the weeks as a reference day within a week. For each Wednesday from Jan 2013 to July 2013, we used the latest available PAHO data in TDR for that day and predicted 2 weeks from the last available week for which the PAHO data was available. These predictions are next evaluated against the final ILI case count as downloaded on September 1, 2013 and we report the performance of our algorithms in Section 2.6.

2.5.2 Evaluation criteria.

We evaluate the prediction accuracy of the different algorithms using a modified version of percentage relative error:

te ˆ ! 4 X Pt Pt = 1 | − | (2.15) A Np − max(P , Pˆ , 10) t=ts t t where ts and te indicate the starting and the ending time point for which predictions were generated. Np indicates the number of time points over the same time period (i.e. Np = te ts + 1). Note that the measure is scaled to have values in [0, 4] and the denominator is designed− to not over-penalize small deviations from the true ILI case count (e.g., when the 20

true case count is 0 and the predicted count is 1). It is to be noted that the accuracy metric so defined is non-convex and is in general multi-modal.

2.5.3 Surrogate data sources.

Before describing our data sources in detail, we describe our overall methodology for organiz- ing a flu-related dictionary (for tracking in multiple media such as news, tweets, and search queries).

Dictionary creation.

The keywords relating to ILI were organized from a seed set of words and expanded using a combination of time series correlation analysis and pseudo-query expansion. The seed set of keywords (e.g., gripe) was constructed in Spanish, Portuguese, and English using feedback from our in-house subject matter experts.

Pseudo-query expansion. Using the seed set, we crawled the top 20 web sites (according to Google Search) associated with each word in this set. We also crawled some expert sites such as the official CDC website and equivalent websites of the countries under consideration, detailing the causes, symptoms and treatment for influenza. Additionally we crawled a few hand-picked websites such as http://www.flufacts.com and http://health.yahoo.net/ channel/flu_treatments. We filtered the words from these sites using standard language processing filtering techniques such as stopword removal and Porter stemming. The filtered set of keywords were then ranked according to the absolute frequency of occurrence. The top 500 words for Spanish and English were then selected. For example, words such as enfermedad and pandemia were obtained from this step.

Time series correlation analysis. Next we used Google Correlate (now a part of Google Trends) to identify keywords most correlated with the ILI case count time series for each country. Once again these words were found to be a mix of both English and Spanish. As an added step in this process, we also compared time-shifted ILI counts: left-shifted to capture the words searched leading up to the actual flu infection and right-shifted to capture the words commonly searched during the tail of the infection. This entire exercise provided us some interesting terms like ginger which has been used as a natural herbal remedy in the eastern world. We also found popular flu medications such as Acemuk and Oseltamivir, which are also sold under the trade name of Tamiflu as highly correlated search queries, especially particularly for Argentina. 21

Final filtering. The set of terms obtained from query expansion and correlation analysis were then pruned by hand to obtain a vocabulary of 151 words. We then performed a final correlation check and retained a final set of 114 words.

Google Flu Trends ( ): F Google Flu Trends (GFT http://www.google.org/flutrends) is a tool based on [24] and provided by Google.org which gives weekly and up-to-date ILI case count estimates using search query volumes. Of the countries under consideration, GFT provides weekly estimates for only 6 of them viz. Argentina, Bolivia, Chile, Mexico, Peru and Paraguay. These estimates are typically at a different scale than the ILI case counts provided by PAHO and therefore need to be scaled accordingly. We collected this data weekly on Monday from Jan 2013 to Aug 2013. (The data downloaded on a particular day contains the entire time series from 2004 to the corresponding week.)

Google Search Trends ( ): S Google Search Trends(http://www.google.com/trends) is another tool provided by Google. Using this tool we can download an estimate of search query volume as a percentage over its own temporal history, filtered geographically. We download the search query volume time series for the 114 keywords described earlier and convert the percentage measures to absolute values using a static dataset we downloaded on Oct 2012 when Google Search Trends used to provide absolute query volumes.

Twitter ( ): T Twitter data was collected from Datasift.com and geotagged using an in-house geocoder. We lemmatized the tweet contents and used language detection and POS tagging to help differentiate relevant from irrelevant uses of our keywords (e.g., the Spanish word gripe, meaning flu, is part of our flu keyword list as opposed to the undesired and unrelated English word ‘gripe’). The resulting analysis yields a weekly occurrence count of our dictionary in tweets.

HealthMap ( ): H Similar to Twitter, we also collect flu-related news stories using HealthMap(http://healthmap. org), an online global disease alert system capturing outbreak data from over 50,000 elec- tronic sources. Using this service we receive flu-related news as a daily feed which is similarly enriched and filtered to obtain a multivariate time series over lemmatized version of the key- words. While Twitter is more suitable to ascertain general public response, the HealthMap 22

data provides more detailed information but may capture the trends at a slower rate. Thus each of these sources offers utility in capturing different surrogate signals: Twitter offers lead- ing but noisy indicators whereas HealthMap provides a slightly delayed but more reliable indicator.

OpenTable ( ): O We also use data on trends of restaurant table reservations, initially studied in [41] to be a potential early indicator for outbreak surveillance, as another surrogate for ILI detection. This novel data stream is based on the postulate that a higher than average number of restaurants with table availability in a region can serve as an indicator of an event of interest, such as increase in flu cases. Table availability was monitored using OpenTable http: //www.opentable.com, an online restaurant reservation site with 28,000 restaurants at the time of this writing. Daily searches were performed starting from September 2012 for a table for two persons at lunch and dinner; between 12:30-3pm, and between 6-10:30pm. Data was collected for Mexico by city (Cancun, Mexico City, Puebla, Monterrey, and Guadalajara) and for the entire country. The daily proportion (proportion used due to changes in the number of restaurants in the system) of restaurants with available tables was aggregated as a weekly time series.

Weather ( ): W All of the previously described data sources can be termed as non-physical indicators which can work suitably as indirect indicators about the state of the population with respect to flu by exposing different population characteristics. On the other hand, meteorological data can be considered a more direct and physical driver of influenza transmission [65]. It has been shown in [49, 51, 57] that absolute humidity can be directly used to predict the onset of influenza epidemics. Here, we collect several other meteorological indicators such as temperature and rainfall in addition to humidity from the Global Data Assimilation System (GDAS). We accessed this data in GRIB format from http://ladsweb.nascom.nasa.gov/ at a resolution of 1 degrees lat/long interval. However, looking at all the lat/long for a country can often lead to noisy data. As such we filtered the downloaded data and used the indicators only around the surveillance centers. We also aggregate this data using weekly averages and thus obtain a resultant time series for each country. We collected this data weekly from Jan 2013 to August 2013. 23

2.6 Results

In this section, we present an exhaustive set of experiments evaluating our algorithms over 6 months of predictions from Jan 2013 to August 2013. The final and stable estimates of ILI case counts are considered to be the estimates downloaded from PAHO on Oct 1, 2013. All models considered here were used to forecast 2 weeks beyond the latest available PAHO ILI estimates. Key findings are presented in Table. 2.1. We analyze some important observations from this table next.

Figure 2.4: Accuracy of different methods for each country.

Can we ‘beat’ Google Flu Trends with our custom dictionary? The key difference between Google Flu Trends (which can be considered as a base rate) and Google Search Trends is that the former uses a closed dictionary whereas we constructed the dictionary to use with GST. As can be seen Table 2.1, for majority of the common countries (countries for which data from both GST and GFT is present), regressors running on GST consistently outperform those running on GFT (with Mexico and Peru being the exception). Thus we posit that the GST model devised here is a sufficiently close approximation to GFT, with the added advantages of having access to raw level data and being available for more countries than GFT (among the 15 countries we consider, only 6 of them are present in the GFT database). Which is the optimal regression model? From Table 2.1, we can also analyze the three different regressors proposed in Section 2.2.1 with respect to overall accuracy. With respect to each individual source, we can see that matrix factorization with nearest neighbor embedding (MFN) performs the best in average over the countries. For some countries such 24

Table 2.1: Comparing forecasting accuracy of models using individual sources. Scores in this and other tables are normalized to [0,4] so that 4 is the most accurate.

Model Sources AR BO CL CR CO EC GF GT HN MX NI PA PY PE SV All W 2.78 2.46 2.39 2.14 2.70 2.22 2.12 2.63 2.52 2.73 2.31 2.21 2.49 2.77 2.61 2.47 H 2.81 2.31 2.22 1.92 2.43 2.04 2.11 2.57 2.33 2.48 2.39 2.15 2.18 2.47 2.33 2.32 MF T 2.37 2.35 2.18 2.03 2.21 2.12 1.83 2.12 2.29 2.03 1.89 2.06 1.96 2.20 2.21 2.12 F 2.34 2.11 2.29 N/A N/A N/A N/A N/A N/A 2.71 N/A N/A 2.31 2.24 N/A 2.33 S 2.48 2.21 2.33 2.04 2.31 2.21 1.93 2.03 2.15 2.51 2.42 2.52 2.33 1.93 2.30 2.24 W 2.92 2.93 2.63 2.52 2.66 2.51 2.71 2.82 2.59 2.62 2.55 2.59 2.61 2.80 2.52 2.66 H 2.73 3.10 2.42 2.27 2.83 2.64 2.43 2.25 2.71 2.31 2.61 2.35 2.43 2.39 2.52 2.53 NN T 2.72 2.86 2.31 2.62 2.77 2.52 2.71 2.66 2.51 2.44 2.13 2.01 1.77 2.51 2.20 2.45 F 2.11 2.21 2.33 N/A N/A N/A N/A N/A N/A 2.19 N/A N/A 2.41 2.32 N/A 2.26 S 2.51 2.31 2.41 1.81 2.52 2.41 2.12 2.29 2.51 2.13 2.61 2.14 2.51 1.87 2.12 2.28 W 2.99 3.01 2.88 2.53 2.78 2.81 2.77 2.83 2.61 2.70 2.56 2.66 2.82 2.79 2.51 2.75 H 2.81 3.13 2.63 2.58 2.91 2.77 2.57 2.63 2.73 2.50 2.61 2.54 2.51 2.69 2.61 2.68 MFN T 2.74 3.03 2.51 2.64 2.83 2.51 2.81 2.71 2.60 2.48 2.13 2.55 2.19 2.57 2.31 2.57 F 2.33 2.41 2.34 N/A N/A N/A N/A N/A N/A 2.69 N/A N/A 2.54 2.48 N/A 2.46 S 2.61 2.44 2.55 2.22 2.61 2.52 2.71 2.31 2.62 2.48 2.61 2.31 2.53 2.23 2.13 2.46

Table 2.2: Comparison of prediction accuracy while combining all data sources and using MFN regression.

Fusion AR BO CL CR CO EC GF GT HN MX NI PA PY PE SV All Level Model 3.12 3.22 3.03 2.88 2.98 3.13 2.87 2.99 2.87 3.00 2.77 2.82 2.81 2.92 2.87 2.95 Data 3.01 2.97 3.13 2.87 2.86 3.04 2.91 2.88 2.72 2.89 2.70 2.60 2.88 2.81 2.92 2.88

Table 2.3: Comparison of prediction accuracy while using model level fusion on MFN regressors and employing PAHO stabilization.

Correction AR BO CL CR CO EC GF GT HN MX NI PA PY PE SV All Method None 3.12 3.22 3.03 2.88 2.98 3.13 2.87 2.99 2.87 3.00 2.77 2.82 2.81 2.92 2.87 2.95 Weeks 3.15 3.24 3.04 2.87 2.97 3.17 2.87 2.99 2.88 3.05 2.77 2.91 3.02 2.91 2.88 2.98 Ahead Numb sam- 3.20 3.24 3.03 2.88 2.96 3.12 2.87 3.01 2.89 3.12 2.78 2.92 3.04 2.91 2.87 2.99 ples Combined 3.21 3.24 3.05 2.89 2.96 3.19 2.87 3.00 2.89 3.13 2.77 2.93 3.08 2.92 2.88 3.00

Table 2.4: Discovering importance of sources in Model level fusion on MFN regressors by ablating one source at a time.

Sources AR BO CL CR CO EC GF GT HN MX NI PA PY PE SV All All 3.21 3.24 3.05 2.89 2.96 3.19 2.87 3.00 2.89 3.13 2.77 2.93 3.08 2.92 2.88 3.00 w/o W 2.91 2.99 2.77 2.71 2.61 2.59 2.66 2.69 2.49 2.78 2.62 2.87 2.60 2.43 2.67 2.69 w/o H 3.04 2.85 2.89 2.56 2.81 2.77 2.61 2.75 2.75 2.82 2.57 2.75 2.51 2.87 2.71 2.75 w/o T 2.92 3.14 2.95 2.61 2.72 2.81 2.88 2.79 2.61 2.93 2.74 2.63 2.79 2.74 2.81 2.80 w/o S 3.19 3.11 2.92 2.64 2.69 2.70 2.89 2.88 2.78 3.07 2.75 2.91 2.80 2.71 2.86 2.86 w/o F 3.20 3.12 2.88 2.89 2.96 3.19 2.87 3.00 2.83 3.02 2.77 2.93 2.98 2.88 2.88 2.96 as Panama, when using only GST, MFN performs poorer than vanilla MF; nevertheless the average accuracy over all countries for any given data source is best when using MFN. 25

Table 2.5: ILI case count prediction accuracy for Mexico using OpenTable data as a single source, and by combining it with all other sources using model level fusion on uncorrected ILI case count data.

Method Lunch Dinner Lunch & Din- ner MF 1.92 2.23 2.31 NN 1.99 1.83 2.11 MFN 2.11 2.31 2.44 Model Fusion 2.96 2.87 2.99

Which is the best strategy to combine multiple data sources? As shown in Table 2.2, in overall, model level fusion works better than data level fusion. For 8 of the 15 countries, model level fusion works appreciably better than data level fusion, while the reverse trend is seen for 4 other countries. This showcases the importance of considering both kinds of fusion depending on the country of interest. How effective are we at forecasting a moving PAHO target? As shown in Table 2.3, our corrected estimates using both the number of samples and the weeks ahead from the upload date are generally better. It is instructive to note that our correction strategy is able to increase the overall accuracy only by a score of approximately 0.05 over all the countries, for some countries such as Mexico and Argentina (for which the data update is typically noisy) we obtain a substantial improvement of scores. This suggests that the correction strategy may be selectively applied when forecasting for certain countries. How do physical vs social indicators fare against each other? From Table 2.1, we see that the data source with the best single accuracy happens to be the physical indicator source, i.e., weather data. However, Table 2.4 conveys a mixed story. Here we conduct an ablation test, wherein we remove one data source at a time from our model level MFN fusion framework and contrast accuracies. While removing the weather data degrades the accuracy score the most, removing the social indicators also degrades the score to varying degrees. Thus we posit that it is important to consider both the physical and social indicators to get a refined signal about the prevalent ILI incidence in the population. How relevant is restaurant reservation data to forecasting ILI? All the results thus far do not consider the OpenTable reservation data, since this source is available only for Mexico (among the countries studied here). We considered table availability for different time ranges and compared performance using our MFN model. As Table 2.5 demonstrates, we obtain the best performance when considering both lunch and dinner reservation data. Nevertheless, we have observed that including this source as part of the ensemble decreases the overall accuracy by 0.01 over the uncorrected ILI case count data. Thus it is our opinion that although the reservation data could exhibit some signals about prevalent ILI conditions, it likely is also a surrogate for non-health conditions (e.g., social unrest) which must be factored out to make the data source more useful. Finally, we present Figure 2.4 where we compare for each country the accuracies of prediction 26 from the best individual source, with those from both data level and model level fusion of the different sources and the the model level fusion of MF regressors applied on the corrected PAHO estimates rather than the raw ones. As can be seen, we progressively increase our accuracies with the corrected PAHO estimates providing the final increase in predictive power to our model level fusion framework.

2.7 Discussion

In this chapter, we have aimed to generate short-term ILI forecasts over a range of Latin American countries using a gamut of options pertaining to data sources, fusion possibilities, and corrections to track a moving target. Our results demonstrate that there are signifi- cant opportunities to improve forecasting performance and selective superiority among data sources that can be leveraged. However, the presented method works best for near-horizon forecasts with significant drop in accuracy (see Chapter 3) for longer range forecasts. Thus we will next explore methods to increase the forecasting horizon while adhering to the principles of using multiple sources to generate such forecasts. Chapter 3

Dynamic Poisson Autoregression for Influenza-Like-Illness Case Count Prediction

In Chapter 2, we have presented our initial efforts at forecasting influenza-like-illness (ILI) case counts. Seasonal influenza regularly affects the global population and improvements in forecasting capability can directly translate into tangible measures of public health. The methods presented in the said chapter successfully incorporated surrogate sources of informa- tion to produce real-time forecasts. However, the ‘reliable’ forecasting horizon was limited, both by model complexity as well as inability to maintain coherence between successive fore- casts. In this chapter, we aim to relax such limitations and increase the forecasting horizon without increasing the computational complexity of the model. Traditionally, epidemiologists aim to predict several characteristics about ILI from surveil- lance reports. Such characteristics of interest can be broadly classified into: (a) seasonal characteristics and (b) short-term characteristics. Seasonal characteristics are concerned with the overall shape of ILI counts for the particular season (See Part II for more details). Such methods are generally trained by assigning greater importance to statistics of the ILI curve such as peak value and the peak size. Conversely, short-term characteristics are con- cerned with accurately predicting the next few data points in absolute value rather than aiming for an overall fit for the season. In this chapter we are motivated by the second problem, i.e. the short-term forecasting challenge (but we also evaluate our methods w.r.t. seasonal characteristics). As discussed earlier, among the several challenges towards ILI case count forecasting, one of the most important fact is that the surveillance reports are often delayed by a number of weeks and therefore estimating the current on-ground scenario is a crucial problem. The case count estimates for a given week can be delayed anywhere from 1 week to 4 weeks, depending on the quality of the surveillance apparatus in a given country. Thus in this chapter we aim to provide reliable short-term forecasts from the last available

27 28 surveillance data such that we can estimate the on-ground case counts and increase our forecasting horizon to atleast 4 weeks. In traditional epidemiology, several models such as SEIR and SIRS [3], have been proposed to model the temporal profile of infectious diseases. In modern computational epidemiol- ogy, more advanced methods have been used. One of the currently popular methods is to fit prediction models by matching observational data against a large library of simulated curves [5, 40, 58]. The curve simulations are generated by using different epidemiological parameters and assumptions. Sometimes network-based models are used to generate the curves [3]. Partially observed influenza counts for a particular year can then be matched to a library of curves to produce the best set of predictions [40]. Closely related to such curve matching methods are filtering-based methods that dynamically fit epidemic models onto observed data by letting the base epidemic parameters vary over time. Yang et al. [66] provide an excellent survey of filtering-based methods used for influenza forecasting and also present comparative analysis of such methods. Concurrently, there has been a lot of interest in using indicator data sources to predict sea- sonal influenza. In [24], Ginsberg et al. presented a method of estimating weekly influenza counts based on search query volumes (Google Flu Trend). Following this seminal work, researchers have investigated a wide-variety of data sources such as Wikipedia [25], Twit- ter [12, 34, 46], and online restaurant reservations [41]. Weather has been found to be a significant indicator of seasonal influenza [49, 50, 51, 57]. In [12], different indicator sources are contrasted to understand their relative influence on short-term forecasting quality. As rich and varied as the above approaches are, most approaches in the literature aim to use the same model to predict for the entire influenza season. This is not entirely desirable as ‘in-season’ ILI characteristics may vary significantly from the ‘out-of-season’ characteristics (see Section 3.1.3). While researchers appreciate the need for dynamic models (e.g., [12]), constraints on temporal consistency are never explicitly imposed in current models. Thus in this chapter we aim to propose a general purpose time series prediction model allowing external factors from indicator sources to produce robust short-term forecasts in a consistent manner. A popular model for analyzing time series data is the autoregressive exogenous (ARX) model [4, 36]. The ARX model has also been adopted by Paul et al. [46] to predict ILI case counts by using Twitter and Google Flu Trends (GFT) as the indicator sources. How- ever, the underlying static autoregressive model may not be suitable for flu trend forecasting, as the activity of the disease and the human living environment evolve over time. Ohlsson et al. [42] have designed a more flexible ARX model for time-varying systems based on model segmentation. It allows the weight of the autoregressive model to be temporally piecewise constant. In this chapter, we further relax this requirement. We build separate models for each time point, but we constrain the models to share common characteristics. To capture such characteristics, we build a graph over models at different time points and embed the prior knowledge on model similarity in terms of the structure of the graph. Then we for- 29 mulate the dynamic ARX model learning problem as a convex optimization problem, whose objective balances the autoregressive loss and the model similarity regularization induced by the graph structure. In this optimization problem, the variables have a natural block structure. Thus we apply a block coordinate descent method to solve this problem. We further extend our dynamic ARX modeling to the Poisson regression model for a better fitting of the count data [4, 14], as is relevant for ILI case counts forecasting. We perform extensive experimental studies to evaluate the effectiveness of the proposed model and the corresponding learning algorithm. We use various real world datasets in the experiments, including different types of indicator data sources from 15 countries around the world. Our experimental studies illustrate that the dynamic modeling of the linear Poisson autoregres- sive model captures well the underlying progression of disease counts. Further, our results also show that our proposed method outperforms state-of-the-art ILI case counts forecast methods. Our main contributions are summarized as follows:

We propose a new dynamic ARX model for the task of ILI case count forecasting. • This approach incorporates a linear Poisson regression model with non-negativity con- straints into an ARX model, ideal for case counts modeling.

Prior domain knowledge can be encoded as structural relationships among different • time points in a graph, which is embedded into the objective as a regularization term while still ensuring that the optimization problem is convex.

We evaluate the proposed method using various real world datasets, including different • types of indicator data sources from the USA and 14 Latin American countries.

3.1 Summary

We present a brief summary of our findings here. For a more detailed treatise we ask the reader to refer to [62]. We developed two dynamic generalized linear models viz. Dynamic Autoregressive model (DARX) and Dynamic Poisson Autoregressive model (DPARX) and compared the forecasting performance for various sources against a number of state-of-the-art algorithms. We highlight some of our interesting findings here.

3.1.1 Model Similarity

First, we conduct experiments to investigate the model similarities posited by our proposed algorithm. In this experiment, we calculate the distance between all pairs of models learned by DPARX during a period of time on the AR dataset. We present the distance matrix associated with the ground truth ILI case count series in Figure 3.1. We see that the 30

distance matrix has a strong seasonal pattern, which is consistent with the pattern of the ILI case count series. At the beginning of each flu season, the model is significantly different from the rest of the models at other time points. This result demonstrates that ILI case counts have a strong periodic pattern and that the dynamic modeling approach successfully captures this pattern. It also validates the necessity of conducting this level of modeling for flu forecasting. In the next experiment, we run our proposed DPARX method on the US dataset under three different model similarity graphs including the fully connected graph, the 3-nearest neighbor graph and the seasonal 3-nearest neighbor graph. We then calculate the three corresponding distance matrices of the learned models, which are shown in Figure 3.2. The patterns in the three distance matrices are very similar. However, the distances between the pairs of models are smaller for the fully connected similarity graph. Without strong prior knowledge, the fully connected similarity graph is preferred, as during different seasons the target signal may still be very different. In the following experiments, we will use the fully connected similarity graph for the regularization term.

3.1.2 Forecasting Results

In the ILI cast count forecast experiments, we use the data record from all 15 countries. All the case count data are associated with several data sources similar to the ones in Section 2. We start with 50 given time points and test the prediction result on the remaining time points. We run all the competing methods in an online manner: the models are re-trained and updated after the arrival of values at every additional time point. For the DARX and DPARX models, we use the same parameter settings: p = 1, b = 15 for GFT and Weather data sources as these data sources have relatively small dimension; p = 1, b = 4 for GST and HealthMap data sources as these data sources have relatively high dimension. The ARX model does not provide numerical stable results for high dimensional data. Thus we present its results on GFT and Weather data sources with p = 1, b = 15. Likewise, the training of the SARX model is very time consuming, especially for high dimensional data. We thus only present its results using the GFT data source with the same setting (p = 1 and b = 15). The remaining parameter in our model is the regularization parameter that controls the variation of the model. We fix it as η = 1 for the DARX model and η = 5 for the DPARX model during all experiments. For MFN algorithm, we follow the same procedure and parameter setting as in [12]. We present the results of short-term ILI case count forecasting for different countries with both 1-step forecast and multi-step forecasts with step sizes of 2, 3, and 4. The prediction accuracy on data sources GFT, Weather, GST, and HealthMap are presented in Tables 3.1, 3.2, 3.3, and 3.4, correspondingly. The experiments show that our models yield better prediction accuracy, especially for multi- step forecasting. Multi-step forecast is a much harder task than 1-step forecast. The dynamic 31 modeling of ARX provides more flexibility in handling the uncertainty associated with the target signal.

Table 3.1: Prediction accuracies for competing algorithms with different forecast steps over different countries using the GFT input source. GFT data is not available for other countries.

Step Method AR BO CL MX PE PY US ARX 2.85 2.63 3.18 2.61 2.51 2.82 3.71 MFN 2.33 2.41 2.34 2.69 2.48 2.54 3.73 1 SARX 3.02 2.42 3.11 2.90 2.81 2.69 3.67 DARX 3.05 2.74 3.12 2.78 2.50 2.65 3.71 DPARX 3.13 2.82 3.18 2.97 2.64 2.81 3.72 ARX 2.38 2.22 2.83 1.88 1.90 2.57 3.47 MFN 2.12 2.00 2.13 2.33 2.21 2.19 3.63 2 SARX 2.75 2.03 2.76 2.64 2.43 2.43 3.64 DARX 2.94 2.68 3.02 2.58 2.38 2.58 3.60 DPARX 2.86 2.70 2.89 2.64 2.52 2.65 3.61 ARX 2.11 1.86 2.61 1.28 1.44 2.31 3.19 MFN 1.99 1.87 2.11 2.14 2.10 2.09 3.33 3 SARX 2.33 1.61 2.46 2.42 2.16 2.23 3.40 DARX 2.66 2.36 2.77 2.37 2.26 2.46 3.41 DPARX 2.58 2.53 2.56 2.45 2.37 2.52 3.42 ARX 1.84 1.61 2.39 0.88 1.12 2.22 2.92 MFN 1.85 1.83 2.00 2.05 2.01 1.94 3.15 4 SARX 2.12 1.41 2.30 2.22 2.02 2.09 3.30 DARX 2.34 2.21 2.52 1.98 2.19 2.22 3.18 DPARX 2.29 2.35 2.32 2.26 2.29 2.40 3.20

3.1.3 Seasonal Analysis

In this chapter, we have not trained the models to predict the seasonal metrics. However, we can construct ILI prediction curves for each ‘step-ahead’, i.e., 1-step ILI prediction curve, 2-step ILI prediction curve and so on. From these prediction curves we can then calculate the season-characteristics and compare them against those calculated from the observed PAHO (or CDC) ILI counts. We compare the predicted and observed seasonal characteristics, for the last ILI year in our set for each country. Our experimental results show [62], the proposed algorithms work well for a number of countries. In general DPARX performs better in terms of the overall predic- tion characteristics. This is consistent with our results for near-term forecasts. For seasonal characteristics, Weather and GFT seem to be the most important sources for prediction. We also present the predicted and real curves for Mexico for the ILI season 2013 in Figure 3.3 based on 1-step ahead predictions. Excepting GST and HealthMap data for some of the state-of-the-arts, all the curves match up closely to the observed ILI curve. 32

3.2 Discussion

In this chapter, we presented a practical short-term ILI case count forecasting method using multiple digital data sources. One of the main contributions of the proposed model is that the underlying autoregressive model is allowed to change over time. In order to control the variation of the model, we built a model similarity graph to indicate the relationship between each pair of models at two different time points and embed the prior knowledge as the structure of the graph. The experiments demonstrate that our proposed algorithm provides consistently better forecasting results than state-of-the-art time series models used for short- term ILI case count forecasting. We also observed that the dynamic model successfully captures the seasonal pattern of flu activity. Finally, while these techniques were applied to the relatively specialized field of ILI case count forecasting, the methods presented are generic enough such that these may be adapted towards other similar count prediction problems. 33

3000

2000

1000

0 0 50 100 150 200

20 0.06

40 0.05 60

80 0.04 100

120 0.03

140

160 0.02

180 0.01 200

220

20 40 60 80 100 120 140 160 180 200 220

Figure 3.1: The distance matrix obtained from our learned DPARX model (bottom figure), associated with the ground truth ILI case count series (top figure) on the AR dataset. We can observe the strong seasonality automatically inferred in the matrix. Each element in the matrix is the Euclidean distance between a pair of the learned models at two corresponding time points after training. For the top figure, the x axis is the index of the weeks; the y axis is the number of ILI cases. For the bottom figure, both x and y axes are the index of the time points. Note that the starting time point (index 0) for the distance matrix is week 15 of the ILI case count series. 34

0.5 0.5 0.45 20 20 0.45 20 0.45 0.4 0.4 0.4 40 0.35 40 40 0.35 0.35 0.3 0.3 0.3 60 60 60 0.25 0.25 0.25 0.2 80 80 0.2 80 0.2 0.15 0.15 0.15 100 100 100 0.1 0.1 0.1

0.05 0.05 120 120 0.05 120

0 0 0 20 40 60 80 100 120 20 40 60 80 100 120 20 40 60 80 100 120 Figure 3.2: Model distance matrices for US dataset. The three matrices are derived from the fully connected similarity graph, the 3-nearest neighbor similarity graph and the seasonal 3-nearest neighbor similarity graph, from left to right correspondingly.

3500

MX for 2013 3000 Actual HM-DARX 2500 HM-DPARX GST-DARX 2000 GST-DPARX Weather-DARX

1500 Season End Weather-DPARX Season Start ILI Count GFT-SARX

1000 GFT-DARX GFT-DPARX

500

0 1 11 21 31 41 51 Weeks

Figure 3.3: Comparison of seasonal characteristics for Mexico using different algorithms for one-step ahead prediction. Blue vertical dashed lines indicate the actual start and end of the season. ILI season considered: 2013. 35

Table 3.2: Prediction accuracies for competing algorithms with different forecast steps over different countries using the weather data source.

Step Method AR BO CL CO CR EC GT HN MX NI PA PE PY SV US ARX 2.94 2.51 3.10 2.90 2.21 2.81 2.83 2.96 2.25 2.18 2.78 2.51 2.84 2.83 3.51 MFN 2.99 3.01 2.88 2.53 2.78 2.81 2.77 2.83 2.61 2.70 2.56 2.82 2.66 2.79 3.81 1 DARX 3.09 2.84 3.17 2.84 2.57 2.94 2.83 2.89 2.91 2.77 2.72 2.67 2.79 2.72 3.71 DPARX 2.98 2.84 3.07 3.01 2.70 2.97 2.87 2.93 2.84 2.86 2.82 2.78 2.86 2.77 3.72 ARX 2.56 2.05 2.63 2.71 1.61 2.56 2.63 2.76 1.15 1.36 2.56 2.05 2.62 2.64 3.21 MFN 2.86 2.89 2.81 2.49 2.71 2.67 2.72 2.41 2.55 2.31 2.50 2.59 2.71 2.30 3.75 2 DARX 2.98 2.69 3.00 2.69 2.63 2.79 2.72 2.81 2.66 2.28 2.55 2.49 2.68 2.66 3.60 DPARX 2.67 2.73 2.86 2.83 2.66 2.79 2.78 2.78 2.62 2.49 2.71 2.63 2.64 2.68 3.61 ARX 2.25 1.65 2.21 2.50 1.06 2.30 2.39 2.59 0.60 0.94 2.42 1.72 2.39 2.46 2.92 MFN 2.49 2.38 2.41 2.33 2.45 2.31 2.32 2.10 2.21 2.11 2.19 2.22 2.40 2.08 3.64 3 DARX 2.68 2.32 2.68 2.57 2.52 2.72 2.50 2.65 2.47 2.00 2.52 2.32 2.54 2.53 3.41 DPARX 2.33 2.44 2.63 2.70 2.58 2.66 2.59 2.61 2.36 2.31 2.75 2.44 2.51 2.55 3.42 ARX 1.98 1.37 1.73 2.31 0.72 2.07 2.22 2.41 0.39 0.83 2.21 1.46 2.21 2.30 2.56 MFN 2.10 2.13 2.15 2.04 2.25 2.11 2.22 1.94 1.99 1.87 2.01 1.86 2.10 1.77 3.54 4 DARX 2.42 2.12 2.39 2.49 2.34 2.52 2.42 2.51 2.17 1.74 2.38 2.27 2.30 2.42 3.18 DPARX 2.10 2.23 2.32 2.64 2.38 2.52 2.55 2.45 2.06 2.15 2.72 2.38 2.27 2.53 3.20

Table 3.3: Prediction accuracies for competing algorithms with different forecast steps over different countries using the GST data source.

Step Dataset AR BO CL CO CR EC GT HN MX NI PA PE PY SV MFN 2.61 2.44 2.55 2.22 2.61 2.52 2.31 2.62 2.48 2.61 2.31 2.23 2.53 2.13 1 DARX 2.99 2.65 3.09 2.74 2.41 2.86 2.72 2.83 2.82 2.84 2.59 2.56 2.75 2.63 DPARX 3.07 2.74 3.15 2.85 2.72 2.80 2.51 2.80 2.96 2.77 2.59 2.66 2.82 2.61 MFN 2.50 2.33 2.31 2.10 2.44 2.29 2.11 2.43 2.37 2.39 2.20 2.01 2.27 2.00 2 DARX 2.83 2.54 2.94 2.57 2.53 2.69 2.58 2.72 2.59 2.40 2.35 2.40 2.54 2.51 DPARX 2.78 2.59 2.86 2.67 2.63 2.67 2.35 2.71 2.60 2.48 2.43 2.53 2.57 2.59 MFN 2.33 2.10 2.16 1.99 2.21 2.03 1.99 2.14 2.20 2.14 2.02 1.91 2.13 1.92 3 DARX 2.51 2.07 2.69 2.45 2.36 2.47 2.41 2.54 2.34 2.06 2.48 2.10 2.49 2.44 DPARX 2.46 2.41 2.53 2.56 2.48 2.51 2.26 2.58 2.38 2.30 2.41 2.34 2.49 2.51 MFN 1.99 2.00 2.01 1.82 1.97 1.88 1.92 1.93 1.81 1.77 1.79 1.70 1.82 1.71 4 DARX 2.16 1.91 2.36 2.24 2.20 2.17 2.28 2.40 1.80 1.86 2.40 2.06 2.23 2.36 DPARX 2.17 2.21 2.29 2.46 2.35 2.33 2.14 2.46 2.10 2.13 2.33 2.21 2.30 2.44

Table 3.4: Prediction accuracies for competing algorithms with different forecast steps over different countries using the HealthMap data source.

Step Dataset AR BO CL CO CR EC GT HN MX NI PA PE PY SV US MFN 2.81 3.13 2.63 2.58 2.91 2.77 2.63 2.73 2.50 2.61 2.54 2.69 2.51 2.61 3.78 1 DARX 3.00 2.69 3.11 2.79 2.44 2.89 2.75 2.91 2.85 2.86 2.60 2.65 2.75 2.64 3.71 DPARX 3.07 2.74 3.15 2.84 2.69 2.83 2.58 2.82 2.95 2.79 2.59 2.70 2.83 2.62 3.72 MFN 2.71 2.91 2.30 2.21 2.77 2.49 2.40 2.38 2.44 2.36 2.15 2.33 2.22 2.33 3.64 2 DARX 2.86 2.60 3.01 2.62 2.54 2.74 2.64 2.77 2.66 2.47 2.37 2.47 2.53 2.58 3.60 DPARX 2.78 2.60 2.88 2.67 2.62 2.71 2.44 2.72 2.60 2.50 2.45 2.58 2.58 2.60 3.61 MFN 2.44 2.30 2.42 2.07 2.31 2.14 2.28 2.01 2.19 2.12 1.99 2.00 1.97 1.95 3.35 3 DARX 2.58 2.18 2.78 2.49 2.35 2.63 2.51 2.62 2.48 2.15 2.49 2.33 2.48 2.51 3.41 DPARX 2.46 2.42 2.55 2.56 2.47 2.58 2.36 2.59 2.38 2.31 2.45 2.37 2.49 2.50 3.42 MFN 1.93 1.99 2.20 1.88 2.00 1.95 2.15 1.95 1.89 1.85 1.72 1.78 1.91 1.81 3.13 4 DARX 2.28 2.02 2.46 2.39 2.19 2.37 2.39 2.45 2.22 1.97 2.45 2.26 2.20 2.42 3.18 DPARX 2.17 2.21 2.30 2.44 2.34 2.42 2.25 2.47 2.12 2.14 2.37 2.25 2.30 2.47 3.21 Part II

Long-term Forecasting using Surrogates

36 37

We discussed several facets of short-term forecasting, specially with respect to ILI, in Part I. Concomitant to short-term forecasting, which provides real-time insights about current on- ground scenario, often times long-term characteristics of targets are of prime interest. Con- sidering the example of epidemic diseases, surveillance agencies are interested in identifying seasonal characteristics such as follows:

1. Start week: Within a particular ILI year (may not be calendar year, e.g., in the USA, the ILI year spans from Epi Week 40 to Epi Week 39 [11]), ‘start week’ is the week from which ILI is said to be in season. We define start week for a ILI year to be the first week where the ILI count for 3 consecutive past weeks (including itself) is greater than a pre-defined threshold.

2. Peak week: Within a particular ILI year, the peak week is the week for which the ILI count is highest for that ILI year.

3. Peak Size: Peak Size is the ILI count observed on the peak week.

4. End week: Within a particular ILI year, the end week is the first week after the peak week such that ILI counts for 3 consecutive past weeks (including itself) is lower than a pre-defined threshold. End week signifies the end of the ILI season and is thus of interest to epidemiologists.

5. Season Size: Season size is used as a proxy for the size of the epidemic. It is calculated by summing up the total ILI count from the start to the end week.

In traditional epidemiology, several models such as SEIR and SIRS [3], have been proposed to model the temporal profile of infectious diseases. In modern computational epidemiol- ogy, more advanced methods have been used. One of the currently popular methods is to fit prediction models by matching observational data against a large library of simulated curves [5, 40, 58]. The curve simulations are generated by using different epidemiological parameters and assumptions. Sometimes network-based models are used to generate the curves [3]. Partially observed influenza counts for a particular year can then be matched to a library of curves to produce the best set of predictions [40]. Closely related to such curve matching methods are filtering-based methods that dynamically fit epidemic models onto observed data by letting the base epidemic parameters vary over time. Yang et al. [66] provide an excellent survey of filtering-based methods used for influenza forecasting and also present comparative analysis of such methods. We present our efforts at disease forecasting using curve matching methods in Chapter 4 and subsequently present our data assimilation based models towards easier integration of surrogates in Chapter 5. Chapter 4

Curve-matching from library of curves

One of the simplest and more-intuitive strategies towards long-term forecasting is based on curve matching from amongst library of curves [3, 5]. Typically, library of curves can be generated using various parameter choices of compartmental models such as SEIR Curves can also be generated from agent based models that are informed through a com- bination of diverse sources such as census and road-network. Curves from such library of curves can then be matched against specific epidemic surveillance data to predict the sea- sonal curves. Seasonal characteristics can then be identified from these detected curves using the definitions as outlined earlier. Some of the considerations in this process can be identified as follows:

Figure 4.1: Filtering library of curves based on season size and season shape.

1. Appending Short-term forecasts to surveillance data: Surveillance reports are typically delayed. As presented in Part I, we can use Surrogates to generate robust predictions for short-term. In general, these predictions are robust i.e. stable w.r.t. to

38 39

surveillance updates. Short-term forecasts can also provide measures of uncertainty about current surveillance. We append these predictions to the last-available surveil- lance data so that the partial time series to match against the curves is longer and hence more accurate. This is especially useful during the initial part of the season where only a few data points are available from the surveillance reports to match against the library of curves.

2. Filtering Library of Curves: Typically, the library may contain a wide variety of curves corresponding to various kinds of diseases identified through various epidemio- logical parameters used to simulate the curves. Many of these curves can be unsuitable for matching against the disease of interest (such as ILI). Moreover, admitting these curves for matching may lead to increased false detections. As such we filter the curves from historical trends of the disease of interest by the following factors:

Filter curve by average season size of the disease. • Filter curves by average peak-to-season size ratio. Effectively, this strategy filters • according to the shape of the epidemic curve.

Figure 4.1 shows examples of curves that were filtered out from such a library while matching against ILI data for Latin America

Performance Highlights We used the aforementioned curve-matching strategy to pre- dict ILI seasonal characteristics for 15 Latin American countries. Figure 4.2 gives an example of such forecasts and Figure 4.3 outlines the results for several countries against the afore- mentioned metrics as reported by IARPA. As can be seen, the framework works well for a few metrics and for a few countries such as Ecuador for total RSV counts. However, our performance was poor for several other metrics. Furthermore, we found curve matching models to be inconsistent with respect to the week of the season when the forecasts were generated. Also, this method admits the use of surrogates only for increasing the time series length to match against and fails to use it for determining more interesting facets such as disease transmission rate. 40

Figure 4.2: Example of seasonal forecasts for ILI using curve-matching methods.

Figure 4.3: Performance measures for ILI seasonal characteristics using curve-matching Chapter 5

Data Assimilation methods for long-term forecasting

Chapter 4 outlined our current efforts at seasonal forecasting using curve-matching models. As identified in the chapter, such methods involve a sub-optimal use of surrogate information at determining seasonal characteristics. Motivated by the efforts of Shaman et al. [51], we developed data assimilation models where surrogates are used to force the disease parameters and seasonal characteristics are found by optimizing over the most probable seasonal curves. We present our efforts to some details in the following sections by first describing some of the relevant data assimilation models in Section 5.1 and present our disease forecasting models using data assimilation methods in Sections 5.2 and 5.3.

5.1 Data Assimilation

Originally proposed in the 1960s [26], Kalman filter (KF) has rapidly gained its reputation in a myriad of applications [17] that features estimation and forecasting. Nowadays the original KF has evolved and given rise to a entire class of dynamic estimation/forecasting algorithms that recursively estimate and forecast: it optimizes the estimates by data assimilation using noisy measurements (observations), and forecasts using a presumed process model. We first consider linear process models. Such systems can be expressed as a pair of linear stochastic process and measurement equations as show below:  xk+1 = Axk + Buk + wk, wk (0,Q) ∼ N (5.1) zk = Hxk + vk, vk (0,R) ∼ N where x n is the state vector, z m is the measurement vector, A n×n is called the process∈ R matrix, B is the matrix that∈ R relates optional control input u to∈ R the state, and m×n H is the measurement matrix. The process noise wk and measurement noise vk ∈ R 41 42 are assumed to be mutually independent random variables with zero mean and normally distributed with noise covariance Q and R respectively. The classic Kalman Filter was developed to estimate the hidden states as well as forecast the observed targets of such linear processes and the relevant equations can then be outlined in the following two groups:

 K = P f HT (HP f HT + R)−1  k k k a f f Estimation xk = xk + Kk(zk Hxk) (5.2)  a −f P = (I KkH)P k − k  f a xk+1 = Axk + Buk Forecast f a T (5.3) Pk+1 = APk A + Q

a m a where, xk is the optimal state estimate given measurement vector zk , Pk is the analysis state error covariance, R is the measurement noise covariance. Equation∈ R 5.2 assimilates measurements into the estimate via Kalman gain matrix K, which weighs the impact from measurement versus that from prediction. Larger R or smaller Q increases the weight of f n prediction; while smaller R or larger Q increases the weight of measurement. xk+1 is f n×n ∈ R the forecast state, Pk+1 is the forecast state error covariance, Q is the process noise covariance. Equation 5.3∈ forecasts R x and P for time step k + 1. In reality, the process and measurement models are often nonlinear. A nonlinear system can be modeled by nonlinear stochastic equations:  xk+1 = a(xk, uk) + wk, wk (0,Q) ∼ N (5.4) zk = h(xk) + vk, vk (0,R) ∼ N One of the more popular solutions to such non-linear systems is the extended Kalman filter (EKF), which is essentially a Kalman filter modified to linearize the estimation about the current mean and covariance [63]. EKF equations are similar to KF equations, except that EKF needs to compute Jacobian matrices at each time step ∂a(x) ∂h(x) A = x,H = x. (5.5) ∂x | ∂x |

EKF came to prominence in aerospace and robotics applications where the state space is small; however in more complex systems with high-dimensional state space such as those in weather and disease prediction, it falls short due to the intractable computational burden associated with Jacobian matrices, as well as maintaining and evolving a separate covariance matrix at each time step. Ensemble Kalman filter (EnKF) [21] is thus developed to alleviate computation complexity. It is related to the Particle Filter (PF) [20] in the sense that each ensemble member can be considered to be a particle aimed at estimating the relevant probability distributions using Monte Carlo procedures. However, contrary to PF, EnKF assumes Gaussian distributed 43

noise characteristics and is thus more computationally more efficient than the PF albeit with stricter assumptions about the underlying process. The essential steps of EnKF are: 1) maintaining an ensemble of state estimates instead of a single estimate, 2) simply advancing each member of the ensemble and, 3) calculating the mean and error covariance matrix directly from this ensemble. Assuming that we have an ensemble of q state estimates with random sample errors, EnKF steps can be expressed via the following equations:

 ˆ f f T f f T −1 Kk = E (E ) (E (E ) + R)  xk zk zk zk  ai fi ˆ i fi xk = xk + Kk(zk + vk zk ), i = 1, 2, ...q q − (5.6)  a 1 P ai  x¯k = q xk i=1  xfi = a(xai , u ) + wi , i = 1, 2, ...q  k+1 k k k  fi fi  zk+1 = h(xk+1), i = 1, 2, ...q f f f f (5.7) Ef = √ 1 [x 1 x¯ . . . x q x¯ ]  xk+1 q−1 k+1 k+1 k+1 k+1  f − f f − f  Ef = √ 1 [z 1 z¯ . . . z q z¯ ] zk+1 q−1 k+1 − k+1 k+1 − k+1 q q f 1 P fi f 1 P fi where,x ¯k+1 = q xk+1 andz ¯k+1 = q zk+1 are the means of state forecast ensemble i=1 i=1 and measurement forecast ensemble. Ef and Ef are the corresponding perturbation xk+1 zk+1 i i matrices ensembles. wk and vk are generated random noise variables that follow the normal distribution (0,Q) and (0,R) respectively. EnKF offers great ease of implementation and handlingN of non-linearityN due to the absence of Jacobian calculations; on the other hand, it is critical to choose an ensemble size that is large enough to be statistically representative. The details of EnKF can be found in [21, 37]. Some of the more popular variations of EnKF, viz. Ensemble Adjustment Kalman filter (EAKF) [2] and Ensemble Transform Kalman Filter (ETKF) [61], do not add Gaussian noise to form measurement ensembles and instead deterministically adjust each ensemble member so that the posterior variance is identical to that predicted by Bayesian theorem under Gaussian distribution assumptions, while keeping the ensemble mean unchanged. With respect to ETKF, EAKF shows better numerical stability but requires extra SVD operations and is thus computationally more expensive. In EAKF, the estimated state perturbation matrix can be written in the pre-multiplier form:

a f Ex = AEx (5.8)

a Compared to ETKF, in which Ex can only be expressed in post-multiplier form, EAKF does not suffer from the two issues that may appear in ETKF: 1) Producing analysis ensembles with inconsistent statistics such as biased mean and/or small standard deviations of the coordinates; 2) Each assimilation of an observation produces a collapse in the number of distinct values of the observed coordinates in the ensemble. More discussion of EAKF and ETKF can be found in [2, 61, 37]. 44

5.2 Data Assimilation Models in disease forecasting

We described some of the more practical and popular data assimilation models in Section 5.1. In this section, we present our disease specific data assimilation model which we used to generate seasonal forecasts for ILI and subsequently expanded towards CHIKV forecasts. To build such data assimilation models, we need to specify disease spread processes whose parameters are learned through the data assimilation algorithm of choice (see [51]). For our purpose, we chose dynamic data-driven SIRS model. This dynamic model is inspired from the Shaman et al. [51] and aims to use a Bayesian Filter to continuously assimilate observed data sources into the model characteristics and generate an ensemble of models. A key distinguishing feature of our work is aimed at the diversity of syndromic surveillance sources used. The spread of the ensemble predictions also reveals the underlying probability distribution of various seasonal characteristics such as start week and peak week. The model used for ILI can be formally described as follows. Let us denote the observed ILI percentage for the region of interest (including national level data) by yt. We choose as a candidate the well defined SIRS model where St and It denote the number of people in ‘Susceptible’ and ‘Infectious’ compartment, at time t. Let us also denote the new infections moving into the I bucket at time t by newIt which can be directly computed from It. Let us denote the population size by N, the mean infectious period by D, the mean resistance period by L, and the basic reproductive rate at time t by R0,t. Then the basic SIRS equation at time t can be given as

dSt = N−St−It β(t)ItSt α dt L − N − (5.9) dIt = β(t)IS It + α dt N − D

where β(t) = R0,t/D.

Let us denote a hidden layer of variables xt that connect the SIRS model with the observed ILI percentages. The hidden variable set can be thought of as an n-tuple xt, as

xt = (St,It,R0, D, L, f, r)

The equations governing the Bayesian filter can be given as:

yt = f newIt + (0, r) ∗ N (5.10) xt = g(xt xt−1) | where g denotes the dynamic model transition from time t 1 to t. − g can be a general purpose transition function. For our purpose, we perturb S and I via the SIRS equation and the remaining state parameters using a random walk model within specified bounds. We studied a number of data assimilation models as presented in 45

Section 5.1 and selected EnKF filters to allow for greater flexibility in modeling and with a stated goal of comparing different sources towards their relevant importance in disease forecasting. We used an EnKF with 10000 ensembles to estimate the disease parameters. The distribution of the ensembles provide the posterior distribution over the SIRS parameters and can be used to directly infer the parameters.

5.3 Data Assimilation Using surrogate Sources

The method so described above can be thought of as a general purpose algorithm where we can introduce information about different sources by modifying equation 5.10. Earlier re- search [51] has shown that surrogate sources such as absolute humidity can be used to locally modify disease parameters and generate more robust forecasts. However, such methods have mainly focused on allowing a single surrogate and/or using custom state transition equations which are not easily generalizable to other sources. We focused on extending such methods to more generic sources and study the relative importance of such sources towards long- term forecasting. We have used a number of surrogate sources such as Weather, Google Flu Trends, Google Search Trends, HealthMap, and Twitter chatter. For the sake of simplicity, we explain our model using Google Flu Trends (GFT) as the illustrative source. Additional data sources can be incorporated following similar equations. As discussed in Part I, surrogate sources were found to encode disease transmission infor- mation but also exhibiting significant noise. However, from our experiments we found that although absolute surrogate counts are noisy, their rolling covariance can be used to inform a sudden increase/decrease of disease incidence in the population. Thus surrogate information was used to modify the transition equation for other latent variables such as R0 as:

R0,t = R0,t−1 + (0, cov(GF Tt−1, GF Tt)) (5.11) N Following Chakraborty et al. [12] we intend to analyze a myriad of data sources to train a more precise model with lower uncertainty bounds.

5.4 Experimental Results and Performance Summary

We used our data assimilation model to generate forecasts for ILI and CHIKV, for various regions of the world. While ILI is an human-to-human transmitted infectious diseases, CHIKV is a vector driven disease and hence forecasting models for CHIKV needs to cognizant about the same. For both diseases, weather attributes such as Temperature and Humidity could be argued to be an important transmission modulator. We applied data assimilation methods as outlined in Section 5.3 using weather as a surrogate source. These forecasts 46

(a) ILI

(b) CHIKV Figure 5.1: Performance summary for (a) ILI and (b) CHIKV seasonal forecasts using Weather as a surrogate source under data assimilation framework 47 were generated continuously for CHIKV in the Americas and for ILI in the US. As can be seen, data assimilation methods were able to more accurately forecast several seasonal characteristics for ILI compared to CHIKV. CHIKV, being a newly introduced disease in the Americas were characterized by more noise and our results also indicate the possible importance of modeling the vectors (mosquitoes) in addition to surrogate sources which may improve our forecasting performance.

Table 5.1: Forecasting performance of seasonal characteristics using data assimilation meth- ods

Metric BO CL MX PE start date 8.000 3.000 11.000 5.000 end date 16.000 14.000 35.000 14.000 peak date 17.000 2.000 28.000 4.000 peak val 2.005 3.413 0.145 2.908 season val 2.889 2.958 0.117 2.255

Similar to our efforts in short-term forecasting we compared the importance of each individ- ual surrogate source towards long-term forecasts. We applied our data assimilation model as outlined in Section 5.3 to ILI incidence for the season 2014-2015 over four Latin American countries viz. Bolivia, Chile, Mexico and Peru. We chose these countries as Google Flu Trends was available for these countries as well as these countries exhibits different modes of seasonality in the Latin Americas. We generated seasonal forecasts using data present at weeks 4 8 of the flu season for each of these countries. Table 5.1, summarizes the per- formance→ summary of our forecasts. The complete performance summary for these forecasts could be see in Appendix A. Figure 5.2 and Figure 5.2 plots the distribution of forecasting ac- curacy for dates (deviation in days) and values (quality score), respectively. As can be seen, HealthMap sources performs the best for both categories, indicating that the news media captures long-term signals about the season. The combination of all sources performs with similar accuracy as HealthMap, indicating that the competing sources could be potentially used, especially to improve accuracies against local variations. We analyze the forecasting performances furthermore by analyzing the change in fore- casting accuracy over the number of season weeks used to generate the forecasts in Fig- ures 5.4, 5.5, 5.6, 5.7 and 5.8. As can be seen, a combination of all sources shows most consistent performance over the season weeks compared to a single source. Furthermore, forecasting accuracy over value metrics (such as peak value and season value) benefits more from observation of a number of seasonal weeks compared to dates. Our results indicate that the shape of the disease curve can be forecasted with better accuracy compared to the actual size when only a few data points are observable for the season. Furthermore, the temporal accuracy plots indicate that surrogates sources such as HealthMap and GST contributes more heavily in the initial part of the disease season compared to the later part. 48

start_date end_date peak_date 11 11 11

10 10 10

9 9 9

8 8 8

7 7 7 Score Score Score

6 6 6

5 5 5

4 4 4

3 3 3 Weather gft gst hmap merged twitter Weather gft gst hmap merged twitter Weather gft gst hmap merged twitter source source source

Figure 5.2: Comparison of forecasting accuracy for Date metrics using surrogates

peak_val season_val 11 11

10 10

9 9

8 8

7 7 Score Score

6 6

5 5

4 4

3 3 Weather gft gst hmap merged twitter Weather gft gst hmap merged twitter source source

Figure 5.3: Comparison of forecasting accuracy for Value metrics using surrogates 49

Country: BO Country: CL 9.0 3.15 source source twitter twitter hmap 3.10 hmap 8.8 gst gst gft gft Weather 3.05 Weather 8.6 merged merged

3.00

8.4

2.95

8.2 2.90

8.0 2.85

7.8 2.80 4 5 6 7 8 9 4 5 6 7 8 9 curr_week curr_week Country: MX Country: PE 11.6 5.3 source source twitter twitter hmap hmap 11.4 5.2 gst gst gft gft Weather Weather 11.2 merged 5.1 merged

11.0 5.0

10.8 4.9

10.6 4.8

10.4 4.7 4 5 6 7 8 9 4 5 6 7 8 9 curr_week curr_week

Figure 5.4: Comparison of forecasting accuracy for ‘Start Date’ using different surrogate sources

Country: BO Country: CL 16 14.0 source source twitter twitter hmap 13.8 hmap 15 gst gst gft gft Weather 13.6 Weather merged merged

14 13.4

13.2 13

13.0

12 12.8

11 12.6 4 5 6 7 8 9 4 5 6 7 8 9 curr_week curr_week Country: MX Country: PE 35.6 16

35.4 15

35.2 14

35.0 13

34.8 12

34.6 11

34.4 source 10 source twitter twitter 34.2 hmap 9 hmap gst gst gft gft 34.0 8 Weather Weather merged merged 33.8 7 4 5 6 7 8 9 4 5 6 7 8 9 curr_week curr_week

Figure 5.5: Comparison of forecasting accuracy for ‘End Date’ using different surrogate sources 50

Country: BO Country: CL 18 4.0 source twitter 16 3.5 hmap gst 14 3.0 gft Weather merged 12 2.5

10 2.0

8 1.5 source twitter 6 hmap 1.0 gst 4 gft 0.5 Weather merged 2 0.0 4 5 6 7 8 9 4 5 6 7 8 9 curr_week curr_week Country: MX Country: PE 28.0 4.0 source source twitter twitter hmap hmap 3.8 gst gst 27.5 gft gft Weather Weather merged 3.6 merged

27.0 3.4

3.2

26.5

3.0

26.0 2.8 4 5 6 7 8 9 4 5 6 7 8 9 curr_week curr_week

Figure 5.6: Comparison of forecasting accuracy for ‘Peak Date’ using different surrogate sources

Country: BO Country: CL 2.8 3.55 source source twitter twitter 2.6 hmap 3.50 hmap gst gst 2.4 gft gft Weather 3.45 Weather merged merged 2.2 3.40

2.0

3.35 1.8

3.30 1.6

3.25 1.4

1.2 3.20 4 5 6 7 8 9 4 5 6 7 8 9 curr_week curr_week Country: MX Country: PE 0.175 4.0 source twitter 0.170 hmap gst 3.5 0.165 gft Weather merged 0.160 3.0

0.155

2.5 0.150 source twitter 0.145 hmap 2.0 gst 0.140 gft Weather merged 0.135 1.5 4 5 6 7 8 9 4 5 6 7 8 9 curr_week curr_week

Figure 5.7: Comparison of forecasting accuracy for ‘Peak Value’ using different surrogate sources 51

Country: BO Country: CL 3.8 3.15 source twitter hmap 3.6 3.10 gst gft Weather 3.4 3.05 merged source twitter hmap 3.2 gst 3.00 gft Weather merged 3.0 2.95

2.8 2.90

2.6 2.85 4 5 6 7 8 9 4 5 6 7 8 9 curr_week curr_week Country: MX Country: PE 2.8 source twitter hmap 0.125 2.6 gst gft Weather merged 2.4

0.120 2.2

2.0 source 0.115 twitter hmap gst 1.8 gft Weather merged 0.110 1.6 4 5 6 7 8 9 4 5 6 7 8 9 curr_week curr_week

Figure 5.8: Comparison of forecasting accuracy for ‘Season Value’ using different surrogate sources

5.5 Discussion

We have presented our work on long-term forecasts using both data assimilation methods and curve matching process. Our results indicate that data assimilation methods are in general more flexible and robust towards long-term forecasts. Surrogate sources such as HealthMap are important factors for such forecasts, especially during the initial part of the season. Our future research will focus on systematically including other infectious diseases with the framework and towards sparse selection of surrogates for more robust forecasting. Part III

Detecting and Adapting to Concept Drift

52 53

Part I and Part II outlined our efforts at short-term and long-term forecasting using surro- gates. However, surrogates are typically noisy and relationships to targets may be dynamic in nature. The changes in surrogate-target relationships can be significant, which if unde- tected may subsequently render any model developed on these surrogates ineffective. This motivates the third problem of this thesis where we first try to identify such major changes under the concept of ‘changepoints’. For this, we developed a hierarchical changepoint de- tection framework which can inform the changepoints in targets using information from the surrogate layers in Chapter 6. We also propose the use of such changepoints towards adaptive target forecasting in Chapter 7. Chapter 6

Hierarchical Quickest Change Detection via Surrogates

With the increasing availability of digital data sources, there is a concomitant interest in using such sources to understand and detect events of interest, reliably and rapidly. For instance, protest uprisings in unstable countries can be better analyzed by considering a variety of sources such as economic indicators (e.g. inflation, food prices) and social media indicators (e.g. Twitter and news activity). Concurrently, detecting the onset of such events with minimal delay is of critical importance. For instance, detecting a disease outbreak [45] in real time can help in triggering preventive measures to control the outbreak. Similarly, early alerts about possible protest uprisings can help in designing traffic diversions and enhanced security to ensure peaceful protests. Motivated by similar real-life scenarios where significant events can be argued to be observ- able in social sphere, we propose Hierarchical Quickest Change Detection (HQCD), for online change detection across multiple sources, viz. target and surrogates. Typically, targets are sources of imminent interest (such as disease outbreaks or civil unrest); whereas surrogates (such as counts of the word ‘protesta’ in Twitter) by themselves are not of significant inter- est. Thus, HQCD is aimed towards continuously utilizing both categories, but more focused on early (or quickest) detection of significant changes across the target sources. Traditional event (or change) detection approaches are not suitable for such problems. These are either a) offline approaches [43, 60, 52, 8] using the entire data retrospectively - thus not applicable to real-time scenarios, or b) online detection approaches [53, 54, 30, 31, 1, 35] with primary focus on the target source of interest and do not utilize other correlated sources. Table 6.1 shows a comparison of HQCD and several state-of-the-art methods in terms of the desirable attributes. The main contributions of the work presented in this chapter are: HQCD formalizes a hierarchical structure which in addition to the observed set of target •

54 55

Table 6.1: Comparison of state-of-the-art methods vs Hierarchical Quickest Change Detec- tion

Desirable Sequential Window- Bayesian Relative Hierarchical HQCD Properties GLRT Limited Online Density- Bayesian (This [53] GLRT CPD ratio Analysis of Paper) [54] [30] [1] Estimation Change [31] (RuLSIF) Point [35] Problems [8]

Online XXXXX Hierarchical XX Bounded False Alarm Rate / XXXX Detection delay

Handles X Non-IID data sources (i.e., Si’s), incorporates additional surrogates, denoted by Kj’s, and encodes propa- gation of change from surrogate to target sources. HQCD presents a specialized change detection metric that guarantees a maximum level of• false alarm rate while reducing the detection delay in quickest detection framework. In addition, HQCD yields a natural methodology for analyzing the causality of change in a particular target source through a sequence of change propagation in other sources. HQCD presents a specialized sequential Monte Carlo based change detection framework that• along with specialized change detection metrics enables hierarchical data to be analyzed in online fashion. We extensively test HQCD on both synthetic and real world data. We compare against •state-of-the-art methods and illustrate the robustness of our methods and the usefulness of surrogates. Moreover, we analyzed target-surrogate relationships and uncover important propagation patterns that led to such uprisings.

6.1 HQCD–Hierarchical Quickest Change Detection

We first provide a brief overview of classical QCD problem and then present the HQCD framework.

6.1.1 Quickest Change Detection (QCD)

Let us consider a data source S changing over time and following different stochastic processes before and after an unknown time Γ (changepoint). The task of QCD is to produce an ˆ estimate Γ = γ in an online setting (i.e., at time t, only S1,...,St is available). Figure 6.1 illustrates the two fundamental performance metrics related to this problem. In the figure, 56

Γ = t4 is the actual time-point when the changepoint happened. An early estimate such as γ1 = t1 in the figure leads to a false alarm, where another estimate, such as γ2 = t6 leads to an ‘additive delay’ of γ2 Γ = t6 t4. The goal of QCD is to design an online detection strategy which minimizes the− expected− additive detection delay (EADD) while not exceeding a maximum pre-specified probability of false alarm (PFA). QCD has been studied in various contexts. Some of the foremost methods have considered i.i.d. distributions with known (or unknown) parameters before and after unknown changepoints [59]. Some of the more popular methods have used CUSUM (cumulative sum of likelihood) based tests while more general approaches are adapted in GLRT (generalized likelihood ratio test) based methods [19].

γ

γ1 Γ 2

FalseAlarm

Delayed Detection True Change Point

t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10

Detection Delay = γ2 - Γ

Figure 6.1: Illustration of Quickest Change Detection (QCD): blue colored line represents the actual changepoint at time Γ = t4. (a) declaring a change at γ1 leads to a false alarm, whereas (b) declaring the change at γ2 leads to detection delay. QCD can strike a tradeoff between false alarm and detection delay. 6.1.2 Changepoint detection in Hierarchical Data

We next present our approach to generalize QCD to a hierarchical setting. We first describe a generic hierarchical model and then propose the QCD statistics for such models in Sec- tion 6.1.2. For computational feasibility, we present a bounded approximate of the same and our multilevel changepoint algorithm in Section 6.1.2.

Generic Hierarchical Model

¯(T ) (T ) (T ) (T ) Let us consider S , a set of I correlated temporal sequences S1 ,S2 ,...SI where, (T ) th (T ) { } Si represents the i target data sequence Si = [si(1), si(2), . . . , si(T )] for i = 1,...,I, 57

E Sum of Target sources E

(Target sources)

S S1 S2 3 SI S1 S2 S3 ... SI

(Surrogate sources)

K1 K2 K3 ... KJ K1 K2 K3 KJ

Figure 6.2: Generative process for HQCD. As an example consider civil unrest protests. In the framework, different protest types (such as Education- and Housing-related protests) form the targets denoted by Si’s. The total number of protests will be denoted by the top- most variable E. Finally, the set of surrogates, such as counts of Twitter keywords, stock price data, weather data, network usage data etc. are denoted by Kj’s. collected up and until some time T . The cumulative sum of the target sources Si’s at time t is PI given by E(t), i.e., E(t) = i=1 si(t). Concurrent to target sources, we also observe a set of ¯ (T ) (T ) (T ) (T ) (T ) J surrogate sources, K = K1 ,K2 ,...,KJ , where Kj = [kj(1), kj(2), . . . , kj(T )], for j = 1,...,J, which may{ either have a causal} or effectual relationship with the target source set S¯(T ) (see Figure 6.2). We assume that targets and surrogates follow a stochastic Markov process as follows:

¯(T ) ¯ (T ) (T ) (T ) (T ) (T ) P (S , K ) =P (S1 ,...,SI ,K1 ,...,KJ ) T ( J I ) φK φS Y Y j Y i ¯(t−1) ¯ (t−1) = Pt (Kj(t)) Pt Si(t) S , K . t=1 j=1 × i=1 |

K S The binary variables φj , φi 0, 1 capture the notion of significant changes in events through changes in distribution∈ of { the} generative process as follows: if the surrogate source K Kj undergoes a change in distribution at some time t, then, φj changes from 0 to 1. In other 0 1 words, Pt (Kj) (respectively Pt (Kj)) denotes the pre-change (post-change) distribution of the jth surrogate source. Similarly, if the target source Si undergoes a change in distribution at 58

S 0 1 some time t, then φi changes from 0 to 1. In other words, Pt (Si ) (respectively Pt (Si )) de- notes the pre-change|· |·

(post-change) conditional distribution of the jth target data source. We denote ΓKj (re- K S spectively ΓSi ) as the random variable denoting the time at which φj (respectively, φi ) ¯ ¯ changes from 0 to 1. Finally, we write ΓK¯ = (ΓK1 ,..., ΓKJ ), and ΓS¯ = (ΓS1 ,..., ΓSI ) as the collective sets of changepoints in the surrogate and target sources, respectively. Finally, denote ΓE as the changepoint random variable for the top layer, E, which represents the sum total of all target sources.

From QCD to HQCD

We extend the concepts of QCD presented in Section 6.1.1 to multilevel setting by formalizing the problem as the earliest detection of the set of all (J + I + 1) changepoints, i.e., Γ¯ = ¯ ¯ ¯(T ) ¯ (T ) Γ ¯ , Γ¯ , ΓE having observed the target and surrogate sources i.e. S , K . Letγ ¯ = { K S } γ¯K¯ , γ¯S¯ , γE be the (J + I + 1) vector of decision variables for the changepoints. To measure detection{ performance,} we define the following two novel performance criteria:

Multi-Level Probability-of-False-Alarm (ML-PFA):  ML-PFA(¯γ) = P γ¯ Γ¯ , (6.1) 

where for any two N length vectors a b, the notation implies ai bi, for i = 1,...,N. For  ≤ ¯ instance, consider the example of I = 1 target, and J = 1 surrogate. Then Γ = (ΓK1 , ΓS1 )

and γ = (γK1 , γS1 ), and the probability of multi-level false alarm is given by ML-PFA(γ) =

P(γK1 ΓK1 , γS1 ΓS1 ). This definition of ML-PFA declares a false alarm only if all the (J + I ≤+ 1) change≤ decision variables are smaller than the true changepoints. Expected Additive Detection Delay (EADD):

J I ¯  X X EADD(γ) = γ Γ 1 = E( γKj ΓKj ) + E( γSi ΓSi ) + E γE ΓE (6.2) E | − | | − | | − | | − | j=1 i=1 | {z } Top layer delay | {z } | {z } Surrogate layer delay Target layer delay

Given the observations, i.e., all target and surrogate sources (S¯(T ), K¯ (T )) till time T governed by unknown changepoints Γ¯, we aim to make an optimal decision γ about these changepoints under the following criterion

γ∗(α) = arg min EADD(γ) s.t. ML-PFA(γ) α. (6.3) γ ≤ In other words, γ∗(α) is the optimal change decision vector which minimizes the EADD while guaranteeing that the ML-PFA is no more than a tolerable threshold α. We note that the above optimal test is challenging to implement for real-world data sets due to following issues: a) it requires the knowledge of pre- and post- change distributions (for all 59 sources) and the distribution of the changepoint random vector Γ¯, b) unlike single source QCD, finding the optimal γ∗(α) requires a multi-dimensional search over multiple sources, making it computationally expensive, and c) it does not discriminate between false alarms across different sources. For instance, declaring false alarm at a target source (such as premature declaration of the onset of protests or disease outbreaks) must be penalized more in comparison to declaring false alarm at a surrogate source (such as incorrectly declaring rise in Twitter activity).

Bounded approximation of HQCD

We can circumvent the problem (b) of the original definition of ML-PFA as given in equa- tion 6.1 by upper bounding it in Theorem 6.1.

Theorem 6.1 (Modified-PFA). Let γ¯ = γ¯S, γ¯K, γE be the a set of estimates about true changepoint for targets, surrogates and sum-of-targets,{ } respectively. Then under the condition of greater importance to accurate target layer detections, ML-PFA (see 6.1) is upper-bounded by Modified-PFA, where:

Modified-PFA(γ) , I max P(γSi ΓSi ) + min P(γKj ΓKj ) + P(γE ΓE) (6.4) × i ≤ j ≤ ≤

Proof. We can prove the upper bound of ML-PFA with the following reductions:

ML-PFA(γ) = P(γ Γ)  = P(γS¯ ΓS¯ , γK¯ ΓK¯ , γE ΓE) (a)   ≤ P(γS¯ ΓS¯ ) + P(γK¯ ΓK¯ ) + P(γE ΓE) (≤b)   ≤ PI (6.5) P(γSi ΓSi ) + P(γK¯ ΓK¯ ) + P(γE ΓE) ≤ i=1 ≤  ≤ I max P(γSi ΓSi ) + P(γK¯ ΓK¯ ) + P(γE ΓE) ≤ × i ≤  ≤ (c)

I max P(γSi ΓSi ) + min P(γKj ΓKj ) + P(γE ΓE), ≤ × i ≤ j ≤ ≤ where (a) and (b) follows from the union bound on probability and (c) follows from the fact that the joint probability of a set of events is less than the probability of any one event, i.e.,

P(γK¯ ΓK¯ ) P(γKj ΓKj ), for any j = 1,...,J, and then taking the minimum over all j. The resulting ≤ upper bound≤ in (6.5) leads to the basis of the modification of the multi-level PFA:

Modified-PFA(γ) , I max P(γSi ΓSi ) + min P(γKj ΓKj ) + P(γE ΓE) × i ≤ j ≤ ≤  

Modified-PFA expression leads to intuitive interpretations as follows: (i) as false alarms at targets can have a higher impact, it is desirable to keep the worst case PFA across these to 60

be the smallest, or equivalently, maxi P(γSi ΓSi ) should be minimized. (ii) false alarms at surrogates are not as important and we can≤ declare a false alarm if all of the surrogate level detection(s) are unreliable, or equivalently, minj P(γKj ΓKj ) needs to be minimized. (iii) notably, the above modification leads to a low-complexity≤ change detection approach across multiple sources by locally optimal detection strategies avoiding a multi-dimensional search. Based on Modified-PFA, we next present a compact test suite to declare changes at pre- specified levels of maximum PFA as given in Theorem 6.2 and incorporate specificity issues pointed out in problem (c) of the original formulation of PFA.

Theorem 6.2 (Multi-level Change Detection). Let ΓSi be the true change point random

variable for the ith target source, Si. Let ΓKj and ΓE represent the same for the jth surrogate (T ) ¯(T ) ¯ (T ) and the sum-of-targets, respectively. Let the data observed till time T be D , S , K and P (Γ¯ D(T )) denote the estimate of the conditional distribution (see Section 6.2.2). Then, | if αi, βj, λ represent the PFA thresholds for the Si,Kj,E, the changepoint tests can be given as:   (T ) αi γSi (αi) = inf n : TSSi (D ) , i = 1,...,I (6.6a) ≥ 1 + αi   (T ) βj γKj (βj) = inf n : TSKj (D ) , j = 1,...,J (6.6b) ≥ 1 + βj   (T ) λ γE(λ) = inf n : TSE(D ) , (6.6c) ≥ 1 + λ

(T ) (T ) where TSX (D ) = P(ΓX n D ) is the test statistic (TS) for a source X. ≤ | Proof. In quickest change detection, our goal at time T is to decide if a change should be declared for some n T for a particular data source. To this end, we can use the following change detection test≤     (T )  P ΓSi n D  γSi (αi) = inf n : log   ≤ |  log(αi) , (T ) ≥  P ΓS > n D  i | which is equivalent to the following test:   (T ) αi γSi (αi) = inf n : P ΓSi n D . (6.7) ≤ | ≥ 1 + αi

Intuitively, the above test declares the change for the ith target source Si at the smallest time n for which the test statistic (i.e., posterior probability of the change point random variable being less than n) exceeds a threshold. The probability of false alarm for the above 61

test can be bounded in terms of the threshold αi as:

P P (T ) (T ) P(γSi ΓSi ) = D(T ) n P(D , γSi = n)P(ΓSi > n D , γSi = n) ≤ (d) | X X (T )  1  P(D , γSi = n) ≤ 1+αi (6.8) D(T ) n | {z } =1 = 1 , 1+αi

where (d) follows from the fact that given the observed data and the event, γSi = n, i.e., the change is declared at n, then it follows from equation 6.7 that

(T ) P(ΓSi > n D , γSi = n) 1/(1 + αi) | ≤ Let us denote the test statistic (TS) for a data source X as:

(T ) (T ) TSX (D ) = P(ΓX n D ) ≤ | Then, then the multi-level change detection test is:

(T ) αi γSi (αi) = inf n : TSSi (D ) , i = 1,...,I { ≥ 1 + αi } (T ) βj γKj (βj) = inf n : TSKj (D ) , j = 1,...,J { ≥ 1 + βj }

(T ) λ γE(λ) = inf n : TSE(D ) { ≥ 1 + λ}

 

From Theorem 6.2, we can infer the following boundedness property of Modified-PFA as expressed in the following Lemma.

∆ ∆ Lemma 6.3. If we define α = mini(αi) and β = maxj(βj), then Modified-PFA in equa- tion 6.4 can be bounded as: 1 1 1 Modified-PFA(γ) I + + (6.10) ≤ × 1 + α 1 + β 1 + λ

6.2 HQCD for Count Data via Surrogates

In this section we discuss the HQCD framework for count data sources which may be observed in real life. For example, we can analyze the number of protests towards early detection of protest uprisings via surrogate sources. Protests can happen in civil society for various rea- sons such as protests against fare hike or protests demanding more job opportunities. Such 62

Algorithm 1: HQCD Multi-level Change Point Detection Algorithm Input : At time T , Target and Surrogate Sources D(T ) = S(T ),K(T ) Parameters: PFA threshold for targets (α), surrogates (β), and sum of targets (λ) Output : Changepoint Decisionsγ ¯S, γ¯K, γE at each timepoint T

1 for each T do (T ) 2 Update joint posteriorP (ΓK , ΓS, ΓE D ) // target change detection | 3 for i 1 to I do ← (T ) 4 Compute target marginal P (Γ D ) Si | 5 Find γSi (α) using 6.6a

6 γ¯S γS1 (α), . . . , γSI (α) // ←surrogate { change detection} 7 for j 1 to J do ← (T ) 8 Compute surrogate marginal P (Γ D ) Kj | 9 Find γKj (β) using 6.6a

10 γ¯K γK1 (β), . . . , γKJ (β) // ←sum-of-targets { change} detection (T ) 11 Compute sum-of-targets marginal P (Γ D ) E| 12 Find γE(λ) using 6.6c 13 Return Decision γ¯S, γ¯K, γE(λ) at T

protests, especially major changes in protest base levels, are potentially interlinked. How- ever explaining such interactions is a non-trivial process. [48] found several social sources, especially Twitter chatter, to capture protest related information. We apply HQCD to find significant changes in protests concurrent to changes in Twitter chatter, such that detecting changes accurately are of primary importance in contrast to the chatters which can be influ- enced by a range of factors, including protests. In general, HQCD can be applied in similar events, such as disease outbreaks, to find significant changes in targets using information from noisy surrogates. 6.2.1 Hierarchical Model for Count Data

In general, HQCD can be applied to any count data sources. However, the exact specification may depend on the application. For example, considering protest uprisings, we first note that surrogate sources such as Twitter are in general noisy and involve a complex interplay of several factors - one of which could be protest uprisings. Furthermore, for protest uprisings, we are more concerned in using the surrogates (Twitter chatter) to help declare changes at target level (protest counts) than accurately identifying the changes in surrogates. Thus, without loss of generality, we model the surrogates as i.i.d. distributed variables. Figure 6.3) evaluates the i.i.d. assumptions, for both protest counts and Twitter chatter. Our results indicate that Log-normal is a reasonable fit for Twitter chatter.

th Surrogate Sources: Formally, we assume that the j surrogate source Kj is generated i.i.d. 63

0.0000030 0.08 Pre 2013-04-01 0.0000025 M = 409126.93 0.07 Pre 2013-05-25 LogNorm Fit s = 1.77 LogNorm Fit 0.0000020 0.06 0.05 0.0000015 0.04 0.0000010 0.03 0.0000005 0.02 0.01 0.0000000 0.00 200000 0 200000 400000 600000 800000 1000000 1200000 1400000 0 10 20 30 40 50 60 70 0.0000025 0.016 2013-04-01 - 2013-09-01 M = 574497.65 0.014 0.0000020 LogNorm Fit s = 1.48 0.012 0.0000015 0.010 0.008 0.0000010 0.006 0.0000005 0.004 2013-05-25 - 2013-10-20 0.002 LogNorm Fit 0.0000000 0.000 200000 0 200000 400000 600000 800000 1000000 1200000 1400000 0 50 100 150 200 250 300 350 400 450 0.0000025 0.025 Post 2013-09-01 µ = 730181.74 Post 2013-10-20 0.0000020 Norm Fit σ = 178042.92 0.020 LogNorm Fit 0.0000015 0.015 0.0000010 0.010 0.0000005 0.005 0.0000000 0.000 0 200000 400000 600000 800000 1000000 1200000 1400000 50 0 50 100 150 200 0.0000020 0.014 M = 565888.93 0.012 0.0000015 s = 1.60 0.010

0.0000010 0.008 0.006 0.0000005 Full Data 0.004 LogNorm Fit 0.002 0.0000000 0.000 200000 0 200000 400000 600000 800000 1000000 1200000 1400000 100 0 100 200 300 400 500

(a) (b) Figure 6.3: Histogram fit of (a) surrogate source (Twitter keyword counts) and (b) target source (Number of protests of different categories), for various temporal windows, under i.i.d. assumptions. These assumptions lead to satisfactory distribution fit, at a batch level, for both sources. The top-most row corresponds to the period before the Brazilian spring (pre 2013-05-25), the second row is for the period 2013-05-25 to 2013-10-20, and the third is for the period after 2013-10-20. The last row shows the fit for the entire period. These temporal fits are indicative of significant changes in distribution along the Brazilian Spring timeline, for both target and surrogates.

K from a distribution f w.r.t to the associated changepoint ΓKj as: ( f K (φKj ) t Γ k (t) i.i.d 0 Kj (6.11) j K Kj ≤ ∼ f (φ1 ) t > ΓKj

Kj Kj where, φ0 and φ1 are the pre- and post-change parameters. Following our earlier discus- sion, we select f K as Log-normal (with location and scale parameters φKj = cKj , dKj ) for Twitter counts. { } Target Sources: Target sources can in general be dependent on both the past values of targets as well as the surrogates. Here, we restrict the target source process to be a first th order Markov process. Under this assumption, we formalize the i target source Si to follow S a Markov process ft w.r.t to its changepoint ΓSi as: ( f S(φSi (t)) t Γ s (t) t 0 Si (6.12) i S Si ≤ ∼ ft (φ1 (t)) t > ΓSi 64

Si Si where, φ0 and φ1 are the pre- and post-change parameters of the process. Poisson process with dynamic rate parameters has been shown [8] to be effective in specifying hierarchical count data w.r.t changepoints. Here, we model the rate parameters as a nested autoregressive process [22, 8] given as:

Ai (t)   Si Si 0/1 S(t 1) φ0/1(t) = φ0/1(t 1) + |Ai (t)| − + (0, σS) − 0/1 K(t 1) N (6.13) i i − A (t) = A (t 1) + (0, Σ i ) 0/1 0/1 − N A S i Here, φ0/1(t) captures the latent rate and σS denotes the error variance. A0/1(t) captures the variation due to the observed values of target and surrogates sources. Changepoint Priors: Following our prior discussion, surrogate changepoints can be assumed

to have an uninformative prior and we model ΓKj via a memoryless arrival distribution (static probability of observing change given it hasn’t occurred earlier) as:

ΓK Geom(ρK ) P (Kj = t Kj t) = ρK (6.14) j ∼ j ⇒ | ≥ j Conversely, target changepoints can be influenced by surrogate changepoints as their genera- tive process is dependent on the surrogates. Specifically, whenever we observe a changepoint in the surrogates, we assume that the base rate of changepoint for a target to increase for a certain period of time. Formally, target changepoint priors are assumed to follow a dynamic process as:

ΓSi Geom(ρSi(t)) (6.15) ∼ 2 X 1 −µj (t−ΓKj ) ρSi (t) = ρSi + (ΓKj < t)µj e j I

where, is the indicator function. ρSi represents the nominal base rate for the changepoint. It can beI seen, a change in the jth surrogate source is modeled as an exponentially decaying 1 ‘impulse’ of amplitude µj . The summation of targets, E(t) is known deterministically given Si(t). Moreover, given Si(t 1), E(t) can be considered to be summation of independent Poisson processes following similar− dynamics as equation 6.13 which is omitted due to limited space. Similarly, relationships for dependence of ΓE can be modeled to be dependent on K similar to equation 6.15.

6.2.2 Changepoint Posterior Estimation

Algorithm 1 involves posterior estimation of the changepoints given the data at a particular time point. Earlier work has focused mainly on offline methods such as Gibbs Sampling [8]. Online posterior estimation for such problems have been studied extensively in the context of Sequential Bayesian Inference [9] such as Kalman filters [26, 55, 2] (Gaussian transitions) and Particle Filters [18, 47, 20]. Recently, Chopin et al. [16] proposed a robust Particle Filter, SMC2 which is ideally suited for fitting the parameters of the non-linear hierarchical model described in Section 6.2.1. In this section we formulate a Sequential Bayesian Algorithm that makes the HQCD tractable under real world constraints (see Figure 6.4). 65

120 Gibbs Sampling HQCD 100 HQCD without surrogates

80

60

Time (in min) 40

20

0 Simulated Brazil Venezuela Uruguay Dataset

Figure 6.4: Computation time for one complete run of changepoint detection (in mins) on a 1.6 GHz quad core 8gb intel i5 processor: Gibbs sampling [8] vs HQCD vs HQCD without surrogates. Gibbs sampling computation times are unsuitable for online detection. ¯ ¯ (T ) 2 To find the posterior P ΓS, ΓK , ΓE D at any time T using SMC we first cast the model parameters and variables into the following| three categories:

2 Observations (yT ): In the context of SMC these are the parameters that correspond to observed variables at each time point T . For HQCD we can model yT as:

∆ yT = S(T ),K(T ) (6.16) { }

2 Hidden States (xT ): SMC estimates the observations based on interaction with hidden states which are dynamic, unobserved and is sufficient to describe yT at T . For HQCD, we can express xT as follows:

∆ ¯ ¯ ¯S ¯K xT = ΓS, ΓK , ΓE, φ (T 1), φ , (6.17) { 0/1 − 0/1 ¯ ρ¯K (T ), ρ¯S(T ), A0/1,S(T 1),K(T 1) − − } Static Parameters (θ): Finally, SMC2 also accommodates the concept of static parameters which do not change over time such as the base probabilities of changepointρ ¯S and the noise matrix ΣA in HQCD. We can express θ as:

∆ ¯1 ¯2 θ = σS, ΣA, ρ¯S, µ , µ (6.18) { } 2 For a given set of such parameters, SMC works by first generating Nθ samples of θ using 2 the prior distribution P (θ). For each of these samples of θ, SMC samples NX samples of x0 from its prior P (x0 θ). Following standard practices, we use conjugate distributions [9] for the priors. | 66

Algorithm 2: HQCD Changepoint Posterior estimation via SMC2

Input : At time T , yT as give in equation 6.16 Parameters: Prior distributions P (θ) and P (x θ) 0| Hyperparameters for P (θ) and P (x0 θ) Output : joint posterior P (Γ , Γ , Γ D(T )) | K S E| 1 Define xT as give in equation 6.17 2 Define θ as give in equation 6.18 // Initialization 3 Sample Nθ number of θq using P (θ) 4 Sample N number of x using P (x θ ) x 0q,r 0| q 5 Update weights w(0) // See Appendix // Online Learning 6 for each T do // State Updates 7 for each q N do ∈ θ 8 for each r N do ∈ x 9 Update States: xTq,r from xT −1q,r 10 Compute Importance weights wq,r(T ) 11 Compute observation probability P (y y − , θ ) T | T 1 q // Incorporate observation at time T 12 Update Importance weight wq,r(T ) wq,r(T )P (yT yT −1, θq) // test premature convergence ← | 13 Test degeneracy conditions using effective sample size 14 if degeneracy then // markov kernel jumps 15 Update xTq,r by multiplying a markov Kernel T // recomputing weights K 16 exchange x and set w 1 Tq,r qr ∝ // Find joints (T ) 17 Return Update P (Γ¯ , Γ¯ , Γ D ) using equation 6.19 S K E| 67

Table 6.2: (Synthetic data) comparing true changepoint (Γ) for targets against detected changepoint (γ) by HQCD against state-of-the-art methods for false alarm (FA) and additive detection delay (ADD). Each row represent a target and best detected changepoint is shown in bold whereas false alarms are shown in red.

True GLRT WGLRT BOCPD RuLSIF HQCD HQCD w/o surr.

Γ γ ADD γ ADD γ ADD γ ADD γ ADD γ ADD

S1 297– 10 – 13 36 7 33 4 32 3 S2 6 11 5 14 8 16 10 28 22 8 2 9 3 S3 247– 16– 15 29 5 22 - 26 2 S4 265– 11– 11 38 12 27 1 31 5 S5 47 40– 15–8 26- 50 3 55 8

At each time point T , the samples are perturbed using the model equations given in Sec- tion 6.2.1 and associated with weights w to estimate the joint posteriors as:

N N Pθ Px P (θ, xT yT ) = wq,rδ(θ, xT ) | q=1 r=1 (6.19) Nθ Nx ¯ ¯ (T ) P P ¯ ¯ P ΓS, ΓK , ΓE D wq,rδ(ΓS, ΓK , ΓE) | ∝ q=1 r=1 where, δ is the Kronecker-delta function. Algorithm 2 outlines the steps involved in this process. For more details on SMC2 see Appendix.

6.3 Experiments

We present experimental results for both synthetic and real-world datasets, and compare HQCD against several state-of-the-art online change detection methods (see Table 6.1), specifically, GLRT [54], W-GLRT [31], BOCPD [1] and RuLSIF [35]. To further analyze the effects of surrogates in detecting changepoints, we compare against HQCD without surro- gates, where K(t 1) is dropped from equation 6.13 and ρSi(t) is made static (i.e. independent of changepoints from− surrogates) in equation 6.15.

6.3.1 Synthetic Data

In this section, we validate against synthetic datasets with known changepoint parameters. For this, we pick 5 targets (I = 5) and 10 surrogates (J = 10). The surrogates were generated from i.i.d. Log-normal distributions (see equation 6.11) while the targets were generated using Poisson process (see equation 6.12). The changepoints for surrogates were 68

520 Target-1 900 Target-2 1000 Target-3 480 Target-4 900 Target-5 500 800 900 460 850 480 700 800 440 460 700 800 600 420 440 600 500 400 750 420 500 400 380 400 400 700 380 300 300 360 650 360 200 200 340 340 100 100 320 600 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50

Figure 6.5: Comparison of HQCD against state-of-the-art on simulated target sources. X-axis represents time and Y-axis represents actual value. Solid blue lines refer to the true changepoint, solid green refers to the ones detected by HQCD and brown refers to HQCD without surrogates. Dashed red, magenta, purple and gold lines refer to changepoints detected by RuLSIF, WGLRT, BOCPD and GLRT, respectively. HQCD shows better detection for most targets with low overall detection delay and false alarms. sampled from a fixed Gamma distribution (see 6.14) while the associated changepoints for target sources were simulated via equation 6.15.

Comparisons with state-of-the-art

As true changepoints are known for the synthetic dataset, we can compare HQCD against the state-of-the-art methods for the detected changepoint as shown in Figure 6.5. Table 6.2 presents the results in terms of the false alarm (FA) and additive detection delay (ADD). From the table, we can see that HQCD is able to detect the changepoints with fewer false alarms. Also HQCD has the lowest delay across all methods for all targets except Target-1 for which HQCD without surrogates achieved better delay indicating the surrogates are not informative for this target source.

Usefulness of Surrogates

Our comparisons with the state-of-the-art shows significant improvements that were achieved by HQCD, both in terms of FA and ADD and showcase the importance of systematically admitting surrogate information to attain a quicker change detection with low false alarm. We compare HQCD with surrogates against HQCD without surrogates (Table 6.2) and find that admitting surrogates significantly improves average delay (2.5 compared to 4.2). We also plot the average false alarm rate against the detection delay in Figure 6.6 and find that HQCD results are in general the ones with the best trade-off between FA and ADD. 69

HQCD Without Surrogates False Alarm Detection Delay HQCD

BOCPD

RuLSIF

W-GLRT

GLRT

40 30 20 10 0 10 20 30

Figure 6.6: False Alarm vs Delay trade-off for different methods. HQCD shows the best trade-off.

6.3.2 Real life case study

In real-life scenarios, the true changepoint is typically unknown. One representative example could be seen w.r.t. the onset of major civil unrest related protests and uprisings. We present an analysis of three major uprisings: (i) in Brazil around mid 2013 (often termed as the Brazilian Spring), (ii) in Venezuela around early 2014 and, (iii) in Uruguay around late 2013. We first describe the data collection procedure and followup with a comparative analysis of detected changepoints. Weekly counts of civil unrest events from Nov. 2012 to Dec. 2014 were obtained as part of a database of discrete unrest events (Gold Standard Report - GSR) prepared by human analysts by parsing news articles for civil unrest content. Among other annotations, the GSR also classifies each event to one of 6 possible event types based on the reason (‘why’) behind the protest. Each of these event types such as a) Employment and Wages, b) Housing, c) Energy and Resources, d) Other government, e) Other economic and f) Other, bears certain societal importance. We treat the weekly counts of each of these event-types as target sources (S) and the sum total of all protests for a week as the sum-of-targets (E). We also collected geo-fenced tweets for each country over the same time-period. We used a human-annotated dictionary of 962 such keywords/phrases that contains several identifiers of protest in the languages spoken in the countries of interest (similar to Ramakrishnan et.al. [48]). As most of these keywords could have similar trends, we cluster them using k-means into 30 clusters (i.e., we have J = 30 surrogates). To account for scaling effects while preserving temporal coherence, each keyword time series was normalized to zero-mean and unit variance. 70

180 40 30

160 35 25 140 30 20 120 25

100 20 15

80 Event Counts 15 Event Counts

Event Counts 10 60 10 5 40 5

20 0 0 06 13 20 27 03 10 17 24 11 18 25 02 09 16 23 30 06 13 0 Jan Feb Mar Apr May Jun Jul Aug Sep Jan Feb Mar Dec Jan 2013 2014 2014 (a) Brazil Total Protests (b) Venezuela Total Protests (c) Uruguay Total Protests

Figure 6.7: Comparison of detected changepoints at the sum-of-targets (all Protests). HQCD detections are shown in solid green while those from the state-of-the-art methods i.e. RuLSIF (red), WGLRT (magenta), BOCPD (purple) and GLRT (gold) are shown with dashed lines. HQCD detection is the closest to the traditional start date of Mass Protests in the three countries studied .

Changepoint Across layers

We show the changepoints detected by HQCD (bold green) and the state-of-the-art methods (dashed lines) for the sum-of-all protests in Figure 6.7 (see Figure C.1 in Appendix C for individual protest types). We can observe that HQCD, which uses the surrogate information sources and exploits the hierarchical structure, finds indicators of changes which are visually better as well as more aligned to the dates of major events (See demo at https://prithwi. github.io/hqcd_supplementary). In contrast, the state-of-the-art methods can be argued to show significantly high false alarm rate. For such real world data sources, the notion of a true changepoint is difficult to ascertain, we can instead consider for example the onset of Brazilian spring protests (2013-06-01) as an underlying changepoint to compare at the sum-of-targets and interpret notions of false alarm. Table C.1 tabulates these inferences for the targets as well as the sum-of-targets. Although, a true changepoint is unknown, we note that for HQCD, the expected additive detection delay (EADD) can be estimated according to equation 6.2 (from P (Γ¯ D(T )) in Algorithm 2). |

Changepoint influence analysis

The experiments presented in the previous section can be further analyzed to ascertain the nature of progression of significant events that lead to a protest. Here we present our analysis for Brazilian Spring. We found that detected changepoints (see Table C.1 in Appendix C) for Brazil reveal an interesting progression - significant changes in Energy related unrest (06/02) propagated to Housing/Other Govt. Unrest (06/16) and culminated in mass Employment related unrest (08/18). Interestingly, we can analyze the fitted parameters of the weight i vector A0/1 of the rate updates (see 6.13) to quantize the changepoint influence of a source (target/surrogate) at time T 1 to time T . For each target Si, we can compute the average − 71

40 Other Other 36 Resources Economic Employment Housing Other Government Energy and

Employment 32

28 Energy and Resources 24

Housing 20

16 Other 12

Other Economic 8

Other 4 Government 0

(a) Influence of lagged targets on current targets

40 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 36 Employment 32 28 Energy and Resources 24 Housing 20 Other 16 12 Other Economic 8 Other Government 4 0 (b) Influence of lagged surrogates on current targets Figure 6.8: (Brazilian Spring) Heatmap of changepoint influences of targets on targets (a); and surrogates on targets (b). Darker (lighter) shades indicate higher (lesser) changepoint influence. (a) shows presence of strong off-diagonal elements indicating strong cross-target changepoint information. (b) shows a mixture of uninformative and informative surrogates. 72

value of the weight vector component of each target/surrogate separately. Let h0 and h1 denote these averages for one such source. Effectively, h0 then measures the effect of the source at time t 1 on Si at t before change while h1 captures the same post change. Their percentage relative− change can then be used as a measure of the changepoint influence of a particular target/surrogate source on Si. We plot a heatmap of these percentages in Figure 6.8 for both targets and surrogates, separately. From Figure 6.8a, we can see that ‘Other Economic’ and ’Employment’ related protests had strong influences from ‘Housing’ related protests. Furthermore, from Figure 6.8b we can see ‘Housing’ and ‘Employment’ related protests were influenced by similar Twitter chatter clusters (cluster-01 and cluster- 26) - indicating that the interaction between these protest subtypes can be inferred from social domain. Conversely, ‘Housing’ and ‘Other Economic’ related protests are only weakly correlated through Twitter chatters - thus exhibiting the robustness of HQCD which can still detect interactions between targets when surrogates fail to explain the same. In general, for a particular target we can see linked pre-cursors in other targets (strong off-diagonal elements in Figure 6.8a) and highly specific informative surrogates (few strong cells for a row in Figure 6.8b).

6.4 Discussion

We have shown HQCD to be an effective framework towards detecting changepoints in an online manner while accommodating multiple sources in a hierarchical framework. HQCD has been validated against both synthetic sources and real-life scenarios. In the next chapter, we will next present our efforts at utilizing these changepoints towards robust forecasting models.

Supporting Information A demo of HQCD and the datasets used in this chapter can be found in https://prithwi.github.io/hqcd_supplementary. Attached appendix provides additional details on SMC2. Chapter 7

Concept Drift Adaptation for Google Flu Trends

Early detection of disease outbreaks can lead to prompt response strategies and effective im- plementation of counter-measures. Syndromic surveillance mechanisms hold great promise in improving lead-time to detection. Google Flu Trends (GFT) was one of the most celebrated example of syndromic surveillance and has emerged as one of the most popular mechanisms involving non-clinical data. Recent work, including at Google, has shown that systems like GFT, just like other surveillance and forecasting strategies, require periodic re-training and adaptation every year. In particular, GFT estimates tend to be locally spiky in nature, which often lead to difficulties in regression w.r.t CDC ILI surveillance data. In addition to local variations, we posit that the fundamental cause of major seasonal performance varia- tions of GFT is due to dynamic patterns in user search behavior. Such a phenomenon can be analyzed under the framework of concept drift. Our proposed approach is to explicitly model concept drift to make such ILI estimates from surrogates sources such as GFT more robust and in an online manner.

7.1 Background

Google Flu Trends first came into limelight with Ginsberg et al.’s seminal work [24] on mining indicators for disease surveillance from social media activity. This work has spurred a flurry of research in this domain such as [38]. GFT ILI estimates were available for several countries and for several regions, which can be used epidemiologists to gain quick insight into the prevalent influenza state. However, as noted in some recent studies such as [6, 33, 32], GFT is under-performing against official surveillance data. In spite of updates to the GFT system, which attempt to rescale search query terms in response to sudden spikes in the search data, the drifting performance issue hasn’t been completely resolved [32].

73 74

GFT-Argentina Rolling mean

Figure 7.1: Evidence of Concept Drift. In Google Flu Trends data for Argentina (left), the corresponding 52-week rolling mean (right) exhibits a saddle point in early 2012 - indicates a possible mean shift drift in GFT for Argentina.

Part I and II have outlined our efforts at short-term and long-term forecasting of infectious diseases using surrogate sources. Such forecasts were also generated for the IARPA OSI nationwide challenge our winning team developed Early Model Based Event Recognition using Surrogates (EMBERS) [48] - an automated continuous surveillance and predictive system that monitors among other things, epidemic and rare disease outbreaks. During this effort we came to better understand the inherent drift in surrogate-target relationships for diseases and we had to continuously monitor and adapt our models focusing equal attention to robustness and efficacy. During this experience, we learned that the effective usage of open source data, in presence of ever-changing data patterns, necessitates incorporation of adaptivity in models. Specifically, for ILI we have been monitoring a set of keywords in several media such as Google search data, news and Twitter, and found evidences of evolving correlations of such keyword counts to surveillance data [12]. We have also been closely collaborating with CDC for the past three years and providing forecasts about US national and region level ILINet percentages as well as seasonal indicators such as peaks. Such efforts led us to run a market for flu predictions under Scicast (https: //scicast.org/flu). These experiences corroborate with our EMBERS observations and we have made similar observations about ILI disease surveillance in general. Focusing on GFT, we conducted experiments for six Latin American countries, namely Ar- gentina, Bolivia, Chile, Mexico, Peru and Paraguay. Figure 7.1 shows the GFT data for Argentina (from 2010-2014) and the corresponding rolling mean (over a 52 week window). As can be seen the rolling mean indicates that the average activity of flu trends showed a major shift around 2012. Apart from the major change, similar other local changes in mean shift can also be observed. Rolling statistics over standard deviations and Kurtosis provides similar insights. In general a combination of these measures indicate that the GFT data distribution is non-stationary. From a machine learning perspective, such non-stationarity 75

GST Data GFT Data Weather HealthMap

Surveillance Data Concept Drift Detector Drift Adaptation Robust Forecasts Target/Source Mismatch Drift Adaptive Resampling Model Retargetting

Figure 7.2: Concept Drift Adaption Framework. Framework ingest target sources such as CDC ILI case count data and surrogate sources such as GFT and detects changepoints via ‘Concept Drift Detector’ stage. Drift probabilities are next passed onto ‘Drift Adaptation’ stage where robust predictions are generated using resampling based methods. in the independent variables leads to varying statistical correlation with the target variable (here official surveillance data) also referred to as concept drift. Concept drift is known to cause predictions to be less accurate over time and identification/handling of such drifts can show significant improvement in models. We observed similar trends for concept drift in GFT data for the other five Latin American countries.

7.2 Robust Models via Concept Drift Adaptation

Concept drift is an actively studied problem and researchers have proposed many different methods to handle concept drifts [23]. Some of the more popular methods focuses on ensem- ble models where ensembles can either be created at model level or via random resampling of data points to constitute a drift adapted dataset which can next be passed on to machine learning algorithms. We focused on the random resampling approaches with an aim to- wards computationally inexpensive and generic approach and propose a two-step formalism to handle concept drift towards a Robust GFT estimate. First, we detect concept-drifts in the surrogate-target data relationships using an online nonparametric changepoint detection test (see Chapter 6). We used windowed GLRT ap- proaches using Poisson Regression model from surrogate data sources to ILI surveillance data and analyze the regression errors (slack) for changes in distribution. Following classical CUSUM test and our experiences (see Chapter 6), we propose a rolling window over the se- ries of slacks and identify change points based on log-likelihood ratios. These log-likelihood ratios can then be used as probabilities of concept drift for each time-point and we can use 76

weighted resampling of past data where the weights for sampling the time-point t can be given as:

1 Ldrift(t) wt = P − (7.1) (1 Ldrift(t)) t −

where Ldrift(t) quantifies the drift at time t in terms of likelihood of a change at the said time point. The second component involves fitting a Poisson Regression once more, but this time on the resampled dataset to find updated model parameters and generate the adapted GFT estimates. We use random resampling without replacement using drift probabilities from equation 7.1 and fit our Poisson regression model on the same. The framework can be roughly shown as given in Figure 2. We can also employ a feedback mechanism where past accuracies of adapted GFT to ILI surveillance data is used to update the computed Log-likelihood for drift.

7.2.1 Experimental evaluation and comparing Surrogate Sources

The proposed method, outlined in the previous section, can capture drifts using aggregated surrogate activity. Similar to Part I and Part II, we intend to compare different surrogate sources for drift correction ability. Our results from Part II has indicated that long-term forecasts for Mexico were especially noisy. As such, we focus on Mexico for the following assay. We applied our drift adaptation framework for the season 2014-2015 and Table 7.1 presents our findings. As can be seen, incorporation of surrogate sources via drift adapters significantly improves forecasting accuracy. GST contributes most significantly towards the drift adaptation while a combination of all sources produces the best overall forecasting accuracy. Significant drift adaptation could also be seen for HealthMap, however the abso- lute value of forecasting accuracy renders HealthMap source insignificant for the country of interest.

Table 7.1: Comparison of surrogate sources pre- and post-drift adaptation.

Pre Drift Correction Post Drift Correction Percentage correction source GST 2.801 3.125 10.372 GFT 2.533 2.741 7.561 HealthMap 1.364 1.815 24.874 Weather 2.242 2.499 10.271 All 3.082 3.496 11.836 77

We also plot the quality score and deviance plots for pre drift-corrected and post drift- corrected forecasts for GFT, GST, HealthMap, Weather and all sources in Figures 7.3, 7.4, 7.5, 7.6, 7.7, respectively. As can be seen, the quality score distribution of forecasts shows a marked improvement, both in terms of higher absolute value and tighter bounds, for post- drift corrected models. The figures also show the distribution of residual deviance. In terms of concept drift, a narrow distribution indicates a well fitted problem and hence better drift correction, whereas a more spread out deviance distribution indicates a sub-optimal correction. The deviance plots also exhibits the efficacy of our methods - especially GST and combined sources shows marked improvement indicating these are the best methods of correcting for drift.

7.3 Discussion

We have proposed a computationally inexpensive method of drift adaptation for disease sources for Mexico. Our results indicate that significant improvement in forecasting, as well modeling, accuracy could be achieved by including surrogates via the proposed framework. Furthermore, a combination of all sources performs best in terms of drift adaptation, thus exhibiting the importance of considering diverse sources. In future, we would extend this analysis to more regions and ascertain relative importance of such sources w˙rt˙to˙ the regions. 78

4.0

3.5

3.0

2.5

2.0 QS

1.5

1.0

0.5

0.0 corrected uncorrected Drift Adaptation

(a) Quality Score distribution of forecasts before and after drift correction

Drift uncorrected Residual Deviance: 0.829 Drift corrected Residual Deviance: 0.469 160 140

140 120

120 100

100 80

80

Frequency Frequency 60 60

40 40

20 20

0 0 40 30 20 10 0 10 20 30 40 40 30 20 10 0 10 20 30 40 Deviance Deviance (b) Residual Deviance Distribution Figure 7.3: Drift Adaptation for Mexico using GFT 79

Drift uncorrected Residual Deviance: -0.398 Drift corrected Residual Deviance: -0.250 70 250

60

200

50

150 40

Frequency 30 Frequency 100

20

50

10

0 0 10 5 0 5 10 15 100 80 60 40 20 0 20 Deviance Deviance (a) Quality Score distribution of forecasts before and after drift correction

(b) Residual Deviance Distribution Figure 7.4: Drift Adaptation for Mexico using GST 80

4.0

3.5

3.0

2.5

2.0 QS

1.5

1.0

0.5

0.0 corrected uncorrected Drift Adaptation

(a) Quality Score distribution of forecasts before and after drift correction

Drift uncorrected Residual Deviance: 6.201 Drift corrected Residual Deviance: 3.650 140 180

160 120

140

100 120

80 100

80

Frequency 60 Frequency

60 40

40

20 20

0 0 100 80 60 40 20 0 20 40 60 100 80 60 40 20 0 20 40 60 Deviance Deviance (b) Residual Deviance Distribution Figure 7.5: Drift Adaptation for Mexico using HealthMap 81

4.0

3.5

3.0

2.5

2.0 QS

1.5

1.0

0.5

0.0 corrected uncorrected Drift Adaptation

(a) Quality Score distribution of forecasts before and after drift correction

Drift uncorrected Residual Deviance: -2.556 Drift corrected Residual Deviance: -1.204 120 140

120 100

100 80

80

60

Frequency Frequency 60

40 40

20 20

0 0 40 30 20 10 0 10 20 30 40 30 20 10 0 10 20 30 Deviance Deviance (b) Residual Deviance Distribution Figure 7.6: Drift Adaptation for Mexico using weather sources 82

4.0

3.5

3.0

2.5 QS

2.0

1.5

1.0

0.5 corrected uncorrected Drift Adaptation

(a) Quality Score distribution of forecasts before and after drift correction

Drift uncorrected Residual Deviance: -0.158 Drift corrected Residual Deviance: -0.001 250 250

200 200

150 150 Frequency Frequency 100 100

50 50

0 0 100 80 60 40 20 0 20 100 80 60 40 20 0 20 Deviance Deviance (b) Residual Deviance Distribution Figure 7.7: Drift Adaptation for Mexico using All sources Chapter 8

Conclusion

We have presented the problem of time series prediction using surrogates and motivated our efforts by examining the particular case of influenza forecasting. We identified three major thrusts for this problem viz. (i) short-term forecasting, (ii) long-term forecasting and (iii) concept drift. We presented our approaches for each of these thrusts in this thesis and communicated our findings in [12, 62, 13]. Our results showcase the efficacy of using surrogates to forecast about disease characteristics. In the following section, we discuss the importance of surrogate information available from open source indicators for public health surveillance and conclude with some key insights on how such surrogates can be used towards an integrated surveillance mechanism.

8.1 Importance of Open Source Indicators for Public Health

Our results indicate that open source indicators (OSI) are extremely useful for forecasting various facets of disease characteristics such as peak intensity and case counts in the short term. One of the key advantages of using surrogates could be attributed to the real-time nature of such sources as well as their ready availability. However, such surrogates are in general noisy and may exhibit changing relationships with the disease characteristics of interest. For example, the volume of search queries for the term ‘flu’ may have been more indicative of ILI case counts in the population for the years preceding 2011 than post-2012, for the United States. Thus, this work motivates the use of algorithms that in principle are aware of the possibility of such changing patterns and more importantly, are adaptable to such circumstances. In general, surrogate sources, especially the non-physical ones, can be considered to be ‘sensors’ of disease spread in population rather than actual indicators of disease characteristics. Surrogates from a particular source (see Part I) may contain information about a certain stage of the disease spread than other. For example, Figure 8.1

83 84 indicates that disease keywords from HealthMap news corpus are more indicative during the start of the season whereas search query volumes as accessed by Google Search Trends exhibit a sub-optimal but stable correlation throughout the season. Thus a single OSI source may not be suitable towards robust disease forecasting. However, as seen from Part I and Part II, combining multiple surrogates can lead to a more robust and stable forecasting framework. It can be argued that multiple surrogates may provide better coverage over the different stages of the season. Also, noises such as spikes in search query activity may be better compensated by using a variety of OSI sources and a consensus of increased/decreased activity may better inform a forecasting framework. Another crucial aspect of public health surveillance is the fact that ‘ground truth’ information available at a particular point of time is subject to noise. Consequently, models for disease forecasting should be aware of such noises, which can often be systematic. Such flexibility in modeling is more important while using OSI sources as such sources itself may be subject to noise. In this work, we have shown that forecasting with an ability to model the surveillance uncertainty increases the final forecasting accuracy manifold. This work has focused more on influenza as a primer for endemic disease forecasting. One of the key advantages of using influenza as an application is the fact that its one of the most common infectious disease worldwide exhibiting evolving patterns over regions and time and, more important has significant public health impact. In this work, we have found physical sources such as Temperature and Humidity to be more useful in forecasting in- fluenza conditions in the population. Non-physical sources were found to contribute to the overall forecasting accuracy with varied degrees w˙rt˙countries˙ and disease characteristics. For short-term forecasts, Twitter chatter and disease related news were found to be significantly useful for a number of countries of interest. For long-term forecasts, surrogates were found to be more useful at the initial stages of the disease season where disease information from traditional surveillance is sparse and more noisy. For the later part of the season, simula- tion based models worked better than data assimilation models and inclusion of surrogates improved the overall forecasting accuracy to lesser degree.

8.2 Guidelines for using surrogates for Health Surveil- lance

In this section, we combine the insights presented in the previous section with our experience on forecasting infectious diseases into a list of guidelines that may be followed while using surrogates for disease surveillance, as follows:

Surrogates are more useful for forecasting diseases in regions where historical data for • such sources as well as surveillance data for the said diseases are available for atleast a few disease seasons. For emerging diseases, such surrogates may still be useful but may 85

(a) HealthMap (b) GST Figure 8.1: Correlation of surrogate sources with disease incidence. Count of influenza related keywords from (a) HealthMap and (b) GST compared against influenza case counts for Argentina as available from PAHO. HealthMap keywords capture the start of the season more accurately, while GST keywords exhibit a sub-optimal but consistent correlation with PAHO counts. 86

require different stochastic models than regression/assimilation based models presented in this work.

As discussed in Section 8.3, multiple surrogate sources are more useful for disease fore- • casting than a single source. In general, a variety of heterogeneous sources such as Twitter chatter and news, may encode different information about the disease spread should be used than a number of homogeneous sources, such as disease news informa- tion from multiple sources.

Use of OSI must also be justified by being aware of spurious correlation and the us- • ability of such sources towards a real-time system must be backed by both quantitative (such as forecasting accuracy) and well as qualitative (such as expert insights) mea- sures. For example use of search queries as surrogates for influenza forecasting can be justified by an improved forecasting performance as well as by recognizing the fact that people may search about flu symptoms and remedies as possible measures of self-diagnosis.

Finally, surrogates that are available at a regular and steady interval with wide cov- • erage are preferable to other sources that may be available only at sporadic intervals. For example, twitter chatter albeit being noisy are preferable from a public health surveillance standpoint than other sources such as telephone surveys.

8.3 Future Work

The research presented in this thesis has been mainly aimed at disease forecasting using data from a single region of interest. In future, we aim to generate spatially aware models such that the increase/decrease of disease incidences in neighboring regions can be used to modulate forecasts for the region of interest. Furthermore, we will expand the frameworks presented in this thesis to simultaneously study multiple diseases with underlying similarities - either via common transmission methods (such as Dengue and Chikungunya) or via similar exposed population. Such combined studies can lead to more robust forecasts, especially for diseases with sparse data (such as Chikungunya), and provide a deeper understanding about the spread of such diseases. Bibliography

[1] R. P. Adams and D. J. MacKay. Bayesian Online Changepoint Detection. arXiv preprint arXiv:0710.3742, 2007. [2] J. L. Anderson. An Ensemble Adjustment Kalman Filter for Data Assimilation. Monthly weather review, 129(12):2884–2903, 2001. [3] A. Apolloni, V. A. Kumar, M. V. Marathe, and S. Swarup. Computational Epidemiology in a Connected World. Computer, 42(12):0083–86, 2009. [4] M. T. Bahadori, Y. Liu, and E. P. Xing. Fast Structure Learning in Generalized Stochas- tic Processes with Latent Factors. In Proceedings of KDD ’13, 2013. [5] K. R. Bisset, J. Chen, X. Feng, V. Kumar, and M. V. Marathe. EpiFast: A Fast Algo- rithm for Large Scale Realistic Epidemic Simulations on Distributed Memory Systems. In Proceedings of the ICS ’09, 2009. [6] D. Butler. When Google got Flu Wrong. Nature, 494(7436):155, 2013. [7] J. Canny. Collaborative Filtering with Privacy via Factor Analysis. In Proceedings of SIGIR ’02, pages 238–245, 2002. [8] B. P. Carlin, A. E. Gelfand, and A. F. Smith. Hierarchical Bayesian analysis of Change- point Problems. Applied statistics, pages 389–405, 1992. [9] G. Casella and R. L. Berger. Statistical Inference, volume 2. Duxbury Pacific Grove, CA, 2002. [10] CDC. Influenza (flu). www.cdc.gov/flu/index.htm. Accessed: 2015-09-17. [11] P. Chakraborty. U.S. Flu Forecasting 2014 - SciCast. https://scicast.org/flu. Last Accessed: 2015-02-20. [12] P. Chakraborty, P. Khadivi, B. Lewis, A. Mahendiran, J. Chen, P. Butler, E. O. Nsoesie, S. R. Mekaru, J. S. Brownstein, M. V. Marathe, and N. Ramakrishnan. Forecasting a Moving Target: Ensemble Models forILI Case Count Predictions. In Proceedings of the 2014SIAM International Conference on Data Mining, Philadelphia, Pennsylvania, USA, April 24-26, 2014, pages 262–270, 2014.

87 88

[13] P. Chakraborty, S. Muthiah, R. Tandon, and N. Ramakrishnan. Hierarchical Quickest Change Detection via Surrogates. arXiv preprint arXiv:1603.09739, 2016.

[14] Y. Chen, D. Pavlov, and J. F. Canny. Large-Scale Behavioral Targeting. In Proceedings of KDD ’09, 2009.

[15] C. Chew and G. Eysenbach. Pandemics in the age of twitter: Content analysis of tweets during the 2009 h1n1 outbreak. PLOS One, 5(11):e14118, 2013.

[16] N. Chopin, P. E. Jacob, and O. Papaspiliopoulos. SMC2: An Efficient Algorithm for Sequential Analysis of State Space Models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(3):397–426, 2013.

[17] J. L. Crassidis and J. L. Junkins. Optimal estimation of dynamic systems. CRC press, 2011.

[18] P. Del Moral. Non-linear Filtering: Interacting Particle Resolution. Markov processes and related fields, 2(4):555–581, 1996.

[19] A. Dessein and A. Cont. Online Change Detection in Exponential Families with Un- known Parameters. In F. Nielsen and F. Barbaresco, editors, Geometric Science of In- formation, volume 8085 of Lecture Notes in Computer Science, pages 633–640. Springer Berlin Heidelberg, 2013.

[20] A. Doucet and A. M. Johansen. A Tutorial on Particle Filtering and Smoothing: Fifteen Years Later. Handbook of Nonlinear Filtering, 12:656–704, 2009.

[21] G. Evensen. The ensemble kalman filter: Theoretical formulation and practical imple- mentation. Ocean dynamics, 53(4):343–367, 2003.

[22] K. Fokianos, A. Rahbek, and D. Tjøstheim. Poisson Autoregression. Journal of the American Statistical Association, 104(488):1430–1439, 2009.

[23] J. Gama, I. Zliobait˙e,ˇ A. Bifet, M. Pechenizkiy, and A. Bouchachia. A Survey on Concept Drift Adaptation. ACM Computing Surveys (CSUR), 46(4):44, 2014.

[24] J. Ginsberg, M. H. Mohebbi, R. S. Patel, L. Brammer, M. S. Smolinski, and L. Bril- liant. Detecting Influenza Epidemics using Search Engine Query Data. Nature, 457(7232):1012–1014, 2008.

[25] K. S. Hickmann, G. Fairchild, R. Priedhorsky, N. Generous, J. M. Hyman, A. Desh- pande, and S. Y. Del Valle. Forecasting the 2013–2014 Influenza Season using Wikipedia. arXiv preprint arXiv:1410.7716, 2014.

[26] R. E. Kalman. A New approach to Linear Filtering and Prediction Problems. Journal of Fluids Engineering, 82(1):35–45, 1960. 89

[27] K.Denecke, P.Dolog, and P.Smrz. Making use of social media data in public health. In Proceedings of WWW ’12, pages 243–246, 2012.

[28] Y. Koren. Factorization Meets the Neighborhood: A Multifaceted Collaborative Filter- ing Model. In Proceedings of KDD ’08, pages 426–434, 2008.

[29] P. Kostkova. A Roadmap to Integrated Digital Public Health Surveillance: The Vision and The Challenges. In Proceedings of WWW ’13, pages 687–694, 2013.

[30] T. L. Lai. Sequential Changepoint Detection in Quality Control and Dynamical Systems. Journal of the Royal Statistical Society. Series B (Methodological), pages 613–658, 1995.

[31] T. L. Lai and H. Xing. Sequential Change-point Detection when the pre-and post-change Parameters are Unknown. Sequential Analysis, 29(2):162–175, 2010.

[32] D. Lazer, R. Kennedy, G. King, and A. Vespignani. Google Flu Trends Still Appears Sick: An Evaluation of the 2013-2014 Flu Season. Available at SSRN 2408560, 2014.

[33] D. Lazer, R. Kennedy, G. King, and A. Vespignani. The Parable of Google Flu: Traps in Big Data Analysis. Science, 343(6176):1203–1205, 2014.

[34] K. Lee, A.Agrawal, and A.Choudhary. Real-time disease surveillance using twitter data: demonstration on flu and cancer. In Proceedings of the KDD ’13, pages 1474–1477, 2013.

[35] S. Liu, M. Yamada, N. Collier, and M. Sugiyama. Change-point Detection in Time-series Data by Relative Density-Ratio Estimation. Neural Networks, 43(0):72 – 83, 2013.

[36] Y. Liu, M. T. Bahadori, and H. Li. Sparse-GEV: Sparse latent space model for multi- variate extreme value time series modelling. In Proceedings of ICML ’12, 2012.

[37] D. Livings. Aspects of the ensemble kalman filter. Reading University Masters Thesis, 2005.

[38] D. J. McIver and J. S. Brownstein. Wikipedia Usage Estimates Prevalence of Influenza- Like Illness in the United States in Near Real-Time. PLOS Computational Biology, 10(4):e1003581, 04 2014.

[39] N.Kanhabua and W.Nejdl. Understanding the diversity of tweets in the time of out- breaks. In Proceedings of WWW ’13, pages 1335–1342, 2013.

[40] E. Nsoesie, M. Mararthe, and J. Brownstein. Forecasting Peaks of Seasonal Influenza Epidemics. PLOS Currents, 5, 2013.

[41] E. O. Nsoesie, D. L. Buckeridge, and J. S. Brownstein. Who’s Not Coming to Dinner? Evaluating Trends in Online Restaurant Reservations for Outbreak Surveillance. Online Journal of Public Health Informatics, 5(1), 2013. 90

[42] H. Ohlsson, L. Ljung, and S. Boyd. Segmentation of ARX-models using Sum-Of-Norms Regularization. Automatica, 46:1107–1111, 2010.

[43] E. Page. Continuous Inspection Schemes. Biometrika, pages 100–115, 1954.

[44] PAHO. Influenza and other Respiratory Viruses. http://ais.paho.org/phip/viz/ ed_flu.asp. Accessed: 2015-09-01.

[45] I. Painter, J. Eaton, and B. Lober. Using Change Point Detection for Monitoring the Quality of Aggregate Data. Online journal of public health informatics, 5(1), 2013.

[46] M. J. Paul, M. Dredze, and D. Broniatowski. Twitter improves influenza forecasting. PLOS Currents, 6, 2014.

[47] M. K. Pitt and N. Shephard. Filtering via simulation: Auxiliary particle filters. Journal of the American statistical association, 94(446):590–599, 1999.

[48] N. Ramakrishnan, P. Butler, S. Muthiah, et al. ‘Beating the News’ with EMBERS: Forecasting Civil Unrest Using Open Source Indicators. In Proceedings of the 20th ACM SIGKDD, KDD, pages 1799–1808, New York, NY, USA, 2014. ACM.

[49] J. Shaman, E. Goldstein, and M. Lipsitch. Absolute Humidity and Pandemic Versus Epidemic Influenza. American journal of epidemiology, 173(2):127–135, 2010.

[50] J. Shaman and A. Karspeck. Forecasting Seasonal Outbreaks of Influenza. Proceedings of the National Academy of Sciences, 109(50):20425–20430, 2012.

[51] J. Shaman, V. E. Pitzer, C. Viboud, B. T. Grenfell, and M. Lipsitch. Absolute humidity and the seasonal onset of influenza in the continental United States. PLOS Biology, 8(2):e1000316, 2010.

[52] W. A. Shewhart. The Application of Statistics as an Aid in Maintaining Quality of a Manufactured Product. Journal of the American Statistical Association, 20(152):546– 548, 1925.

[53] A. N. Shiryaev. On Optimum Methods in Quickest Detection Problems. Theory of Probability & Its Applications, 8(1):22–46, 1963.

[54] D. Siegmund and E. Venkatraman. Using the Generalized Likelihood Ratio Statistic for Sequential Detection of a Change-point. The Annals of Statistics, pages 255–271, 1995.

[55] D. Simon. Kalman Filtering with State Constraints: A Survey of Linear and Nonlinear Algorithms. IET Control Theory & Applications, 4:1303–1318(15), August 2010.

[56] R. Sugumaran and J.Voss. Real-time spatio-temporal analysis of west nile virus using twitter data. In Proceedings of COM.Geo ’12, pages 1335–1342, 2012. 91

[57] J. D. Tamerius, J. Shaman, W. J. Alonso, K. Bloom-Feshbach, C. K. Uejio, A. Com- rie, and C. Viboud. Environmental Predictors of Seasonal Influenza Epidemics across Temperate and Tropical Climates. PLOS Pathog., 9(3):68–72, 2013.

[58] M. Tizzoni, P. Bajardi, C. Poletto, J. J. Ramasco, D. Balcan, B. Gon¸calves, N. Perra, V. Colizza, and A. Vespignani. Real-Time Numerical Forecast of Global Epidemic Spreading: Case Study of 2009 A/H1N1pdm. BMC medicine, 10(1):165, 2012.

[59] V. V. Veeravalli and T. Banerjee. Quickest Change Detection. Academic Press Library in Signal Processing: Array and Statistical Signal Processing, 3:209–256, 2013.

[60] A. Wald. Sequential tests of Statistical Hypotheses. The Annals of Mathematical Statis- tics, 16(2):117–186, 1945.

[61] X. Wang and C. H. Bishop. A comparison of breeding and ensemble transform kalman filter ensemble forecast schemes. Journal of the atmospheric sciences, 60(9):1140–1158, 2003.

[62] Z. Wang, P. Chakraborty, S. R. Mekaru, J. S. Brownstein, J. Ye, and N. Ramakrishnan. Dynamic Poisson Autoregression for Influenza-Like-Illness Case Count Prediction. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discov- ery and Data Mining, KDD ’15, pages 1285–1294, New York, NY, USA, 2015. ACM.

[63] G. Welch and G. Bishop. An introduction to the kalman filter. department of computer science, university of north carolina, 2006.

[64] WHO. Surveillance and Monitoring. http://www.who.int/influenza/surveillance_ monitoring/en/. Accessed: 2015-09-17.

[65] W. Yang, S. Elankumaran, and L. C. Marr. Relationship between humidity and in- fluenza A viability in droplets and implications for influenza’s seasonality. PlOS One, 7(10):e46789, 2012.

[66] W. Yang, A. Karspeck, and J. Shaman. Comparison of filtering methods for the model- ing and retrospective forecasting of influenza epidemics. PLOS Computational Biology, 10(4):e1003583, 2014.

[67] Q. Yuan, E. O. Nsoesie, B. Lv, G. Peng, R. Chunara, and J. S. Brownstein. Monitoring Influenza Epidemics in China with Search Query from Baidu. PLOS One, 8(5):e64323, 2013. Appendix A

Data Assimilation: detailed performance

In this appendix, we present the detailed performance of data assimilation model presented in Chapter 5, w.r.t. several seasonal characteristics using different sources individually as well as in a combined manner. The metrics are presented in Table A.1. Quality score is used to evaluate the value metrics (peak value and season value) while number of days offset has been used to evaluate the date metrics (start date, peak date, and end date). Combined sources shows the best performance overall.

Table A.1: Performance of Data assimilation methods using different surrogate sources w.r.t. seasonal characteristics

Actual Predicted Score Metric Current week Country Source end date 4 BO Weather 36.0 52.000 16.000 GFT 36.0 52.000 16.000 GST 36.0 52.000 16.000 HealthMap 36.0 47.455 11.455 Twitter 36.0 52.000 16.000 Merged 36.0 52.000 16.000 CL Weather 28.0 42.000 14.000 GFT 28.0 42.000 14.000 GST 28.0 42.000 14.000 HealthMap 28.0 42.000 14.000 Twitter 28.0 42.000 14.000 Merged 28.0 42.000 14.000 Continued on next page

92 93

Actual Predicted Score Metric Current week Country Source MX Weather 11.0 46.000 35.000 GFT 11.0 46.000 35.000 GST 11.0 46.000 35.000 HealthMap 11.0 46.000 35.000 Twitter 11.0 46.000 35.000 Merged 11.0 46.000 35.000 PE Weather 28.0 42.000 14.000 GFT 28.0 42.000 14.000 GST 28.0 42.000 14.000 HealthMap 28.0 35.000 7.000 Twitter 28.0 42.000 14.000 Merged 28.0 42.273 14.273 5 BO Weather 36.0 52.000 16.000 GFT 36.0 52.000 16.000 GST 36.0 52.000 16.000 HealthMap 36.0 47.000 11.000 Twitter 36.0 52.000 16.000 Merged 36.0 52.000 16.000 CL Weather 28.0 42.000 14.000 GFT 28.0 42.000 14.000 GST 28.0 42.000 14.000 HealthMap 28.0 42.000 14.000 Twitter 28.0 42.000 14.000 Merged 28.0 42.000 14.000 MX Weather 11.0 46.000 35.000 GFT 11.0 46.000 35.000 GST 11.0 46.000 35.000 HealthMap 11.0 46.000 35.000 Twitter 11.0 46.000 35.000 Merged 11.0 46.000 35.000 PE Weather 28.0 42.000 14.000 GFT 28.0 44.000 16.000 GST 28.0 42.000 14.000 HealthMap 28.0 37.545 9.545 Twitter 28.0 42.000 14.000 Merged 28.0 42.727 14.727 6 BO Weather 36.0 52.000 16.000 GFT 36.0 52.000 16.000 Continued on next page 94

Actual Predicted Score Metric Current week Country Source GST 36.0 52.000 16.000 HealthMap 36.0 47.000 11.000 Twitter 36.0 52.000 16.000 Merged 36.0 52.000 16.000 CL Weather 28.0 42.000 14.000 GFT 28.0 42.000 14.000 GST 28.0 42.000 14.000 HealthMap 28.0 40.636 12.636 Twitter 28.0 42.000 14.000 Merged 28.0 42.000 14.000 MX Weather 11.0 46.000 35.000 GFT 11.0 46.000 35.000 GST 11.0 46.000 35.000 HealthMap 11.0 46.273 35.273 Twitter 11.0 46.000 35.000 Merged 11.0 46.000 35.000 PE Weather 28.0 42.000 14.000 GFT 28.0 42.182 14.182 GST 28.0 42.000 14.000 HealthMap 28.0 35.000 7.000 Twitter 28.0 42.000 14.000 Merged 28.0 43.000 15.000 7 BO Weather 36.0 52.000 16.000 GFT 36.0 52.000 16.000 GST 36.0 52.000 16.000 HealthMap 36.0 47.000 11.000 Twitter 36.0 52.000 16.000 Merged 36.0 52.000 16.000 CL Weather 28.0 42.000 14.000 GFT 28.0 42.000 14.000 GST 28.0 42.000 14.000 HealthMap 28.0 41.000 13.000 Twitter 28.0 42.000 14.000 Merged 28.0 42.000 14.000 MX Weather 11.0 46.000 35.000 GFT 11.0 46.000 35.000 GST 11.0 46.000 35.000 HealthMap 11.0 46.545 35.545 Continued on next page 95

Actual Predicted Score Metric Current week Country Source Twitter 11.0 46.000 35.000 Merged 11.0 46.000 35.000 PE Weather 28.0 42.000 14.000 GFT 28.0 42.091 14.091 GST 28.0 42.000 14.000 HealthMap 28.0 35.000 7.000 Twitter 28.0 42.000 14.000 Merged 28.0 43.000 15.000 8 BO Weather 36.0 52.000 16.000 GFT 36.0 52.000 16.000 GST 36.0 52.000 16.000 HealthMap 36.0 47.000 11.000 Twitter 36.0 52.000 16.000 Merged 36.0 52.000 16.000 CL Weather 28.0 42.000 14.000 GFT 28.0 42.000 14.000 GST 28.0 42.000 14.000 HealthMap 28.0 42.000 14.000 Twitter 28.0 42.000 14.000 Merged 28.0 42.000 14.000 MX Weather 11.0 46.000 35.000 GFT 11.0 45.545 34.545 GST 11.0 46.000 35.000 HealthMap 11.0 46.000 35.000 Twitter 11.0 46.000 35.000 Merged 11.0 46.000 35.000 PE Weather 28.0 42.000 14.000 GFT 28.0 42.000 14.000 GST 28.0 42.000 14.000 HealthMap 28.0 35.000 7.000 Twitter 28.0 42.000 14.000 Merged 28.0 43.000 15.000 9 BO Weather 36.0 52.000 16.000 GFT 36.0 52.000 16.000 GST 36.0 52.000 16.000 HealthMap 36.0 47.000 11.000 Twitter 36.0 52.000 16.000 Merged 36.0 52.000 16.000 Continued on next page 96

Actual Predicted Score Metric Current week Country Source CL Weather 28.0 42.000 14.000 GFT 28.0 42.000 14.000 GST 28.0 42.000 14.000 HealthMap 28.0 41.545 13.545 Twitter 28.0 42.000 14.000 Merged 28.0 42.000 14.000 MX Weather 11.0 46.000 35.000 GFT 11.0 44.909 33.909 GST 11.0 46.000 35.000 HealthMap 11.0 46.000 35.000 Twitter 11.0 46.000 35.000 Merged 11.0 46.000 35.000 PE Weather 28.0 42.000 14.000 GFT 28.0 42.000 14.000 GST 28.0 43.000 15.000 HealthMap 28.0 35.000 7.000 Twitter 28.0 42.000 14.000 Merged 28.0 43.000 15.000 peak date 4 BO Weather 22.0 39.000 17.000 GFT 22.0 39.000 17.000 GST 22.0 39.000 17.000 HealthMap 22.0 24.000 2.000 Twitter 22.0 39.000 17.000 Merged 22.0 39.000 17.000 CL Weather 19.0 17.000 2.000 GFT 19.0 19.000 0.000 GST 19.0 19.000 0.000 HealthMap 19.0 16.818 2.182 Twitter 19.0 17.000 2.000 Merged 19.0 19.000 0.000 MX Weather 4.0 32.000 28.000 GFT 4.0 32.000 28.000 GST 4.0 32.000 28.000 HealthMap 4.0 30.000 26.000 Twitter 4.0 32.000 28.000 Merged 4.0 32.000 28.000 PE Weather 25.0 21.000 4.000 GFT 25.0 21.000 4.000 Continued on next page 97

Actual Predicted Score Metric Current week Country Source GST 25.0 21.000 4.000 HealthMap 25.0 21.000 4.000 Twitter 25.0 21.000 4.000 Merged 25.0 21.000 4.000 5 BO Weather 22.0 39.000 17.000 GFT 22.0 39.000 17.000 GST 22.0 39.000 17.000 HealthMap 22.0 24.000 2.000 Twitter 22.0 39.000 17.000 Merged 22.0 39.000 17.000 CL Weather 19.0 17.000 2.000 GFT 19.0 15.000 4.000 GST 19.0 19.000 0.000 HealthMap 19.0 15.000 4.000 Twitter 19.0 17.000 2.000 Merged 19.0 19.000 0.000 MX Weather 4.0 32.000 28.000 GFT 4.0 32.000 28.000 GST 4.0 32.000 28.000 HealthMap 4.0 30.000 26.000 Twitter 4.0 32.000 28.000 Merged 4.0 32.000 28.000 PE Weather 25.0 21.000 4.000 GFT 25.0 22.000 3.000 GST 25.0 21.000 4.000 HealthMap 25.0 21.000 4.000 Twitter 25.0 21.000 4.000 Merged 25.0 21.000 4.000 6 BO Weather 22.0 39.000 17.000 GFT 22.0 39.000 17.000 GST 22.0 39.000 17.000 HealthMap 22.0 24.000 2.000 Twitter 22.0 39.000 17.000 Merged 22.0 39.000 17.000 CL Weather 19.0 17.000 2.000 GFT 19.0 15.000 4.000 GST 19.0 19.000 0.000 HealthMap 19.0 16.000 3.000 Continued on next page 98

Actual Predicted Score Metric Current week Country Source Twitter 19.0 17.000 2.000 Merged 19.0 19.000 0.000 MX Weather 4.0 32.000 28.000 GFT 4.0 32.000 28.000 GST 4.0 32.000 28.000 HealthMap 4.0 30.545 26.545 Twitter 4.0 32.000 28.000 Merged 4.0 32.000 28.000 PE Weather 25.0 21.000 4.000 GFT 25.0 21.000 4.000 GST 25.0 21.000 4.000 HealthMap 25.0 21.000 4.000 Twitter 25.0 21.000 4.000 Merged 25.0 21.000 4.000 7 BO Weather 22.0 39.000 17.000 GFT 22.0 39.000 17.000 GST 22.0 39.000 17.000 HealthMap 22.0 24.000 2.000 Twitter 22.0 39.000 17.000 Merged 22.0 39.000 17.000 CL Weather 19.0 17.000 2.000 GFT 19.0 15.000 4.000 GST 19.0 19.000 0.000 HealthMap 19.0 17.000 2.000 Twitter 19.0 17.000 2.000 Merged 19.0 19.000 0.000 MX Weather 4.0 32.000 28.000 GFT 4.0 32.000 28.000 GST 4.0 32.000 28.000 HealthMap 4.0 30.000 26.000 Twitter 4.0 32.000 28.000 Merged 4.0 32.000 28.000 PE Weather 25.0 21.000 4.000 GFT 25.0 21.000 4.000 GST 25.0 21.000 4.000 HealthMap 25.0 21.000 4.000 Twitter 25.0 21.000 4.000 Continued on next page 99

Actual Predicted Score Metric Current week Country Source Merged 25.0 21.000 4.000 8 BO Weather 22.0 39.000 17.000 GFT 22.0 39.000 17.000 GST 22.0 39.000 17.000 HealthMap 22.0 24.000 2.000 Twitter 22.0 39.000 17.000 Merged 22.0 39.000 17.000 CL Weather 19.0 17.000 2.000 GFT 19.0 15.000 4.000 GST 19.0 19.000 0.000 HealthMap 19.0 17.000 2.000 Twitter 19.0 17.000 2.000 Merged 19.0 19.000 0.000 MX Weather 4.0 32.000 28.000 GFT 4.0 32.000 28.000 GST 4.0 32.000 28.000 HealthMap 4.0 30.000 26.000 Twitter 4.0 32.000 28.000 Merged 4.0 32.000 28.000 PE Weather 25.0 21.000 4.000 GFT 25.0 21.000 4.000 GST 25.0 21.000 4.000 HealthMap 25.0 21.000 4.000 Twitter 25.0 21.000 4.000 Merged 25.0 21.000 4.000 9 BO Weather 22.0 39.000 17.000 GFT 22.0 39.000 17.000 GST 22.0 39.000 17.000 HealthMap 22.0 24.000 2.000 Twitter 22.0 39.000 17.000 Merged 22.0 39.000 17.000 CL Weather 19.0 17.000 2.000 GFT 19.0 15.000 4.000 GST 19.0 19.000 0.000 HealthMap 19.0 17.000 2.000 Twitter 19.0 17.000 2.000 Merged 19.0 19.000 0.000 MX Weather 4.0 32.000 28.000 Continued on next page 100

Actual Predicted Score Metric Current week Country Source GFT 4.0 32.000 28.000 GST 4.0 32.000 28.000 HealthMap 4.0 30.000 26.000 Twitter 4.0 32.000 28.000 Merged 4.0 32.000 28.000 PE Weather 25.0 21.000 4.000 GFT 25.0 21.000 4.000 GST 25.0 21.000 4.000 HealthMap 25.0 21.000 4.000 Twitter 25.0 21.000 4.000 Merged 25.0 21.000 4.000 peak val 4 BO Weather 74.0 215.983 1.370 GFT 74.0 141.080 2.098 GST 74.0 154.167 1.920 HealthMap 74.0 48.998 2.649 Twitter 74.0 178.928 1.654 Merged 74.0 145.784 2.030 CL Weather 1004.0 828.863 3.302 GFT 1004.0 882.475 3.516 GST 1004.0 864.912 3.446 HealthMap 1004.0 849.428 3.384 Twitter 1004.0 889.066 3.542 Merged 1004.0 876.892 3.494 MX Weather 25.0 690.645 0.145 GFT 25.0 721.851 0.139 GST 25.0 701.918 0.142 HealthMap 25.0 643.583 0.155 Twitter 25.0 668.093 0.150 Merged 25.0 693.189 0.144 PE Weather 83.0 112.651 2.947 GFT 83.0 89.895 3.693 GST 83.0 118.909 2.792 HealthMap 83.0 103.134 3.219 Twitter 83.0 91.843 3.615 Merged 83.0 123.781 2.682 5 BO Weather 74.0 183.187 1.616 GFT 74.0 135.428 2.186 GST 74.0 149.081 1.985 Continued on next page 101

Actual Predicted Score Metric Current week Country Source HealthMap 74.0 46.160 2.495 Twitter 74.0 182.980 1.618 Merged 74.0 144.726 2.045 CL Weather 1004.0 831.180 3.311 GFT 1004.0 812.141 3.236 GST 1004.0 858.648 3.421 HealthMap 1004.0 844.272 3.364 Twitter 1004.0 888.098 3.538 Merged 1004.0 876.708 3.493 MX Weather 25.0 706.795 0.141 GFT 25.0 716.962 0.139 GST 25.0 706.077 0.142 HealthMap 25.0 601.926 0.166 Twitter 25.0 668.791 0.150 Merged 25.0 689.063 0.145 PE Weather 83.0 98.817 3.360 GFT 83.0 72.951 3.516 GST 83.0 117.118 2.835 HealthMap 83.0 120.097 2.764 Twitter 83.0 91.362 3.634 Merged 83.0 125.151 2.653 6 BO Weather 74.0 164.075 1.804 GFT 74.0 132.266 2.238 GST 74.0 148.606 1.992 HealthMap 74.0 44.203 2.389 Twitter 74.0 177.193 1.670 Merged 74.0 144.096 2.054 CL Weather 1004.0 831.363 3.312 GFT 1004.0 811.911 3.235 GST 1004.0 857.130 3.415 HealthMap 1004.0 844.386 3.364 Twitter 1004.0 887.248 3.535 Merged 1004.0 876.766 3.493 MX Weather 25.0 697.397 0.143 GFT 25.0 717.214 0.139 GST 25.0 710.721 0.141 HealthMap 25.0 574.125 0.174 Twitter 25.0 669.851 0.149 Continued on next page 102

Actual Predicted Score Metric Current week Country Source Merged 25.0 687.034 0.146 PE Weather 83.0 112.842 2.942 GFT 83.0 85.031 3.904 GST 83.0 117.471 2.826 HealthMap 83.0 158.475 2.095 Twitter 83.0 90.194 3.681 Merged 83.0 129.150 2.571 7 BO Weather 74.0 164.737 1.797 GFT 74.0 131.936 2.244 GST 74.0 148.456 1.994 HealthMap 74.0 43.403 2.346 Twitter 74.0 180.890 1.636 Merged 74.0 145.478 2.035 CL Weather 1004.0 834.666 3.325 GFT 1004.0 810.008 3.227 GST 1004.0 856.240 3.411 HealthMap 1004.0 861.195 3.431 Twitter 1004.0 887.279 3.535 Merged 1004.0 877.324 3.495 MX Weather 25.0 693.235 0.144 GFT 25.0 720.404 0.139 GST 25.0 712.002 0.140 HealthMap 25.0 583.512 0.171 Twitter 25.0 670.235 0.149 Merged 25.0 686.116 0.146 PE Weather 83.0 106.492 3.118 GFT 83.0 91.732 3.619 GST 83.0 117.554 2.824 HealthMap 83.0 168.713 1.968 Twitter 83.0 90.009 3.689 Merged 83.0 129.021 2.573 8 BO Weather 74.0 151.037 1.960 GFT 74.0 119.704 2.473 GST 74.0 148.344 1.995 HealthMap 74.0 43.239 2.337 Twitter 74.0 182.833 1.619 Merged 74.0 145.993 2.027 CL Weather 1004.0 833.944 3.322 Continued on next page 103

Actual Predicted Score Metric Current week Country Source GFT 1004.0 809.992 3.227 GST 1004.0 853.501 3.400 HealthMap 1004.0 859.736 3.425 Twitter 1004.0 885.675 3.529 Merged 1004.0 877.683 3.497 MX Weather 25.0 695.657 0.144 GFT 25.0 725.328 0.138 GST 25.0 711.714 0.141 HealthMap 25.0 588.392 0.170 Twitter 25.0 670.595 0.149 Merged 25.0 686.213 0.146 PE Weather 83.0 111.860 2.968 GFT 83.0 95.614 3.472 GST 83.0 117.518 2.825 HealthMap 83.0 187.161 1.774 Twitter 83.0 88.713 3.742 Merged 83.0 129.278 2.568 9 BO Weather 74.0 170.650 1.735 GFT 74.0 123.155 2.403 GST 74.0 150.413 1.968 HealthMap 74.0 42.851 2.316 Twitter 74.0 181.010 1.635 Merged 74.0 146.912 2.015 CL Weather 1004.0 831.587 3.313 GFT 1004.0 812.774 3.238 GST 1004.0 854.884 3.406 HealthMap 1004.0 854.480 3.404 Twitter 1004.0 884.240 3.523 Merged 1004.0 877.627 3.497 MX Weather 25.0 686.548 0.146 GFT 25.0 730.480 0.137 GST 25.0 710.135 0.141 HealthMap 25.0 633.748 0.158 Twitter 25.0 670.665 0.149 Merged 25.0 688.331 0.145 PE Weather 83.0 115.505 2.874 GFT 83.0 93.706 3.543 GST 83.0 117.460 2.826 Continued on next page 104

Actual Predicted Score Metric Current week Country Source HealthMap 83.0 167.631 1.981 Twitter 83.0 86.963 3.818 Merged 83.0 129.119 2.571 season val 4 BO Weather 674.0 1002.532 2.689 GFT 674.0 907.496 2.971 GST 674.0 942.767 2.860 HealthMap 674.0 712.464 3.784 Twitter 674.0 964.909 2.794 Merged 674.0 930.938 2.896 CL Weather 8752.0 11786.915 2.970 GFT 8752.0 12024.415 2.911 GST 8752.0 11960.716 2.927 HealthMap 8752.0 11505.883 3.043 Twitter 8752.0 11833.753 2.958 Merged 8752.0 12027.331 2.911 MX Weather 158.0 5562.933 0.114 GFT 158.0 5358.149 0.118 GST 158.0 5651.481 0.112 HealthMap 158.0 5305.788 0.119 Twitter 158.0 5217.161 0.121 Merged 158.0 5627.434 0.112 PE Weather 649.0 1146.146 2.265 GFT 649.0 1006.834 2.578 GST 649.0 1191.952 2.178 HealthMap 649.0 1029.804 2.521 Twitter 649.0 1029.770 2.521 Merged 649.0 1220.496 2.127 5 BO Weather 674.0 965.295 2.793 GFT 674.0 904.409 2.981 GST 674.0 936.644 2.878 HealthMap 674.0 724.866 3.719 Twitter 674.0 967.609 2.786 Merged 674.0 930.526 2.897 CL Weather 8752.0 11716.956 2.988 GFT 8752.0 11474.547 3.051 GST 8752.0 11931.396 2.934 HealthMap 8752.0 11246.845 3.113 Twitter 8752.0 11834.536 2.958 Continued on next page 105

Actual Predicted Score Metric Current week Country Source Merged 8752.0 12086.014 2.897 MX Weather 158.0 5533.380 0.114 GFT 158.0 5325.509 0.119 GST 158.0 5671.309 0.111 HealthMap 158.0 5139.495 0.123 Twitter 158.0 5230.854 0.121 Merged 158.0 5607.392 0.113 PE Weather 649.0 1069.703 2.427 GFT 649.0 936.372 2.772 GST 649.0 1183.395 2.194 HealthMap 649.0 1154.279 2.249 Twitter 649.0 1029.177 2.522 Merged 649.0 1230.061 2.110 6 BO Weather 674.0 947.951 2.844 GFT 674.0 904.324 2.981 GST 674.0 935.693 2.881 HealthMap 674.0 742.711 3.630 Twitter 674.0 961.997 2.803 Merged 674.0 930.619 2.897 CL Weather 8752.0 11664.138 3.001 GFT 8752.0 11469.219 3.052 GST 8752.0 11926.568 2.935 HealthMap 8752.0 11289.348 3.101 Twitter 8752.0 11840.811 2.957 Merged 8752.0 12075.780 2.899 MX Weather 158.0 5494.777 0.115 GFT 158.0 5323.444 0.119 GST 158.0 5694.656 0.111 HealthMap 158.0 5009.903 0.126 Twitter 158.0 5247.421 0.120 Merged 158.0 5589.361 0.113 PE Weather 649.0 1148.559 2.260 GFT 649.0 994.009 2.612 GST 649.0 1183.066 2.194 HealthMap 649.0 1325.658 1.958 Twitter 649.0 1023.533 2.536 Merged 649.0 1252.013 2.073 7 BO Weather 674.0 949.164 2.840 Continued on next page 106

Actual Predicted Score Metric Current week Country Source GFT 674.0 905.601 2.977 GST 674.0 935.529 2.882 HealthMap 674.0 753.922 3.576 Twitter 674.0 965.598 2.792 Merged 674.0 932.080 2.892 CL Weather 8752.0 11642.172 3.007 GFT 8752.0 11475.410 3.051 GST 8752.0 11921.222 2.937 HealthMap 8752.0 11427.368 3.064 Twitter 8752.0 11840.378 2.957 Merged 8752.0 12083.704 2.897 MX Weather 158.0 5469.313 0.116 GFT 158.0 5387.525 0.117 GST 158.0 5697.444 0.111 HealthMap 158.0 5042.625 0.125 Twitter 158.0 5254.495 0.120 Merged 158.0 5575.940 0.113 PE Weather 649.0 1105.682 2.348 GFT 649.0 1032.684 2.514 GST 649.0 1181.024 2.198 HealthMap 649.0 1374.625 1.889 Twitter 649.0 1023.418 2.537 Merged 649.0 1252.055 2.073 8 BO Weather 674.0 931.436 2.894 GFT 674.0 896.667 3.007 GST 674.0 935.181 2.883 HealthMap 674.0 755.413 3.569 Twitter 674.0 967.242 2.787 Merged 674.0 932.683 2.891 CL Weather 8752.0 11613.619 3.014 GFT 8752.0 11477.909 3.050 GST 8752.0 11918.721 2.937 HealthMap 8752.0 11620.917 3.012 Twitter 8752.0 11856.342 2.953 Merged 8752.0 12091.229 2.895 MX Weather 158.0 5445.851 0.116 GFT 158.0 5340.681 0.118 GST 158.0 5695.009 0.111 Continued on next page 107

Actual Predicted Score Metric Current week Country Source HealthMap 158.0 5074.198 0.125 Twitter 158.0 5260.482 0.120 Merged 158.0 5569.559 0.113 PE Weather 649.0 1138.684 2.280 GFT 649.0 1059.566 2.450 GST 649.0 1179.636 2.201 HealthMap 649.0 1461.538 1.776 Twitter 649.0 1016.001 2.555 Merged 649.0 1253.958 2.070 9 BO Weather 674.0 953.125 2.829 GFT 674.0 898.970 2.999 GST 674.0 937.181 2.877 HealthMap 674.0 760.434 3.545 Twitter 674.0 965.647 2.792 Merged 674.0 933.647 2.888 CL Weather 8752.0 11587.588 3.021 GFT 8752.0 11451.570 3.057 GST 8752.0 11920.379 2.937 HealthMap 8752.0 11517.121 3.040 Twitter 8752.0 11863.286 2.951 Merged 8752.0 12091.663 2.895 MX Weather 158.0 5400.856 0.117 GFT 158.0 5308.693 0.119 GST 158.0 5690.288 0.111 HealthMap 158.0 5243.898 0.121 Twitter 158.0 5261.940 0.120 Merged 158.0 5595.444 0.113 PE Weather 649.0 1161.855 2.234 GFT 649.0 1047.027 2.479 GST 649.0 1183.520 2.193 HealthMap 649.0 1373.276 1.890 Twitter 649.0 1004.894 2.583 Merged 649.0 1253.256 2.071 start date 4 BO Weather 17.0 9.000 8.000 GFT 17.0 9.000 8.000 GST 17.0 9.000 8.000 HealthMap 17.0 8.000 9.000 Twitter 17.0 9.000 8.000 Continued on next page 108

Actual Predicted Score Metric Current week Country Source Merged 17.0 9.000 8.000 CL Weather 10.0 7.000 3.000 GFT 10.0 7.000 3.000 GST 10.0 7.000 3.000 HealthMap 10.0 7.000 3.000 Twitter 10.0 7.000 3.000 Merged 10.0 7.000 3.000 MX Weather 3.0 14.000 11.000 GFT 3.0 14.000 11.000 GST 3.0 14.000 11.000 HealthMap 3.0 14.000 11.000 Twitter 3.0 14.000 11.000 Merged 3.0 14.000 11.000 PE Weather 6.0 1.000 5.000 GFT 6.0 1.000 5.000 GST 6.0 1.000 5.000 HealthMap 6.0 1.000 5.000 Twitter 6.0 1.000 5.000 Merged 6.0 1.000 5.000 5 BO Weather 17.0 9.000 8.000 GFT 17.0 9.000 8.000 GST 17.0 9.000 8.000 HealthMap 17.0 9.000 8.000 Twitter 17.0 9.000 8.000 Merged 17.0 9.000 8.000 CL Weather 10.0 7.000 3.000 GFT 10.0 7.000 3.000 GST 10.0 7.000 3.000 HealthMap 10.0 7.000 3.000 Twitter 10.0 7.000 3.000 Merged 10.0 7.000 3.000 MX Weather 3.0 14.000 11.000 GFT 3.0 14.000 11.000 GST 3.0 14.000 11.000 HealthMap 3.0 14.000 11.000 Twitter 3.0 14.000 11.000 Merged 3.0 14.000 11.000 PE Weather 6.0 1.000 5.000 Continued on next page 109

Actual Predicted Score Metric Current week Country Source GFT 6.0 1.000 5.000 GST 6.0 1.000 5.000 HealthMap 6.0 1.000 5.000 Twitter 6.0 1.000 5.000 Merged 6.0 1.000 5.000 6 BO Weather 17.0 9.000 8.000 GFT 17.0 9.000 8.000 GST 17.0 9.000 8.000 HealthMap 17.0 9.000 8.000 Twitter 17.0 9.000 8.000 Merged 17.0 9.000 8.000 CL Weather 10.0 7.000 3.000 GFT 10.0 7.000 3.000 GST 10.0 7.000 3.000 HealthMap 10.0 7.000 3.000 Twitter 10.0 7.000 3.000 Merged 10.0 7.000 3.000 MX Weather 3.0 14.000 11.000 GFT 3.0 14.000 11.000 GST 3.0 14.000 11.000 HealthMap 3.0 14.000 11.000 Twitter 3.0 14.000 11.000 Merged 3.0 14.000 11.000 PE Weather 6.0 1.000 5.000 GFT 6.0 1.000 5.000 GST 6.0 1.000 5.000 HealthMap 6.0 1.000 5.000 Twitter 6.0 1.000 5.000 Merged 6.0 1.000 5.000 7 BO Weather 17.0 9.000 8.000 GFT 17.0 9.000 8.000 GST 17.0 9.000 8.000 HealthMap 17.0 9.000 8.000 Twitter 17.0 9.000 8.000 Merged 17.0 9.000 8.000 CL Weather 10.0 7.000 3.000 GFT 10.0 7.000 3.000 GST 10.0 7.000 3.000 Continued on next page 110

Actual Predicted Score Metric Current week Country Source HealthMap 10.0 7.000 3.000 Twitter 10.0 7.000 3.000 Merged 10.0 7.000 3.000 MX Weather 3.0 14.000 11.000 GFT 3.0 14.000 11.000 GST 3.0 14.000 11.000 HealthMap 3.0 14.000 11.000 Twitter 3.0 14.000 11.000 Merged 3.0 14.000 11.000 PE Weather 6.0 1.000 5.000 GFT 6.0 1.000 5.000 GST 6.0 1.000 5.000 HealthMap 6.0 1.000 5.000 Twitter 6.0 1.000 5.000 Merged 6.0 1.000 5.000 8 BO Weather 17.0 9.000 8.000 GFT 17.0 9.000 8.000 GST 17.0 9.000 8.000 HealthMap 17.0 9.000 8.000 Twitter 17.0 9.000 8.000 Merged 17.0 9.000 8.000 CL Weather 10.0 7.000 3.000 GFT 10.0 7.000 3.000 GST 10.0 7.000 3.000 HealthMap 10.0 7.000 3.000 Twitter 10.0 7.000 3.000 Merged 10.0 7.000 3.000 MX Weather 3.0 14.000 11.000 GFT 3.0 14.000 11.000 GST 3.0 14.000 11.000 HealthMap 3.0 14.000 11.000 Twitter 3.0 14.000 11.000 Merged 3.0 14.000 11.000 PE Weather 6.0 1.000 5.000 GFT 6.0 1.000 5.000 GST 6.0 1.000 5.000 HealthMap 6.0 1.000 5.000 Twitter 6.0 1.000 5.000 Continued on next page 111

Actual Predicted Score Metric Current week Country Source Merged 6.0 1.000 5.000 9 BO Weather 17.0 9.000 8.000 GFT 17.0 9.000 8.000 GST 17.0 9.000 8.000 HealthMap 17.0 9.000 8.000 Twitter 17.0 9.000 8.000 Merged 17.0 9.000 8.000 CL Weather 10.0 7.000 3.000 GFT 10.0 7.000 3.000 GST 10.0 7.000 3.000 HealthMap 10.0 7.000 3.000 Twitter 10.0 7.000 3.000 Merged 10.0 7.000 3.000 MX Weather 3.0 14.000 11.000 GFT 3.0 14.000 11.000 GST 3.0 14.000 11.000 HealthMap 3.0 14.000 11.000 Twitter 3.0 14.000 11.000 Merged 3.0 14.000 11.000 PE Weather 6.0 1.000 5.000 GFT 6.0 1.000 5.000 GST 6.0 1.000 5.000 HealthMap 6.0 1.000 5.000 Twitter 6.0 1.000 5.000 Merged 6.0 1.000 5.000 Appendix B

Sequential Bayesian Inference

In this appendix, we briefly describe the methodology of Sequential Bayesian Inference and present some of the details as relevant to the methods presented in Chapter6.

Consider a stochastic process where an observed temporal data sequence y¯ = y1, y2, . . . , yt { } depends on unobserved latent states x¯ = x1, x2, . . . , xt such that the following formulation holds: { } P (yt y1:t−1, x1:t, θ) = fθ(yt xt) | | P (xt x1:t−1, θ) = gθ(xt xt−1) | | (B.1) P (x1 θ) = µθ(x1) | Π0(θ) = P (θ) i.e. yt depends only on the current estimate of the state xt. On the other hand, xt depends only on xt−1, thus exhibiting a first-order Markov property. θ denotes the set of parameter for the described process which are constant over time. For some θ, fθ, gθ describe the observation probability and the state transition probability, respectively. P (θ) is the prior distribution for the static parameter θ while µθ is the same for x given a particular θ. Typically, at any time point t 1 the observation values are known but the latent states and the parameter θ are unknown.− The problem of interest is then to estimate the posterior probability

Pθ( x1, x2, . . . , xt−1 y1, y2, . . . , yt−1 ) { }|{ } This problem has been studied extensively in the context of Sequential Bayesian Inference [9]. Kalman filters [26], a class of such algorithms, are very popular when fθ and gθ describe linear Gaussian transitions. There have been efforts [55, 2] at relaxing these restrictions using methods such as Taylor series expansion and ensemble averages. However, for arbitrary forms of fθ and gθ, Sequential Monte Carlo and more specifically Particle Filters are more popular. Particle Filters [18] estimate the posteriors using a large number of Monte Carlo samples from the observation and state transition models. At any time t, these algorithms only need to draw new samples for time t using data from only t 1. Thus these methods are ideally − 112 113 suited for online learning. Standard Particle Filters are known to suffer from premature convergence (particle degeneracy) [20] or unsuitable for unknown static variables [47, 20] Recently, Chopin et al. [16] proposed a hybrid Particle filter which interleaves Iterated batch resampling with particle filter updates to handle both static and state parameters. 2 Given an observed sequence y1:t, SMC can be used to find the best posterior fit of the static and state parameters as given below:

φ  P φ, x1:t y1:t { } |

B.1 SMC2 algorithm traces

We present the traces of the SMC2 algorithm below. For a more detailed treatment of the same (including theoretical proofs of convergence) we ask the readers to refer to [16].

2 SMC typically starts with two parameters: (a) Nθ - the number of static parameters sampled from the prior of θ and (b) Nx - the number of particles of initialized for each θ. Then the Algorithm can be given as follows:

m 1. Sample Nθ number of θ P (θ) ∼ 2. θm run the following particle filter ∀ (a) Initialization: t = 1

1:Nx,m i. x1 µθm ∼ n,m n,m n µ1,θm (x1 )gθ(y1|x1 ii. w1,θ(x1 , m) = n,m) q1,θ(x1 n,m w1,θ(x1n,m) iii. W = P 1,θ i w1,θ(x1i,m) Nx iv. P (y θm) = 1 P w (xn,m) Nx 1,θ 1 | n=1 (b) t 1 ≥ i. Auxiliary variable:   an,m Multinomial W 1:Nx,m t−1 ∼ t−1,θ ii. State Proposal: n,m  at−1  xtn, m qt,θ . x ∼ | t−1 iii. Weight Update:  n,m  at−1 n,m n,m wt,θ xt−1 xt  at−1  Wt,θ xt−1 n,m P at−1 n,m ∼ xt−1 xt 114

iv. Observation probability: Nx  n,m  P at−1 n,m wt,θ xt−1 xt m n=1 P (yt y1:t−1, θ ) = | Nx 3. Update Importance weights: m m m θ w w P (yt y1:t−1,θm ∀ ← | 4. Under degeneracy criterion: Move particles using Kernel   ˜m 1:N˜x,m 1:N˜x,m i.i.d θ , x1:t , a1:t−1 Pm m 1:Nx∼,m 1:Nx,m w Kt(θ ,x1:t ,a1:t−1 ) P m m w

5. Weight Exchange:  1:N ,m 1:N ,m  1:N˜ ,m 1:N˜ ,m θm, x x , a x θ˜m, x x , a x 1:t 1:t−1 ← 1:t 1:t−1

Here, is a Markov kernel Targeting the posterior distribution. It can be shown that such MarkovK moves don’t change the target distribution and can alleviate the problem of particle degeneracy.

B.2 SMC2 priors

We used conjugate distributions to model the priors. For, P (θ) we used a mixture of Latin hypercube sampling (LHS) and conjugate priors as follows: ¯1 ¯2 σS, ρ¯S, µ , µ LHS ∼ (B.2) ΣA InverseWishart, ∼

Similar to P (θ), we model the initial distribution P (x0 θ) via LHS sampling for the base values and by using the model equations as presented in| Section 3.1. as follows:

c¯K Normal (B.3) φ¯k, ρ¯s, ∼ Gamma ∼

The parameters of the distributions of P (θ) and P (x0 θ) are called hyperparameters in the general domain of Bayesian Inference and following standard| practices are found via cross- validation. Appendix C

HQCD: Additional Experimental Results

In this appendix, we present some additional experimental results that complements the summary results presented in Section 6.3.

Employment Other Government Employment Other Government Employment Other Government 40 25 3.0 40 2.0 30

35 35 2.5 25 20 30 1.5 30 2.0 20 25 25 15 1.5 20 1.0 15 20 15

Event Counts 1.0 Event Counts Event Counts Event Counts 10 15 10 Event Counts Event Counts 10 0.5 10 0.5 5 5 5 5 0.0 0 0.0 0 06 13 20 27 03 10 17 24 06 13 20 27 03 10 17 24 11 18 25 02 09 16 23 30 06 13 11 18 25 02 09 16 23 30 06 13 0 0 Jan Feb Mar Apr May Jun Jul Aug Sep Jan Feb Mar Apr May Jun Jul Aug Sep Jan Feb Mar Jan Feb Mar Dec Jan Dec Jan 2013 2013 2014 2014 2014 2014 Energy & Resources Other Economic Energy & Resources Other Economic Energy & Resources Other Economic 90 18 3.0 7 3.0 1.0

80 16 2.5 6 2.5 0.8 70 14 5 2.0 2.0 60 12 4 0.6 50 10 1.5 1.5 3 40 8 0.4

Event Counts 1.0 Event Counts Event Counts 1.0 Event Counts

Event Counts Event Counts 2 30 6 0.2 0.5 0.5 20 4 1

10 2 0.0 0 0.0 0.0 06 13 20 27 03 10 17 24 06 13 20 27 03 10 17 24 11 18 25 02 09 16 23 30 06 13 11 18 25 02 09 16 23 30 06 13 0 0 Jan Feb Mar Apr May Jun Jul Aug Sep Jan Feb Mar Apr May Jun Jul Aug Sep Jan Feb Mar Jan Feb Mar Dec Jan Dec Jan 2013 2013 2014 2014 2014 2014 Housing Other Housing Other Housing Other 5 40 0.06 4.0 1.0 2.0

3.5 35 0.04 0.8 4 3.0 1.5 30 0.02 2.5 0.6 3 25 0.00 2.0 1.0 20 1.5 0.4

Event Counts ?0.02 Event Counts Event Counts Event Counts 2 15 Event Counts Event Counts 1.0 0.5 0.2 10 ?0.04 1 0.5 5 ?0.06 0.0 0.0 0.0 06 13 20 27 03 10 17 24 06 13 20 27 03 10 17 24 11 18 25 02 09 16 23 30 06 13 11 18 25 02 09 16 23 30 06 13 0 0 Jan Feb Mar Apr May Jun Jul Aug Sep Jan Feb Mar Apr May Jun Jul Aug Sep Jan Feb Mar Jan Feb Mar Dec Jan Dec Jan 2013 2013 2014 2014 2014 2014 (a) Brazil Subtypes (b) Venezuela Subtypes (c) Uruguay Subtypes Figure C.1: Comparison of detected changepoints at the target sources (Protest types) HQCD detections are shown in solid green while those from the state-of-the-art methods i.e. RuLSIF (red), WGLRT (magenta), BOCPD (purple) and GLRT (gold) are shown with dashed lines.

115 116

Table C.1: (Protest uprisings) Comparison of HQCD vs state-of-the-art with respect to detected changepoints

Event-Type GLRT WGLRT BOCPD RuLSIF HQCD

γ γ γ γ γ EADD Brazil Employment & Wages 02/10 03/17 06/16 05/26 08/18 4 Energy & Resources 02/10 03/17 06/09 05/19 06/02 6 Housing 03/24 03/31 07/28 05/19 06/16 8 Other Economic 03/24 03/24 06/23 05/19 06/30 5 Other Government 02/17 06/23 04/07 05/19 06/16 4 Other 03/03 03/17 06/30 05/19 06/23 6 All 02/17 04/28 05/19 06/16 06/16 8

Venezuela Employment & Wages 01/14 01/13 01/28 01/25 01/27 3 Energy & Resources 01/20 01/11 02/28 01/20 02/24 7 Housing ------Other Economic 01/31 01/31 01/28 - 01/27 9 Other Government 01/22 01/11 02/03 01/20 02/10 4 Other 01/14 01/12 01/25 01/30 01/24 5 All 01/26 01/11 01/30 01/20 02/12 3

Uruguay Employment & Wages 12/06 12/08 12/13 12/03 12/10 3 Energy & Resources 12/04 12/05 12/10 - 12/09 4 Housing 12/21 12/06 11/30 - 11/28 2 Other Economic 12/20 12/06 - - 11/26 2 Other Government 11/25 12/05 12/16 11/29 12/15 3 Other 12/05 12/09 12/03 - 01/14 10 All 12/05 12/09 12/03 11/29 12/10 3