Data-Driven Methods for Modeling and Predicting Multivariate Time Series Using Surrogates
Total Page:16
File Type:pdf, Size:1020Kb
Data-Driven Methods for Modeling and Predicting Multivariate Time Series using Surrogates Prithwish Chakraborty Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science Narendran Ramakrishnan, Chair Madhav Marathe Chang-Tien Lu Ravi Tandon John S. Brownstein April 28, 2016 Arlington, VA Keywords: Multivariate Time Series, Surrogates, Generalized Linear Models, Bayesian Sequential Analysis, Computational Epidemiology Copyright c 2015, Prithwish Chakraborty Data-Driven Methods for Modeling and Predicting Multivariate Time Series using Surrogates Prithwish Chakraborty (ABSTRACT) Modeling and predicting multivariate time series data has been of prime interest to re- searchers for many decades. Traditionally, time series prediction models have focused on finding attributes that have consistent correlations with target variable(s). However, diverse surrogate signals, such as News data and Twitter chatter, are increasingly available which can provide real-time information albeit with inconsistent correlations. Intelligent use of such sources can lead to early and real-time warning systems such as Google Flu Trends. Furthermore, the target variables of interest, such as public heath surveillance, can be noisy. Thus models built for such data sources should be flexible as well as adaptable to changing correlation patterns. In this thesis we explore various methods of using surrogates to generate more reliable and timely forecasts for noisy target signals. We primarily investigate three key components of the forecasting problem viz. (i) short-term forecasting where surrogates can be employed in a now-casting framework, (ii) long-term forecasting problem where surrogates acts as forcing parameters to model system dynamics and, (iii) robust drift models that detect and exploit `changepoints' in surrogate-target relationship to produce robust models. We explore various `physical' and `social' surrogate sources to study these sub-problems, primarily to generate real-time forecasts for endemic diseases. On modeling side, we employed matrix factorization and generalized linear models to detect short-term trends and explored various Bayesian sequential analysis methods to model long-term effects. Our research indicates that, in general, a combination of surrogates can lead to more robust models. Interestingly, our findings indicate that under specific scenarios, particular surrogates can decrease overall forecasting accuracy - thus providing an argument towards the use of `Good data' against `Big data'. This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center (DoI/NBC) contract number D12PC000337. The US Government is authorized to reproduce and distribute reprints of this work for Gov- ernmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be inter- preted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the US Government. Data-Driven Methods for Modeling and Predicting Multivariate Time Series using Surrogates Prithwish Chakraborty (GENERAL AUDIENCE ABSTRACT) In the context of public health, modeling and early-forecasting of infectious diseases is of prime importance. Such efforts help agencies to devise interventions and implement effec- tive counter-measures. However, disease surveillance is an involved process where agencies estimate the intensity of diseases in the public domain using various networks. The process involves various levels of data cleaning and aggregation and as such the resultant surveillance data is inherently noisy (requiring several revisions to stabilize) and delayed. Thus real-time forecasting about such diseases necessitates stable and robust methods that can provide ac- curate public health information in time-critical manner. This work focuses on data-driven modeling and forecasting of time series, especially infectious diseases, for a number regions of the world including Latin America and the United States of America. With the increasing popularity of social media, real-time societal information could be extracted from various media such as Twitter and News. This work addresses this critical area where a number of models have been presented to systematically integrate and compare the usefulness of such real-time information from both physical- (such as Temperature) and non-physical-indicators (such as Twitter) towards robust disease forecasting. Specifically, this work focuses on three critical areas: (a) Short-term forecasting of disease case counts to get better estimates of current on ground scenario, (b) long-term forecasting about disease season characteristics to get help public health agencies plan and implement interventions and finally (c) Con- cept drift detection and adaptation to consider the ever evolving relationship of the societal surrogates and the public health surveillance and lend robustness to the disease forecasting models. This work shows that such indicators could be useful for reliable estimation of dis- ease characteristics - even when the ground-truth itself is unreliable and provide insights as to how such indicators can be integrated as part of public surveillance. This work has used principles from diverse fields spanning Bayesian Statistics, Machine Learning, Information Theory, and Public Health to analyze and characterize such diseases. Acknowledgments I extend my sincere thanks and gratitude to my advisor Dr. Naren Ramakrishnan for his continued encouragement and guidance throughout my work. His feedback, insights and inputs have contributed immensely to the final form of this work. He has been my mentor and my guide. I have always found in him a patient listener who rendered clarity to my thoughts and I have always come out of our discussion with renewed vigor and focus. It has been my utmost privilege to work with him for all these years. I would also like to thank my entire committee. I sincerely thank Dr. Madhav Marathe and Dr. John Brownstein for their unique perspectives on public health without which this work wouldn't have been complete. I have especially enjoyed my meetings with Dr. Madhav Marathe and our collaborations that have helped me to gain a broader understanding about the field of computational epidemiology. I cannot thank Dr. Ravi Tandon enough for his inputs and insights that ultimately materialized in the form of `concept drift' - a crucial component of this work. Finally, Dr. C.T. Lu have always been welcoming and encouraging, and I thank him for his crucial feedback and inputs about this work. I consider myself fortunate to have received the guidance of such an esteemed and kind group of people. I extend my heartfelt thanks and gratitude to Dr. Bryan Lewis, NDSSL at Virginia Tech for all his encouragement, guidance and countless hours working with me on this work. I have been lucky to have him as my mentor. I would also like to thank Discovery Analytics Center at Virginia Tech which has been my- home-away-from-home for these past few years. I have found mentors like Tozammel Hossain and Patrick Butler who have immensely shaped my early PhD years. All my lab members have been crucial and I will miss my time with all of them. They have been my friend, my colleague and more often than not my support group throughout this process. I wish all of you the best for your future. I have also been fortunate to work with a varied group of collaborators from NDSSL, HealthMap and YeLab as well as public health agencies such as IARPA and CDC which has made my PhD a great experience that I will cherish forever. I would also like to express my gratitude to my wonderful friends - Deba Pratim Saha, Gourab Ghosh Roy, Saurav Ghosh, Sathappan Muthiah, Arijit Chattopadhyay, Sayantan Guha and Abhishek Mukherjee, to name a few - with whom I have shared unique moments throughout this time. Thanks for being around and being there for me whenever I needed iv you all. Thanking my family is perhaps not enough. My mother Mrs. Devyani Chakraborty and my brother Mr. Prasenjit Chakraborty have been my closest friends and confidants. This work as well as me owes everything to you. My late father Mr. Prasanta Kr. Chakraborty would have been happy to see me where I am today. My sister-in-law Mrs. Amrita Dhole Chakraborty and my cousins, I thank you for being the best family I could hope for and being there for me always. v Contents 1 Background and Motivation 1 1.1 Flu Surveillance Effects . .1 1.2 Motivation towards using surrogates . .6 I Short-term Forecasting using Surrogates 7 2 Forecasting a Moving Target: Ensemble Models for ILI Case Count Predictions 9 2.1 Related Work . 10 2.2 Problem Formulation . 12 2.2.1 Methods . 12 2.3 Ensemble Approaches . 14 2.3.1 Data level fusion: . 15 2.3.2 Model level fusion: . 15 2.4 Forecasting a Moving Target . 16 2.5 Experimental Setup . 18 2.5.1 Reference Data. 18 2.5.2 Evaluation criteria. 19 2.5.3 Surrogate data sources. 20 2.6 Results . 23 2.7 Discussion . 26 vi 3 Dynamic Poisson Autoregression for Influenza-Like-Illness Case Count Prediction 27 3.1 Summary . 29 3.1.1 Model Similarity . 29 3.1.2 Forecasting Results . 30 3.1.3 Seasonal Analysis . 31 3.2 Discussion . 32 II Long-term Forecasting using Surrogates 36 4 Curve-matching from library of curves 38 5 Data Assimilation methods for long-term forecasting 41 5.1 Data Assimilation . 41 5.2 Data Assimilation Models in disease forecasting . 44 5.3 Data Assimilation Using surrogate Sources . 45 5.4 Experimental Results and Performance Summary . 45 5.5 Discussion . 51 III Detecting and Adapting to Concept Drift 52 6 Hierarchical Quickest Change Detection via Surrogates 54 6.1 HQCD{Hierarchical Quickest Change Detection . 55 6.1.1 Quickest Change Detection (QCD) . 55 6.1.2 Changepoint detection in Hierarchical Data . 56 6.2 HQCD for Count Data via Surrogates .