Sentinel Surveillance Network Design and Novel Uses of Wikipedia
Total Page:16
File Type:pdf, Size:1020Kb
- Improving disease surveillance : sentinel surveillance network design and novel uses of Wikipedia. Fairchild, Geoffrey Colin https://iro.uiowa.edu/discovery/delivery/01IOWA_INST:ResearchRepository/12730531290002771?l#13730830480002771 Fairchild, G. C. (2015). Improving disease surveillance: sentinel surveillance network design and novel uses of Wikipedia [University of Iowa]. https://doi.org/10.17077/etd.sfbu328x https://iro.uiowa.edu Copyright 2014 Geoffrey Colin Fairchild Downloaded on 2021/10/01 19:58:37 -0500 - IMPROVING DISEASE SURVEILLANCE: SENTINEL SURVEILLANCE NETWORK DESIGN AND NOVEL USES OF WIKIPEDIA by Geoffrey Colin Fairchild A thesis submitted in partial fulfillment of the requirements for the Doctor of Philosophy degree in Computer Science in the Graduate College of The University of Iowa December 2014 Thesis Supervisor: Professor Alberto Maria Segre Graduate College The University of Iowa Iowa City, Iowa CERTIFICATE OF APPROVAL PH.D. THESIS This is to certify that the Ph.D. thesis of Geoffrey Colin Fairchild has been approved by the Examining Committee for the thesis requirement for the Doctor of Philosophy degree in Computer Science at the December 2014 graduation. Thesis committee: Alberto M. Segre, Thesis Supervisor Philip M. Polgreen Sriram Pemmaraju Ted Herman Gerard Rushton Sara Y. Del Valle ACKNOWLEDGEMENTS A Ph.D. is a long and arduous journey. I've been lucky to have been sur- rounded by a truly wonderful group of people over the last 6 years, and I know for a fact that I wouldn't have made it this far without them. First and foremost, I want to thank Dr. Alberto Segre and Dr. Phil Polgreen for their patience, guidance, and advice. They were quick to recognize my affinity for medicine and were able to direct me to a research area that I wasn't even aware of prior to grad school. Computational epidemiology, as it turns out, is the perfect union between my skill set and research interests. Both of you have been instrumental in molding me into the computer scientist I am today, and I am forever grateful. I also want to thank the rest of the CompEpi group at Iowa. Dr. Sriram Pemmaraju, Dr. Ted Herman, and Dr. Gerard Rushton have been especially great at critiquing my work and offering advice when necessary. The rest of my cohort at Iowa was equally important, providing many laughs and thought-provoking discussions. Dr. Sara Del Valle, I'm not sure you'll ever know how much I appreciate the opportunity you've given me at Los Alamos. I've met some brilliant and inspiring people here, and you've opened up new avenues for me that I didn't even know were available. I hope that one day I can pay it forward. To my friends outside of academia and work, thanks for keeping me sane and grounded. It's easy to lose track of how to have fun during grad school, but you all ensured that I never fell into that trap. ii My mom, sister, dad, and stepmom have provided the kind of unconditional love and support that I so often needed. Mom, I am forever indebted to you for the sacrifices you made in order for Britt and me to have the chances we did. You kept us in good schools, and you've always been encouraging. Britt, you're my best friend. We can talk about anything, we always support one another, and I know you've got my back when I need it. Dad and Linda, I love our daily emails and weekly conversations about life. You've both always shown such an interest in my work, and I really appreciate that. I love you all more than you'll ever know. I'm so lucky to have you in my life. Finally, I want to thank Tamiko. You supplied motivation when there was none, laughs when I was down, deep discussions, and constant encouragement. With- out you, I'm not sure I would've survived grad school. iii ABSTRACT Traditional disease surveillance systems are instrumental in guiding policy- makers' decisions and understanding disease dynamics. The first study in this dis- sertation looks at sentinel surveillance network design. We consider three location- allocation models: two based on the maximal coverage model (MCM) and one based on the K-median model. The MCM selects sites that maximize the total number of people within a specified distance to the site. The K-median model minimizes the sum of the distances from each individual to the individual's nearest site. Using a ground truth dataset consisting of two million de-identified Medicaid billing records representing eight complete influenza seasons and an evaluation function based on the Huff spatial interaction model, we empirically compare networks against the existing volunteer-based Iowa Department of Public Health influenza-like illness network by simulating the spread of influenza across the state of Iowa. We compare networks on two metrics: outbreak intensity (i.e., disease burden) and outbreak timing (i.e., the start, peak, and end of the epidemic). We show that it is possible to design a network that achieves outbreak intensity performance identical to the status quo network using two fewer sites. We also show that if outbreak timing detection is of primary interest, it is actually possible to create a network that matches the existing network's performance using 42% fewer sites. Finally, in an effort to demonstrate the generic usefulness of these location-allocation models, we examine primary stroke center selection. We describe the ineffectiveness of the current self-initiated approach iv and argue for a more organized primary stroke center system. While these traditional disease surveillance systems are important, they have several downsides. First, due to a complex reporting hierarchy, there is generally a reporting lag; for example, most diseases in the United States experience a reporting lag of approximately 1{2 weeks. Second, many regions of the world lack trustworthy or reliable data. As a result, there has been a surge of research looking at using publicly available data on the internet for disease surveillance purposes. The second and third studies in this dissertation analyze Wikipedia's viability in this sphere. The first of these two studies looks at Wikipedia access logs. Hourly access logs dating back to December 2007 are available for anyone to download completely free of charge. These logs contain, among other things, the total number of accesses for every article in Wikipedia. Using a linear model and a simple article selection procedure, we show that it is possible to nowcast and, in some cases, forecast up to the 28 days tested in 8 of the 14 disease-location contexts considered. We also demonstrate that it may be possible in some cases to train a model in one context and use the same model to nowcast or forecast in another context with poor surveillance data. The second of the Wikipedia studies looked at disease-relevant data found in the article content. A number of disease outbreaks are meticulously tracked on Wikipedia. Case counts, death counts, and hospitalization counts are often provided in the article narrative. Using a dataset created from 14 Wikipedia articles, we trained a named-entity recognizer (NER) to recognize and tag these phrases. The NER achieved an F1 score of 0.753. In addition to these counts in the narrative, v we tested the accuracy of tabular data using the 2014 West African Ebola virus disease epidemic. This article, like a number of other disease articles on Wikipedia, contains granular case counts and deaths counts per country affected by the disease. By computing the root-mean-square error between the Wikipedia time series and a ground truth time series, we show that the Wikipedia time series are both timely and accurate. vi PUBLIC ABSTRACT This dissertation presents three studies that aim to improve the current state of disease surveillance. The first study looks at designing traditional sentinel surveillance systems, which are instrumental in guiding policy-makers' decisions and understanding disease dynamics. We use several popular location-allocation models to algorithmically design surveillance networks of varying sizes. By simulating the spread of influenza across the state of Iowa, we demonstrate that we are capable of generating smaller networks capable of performance at least as good as the volunteer-based influenza surveillance network used by the Iowa Department of Public Health. We also apply these network design methods to primary stroke center placement. The second and third studies recognize that while these traditional surveillance systems are important, they have drawbacks, such as reporting lags and untrustwor- thy, unreliable, or unavailable data in some instances. To help solve these problems, we introduce a novel data source: Wikipedia. The second study displays how a linear model using time series of article accesses can nowcast and forecast a variety of diseases in a variety of locations. Finally, the third study demonstrates how disease-related data can be elicited from Wikipedia article content. We show how a named-entity recognizer can be trained to tag case, death, and hospitalization counts in the article narrative. We also analyze tabular time series data and show that they are accurate and timely. vii TABLE OF CONTENTS LIST OF TABLES . .x LIST OF FIGURES . xi CHAPTER 1 INTRODUCTION . .1 2 HOW MANY SUFFICE? A COMPUTATIONAL METHOD FOR SIZ- ING SENTINEL SURVEILLANCE NETWORKS . .8 2.1 Background . .9 2.2 Methods . 13 2.2.1 Online tool . 13 2.2.2 Algorithms . 15 2.2.2.1 Maximal coverage model . 15 2.2.2.2 K-median model . 17 2.2.3 Validation . 18 2.2.3.1 Medicaid dataset . 18 2.2.3.2 Simulation . 22 2.3 Results . 24 2.3.1 Outbreak intensity .