C Copyright 2019 Fahad Pervaiz Understanding Challenges in the Data Pipeline for Development Data
Total Page:16
File Type:pdf, Size:1020Kb
c Copyright 2019 Fahad Pervaiz Understanding Challenges in the Data Pipeline for Development Data Fahad Pervaiz A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Washington 2019 Reading Committee: Richard Anderson, Chair Kurtis Heimerl Abraham Flaxman Program Authorized to Offer Degree: Paul G. Allen School of Computer Science & Engineering University of Washington Abstract Understanding Challenges in the Data Pipeline for Development Data Fahad Pervaiz Chair of the Supervisory Committee: Professor Richard Anderson Paul G. Allen School of Computer Science & Engineering The developing world is relying more and more on data driven policies. Numerous develop- ment agencies have pushed for on-ground data collection to support the development work they pursue. Many governments have launched efforts for more frequent information gath- ering. Overall, the amount of data collected is tremendous, yet we face significant issues in doing useful analysis. Most of these barriers are around data cleaning and merging, and they require a data engineer to support some parts of the analysis. This thesis aims to understand the pain points of cleaning development data. It also proposes solutions that harness the thought process of a data engineer to reduce the manual workload of the tedious process of cleaning such data. To achieve these goals, two research areas are critical: (1) to discern current data usage patterns and to build a taxonomy of data cleaning in the devel- oping world; and (2) to create algorithms to support automated data cleaning, which target selected problems including matching transliterated names. With these goals, this thesis will empower regular data users to easily do the necessary data cleaning and scrubbing for analysis. TABLE OF CONTENTS Page List of Figures . iv List of Tables . vi Chapter 1: Introduction . .1 Chapter 2: Background . .4 2.1 Types of Datasets . .4 2.2 Challenges in Development Data . .6 2.3 The Development Data Pipeline . .7 Chapter 3: Identified Problems in the Data Pipeline . 12 3.1 Methodology . 13 3.2 Findings . 15 3.3 Discussion . 26 3.4 Conclusion . 30 Chapter 4: Case Study: Improved Cold Chain Information System . 32 4.1 Summary . 32 4.2 Introduction . 32 4.3 Developing a CCIS for Laos . 38 4.4 Field Experience . 48 4.5 Discussion . 50 Chapter 5: Case Study: Developing Cold Chain Data Standard . 52 5.1 Summary . 52 5.2 Introduction . 52 5.3 Immunization . 54 i 5.4 Cold Chain Information System . 55 5.5 Software Context and Challenges . 57 5.6 Data Standards . 63 5.7 Discussion . 65 5.8 Conclusion . 67 Chapter 6: Case Study: An Assessment of SMS Fraud in Pakistan . 68 6.1 Summary . 68 6.2 Introduction . 68 6.3 Related Work . 70 6.4 SMS Fraud Background . 72 6.5 Data Collection . 73 6.6 Data Analysis . 80 6.7 Results . 83 6.8 Qualitative Analysis . 88 6.9 Discussion . 91 6.10 Conclusion . 95 Chapter 7: The Scalability of SMS Reporting Systems: Integrating with National Health Information Systems . 96 7.1 Summary . 96 7.2 Introduction . 96 7.3 Background . 98 7.4 Case Studies . 102 7.5 Barriers to Responding . 113 7.6 Discussion: Lessons at Scale . 117 7.7 Conclusion . 123 Chapter 8: Name Resolution for Data Cleaning . 125 8.1 Summary . 125 8.2 Introduction . 125 8.3 Related Work . 126 8.4 Methodology . 127 8.5 Evaluation . 129 ii 8.6 Discussion . 133 8.7 Conclusion . 134 Chapter 9: Conclusion . 135 Bibliography . 139 iii LIST OF FIGURES Figure Number Page 2.1 Various stages of the data pipeline along with a list of challenges at each stage.8 3.1 Summary of specific challenges grouped into categories. Challenges in bold text were mentioned by three or more participants. 15 4.1 (Top) fridgetag 30DTR with no alarms in last thirty days, (bottom) fridgetag 30DTR with two high temperature alarms, yesterday and two days ago. 36 4.2 Typical rural health center staffed by approximately four health workers . 39 4.3 Lao PDR, the NIP office is located in the capital Vientiane. 39 4.4 System architecture: Health workers send data via SMS to an android phone that syncs with cloud system. Cold chain manager and SMS moderator man- age these systems using their respective web interfaces. 42 4.5 Vaccine refrigerator labeled with A for reporting. 48 4.6 Five valid messages that all have the same semantic meaning. each message tells the system that refrigerators A and B had zero alarms and that the current stock levels for pentavalent and pneumococcal are 20 and 30. 49 6.1 Screenshots from Safe SMS app, showing how to label a conversation . 75 6.2 Number of conversations that were available on each user's phone, the ones they uploaded and the conversations that were labeled by them . 80 6.3 Presence of different features with important ones highlighted . 87 7.1 Wheel (Close up of the SMS report job aid for DSS) . 106 7.2 A BHU dispenser explaining how he uses the SMS reporting wheel to create a SMS message . 108 7.3 A 30 day temperature recorder (30DTR) in a Lao refrigerator showing two high alarms in the last two days . 108 7.4 The Lao SMS Immunization Manager (SIM) showing a list of incoming SMS reports of October 2014 . 112 7.5 An example message showing a sample of special characters accidentally typed during a training in Laos . 122 iv 8.1 Sensitivity of string matching algorithms against the Niger transliterated lo- cality names . 130 8.2 Sensitivity of different heuristics defined, using a combination of string match- ing algorithms, against the Niger transliterated locality names . 132 v LIST OF TABLES Table Number Page 3.1 Distribution of participants by role and organization type. 14 6.1 Summary of User Labeled Data. 81 6.2 Summary of Fraud Types. 83 6.3 Examples of fraudulent messages that were collected. English translations are given for messages sent in Roman Urdu. 84 6.4 Summary of Heuristic Results . 86 8.1 List of Heuristics. 127 8.2 Examples of failed matches . 133 vi 1 Chapter 1 INTRODUCTION Global development organizations and governments in developing regions increasingly rely on data to inform policies that are intended to improve health, education, employment, human rights, and economic development. Significant amounts of data are collected and analyzed by a wide array of stakeholders (e.g., government agencies, non-governmental or- ganizations, global development donors, and social enterprises) to conceptualize, implement, evaluate, and support policy decisions. Often these stakeholders have different goals and strategies for data collection and analysis. For example, while government agencies often collect a wide range of data to get an overview of different development indicators, non- profit and non-governmental organizations gather data to identify insights on a specific topic or to measure the impact of a specific intervention. For these reasons, attempts to collect data are often disorganized and in silos, which results in the availability of copious amounts of poor-quality data that is inconsistent, isolated, and lacks structure and standards, making it hard to clean and analyze. In many instances, data is collected without much considera- tion and planning, and often it remains little used or forgotten, resulting in time and cost intensive replicated efforts to collect the same data for different purposes. Although it is desirable to combine those datasets that cover similar domains, merging, transforming and cleaning datasets containing different schema and types is a non-trivial process that requires a substantial amount of effort. Data processing also involves multiple stakeholders, both within and outside the organization, and often data goes through multiple processing stages, including importing, merging, rebuilding missing datasets, standardizing and normalizing, duplicating, and exporting. Processed datasets then undergo cycles of analyses and visualizations. Many of these stages, from data collection to data visualization, 2 and processes within these stages are isolated from each other, making it easier for people to work on data independently. However, this isolation also means that people who work on one aspect (e.g., data cleaning) might have no control over processes in other stages (e.g., collection) and may not fully understand the context in which the data is collected, cleaned, transformed, or analyzed. Several Information and Communication Technology for Development (ICTD) researchers have investigated challenges in collecting development data [45] and designed new tools that are more suitable to gather development data [115, 61, 23, 21]. However, the research that examines challenges in different stages of the data pipeline is largely absent. While some researchers have provided taxonomies for dirty data (inaccurate, incomplete, and inconsistent data) in the context of systems in the developed world [81] and identified challenges such.