A Spark-Based Workflow for Probabilistic Record Linkage Of

A Spark-based workflow for probabilistic record linkage of healthcare data ∗ Robespierre Pita Clicia Pinto Pedro Melo Distributed Systems Lab. Distributed Systems Lab. Distributed Systems Lab. Federal University of Bahia Federal University of Bahia Federal University of Bahia Salvador, BA, Brazil Salvador, BA, Brazil Salvador, BA, Brazil [email protected] [email protected] [email protected] Malu Silva Marcos Barreto Davide Rasella Distributed Systems Lab. Distributed Systems Lab. Institute of Public Health Federal University of Bahia Federal University of Bahia Federal University of Bahia Salvador, BA, Brazil Salvador, BA, Brazil Salvador, BA, Brazil [email protected] [email protected] [email protected] ABSTRACT 1. INTRODUCTION Several areas, such as science, economics, finance, busi- The term big data [18] was coined to represent the ness intelligence, health, and others are exploring big data large volume of data produced daily by thousands of de- as a way to produce new information, make better decisions, vices, users, and computer systems. These data should and move forward their related technologies and systems. be stored in secure, scalable infrastructures in order Specifically in health, big data represents a challenging pro- to be processed using knowledge discovery and analy- blem due to the poor quality of data in some circumstances tics tools. Today, there is a significant number of big and the need to retrieve, aggregate, and process a huge amount data applications covering several areas, such as finance, of data from disparate databases. In this work, we focused entertainment, e-government, science, health etc. All on Brazilian Public Health System and on large databases these applications require performance, reliability, and from Ministry of Health and Ministry of Social Development accurate results from their underlying execution envi- and Hunger Alleviation. We present our Spark-based ap- ronments, as well as specific requisites depending on proach to data processing and probabilistic record linkage of each context. such databases in order to produce very accurate data marts. Healthcare data come from different information sys- These data marts are used by statisticians and epidemiolo- tems, disparate databases, and potential applications gists to assess the effectiveness of conditional cash transfer that need to be combined for diverse purposes, inclu- programs to poor families in respect with the occurrence of ding the aggregation of medical and hospital services, some diseases (tuberculosis, leprosy, and AIDS). The case analysis of patients' profile and diseases, assessment of study we made as a proof-of-concept presents a good per- public health policies, monitoring of drug interventions, formance with accurate results. For comparison, we also and so on. discuss an OpenMP-based implementation. Our work focuses on the Brazilian Public Health Sys- tem [23], specifically on supporting the assessment of Categories and Subject Descriptors data quality, pre-processing, and linkage of databases J.1 [Administrative data processing]: Government; provided by the Ministry of Health and the Ministry of D.1.3 [Concurrent Programming]: Distributed pro- Social Development and Hunger Alleviation. The data gramming. marts produced by the linkage are used by statisticians ∗ and epidemiologists in order to assess the effectiveness This work is partially sponsored by Federal University of conditional cash transfer programs for poor families of Bahia (UFBA), with PIBIC and PRODOC grants, by FAPESB (PPSUS-BA 30/2013 grants), and by CNPq | in relation to some diseases, such as leprosy, tuberculo- Brazilian National Research Council. sis, and AIDS. We present a four-stage workflow designed to pro- c 2015, Copyright is with the authors. Published in vide the functionalities mentioned above. The second the Workshop Proceedings of the EDBT/ICDT 2015 Joint (pre-processing) and third (linkage) stages of our work- Conference (March 27, 2015, Brussels, Belgium) on CEUR- WS.org (ISSN 1613-0073). Distribution of this paper is flow are very data-intensive and time-consuming tasks, permitted under the terms of the Creative Commons license so we based our implementation in the Spark scalable CC-by-nc-nd 4.0. execution engine [41] in order to produce very accu- 1 rate results in a short period of time. The first stage are highly desirable and necessary for the evaluation of (assessment of data quality) is made through SPSS [17]. public policies. The last stage is dedicated to the evaluation of the data From an epidemiological standpoint, tuberculosis and marts produced by our pre-processing and linkage algo- leprosy are major public health problems in Brazil, with rithms and is realized by statisticians and epidemiolo- poverty as one of their main drivers. In addition, there gists. Once approved, they load these data marts into is a broad consensus on the bidirectional relationship SPSS and Stata [6] in order to perform some specific between these infectious diseases and poverty: one can case studies. lead to another. It is therefore clear that to reduce We evaluate our workflow by linking three databases: morbidity and mortality from poverty-related diseases CadUnico´ (social and economic data of poor families is necessary to plan interventions that address their so- | approximately 76 million records), PBF (payments cial determinants. from \Bolsa Fam´ılia”program), and SIH (hospitaliza- This work pertains to a project involving the longi- tion data from the Brazilian Public Health System | tudinal study of CadUnico,´ PBF (\Bolsa Fam´ılia”pro- 56,059 records). We discuss the results obtained with gram), and three databases from the Brazilian Public our Spark-based implementation and also a comparison Health System (SUS): SIH (hospitalization), SINAN with an OpenMP-based implementation. (notifiable diseases), and SIM (mortality). Table 1 shows This paper is structured as follows: Section 2 presents these databases with their years of coverage to which we the Brazilian Healthcare System to contextualize our have access. The main goal is to relate individuals in work. In Section 3 we discuss some related works spe- the existing SUS databases with their counterparts in cially focusing on record linkage. Our proposed work- the PBF and CadUnico,´ through a process called lin- flow is detailed in Section 4 and its Spark-based im- kage (or pairing). After linkage, the resulting databases plementation is discussed in Section 5. We present re- (data marts) are used by statisticians and epidemiolo- sults obtained from our case study, both in Spark and gists to analyze the incidence of some diseases in fami- OpenMP, in Section 6. Some concluding remarks are lies benefiting from \Bolsa Fam´ılia”compared to non- presented in Section 7. beneficiary families. 2. BRAZILIAN HEALTHCARE SYSTEM Databases Years As a strategy to combat poverty, the Brazilian go- SIH (hospitalization) 1998 to 2011 vernment implemented cash transfer policies for poor SINAN (notifications) 2000 to 2010 families, in order to facilitate their access to educa- SIM (mortality) 2000 to 2010 tion and healthcare, as well as to offer them allowances CadUnico´ (socieconomic data) 2007 to 2013 for consuming goods and services. In particular, the PBF (payments) 2007 to 2013 \Bolsa Fam´ılia” Program [25] was created under the management of the Ministry of Social Development and Table 1: Brazilan governmental databases. Hunger Alleviation to support poor families and pro- mote their social inclusion through income transfers. Socioeconomic information about poor families are The major obstacle for linkage is the absence of com- kept in a database called CadastroUnico´ (CadUnico)´ [24]. mon identifiers (key attributes) in all databases, which All families with a monthly income below half the mi- requires the use of probabilistic linkage algorithms, re- nimum wage per person or a total monthly income of sulting in a significant number of comparisons and in a less than three minimum wages can be enrolled in the large execution time. In addition, handling these data- database. This registration must be renewed every two bases requires the use of secrecy and confidentiality poli- years in order to keep updated data. All social pro- cies for personal information, especially those related to grams from the federal government should select their health data. Therefore, techniques for data transforma- recipients based on data contained in CadUnico.´ tion and anonymisation should be employed before the In order to observe the influence of certain social in- linkage stage. terventions and their positive (or negative) effects for The longitudinal study requires the pairing of all avai- their beneficiaries, rigourous impact evaluations are re- lable versions for certain databases within the period quired. Individual cohorts [19] have emerged as the to be analyzed. In the scope of our project, we must primary method for this purpose, supporting the pro- link versions of CadUnico,´ PBF, and SIH between 2007 cess of improving public policies and social programs and 2011 to allow a retrospective analysis of the inci- in order to qualify the transparency of public invest- dence of diseases in poor families and, thereafter, draw ments. It is expected that these transfer programs can up prospects for the coming years. In this scenario, the positively contribute to the health and the education of amount of data to be analyzed, processed, and anonymi- beneficiary families, but studies capable to prove this sed

A Spark-Based Workflow for Probabilistic Record Linkage Of

A Framework for Detecting Unnecessary Industrial Data in ETL Processes

Large Scale Record Linkage in the Presence of Missing Data Otherwise (A8,A 9 ) Is a Member of U (True Non-Matches) and A8 and Deﬁnition 3.3 (Relational Signature)

Historical Census Record Linkage

Volume I, Record Linkage Issues and Resul

Approaches to Multiple Record Linkage

Record Linkage: a Machine Learning Approach, a Toolbox, and a Digital Government Web Service

Contributions to Record Linkage for Disclosure Risk Assessment

Neural Networks for Entity Matching: a Survey

Introduction to Data Science Rest of Today’S Lecture

Learning Blocking Schemes for Record Linkage∗

A Hierarchical Graphical Model for Record Linkage

An Introduction to Probabilistic Record Linkage with a Focus on Linkage Processing for WTC Registries