Temporal Graph Record Linkage and K-Safe Approximate Match ______

TEMPORAL GRAPH RECORD LINKAGE AND K-SAFE APPROXIMATE MATCH _____________________________________________________________ A Dissertation Submitted to the Temple University Graduate Board _____________________________________________________________ In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY _____________________________________________________________ By Joseph Jupin December 2016 Examining Committee Members: Dr. Justin Yuan Shi, Advisory Chair, CIS Department Dr. Yuhong Guo, CIS Department Dr. Edward Dragut, CIS Department Dr. Bo Ji, CIS Department Dr. Boleslaw K. Szymanski, External Member, RPI CS Department © Copyright 2016 By Joseph Jupin All rights reserved ii ABSTRACT Temporal Graph Record Linkage and k-Safe Approximate Match: Graph Aggregated Temporal and Relational Data Enhanced Record Linkage with Iterative Matching and Hash Optimized Edit-Distance String Comparison By Joseph Jupin email: [email protected] Temple University, 2016 Committee: Dr. Justin Yuan Shi, Advisor and Chair Dr. Yuhong Guo Dr. Edward Dragut Dr. Bo Ji Dr. Boleslaw K. Szymanski Since the advent of electronic data processing, organizations have accrued vast amounts of data contained in multiple databases with no reliable global unique identifier. These databases were developed by different departments for different purposes at different times. Organizing and analyzing these data for human services requires linking records from all sources. RL (Record Linkage) is a process that connects records that are related to the identical or a sufficiently similar entity from multiple heterogeneous databases. RL is a data and compute intensive, mission critical process. The process must be efficient enough to process big data and effective enough to provide accurate matches. We have evaluated an RL system that is currently in use by a local health and human services department. We found that they were using the typical approach that was iii offered by Fellegi and Sunter with tuple-by-tuple processing, using the Soundex as the primary approximate string matching method. The Soundex has been found to be unreliable both as a phonetic and as an approximate string matching method. We found that their data, in many cases, has more than one value per field, suggesting that the data were queried from a 5NF data base. Consider that if a woman has been married 3 times, she may have up to 4 last names on record. This query process produced more than one tuple per database/entity apparently generating a Cartesian product of this data. In many cases, more than a dozen tuples were observed for a single database/entity. This approach is both ineffective and inefficient. An effective RL method should handle this multi-data without redundancy and use edit-distance for approximate string matching. However, due to high computational complexity, edit-distance will not scale well with big data problems. We developed two methodologies for resolving the aforementioned issues: PSH and ALIM. PSH – The Probabilistic Signature Hash is a composite method that increases the speed of Damerau-Levenshtein edit-distance. It combines signature filtering, probabilistic hashing, length filtering and prefix pruning to increase the speed of edit- distance. It is also lossless because it does not lose any true positive matches. ALIM – Aggregate Link and Iterative Match is a graph-based record linkage methodology that uses a multi-graph to store demographic data about people. ALIM performs string matching as records are inserted into the graph. ALIM eliminates data redundancy and stores the relationships between data. We tested PSH for string comparison and found it to be approximately 6,000 times faster than DL. We tested it against the trie-join methods and found that they are up to 6.26 iv times faster but lose between 10 and 20 percent of true positives. We tested ALIM against a method currently in use by a local health and human services department and found ALIM to produce significantly more matches (even with more restrictive match criteria) and that ALIM ran more than twice as fast. ALIM handles the multi-data problem and PSH allows the use of edit-distance comparison in this RL model. ALIM is more efficient and effective than a currently implemented RL system. This model can also be expanded to perform social network analysis and temporal data modeling. For human services, temporal modeling can reveal how policy changes and treatments affect clients over time and social network analysis can determine the effects of these on whole families by facilitating family linkage. v ACKNOWLEDGEMENTS I would like to thank my advisor, committee chair and associate chair of the CIS department, Dr. Justin Yuan Shi for all of his help and guidance with this project. Dr. Shi provided me with this research problem after being contacted by a local health department to assess their Record Linkage system. Through his efforts, we had the very rare privilege of access to a real system with real data. This access is usually very restricted due to HIPPA privacy regulations. He was always there to help me when meeting with health department officials and in developing a methodology to assess and report in the health departments RL system. He has been very patient and has invested time and effort into our project. He always made time to meet and discuss strategies and solutions to problems that would arise during the development of the solution to or problem. I also thank the other members of my dissertation committee, Dr. Yuhong Guo, Dr. Edward Dragut, Dr. Bo Ji and Dr. Boleslaw K. Szymanski for their time and consideration and for agreeing to participate and review in my work. I am glad to have had faculty with experience and interest in data problems accept involvement in my project. I would also like to thank the members of my former project, analyzing data from the ProDES project. Dr. Zoran Obradovic was my former advisor and committee chair. He provided me with experience working on data collected from multiple sources and all the issues that these data represent. Dr. Philip W. Harris was our primary investigator and provided the problem and the data along with valuable domain advice for working with vi Criminal Justice and Social Sciences data. Dr. Alan Izenman, Dr. Jeremy Mennis and Dr. Heidi Grunwald were other members of the group that also provided insight into processing data, analyzing data and reporting on results obtained from our research. My fellow graduate students: Brian Lockwood and Yilian Qin. Hope that their post Temple careers are both successful and satisfying. I would also like to thank officials from the Philadelphia Department of Health. Dr. Donald Schwarz, Deputy Mayor of Health and Opportunity, Susan Kretsge, Chief of Staff for the Office of Health and Opportunity, Swamy Bodige, Analyst for the Office of Health and Opportunity, LaVern Wright, Program Director for Integrated Systems for the Office of Health and Opportunity, and Dr. Zheng Wen, IFI, for their advice and support in assessing and understanding their RL system. I would also thank the City of Philadelphia’s Office of Health and Opportunity for grant support for this project. vii TABLE OF CONTENTS ABSTRACT .................................................................................................................................... iii ACKNOWLEDGEMENTS ............................................................................................................ vi LIST OF FIGURES ......................................................................................................................... x LIST OF TABLES ......................................................................................................................... xii LIST OF EQUATIONS ................................................................................................................. xiv I PREFACE ...................................................................................................................................... 1 1 INTRODUCTION TO RECORD LINKAGE .......................................................................... 2 1.1 Challenges and Principals of Record Linkage ................................................................... 4 1.2 Proposed Methodology ...................................................................................................... 5 1.3 Envisioned Contributions ................................................................................................... 6 1.4 Organization of Dissertation .............................................................................................. 7 1.5 Notation and Formalization ............................................................................................... 8 2 BACKGROUND ...................................................................................................................... 9 2.1 Standardization ................................................................................................................ 13 2.2 Blocking ........................................................................................................................... 14 2.3 Comparing and Matching ................................................................................................ 17 2.4 Field-level Comparison .................................................................................................... 18 2.5 Survey of Existing RL Methods......................................................................................

Temporal Graph Record Linkage and K-Safe Approximate Match ______

Hardware Pattern Matching for Network Traffic Analysis in Gigabit

Comprehensive Examinations in Computer Science 1872 - 1878

Trusted Research Environments (TRE) a Strategy to Build Public Trust and Meet Changing Health Data Science Needs

A Personalized Cloud-Based Traffic Redundancy Elimination for Smartphones Vivekgautham Soundararaj Clemson University, [email protected]

Branchclust Tutorial

Approximate Pattern Matching with Index Structures

How to Think Like a Computer Scientist

A Best-First Anagram Hashing Filter for Approximate String Matching

The Answer Question in Question Answering Systems Engenharia

Comparing Namedcapture with Other R Packages for Regular Expressions by Toby Dylan Hocking

The Power of Prediction: Cloud Bandwidth and Cost Reduction

Necessary to Conduct Most Analyses Commercial Software Typically Have (Very Bad) Installers Open Source Software Typically Have