<<

Outline

Data Linkage Techniques: What is data linkage? Past, Present and Future Applications and challenges Peter Christen The past Department of Computer Science, A short history of data linkage The Australian National University Contact: [email protected] The present

Project Web site: http://datamining.anu.edu.au/linkage.html Computer science based approaches: Learning to link

Funded by the Australian National University, the NSW Department of Health, The future the Australian Research Council (ARC) under Linkage Project 0453463, Scalability, automation, and privacy and confidentiality and the Australian Partnership for Advanced Computing (APAC) Our project: Febrl (Freely extensible biomedical )

Peter Christen, August 2006 – p.1/32 Peter Christen, August 2006 – p.2/32

What is data (or record) linkage? Recent interest in data linkage

The process of linking and aggregating records Traditionally, data linkage has been used in from one or more data sources representing the health (epidemiology) and statistics () same entity (patient, customer, business name, etc.) In recent years, increased interest from Also called data matching, , data businesses and governments scrubbing, ETL (extraction, transformation and A lot of data is being collected by many organisations loading), object identification, merge-purge, etc. Increased computing power and storage capacities Challenging if no unique entity identifiers available Data warehousing and distributed databases E.g., which of these records represent the same person? Data mining of large data collections Dr Smith, Peter 42 Miller Street 2602 O'Connor E-Commerce and Web applications (for example online Pete Smith 42 Miller St 2600 Canberra A.C.T. product comparisons: http://froogle.com) P. Smithers 24 Mill Street 2600 Canberra ACT Geocoding and spatial data analysis

Peter Christen, August 2006 – p.3/32 Peter Christen, August 2006 – p.4/32

Applications and usage Challenge 1: Dirty data

Applications of data linkage Real world data is often dirty Remove duplicates in a data set (internal linkage) Missing values, inconsistencies Merge new records into a larger master data set Typographical errors and other variations Create patient or customer oriented statistics Different coding schemes / formats Compile data for longitudinal (over time) studies Out-of-date data Geocode matching (with reference address data) Names and addresses are especially prone to Widespread use of data linkage data entry errors (over phone, hand-written, scanned) Immigration, taxation, social security, census Cleaned and standardised data is needed for Fraud, crime and terrorism intelligence loading into databases and data warehouses Business mailing lists, exchange of customer data data mining and other data analysis studies Social, health and biomedical research data linkage and deduplication

Peter Christen, August 2006 – p.5/32 Peter Christen, August 2006 – p.6/32

Challenge 3: Privacy and Challenge 2: Scalability confidentiality

General public is worried about their information Data collections with tens or even hundreds of being linked and shared between organisations millions of records are not uncommon Good: research, health, statistics, crime and fraud Number of possible record pairs to compare equals the product of the sizes of the two data sets detection (taxation, social security, etc.) (linking two data sets with 1,000,000 records each will result Scary: intelligence, surveillance, commercial data mining in 106× 106 = 1012 record pairs) (not much information from businesses, no regulation) Performance bottleneck in a data linkage system is Bad: identify fraud, re-identification usually the (expensive) comparison of attribute Traditionally, identified data has to be given to the (field) values between record pairs person or organisation performing the linkage Blocking / indexing / filtering techniques are used Privacy of individuals in data sets is invaded to reduce the large amount of comparisons Consent of individuals involved is needed Linkage process should be automatic Alternatively, seek approval from ethics committees

Peter Christen, August 2006 – p.7/32 Peter Christen, August 2006 – p.8/32 Outline: The past Traditional data linkage techniques

What is data linkage? Computer assisted data linkage goes back as far as the 1950s (based on ad-hoc heuristic methods) Applications and challenges Deterministic linkage The past Exact linkage, if a unique identi®er of high quality is A short history of data linkage available (has to be precise, robust, stable over time) The present Examples: Medicare, ABN or Tax ®le number (are they really unique, stable, trustworthy?) Computer science based approaches: Learning to link Rules based linkage (complex to build and maintain) The future Probabilistic linkage Scalability, automation, and privacy and confidentiality Our project: Febrl Apply linkage using available (personal) information (Freely extensible biomedical record linkage) (like names, addresses, dates of birth, etc)

Peter Christen, August 2006 – p.9/32 Peter Christen, August 2006 – p.10/32

Probabilistic data linkage Fellegi and Sunter classification

Basic ideas of probabilistic linkage were For each compared record pair a vector containing introduced by Newcombe & Kennedy (1962) matching weights is calculated Theoretical foundation by Fellegi & Sunter (1969) Record A: [`dr', `peter', `paul', `miller'] Record B: [`mr', `john', `', `miller'] No unique entity identifiers available Matching weights: [0.2, -3.2, 0.0, 2.4 ] Compare common record attributes (or fields) Fellegi & Sunter approach sums all weights Compute matching weights based on frequency ratios (then uses two thresholds to classify record pairs as (global or value specific) and error estimates non-matches, possible matches, or matches)

Sum of the matching weights is used to classify a pair of Lower Upper threshold threshold records as match, non-match, or possible match Many more with lower weights... Problems: Estimating errors and threshold values, assumption of independence, and manual clerical review

Still the basis of many linkage systems −5 0 5 10 15 Total matching weight Peter Christen, August 2006 – p.11/32 Peter Christen, August 2006 – p.12/32

Traditional blocking Outline: The present

Traditional blocking works by only comparing What is data linkage? record pairs that have the same value for a blocking variable (for example, only compare records Applications and challenges that have the same postcode value) The past Problems with traditional blocking A short history of data linkage An erroneous value in a blocking variable results in a The present record being inserted into the wrong block (several Computer science based approaches: Learning to link passes with different blocking variables can solve this) The future Values of blocking variable should be uniformly distributed (as the most frequent values determine Scalability, automation, and privacy and confidentiality the size of the largest blocks) Our project: Febrl Example: Frequency of `Smith' in NSW: 25,425 (Freely extensible biomedical record linkage)

Peter Christen, August 2006 – p.13/32 Peter Christen, August 2006 – p.14/32

Improved classification Classification challenges

Summing of matching weights results in loss of In many cases there is no training data available information (e.g. two record pairs: same name but Possible to use results of earlier linkage projects? ⇔ different address different address but same name) Or from clerical review process? View record pair classification as a multi- How confident can we be about correct manual dimensional binary classification problem classification of possible links? (use matching weight vectors to classify record pairs into matches or non-matches, but no possible matches) Often there is no gold standard available (no data sets with true known linkage status) Different techniques can be used No test data set collection available Supervised: Manually prepared training data needed (like in information retrieval or data mining) (record pairs and their match status), almost like manual Recent small repository: RIDDLE clerical review before the linkage http://www.cs.utexas.edu/users/ml/riddle/ Un-supervised: Find (local) structure in the data (similar (Repository of Information on Duplicate Detection, Record Linkage, and Identity Uncertainty) record pairs) without training data Peter Christen, August 2006 – p.15/32 Peter Christen, August 2006 – p.16/32 Classification research (1) Classification research (2)

Information retrieval based Semi-supervised techniques Aim is to reduce manual training effort Represent records as document vectors Active learning (select a record pair for manual class- Calculate distance between vectors (tf-idf weights) ification that a set of classifiers disagree on the most) Database research approaches Semi-supervised clustering (provide some manually Extend SQL language (fuzzy join operator) classified record pairs) Implement linkage algorithms using SQL statements Un-supervised techniques Supervised machine learning techniques Clustering techniques (k-means, farthest first, etc.) Learn string distance measures (edit-distance costs for Hierarchical graphical models (probabilistic approaches) character insert, delete, substitute) Main critic point: Often only confidential or small Decision trees, genetic programming, association rules, test data sets (like bibliographic citations, restaurant expert systems, etc. names, etc.)

Peter Christen, August 2006 – p.17/32 Peter Christen, August 2006 – p.18/32

Blocking research Outline: The future Sorted neighbourhood approach (sliding window over sorted blocking variable) What is data linkage? Fuzzy blocking using q-grams (e.g. bigrams) Applications and challenges (`peter' → [`pe',`et',`te',`er'], `pete' → [`pe',`et',`te]) The past Overlapping canopy clustering A short history of data linkage (cheaply insert records into several clusters) Post-blocking filtering The present (like length differences or q-grams count differences) Computer science based approaches: Learning to link Supervised learning for blocking The future (minimise removal of true matches by the blocking process) Scalability, automation, and privacy and confidentiality US Census Bureau: BigMatch Our project: Febrl (pre-process ’smaller’ data set so its record values can be (Freely extensible biomedical record linkage) accessed directly; with all blocking passes in one go)

Peter Christen, August 2006 – p.19/32 Peter Christen, August 2006 – p.20/32

The main future challenges Scalability / Computational issues

Scalability Aim is to be able to link very large data sets with New computational techniques are required to allow large tens to hundreds of millions of records scale linking of massive data collections on modern parallel Large parallel machines needed (super-computers) and distributed computing platforms. Alternatively, use networked PCs / workstations (run Automation linkage jobs in background, over night or weekends) Decision models are needed that will reduce or even eliminate the manual clerical review (or preparation of Linkage between organisations training data) while keeping a high linkage quality. Parallel and/or distributed computing platforms (clusters, Privacy and confidentiality computational grids, Web services) Techniques are required that will allow the linking of large Fault tolerance (networks, computing nodes), (dynamic) scale data collections between organisations without load balancing, heterogeneous platforms (standards, revealing any personal or confidential information. transformations, meta data) Public acceptance Security, access, interfaces, charging policies, etc.

Peter Christen, August 2006 – p.21/32 Peter Christen, August 2006 – p.22/32

Privacy and confidentiality issues Privacy preserving approach

Traditional data linkage requires that identified data Alice has a database A she wants to link with Bob is given to the person or organisation performing (without revealing the actual values in A) the linkage (names, address, dates of birth, etc.) Bob has a database B he wants to link with Alice Approval from ethics committees is required as it is (without revealing the actual values in B) unfeasible to get consent from large number of Easy if only exact matches are considered individuals Encode data using one-way hashing (like SHA) Complete trust in linkage organisation, their staff, and → computing and networking systems Example: `peter' `51ddc7d3a611eeba6ca770' More complicated if values contain typographical Invasion of privacy could be avoided (or mitigated) errors or other variations if some method were available to determine which (even a single character difference between two strings will records in two data sets match, without revealing any result in very different hash encodings) identifying information.

Peter Christen, August 2006 – p.23/32 Peter Christen, August 2006 – p.24/32 Three-party linkage protocol Privacy preserving research

Alice and Bob negotiate a shared secret key Pioneered by French researchers in 1990s They encode their data (using this key) and send it [Quantin et al., 1998] (for situations where de-identified data to a third party (Carol) that performs the linkage needs to be centralised and linked for follow-up studies) Blindfolded record linkage Results are sent back to Alice and Bob [Churches and Christen, 2004] (allows approximate linkage All transmitted data is encrypted using a public key of strings with typographical errors based on q-gram infrastructure (PKI) techniques) Privacy-preserving data linkage protocols 1) Negotiate secret key A Alice Bob [O’Keefe et al., 2004] (several protocols with improved 2) Encoded A B

3) Enccoded A B security and less information leakage) Blocking aware private record linkage

2) Encoded B [Al-Lawati et al., 2005] (approximate linkage based on Carol 3) Encoded A B tokens and tf-idf, and three blocking approaches)

Peter Christen, August 2006 – p.25/32 Peter Christen, August 2006 – p.26/32

Outline: Our project: Febrl The Febrl project

What is data linkage? Aims at developing new and improved techniques for parallel large scale data linkage Applications and challenges Main research areas The past Probabilistic techniques for automated data cleaning and A short history of data linkage standardisation (mainly on addresses) The present New and improved blocking and indexing techniques Computer science based approaches: Learning to link Improved record pair classification using (un-supervised) The future machine learning techniques (reduce clerical review) Improved performance (scalability and parallelism) Scalability, automation, and privacy and confidentiality Our project: Febrl Project Web site: (Freely extensible biomedical record linkage) http://datamining.anu.edu.au/linkage.html

Peter Christen, August 2006 – p.27/32 Peter Christen, August 2006 – p.28/32

Febrl prototype software What can you do..?!

An experimental platform for new and improved Commercial data linkage consultancies and data linkage algorithms software are expensive (∼$10,000 to many $100,000) Modules for data cleaning and standardisation, Support local research and training data linkage, deduplication, geocoding, and Develop local knowledge and expertise, rather than generation of synthetic data sets relying upon overseas software vendors Open source https://sourceforge.net/projects/febrl/ Training of PhD, Masters and honours students Implemented in Python http://www.python.org Australian Research Council (ARC) Linkage Easy and rapid prototype software development projects Object-oriented and cross-platform (Unix, Win, Mac) Partially funded by industry / government organisation Can handle large data sets stable and efficiently Develop techniques and methods specific to industry Many external modules, easy to extend Smallest contributions ∼ $15,000 plus in-kind Large user community (per annum over 3-4 years) Peter Christen, August 2006 – p.29/32 Peter Christen, August 2006 – p.30/32

Outlook Contributions / Acknowledgements

Recent interest in data linkage from governments Dr Tim Churches (New South Wales Health Department, Centre for and businesses Epidemiology and Research) Dr Markus Hegland (ANU Mathematical Sciences Institute) Data mining and data warehousing Dr Lee Taylor (New South Wales Health Department, Centre for E-Commerce and Web applications Epidemiology and Research) Mr Alan Willmore (New South Wales Health Department, Centre for Census, crime/fraud detection, intelligence/surveillance Epidemiology and Research) Main future challenges Ms Kim Lim (New South Wales Health Department, Centre for Epidemiology and Research) Automated and accurate linkage Mr Karl Goiser (ANU Computer Science PhD student) Higher performance (linking very large data sets) Mr Daniel Belacic (ANU Computer Science honours student, 2005) Secure and privacy-preserving data linkage Mr Puthick Hok (ANU Computer Science honours student, 2004) Mr Justin Zhu (ANU Computer Science honours student, 2002) For more information see our project Web site Mr David Horgan (ANU Computer Science summer student, 2003/2004) (publications, talks, software, Web resources / links)

Peter Christen, August 2006 – p.31/32 Peter Christen, August 2006 – p.32/32