Data Linkage Techniques
Total Page:16
File Type:pdf, Size:1020Kb
Outline Data Linkage Techniques: What is data linkage? Past, Present and Future Applications and challenges Peter Christen The past Department of Computer Science, A short history of data linkage The Australian National University Contact: [email protected] The present Project Web site: http://datamining.anu.edu.au/linkage.html Computer science based approaches: Learning to link Funded by the Australian National University, the NSW Department of Health, The future the Australian Research Council (ARC) under Linkage Project 0453463, Scalability, automation, and privacy and confidentiality and the Australian Partnership for Advanced Computing (APAC) Our project: Febrl (Freely extensible biomedical record linkage) Peter Christen, August 2006 – p.1/32 Peter Christen, August 2006 – p.2/32 What is data (or record) linkage? Recent interest in data linkage The process of linking and aggregating records Traditionally, data linkage has been used in from one or more data sources representing the health (epidemiology) and statistics (census) same entity (patient, customer, business name, etc.) In recent years, increased interest from Also called data matching, data integration, data businesses and governments scrubbing, ETL (extraction, transformation and A lot of data is being collected by many organisations loading), object identification, merge-purge, etc. Increased computing power and storage capacities Challenging if no unique entity identifiers available Data warehousing and distributed databases E.g., which of these records represent the same person? Data mining of large data collections Dr Smith, Peter 42 Miller Street 2602 O'Connor E-Commerce and Web applications (for example online Pete Smith 42 Miller St 2600 Canberra A.C.T. product comparisons: http://froogle.com) P. Smithers 24 Mill Street 2600 Canberra ACT Geocoding and spatial data analysis Peter Christen, August 2006 – p.3/32 Peter Christen, August 2006 – p.4/32 Applications and usage Challenge 1: Dirty data Applications of data linkage Real world data is often dirty Remove duplicates in a data set (internal linkage) Missing values, inconsistencies Merge new records into a larger master data set Typographical errors and other variations Create patient or customer oriented statistics Different coding schemes / formats Compile data for longitudinal (over time) studies Out-of-date data Geocode matching (with reference address data) Names and addresses are especially prone to Widespread use of data linkage data entry errors (over phone, hand-written, scanned) Immigration, taxation, social security, census Cleaned and standardised data is needed for Fraud, crime and terrorism intelligence loading into databases and data warehouses Business mailing lists, exchange of customer data data mining and other data analysis studies Social, health and biomedical research data linkage and deduplication Peter Christen, August 2006 – p.5/32 Peter Christen, August 2006 – p.6/32 Challenge 3: Privacy and Challenge 2: Scalability confidentiality General public is worried about their information Data collections with tens or even hundreds of being linked and shared between organisations millions of records are not uncommon Good: research, health, statistics, crime and fraud Number of possible record pairs to compare equals the product of the sizes of the two data sets detection (taxation, social security, etc.) (linking two data sets with 1,000,000 records each will result Scary: intelligence, surveillance, commercial data mining in 106× 106 = 1012 record pairs) (not much information from businesses, no regulation) Performance bottleneck in a data linkage system is Bad: identify fraud, re-identification usually the (expensive) comparison of attribute Traditionally, identified data has to be given to the (field) values between record pairs person or organisation performing the linkage Blocking / indexing / filtering techniques are used Privacy of individuals in data sets is invaded to reduce the large amount of comparisons Consent of individuals involved is needed Linkage process should be automatic Alternatively, seek approval from ethics committees Peter Christen, August 2006 – p.7/32 Peter Christen, August 2006 – p.8/32 Outline: The past Traditional data linkage techniques What is data linkage? Computer assisted data linkage goes back as far as the 1950s (based on ad-hoc heuristic methods) Applications and challenges Deterministic linkage The past Exact linkage, if a unique identi®er of high quality is A short history of data linkage available (has to be precise, robust, stable over time) The present Examples: Medicare, ABN or Tax ®le number (are they really unique, stable, trustworthy?) Computer science based approaches: Learning to link Rules based linkage (complex to build and maintain) The future Probabilistic linkage Scalability, automation, and privacy and confidentiality Our project: Febrl Apply linkage using available (personal) information (Freely extensible biomedical record linkage) (like names, addresses, dates of birth, etc) Peter Christen, August 2006 – p.9/32 Peter Christen, August 2006 – p.10/32 Probabilistic data linkage Fellegi and Sunter classification Basic ideas of probabilistic linkage were For each compared record pair a vector containing introduced by Newcombe & Kennedy (1962) matching weights is calculated Theoretical foundation by Fellegi & Sunter (1969) Record A: [`dr', `peter', `paul', `miller'] Record B: [`mr', `john', `', `miller'] No unique entity identifiers available Matching weights: [0.2, -3.2, 0.0, 2.4 ] Compare common record attributes (or fields) Fellegi & Sunter approach sums all weights Compute matching weights based on frequency ratios (then uses two thresholds to classify record pairs as (global or value specific) and error estimates non-matches, possible matches, or matches) Sum of the matching weights is used to classify a pair of Lower Upper threshold threshold records as match, non-match, or possible match Many more with lower weights... Problems: Estimating errors and threshold values, assumption of independence, and manual clerical review Still the basis of many linkage systems −5 0 5 10 15 Total matching weight Peter Christen, August 2006 – p.11/32 Peter Christen, August 2006 – p.12/32 Traditional blocking Outline: The present Traditional blocking works by only comparing What is data linkage? record pairs that have the same value for a blocking variable (for example, only compare records Applications and challenges that have the same postcode value) The past Problems with traditional blocking A short history of data linkage An erroneous value in a blocking variable results in a The present record being inserted into the wrong block (several Computer science based approaches: Learning to link passes with different blocking variables can solve this) The future Values of blocking variable should be uniformly distributed (as the most frequent values determine Scalability, automation, and privacy and confidentiality the size of the largest blocks) Our project: Febrl Example: Frequency of `Smith' in NSW: 25,425 (Freely extensible biomedical record linkage) Peter Christen, August 2006 – p.13/32 Peter Christen, August 2006 – p.14/32 Improved classification Classification challenges Summing of matching weights results in loss of In many cases there is no training data available information (e.g. two record pairs: same name but Possible to use results of earlier linkage projects? , different address different address but same name) Or from clerical review process? View record pair classification as a multi- How confident can we be about correct manual dimensional binary classification problem classification of possible links? (use matching weight vectors to classify record pairs into matches or non-matches, but no possible matches) Often there is no gold standard available (no data sets with true known linkage status) Different machine learning techniques can be used No test data set collection available Supervised: Manually prepared training data needed (like in information retrieval or data mining) (record pairs and their match status), almost like manual Recent small repository: RIDDLE clerical review before the linkage http://www.cs.utexas.edu/users/ml/riddle/ Un-supervised: Find (local) structure in the data (similar (Repository of Information on Duplicate Detection, Record Linkage, and Identity Uncertainty) record pairs) without training data Peter Christen, August 2006 – p.15/32 Peter Christen, August 2006 – p.16/32 Classification research (1) Classification research (2) Information retrieval based Semi-supervised techniques Aim is to reduce manual training effort Represent records as document vectors Active learning (select a record pair for manual class- Calculate distance between vectors (tf-idf weights) ification that a set of classifiers disagree on the most) Database research approaches Semi-supervised clustering (provide some manually Extend SQL language (fuzzy join operator) classified record pairs) Implement linkage algorithms using SQL statements Un-supervised techniques Supervised machine learning techniques Clustering techniques (k-means, farthest first, etc.) Learn string distance measures (edit-distance costs for Hierarchical graphical models (probabilistic approaches) character insert, delete, substitute) Main critic point: Often only confidential or small Decision trees, genetic programming, association rules, test data sets (like bibliographic citations, restaurant expert systems, etc. names, etc.) Peter Christen, August 2006 – p.17/32 Peter