Learning Blocking Schemes for Record Linkage∗
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
When Does Blocking Help?
page 1 When Does Blocking Help? Teacher Notes, Part I The purpose of blocking is frequently described as “reducing variability.” However, this phrase carries little meaning to most beginning students of statistics. This activity, consisting of three rounds of simulation, is designed to illustrate what reducing variability really means in this context. In fact, students should see that a better description than “reducing variability” might be “attributing variability”, or “reducing unexplained variability”. The activity can be completed in a single 90-minute class or two classes of at least 45 minutes. For shorter classes you may wish to extend the simulations over two days. It is important that students understand not only what to do but also why they do what they do. Background Here is the specific problem that will be addressed in this activity: A set of 24 dogs (6 of each of four breeds; 6 from each of four veterinary clinics) has been randomly selected from a population of dogs older than eight years of age whose owners have permitted their inclusion in a study. Each dog will be assigned to exactly one of three treatment groups. Group “Ca” will receive a dietary supplement of calcium, Group “Ex” will receive a dietary supplement of calcium and a daily exercise regimen, and Group “Co” will be a control group that receives no supplement to the ordinary diet and no additional exercise. All dogs will have a bone density evaluation at the beginning and end of the one-year study. (The bone density is measured in Houndsfield units by using a CT scan.) The goals of the study are to determine (i) whether there are different changes in bone density over the year of the study for the dogs in the three treatment groups; and if so, (ii) how much each treatment influences that change in bone density. -
A Framework for Detecting Unnecessary Industrial Data in ETL Processes
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Apollo A Framework for Detecting Unnecessary Industrial Data in ETL Processes Philip Woodall Duncan McFarlane Department of Engineering Department of Engineering University of Cambridge, UK. University of Cambridge, UK. [email protected] Amar Shah, Department of Engineering Torben Jess University of Cambridge, UK Department of Engineering University of Cambridge, UK. William Krechel Boeing, USA Mark Harrison Eric Nicks Department of Engineering Boeing, USA. University of Cambridge, UK Abstract—Extract, transform and load (ETL) is a critical procurement system that is used to select suppliers for the process used by industrial organisations to shift data from one orders of the parts. These orders may then also be transferred database to another, such as from an operational system to a data by ETL into the order release system, ready to send to the warehouse. With the increasing amount of data stored by suppliers. Moreover, for reporting, these orders may be sent via industrial organisations, some ETL processes can take in excess ETL to a data warehouse that is used by managers to, for of 12 hours to complete; this can leave decision makers stranded example, scrutinise various orders over the last month. while they wait for the data needed to support their decisions. After designing the ETL processes, inevitably data requirements However, the problem is that ETL is a time-consuming and can change, and much of the data that goes through the ETL expensive process, which can leave decision makers in process may not ever be used or needed. -
Lec 9: Blocking and Confounding for 2K Factorial Design
Lec 9: Blocking and Confounding for 2k Factorial Design Ying Li December 2, 2011 Ying Li Lec 9: Blocking and Confounding for 2k Factorial Design 2k factorial design Special case of the general factorial design; k factors, all at two levels The two levels are usually called low and high (they could be either quantitative or qualitative) Very widely used in industrial experimentation Ying Li Lec 9: Blocking and Confounding for 2k Factorial Design Example Consider an investigation into the effect of the concentration of the reactant and the amount of the catalyst on the conversion in a chemical process. A: reactant concentration, 2 levels B: catalyst, 2 levels 3 replicates, 12 runs in total Ying Li Lec 9: Blocking and Confounding for 2k Factorial Design 1 A B A = f[ab − b] + [a − (1)]g − − (1) = 28 + 25 + 27 = 80 2n + − a = 36 + 32 + 32 = 100 1 B = f[ab − a] + [b − (1)]g − + b = 18 + 19 + 23 = 60 2n + + ab = 31 + 30 + 29 = 90 1 AB = f[ab − b] − [a − (1)]g 2n Ying Li Lec 9: Blocking and Confounding for 2k Factorial Design Manual Calculation 1 A = f[ab − b] + [a − (1)]g 2n ContrastA = ab + a − b − (1) Contrast SS = A A 4n Ying Li Lec 9: Blocking and Confounding for 2k Factorial Design Regression Model For 22 × 1 experiment Ying Li Lec 9: Blocking and Confounding for 2k Factorial Design Regression Model The least square estimates: The regression coefficient estimates are exactly half of the \usual" effect estimates Ying Li Lec 9: Blocking and Confounding for 2k Factorial Design Analysis Procedure for a Factorial Design Estimate factor effects. -
Chapter 7 Blocking and Confounding Systems for Two-Level Factorials
Chapter 7 Blocking and Confounding Systems for Two-Level Factorials &5² Design and Analysis of Experiments (Douglas C. Montgomery) hsuhl (NUK) DAE Chap. 7 1 / 28 Introduction Sometimes, it is impossible to perform all 2k factorial experiments under homogeneous condition. I a batch of raw material: not large enough for the required runs Blocking technique: making the treatments are equally effective across many situation hsuhl (NUK) DAE Chap. 7 2 / 28 Blocking a Replicated 2k Factorial Design 2k factorial design, n replicates Example 7.1: chemical process experiment 22 factorial design: A-concentration; B-catalyst 4 trials; 3 replicates hsuhl (NUK) DAE Chap. 7 3 / 28 Blocking a Replicated 2k Factorial Design (cont.) n replicates a block: each set of nonhomogeneous conditions each replicate is run in one of the blocks 3 2 2 X Bi y··· SSBlocks= − (2 d:f :) 4 12 i=1 = 6:50 The block effect is small. hsuhl (NUK) DAE Chap. 7 4 / 28 Confounding Confounding(干W;混雜;ø絡) the block size is smaller than the number of treatment combinations impossible to perform a complete replicate of a factorial design in one block confounding: a design technique for arranging a complete factorial experiment in blocks causes information about certain treatment effects(high-order interactions) to be indistinguishable(p|辨½的) from, or confounded with blocks hsuhl (NUK) DAE Chap. 7 5 / 28 Confounding the 2k Factorial Design in Two Blocks a single replicate of 22 design two batches of raw material are required 2 factors with 2 blocks hsuhl (NUK) DAE Chap. 7 6 / 28 Confounding the 2k Factorial Design in Two Blocks (cont.) 1 A = 2 [ab + a − b−(1)] 1 (any difference between block 1 and 2 will cancel out) B = 2 [ab + b − a−(1)] 1 AB = [ab+(1) − a − b] 2 (block effect and AB interaction are identical; confounded with blocks) hsuhl (NUK) DAE Chap. -
Introduction to Biostatistics
Introduction to Biostatistics Jie Yang, Ph.D. Associate Professor Department of Family, Population and Preventive Medicine Director Biostatistical Consulting Core In collaboration with Clinical Translational Science Center (CTSC) and the Biostatistics and Bioinformatics Shared Resource (BB-SR), Stony Brook Cancer Center (SBCC). OUTLINE What is Biostatistics What does a biostatistician do • Experiment design, clinical trial design • Descriptive and Inferential analysis • Result interpretation What you should bring while consulting with a biostatistician WHAT IS BIOSTATISTICS • The science of biostatistics encompasses the design of biological/clinical experiments the collection, summarization, and analysis of data from those experiments the interpretation of, and inference from, the results How to Lie with Statistics (1954) by Darrell Huff. http://www.youtube.com/watch?v=PbODigCZqL8 GOAL OF STATISTICS Sampling POPULATION Probability SAMPLE Theory Descriptive Descriptive Statistics Statistics Inference Population Sample Parameters: Inferential Statistics Statistics: 흁, 흈, 흅… 푿 , 풔, 풑 ,… PROPERTIES OF A “GOOD” SAMPLE • Adequate sample size (statistical power) • Random selection (representative) Sampling Techniques: 1.Simple random sampling 2.Stratified sampling 3.Systematic sampling 4.Cluster sampling 5.Convenience sampling STUDY DESIGN EXPERIEMENT DESIGN Completely Randomized Design (CRD) - Randomly assign the experiment units to the treatments Design with Blocking – dealing with nuisance factor which has some effect on the response, but of no interest to the experimenter; Without blocking, large unexplained error leads to less detection power. 1. Randomized Complete Block Design (RCBD) - One single blocking factor 2. Latin Square 3. Cross over Design Design (two (each subject=blocking factor) 4. Balanced Incomplete blocking factor) Block Design EXPERIMENT DESIGN Factorial Design: similar to randomized block design, but allowing to test the interaction between two treatment effects. -
Large Scale Record Linkage in the Presence of Missing Data Otherwise (A8,A 9 ) Is a Member of U (True Non-Matches) and A8 and Definition 3.3 (Relational Signature)
Large Scale Record Linkage in the Presence of Missing Data Thilina Ranbaduge Peter Christen Rainer Schnell The Australian National University The Australian National University University Duisburg-Essen Canberra, Australia Canberra, Australia Duisburg, Germany [email protected] [email protected] [email protected] ABSTRACT One major data quality aspect that so far has only seen limited Record linkage is aimed at the accurate and efficient identifica- attention in RL is missing data [4, 16, 19, 28]. There are various tion of records that represent the same entity within or across reasons why missing values can occur, ranging from equipment disparate databases. It is a fundamental task in data integration malfunction or data items not considered to be important, to dele- and increasingly required for accurate decision making in appli- tion of values due to inconsistencies, or even the refusal of individ- cation domains ranging from health analytics to national security. uals to provide information for example when answering surveys. Traditional record linkage techniques calculate string similarities Missing data can be categorised into different types [26]. between quasi-identifying (QID) values, such as the names and ad- Data missing completely at random (MCAR) are missing values dresses of people. Errors, variations, and missing QID values can that occur without any patterns or correlations at all with any however lead to low linkage quality because the similarities be- other values in the same record or database. For example, if a first tween records cannot be calculated accurately. To overcome this name is missing for an individual, then neither last name, address, challenge, we propose a novel technique that can accurately link age, nor gender, can help predict the missing first name value (as- records even when QID values contain errors or variations, or are suming no external database is accessible where a complete record missing. -
Historical Census Record Linkage
Historical Census Record Linkage Steven Ruggles† University of Minnesota Catherine Fitch University of Minnesota Evan Roberts University of Minnesota December 2017 Working Paper No. 2017-3 https://doi.org/10.18128/ MPC2017-3 †Correspondence should be directed to: Steven Ruggles University of Minnesota, 50 Willey Hall, 225 19th Ave S., Minneapolis, MN 55455 e-mail: [email protected], phone: 612-624-5818, fax:612-626-8375 Abstract For the past 80 years, social scientists have been linking historical censuses across time to study economic and geographic mobility. In recent decades, the quantity of historical census record linkage has exploded, owing largely to the advent of new machine-readable data created by genealogical organizations. Investigators are examining economic and geographic mobility across multiple generations, but also engaging many new topics. Several analysts are exploring the effects of early-life socioeconomic conditions, environmental exposures, or natural disasters on family, health and economic outcomes in later life. Other studies exploit natural experiments to gauge the impact of policy interventions such as social welfare programs and educational reforms. The new data sources have led to a proliferation of record linkage methodology, and some widespread approaches inadvertently introduce errors that can lead to false inferences. A new generation of large-scale shared data infrastructure now in preparation will ameliorate weaknesses of current linkage methods. Introduction Historical census data are indispensable for studying long-run change, since they provide the only record of the lives of millions of people over the past two centuries. In much of Europe and North America, investigators can access individual-level data based on original census manuscripts (Szołtysek & Gruber 2016). -
Volume I, Record Linkage Issues and Resul
Chapter Two Administrative Data and Record Linkage Issues For research purposes, administrative data have the advantage of detailed and accurate measurement of program status and outcomes, complete coverage of populations of interest (enabling detailed subgroup analyses), data on the same individuals over long periods, low cost relative to survey data, and the ability to obtain many kinds of information through matching (Hotz et al., 1998). In addition, many types of administrative data have relatively high degrees of uniformity in content across geographic areas.13,14 Government agencies have recognized the potential research uses of administrative data. Of 10 data development initiatives recently identified by USDA’s Economic Research Service, only one did not involve administrative data (Wittenburg et al., 2001). Five of the 10 initiatives involve creation of linked databases matching administrative records from multiple agencies, or matching administrative records to survey responses. Linked databases are a way to create “new” data from existing sources. For example, the National Center for Health Statistics (NCHS) determines infant mortality rates using state files of linked data from birth and death certificates (Mathews et al., 2002). The Department of Transportation examines motor vehicle crash outcomes by linking records of police-reported crashes to hospital discharge data, EMS data, and hospital emergency department data (NHTSA, 1996a). In the social services arena, a number of States have developed master client indexes that -
Approaches to Multiple Record Linkage
Approaches to Multiple Record Linkage Mauricio Sadinle, Rob Hall, and Stephen E. Fienberg Department of Statistics, Heinz College, and Department of Machine Learning Carnegie Mellon University Pittsburgh, PA 15213-3890, U.S.A. E-mail: [email protected]; [email protected]; fi[email protected] We review the theory and techniques of record linkage that date back to pioneering work by Fellegi and Sunter on matching records( in) two lists. When the task involves linking K > 2 lists, the most common approach K consists of performing all 2 possible pairs of lists using a Fellegi-Sunter-like approach and then somehow reconciling the discrepancies in an ad hoc fashion. We describe some important uses of the methodology, provide a principled way of accomplishing the reconciliation and we finally present some key parts of the generalization of Fellegi and Sunter's method to K > 2 lists. Introduction Record linkage is a family of techniques for matching two data files using names, addresses, and other fields that are typically not unique identifiers of entities. Most record linkage approaches are based or emulate a method presented in a pioneering paper by Fellegi and Sunter (1969). Winkler (1999) and Herzog et al. (2007) document the basic two-list methodology and several variations. We begin by reviewing some common applications of record linkage techniques, including the linking K > 2 lists. Data Integration. Synthesizing data on a group of individuals from two or more files in the absence of unique identifiers requires record linkage, e.g., to create a data warehouse to be used for querying such as credit checks, e.g., see Talburt (2011), or to assess enrollment in multiple government programs, e.g., see Cole (2003). -
Record Linkage: a Machine Learning Approach, a Toolbox, and a Digital Government Web Service
Purdue University Purdue e-Pubs Department of Computer Science Technical Reports Department of Computer Science 2003 Record Linkage: A Machine Learning Approach, A Toolbox, and a Digital Government Web Service Mohamed G. Elfeky Vassilios S. Verykios Ahmed K. Elmagarmid Purdue University, [email protected] Thanaa M. Ghanem Ahmed R. Huwait Report Number: 03-024 Elfeky, Mohamed G.; Verykios, Vassilios S.; Elmagarmid, Ahmed K.; Ghanem, Thanaa M.; and Huwait, Ahmed R., "Record Linkage: A Machine Learning Approach, A Toolbox, and a Digital Government Web Service" (2003). Department of Computer Science Technical Reports. Paper 1573. https://docs.lib.purdue.edu/cstech/1573 This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact [email protected] for additional information. RECORD LINKAGE, A MACHINE LEARNING APPROACH, A TOOLBOX, AND A DIGITAL GOVERNMENT WEB SERVICE Mohamed G. Elfeky Vassilios S. Verykios Ahmed K. Elmagarmid Thanaa M. Ghanem Ahmed R. Huwait CSD TR #03-024 July 2003 Record Linkage: A Machine Learning Approach, A Toolbox, and A Digital Government Web Service' l Mohamed G. Elfeky 1 VassiliosS. Verykios Ahmed K. Elmagarrnid 1 Purdue University Drexel University Hewlett-Packard [email protected] [email protected] [email protected] Thanaa M. Ghanem Ahmed R. Huwait 2 Purdue University Ain Shams University [email protected],edu [email protected] Abstract Data cleaning is a vital process mat ensures the quality of data smred in real·world databases. Data cleaning problems are frequently encountered in many research areas, such as kllowledge discovery in databases, data warellOusing. system imegratioll and e services. -
Contributions to Record Linkage for Disclosure Risk Assessment
Contributions to Record Linkage for Disclosure Risk Assessment by Jordi Nin Guerrero Advisor: Dr. Vicenç Torra i Reventós A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor in Artificial Intelligence at the Departament de Ciències de la Computació Universitat Autònoma de Barcelona A la Mireia , que fa seves les meves il·lusions. I a en Xavier, a qui esperem amb entusiasme. ii I am only a child playing on the beach, while vast oceans of truth lie undiscovered before me. Isaac Newton (1642-1727) iii Contents Agraïments xiii Abstract xv 1 Introduction 1 1.1 Motivations ................................... 1 1.2 Contributions .................................. 3 1.3 StructureoftheDocument . 4 2 Preliminaries 7 2.1 AggregationFunctions . 7 2.2 TimeSeries.................................... 12 2.3 Re-identification Methods . 17 2.4 Microdata Protection Methods . 21 2.5 Information Loss and Disclosure Risk . 33 2.6 DataSetsDescription ............................. 38 3 Microaggregation Analysis 45 3.1 Attribute Selection in Multivariate Microaggregation ........... 45 3.2 Modeling Projections in Microaggregation . ...... 58 3.3 Improving Microaggregation for Complex Record Anonymization . 62 v 4 Specific Disclosure Risk Measures 77 4.1 Rank Swapping Record Linkage . 77 4.2 AlignmentRecordLinkage. 93 4.3 ProjectedRecordLinkage . 98 5 Record Linkage using Fuzzy Integrals 109 5.1 An Alternative Disclosure Risk Scenario . 109 5.2 Experiments ................................... 117 6 Time Series Protection 125 6.1 TimeSeriesProtection .. .. .. .. .. .. 125 6.2 Time Series Information Loss Measures . 127 6.3 Time Series Disclosure Risk Measures . 130 6.4 Final Trade-off Evaluation . 135 6.5 Experiments ................................... 136 7 Conclusions and Future Directions 141 7.1 Summary of Contributions . -
Design of Engineering Experiments Blocking & Confounding in the 2K
Design of Engineering Experiments Blocking & Confounding in the 2 k • Text reference, Chapter 7 • Blocking is a technique for dealing with controllable nuisance variables • Two cases are considered – Replicated designs – Unreplicated designs Chapter 7 Design & Analysis of Experiments 1 8E 2012 Montgomery Chapter 7 Design & Analysis of Experiments 2 8E 2012 Montgomery Blocking a Replicated Design • This is the same scenario discussed previously in Chapter 5 • If there are n replicates of the design, then each replicate is a block • Each replicate is run in one of the blocks (time periods, batches of raw material, etc.) • Runs within the block are randomized Chapter 7 Design & Analysis of Experiments 3 8E 2012 Montgomery Blocking a Replicated Design Consider the example from Section 6-2 (next slide); k = 2 factors, n = 3 replicates This is the “usual” method for calculating a block 3 B2 y 2 sum of squares =i − ... SS Blocks ∑ i=1 4 12 = 6.50 Chapter 7 Design & Analysis of Experiments 4 8E 2012 Montgomery 6-2: The Simplest Case: The 22 Chemical Process Example (1) (a) (b) (ab) A = reactant concentration, B = catalyst amount, y = recovery ANOVA for the Blocked Design Page 305 Chapter 7 Design & Analysis of Experiments 6 8E 2012 Montgomery Confounding in Blocks • Confounding is a design technique for arranging a complete factorial experiment in blocks, where the block size is smaller than the number of treatment combinations in one replicate. • Now consider the unreplicated case • Clearly the previous discussion does not apply, since there