Supporting Efficient Record Linkage for Large Data Sets Using Mapping
Total Page:16
File Type:pdf, Size:1020Kb

Load more
Recommended publications
-
A Framework for Detecting Unnecessary Industrial Data in ETL Processes
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Apollo A Framework for Detecting Unnecessary Industrial Data in ETL Processes Philip Woodall Duncan McFarlane Department of Engineering Department of Engineering University of Cambridge, UK. University of Cambridge, UK. [email protected] Amar Shah, Department of Engineering Torben Jess University of Cambridge, UK Department of Engineering University of Cambridge, UK. William Krechel Boeing, USA Mark Harrison Eric Nicks Department of Engineering Boeing, USA. University of Cambridge, UK Abstract—Extract, transform and load (ETL) is a critical procurement system that is used to select suppliers for the process used by industrial organisations to shift data from one orders of the parts. These orders may then also be transferred database to another, such as from an operational system to a data by ETL into the order release system, ready to send to the warehouse. With the increasing amount of data stored by suppliers. Moreover, for reporting, these orders may be sent via industrial organisations, some ETL processes can take in excess ETL to a data warehouse that is used by managers to, for of 12 hours to complete; this can leave decision makers stranded example, scrutinise various orders over the last month. while they wait for the data needed to support their decisions. After designing the ETL processes, inevitably data requirements However, the problem is that ETL is a time-consuming and can change, and much of the data that goes through the ETL expensive process, which can leave decision makers in process may not ever be used or needed. -
Headprint: Detecting Anomalous Communications Through Header-Based Application Fingerprinting
HeadPrint: Detecting Anomalous Communications through Header-based Application Fingerprinting Riccardo Bortolameotti Thijs van Ede Andrea Continella University of Twente University of Twente UC Santa Barbara [email protected] [email protected] [email protected] Thomas Hupperich Maarten H. Everts Reza Rafati University of Muenster University of Twente Bitdefender [email protected] [email protected] [email protected] Willem Jonker Pieter Hartel Andreas Peter University of Twente Delft University of Technology University of Twente [email protected] [email protected] [email protected] ABSTRACT 1 INTRODUCTION Passive application fingerprinting is a technique to detect anoma- Data breaches are a major concern for enterprises across all indus- lous outgoing connections. By monitoring the network traffic, a tries due to the severe financial damage they cause [25]. Although security monitor passively learns the network characteristics of the many enterprises have full visibility of their network traffic (e.g., applications installed on each machine, and uses them to detect the using TLS-proxies) and they stop many cyber attacks by inspect- presence of new applications (e.g., malware infection). ing traffic with different security solutions (e.g., IDS, IPS, Next- In this work, we propose HeadPrint, a novel passive fingerprint- Generation Firewalls), attackers are still capable of successfully ing approach that relies only on two orthogonal network header exfiltrating data. Data exfiltration often occurs over network covert characteristics to distinguish applications, namely the order of the channels, which are communication channels established by attack- headers and their associated values. Our approach automatically ers to hide the transfer of data from a security monitor. -
Author Guidelines for 8
I.J. Information Technology and Computer Science, 2014, 01, 68-75 Published Online December 2013 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijitcs.2014.01.08 FSM Circuits Design for Approximate String Matching in Hardware Based Network Intrusion Detection Systems Dejan Georgiev Faculty of Electrical Engineering and Information Technologies , Skopje, Macedonia E-mail: [email protected] Aristotel Tentov Faculty of Electrical Engineering and Information Technologies, Skopje, Macedonia E-mail: [email protected] Abstract— In this paper we present a logical circuits to perform a check to a possible variations of the design for approximate content matching implemented content because of so named "evasion" techniques. as finite state machines (FSM). As network speed Software systems have processing limitations in that increases the software based network intrusion regards. On the other hand hardware platforms, like detection and prevention systems (NIDPS) are lagging FPGA offer efficient implementation for high speed behind requirements in throughput of so called deep network traffic. package inspection - the most exhaustive process of Excluding the regular expressions as special case ,the finding a pattern in package payloads. Therefore, there problem of approximate content matching basically is a is a demand for hardware implementation. k-differentiate problem of finding a content Approximate content matching is a special case of content finding and variations detection used by C c1c2.....cm in a string T t1t2.......tn such as the "evasion" techniques. In this research we will enhance distance D(C,X) of content C and all the occurrences of the k-differentiate problem with "ability" to detect a a substring X is less or equal to k , where k m n . -
Comparison of Levenshtein Distance Algorithm and Needleman-Wunsch Distance Algorithm for String Matching
COMPARISON OF LEVENSHTEIN DISTANCE ALGORITHM AND NEEDLEMAN-WUNSCH DISTANCE ALGORITHM FOR STRING MATCHING KHIN MOE MYINT AUNG M.C.Sc. JANUARY2019 COMPARISON OF LEVENSHTEIN DISTANCE ALGORITHM AND NEEDLEMAN-WUNSCH DISTANCE ALGORITHM FOR STRING MATCHING BY KHIN MOE MYINT AUNG B.C.Sc. (Hons:) A Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Computer Science (M.C.Sc.) University of Computer Studies, Yangon JANUARY2019 ii ACKNOWLEDGEMENTS I would like to take this opportunity to express my sincere thanks to those who helped me with various aspects of conducting research and writing this thesis. Firstly, I would like to express my gratitude and my sincere thanks to Prof. Dr. Mie Mie Thet Thwin, Rector of the University of Computer Studies, Yangon, for allowing me to develop this thesis. Secondly, my heartfelt thanks and respect go to Dr. Thi Thi Soe Nyunt, Professor and Head of Faculty of Computer Science, University of Computer Studies, Yangon, for her invaluable guidance and administrative support, throughout the development of the thesis. I am deeply thankful to my supervisor, Dr. Ah Nge Htwe, Associate Professor, Faculty of Computer Science, University of Computer Studies, Yangon, for her invaluable guidance, encouragement, superior suggestion and supervision on the accomplishment of this thesis. I would like to thank Daw Ni Ni San, Lecturer, Language Department of the University of Computer Studies, Yangon, for editing my thesis from the language point of view. I thank all my teachers who taught and helped me during the period of studies in the University of Computer Studies, Yangon. Finally, I am grateful for all my teachers who have kindly taught me and my friends from the University of Computer Studies, Yangon, for their advice, their co- operation and help for the completion of thesis. -
Large Scale Record Linkage in the Presence of Missing Data Otherwise (A8,A 9 ) Is a Member of U (True Non-Matches) and A8 and Definition 3.3 (Relational Signature)
Large Scale Record Linkage in the Presence of Missing Data Thilina Ranbaduge Peter Christen Rainer Schnell The Australian National University The Australian National University University Duisburg-Essen Canberra, Australia Canberra, Australia Duisburg, Germany [email protected] [email protected] [email protected] ABSTRACT One major data quality aspect that so far has only seen limited Record linkage is aimed at the accurate and efficient identifica- attention in RL is missing data [4, 16, 19, 28]. There are various tion of records that represent the same entity within or across reasons why missing values can occur, ranging from equipment disparate databases. It is a fundamental task in data integration malfunction or data items not considered to be important, to dele- and increasingly required for accurate decision making in appli- tion of values due to inconsistencies, or even the refusal of individ- cation domains ranging from health analytics to national security. uals to provide information for example when answering surveys. Traditional record linkage techniques calculate string similarities Missing data can be categorised into different types [26]. between quasi-identifying (QID) values, such as the names and ad- Data missing completely at random (MCAR) are missing values dresses of people. Errors, variations, and missing QID values can that occur without any patterns or correlations at all with any however lead to low linkage quality because the similarities be- other values in the same record or database. For example, if a first tween records cannot be calculated accurately. To overcome this name is missing for an individual, then neither last name, address, challenge, we propose a novel technique that can accurately link age, nor gender, can help predict the missing first name value (as- records even when QID values contain errors or variations, or are suming no external database is accessible where a complete record missing. -
Historical Census Record Linkage
Historical Census Record Linkage Steven Ruggles† University of Minnesota Catherine Fitch University of Minnesota Evan Roberts University of Minnesota December 2017 Working Paper No. 2017-3 https://doi.org/10.18128/ MPC2017-3 †Correspondence should be directed to: Steven Ruggles University of Minnesota, 50 Willey Hall, 225 19th Ave S., Minneapolis, MN 55455 e-mail: [email protected], phone: 612-624-5818, fax:612-626-8375 Abstract For the past 80 years, social scientists have been linking historical censuses across time to study economic and geographic mobility. In recent decades, the quantity of historical census record linkage has exploded, owing largely to the advent of new machine-readable data created by genealogical organizations. Investigators are examining economic and geographic mobility across multiple generations, but also engaging many new topics. Several analysts are exploring the effects of early-life socioeconomic conditions, environmental exposures, or natural disasters on family, health and economic outcomes in later life. Other studies exploit natural experiments to gauge the impact of policy interventions such as social welfare programs and educational reforms. The new data sources have led to a proliferation of record linkage methodology, and some widespread approaches inadvertently introduce errors that can lead to false inferences. A new generation of large-scale shared data infrastructure now in preparation will ameliorate weaknesses of current linkage methods. Introduction Historical census data are indispensable for studying long-run change, since they provide the only record of the lives of millions of people over the past two centuries. In much of Europe and North America, investigators can access individual-level data based on original census manuscripts (Szołtysek & Gruber 2016). -
Automating Error Frequency Analysis Via the Phonemic Edit Distance Ratio
JSLHR Research Note Automating Error Frequency Analysis via the Phonemic Edit Distance Ratio Michael Smith,a Kevin T. Cunningham,a and Katarina L. Haleya Purpose: Many communication disorders result in speech corresponds to percent segments with these phonemic sound errors that listeners perceive as phonemic errors. errors. Unfortunately, manual methods for calculating phonemic Results: Convergent validity between the manually calculated error frequency are prohibitively time consuming to use in error frequency and the automated edit distance ratio was large-scale research and busy clinical settings. The purpose excellent, as demonstrated by nearly perfect correlations of this study was to validate an automated analysis based and negligible mean differences. The results were replicated on a string metric—the unweighted Levenshtein edit across 2 speech samples and 2 computation applications. distance—to express phonemic error frequency after left Conclusions: The phonemic edit distance ratio is well hemisphere stroke. suited to estimate phonemic error rate and proves itself Method: Audio-recorded speech samples from 65 speakers for immediate application to research and clinical practice. who repeated single words after a clinician were transcribed It can be calculated from any paired strings of transcription phonetically. By comparing these transcriptions to the symbols and requires no additional steps, algorithms, or target, we calculated the percent segments with a combination judgment concerning alignment between target and production. of phonemic substitutions, additions, and omissions and We recommend it as a valid, simple, and efficient substitute derived the phonemic edit distance ratio, which theoretically for manual calculation of error frequency. n the context of communication disorders that affect difference between a series of targets and a series of pro- speech production, sound error frequency is often ductions. -
APPROXIMATE STRING MATCHING METHODS for DUPLICATE DETECTION and CLUSTERING TASKS by Oleksandr Rudniy
Copyright Warning & Restrictions The copyright law of the United States (Title 17, United States Code) governs the making of photocopies or other reproductions of copyrighted material. Under certain conditions specified in the law, libraries and archives are authorized to furnish a photocopy or other reproduction. One of these specified conditions is that the photocopy or reproduction is not to be “used for any purpose other than private study, scholarship, or research.” If a, user makes a request for, or later uses, a photocopy or reproduction for purposes in excess of “fair use” that user may be liable for copyright infringement, This institution reserves the right to refuse to accept a copying order if, in its judgment, fulfillment of the order would involve violation of copyright law. Please Note: The author retains the copyright while the New Jersey Institute of Technology reserves the right to distribute this thesis or dissertation Printing note: If you do not wish to print this page, then select “Pages from: first page # to: last page #” on the print dialog screen The Van Houten library has removed some of the personal information and all signatures from the approval page and biographical sketches of theses and dissertations in order to protect the identity of NJIT graduates and faculty. ABSTRACT APPROXIMATE STRING MATCHING METHODS FOR DUPLICATE DETECTION AND CLUSTERING TASKS by Oleksandr Rudniy Approximate string matching methods are utilized by a vast number of duplicate detection and clustering applications in various knowledge domains. The application area is expected to grow due to the recent significant increase in the amount of digital data and knowledge sources. -
Finite State Recognizer and String Similarity Based Spelling
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by BRAC University Institutional Repository FINITE STATE RECOGNIZER AND STRING SIMILARITY BASED SPELLING CHECKER FOR BANGLA Submitted to A Thesis The Department of Computer Science and Engineering of BRAC University by Munshi Asadullah In Partial Fulfillment of the Requirements for the Degree of Bachelor of Science in Computer Science and Engineering Fall 2007 BRAC University, Dhaka, Bangladesh 1 DECLARATION I hereby declare that this thesis is based on the results found by me. Materials of work found by other researcher are mentioned by reference. This thesis, neither in whole nor in part, has been previously submitted for any degree. Signature of Supervisor Signature of Author 2 ACKNOWLEDGEMENTS Special thanks to my supervisor Mumit Khan without whom this work would have been very difficult. Thanks to Zahurul Islam for providing all the support that was required for this work. Also special thanks to the members of CRBLP at BRAC University, who has managed to take out some time from their busy schedule to support, help and give feedback on the implementation of this work. 3 Abstract A crucial figure of merit for a spelling checker is not just whether it can detect misspelled words, but also in how it ranks the suggestions for the word. Spelling checker algorithms using edit distance methods tend to produce a large number of possibilities for misspelled words. We propose an alternative approach to checking the spelling of Bangla text that uses a finite state automaton (FSA) to probabilistically create the suggestion list for a misspelled word. -
Volume I, Record Linkage Issues and Resul
Chapter Two Administrative Data and Record Linkage Issues For research purposes, administrative data have the advantage of detailed and accurate measurement of program status and outcomes, complete coverage of populations of interest (enabling detailed subgroup analyses), data on the same individuals over long periods, low cost relative to survey data, and the ability to obtain many kinds of information through matching (Hotz et al., 1998). In addition, many types of administrative data have relatively high degrees of uniformity in content across geographic areas.13,14 Government agencies have recognized the potential research uses of administrative data. Of 10 data development initiatives recently identified by USDA’s Economic Research Service, only one did not involve administrative data (Wittenburg et al., 2001). Five of the 10 initiatives involve creation of linked databases matching administrative records from multiple agencies, or matching administrative records to survey responses. Linked databases are a way to create “new” data from existing sources. For example, the National Center for Health Statistics (NCHS) determines infant mortality rates using state files of linked data from birth and death certificates (Mathews et al., 2002). The Department of Transportation examines motor vehicle crash outcomes by linking records of police-reported crashes to hospital discharge data, EMS data, and hospital emergency department data (NHTSA, 1996a). In the social services arena, a number of States have developed master client indexes that -
Approaches to Multiple Record Linkage
Approaches to Multiple Record Linkage Mauricio Sadinle, Rob Hall, and Stephen E. Fienberg Department of Statistics, Heinz College, and Department of Machine Learning Carnegie Mellon University Pittsburgh, PA 15213-3890, U.S.A. E-mail: [email protected]; [email protected]; fi[email protected] We review the theory and techniques of record linkage that date back to pioneering work by Fellegi and Sunter on matching records( in) two lists. When the task involves linking K > 2 lists, the most common approach K consists of performing all 2 possible pairs of lists using a Fellegi-Sunter-like approach and then somehow reconciling the discrepancies in an ad hoc fashion. We describe some important uses of the methodology, provide a principled way of accomplishing the reconciliation and we finally present some key parts of the generalization of Fellegi and Sunter's method to K > 2 lists. Introduction Record linkage is a family of techniques for matching two data files using names, addresses, and other fields that are typically not unique identifiers of entities. Most record linkage approaches are based or emulate a method presented in a pioneering paper by Fellegi and Sunter (1969). Winkler (1999) and Herzog et al. (2007) document the basic two-list methodology and several variations. We begin by reviewing some common applications of record linkage techniques, including the linking K > 2 lists. Data Integration. Synthesizing data on a group of individuals from two or more files in the absence of unique identifiers requires record linkage, e.g., to create a data warehouse to be used for querying such as credit checks, e.g., see Talburt (2011), or to assess enrollment in multiple government programs, e.g., see Cole (2003). -
Record Linkage: a Machine Learning Approach, a Toolbox, and a Digital Government Web Service
Purdue University Purdue e-Pubs Department of Computer Science Technical Reports Department of Computer Science 2003 Record Linkage: A Machine Learning Approach, A Toolbox, and a Digital Government Web Service Mohamed G. Elfeky Vassilios S. Verykios Ahmed K. Elmagarmid Purdue University, [email protected] Thanaa M. Ghanem Ahmed R. Huwait Report Number: 03-024 Elfeky, Mohamed G.; Verykios, Vassilios S.; Elmagarmid, Ahmed K.; Ghanem, Thanaa M.; and Huwait, Ahmed R., "Record Linkage: A Machine Learning Approach, A Toolbox, and a Digital Government Web Service" (2003). Department of Computer Science Technical Reports. Paper 1573. https://docs.lib.purdue.edu/cstech/1573 This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact [email protected] for additional information. RECORD LINKAGE, A MACHINE LEARNING APPROACH, A TOOLBOX, AND A DIGITAL GOVERNMENT WEB SERVICE Mohamed G. Elfeky Vassilios S. Verykios Ahmed K. Elmagarmid Thanaa M. Ghanem Ahmed R. Huwait CSD TR #03-024 July 2003 Record Linkage: A Machine Learning Approach, A Toolbox, and A Digital Government Web Service' l Mohamed G. Elfeky 1 VassiliosS. Verykios Ahmed K. Elmagarrnid 1 Purdue University Drexel University Hewlett-Packard [email protected] [email protected] [email protected] Thanaa M. Ghanem Ahmed R. Huwait 2 Purdue University Ain Shams University [email protected],edu [email protected] Abstract Data cleaning is a vital process mat ensures the quality of data smred in real·world databases. Data cleaning problems are frequently encountered in many research areas, such as kllowledge discovery in databases, data warellOusing. system imegratioll and e services.