R L W D by Oktie Hassanzadeh A
Total Page:16
File Type:pdf, Size:1020Kb
R L W D by Oktie Hassanzadeh A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Computer Science University of Toronto Copyright © 2012 by Oktie Hassanzadeh Abstract Record Linkage for Web Data Oktie Hassanzadeh Doctor of Philosophy Graduate Department of Computer Science University of Toronto 2012 Record linkage refers to the task of nding and linking records (in a single database or in a set of data sources) that refer to the same entity. Automating the record linkage process is a challenging problem, and has been the topic of extensive research for many years. Several tools and techniques have been developed as part of research prototypes and commercial so- ware systems. However, the changing nature of the linkage process and the growing size of data sources create new challenges for this task. In this thesis, we study the record linkage problem for Web data sources. We show that tra- ditional approaches to record linkage fail to meet the needs of Web data because 1) they do not permit users to easily tailor string matching algorithms to be useful over the highly heteroge- neous and error-riddled string data on the Web and 2) they assume that the attributes required for record linkage are given. We propose novel solutions to address these shortcomings. First, we present a framework for record linkage over relational data, motivated by the fact that many Web data sources are powered by relational database engines. is framework is based on declarative specication of the linkage requirements by the user and allows linking records in many real-world scenarios. We present algorithms for translation of these requirements to queries that can run over a relational data source, potentially using a semantic knowledge base to enhance the accuracy of link discovery. Effective specication of requirements for linking records across multiples data sources re- quires understanding the schema of each source, identifying attributes that can be used for link- ii age, and their corresponding attributes in other sources. Existing approaches rely on schema or attribute matching, where the goal is aligning schemas, so attributes are matched if they play semantically related roles in their schemas. In contrast, we seek to nd attributes that can be used to link records between data sources, which we refer to as linkage points. In this thesis, we dene the notion of linkage point and present the rst linkage point discovery algorithms. We then address the novel problem of how to publish Web data in a way that facilitates record linkage. We hypothesize that careful use of existing, curated Web sources (their data and structure) can guide the creation of conceptual models for semistructured Web data that in turn facilitate record linkage with these curated sources. Our solution is an end-to-end framework for data transformation and publication, which includes novel algorithms for identication of entity types (that are linkable) and their relationships out of semistructured Web data. A highlight of this thesis is showcasing the application of the proposed algorithms and frameworks in real applications and publishing the results as high-quality data sources on the Web. iii Dedication To my mother, Zahra Tarmast and father, Hossein Hassanzadeh. iv Acknowledgements In a thesis on “Record Linkage for Web Data”, it would be ironic to have acknowledgments in the traditional format that will only become part of a single unlinked record in the library. So here is a link to the acknowledgments data on the Web, that include links to URIs that describe (and link) those who are acknowledged. I hope one day, all the contents of theses and scientic publications become properly linked on the Web. Persistent URL: http://purl.org/oktie/phdack Backup Link: http://www.oktie.com/phdack To follow the requirements of a traditional thesis document, I will only list the name of those whom I have collaborated with as a part of the work described in this thesis (sorted by the number of publications we have co-authored, in descending order): Renée J. Miller, Anastasios Kementsietsidis, Lipyeow Lim, Min Wang, Reynold S. Xin, Ken Pu, Soheil Hassas Yeganeh, Mariano Consens, Fei Chiang, Hyun Chul Lee, Christian Fritz, Shirin Sohrabi; and the members of my committee: Chen Li, Sheila McIlraith, and Gerald Penn. Last but not least, I would like to thank my family for their unconditional love and support during the past ve years: my parents, my brother Aidin, sister-in-law Sophie, the Sohrabi family and in particular my beloved Shirin. v Contents 1 Introduction 1 1.1 Motivation ..................................... 3 1.1.1 Link Discovery over Relational Data ................... 3 1.1.2 Discovering Linkage Points ........................ 5 1.1.3 Linking Semistructured Data ....................... 6 1.2 Objectives ...................................... 8 1.3 Contributions .................................... 10 1.4 Organization of the Dissertation .......................... 13 2 State-of-the-art in Record Linkage 15 2.1 String Matching for Record Linkage ........................ 18 2.1.1 String Normalization ........................... 19 2.1.2 String Similarity Measures ........................ 19 2.1.3 Semantic (Language-Based) Similarity Measures ............ 26 2.2 Beyond String Matching .............................. 29 2.2.1 Leveraging Structure and Co-occurrence ................ 30 2.2.2 Clustering for Record Linkage ...................... 31 2.3 Entity Identication and Record Linkage ..................... 39 2.3.1 Schema Matching and Record Linkage .................. 40 2.3.2 Entity Identication from Text ...................... 41 vi 2.4 Record Linkage Tools and Systems ........................ 43 2.5 Summary and Concluding Remarks ........................ 47 3 Link Discovery over Relational Data 49 3.1 Introduction .................................... 49 3.2 Motivating Example ................................ 53 3.3 e LinQL Language ................................ 55 3.3.1 Examples .................................. 55 3.3.2 Native Link Methods ........................... 59 3.4 From LinQL to SQL ................................ 63 3.4.1 Approximate Matching Implementation ................. 65 3.4.2 Semantic Matching Implementation ................... 68 3.5 Experiments .................................... 70 3.5.1 Data Sets .................................. 70 3.5.2 Effectiveness and Accuracy Results .................... 72 3.5.3 Effectiveness of Weight Tables ...................... 79 3.5.4 Performance Results ............................ 80 3.6 Related Work .................................... 83 3.7 Conclusion ..................................... 84 4 Discovering Linkage Points over Web Data 85 4.1 Introduction .................................... 85 4.2 Preliminaries .................................... 89 4.3 Framework ..................................... 92 4.4 Search Algorithms ................................. 96 4.4.1 SMaSh-S: Search by Set Similarity .................... 97 4.4.2 SMaSh-R: Record-based Search ...................... 97 4.4.3 SMaSh-X: Improving Search by Filtering ................. 98 vii 4.5 Experiments .................................... 101 4.5.1 Data Sets .................................. 101 4.5.2 Settings & Implementation Details .................... 102 4.5.3 Accuracy Results ............................. 103 4.5.4 Running Times .............................. 107 4.5.5 Finding the Right Settings ......................... 109 4.6 Related Work .................................... 110 4.7 Conclusion ..................................... 112 5 Linking Semistructured Data on the Web 113 5.1 Introduction .................................... 113 5.2 Framework ..................................... 116 5.2.1 Entity Type Extractor ........................... 117 5.2.2 Transformer ................................ 119 5.2.3 Provenance ................................ 119 5.2.4 Data Instance Processor .......................... 120 5.2.5 Data Browse and Feedback Interface ................... 120 5.3 Entity Type Extraction ............................... 121 5.3.1 Basic Entity Type Extraction ....................... 121 5.3.2 Entity Type Extraction Enhancement ................... 122 5.4 Experience and Evaluation ............................. 126 5.4.1 Clinical Trials Data ............................ 127 5.4.2 Bibliographic Data ............................. 129 5.4.3 Mapping Transformation Interface .................... 131 5.5 Conclusion ..................................... 132 6 Creating High-Quality Linked Data: Case Studies 134 6.1 Linked Movie Data Base .............................. 134 viii 6.1.1 Data Transformation ........................... 137 6.1.2 Record Linkage .............................. 138 6.1.3 Publishing Linkage Meta-data ...................... 141 6.1.4 Impact and Future Directions ....................... 142 6.2 Linked Clinical Trials ............................... 143 6.2.1 Initial Data Transformation ........................ 144 6.2.2 Record Linkage .............................. 145 6.2.3 LinkedCT Live: Online Data Transformation .............. 148 6.2.4 Impact and Future Directions ....................... 148 6.3 BibBase Triplied: Linking Bibliographic Data on the Web ........... 149 6.3.1 Lightweight Linked Data Publication ................... 149 6.3.2 Internal Record Linkage .......................... 150 6.3.3 Linkage to External Data Sources ....................