Uva-DARE (Digital Academic Repository)
Total Page:16
File Type:pdf, Size:1020Kb
UvA-DARE (Digital Academic Repository) On entity resolution in probabilistic data Ayat, S.N. Publication date 2014 Document Version Final published version Link to publication Citation for published version (APA): Ayat, S. N. (2014). On entity resolution in probabilistic data. General rights It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons). Disclaimer/Complaints regulations If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible. UvA-DARE is a service provided by the library of the University of Amsterdam (https://dare.uva.nl) Download date:01 Oct 2021 On Entity Resolution in Probabilistic Data Naser Ayat On Entity Resolution in Probabilistic Data Naser Ayat On Entity Resolution in Probabilistic Data Naser Ayat On Entity Resolution in Probabilistic Data Academisch Proefschrift ter verkrijging van de graad van doctor aan de Universiteit van Amsterdam op gezag van de Rector Magnificus prof.dr. D.C. van den Boom ten overstaan van een door het college voor promoties ingestelde commissie, in het openbaar te verdedigen in de Agnietenkapel op dinsdag 30 september 2014, te 12.00 uur door Seyed Naser Ayat geboren te Esfahan, Iran Promotors: prof. dr. H. Afsarmanesh prof. dr. P. Valduriez Copromotor: dr. R. Akbarinia Overige leden: prof. dr. P.W. Adriaans prof. dr. P.M.A. Sloot prof. dr. A.P.J.M. Siebes prof. dr. M.T. Bubak prof. dr. ir. F.C.A Groen Faculteit der Natuurwetenschappen, Wiskunde en Informatica SIKS Dissertation Series No. 2014-32 The research reported in this thesis has been carried out under the auspices of SIKS, the Dutch Research School for Information and Knowledge Systems. Copyright c 2014 by Naser Ayat Printed by: GVO printers & designers. ISBN: 978 90 6464 812 0 to Aylar, and my parents iii Contents 1 Introduction 1 1.1 Uncertain Data . .1 1.2 Entity Resolution . .2 1.3 Entity Resolution for Probabilistic Data . .4 1.4 Research Questions . .5 1.5 Roadmap . .7 2 Background and Related Work 9 2.1 Uncertain Data Models . .9 2.1.1 Possible Worlds Semantics . .9 2.1.2 X-relation Data Model . 11 2.2 Entity Resolution . 12 2.2.1 Matching Methods . 14 2.2.2 Scalability Issues . 24 2.2.3 Entity Resolution for Probabilistic Data . 26 2.2.4 Conclusion . 29 3 Entity Resolution for Probabilistic Data 31 3.1 Introduction . 31 3.2 Problem Definition . 33 3.2.1 Data Model . 33 3.2.2 Context-Free and Context-Sensitive Similarity Functions . 33 3.2.3 Problem Statement . 34 3.3 Context-Free Entity Resolution . 36 3.3.1 CFA Algorithm . 36 3.3.2 Improving CFA Algorithm . 38 3.4 Context-sensitive Entity Resolution . 41 3.4.1 Monte Carlo Algorithm . 42 3.4.2 Parallel MC . 42 v 3.4.3 CB Similarity Function . 43 3.5 Performance Evaluation . 48 3.5.1 Experimental setup . 48 3.5.2 Results . 51 3.6 Analysis against related work . 54 3.7 Conclusion . 55 4 Entity Resolution for Distributed Probabilistic Data 57 4.1 Introduction . 57 4.2 Problem definition . 59 4.3 Distributed Computation of Most-probable Matching-pair . 59 4.3.1 Algorithm Overview . 60 4.3.2 Extract the Essential-set . 61 4.3.3 Merge-and-Backward Essential-Sets . 66 4.3.4 MPMP Computation and Data Retrieval . 67 4.4 FD Example . 68 4.5 Analysis of Communication Cost . 69 4.5.1 Distributed System Model . 69 4.5.2 Forward Messages . 71 4.5.3 Backward Messages . 72 4.5.4 Retrieve Messages . 73 4.6 Performance Evaluation . 73 4.6.1 Experimental and Simulation Setup . 74 4.6.2 Response Time . 77 4.6.3 Communication Cost . 79 4.6.4 Case Study on Real Data . 81 4.7 Analysis against related Work . 82 4.8 Conclusion . 83 5 Entity Resolution for Probabilistic Data Using Entropy Reduc- tion 85 5.1 Introduction . 85 5.2 Problem Definition . 87 5.2.1 Data Model . 87 5.2.2 Entropy in Probabilistic Databases . 87 5.2.3 Entity Resolution Over Probabilistic Database . 88 5.2.4 Problem Statement . 89 5.3 ERPD Using Entropy . 90 5.3.1 Computing Entropy in X-relations . 90 5.3.2 Merge Function . 92 5.3.3 ME Algorithm . 96 5.3.4 Time Complexity . 97 5.3.5 Multi-Alternative X-relation Case . 98 vi 5.4 Performance Evaluation . 99 5.4.1 Experimental Setup . 99 5.4.2 Performance Results . 102 5.5 Analysis against related Work . 107 5.6 Conclusion . 108 6 Pay-As-You-Go Data Integration Using Functional Dependen- cies 109 6.1 Introduction . 109 6.2 Problem Definition . 111 6.3 System Architecture . 113 6.4 Schema Matching . 114 6.4.1 Distance Function . 114 6.4.2 Schema Matching Algorithm . 120 6.4.3 Adding Data Sources Incrementally . 121 6.4.4 Execution Cost Analysis . 122 6.5 Performance Evaluation . 123 6.5.1 Experimental Setup . 123 6.5.2 Results . 124 6.6 Analysis against related Work . 127 6.7 Conclusion . 127 7 Conclusion 129 7.1 Answers to Research Questions . 129 7.2 Future Work . 132 A Uncertain Data Models 133 A.1 Incomplete Relational Models . 133 A.1.1 Codd tables . 133 A.1.2 C-tables . 134 A.2 Fuzzy Relational Models . 135 A.2.1 Fuzzy Set . 136 A.2.2 Fuzzy Relations . 136 A.3 Probabilistic Relational Models . 138 A.4 Probabilistic Graphical Models . 139 B The Courses Dataset 143 C Author's publications 147 Bibliography 149 Abstract 161 vii Samenvatting 165 Acknowledgments 169 SIKS Dissertation Series 171 viii Chapter 1 Introduction 1.1 Uncertain Data There are many applications that produce uncertain data. Untrusted sources, imprecise measuring instruments, and uncertain methods, are some reasons that cause uncertainty in data. Let us provide some examples of uncertain data which arise in different real-life applications. • Uncertainty inherent in data. There are many data sources that are inherently uncertain, such as scientific measurements, sensor readings, lo- cation data from GPS, and RFID data. • Data integration. An important step in automatically integrating a num- ber of structured data sources is to decide which schema elements have the same semantics, i.e. schema matching. However, the result of schema matching is also itself often uncertain. For instance, a schema matching approach outputs \phone" and \phone-no" as matching, while mentioning that they have a 90% chance to have the same semantics. • Information extraction. The process of extracting information from un- structured data sources, e.g. web, is an uncertain process, meaning that extracted information are typically associated with probability values indi- cating the likelihood that the extracted information is correct. Consider, for instance, an extractor which decides with 0.95 confidence that Bob lives in Amsterdam. • Optical Character Recognition (OCR). The aim of OCR is to recog- nize the text within handwritten or typewritten scanned documents. The output of OCR is often uncertain. For instance, an OCR method, which cannot recognize, with certainty, the 'o' character in the word \Ford", may produce the set f(\Ford" : 0.8), (\F0rd" : 0.2)g as output, meaning that 1 2 Chapter 1. Introduction each of the two output words may represent the original scanned word with the associated probability. Broadly speaking, there are two approaches in dealing with uncertain data: 1) ignoring it, and 2) managing it. While the former approach results in loss of information which is contained in uncertain data, in recent years, we have been witnessing much interest in the latter approach. As shown in the above examples, most real-life applications tend to quantify data uncertainty with probability values, referred to as probabilistic data. We also deal with probabilistic data in this thesis. 1.2 Entity Resolution Entity Resolution1 (ER) is the process of identifying duplicate tuples, refereing to the tuples that represent the same real-world entity. The ER problem arises in many real-life applications, such as data integration. Consider, for instance, that the large company A acquires another large company B. It is quite likely that A and B share many customers. On the other hand, it is not reasonable to expect a person, who is the customer of both companies, to be represented by identical tuples in the databases of the two companies. For example, consider tuple tA “ p\Thomas Michaelis"; \45, Main street"q in A's database and tuple tB “ p\T. Michaelis"; \45, Main st., 1951"q in B's database, both representing the same customer. In the above example, it is clear for humans that tA and tB refer to the same person. However, it is not reasonable to expect the humans to solve the ER problem on large datasets, since every tuple in the dataset should be compared against all other tuples in the same dataset, giving the ER process an Opn2q complexity. On the other hand, computational power exists on computers to deal easily with the ER problem on large datasets, although they lack the human intelligence, which greatly simplifies the identification of duplicate tuples.