University of Trento Knowledge-Based Open Entity Matching

PhD Dissertation International Doctorate School in Information and Communication Technologies DISI - University of Trento Knowledge-Based Open Entity Matching Stefano Bortoli Advisor: Prof. Paolo Bouquet Universitàdegli Studi di Trento April 2013 Preface The path that lead me to the realization of this work starts back in early 2005. At that time I had just completed my bachelor degree in Computer Science at the University of Trento, and started working as research assistant on the EU funded VIKEF project in the group of professor Bouquet. Working on that project, we had to deal for the first time with problems related to the integration of large RDF graphs resulting from diverse automatic data extraction processes. Quickly we realized that a viable solution to the problem would have been to assign a priori the same URI to the resources we wanted to mention unambiguously in the different RDF graphs. This intuition follows the principles of the Occam’s Razor, suggesting to avoid the unnecessary multiplications of identifiers for the same entity. Thereby, under the coordination of professor Bouquet we started to conceive and implement the first prototype of what we called Entity Repository. The discussions and brainstorming necessary to the conception and first implementation of the prototype fostered the development of a more ambitious vision: define a global naming service for the creation and maintenance of globally unique identifiers for non-web resources. Following this objective, professor Bouquet worked to form a consortium of prestigious research institutes and companies, and successfully obtained a consistent EU grant to develop the idea. After three years of intense and challenging work, in 2009 the OKKAM project consortium released a first complete working prototype of the Entity Name System (ENS). The ENS embeds in a scalable architecture state of the art solutions in many fields of information science. The ENS was considered a real success with plenty of potential applications in the real world cases. Therefore, with the support of the closest partners, professor Bouquet founded a spin-off company to sustain and further develop the vision of a Semantic Web where frictionless entity-centric information integration was possible. Nevertheless, despite the efforts of brilliant researches, the solutions defined to the most challenging problem of entity matching required further specialization and development. In this context, I decided to apply for a PhD scholarship at the International Doctoral School in ICT of the University of Trento, and start my research to define a more effective and efficient iii iv solution to the entity matching problem under the supervision of professor Bouquet. Needless to say that since then, the path to the definition this thesis passed through tons of papers, a fruitful visit at the Information Sciences Institute in Los Angeles, moments of excitements due to discovery, and depressive moments due to failures in the evaluation of intuitions. Most of all, what characterized this period was a constant struggling between the will of following interesting leads in the solutions of possibly marginal aspects of the problem and the need of staying focused on the target. I am not entirely sure the work proposed in this thesis is the best possible I could have done. Probably, if I could do it again, I would do it differently. For sure, I did not save energies, passion and commitment. The adrenaline, happiness and satisfaction coming from the successful testing of a new solution always compensates for long working hours, and weekends and holidays spent reading and programming. Sometimes one may even get lost, but I believe that getting lost is a necessary condition to push oneself in the research of innovative solutions. One of my professors in high school used to tell to the students: “when you don’t understand anything about something, it is the moment you start learning about it”. Well, learning is what I have done, and learning is what I want to do. I sincerely believe that what is presented in this work is neither revolutionary, nor conclusive. At the same time, I am convinced that this work in its broadness is the cornerstone for the development of innovative solutions that contribute to move a step ahead towards the realization of the vision I embraced back in 2005. Contents 1 Introduction 1 1.1 MissionStatement ............................. 5 I TheProblemandtheStateoftheArt 7 2 TheProblem:MatchingintheOpenWorld 9 2.1 ExamplesOfSemanticHeterogeneity . 14 2.2 ExamplesofStructuralHeterogeneity . ... 16 2.3 ExamplesofInconsistentDescriptions. ..... 17 3 The State of the Art 19 3.1 EntityMatchinginInformationSystems . ... 19 3.1.1 ProbabilisticMethods . 20 3.1.2 Distance-basedMethods . 25 3.1.3 Rule-basedMethod. 28 3.2 EntityMatchinginthe(Semantic)Web. .. 32 3.3 StringSimilarityMetrics . 39 II Vision, Theories and Definitions 43 4 TheKnowledge-basedMatchingVision 45 5 Identification Ontology 49 5.1 DefiningaConceptualModel . 51 5.2 FormalAnalysisoftheConceptualModel. .. 54 5.2.1 OntoClean Meta-properties Annotation . 56 5.2.2 Constraints Violation in Backbone Taxonomy . 64 5.3 IdentificationOntologyTaxonomy. .. 67 v vi CONTENTS 5.4 Meta-propertiesforIdentification . .... 69 5.5 FeaturesforOpenEntityMatching . 71 5.5.1 FeaturesforEntityTypePerson. 73 5.5.2 FeaturesforEntityTypeLocation. 81 5.5.3 FeaturesforEntityTypeOrganization . 88 5.5.4 RemarksAboutChosenFeatures . 94 5.6 ContextualSemanticHarmonization . 95 6 Rules for Open Entity Matching 99 6.1 TheoreticalFoundations . 99 6.2 Rules Definition, Application and Satisfaction . ...... 101 6.3 RulesNormalization ............................ 104 6.3.1 AtomOperatorNormalization . 104 6.3.2 TransitiveOperatorNormalization . 105 6.4 Rulesmerging................................ 106 6.4.1 Rulessubsumption . 107 6.4.2 Merging ρ-subsumedrules . 107 6.4.3 SimilarityThresholdsNormalization . 109 6.4.4 RulesMergingprocess . 111 6.4.5 DefiningRulesClassHierarchy . 112 6.5 Remarks about Rules for Open Entity Matching . 113 III Implementation and Evaluation 115 7 SemanticandStructuralHarmonization 117 7.1 EntityTypeHarmonization . 118 7.1.1 Examples of Entity Type Contextual Mappings . 119 7.2 SemanticHarmonizationofFeatures . 120 7.2.1 MappingsforFeaturesofPerson. 121 7.2.2 MappingsforFeaturesofLocation . 127 7.2.3 MappingsforFeaturesofOrganization . 131 7.2.4 RemarksonContextualMappings. 137 7.3 StructuralHarmonizationofFeatures . 138 8 BuildingRulesforOpenEntityMatching 141 8.1 Top-downRuleConstruction. 142 8.2 LearningRulesforOpenEntityMatching . 145 CONTENTS vii 8.2.1 DataSamplesCollection . 147 8.2.2 LabelingSamplesforTrainingSet . 158 8.2.3 TrainingSetFilter . 171 8.2.4 LearningaDecisionTree . 174 8.2.5 ExtractingRulesFromDecisionTrees . 176 8.2.6 Training Set Partitions and Combinations . 177 8.3 CombiningTop-downandBottomUpRules . 179 8.3.1 PlainRulesCombination. 179 8.3.2 PlainTopDownThresholdNormalization . 180 8.3.3 Top-DownPriorityRulesCombination . 181 8.3.4 PositiveandNegativeOnlyTopDown . 181 9 Fingerprint Match Solution 183 9.1 ComputingStringSimilarity . 184 9.1.1 BestSimilarityMetricPerFeature . 185 9.2 ComputingFeaturesComparison . 187 9.2.1 GreedyFeaturesComparison . 187 9.2.2 Features Comparison with Relative Completeness . .... 188 9.2.3 Features Comparison considering Average Score . 189 9.3 ComputingRule-BasedMatchingDecision . 190 9.4 FingerprintSimilarity. 191 10 Experimental Evaluation 195 10.1 EvaluationDatasets .. .. .. .. .. ... .. .. .. .. .. ... .. 196 10.1.1 PersonEvaluationDatasets . 197 10.1.2 LocationEvaluationDatasets . 198 10.1.3 OrganizationEvaluationDatasets . 198 10.2 EvaluatingFingerprintMatch . 199 10.2.1 Top-downOnlyRulesExperiments . 200 10.2.2 Bottom-upOnlyRulesExperiments. 207 10.2.3 MixedRulesExperiments . 221 10.2.4 ExperimentsResultsAnalysis . 232 10.3 ComparingwithtoFBEMMatcher . 237 IV Conclusions and Appendix 241 11 Conclusions and Future Work 243 viii CONTENTS 12 Acknowledgments 249 A AppendixA:SemanticHarmonizationMappings 251 B Appendix B: Dataset Collection 255 Chapter 1 Introduction In the very promising vision of the Semantic Web, software agents are capable of ex- ploiting semantically annotated information in order to perform automatically time expensive tasks on behalf of human users [12, 2]. In order to realize this ambitious goal, much effort has been spent in the past to define and spread the usage of tools aiding the production of semantically annotated information. As a result, software agents are expected to be able of gathering, exchanging, automatically processing and integrating semantically structured information to perform sophisticated tasks. Nev- ertheless, there are still many subtle and complex philosophical issues to be solved, undermining the establishment of a solid foundation of the whole architecture of the (Semantic) Web. One of the debated points is related to the problem of identity and reference on the Web [75, 25, 24, 11, 94, 44]. Namely, there is a disagreement about the way syntactically unique identifiers (i.e. Uniform Resource Identifiers) should be tied to resources (i.e. entities) in order to provide unambiguous means of reference (i.e. names) and how these should fit the current architecture of the Web. In the first concrete realization of the Semantic Web vision known as Linked Data [13, 20], the proliferation of identifiers for entities is deliberately allowed, relying on the assumption that, with time, conventions will emerge. Notice that in the real world, the society deals with the identification

Load more