Using an External Knowledge Base to Link Domain-Specific Datasets
Total Page:16
File Type:pdf, Size:1020Kb
Using an external knowledge base to link domain-specific datasets Gideon G.A. Mooijen 10686290 Bachelor thesis Credits: 18 EC Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dhr. Prof. dr. P. T. Groth University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam June 28th, 2019 Abstract Conventional record-linkage methods are word-similarity based. A metrics of choice cal- culates the records that are most similar and present them as matches. The emergence of comprehensive online knowledge bases gives rise to the presumption that for domain- specific record-linking, incorporation of the notion of the entities in dispute might offer a new approach in linking records. As the context of the labels is present, the knowl- edge base in choice can be used to form the mapping between labels and entity. This knowledge-based record-linking (KBRL) outperforms the conventional method in terms of precision. It shows to be size-invariant, where the performance of conventional methods decreases if the labels show high variance or there are a large number of candidates per entry. Additionally, KBRL seems to more resilient to noisy and incomplete data. Contents 1 Introduction 1 1.1 Approach . .2 2 Related work 3 2.1 Word-similarity based RL . .3 2.1.1 Levenshtein . .3 2.1.2 Jaro-Winkler . .3 2.1.3 Qgram . .4 2.1.4 LCS . .4 2.2 Knowledgebased RL . .4 3 Method 6 3.1 Knowledgebased RL . .6 3.2 Word-similarity RL . .8 4 Experiments 9 4.1 Data acquisition . .9 4.1.1 Dataset A . .9 4.1.2 Dataset B . 10 4.2 Wikidict construction . 12 4.3 Record linkage . 12 4.4 Golden standard . 13 5 Results 14 5.1 Performance scores . 18 6 Evaluation 19 6.1 Mismatches . 19 7 Conclusion 20 Chapter 1 Introduction Data analysis is a powerful tool that is widely used in a large number of disciplines. Amongst others, it can be used to detect anomalies in medical systems to increase cancer survival rates or provide demographic statistics. To perform this type of research, compre- hensive and valid data is required. This data is not always available. In some cases, the desired data is scattered throughout multiple files and locations. These multiple datasets might each contain different and valuable information on the same person, institution, or country. The combination of these datasets offers new insights over the matter in dispute. The most obvious approach is to merge the datasets. This technique requires that both datasets are free of noise and compatible. It is possible to transform datasets into noise-free and compatible datasets through preprocessing. The third criterium is more complicated to establish: both datasets need to maintain the same labels for the same entities. If this is not the case, the system is not aware of the fact that different entries of dataset A and dataset B concerns the same entity. If this matching is not established, combining the data fails. An example: one dataset states that 'The Netherlands' has 17 million inhabitants in 2017. A different dataset states that 'Netherlands' scored a 7.23 on the subject: 'how high would the average resident score their level of happiness'. In order to detect the correla- tion between these two factors, a technique is required to express the fact that both 'The Netherlands' and 'Netherlands' are different labels for the same entity. Real life data rarely meet this 'conformity of labels' criterium. The lack of conformity can be resolved by a technique called record linking (RL). This is the task of matching elements cross-database. Many RL-algorithms are metrics-based; they calculate word sim- ilarity between potential matches. This type of matching lacks an actual notion of the entity in dispute; it merely establishes the alikeness of the labels. The incentive for this project was the presumption that a different approach that does include the 'notion' of the entities might offer advantages over metrics-based techniques. The hypothesis is that for domain-specific record-linkage, a well-chosen knowledgebase (KB) can map different labels to the correct entities. This mapping is a more sophisticated and reliable method as the KB ensures that the different labels are indeed referring to the same entity, rather than being labels with high word similarity. This results in the following research question: Does the utilization of an external knowledgebase and search engine provide 1 1.1. APPROACH CHAPTER 1. INTRODUCTION advantages over metrics based record linking for domain-specific datasets? 1.1 Approach To answer this question, a case study is performed. The domain of choice is transfers in European soccer. This domain meets all requirements in order to answer or research question: To answer this question, a case study is performed. The domain of choice is transfers in European soccer. This domain meets all requirements in order to answer the research question: 1. There is much data available. 2. The different datasets maintain different labels for the same entities. 3. There is a KB available to enable the record linking. 4. The combination of the data yields interesting competition-specific statistics that can be used for data analysis. To answer the research question, two datasets are extracted online. Dataset A contains information on transfers and is structured per year and competition. Dataset B is an extensive, unstructured database that also contains fees of transfers. The approach is to extract the fees from the correct matches for all transfers in Dataset A, achieved through record linkage. This is executed multiple times with different techniques. One of these techniques is knowledgebase record linking (KBRL); the others utilize standard word- similarity based methods (WSRL). Finally, a golden standard was constructed to evaluate the performance of KBRL with respect to WSRL. 2 Chapter 2 Related work Previous research was examined to develop the approach for comparing the performance of WSRL with KBRL. 2.1 Word-similarity based RL Essential for this project is the evaluation of the WSRL. The four techniques that are used in this project range from simple to complex and are entirely different but share one fundamental property: they calculate word-similarity. 2.1.1 Levenshtein The Levenshtein-distance [3] (LD) is one of the most straightforward word-similarity mea- surements. It calculates the number of operations required to transform one string into another. Possible operations are the addition, deletion, or swapping of letters. The LD increased with 1 for every required operation. 8 max(i; j) if min(i; j) = 0 > 8 < LDs1;s2(i − 1; j) + 1 LDs1;s2(i; j) < (2.1) >min LDs1;s2(i; j − 1) + 1 otherwise :> : LDs1;s2(i − 1; j − 1) + 1(as16=bs2) This procedure produces the LD for strings s1 and s2. Identical strings have the minimal possible LD: 0. A high LD indicates low word-similarity. 2.1.2 Jaro-Winkler A different method is Jaro-Winkler (JW) [6]. It is similar to Levenshtein but has addi- tional mechanisms that compensate for significant differences in length. These mechanisms solve the problem that arises when abbreviations are used. In order to calculate the Jaro-Winkler-similarity (simw), firstly the Jaro-distance (simj) is calculated. 0 if m = 0 sim = (2.2) j 1 ( m + m + m−t ) otherwise 3 js1j js2j m • s are the strings that are under evaluation. 3 2.2. KNOWLEDGEBASED RL CHAPTER 2. RELATED WORK • m is the number of matching characters. • t is half the amount of transpositions required to match two characters in the different strings. Subsequently, the Jaro-Winkler-similarity (simw) is calculated. simw = simj + lp(1 − simj) (2.3) • l ranges from f0; 4g, is defined by the range for the first characters that are identical. • p defines the weight of l. It ranges from f0:0; 0:25g and quantifies how much l favors the alikeness of the strings. If p is large, strings with identical prefixes will have high similarity. If p is small, identical prefixes have a lower impact on the total calculation. For this project p = 0:1 is maintained. 2.1.3 Qgram Qgram string matching is introduced by Ukkonen [5]. X Dq(x; y) = jG(x)[v] − G(y)[v]j (2.4) v2Pq For this project, q = 2 is maintained. This means that for every string, each 2-gram is calculated. An example is given below: br ro ot th he er mo brother 1 1 1 1 1 1 0 mother 0 0 1 1 1 1 1 The L1 norm (or Manhattan-distance) of these q-gram profiles indicate the similarity. The similarity increases as the amount of overlapping 2-grams increases and so, decreases the distance. 2.1.4 LCS The longest common subsequence (LCS) [1] can be transformed into a metric as well. f(s1; s2) denotes the LCS. M(s1; s2) denotes the length of the largest of both strings. This is transformed into a metric through the following equation: f(s1; s2) d(s1; s2) = 1 − (2.5) M(s1; s2) When the strings are equal, d(s1; s2) = 1. As the LCS or the relative size of one of the strings decreases, this value becomes smaller. If no characters coincide, the value is 0. 2.2 Knowledgebased RL Many of the word-similarity metrics were developed in an era where the internet was not widely adopted or did not even exist yet. The progression of online encyclopedias might seem to be a valuable asset for record linking. WSRL is an 'unaware' technique, where 4 CHAPTER 2. RELATED WORK 2.2.