Using an external knowledge base to link domain-specific datasets

Gideon G.A. Mooijen 10686290

Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam

Supervisor Dhr. Prof. dr. P. T. Groth University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam

June 28th, 2019 Abstract

Conventional record-linkage methods are word-similarity based. A metrics of choice cal- culates the records that are most similar and present them as matches. The emergence of comprehensive online knowledge bases gives rise to the presumption that for domain- specific record-linking, incorporation of the notion of the entities in dispute might offer a new approach in linking records. As the context of the labels is present, the knowl- edge base in choice can be used to form the mapping between labels and entity. This knowledge-based record-linking (KBRL) outperforms the conventional method in terms of precision. It shows to be size-invariant, where the performance of conventional methods decreases if the labels show high variance or there are a large number of candidates per entry. Additionally, KBRL seems to more resilient to noisy and incomplete data. Contents

1 Introduction 1 1.1 Approach ...... 2

2 Related work 3 2.1 Word-similarity based RL ...... 3 2.1.1 Levenshtein ...... 3 2.1.2 Jaro-Winkler ...... 3 2.1.3 Qgram ...... 4 2.1.4 LCS ...... 4 2.2 Knowledgebased RL ...... 4

3 Method 6 3.1 Knowledgebased RL ...... 6 3.2 Word-similarity RL ...... 8

4 Experiments 9 4.1 Data acquisition ...... 9 4.1.1 Dataset A ...... 9 4.1.2 Dataset B ...... 10 4.2 Wikidict construction ...... 12 4.3 Record linkage ...... 12 4.4 Golden standard ...... 13

5 Results 14 5.1 Performance scores ...... 18

6 Evaluation 19 6.1 Mismatches ...... 19

7 Conclusion 20 Chapter 1

Introduction

Data analysis is a powerful tool that is widely used in a large number of disciplines. Amongst others, it can be used to detect anomalies in medical systems to increase cancer survival rates or provide demographic statistics. To perform this type of research, compre- hensive and valid data is required. This data is not always available. In some cases, the desired data is scattered throughout multiple files and locations. These multiple datasets might each contain different and valuable information on the same person, institution, or country. The combination of these datasets offers new insights over the matter in dispute.

The most obvious approach is to merge the datasets. This technique requires that both datasets are free of noise and compatible. It is possible to transform datasets into noise-free and compatible datasets through preprocessing. The third criterium is more complicated to establish: both datasets need to maintain the same labels for the same entities. If this is not the case, the system is not aware of the fact that different entries of dataset A and dataset B concerns the same entity. If this matching is not established, combining the data fails.

An example: one dataset states that ’The Netherlands’ has 17 million inhabitants in 2017. A different dataset states that ’Netherlands’ scored a 7.23 on the subject: ’how high would the average resident score their level of happiness’. In order to detect the correla- tion between these two factors, a technique is required to express the fact that both ’The Netherlands’ and ’Netherlands’ are different labels for the same entity.

Real life data rarely meet this ’conformity of labels’ criterium. The lack of conformity can be resolved by a technique called record linking (RL). This is the task of matching elements cross-database. Many RL-algorithms are metrics-based; they calculate word sim- ilarity between potential matches. This type of matching lacks an actual notion of the entity in dispute; it merely establishes the alikeness of the labels.

The incentive for this project was the presumption that a different approach that does include the ’notion’ of the entities might offer advantages over metrics-based techniques. The hypothesis is that for domain-specific record-linkage, a well-chosen knowledgebase (KB) can map different labels to the correct entities. This mapping is a more sophisticated and reliable method as the KB ensures that the different labels are indeed referring to the same entity, rather than being labels with high word similarity. This results in the following research question:

Does the utilization of an external knowledgebase and search engine provide

1 1.1. APPROACH CHAPTER 1. INTRODUCTION

advantages over metrics based record linking for domain-specific datasets?

1.1 Approach

To answer this question, a case study is performed. The domain of choice is transfers in European soccer. This domain meets all requirements in order to answer or research question: To answer this question, a case study is performed. The domain of choice is transfers in European soccer. This domain meets all requirements in order to answer the research question:

1. There is much data available.

2. The different datasets maintain different labels for the same entities.

3. There is a KB available to enable the record linking.

4. The combination of the data yields interesting competition-specific statistics that can be used for data analysis.

To answer the research question, two datasets are extracted online. Dataset A contains information on transfers and is structured per year and competition. Dataset B is an extensive, unstructured database that also contains fees of transfers. The approach is to extract the fees from the correct matches for all transfers in Dataset A, achieved through record linkage. This is executed multiple times with different techniques. One of these techniques is knowledgebase record linking (KBRL); the others utilize standard word- similarity based methods (WSRL). Finally, a golden standard was constructed to evaluate the performance of KBRL with respect to WSRL.

2 Chapter 2

Related work

Previous research was examined to develop the approach for comparing the performance of WSRL with KBRL.

2.1 Word-similarity based RL

Essential for this project is the evaluation of the WSRL. The four techniques that are used in this project range from simple to complex and are entirely different but share one fundamental property: they calculate word-similarity.

2.1.1 Levenshtein The Levenshtein-distance [3] (LD) is one of the most straightforward word-similarity mea- surements. It calculates the number of operations required to transform one string into another. Possible operations are the addition, deletion, or swapping of letters. The LD increased with 1 for every required operation.

 max(i, j) if min(i, j) = 0    LDs1,s2(i − 1, j) + 1 LDs1,s2(i, j)  (2.1) min LDs1,s2(i, j − 1) + 1 otherwise   LDs1,s2(i − 1, j − 1) + 1(as16=bs2)

This procedure produces the LD for strings s1 and s2. Identical strings have the minimal possible LD: 0. A high LD indicates low word-similarity.

2.1.2 Jaro-Winkler A different method is Jaro-Winkler (JW) [6]. It is similar to Levenshtein but has addi- tional mechanisms that compensate for significant differences in length. These mechanisms solve the problem that arises when abbreviations are used.

In order to calculate the Jaro-Winkler-similarity (simw), firstly the Jaro-distance (simj) is calculated.

 0 if m = 0 sim = (2.2) j 1 ( m + m + m−t ) otherwise 3 |s1| |s2| m

• s are the strings that are under evaluation.

3 2.2. KNOWLEDGEBASED RL CHAPTER 2. RELATED WORK

• m is the number of matching characters.

• t is half the amount of transpositions required to match two characters in the different strings.

Subsequently, the Jaro-Winkler-similarity (simw) is calculated.

simw = simj + lp(1 − simj) (2.3)

• l ranges from {0, 4}, is defined by the range for the first characters that are identical.

• p defines the weight of l. It ranges from {0.0, 0.25} and quantifies how much l favors the alikeness of the strings. If p is large, strings with identical prefixes will have high similarity. If p is small, identical prefixes have a lower impact on the total calculation.

For this project p = 0.1 is maintained.

2.1.3 Qgram Qgram string matching is introduced by Ukkonen [5].

X Dq(x, y) = |G(x)[v] − G(y)[v]| (2.4) v∈Pq

For this project, q = 2 is maintained. This means that for every string, each 2-gram is calculated. An example is given below:

br ro ot th he er mo brother 1 1 1 1 1 1 0 mother 0 0 1 1 1 1 1 The L1 norm (or Manhattan-distance) of these q-gram profiles indicate the similarity. The similarity increases as the amount of overlapping 2-grams increases and so, decreases the distance.

2.1.4 LCS The longest common subsequence (LCS) [1] can be transformed into a metric as well. f(s1, s2) denotes the LCS. M(s1, s2) denotes the length of the largest of both strings. This is transformed into a metric through the following equation:

f(s1, s2) d(s1, s2) = 1 − (2.5) M(s1, s2)

When the strings are equal, d(s1, s2) = 1. As the LCS or the relative size of one of the strings decreases, this value becomes smaller. If no characters coincide, the value is 0.

2.2 Knowledgebased RL

Many of the word-similarity metrics were developed in an era where the internet was not widely adopted or did not even exist yet. The progression of online encyclopedias might seem to be a valuable asset for record linking. WSRL is an ’unaware’ technique, where

4 CHAPTER 2. RELATED WORK 2.2. KNOWLEDGEBASED RL an extensive KB can provide certainty. The KB might have information about multiple different labels for the same entity. Shen et al [4] showed that their developed LINDEN framework outperformed all other state-of-the-art techniques for Entity Linking. Their goal was to extend texts with additional information on the entities in that text. The challenging part was disambiguation; ensuring that Michael Jordan referred to the bas- ketball player if it fits that context and that it referred to the computer scientist if the name was used in that context.

Doan et al. [2] described a solution to the problem that arises when multiple labels refer to the same entity. The proposed solution entails a concordance table, with the different labels as entries on the columns for every row (entity).

5 Chapter 3

Method

3.1 Knowledgebased RL

The manual algorithm is named knowledge-based record-linkage. The name is derived from the mechanics: using all present labels and a knowledge base and convert that into a dictionary that maps all different labels to the correct entity. The first step is to extract all labels from the databases.

Algorithm 1 Extracting all elements from the correct columns as labels Data: D, important columns Result: labels for entry ∈ D do for element ∈ entry do if element ∈ important columns then labels.append(element) end end end return labels

Subsequently, domain-specific terms are added to disambiguate: providing the context for the KB. The algorithm restricts the search engine to the URL of the selected KB. The name that the KB maintains is extracted as the correct entity for that label.

Algorithm 2 Converting the labels to the correct entity Data: labels, terms, KB Result: labels mapped to entities mappings = [ ] for label ∈ labels do query = label + terms entity = get entity(query, KB) mappings.append(label,entity) end return mappings

6 CHAPTER 3. METHOD 3.1. KNOWLEDGEBASED RL

The entity retrieval is executed only once per label to achieve computationally efficient record linking. The resulting entities are stored in a dictionary that maps labels to the entity. This mechanism is an adaptation of the method that is described by Doan et al [2].

Algorithm 3 Obtain dictionary to facilitate record linkage between two datasets Data: D1, D2, features, terms Result: dictionary dict = {} labels D1 = extract labels(D1, features) labels D2 = extract labels(D2, features) all labels = merge(labels D1,labels D2)) unique labels = set(all labels) mappings = convert labels(unique labels) for label, entity ∈ mappings do if entity in dict then dict(entity).append(label) else dict(entity) = label end end return dict

Finally, comparing all entries in D1 to all candidates in D2 leads to the presentation of matches. If the elements are identifiers - elements that are identical in both datasets, the algorithm checks whether they are equal. If this is the case, the first criterium is met. Subsequently, it compares the labels. The algorithm checks if these labels refer to the same entity according to the previously constructed dictionary. After this stage, if no match fails, the candidate is presented as a correct match to the entry.

Algorithm 4 Matching entries to candidates Data: D1, D2, dict Result: matched (entry, candidate) tuples matches = [ ] for entry ∈ D1 do for candidate ∈ D2 do for i in range (0, len(entry)) do if entry[i] != candidate[i] and: dict(entry[i]) != dict(candidate[i]) then break end matches.append((entry, candidate)) end end end return matches

7 3.2. WORD-SIMILARITY RL CHAPTER 3. METHOD

3.2 Word-similarity RL

KBRL does not calculate a degree of certainty. A match is presented if and only if all entries are equal or they are labels to the same entity. WSRL works differently. Before performing the record linking, some parameters need to be specified.

1. Identifiers. The algorithm needs to be aware of which elements need to be identical in both datasets. Every time that an entry is compared to a candidate, it only compares those candidates for whom the identifiers are identical in both the entry and the candidate. This technique is called blocking and reduces computational complexity.

2. Elements that need matching: labels. The algorithm will calculate the word-similarity for every (label, label) pair; one in the entry, one in the candidate. The similarities for every pair are added up, and the result is the word-similarity for all labels com- bined. Other options are to calculate the sum of squared errors. This option will favour candidates who do not have (label, label) pairs with extremely low similarity.

3. Elements that do not need matching: these will be ignored for the record linking.

WSRL will always present a candidate, provided that there is at least one candidate with coinciding identifiers. If there are multiple candidates with coinciding identifiers, it will present the candidate with the highest word similarity between the labels.

8 Chapter 4

Experiments

4.1 Data acquisition

As mentioned before: two datasets are used. Dataset A (DB) and dataset B (DB) both contain information on transfers, but DB has more information than DA: it also contains the fee of the transfers. An example of entries in both datasets, after acquisition and preprocessing, is given below:

year name from club to club sum Source A 2018-2019 Dominic Iorfa Wolverhampton Wanderers Sheffield W Source B 2018-2019 Dominic Iorfa Wolves Sheff Wed e2 Mill. The function of DA is to define the scope of the data. It allows specificity: which transfers are required for the data analysis. It yields all transfers for players who left their club between the considered years.

4.1.1 Dataset A To obtain all required transfers, the website FootballSquads1 is used. DA contains all transfers for players in the Spanish, English, Dutch, German, Italian, French, Russian and Portuguese prime competitions. All transfers between 2011 and 2019 are used: 3244 transfers.

Figure 4.1: Architecture of the acquisition of DA

1http://www.footballsquads.co.uk/

9 4.1. DATA ACQUISITION CHAPTER 4. EXPERIMENTS

Link generator All URLs for the necessary files follow the same pattern. The root URL2 per nation, per year, contains all names of the participating clubs, as well as the link to their squad page for that season. The names and links are stored.

Webscraping The previous process yields all links to the pages that require extraction. Again, all squad pages satisfy the same format. This property allows for the construction of a generic web scraper. This web scraper downloads the HTML table on the website. The python-toolkit Pandas is used to convert all tables on that page to a data frame. The first data frame contains the necessary data. Below is an example of that data frame.

Number Name Nat Pos Height Date of Birth Old club 1 Petr Cech CZE G 1.96 20-05-82 Chelsea 2 H´ectorBeller´ın ESP D 1.78 19-03-95 Barcelona ...... 87 Bukayo Saka ENG M 05-09-01 ... Players no longer at this club Number Name Nat Pos Height Date of Birth New club 13 David Ospina COL G 1.83 31-08-88 Napoli (On Loan) 60 Gedion Zelalem USA M 1.85 26-01-97 http://footballsquads.co.uk/eng/2018-2019/engprem/arsenal.htm

Preprocessing Not all data is required. To decrease the computational complexity, all unneeded infor- mation is disregarded. For every transfer, the following information is stored:

year name from to 2018/2019 David Ospina Arsenal Napoli 2018/2019 Gedion Zelalem Arsenal Sporting Kansas City

Storage The tables are converted and stored in CSV files. All these transfer files can be found in: data/year/nation/ Such directories contain the following files:

data/2018-2019/england/Arsenal.csv data/2018-2019/england/Arsenal_transfers.csv data/2018-2019/england/... data/2018-2019/england/... data/2018-2019/england/Wolverhampton Wanderers.csv data/2018-2019/england/Wolverhampton Wandereres_transfers.csv

4.1.2 Dataset B The German website Transfermarkt3 provides extensive information on transfer fees.

2http://www.footballsquads.co.uk/eng/2018-2019/engprem.htm 3https://www.transfermarkt.de/

10 CHAPTER 4. EXPERIMENTS 4.1. DATA ACQUISITION

Architecture

Figure 4.2: Architecture of Dataset B acquisition

Link generator The year of the transfer and name of the player are extracted from Dataset A. Subse- quently, the name of the player is converted into a Google query. The Google queries are limited to Transfermarkt.de

Source A: 2017-2018 Martin Terrier Lille Lyon Name: Martin Terrier Google query: Martin Terrier player transfermarkt transfers Raw result: https://www.transfermarkt.de/martin-terrier/profil/spieler/442891 Processed result: https://www.transfermarkt.de/martin-terrier/transfers/spieler/442891

Webscraping The desired information is located in the body of the website.

Figure 4.3: All information present on Transfermarkt on the player Martin Terrier

All these transfers are extracted. An evaluation was needed to conclude that every row is a < tr > element with an identical class. All these elements have the same structure,

11 4.2. WIKIDICT CONSTRUCTION CHAPTER 4. EXPERIMENTS the contents of which are extracted using the Python toolkit BeautifulSoup. This toolkit allows for easy processing of HTML pages. It has built-in functions to search for certain elements (the required < tr > elements, for instance). This facilitates the extraction of all required data.

Preprocessing

The transfers are stored if and only if the value of the ’Season’ cell matches the year that is present in Dataset A. In the case of Martin Terrier all transfers are stored, with the exception of the last one, since the years do not match. In order to compare 2017 − 2018 with 17/18 or 15/16, a small converter was developed which uses string manipulation.

A similar converter was constructed to process the ’Fee’, found in the last column of the data. Often they are strings: players can be loaned to other clubs, return to their club, or have no fee at all. On other occasions, there was a price involved, which is expressed on their website as a combination of digits and text. A more desirable formulation is in solely integers, which was also achieved by a simple manual string manipulation algorithm.

DB did differentiate between the actual club and youth academies of that club, while DA does not. Example:

DA: 27 2011-2012 Shkodran Mustafi Everton Sampdoria DB: 39 2011-2012 Shkodran Mustafi Everton U21 Sampdoria 75000

The KB tends to have separate documents for these youth academies. This results in the fact that the matching would fail because DA is underspecified regarding this matter. For that reason, all labels are stripped from any occurrence of U18 - U23.

4.2 Wikidict construction

The final step in order to perform KBRL is accumulating all (label, entity) pairs to a dictionary. A small subset of the dictionary is given below:

1111 VVV-Venlo,[’VVV-Venlo’, ’VVV Venlo’] 1112 VV_Bennekom,[’VV Bennekom’] 1113 Valdres_FK,[’Valdres’, ’Valdres FK’] 1114 Valencia_CF,[’Valencia’] 1115 Valencia_CF_Mestalla,[’Valencia B’] 1116 Valenciennes_FC,[’Valenciennes’, ’Valenciennes FC’] 1117 Vancouver_Whitecaps_FC,[’Vancouver’, ’Vancouver Whitecaps’]

4.3 Record linkage

Both DA and DB are processed into a new file: dataset C (DC). This file has the same amount of entries as DA with an extra column for each of the record linking techniques. This values of the column are the fees that are extracted from the candidate in DB that was matched to that entry. Pseudo-code of this technique is described earlier on as Algorithm 4.

12 CHAPTER 4. EXPERIMENTS 4.4. GOLDEN STANDARD

4.4 Golden standard

There was no golden standard available for this project. In a way, the goal of the project was to find the type of record linkage algorithm that results in the golden standard, or as close to the golden standard as possible. In order to perform an evaluation, a subset of 100 samples from DC was constructed. For this subset, the correct fee was inserted manually. The same KB and sources were used for this manual data acquisition as the ones that DA and DB utilized.

13 Chapter 5

Results

The following tables show the performance of the five evaluated techniques: Wikidict (WD), Levenshtein (LS), Jaro-Winkler (JW), Qgram (QG), Longest Common Subse- quence (LCS). The golden standard (GS) is found in the last column.

The colours in the cells indicate whether they are false positives (red), true negatives (cyan) or false negatives (yellow). True positives are uncoloured.

• ’XXX’ means that the algorithm did not manage to match the entry to any candidate

• ’?’ and ’-’ indicates that a match is found, but the source did not have information on the fee

• End of loan (EOL) means that the player has finished his period of loan to the club in dispute

14 CHAPTER 5. RESULTS GS Free Free Free Free - 350000 8500000 1000000 Free EOL EOL EOL EOL 2500000 Free 1500000 EOL 7000000 Free 2000000 ? 950000 550000 Free 500000 - EOL Free 44730000 7800000 EOL ? Free EOL LCS Free Free Free Free XXX 350000 8500000 1000000 Free XXX EOL - EOL 2500000 Free 1500000 EOL 7000000 Free XXX ? 950000 550000 Free XXX - EOL Free 44730000 7800000 EOL ? Free EOL QG Free Free Free Free XXX 350000 8500000 1000000 Free XXX EOL - EOL 2500000 Free 1500000 EOL 7000000 Free XXX ? 950000 550000 Free XXX - EOL Free 44730000 7800000 EOL ? Free EOL JW Free Free Free Free XXX 350000 8500000 1000000 Free XXX EOL - EOL 2500000 Free 1500000 EOL 7000000 Free XXX ? 950000 550000 Free XXX - EOL Free 44730000 7800000 EOL ? Free EOL LS Free Free Free Free XXX 350000 8500000 1000000 Free XXX EOL - EOL 2500000 Free 1500000 EOL 7000000 Free XXX ? 950000 550000 Free XXX - EOL Free 44730000 7800000 EOL ? Free EOL WD Free Free Free XXX XXX XXX 8500000 1000000 XXX XXX EOL XXX EOL 2500000 Free 1500000 XXX 7000000 Free XXX ? 950000 550000 Free XXX - EOL Free 44730000 7800000 EOL ? Free EOL to Liverpool Blackpool FC Kbenhavn Levadiakos Cartagena Eintracht Frankfurt Palermo Bordeaux Sporting Gijn B Spartak Moscow Anzhi Makhachkala Dynamo Moscow Spartak Moscow Hull C Motherwell Derby Co Villarreal B Swansea C Atltico Madrid B Amrica Parma FC Utrecht Vitesse Botev Plovdiv Slavia Prague Paris St-Germain B Chelsea Wycombe W Manchester U Hull C Sunderland Luton T Brighton & HA Olympiakos from Manchester City West Bromwich Albion Mlaga Villarreal Zaragoza 1.FC Kaiserslautern Internazionale Lille Lorient Amkar Perm CSKA Moscow Krylia Sovetov Tom Tomsk Manchester United Sunderland Wigan Athletic Rayo Vallecano Valencia Valladolid VfB Stuttgart Milan ADO Den Haag Ajax Heracles Almelo Brest Paris Saint-Germain Vitria Guimares Arsenal Chelsea Everton Hull City West Ham United Celta Vigo Augsburg name Craig Bellamy Roman Bedn Kris Stadsgaard Mano Braulio Martin Amedick Emiliano Viviano Ludovic Obraniak Adama Tour Artur Valikayev Vladimir Gabulov Yuriy Kirillov Sergey Pesyakov Robbie Brady James McFadden Conor Sammon Nicki Bille Nielsen Pablo Hernndez Vctor Mongil Francisco Rodrguez Djamel Mesbah Jens Toornstra Theo Janssen Luis Pedro Tom Miola Antoine Conte Milan Lalkovi Anthony Jeffrey Juan Mata Nikica Jelavi Danny Graham Pelly Ruddock David Rodrguez Panagiotis Vlach. year 2011-2012 2011-2012 2011-2012 2011-2012 2011-2012 2011-2012 2011-2012 2011-2012 2011-2012 2011-2012 2011-2012 2011-2012 2011-2012 2012-2013 2012-2013 2012-2013 2012-2013 2012-2013 2012-2013 2012-2013 2012-2013 2012-2013 2012-2013 2012-2013 2012-2013 2012-2013 2012-2013 2013-2014 2013-2014 2013-2014 2013-2014 2013-2014 2013-2014 2013-2014 40 75 95 113 116 150 182 251 255 285 295 310 373 481 500 520 548 555 559 592 636 671 673 687 713 746 874 880 890 899 909 940 950 978

15 CHAPTER 5. RESULTS GS Free ? Free XXX 12000000 35500000 ? 1500000 3000000 EOL 250000 EOL Free Free ? EOL 700000 Free 200000 ? EOL 2100000 Free Free Loan Free 5000000 579000 1300000 Free 4700000 300000 12500000 700000 LCS Free ? Free XXX 12000000 35500000 ? 1500000 3000000 Free 250000 EOL Free Free ? EOL 700000 Free 200000 ? EOL 2100000 Free Free Loan Free 5000000 579000 1300000 Free 4700000 300000 - 700000 QG Free ? Free XXX 12000000 35500000 ? 1500000 3000000 Free 250000 EOL Free Free ? EOL 700000 Free 200000 ? EOL 2100000 Free Free Loan Free 5000000 579000 1300000 Free 4700000 300000 - 700000 JW Free ? Free XXX 12000000 35500000 ? 1500000 3000000 Free 250000 EOL Free Free ? EOL 700000 Free 200000 ? EOL 2100000 Free Free Loan Free 5000000 579000 1300000 Free 4700000 300000 - 700000 LS Free ? Free XXX 12000000 35500000 ? 1500000 3000000 Free 250000 EOL Free Free ? EOL 700000 Free 200000 ? EOL 2100000 Free Free Loan Free 5000000 579000 1300000 Free 4700000 300000 - 700000 WD Free ? XXX XXX 12000000 35500000 ? 1500000 3000000 XXX 250000 EOL Free Free ? EOL 700000 Free 200000 ? EOL 2100000 Free Free Loan Free 5000000 579000 1300000 XXX 4700000 300000 XXX 700000 to Parma Chievo Verona Nacional [URU] Oudenaarde Marseille Chelsea Senica Monaco Milan Valencia Austria Vienna Milan AS Trenn FC Vestsjlland Perth Glory Olympiakos Bordeaux Lechia Gdask CRD Libolo Zamalek Arab Contractors Fiorentina Las Palmas Zaragoza Hertha Clermont Celta Vigo Roma Beikta Arsenal [ARG] Derby Co Stuttgart Wolfsburg Perugia from VfB Stuttgart Cagliari Fiorentina Heracles Almelo Lille Anzhi Makhachkala Paos de Ferreira Sporting Braga Chelsea Getafe VfB Stuttgart Genoa FC Dordrecht SC Heerenveen Heracles Almelo PEC Zwolle Paris Saint-Germain Amkar Perm Belenenses Martimo Nacional West Ham United Levante Sporting Gijn VfB Stuttgart Lorient Lyon Dynamo Moscow Spartak Moscow Arouca Watford Bayern Munich FSV Mainz 05 Crotone name Cristian Molinaro Michael Agazzi Gustavo Muna Mathias Schamp Florian Thauvin Willian Christian Irobiso Elderson Uwa Echiejile Fernando Torres Michel Raphael Holzhauser Alessandro Matri Ryan Koolwijk Jukka Raitala Dragan Palji Nikos Ioannidis Clment Chantme Jakub Wawrzyniak Fredy Mohamed Ibrahim Mahmoud Ezzat Mauro Zrate Nabil El Zhar Alberto Guitin Vedad Ibievi Enzo Reale Claudio Beauvue William Vainqueur Aras zbiliz Agustn Vuletich Ikechi Anya Julian Green Yunus Mall Nicol Fazzi year 2013-2014 2013-2014 2013-2014 2013-2014 2013-2014 2013-2014 2013-2014 2013-2014 2014-2015 2014-2015 2014-2015 2014-2015 2014-2015 2014-2015 2014-2015 2014-2015 2014-2015 2014-2015 2014-2015 2014-2015 2014-2015 2015-2016 2015-2016 2015-2016 2015-2016 2015-2016 2015-2016 2015-2016 2015-2016 2015-2016 2016-2017 2016-2017 2016-2017 2016-2017 999 1016 1027 1092 1132 1176 1278 1285 1298 1364 1410 1434 1480 1496 1499 1505 1541 1546 1585 1599 1605 1685 1711 1728 1767 1902 1908 1945 1966 1984 2097 2153 2178 2206

16 CHAPTER 5. RESULTS GS 2750000 3400000 ? 4500000 Free 435000 ? 400000 Free Free 1000000 Free 10000000 - 800000 Free 25000000 Free 3300000 Free 1500000 EOL EOL 20000000 5000000 EOL 800000 3800000 Free - XXX ? LCS 2750000 3400000 ? 4500000 Free 435000 ? 400000 Free 1300000 XXX Free 10000000 - 800000 Free 25000000 XXX 3300000 Free 1500000 EOL XXX 20000000 5000000 EOL 800000 3800000 Free - XXX ? QG 2750000 3400000 ? 4500000 Free 435000 ? 400000 Free 1300000 XXX Free 10000000 - 800000 Free 25000000 XXX 3300000 Free 1500000 EOL XXX 20000000 5000000 EOL 800000 3800000 Free - XXX ? JW 2750000 3400000 ? 4500000 Free 435000 ? 400000 Free 1300000 XXX Free 10000000 - 800000 Free 25000000 XXX 3300000 Free 1500000 EOL XXX 20000000 5000000 EOL 800000 3800000 Free - XXX ? LS 2750000 3400000 ? 4500000 Free 435000 ? 400000 Free 1300000 XXX Free 10000000 - 800000 Free 25000000 XXX 3300000 Free 1500000 EOL XXX 20000000 5000000 EOL 800000 3800000 Free - XXX ? WD 2750000 3400000 ? 4500000 Free 435000 ? 400000 Free XXX XXX Free 10000000 XXX 800000 Free 25000000 XXX 3300000 Free 1500000 EOL XXX 20000000 5000000 EOL 800000 3800000 Free XXX XXX ? to Watford Torino Ternana Atlanta U AEK Larnaca Real Salt Lake Kairat Almaty Zenit St Petersburg FC Krasnodar So Bento St-Etienne South Shields Rennes Ulsan Hyundai Darmstadt Sassuolo Chelsea Mordovia Saransk Zenit St Petersburg FC Mariupol Betis Atltico Madrid Benfica Monaco Paris St-Germain Udinese Vicenza Virtus Sporting CP Alashkert Belenenses FC Porto Altay from Fiorentina Napoli Pescara Torino ADO Den Haag FC Groningen Amkar Perm Krylia Sovetov Tom Tomsk FC Ufa Arouca Stoke City West Ham United FC Schalke 04 Roma Torino Anzhi Makhachkala Rubin Kazan SKA-Khabarovsk Zenit St. Petersburg Sporting Braga Vitria Setbal Bayer Leverkusen Bayern Munich Atalanta Udinese Akhmat Grozny FC Ufa Moreirense Santa Clara Tondela name Mauro Zrate Mirko Valdifiori Simone Aresti Josef Martnez Hector Hevel Albert Rusnk Chuma Anene Ibrahim Tsallagov Aleksandr Zhirov Diego Carlos Jorginho Jack Devlin Diafra Sakho Park Joo-Ho Felix Platte Riccardo Cappa Davide Zappacosta Alan Bagaev Magomed Ozdoev Denys Dedechko Javi Garca Andr Moreira Csar Benjamin Henrichs Juan Bernat Ali Adnan Simone Pontisso Idrissa Doumbia Denis Tumasyan Ousmane Dram Fernando Andrade Hlder Tavares year 2016-2017 2016-2017 2016-2017 2016-2017 2016-2017 2016-2017 2016-2017 2016-2017 2016-2017 2016-2017 2016-2017 2017-2018 2017-2018 2017-2018 2017-2018 2017-2018 2017-2018 2017-2018 2017-2018 2017-2018 2017-2018 2017-2018 2017-2018 2018-2019 2018-2019 2018-2019 2018-2019 2018-2019 2018-2019 2018-2019 2018-2019 2018-2019 2215 2231 2237 2252 2259 2279 2355 2387 2413 2428 2438 2565 2580 2632 2659 2698 2710 2798 2830 2834 2845 2894 2901 2969 2972 2998 3038 3125 3170 3204 3224 3236

17 5.1. PERFORMANCE SCORES CHAPTER 5. RESULTS

5.1 Performance scores

Figure 5.1: Performance of the different RL techniques

This figure shows a significant difference between the KBRL and WSRL. All four types of WSRL score equally for all measurements.

18 Chapter 6

Evaluation

6.1 Mismatches

The test set, DC, shows irregularities for multiple samples. Some of these are accumulated in the following table. As the four different WSRL techniques yield the same results, they are concatenated into one column: ’WSRL’.

Testset # year name from to WD WSRL GS 255 2011-2012 Adama Tour´e Lorient Sporting Gij´onB XXX Free Free 1364 2014-2015 M´ıchel Getafe Valencia XXX Free EOL 2438 2016-2017 Jorginho Arouca St-Etienne XXX XXX 1000000 For the transfers with mismatches, all candidates in DB are given below:

Dataset B # year name from to fee 369 2011-2012 Adama Tour´e FC Lorient Sporting B Free 370 2011-2012 Adama Tour´e Paris SG FC Lorient Free 1870 2014-2015 M´ıchel Novorizontino Guarani-SC Free Transfer In the case of Adama Tour´e,WSRL acquires the correct fee, the manual algorithm does not. Below are the Wikidict entries for the labels involved:

Wikidict Entity Label 1 Label 2 FC Lorient FC Lorient Lorient Sporting CP Sporting B Sporting CP Sporting de Gij´on B Sporting Gij´onB It becomes clear that the KBRL algorithm does not match the entry of Adama Tour´eto the right candidate, as the ’Sporting Gij´onB’ and ’Sporting B’ are not labels to the same entity. This shows a flaw in the functionality of Wikidict. In this case, the algorithm did not map ’Sporting Gij´onB’ to ’Sporting Gij´on’,as it concerns the second division club of ’Sporting Gij´on’. DA, similar as for the youth acadamies, does not differentiate over the primary club or secondary associations. In the case of M´ıchel, there are candidates in Dataset B that agree on the identifier (year, name). However, this is probably not the same M´ıchel - presumably this is a common name in soccer. This results in a false positive for WSRL. The Wikidict manages to ’detect’ that because the clubs are not different labels to the same entity: no match. In the case of Jorginho, there is no single candidate that agrees on both the year and the name. This results in false negatives for both algorithms.

19 Chapter 7

Conclusion

Important to note is that all inferences about the performance of the different algorithms apply for this specific project. It is no general comparison for the quality of the techniques, as this relies heavily on the domain, knowledge base and data.

The KBRL shows one advantage over WSRL for this specific project: it scores perfectly in terms of precision. This full score translates to the fact that all matches presented by this algorithm are valid. The algorithm is never wrong, in contrast to WSRL. This is in agreement with the hypothesis: the use of a knowledge base to incorporate the notion of the entities does provide advantages for record linking.

However, except for precision, WSRL outperforms KBRL. This outperformance is based on the combination of the number of candidates per entry and the variance in the labels. As these increase, the probability that the algorithm presents an incorrect match increases. This increase is logical: a broader spectrum of options that show high diversity increases the chances of similar labels for wrong candidates. The KBRL technique is size-invariant. If the mapping from labels to entities is successful, the matching will succeed regardless of the number of candidates and variance in label.

Additionally, KBRL has proven to be more prone to noise and incomplete data than WSRL. The word-similarity algorithms have shown to present false matches because the data was incomplete. KBRL might be a suitable approach for a project that utilizes noisy and incomplete data because it does not produce false positives when the data is incorrect or absent.

Knowledge-based record-linking shows advantages over conventional word-similarity based record linking for domain specific tasks in terms of precision. The technique is promising for domain-specific projects that use noisy or incomplete data, or data sets that contain a large number of candidates per entry.

20 Bibliography

[1] Daniel Bakkelund. “An LCS-based string metric”. In: Olso, Norway: University of Oslo (2009). [2] AnHai Doan, Alon Halevy, and Zachary Ives. “3 - Describing Data Sources”. In: Principles of Data Integration. Ed. by AnHai Doan, Alon Halevy, and Zachary Ives. Boston: Morgan Kaufmann, 2012, pp. 65–94. isbn: 978-0-12-416044-6. doi: https: / / doi . org / 10 . 1016 / B978 - 0 - 12 - 416044 - 6 . 00003 - X. url: http : / / www . sciencedirect.com/science/article/pii/B978012416044600003X. [3] V. I. Levenshtein. “Binary Codes Capable of Correcting Deletions, Insertions and Reversals”. In: Soviet Physics Doklady 10 (Feb. 1966), p. 707. [4] Wei Shen et al. “LINDEN: Linking named entities with knowledge base via semantic knowledge”. In: WWW’12 - Proceedings of the 21st Annual Conference on World Wide Web (Apr. 2012). doi: 10.1145/2187836.2187898. [5] Esko Ukkonen. “Approximate string-matching with q-grams and maximal matches”. In: Theoretical Computer Science 92.1 (1992), pp. 191–211. issn: 0304-3975. doi: https : / / doi . org / 10 . 1016 / 0304 - 3975(92 ) 90143 - 4. url: http : / / www . sciencedirect.com/science/article/pii/0304397592901434. [6] William E. Winkler. “String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage”. In: Proceedings of the Section on Survey Research. Wachington, DC, 1990, pp. 354–359.

21