<<

- A Framework for Link Discovery on the Semantic Web

Axel-Cyrille Ngonga Ngomo, Mohamed Ahmed Sherif Paderborn University, Data Science Group, Pohlweg 51, D-33098 Paderborn, E-mail: {firstname.lastname}@upb.de

Kleanthi Georgala, Mofeed Mohamed Hassan, Kevin Dreßler, Klaus Lyko, Daniel Obraczka, Tommaso Soru University of Leipzig, Institute of Computer Science, AKSW Group, Augustusplatz 10, D-04009 Leipzig, Germany E-mail: {lastname}@informatik.uni-leipzig.de

Abstract The Linked Data paradigm builds upon the backbone of distributed knowledge bases connected by typed links. The mere volume of current knowledge bases as well as their sheer number pose two major challenges when aiming to support the computation of links across and within them. The first is that tools for link discovery have to be time-efficient when they compute links. Secondly, these tools have to produce links of high quality to serve the applications built upon Linked Data well. The current version of the LIMES framework is the product of seven years of research on these two challenges. The framework combines diverse algorithms for link discovery within a generic and extensible architecture. In this system paper, we give an overview of the 1.0 open-source release of the framework. In particular, we focus on an overview of the architecture of the framework, an intuition of its inner workings and a brief overview of the approaches it contains. Some of the applications within which the framework was used complete the paper. Our framework is open-source and available under a dual license at http://github.com/aksw/limes-dev together with a user manual and a developer manual. Keywords: Link Discovery, Record Linking, System Description, Machine Learning, Semantic Web, OWL, RDF

1. Introduction such as batch and active learning [8, 9, 10], unsupervised learning [10] and even positive-only learning [11]. Establishing links between knowledge bases is one of the key steps of the Linked Data publication process.1 A plethora The main goal of this paper is to give an overview of the LIMES of approaches has thus been devised to support this process [1]. framework and some of the applications within which it was In this paper, we present the LIMES framework, which was used. We begin by presenting the link discovery problem and designed to accommodate a large number of link discovery how we address this problem within a declarative setting (see approaches within a single extensible architecture. LIMES was Section 2). Then, we present our solution to the link discov- designed as a declarative framework (i.e., a framework that ery problem in the form of LIMES and its architecture. The processes link specifications, see Section 2) to address two main subsequent sections present the different families of algorithms challenges. implemented within the framework. We begin by a short over- view of the algorithms to ensure the efficiency of the framework 1. Time-efficiency: The mere size of existing knowledge (see Section 5). Thereafter, we give an overview of algorithms bases (e.g., 30+ billion triples in LinkedGeoData [2], that address the accuracy problem (see Section 6). We round 20+ billion triples in LinkedTCGA [3]) makes efficient up the core of the paper with some of the applications within solutions indispensable to the use of link discovery frame- which LIMES was used, including benchmarking and the pub- works in real application scenarios. LIMES addresses this lication of 5-star linked datasets (Section 7). A brief overview challenge by providing time-efficient approaches based of related work (Section 8), followed by an overview of the on the characteristics of metric spaces [4, 5], orthodromic evaluation results of algorithms included in LIMES (Section 9) spaces [6] and on filter-based paradigms [7]. and a conclusion (Section 10) complete the paper. 2. Accuracy: Efficient solutions are of little help if the res- ults they generate are inaccurate. Hence, LIMES also ac- commodates (especially machine-learning) solutions that 2. The Link Discovery Problem allow the generation of links between knowledge bases The formal specification of Link Discovery (LD) adopted with a high accuracy. These solutions abide by paradigms herein is akin to that proposed in [12]. Given two (not necessarily distinct) sets S resp. T of source resp. target resources as well 1http://www.w3.org/DesignIssues/LinkedData.html as a relation R, the goal of LD is is to find the set M = {(s, t) ∈

Preprint submitted to Journal of Web Semantics 28th August 2017 S × T : R(s, t)} of pairs (s, t) ∈ S × T such that R(s, t). f(edit(:socId, :socId), 0.5) In most cases, computing M is a non-trivial task. Hence, a t large number of frameworks (e.g., SILK [13], LIMES [12] and f(trigrams(:name, :label), 0.5) KnoFuss [14]) aim to approximate M by computing the mapping 0 M = {(s, t) ∈ S × T : σ(s, t) ≥ θ}, where σ is a similarity Figure 1: Example of a complex LS. The filter nodes are rectangles while the function and θ is a similarity threshold. For example, one can operator nodes are circles. :socID stands for social security number. configure these frameworks to compare the dates of birth, family names and given names of persons across census records to LS [[LS]]M determine whether they are duplicates. We call the equation f(m, θ) {(s, t)|(s, t) ∈ M ∧ m(s, t) ≥ θ} which specifies M 0 a link specification (short LS; also called L u L {(s, t)|(s, t) ∈ [[L ]] ∧ (s, t) ∈ [[L ]] } linkage rule in the literature, see e.g., [13]). Note that the LD 1 2 1 M 2 M L t L {(s, t)|(s, t) ∈ [[L ]] ∨ (s, t) ∈ [[L ]] } problem can be expressed equivalently using distances instead 1 2 1 M 2 M L \L {(s, t)|(s, t) ∈ [[L ]] ∧ (s, t) ∈/ [[L ]] } of similarities. As follows: Given two sets S and T of instances, 1 2 1 M 2 M a (complex) distance measure δ and a distance threshold θ ∈ Table 1: Link Specification Syntax and Semantics [0, ∞[, determine M 0 = {(s, t) ∈ S × T : δ(s, t) ≤ τ}.2 Consequently, distance and similarities are used interchangeably within this paper. We define a filter as a function f(m, θ). We call a specific- Under this so-called declarative paradigm, two entities s ation atomic when it consists of exactly one filtering function. and t are then considered to be linked be via R if σ(s, t) ≥ θ. A complex specification can be obtained by combining two 2 Naïve algorithms require O(|S||T |) ∈ O(n ) computations to specifications L1 and L2 through an operator that allows the 0 output M . Given the large size of existing knowledge bases, merging of the results of L1 and L2. Here, we use the operators time-efficient approaches able to reduce this runtime are hence u, t and \ as they are complete w.r.t. the Boolean algebra and a central component of LIMES. In addition, note that the choice frequently used to define LS. An example of a complex LS is of appropriate σ and θ is central to ensure that M is approxim- given in Figure 1. We denote the set of all LS as L. 0 ated correctly by M . Approaches that allow the computation We define the semantics [[L]]M of a LS L w.r.t. a mapping of accurate σ and θ are thus fundamental for the LIMES frame- M as given in Table 1. Those semantics are similar to those used work [1]). in languages like SPARQL, i.e., they are defined extensionally through the mappings they generate. The mapping [[L]] of a LS L with respect to S × T contains the links that will be generated 3. Formal Overview by L. Several approaches can be chosen when aiming to define the syntax and semantics of LSs in detail [9, 13, 14]. In LIMES, 4. Architecture of the Framework we chose a grammar with a syntax and semantics based on set semantics. This grammar assumes that LSs consist of two types 4.1. Layers of atomic components: As shown in Figure 2, the LIMES framework consists of 6 main layers, each of which can be extended to accommodate • similarity measures m, which allow the comparison of new or improved functionality. The input to the framework is a property values or portions of the concise bound descrip- configuration file, which describes how the sets S and T are to tion of 2 resources and be retrieved from two knowledge bases K1 and K2 (e.g., remote SPARQL endpoints). Moreover, the configuration file describes • operators op, which can be used to combine these similar- how links are to be computed. To this end, the user can chose to ities into more complex specifications. either provide a LS explicitly or a configuration for a particular Without loss of generality, we define an atomic similarity meas- machine learning algorithm. The result of the framework is a ure a as a function a : S × T → [0, 1]. An example of an atomic (set of) mapping(s). In the following, we give an overview of similarity measure is the edit similarity dubbed edit3. Every the inner workings of each of the layers. atomic measure is a measure. A complex measure m combines 4.1.1. Controller Layer measures m1 and m2 using measure operators such as min and max. We use mappings M ⊆ S × T to store the results of the The processing of the configuration file by the different application of a similarity measure to S × T or subsets thereof. layers is orchestrated by the controller layer. The controller We denote the set of all mappings as M. instantiates the different implementations of input modules (e.g., reading data from files or from SPARQL endpoints), the data modules (e.g., file cache, in-memory), the execution modules 2Note that a distance function δ can always be transformed into a normed (e.g., planners, number of processors for parallel processing) and similarity function σ by setting σ(x, y) = (1+δ(x, y))−1. Hence, the distance the output modules (e.g., the serialization format and the output threshold τ can be transformed into a similarity threshold θ by means of the location) according to the configuration file or using default equation θ = (1 + τ)−1. 3We define the edit similarity of two strings s and t as (1 + lev(s, t))−1, values. Once the layers have been instantiated, the configuration where lev stands for the Levenshtein distance. is forwarded to the input layer. 2 4.1.4. Execution Layer GUI If the user specifies the LS to execute manually, the LIMES Mapping controller layer calls the execution layer. The execution layer then re-writes, plans and executes the LS specified by the user. The rewriter aims to simplify complex LS by removing potential Config. Output redundancy so as to eventually speed up the overall execution. To Execution this end, it exploits the subset relations and the Boolean algebra

K1 which underpin the syntax and semantics of complex specifica- Data tions. The planner module maps a LS to an equivalent execution

Controller plan, which is a sequence of atomic execution operations from K2 Input which a mapping results. These operations include EXECUTE Machine Learning (running a particular atomic similarity measure, e.g., the trigram similarity) and operations on mappings such as UNION (union of two mappings) and DIFF (difference of two mappings). For an

Figure 2: General architecture of LIMES. atomic LS, i.e., a LS such that σ is an atomic similarity measure (e.g., Levenshtein similarity, qgrams similarity, etc.), the planner module generates a simple plan with one EXECUTE instruction. 4.1.2. Input Layer For a complex LS, the planner module determines the order in The input layer reads the configuration and extracts all the which atomic LS should be executed as well as the sequence in information necessary to execute the specification or the machine which intermediary results should be processed. For example, learning approach chosen by the user. This information includes the planner module may decide to first run some atomic LS L1 (1) the location of the input knowledge bases K1 and K2 (e.g., and then filter the results using another atomic LS L2 instead SPARQL endpoints or files), (2) the specification of the sets S of running both L1 and L2 independently and merge the results. and T , (3) the measures and thresholds or the machine learning The execution engine is then responsible for executing the plan approach to use. The current version of LIMES supports RDF of an input LS and returns the set of potential links as a mapping. configuration files based on the LIMES Configuration Ontology (LCO)4 (see Figure 3) and XML configuration files based on the 4.1.5. Machine Learning layer LIMES Specification Language (LSL) [5]. If the configuration In case the user opts for using a machine learning approach, file is valid (w.r.t. the LCO or LSL), the input layer then calls the LIMES controller calls the machine learning layer. This its query module. This module uses the configuration for S and layer instantiates the machine learning algorithm selected by the T to retrieve instances and properties from the input knowledge user and executes it in tandem with the execution layer. To this bases that adhere to the restrictions specified in the configuration end, the machine learning approaches begin by retrieving the file. All data retrieved is then forwarded to the data layer via the training data provided by the user if required. The approaches controller. then generate LSs, which are run in the execution layer. The resulting mappings are evaluated using quality measures (e.g., F- 4.1.3. Data Layer measure, pseudo-F-measure [10]) and returned to the user once The data layer stores the input data gathered by the input a termination condition has been fulfilled. Currently, LIMES layer by using memory, file or hybrid storage techniques. Cur- implements the EAGLE [9], COALA [15], EUCLID [10] and rently, LIMES relies on a hybrid cache as default storage imple- WOMBAT [11] algorithms. A brief overview of these approaches mentation. This implementation generates a hash for any set of is given in Section 5. More details on the algorithms can be resources specified in a configuration file and serializes this set found in the corresponding papers. into a file. Whenever faced with a new specification, the cache first determines whether the data required for the computation 4.1.6. Output Layer is already available locally by using the hash aforementioned. The output layer serializes the output of LIMES. Currently, If the data is available, it is retrieved from the file into memory. it supports the serialization into files in any RDF serialization In other cases, the hybrid cache retrieves the data as specified, also chosen by the user. In addition, the framework support CSV generates a hash and caches it on the hard drive. By these means, and XML as output formats. the data layer addresses the practical problem of data availability, especially from remote data sources. Once all data is available, 4.2. Graphical User Interface the controller layer chooses (based on the configuration file) In addition to supporting configurations as input files, LIMES between execution of complex measures specified by the user of provides a graphical user interface (GUI) to assist the end user running a machine learning algorithm. during the LD process [16]. The framework supports the manual creation of LS as shown in Figure 4 for users who already know 4https://github.com/AKSW/LIMES-dev/blob/master/ which LS they would like to execute to achieve a certain link- limes-core/resources/lco.owl ing task. However, most users do not know which LS suits their linking task best and therefore need help throughout this

3 Acceptance lco:hasAcceptance LimesSpecs Review Dataset rdfs:label rdfs:label rdfs:label a void:Dataset rdfs:comment rdfs:comment lco:hasReview rdfs:comment rdfs:label lco:threshold lco:hasExecutionParameters lco:outputFormat lco:threshold rdfs:comment lco:relation lco:granularity lco:relation lco:endpoint lco:hasMetric lco:variable lco:pageSize

lco:hasMLAlgorithm lco:property lco:restriction lco:hasSource SourceDataset

Metric rdfs:label lco:hasSource TargetDataset rdfs:comment lco:expression

ExecutionParameters MLAlgorithm MLAlgorithmParameter lco:executionPlanner rdfs:label lco:hasMLAlgorithmParameter rdfs:label lco:relation rdfs:comment rdfs:comment lco:executionEngine lco:type lco:value lco:hasTrainingData

Figure 3: LCO ontology

Figure 5: User feedback interface for active learning

Figure 4: Link Specification Interface these approaches work in detail. Interested readers are referred process. Hence, the GUI provides wizards which ease the LS to the publications mentioned in each subsection. creation process and allow configuration of the machine learning algorithms. After the selection of an algorithm, the user can 5.1. LIMES modify the default configurations if necessary and run it. If a The LIMES algorithm [4] is the first scalable approach de- batch or unsupervised learning approach is selected, the exe- veloped explicitly for LD. The basic intuition behind the ap- cution is carried out once and the results are presented to the proach is based on regarding the LD problem as a bounded user. In the case of active learning, the LIMES GUI presents the distance problem δ(s, t) ≤ τ. If δ abides by the triangle in- most highly informative link candidates to the user, who must equality, the δ(s, t) ≥ δ(s, e) − δ(e, t) for any e. Based on this now label these candidates as either a match or non-match (see insight, LIMES implements a space tiling approach that assigns Figure 5). This procedure is repeated until the user is satisfied each t to an exemplar e such that δ(e, t) is known for many t. with the result. In all cases, the GUI allows the exportation of Hence, for all s, LIMES can first compute a lower bound for the the configurations generated by machine learning approaches distance between s and t. If this lower bound is larger that τ, for future use. then the exact computation of δ(s, t) is not carried out, as (s, t) is guaranteed not to belong in the output of this atomic measure.

5. Scalability Algorithms 5.2. HR3 3 As mentioned in the introduction, the scalability of LIMES The HR algorithm [5] builds upon some of the insights is based on a number of time-efficient approaches for atomic LS of the LIMES algorithms within spaces with Minkowski dis- that the framework supports. In the following, we summarize the tances. The basic insight behind the approach is that the equation approaches supported by LIMES version 1.0.0. By no means do δ(s, t) ≤ τ describes a hypersphere of radius τ around s. While we aim to give all technical details necessary to understand how 4 computing spheres in multidimensional spaces can be time- target datasets and computes the reverse5 relation r0 instead of r consuming, hypercubes can be computed rapidly. Hence, the in case it finds that the estimated total hypervolume of the target approach relies on the approximation hyperspheres with hyper- is less that the one of the source dataset. Then in its second step, cubes. In contrast to previous approaches such as HYPPO [17], RADON utilizes a space tiling approach to insert all source and the approach combines hypercubes with an indexing approach target geometries into an index I, which maps resources to sets and can hence discard a larger number of comparisons. One of of hypercubes. Finally, RADON implements the last speedup the most important theoretical insights behind this approach is strategy using a MBB-based filtering technique, Like AEGLE, that it is the first approach guaranteed to being able to achieve the RADON is guaranteed to achieve a result completeness of 100% smallest possible number of comparisons when given sufficient as it able to compute all topological relations of the DE-9IM memory. It is hence the first reduction-ratio-optimal approach model. for link discovery. 5.6. ROCKER 5.3. ORCHID Keys for graphs are the projection of the concept of primary ORCHID [6] is a link discovery algorithm for geo-spatial keys from relational databases. A key is thus a set of prop- 3 data. Like HR , ORCHID is reductio-ratio-optimal. In contrast erties such that all instances of a given class are distinguish- to HR3, it operates in orthodromic spaces, in which Minkowski able from each other by their property values. LIMES features distances do not hold. To achieve the goal of reduction-ratio ROCKER [22], a refinement operator-based approach for key optimality, ORCHID uses a space discretization approach which discovery. The ROCKER algorithm first streams the input RDF relies on tiling the surface of the planet based on latitude and lon- graph to build a hash index including instance URIs and a con- gitude information. The approach then only compares polygons catenation of object values for each property. The indexes are t ∈ T which lie within a certain range of s ∈ S. Like in HR3, then stored in a local database to enable an efficient computation the range is described via a set of hypercubes (i.e, squares in 2 of the discriminability score. Starting from the empty set, a dimensions). However, the shape that is to be approximated is a refinement graph is visited, adding a property to the set at each disk projected onto the surface of the sphere. The projection is refinement step. As keys abide by several monotonicities, addi- accounted for by compensating for the curvature of the surface tional optimizations allow scalability to large datasets. ROCKER of the planet using an increased number of squares. See [18] for has shown state-of-the-art results in terms of discovered keys, a survey of point-set distance measures implemented by LIMES efficiency, and memory consumption [22]. and optimized using ORCHID. 5.7. REEDED 5.4. AEGLE LD frameworks rely on similarity measures such as the AEGLE [19] is a time-efficient approach for computing tem- Levenshtein (or edit) distance to discover similar instances. The poral relations. Our approach supports all possible temporal edit distance calculates the number of operations (i.e., addition, relations between event data that can be modelled as intervals deletion or replacement of a character) needed to transform a according to Allen’s Interval Algebra [20]. The key idea behind string to another. However, the assumption that all operations the algorithm is to reduce the 13 Allen relations to 8 simpler must have the same weight does not always apply. For instance, relations, which compare either the beginning or the end of an the replacement of a z with an s often leads to the same word event with the beginning or the end of another event. Given written in American and British English, respectively. This oper- that time is ordered, AEGLE reduces the problem of computing ation should necessarily weigh less than, e.g., an addition of the these simple relations to the problem of matching entities across character I to the end of King George I. While manifold sorted lists, which can be carried out within a time complexity algorithms have been developed for the efficient discovery of of O(n log n). The approach is guaranteed to compute complete similar strings using edit distance [23, 24], REEDED was de- results for any temporal relation. veloped to support also weighted edit distances. REEDED joins two sets of strings applying three filters to avoid computing the 5.5. RADON similarity values for all the pairs. The (1) length-aware filter discards the pairs of strings having a too different length, the (2) RADON [21] addresses the efficient computation of topo- logical relations on geo-spatial datasets, which belong to the character-aware filter selects only the pairs which do not differ largest sources of Linked Data. The main innovation of the by more than a given number of characters and the (3) verifica- approach is a novel sparse index for geo-spatial resources based tion filter finally verifies the weighted edit distance among the on minimum bounding boxes (MBB). Based on this index, it is pairs against a threshold [7]. able to discard unnecessary computations for DE-9IM relations. 5.8. Jaro-Winkler Even small datasets can contain very large geometries that span over a large number of hypercubes and would lead to a large The Jaro-Winkler string similarity measure, which was ori- spatial index when used as source. Therefore, RADON applies ginally designed for the deduplication of person names, is a a swapping strategy as its first step, where it swaps source and

5Formally, the reverse relation r0 of a relation r is defined as r0(y, x) ⇔ r(x, y). 5 common choice to generate links between knowledge bases available by the user. Then, most informative unlabeled pairs based on labels. We therefore developed a runtime-optimized (by virtue of their proximity to the decision boundary of the Jaro-Winkler algorithm for LD [25]. The approach consists of perceptron classifier) are sent as queries to the user. The labeled two steps: indexing and tree-pruning. The indexing step itself answers are finally added to the training data, which closes the also consists of two phases: (1) strings of both source and tar- loop. The iteration is carried on until the user terminates the get datasets get sorted into buckets based on upper and lower process or a termination condition such as a maximal number of bounds pertaining to their lengths and (2) in each bucket, strings questions is reached. of the target dataset get indexed by adding them to a tree akin to a trie [26]. Tree pruning is then applied to cut off subtrees 6.2. EAGLE for which the similarity threshold can not be achieved due to EAGLE [9] implements a machine learning approach for LS character missmatches. This significantly reduces the number of of arbitrary complexity based on genetic programming. To this necessary Jaro-Winkler similarity computations. end, the approach models LS as trees, each of which stands for the genome of an individual in a population. EAGLE begins by 5.9. HELIOS generating a population of random LS, i.e., of individuals with In contrast to the algorithms above, HELIOS [27] addresses random genomes. Based on training data or on an objective the scalability of link discovery by improving the planning of function, the algorithm determines the fitness of the individuals LSs. To this end, HELIOS implements a rewriter and a planner. in the population. Operators such as mutation and crossover The rewriter consists of a set of algebraic rules, which can trans- are then used to generate new members of the population. The form portions of LS into formally equivalent expressions, which fittest individuals are finally selected as members of the next can be potentially executed faster. For example, expressions population. When used as an active learning approach, EAGLE such as AND(σ(s, t) ≥ θ1, σ(s, t) ≥ θ2) are transformed into relies on a committee-based algorithm to determine most in- σ(s, t) ≥ max(θ1, θ2). In this particular simple example, the formative positive and negative examples. In addition to the number of EXECUTE calls can hence be reduced from 2 to 1. classical entropy-based selection, the COALA [15] extension The planner transforms a rewritten LS into a plan by aiming of EAGLE also allows the correlation between resources to be to find a sequence of execution steps that minimizes the total taken into consideration during the selection process for most runtime of the LS. To this end, the planner relies on runtime information examples. All active learning versions of EAGLE approximations derived from linear regressions on large static ensure that each iteration of the approach only demands a small dictionaries. number of labels from the user. EAGLE is also designed to support the unsupervised learning approach of LS. In this case, the approach aims to find individuals that maximize so-called 6. Accuracy Algorithms pseudo F-Measures [10, 28]. In addition to pushing for efficiency, the LIMES framework also aims to provide approaches that compute links accurately. 6.3. ACIDS To this end, the framework includes a family of machine learn- The ACIDS approach [29] targets instance matching by join- ing approaches designed specifically for the purpose of link ing active learning with classification using linear Support Vector discovery. The core algorithms can be adapted for unsupervised, Machines (SVM). Given two instances s and t, the similarities active and batch learning. In the following, we present the main among their datatype values are collected in a feature vector of ideas behind these approaches while refraining from provid- size N. A pair (s, t) is represented as a point in the similarity ing complex technical details, which can be retrieved from the space [0, 1]N . At each iteration, pairs are (1) selected based on corresponding publications. their proximity to the SVM classifier, (2) labeled by the user, and (3) added to the SVM training set. Mini-batches of pairs are 6.1. RAVEN labeled as positive if the instances are to be linked with an R, RAVEN [8] is the first approach designed to address the or negative otherwise. The SVM model is then built after each active learning of LS. The basic intuition behind the approach is step, where the classifier is a hyperplane dividing the similarity that LS can be regarded as classifiers. Links that abide by the space into two subspaces. String similarities are computed using LS then belong to the class +1 while all other links belong to a weighted Levenshtein distance. Using an update rule inspired −1. Based on this intuition, the approach tackles the problem of by perceptron learning, edit operation weights are learned by learning LSs by first using the solution of the hospital-resident maximizing the training F-score. problem to detect potentially matching classes and properties in 6.4. EUCLID K1 and K2. The algorithm then assumes that it knows the type of LS that is to be learned (e.g., conjunctive specifications, in EUCLID [10] is a deterministic learning algorithm loosely which all specifications operators are conjunctions). Based on based on RAVEN. EUCLID reuses the property matching tech- this assumption and the available property matching, it learns niques implemented in RAVEN and the idea of having known the thresholds associated with each property pair iteratively by classifier shapes. In contrast to previous algorithms, EUCLID using an extension of the perceptron learning paradigm. RAVEN learns combinations of atomic LS by using a hierarchical search first applies perceptron learning to all training data that is made

6 approach. While EUCLID was designed for unsupervised learn- 7.2. Linked TCGA ing, it can also be used for supervised learning based on labeled A second dataset that was linked with LIMES is LINKED- training data[10]. TCGA [3], an RDF representation of The Cancer Genome Atlas. The data source describes cancer patient data gathered from 6.5. WOMBAT 9000 patients for 30 different cancer diseases. The dataset aims One of the most crucial tasks when dealing with evolving to support bio-medical experts who work on characterizing the datasets lies in updating the links from these data sets to other changes that occur in genes due to cancer. LINKEDTCGA is data sets. While supervised approaches have been devised to one of the largest knowledge bases on the Linked Data Web and achieve this goal, they assume the provision of both positive and accounts for more than 20 billion triples. The linking task for this negative examples for links [30]. However, the links available on dataset consisted of connecting it with the biomedical portion the Data Web only provide positive examples for relations and of the Linked Open Data Cloud, in particular to Bio2RDF.6 no negative ones, as the open-world assumption underlying the LIMES was used to compute the more than 16 million links Web of Data suggests that the non-existence of a link between which connect LINKEDTCGA with chromosomes from OMIM two resources cannot be understood as stating that these two and HGNC. The more than 500 thousand links that connect resources are not related. In LIMES, we addressed this drawback LINKEDTCGA with Homologene and similar knowledge bases by proposing WOMBAT [11], the first approach for learning links were also computed using LIMES. based on positive examples only. WOMBAT is inspired by the concept of generalisation in quasi-ordered spaces. Given a set 7.3. Knowledge Base Repair of positive examples, it aims to find a classifier that covers a The COLIBRI [28] approach attempts to repair instance large number of positive examples (i.e., achieves a high recall knowledge in n knowledge bases. COLIBRI discovers links for on the positive examples) while still achieving a high precision. transitive relations (e.g.,owl:sameAs) between instances in The simple version of WOMBAT relies on a two-step approach, knowledge bases while correcting errors in the same knowledge which learns atomic LS and subsequently finds ways to combine bases. In contrast to most of the existing approaches, COLIBRI them to achieve high F-measures. The complete version of the 7 takes an n-set of set of resources K1,...,Kn with n ≥ 2 as algorithm relies on a full-fledged upward refinement operator, input. Thereafter, COLIBRI relies on the LIMES’s unsupervised which is guaranteed to find the best specification possible but deterministic approach EUCLID (see Section 6.4) to link each scales less well than the simple version. pair (Ki,Kj) of sets of resources (with i 6= j). The resource mappings resulting from the link discovery are then forwarded 7. Use Cases to a voting approach, which is able to detect poor mappings. This information is subsequently used to find sources of errors LIMES was already used in a large number of use cases. in the mappings, such as erroneous or missing information in In the following, we present some of the datasets and other the instances. The errors in the mappings are finally corrected applications, in which techniques developed for LIMES or the and the approach starts the next linking iteration. LIMES framework itself, were used successfully. 7.4. Question Answering 7.1. Semantic Quran LIMES was also used to create the knowledge base behind The SEMANTICQURAN dataset [31] is a multilingual RDF the question answering engine DEQA [32]. This engine was de- representation of translations of the Quran. The dataset was signed to be a comprehensive framework for deep web question created by integrating data from two different semi-structured answering. To this end, the engine provides functionality for sources and aligning them to an ontology designed to represent the extraction of structured knowledge from (1) unstructured multilingual data from sources with a hierarchical structure. The data sources (e.g., using FOX [33] and AGDISTIS [34]) and (2) resulting RDF data encompasses 43 different languages which the deep Web using the OXPath language.8 The results of the belong to the most under-represented languages in the Linked extraction are linked with other knowledge sources on the Web Data Cloud, including Arabic, Amharic and Amazigh. Using using LIMES. In the example presented in [32], geo-spatial links 3 LIMES, we were able to generate links from SEMANTICQURAN such as nearBy were generated using LIMES’ approach HR . to 3 versions of the RDF representation of Wiktionary and to The approach was deployed on a dataset combining data on flats DBpedia. The basic intuition behind our linking of SEMANTIC- in Oxford and geo-spatial information from LinkedGeoData and QURAN was to link words in a given language in SEMANTIC- was able to support the answering of complex questions such as QURAN to words in the same language with exactly the same "Give me a flat with 2 bedrooms near to a school". label. We provide 7617 links to the English version of DBpedia, which in turn is linked to non-English versions of DBpedia. In 6 addition, we generated 7809 links to the English, 9856 to the http://bio2rdf.org 7An n-set is a set of magnitude n. French and 1453 to the German Wiktionary. 8http://www.oxpath.org/

7 7.5. Benchmarking we focus on frameworks for link discovery. Dedicated ap- While the creation of datasets for the Linked Data Web is proaches and algorithms can be found in the related work sec- an obvious use of the framework, LIMES and its algorithms can tions of [4, 5, 12, 9, 10, 15, 19, 25, 21]. As described in [1], be used for several other purposes. A non-obvious application most LD frameworks address the efficiency challenge by aiming is the generation of benchmarks based on query logs [35, 36]. to discover unnecessary computations of the similarity or dis- The DBpedia SPARQL benchmark [35] relies on DBpedia query tance measures efficiently. The accuracy challenge is commonly logs to detect clusters of similar queries, which are used as addressed using machine learning techniques. templates for generating queries during the execution of the One of the first LD frameworks is SILK [39], which uses a benchmark. Computing the similarity between the millions of multi-dimensional blocking technique (MultiBlock [13]) to op- queries in the DBpedia query log proved to be an impractical timize the linking runtime through a rough index pre-matching. endeavor when implemented in a naïve fashion. With LIMES, To parallelize the linking process, SILK relies on MapReduce. the runtime of the similarity computations could be reduced Like LIMES, SILK configuration supports input files in XML and drastically by expressing the problem of finding similar queries is also able to retrieve RDF data by quering SPARQl endpoints. as a near-duplicate detection problem. Both frameworks also allow user-specified link types between The application of LIMES in FEASIBLE [36] was also in resources as well as owl:sameAs links. SILK incorporates the clustering of queries. The approach was designed to enable element-level matchers on selected properties using string, nu- users to select the number of queries that their benchmark should meric, temporal and geo-spatial similarity measures. SILK also generate. The examplar-based approach from LIMES [4] was supports multiple matchers, as it allows the comparison of dif- used as a foundation for detecting the exact number of clusters ferent properties between resources that are combined together are required by the user while optimizing the homogeneity of the using match rules. SILK also implements supervised and active said clusters. The resulting benchmark was shown to outperform learning methods for identifying LS for linking. the state of the art in how well it encompasses the idiosyncrasies KNOFUSS [14] incorporates blocking approaches derived of the input query log. from databases. Like LIMES, it supports unsupervised machine learning techniques based on genetic programming. However, 7.6. Dataset Enrichment in contrast to both aforementioned tools, KNOFUSS supports no other link types apart from owl:sameAs and implements Over the recent years, a few frameworks for RDF data en- only string similarity measures between matching entities prop- richment such as LDIF9 and DEER10 have been developed. The erties. Additionally, its indexing technique for time complexity frameworks provide enrichment methods such as entity recog- minimization is not guaranteed to achieve result completeness. nition [33], link discovery [12] and schema enrichment [37]. ZHISHI.LINKS [40] is another LD framework that achieves However, devising appropriate configurations for these tools efficiency by using an index-based technique that comes at can prove a difficult endeavour, as the tools require the right the cost of effectiveness. Similar to KNOFUSS, it only sup- sequence of enrichment functions to be chosen and these func- ports owl:sameAs links but implements geo-spatial relations tions to be configured adequately. In DEER [38], we address this between resources. In comparison with all previous three tools, problem by presenting a supervised machine learning approach it supports both property-based and semantic-based matching, for the automatic detection of enrichment pipelines based on a using knowledge obtained by an ontology. refinement operator and self-configuration algorithms for enrich- SERIMI [41] is a LD tool that retrieves entities for linking ment functions. The output of DEER is an enrichment pipeline by querying SPARQL endpoints. In contrast to the previous that can be used on whole datasets to generate enriched versions. frameworks, it only supports single properties to be used for The DEER’s linking enrichment function is used to connect linking resources. However, it incorporates an adaptive tech- further datasets using link predicates such as owl:sameAs. nique that weighs differently the properties of a knowledge With the proper configurations, the linking enrichment function base and chooses the most representative property for linking. can be used to generate arbitrary predicate types. For instance, Furthermore, SLINT+ [42] is a LD tool that is very similar to using the gn:nearby for linking geographical resources. The SERIMI but supports comparisons between multiple properties. idea of the self configuration of linking enrichment function is to Finally, there are a set of frameworks (RIMOM [43], AGREE- (1) first perform an automatic property matching, then (2) based MENTMAKER [44], LOGMAP [45], CODI [46]) that initially on LIMES’ WOMBAT algorithm [10] (see Section 6.5), perform supported ontology matching and then evolved to support LD supervised link discovery. between resources. RIMOM is based on Bayesian decision matching in order to link ontologies and transforms the problem 8. Related Work of linking into a decision problem. Even though RIMOM util- izes dictionaries for ontology matching, it does not support them There has been a significant amount of work pertaining to in entity linking. Similar to RIMOM, AGREEMENTMAKER the two key challenges of LD: efficiency and accuracy. Here, also uses dictionaries in ontology level. AGREEMENTMAKER incorporates a variety of matching methods based on different properties considered for comparison and the different granular- 9http://ldif.wbsg.de/ 10http://aksw.org/Projects/DEER.html ity of the components. LOGMAP is an efficient tool for ontology matching that scales up to tens (even hundreds) of thousands 8 of different classes included in an ontology. It propagates the Table 2: Evaluation of various algorithms used in LIMES. The numbers provided were retrieved from the experiments and results of the corresponding papers. information obtained from ontology matching to link resources, These numbers are not upper bounds but merely reflect improvements observed by using logical reasoning to exclude links between resources empirically on particular datasets. Orders of magnitude are abbreviated to o.o.m.. that do not abide by the restrictions obtained from ontology matching. CODI is a probabilistic-logical framework for onto- logy matching based on the syntax and semantics of Markov Improvement Algorithm Sec. Ref. logic. It incorporates typical matching approaches that are joined Runtime F-Measure together to increase the quality of ontology alignment. LIMES Up to 60-fold – 5.1 [4] The basic difference between the previous LD framework HR3 Up to 7% – 5.2 [5] and LIMES is that LIMES provides both theoretical and practical ORCHID Up to 2 o.o.m. – 5.3 [6] AEGLE Up to 4 o.o.m. Correct (100%) 5.4 [19] guarantees for efficient and scalable LD. LIMES is guaranteed RADON Up to 3 o.o.m. Correct (100%) 5.5 [21] to lead to exactly the same matching as a brute force approach ROCKER Up to 1 o.o.m. Correct (100%) 5.6 [22] while at the same time reducing significantly the number of REEDED Up to 15-fold Correct (100%) 5.7 [7] comparisons. The approaches incorporated in LIMES facilitate Jaro-Winkler Up to 5.5-fold – 5.8 [25] different approximation techniques to compute estimates of the HELIOS Up to 300% – 5.9 [27] similarity between instances. These estimates are then used RAVEN Up to 33% 90% − 100% 6.1 [8] to filter out a large number of those instance pairs that do not EAGLE Up to 14-fold > 90% 6.2 [9] ACIDS – +7% 6.3 [29] suffice the mapping conditions. By these means, LIMES can COALA – +25% 6.2 [15] reduce the number of comparisons needed during the mapping EUCLID Up to 33% +11% 6.4 [10] process by several orders of magnitude. LIMES supports the first WOMBAT – +23% 6.5 [11] RAVEN Up to 33% 90% – 100% 6.1 [8] planning technique for link discovery HELIOS, that minimizes EAGLE Up to 14-fold Superior to 90% 6.2 [9] the overall execution of a link specification, without any loss of ACIDS – Up to 7% 6.3 [29] completeness. As shown in our previous works (see [27, 19, 47]), COALA – Up to 25% 6.2 [15] LIMES is one of the most scalable link discovery frameworks EUCLID Up to 33% Up to 11% 6.4 [10] WOMBAT – Up to 23% 6.5 [11] currently available.

9. Evaluation becomes a problem of central importance when faced with small The approaches included in LIMES are the result of seven amounts of RAM that cannot fit the sets S and T [48] or when years of research, during which the state of the art evolved faced with streaming or complex data (e.g., 5D geospatial data). significantly. In Table 2, we give an overview of the performance The appropriate use of hardware resources, which has already improvement (w.r.t. runtime and/or accuracy) of a selection of been shown to be of central importance for link discovery ap- algorithms currently implemented in LIMES. The improvements proaches [49, 50], remains an important challenge for the future. mentioned refer to improvements w.r.t. the state of art at the Big Data frameworks such as SPARK13 and FLINK14 promise time at which the papers were written. We refer the readers to to alleviate this problem when used correctly. the corresponding research paper for the evaluation settings and the experimental results for each of the algorithms mentioned in the table. Acknowledgments This work has been supported by H2020 projects SLIPO 10. Conclusion and Future Work (GA no. 731581) and HOBBIT (GA no. 688227) as well as the DFG project LinkingLOD (project no. NG 105/3-2) and the In LIMES, we offer an extensible framework for the dis- BMWI Projects SAKE (project no. 01MD15006E) and GEISER covery of links between knowledge bases. The framework has (project no. 01MD16014). already been used in several applications and shown to be a reliable, near industry-grade framework for link discovery. The graphical user interface and the manuals for users11 and de- References 12 velopers , which accompany the tool, make it a viable frame- [1] M. Nentwig, M. Hartung, A. N. Ngomo, E. Rahm, A survey of current work for novices and experts that aims to create and deploy link discovery frameworks, Semantic Web 8 (3) (2017) 419–436. doi: linked data sets. While a number of challenges have been ad- 10.3233/SW-150210. dressed in the framework, the increasing amount of Linked Data URL http://dx.doi.org/10.3233/SW-150210 [2] C. Stadler, J. Lehmann, K. Höffner, S. Auer, Linkedgeodata: A core available on the Web and the large number of applications that for a web of spatial open data, Semantic Web 3 (4) (2012) 333–354. rely on it demand that we address ever more challenges over the doi:10.3233/SW-2011-0052. years to come. The intelligent use of memory (disk, RAM, etc.) URL http://dx.doi.org/10.3233/SW-2011-0052

13 11http://aksw.github.io/LIMES-dev/user_manual/ http://spark.apache.org/ 14 12http://aksw.github.io/LIMES-dev/developer_manual/ https://flink.apache.org/ 9 [3] M. Saleem, S. S. Padmanabhuni, A. N. Ngomo, J. S. Almeida, S. Decker, 10.1007/978-3-642-38288-8_30. H. F. Deus, Linked cancer genome atlas database, in: M. Sabou, E. Blom- URL http://dx.doi.org/10.1007/978-3-642-38288-8_ qvist, T. D. Noia, H. Sack, T. Pellegrini (Eds.), I-SEMANTICS 2013 - 30 9th International Conference on Semantic Systems, ISEM ’13, Graz, Aus- [16] A.-C. Ngonga Ngomo, D. Obraczka, K. Georgala, Using machine learn- tria, September 4-6, 2013, ACM, 2013, pp. 129–134. doi:10.1145/ ing for link discovery on the web of data, in: Demos at the European 2506182.2506200. Conference on Artificial Intelligence, 2016. URL http://doi.acm.org/10.1145/2506182.2506200 [17] A. N. Ngomo, A time-efficient hybrid approach to link discovery, in: [4] A. N. Ngomo, S. Auer, LIMES - A time-efficient approach for large- Shvaiko et al. [53]. scale link discovery on the web of data, in: T. Walsh (Ed.), IJCAI 2011, URL http://ceur-ws.org/Vol-814/om2011_Tpaper1. Proceedings of the 22nd International Joint Conference on Artificial Intelli- pdf gence, Barcelona, Catalonia, Spain, July 16-22, 2011, IJCAI/AAAI, 2011, [18] M. A. Sherif, A.-C. N. Ngomo, A Systematic Survey of Point Set Distance pp. 2312–2317. doi:10.5591/978-1-57735-516-8/IJCAI11- Measures for Link Discovery, Semantic Web Journal. 385. URL http://www.semantic-web-journal.net/content/ URL http://dx.doi.org/10.5591/978-1-57735-516-8/ systematic-survey-point-set-distance-measures- IJCAI11-385 link-discovery-1 [5] A. N. Ngomo, Link discovery with guaranteed reduction ratio in af- [19] K. Georgala, M. A. Sherif, A. N. Ngomo, An efficient approach for the fine spaces with minkowski measures, in: The Semantic Web - ISWC generation of allen relations, in: ECAI 2016 - 22nd European Conference 2012 - 11th International Semantic Web Conference, Boston, MA, USA, on Artificial Intelligence, 29 August-2 September 2016, The Hague, The November 11-15, 2012, Proceedings, Part I, 2012, pp. 378–393. doi: - Including Prestigious Applications of Artificial Intelligence 10.1007/978-3-642-35176-1_24. (PAIS 2016), 2016, pp. 948–956. doi:10.3233/978-1-61499- URL http://dx.doi.org/10.1007/978-3-642-35176-1_ 672-9-948. 24 URL http://dx.doi.org/10.3233/978-1-61499-672-9- [6] A. N. Ngomo, ORCHID - reduction-ratio-optimal computation of geo- 948 spatial distances for link discovery, in: Alani et al. [51], pp. 395–410. [20] J. F. Allen, Maintaining knowledge about temporal intervals, Commun. doi:10.1007/978-3-642-41335-3_25. ACM 26 (11) (1983) 832–843. doi:10.1145/182.358434. URL http://dx.doi.org/10.1007/978-3-642-41335-3_ URL http://doi.acm.org/10.1145/182.358434 25 [21] M. A. Sherif, K. Dreßler, P. Smeros, A. N. Ngomo, Radon - rapid [7] T. Soru, A. N. Ngomo, Rapid execution of weighted edit distances, in: discovery of topological relations, in: Proceedings of the Thirty-First Shvaiko et al. [52], pp. 1–12. AAAI Conference on Artificial Intelligence, February 4-9, 2017, San URL http://ceur-ws.org/Vol-1111/om2013_Tpaper1. Francisco, California, USA., 2017, pp. 175–181. pdf URL http://aaai.org/ocs/index.php/AAAI/AAAI17/ [8] A. N. Ngomo, J. Lehmann, S. Auer, K. Höffner, RAVEN - active learning paper/view/14199 of link specifications, in: Shvaiko et al. [53], pp. 25–36. [22] T. Soru, E. Marx, A. N. Ngomo, ROCKER: A refinement operator for URL http://ceur-ws.org/Vol-814/om2011_Tpaper3. key discovery, in: Proceedings of the 24th International Conference on pdf World Wide Web, WWW 2015, Florence, Italy, May 18-22, 2015, 2015, [9] A. N. Ngomo, K. Lyko, EAGLE: efficient active learning of link spe- pp. 1025–1033. doi:10.1145/2736277.2741642. cifications using genetic programming, in: The Semantic Web: Research URL http://doi.acm.org/10.1145/2736277.2741642 and Applications - 9th Extended Semantic Web Conference, ESWC 2012, [23] C. Xiao, W. Wang, X. Lin, Ed-join: an efficient algorithm for similarity Heraklion, Crete, Greece, May 27-31, 2012. Proceedings, 2012, pp. 149– joins with edit distance constraints, Proceedings of the VLDB Endowment 163. doi:10.1007/978-3-642-30284-8_17. 1 (1) (2008) 933–944. URL http://dx.doi.org/10.1007/978-3-642-30284-8_ [24] G. Li, D. Deng, J. Wang, J. Feng, Pass-join: A partition-based method 17 for similarity joins, Proceedings of the VLDB Endowment 5 (3) (2011) [10] A. N. Ngomo, K. Lyko, Unsupervised learning of link specifications: 253–264. deterministic vs. non-deterministic, in: Shvaiko et al. [52], pp. 25–36. [25] K. Dreßler, A. N. Ngomo, On the efficient execution of bounded jaro- URL http://ceur-ws.org/Vol-1111/om2013_Tpaper3. winkler distances, Semantic Web 8 (2) (2017) 185–196. doi:10.3233/ pdf SW-150209. [11] M. Sherif, A.-C. Ngonga Ngomo, J. Lehmann, WOMBAT - A Generaliza- URL http://dx.doi.org/10.3233/SW-150209 tion Approach for Automatic Link Discovery, in: 14th Extended Semantic [26] E. Fredkin, Trie memory, Commun. ACM 3 (9) (1960) 490–499. doi: Web Conference, Portorož, Slovenia, 28th May - 1st June 2017, Springer, 10.1145/367390.367400. 2017. URL http://doi.acm.org/10.1145/367390.367400 URL http://svn.aksw.org/papers/2017/ESWC_WOMBAT/ [27] A. N. Ngomo, HELIOS - execution optimization for link discovery, in: public.pdf Mika et al. [54], pp. 17–32. doi:10.1007/978-3-319-11964- [12] A. N. Ngomo, On link discovery using a hybrid approach, J. Data Se- 9_2. mantics 1 (4) (2012) 203–217. doi:10.1007/s13740-012-0012- URL http://dx.doi.org/10.1007/978-3-319-11964-9_ y. 2 URL http://dx.doi.org/10.1007/s13740-012-0012-y [28] A. N. Ngomo, M. A. Sherif, K. Lyko, Unsupervised link discovery through [13] R. Isele, A. Jentzsch, C. Bizer, Efficient multidimensional blocking knowledge base repair, in: The Semantic Web: Trends and Challenges - for link discovery without losing recall, in: Proceedings of the 14th 11th International Conference, ESWC 2014, Anissaras, Crete, Greece, May International Workshop on the Web and Databases 2011, WebDB 2011, 25-29, 2014. Proceedings, 2014, pp. 380–394. doi:10.1007/978-3- Athens, Greece, June 12, 2011, 2011. 319-07443-6_26. URL http://webdb2011.rutgers.edu/papers/Paper% URL http://dx.doi.org/10.1007/978-3-319-07443-6_ 2039/silk.pdf 26 [14] A. Nikolov, V. S. Uren, E. Motta, Knofuss: a comprehensive architecture [29] T. Soru, A. N. Ngomo, Active learning of domain-specific distances for for knowledge fusion, in: Proceedings of the 4th International Conference link discovery, in: Semantic Technology, Second Joint International Con- on Knowledge Capture (K-CAP 2007), October 28-31, 2007, Whistler, BC, ference, JIST 2012, Nara, Japan, December 2-4, 2012. Proceedings, 2012, Canada, 2007, pp. 185–186. doi:10.1145/1298406.1298446. pp. 97–112. doi:10.1007/978-3-642-37996-3_7. URL http://doi.acm.org/10.1145/1298406.1298446 URL http://dx.doi.org/10.1007/978-3-642-37996-3_ [15] A. N. Ngomo, K. Lyko, V. Christen, COALA - correlation-aware act- 7 ive learning of link specifications, in: The Semantic Web: Semantics [30] A. N. Ngomo, S. Auer, J. Lehmann, A. Zaveri, Introduction to linked and Big Data, 10th International Conference, ESWC 2013, Montpel- data and its lifecycle on the web, in: Reasoning Web. Reasoning on lier, , May 26-30, 2013. Proceedings, 2013, pp. 442–456. doi: the Web in the Big Data Era - 10th International Summer School 2014,

10 Athens, Greece, September 8-13, 2014. Proceedings, 2014, pp. 1–99. AWCC 2004, ZhenJiang, JiangSu, China, November 15-17, 2004, Pro- doi:10.1007/978-3-319-10587-1_1. ceedings, 2004, pp. 469–480. doi:10.1007/978-3-540-30483- URL http://dx.doi.org/10.1007/978-3-319-10587-1_ 8_58. 1 URL http://dx.doi.org/10.1007/978-3-540-30483-8_ [31] M. A. Sherif, A. N. Ngomo, Semantic quran, Semantic Web 6 (4) (2015) 58 339–345. doi:10.3233/SW-140137. [44] I. F. Cruz, F. P. Antonelli, C. Stroe, Agreementmaker: Efficient matching URL http://dx.doi.org/10.3233/SW-140137 for large real-world schemas and ontologies, PVLDB 2 (2) (2009) 1586– [32] J. Lehmann, T. Furche, G. Grasso, A. N. Ngomo, C. Schallhart, A. J. 1589. Sellers, C. Unger, L. Bühmann, D. Gerber, K. Höffner, D. Liu, S. Auer, URL http://www.vldb.org/pvldb/2/vldb09-1003.pdf deqa: Deep web extraction for question answering, in: The Semantic Web [45] E. Jiménez-Ruiz, B. C. Grau, Logmap: Logic-based and scalable onto- - ISWC 2012 - 11th International Semantic Web Conference, Boston, MA, logy matching, in: The Semantic Web - ISWC 2011 - 10th International USA, November 11-15, 2012, Proceedings, Part II, 2012, pp. 131–147. Semantic Web Conference, Bonn, Germany, October 23-27, 2011, Pro- doi:10.1007/978-3-642-35173-0_9. ceedings, Part I, 2011, pp. 273–288. doi:10.1007/978-3-642- URL http://dx.doi.org/10.1007/978-3-642-35173-0_ 25073-6_18. 9 URL http://dx.doi.org/10.1007/978-3-642-25073-6_ [33] R. Speck, A. N. Ngomo, Ensemble learning for named entity recogni- 18 tion, in: Mika et al. [54], pp. 519–534. doi:10.1007/978-3-319- [46] M. Niepert, C. Meilicke, H. Stuckenschmidt, A probabilistic-logical 11964-9_33. framework for ontology matching, in: Proceedings of the Twenty-Fourth URL http://dx.doi.org/10.1007/978-3-319-11964-9_ AAAI Conference on Artificial Intelligence, AAAI 2010, Atlanta, Georgia, 33 USA, July 11-15, 2010, 2010. [34] R. Usbeck, A. N. Ngomo, M. Röder, D. Gerber, S. A. Coelho, S. Auer, URL http://www.aaai.org/ocs/index.php/AAAI/ A. Both, AGDISTIS - graph-based disambiguation of named entities using AAAI10/paper/view/1744 linked data, in: Mika et al. [54], pp. 457–471. doi:10.1007/978-3- [47] L. Zhu, M. Ghasemi-Gol, P. A. Szekely, A. Galstyan, C. A. Knoblock, 319-11964-9_29. Unsupervised entity resolution on multi-type graphs, in: The Semantic URL http://dx.doi.org/10.1007/978-3-319-11964-9_ Web - ISWC 2016 - 15th International Semantic Web Conference, Kobe, 29 Japan, October 17-21, 2016, Proceedings, Part I, 2016, pp. 649–667. [35] M. Morsey, J. Lehmann, S. Auer, A.-C. Ngonga Ngomo, DBpedia doi:10.1007/978-3-319-46523-4_39. SPARQL Benchmark—Performance Assessment with Real Queries on URL http://dx.doi.org/10.1007/978-3-319-46523-4_ Real Data, in: ISWC 2011, 2011. 39 URL http://jens-lehmann.org/files/2011/dbpsb.pdf [48] A.-C. N. Ngomo, M. M. Hassan, The lazy traveling salesman - memory [36] M. Saleem, Q. Mehmood, A.-C. Ngonga Ngomo, Feasible: A feature- management for large-scale link discovery., in: H. Sack, E. Blomqvist, based sparql benchmark generation framework, in: International Semantic M. d’Aquin, C. Ghidini, S. P. Ponzetto, C. Lange (Eds.), ESWC, Vol. 9678 Web Conference (ISWC), 2015. of Lecture Notes in Computer Science, Springer, 2016, pp. 423–438. URL http://svn.aksw.org/papers/2015/ISWC_ URL http://dblp.uni-trier.de/db/conf/esws/ FEASIBLE/public.pdf eswc2016.html#NgomoH16 [37] L. Bühmann, J. Lehmann, Pattern based knowledge base enrichment, in: [49] A.-C. Ngonga Ngomo, L. Kolb, N. Heino, M. Hartung, S. Auer, E. Rahm, Alani et al. [51], pp. 33–48. doi:10.1007/978-3-642-41335- When to reach for the cloud: Using parallel hardware for link discovery., 3_3. in: Proceedings of ESCW, 2013. URL http://dx.doi.org/10.1007/978-3-642-41335-3_ [50] M. A. Sherif, A. N. Ngomo, An optimization approach for load balan- 3 cing in parallel link discovery, in: Proceedings of the 11th International [38] M. A. Sherif, A. N. Ngomo, J. Lehmann, Automating RDF dataset trans- Conference on Semantic Systems, SEMANTICS 2015, Vienna, , formation and enrichment, in: The Semantic Web. Latest Advances and September 15-17, 2015, 2015, pp. 161–168. doi:10.1145/2814864. New Domains - 12th European Semantic Web Conference, ESWC 2015, 2814872. Portoroz, Slovenia, May 31 - June 4, 2015. Proceedings, 2015, pp. 371– URL http://doi.acm.org/10.1145/2814864.2814872 387. doi:10.1007/978-3-319-18818-8_23. [51] H. Alani, L. Kagal, A. Fokoue, P. T. Groth, C. Biemann, J. X. Parreira, URL http://dx.doi.org/10.1007/978-3-319-18818-8_ L. Aroyo, N. F. Noy, C. Welty, K. Janowicz (Eds.), The Semantic Web - 23 ISWC 2013 - 12th International Semantic Web Conference, Sydney, NSW, [39] J. Volz, C. Bizer, M. Gaedke, G. Kobilarov, Discovering and maintaining Australia, October 21-25, 2013, Proceedings, Part I, Vol. 8218 of Lecture links on the web of data, in: The Semantic Web - ISWC 2009, 8th Interna- Notes in Computer Science, Springer, 2013. doi:10.1007/978-3- tional Semantic Web Conference, ISWC 2009, Chantilly, VA, USA, Octo- 642-41335-3. ber 25-29, 2009. Proceedings, 2009, pp. 650–665. doi:10.1007/978- URL http://dx.doi.org/10.1007/978-3-642-41335-3 3-642-04930-9_41. [52] P. Shvaiko, J. Euzenat, K. Srinivas, M. Mao, E. Jiménez-Ruiz (Eds.), URL http://dx.doi.org/10.1007/978-3-642-04930-9_ Proceedings of the 8th International Workshop on Ontology Matching 41 co-located with the 12th International Semantic Web Conference (ISWC [40] X. Niu, S. Rong, Y. Zhang, H. Wang, Zhishi.links results for OAEI 2011, 2013), Sydney, Australia, October 21, 2013, Vol. 1111 of CEUR Workshop in: Shvaiko et al. [53]. Proceedings, CEUR-WS.org, 2013. URL http://ceur-ws.org/Vol-814/oaei11_paper16. URL http://ceur-ws.org/Vol-1111 pdf [53] P. Shvaiko, J. Euzenat, T. Heath, C. Quix, M. Mao, I. F. Cruz (Eds.), Pro- [41] S. Araújo, J. Hidders, D. Schwabe, A. P. de Vries, SERIMI - resource ceedings of the 6th International Workshop on Ontology Matching, Bonn, description similarity, RDF instance matching and interlinking, in: Germany, October 24, 2011, Vol. 814 of CEUR Workshop Proceedings, Proceedings of the 6th International Workshop on Ontology Matching, CEUR-WS.org, 2011. Bonn, Germany, October 24, 2011, 2011. URL http://ceur-ws.org/Vol-814 URL http://ceur-ws.org/Vol-814/om2011_poster6. [54] P. Mika, T. Tudorache, A. Bernstein, C. Welty, C. A. Knoblock, D. Vran- pdf decic, P. T. Groth, N. F. Noy, K. Janowicz, C. A. Goble (Eds.), The [42] K. Nguyen, R. Ichise, B. Le, SLINT: a schema-independent linked data Semantic Web - ISWC 2014 - 13th International Semantic Web Con- interlinking system, in: Proceedings of the 7th International Workshop on ference, Riva del Garda, Italy, October 19-23, 2014. Proceedings, Part Ontology Matching, Boston, MA, USA, November 11, 2012, 2012. I, Vol. 8796 of Lecture Notes in Computer Science, Springer, 2014. URL http://ceur-ws.org/Vol-946/om2012_Tpaper1. doi:10.1007/978-3-319-11964-9. pdf URL http://dx.doi.org/10.1007/978-3-319-11964-9 [43] J. Tang, B. Liang, J. Li, K. Wang, Risk minimization based ontology map- ping, in: Content Computing, Advanced Workshop on Content Computing,

11