LIMES - a Framework for Link Discovery on the Semantic Web

Noname manuscript No. (will be inserted by the editor) LIMES - A Framework for Link Discovery on the Semantic Web Axel-Cyrille Ngonga Ngomo · Mohamed Ahmed Sherif · Kleanthi Georgala · Mofeed Mohamed Hassan · Kevin Dreßler · Klaus Lyko · Daniel Obraczka · Tommaso Soru Received: date / Accepted: date Abstract The Linked Data paradigm builds upon the back- 1 Introduction bone of distributed knowledge bases connected by typed links. The mere volume of current knowledge bases as well Establishing links between knowledge bases is one of the key as their sheer number pose two major challenges when aim- steps of the Linked Data publication process.1 A plethora of ing to support the computation of links across and within approaches has thus been devised to support this process [12]. them. The first is that tools for link discovery have to be In this paper, we present the LIMES framework, which was time-efficient when they compute links. Secondly, these tools designed to accommodate a large number of link discovery have to produce links of high quality to serve the applications approaches within a single extensible architecture. LIMES built upon Linked Data well. Solutions to the second problem was designed as a declarative framework (i.e., a framework build upon efficient computational approaches developed to that processes link specifications, see Section 2) to address solve the first and combine these with dedicated machine two main challenges: learning techniques. The current version of the LIMES frame- 1. Time-efficiency: The mere size of existing knowledge work is the product of seven years of research on these two bases (e.g., 30+ billion triples in LinkedGeoData [39], challenges. A series of machine learning techniques and 20+ billion triples in LinkedTCGA2 [28]) makes efficient efficient computation approaches were developed and in- solutions indispensable to the use of link discovery frame- tegrated into this framework to address the link discovery works in real application scenarios. LIMES addresses this problem. The framework combines these diverse algorithms challenge by providing time-efficient approaches based within a generic and extensible architecture. In this article, on the characteristics of metric spaces [19,15], ortho- we give an overview of version 1.7.4 of the open-source dromic spaces [17] and on filter-based paradigms [37]. release of the framework. In particular, we focus on an over- 2. Accuracy: Central to this paper are the solutions to ac- view of the architecture of the framework, an intuition of curacy provided in the framework. Efficient solutions are its inner workings and a brief overview of the approaches it of little help if the results they generate are inaccurate. contains. Some descriptions of the applications within which Hence, LIMES also accommodates dedicated machine- the framework was used complete the paper. Our framework learning solutions that allow the generation of links be- https: is open-source and available under a GNU license at tween knowledge bases with a high accuracy. These //github.com/dice-group/LIMES together with a solutions abide by paradigms such as batch and active user manual and a developer manual. learning [21,22,23], unsupervised learning [23] and even positive-only learning [29]. Axel-Cyrille Ngonga Ngomo · Mohamed Ahmed Sherif Paderborn University, Data Science Group, Pohlweg 51, D-33098 Pader- The main goal of this paper is to give an overview of the born, Germany LIMES framework and some of the applications within which E-mail: {firstname.lastname}@upb.de it was used. We begin by presenting the link discovery prob- Kleanthi Georgala · Mofeed Mohamed Hassan · Kevin Dreßler · Klaus lem and how we address this problem within a declarative Lyko · Daniel Obraczka · Tommaso Soru setting (see Section 2). Then, we present our solution to University of Leipzig, Institute of Computer Science, AKSW Group, Augustusplatz 10, D-04009 Leipzig, Germany E-mail: 1 http://www.w3.org/DesignIssues/LinkedData {lastname}@informatik.uni-leipzig.de 2 An RDF representation of The Cancer Genome Atlas. 2 Ngonga Ngomo et al. the link discovery problem in the form of LIMES and its f(edit(:socId; :socId); 0:5) architecture. The subsequent sections present the different t families of algorithms implemented within the framework. f(trigrams(:name; :label); 0:5) We begin by a short overview of the algorithms that ensure the efficiency of the framework (see Section 4). Thereafter, Figure 1: Example of a complex LS. The filter nodes are we give an overview of algorithms that address the accuracy rectangles while the operator nodes are circles. :socID problem (see Section 5). We round up the core of the paper stands for social security number. with some of the applications within which LIMES was used, including benchmarking and the publication of 5-star linked LS [[LS]]M datasets (Section 6). An overview of the evaluation results of f(m; θ) f(s; t)j(s; t) 2 M ^ m(s; t) ≥ θg algorithms included in LIMES (Section 7) and a conclusion L1 u L2 f(s; t)j(s; t) 2 [[L1]]M ^ (s; t) 2 [[L2]]M g (Section 8) complete the paper. L1 t L2 f(s; t)j(s; t) 2 [[L1]]M _ (s; t) 2 [[L2]]M g L1nL2 f(s; t)j(s; t) 2 [[L1]]M ^ (s; t) 2= [[L2]]M g Table 1: Link Specification Syntax and Semantics. 2 The Link Discovery Problem The formal specification of Link Discovery (LD) adopted Several approaches can be chosen when aiming to define herein is akin to that proposed in [16]. Given two (not neces- the syntax and semantics of LSs in detail [22,7,26]. In LIMES, sarily distinct) sets S resp. T of source resp. target resources we chose a grammar with a syntax and semantics based on as well as a relation R, the goal of LD is is to find the set set semantics. This grammar assumes that LSs consist of M = f(s; t) 2 S × T : R(s; t)g of pairs (s; t) 2 S × T such two types of atomic components: (i) similarity measures m, that R(s; t). In most cases, computing M is a non-trivial which allow the comparison of property values or portions task. Hence, a large number of frameworks (e.g., SILK [7], of the concise bound description of 2 resources and (ii) op- LIMES [16] and KnoFuss [26]) aim to approximate M by erators op, which can be used to combine these similarities computing the mapping M 0 = f(s; t) 2 S × T : σ(s; t) ≥ into more complex specifications. We define an atomic sim- θg, where σ is a similarity function and θ is a similarity ilarity measure a as a function a : S × T ! [0; 1]. An threshold. For example, one can configure these frameworks example of an atomic similarity measure is the edit similar- to compare the dates of birth, family names and given names ity dubbed edit4. Every atomic measure is a measure. A of persons across census records to determine whether they complex measure m combines measures m1 and m2 using are duplicates. We call the equation that specifies M 0 a link measure operators such as min and max. We use mappings specification (short LS; also called linkage rule in the liter- M ⊆ S × T to store the results of the application of a simil- ature, see e.g., [7]). Note that the LD problem can be ex- arity measure to S × T or subsets thereof. We denote the set pressed equivalently using distances instead of similarities in of all mappings as M. the following manner: Given two sets S and T of instances, We define a filter as a function f(m; θ). We call a spe- a (complex) distance measure δ and a distance threshold cification atomic when it consists of exactly one filtering θ 2 [0; 1[, determine M 0 = f(s; t) 2 S×T : δ(s; t) ≤ τg.3 function. A complex specification can be obtained by com- Consequently, distance and similarities are used within link bining two specifications L and L through an operator specifications in this paper. 1 2 that allows the merging of the results of L and L . Here, Under this so-called declarative paradigm, two entities 1 2 we use the operators u, t and n as they are complete w.r.t. s and t are then considered to be linked via R if σ(s; t) ≥ θ. the Boolean algebra and frequently used to define LS. An Naïve algorithms require O(jSjjT j) 2 O(n2) computations example of a complex LS is given in Figure 1. We denote the to output M 0. Given the large size of existing knowledge set of all LS as L. bases, time-efficient approaches able to reduce this runtime We define the semantics [[L]]M of a LS L w.r.t. a map- are hence a central component of LIMES as they are necessary ping M as given in Table 1. Those semantics are similar to for link specifications to be computed in acceptable times. those used in languages like SPARQL, i.e., they are defined This efficient computation is in turn the proxy necessary for extensionally through the mappings they generate. The map- machine learning techniques to be used to optimize the choice ping [[L]] of a LS L with respect to S × T contains the links of appropriate σ and θ and thus ensure that M 0 approximates that will be generated by L. M well even when M is large [12]). 3 Note that a distance function δ can always be transformed into a normed similarity function σ by setting σ(x; y) = (1 + δ(x; y))−1.

Load more