Semantic Web 0 (2019) 1 1 IOS Press

1 1 2 2 3 3 4 A More Decentralized Vision for Linked Data 4 5 5 Axel Polleres a,b,*, Maulik Rajendra Kamdar c, Javier David Fernández a,b, Tania Tudorache c, 6 c 6 7 Mark Alan Musen 7 a 8 Institute for Information Business, Vienna University of Economics and Business, Vienna, Austria 8 b 9 Complexity Science Hub Vienna, Vienna, Austria 9 10 E-mails: [email protected], [email protected] 10 c 11 Center for Biomedical Informatics Research, Stanford University, CA, USA 11 12 E-mails: [email protected], [email protected], [email protected] 12 13 13 14 Editors: Pascal Hitzler, Kansas State University, Manhattan, KS, USA; Krzysztof Janowicz, University of California, Santa Barbara, USA 14 15 Solicited reviews: Anna Lisa Gentile, IBM Research, CA, USA; Jens Lehmann, University of Bonn, Germany 15 16 16 17 17 18 18 19 19 20 Abstract. In this deliberately provocative position paper, we claim that more than ten years into Linked Data there are still 20 21 (too?) many unresolved challenges towards arriving at a truly machine-readable and decentralized Web of data. We take a deeper 21 1 22 look at key challenges in usage and adoption of Linked Data from the ever-present “LOD cloud” diagram. Herein, we try to 22 23 highlight and exemplify both key technical and non-technical challenges to the success of LOD, and we outline potential solution 23 strategies. We hope that this paper will serve as a discussion basis for a fresh start towards more actionable, truly decentralized 24 24 Linked Data, and as a call to the community to join forces. 25 25 26 Keywords: Linked Data, Decentralization, Semantic Web 26 27 27 28 28 29 29 30 1. Decentralization Myths on the Semantic Web appear not to have found major adoption at a level 30 31 comparable with these siloed sites.3 31 32 Let us start with a rant, arguing that the Semantic – We envisioned a decentralized network of on- 32 33 33 Web may well be considered a story of failed promises tologies on the Web that would enable smart 34 34 with regards to decentralization: agents to seamlessly talk to each other, and that 35 35 – We had hopes (as a community) to revolutionize would enable easy integration of data by follow- 36 ing the guiding principles of ontology engineer- 36 37 Social Networks in a way that every data subject 37 owns and controls their social network data in de- ing and Gruber’s often cited vision of ontolo- 38 gies as shared conceptualizations [25].4 While 38 centralized FOAF [13] files published in their 39 there are indeed certain areas in which ontolo- 39 40 personal Web space – we got siloed, centralized 40 41 social networks (Facebook, LinkedIn). Attempts 41 42 to re-decentralize the Social Web, for instance, 3While ActivePub has been picked up by several imple- 42 43 through the work of the W3C Social Web WG2 mentations, cf. https://en.wikipedia.org/wiki/ActivityPub# 43 Implementations, reversing the network effects that have drawn a 44 44 critical mass of users to these siloed sites seems still far away. 45 4Or, as Dan Brickley, one of the inventors of FOAF stated slightly 45 46 *Corresponding author. E-mail: [email protected]. sarcastically in personal communication: “we took one useful fea- 46 47 1The Linking cloud diagram, available at http:// ture of RDF/RDFS (fine grained vocabulary composition) and ele- 47 48 lod-cloud.net/, which has been regularly updated since 2007 by An- vated it to a cult-like holy law, to the extent that anyone who created 48 dreas Abele, John P. McCrae, Paul Buitelaar, Anja Jentzsch and a useful RDF vocabulary and wanted to keep improving it, found 49 49 Richard Cyganiak, with its latest version having been created in themselves pushed instead into combining it with dozens of other 50 March 2019 [1]. half-finished, poorly documented efforts that weren’t really designed 50 51 2https://www.w3.org/wiki/Socialwg to fit together nicely.” 51

1570-0844/19/$35.00 c 2019 – IOS Press and the authors. All rights reserved 2 A. Polleres et al. / A More Decentralized Vision for Linked Data

1 gies are used to share conceptualizations of a do- known examples of knowledge graphs (Google’s, 1 2 main, mostly these are insular efforts that do the Bing’s, and Yahoo’s knowledge graphs as well as 2 3 job well for a particular community. However, their open pendants DBpedia and ) are 3 4 on Web scale, ontology and vocabulary reuse is NOT decentralized. 4 5 still limited or tied to inherent challenges, see So, here we are after more than ten years. However, 5 6 also [26, 27]. Rather, we see a rise of main cen- there is one lighthouse project that clearly has imple- 6 7 tral schema (schema.org), and fast-growing com- mented the vision of a decentralized Semantic Web. 7 8 munity projects like Wikidata [50] refusing to buy This single project that we, as a community, hinge 8 9 into the need for re-using terms from other on- upon and tend to accept as a clear success to wipe away 9 10 tologies.5 all the failed promises mentioned above is: Linked 10 11 – We put a lot of effort into formal semantics and Data [9]. The promise to be able to publish structured 11 12 clean axiomatization of those ontologies – we data in a truly decentralized fashion, with a couple of 12 13 got logical inconsistency.6 Whereas, serious at- simple principles to enable the automatic retrieval and 13 14 tempts to apply such reasoning about Web Data integration by just “following your nose” (i.e., derefer- 14 15 in the wild have either had to restrict themselves encing HTTP links). This principle is the most power- 15 16 to lightweight ontologies or have not been fur- ful promise that filled the community with new enthu- 16 17 ther developed in the past five years, with (a) siasm through the so-called “LOD cloud”, cf. Fig. 1. If 17 18 the semantics of OWL [45] and even parts of we measure the number of datasets published accord- 18 19 RDF(S) [12] turning out to be too hard to grasp ing to the four Linked Data principles [6] and that link 19 20 for normal Web users and developers to survive to each other, we find evidence of growth and prosper- 20 21 in the World Wild Web [24, 36]; and (b) the DL ity (cf. Fig. 2), and hope to finally make the vision of a 21 22 community mostly having turned their back to se- decentralized Web of data come true. Meanwhile, in- 22 23 riously taking the challenge of decentralized ap- deed this “cloud” contains over 1,184 datasets, which 23 24 plications at Web scale. should be considered good news. 24 25 – Berners-Lee et al. in their original Semantic Web However, as we will discuss in the present paper, 25 26 article [7] promised Web-scale automation: au- there are still serious barriers to consume and use 26 27 tomated calendar synchronisation, personalised Linked Data from the “cloud”, wherein we have to 27 28 health care assistance, home automation – some question the usefulness of the current LOD cloud. 28 29 of these applications are a reality now (Amazon Thus, we would like to take a step back and assess 29 30 Alexa’s home control, or Google’s and Apple’s the situation. We call for a more principled and, in our 30 31 widely used services), but rather than relying on opinion, more useful restart and for more collaboration 31 32 a decentralized Semantic Web, use single compa- and decentralization in the community itself. 32 33 nies’ curated knowledge bases – also now called Along these lines, in the remainder of this paper, we 33 34 “Knowledge Graphs” – that enhance these com- start with some background on the genesis of the cur- 34 35 panies’ services’ backend systems. rent LOD cloud in Section 2. We will then highlight 35 36 – More specifically, we see knowledge graphs five perceived main challenges we deem important to 36 37 evolve and embrace them as a success story of be addressed to make Linked Data more usable and, 37 38 the Semantic Web. Yet a good definition of what therefore, useful. These challenges will be presented – 38 39 a is and what differentiates it by examples and discussing their implications – in the 39 40 from an “ontology” is still to be provided – apart course of Section 3. Finally, we conclude with a call to 40 41 from the single distinguishing feature that all collaboratively and openly address these challenges as 41 42 a community in order to (re-)decentralize the Semantic 42 43 Web again in Section 4. 43 5The main reason for Wikidata not to prescribe existing vocab- 44 44 ularies was to leave the community freedom to link and use what 45 they deem useful within one consistent scheme/namespace: one of 45 46 the reasons was to avoid the needed buy-in to existing ontologies the 2. Background: The genesis of the LOD Cloud 46 47 popularity of which or agreement about could shift over time. There- 47 48 fore, they “left it to the community whether to choose stronger se- The creation of a complete Web index is a never- 48 mantics (e.g., OWL) or weaker semantics (e.g., SKOS [37]) or not” 49 49 (personal communication Denny Vrandeci˘ c).´ ending story. Since the early days of the Linked 50 6Even within DBpedia [8, 40], the central crystallization point of Data Web, several attempts have been created and 50 51 the LOD cloud [10]. failed to sustain exhaustive Linked Data Search en- 51 A. Polleres et al. / A More Decentralized Vision for Linked Data 3

1 as Semantic Pingback [46]. In the meantime, unfor- 1 2 tunately all of these search engines have been dis- 2 3 continued, and we are not aware of any active, pub- 3 4 lic Semantic Pingback services. Recently, the LOD- 4 5 Laundromat project [5] offers an URL lookup ser- 5 6 vice7 generated from/for the (accessible parts of) the 6 7 LOD cloud. Moreover, the LOD Cache by Open- 7 8 LinkSW8 remains available for LOD entity lookups 8 9 and SPARQL queries, although it does not provide a 9 10 detailed specification of which datasets it indexes. 10 11 Both of these more recent efforts though claim to 11 12 refer to datasets in the LOD cloud. The LOD cloud di- 12 13 agram [1], which took a different approach, has been 13 14 generated from provided by the commu- 14 15 15 nity at a (CKAN-driven) Open Data portal, namely 16 16 http://datahub.io. Interestingly, this is only confirmed 17 17 for its prior version in August 2017, as references to 18 18 datahub.io have been removed from current, later LOD 19 19 diagram versions. Also note that datahub.io has moved 20 20 Figure 1. The “LOD cloud” diagram [1], from April 2018, counting in the meantime and the "old" LOD-cloud dataset 21 1,184 datasets. 21 22 metadata descriptions are only available via the sug- 22 23 gestively “deprecated” URL https://old.datahub.io/. 23 24 While the LOD-cloud initiative itself seems to have 24 25 been suffering from starvation as well, the current no- 25 26 ble effort is depending on a few individuals, such as 26 27 Abele et al. [1] (the creators and maintainers of the 27 28 LOD cloud diagram), which seems to put the initiative 28 29 at risk. At least, there is recent active development, 29 30 with regular bimonthly updates on the lod-cloud.net 30 31 between April 2018 and April 2019. 31 32 Still, the LOD-cloud at lod-cloud.net and the meta- 32 33 data at datahub.io seem to remain the single most 33 34 popular entry points to Semantic Web data (with the 34 35 exception of domain-specific portals such as BioPor- 35 36 tal [44]), and therefore a bottleneck. 36 Figure 2. The growth of the “LOD cloud” in number of datasets 37 37 seems to indicate steady, while not rapid or even overwhelming The metadata the LOD cloud relies on and com- 9 38 adoption; we still have to view this as opposed to the probably much prises of metadata fields such as: 38 39 more rapid growth of other parts of the Web [41] – tags, where as a pre-filter, only those datasets are 39 40 . included in the cloud that have the tag “lod”, 40 41 – link descriptions, i.e. declarations of numbers of 41 42 gines, such as Sindice [39], SWSE [30], Watson [17], links to other datasets, 42 43 Swoogle [20], just to name a few. Typically based on – resources, that is, URLs to access the dataset in 43 44 bespoke, crawler-based architectures, these search en- the form of e.g. dumps, as SPARQL endpoints, 44 45 gines relied on either (i) collecting data published un- 45 46 der the Linked Data principles and particularly apply- 46 47 ing the “follow-your-nose” approach enabled through 7http://lotus.lodlaundromat.org/ 47 8 48 these principles (i.e., find more Linked Data by deref- lod.openlinksw.com/ 48 9Disclaimer: Note that our observations base on metadata from 49 49 erencing links appearing in Linked Data), and some- datahub.io collected in April 2018; since then, lod-cloud.net has dis- 50 times (ii) relying on “registry” or “pingback” ser- continued on datahub.io and now provides an own form-based sub- 50 51 vices to collect and advertise linked data assets, such mission system for metadata on its Web page. 51 4 A. Polleres et al. / A More Decentralized Vision for Linked Data

1 or semantic descriptions (e.g. in the form of a Data accessible and usable. We see the following ma- 1 2 Void [2] descriptions) or an XML sitemap. jor challenges when attempting to use Linked Data, 2 3 Apart from the LOD cloud, a similar effort exists parts of which we underpin by some preliminary anal- 3 4 to collect and register Linked Data vocabularies and yses on the metadata from old.datahub.io. We are obvi- 4 5 document their interconnections in the Linked Open ously not the first ones to recognize these as such, and 5 6 Vocabularies (LOV) project by Vandenbussche et al. will accompany them with similar analyses and ref- 6 7 [47]. As opposed to the purely metadata based ap- erences where available. Yet, we focus on challenges 7 8 proach of the LOD cloud collection, LOV relies on cu- which we believe need a solution first, before we can 8 9 ration and quality checks, verification of parsable vo- dream about federated queries or optimizing query an- 9 10 cabulary descriptions, etc. We note that the distinc- swering over Linked Data (which is what we do mostly 10 11 tion between Linked “vocabularies” and “data” is not in our research now — before addressing practical ap- 11 12 always straightforward, with for instance the entries plications over several datasets in real existing Linked 12 13 of the BioPortal [44], a registry of ontologies (which Data). 13 14 could by definition be considered as vocabularies), be- 14 3.1.1. Availability and resource limits. 15 ing (and, in fact, a significant) part of the LOD cloud, 15 16 but not being present in LOV. As a result of a recent analysis we did over the 16 17 So, where does this leave us? We have seen a lot metadata on datahub.io, we unfortunately confirmed 17 18 of resources being put into publishing Linked Data, a very low level of availability of resources, which 18 19 but yet a publicly widely visible “killer app” is still was already identified as one of the main challenges 19 20 missing. The reason for this, in the opinion and ex- in the biomedical domain [34]. Among the mentioned 20 21 periences of the authors, lies all to often in the frus- 5435 resources in the 1281 "LOD"-tagged datasets on 21 22 trating experiences when trying to actually use Linked datahub.io, there are only 1917 resources URLs that 22 23 Data for building actual applications. Many attempts could be dereferenced. Among all the datasets only 23 24 and projects end up still using a centralized warehous- 646 dataset descriptions contain such dereferenceable 24 25 ing approach, integrating a handful of data sets directly (not counting links to PDF, CSV, TSV files) resource 25 26 from their raw data sources, rather than being able to URLs. That is, almost half, 635 dataset descriptions 26 27 leverage their “lifted” Linked Data versions: the use contain no dereferenceable resource URLs that would 27 28 and benefits of RDF and Linked Data over conven- point to data at all. We applied a best effort here, that is 28 29 tional and warehouses technologies, where dereferencing both HTTP and FTP URLs with a time- 29 30 more trained people are available, remain question- out of 10 seconds awaiting a potential response, count- 30 31 able. In the following, we will elaborate on the main ing all 2xx return codes for a HEAD request for HTTP 31 32 reasons for this current state, as we perceive them, (and following redirects), or, resp. LIST requests for 32 33 however, with a hopeful perspective, that is, sketching the containing directory for FTP as positives. This con- 33 34 solution paths to overcome these challenges. firms the similar experiments by Debattista in his the- 34 35 sis [18, Section 9] and in a more recent article [19]; 35 36 many LOD cloud datasets are indeed not even be- 36 10 37 3. Key Challenges for Linked Data Adoption ing mentioned in his quality assessment framework , 37 38 which only covers 130 accessible datasets. 38 39 Reasons for LOD not yet having reached its full po- We note that even a best effort of availability could 39 40 tential are manifold and not simple, and we do not be viewed as optimistic, if we look in a finer grained 40 41 claim to be exhaustive herein; rather, we provide a list analysis of the different formats in these URLs (cf. 41 42 from the experiences of the authors to help explain Figure 3). For example, concerning SPARQL end- 42 43 some major challenges in the current state of affairs points, our small experiment reconfirms that, among 43 44 around LOD. We have chosen to divide reasons into the mentioned 444 potential SPARQL endpoint URLs 44 45 technical and non-technical underlying challenges. in metadata, only 252 responded at all, and only 45 46 195 responded "true" to a simple ASK {?S ?P ?O} 46 47 3.1. Technical challenges query.11 Table 1 shows the numbers for responding 47 48 48 49 49 The current mode of collection of LOD by meta- 10http://jerdeb.github.io/lodqa 50 data published once-off by dataset creators has lead 11Also, some endpoint implementations returned non-SPARQL- 50 51 to mainly a nice drawing, rather than making Linked protocol-conformant results such as http://identifiers.org/services/ 51 A. Polleres et al. / A More Decentralized Vision for Linked Data 5

1 1 URL type guessed by suffix and format: for the LOD Laundromat crawl, which is not updated 2 2500 on a regularly basis. LODVader [4] and IDOL [3] are 2 3 2064 2079 two further projects that have attempted to monitor and 3 2000 4 generate “live” LOD clouds and metadata in an au- 4 5 5 1500 tomated manner directly from the data sources which 6 could serve as starting points. 6

7 1000 Outdated, as well as non-available, data is worthless 7 8 and the frustrating experiences of not finding half the 8 444 9 500 438 9 299 resources when trying to retrieve Linked Data, rather 10 111 jeopardizes the LOD initiative than inviting externals 10 0 11 rdf html/ other (zipped) other (unknown) Non-RDF to our own close community to buy in to the ideas of 11 (pdf,csv,tsv) 12 Linked Data. That is, the LOD cloud itself needs to 12 13 13 Figure 3. Types of URLs in the “LOD cloud” guessed by declared be “live” and providers that do not comply with min- 14 metadata format and suffixes. imal availability over a certain duration should be no- 14 15 tified and removed. Also, notoriously outdated, stale 15 16 16 query conformant responses data should not be listed. 17 17 ASK {?S ?P ?O } 195 (true) + 7 (false) 18 3.1.2. Size and Scalability 18 ASK {} 150 (true) + 7 (false) 19 The situation in terms of dataset sizes have changed 19 ASK {GRAPH ?G {?S ?P ?O }} 192 (true) + 9 (false) 20 dramatically since the early days of 20 ASK {GRAPH ?G {}} 146 (true) + 11 (false) 21 engines, where relatively small amounts of triples 21 SELECT (count(*) AS ?C) 143 (137 non-zero) 22 could be feasibly managed in a single triple store: 22 WHERE {?S ?P ?O } 23 few datasets generated from big databases reach dra- 23 SELECT (count(*) AS ?C) 134 (132 non-zero) 24 matic sizes. For instance, the latest edition of DBpe- 24 WHERE { GRAPH ?G { ?S ?P ?O }} dia (2016-10), consists of more than 13 billion triples, 25 Table 1 25 Wikidata comprises +5B triples and the whole LOD- 26 SPARQL protocol conformant responses out of the 251 of overall 26 27 440 endpoints that responded at all. Laundromat project, which attempts to process and 27 28 cleanse the accessible part of the LOD cloud, reports 28 29 at the moment 38.8B indexed triples. 29 endpoint (without timeouts) to a set of test queries, in- 30 We also note that, to the best of our knowledge, cur- 30 dicating a considerable number of non-responding or 31 rent triple stores on commodity servers do not scale 31 non-SPARQL-protocol-conformant endpoints. 32 up to more than 50B triples, apart from lab experi- 32 33 Towards a solution path: In order to avoid stale or ments on hardware probably not yet available to most 33 34 outdated datasets and SPARQL endpoints, a “live” research labs in our community. AllegroGraph and Or- 34 35 LOD cloud would need to be producible in a reg- acle triple stores have reported dealing with up to 1 35 14 36 ular and automated fashion. As a part of a solution trillion triples. 36 37 path, we view regular monitoring frameworks, such We already see the number of triples reported on 37 38 as SPARQLES [48],12 or the Dynamic Linked Data the LOD cloud diverging from what a simple SELECT 38 39 Observatory[32],13, as essential, which both (i) assess (COUNT (*) AS ?C) WHERE {?S ?P ?O} to 39 40 which parts of the LOD cloud are still “alive”, and also their respective endpoints reports in various examples, 40 41 (ii) could notify the providers and publishers about po- just to name some: the Pubmed-Bio2RDF endpoint 15, 41 42 tential problems. Similarly, Debattista’s fine-grained reports 1.37B triples on the query above,16 whereas the 42 43 quality framework mentioned above [18, Section 9], 43 44 44 aimed originally at re-assessing and testing LOD data 14cf. https://www.w3.org/wiki/LargeTripleStores, last retrieved 45 regularly could be a valid starting point, but seems to 2018-05-16, where we note that these experiments have been con- 45 46 not have been updated since 2016. The same applies ducted on synthetic LUBM data, which does not necessarily reflect 46 47 the characteristics of Linked Data “in the wild”. 47 15 48 http://pubmed.bio2rdf.org/sparql 48 sparql which returns “false” on above ASK query, although clearly 16The same number is returned on a query for quads, i.e. 49 49 its default graph is not empty. SELECT (COUNT (*) AS ?C) WHERE {GRAPH ?G {?S 50 12http://sparqles.ai.wu.ac.at/ ?P ?O}}, which is of course not necessarily the case for all 50 51 13http://km.aifb.kit.edu/projects/dyldo/ SPARQL endpoints. 51 6 A. Polleres et al. / A More Decentralized Vision for Linked Data

1 dump17 reports 1.8B triples. Yet again, on a side note, tees duplicate-freeness. Notably, there are already sev- 1 2 different to both of that, the metadata at datahub.io re- eral TPF endpoints available,20 most of them powered 2 3 ports 5B triples for the same dataset,18, where how- by HDT in the backend, thus creating a small server- 3 4 ever it cannot be easily determined in how far these footprint and -load, for either answering triple pattern 4 5 numbers refer to different versions or subsets of the queries or downloading the whole dump. HDT has also 5 6 dataset. Likewise, Wikidata’s query service responds been recently extended to handle also nquads besides 6 7 to the same query a number of 5.2B triples, which is RDF triple dumps, thus also being usable for datasets 7 8 significantly lower than the 5.7B triples we retrieved consisting of different (sub)graphs [22]; an analogous 8 9 from the dump mentioned above. extension of the TPF interface to nquads would be 9 10 In addition to that, it is mostly impossible to indeed straightforward. We also note that metadata such as the 10 11 retrieve all triples from a SPARQL endpoint, due to number of triples encoded are stored/computed dur- 11 12 result size restrictions that many endpoints apply, ei- ing dump generation in the header in HDT files, thus 12 13 ther in the form of timeouts or only returning a certain providing a single, reliable entry to metadata. 13 14 maximum number of results/triples. For details, Buil et As an additional starting point, work on load bal- 14 15 al. [14] discusses some of these restrictions, and also ancing under concurrency for SPARQL endpoints [38] 15 16 explains, why in general they cannot be trivially cir- also promises better resource management on server- 16 17 cumvented (e.g., by “paging” results with LIMIT and side query processing. Eventually, we expect lots of 17 18 OFFSET). As another example of related problems, room for research and potential solutions to the chal- 18 19 UniProt [15], reported to have +39B triples served on lenges for truly decentralized, scalable linked data 19 20 its public endpoint,14 times out on the simple query to querying, by combining client-side and server-side 20 21 count its triples mentioned above. query processing in a demand-driven manner. 21 22 Another potential challenge in terms of size and 22 3.1.3. Findability and (Meta-)Data Formats 23 scalability is the amount of duplicates in current 23 While the current metadata available on the LOD 24 dumps. As an example, the PubMed RDF dump from 24 cloud does not tell us a lot about how to access sin- 25 Bio2RDF we mentioned above (cf. Footnote 17) con- 25 gle datasets, over time, various dataset description for- 26 sists of +7.22B nquads spread over 1151 dump files. 26 mats and mechanisms have been proposed, namely (i) 27 A lot of triples are actually duplicated across these 27 VoID descriptions, (ii) (Semantic) Sitemaps, and (iii) 28 dump files from the same dataset. Downloading all 28 SPARQL service endpoint descriptions. In the follow- 29 of these and de-duplicating them locally both wastes 29 30 ing, we analyze their state of use in the LOD cloud. 30 bandwidth and makes processing such dumps unnec- 31 essarily cumbersome. The Vocabulary of Interlinked Datasets (VoID) had 31 32 32 been designed as a minimalistic entry point for de- 33 Towards a solution path: It seems that in order to 33 scribing datasets and how to access them, contain- 34 avoid both such discrepancies and bottlenecks for 34 ing properties for locating dumps (:dataDump), 35 downloads and query processing, a combination of 35 finding SPARQL endpoints (void:sparqlEndpoint) 36 (i) dumps provided in HDT [21], a compressed and 36 or describing the size of the dataset, i.e., numbers of 37 queryable RDF format, as well as (ii) Triple Pattern 37 triples (void:triples), and other structural statis- 38 Fragments (TPF) endpoints [49] as the standard ac- 38 tics. In order to find the VoID description, it is sug- 39 cess method for Linked Datasets could alleviate some 39 gested to place the dataset description in the root di- 40 of these problems The Triple Pattern Fragments in- 40 rectory of a Web-server under /.well-known/void. 41 terface – essentially limits queries to an endpoint to 41 There are various problems with this approach: 42 simple triple matching queries which offloads pro- 42 firstly, different datasets hosted under one common 43 cessing of complex joins and other operations to the 43 domain/server cannot provide different dataset de- 44 client-side, while still not having to download com- 44 scriptions; as illustration, obviously for Github hosted 45 plete dumps. HDT,19 on the other hand is an already 45 data, https://github.com/.well-known/void 46 compressed dump-format that allows such triple pat- 46 would not return a valid VoID description, although 47 tern queries without decompression and also guaran- 47 48 Github is gaining popularity for hosting Linked Data 48 49 sets. Secondly, even the “epicenter” of the LOD- 49 17http://download.bio2rdf.org/#/release/4/pubmed/ 50 18https://old.datahub.io/dataset/bio2rdf-pubmed 50 51 19http://rdfhdt.org 20http://linkeddatafragments.org/ 51 A. Polleres et al. / A More Decentralized Vision for Linked Data 7

1 cloud, .org does not follow the rules and pro- to this recipe, out of which in fact 63 return HTML 1 2 vides a VoiD description at the non-obviously findable (mostly query forms), even if attempting CONNEG.23 2 3 URL http://dbpedia.org/void/page/Dataset We note that while some of these mentioned HTML 3 4 instead. Lastly, indeed, among all 881 hostnames men- responses might contain RDFa, it is still an extra step 4 5 tioned in URLs in datahub.io’s metadata, 159 respond to extract and parse and each such extra step will bloat 5 6 to an HTTP GET with this recipe, at least 75 of which a potential consuming client unnecessarily. Similarly, 6 7 7 though seem to be HTML responses, and only 56 valid when attempting to find data dumps, without a seman- 8 8 RDF;21 without going into further detail, even if the tic sitemap or a VoID file in place, our best guess 9 9 HTML contained RDFa (which in the cases we in- would be to guess and try parsers from “format” de- 10 10 scriptors in the metadata or from filename suffixes. 11 spected it did not), it seems that easy-to-parse RDF 11 An additional complication here are compressed for- 12 results with valid VoId descriptions are the exception. 12 mats, where attempting different decompression for- 13 13 (Semantic) Sitemaps: XML Sitemaps22 seem to be mats (gzip, bzip, tar, zip, just to name a few), some- 14 14 a more commonly implemented pattern to discover 15 times even used in combination, further complicate ac- 15 data and pages accessible via an HTTP server, not 16 cessibility. Some of the the guessed formats we found 16 17 least because of their recommendation by search en- in all URLs are listed again in Fig. 3 above. 17 18 gines. It is a simple XML format that should guide We note that by manual inspection, some endpoint 18 19 crawlers across sites, where Tummarello et al. had addresses or accessibility of datasets could be recov- 19 20 even proposed an extension of the Sitemaps proto- ered, but since we emphasize on machine accessibility, 20 21 col to link to RDF datasets specifically [16], that has manual "recovery" seems an undesirable option. 21 22 been implemented in Sindice [39]. Sitemaps are ex- 22 Towards a solution path: We feel that as for auto- 23 pected to be found under the root of a dataset’s direc- 23 matic findability, Semantic Sitemaps with pointers to 24 tory on a host in a file called ‘sitemap.xml’, that is, 24 a VoID description, with concrete pointers to primar- 25 not necessarily directly underneath root directory of 25 ily a dump, preferably in HDT as well as (optionally) 26 the host address. datahub.io’s metadata contains hints 26 a pointer to a SPARQL endpoint (or TPF endpoint) 27 (by filename) to such sitemaps for 57 datasets, 56 in- 27 should be the commonly to be agreed upon practice. 28 deed returning valid sitemaps, and 55 of which in- 28 We note here, that the use of HDT makes this task 29 deed use the semantic sitemap extension [16] (52 con- 29 30 even simpler, as indeed the Header part of an HDT 30 taining a sc:dataDump attribute and 53 contain- 31 dump file holds a place for metadata descriptions about 31 ing a sc:sparqlEndpoint field). Overall, while seman- 24 32 the dataset readily. Also, SPARQL endpoints should 32 tic sitemaps are only used for a marginal ∼ 5% of 33 provide service descriptions in easily accessible RDF 33 34 datahup.io datasets, they seem to be fairly consistent. (not RDFa) available via CONNEG, where again these 34 35 SPARQL service endpoint descriptions: according SPARQL service descriptions should describe service 35 36 36 to the SPARQL1.1 specification, “SPARQL services limitations (such as e.g. result size limits or connec- 37 37 made available via the SPARQL Protocol SHOULD tion limits and timeouts). Also, the service description 38 38 return a service description document at the service should declare potential differences between the data 39 in the dump and in the endpoint, if any. We empha- 39 40 endpoint when dereferenced using the HTTP GET op- 40 eration without any query parameter strings provided. size here, that to the best of our knowledge there is no 41 agreed upon vocabulary for SPARQL endpoint restric- 41 This service description MUST be made available in 42 tions and capabilities. 42 43 an RDF serialization, MAY be embedded in (X)HTML 43 44 by way of RDFa [RDFA], and SHOULD use content 44 45 negotiation [CONNEG] if available in other RDF rep- 23with sending an ’Accept: text/, 45 46 resentations.” Yet, out of the 251 potential respondent application/n-triples, application/, 46 47 application/n-quads, application/rdf+xml, *’ header. 47 endpoint addresses mentioned above only 136 respond 24 48 In fact, some automatically computable VoID properties are al- 48 ready computed and included in HDT’s header per default, and it is 49 49 possible to add additional properties such as pointers to (SPARQL or 50 21We tested all hosts from URLs that provided non-error results. Linked Data fragments) endpoints, or used namespaces within this 50 51 22https://www.sitemaps.org/protocol.html header, as a single point of access through an HDT dump file. 51 8 A. Polleres et al. / A More Decentralized Vision for Linked Data

1 3.1.4. Linked Data Quality & Semantics of Links. ‘/’ and ‘#’ prefixes, some ontologies use completely 1 2 The Linked Data principles define rough guidelines different recipes to separate identifiers from prefixes. 2 3 on derefenceability and linkage of datasets, yet in or- In fact, various datasets “mint” URIs with differ- 3 4 der for RDF datasets, once downloaded, to be truly ing recipes, for instance, we find the prefix scheme 4 5 machine-processable and being able to traverse and in- http://bioonto.de/sbml.owl#Uniprot: within 5 6 terpret those links fruitfully, more detailed guidelines the BIOMDELS ontology from Bioportal, with 562 6 7 seem to be indispensable: in an early approach, Hogan identifiers using this scheme (e.g., 7 8 et al. proposed the “Pedantic Web”[29] alongside with http://bioonto.de/sbml.owl#Uniprot:Q9UJX6). 8 9 the discontinued tool, RDFAlerts, to check and as- In this case, what is the namespace prefix? It seems in- 9 10 sess the quality, dereferenceability, and finally syntac- tuitive that this URI minting scheme refers to UNIPROT 10 11 tical (e.g. use of ill-defined literals) logical consistency which indeed means the dereferenceable URL 11 12 (in terms of RDFS/OWL inferences, use of literals in https://www.uniprot.org/uniprot/Q9UJX6. 12 13 place of object properties, availability of definitions for Now, at a closer look this example25 illustrates several 13 14 used properties and classes, etc.) of RDF datasets. A problems at once: it is unclear which prefix denotes the 14 15 lot of these checks though, were not necessarily de- “namespace”: http://bioonto.de/sbml.owl# or 15 16 signed to scale to datasets of billions of triples, or, rather http://bioonto.de/sbml.owl#Uniprot:? 16 17 resp., should be reassessed in terms of feasibility: HDT Likewise, the same entity appears in the LOD cloud 17 18 could serve as a basis for scalable, out-of -the-box im- under a different, disconnected namespace prefix: 18 19 plementations of such checks on a dataset level [26]. http://purl.uniprot.org/uniprot/Q9UJX6 19 20 For example, link counts in the LOD cloud dia- In fact, the “#-namespace” in our example, 20 21 gram, which shall indicate in how far one dataset links http://bioonto.de/sbml.owl#, does not refer to 21 22 to another dataset, could be checked and computed a dereferenceable URI. Here data itself comes from a 22 23 automatically. To the best of our knowledge, these dataset dump in an old version of BioPortal, that has 23 24 links and their strength, have been created so far from been fixed in the meantime, but nonetheless serves for 24 25 datahub.io’s metadata field links: illustration. Whole datasets, such as the BIOMODELS 25 26 (i.e., been typically manually specified by the contrib- ontology now exists on different places in the LOD 26 27 utors of said metadata). The definition for how such cloud, within Bio2RDF, within BioPortal, but also as 27 28 links should be declared on lod-cloud.net provides the an RDF dataset directly published by EBI26 in three 28 29 following inclusion/exclusion criterion for datasets in different “RDF exports” of the same . 29 30 the LOD cloud: “The dataset must be connected via While – depending on the serialisation – names- 30 31 RDF links to a dataset that is already in the diagram. paces could be filtered out based on being explicitly 31 32 This means, either your dataset must use URIs from represented (e.g., marked with XML namespaces in 32 33 the other dataset, or vice versa. We arbitrarily require RDF/XML or by @prefix declaration in Turtle, respec- 33 34 at least 50 links.” An older version of the page also tively, this seems not to be a reliable way of recogniz- 34 35 provided a slightly more concrete definition of what is ing all used namespaces within an RDF datadump in a 35 36 meant by a link here: “A link, for our purposes, is an declarative machine-readable manner. Plus, as the ex- 36 37 RDF triple where subject and object URIs are in the ample illustrates, even if we had all namespaces occur- 37 38 namespaces of different datasets.” We however find ring within a dataset, various URL schemes used refer 38 39 this definition hard to assess. Since no concrete guide- to either non-dereferenceable or non-RDF publishing 39 40 line with regards to “ownership” of namespaces is pro- third-party namespaces, that cannot be simple assigned 40 41 vided here, any attempt to compute such links auto- to “belonging” to a single dataset. More issues about 41 42 matically is doomed to fail. From our observations on URI schemes and namespaces and term (non-)re-use 42 43 different datasets, it is by no means always clear have been described in [26, 33, 35]. 43 44 1. to which namespace a URI belongs, or Last, but not least, as an open problem, links in one 44 45 2. to which dataset a namespace belongs dataset always refer to a particular version of the linked 45 46 As for 1, we note that in many cases it is not even dataset, which cannot be guaranteed to persist or be- 46 47 clear entirely purely from the RDF data which part of ing dereferenceable in the future. For more sustain- 47 48 the URIs in a dataset denote namespaces: namespaces 48 49 49 and qnames in RDF have no special status as in XML, 25which is one of many, we emphasize it is not our intention to 50 they simply denote prefixes; while certain “recipes” point fingers to anyone! 50 51 for such prefixes exist, such as most commonly used 26at ftp://ftp.ebi.ac.uk/pub/databases/RDF/biomodels/ 51 A. Polleres et al. / A More Decentralized Vision for Linked Data 9

1 able Linked Open Data, we therefore deem versioned data is not a dataset in the LOD cloud, although it is 1 2 Linked Data as well as archives a necessity. clearly linked well with several datasets present. 2 3 Overall, the burden of manually and pro-actively 3 Towards a solution path: We feel that in order to 4 needing to provide and maintain LOD cloud metadata 4 avoid such issues, established best practices for Linked 5 on the publisher-side has proven unsustainable. 5 Data publishing would need to provide more guide- 6 6 lines for URL minting and reuse. Namespace and 3.2.2. Trust 7 7 ID minting should probably be restricted to machine- Besides the pervasive issues of availability and re- 8 8 recognizable patterns (such as strict adherence to ‘/’ liability, developers are rightfully worried that pub- 9 9 and ‘#’-namespaces), with dereferenaceable names- lished data in the cloud is not kept up to date, and 10 10 pace URLs). Ownership of a namespace could – for in- 11 as such these technical issues might overall give rise 11 stance – be restricted to pay-level-domain, that is, def- 12 to (or have already given rise to, possibly) doubts on 12 inition of namespaces being restricted to the own pay- 13 the technologies and principles of Linked Data. Stale 13 level domain, and URL and namespace schemas given 14 datasets, while still available, but with outdated, once- 14 a clear machine-readable ownership relation. We leave 15 off RDF exports of in the meantime evolved databases, 15 a concrete definition of such a machine-readable and 16 likewise raise trustworthiness issues in Linked Data. 16 assessable ownership open for now, but refer to simi- 17 While it seems to have been a sufficient incentive 17 lar concepts and thoughts about URI “authority” hav- 18 to “appear” in the LOD cloud to publish datasets ad- 18 ing been discussed before in the context of ontolog- 19 hering to Linked Data principles, a similarly strong in- 19 ical inference by Hogan in his thesis [28, Section 5] 20 centive to sustain and maintain quality of published 20 as a potential starting point. Hogan’s thesis also con- 21 datasets seems to be missing. It is therefore important 21 tains some details on scalable implementations of the 22 for us as a community to keep the LOD project up and 22 above-mentioned checks that have been described in 23 alive and maintained, by creating sustainable publish- 23 RDFAlerts [29] earlier, which we believe could be im- 27 24 ing and monitoring processes. 24 plemented directly and efficiently on top of indexed 25 25 compressed formats HDT, which we leave to future 3.2.3. Governance 26 26 work on our agenda for now. We note that not only trust in the LOD cloud it- 27 27 As for archiving and versions, we refer to [23] and self, but also mutual trust between LOD providers may 28 28 references therein in terms of starting points. While be a problem that is difficult to circumvent. For in- 29 29 no standard exists at this point for how to publish ver- stance the presence of various different unlinked “RDF 30 30 sioned RDF archives, let us again refer to possible dumps” or LOD datasets arising from exports of the 31 31 HDT-based solutions, particularly enabled through the same legacy database (BIOMODELS given as one il- 32 32 recent extension of HDT to handle nquads [22]. lustrative example of many above) could be potentially 33 related to many of our exports and datasets being cre- 33 34 34 3.2. Non-Technical Challenges ated in isolation, by closed groups, without incentives 35 for collaboration, sharing infrastructures, and evolving 35 36 36 Even if we will be able to solve all the above techni- those exports jointly. This issue can only be solved by 37 37 cal challenges, there are several more pertinent issues a more collaborative, and truly open governance. 38 38 in the critical path to the success of LOD. Many non- 39 3.2.4. Documentation and Usability 39 technical challenges should be fixed in order to stim- 40 Besides the technical issues discussed above, usabil- 40 ulate adoption of linked data, a non-exhaustive list of 41 ity issues and documentation standards have been long 41 which we briefly describe hereafter. 42 overlooked in many Linked Data projects. Industry- 42 43 3.2.1. Completeness/Consistency strength tools to consume and use Linked Data with 43 44 Several well-known and important RDF datasets are sufficient documentation are still under-developed. 44 45 missing in the LOD cloud. For example, EBI RDF [31] We believe this issue can be ameliorated by: (1) bet- 45 46 is not there (plus various other well-known data bases ter metadata for describing the datasets; (2) better doc- 46 47 from the biomedical and life sciences domain [34]), umentation for using the datasets, including sample 47 48 which have gone through the effort of publishing RDF, queries; (3) better tool support for enabling reuse of 48 49 but not taken the additional hurdle of manually adding 49 50 and updating their metadata in yet another centralized 27with the questionable alternative to re-brand under another 50 51 catalog such as datahub.io. For similar reasons, Wiki- name after an “LOD winter” from unfulfilled expectations. 51 10 A. Polleres et al. / A More Decentralized Vision for Linked Data

1 existing vocabularies; and (4) Supporting and promot- On the bright side, specific communities, such as 1 2 ing developer-friendly formats, such as JSON-LD. the biomedical community have been very successful 2 3 In addition, in terms of positive examples, apart in using OWL and Semantic Web technologies for the 3 4 from the aforementioned HDT and TPF projects, use- management of large biomedical vocabularies and on- 4 5 ful SPARQL query editing tools such as YASGUI [43] tologies, for a detailed overview of successes in this 5 6 or Wikidata’s query interface, have appeared in the last area we refer to [33, 34]. Main factors for success 6 7 two years; we need more tools like those. projects are: (1) Having a dedicated and very active 7 8 8 3.2.5. Funding & Competition. development team behind it with continuous funding 9 9 Last, but not least, while the EU and other funding over several years; (2) Actively building a strong com- 10 10 agencies have supported the creation of a Web of data munity of domain users from different areas, and using 11 11 greatly, we also feel that there are problematic side ef- their needs as the driver for the ontology development; 12 12 fects which need discussion and counter-strategies: (3) Having an exemplary documentation, about both 13 13 – cross-continental research initiatives are not be- ontology, but also about how to use Linked Data in ap- 14 14 ing funded plications targeted to domain users, as well as docu- 15 15 – EU project consortia are typically being judged 16 mentation about the processes for building and main- 16 by complementary partner expertise 17 taining collaboratively generated Linked Data sources; 17 Both these factors prevent research groups working on 18 (4) Using a principled approach for developing the 18 overlapping topics collaboratively, and rather stimulate 19 underlying ontology and maintaining the vocabularies 19 20 an environment of isolated closed research than open used; (5) Using automated pipelines to check and en- 20 21 collaboration to jointly build sustainable solutions and sure data and vocabulary quality. 21 22 address the issues mentioned so far. Our hope is that the Linked Data community can 22 Lack of collaboration may in other cases also just 23 learn from such specific projects, and that it will try to 23 be caused by the disconnect of research communities. 24 apply some of the same approaches that proved to be so 24 This is for instance exemplified by the Semantic Web 25 successful. We believe the community needs to work 25 in Life Sciences community, for instance, seemingly 26 on those by joining forces, rather than by competition. 26 having recently started efforts very similar to SPAR- 27 We also argued that HDT, a compressed and queryable 27 QLES [48] in building up a completely independent 28 dump format for Linked Datasets, could play a central 28 SPARQL endpoint monitoring framework [52],28, not 29 role as a starting point to address some (but not all) of 29 30 even citing SPARQLES (sic!), which seems unneces- 30 sarily duplicating efforts instead of collaboratively de- the technical challenges we have outlined, i.e., implic- 31 itly suggesting a “fifth” Linked Data principle [6]: 31 32 veloping and maintaining such services. 32 33 5. Publish your dataset as an HDT dump, includ- 33 34 ing VoID metadata as part of its header, declar- 34 35 4. Conclusions and Next Steps ing (i) the (authoritatively) owned namespaces, 35 36 (ii) links to previous and most current versions of 36 37 So, is Linked Data doomed to fail? In this paper we the dataset, (iii) and – whenever you use names- 37 38 did not present a lot of new insights, but our delib- paces owned by other datasets or ontologies – links 38 39 eratively provocative articulation of rethinking Linked to specific versions of these other datasets. 39 40 Open Data and its principles. It is not too late to coun- 40 41 teract and join forces. We hope that our summary of In fact, we would argue that more principled Linked 41 42 problems and challenges, reminders of valuable past Data publishing could allow to auto-generate LOD 42 43 attempts to address them, and outline of potential so- clouds from a set of HDT dumps, which to demon- 43 44 lution strategies can serve as a discussion basis for strate is on our agenda for future work. While we admit 44 45 a fresh starts ahead towards more actionable Linked of course that HDT is not the only possible solution 45 46 Data. Yet, we should not conceal there is still a lot of to this problem, what we aim at emphasizing here is 46 47 research to be done and open problems to work on, for the necessity for a principled approach to overcome the 47 48 instance in demonstrating how the sketched solution obvious synchronization problems arising from sep- 48 49 paths can be realized at Web scale. arate maintenance of metadata outside of the actual 49 50 knowledge source in current dataset catalogues: an in- 50 51 28available at http://yummydata.org/endpoints teresting very recent initiative in this context is the 51 A. Polleres et al. / A More Decentralized Vision for Linked Data 11

1 DBpedia Databus project29, currently in public beta, Conference on Semantic Systems (SEMANTiCS 2017), 2017. 1 2 which aims at providing an end-to-end pipeline eas- DOI: 10.1145/3132218.3132238. 2 3 ing auto-extraction, metadata-generation and publish- [4] C. Baron Neto, K. Müller, M. Brümmer, D. Kontokostas, 3 and S. Hellmann. LODVader: An interface to LOD Visu- 4 4 ing of Linked Data Knowledge Graphs at scale. alization, Analytics and DiscovERy in Real-time. In Pro- 5 Apart from technical challenges, other issues arose, ceedings of the 25th International Conference Companion on 5 6 that seem equally important, such as the establish- , WWW ’16 Companion, pages 163–166. In- 6 7 ment of collaborative and shared research infrastruc- ternational World Wide Web Conferences Steering Committee, 7 8 tures to guarantee sustainable funding and persistence 2016. DOI: 10.1145/2872518.2890545. 8 [5] W. Beek, L. Rietveld, H. R. Bazoobandi, J. Wielemaker, and 9 9 of Linked Data assets, as we have seen many promis- S. Schlobach. LOD Laundromat: a uniform way of publish- 10 ing efforts and initiatives mentioned in this paper hav- ing other people’s dirty data. In 13th International Semantic 10 11 ing discontinued unfortunately. In the meanwhile, we Web Conference (ISWC), pages 213–228. Springer, 2014. DOI: 11 12 also emphasize that initiatives like the recently US- 10.1007/978-3-319-11964-9_14. 12 13 founded “Open Knowledge Network”30 initiative or a [6] T. Berners-Lee. Linked Data. W3C Design Issues, July 2006. 13 From http://www.w3.org/DesignIssues/LinkedData.html; retr. 14 14 Dagstuhl seminar on “New Directions for Knowledge 2018/06/01. 15 Representation on the Semantic Web” [11] have pro- [7] T. Berners-Lee, J. Hendler, and O. Lassila. The semantic web. 15 16 vided platforms to openly discuss such a fresh start, in Scientific American, pages 29–37, May 2001. 16 17 the context of new trends and efforts around Knowl- [8] S. Bischof, M. Krötzsch, A. Polleres, and S. Rudolph. Schema- 17 18 edge Graphs and the FAIR principles [51], that parallel agnostic query rewriting in SPARQL 1.1. In 13th International 18 Semantic Web Conference, LNCS. Springer, Oct. 2014. DOI: 19 19 and complement the Linked Data movement. 10.1007/978-3-319-11964-9_37. 20 20 Acknowledgements [9] C. Bizer, T. Heath, and T. Berners-Lee. Linked Data - The 21 Story So Far. Int. J. Semantic Web Inf. Syst., 5(3):1–22, 2009. 21 22 A preliminary version of this work [42] was pre- DOI: 10.4018/jswis.2009081901. 22 23 sented at the DESEMWEB2018 workshop: we thank [10] C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cy- 23 ganiak, and S. Hellmann. DBpedia - A crystallization point 24 the participants and particularly Dan Brickley, Sarven 24 for the Web of Data. J. Web Sem., 7(3):154–165, 2009. DOI: 25 Capadisli, Amrapali Zaveri, and Denny Vrandeci˘ c´ for 25 their comments on this first revision. Axel Polleres’ 10.1016/j.websem.2009.07.002. 26 [11] P. A. Bonatti, S. Decker, A. Polleres, and V. Presutti. Knowl- 26 27 work was supported under Stanford University’s Dis- edge Graphs: New Directions for Knowledge Representation 27 28 tinguished Visiting Austrian Chair program. Javier on the Semantic Web (Dagstuhl Seminar 18371). Dagstuhl Re- 28 ports, 8(9):29–111, 2019. DOI: 10.4230/DagRep.8.9.29. 29 Fernández’ work was supported by the EC under the 29 [12] D. Brickley and R. Guha. RDF Schema 1.1. W3C Recommen- 30 H2020 project SPECIAL and by the Austrian Re- 30 search Promotion Agency (FFG) under the project dation, Feb. 2014. http://www.w3.org/TR/rdf-schema/. 31 [13] D. Brickley and L. Miller. FOAF Vocabulary Specification 31 32 “CitySpin”. Maulik Kamdar, Tania Tudorache, and 0.99, Jan. 2014. http://xmlns.com/foaf/0.1/. 32 33 Mark Musen were supported in part by the grants [14] C. Buil-Aranda, A. Polleres, and J. Umbrich. Strategies for 33 executing federated queries in SPARQL1.1. In 13th Interna- 34 GM086587, GM103316, and U54-HG004028 from 34 the US National Institutes of Health. tional Semantic Web Conference (ISWC). Springer, 2014. DOI: 35 10.1007/978-3-319-11915-1_25. 35 36 [15] U. Consortium et al. The universal protein resource (UniProt). 36 37 Nucleic acids research, 36(suppl 1):D190–D195, 2008. DOI: 37 38 References 10.1093/nar/gkm895. 38 [16] R. Cyganiak, H. Stenzhorn, R. Delbru, S. Decker, and G. Tum- 39 39 [1] A. Abele, J. P. McCrae, P. Buitelaar, A. Jentzsch, and R. Cyga- marello. Semantic sitemaps: Efficient and flexible access to 40 40 niak. Linking Open Data cloud diagram (2018-04-30), 2018. datasets on the semantic web. In Proceedings of the European 41 From http://lod-cloud.net/; retr. 2018/06/01. Semantic Web Conference (ESWC), pages 690–704, 2008. 41 42 [2] K. Alexander, R. Cyganiak, M. Hausenblas, and J. Zhao. De- [17] M. d’Aquin and E. Motta. Watson, More Than a Semantic 42 43 scribing Linked Datasets with the VoID Vocabulary. W3C In- Web Search Engine. Semantic Web, 2(1):55–63, 2011. DOI: 43 10.3233/SW-2011-0031. 44 terest Group Note 03 March 2011, 2011. From https://www. 44 w3.org/TR/void/; retr. 2018/06/01. [18] J. Debattista. Scalable Quality Assessment of Linked Data. 45 45 [3] C. Baron Neto, D. Kontokostas, A. Kirschenbaum, G. Publio, PhD thesis, Rheinische Friedrich-Wilhelms-Universität Bonn, 46 D. Esteves, and S. Hellmann. IDOL: Comprehensive & com- Oct. 2016. 46 47 plete LOD insights. In Proceedings of the 13th International [19] J. Debattista, C. Lange, S. Auer, and D. Cortis. Evaluating the 47 48 quality of the LOD cloud: An empirical investigation. Seman- 48 tic Web, 9(6):859–901, 2018. DOI: 10.3233/SW-180306. 49 49 29https://blog.dbpedia.org/2019/10/02/ [20] L. Ding, T. Finin, A. Joshi, R. Pan, R. S. Cost, Y. Peng, P. Red- 50 one-billion-derived-knowledge-graphs/ divari, V. Doshi, and J. Sachs. Swoogle: A Search and Meta- 50 51 30http://ichs.ucsf.edu/open-knowledge-network/ data Engine for the Semantic Web. In Proceedings of the 51 12 A. Polleres et al. / A More Decentralized Vision for Linked Data

1 ACM International Conference on Information and Knowl- [36] A. Mallea, M. Arenas, A. Hogan, and A. Polleres. On Blank 1 2 edge Management (CIKM), pages 652–659. ACM, 2004. DOI: Nodes. In Proceedings of the International Semantic Web Con- 2 3 10.1145/1031171.1031289. ference (ISWC), volume 7031 of LNCS. Springer, 2011. DOI: 3 [21] J. D. Fernández, M. A. Martınez-Prieto, C. Gutiérrez, 10.1007/978-3-642-25073-6_27. 4 4 A. Polleres, and M. Arias. Binary RDF Representation for Pub- [37] A. Miles and S. Bechhofer. Simple knowledge organization 5 lication and Exchange (HDT). J. Web Sem., 19(2), 2013. DOI: system reference. Recommendation, W3C, August 18 2009. 5 6 10.1016/j.websem.2013.01.002. [38] T. Minier, H. Skaf-Molli, and P. Molli. SaGe: Web pre- 6 7 [22] J. D. Fernandez, M. A. Martínez-Prieto, A. Polleres, and emption for public SPARQL query services. In The World 7 8 J. Reindorf. HDTQ: Managing RDF datasets in compressed Wide Web Conference, pages 1268–1278. ACM, 2019. DOI: 8 9 space. In European Semantic Web Conference (ESWC). 10.1145/3308558.3313652. 9 Springer, 2018. DOI: 10.1007/978-3-319-93417-4_13. [39] E. Oren, R. Delbru, M. Catasta, R. Cyganiak, H. Stenzhorn, 10 10 [23] J. D. Fernandez, J. Umbrich, A. Polleres, and M. Knuth. Eval- and G. Tummarello. Sindice.com: a document-oriented lookup 11 uating query and storage strategies for RDF archives. Semantic index for open linked data. IJMSO, 3(1):37–52, 2008. DOI: 11 12 Web, 10(2):247–291, 2019. DOI: 10.3233/SW-180309. 10.1504/IJMSO.2008.021204. 12 13 [24] B. Glimm, A. Hogan, M. Krötzsch, and A. Polleres. OWL: [40] H. Paulheim and A. Gangemi. Serving dbpedia with DOLCE 13 14 Yet to arrive on the web of data? In WWW2012 Workshop on - more than just adding a cherry on top. In 14th International 14 Linked Data on the Web (LDOW), 2012. Semantic Web Conference (ISWC), pages 180–196, 2015. DOI: 15 15 [25] T. R. Gruber. Toward principles for the design of on- 10.1007/978-3-319-25007-6_11. 16 tologies used for knowledge sharing? International journal [41] A. Polleres, A. Hogan, A. Harth, and S. Decker. Can we ever 16 17 of human-computer studies, 43(5-6):907–928, 1995. DOI: catch up with the Web? Semantic Web, 1(1-2):45–52, 2010. 17 18 10.1006/ijhc.1995.1081. DOI: 10.3233/SW-2010-0016. 18 19 [26] A. Haller, J. D. Fernández, M. R. Kamdar, and A. Polleres. [42] A. Polleres, M. R. Kamdar, J. D. Fernández, T. Tudorache, 19 20 What are Links in Linked Open Data? A Characterization and and M. A. Musen. A more decentralized vision for linked 20 Evaluation of Links between Knowledge Graphs on the Web. data. In Decentralizing the Semantic Web (Workshop of 21 21 Technical Report 2/2019, Department für Informationsverar- ISWC2018), volume 2165 of CEUR Workshop Proceedings. 22 beitung und Prozessmanagement, WU Vienna University of CEUR-WS.org, Oct. 2018. 22 23 Economics and Business, 2019. [43] L. Rietveld and R. Hoekstra. The YASGUI family of 23 24 [27] A. Haller and A. Polleres. Are We Better Off With Just One SPARQL clients. Semantic Web, 8(3):373–383, 2017. DOI: 24 25 Ontology on the Web? Semantic Web, in this volume, 2019. 10.3233/SW-150197. 25 [28] A. Hogan. Exploiting RDFS and OWL for Integrating Hetero- [44] M. Salvadores, P. R. Alexander, M. A. Musen, and N. F. Noy. 26 26 geneous, Large-Scale, Linked Data Corpora. PhD thesis, Dig- BioPortal as a dataset of linked biomedical ontologies and ter- 27 ital Enterprise Research Institute, National University of Ire- minologies in RDF. Semantic Web, 4(3):277–284, 2013. DOI: 27 28 land, Galway, 2011. 10.3233/SW-2012-0086. 28 29 [29] A. Hogan, A. Harth, A. Passant, S. Decker, and A. Polleres. [45] M. K. Smith, C. Welty, and D. L. McGuinness. OWL Web On- 29 30 Weaving the pedantic web. In International Workshop on tology Language Guide. W3C Recommendation, Feb. 2004. 30 31 Linked Data on the Web (LDOW) at WWW, 2010. http://www.w3.org/TR/owl-guide/. 31 [30] A. Hogan, A. Harth, J. Umbrich, S. Kinsella, A. Polleres, and [46] S. Tramp, P. Frischmuth, T. Ermilov, and S. Auer. Weaving a 32 32 S. Decker. Searching and browsing Linked Data with SWSE: Social Data Web with Semantic Pingback. In Knowledge Engi- 33 The Semantic Web Search Engine. J. Web Sem., 9(4):365–401, neering and by the Masses (EKAW), 33 34 2011. DOI: 10.1016/j.Websem.2011.06.004. volume 6317 of LNAI, pages 135–149. Springer, 2010. DOI: 34 35 [31] S. Jupp, J. Malone, J. Bolleman, M. Brandizi, M. Davies, 10.1007/978-3-642-16438-5_10. 35 36 L. Garcia, A. Gaulton, S. Gehant, C. Laibe, N. Redaschi, et al. [47] P. Vandenbussche, G. Atemezing, M. Poveda-Villalón, and 36 The EBI RDF platform: linked open data for the life sciences. B. Vatant. Linked open vocabularies (LOV): A gateway to 37 37 Bioinformatics, 30(9):1338–1339, 2014. DOI: 10.1093/bioin- reusable semantic vocabularies on the web. Semantic Web, 38 38 formatics/btt765. 8(3):437–452, 2017. DOI: 10.3233/SW-160213. 39 [32] T. Käfer, A. Abdelrahman, J. Umbrich, P. O’Byrne, and [48] P.-Y. Vandenbussche, J. Umbrich, L. Matteis, A. Hogan, and 39 40 A. Hogan. Observing linked data dynamics. In Proceedings of C. Buil-Aranda. SPARQLES: Monitoring public SPARQL 40 41 Extended Semantic Web Conference (ESWC), pages 213–227. endpoints. Semantic Web, 8(6):1049–1065, 2017. DOI: 41 42 Springer, 2013. DOI: 10.1007/978-3-642-38288-8_15. 10.3233/SW-170254. 42 [33] M. R. Kamdar. A web-based integration framework over het- [49] R. Verborgh, M. V. Sande, O. Hartig, J. V. Herwegen, L. D. 43 43 erogeneous biomedical data and knowledge sources. PhD the- Vocht, B. D. Meester, G. Haesendonck, and P. Colpaert. 44 sis, Stanford University, 2019. Triple pattern fragments: A low-cost knowledge graph inter- 44 45 [34] M. R. Kamdar, J. D. Fernández, A. Polleres, T. Tudorache, face for the web. J. Web Sem., 37-38:184–206, 2016. DOI: 45 46 and M. A. Musen. Enabling web-scale data integration in 10.1016/j.websem.2016.03.003. 46 47 biomedicine through linked open data. NPJ digital medicine, [50] D. Vrandecic and M. Krötzsch. Wikidata: A Free Collaborative 47 Commun. ACM 48 2(1):1–14, 2019. DOI: 10.1038/s41746-019-0162-5. Knowledgebase. , 57(10):78–85, 2014. DOI: 48 [35] M. R. Kamdar, T. Tudorache, and M. A. Musen. A system- 10.1145/2629489. 49 49 atic analysis of term reuse and term overlap across biomed- [51] M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg, G. Apple- 50 ical ontologies. Semantic web, 8(6):853–871, 2017. DOI: ton, M. Axton, A. Baak, N. Blomberg, J.-W. Boiten, L. B. 50 51 10.3233/SW-160238. da Silva Santos, P. E. Bourne, et al. The FAIR guiding princi- 51 A. Polleres et al. / A More Decentralized Vision for Linked Data 13

1 ples for scientific data management and stewardship. Scientific Data Providers and Consumers. In International Confer- 1 data, 3, 2016. DOI: 10.1038/sdata.2016.18. 2 ence Semantic Web Applications and Tools for Life Sciences 2 3 [52] Y. Yamamoto, A. Yamaguchi, and A. Splendiani. Umaka- 3 Yummy Data: A Place to Facilitate Communication between (SWAT4LS), volume 1795 of CEUR, 2016. 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 24 24 25 25 26 26 27 27 28 28 29 29 30 30 31 31 32 32 33 33 34 34 35 35 36 36 37 37 38 38 39 39 40 40 41 41 42 42 43 43 44 44 45 45 46 46 47 47 48 48 49 49 50 50 51 51