The Web of Data

NISO Virtual Conference 19 February 2014 Ralph Swick, W3C Agenda

• Data is changing our lives • W3C’s traditional focus • Expanding scope of W3C’s data activities Web has transformed our relation to computers and to data

• A computer in every pocket • Apps leveraging context – geolocation and other sensors – social context (“I’m at the conference, too!”) • Change in the use of search – people search for answers, not sites – answers from aggregated data (Siri, Google Now, Wolfram Alpha) Apps are using data from many sources

• Social networking • Mobile devices • Sensors • Open data

Imagine…

• A “Web” where – documents are available for download on the – but there would be no hyperlinks among them Data on the Web is not enough…

• We need a proper infrastructure for a real Web of Data where: – data are available on the Web • accessible via standard Web technologies – data are interlinked over the Web – data can be integrated over the Web • This is Agenda

• Data is changing our lives • W3C’s traditional focus • Expanding scope of W3C’s data activities Core

• RDF data model • RDF Schema vocabulary design • RDB2RDF relational DB export • SPARQL query • SKOS vocabulary description • OWL ontological inference • RIF rules interchange • LDP read-write Web of Data • POWDER description resources • GRDDL app-specific XML Need for RDF schemas

• First step towards the “extra knowledge”: – define the terms we can use – what restrictions apply – what extra relationships are there? • “RDF Vocabulary Description Language” – the term “Schema” is retained for historical reasons… Vocabularies

• There is a need for “languages” to define such vocabularies – to define those vocabularies – to assign clear “semantics” on how new relationships can be deduced SKOS

• SKOS provides a simple bridge between the “print world” and the (Semantic) Web • Thesauri, glossaries, etc., from the library community can be made available • SKOS can also be used to organize, e.g., tags, annotate other vocabularies, … Semantic Web/Linked Data Today

• Standards are mature – some level of maintenance work is always needed • Server-side applications dominate • Commercial applications exist, e.g.: – direct integration/usage of linked data on the Web – consumption of other formats converted internally to a common format (RDF) Challenge: leverage data in interoperable apps

• Public, private, behind enterprise firewalls • From informal to highly curated • From machine readable to human readable – HTML tables, twitter feeds, local vocabularies, spreadsheets, … • Expressed in diverse data models – tree, graph, table, … • Serialized in many ways – XML, CSV, RDF, PDF, JSON, HTML Tables,…

The Linking Open Data Project Linked Data Principles Is your data 5 Star?

Available on the Web in some format (i.e., use URI to access the data)

Available as machine-readable structured data (e.g., excel instead of an image scan)

As before, but using a non-proprietary format (e.g., CSV instead of excel)

All the above, plus use open standards (RDF & Co.) to identify things, so that people could point at your stuff

All the above, plus link your data to other people’s data to provide context A Three Star Example The importance of Linked Data

• Provide a core set of data that applications can build on – stable references for “things”, • e.g., http://dbpedia.org/resource/Kolkata/ – many many relationships that applications may reuse – a “nucleus” for a larger, semantically enabled Web! Linked Data Platform (LDP)

• Define an HTTP/RESTful based infrastructure to publish, read, write, or modify linked data – typical usage: data intensive application in a browser, application integration using shared data… • The infrastructure should be easy to implement and install – provides an “entry point” for Linked Data applications! • The work is nearing completion RDF with HTML: RDFa

• By adding some “meta” information, the same source can be reused – typical example: your personal information, like address, should be readable for humans and processable by machines • Some solutions have emerged: – add extra statements in or RDFa that can be converted to RDF • microdata can be used for a (useful) subset of RDF • RDFa is, essentially, a complete serialization of RDF schema.org

• Schema.org is a cooperation of search engines (Bing, Google, Yahoo!, and Yandex) • It is a large vocabulary that they all understand • The terms are extracted from HTML5+microdata or HTML5+RDFa – the various partners use it for different purposes – it can be used by anyone outside of the search world!

Some things to remember when you publish data • Publish your data first, do user interfaces later! – the “raw data” can become useful on its own right and others may use it – you can add your added value later by providing nice user access • If possible, publish your data in RDF but if you cannot, others may help you in conversions – trust the community… • Add links to other data. “Just” publishing isn’t enough… Some things to remember when you publish data (2) • Think about persistence and versioning – others may depend on the data you publish… • Be thoughtful about the URIs you choose • Try to avoid reinventing the wheel when choosing vocabularies Some things to remember when you publish data (3) • Document your data, i.e., provide – there are vocabularies to do this • Data Catalog Vocabulary (DCAT) • Vocabulary of Interlinked Datasets (VoID) • DCTERMS • vocabularies for licensing (Open Data Commons, government licenses) – this area is still very much in development…

Agenda

• Data is changing our lives • W3C’s work on data integration • Expanding scope of W3C’s data activities New work underway

• CSV on the Web

• Data on the Web Best Practices

• Vocabulary management What we are hearing

• CSV is everywhere – can be huge data sets, not easily readable in a spreadsheet or Google refine – meaning of data not in machine-readable form – data is not necessarily used for web-scale integration but rather immediate usage • Metadata is essential • Conversion is an issue • European Commission Study on business models for Linked Open Government Data (BM4LOGD)

Linked Data Benefits (BM4LOD)

• Flexible data integration – Streamlined internal processes – Where working relationships already exist, much easier to share – Linking reference collections; discovery of new relationships • Increase in data quality – More use of data internally brings errors to light – Use of open standards increases quality of system • New services • Cost reduction – Increased efficiency – Increase in data usage due to LOD enrichment

CSV on the Web

• How W3C can help – metadata vocabulary to describe CSV data (structure, reference to access rights, annotations, etc.) – metadata discovery (e.g., part of an HTTP header, special rows and columns, packaging formats…) – mapping content to RDF, JSON, XML Best practices

• Document best practices for the data publishers – URI design, management of persistence, versioning – business models – use of core metadata vocabularies (provenance, access control, ownership) • Specific vocabularies – quality, application descriptions, … Vocabulary management: challenge

• Interoperable vocabularies are key for (meta)data • At the moment, it is a fairly chaotic world… – many, possibly overlapping vocabularies – difficult to locate the one that is needed – vocabularies may not be properly managed, maintained, versioned, provided persistence… Vocabulary management: how W3C can help

• Provide a space where – communities can develop vocabularies (through, e.g., CGs, possibly WGs) – host vocabularies at W3C if requested – annotate vocabularies with a proper set of metadata terms – establish a vocabulary directory • The exact structure is still being discussed Summary

• Data-driven smart apps are one of the major growth engines for the worldwide software market. • We need to meet developers where they are. • 5 Star Benefits of LOD – Greater efficiency, better provision of the task – Greater flexibility leads to lower costs for future projects – New services, new connections, new discoveries – Improved navigation within and between datasets – Others can build apps based on your data

Available specifications: Primers, Guides`

• Primers: – RDF Primer – OWL Guide – SKOS Primer – GRDDL Primer – RDFa Primer • The W3C Semantic Web Activity Wiki has links to all the specifications These slides are in the Web at

http://www.w3.org/2014/Talks /0219-NISO-RRS with thanks to Ivan Herman, W3C and Phil Archer, W3C