Learning SPARQL, 2E

Compliments of 2nd Edition Querying and Updating with SPARQL 1.1 Learning SPARQL Bob DuCharme FREE CHAPTERS YOUR DATA DESERVES BETTER The Evolution of Data Integration Each year billions of dollars and countless hours are spent integrating data from silos across organizations. Legacy tools aren’t agile enough to handle today’s heterogeneous data. Find out how you can use a multi-model database to reduce complexity and risk, save money, and shorten time to value. Download your free eBook, compliments of MarkLogic. MARKLOGIC.COM/MULTIMODEL SECOND EDITION Learning SPARQL Querying and Updating with SPARQL 1.1 This excerpt contains Chapters 2 and 7 of the book Learning SPARQL, Second Edition. The complete book is available at oreilly.com and through other retailers. Bob DuCharme Beijing Boston Farnham Sebastopol Tokyo Learning SPARQL, Second Edition by Bob DuCharme Copyright © 2013 O’Reilly Media. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐ tutional sales department: 800-998-9938 or [email protected]. Editors: Simon St. Laurent and Meghan Blanchette Indexer: Bob DuCharme Production Editor: Kristen Borg Interior Designer: David Futato Proofreader: Amanda Kersey Cover Designer: Randy Comer Illustrator: Rebecca Demarest August 2013: Second Edition Revision History for the Second Edition 2013-06-27: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781449371432 for release details. Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc. Learning SPARQL, the image of an anglerfish and related trade dress are trademarks of O’Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a trade‐ mark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher and author assume no responsibility for errors or omissions, or for damages resulting from the use of the information con‐ tained herein. 978-1-449-37143-2 [LSI] 1372271958 Table of Contents 1. The Semantic Web, RDF, and Linked Data (and SPARQL). 1 What Exactly Is the “Semantic Web”? 1 URLs, URIs, IRIs, and Namespaces 3 The Resource Description Framework (RDF) 6 Storing RDF in Files 6 Storing RDF in Databases 12 Data Typing 12 Making RDF More Readable with Language Tags and Labels 14 Blank Nodes and Why They’re Useful 16 Named Graphs 18 Reusing and Creating Vocabularies: RDF Schema and OWL 19 Linked Data 25 SPARQL’s Past, Present, and Future 26 The SPARQL Specifications 27 Summary 28 2. Query Efficiency and Debugging. 31 Efficiency Inside the WHERE Clause 31 Reduce the Search Space 32 OPTIONAL Is Very Optional 33 Triple Pattern Order Matters 34 FILTERs: Where and What 36 Property Paths Can Be Expensive 39 Efficiency Outside the WHERE Clause 40 Debugging 41 Manual Debugging 42 SPARQL Algebra 44 Debugging Tools 46 Summary 47 iii CHAPTER 1 The Semantic Web, RDF, and Linked Data (and SPARQL) The SPARQL query language is for data that follows a particular model, but the semantic web isn’t about the query language or about the model—it’s about the data. The booming amount of data becoming available on the semantic web is making great new kinds of applications possible, and as a well-implemented, mature standard designed with the semantic web in mind, SPARQL is the best way to get that data and put it to work in your applications. The flexibility of the RDF data model means that it’s being used more and more with projects that have nothing to do with the “semantic web” other than their use of technology that uses these standards—that’s why you’ll often see references to “semantic web technology.” What Exactly Is the “Semantic Web”? As excitement over the semantic web grows, some vendors use the phrase to sell products with strong connections to the ideas behind the semantic web, and others use it to sell products with weaker connections. This can be confusing for people try‐ ing to understand the semantic web landscape. I like to define the semantic web as a set of standards and best practices for sharing data and the semantics of that data over the Web for use by applications. Let’s look at this definition one or two phrases at a time, and then we’ll look at these issues in more detail. A set of standards 1 Before Tim Berners-Lee invented the World Wide Web, more powerful hypertext sys‐ tems were available, but he built his around simple specifications that he published as public standards. This made it possible for people to implement his system on their own (that is, to write their own web servers, web browsers, and especially web pages), and his system grew to become the biggest hypertext system ever. Berners-Lee founded the W3C to oversee these standards, and the semantic web is also built on W3C standards: the RDF data model, the SPARQL query language, and the RDF Schema and OWL standards for storing vocabularies and ontologies. A product or project may deal with semantics, but if it doesn’t use these standards, it can’t connect to and be part of the semantic web any more than a 1985 hypertext system could link to a page on the World Wide Web without using the HTML or HTTP standards. (There are those who disagree on this last point.) best practices for sharing data... over the Web for use by applications Berners-Lee’s original web was designed to deliver human-readable documents. If you want to fly from one airport to another next Sunday afternoon, you can go to an airline website, fill out a query form, and then read the query results off the screen with your eyes. Airline comparison sites have programs that retrieve web pages from multiple airline sites and extract the information they need, in a process known as “screen scraping,” before using the data for their own web pages. Before writing such a program, a developer at the airline comparison website must analyze the HTML structure of each airline’s website to determine where the screen scraping program should look for the data it needs. If one airline redesigns their website, the developer must update his screen-scraping program to account for these differences. Berners-Lee came up with the idea of Linked Data as a set of best practices for shar‐ ing data across the web infrastructure so that applications can more easily retrieve data from public sites with no need for screen scraping—for example, to let your cal‐ endar program get flight information from multiple airline websites in a common, machine-readable format. These best practices recommend the use of URIs to name things and the use of standards such as RDF and SPARQL. They provide excellent guidelines for the creation of an infrastructure for the semantic web. and the semantics of that data The idea of “semantics” is often defined as “the meaning of words.” Linked Data prin‐ ciples and the related standards make it easier to share data, and the use of URIs can provide a bit of semantics by providing the context of a term. For example, even if I don’t know what “sh98003588#concept” refers to, I can see from the URI http:// id.loc.gov/authorities/sh98003588#concept that it comes from the US Library of Con‐ gress. Storing the complete meaning of words so that computers can “understand” these meanings may be asking too much of current computers, but the W3C Web Ontology Language (also known as OWL) already lets us store valuable bits of mean‐ ing so that we can get more out of our data. For example, when we know that the 2 | Chapter 1: The Semantic Web, RDF, and Linked Data (and SPARQL) term “spouse” is symmetric (that is, that if A is the spouse of B, then B is the spouse of A), or that zip codes are a subset of postal codes, or that “sell” is the opposite of “buy,” we know more about the resources that have these properties and the relationships between these resources. Let’s look at these components of the semantic web in more detail. URLs, URIs, IRIs, and Namespaces When Berners-Lee invented the Web, along with writing the first web server and browser, he developed specifications for three things so that all the servers and browsers could work together: • A way to represent document structure, so that a browser would know which parts of a document were paragraphs, which were headers, which were links, and so forth. This specification is the Hypertext Markup Language, or HTML. • A way for client programs such as web browsers and servers to communicate with each other. The Hypertext Transfer Protocol, or HTTP, consists of a few short commands and three-digit codes that essentially let a client program such as a web browser say things like “Hey www.learningsparql.com server, send me the index.html file from the resources directory!” They also let the server say “OK, here you go!” or “Sorry, I don’t know about that resource.” We’ll learn more about HTTP in “SPARQL and HTTP.” • A compact way for the client to specify which resource it wants—for example, the name of a file, the directory where it’s stored, and the server that has that file sys‐ tem.

Learning SPARQL, 2E

Basic Querying with SPARQL

Mapping Spatiotemporal Data to RDF: a SPARQL Endpoint for Brussels

Validating RDF Data Using Shapes

Computational Integrity for Outsourced Execution of SPARQL Queries

Introduction Vocabulary an Excerpt of a Dbpedia Dataset

Supporting SPARQL Update Queries in RDF-XML Integration *

Querying Distributed RDF Data Sources with SPARQL

SPARQL Query Processing with Conventional Relational Database Systems

Profiting from Kitties on Ethereum: Leveraging Blockchain RDF Data with SANSA

Vec2sparql: Integrating SPARQL Queries and Knowledge Graph Embeddings

Validating Shacl Constraints Over a Sparql Endpoint

Online Index Extraction from Linked Open Data Sources