SQL, XQuery, and SPARQL:Making the Picture Prettier

Jim Melton, , Copyright © 2007 Oracle, [email protected]

Introduction Last year, we asked “what’s wrong with this picture?” regarding the existence of three apparently overlapping query languages: SQL, XQuery, and SPARQL. Our somewhat reluctant answer to the question was that there was essentially nothing wrong because each of the three languages (and their corresponding data models) served specific purposes better than the two alternatives. This year, our research has been aimed at “making the picture prettier” – that is, accepting our earlier conclusions and finding practical ways to make the situation work well at minimal development costs. In early 2006, the World Wide Web Consortium (W3C ) published three Candidate Recommendation documents [SPARQL-L], [SPARQL-P], and [SPARQL-R] defining a new called SPARQL. That new language was described as “a query language for getting information from…RDF graphs” (that is, SPARQL is an RDF query language), which seemed on the surface to be a new technology requirement. Comments raised during the Candidate Recommendation review period resulted in the W3C’s Data Access Working Group (DAWG) reverting [SPARQL-L] to Working Draft status for additional work. Recently, the revised specification [SPARQL-L2] was advanced to Last Call Working Draft status, while the other two specifications have been held in the Candidate Recommendation stage awaiting progression of [SPARQL-L2] to Candidate Recommendation. Last year, we acknowledged that SPARQL’s existence is justified, but we also identified some areas in which additional research was required before it could be said whether or not practical integration with SQL and/or XQuery was likely. The present paper addresses that subject further. In particular, we indicate how existing investment in persistence technology can be applied to the RDF and to implementing the SPARQL language.

Data Model Integration Query languages are designed to be applied to data represented in a particular data model. SQL is used to retrieve, create, modify, and delete data represented in (a variation of) the relational model of data. XQuery is used to locate and retrieve data that is represented in the XPath data model, XDM [XDM]. (The ability to update such data is expected to be provided in early 2008.) Our vision is of a world in which applications can query data that is provided in the SQL/relational model, in the XPath Data Model, and in RDF, preferably in a single query expression. This implies that SQL statements must be able to access XML data and RDF data, that XQuery expressions be able to access SQL data and RDF data, and that SPARQL queries be able to access SQL data and XML data. Achieving that vision requires a significant amount of infrastructure. We’ve long known that one language can be used to query data represented in a data model other than that for which the language was designed by mapping the data from its native data model into the query language’s data model. An important example is SQL/XML [SQL/XML], which allows relational data to be published in an XML form (that is, as an XPath data model instance) that can then be queried using XQuery. SQL/XML also provides a facility (XMLTABLE) that allows XML to be treated as though it were SQL data. Such mappings naturally run into the famous “impedance mismatch” caused by factors such as the collections of data types differing amongst query languages and their corresponding data models.

1 SQL, XQuery, and SPARQL:Making the Picture Prettier

RDF is presented in [RDF-C] as yet another data model – a graph data model – distinct from the XPath tree-structured data model and from the SQL “flat table” data model. It is tempting to reject that assertion because of the tuple nature of RDF entities. However, a close examination of [RDF-C] shows subtle – but important – differences between collections of RDF triples and multisets of rows in SQL tables of three columns. For example, SQL tables are defined to comprise one or more columns, each having a particular declared data type (such as INTEGER, TIMESTAMP, or some user-defined type). Every row in that table has exactly that number of columns and the value of each column in each such row must be of the column’s declared type. (Values of user-defined type columns may have a most-specific type that is a subtype of that user-defined type, which is a concept that doesn’t apply to columns of SQL’s built-in types.) In addition, all of SQL’s metadata is essentially structural metadata – that is, metadata about the structure of the various tables, about the data types of columns, and so forth – and not semantic metadata, information that actually describes meaning of the SQL data. In the SQL model, the data types of columns are captured in various system tables, but very little information about the relationships of those data types is derivable from the system tables. Of course, information from those system tables can be combined with the data in the tables themselves, although the criteria through which such combinations would be meaningful are far from clear. By contrast, a given RDF collection can be augmented by RDF triples expressed using RDF Schema [RDF-S] and OWL [OWL-L] constructs that specify the class to which a given RDF entity belongs. Last year, we investigated whether the use of SQL’s user-defined types might offer some way to map such class information from RDF into the SQL model, but the results were discouraging and we have abandoned that line of research. Another important difference arises from the relationships between the metadata associated with each model and the data available under that model. In the SQL environment, data literally cannot exist without metadata – the schema. The two are inseparable in theory and in practice. However, in both the XPath data model and in the RDF data model, data may exist independent of any schema describing that data. While the absence of a schema may limit the ways in which the data can be interpreted, it is possible to build XML documents and RDF collections without any schema that describes them. On the World Wide Web, this distinction is especially important, because, unlike in the closed world of a system, it is impossible for there to be a central point of control at which such metadata can be created...and enforced.

Persistence Models The first commercial “relational” database management systems began to appear about 25 years ago. At the time, data management was dominated by CODASYL and other “network” DBMSs. The conventional wisdom at the time said that the new low-performance, small-volume systems didn’t have a chance against the established base. But the separation of data model from persistence model proved to offer incredible versatility and opportunity for tremendous performance, manageability, scalability, and amount of data. Since then, there have been a number of database system innovations that were hoped to overtake relational systems in the marketplace, such as object-oriented database systems (OODBMS) and the so-called “native XML” database systems. To date, none have succeeded in doing so (although many of them have found secure niche markets with unique requirements). Instead, the implementers of relational systems have co-opted the new forms of data. The advent of object-relational systems (ORDBMS) responded to the majority of the requirements that led to the development of OODBMSs, and it appears at present that those systems have been successfully extended to handle XML data (XORDBMS) for the large majority of application environments. What, then, should be done about RDF data? RDF, as stated above, defines a graph data model. An important question to consider is this: Are the persistence requirements for graph data models so unique that the persistence engines that have served so well for relational data, object-oriented data, and XML data must be avoided? Or do those engines have sufficient flexibility that they can be used successfully for persisting and managing RDF data, too? We considered the possibility of creating a native RDF storage engine to deal with the graph nature of the RDF data. Such an engine would, no doubt, have some similarities to the CODASYL and other network DBMSs of the ’70s. While we realized that there might be some advantages to this approach, we are also highly aware that relational storage engines easily overcame any perceived advantages to such “pointer-based” systems. Furthermore, development of new database storage engines is burdened by the immense amount of infrastructure required to

2 SQL, XQuery, and SPARQL:Making the Picture Prettier provide truly “industrial-strength” capabilities that existing relational engines already provide. There would have to be truly overwhelming advantages to a new storage model to justify the expenditures (literally billions of dollars) that led to the dominance of relational engines today. We firmly believe that the storage technologies that underlie the successful relational systems of which we are aware – including commercial implementations and open source implementations – are completely adequate for RDF data management. In fact, because the nature of RDF is collections of 3-tuples, we are convinced that there is no need at all for a native RDF storage environment and that relational systems are ideally suited for the job. Having reached that conclusion, we were next faced with a somewhat higher-level decision: Should there be a new “native” RDF data type defined for relational systems, as was done for XML data? The choice here was not so obvious, as there are advantages to defining a new data type and advantages to eschewing a new type in favor of ordinary table/column/row representations. The advantages to defining a new RDF data type include building in the ability to guarantee that the RDF data model – directed graphs – is always followed, but there is a significant disadvantage to face: the enhancements to SQL required to deal with such a new data type would complicate the language considerably. By contrast, using ordinary table representations for RDF data minimizes (or eliminates) any necessary changes to SQL, but it does require some care to ensure that the requirements of the RDF graph model are not violated. Our decision was easy to make; we chose to use a table representation for RDF data to create a persistent store for that data. But that raised another question to answer! Should we depend on application developers and users to write SQL queries to interrogate their RDF data? As we demonstrated last year, the SQL syntax required is not always obvious. Worse, as we’ll demonstrate in this paper, the SQL can become very complex very quickly, especially when the nature of queries depends on the relationships between data more than on actual data values. It became clear to us that SPARQL was a much preferable language for querying RDF. Consequently, we made the choice to use SPARQL for querying RDF stored in ordinary SQL tables and columns. However, to minimize development costs and complexities, we also chose not to develop a native SPARQL language layer atop our existing storage engines, but to transform SPARQL into a language already understood by those engines: SQL. The next problem we faced was deciding the details of that table representation. The RDF data model is, as we observed earlier in this paper, a directed graph of nodes connected by labeled edges. The graph is frequently, and conveniently, represented textually as a collection of 3-tuples comprising a subject, a predicate, and an object. Predicates are always represented as URIs, while objects may be represented by URIs or by literal values. In theory, subjects may also be represented by URIs or by literal values, but the current definition of RDF limits them to URIs. Both subjects and objects may also be represented by blank nodes, each of which is uniquely identified within a given RDF graph. Naturally, we thought, these collections of 3-tuples would best be handled in SQL tables of three columns. However, the differences between URIs and literal values gave us some pause, especially since URIs are, in effect, “pointers” (usually, but not always, to other nodes in the graph). We chose to address this issue by creating tables of four columns instead of three. One column is, of course, used as the subject of each triple and a second is used as the predicate. The other two capture the object of the triple – one is used when the object is a URI and the other when the object is a literal value. Of course, in SQL tables, every value of a column must be of the same data type, while RDF literals can be of many data types. We chose to use the approach chosen by RDF for typed literals – the lexical representation of the literal value accompanied by a data type specification (a URI as defined by XML Schema). We also postulated a sort of “referential integrity” facility on the SQL tables to help enforce the graph model of RDF:

CREATE TABLE RDF_STORE ( SUBJECT VARCHAR(2000), VERB VARCHAR(2000), OBJECT_URI VARCHAR(2000), OBJECT_LIT VARCHAR(2000), CHECK ( OBJECT_URI IS NULL

3 SQL, XQuery, and SPARQL:Making the Picture Prettier

OR OBJECT_URI IN ( SELECT SUBJECT FROM RDF_STORE ) ),... CHECK ( VERB IS NULL OR VERB IN ( SELECT SUBJECT FROM RDF_STORE ), CHECK ( ( OBJECT_URI IS NOT NULL AND OBJECT_LIT IS NULL ) OR ( OBJECT_LIT IS NOT NULL AND OBJECT_URI IS NULL ) ) )

However, it was pointed out to us that RDF does not require that subject, predicate, or object URIs resolve to anything represented in the graph! Because of that fact, referential integrity is not needed (and, indeed, would violate the RDF data model).

SPARQL to SQL Once the conclusion was reached that a relational engine was appropriate for persisting RDF graphs, that SPARQL was the appropriate language to use for querying that RDF, and that the development costs of building a complete SPARQL stack were prohibitive, we investigated exactly how we would deal with the problem of executing SPARQL code in an SQL engine. We describe our solution as though we were literally transforming SPARQL syntax into SQL syntax, but we do not expect any practical implementation to follow that paradigm; instead, implementations will transform SPARQL syntax into the same underlying execution trees/code used for compiling and executing SQL code. We made several assumptions as we began solving this problem:

• RDF graphs will be stored in tables of three columns (nothing in this work conflicts with the four-column design discussed above). • The primary keys of such tables comprises all three (four) columns. • Blank nodes are represented by (generated) blank node identifiers.

The transformation from SPARQL into SQL involves a number of prerequisites.

• We chose to represent SPARQL syntax as parse trees, which significantly simplified several other aspects of the design. • We created a simple algorithm for producing (globally!) unique blank node identifiers. • We next created an algorithm to produce an SQL subquery corresponding to each node in the parse tree for a SPARQL statement; the result of each subquery includes a path column that identifies every node of the parse tree that contributes to the solution represented by that subquery. • In this design, every node in the parse tree is numbered. The root node of the parse tree is node number 1. Every descendant of node i is numbered j, j > i. We also give each node a “name” (in two forms: a character string, and an SQL identifier) derived from its number: 'N1', 'N2', etc., and N1, N2, etc.

The SQL code that is generated for a typical SPARQL statement is quite cumbersome, verbose, and complex. As we said above, however, we do not expect that any implementation will generate literal SQL code such as that illustrated here. The code we show here is intended to be illustrative of the technique, not of specific generated SQL. Consider the following simple SPARQL statement:

{ { ?x :v _:n } UNION { ?x :v _:n } }

4 SQL, XQuery, and SPARQL:Making the Picture Prettier

The parse tree for that statement is shown in Figure 1.

Figure 1: Parse Tree of SPARQL Statement

The parenthesized number at each node represents the node number. There are no subqueries associated with the leaf nodes of the parse tree (e.g., ?x(5) or :v(11)). The path column of the subquery associated with each of the two TRIPLE nodes return the name of that node (N4 and N9, respectively). The path column of the subqueries associated with nodes GROUP(3) and GROUP(8) return the names of those nodes concatenated to the value of the path column of the nodes’ descendants (N4N3 and N9N8, respectively). The path column of the subquery associated with node UNION(2) returns N4N3N2 and N9N8N2. Consider node (4), which is a TRIPLE node comprising:

< ?x :v _:n >

The SQL subquery corresponding to that TRIPLE node is:

N4 AS ( SELECT 'N4' AS "::path", Subject AS "x", Object as "_:n" FROM GraphTable WHERE Verb = 'http://example.com/v' )

The notation used here was created as part of the 1999 edition of the SQL standard. It can be read as “N4 is the nickname given to the subquery …”. The expression is an element of an SQL WITH clause, which “predeclares” subqueries, giving them a query name that are used later in the overall query or in later elements of the WITH clause; in this case, N4 is the subquery name. The remainder of that expression is a table subquery, which generates a virtual table of zero or more rows. Continuing to break that expression down:

• 'N4' AS "::path" is the first column • The name of the column is "::path" (an SQL delimited identifier) • The value of the column in all rows is the character string N4 (an SQL character string literal) identifying that node • Subject AS "x" is the second column • The name of the column is "x" (the SPARQL variable name)

5 SQL, XQuery, and SPARQL:Making the Picture Prettier

• The value of the column in all rows is the value of the SUBJECT column of the corresponding row taken from the graph table after the WHERE clause has been applied • Object as "_:n" is the third column • The name of the column is "_:x" (an SQL delimited identifier that is generated from the blank node “name”) • The value of the column in all rows is the value of the OBJECT column of the corresponding row taken from the graph table after the WHERE clause has been applied • FROM GraphTable causes all rows from the graph table to be identified as candidate rows to be used in the WHERE and SELECT clauses. • WHERE Verb = 'http://example.com/v' causes the “un-identification” of all rows identified in the graph table in which the value of the Verb does not equal the URI (character string) http://example.com/v • The result is a virtual table with as many rows as there rows in the graph table whose verb is the given URI.

And, finally, the result of the entire expression is generated by the SELECT clause, which returns (in rows of a virtual table) the subject and object values from each row that remains after application of the WHERE clause (there is a 1:1 relationship between those remaining rows and the rows generated by the expression), as well as (in each row) the literal value that identifies the node in the SPARQL syntax tree for which those rows are a solution. Of course, analogous SQL subqueries are generated for other SPARQL syntax elements: JOIN nodes, GROUP nodes, EMPTY nodes, OPTIONAL nodes, UNION nodes, GRAPH nodes, and the root node.

Conclusions Relational database management systems provide both the industrial-strength storage capabilities needed for RDF use in an enterprise of any size and the language infrastructure needed to execute SPARQL statements to query that RDF. Specialized RDF storage engines are not required and can actually inhibit integration of multiple sorts of data. By choosing to represent RDF triples in an SQL table of three or four columns, RDF can take complete advantage of the facilities offered by modern RDBMSs, including very high scalability, extreme reliability, manageability (including backup and restore), security, and so forth. We have demonstrated that it is possible to transform arbitrary SPARQL statements into SQL code, recognizing that real product will not transform literally into SQL but into the internal execution trees that also result from SQL compilation. By using an RDBMS as the RDF persistence engine and transforming SPARQL into SQL-like execution trees, it becomes an almost trivial exercise to integrate ordinary SQL queries, XQuery queries (using SQL/XML), and RDF queries (most likely using an SQL/XML-like facility that includes an SQL function for invoking SPARQL statements). By leveraging the enormous investment already made in RDBMS engines, this approach promises to greatly increase the rate of adoption of the semantic web.

Acknowledgements The author thanks Fred Zemke, Ashok Malhotra, Susie Stephens, and Eric Prud’hommeaux for their help in researching – and understanding – topics related to the semantic web, including RDF and SPARQL.

Bibliography [OWL-L] OWL Reference, Recommendation, World Wide Web Consortium, 2004-02-10; http://www.w3.org/TR/2004/REC-owl-ref-20040210/ [RDF-C] Resource Description Framework (RDF): Concepts and Abstract Syntax, Recommendation, World Wide Web Consortium, 2004-02-10; http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/ [RDF-S] RDF/XML Syntax Specification (Revised), Recommendation, World Wide Web Consortium, 2004-02-10; http://www.w3.org/TR/2004/REC-rdf-syntax-grammar-20040210/

6 SQL, XQuery, and SPARQL:Making the Picture Prettier

[SPARQL-L] SPARQL Query Language for RDF, Candidate Recommendation, World Wide Web Consortium, 2006-04-06; http://www.w3.org/TR/2006/CR-rdf-sparql-query-20060406/ [SPARQL-L2] SPARQL Query Language for RDF, Last Call Working Draft, World Wide Web Consortium, 2007-03-26; http://www.w3.org/TR/2007/WD-rdf-sparql-query-20070326/ [SPARQL-P] SPARQL Protocol for RDF, Candidate Recommendation, World Wide Web Consortium, 2006-04-06; http://www.w3.org/TR/2006/CR-rdf-sparql-protocol-20060406/ [SPARQL-R] SPARQL Query Results XML Format, Candidate Recommendation, World Wide Web Consortium, 2006-04-06; http://www.w3.org/TR/2006/CR-rdf-sparql-XMLres-20060406/ [SQL] ISO/IEC 9075-*:2003, — Database languages — SQL, International Organization for Standardization, 2003 [SQL/XML] ISO/IEC 9075-14:2006, Information technology — Database languages — SQL — Part 14: XML-Related Specifications (SQL/XML), International Organization for Standardization, 2006 [XDM] XQuery 1.0 and XPath 2.0 Data Model (XDM), Recommendation, World Wide Web Consortium, 2007-01-23; http://www.w3.org/TR/xpath-datamodel/ [XQuery] XQuery 1.0: An XML Query Language, Recommendation, World Wide Web Consortium, 2007-01-23; http://www.w3.org/TR/xquery/

7