SQL, Xquery, and SPARQL:Making the Picture Prettier
Total Page:16
File Type:pdf, Size:1020Kb
SQL, XQuery, and SPARQL:Making the Picture Prettier Jim Melton, Oracle Corporation, Copyright © 2007 Oracle, [email protected] Introduction Last year, we asked “what’s wrong with this picture?” regarding the existence of three apparently overlapping query languages: SQL, XQuery, and SPARQL. Our somewhat reluctant answer to the question was that there was essentially nothing wrong because each of the three languages (and their corresponding data models) served specific purposes better than the two alternatives. This year, our research has been aimed at “making the picture prettier” – that is, accepting our earlier conclusions and finding practical ways to make the situation work well at minimal development costs. In early 2006, the World Wide Web Consortium (W3C ) published three Candidate Recommendation documents [SPARQL-L], [SPARQL-P], and [SPARQL-R] defining a new query language called SPARQL. That new language was described as “a query language for getting information from…RDF graphs” (that is, SPARQL is an RDF query language), which seemed on the surface to be a new technology requirement. Comments raised during the Candidate Recommendation review period resulted in the W3C’s Data Access Working Group (DAWG) reverting [SPARQL-L] to Working Draft status for additional work. Recently, the revised specification [SPARQL-L2] was advanced to Last Call Working Draft status, while the other two specifications have been held in the Candidate Recommendation stage awaiting progression of [SPARQL-L2] to Candidate Recommendation. Last year, we acknowledged that SPARQL’s existence is justified, but we also identified some areas in which additional research was required before it could be said whether or not practical integration with SQL and/or XQuery was likely. The present paper addresses that subject further. In particular, we indicate how existing investment in persistence technology can be applied to the RDF data model and to implementing the SPARQL language. Data Model Integration Query languages are designed to be applied to data represented in a particular data model. SQL is used to retrieve, create, modify, and delete data represented in (a variation of) the relational model of data. XQuery is used to locate and retrieve data that is represented in the XPath data model, XDM [XDM]. (The ability to update such data is expected to be provided in early 2008.) Our vision is of a world in which applications can query data that is provided in the SQL/relational model, in the XPath Data Model, and in RDF, preferably in a single query expression. This implies that SQL statements must be able to access XML data and RDF data, that XQuery expressions be able to access SQL data and RDF data, and that SPARQL queries be able to access SQL data and XML data. Achieving that vision requires a significant amount of infrastructure. We’ve long known that one language can be used to query data represented in a data model other than that for which the language was designed by mapping the data from its native data model into the query language’s data model. An important example is SQL/XML [SQL/XML], which allows relational data to be published in an XML form (that is, as an XPath data model instance) that can then be queried using XQuery. SQL/XML also provides a facility (XMLTABLE) that allows XML to be treated as though it were SQL data. Such mappings naturally run into the famous “impedance mismatch” caused by factors such as the collections of data types differing amongst query languages and their corresponding data models. 1 SQL, XQuery, and SPARQL:Making the Picture Prettier RDF is presented in [RDF-C] as yet another data model – a graph data model – distinct from the XPath tree-structured data model and from the SQL “flat table” data model. It is tempting to reject that assertion because of the tuple nature of RDF entities. However, a close examination of [RDF-C] shows subtle – but important – differences between collections of RDF triples and multisets of rows in SQL tables of three columns. For example, SQL tables are defined to comprise one or more columns, each having a particular declared data type (such as INTEGER, TIMESTAMP, or some user-defined type). Every row in that table has exactly that number of columns and the value of each column in each such row must be of the column’s declared type. (Values of user-defined type columns may have a most-specific type that is a subtype of that user-defined type, which is a concept that doesn’t apply to columns of SQL’s built-in types.) In addition, all of SQL’s metadata is essentially structural metadata – that is, metadata about the structure of the various tables, about the data types of columns, and so forth – and not semantic metadata, information that actually describes meaning of the SQL data. In the SQL model, the data types of columns are captured in various system tables, but very little information about the relationships of those data types is derivable from the system tables. Of course, information from those system tables can be combined with the data in the tables themselves, although the criteria through which such combinations would be meaningful are far from clear. By contrast, a given RDF collection can be augmented by RDF triples expressed using RDF Schema [RDF-S] and OWL [OWL-L] constructs that specify the class to which a given RDF entity belongs. Last year, we investigated whether the use of SQL’s user-defined types might offer some way to map such class information from RDF into the SQL model, but the results were discouraging and we have abandoned that line of research. Another important difference arises from the relationships between the metadata associated with each model and the data available under that model. In the SQL environment, data literally cannot exist without metadata – the schema. The two are inseparable in theory and in practice. However, in both the XPath data model and in the RDF data model, data may exist independent of any schema describing that data. While the absence of a schema may limit the ways in which the data can be interpreted, it is possible to build XML documents and RDF collections without any schema that describes them. On the World Wide Web, this distinction is especially important, because, unlike in the closed world of a database system, it is impossible for there to be a central point of control at which such metadata can be created...and enforced. Persistence Models The first commercial “relational” database management systems began to appear about 25 years ago. At the time, data management was dominated by CODASYL and other “network” DBMSs. The conventional wisdom at the time said that the new low-performance, small-volume systems didn’t have a chance against the established base. But the separation of data model from persistence model proved to offer incredible versatility and opportunity for tremendous performance, manageability, scalability, and amount of data. Since then, there have been a number of database system innovations that were hoped to overtake relational systems in the marketplace, such as object-oriented database systems (OODBMS) and the so-called “native XML” database systems. To date, none have succeeded in doing so (although many of them have found secure niche markets with unique requirements). Instead, the implementers of relational systems have co-opted the new forms of data. The advent of object-relational systems (ORDBMS) responded to the majority of the requirements that led to the development of OODBMSs, and it appears at present that those systems have been successfully extended to handle XML data (XORDBMS) for the large majority of application environments. What, then, should be done about RDF data? RDF, as stated above, defines a graph data model. An important question to consider is this: Are the persistence requirements for graph data models so unique that the persistence engines that have served so well for relational data, object-oriented data, and XML data must be avoided? Or do those engines have sufficient flexibility that they can be used successfully for persisting and managing RDF data, too? We considered the possibility of creating a native RDF storage engine to deal with the graph nature of the RDF data. Such an engine would, no doubt, have some similarities to the CODASYL and other network DBMSs of the ’70s. While we realized that there might be some advantages to this approach, we are also highly aware that relational storage engines easily overcame any perceived advantages to such “pointer-based” systems. Furthermore, development of new database storage engines is burdened by the immense amount of infrastructure required to 2 SQL, XQuery, and SPARQL:Making the Picture Prettier provide truly “industrial-strength” capabilities that existing relational engines already provide. There would have to be truly overwhelming advantages to a new storage model to justify the expenditures (literally billions of dollars) that led to the dominance of relational engines today. We firmly believe that the storage technologies that underlie the successful relational systems of which we are aware – including commercial implementations and open source implementations – are completely adequate for RDF data management. In fact, because the nature of RDF is collections of 3-tuples, we are convinced that there is no need at all for a native RDF storage environment and that relational systems are ideally suited for the job. Having reached that conclusion, we were next faced with a somewhat higher-level decision: Should there be a new “native” RDF data type defined for relational systems, as was done for XML data? The choice here was not so obvious, as there are advantages to defining a new data type and advantages to eschewing a new type in favor of ordinary table/column/row representations.