RDF in the JRS Server

Simon Johnston, Martin Nally, Edison Ting

May 16, 2008

This document describes the manner in which the JRS Server leverages, and exposes, RDF. JRS is not an RDF server, however it has made extensive use of RDF in the way in which it indexes resources and exposes indexed properties to client applications. This document covers not only the manner in which clients interact with this RDF view but the motivation for using RDF and some of the implementation details.

1 Motivation

The Jazz REST Services (JRS) Server is a set of REST storage services that allow for the secure storage, indexing and query of resources supporting the development of Application Lifecycle Management (ALM) application clients. A JRS server should not make any restriction on the type or representation of resources1 stored, and yet indexing of resources does require knowledge of the “raw” format (XML vs. plain text for example) and specific domain schema (requirement XML vs. test case XML for example). To this end the workflow within JRS on the storage of resources is to invoke a set of indexer tasks that are raw format specific and which may be further configured with declarative rules for domain schema. For example the JRS server provides an image indexer that is able to extract EXIF2 properties from photographs and indexes a fixed subset, the XML indexer however must be configured by a client to extract specific elements and attributes using a set of XPath expressions. These indexer tasks are responsible for the extraction of properties from the resource which the JRS server then makes available for queries. In the design of the JRS server a number of particular needs were defined that effect the design of the indexer tasks and the server’s persistence of index properties themselves.

1. The notion of an index property should be as simple as possible while still conveying useful information.

1We use the definitions of “resource” and “representation” from http://www.w3.org/TR/webarch/ (sec- tions 2.2 and 3.2) 2http://www.exif.org

1 2. Index properties should be retrievable for a given resource, that is the server should be able to answer the question “what index properties have been extracted from this resource”.

3. The representation of properties returned to a client should be in a standard format, if possible, rather than JRS inventing something new.

4. The server should support standard query languages that can operate on these index properties allowing client applications to perform complex queries.

In the prototype server that pre-dates JRS the first concern above started us down a path of using key/value pairs and storing the value only as a string. This proved overly simple/restrictive firstly as we needed to ensure that properties could be correctly typed such that operators such as “>” and “<” would work correctly and also because in the case of XML resources some properties were attached to secondary resources3 within the resource and so needed to store this additional identifier. This led us to a design where we used the {subject, predicate, object} triple notion common in knowledge representation schemes, and an initial implementation known as the “universal table” where we stored all triples in a single database table. This single table led to terrible performance as any meaningful query resulted in many joins and simply did not scale. In JRS we changed the design such that each resource in the repository has it’s own set of typed triples (types are limited to String, URI, Boolean, Integer and Timestamp). In the JRS Server it is possible to request a resource that contains the set of all index properties that are stored on a resource using the URI form “resource-uri?properties” where the query parameter “properties” will result in the return of a property document rather than the resource itself. It is also posible when executing a JRS simple query (that is a simple conjunctive, URL-encoded query) to request that not only the URL of a hit be returned but the properties of a hit also. This second form allows for client applications to query for resources and to return index properties such as name, content-type and so on allowing query results to be more appropriately rendered for a user. The initial format of this response was a custom format, which was unsatisfactory, and so we looked around for a more appropriate replacement. It seemed reasonable, as the internal format of the indexed properties was very much inspired by RDF4 that the RDF XML format be used in the representation of properties in queries described above. To this end we described a set of RDF generation patterns (described below) that the JRS server would use so that all our property documents would be returned in a regular manner. This move to an RDF XML representation also spurred us to remove the custom names we had used for certain system maintained properties, using a mixture of RDF and Dublin Core5, properties instead. The last step was to decide how to make these indexed properties more easily accessible and available for query by client applications. The resulting decision was to generate a

3See http://www.w3.org/TR/webarch/ section 3.2.2 4http://www.w3.org/RDF/ 5http://dublincore.org/documents/dcmi-terms/

2 query service provided by JRS that allowed client applications to POST a query, in a standard query language, to the server and have it run against an “RDF Store”, that is against some collection of all the possible RDF documents used to describe the resources in the JRS repository. The key here is that the implementation of JRS could provide multiple such query languages and can choose to actually store the RDF as XML data or not; the details are not client visible. The benefit to the client is that the shape of this RDF store and the behavior of the query service are defined by the JRS specifications and so how the query service is implemented is immaterial as long as it conforms to these behavioral requirements. Putting all of these together we discovered that RDF has now taken a central role in our design; ANY resource stored in JRS regardless of it’s representation has an associ- ated set of index properties (even if these are only the system defined properties) and these properties are accessible though the “?properties” request, the URL-encoded query and the RDF store query. As we expect client applications to make extensive use of links between resources and to define “virtual collections” of resources through standard queries, understanding these services are key to the development of JRS client applica- tions. The rest of this document describes the format of properties, the property and query and finally some of the details of the current JRS implementation that uses DB2 PureXML.

2 Indexing and Properties

This section describes the set of generation patterns used to express JRS indexed prop- erties in RDF. The reason for this document is to constrain the use of RDF to a set of patterns that can be expected by clients and therefore reduce the parsing re- quirements. All examples have been tested against the W3C (RDF Validation Service http://www.w3.org/RDF/Validator/) and written as a set of test cases with RDFLib http://rdflib.net. The examples in this section show indexes extracted from the more complete music example presented later in this paper.

2.1 General Patterns The following describe common patterns derived from patterns seen in indexes them- selves.

2.1.1 Empty Set of Index Properties The following is the representation returned for resources that either have not been indexed or for which no indexer contributed properties. This is the simplest form of RDF description document.

3 Note that the rdf:RDF element is optional, according to RDF/XML Syntax Specifi- cation http://www.w3.org/TR/rdf-syntax-grammar/ section 2.6, we would prefer not to require this and so clients should not expect to receive the outer RDF tag. Users should note however that some validators, including the W3C online validation service DO require the outer RDF tag.

To create a complete RDF/XML document, the serialization of the graph into XML is usually contained inside an rdf:RDF XML element which becomes the top-level XML document element. Conventionally the rdf:RDF element is also used to declare the XML namespaces that are used, although that is not required. When there is only one top-level node element inside rdf:RDF, the rdf:RDF can be omitted although any XML namespaces must still be declared.

Therefore the basic form would be simpler, as in the following listing.

2.1.2 Simple Index Properties Once index properties have been stored for a resource, we see these encoded as true XML elements with the predicate name and namespace extracted by the indexer. The difference between this format and the current JRS indexer is that RDF makes a conscious and semantic distinction between literal values and resource links and this is reflected in the format as shown below. This form uses the RDF convention that stores literal values as element content but references to other resources with the URL as an attribute of the element (seen here in the difference between ns:name and ns:artist). Literal values may also be typed, as seen in the ns:releasedYear element below, using the XML type system - but only those types supported by the JRS indexer specification. A Matter of Life and Death 2006

2.1.3 Multiple Instances of Single Predicate The current JRS index property triples allow for multiple values; unfortunately we do not distinguish between two semantically different cases for this. The first is where the same property logically exists multiple times with a single distinct value, and the second where a single property may have multiple values simultaneously. The example below shows the RDF encoding closest to the first form, and the format to be used in JRS.

4 A Matter of Life and Death 2006 Rock Heavy Metal RDF does provide for the notion of collections (Bags, Sets, etc.) which we have chosen not to use in JRS.

2.1.4 Indexed Secondary Resources JRS supports the indexing of secondary resources in representations that support such a notion; for example an XML document may be structured such that it has a hierarchy of logical components. Simply put each indexed resource, be it primary or secondary, will have a distinct RDF Description document. This separation is shown in the listing below, notice that each secondary resource may be identified due to the URL format of “#fragment”. All of the rules described here are applicable for both parent and secondary resources. A Matter of Life and Death 2006 Rock Heavy Metal

Different World 1

2.1.5 Indexed Compound Values There are XML constructs which do not index well either as a set of top-level properties but are also not secondary resources and thus cause considerable problems. The common case for this is the Atom “link” element (see examples).

5 What we need to be able to do is to store a compound property, something easy to do in RDF. Our example takes encodes the value “Disk 1 of 1” in a compound property, in effect creating a set of triples of the form (using ’nt’ notation): _:01. _:01 1. _:01 1. A Matter of Life and Death 2006 Rock Heavy Metal 1 1

2.2 Additional Topics The generation patterns above are not the whole story, there are certain features of the generation which require some explaination and which are important for clients to understand. For example the topic of namespace concatenation is important so that client applications are able to correctly construct queries and parse results.

2.2.1 URI Expansion/Concatenation As seen in the examples above resources and types in RDF use URIs and those in the examples are of the general form “base-uri#element”, a format also used in JRS itself to concatenate predicate names and URIs in the current triples. RDF uses a simple form where ns:name is assumed to be equivalent to “ns-uri+name” and does not specify any separator character. Therefore namespace specifiers should include a separator character which is why the namespace declarations in the examples all have trailing “#” characters. In the case of Dublin Core the namespace uses a trailing “/” instead meaning that each element in dublin core is instead a resource in it’s own right, although as of the current time each element resolves to a single combined RDF resource. In either constructing RDF namespace/element pairs from a URI, or in extracting a namespace/name pair from an indexed URI JRS will use the same conventions. These are summarized in the table below.

6 Element Namespace Predicate Value RDF Namespace RDF Name http://example.com/ns/music artist http://example.com/ns/music# artist http://example.com/ns/music/ artist http://example.com/ns/music/ artist http://example.com/ns/music# artist http://example.com/ns/music# artist http://www.w3.org/2005/Atom self http://www.w3.org/2005/Atom# self http://www.w3.org/2005/Atom http://...v0.6#predecessor http://jazz.net/xmlns/v0.6# predecessor http://example.com/ns/name http://example.com/ns/ name http://example.com/name http://example.com/ name Notes:

1. In the case where a namespace has a trailing “/” or “#” this is preserved into the RDF namespace value; where no such character exists a “#” will be appended.

2. In the case that a predicate value contains a complete URI then a similar approach to the extraction of element name and namespace occurs; by first looking for “#” as a separator and then “/”.

2.2.2 JRS System Properties A system property is one that is often a property of the internal JRS resource storage model and so is set/updated by the JRS server and is not present within the represen- tation of the resource itself. These system properties are both avalable for query as well as returned as part of a “?properties” request and as such they need to be present in the RDF format, although certain changes can be made to the particular encoding from the current format to the proposal herein. The table below lists the current system properties and the RDF values they are provided as, note the use JRS, RDF and Dublin Core elements to describe these.

7 JRS Property RDF XPath Property The URI of the resource in rdf:about /rdf:Description/@rdf:about the repository. The content-type of the dcterm:format /rdf:Description/dcterm:format resource in the repository. The timestamp of the last dcterm:modified /rdf:Description/dcterm:modified modification. The URI to the user that dcterms:contributor /rdf:Description/dcterms:contributor/@rdf:resource made the last modification. The URI of the root rdf:type /rdf:Description/rdf:type/@rdf:resource element of any XML resource. The archived state of the jrs:archived /rdf:Description/jrs:archived resource The URI to the user that dcterms:creator /rdf:Description/dcterm:creator/@rdf:resource created the resource. The resource creation dcterms:created /rdf:Description/dcterm:created timestamp.

3 Examples

The following sections introduce two examples, one we’ve already seen above, that we’ll use throughout the rest of this of the document.

3.1 The Music Example Using the example from above, this section describes the source resource and the indexer used to generate it. Firstly here is a sample of the format used to describe a music album, each track is defined within the resource as a secondary and a number of links are defined between the album and artist(s) and album and artwork for example. Not all of the elements in the resource require indexing and so the RDF will not be as complete. A Matter of Life and Death Sanctuary Records Kevin Shirley Martin Birch 2006 /jazz/resources/musicdb/artists/artist-1 Rock Heavy Metal 72:06 /jazz/resources/musicdb/artwork/726467828 /jazz/resources/musicdb/artwork/726467832 Different World

8 1 10:01 /jazz/resources/musicdb/artists/artist-101 /jazz/resources/musicdb/artists/artist-102 The XML indexer in the JRS server takes an index specification that describes, for a given XML namespace, which elements and attributes to index. The following is an index specification for the music namespace; note that the specification below does not index the compound “disk” element but does index each track in the album as a secondary resource identified by it’s id attribute.

Finally, as shown above here is the resulting set of index resources stored in the RDF Store. Note that the secondary resources do not have the set of system properties al- though it has been suggested they should have the rdf:type property and possibly a dc:isPartOf relationship to their primary resource. 2008-03-01T00:00:00 2008-03-02T00:00:00 application/xml False A Matter of Life and Death 2006 Rock Heavy Metal

9 Different World 1

3.2 Atom entry Using one of the more complex Atom entries that are part of the JRS server system collections, for a change event we can see there are a number of particular elements we need to index carefully. Change Event _eB0BgOu_EdytnP_R_BoVIg /jazz/users/TestJazzAdmin1 _eB0BgOu_EdytnP_R_BoVIg 2008-03-06T15:54:10 -05:00 PUT-C Given the current index specification used by the change event service the following RDF would be produced; note that the additional whitespace has been included only for clarity. Here you can see that all links have been converted to compound properties.

2008-03-01T00:00:00 2008-03-01T00:00:00 application/atom+xml False

10

Change Event _eB0BgOu_EdytnP_R_BoVIg

< ns0:href rdf:resource="/jazz/projects/main/change-events/_eB0BgOu_EdytnP_R_BoVIg"/>

< ns0:href rdf:resource="/jazz/projects/main/change-events/_eB0BgOu_EdytnP_R_BoVIg"/>

application/xml

< ns0:href rdf:resource="/jazz/resources/main/test-resource-0?revision=_eBnNMOu_EdytnP_R_BoVIg"/> application/xml

PUT-C

4 Query API

JRS provides two query methods, a simple URL-encoded conjunctive query and a POST- based query that is intended to be handled by the underlying RDF store itself. The simple URL-encoded form is easy to use and available from browsers however the the limitations of this form are significant, and include: • Only conjunction is supported between terms. • Only equality is supported as an operator for terms. • A term may only appear once in a query. The POST-based query is intended specifically to address these limitations and is pro- vided by allowing the client to POST a query to the same URL as above and have it executed in an appropriately enabled database server. We would also like to provide a URL-encoded version of the more complete query language, which we know is possible (and indeed specified) for SPARQL but becomes far more complex for XQuery.

11 4.1 Service details The query service relies on a new, internal to JRS, service that manages the actual interface to the database, performs connections, validation and execution of queries. This internal service (RDFStoreService) is configured to use one or more RDFStore classes, each specific to a database server. If the connection to the configured database fails then requests to perform queries will result in 503 (Service Unavailable). The details of the RDFStoreService and RDFStore implementations are not described as a part of the API. The actual query language supported by a JRS server is not defined by the specifications, allowing different servers to provide appropriate languages. In general we expect XQuery6 and SPARQL7 to be popular choices as XQuery takes advantage of the fact that all of the index documents are in XML and XQuery provides great capabilities. On the other hand, as the index documents are also valid RDF the SPARQL language provides some capabilities not present in XQuery - especially when dealing with links between resources.

4.2 JRS Query specifics and validation The current JRS server only supports XQuery and so has the following specific behavior.

• The content-type of the request MUST be “application/xquery”. – POSTing a request with an unknown, or unsupported content type will result in 415 (Unsupported Media Type).

• The first characters of the body MUST be “xquery” (case insensitive). – Failure to include this as the start of the body will result in 400 (Bad Request).

• The query MUST NOT contain references to database specific functions (for DB2 this implies a function with the prefix “db2-fn”). – Including such references will result in 400 (Bad Request).

Additionally as the underlying structure of the database is an implementation detail, and is in fact different for different database servers, the query has to abstract certain features which are likely to vary across servers. Primarily the way in which a query references the source MUST reference a specifically named collection, as in the example below: 1 xquery 2 declare namespace rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"; 3 declare namespace wi = "http://www.jazz.net/xmlns/workitems/"; 4 for $index in collection(’JRS_RDF_INDEX’)/rdf:Description 5 where $index/rdf:type/@rdf:resource = "http://example.com/xmlns/music#album" and 6 $index/music:releasedYear > 2000 7 return $index/fn:string(@rdf:about);

6http://www.w3.org/TR/xquery/ 7http://www.w3.org/TR/rdf-sparql-query/

12 The JRS query service translates the reference to collection(“JRS_RDF_INDEX”) on line 4 into whatever structure or reference is required by the underlying database (see 5.2). Any alternative context provided by the client (for example doc(“name.xml”)) would most likely result in an error from the database XQuery processor.

4.3 Example POSTed query The following example demonstrates how the client submits the query and deals with the responses, note both the specific content-type and the use of collection() in this request.. POST /jazz/projects/main/query HTTP/1.1 Host: example.com Accept : */* Content-Type: application/xquery Content-Length: 256

xquery declare namespace rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"; declare namespace music = "http://example.com/xmlns/music#"; collection(’JRS_RDF_INDEX ’)/rdf:Description[ rdf:type/@rdf:resource = "http://example.com/xmlns/music#album" and music:releasedYear > 2000] /fn:string(@rdf:about); The response format that JRS uses for all query and search operations is Atom, so the result set is wrapped in an Atom feed and each row is represented as an Atom entry. The one interesting aspect of this is the type attribute on the Atom content element. In this query the response from the query is a list of URLs, the text content of the RDF about attribute; as such the type of the results row is “text”. HTTP/1.1 200 OK Date: Thu, 20 Mar 2008 14:18:07 GMT Content-Length: 256 Content-Type: application/atom+xml Last-Modified: Thu, 20 Mar 2008 14:18:07 GMT

XQuery Results 2008-03-20T14:18:07 Jazz Resource Services Row 1 System 2008-03-20T14:18:07 /jazz/resources/musicdb/albums/album-1 In the example below the user has returned the entire Description element as the result and the response has therefore been set to the specific content type for RDF. Row 1 System 2008-03-20T14:18:07

13

4.4 Additional Examples The following are some additional RDF samples extending the music example seen above. 2008-03-01T00:00:00 2008-03-02T00:00:00 application/xml False A Matter of Life and Death 2006 Rock Heavy Metal

2008-03-01T00:00:00 2008-03-02T00:00:00 application/xml False Iron Maiden 1978

14 Heavy Metal

2008-03-01T00:00:00 2008-03-02T00:00:00 application/xml False Steve Harris

Different World 1 259

These Colours Dont Run 2 412

Brighter Than a Thousand Suns 3s

15 512

4.4.1 Sample Queries Given these sample resources we can construct some sample queries; these are obviously domain specific but intended to demonstrate the kinds of intra- and inter-resource queries we expect clients to perform. The intent of the investigation started here, and which we will go into more detail in with the JRS client teams, is to see if there are common patterns of XQuery usage such that we can restrict the JRS API to that subset of XQuery that covers these common cases and which will be portable across database platform. For example many queries can be written entirely with XPath, therefore the following three queries are in fact identical. XQUERY declare namespace rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"; declare namespace music = "http://example.com/xmlns/music#"; for $index in collection(’JRS_RDF_INDEX’)/rdf:Description where $index/rdf:type/@rdf:resource = "http://example.com/xmlns/music#album" and $index/music:genre = "Rock" return $index/fn:string(@rdf:about);

XQUERY declare namespace rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"; declare namespace music = "http://example.com/xmlns/music#"; for $index in collection(’JRS_RDF_INDEX’)/rdf:Description[ rdf:type/@rdf:resource = "http://example.com/xmlns/music#album" and music:genre = "Rock"]/fn:string(@rdf:about);

XQUERY declare namespace rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"; declare namespace music = "http://example.com/xmlns/music#"; collection(’JRS_RDF_INDEX’)/rdf:Description[ rdf:type/@rdf:resource = "http://example.com/xmlns/music#album" and music:genre = "Rock"]/fn:string(@rdf:about); Some XQuery resources suggest that the second form of the query above, using XPath predicates rather than the XPath “where” clause, is more efficient, this is not the case for DB2 will translate all where clauses into XPath predicates where possible. The following example queries have all been written as simple XPath wherever possible for clarity. -- *** Effectively get all resource descriptions XQUERY declare namespace rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"; collection(’JRS_RDF_INDEX’)//rdf:Description;

-- *** Return the list of all about URLs XQUERY declare namespace rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"; collection(’JRS_RDF_INDEX’)/rdf:Description/fn:string(@rdf:about);

-- *** Return the type of all resources XQUERY declare namespace rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"; collection(’JRS_RDF_INDEX’)/rdf:Description/rdf:type/fn:string(@rdf:resource);

16 -- *** Return the URLs of all tracks XQUERY declare namespace rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"; collection(’JRS_RDF_INDEX’)/rdf:Description[ rdf:type/@rdf:resource = "http://example.com/xmlns/music#track"]/fn:string( @rdf:about);

-- *** List all distinct genres XQUERY declare namespace rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"; declare namespace music = "http://example.com/xmlns/music#"; let $genres := collection(’JRS_RDF_INDEX’)/rdf:Description/music:genre/text() let $distinct := distinct-values($genre22) return $distinct;

-- *** Find all Rock albums XQUERY declare namespace rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"; declare namespace music = "http://example.com/xmlns/music#"; collection(’JRS_RDF_INDEX’)/rdf:Description[ rdf:type/@rdf:resource = "http://example.com/xmlns/music#album" and $index/music:genre = "Rock"]/fn:string(@rdf:about);

-- *** Find artist for all albums XQUERY declare namespace rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"; declare namespace music = "http://example.com/xmlns/music#"; for $artist in collection(’JRS_RDF_INDEX’)/rdf:Description for $artistsInAlbum = collection(’JRS_RDF_INDEX’)/rdf:Description[ rdf:type/@rdf:resource = "http://example.com/xmlns/music#album"]/music: artist/@rdf:resource where $artist/rdf:type/@rdf:resource = "http://example.com/xmlns/music#artist" and $artist/@rdf:about = $artistsInAlbum return $artist/music:name/text();

-- *** Final artist-1, first in XQuery XQUERY declare namespace rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"; collection(’JRS_RDF_INDEX’)/rdf:Description[@rdf:about="/jazz/resources/musicdb/ artists/artist-1"];

-- *** and in SQL for comparison SELECT JRS_ID FROM JRS_RDF_INDEX WHERE RDF_ABOUT = ’/jazz/resources/musicdb/ artists/artist-1’;

-- *** Final artist named... XQUERY declare namespace rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"; declare namespace music = "http://example.com/xmlns/music#"; collection(’JRS_RDF_INDEX’)/rdf:Description[music:name="Iron Maiden"];

-- *** Final all mentions of artist-1 XQUERY declare namespace rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"; collection(’JRS_RDF_INDEX’)/rdf:Description[ */@rdf:resource = "/jazz/resources/musicdb/artists/artist-1"] /fn:string(@rdf:about);

-- *** Find things released after 2000 XQUERY declare namespace rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"; declare namespace music = "http://example.com/xmlns/music#"; collection(’JRS_RDF_INDEX’)/rdf:Description[

17 rdf:type/@rdf:resource = "http://example.com/xmlns/music#album" and music:releasedYear > 2000] /fn:string(@rdf:about);

5 An Implementation with DB2 PureXML

This final section describes the design of a DB2 PureXML8 database supporting the JRS notion of an RDF Store. This store is not an optimized relational structure support- ing a pure RDF view (as in IODT/Minerva http://www.alphaworks.ibm.com/tech/ semanticstk/faq or Jena http://jena.sourceforge.net/) but a PureXML applica- tion storing RDF documents as-is in the XML database. The purpose is to provide a common storage format (in XML) for index properties extracted from domain resources by the JRS server. The chosen format is a subset, or rather a set of usage patterns, and for each resource stored in JRS there are potentially a number of such RDF indexes. The decision to store the RDF documents as XML allows us to provide client access directly to these index tables and to submit standard XQuery against the stored indexes. The reason for the development of the RDF indexing, or in fact the separation of indexing from the resources themselves is the fundamental heterogeneity of JRS, the server does not have any notion of what resources clients wish to store and so has a hands-off approach to indexing where the client provided MIME type of a resource is used to determine one or more indexers to run. These indexers extract any number of triples from the resource and these triples, once calculated, provide the basis for the RDF XML document we are storing in this database. Also note that a resource in JRS storage terms may hold a primary and one or more secondary resources, common in many XML documents, and our indexers will create separate RDF documents for the primary and one for each identified secondary.

5.1 The JRS RDF Index Table This section details the table that will store the RDF documents, we do not describe/pre- scribe a specific database layout (in JRS itself we require that the CREATE DATABASE be performed before JRS starts) although we do use the database name JRSXML and schema name JRS_RDF in our examples. Here is the example script included with JRS to create a database if users need to understand how. db2 - vos "CREATE DATABASE JRSXML AUTOMATIC STORAGE YES ON ’C:\’ USING CODESET UTF -8 TERRITORY US PAGESIZE 32 K" db2 - vos "CREATE SCHEMA JRS_RDF AUTHORIZATION ADMINISTRATOR" The table is defined with the following DDL and described in the table below. Note the particular DB2 hints provided on the RDF column to inline XML content wherever possible and then on the table to turn data compression on. The inlining of content is both valuable in saving space but also in terms of performace; where possible the database will try and store the XML inline in the row, the size of the row storage is based upon the

8http://www.ibm.com/software/data/db2/xml/

18 page size of the database and should therefore be set to the maximum value of 32K. If the entire row, including the XML content can fit in the page then the data is stored inline, otherwise it will be extracted and stored separately. Inline data can also be compressed by DB2 (using the compression option shown in the alter table statement) but only data in the row, so our inlined documents would be compressed. Data stored outside the row in separate spaces would not benefit from this row compression, although a customer can purchase an additional compression product for DB2 to extend compression to this separated data. Note also that future versions of DB2 intend to add compression for indexes also, this would be valuable to us as there is the issue of duplicate index data, described below. CREATETABLEJRS_RDF.JRS_RDF_INDEX (JRS_IDBIGINTNOTNULLGENERATEDALWAYSASIDENTITY (START WITH 1, MINVALUE 1, INCREMENT BY 1, NO CYCLE, NO CACHE, NO ORDER) , PROJECT_NAME VARCHAR(128) NOT NULL , RDF XML NOT NULL INLINE LENGTH 10000 , CONSTRAINTJRS_RDF_INDEX_PKEYPRIMARYKEY(JRS_ID) ) IN "USERSPACE1"; ALTERTABLEJRS_RDF.JRS_RDF_INDEXCOMPRESSYES; Column Comments JRS_ID An auto-generated identity column. PROJECT_NAME The identifier for JRS project containing the indexed resource. RDF The actual RDF content itself; this is stored exactly as-is.

5.1.1 Index Definitions The key to efficient query for clients is a combination of the restricted subset of RDF used by JRS and the set of indexes used by DB2 to index the RDF. The two attribute indexes, for rdf:about and rdf:resource are relatively straightforward and also allow us to separately index URL values (in rdf:resource) from string literals (in element content). The PROP_* indexes are all identical except for the defined SQL data type, the reason for this is described below. Note the datatype of string data is set to VARCHAR(512), an alternative and space-efficient type would be VARCHAR HASHED, which stores an 8-byte hash of the string content rather than the string itself. Such an index has a very simple restriction, one which makes it hard for us to use, which is that it can only be used for comparison tests i.e. string equality and not range checks i.e. LIKE. As we cannot be sure that clients will not need to perform range checks in queries we currently plan to use the non-hashed form of these indexes. CREATEINDEXJRS_RDF.RDF_ABOUT ONJRS_RDF.JRS_RDF_INDEX(RDFASC) GENERATEKEYUSINGXMLPATTERN ’declare namespace rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"; /rdf :Description/@rdf:about’ AS SQL VARCHAR (512) ALLOWREVERSESCANS PCTFREE 10 MINPCTUSED 10 PAGESPLITSYMMETRICCOLLECTSTATISTICS;

19 CREATEINDEXJRS_RDF.RDF_RESOURCE ONJRS_RDF.JRS_RDF_INDEX(RDFASC) GENERATEKEYUSINGXMLPATTERN ’declare namespace rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"; /rdf :Description/*/@rdf:resource’ AS SQL VARCHAR (512) ALLOWREVERSESCANS PCTFREE 10 MINPCTUSED 10 PAGESPLITSYMMETRICCOLLECTSTATISTICS;

CREATEINDEXJRS_RDF.PROP_STRING ONJRS_RDF.JRS_RDF_INDEX(RDFASC) GENERATEKEYUSINGXMLPATTERN ’declare namespace rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"; /rdf :Description/*’ AS SQL VARCHAR(255) ALLOWREVERSESCANS PCTFREE 10 MINPCTUSED 10 PAGESPLITSYMMETRICCOLLECTSTATISTICS;

CREATEINDEXJRS_RDF.PROP_NUMBER ONJRS_RDF.JRS_RDF_INDEX(RDFASC) GENERATEKEYUSINGXMLPATTERN ’declare namespace rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"; /rdf :Description/*’ ASSQLDOUBLE ALLOWREVERSESCANS PCTFREE 10 MINPCTUSED 10 PAGESPLITSYMMETRICCOLLECTSTATISTICS;

CREATEINDEXJRS_RDF.PROP_TIMESTAMP ONJRS_RDF.JRS_RDF_INDEX(RDFASC) GENERATEKEYUSINGXMLPATTERN ’declare namespace rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"; /rdf :Description/*’ ASSQLTIMESTAMP ALLOWREVERSESCANS PCTFREE 10 MINPCTUSED 10 PAGESPLITSYMMETRICCOLLECTSTATISTICS; Logically we wanted to use XPath to extract specific typed values into typed indexes using the form:

• /rdf:Description/*[rdf:datatype="...#integer"] -> SQL DOUBLE

• /rdf:Description/*[rdf:datatype="...#dateTime"] -> SQL TIMESTAMP

• /rdf:Description/*[rdf:datatype="...#string" or not(string(@rdf:datatype))] -> SQL VARCHAR

Note the last, which indicates the form where you ask for things that do not have a datatype attribute. However XPath of the first form is not allowed in the specification of indexes, the restricted subset defined by DB2 (and named xmlpattern) does not allow predicates of the form shown above9. This caused a little concern initially, but it turns out that the DB2 XML indexing not only has a solution already, described in the doc10

9http://publib.boulder.ibm.com/infocenter/db2luw/v9/topic/com.ibm.db2.udb.admin.doc/doc/c0024077.htm 10http://publib.boulder.ibm.com/infocenter/db2luw/v9/topic/com.ibm.db2.udb.admin.doc/doc/c0024203.htm

20 but there is even an example in the DB2 docs showing the use of multiple indexes of this form11. The index XML data type acts like a filter and is not a constraint since the user may have multiple indexes with different data types on the same XML column. XML values that do not have a valid lexical form for the target index XML data type are ignored and not indexed. If the value cannot be converted to the index XML data type, then the value is inserted in the table, but it is not inserted into the index. No error or warning is raised since specifying the index XML data type is not considered a constraint on the values. Note that the index can ignore only invalid XML values for the data type. Valid values must conform to DB2’s representation of the value for the data type, or an error will be issued. So, we create three typed indexes off of the same wildcard pattern that selects all element nodes that are children of the description. We use the method described above and let DB2 work out what type the element content is and create the relevant indexes for us. We do still keep strings and URIs separate but only because RDF uses the rdf:resource attribute pattern for which we are able to write a distinct indexer.

• /rdf:Description/* -> SQL VARCHAR, DOUBLE, TIMESTAMP

As noted elsewhere the issue with this scheme is that any type which can be cast to both a string and a number or timestamp will result in two index entries, for example the element 10 will result in a string index as the value can certainly be cast to a string but also a numeric index. This has certain space considerations but future DB2 plans to compress index data will help us here.

5.2 Query Transformation While not truly implementing any query transformation or optimization re-writes, the DB2 specific RDF Store implementation has to transform the collection() function into a DB2 specific function. It is important that queries be scoped to by the JRS project, and so the transformation uses an inner SQL select to ensure this (obviously PROJECT_NAME is indexed in the database). -- -- The query as submitted by the user to /jazz/projects/main/query -- XQUERY declare namespace rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"; declare namespace music = "http://example.com/xmlns/music#"; collection(’JRS_RDF_INDEX’)/rdf:Description[ rdf:type/@rdf:resource = "http://example.com/xmlns/music#album" and music:genre = "Rock"]/fn:string(@rdf:about);

-- -- The query submitted to DB2 -- XQUERY declare namespace rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#";

11http://publib.boulder.ibm.com/infocenter/db2luw/v9/topic/com.ibm.db2.udb.admin.doc/doc/c0024201.htm

21 declare namespace music = "http://example.com/xmlns/music#"; db2-fn:sqlquery("SELECT RDF FROM JRS_RDF.JRS_RDF_INDEX WHERE PROJECT_NAME=’main’") /rdf:Description[ rdf:type/@rdf:resource = "http://example.com/xmlns/music#album" and music:genre = "Rock"]/fn:string(@rdf:about);

6 Future Work

The JRS server implementation described in this paper is built according to a set of spec- ifications for the individual REST services (storage, indexing, search, query, etc.) and within IBM a team is developing an additional sample implementation of these speci- fications, including the RDF store work which is the focus of this paper. The sample implementation strives to be as close as possible to the commercial server in behavior, for example both uses Apache Lucene for full-text search. The sample implementation is however different in a number of significant ways, partly to simplify this implementa- tion which will be released under an open source license, and partly to allow for us to experiment with different options allowed in the specifications. 1. The commercial server is developed in Java, the reference implementation is devel- oped in Python. 2. The commercial server uses a Servlet container and frameworks from the Jazz project, the reference implementation uses the Django Project web framework. 3. The commercial server uses Apache Derby as it’s default database, the reference implementation uses SQLite. 4. The commercial server uses DB2 pureXML underlying the RDF store, the reference implementation uses RDFlib and SQLite. 5. The commercial server provides XQuery as it’s POST-based query option, the reference implementation uses SPARQL. a) Additionally the sample server supports both the JRS-style GET and POST query API but also completely implements the SPARQL protocol GET and POST bindings. The latter allows users to type complete queries into a browser, bookmark them, share them, etc. The last item on this list is probably the most important to clients and the JRS team as it allows us to compare the suitability of SPARQL and XQuery for the kinds of queries that our clients are likely to need. For example we know that some teams need to be able to calculate the average of values stored in the properties, this would be easy using XQuery aggregate functions, a capability not provided in the base SPARQL specification. However when resources are heavily inter-linked XQuery is unwieldy in describing queries that span these links, SPARQL is far more suited. With this in mind we hope to be able to provide SPARQL in our commercial JRS server to support these kinds of queries, and to really make use of the RDF store as RDF and not just as XML.

22