Advanced XML Technologies: Schema, Xpath, Xquery, and XSL

The National Virtual Observatory Book ASP Conference Series, Vol. 382, © 2008 M. J. Graham, M. J. Fitzpatrick, and T. A. McGlynn, eds. Chapter 57: Advanced XML Technologies: Schema, Xpath, XQuery, and XSL Raymond L. Plante Introduction Much of what happens behind the scenes in the VO happens using XML A major reason for choosing to use XML is to take advantage of “off-the-shelf” standards and technologies that can help us manage our metadata. Most VO users will not need to know anything about various manipulations of XML going on underneath their appli- cations. However, users who begin to delve into programming for the VO, be it scripting to gather data for a VO research project or developing a general application, will start to see some of these technologies at work under the hood. In four sections in this chapter, we’ll look at four of the most useful XML technologies. With each one, we’ll start by highlighting how you might find it useful. The intent is not to make you proficient in these tools. Rather, by getting a general sense how these technologies work you will at least have some ability to debug your application when things go wrong. In some cases, you may acquire enough familiarity to edit and use pre-existing samples. In this chapter, you will get a chance to make some of those edits and try out some tools you can find on the CD. We will use some tools from the adqllib pack- age. If you tried out the exercises in Chapter 36, then you may have already built these tools. If not, you can do that now On Linux/Mac-OS: > cd $NVOSS_HOME/java/src/adqllib > ant and on Windows: > cd %NVOSS_HOME\java\src\adqllib > ant 1. Defining Metadata Using XML Schema 1.1. Introduction What it is: XML Schema is a World Wide Web Consortium (W3C) standard for defining and verifying an XML grammar. It defines a set of legal XML tags and attributes and gives it a name, or more precisely, a namespace. It also specifies what order the tags must appear in and what values are allowed. A Schema-aware XML 619 620 Plante parser can read this definition and use it to test if an XML document obeys the grammar rules. Why you might care: The VO uses XML to encode metadata and service messages, and XML Schema is used to define the metadata encoding and message syntax. Having the ability to read XML Schema definitions may help you understand what metadata is needed by an application as well as how to encode it. What you’ll get from this section: You will hopefully gain some rudimentary skills for reading an XML Schema document as a means of discerning the proper syntax for creating XML that conforms to the schema. You will also get a look at the role of namespaces in supporting multiple schemas in the same document. Finally, we’ll try out a tool for validating XML documents to determine if they comply with a given schema; this will allow us to experiment with changes to a document. If you want to try out the exercises described in this section, you will the find the example files in the CD software distribution under $NVOSS_HOME/java/src/ advxml (on Windows, %NVOSS_HOME%\java\src\advxml). 1.2. Schema Basics In this section, we will consider the case of using XML to encode metadata. We will use XML Schema to define the metadata names and the value types. XML format is great for metadata because it can easily capture both simple and complex values. That is, some metadata will be simple: a name and value like a string or a real number (e.g. “title” or “frequency”). Others can be complex where several simple values are combined together and given a name (e.g. “position” might be comprised of a Right Ascension value and a Declination value). Thus, our metadata will be encoded as elements and attributes in our XML document. When we define a set of related metadata that are meant to be used together, we call that a schema. XML Schema is a particular standard for defining the elements and attributes that will be used to encode our metadata. The definition is done via an XML Schema document (also in XML format) which essentially contains a list of definitions. Typically, most of the definitions are of XML elements, attributes, and types, but they can also include definitions of groups of elements and attributes as well. To understand how these things are defined, we’ll step through a few simple examples, starting with the schema listed in Figure 1 (xmltech-simple.xsd in the CD distribution under $NVOSS_HOME/java/src/advxml). The example starts with the root element, <xs:schema>; it contains some attributes related to namespaces which we will get to later. For now, just note that the xs: prefix denotes things that are defined as part of the XML Schema language. The first interesting thing in this example is the definition of our first element using the <xs:element> tag. The name attribute indicates that the element will be called “resource”. We call this element a global element because its definition appears as a direct child of the <xs:schema> tag. Only global elements can appear as root elements of an XML document (but they can appear elsewhere, too). Advanced XML 621 The full definition of the “resource” element appears in the content of the <xs:element> tag, and the first tag inside it, <xs:complexType>, indicates that it has a complex type. All elements and attributes have an associated type that indicates what its value will look like. A complex type means that the element can contain other elements. Some of those other elements inside might also be defined to be complex, which captures the familiar hierarchical structure of XML documents. Ob- viously attributes cannot be defined to be complex because they can only contain simple values, not other elements. We note also this kind of type is referred to as an targetNamespace <?xml version="1.0" encoding="UTF-8"?> <xs:schema targetNamespace="http://nvoss.org/VOResource" globally defined xmlns:xs="http://www.w3.org/2001/XMLSchema" element elementFormDefault="qualified"> <xs:element name="resource"> anonymous type <xs:complexType> definition <xs:sequence> content model <xs:element name="title" type="xs:string" /> <xs:element name="referenceURL" type="xs:anyURI" locally defined minOccurs="0"/> elements <xs:element name="type" minOccurs="0" maxOccurs="unbounded"> Occurrence <xs:simpleType> restrictions <xs:restriction base="xs:string"> <xs:enumeration value="Archive" /> <xs:enumeration value="Catalog" /> <xs:enumeration value="Organisation" /> </xs:restriction> </xs:simpleType> </xs:element> </xs:sequence> <xs:attribute name="created" type="xs:dateTime" /> </xs:complexType> </xs:element> </xs:schema> Figure 1. xmltech-simple.xsd: an annotated sample of a simple XML Schema document; see text for a detailed explanation. Fig. 2 contains an example of an XML document that conforms with this schema. anonymous type because it’s being defined directly inside the <xs:element> tag. Later, we’ll look at an example of a globally defined type definition that is given a name (and therefore is not anonymous!). Complex types are said to be described by a content model. This defines what other elements it can contain and how they must be arranged. <xs:sequence> is the most common type of content model used in VO schemas. Inside that tag, we list the elements that may appear inside the “resource” element. The order that these elements are listed in the definition is the order they must have inside the “resource” element. In other words, the “resource” element can contain “title,” “referenceURL,” and “type” elements in that order. Other types of content models which we won’t 622 Plante cover (though you might surmise their meaning) include xs:any, xs:all, and xs:group. The content model is further controlled by occurrence constraints. Each <xs:element> tag inside the <xs:sequence> can have minOccurs and maxOccurs attributes that specify the number of sequential occurrences of the element can appear. For example, the “type” element can appear a minimum of zero times — that is, it doesn’t have to appear at all — or it can appear an unlimited number of times. If either minOccurs or maxOccurs is not specified, it is assumed to be one. Thus, the “title” element must appear once and only once. The “referenceURL” is optional (because minOccurs=0), but no more than one “referenceURL” is allowed. The “title,” “referenceURL,” and “type” elements are what we call local elements (as opposed to global elements) because they are defined inside the definition of a type. The “type” element is being assigned an anonymous type, much like “resource” was (because its type definition appears immediately inside the element definition), but the “title” and “referenceURL” elements are being assigned predefined types, xs:string and xs:anyURI, via the type attribute. The xs: prefix indicates that these types are defined by the XML Schema standard. All three of these elements have what we call simple types. These represent values that do not contain other elements but, rather, have values that can be appear as simple strings, such as an integer, date, URL, or just a generic string. XML Schema defines a large set of simple types that define the format of the value. xs:decimal, xs:integer, xs:positiveInteger, xs:date, and xs:dateTime are other useful simple types available. In the example, the “type” element has a simple, anonymous type. It’s our first example of a derived type.

Load more