The National Virtual Observatory Book ASP Conference Series, Vol. 382, © 2008 M. J. Graham, M. J. Fitzpatrick, and T. A. McGlynn, eds.

Chapter 57: Advanced XML Technologies: Schema, Xpath, XQuery, and XSL

Raymond L. Plante

Introduction

Much of what happens behind the scenes in the VO happens using XML A major reason for choosing to use XML is to take advantage of “off-the-shelf” standards and technologies that can help us manage our metadata. Most VO users will not need to know anything about various manipulations of XML going on underneath their appli- cations. However, users who begin to delve into programming for the VO, be it scripting to gather data for a VO research project or developing a general application, will start to see some of these technologies at work under the hood. In four sections in this chapter, we’ll look at four of the most useful XML technologies. With each one, we’ll start by highlighting how you might find it useful. The intent is not to make you proficient in these tools. Rather, by getting a general sense how these technologies work you will at least have some ability to debug your application when things go wrong. In some cases, you may acquire enough familiarity to edit and use pre-existing samples. In this chapter, you will get a chance to make some of those edits and try out some tools you can find on the CD. We will use some tools from the adqllib pack- age. If you tried out the exercises in Chapter 36, then you may have already built these tools. If not, you can do that now

On Linux/Mac-OS:

> cd $NVOSS_HOME/java/src/adqllib > ant and on Windows:

> cd %NVOSS_HOME\java\src\adqllib > ant

1. Defining Metadata Using XML Schema

1.1. Introduction

What it is: XML Schema is a World Wide Web Consortium (W3C) standard for defining and verifying an XML grammar. It defines a set of legal XML tags and at- tributes and gives it a name, or more precisely, a namespace. It also specifies what order the tags must appear in and what values are allowed. A Schema-aware XML 619 620 Plante parser can read this definition and use it to test if an XML document obeys the grammar rules.

Why you might care: The VO uses XML to encode metadata and service messages, and XML Schema is used to define the metadata encoding and message syntax. Having the ability to read XML Schema definitions may help you understand what metadata is needed by an application as well as how to encode it.

What you’ll get from this section: You will hopefully gain some rudimentary skills for reading an XML Schema document as a means of discerning the proper syntax for creating XML that conforms to the schema. You will also get a look at the role of namespaces in supporting multiple schemas in the same document. Finally, we’ll try out a tool for validating XML documents to determine if they comply with a given schema; this will allow us to experiment with changes to a document. If you want to try out the exercises described in this section, you will the find the example files in the CD software distribution under $NVOSS_HOME/java/src/ advxml (on Windows, %NVOSS_HOME%\java\src\advxml).

1.2. Schema Basics

In this section, we will consider the case of using XML to encode metadata. We will use XML Schema to define the metadata names and the value types. XML format is great for metadata because it can easily capture both simple and complex values. That is, some metadata will be simple: a name and value like a string or a real num- ber (e.g. “title” or “frequency”). Others can be complex where several simple values are combined together and given a name (e.g. “position” might be comprised of a Right Ascension value and a Declination value). Thus, our metadata will be encoded as elements and attributes in our XML document. When we define a set of related metadata that are meant to be used together, we call that a schema. XML Schema is a particular standard for defining the elements and attributes that will be used to en- code our metadata. The definition is done via an XML Schema document (also in XML format) which essentially contains a list of definitions. Typically, most of the definitions are of XML elements, attributes, and types, but they can also include definitions of groups of elements and attributes as well. To understand how these things are de- fined, we’ll step through a few simple examples, starting with the schema listed in Figure 1 (xmltech-simple.xsd in the CD distribution under $NVOSS_HOME/java/src/advxml). The example starts with the root element, ; it contains some attributes related to namespaces which we will get to later. For now, just note that the xs: prefix denotes things that are defined as part of the XML Schema language. The first interesting thing in this example is the definition of our first element using the tag. The name attribute indicates that the element will be called “resource”. We call this element a global element because its definition ap- pears as a direct child of the tag. Only global elements can appear as root elements of an XML document (but they can appear elsewhere, too).

Advanced XML 621

The full definition of the “resource” element appears in the content of the tag, and the first tag inside it, , indicates that it has a complex type. All elements and attributes have an associated type that indicates what its value will look like. A complex type means that the element can contain other elements. Some of those other elements inside might also be defined to be complex, which captures the familiar hierarchical structure of XML documents. Ob- viously attributes cannot be defined to be complex because they can only contain simple values, not other elements. We note also this kind of type is referred to as an

targetNamespace anonymous type definition content model elements Occurrence restrictions

Figure 1. xmltech-simple.xsd: an annotated sample of a simple XML Schema document; see text for a detailed explanation. Fig. 2 contains an example of an XML document that conforms with this schema. anonymous type because it’s being defined directly inside the tag. Later, we’ll look at an example of a globally defined type definition that is given a name (and therefore is not anonymous!). Complex types are said to be described by a content model. This defines what other elements it can contain and how they must be arranged. is the most common type of content model used in VO schemas. Inside that tag, we list the elements that may appear inside the “resource” element. The order that these ele- ments are listed in the definition is the order they must have inside the “resource” element. In other words, the “resource” element can contain “title,” “referenceURL,” and “type” elements in that order. Other types of content models which we won’t

622 Plante cover (though you might surmise their meaning) include xs:any, xs:all, and xs:group. The content model is further controlled by occurrence constraints. Each tag inside the can have minOccurs and maxOccurs attributes that specify the number of sequential occurrences of the element can ap- pear. For example, the “type” element can appear a minimum of zero times — that is, it doesn’t have to appear at all — or it can appear an unlimited number of times. If either minOccurs or maxOccurs is not specified, it is assumed to be one. Thus, the “title” element must appear once and only once. The “referenceURL” is optional (because minOccurs=0), but no more than one “referenceURL” is allowed. The “title,” “referenceURL,” and “type” elements are what we call local ele- ments (as opposed to global elements) because they are defined inside the definition of a type. The “type” element is being assigned an anonymous type, much like “re- source” was (because its type definition appears immediately inside the element defi- nition), but the “title” and “referenceURL” elements are being assigned predefined types, xs:string and xs:anyURI, via the type attribute. The xs: prefix indicates that these types are defined by the XML Schema standard. All three of these ele- ments have what we call simple types. These represent values that do not contain other elements but, rather, have values that can be appear as simple strings, such as an integer, date, URL, or just a generic string. XML Schema defines a large set of simple types that define the format of the value. xs:decimal, xs:integer, xs:positiveInteger, xs:date, and xs:dateTime are other useful simple types available. In the example, the “type” element has a simple, anonymous type. It’s our first example of a derived type. Here we want to say that the value is a string that is “re- stricted” to only specified values, namely “Archive,” “Catalog,” and “Organisation.” The element allows us to derive a new type from the xs:string type. Now that you know a little bit more about types, it’s worth clarifying what is meant by a local element. Because it is defined as part of a containing type defini-

default namespace gives location of the NCSA Astronomy Digital Image Library VOResource schema http://adil.ncsa.uiuc.edu/ Archive Catalog

Figure 2. xmltech-simple.xml: a sample instance document that con- forms to the schema defined in Fig. 1. The xsi:schemaLocation attrib- ute indicates the local file in which the schema document can be found.

Advanced XML 623 tion, a local element can only appear in the context of that type. That is, a element tag can only appear inside a <resource> element. Now, another complex type could be defined to contain an element called “title”; however, technically, that “title” would not be the same element as the “title” inside a “resource.” The meaning of “title” could be different in the two contexts. In fact, the two titles could have dif- ferent types assigned to it: one could be a xs:string, and the other, a xs:boolean. Finally, the definition of the “resource” element concludes with the definition of a single attribute called “created” whose type is xs:dateTime. By default, this will be an optional attribute of “resource.” If we wanted make the inclusion of this attribute mandatory, we would add use="required" to the <xs:attribute> ele- ment. We have a schema now in figure 1, so let’s have a look an XML document that uses this schema. Figure 2 illustrates what we refer to as an instance document. This is an XML document that actually uses the tags to encode the values of metadata. It starts with <resource> as our root element and includes the optional created attrib- ute set to the date and time the document was created. The root element also includes some extra attributes related to namespaces which we will discuss in the next section. Inside that element we have all the local elements in the proper order. In particular, it includes two occurrences of the <type> element. Thus, it appears to conform to the schema defined in figure 1, but how can we know for sure? The rules are spelled out in a structured way in the schema file, so a computer program should be able to tell us. To understand how such a program can check our instance document, we need to understand a bit about namespaces. </p><p>1.3. Namespaces and Validation </p><p>The simple example in Figure 1 represents a schema — a language of terms that can describe a VO resource. We would like to give this schema a name to distinguish it from other schemas. More precisely, we want to assign the schema to a namespace. In general, a namespace is a set of names that are internally unique, and we give that set a name. In the case of XML, the namespace includes the names of the elements and attributes that make up our schema. We specify the name of our schema in our Schema document with a targetNamespace attribute. In Fig. 1, this attribute assigns the name “http://nvoss.org/VOResource” to the namespace that contains our schema. As you might guess, a namespace name takes the form of a Universal Re- source Identifier (URI). Now consider our question, how can we tell if our instance document conforms to our schema? We call the process of answering this question validation, checking if the document makes valid use of the grammar defined in our schema. If a program is going to validate our instance document for us, it needs to know what schema it is going to check the document against. We signify this in the instance document (Fig. 2) using the xmlns attribute (an abbreviation for “XML namespace”). We announced at the top of this file that this document uses elements from the “http://nvoss.org/VOResource.” After all, there could another schema out there called “http://fred.com” that defines an element called “resource.” That would not matter to our checker program because it only need concern itself with the space of names matching our namespace. </p><p>624 Plante </p><p>You might realize now that because namespaces provide a way to distinguish two different elements that happen to have the same name, it is actually possible to include elements from the “http://fred.com” and “http://nvoss.org/VOResource” namespaces in the same XML document. The xmlns attribute identifies the default namespace. We could include in our instance document elements or attributes from other namespaces if we had a way to indicate what namespace they are from. That is where namespace prefixes come in. Any XML document can define a short prefix that stands for a namespace name using an attribute of the form xmlns:prefix. For example, in Fig. 1, we defined a prefix at the top of the document: </p><p> xmlns:xs="http://www.w3.org/2001/XMLSchema" </p><p>That is, we defined a prefix xs that is short for the XML Schema namespace. In this document, we chose not to define a default namespace using xmlns. Instead, we ex- plicitly labeled each element with the xs: prefix to denote the namespace it belongs to. We also needed the prefix to indicate the simple types we used from that name- space. Getting back to our question, if our validating program is going to check our in- stance document against our schema, it needs to know where it can find the Schema document. In Fig. 2, you can see that we did that with the xsi:schemaLocation at- tribute. In general, this attribute contains one or more pairs of values, where the first value is namespace name and the second value is the location of the Schema file. Often the location half is set to a URL; however, in our case, we set it the location to be simply a file on local disk. Note that the location provided by this attribute is just a recommendation. A tool that reads and uses this document may prefer to consult its own cached copy of the schema. Also, notice that this attribute has the xsi: prefix which indicates that it is not from our default namespace. Just above xsi:schemaLocation, we defined the xsi prefix to refer to the “http://www.w3.org/2001/XMLSchema-instance” namespace. Now we’re ready to actually try validating our instance document. We will do this using a command called validate which you built and installed when you fol- lowed the software setup instructions at the start of this chapter. The sample files covered in this chapter are located in the $NVOSS_HOME/java/src/advxml directory. If you change into that directory and type, validate xmltech-simple.xml then you should see, xmltech-simple.xml: 478 ms (4 elems, 1 attrs, 0 spaces, 86 chars) </p><p>Essentially, if the program does not report any errors, then the instance document is valid. </p><p>Advanced XML 625 </p><p>Suggested Exercise: Try editing the xmltech-simple.xml file by adding or removing elements. Be sure to make changes that are not compliant with the schema so that you can see what an error looks like. </p><p>1.4. Global Reusable Types and Elements </p><p>In our first schema document example, we only defined anonymous types—that is, the type definition appears as part of an element definition. However, we could de- fine them as global types. When we define a global type, we give it a name; then, we can assign that type to an element by providing the name to the <xs:element> tag’s type attribute. One advantage of the global type is that it can be assigned to multiple elements. These elements may have different names and different semantic mean- ings; however, they would share a common format for the values they hold. This is why we call them “reusable.” Global types are also important for extending schemas. Figure 3 illustrates how we can alter the schema we introduced in Fig. 1 to make use of globally defined types. In our original example, we had defined anonymous </p><p><?xml version="1.0" encoding="UTF-8"?> <xs:schema targetNamespace="http://nvoss.org/VOResource" xmlns:res="http://nvoss.org/VOResource" Define a prefix for xmlns:xs="http://www.w3.org/2001/XMLSchema" this namespace elementFormDefault="qualified"> <xs:simpleType name="Type"> <xs:restriction base="xs:string"> <xs:enumeration value="Archive" /> Type defined here… <xs:enumeration value="Catalog" /> <xs:enumeration value="Organisation" /> </xs:restriction> </xs:simpleType> <xs:complexType name="Resource"> <xs:sequence> <xs:element name="title" type="xs:string" /> <xs:element name="referenceURL" type="xs:anyURI" minOccurs="0"/> …and used here <xs:element name="type" type="res:Type" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="created" type="xs:dateTime" /> </xs:complexType> <xs:element name="resource" type="res:Resource" /> </xs:schema> </p><p>Figure 3. xmltech-globaltypes.xsd: the simple schema introduced in Fig. 1, modified to use global types. types for two elements: “resource,” our root element, and “type,” one of the “re- source” child elements. In this example, we break those out into stand-alone type definitions which we will call “Resource” and “Type,” respectively. For both of </p><p>626 Plante these, we simply pulled out type definition from inside the element definition, made them direct children of the <xs:schema> tag, and added attribute called name. For instance, in Fig. 1, the <xs:element> tag that defined the local “type” element con- tained a <xs:simpleType> tag. In Fig. 3, this <xs:simpleType> tag is the first defi- nition to appear under the xs:schema tag. We make use of this now global type in the definition of the “type” element by adding the type attribute to the <xs:element> tag that defines it: </p><p><xs:element name="type" type="res:Type" minOccurs="0" maxOccurs="unbounded"/> </p><p>Notice that we had to include a prefix, res:, in front of the type name to indicate what schema the type definition comes from. We defined this prefix at the top of the document to refer to our current schema. The “Type” type definition is followed by the definition of the “Resource” com- plex type. To assign this type to an element we can use, we have the <xs:element> at the end of the document. Its guts have been extracted, and we added the type="res:Resource" attribute to assign its type. It might now occur to you that the namespace prefix makes it possible to set the value of the type attribute to a type defined in an entirely different schema created by someone else. In accomplish this, we would simply define a namespace prefix for that external schema and then refer to the type we want to borrow in the type attrib- ute by its proper name, properly prefixed. If we want our validating tool to find the external schema, then we should also add a namespace-location pair to the xsi:schemaLocation attribute (see the VODataService-v1.0.xsd file as an exam- ple). In order to reuse a type from an external schema in this way, that type must be defined as a global type. Defining our schema in terms of global types as we did in Fig. 3 does not offer us any advantages over the anonymous types of Fig. 1 in as far as how we use it. The example in Fig. 2 is compliant with both versions of the schema. Nevertheless, it is common use the global type approach when defining schemas, and one important reason is that it allows other schemas to reuse your types and even extend them, as we will see later. Global element definitions can also be reused. An example of this is provided in the xmltech-elrefs.xsd file. In that file, we made the “type” element a global type by adding the line: </p><p><xs:element name="type" type="res:Type"/> </p><p>We then can make use of this element inside the “Resource” type definition by re- placing the type attribute with a ref attribute: </p><p><xs:element ref="res:type" minOccurs="0" maxOccurs="unbounded"/> </p><p>Global elements offer the same advantages as global types — reusability and exten- sibility. In addition, a global element can be used as an XML document’s root ele-</p><p>Advanced XML 627 ment. For example, we could now create a valid XML document with <type> as the root element. </p><p>1.5. Schema Documentation </p><p>If you were faced with producing a document like the one in Fig. 2, you would probably want to know more about what the terms “referenceURL” or “type” mean. Unfortunately, XML Schema validation cannot determine whether your document makes semantic sense, so documenting a schema is critical to helping authors create meaningful documents with it. XML Schema provides mark-up for including ex- planatory comments directly in the schema (see also the xmltech-documented.xsd file): </p><p><xs:complexType name="Resource"> <xs:annotation> <xs:documentation> Any entity or component of a VO application that is describable and identifiable by a IVOA Identifier. </xs:documentation> </xs:annotation> <xs:sequence> <xs:element name="title" type="xs:string" > <xs:annotation> <xs:documentation> the full name given to the resource </xs:documentation> </xs:annotation> </xs:element> <xs:element name="referenceURL" type="xs:anyURI" minOccurs="0"> <xs:annotation> <xs:documentation> URL pointing to a human-readable document describing this resource. </xs:documentation> </xs:annotation> </xs:element> </p><p>The xs:annotation and xs:documentation elements in the example above allow us to place the semantic definition of the “referenceURL” element directly inside that element’s syntactic definition. Not only can we provide semantic definitions for each element, we can also provide one for the complex type, “Resource” as a whole. You might wonder why we didn’t just use simple XML comments, <!-- … -->, to provide this information. By using actual XML elements to provide this informa- tion in a more structured way, it’s possible for applications to extract this information and incorporate it into displays of the metadata. </p><p>628 Plante </p><p>1.6. Extending Schemas with Derived Types </p><p>The ability to extend an XML schema is a vitally important capability for our use of XML within the VO. It allows us to take a general purpose schema and customize it for a special purpose by adding more specific metadata. It gives us a mechanism for evolving our metadata standards. It is the means by which we build extensive vo- cabularies out of reusable building blocks. At the heart of schema extension is the creation of new XML types derived from existing types. There are two ways with XML Schema to derive a new type from an existing one: extension and restriction. The extension method can only be applied to complex types and is the derivation method used most in VO schemas. Extension allows one to add additional elements to the content of an existing type. Consider this example in which we define a new complex type called “Service” derived from our previously defined “Resource” type. </p><p><xs:complexType name="Service"> <xs:complexContent> <xs:extension base="res:Resource"> <xs:sequence> <xs:element name="accessURL" type="xs:anyURI" /> </xs:sequence> </xs:extension> </xs:complexContent> </xs:complexType> </p><p><xs:element name="service" type="res:Service"> </p><p>The use of the xs:extension element here indicates that the contents of the “Serv- ice” type is the same as the “Resource” type — the base type — but is followed by an additional element called “accessURL”. In this example, we’ve also assigned our new type to an element called “service” so we can use it in a document instance. Here’s what this would look like: </p><p><service xmlns="http://nvoss.org/VOResource" created="1994-11-01T12:00:00" > <title>ADIL Query Page http://adil.ncsa.uiuc.edu/help.html Archive http://adil.ncsa.uiuc.edu/QueryPage.html

So, you can see that a element looks just like a element ex- cept that it includes an extra element at the end. The second form of derivation uses the xs:restriction element instead of xs:extension, and it can be applied to both simple and complex types. The applica- tion to complex types is relatively rare in VO metadata schemas. It can be used to restrict the content of the type by disallowing the use of optional elements, limiting

Advanced XML 629 the number occurrences of particular elements, or setting fixed or default values where none were previously set in the base type. By contrast, restriction of simple types is much more common and interesting. Restriction is what allows you to re- strict string values to a finite set of allowed words as we did with the “type” element in our very first example in Section 1.2. You can also restrict a string value to cer- tain pattern or an integer value to specific range. Note that in both derivation examples, we referred to the base type with a name- space-qualified name — e.g. res:Resource. Thus, the base type can come from ei- ther the same schema as the derived type or from an entirely different schema. We will return to this idea later in this section. In our extension example above, we really can for the most part only use our new element as the root element of a document; that’s because we ha- ven’t defined any types that will take our element (or the underlying “Service” type) as content. A more flexible extensibility would allow us to insert our extended metadata anywhere within the XML document. That is, suppose we would like to take the content of an element of some base type deep inside an XML docu- ment and substitute in the content of our derived type. (Object-oriented programmers can view this as a kind of polymorphism.) XML Schema offers two techniques for doing this. The “xsi:type” technique involves a special label placed in the instance document that marks where a derived type is being used, while the “substitution group” technique involves a label in the schema document. To see the “xsi:type” technique in action, imagine our schema includes the “Re- source” and “Service” definitions given above and we have defined an additional element that assigns the “Resource” type to one of its sub-elements:

With this definition, a element can contain one or more sub-elements; consequently, the instance document can look like this:

NCSA Astronomy Digital Image Li brary http://adil.ncsa.uiuc.edu Archive ADIL Query Page http://adil.ncsa.uiuc.edu Archive

630 Plante

http://adil.ncsa.uiuc.edu/QueryPage.html

In the first resource listed, we are using a normal element with its unex- tended metadata. In the second resource, however, we have added an xsi:type at- tribute that indicates we are substituting in a derived resource type called “Service”, allowing us to add the extended metadatum, . The official IVOA ver- sion of the VOResource schema uses this technique to insert extension metadata. We can now consider the “substitution group” technique. Again, assume our schema includes the “Resource” and “Service” type definitions. To make this tech- nique work, we must define a global element for each of these types. Also, our defi- nition of the element will look slightly different:

The substitutionGroup attribute attached to the “service” element definition says that a element may be substituted in wherever a element is allowed. To enable this inside the element, we define the contents in terms of the global element using the ref attribute. Our instance document now looks like this:

NCSA Astronomy Digital Image Library http://adil.ncsa.uiuc.edu Archive ADIL Query Page http://adil.ncsa.uiuc.edu Archive http://adil.ncsa.uiuc.edu/QueryPage.html

Advanced XML 631

This time, instead of using the xsi:type attribute in the second record, we use the element. In our two examples above, the base “Resource” type and the derived “Service” type were both defined in the same schema; however, it’s more interesting when a base and derived types are in different schemas authored by different people. (I like to refer to the schema containing the base types as the base schema and the schema containing the derived types as the extension schema.) One reason this is interesting is that it highlights a conceptual difference between the xsi:type and substitutions group techniques. The substitution group technique cannot be used unless the author of the base schema has defined a global element set to the base type; thus, the base schema author controls where this type of extensibility can be applied. The xsi:type technique has no such restriction; thus, the extension schema author takes full control over where the extensibility can be applied. To create an extension schema, you must include two things. First, you need to be sure to define a namespace prefix for the schema (or schemas) you will be draw- ing from. Second, you need to include the element in order to “load” the base schema’s definitions. The instance documents must also define namespace prefixes for both the base and the extension schemas. The IVOA registry framework makes great use of this type of extensibility via the VOResource family of schemas. The set is anchored by a core VOResource schema that defines types for describing resources via generic metadata. A variety of extension schemas (e.g. VODataService, VORegistry, ConeSearch, SIA, etc.) add new metadata for describing specific types of resources (see Chapter 41 for more de- tails.) This approach allows us to evolve the VOResource vocabulary in a backward compatible way. The introduction of a new extension schema does not invalidate existing documents that only use the older schemas. The introduction can affect ap- plications that process VOResource instance documents. To protect themselves from this evolution, they need to be flexible enough to ignore parts of the instance docu- ment that contains extension metadata they do not yet understand.

Suggested Exercise: The advxml directory on the companion CD includes an example of a resource description document called xmltech-adil.xml that is compliant with the official VOResource schema. Also included are copies of the VOResource core schema (VOResource-v1.0.xsd) and the VODataService extension schema (VODataService-v1.0.xsd). To see what you’ve learned in this section, try adding and removing metadata from the instance document, using the schemas as a guide. Check your re- sults with the validate tool.

Finally, as we leave the topic of XML Schema, we will include one parting word about namespace prefixes. A VOResource instance document will often draw on elements and types defined in as many as three different schemas. Consequently, each of these schemas must have a prefix defined for it. Usually when one creates an instance document that draws on multiple schemas, one must carefully label elements with the proper prefixes (or else make careful use of the xmlns attribute). This can get complicated (because you need to know which elements come from which sche- mas) and prone to error. The VOResource schemas get around this problem by set-

632 Plante ting in the XSD files the attribute elementFormDefault="unqualified"; as a result, only the values of xsi:type require tagging with a namespace prefix. It also makes forming XPath pointers easier, which leads us to our next XML technology.

For More Information: There are many books available on XML Schema as well as resources on-line. For a comprehensive yet readable tu- torial on XML Schema, this author recommends the “XML Schema Part 0: Primer” which is the first of the three documents from the W3C that makes up the official XML Schema specification.

2. Extracting Information using XPath

What it is: XPath is a W3C standard syntax for pointing to elements, attributes and/or their values in an XML file. Once you can point a bit of XML, you can ex- tract it, transform it, or ask questions about it.

Why you might care: XPath is a building block for two other useful XML technolo- gies: XQuery and XSL. It is also used in ADQL queries to search VO resource reg- istries.

What you’ll get from this section: You will be able to form simple XPath queries.

2.1. An Example: Can You Tell Me How to Find Enlightenment?

An XPath is a string that in its simplest form looks much like a file path, but we can think of it as a set of directions for getting from one place in an XML document to another. Consider this sample bit of XML:

kosher hot dog

Advanced XML 633

Our quest for enlightenment can be reduced to this XPath:

1. Start at the top of the document 2. Go to Chicago 3. Find the neighborhood named “Wrigleyville” 4. Go to the 3rd light 5. And You’re There!

/Chicago/neighborhood[@name="Wrigleyville"]/light[3]/WrigleyField

In other words, this XPath tells us where the element is relative to the top of the document (identified by the leading slash).

2.2. XPath Syntax

As we can see from our example, an XPath is a sequence of fields separated by slashes, each representing a descent into the XML hierarchy. (Double slashes, //, mean drop an arbitrary number of levels.) Each field between the slashes refers to an XML node — that is, an element, an attribute, or a text value. You can refer to an attribute by its name preceded with an at-sign, @; otherwise, a simple name is taken to be the name of an element. Other useful node-pointing symbols include an asterisk, *, a wildcard that represents any name; a single dot, ., referring to the current node; and double dots, .., which represents a step up in the hierarchy. More generally, the fields represent a step in the XML structure, and with every step, there is a notion of the context node — the location from which the step is made. If an XPath does not start with a slash, then it is taken to be relative to some context-specific starting point. In our example above, @name is a small relative XPath within our XPath; its context-node is taken to be a element pointed to by /Chicago/neighborhood. The path notation described so far is actually a short-hand version which is the notation most commonly used. There is also a more verbose syntax that allows one to move in any direction in a XML file: upward, downward, or laterally (i.e. to elements at the same level before or after the context- node). A portion of an XPath enclosed by square brackets, [ ], is referred to as a predicate. It represents a query or a test against element or attribute pointed to by the XPath up to that point. In fact, it’s often helpful to read an opening bracket as “where…” — as in, “the Chicago neighborhood where the name is ‘Wrigleyville.’” XPath provides a variety of other functions and operators that can be used in a predi- cate test. There are the familiar comparison operators, =, !=, <, >, <=, >=, and the boolean operators, and and or. There is also a large number of useful functions, such as not(), contains(string, string), position(), count(), last(), and local- name(). The predicate notation [3] is short for [position()=3] or, translated, “where the position of the node is equal to 3.”

2.3. XPath as a Query

In the presentation above, we talked about an XPath as a pointer as if there is actually an XML document with a element at the location the path points to.

634 Plante

We can also think of an XPath as a query against an XML document: is there a element at that location? In this sense, we can think of an XPath as “returning” a set of nodes that match the positions pointed to by the XPath. That is, if the XPath is ambiguous — say, /Chicago/neighborhood/light — then all of the elements get returned. Here are a few examples of queries against our sam- ple XML:

/Chicago/neighborhood[@name="Wrigleyville"]/light[3]/WrigleyField XPath matches the produces kosher hot dog element and its contents. /Chicago/neighborhood/light XPath matches all three occurances of the element. produces kosher hot dog /Chicago/neighborhood/light/WrigleyField/snack produces kosher hot dog string(/Chicago/neighborhood/light/WrigleyField/snack) produces kosher hot dog The string function returns the text inside the element. /Chicago/neighborhood/light/WrigleyField/[snack="kosher hot dog"] was automatically produces kosher hot dog converted to a string before the comparison operator was applied.

Note that in some contexts, as we see in the last example above, a matching element will get converted to its string value automatically. As we’ll see later in the section on XSL, this will be the typical way to pull out metadata values from an XML file. The CARNIVORE VO registry shows how XPath can be used as a query (see Resources for the URL). It accepts XPaths as a way of selecting resources or for pulling specific information out of the resource descriptions. Consider these example queries:

/Resource[contains(content/description, 'cluster')] Return all resources where the description contains the word ‘cluster’

/Resource[facility] Return all resources that have a facility element

Advanced XML 635

/Resource/capability[@xsi:type='cs:ConeSearch']/interface/accessURL Return the interface of all ConeSearch services

/Resource[@xsi:type='vs:DataCollection']/coverage//stc:AllSky Return all data collections that purport to have data distributed over the entire sky. It’s worth pointing out the use of namespace prefixes in the examples above. Typi- cally, if an element is part of a defined namespace, then the element must be qualified with the appropriate namespace prefix when it appears in an XPath. This is not nec- essary, however, when in the XML Schema that defines the element has been set with elementDefaultForm="unqualified" as it is in the VOResource schema (see the end of section 1.6). In this case, only global types — such as those that appear as values to the xsi:type attribute — and global elements need to be qualified with a namespace prefix. This makes are XPaths simpler. Not only are they more readable, we don’t need to keep track of which elements come from which schema. We need only to keep track of a few major types, like the Resource types. The last example shows an exception to this. The AllSky element comes from the Space-Time Coor- dinates schema (developed independently of the VOResource schema) which sets elementDefaultForm="qualified".

Suggested Exercise: You should now be ready to create your own XPaths. Have a look at the xmltech-adil.xml file from the advxml direc- tory. See if you can create XPaths for the various values in the file. The xmltech.html file (which you can view with a ) offers some suggestions.

3. Search for Information Using XQuery

What it is: XQuery (also known as XML Query) is a W3C standard language for querying XML documents and extracting selected information from them. XQuery does for XML documents what SQL does for tabular data.

Why you might care: XQuery is one of the query languages supported in the IVOA Registry Interface standard for discovering resources. In particular, it can handle cer- tain complex registry queries that the ADQL alternative cannot.

What you’ll get from this section: You should be able to query XML documents by modifying an existing XQuery.

3.1. XQuery versus SQL

As you read in Chapter 41, we use XML to encode descriptions of resources in regis- tries; users can then search registries to locate for useful data and services. Besides simple free-text searches, registries allow one to form complex queries that constrain the values of specific metadata. Some registries support searching by loading all of

636 Plante the metadata into tables loaded into a relational database; this means that SQL can be used to search it (see Chapter 58). Other registries store the metadata in their original XML document form inside an XML database. XQuery was specifically designed for searching such XML databases. XQuery functions in a manner very similar to SQL. SQL is a language used to query tables, and the result of an SQL query itself is a table. The columns of the re- turned table are controlled via the SELECT clause of the query. Which table rows are selected is controlled by the WHERE clause. In contrast, the XQuery is used to query XML documents, and the result is returned as an XML document. The form of that document is set by the return clause, and the contents are controlled by the for, let, and where clauses.

3.2. XQuery Syntax: Think FLWOR

XQuery syntax can take several forms. XPath, of the sort we explored in the last sec- tion, is one form. There are others that we will not explore here; however, we will introduce the most common form referred to as FLWOR (pronounced, “flower”). It stands for for/let where orderby return, the clauses that make up a typical query. Figure 4 illustrates an example of this form. It starts with a declare statement (apparently not considered important to be in the syntax form’s acronym) to define any namespace prefixes we will be using in the query. We will refer to the bits of metadata that we want to base our query on using XPaths; thus, we first need to iden- tify the namespaces we will be drawing from.

Define a prefix for this namespace

Loop over all declare namespace cs= "http://www.ivoa.net/xml/ConeSearch/v1.0"; ConeSearch resources for $vr in //Resource[capability/@xsi:type="cs:ConeSearch"] where contains($vr//description, "quasar") Restrict output to return ConeSearch services about quasars {string($vr/title)} {string($vr/capability/interface/accessURL)} Extract and display desired information

Figure 4. An example of the FLWOR form of XQuery.

The for clause selects from all the documents being searched ele- ments that contain elements with the type “ConeSearch” and sets up a loop over all those found. In particular, the matched elements are stored in a variable, $vr, which forms the basis of the other XPaths in the query. Though not shown, a let clause is similar to the for: it, too, can define variables but without the looping. Note that if a let clause is used to define a variable inside the for loop

Advanced XML 637 and includes $vr in its definition, then its value would change with each iteration of the loop. The where clause is used to further restrict the output. In our example, we will select only those resources in which the element’s value contains the word “quasar.” The where clause is optional, since as you can see from our example, some filtering is already done by the for clause. In fact, our where constraint could have easily been incorporated into the for clause’s XPath: for $vr in //Resource[capability/@xsi:type="cs:ConeSearch" and contains($vr//description, "quasar")]

The return clause formats all of the matched elements into the de- sired output format. The content of this clause is a template for the output XML document. The braces, { }, indicate where we care to have data from the input documents inserted into the output XML. Again, we identify the data to plug-in with XPaths. In our example, each ConeSearch resource having something to do with quasars will be summarized by its title and its service URL.

Suggested Exercise: You can try this query in the CARNIVORE registry search interface. Consult the Useful Links section at the end of this chapter for the URL. Try making edits to the query for different results. You may need to consult the VOResource schema (VOResource-v1.0.xsd) to see what other metadata can be used to constrain the search.

The FLWOR expression is just one type of expression supported by XQuery. The full language provides additional programming constructs and many functions for testing and manipulating data from the input XML. In fact, XQuery can be thought of as a full XML processing language that can transform data drawn from a whole collection of XML documents. Our last technology focuses on transforming one document at a time.

4. Transforming Metadata with XSL

What it is: The XML Stylesheet Language (XSL) is a W3C standard language for transforming XML documents into other forms — namely, other forms of XML, HTML, or text. An XSL stylesheet is an XML document that uses the language to define a specific transformation.

Why you might care: XSL can be used to provide human-readable renderings from XML data from the VO. XSL is used in the ADQL Java library and the Java SkyN- ode server toolkit to translate ADQL into local SQL. It can also be used as part of a client tool that talks to Web Services, like a Sky Node or Sky Portal.

What you’ll get from this section: You will learn to use a tool for applying a stylesheet to an XML document. You should then be able to adapt an existing XSL stylesheet by making simple changes.

638 Plante

4.1. An Overview of XSL

XSL is an XML language for creating stylesheets. You might be familiar with the Cascading Style Sheet (CSS) as a tool for controlling the look of HTML in a web browser. XSL is similar in concept — a stylesheet is applied to an XML document — but it is a common misconception that XSL is just about converting XML into HTML. XSL is about transforming XML data from one schema into another (includ- ing XHTML) or into plain text. (In fact, the relevant W3C standard is called XSL Transformations, or XSLT; in polite conversation, XSL and XSLT are often used interchangeably.) XSL is also different from CSS in that an XSL stylesheet is an XML document itself controlled by the XSL schema. An XSL stylesheet is constructed as a list of templates written to be applied to an XML document compliant with a particular schema. Each template describes how to transform one type of node — say, an element or attribute — from the input document into the format of the output document. For example, a template for an input element might be written to insert the data into an <H1> element in the output document. The node the template is designed for is identified by an XPath. An XSL document usually includes a template targeted to /, the top of the document. Inside that template, it can call other templates to handle different parts of the input document. Those templates, of course, can call other templates as well. The XSL language provides a number of programming features that can be used inside a template. This includes conditionals, looping, variables, common built-in functions, and user-defined functions. XSL also provides several hooks for plugging in extensions to the language. Thus, XSL can be thought of as a kind of program- ming language for processing data. </p><p>FYI: While XSL does qualify as a programming language, there are as- pects of it that may seem unusual. The most commonly used languages (including Java, Python, C, etc.) fall into the category known as procedural or imperative languages in which one programs in terms of a process of steps to be carried out. XSL falls into the category of declarative pro- gramming. In this paradigm, one describes what needs to be done with less regard for how and, in particular, in what order it should be done. One seemingly odd consequence of this is that XSL variables, once set, are static; they cannot be changed. This means that you cannot create loops controlled by iterating a counter in the traditional way. Instead, one typi- cally use function recursion to accomplish the same thing. </p><p>Applying an XSL stylesheet to an XML document requires an XSLT engine. One such engine, xalan, is provided on the companion CD. If you followed the short build instructions at the start of this chapter, then this tool is ready for use. At the end of this section, we describe a sample exercise you can run to see XSL in action. It’s worth noting that the example we will discuss in this chapter is compatible with versions 1.0 of the W3C standards for XPath and XSL. Recently, the W3C an- nounced the acceptance of version 2 of each of these standards to Recommendation status. Most of XSL version 1.0 is directly supportable by a version 2.0 engine, and where it is not, version 2.0 provides a backward-compatibility mode. </p><p>Advanced XML 639 </p><p>4.2. A Tour of an XSL Stylesheet </p><p>We will explore the XSL syntax by stepping through a sample stylesheet designed to produce simple text summaries of resource descriptions using the VOResource schema. You can find this sample in the advxml directory on the companion CD as the file, xmltech-VOResource.<a href="/tags/XSL/" rel="tag">xsl</a>. Afterward, we’ll look at the result of applying the stylesheet to our sample resource description, xmltech-adil.xml. Figure 5 lists a portion of our stylesheet. In this example, elements that are part of the XSL language have the namespace prefix, xsl. We can see that the <xsl:stylesheet> element serves as the root element for the document. Included amongst its attributes are definitions of prefixes for all the namespaces involved in the transformation. This includes the XSL namespace as well as all the namespaces we’ll have to refer to in the input document (the one we will transform, e.g. xmltech-adil.xml). In general, it would also include the namespaces to be used in the output document; however, in our example, we are converting an XML document into plain text, so no such namespaces apply. The output format is controlled by the <xsl:output> element, which is usually one of the first children of the root element to appear in the stylesheet. Its method attribute specifies the format and can be set to either “text”, “xml”, or “html.” There are other preliminary elements that can optionally appear near the top of the document which we will not consider; however, once they are out of the way, the remainder of the document lists the templates that control the transformation of the various parts of the input document. These can appear in any order; however, it is not a bad idea for readability purposes to list your most general templates first. Usu- ally this means a template that is meant to match the start of the input document. The intended target of a template is given via its match attribute as an XPath; thus, the start of the input document is matched with the XPath, /. Our version of this tem- plate does two things. First, it prints out a bit of text, “Resource Description Re- cord.” When raw text like this appears in a template, the XSLT engine will tend to preserve spacing, such as carriage returns, when it is printed to the output document. Sometimes this is desirable, and sometimes it isn’t; however, we will later see one way to control this explicitly. The second thing our template does is pass control to another template. The use of the <xsl:apply-templates> element within this tem- plate says, if there is an element maching the XPath, Resource, find and apply a matching template. Thus, the next template in our example will be engaged. The Resource template illustrates how a template “takes charge” of handling a particular XML node from the input document — in this case, the <Resource> ele- ment. It can print out the values of various attributes and child elements using the <xsl:value-of> element; its select attribute takes an XPath that identifies the at- tribute or element to be printed. These XPaths are normally relative to the current context node (see Section 2.2]) which, in this case, is the node that was matched to the template, or the <Resource> element. Thus, <xsl:value-of select="title"> prints the value of the <title> element that is a child of the <Resource> element. The Resource template provides some literal text as values in <xsl:text> ele- ments. These elements are useful for taking direct control of the spacing, especially when the text you want to insert is purely spacing, as in the first occurrence of <xsl:text>. Finally, even though this template does directly format some of the in-</p><p>640 Plante formation inside the input <Resource> element (like the title, short name, and identi- fier), it passes the formatting of other data onto other templates. It is common within a template to want to loop over all the occurrences of an element. Use of the <xsl:apply-templates> element actually does this automati- cally for you: the template will be applied to the matching elements in the order that they appear in the input document. Sometimes, however, you will want to take finer control, and the <xsl:for-each> element can help. For example, in the curation template, we want to print the values of each of the <contentLevel> elements and separate them by commas. The <xsl:for-each> element selects each occurrence of the <contentLevel> element in turn and applies XSL directives within. We use an <xsl:if> block to test if the element we are processing is the last one; if it is, we don’t print the comma after the value. One side effect of the <xsl:for-each> ele- ment is that it switches context node from the <curation> element to the instance of the <contentLevel> element currently being handled in the loop; thus, XPaths within the loop, like the one in the <xsl:value-of> element, must be relative to <contentLevel>. Another useful control flow capability is the if-then-else pattern provided by the <xsl:choose>, <xsl:when>, and <xsl:otherwise> elements. Their use is not shown in Figure 5, but you can find an example their use in the sample xmltech-VOResource.xsl file. </p><p>4.3. Applying a Stylesheet </p><p>Note that an XSL stylesheet is not usually written for a particular instance of an XML document, but rather for a class of documents that conform to a particular input schema. Our sample XSL file, xmltech-VOResource.xsl, will print a simple text summary of a resource description record. To see the results, we can use the xalan XSL engine to the sample resource description, xmltech-adil.xml. Assuming you have completed the build instructions at the start of this chapter, change into the $NVOSS_HOME/java/src/advxml directory (or %NVOSS_HOME%\java\src\advxml on Windows) and type: xalan –in xmltech-adil.xml –xsl xmltech-VOResource.xsl </p><p>This should produce the following output: </p><p>Resource Description Record DataCollection NCSA Astronomy Digital Image Library (ADIL) IVOA Identifier: ivo://adil.ncsa/adil </p><p>Description: The ADIL collects published image data in FITS format from a variety of telescopes and wavebands and makes them available to the astro- nomical community. Users can search, browse, and download images as well as upload their own published images. </p><p>Target Communities: University, Research Published by: NCSA Radio Astronomy Imaging </p><p>Advanced XML 641 </p><p>Define prefixes for <?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet xmlns="http://www.ivoa.net/xml/VOResource/v1.0" all namespace that xmlns:vr="http://www.ivoa.net/xml/VOResource/v1.0" we’ll be using xmlns:vs="http://www.ivoa.net/xml/VODataService/v1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" Our output format version="1.0"> will be plain text <xsl:output method="text"/> Our root document <xsl:template match="/"> sets up the output Resource Description Record <xsl:apply-templates select="Resource" /> document and calls </xsl:template> the next template <xsl:template match="Resource" > <xsl:value-of select="substring-after(@xsi:type,':')"/> Resource template <xsl:text> </xsl:text> text tags can be <xsl:value-of select="title"/> used to take explicit <xsl:text> (</xsl:text> control of spacing <xsl:value-of select="shortName"/> <xsl:text>) value-of will print IVOA Identifier: </xsl:text> the string value of <xsl:value-of select="identifier"/> nodes <xsl:apply-templates select="content" /> Pass control on to <xsl:apply-templates select="curation" /> other templates </xsl:template> <xsl:template match="content"> <xsl:apply-templates select="description" /> <xsl:text> Loop over all </xsl:text> <xsl:text>Target Communities: </xsl:text> occurances of contentLevel <xsl:for-each select="contentLevel"> <xsl:value-of select="."/> <xsl:if test="position()!=last()"> <xsl:text>, </xsl:text> if block </xsl:if> </xsl:for-each> </xsl:template> </p><p>Figure 5. A portion of stylesheet found in xmltech-VOResource.xsl </p><p>Suggested Exercise: Try editing the sample stylesheet to make it print out more information from the resource description, such as contact informa- tion (found in the <contact> element) or subject topics (values of the <subject> element). </p><p>642 Plante </p><p>4.4. Transformation as a Paradigm </p><p>Much of how we use metadata in astronomy, particularly as part of managing datasets within a collection or archive, can be thought of in terms of transformations from one form to another. When XML is the vehicle for moving this information around, XSLT can be the mechanism for getting it into to form needed to use it. We saw how XML metadata can be transformed into plain text or HTML for user dis- play. When coupled with a technology like XForms (Dubinko 2002) or XUL (see Useful Links), XSLT can even insert metadata into web or application interfaces. If one is loading metadata into a database, XSLT can be used to transform metadata into SQL INSERT statements. The programming structures of XSL makes it flexible enough to create complex workflow scripts — even compilable code — tuned to process the datasets that the metadata describe. With a general tool like XSLT, transformation becomes a powerful paradigm for metadata processing. In this context, an XSL stylesheet plays a role that is some- where between configuration file and script that drives some metadata application. Simple edits to a stylesheet can enable rapid adaptation of application. Another fea- ture of XSL that can enable adaptation is the <xsl:import> element. This allows a stylesheet to import templates from another stylesheet. Consequently, one can create template “libraries” that can be loaded by, added to, and even overridden by an appli- cation-specific stylesheet. The ADQL library included on the companion CD uses this capability to convert ADQL to various vendor-specific dialects of SQL (see Chapter 36). The library includes a single master stylesheet that converts the XML form of ADQL to standard SQL. Vendor-specific stylesheets (that is, stylesheets that convert ADQL to the SQL supported by MySQL, SQLServer, Oracle, etc.) import templates from the master stylesheet and then override specific templates to handle deviations from the standard. </p><p>5. Conclusion </p><p>A major motivation for using XML as a format for encoding information in the VO is availability of diverse and freely available tools for processing XML. Many of these tools can get quite complex — often too complex to be useful for direct use by the general user. However, they do provide great flexibility in how XML can be used and processed in applications. As you have hopefully observed from this chapter, these tools can be effectively built on top of each other. For this reason they are helpful for building long-standing tools and services. </p><p>References </p><p>Boag, S., Chamberlin, D., Fernández, M. F., Florescu, D., Robie, J., Siméon, J. 2007, XQuery 1.0: An XML Query Language, W3C Recommendation, http://www.w3.org/TR/2007/REC-xquery-20070123/ Boyer, J. M., Landwehr, D., Merrick, R., Raman, T. V., Dubinko, M., Klotz, L. L. 2006, XForms 1.0 (Second Edition), W3C Recommendation, http://www.w3.org/TR/xforms/ Clark, J., & DeRose, S. 1999, XML Path Language (XPath) Version 1.0, W3C </p><p>Advanced XML 643 </p><p>Recommendation, http://www.w3.org/TR/xpath Clark, J. 1999, XSL Transformations (XSLT) Version 1.0, W3C Recommendation, http://www.w3.org/TR/xslt Dubinko, M. 2002, What are XForms, http://www.xml.com/pub/a/2001/09/05/xforms.html Fallside, D. C., & Walmsley, P. 2006, XML Schema Part 0: Primer, W3C Recommendation, http://www.w3.org/TR/xmlschema-0/ </p><p>Useful Links </p><p>The CARNIVORE Registry’s Advanced Querying. June 2005; http://nvo.caltech.edu:8080/CARNIVORE/advancedquery [Accessed 1 Feb 2007]. XML User Interface Language (XUL). January 2007: http://www.mozilla.org/projects/xul/ [Accessed 29 June 2007] </p> </div> </article> </div> </div> </div> <script type="text/javascript" async crossorigin="anonymous" src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-8519364510543070"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.6.1/jquery.min.js" crossorigin="anonymous" referrerpolicy="no-referrer"></script> <script> var docId = 'b65f581109a1c698b33b84d5302c739e'; var endPage = 1; var totalPage = 25; var pfLoading = false; window.addEventListener('scroll', function () { if (pfLoading) return; var $now = $('.article-imgview .pf').eq(endPage - 1); if (document.documentElement.scrollTop + $(window).height() > $now.offset().top) { pfLoading = true; endPage++; if (endPage > totalPage) return; var imgEle = new Image(); var imgsrc = "//data.docslib.org/img/b65f581109a1c698b33b84d5302c739e-" + endPage + (endPage > 3 ? ".jpg" : ".webp"); imgEle.src = imgsrc; var $imgLoad = $('<div class="pf" id="pf' + endPage + '"><img src="/loading.gif"></div>'); $('.article-imgview').append($imgLoad); imgEle.addEventListener('load', function () { $imgLoad.find('img').attr('src', imgsrc); pfLoading = false }); if (endPage < 7) { adcall('pf' + endPage); } } }, { passive: true }); </script> <script> var sc_project = 11552861; var sc_invisible = 1; var sc_security = "b956b151"; </script> <script src="https://www.statcounter.com/counter/counter.js" async></script> </html>