CIS 680 XML Pointer Language (XPointer)

XPointer, the XML Pointer Language, defines an addressing scheme for individual parts of an XML document. These addresses can be used by any application that needs to identify parts of or locations in an XML document. For instance, an XML editor could use an XPointer to identify the current position of the insertion point or the range of the selection. An XInclude processor can use an XPointer to determine what part of a document to include. And the URI in an XLink can include an XPointer fragment identifier that locates one particular element in the targeted document. XPointers use the same XPath syntax that you're familiar with from XSL transformations to identify the parts of the document they point to, along with a few additional pieces.

For the moment, therefore, an XPointer can be used as an index into a complete document, the whole of which is loaded and then positioned at the location identified by the XPointer, and even this much is more than most browsers can handle. In the long-term, extensions to XML, XLink, HTTP, and other protocols may allow more sophisticated uses of XPointers. For instance, XInclude will let you quote a remote document by using an XPointer to tell browsers where to copy the quote in the original document, rather than retyping the text of the quote. You could include cross-references inside a document that automatically update themselves as the document is revised. These uses, however, will have to wait for the development of several next-generation technologies. For now, you must be content with precisely identifying the part of a document you want to jump to when following an XLink.

Points Selecting a particular element or node is almost always good enough for pointing into well-formed XML documents. However, on occasion you may need to point into XML data in which large chunks of non-XML text is embedded via CDATA sections, comments, processing instructions, or some other means. In these cases, you may need to refer to particular ranges of text in the document that don't map onto any particular markup element. Or, you may need to point into non-XML substructure in the text content of particular elements; for example the month in a BORN element that looks like this:

11 Feb 1858

An XPath expression can identify an element node, an attribute node, a text node, a comment node, or a processing instruction node. However, it can't indicate the first two characters of the BORN element (the date) or the substring of text between the first space and the last space in the BORN element (the month).

XPointer generalizes XPath to allow identifiers like this. An XPointer can address points in the document and ranges between points. These may not correspond to any one node. For instance, the place between the X and the P in the word XPointer at the beginning of this paragraph is a point. The place between the t and the h in the word this at the end of the first sentence of this paragraph is another point. The text fragment "Pointer generalizes XPath to allow pointers like t" between those two points is a range.

Every point is either between two nodes or between two characters in the parsed character data of a document. To make sense of this, you have to remember that parsed character data is part of a text node. For instance, consider this very simple but well-formed XML document: Hello

There are exactly 3 nodes and 14 distinct points in this document. The nodes are the root node, which contains the GREETING element node, which contains a text node. In order the points are:

• 1. The point before the root node • 2. The point before the GREETING element node • 3. The point before the text node containing the text "Hello" (as well as any white space) • 4. The point before the line break between and Hello • 5. The point before the first H in Hello • 6. The point between the H and the e in Hello • 7. The point between the e and the l in Hello • 8. The point between the l and the l in Hello • 9. The point between the l and the o in Hello • 10. The point after the o in Hello • 11. The point after the line break between Hello and • 12. The point after the text node node containing the text "Hello" • 13. The point after the GREETING element • 14. The point after the root node

Points allow XPointers to indicate arbitrary positions in the parsed character data of a document. They do not, however, enable pointing at a position in the middle of a tag . In essence, what points add is the ability to break up the text content into smaller nodes, one for each character.

A point is selected by using the string-range() function to select a range, then using the start- point () or end-point () function to extract the first or last point from the range. For example, this XPointer selects the point immediately before the D in Domeniquette Celeste Baudean© sNAME element: xpointer(start-point(string-range (id('p1')/NAME,"Domeniquette")))

This XPointer selects the point after the last e in Domeniquette : xpointer(end-point(string-range(id('p1')/NAME,"Domeniquette")))

You can also take the start-point() or end-point() of an element, text, comment, processing instruction, or root node to get the first or last point in that node.

Ranges Some applications need to specify a range across a document rather than a particular point in the document. For instance, the selection a user makes with a mouse is not necessarily going to match up with any one element or node. It may start in the middle of one paragraph, extend across a heading and a picture, and then into the middle of another paragraph two pages down.

Any contiguous area of a document can be described with a range. A range begins at one point and continues until another point. The start and end points are each identified by a location path. If the

CIS 680 XPointer page 2 starting path points to a node set rather than a point, then range-to() will return multiple ranges, one starting from the first point of ecah node in the set.

To specify a range, you append /range-to(end-point) to a location path specifying the start point of the range. The parentheses contain a location path specifying the end point of the range. For example, suppose you want to select everything between the first start tag and the last end tag in the family tree document. This XPointer accomplishes that: xpointer(/child::FAMILYTREE/child::PERSON[position()= 1]/range-to(/child::FAMILYTREE/child::PERSON[position()=last()]))

Range functions XPointer includes several functions specifically for working with ranges. Most of these operate on location sets. A location set is just a node set that can also contain points and ranges, as well as nodes.

The range(location-set) function returns a location set containing one range for each location in the argument. The range is the minimum range necessary to cover the entire location. In essence, this function converts locations to ranges.

The range-inside(location-set) function returns a location set containing the interiors of each of the locations in the input. That is, if one of the locations is an element, then the location returned is the content of the element (but not including the start and end tags). However, if the input location is a range or point, then the interior of the location is just the same as the range or point.

The start-point(location-set) function returns a location set that contains the first point of each location in the input location set. For example, start-point(//PERSON[1]) returns the point immediately after the first start tag in the document. start-point(//PERSON) returns the set of points immediately after each start tag.

The end-point(location-set) function acts the same as start-point() except that it returns the points immediately after each location in its input.

String ranges XPointer provides some very basic string-matching capabilities through the string-range() function. This function takes as an argument a location set to search and a substring to search for. It returns a location set containing one range for each nonoverlapping matching substring. You can also provide optional index and length arguments indicating how many characters after the match the range should start and how many characters after the start the range should continue. The basic syntax is: string-range(location-set, substring, index, length)

The first argument is an XPath expression that returns a location set specifying which part of the document to search for a matching string. The second substring argument is the actual string to search for. By default, the range returned starts before the first matched character and encompasses all the matched characters. However, the index argument can give a positive number to start after the beginning of the match. For instance, setting it to 2 indicates that the range starts with the second character after the first matched character. The length argument can specify how many characters to include in the range.

CIS 680 XPointer page 3 A string range points to an occurrence of a specified string, or a substring of a given string in the text (not markup) of the document. For example, this XPointer finds all occurrences of the string Harold: xpointer(string-range(/,"Harold"))

You can change the first argument to specify what nodes you want to look in. For example, this XPointer finds all occurrences of the string Harold in NAME elements: xpointer(string-range(//NAME,"Harold"))

String ranges may have predicates. For example, this XPointer finds only the first occurrence of the string Harold in the document: xpointer(string-range(/,"Harold")[position()=1])

This targets the position immediately preceding the word Harold in Charles Walter Harold© sNAME element. This is not the same as pointing at the entire NAME element as an element-based selector would do.

A third numeric argument targets a particular position in the string. For example, this targets the point between the l and d in the first occurrence of the string Harold because d is the sixth letter: xpointer(string-range(/,"Harold",6)[position()=1])

An optional fourth argument specifies the number of characters to select. For example, this URI selects the old from the first occurrence of the entire string Harold: xpointer(string-range(/,"Harold",4,3)[position()=1])

If the first string argument in the node test is the empty string, then relevant positions in the context node© s text contents are selected. For example, the following XPointer targets the first six characters of the document© s parsed character data : xpointer(string-range(/,""1,6)[position()=1])

For another example, let© s suppose that you want to find the year of birth for all people born in the nineteenth century. The following will accomplish that: xpointer(string-range(//BORN, " 18", 2, 4))

This says to look in all BORN elements for the string " 18". (The initial space is important to avoid accidentally matching someone born in 1918 or on the 18th day of the month.) When it© s found, move one character ahead (to skip the space) and return a range covering the next four characters. When matching strings, case is considered. Markup characters are ignored.

Child Sequences The two most common ways to identify an element in an XML document are by ID and by location. Identifying an element by ID is accomplished through the id() function. Identifying an element by

CIS 680 XPointer page 4 location is generally accomplished by counting children down from the root. For example, the following URIs both point to John P. Muller' sPERSON element: http://www.theharolds.com/genealogy.xml#xpointer(id("p4")) http://www.e.com/genealogy.xml#xpointer(/child::*[position()=1]/child::*[position() =4])

A child sequence is a shortcut for XPointers, like the second example above — that is, an XPointer that consists of nothing but a series of child relative location steps counting down from the root node, each of which selects a particular child by position only. The shortcut is to use only the position number and the slashes that separate individual elements from each other, like this: http://www.theharolds.com/genealogy.xml#/1/4

/1/4 is a child sequence that selects the fourth child element of the first child element of the root. This syntax can be extended for any depth of child elements. For example these two URIs point to John P. Muller' sNAME and SPOUSE elements respectively: http://www.theharolds.com/genealogy.xml#/1/4/1 http://www.theharolds.com/genealogy.xml#/1/4/2

Child sequences may include an initial ID. In that case, the counting begins from the element with that ID rather than from the root. For example, John P. Muller' sPERSON element has an ID attribute with the value p4. Consequently xpointer(p4/1) points to his NAME element and xpointer(p4/2) points to his SPOUSE element.

Each child sequence always points to a single element. You cannot use child sequences with any other relative location steps. You cannot use them to select elements of a particular type. You cannot use them to select attributes or strings. You can only use them to select a single element by its relative location in the tree.

XPointer Examples HTML links generally point to one particular document. Additional granularity — that is, pointing to a particular section, chapter, or paragraph of a particular document — isn' t well supported. Provided you control both the linking and the linked document, you can insert a named anchor into an HTML file at the position to which you want to link. For example:

XPointer Examples

You can then link to this position in the file by adding a # and the name of the anchor to the URL. The piece of the URL after the # is called the fragment identifier. For example, in this link the fragment identifier is xtocid20.2.

XPointer Examples

However, this solution is messy. It' s not alwasy possible to modify the target document so that the source document can link to it. The target document may be on a different server controlled by

CIS 680 XPointer page 5 someone other than the author of the source document. And the author of the target document may change or move it without notifying the author of the source. Furthermore, named anchors violate the principle of separating markup from content. Placing a named anchor in a document says nothing about the document or its content. It© s just a marker for other documetns to refer to. It adds nothing to the document© s own content .

XPointers allow much more sophisticated connections between parts of documents. An XPointer can refer to any element of a document; to the first, second, or seventeenth element; to the seventh element named P, to the first element that© s a child of the secondDIV element, and so on. XPointers provide very precisely targeted addresses of particular parts of documents. They do not require the targeted document to contain additional markup just so its individual pieces can be linked to.

Furthermore, unlike HTML anchors, XPointers don© t point to just a single point in a document. They can point to entire elements, to possibly discontiguous sets of elements, or to the range of text between two points. Thus, you can use an XPointer to select a particular part of a document, perhaps so it can be copied or loaded into a program. Here are a few examples of XPointers: xpointer(id("ebnf")) xpointer(descendant::language[position()=2]) ebnf xpointer(/child::spec/child::body/child::*/child::language[2]) xpointer(/spec/body/*/language[2]) /1/14/2 xpointer(id("ebnf"))xpointer(id("EBNF"))

Each of these selects a particular element in a document. The first finds the element with the ID ebnf. The second finds the second language element in the document . The third is a shorthand form of finding the element with the ID ebnf. The fourth and fifth both specify the second language child element of any child element of the body child elements of the spec child of the root node. The sixth finds the second child element of the fourteenth child element of the root element. The final URI also points to the element with the ID ebnf. However, if no such element is present, it then finds the element with the ID EBNF.

Location Paths, Steps, and Sets Many (though not all) XPointers are location paths. Location paths are built from location steps. Each location step specifies a point in the targeted document, always relative to some other well-known point such as the start of the document or the previous location step. This well-known point is called the context node. In general, a location step has three parts: the axis, the node test, and an optional predicate. These are combined in this form: axis::node-test[predicate]

For example, in the location step child::PERSON[position()=2], the axis is child, the node-test is PERSON, and the predicate is [position()=2]. This location step selects the second PERSON element along the child axis, starting from the context node or, less formally, the second PERSON child element of the context node. Of course, which element this actually is depends on what the context node is. Consequently, this is what© s referred to as ar elative location step. There are also absolute location steps that do not depend on the context node.

CIS 680 XPointer page 6 The axis tells you in what direction to search from the context node. For instance, an axis can say to look at things that follow the context node, things that precede the context node, things that are children of the context node, things that are attributes of the context node, and so forth.

The node test tells you which nodes to consider along the axis. The most common node test is simply an element name. However the node test may also be the asterisk (*) wild card to indicate that any element is to be matched, or one of several functions for selecting comments, text, attributes, processing instructions, points, and ranges. The group of nodes along the given axis that satisfy the node test form a location set.

The predicate is a boolean expression (exactly like the expressions in XSLT) that tests each node in that set. If that expression returns false, then the node is removed from the set.

Often, after the entire location step Ð axis, node test, and predicate Ð has been evaluated, what© s left is a single, unique node. A location set like this with only one node is called a singleton. However, not all location steps produce singletons. In some cases, you may finish with multiple nodes in the final location set. On occasion, there may be no nodes in the location set; in other words, the location set is the empty set.

A single location step is often not enough to identify the node you want. Commonly, location steps are strung together, separated by slashes, to form a location path. Each location step© s location set becomes the context node set for the next step in the path. For example, consider this XPointer: xpointer(/child::FAMILYTREE/child::PERSON[position()=3])

The location path of this XPointer is /child::FAMILYTREE/child::PERSON[position()=3]. It is built from two location steps:

• /child::FAMILYTREE • child::PERSON[position()=3]

The first location step is an absolute step that selects all child elements of the root node whose name is FAMILYTREE. When applied, there© s xeactly one such element. The second location step is then applied relative to the FAMILYTREE element returned by the first location step. All of its child nodes are considered. Those that satisfy the node test Ð that is, elements whose name is PERSON Ð are returned. There are 12 of these nodes. Each of these 12 nodes is then compared against the predicate to see if its position is equal to 3. This turns out to be true for only one node, Elodie Bellau© sPERSON element, so that is the single node this XPointer points to.

It is not always the case, however, that an XPointer points to exactly one node. For instance, consider this XPointer: xpointer(/child::FAMILYTREE/child::PERSON[position()>3])

This is exactly the same as before except that the equals sign has been changed to a greater than sign. Now when each of the 12 PERSON elements are compared, the predicate returns true for 9 of them. Each of these nine is included in the location set that this XPointer returns. This XPointer points to nine nodes, not to one.

CIS 680 XPointer page 7 A family tree ]> Domeniquette Celeste Baudean 21 Apr 1836 Unknown Jean Francois Bellau Elodie Bellau 11 Feb 1858 12 Apr 1898 John P. Muller Adolf Eno Maria Bellau Eugene Bellau

CIS 680 XPointer page 8 Louise Pauline Bellau 29 Oct 1868 3 May 1938 Charles Walter Harold about 1861 about 1938 Victor Joseph Bellau Ellen Gilmore Honore Bellau

CIS 680 XPointer page 9