The XML Framework and Its Implications for Corpus Access and Use
Total Page:16
File Type:pdf, Size:1020Kb
The XML Framework and Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar College Poughkeepsie, NY 12604-0520 USA [email protected] Abstract The eXtensible Markup Language (XML) (Bray, et al., 1998) is the emerging standard for data representation and exchange on the World Wide Web. The XML Framework includes very powerful mechanisms for accessing and manipulating XML documents that are likely to significantly impact the way annotated corpora are created and accessed. This paper outlines a few of the possibilities. Introduction in the document itself), and allows addressing text The eXtensible Markup Language (XML) (Bray, et al., fragments within a particular element by providing 1998) is the emerging standard for data representation and predicates for manipulating chains of characters; • exchange on the World Wide Web. At its most basic level XPointer (DeRose, Daniel, & Maler, 1999), which XML is a document markup language directly derived extends XPath syntax to allow addressing points and from SGML (i.e., allowing tagged text (elements), ranges as well as nodes, locating information by element nesting, and element references). As such, string matching, and use of addressing expressions in translation of an SGML encoded document into XML is URI-references as fragment identifiers. relatively trivial. However, various features and For example, the Xpath expression extensions of XML make it a far more powerful tool for /div/p[2]/s[3] specifies the third <s> (sentence) data representation and access than SGML. The following element within the second <p> (paragraph) element within outlines some of these mechanisms and suggests ways in each <div> (text division) element; /descendant::p which they can be used for creation and exploitation of specifies all <p> elements in the document. In addition, annotated corpora. Xpath allows addressing text fragments within a particular element by providing predicates for manipulating chains XML links of characters. The expression The recommended practice in encoding annotated substring(/p/s[2]/text(),6) corpora is to maintain all or most annotations in separate selects the string "one would expect that the whole sky documents, each of which references appropriate locations would be as bright as the sun, even at night." from the in the document containing the original data (Ide & Brew, following text: 2000). This strategy yields, in essence, a finely linked <p><s id="d3p13s4">The difficulty hypertext format where the links specify a semantic role is that in an infinite static universe rather than navigational options. That is, links signify the nearly every line of sight would end on the surface of a star.</s><s location(s) where markup contained in a given annotation id="d3p13s5">Thus one would expect document would appear in the document to which it is that the whole sky would be as bright linked. As such, annotation information comprises remote as the sun, even at night.</s></p> or "stand-off" markup that is virtually added to the The expression document to which it is linked. In principle, the original data could contain no markup at all (or, more likely, substring(/p/s[2]/text(), 10, 12) markup for gross logical structure only); all markup could • selects "would expect". Thus the reference is made by be retained in separate documents with links into the specifying (1) the address (absolute or relative) of the original based on offsets. element closest to the substring to be referred to, and The standoff scheme, then, requires addressing XML (2) the substring within this element. elements, as well as characters and chains of characters within those elements. It also requires that elements and The Xlink mechanism can be used to link corresponding characters can be addressed both within the same segments of two or more primary texts (As for alignment document and in other XML documents. XML provides of text or speech), or to link annotation documents to a the following linking mechanisms, which are substantially base document containing the primary text. For example, more powerful than the mechanisms provided in SGML, in the following, annotation information (e.g., morpho- which satisfy these requirements: syntactic information) about a specific token (<tok>) is • XLink (DeRose, et al., 2000), a mechanism for linked to the string of characters in the original text to specifying a link (uni-directional or more complex which it applies: linking structures) between two or more resources or <tok portions of resources; xlink:href= • the XML Path Language (XPath) (Clark & DeRose, "substring(/p/s[2]/text(),10,12)"> 1999), an extended addressing syntax that defines a Although this example shows linkage for text, the concise notation for element localization in the mechanism provides for linking resources in any medium document tree (as defined by the nesting of elements (audio, video, etc.), which allows for linking speech, external images, video, applets, form-processing A suite of documents representing a base text (or programs, style sheets, etc. texts) and its annotations can be manipulated to serve any In addition to specifying the target location for application that relies on part or all of its contents. Thus, information in the same or external documents, XLink XSLT is likely to have the most to offer for manipulation attributes can be used to specify the role of the link, i.e., of and access to annotated corpora. how the link should be activated (by hand, or XSLT is relatively complex and will not be described automatically by the browser) and what to do with the in detail here.1 A short example can provide some idea of target fragment (replace it or insert it into the source the possibilities. Using as input a document containing document). morpho-syntactic information (e.g., a document In XML, annotated fragments are referenced by the containing the fragment in Figure 12), the XSLT URI (remote or local) of the target resource, and an document in Figure 2 can be used to create an HTML extended pointer identifying a element and, where document that displays a text in "word | lemma | pos" necessary, the selected substring of that element's content, form. When the resulting HTML document is loaded into as in the following: a browser, it will display the following: <tok It|it|PPER3 was|be|PAST3 a|a|DINT xlink:href= bright|bright|ADJE cold|cold|ADJE "http://www.loria.fr/doc.xml#xptr day|day|NN… (substring(/p/s[2]/text(), 10, 12))"> Annotation resulting from automatic processing <?xml version="1.0"> (marking of sentence boundaries, tokens, links between <chunk type="BODY" lang="en" parallel texts, etc.) often includes thousands of links to the xml:base= same external document. Repetition of the document "http://www.cs.vassar.edu/~ME/Oen.xcesDoc#"> <par xlink:href="xptr(substring(//p[1]"> name on every relevant element in an annotation <s xlink:href="xptr(substring(//p/s[1]"> document would obviously significantly multiply its size. <tok type="WORD" XML includes an attribute xml:base (Marsh, 2000) that xlink:href= can be used to specify inheritance of an attribute. For "xptr(substring(//p/s[1]/text(),1,2"> example, in the following text: <orth>It</orth> <disamb> <chunk <base>it</base> xml:base= <msd>Pp3ns</msd> "http://www.mysite.edu/doc.xml#"> <ctag>PPER3</ctag></lex> <tok <lex> xlink:href="xptr(substring <base>it</base> (/p/s[2]/text(), 10, 12))"/> <msd>Pp3ns</msd> <tok <ctag>PPER3</ctag></lex></tok> xlink:href="xptr(substring <tok type="WORD" (/p/s[2]/text(), 24, 4))"/> xlink:href= </chunk> "xptr(substring(//p/s[1]/text(),4,2"> the value of the attribute xml:base specified on the <orth>was</orth> <disamb> <chunk> element is inherited by the two <tok> elements <base>be</base> that are its children, and therefore need not be re- <msd>Vmis3s</msd> specified. The inclusion of xml:base in the XML <ctag>PAST3</ctag></lex> specification ensures that conformant XML processors <lex> <base>be</base> will handle it (unlike SGML). <msd>Vais1s</msd> <ctag>AUX1</ctag></lex> XML transformations <lex> <base>be</base> The Extensible Style Language (XSL) is a part of the <msd>Vais3s</msd> XML framework, consisting of two parts: the best known <ctag>AUX3</ctag></lex> is the XSL formatting or "style sheet" language; and a <lex> powerful tree-traversal language, XSLT (Clark, 1999), <base>be</base> <msd>Vmis1s</msd> that can be used to convert any XML document into <ctag>PAST1</ctag></lex> another document in any form (e.g., XML, well-formed <lex> HTML, plain text, etc.) by selecting, rearranging, and/or <base>be</base> adding information to it. The transformed documents may <msd>Vmis3s</msd> or may not be intended for rendering data on a computer <ctag>PAST3</ctag></lex></tok>… screen, but may be used simply to move data from one computer system or program to another (e.g., to transduce between encoding and/or annotation formats, etc.). Figure 1 : Fragment of a cesAna document XSLT supports the following kinds of document manipulation: • selection of elements or portions of element content using the XPath syntax; 1 Full documentation is available at http://www.w3.org/TR/xslt. • rearrangement or transformation of extracted 2 Note that this document, encoded according to the xcesAna information (including not only text content but also specifications (Ide, Bonhomme, & Romary, 2000), contains full element names, etc.) in the target document; sementation and annotation information, including full morpho- • addition of information in the target document. syntactic specifications for all potential annotations and the results of automatic disambiguation. different content models and attributes in order to provide <xsl:stylesheet version="1.0" for tight validation of form in each context. For example, xmnls:xsl= in the SGML-based CES (Ide, 1998a, b), header elements "http://www.w3.org/1999/XSL/Transform"> are prefixed with "h." so that names in headers are tagged <xsl:template match= “/”> with <h.name>, whose content model is different from <html> that of the <name> element that can appear in the body of <body> the text.