The following paper was originally published in the Proceedings of the Sixth Annual Tcl/Tk Workshop San Diego, California, September 14–18, 1998
XML Support for Tcl
Steve Ball Zveno Pty Ltd
For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: [email protected] 4. WWW URL:http://www.usenix.org/ XML Support For Tcl
Steve Ball Zveno Pty Ltd http://www.zveno.com/ [email protected] document. These elements are defined using a Abstract Document Type Definition (DTD), along with XML is emerging as a significant technology for use allowed general entities. The DTD includes rules on both the World Wide Web and in many other which govern what elements or text are allowed to application areas, such as network protocols. occur inside each element. In contrast, HTML gives Documents written in XML have a rich, hierarchical the author a fixed set of tags to use, and those tags structure, the document tree. An application which is have fixed semantics and behaviour. to process XML documents must be able to access and manipulate the document tree in order to be able XML has been designed with the goal of being easier to examine and change the structure. for tool developers to write software for processing documents, so that there will be plentiful applications The DOM is a language-independent specification of available for handling XML. The syntax is far more how an application accesses and manipulates the restricted than SGML, with most optional features document structure. TclDOM is a Tcl language removed. For example, tag minimisation and binding for the DOM. The TclDOM specification omission are not allowed so it is easy to detect the provides a standard API for Tcl applications to end of an element. Empty elements must be explicitly process a XML or HTML document. marked.
TclXML is a Tcl package which provides a sample There are two levels of conformity for a XML implementation of TclDOM. It provides XML document. At the very least, a XML document must parsers along with the tools needed to create a be well-formed. A well-formed XML document hierarchical representation of documents which can adheres to all of the syntactic rules of the XML be conveniently processed by a Tcl script. There are specification. All elements must have an end tag, also facilities to check the validity of a document, unless they are empty elements in which case the tag along with commands to produce document output. must include a trailing slash, such as
. Element TclXML provides a framework for parser and attributes must have a value and the value must be validator modules which allows some or all of the quoted. There may be no occurrances of illegal various components to be implemented in an characters, and so on. XML has been designed so that extension language. a program can check that a document is well-formed without having to use the document's DTD. Keywords: Tcl, World Wide Web, WWW, XML, DOM, Parsing In addition to being well-formed, a XML document may also be valid. A valid XML document conforms A Brief Introduction To XML and DOM to the rules laid out in its associated DTD. This means that all elements have content which is XML allowed according to the element's definition of its content model in the document's DTD. All attributes The eXtensible Markup Language, XML [XML], is a used in elements must also be permitted for that Recommendation from the World Wide Web element and their value must be of the correct type. Consortium (W3C) [W3C] which is a significant All attribute values which are used for identification simplification of SGML, intended to make a subset must be unique, and so on. Validity is a much SGML suitable for use on the Web. XML provides a stronger assertion for a document than well- much more meaningful and rich way in which to formedness, but requires access to the document's represent semantic structure in a document when DTD and more processing time to check the various compared to HTML. With XML, a document author rules. can "call a Spade a
Access to all of the features of the DOM has the Plume advantage of familiarity to developers who have a previous knowledge of the DOM. The DOM is also a Plume [Ball98A] is a general-purpose WWW collaborative design effort which ensures that the browser. Plume version 1.0 (never officially released interface will be comprehensive, allowing sufficient publically) extended the html_library parser to expressiveness of the interface for a general-purpose build a tree representation of the parsed HTML or document scripting library such as TclXML. XML document. This representation was then presented to the application with the tree structure, in Providing all of the DOM features goes a long way a format known as "XAPI-Tcl", the XML API for towards allowing access to DOM implementations in Tcl. XAPI-Tcl used a nested Tcl list structure to other languages, because there is then a represent the document tree, with certain commands straightforward mapping between the used to distinguish between elements, character data, implementations. It is fair to say that the majority of processing instructions, and so on. For example, the work in writing XML processors is being done using document example given above would be represented Java, so an interface to Java is essential for TclXML as: to leverage this activity. Fortunately, TclBlend and parse:pi xml {version 1.0} {} Jacl give easy access to Java classes from Tcl scripts. parse:pi DOCTYPE {SYSTEM paper.dtd} TclDOM Interface Design {} parse:element paper {} { Tcl and Tk use a number of conventions to make Tcl parse:element title {} { parse:text {TclXML Example} {} {} scripting easier for developers. TclDOM will adopt } these conventions in the design of its API so that Tcl parse:element abstract {} { developers will find it easy to use. For example, parse:text {This is an example of using Tcl commands to manipulate objects and the a XML document instance.} use of configuration options. The design of the } TclDOM API must take into consideration that an parse:element introduction {} { application may need to process more than one parse:text {XML uses the } document concurrently, so a single hierarchical parse:element acronym {} { model, such as Tk's widget hierarchy, would be parse:element short {} { unsuitable. For example, a XML document browser parse:text DTD {} {} } parse:text { parse:element full {} { # Process character data parse:text {Document Type # arg1 is the data Definition} # arg2 and arg3 are unused } } } parse:pi { parse:text {to define classes of # A processing instruction documents. Such a document is known # arg1 is the PI name as a } {} {} # arg2 is data for the PI parse:element definition {} { # arg3 is unused parse:text \ } {document instance} {} {} } } } parse:text . {} {} } As discussed above, an alternative way to process a parse:element point {} { document is to evaluate the parsed data structure. All parse:text {A DTD is used to arguments are appropriately quoted, so this is a safe define the elements and entities operation to perform. In this case, the application that are allowed to appear in a defines procedures with the same name as those document. Empty elements have a names used to distinguish the features of the trailing slash: } document. By default, these are: parse:element figure {image parse:element figure-1.png} {} } To denote an element, option - parse:element conclusion {} { elementcommand. parse:text {XML is way better parse:text than HTML.} } To denote character data, option - parse:element references {} {} textcommand. } parse:pi This format may be interpreted as a (nested) Tcl list To denote a processing instruction, option - or evaluated as a Tcl script. Internally the parser, picommand. xml::parse, manipulates the document as a Tcl list. Since in Tcl version 8.0 (and above) list traversal parse:comment is relatively fast the parser also returns the parsed To denote a comment, option - data structure as a Tcl list. However, this format is commentcommand. not suitable for evaluation so a command, xml::cvtscript, is provided to convert this Different names may be used by specifying options representation into one which may be evaluated, as to the xml::parse command. This processing shown above. method may be used as such:
To facilitate access to the data structure using Tcl list proc parse:element {name attributes commands, in particular the foreach command, content} { dummy arguments are appended to the entries used eval $content for text and processing instructions. This allows a } construct to be used such as: proc parse:text {text unused unused} { foreach {type arg1 arg2 arg3} puts $text [xml::parse $MyDocument] { } switch $type { parse:element { eval [xml::cvtscript [xml::parse # Process element $MyDocument]] # arg1 is the tag name The disadvantage of using an explicit Tcl # arg2 is the attribute list representation for the parsed data structure is that the # arg3 is the element content } opportunity is lost to implement the document tree using another language, such as C or Java. Accessing and manipulating the tree structure may be quite slow oriented, whereas Tcl is not. Common Tcl using Tcl. Also, it can be difficult to dynamically conventions are used to overcome this problem, by modify the document structure, for example for defining class creation commands and object instance document editing purposes, and to navigate the tree commands. TclDOM defines a number of commands from an arbitrary starting point, for example if a tree within the dom namespace which correspond to the node is passed as an argument. IDL interfaces defined in the DOM specification. These commands produce and accept "tokens" as CoST arguments for referring to nodes in the document tree. The DOM implementation may use these tokens CoST [English96] is a system for processing SGML to lookup the node in an internal data structure. This documents written by Joe English based on James allows for efficient implementation in extension Clark's sgmls library. It has been adapted for use languages. with Tk in an application called CoSTWish [PM- R96], and is being used for processing XML The DOM specification provides a class for accessing documents for the DOM specification [Nicol98]. a list of nodes and a class to manipulate strings. These are unnecessary in Tcl, as Tcl already has a CoST's API is not directly suitable for use with rich set of primitives for manipulating lists and TclDOM. CoST has the notion of a document node, strings, but these interfaces may be emulated for the but nodes are not exposed to the application and there sake of compatibility. TclDOM defines methods for is no equivalent to a DOM "Iterator" for traversing retrieving a list of the children of a node, the parent sequences or trees of nodes. Instead, CoST supplies of a node, the list of attributes for an element, and so explicit methods for performing iterative operations on. An application may use these methods to traverse on document tree nodes, such as withNode or the document tree, much like a Tk script traverses the foreachNode. In addition, CoST does not provide widget hierarchy. Tcl lists are ordered, which is an support for creating or modifying documents, nor important property for representing the children of an does it provide access to the document's DTD. element node. One potential problem with this approach is that in an application where the Another difference between DOM and CoST is the document is being dynamically updated the notion of a current node. In CoST there is always a document tree may change after a list of nodes has node which is current, and operations may be been generated. However, this may also be seen as an performed upon that node by CoST methods. advantage when compared to Plume's nested list However, in the DOM no actual node is specified as approach where changing the document tree can be being current, instead there is a pointer to the position cumbersome. between two nodes. The DOM then allows the node before or after the current pointer to be retrieved. The Attribute lists are also defined in terms of a Tcl list, DOM specifies node positions in this way to allow but are represented as name/value pairs. This is traversal of a dynamically changing document tree. convenient for use in conjunction with the array With the DOM there is no danger of the current node set Tcl command for accessing attributes via a Tcl being deleted and so the current node reference array. Attribute lists are unordered, so storing these in becoming invalid. a Tcl array is satisfactory.
The TclDOM API The command necessary to perform parsing and serialisation a XML document instance is not A preliminary API for TclDOM has been designed to explicitly provided by the DOM specification. For satisfy the constraints outlined above. The namespace TclDOM these functions may be made implicit by dom shall be reserved for use by the TclDOM defining that XML documents are stored as an package. Layered packages may also be used for internal representation of a Tcl Object. In this way, a supporting specific markup languages. The XML document will be stored initially as a string, but namespaces xml and html are initially reserved for when accessed by a DOM function it will be parsed XML and HTML respectively. into the implementation's internal data structure. DOM is defined using the OMG IDL interface When the string representation is required the internal specification, a language-neutral definition language. structure is serialised. The only problem with this This presents some difficulties for defining the approach is that an implementation would be difficult TclDOM specification because IDL is object- to write as a pure Tcl script, since the internal representation of a Tcl Object cannot be accessed from the Tcl script level. instance} dom::document createTextNode $in . Example # Add an ID attribute to
# Create a document in memory # Write out the XML document set docRoot [dom::document set ch [open example.xml w] createDocumentFragment] set paper [dom::document # No explicit serialisation createElement $docRoot paper] puts $ch $docRoot set title [dom::document createElement $paper title] close $ch dom::document createTextNode $title {TclXML Example} TclXML dom::document createTextNode TclXML is a Tcl package which provides the [dom::document createElement facilities needed by a Tcl script to parse XML $paper abstract] {This is an documents, traverse and manipulate their structures example of a XML document instance} set in [dom::document and to generate XML documents. The package createElement $paper introduction] provides an implementation of TclDOM (see below) dom::document createTextNode $in for accessing and manipulating the document {XML uses the } structure. It also provides a framework for the various set acronym [dom::document components needed to parse and generate documents, createElement $in acronym] allowing different implementations to be used dom::document createTextNode together in a seamless fashion. [dom::document createElement $acronym short] DTD Applications wishing to use TclXML will have dom::document createTextnode different requirements, both in terms of functionality [dom::document createElement and in terms of performance. Some applications may $acronym full] {Document Type require an event-based parser, others may require a Definition} tree-based representation. An application may only dom::document createTextNode $in need to check that a document is well-formed, {to define classes of documents. whereas another may need to validate its documents. Such a document is known as a } Figure 1 outlines the various requirements needed. dom::document createTextNode [dom::document createElement $in definition] {document Figure 1: XML Processing Requirements of Applications same commands to construct a document in-memory At present TclXML provides two non-validating itself. XML parsers. A "native" parser written entirely in expat Tcl and a Tcl interface to James Clark's Compliant Parsers [Clark98] parser called TclExpat. Both of these parsers are event based, ie. they produce a stream of As can be seen from Figure 2, TclXML aims to allow "document events", such as the start and end of any compliant parser to be used as a component of elements. TclXML also provides a utility to construct the framework. An example might be to replace the a tree based representation of the document, the Tree built-in Tcl parser with a SAXDOM-compliant parser Builder. This utility uses the event stream produced written in Java. Once a driver is written for the by either of the parsers to construct the document SAXDOM interface, any parser written using that tree. Finally, TclXML will provide a document interface can be used with TclXML. validator. The validator will include a DTD parser and will check the complete tree representation of a TclExpat document for validity according to the document's DTD. An Tcl extension, called TclExpat, has been created to provide an interface to James Clark's All of these facilities have been exposed to the expat XML parser. This extension has been written application developer, since the different parts may for Tcl version 8.0 and 8.1 (alpha2) as a dynamically be useful for different applications. Some application loadable library. The extension has been tested on may process a XML document as a stream, in which Linux, Solaris and HP/UX. Compiling the extension an event-based parser is most useful. Other on other platforms supported by Tcl should present applications may need to traverse the document no difficulties. An expat parser is created using the structure, in which case a tree-based parser is expat command. When a parser is created a required. Also, the modular construction of the corresponding Tcl command is registered to access TclXML toolkit, makes it a simple matter to replace the parser, in the usual Tcl fashion. expat allows components of it with packages written in other callbacks to be registered for various "document languages. TclXML already includes expat, which is events" such as the start of an element, the end of an written in C, as an example. element, character data, processing instructions, and so on. The purpose of TclExpat is to allow Tcl An event-based parser which is used within the scripts to be invoked as these callbacks. An TclXML framework is configured to issue calls to the application may configure a parser with various TclXML Tree Builder, which then issues calls to the options to set the callback scripts. Certain callback TclDOM module to construct the document tree. A scripts have arguments appended to them before tree-based parser would make calls directly to the evaluation, such as the name of an element and its TclDOM module. The Tcl application can call the attribute list for the start of an element. Figure 2: TclXML Modular Structure This callback is invoked when data is The main callback options for the TclExpat parser encountered for which there are no other are as follows: handlers registered. -elementstartcommand name attributes In addition to supporting straightforward callback Invoked when an element starts. Arguments are scripts, TclExpat also allows these scripts to use the name of the element and its attribute list, the continue and break commands to alter how given as a name-value paired list. the document is processed. If the callback script returns a TCL_CONTINUE code then callback -elementendcommand name invocation is suspended until the currently open Invoked when an element ends. The name of the element is closed. A return code of TCL_BREAK element is appended. causes all further callback invocation to be skipped. -datacommand data If a callback script returns an error condition all further callbacks are skipped and the parser also Invoked when character data is encountered. The returns an error condition. TclExpat handles data is appended. aborting the processing of document callbacks even -processinginstructioncommand name though expat itself does not support this feature. data An example of using TclExpat is as follows: Invoked when a processing instruction is encountered. Arguments are the name of the package require expat processing instruction and the instruction's data. -defaultcommand data set parser [expat xmlParser] $parser conf -elementstartcommand CountElements skipped $parser conf