Exploring and Extracting Nodes from Large XML Files
Total Page:16
File Type:pdf, Size:1020Kb
Exploring and Extracting Nodes from Large XML Files Guy Lapalme January 2010 Abstract This article shows how to deal simply with large XML files that cannot be read as a whole in memory and for which the usual XML exploration and extraction mechanisms cannot work or are very inefficient in processing time. We define the notion of a skeleton document that is maintained as the file is read using a pull- parser. It is used for showing the structure of the document and for selecting parts of it. 1 Introduction XML has been developed to facilitate the annotation of information to be shared between computer systems. Because it is intended to be easily generated and parsed by computer systems on all platforms, its format is based on character streams rather than internal binary ones. Being character-based, it also has the nice property of being readable and editable by humans using standard text editors. XML is based on a uniform, simple and yet powerful model of data organization: the generalized tree. Such a tree is defined as either a single element or an element having other trees as its sub-elements called children. This is the same model as the one chosen for the Lisp programming language 50 years ago. This hierarchical model is very simple and allows a simple annotation of the data. The left part of Figure 1 shows a very small XML file illustrating the basic notation: an arbitrary name between < and > symbols is given to a node of a tree. This is called a start-tag. Everything up until a corresponding end-tag (the same tag except that it starts with </) forms the content of the node, which can itself be a tree. Such node in the tree (r, a, b and c in Figure 1) are called elements. Elements can also contain character data (c in Figure 1) and even mix character data and elements (b in Figure 1). An XML element with no content can be indicated with an end-tag immediately following a start-tag and can be abridged as an empty-element tag: a start-tag with a terminating / (a in Figure 1). Additional information can be added to a start-tag with attribute pairs comprising the name of the attribute (id in tags a, b and c in Figure 1), an equal sign and the corresponding character string value within quotes (e.g. "01"). Attributes can also be added to an empty element (a in Figure 1). 1 r /r a b /r/a /r/b @id="i1" @id="i2" <r> <a id="i1"/> c c /r/b/c d /r/b/c <b id="i2"> @id="i3" @id="i4" <c id="i3">e</c>d<c id="i4">f</c> /r/b/text() </b> /r/b/c/text() e f /r/b/c/text() </r> element node text node Figure 1: Small XML file on the left with, on the right, the corresponding tree structure and the XPath expressions describing the access to each node. The right part of Figure 1 shows that this notation is equivalent to a tree data structure where each node is labelled with its name and attributes. Character data appears as leaf nodes. An empty element is a node with no sub-tree. XML is widely used in computing systems to systematize structured data as an al- ternative to databases. Many relational databases also offer XML specific features for indexing and searching. Because of the portability of its encoding and the fact that XML parsers are freely available, it is also used for many tasks requiring flexible data manipulation to transfer data between systems, as configuration files of programs and for keeping information about other files. XML has the (well deserved) reputation of being verbose but it must be kept in mind that this notation is primarily aimed at communication between machines for which verbosity is not a problem but uniformity of notation is a real asset. However the size of XML files can grow fast (some cases, they can be many megabytes long) and this complicates the access to their content (i.e. identification and output of a few nodes). XML file exploration and extraction tools rely on XPath [5] processors that evaluate a tree oriented expression over the global tree structure to select nodes in the tree. In order to evaluate such an expression, the XPath processor first reads the file and builds, in memory, an internal tree structure over which the expression is evaluated. This is not a problem with small files (less than a megabyte or two) but it becomes problematic for larger ones because, in memory, the tree structure is much larger than the original text file. 2 Related work The problem of transforming large XML files has long been recognized and the best attempt to deal with it is STX [7] which is an XML-based language for transforming XML documents into other XML documents without building the full tree in memory. It defines a new interrogation and transformation language adding some restrictions to the standard XPath [5] and XSLT [12]. These modifications are aimed at guaranteeing that the transformation templates can be applied in streaming mode over events that are 2 received sequentially from an XML parser. It relies on a data model keeping in memory at a given time a single node of the tree and a stack of its parents up to the root. Joost [4] is a prototype implementation of STX in Java and XML::STX [6] is a Perl implementation. Our approach relies on a similar representation but it is more modest in goal: we do not aim at transforming an XML document but we rather restrict ourselves to the extraction of a small part of it. The extraction mechanism is a single XPath expression with small constraints that allow the programmer an intuitive control on the amount of memory needed to process the file. XStream [11] is another approach for the efficient streaming transformation of XML files using a declarative functional language based on OCAML which is quite different from the usual XPath language. Our approach does not imply learning a new formalism other than the original XPath but, we do not try to transform the original large file, only to extract a small part. A different approach to the extraction of nodes from large XML files is projection [15] applied in the context of XQuery. The system analyzes the query and then parses the XML file but only keeps in memory the relevant parts of the document to ensure that the result of the query is the same in the projected document as in the original document. This analysis is relatively delicate and has been implemented in O’CAML in the Galax XQuery processor. Saxon (Entreprise Edition) also supports this capability in its XQuery implementation. Our approach can be considered a simplified projection which does not create a projected document but which outputs on the fly only the projected part of the document corresponding to an XPath expression. Another interesting approach to dealing with large XML files is the Gadget XML inspector [16] that produces an index of all XPath expressions for locating nodes in a file. That index is kept in a database then used for a faster access to the content of the XML file. We use a similar idea to explore the XML file but our approach is much simpler because we do not keep track of these expressions but we merely write them out and let other tools deal with this textual output. 3 Context of use In order to extract small portions of large text files, we often resort to sequential reading of each line and decide for each if it is to be kept or not. A well-known utility for this operation is grep which reads a file line by line and copies only the ones matching a given regular expression. Our goal is to find a similar type of utility as grep but for operating on XML files. In the case of tree-oriented XML files, grep is not very useful because the regular expression cannot take into account the embedded structure of the tags. And worse, in many case, XML files are composed of only one line (to save space...) because end-of-lines do not have any special meaning except within character content. It is the tree structure that matters and it must be taken into account in the extraction process. In this article we propose an equivalent of a line in a tree-structured XML file in order to be able to process the file sequentially by evaluating a somewhat restricted, but nevertheless useful, form of XPath expression for selecting some parts of an XML document without having to build the tree in memory before evaluation. There is no need 3 /r 2 /r/b/c /r /r/a 2 /r/b/c/text() /r/a[@id="i1"] /r/b 1 /r /r/b[@id="i2"] /r/b/c 1 /r/a /r/b[@id="i2"]/c[@id="i3"] /r/b/c/text() 1 /r/b /r/b[@id="i2"]/c[@id="i3"]/text()[.="e"] /r/b/text() 1 /r/b/text() /r/b[@id="i2"]/text()[.="d"] /r/b/c /r/b[@id="i2"]/c[@id="i4"] /r/b/c/text() /r/b[@id="i2"]/c[@id="i4"]/text()[.="f"] Figure 2: XPath expressions describing the content of Figure 1: the left part shows all XPath expressions as they are encountered during reading of the file; the center part, displays the count of each distinct Xpath expression and the right part also shows the values of the attributes and the text nodes.