Schema-Based Compression of XML Data with Relax NG
Total Page:16
File Type:pdf, Size:1020Kb
JOURNAL OF COMPUTERS, VOL. 2, NO. 10, DECEMBER 2007 9 Schema-Based Compression of XML Data with Relax NG Christopher League and Kenjone Eng Long Island University Computer Science, Brooklyn, NY, USA Email: {christopher.league, kenjone.eng}@liu.edu start = stm Abstract— The extensible markup language XML has be- come indispensable in many areas, but a significant disad- vantage is its size: tagging a set of data increases the space stm = element seq { stm+ } needed to store it, the bandwidth needed to transmit it, and | element assign { var, exp } the time needed to parse it. We present a new compression | element print { exp* } technique based on the document type, expressed as a Relax NG schema. Assuming the sender and receiver agree in advance on the document type, conforming documents exp = element num can be transmitted extremely compactly. On several data { attribute val {text}, empty } sets with high tag density this technique compresses better | element id than other known XML-aware compressors, including those { var, empty } that consider the document type. | element binop { attribute id {text}?, Index terms—XML, data compression, tree compres- op, exp, exp } sion, Relax NG, compact binary formats | element exp { attribute ref {text} } I. MOTIVATION In recent years, the extensible markup language var = attribute var {text} XML [2] has become indispensable for web services, doc- ument markup, conduits between databases, application op = attribute op data formats, and in many other areas. Unfortunately, a { "add" | "sub" | "mul" | "div" } Figure 1. Relax NG schema in compact syntax. It defines the grammar significant disadvantage in some domains is that XML is for a simple sequential language – after Wang et al. [15] – expressed in extremely verbose. Although disk capacity is less often XML. a concern these days, transmitting XML-tagged data still requires significantly more bandwidth and longer parse times (compared to a custom binary format, for example). simple, but not terribly expressive, so a few competitors Compression of XML has been studied from a variety have arisen. We focus in particular on Relax NG by Clark of perspectives. Some researchers aim to achieve minimal and Murata [12]. It is expressive, but unlike XML Schema size [3], [4], [5], others focus on efficient streaming [13] it has a clean formal model [14] that is the foundation [6], [7], [8] – a balance between bandwidth and en- of our compression technique. code/decode times – and still others answer XML queries Figure 1 contains a simple Relax NG schema written directly from compressed representations [9]. Represen- using the compact syntax. (There is an equivalent XML tations that support queries are necessarily larger than syntax that would make this example about 4 times those that do not; Ferragina et al. [10] report increases longer.) This schema specifies the grammar for a small of 25 to 96% compared to opaque representations. This sequential language with variables, assignment, numbers, is not all that bad if querying is a requirement, but we and arithmetic expressions. Note the use of regular ex- assume it is not and aim for the smallest possible size. pression operators (+*?|) and the id/ref attributes for Following Levene and Wood [11], we study how a schema reusing common sub-expressions. (document type) can be used to reach that goal. Our original motivation for studying XML compression Traditionally, one writes a document type definition was to investigate the use of XML to represent abstract (DTD) to constrain the sequencing of tags and attributes syntax trees and intermediate languages in compilers. in an XML document. For most commonly-used XML The language in figure 1 is based on an example given formats, a DTD already exists. The DTD language is by Wang et al. [15]. They developed an abstract syntax description language (ASDL) for specifying the grammars This paper is based on “Type-Based Compression of XML Data,” by C. League and K. Eng, which appeared in the Proceedings of the IEEE of languages, generating type definitions in multiple lan- Data Compression Conference (DCC), Snowbird, Utah, March 2007 [1]. guages, and automatically marshaling data structures to an © 2007 ACADEMY PUBLISHER 10 JOURNAL OF COMPUTERS, VOL. 2, NO. 10, DECEMBER 2007 <seq> successful implementations, and the first such effort in the <assign var=’x’> context of Relax NG. In section IV we report results that, <num val=’7’/> on several data sets, improve significantly on other known </assign> techniques. The algorithm itself is detailed in section <print> III and we close in section V by discussing limitations, <binop op=’add’> consequences, and directions for future research. <binop id=’r1’ op=’mul’> <id var=’x’/> II. RELATED WORK <id var=’x’/> There are a few common themes among XML com- </binop> pression techniques. One is separating the tree structure <binop op=’sub’> from the text or data. Another is finding ways to regroup <exp ref=’r1’/> the data elements for better performance. <num val=’5’/> There are also a few distinguishing features. Algorithms </binop> differ in whether they preserve the ‘ignorable’ white space </binop> outside of text nodes. (Those that discard it are not loss- </print> less in the traditional sense, but are perfectly acceptable </seq> given the content model of XML.) Also, algorithms differ Figure 2. XML code representing a program in the sequential language in the extent to which they use the schema or DTD. Some of figure 1. When run, the program would output 93 (= 49 + 49 – 5). ignore it entirely, others optionally use it to optimize their operations, and some (like us) regard it as essential. opaque but portable binary format. We hypothesize that XML and all its related libraries, languages, and tools A. General XML Compression could replace ASDL and prove useful elsewhere in the Liefke and Suciu [3] implemented XMill, one of the implementation of programming languages. Brabrand et earliest XML-aware compressors. Its primary innova- al. [16] support this vision; they provide a tool for man- tion was to group related data items into containers aging dual syntax for languages – one in XML, and one that are compressed separately. To cite their example, ‘human-readable’, much like the distinction Relax NG “all <name> data items form one container, while all itself employs between its XML and compact syntax. <phone> data items form a second container.” This way, Figure 2 contains a small program in the sequential a general-purpose compressor such as deflate [18] will language, valid with respect to the Relax NG schema in find redundancy more easily. Moreover, custom semantic figure 1. In conventional notation, it might be written as compressors may be applied to separate containers: date x := 7; print [x*x + (x*x - 5)], but the common attributes, for example, could be converted from character sub-expression x*x is explicitly shared via the id/ref data into a more compact representation. One of the main mechanism. (One of the limitations of ASDL is that it disadvantages of this approach is that it requires custom could represent trees only, not graphs; however, managing tuning for each data set to do really well. sharing is essential for the efficient implementation of Girardot and Sundaresan [6] describe Millau, a modest sophisticated intermediate languages [17].) These figures extension of WAP binary XML format, that uses byte will serve as a running example in the remainder of the codes to represent XML tag names, attribute names, and paper. some attribute values. Text nodes are compressed with The main contribution of this paper is to describe and a deflate/zlib algorithm and the types of certain attribute analyze a new technique for compressing XML docu- values (integers, dates, etc.) are inferred and represented ments that are known to conform to a given Relax NG more compactly than as character data. Millau does not schema. As a simple example of what can be done require a DTD, but if present it can be used to build and with such knowledge, the schema of figure 1 requires optimize the token dictionaries in advance. that a conforming document begins with either <seq>, Cheney’s xmlppm [4] is an adaptation of the general- <assign>, or <print>; so we need at most two bits to purpose prediction by partial match compression [19] to tell us which it is. Similarly, the <binop> tag has an XML documents. The element names along the path from optional id attribute and a required op attribute. Here, root to leaf serve as a context for compressing the text we need just one bit to indicate the presence or absence nodes. They provide benefits similar to those of XMill’s of the id and two more bits to specify the arithmetic containers. Adiego et al. [5] augment this technique with operator. (The data values are separated from the tree a heuristic to combine certain context models; this yields structure and compressed separately.) For this system to better results on some large data sets. In our experiments, work, the sender and receiver must agree in advance on the performance of xmlppm was nearly always competi- precisely the same schema. In that sense, the schema is tive.; it was one of the main contenders. like a shared key for encryption and decryption. Toman [20] describes a compressor that dynamically We are not the first to propose compressing XML infers a custom grammar for each document, rather than relative to the document type – an analysis of related using a given DTD. There is a surface similarity with work appears in section II – but ours is one of the few our technique in that a finite state automaton models the © 2007 ACADEMY PUBLISHER JOURNAL OF COMPUTERS, VOL. 2, NO. 10, DECEMBER 2007 11 element structure, and is used to predict future transitions.