A Method for Xquery Transform Implementation Based on Shadow Mechanism

A Method for XQuery Transform Implementation Based on Shadow Mechanism Maria Rekouts, Maxim Grinev, Alexander Boldakov Institute for System Programming of Russian Academy of Sciences, Russia Abstract a node (and its descendants) and allows the user to access both modified and unmodified nodes. This modification is The XQuery Update Facility has been recently published specified as an update operation. Generally speaking, the by the World Wide Web Consortium as a working draft. user can obtain ”modified” data without modifying the data Among other features, the working draft presents a novel itself in a single query. It is worth pointing out, that there is powerful transform expression. A transform expression cre- an alternative for transform expressions in XQuery 1.0: the ates a new modified copy of an XML subtree, where the same query can by written by means of recursive function modifications are specified as update operations. In this pa- that constructs the modified copies of existing nodes. Com- per we investigate the use of transform expressions and pro- paring the two approaches, transform expression provides vide a method to support transform expressions efficiently: the following advantages: the method avoids copying of the whole modified XML subtree. Our method is based on the idea of shadow mechanism • A query with a transform expression is much easier to proposed in early work on recovery. write, debug and understand in contrast to a query with a recursive function (in Section 1.2 we give an example of such queries). 1 Introduction • With transform expressions it is possible to avoid bulk copying of data. That is, due to its higher declar- Over the past several years, both the research and the ative form, transform expressions give implementors standards communities have been showing a great interest the possibility to implement the expression in differ- in extending XQuery [1] with additional advanced facilities. ent ways, in particular in a way that does not carry out One of the most discussed and, obviously the one XQuery real copying of data, or at least minimizes the copying. lacked most, is the XQuery Update Facility that has been Otherwise, a query with recursive XML tree traver- recently presented by the World Wide Web Consortium as sal and explicit construction of new elements implies a working draft [3]. straightforward way of execution that leads to copying The XQuery Update Facility classifies all XQuery ex- of data. Optimization of such a query is very chal- pressions into the following two mutually exclusive cate- lenging, as it is quite a hard task to determine which of gories: an updating expression and a non-updating expres- the newly created elements are modified and which are sion. An updating expression is an expression that can mod- copied. ify the state of existing nodes, while a non-updating expression preserves original contents. The XQuery Update Obviously, supporting transform expressions without Facility presents five new kinds of expressions, called: in- performing the actual copying can tremendously increase sert, delete, replace, rename, and transform. The first four the speed of its evaluation and save memory resources. This of these are updating expressions, and the last (transform) is what the method presented in our paper aims to achieve. is a non-updating expression. All of the four updating expressions are quite customary to XQuery users as they all 1.1 Outline had appeared in one form or another both in research pro- posals [4, 5], and in software products that support XQuery In the next section we present a motivating example that [6, 7, 8]. While the transform expression provides novel demonstrates the effectiveness of using transform expres- powerful XQuery semantics, it has not been studied enough sions for content engineering tasks. In Section 2 we provide by researches and is not present in today’s software prod- our method to support transformswithout copying and illus- ucts. trate it by examples. In Section 3 related work is reviewed. A transform expression creates a new modified copy of Finally, Section 4 concludes the paper. 1.2 Motivating Example <file>/biography/james_cameron.html</file> <text> James Francis Cameron (born August 16, 1954) Our motivating example comes from the field of content is a Canadian-born American film director noted for his action/science fiction films, engineering [9]. The goal of content engineering is to au- which are often extremely successful financially... tomate content creation, management and publishing tasks </text> </biography> usually performed by hand. One of the basic ideas is single </director> sourcing, where there is a single version of content in the ... contentengineeringsystem, and the system providesa set of mechanisms that publish the content to various media. Iden- The task is to generate an XHTML version of review text tity based markup technique contributes to single sourcing replacing all director elements with the hyperlinks ac- support by solving the problems with links in the content: cordingto the directors’ data from directors.xml. In the gen- instead of explicit references, the content fragments are ”la- eral case we do not know the exact structure of a review beled” using names that reflect and clarify the meaning of element, that is, we do not know in advance where the the content [9]. director elements will appear within the review el- Consider a project where the goal is to create a collection ement. Thus it is necessary to scan over a hierarchy of ele- of movie reviews and to publish it to various media: web ments, applying the required transformation at each level of page and CD-ROM. When published on a web page, each the hierarchy. In XQuery 1.0 this can be accomplished by director name in the reviews must reference the director’s defining a recursive function. An example of such a func- biography in the internet movie database. When published tion performing the transformation looks as follows: to a CD-ROM, each director name must reference his bi- declare function local:traverse-replace($n as node()) ography stored locally on the CD-ROM. To accomplish this as node() { task, we use the following identity based markup technique: typeswitch($n) initially each director in the movie reviews is labeled with case $d as element(director) return director tag, that contains this director’s identificator as let $b := attribute; the information relative to each director is stored document("directors.xml")/directors /director[@id=$d/@id]/biography in a separate document ’directors.xml’ and contains the URI return to the biography in the internet movie database and a refer- <a href="{$b/url/text()}">{$d/text()}</a> case $e as element() ence to a local file with biography. return element Consider the example structure of the XML documents. { fn:local-name($e) } { for $c in $e/(* | @* | text()) In ’reviews.xml’ every review node contains review meta- return local:traverse-replace($c) } data and the text of the review in XHTML format with in- case $d as document-node() return document corporated director tags: { for $c in $d/* return local:traverse-replace($c) } default return $n <review> }; <title>Titanic</title> <genre>romance</genre> for $r in doc("reviews.xml")/reviews <text> /review[genre="romantic"] ... return <p><director id="James Cameron">James Cameron's</director> local:traverse-replace($r) 194-minute, $200 million film of the tragic voyage is in the tradition of the great Hollywood epics.</p> ... In the above query the recursive function ’traverse- </text> replace’ scans over the whole data and returns the appro- <review> priate result based on the type of the current node. This function reconstructs the whole review nodes copying all In ’directors.xml’ each director node is labeled with data. Note that director elements constitute a relatively id attribute and contains biography element with differ- small part of the review elements. It means that the re- ent forms of biography references for various media. The cursive function constructs the result preserving most of the url element contains the following subelements: the URI data in its original form and transforming only small pieces to the director’s biography intended for publishing on the of it. This leads to an extra overhead incurred by unnec- web page, the file element with a path to the biogra- essary copying of data. Optimization of such a query is phy data to publish on the CD-ROM and the text element not trivial and often practically impossible, as it is quite a with brief biography description intended to publish on the hard task to determine which of the newly created elements printed media. are modified and which are copied. Another disadvantage ... of transformation queries written via recursive functions is <director id="James Cameron"> <biography> their cumbersome syntax. The XQuery code looses its com- <url>http://en.wikipedia.org/wiki/James_Cameron</url> pactness and becomes hard to understand. 2 The transform expressions introducedin the XQuery Up- bounds of the XML tree that represents the copied date Facility allows one to write the required transforma- value. More precisely, the method must ensure that tion in much more compact way. Moreover, it gives more parent, ancestor, following and preceding axes do not chances to implement it in an efficient way. The same trans- lead out of the root of this XML tree. formation query written with transform expression looks as • While navigating over the copied value, the replace- follows. ment nodes must be visible instead of the target nodes. for $r in doc("reviews.xml")//review[genre="romance"] return Our method is based on the idea of data shadow mecha- transform copy $rom:=$r nism, which was first proposedin the early papers on recov- do ery and data versioning where there is an ability to return for $d in $rom//director do replace {$d} to original versions [10].

Load more