Xquery Interpreter Xquery Processors
Total Page:16
File Type:pdf, Size:1020Kb
Comp. by: sunselvakumar Stage: Revises1 ChapterID: 000000A811 Date:15/6/09 Time:19:26:29 XQuery Processors X 101 in Database Technology, Proc. 8th Int. Conf. on Extending is heavily influenced by several earlier XML query Database Technology, 2002, pp. 477–495. languages including Lorel, Quilt, XML-QL, and XQL. 8. XML Path Language (XPath) 2.0. W3C Recommendation. Avail- XQuery is a strongly-typed functional language, whose able at: http://www.w3.org/TR/xpath20/ 9. XQuery 1.0: An XML Query Language. W3C Recommendation. basic principals include simplicity, compositionality, Available at: http://www.w3.org/TR/xquery/ closure, schema conformance, XPath compatibility, 10. XQuery 1.0 and XPath 2.0 Data Model (XDM). W3C generality and completeness. Its type system is based Recommendation. Available at: http://www.w3.org/TR/xpath- on XML schema, and it contains XPath language as a datamodel/ subset. Over the years, several software vendors have 11. XQuery 1.0 and XPath 2.0 Full-Text 1.0. W3C Working Draft. Available at: http://www.w3.org/TR/xpath-full-text-10/ developed products based on XQuery in varying degrees 12. XQuery 1.0 and XPath 2.0 Full-Text 1.0 Requirements. W3C of conformance. A current list of XQuery implementa- Working Draft. Available at: http://www.w3.org/TR/xpath-full- tions is maintained on the W3 XML Query working text-10-requirements/ group’s homepage (http://www.w3.org/XML/XQuery). 13. XQuery 1.0 and XPath 2.0 Full-Text 1.0 Use Cases. W3C Work- There are three main approaches of XQuery proces- ing Draft. Available at: http://www.w3.org/TR/xpath-full-text- sors. The first one is to leverage existing relational 10-use-cases/ database systems as much as possible, and evaluate XQuery in a purely relational way by translating queries into SQL queries, evaluating them using SQL XQuery Interpreter database engine, and reformatting the output tuples back into XML results. The second approach is to ▶ XQuery Processors retain the native structure of the XML data both in the storage component and during the query evalua- tion process. One native XQuery processor is Michael Kay’s Saxon, which provides one of the most complete XQuery Processors and conforming implementations of the language available at the time of writing. Compared with the 1 2 3 TORSTEN GRUST ,H.V.JAGADISH ,FATMA O¨ ZCAN , first approach, this native approach avoids the over- 4 CONG YU head of translating back and forth between XML and 1University of Tu¨bingen, Tu¨bingen, Germany relational structures, but it also faces the significant 2University of Michigan, Ann Arbor, MI, USA challenge of designing new indexing and evaluation 3IBM Almaden Research Center, San Jose, CA, USA techniques. Finally, the third approach is a hybrid 4Yahoo! Research, New York, NY, USA style that integrates native XML storage and XPath navigation techniques with existing relational techni- Synonyms ques for query processing. XML database system; XQuery compiler; XQuery interpreter Foundations Definition Pathfinder: Purely Relational XQuery XQuery processors are systems for efficient storage and The Pathfinder XQuery compiler has been developed retrieval of XML data using XML queries written in the under the main hypothesis that the well-understood XQuery language. A typical XQuery processor includes relational database kernels also make for efficient XQu- the data model, which dictates the storage component; ery processors. Such a relational account of XQuery the query model, which defines how queries are pro- processing can indeed yield scalable XQuery imple- cessed; and the optimization modules, which leverage mentations – provided that the system exploits suitable various algorithmic and indexing techniques to im- relational encodings of both, (i) the XQuery Data prove the performance of query processing. Model (XDM), i.e. tree fragments as well as ordered item sequences, and (ii) the dynamic semantics of Historical Background XQuery that allow the database back-end to play its The first W3C working draft of XQuery was published trump: set-oriented evaluation. Pathfinder deter- in early 2001 by a group of industrial experts. It minedly implements this approach, effectively Comp. by: sunselvakumar Stage: Revises1 ChapterID: 000000A811 Date:15/6/09 Time:19:26:30 102 X XQuery Processors realizing the dashed path in Fig. 1b. Any relational items are laid out in a single table for all loop iterations, database system may assume the role of Pathfinder’s one item per row. A plan consuming this loop-lifted back-end database; the compiler does not rely on XQu- or ‘‘unrolled’’ representation of e may, effectively, ery-specific builtin functionality and requires no process the results of the individual iterated evaluations changes to the underlying database kernel. of e in any order it sees fit – or in parallel [10]. In some Internal XQuery and Data Model Representation. To sense, such a plan is the embodiment of the indepen- represent XML fragments, i.e. ordered unranked trees dence of the individual evaluations of an XQuery for- of XML nodes, Pathfinder can operate with any node- loop body. level tabular encoding of trees (node ^ row) that – Exploitation of Type and Schema Information. Path- ¼ along other XDM specifics like tag name, node kind, finder’s node-level tree encoding is schema-oblivious etc. – preserves node identity and document order. and does not depend on XML Schema information to A variety of such encodings are available, among represent XML documents or fragments. If the DTD of these are variations of pre/post region encodings [8] the XML documents consumed by an expression is avail- or ORDPATH identifiers [15]. Ordered sequences of able, the compiler annotates its plans with node location items are mapped into tables in which a dedicated and fan-out information that assists XPath location column preserves sequence order. path evaluation but is also used to reduce a query’s Pathfinder compiles incoming XQuery expression runtime effort invested in node construction and atomi- into plans of relational algebra operators. To actually zation. Pathfinder additionally infers static type infor- operate the database back-end in a set-oriented fashion, mation for a query to further prune plans, e.g. to Pathfinder draws the necessary amount of independent control the impact of polymorphic item sequences work from XQuery’s for-loops. The evaluation of a which present a challenge for the strictly typed table subexpression e in the scope of a for-loop yields an column content in standard relational database systems. ordered sequence of zero or more items in each loop Query Runtime. Internally, Pathfinder analyzes iteration. In Pathfinders relational encoding, these the data flow between the algebraic operators of the XQuery Processors. Figure 1. (a) DB2 XML system architecture. (b) Pathfinder: purely relational XQuery on top of vanilla database back-ends. (c) Timber architecture overview [11]. Comp. by: sunselvakumar Stage: Revises1 ChapterID: 000000A811 Date:15/6/09 Time:19:26:33 XQuery Processors X 103 generated plan to derive a series of operator annota- along with the labels, are stored natively, and in the tions (keys, multi-valued dependencies, etc.) that order of their start keys (which correspond to their drive plan simplification and reshaping. A number of document order), into the Timber storage backend. XQuery-specific optimization problems, e.g. the stable XQuery Representation. A central concept in the detection of value-based joins or XPath twigs and Timber system is the TLC (Tree Logical Class) algebra the exploitation of local order indifference, may be [12,18,17], which manipulates sets (or sequences) of approached with simple or well-known relational heterogeneous, ordered, labeled trees. Each operator query processing techniques. in the TLC algebra takes as input one or more sets The Pathfinder XQuery compiler is retargetable: (or sequences) of trees and produces as output a its internal table algebra has been designed to match set (sequence) of trees. The main operators in TLC the processing capabilities of modern SQL database include filter, select, project, join, reordering, duplicate- systems. Standard B-trees provide excellent index sup- elimination, grouping, construct, flatten, shadow/illumi- port. Code generators exist that emit sequences of nate. There are several important features of the TLC SQL:1999 statements [9] (no SQL/XML functionality algebra. First, like the relational algebra, TLC is a is used but the code benefits if OLAP primitives like ‘‘proper’’ algebra with compositionality and closure. DENSE_RANK are available). Bundled with its code Second, TLC efficiently manages the heterogeneity generator targeting the extensible column store kernel arising in XML query processing. In particular, heter- MonetDB, Pathfinder constitutes XQuery technology ogenous input trees are reduced to homogenous sets that processes XML documents in the Gigabyte-range for bulk manipulation through tree pattern matching. in interactive time [7]. This tree pattern matching mechanism is also useful in selecting portions of interest in a large XML tree. Timber: A Native XML Database System Third, TLC can efficiently handle both set and se- Timber [11] is an XML database system which man- quence semantics, as well as a hybrid semantics, ages XML data natively: i.e. the XML data instances are where part of the input collection of trees is ordered stored in their actual format and the XQueries are [17]. Finally, the TLC algebra covers a large fragment of processed directly. Because of this native representa- XQuery, including nested FLWOR expressions. Each tion, there is no overhead for converting the data incoming XQuery is parsed and compiled into the between the XML format and the relational represen- TLC algebra representation (i.e. logical query plans) tation or for translating queries from the XQuery for- before being evaluated against the data. mat into the SQL format. While many components of The Timber system, and its underlying algebra, the traditional database can be reused (for example, has been extended to deal with text manipulation [2] the transaction management facilities), other compo- and probabilistic data [14].