Pushing Xpath Accelerator to Its Limits

Pushing XPath Accelerator to its Limits Christian Grun¨ Alexander Holupirek Marc Kramis Marc H. Scholl Marcel Waldvogel Department of Computer and Information Science University of Konstanz, Germany <Firstname>.<Lastname>@uni-konstanz.de ABSTRACT paper describes and evaluates two lines of research aiming at Two competing encoding concepts are known to scale well further improvements. We implemented two experimental with growing amounts of XML data: XPath Accelerator en- prototypes, focusing on current weaknesses of the MonetDB coding implemented by MonetDB for in-memory documents system: the lack of value-based indexes and the somewhat and X-Hive’s Persistent DOM for on-disk storage. We iden- less thrilling performance once the system gets I/O-bound. tified two ways to improve XPath Accelerator and present Separate prototypes have been chosen because they allow us prototypes for the respective techniques: BaseX boosts in- to apply independent code optimizations. memory performance with optimized data and value index structures while Idefix introduces native block-oriented per- Evaluation. The widely referenced XMark benchmark [20] sistence with logarithmic update behavior for true scalabil- and the well-known DBLP database [16] were chosen for ity, overcoming main-memory constraints. evaluation to get an insight into both the current limitations and the proposed improvements of XPath Accelera- An easy-to-use Java-based benchmarking framework was de- tor. BaseX is directly compared to MonetDB [2] as both veloped and used to consistently compare these competing are main-memory based and operate on a similar relational techniques and perform scalability measurements. The es- encoding. Idefix is compared to X-Hive [23] as both are tablished XMark benchmark was applied to all four systems mainly based on secondary storage. The main difference under test. Additional fulltext-sensitive queries against the to Idefix is that X-Hive stores the XML as a Persistent well-known DBLP database complement the XMark results. DOM [13]. Both MonetDB and X-Hive were chosen because current performance comparisons [3] suggest them to be the Not only did the latest version of X-Hive finally surprise with best available solutions for either in-memory or persistent good scalability and performance numbers. Also, both Ba- XML processing. In the course of this work, published re- seX and Idefix hold their promise to push XPath Acceler- sults about MonetDB and X-Hive could be revisited. Both ator to its limits: BaseX efficiently exploits available main BaseX and Idefix strive for outperforming their state-of- memory to speedup XML queries while Idefix surpasses the-art competitors. main-memory constraints and rivals the on-disk leadership of X-Hive. The competition between XPath Accelerator and Contributions. Our main contribution is the evaluation of Persistent DOM definitely is relaunched. our approaches against the state-of-the-art competitors, us- ing our own benchmarking framework Perfidix. In more 1. INTRODUCTION detail: The first prototype, BaseX, pushes in-memory XML XML has become the standard for universal exchange of tex- processing to its limits by applying thorough optimizations tual data, but it is also increasingly used as a storage format and an index-based approach for processing predicates and for large volumes of data. One popular and successful ap- nested loops. The second prototype, Idefix, linearly scales proach to store and access XML data is to map document XPath Accelerator beyond current memory limitations by efficiently placing it on secondary storage and introducing nodes to relational encodings. Based on the XPath Acceler- 1 ator encoding [11], the integration of the Staircase Join [12], logarithmic update time . Finally, the use of our bench- an optimizing to-algebra translation [4], and an underlying marking framework Perfidix assures a consistent and re- relational engine that is tuned to exploit vast main mem- producible benchmarking environment. ories, the MonetDB system currently provides unrivalled benchmark performance for large XML documents [3]. This Outline. The improvements to the XPath Accelerator encoding and its implementation are described by introducing our two prototypes (Section 2). We intensively evaluated and compared the state-of-the-art competitors in the field of XML-aware databases against our optimized prototypes (Section 3). In Section 4, we summarize our findings and have a look at our future work. 1The update performance of Idefix was not yet measured due to the current alpha-level update implementation. XML Document Mapping TagIndex AttNameIndex pre par token kind att attVal id tag id att 0 0 db elem ...0000 db ...0000 id 1 1 address elem id add0 ...0001 address Node Table ...0001 title 2 2 name elem title Prof. ...0010 name 3 3 text ...0011 street par kind/token attribute Hack Hacklinson 4 2 street elem ...0100 city ...0000 0.....0000 nil 5 3 Alley Road 43 text ...0001 0.....0001 0000...0000 6 2 city elem TextIndex ...0010 0.....0010 0001...0001 7 3 0-62996 Chicago text ...0011 1.....0000 nil id text 8 1 address elem id add1 ...0010 0.....0011 nil 9 2 name elem ...0000 Hack Hacklinson ...0011 1.....0001 nil AttValIndex 10 3 Jack Johnson text ...0001 Alley Road 43 ...0010 0.....0100 nil ...0010 id attVal 11 2 street elem 0-62996 Chicago ...0011 1.....0010 nil 12 3 Pick St. 43 text ...0011 Jack Johnson ...0001 0.....0001 0000...0010 ...0000 add0 13 2 city elem ...0100 Pick St. 43 ...0010 0.....0010 nil ...0001 Prof. 14 3 4-23327 Phoenix text ...0101 4-23327 Phoenix ... ... ... ...0010 add1 Figure 1: Relational mapping of an XML document (left), internal table representation in BaseX (right) 2. PUSHING XPATH ACCELERATOR Data structures. The system is mainly built on two sim- Attracted by the approved and generic XPath Accelerator ple yet very effective data structures that guarantee a com- concept, we were initially driven by two basic questions: how pact mapping of XML files: an XML node table and a slim far can we optimize main-memory structures and algorithms hash index structure. The relational structure is represented to efficiently work with the relational encoding? Next, which in the node table, storing the pre/parent references of all data structures are suitable to persistently map the encod- XML nodes; the left table of Figure 1 displays a mapping ing on disk and allow continuous updates? Two different for the XML snippet shown in Figure 2. The table further implementations are the result of our considerations: Ba- references the token of a node (which is the tag name or seX creates a compact main-memory representation of rela- text content) and the node kind (element or text). At- tionally mapped XML documents and offers a general value tribute names and values are stored in a two-dimensional index, and Idefix offers a sophisticated native persistent array; a nil reference is assigned if no attributes are given. data structure for extending the XML storage to virtually All textual tokens – tags, texts, attribute names and values unlimited sizes. – are uniformly stored in a hash structure and referenced by integers. To optimize CPU processing, the table data is exclusively encoded with integer values. (see Figure 1, right 2.1 BaseX – Optimized in-memory processing tables). BaseX was developed to push pure main memory based XML querying to its limits both in terms of memory con- sumption and efficient query processing. Main memory is <db> always limited, compared to the size offered by secondary <address id=’add0’> storage media, so we aimed at optimizing the main-memory <name title=’Prof.’>Hack Hacklinson</name> representation of the XML data to overcome the limita- <street>Alley Road 43</street> tions. Main-memory XQuery processors such as Galax [7] or <city>0-62996 Chicago</city> Saxon [15] work efficiently on small XML files, but querying </address> gets troublesome for larger files as the temporary XML rep- <address id=’add1’> resentations occupy 6 to 8 times the size of the original file <name>Jack Johnson</name> in main memory. MonetDB [2] is designed as a highly effi- <street>Pick St. 43</street> cient relational main memory database, and its Pathfinder/ <city>4-23327 Phoenix</city> XQuery extension [3] applies relational operations to process </address> XML node sets at an amazing speed. </db> Figure 2: XML snippet, mapped in Figure 1 Our first Java prototype can be seen as a hybrid between relational and native XML processing. It uses a relational, edge-based pre/parent encoding for XML nodes, but issues To save memory, some table values store more than one arising from the set-oriented approach, such as the ordered- attribute. Based on the extensive evaluation of numerous ness of XML node sets, can be evaded as all the data can large XML instances up to 38 GB, we noted that the maxi- be processed sequentially, thus simplifying orderedness for mum values for token references showed out to be constantly location step traversals. The current implementation can be smaller than 32 bit, and as the node kind requires only 1 bit, used both as a real-time XML query tool and as a query ap- the two values are merged into one integer. The remaining plication. XPath expressions can be passed on via the com- space will be used for additional node information in future. mand line. Alternatively, an interactive mode is available The attribute name and value references are merged as well, to directly enter queries which are processed locally or by a sharing 10 and 22 bits, respectively. The original values BaseX server instance. XML files can be saved as database can be efficiently accessed via CPU-supported bit shifting files to avoid future parsing of the original documents. operators. The second data structure, the hash index, is basically an The value indexes are implemented pretty straightforward.

Load more