Building Large XML Stores in the Amazon Cloud Jesús Camacho-Rodríguez, Dario Colazzo, Ioana Manolescu

Building Large XML Stores in the Amazon Cloud Jesús Camacho-Rodríguez, Dario Colazzo, Ioana Manolescu To cite this version: Jesús Camacho-Rodríguez, Dario Colazzo, Ioana Manolescu. Building Large XML Stores in the Ama- zon Cloud. DMC - Data Management in the Cloud Workshop - 2012, Apr 2012, Washington, D.C., United States. hal-00669951 HAL Id: hal-00669951 https://hal.inria.fr/hal-00669951 Submitted on 14 Feb 2012 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Building Large XML Stores in the Amazon Cloud Jesus´ Camacho-Rodr´ıguez, Dario Colazzo, Ioana Manolescu LRI, Universite´ Paris-Sud 11 and INRIA Saclay–Ile-de-Franceˆ [email protected], [email protected], [email protected] Abstract—It has been by now widely accepted that an in- be able to work based on any existing XML query processor. creasing part of the world’s interesting data is either shared To attain these objectives, we have made the following choices: through the Web or directly produced through and for Web platforms using formats like XML (structured documents). We • For scalability, we store XML documents within S3, present a scalable store for managing a large corpora of XML Amazon Simple Storage Service. documents built on top of off-the-shelf cloud infrastructure. • To minimize resource usage and for efficiency, we index We implement different indexing strategies to evaluate a query the stored XML documents into SimpleDB, a simple workload over the stored documents in the cloud. Moreover, each database system supporting quite advanced SQL-style strategy presents different trade-offs between efficiency in query answering and cost for storing the index. queries based on a key-value model. This allows to decide from the index only, which documents are concerned by a I. INTRODUCTION given query - at the cost of building and using the index. • To re-use existing XML querying capabilities, we install The recent development of commercial and scientific cloud an XML query processor into virtual machines within computing environments has strongly impacted research and EC2, Amazon Elastic Compute Cloud. We use the index development in distributed software platforms. From a busi- mostly to direct the queries to the XML documents which ness perspective, cloud-based platforms release the application may hold results. owner from the burden of administering the hardware, by We have devised four XML indexing strategies within providing resilience to failures as well as elastic scaling up SimpleDB, and we present experiments on XML indexing and and down of resources according to the demand. From a index-based query processing within AWS. (data management) research perspective, the cloud provides This document is organized as follows. Section II discusses a distributed, shared-nothing infrastructure for data storage related works. Section III describes the basic AWS features on and processing. This made cloud platforms the target of many which we build our work. Section IV details our basic XML recent works focusing e.g. on implementing basic database indexing strategies within AWS, while Section V discusses primitives on top of off-the-shelf cloud services [1], algebraic how to scale up and distribute such indexes when the size extensions of the Map-Reduce paradigm [2] for efficient of the XML database increases. Section VI briefly outlines parallelized processing of queries, based on simple relational- the monetary costs in AWS. Section VII describes our exper- style data models such as in [3] or XML-style nested objects, iments. Finally, Section VIII contains conclusive discussions. as in [4], [5]. In this work, we investigate architectures for large-scale II. RELATED WORKS distributed XML stores based on off-the-shelf cloud infrastruc- To the best of our knowledge, distributed XML indexing ture. One of our important goals was understanding the support and query processing directly based on a commercial cloud’s that such cloud-based platform provides, and at what costs. services has not been attempted before. However, the greater Following other works [1], [6], we have used the Amazon problem of large-scale data management (and in particular cloud platform (Amazon Web Services [7], or AWS in short), XML data management) can be addressed from many other among the most prominent commercial platforms today. angles. We have mentioned before algebra-based approaches Our work was guided by the following objectives. First, which aim at leveraging large-scale distributed infrastruc- our architecture should leverage AWS resources by scaling up tures (a typical example of which are clouds) by intra-query to very large data volumes. Second, we aim at efficiency, in parallelism [3], [4], [5]. In the case of XML, this means particular for the document storage and querying operations. that within the processing of each query on each document, We quantify this efficiency by the response time provided parallelism can be used. In contrast, in our work, we consider by the cloud-hosted application. Our third objective is to the evaluation of one query on one document as an atomic minimize cloud resource usage, or, in classical distributed (inseparable) unit of processing, and focus on the horizontal databases terms, the total work required for our operations. scaling of (i) the index in SimpleDB and (ii) the overall This is all the more important since in a cloud, total work indexing and query processing pipeline distributed over several translates into monetary costs. In particular, a query asked AWS layers. Another line of works which may allow reaching against a repository holding a large number of documents the same global goal of managing XML in the cloud comprises should ideally be evaluated only on those documents on which commercial database products, such as Oracle Database, IBM it yields results. Fourth, to be generic, our architecture should DB2 and Microsoft SQL Server. These products have included name q1 q2 q3 q4 database domain+ key item+ name attribute+ painting painting painting painting value nameval painter nameval painter descriptioncont year=1854 Fig. 1. Structure of a SimpleDB database. lastval lastval XML storage and query processing capabilities in their rela- Fig. 2. Sample queries. tional databases over the last ten years, and then have ported Multiple domains can not be queried by a single query. The their servers to cloud-based architectures [8]. Clearly, such combination of query results on different domains has to be systems have many functionalities beyond being XML stores, done in the application layer. However, AWS ensures that op- in exchange for complex architectures possibly influenced by erations over different domains run in parallel. Hence, splitting the important existing code base. For example, as stated in [8]: data into multiple domains could improve performances. “The design of Cloud SQL Server was heavily influenced by a) SimpleDB limits: AWS establishes some global limits that of SQL Server, on which it is built”. In this work, starting in the data model used for SimpleDB. We briefly discuss the directly from the AWS platform allows us to get a detailed ones having an impact on the design of our platform. The picture of attainable performance and associated monetary maximal size a domain can have is 10GB, while the maximal costs. number of attribute name-value pairs is 109. Concerning items, III. PRELIMINARIES these can contain at most 256 attribute name-value pairs. The same limit is posed for the number of name-value pairs that We first describe AWS, then outline our architecture. can be associated to an item. A. AWS overview Querying using Select is limited as follows. A query result As of today, Amazon offers more than 20 different services may contain at most 2500 items, and can take at most 1 MB. If in the cloud. In the following we describe those relevant for our the query does not meet any of these conditions, the remaining system architecture. The first two are storage services, which results can be retrieved by iterating query execution. we will use for storing the index and data respectively, the 2) Amazon Simple Storage Service: Amazon Simple Stor- third is a computing service used for creating the index and age Service (or S3, in short) is a storage web service for raw evaluating queries, while the fourth is a queue service used data, hence ideal for large objects or files. S3 stores the data in for coordinating processes across machines. buckets identified by their name. Each bucket contains objects, 1) Amazon SimpleDB: SimpleDB is a simple database which have an associated unique name (within that bucket), that supports storing and querying simple, structured data. metadata (both system-defined and user-defined), an access SimpleDB data is organized in domains. Each domain is a control policy for AWS users and a version ID. The number collection of items identified by their name. The length of the of objects that can be stored within a bucket is unlimited. item name should be at most 1024 bytes. In turn, each item Amazon states that there is not a performance difference contains one or more attributes, each one consisting of a name- between storing objects in many buckets or just a few. value pair. Furthermore, several values can be associated to the 3) Amazon Elastic Compute Cloud: Amazon Elastic Com- same attribute name under the same item. An attribute name or pute Cloud (or EC2, in short) is a web service that provides value can have up to 1024 bytes.

Load more