Choosing a Cloud DBMS: Architectures and Tradeoffs
Total Page:16
File Type:pdf, Size:1020Kb
Choosing A Cloud DBMS: Architectures and Tradeoffs Junjay Tan1, Thanaa Ghanem2,∗ , Matthew Perron3, Xiangyao Yu3, Michael Stonebraker3,6, David DeWitt3, Marco Serafini4, Ashraf Aboulnaga5, Tim Kraska3 1Brown University; 2Metropolitan State University (Minnesota), CSC; 3MIT CSAIL; 4University of Massachusetts Amherst, CICS; 5Qatar Computing Research Institute, HBKU; 6Tamr, Inc. [email protected], [email protected], fmperron,yxy,[email protected], [email protected], [email protected], [email protected], [email protected] ABSTRACT Query Executor Nodes As analytic (OLAP) applications move to the cloud, DBMSs have shifted from employing a pure shared-nothing design Local instance Local instance Local instance Local instance with locally attached storage to a hybrid design that com- storage storage storage storage bines the use of shared-storage (e.g., AWS S3) with the use of shared-nothing query execution mechanisms. This paper sheds light on the resulting tradeoffs, which have not been properly identified in previous work. To this end, it evaluates Remote object / block store the TPC-H benchmark across a variety of DBMS offerings running in a cloud environment (AWS) on fast 10Gb+ net- Database works, specifically database-as-a-service offerings (Redshift, Figure 1: Shared Disk Architecture Athena), query engines (Presto, Hive), and a traditional cloud agnostic OLAP database (Vertica). While these com- parisons cannot be apples-to-apples in all cases due to cloud are deploying servers by the millions; not by the thousands) configuration restrictions, we nonetheless identify patterns and specialization (cloud vendors' business priority is infras- and design choices that are advantageous. These include tructure management, whereas other organizations perform prioritizing low-cost object stores like S3 for data storage, this to support main lines of business). using system agnostic yet still performant columnar formats For analytic applications running on the cloud, data re- like ORC that allow easy switching to other systems for sides on external shared storage systems such as S3 or EBS different workloads, and making features that benefit sub- offerings on Amazon Web Services (AWS). Query executors sequent runs like query precompilation and caching remote are spun on-demand in compute nodes such as EC2 nodes data to faster storage optional rather than required because in AWS. Compute nodes should be kept running only when they disadvantage ad hoc queries. strictly necessary because the running cost of moderately PVLDB Reference Format: sized instances is orders of magnitude greater than the cost of Junjay Tan, Thanaa Ghanem, Matthew Perron, Xiangyao Yu, storage services. This has resulted in a fundamental architec- Michael Stonebraker, David DeWitt, Marco Serafini, Ashraf Aboul- tural shift for database management systems (DBMSs). In naga, Tim Kraska. Choosing A Cloud DBMS: Architectures and Tradeoffs. PVLDB, 12(12): 2170-2182, 2019. traditional DBMSs, queries are executed in the same nodes DOI: https://doi.org/10.14778/3352063.3352133 that store the database. If the DBMS is distributed, the dom- inating paradigm has been the shared-nothing architecture, 1. INTRODUCTION whereby the database is partitioned among query execution servers. Cloud architectures fit more naturally in the al- Organizations are moving their applications to the cloud. ternative \shared-disk" architecture, where the database is Despite objections to such a move (security, data location stored by separate storage servers that are distinct from constraints, etc.), sooner or later cost and flexibility consid- query execution servers (see Figure 1). erations will prevail, as evidenced by even national security For the cloud provider selling a DBMS-as-a-service, an agencies committing to vendor hosted cloud deployments architecture that \decouple[s] the storage tier from the com- [17]. The reasons deal with economies of scale (cloud vendors pute tier" provides many advantages, including simplifying ∗Work done while at the Qatar Computing Research Institute node failure tolerance, hot disk management, and software upgrades. Additionally, it allows the reduction of network traffic by moving certain DBMS functions like the log appli- This work is licensed under the Creative Commons Attribution- cator to the storage tier, as in the case of Aurora [27]. NonCommercial-NoDerivatives 4.0 International License. To view a copy Shared-disk DBMSs for cloud analytics face non-obvious of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For design choices relevant to users. This paper sheds light on any use beyond those covered by this license, obtain permission by emailing resulting tradeoffs that have not been clearly identified in [email protected]. Copyright is held by the owner/author(s). Publication rights previous work. To this end, we evaluate six popular produc- licensed to the VLDB Endowment. tion OLAP DBMSs (Athena, Hive [23], Presto, Redshift [10], Proceedings of the VLDB Endowment, Vol. 12, No. 12 ISSN 2150-8097. Redshift Spectrum, and Vertica [14]) with different AWS DOI: https://doi.org/10.14778/3352063.3352133 resource and storage configurations using the TPC-H bench- 2170 mark. Despite being limited to these specific DBMSs and a for a scale-up DBMS running on instances that are twice single cloud provider, our work highlights general tradeoffs as powerful. How do existing DBMSs scale vertically and that arise in other settings. This study aims to provide users horizontally in cloud settings? insights into how different DBMSs and cloud configurations perform for a business analytics workload, which can help Dealing with DBMS-as-a-service offerings: Many clo- them choose the right configurations. It also provides devel- ud providers have DBMS-as-a-service offerings, such as Athe- opers an overview of the design space for future cloud DBMS na or Redshift on AWS. These come with different pricing implementations. structures compared to other services such as EC2 or S3. For We group the DBMS design choices and tradeoffs into example, Athena bills queries based only on the amount of three broad categories, which result from the need for deal- data scanned. These services also offer less flexibility to users ing with (A) external storage; (B) query executors that are in terms of the resources they use and hide key low-level spun on demand; and (C) DBMS-as-a-service offerings. details entirely. How do these different classes of systems compare? Dealing with external storage: Cloud providers offer multiple storage services with different semantics, perfor- Summary of findings: This paper provides a detailed mance, and cost. The first DBMS design choice involves account of these tradeoffs and sheds light into these questions. selecting one of these services. Object stores, like AWS S3, The main findings include the following: allow storing arbitrary binary blobs that can be associated with application-specific metadata and are assigned unique • Cheap remote shared object storage like S3 provides or- global identifiers. These data are accessible via a web-based der of magnitude cost savings over remote block stores REST API. Alternatively, one can use remote block-level like EBS that are commonly used for DBMS storage in storage services like AWS EBS, which can be attached to shared-nothing-architectures, without significant per- compute nodes and accessed by their local file system. Google formance disadvantages in mixed workloads. (Shared Cloud Platform (GCP) and Microsoft Azure offer similar nothing architectures adapted for the cloud often do storage choices. Which abstraction performs best and is not use true local storage because local data is not most cost effective? persisted upon node shutdown.) Compute node instances (EC2) are increasingly being • Physically attached local instance storage provides offered with larger local instance storage volumes. These faster performance than EBS. Additionally, its cost nodes have faster I/O than those with EBS storage and may is coupled into compute costs and this provides slight be cheaper per unit storage but are ephemeral, limited to cost advantages over EBS due to AWS's contractual certain instance types and sizes, and do not persist data compute resource pricing schemes. after system restarts. How to use them in the DBMS design? By initially pre-loading the database onto local instance • Caching from cheap remote object storage like S3 to storage, a DBMS can keep the traditional shared-nothing node block storage is disadvantageous in cold start model. Alternatively, the local storage can be used as a cases and should not always be done by default. cache to avoid accessing remote storage. Is local caching • A carefully chosen general use columnar format like advantageous in a time of ever-increasing network speeds? ORC provides flexibility for future system optimiza- The DBMS design also has to deal with different data for- tion over proprietary storage formats used by shared- mats. Keeping data on external shared storage means that nothing DBMSs and appears performant on TPC-H data can be shared across multiple applications. In fact, many even without optimized partitioning. Shared-nothing DBMSs are able to access data stored in DBMS-agnostic systems try to bridge the gap with hybrid features formats, such as Parquet or ORC. Which compatibility issues (Vertica Eon and Redshift Spectrum), but their cost- arise in this context? performance characteristics are very different than sys- tems focused on utilizing these general data formats Dealing with on-demand query executors: Query ex- like Presto. ecutors should be kept running as little as possible to min- imize costs, so they may be often started and stopped. A • Redshift is unique among the systems tested in that it consequence is that query executors have different startup compiles queries to machine code. Because it is very times and often run queries with a cold cache. Which DBMSs efficient in the single-user use case on warm and cold start quickly? Which DBMSs are designed for optimal per- cache, query compilation time is not disadvantageous on formance with a cold vs warm cache? This relates to how TPC-H.