A High-Performance, Distributed Main Memory Transaction Processing System

H-Store: A High-Performance, Distributed Main Memory Transaction Processing System Robert Kallman Evan P. C. Jones John Hugg Hideaki Kimura Samuel Madden Vertica Inc. Jonathan Natkins Michael Stonebraker [email protected] Andrew Pavlo Yang Zhang Alexander Rasin Massachusetts Institute of Stanley Zdonik Technology Daniel J. Abadi Brown University Yale University {evanj, madden, {rkallman, hkimura, stonebraker, [email protected] jnatkins, pavlo, alexr, yang}@csail.mit.edu sbz}@cs.brown.edu ABSTRACT sponsibilities across multiple shared-nothing machines. Al- Our previous work has shown that architectural and appli- though some RDBMS platforms now provide support for cation shifts have resulted in modern OLTP databases in- this paradigm in their execution framework, research shows creasingly falling short of optimal performance [10]. In par- that building a new OLTP system that is optimized from ticular, the availability of multiple-cores, the abundance of its inception for a distributed environment is advantageous main memory, the lack of user stalls, and the dominant use over retrofitting an existing RDBMS [2]. of stored procedures are factors that portend a clean-slate Using a disk-oriented RDBMS is another key bottleneck redesign of RDBMSs. This previous work showed that such in OLTP databases. All but the very largest OLTP applica- a redesign has the potential to outperform legacy OLTP tions are able to fit their entire data set into the memory of databases by a significant factor. These results, however, a modern shared-nothing cluster of server machines. There- were obtained using a bare-bones prototype that was devel- fore, disk-oriented storage and indexing structures are un- oped just to demonstrate the potential of such a system. We necessary when the entire database is able to reside strictly have since set out to design a more complete execution plat- in memory. There are some main memory database systems form, and to implement some of the ideas presented in the available today, but again many of these systems inherit the original paper. Our demonstration presented here provides architectural baggage of System R [6]. Other distributed insight on the development of a distributed main memory main memory databases have also focused on the migration OLTP database and allows for the further study of the chal- of legacy architectural features to this environment [4]. lenges inherent in this operating environment. Our research is focused on developing H-Store, a next- generation OLTP system that operates on a distributed cluster of shared-nothing machines where the data resides en- 1. INTRODUCTION tirely in main memory. The system model is based on the The use of specialized data engines has been shown to coordination of multiple single-threaded engines to provide outperform traditional or “one size fits all” database sys- more efficient execution of OLTP transactions. An earlier tems [8, 9]. Many of these traditional systems use a myriad prototype of the H-Store system was shown to significantly of architectural components inherited from the original Sys- outperform a traditional, disk-based RDBMS installation tem R database, regardless if the target application domain using a well-known OLTP benchmark [10, 11]. The results actually needs such unwieldy techniques [1]. For example, from this previous work demonstrate that our ideas have the workloads for on-line transaction processing (OLTP) merit; we expand on this work and in this paper we present systems have particular properties, such as repetitive and a more full-featured version of the system that is able to short-lived transaction executions, that are hindered by the execute across multiple machines within a local area clus- I/O performance of legacy RDBMS platforms. One obvious ter. With this new system, we are investigating interesting strategy to mitigate this problem is to scale a system hor- aspects of this operating environment, including how to ex- izontally by partitioning both the data and processing re- ploit non-trivial properties of transactions. We begin in Section 2 by first providing an overview of the internals of the H-Store system. The justification for many of the design decisions for H-Store is found in ear- Permission to copy without fee all or part of this material is granted provided lier work [10]. In Section 3 we then expand our previous that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, discussion on the kinds of transactions that are found in and notice is given that copying is by permission of the Very Large Data OLTP systems and the desirable properties of an H-Store Base Endowment. To copy otherwise, or to republish, to post on servers database design. We conclude the paper in Section 4 by or to redistribute to lists, requires a fee and/or special permission from the describing the demonstration system that we developed to publisher, ACM. explore these properties further and observe the execution VLDB ‘08, August 24-30, 2008, Auckland, New Zealand behavior of our system. Copyright 2008 VLDB Endowment, ACM 000-0-00000-000-0/00/00. of multiple distributed skeleton plans, one for each query in the procedure. Each plan’s sub-operations are not spe- cific to database’s distributed layout, but are optimized using common techniques found in non-distributed databases. Once this first optimization phase is complete, the compiled procedures are transmitted to each site in the local cluster. At runtime, the query planner on each site converts skeleton plans for the queries into multi-site, distributed query plans. This two-phase optimization approach is similar to strate- gies used in previous systems [3]. Our optimizer uses a pre- liminary cost model based on the number of network com- munications that are needed to process the request. This Figure 1: H-Store system architecture. has proven to be sufficient in our initial tests as transaction throughput is not affected by disk access. 2. SYSTEM OVERVIEW 2.2 Runtime Model The H-Store system is a highly distributed, row-store- At execution time, the client OLTP application requests based relational database that runs on a cluster on shared- a new transaction using the stored procedure handlers gen- nothing, main memory executor nodes. erated during deployment process. If a transaction requires We define a single H-Store instance as a cluster of two parameter values as input, the client must also pass these to or more computational nodes deployed within the same ad- the system along with the request. We assume for now that ministrative domain. A node is a single physical computer all sites are trusted and reside within the same administra- system that hosts one or more sites.A site is the basic oper- tive domain, and thus we trust any network communication ational entity in the system; it is a single-threaded daemon that the system receives. that an external OLTP application connects to in order to Any site in an H-Store cluster is able to execute any OLTP execute a transaction. We assume that the typical H-Store application request, regardless of whether the data needed node has multiple processors and that each site is assigned by a particular transaction is located on that site. Unless all to execute on exactly one processor core on a node. Each of the queries’ parameter values in a transaction are known site is independent from all other sites, and thus does not at execution time, we use a lazy-planning strategy for cre- share any data structures or memory with collocated sites ating optimized distributed execution plans. If all of the running on the same machine. parameter values are known in the beginning, then we per- Every relation in the database is divided into one or more form additional optimizations based on certain transaction partitions. A partition is replicated and hosted on a multiple properties (see Section 3.1). sites, forming a replica set. When a running transaction executes a SQL command, OLTP applications make calls to the H-Store system to it makes an internal request through an H-Store API. At repeatedly execute pre-defined stored procedures. Each pro- this point, all of the parameter values for that query are cedure is identified by a unique name and consists of struc- known by the system, and thus the plan is annotated with tured control code intermixed with parameterized SQL com- the locations of the target sites for each query sub-operation. mands. An instance of a store procedure initiated by an The updated plan is then passed to a transaction manager OLTP application is called a transaction. that is responsible for coordinating access with the other Using these definitions, we now describe the deployment sites. The manager transmits plan fragments to the ap- and execution schemes of the H-Store system (see Figure 1). propriate site over sockets and the intermediate results (if any) are passed back to the initiating site. H-Store uses a 2.1 System Deployment generic distributed transaction framework to ensure serializ- ability. We are currently comparing protocols that are based H-Store provides an administrator with a cluster deploy- on both optimistic and pessimistic concurrency control [5, ment framework that takes as input a set of stored proce- 12]. The final results for a query are passed back to the dures, a database schema, a sample workload, and a set of blocked transaction process, which will abort the transac- available sites in the cluster. The framework outputs a list tion, execute more queries, or commit the transaction and of unique invocation handlers that are used to reference the return results back to the client application. procedures at runtime.

A High-Performance, Distributed Main Memory Transaction Processing System

Improving High Contention OLTP Performance Via Transaction

Arxiv:1612.05566V1 [Cs.DB] 16 Dec 2016

Jennie Duggan (Née Rogers)

Scheduling OLTP Transactions Via Learned Abort Prediction

Handling Highly Contended OLTP Workloads Using Fast Dynamic

H-Store: a High-Performance, Distributed Main Memory Transaction Processing System

Chiller: Contention-Centric Transaction Execution and Data Partitioning For

Cross-Engine Query Execution in Federated Database Systems Ankush M. Gupta

Uses Data Partitioning to Distribute Execution » on Shared

P-Store: an Elastic Database System with Predictive Provisioning

Faster: a Concurrent Key-Value Store with In-Place Updates

A Queue-Oriented Transaction Processing Paradigm