10 Rules for Scalable Performance in 'Simple Operation' Datastores

contributedarticles Doi:10.1145/1953122.1953144 sharing the following features: Partition data and operations, keep ˲˲ Disk-oriented storage; ˲˲ Tables stored row-by-row on disk, administration simple, do not assume hence, a row store; one size fits all. ˲˲ B-trees as the indexing mechanism; By MiChAeL StonebrakeR AnD Rick Cattell ˲˲ Dynamic locking as the concurren- cy-control mechanism; ˲˲ A write-ahead log, or WAL, for crash recovery; ˲˲ SQL as the access language; and ˲˲ A “row-oriented” query optimizer 10 Rules for 7 and executor, pioneered in System R. The 1970s and 1980s were characterized by a single major DBMS mar- scalable ket—business data processing—today called online transaction processing, or OLTP. Since then, DBMSs have come to be used in a variety of new markets, Performance including data warehouses, scientific databases, social-networking sites, and gaming sites; the modern-day DBMS market is characterized in the figure here. in ‘simple The figure includes two axes: hori- zontal, indicating whether an application is read-focused or write-focused, and vertical, indicating whether an ap- operation’ plication performs simple operations (read or write a few items) or complex operations (read or write thousands of items); for example, the traditional Datastores OLTP market is write-focused with simple operations, while the data warehouse market is read-focused with complex operations. Many appli- key insights Many scalable sQL and nosQL datastores the relatioNal MoDel of data was proposed in 1970 have been introduced over the past five 5 years, designed for Web 2.0 and other by Ted Codd as the best solution for the DBMS problems applications that exceed the capacity of of the day—business data processing. Early relational single-server RDBMss. 2 9 Major differences characterize these systems included System R and Ingres, and almost all new datastores as to their consistency guarantees, per-server performance, commercial relational DBMS (RDBMS) implementations scalability for read versus write loads, today trace their roots to these two systems. automatic recovery from failure of a server, programming convenience, and As such, unless you squint, the dominant commercial administrative simplicity. vendors—Oracle, IBM, and Microsoft—as well as the Applications must be designed for scalability, partitioning application major open source systems—MySQL and PostgreSQL— data into “shards,” avoiding operations that span partitions, designing for all look about the same today; we term these systems parallelism, and weighing requirements general-purpose traditional row stores, or GPTRS, for consistency guarantees. 72 Communications of The acm | juNe 2011 | voL. 54 | No. 6 high-level Look for languages are Plan to carefully shared-nothing GooD LeVerage main scalability. and need not hurt memory databases. 1 2performance. 3 high availability and automatic onLine recovery are essential for eVeRyThinG. SO4 scalability. 5 Don’t try to Avoid Look for build ACiD Multi-noDe administrative consistency operations. simpliCiTy. 6 7yourself. 8 Pay attention Open source to noDe gives you more performance. ControL over your future. 9 10juNe 2011 | voL. 54 | No. 6 | Communications of The acm 73 contributedarticles cations are, of course, in between; for objects, each with a key and a payload, transactions and/or by relaxing ACID example, social-networking applica- providing little or no ability to inter- semantics, by, say, supporting only tions involve mostly simple operations pret the payload as a multi-attribute “eventual consistency” on multiple but also a balance of reads and writes. object, with no query mechanism for versions of data. Hence, the figure should be viewed as non-primary attributes; These systems are driven by a vari- a continuum in both directions, with Document stores. Includes Couch- ety of motivations. For some, it is dis- any given application somewhere in DB, MongoDB, SimpleDB, and Ter- satisfaction with the relational model between. rastore in which the data model con- or the “heaviness” of RDBMSs. For oth- The major commercial engines sists of objects with a variable number ers, it is the needs of large Web prop- and open source implementations of of attributes, some allowing nested erties with some of the most demand- the relational model are positioned objects. Collections of objects are ing SO problems around. Large Web as “one-size-fits-all” systems; that is, searched via constraints on multiple properties were frequently start-ups their implementations are claimed to attributes through a (non-SQL) query lucky enough to experience explosive be appropriate for all locations in the language or procedural mechanism; growth, the so-called hockey-stick ef- figure. Extensible record stores. Includes fect. They typically use an open source However, there is also some dis- BigTable, Cassandra, HBase, HyperT- DBMS, because it is free or already satisfaction with one-size-fits-all. Wit- able, and PNUTS providing variable- understood by the staff. A single-node ness, for example, the commercial width record sets that can be par- DBMS solution might be built for ver- success of the so-called column stores titioned vertically and horizontally sion 1, which quickly exhibits scalabil- in the data-warehouse market. With across multiple nodes. They are gener- ity problems. The conventional wis- these products, only those columns ally not accessed through SQL; and dom is then to “shard,” or partitioning needed in the query are retrieved from SQL DBMSs. Focus on SO applica- the application data over multiple disk, eliminating overhead for unused tion scalability, including MySQL Clus- nodes that share the load. A table can data. In addition, superior compres- ter, other MySQL derivatives, VoltDB, be partitioned this way; for example, sion and indexing is obtained, since NimbusDB, and Clustrix. They retain employee names can be partitioned only one kind of object exists on each SQL and ACID (Atomicity, Consisten- onto 26 nodes by putting all the “A”s on storage block, rather than several-to- cy, Isolation, and Durability)a transac- node 1 and so forth. It is now up to ap- many. Finally, main-memory band- tions, but their implementations are plication logic to direct each query and width is economized through a query often very different from those of GP- update to the correct node. However, executor that operates on compressed TRS systems. such sharding in application logic has data. For these reasons, column stores We do not claim this classification a number of severe drawbacks: are remarkably faster than row stores is precise or exhaustive, though it does ˲˲ If a cross-shard filter or join must on typical data-warehouse workloads, cover the major classes of newcomer. be performed, then it must be coded in and we expect them to dominate the Moreover, the market is changing rap- the application; data-warehouse market over time. idly, so the reader is advised to check ˲˲ If updates are required within a Our focus here is on simple-opera- other sources for the latest. For a more transaction to multiple shards, then tion (SO) applications, the lower por- thorough discussion and references the application is responsible for tion of the figure. Quite a few new, non- for these systems, see Cattell4 and the somehow guaranteeing data consis- GPTRS systems have been designed table here. tency across nodes; to provide scalability for this market. The NoSQL movement is driven ˲˲ Node failures are more common Loosely speaking, we classify them largely by the systems in the first three as the system scales. A difficult prob- into the following four categories: categories, restricting the traditional lem is how to maintain consistent Key-value stores. Includes Dynamo, notion of ACID transactions by allow- replicas, detect failures, fail over to Voldemort, Membase, Membrain, Sca- ing only single-record operations to be replicas, and replace failed nodes in a laris, and Riak. These systems have the running system; simplest data model: a collection of a ACID; see http://en.wikipedia.org/wiki/ACID ˲˲ Making schema changes without taking shards “offline” is a challenge; A characterization of DBMs applications. and ˲˲ Reprovisioning the hardware to Complex operations data warehouses add additional nodes or change the configuration is extremely tedious and, likewise, much more difficult if social networking the shards cannot be taken offline. Many developers of sharded Web simple operations applications experience severe pain OLTP because they must perform these func- tions in application-level logic; much write-focus read-focus of the NoSQL movement targets this pain point. However, with the large number of new systems and the wide 74 Communications of The acm | juNe 2011 | voL. 54 | No. 6 contributedarticles range of approaches they take, cus- but limits the scalability of a shared cel, Netezza, and Teradata. Moreover, tomersb might have difficulty under- disk configuration to a small number DB2 can run shared-nothing, as do standing and choosing a system to of nodes, typically fewer than 10. many NoSQL engines. Shared-nothing meet their application requirements. Oracle RAC is a popular example of engines normally perform automat- Here, we present 10 rules we advise a DBMS running shared disk, and it ic sharding (partitioning) of data to any customer to consider with an SO is difficult to find RAC configurations achieve parallelism. Shared-nothing application and in examining non-GP- with a double-digit number of nodes. systems scale only if data objects are TRS systems. They are a mix of DBMS Oracle recently announced Exadata partitioned across the system’s nodes requirements and guidelines concern- and Exadata 2 running shared disk in a manner that balances the load. If ing good SO application design. We at the top level of a two-tier hierarchy there is data skew or “hot spots,” then state them in the context of customers while running shared-nothing at the a shared-nothing system degrades in running software in their own environ- bottom level.

10 Rules for Scalable Performance in 'Simple Operation' Datastores

Aware, Workstation-Based Distributed Database System

Providing High Availability for SAP Resources with Oracle Clusterware 11 Release 2

Using Oracle Goldengate 12C for Oracle Database

Providing High Availability for SAP Resources

Magic Quadrant for Data Warehouse Database Management Systems

Oracle Database Net Services Reference, 12C Release 1 (12.1) E17611-13

Technical Comparison of Oracle Real Application Cluster 11G Vs. IBM

Oracle Database 10G Vs. IBM DB2 UDB Technical Overview

Progress Datadirect for JDBC for Oracle User's Guide

Implementation of Oracle Real Application Cluster

Business Continuity with IBM Hyperswap, Oracle Real Application Cluster (RAC), and Vmware Site Recovery Manager a Solution Overview

Oracle Real Application Clusters 11G