contributed articles

doi:10.1145/1953122.1953144 sharing the following features: Partition data and operations, keep ˲˲ Disk-oriented storage; ˲˲ Tables stored row-by-row on disk, administration simple, do not assume hence, a row store; one size fits all. ˲˲ B-trees as the indexing mecha- nism; By Michael Stonebraker and Rick Cattell ˲˲ Dynamic locking as the concurren- cy-control mechanism; ˲˲ A write-ahead log, or WAL, for crash recovery; ˲˲ SQL as the access language; and ˲˲ A “row-oriented” query optimizer 10 Rules for 7 and executor, pioneered in System R. The 1970s and 1980s were charac- terized by a single major DBMS mar- Scalable ket—business data processing—today called online transaction processing, or OLTP. Since then, DBMSs have come to be used in a variety of new markets, Performance including data warehouses, scientific , social-networking sites, and gaming sites; the modern-day DBMS market is characterized in the figure here. in ‘Simple The figure includes two axes: hori- zontal, indicating whether an applica- tion is read-focused or write-focused, and vertical, indicating whether an ap- Operation’ plication performs simple operations (read or write a few items) or complex operations (read or write thousands of items); for example, the traditional Datastores OLTP market is write-focused with simple operations, while the market is read-focused with complex operations. Many appli- key insights Many scalable SQL and NoSQL datastores The relational model of data was proposed in 1970 have been introduced over the past five 5 years, designed for Web 2.0 and other by Ted Codd as the best solution for the DBMS problems applications that exceed the capacity of of the day—business data processing. Early relational single-server RDBMSs. 2 9 Major differences characterize these systems included System R and Ingres, and almost all new datastores as to their consistency guarantees, per-server performance, commercial relational DBMS (RDBMS) implementations for read versus write loads, today trace their roots to these two systems. automatic recovery from failure of a server, programming convenience, and As such, unless you squint, the dominant commercial administrative simplicity. vendors—Oracle, IBM, and —as well as the Applications must be designed for scalability, partitioning application major open source systems—MySQL and PostgreSQL— data into “shards,” avoiding operations that span partitions, designing for all look about the same today; we term these systems parallelism, and weighing requirements general-purpose traditional row stores, or GPTRS, for consistency guarantees.

72 communications of the acm | june 2011 | vol. 54 | no. 6 High-level Look for languages are Plan to carefully shared-nothing good leverage main scalability. and need not hurt memory databases. 1 2performance. 3

High availability and automatic Online recovery are essential for everything. SO4 scalability. 5 Don’t try to Avoid Look for build ACID multi-node administrative consistency operations. simplicity. 6 7yourself. 8

Pay attention Open source to node gives you more performance. control over your future.

9 10June 2011 | vol. 54 | no. 6 | communications of the acm 73 contributed articles

cations are, of course, in between; for objects, each with a key and a payload, transactions and/or by relaxing ACID example, social-networking applica- providing little or no ability to inter- semantics, by, say, supporting only tions involve mostly simple operations pret the payload as a multi-attribute “eventual consistency” on multiple but also a balance of reads and writes. object, with no query mechanism for versions of data. Hence, the figure should be viewed as non-primary attributes; These systems are driven by a vari- a continuum in both directions, with Document stores. Includes Couch- ety of motivations. For some, it is dis- any given application somewhere in DB, MongoDB, SimpleDB, and Ter- satisfaction with the relational model between. rastore in which the data model con- or the “heaviness” of RDBMSs. For oth- The major commercial engines sists of objects with a variable number ers, it is the needs of large Web prop- and open source implementations of of attributes, some allowing nested erties with some of the most demand- the relational model are positioned objects. Collections of objects are ing SO problems around. Large Web as “one-size-fits-all” systems; that is, searched via constraints on multiple properties were frequently start-ups their implementations are claimed to attributes through a (non-SQL) query lucky enough to experience explosive be appropriate for all locations in the language or procedural mechanism; growth, the so-called hockey-stick ef- figure. Extensible record stores. Includes fect. They typically use an open source However, there is also some dis- BigTable, Cassandra, HBase, HyperT- DBMS, because it is free or already satisfaction with one-size-fits-all. Wit- able, and PNUTS providing variable- understood by the staff. A single-node ness, for example, the commercial width record sets that can be par- DBMS solution might be built for ver- success of the so-called column stores titioned vertically and horizontally sion 1, which quickly exhibits scalabil- in the data-warehouse market. With across multiple nodes. They are gener- ity problems. The conventional wis- these products, only those columns ally not accessed through SQL; and dom is then to “shard,” or partitioning needed in the query are retrieved from SQL DBMSs. Focus on SO applica- the application data over multiple disk, eliminating overhead for unused tion scalability, including MySQL Clus- nodes that share the load. A table can data. In addition, superior compres- ter, other MySQL derivatives, VoltDB, be partitioned this way; for example, sion and indexing is obtained, since NimbusDB, and . They retain employee names can be partitioned only one kind of object exists on each SQL and ACID (Atomicity, Consisten- onto 26 nodes by putting all the “A”s on storage block, rather than several-to- cy, Isolation, and Durability)a transac- node 1 and so forth. It is now up to ap- many. Finally, main-memory band- tions, but their implementations are plication logic to direct each query and width is economized through a query often very different from those of GP- update to the correct node. However, executor that operates on compressed TRS systems. such sharding in application logic has data. For these reasons, column stores We do not claim this classification a number of severe drawbacks: are remarkably faster than row stores is precise or exhaustive, though it does ˲˲ If a cross-shard filter or join must on typical data-warehouse workloads, cover the major classes of newcomer. be performed, then it must be coded in and we expect them to dominate the Moreover, the market is changing rap- the application; data-warehouse market over time. idly, so the reader is advised to check ˲˲ If updates are required within a Our focus here is on simple-opera- other sources for the latest. For a more transaction to multiple shards, then tion (SO) applications, the lower por- thorough discussion and references the application is responsible for tion of the figure. Quite a few new, non- for these systems, see Cattell4 and the somehow guaranteeing data consis- GPTRS systems have been designed table here. tency across nodes; to provide scalability for this market. The NoSQL movement is driven ˲˲ Node failures are more common Loosely speaking, we classify them largely by the systems in the first three as the system scales. A difficult prob- into the following four categories: categories, restricting the traditional lem is how to maintain consistent Key-value stores. Includes Dynamo, notion of ACID transactions by allow- replicas, detect failures, fail over to Voldemort, Membase, Membrain, Sca- ing only single-record operations to be replicas, and replace failed nodes in a laris, and Riak. These systems have the running system; simplest data model: a collection of a ACID; see http://en.wikipedia.org/wiki/ACID ˲˲ Making schema changes without taking shards “offline” is a challenge; A characterization of DBMS applications. and ˲˲ Reprovisioning the hardware to

Complex operations data warehouses add additional nodes or change the configuration is extremely tedious and, likewise, much more difficult if social networking the shards cannot be taken offline. Many developers of sharded Web

Simple operations applications experience severe pain OLTP because they must perform these func- tions in application-level logic; much Write-focus Read-focus of the NoSQL movement targets this pain point. However, with the large number of new systems and the wide

74 communications of the acm | june 2011 | vol. 54 | no. 6 contributed articles range of approaches they take, cus- but limits the scalability of a shared cel, Netezza, and Teradata. Moreover, tomersb might have difficulty under- disk configuration to a small number DB2 can run shared-nothing, as do standing and choosing a system to of nodes, typically fewer than 10. many NoSQL engines. Shared-nothing meet their application requirements. Oracle RAC is a popular example of engines normally perform automat- Here, we present 10 rules we advise a DBMS running shared disk, and it ic sharding (partitioning) of data to any customer to consider with an SO is difficult to find RAC configurations achieve parallelism. Shared-nothing application and in examining non-GP- with a double-digit number of nodes. systems scale only if data objects are TRS systems. They are a mix of DBMS Oracle recently announced Exadata partitioned across the system’s nodes requirements and guidelines concern- and Exadata 2 running shared disk in a manner that balances the load. If ing good SO application design. We at the top level of a two-tier hierarchy there is data skew or “hot spots,” then state them in the context of customers while running shared-nothing at the a shared-nothing system degrades in running software in their own environ- bottom level. performance to the speed of the over- ment, though most also apply to soft- The final architecture is a shared- loaded node. The application must ware-as-a-service environments. nothing configuration, where each also make the overwhelming major- We lay out each rule, then indicate node shares neither main memory nor ity of transactions “single-sharded,” a why it is necessary: disk; rather, the nodes in a collection point covered further in Rule 6. Rule 1. Look for shared-nothing of self-contained nodes are connected Unless limited by application data/ scalability. A DBMS can run on three to one another through networking. operation skew, well-designed, shared- hardware architectures: The old- Essentially, all DBMSs oriented to- nothing systems should continue to est—shared-memory multiprocess- ward the data warehouse market since scale until networking bandwidth is ing (SMP)—means the DBMS runs on 1995 run shared-nothing, including exhausted or until the needs of the a single node consisting of a collec- , , Asterdata, Parac- application are met. Many NoSQL tion of cores sharing a common main memory and disk system. SMP is limit- System information sources. ed by main memory bandwidth to a rel- atively small number of cores. Clearly, the number of cores will increase in Systems Link future systems, but it remains to be Asterdata http://asterdata.com seen if main memory bandwidth will BigTable http://labs.google.com/papers/bigtable.html increase commensurately. Hence, Clustrix http://clustrix.com multicore systems face performance CouchDB http://couchdb.apache.org limitations with DBMS software. Cus- DB2 http://ibm.com/software/data/db2 tomers choosing an SMP system were Dynamo http://portal.acm.org/citation.cfm?id=1294281 forced to perform sharding themselves Exadata http://oracle.com/exadata to obtain scalability across SMP nodes Greenplum http://greenplum.com and face the painful problems noted Hadoop http://hadoop.apache.org earlier. Popular systems running on HBase http://hbase.apache.org SMP configurations are MySQL, Post- HyperTable http://hypertable.org greSQL, and Microsoft SQL Server. MongoDB http://mongodb.org A second option is to choose a MySQL http://mysql.com/products/enterprise DBMS that runs on disk clusters, MySQL Cluster http://mysql.com/products/database/cluster where a collection of CPUs with private Netezza http://netezza.com main memories share a common disk NimbusDB http://nimbusdb.com system. This architecture was popular- Oracle http://oracle.com ized in the 1980s and 1990s by DEC, Oracle RAC http://oracle.com/rac HP, and Sun but involves serious scal- Paraccel http://paraccel.com ability problems in the context of a PNUTs http://research.yahoo.com/pub/2304 DBMS. Due to the private buffer pool PostgreSQL http://postgresql.org in each node’s main memory, the same Riak http://basho.com/Riak.html disk block can be in multiple buffer Scalaris http://code.google.com/p/scalaris pools. Hence, careful synchronization SimpleDB http://amazon.com/simpledb of these buffer-pool blocks is required. SQL Server http://microsoft.com/sqlserver Similarly, a private lock table is includ- Teradata http://teradata.com ed in each node’s main memory. Care- Terrastore http://code.google.com/p/terrastore ful synchronization is again required Tokyo Cabinet http://1978th.net/tokyocabinet Vertica http://vertica.com Voldemort http://project-voldemort.com b We use the term “customer” to refer to any VoltDB http://voltdb.com organization evaluating or using one of these systems, even though some of them are open source, with no vendor.

June 2011 | vol. 54 | no. 6 | communications of the acm 75 contributed articles

systems reportedly run 100 nodes or man programmers. Moreover, most extreme a few years ago but is com- more, and BigTable reportedly runs on organizations could never attract and monplace today. Moreover, memory thousands of nodes. retain this level of talent. Hence, this per node will increase in the future, The DBMS needs of Web applica- source of overhead has largely disap- and the number of nodes in a cluster tions can drive DBMS scalability up- peared and today is only an issue on is also likely to increase. Hence, typi- ward in a hurry; for example, Facebook very complex queries rarely found in cal future clusters will have increasing was recently sharding 4,000 MySQL in- SO applications. terabytes of main memory. stances in application logic. If it chose The second source of overhead is As a result, if a is a couple a DBMS, it would have to scale at least communicating with the DBMS. For of terabytes or less (a very large SO da- to this number of nodes. An SMP or security reasons, RDBMSs insist on tabase), customers should consider shared-disk DBMS has no chance at the application being run in a separate main-memory deployment. If a data- this level of scalability. Shared-nothing address space, using ODBC or JDBC base is larger, customers should con- DBMSs are the only game in town. for DBMS interaction. The overhead sider main-memory deployment when Rule 2. High-level languages are of these communication protocols practical. In addition, flash memory good and need not hurt performance. is high; running a SQL transaction has become a promising storage me- Work in a SQL transaction can include requires several back-and-forth mes- dium, as prices have decreased. the following components: sages over TCP/IP. Consequently, any Given the random-access speed of ˲˲ Overhead resulting from the opti- programmer seriously interested in RAM versus disk, a DBMS can poten- mizer choosing an inferior execution performance runs transactions using tially run thousands of times faster. plan; a stored-procedure interface, rather However, the DBMS must be architect- ˲˲ Overhead of communicating with than SQL commands over ODBC/ ed properly to utilize main memory ef- the DBMS; JDBC. In the case of stored procedures, ficiently; only modest improvements ˲˲ Overhead inherent in coding in a a transaction is a single over-and-back are achievable by simply running a high-level language; message. The DBMS further reduces DBMS on a machine with more mem- ˲˲ Overhead for services (such as con- communication overhead by batch- ory. currency control, crash recovery, and ing multiple transactions in one call. To understand why, consider the data integrity); and The communication cost is a function CPU overhead in DBMSs. In 2008, ˲˲ Truly useful work to be performed, of the interface selected, can be mini- Harizopoulos et al.6 measured perfor- no matter what. mized, and has nothing to do with the mance using part of a major SO bench- Here, we cover the first three, leav- language level of the interaction. mark, TPC-C, on the Shore open- ing the last two for Rule 3. Hierarchical The third source of overhead is cod- source DBMS. This DBMS was chosen and network systems were the domi- ing in SQL rather than in a low-level because the source code was available nant DBMS solutions in the 1960s and procedural language. Since most seri- for instrumentation and because it 1970s, offering a low-level procedural ous SQL engines compile to machine was a typical GPTRS implementation. interface to data. The high-level lan- code or at least to a Java-style interme- Based on simple measures of other GP- guage of RDBMSs was instrumental in diate representation, this overhead is TRS systems, the Shore results are rep- displacing these DBMSs for three rea- not large; that is, standard language resentative of those systems as well. sons: compilation converts a high-level spec- Harizopoulos et al.6 used a database ˲˲ A high-level language system re- ification into a very efficient low-level size that allowed all data to fit in main quires the programmer write less code runtime executable. memory, as it was consistent with that is easier to understand; Hence, one of the key lessons in the most SO applications. Since Shore, like ˲˲ Users state what they want instead DBMS field over the past 25 years is other GPTRS systems, is disk-based, of writing a disk-oriented algorithm that high-level languages are good and it meant all data would reside in the on how to access the data they need; do not hurt performance. Some new main memory buffer pool. Their goal a programmer need not understand systems provide SQL or a more limited was to categorize DBMS overhead on complex storage optimizations; and higher-level language; others provide TPC-C; they ran the DBMS in the same ˲˲ A high-level language system has only a “database assembly language,” address space as the application driv- a better chance of allowing a program or individual index and object opera- er, avoiding any TCP/IP cost. They then to survive a change in the schema with- tions. This low-level interface may be looked at the components of CPU us- out maintenance or recoding; as such, adequate for very simple applications, age that perform useful work or deliver low-level systems require far more but, in all other cases, high-level lan- DBMS services. maintenance. guages provide compelling advantag- Following is the CPU cycle usage for One charge leveled at RDBMSs in es. various tasks in the new-order trans- the 1970s and 1980s was they could Rule 3. Plan to carefully leverage action of TPC-C; since Harizopoulos not be as efficient as low-level systems. main memory databases. Consider a et al.6 noted some shortcomings that The claim was that automatic query cluster of 16 nodes, each with 64GB are fixed in most commercial GPTRSs, optimizers could not do as good a job of main memory. Any shared-nothing these results have been scaled to as- as smart programmers. Though early DBMS thereby has access to about 1TB sume removal of these sources of over- optimizers were primitive, they were of main memory. Such a hardware con- head: quickly as good as all but the best hu- figuration would have been considered Useful work (13%). This is the CPU

76 communications of the acm | june 2011 | vol. 54 | no. 6 contributed articles cost for actually finding relevant re- mance in our analysis, but this perfor- cords and performing retrieval or up- mance has a direct effect on the multi- date of relevant attributes; machine scalability discussed in Rule Locking (20%). This is the CPU cost 1, as well as in our other rules. of setting and releasing locks, detect- Rule 4. High availability and au- ing deadlock, and managing the lock Shared-nothing tomatic recovery are essential for SO table; DBMSs are the only scalability. In 1990, a typical DBMS ap- Logging (23%). When a record is up- plication would run on what we would dated, the before and after images of game in town. now consider very expensive hardware. the change are written to a log. Shore If the hardware failed, the customer then groups transactions together in a would restore working hardware, re- “group commit,” forcing the relevant load the operating system and DBMS, portions of the log to disk; then recover the database to the state Buffer pool overhead (33%). Since of the last completed transaction by all data resides in the buffer pool, any performing an undo of incomplete retrievals or updates require finding transactions and a redo of completed the relevant block in the buffer pool. transactions using a DBMS log. This The appropriate record(s) must then process could take time (several min- be located and the relevant attributes utes to an hour or more) during which in the record found. Blocks on which the application would be unavailable. there is an open database cursor must Few customers today are willing to be “pinned” in main memory. More- accept any downtime in their SO ap- over, Least-Recently-Used or another plications, and most want to run re- replacement algorithm is utilized, re- dundant hardware and use data rep- quiring the recording of additional in- lication to maintain a second copy of formation; and all objects. On a hardware failure, the Multithreading overhead (11%). system switches over to the backup Since most DBMSs are multithreaded, and continues operation. Effectively, multiple operations are going on in customers want “nonstop” operation, parallel. Unfortunately, the lock table as pioneered in the 1980s by Tandem is a shared data structure that must Computers. be “latched” to serialize access by the Furthermore, many large Web prop- various parallel threads. In addition, erties run large numbers of shared- B-tree indexes and resource-manage- nothing nodes in their configurations, ment information must be similarly where the probability of failure rises as protected. Latches (mutexes) must the number of “moving parts” increas- be set and released when shared data es. This failure rate renders human structures are accessed. See Harizo- intervention impractical in the recov- poulos et al.6 for a more detailed dis- ery process; instead, shared-nothing cussion, including on why the latching DBMS software must automatically de- overhead may be understated. tect and repair failed nodes. A conventional disk-based DBMS Any DBMS acquired for SO applica- clearly spends the overwhelming ma- tions should have built-in high avail- jority of its cycles on overhead activ- ability, supporting nonstop operation. ity. To go a lot faster, the DBMS must Three high-availability caveats should avoid all the overhead components be addressed. The first is that there is discussed here; for example, a main a multitude of kinds of failure, includ- memory DBMS with conventional ing: multithreading, locking, and recovery ˲˲ Application, where the application is only marginally faster than its disk- corrupts the database; based counterpart. A NoSQL or other ˲˲ DBMS, where the bug can be recre- database engine will not dramatically ated (so-called Bohr bugs); outperform a GPTRS implementation, ˲˲ DBMS, where the bug cannot be unless all these overhead components recreated (so-called Heisenbugs); are addressed or the GPTRS solution ˲˲ Hardware, of all kinds; has not been properly architected (by, ˲˲ Lost network packets; say, using conversational SQL rather ˲˲ Denial-of-service attacks; and than a compiled stored procedure in- ˲˲ Network partitions. terface). Any DBMS will continue operation We look at single-machine perfor- for some but not for all these failure

June 2011 | vol. 54 | no. 6 | communications of the acm 77 contributed articles

modes. The cost of recovering from plication load must be split evenly all possible failure modes is very high. over the servers. Read-scalability can Hence, high availability is a statistical be achieved by replicating data, but effort, or how much availability is de- general read/write scalability requires sired against particular classes of fail- sharding (partitioning) the data over ures. Avoid multi-shard nodes according to a primary key; and The second caveat is the so-called operations to the Scalability advantage. Applications CAP, or consistency, availability, and rarely perform operations spanning partition-tolerance, theorem.3 In the greatest extent more than one server or shard. If a presence of certain failures, it states possible, including large number of servers is involved in that a distributed system can have only processing an operation, the scalabil- two out of these three characteristics: queries that must ity advantage may be lost because of consistency, availability, and partition- redundant work, cross-server commu- tolerance. Hence, there are theoretical go to multiple nication, or required operation syn- limits on what is possible in the high- shards, as well chronization. availability arena. Suppose a customer has an employ- Moreover, many site administrators as multi-shard ee table and partitions it based on em- want to guard against disasters (such updates requiring ployee age. If it wants to know the sal- as earthquakes and floods). Though ary of a specific employee, it must then rare, recovery from disasters is impor- ACID properties. send the query to all nodes, requiring tant and should be viewed as an exten- a slew of messages. Only one node will sion of high availability, supported by find the desired data; the others will replication over a wide-area network. run a redundant query that finds noth- Rule 5. Online everything. An SO ing. Furthermore, if an application per- DBMS should have a single state: “up.” forms an update that crosses shards, From the user’s point of view, it should giving, say, a raise to all employees in never fail and never have to be taken the shoe department, then the system offline. In addition to failure recovery, must pay all of the synchronization we need to consider operations that re- overhead of ensuring the transaction quire the database be taken offline in is performed on every node. many current implementations: Hence, a database administrator Schema changes. Attributes must be (DBA) should choose a sharding key to added to an existing database without make as many operations single-shard interruption in service; as possible. Fortunately, most appli- Index changes. Indexes should be cations naturally involve single-shard added or dropped without interrup- transactions, if the data is partitioned tion in service; properly; for example, if purchase or- Reprovisioning. It should be pos- ders (POs) and their details are both sible to increase the number of nodes sharded on PO number, then the vast used to process transactions, without majority of transactions (such as new interruption in service; for example, a PO and update a specific PO) go to a configuration might go from 10 nodes single node. to 15 nodes to accommodate an in- The percentage of single-node crease in load; and transactions can be increased further Software upgrade. It should be pos- by replicating read-only data; for ex- sible to move from version N of a DBMS ample, a list of customers and their to version N + 1 without interruption of addresses can be replicated at all sites. service. In many business-to-business environ- Though supporting these opera- ments, customers are added or deleted tions is a challenge, 100% uptime or change their addresses infrequent- should be the goal. As an SO system ly. Hence, complete replication allows scales to dozens of nodes and/or mil- inserting the address of a customer lions of users on the Internet, down- into a new PO as a single-node opera- time and manual intervention are not tion. Therefore, selective replication of practical. read-mostly data can be advantageous. Rule 6. Avoid multi-node opera- In summary, programmers should tions. Two characteristics are neces- avoid multi-shard operations to the sary for achieving SO scalability over a greatest extent possible, includ- cluster of servers: ing queries that must go to multiple Even split. The database and ap- shards, as well as multi-shard updates

78 communications of the acm | june 2011 | vol. 54 | no. 6 contributed articles requiring ACID properties. Custom- ACID semantics include charging cus- too high for synchronous replication ers should carefully think through tomers’ accounts only if their orders or sharding. Few users expect to re- their application and database design ship and synchronously updating bi- cover from major disasters without to accomplish this goal. If that goal is lateral “friend” references. Standard short availability hiccups, so the CAP unachievable with the current appli- ACID semantics give the programmer theorem may be less relevant in this cation design, they should consider a the all-or-nothing guarantee needed situation. redesign that achieves higher “single- to maintain data integrity in them. Al- We advise customers who need shardedness.” though some applications do not need ACID to seek a DBMS that provides it, Rule 7. Don’t try to build ACID con- such coordination, a commitment to rather than code it themselves, mini- sistency yourself. In general, the key- a non-ACID system precludes extend- mizing the overhead of distributed value stores, document stores, and ing such applications in the future in a transactions through good database extensible record stores we mentioned way that requires coordination. DBMS and application design. have abandoned transactional ACID applications often live a long time and Rule 8. Look for administrative sim- semantics for a weaker form of atomic- are subject to unknown future require- plicity. One of our favorite complaints ity, isolation, and consistency, provid- ments. about relational DBMSs is their poor ing one or more of the following alter- We understand the NoSQL move- out-of-the-box behavior. Most prod- native mechanisms: ment’s motivation for abandoning ucts include many tuning knobs that ˲˲ Creating new versions of objects transactions, given its belief that trans- allow adjustment of DBMS behavior; on every write that result in parallel actional updates are expensive in tradi- moreover, our experience is that a DBA versions when there are multiple asyn- tional GPTRS systems. However, newer skilled in a particular vendor’s prod- chronous writes; it is up to application SQL engines can offer both ACID and uct, can make it go a factor of two or logic to resolve the resulting conflict; high performance by carefully elimi- more faster than one unskilled in the ˲˲ Providing an “update-if-current” nating all overhead in Rule 3, at least given product. operation that changes an object only for applications that obey Rule 6 (avoid As such, it is a daunting task to if it matches a specified value; this way, multinode operations). If you need bring in a new DBMS, especially one an application can read an object it ACID transactions and cannot follow distributed over many nodes; it re- plans to later update and then make Rule 6, then you will likely incur sub- quires installation, schema construc- changes only if the value is still cur- stantial overhead, no matter whether tion, application design, data distri- rent; you code the ACID yourself or let the bution, tuning, and monitoring. Even ˲˲ Providing ACID semantics but DBMS do it. Letting the DBMS do it is getting a high-performance version of only for read and write operations of a a no-brainer. TPC-C running on a new engine takes single object, attribute, or shard; and We have heard the argument for weeks, though code and schema are ˲˲ Providing “quorum” read-and- abandoning ACID transactions based readily available. Moreover, once an write operations that guarantee the on the CAP theorem,3 stating you can application is in production, it still latest version among “eventually con- have only two of three characteristics: requires substantial DBA resources to sistent” replicas. C consistency, A availability, and P par- keep it running. It is possible to build your own ACID tition-tolerance. The argument is that When considering a new DBMS, semantics on any of these systems, giv- partitions happen, hence one must one should carefully consider the out- en enough additional code. However, abandon consistency to achieve high of-the-box experience. Never let the the task is so difficult, we wouldn’t availability. We take issue for three rea- vendor do a proof-of-concept exercise wish it on our worst enemy. If you need sons: First, some applications really for you. Do the proof of concept your- ACID semantics, you want to use a do need consistency and cannot give self, so you see the out-of-the-box situ- DBMS that provides them; it is much it up. Second, the CAP theorem deals ation up close. Also, carefully consider easier to deal with this at the DBMS with only a subset of possible failures, application-monitoring tools in your level than at the application level. as noted in Rule 4, and one is left with decision. Any operation requiring coordi- how to cope with the rest. And third, we Lastly, pay particular attention to nated updates to two objects is likely are not convinced that partitions are a Rule 5. Some of the most difficult ad- to need ACID guarantees. Consider a substantial issue for data sharded on a ministrative issues (such as schema transaction that moves $10 between LAN, particularly with redundant LANs changes and reprovisioning) in most two user accounts; with an ACID sys- and applications on the same site; in systems require human intervention. tem, a programmer can simply write: this case, partitions may be rare, and Rule 9. Pay attention to node per- one is better off choosing consistency formance. A common refrain heard Begin transaction (all the time) over availability during a these days is “Go for linear scalabil- Decrement account A very rare event. ity; that way you can always provision Increment account B Though true that WAN partitions to meet your application needs, while Commit transaction are much more likely than LAN parti- node performance is less important.” tions, WAN replication is normally Though true that linear scalability Without an ACID system, there is used for read-only copies or disaster is important, ignoring node perfor- no easy way to perform this coordi- recovery (such as when an entire data mance is a big mistake. One should al- nated action. Other cases requiring center goes offline); WAN latency is ways remember that linear scalability

June 2011 | vol. 54 | no. 6 | communications of the acm 79 contributed articles

means overall performance is a multi- several vendors have proved it possible ple of the number of nodes times node to make a viable business with an open performance. The faster the node per- source model. We expect it to be more formance, the fewer nodes one needs. popular over time, and customers are It is common for solutions to dif- well advised to consider its advantages. fer in node performance by an order The best way to of magnitude or more; for example, in avoid “vendor Conclusion DBMS-style queries, parallel DBMSs The 10 rules we’ve presented specify outperform Hadoop by more than an malpractice” is to the desirable properties of any SO order of magnitude,1 and, similarly, use an open source datastore. Customers looking at dis- H-store (the prototype predecessor to tributed data-storage solutions are VoltDB) has been shown to have even product. well advised to view the systems they higher throughput on TPC-C com- are considering in the context of this pared to the products from major rule set, as well as in that of their own vendors.8 For example, consider a cus- unique application requirements. The tomer choosing between two database large number of systems available to- solutions, each offering linear scal- day range considerably in capabilities ability. If solution A offers node perfor- and limitations. mance a factor of 20 better than solu- tion B, the customer might require 50 Acknowledgments hardware nodes with solution A versus We would like to thank Andy Pavlo, 1,000 nodes with solution B. Rick Hillegas, Rune Humborstad, Stav- Such a wide difference in hardware ros Harizopoulos, Dan DeMaggio, Dan cost, rack space, cooling, and power Weinreb, Daniel Abadi, Evan Jones, consumption is obviously non-trivial Greg Luck, and Bobbi Heath for their between the two solutions. More im- valuable input on this article. portant, if each node fails on average every three years, then solution B will References 1. abadi, D. et al. A comparison of approaches to see a failure every day, while solution large-scale data analysis. In Proceedings of the A will see a failure less than once a 2009 SIGMOD Conference on Management of Data (Providence, RI, June 29). ACM Press, New York, month. This dramatic difference will 2009, 165–178. heavily influence how much redundan- 2. astrahan, M.M. et al. System R: A relational approach to data management. ACM Transactions on Database cy is installed and how much admin- Systems 1, 2 (June 1976), 97–137. istrative time is required to deal with 3. brewer, E. Towards Robust Distributed Systems. Keynote at Conference on Principles of Distribute reliability. Node performance makes Computing (Portland, OR, July 2000); http://www. everything else easier. cs.berkeley.edu/~brewer/cs262b-2004/PODC- keynote.pdf Rule 10. Open source gives you more 4. Cattell, R. Scalable SQL and NoSQL datastores. ACM control over your future. This final rule SIGMOD Record 40, 2 (June 2011). 5. Codd, E.F. A relational model of data for large shared is not a technical point but still impor- databanks. Commun. ACM 13, 6 (June 1970), 377–387. tant to mention, and, hence, perhaps, 6. harizopoulos, S. et al. OLTP: Through the looking glass and what we found there. In Proceedings of the should be a suggestion rather than 2008 SIGMOD Conference on Management of Data a rule. The landscape is littered with (Vancouver, B.C., June 10). ACM Press, New York, 2008, 981–992. situations where a company acquired 7. selinger, P. Access path selection in a relational data management system. In Proceedings of the 1979 a vendor’s product, only to face expen- ACM SIGMOD Conference on Management of Data sive upgrades in the following years, (Boston, May 30). ACM Press, New York, 1979, 23–24. 8. stonebraker, M. et al. The end of an architectural era large maintenance bills for often-in- (It’s time for a complete rewrite). In Proceedings of ferior technical support, and the in- the 2007 Very Large Databases Conference (Vienna, Austria, Sept. 23). ACM Press, New York, 2007, ability to avoid these fees because the 399–410. cost of switching to a different product 9. stonebraker, M. et al. The design and implementation of Ingres. ACM Transactions on Database Systems 1, would require extensive recoding. The 3 (Sept. 1976), 189–222. best way to avoid “vendor malpractice” is to use an open source product. Open Michael Stonebraker ([email protected]) is an source eliminates expensive licenses adjunct professor in the Computer Science and Artificial and upgrades and often provides mul- Intelligence Laboratory at the Massachusetts Institute of Technology, consultant and founder, Paradigm4, Inc., tiple alternatives for support, new fea- consultant and founder, Goby, Inc., and consultant and tures, and bug fixes, including the op- founder, VoltDB, Inc. tion of doing them in-house. Rick Cattell ([email protected]) is a database technology consultant at Cattell.Net and on the technical advisory For these reasons, many newer Web- board of Schooner Information Technologies. oriented shops are adamant about us- ing only open source systems. Also, © 2011 ACM 0001-0782/11/06 $10.00

80 communications of the acm | june 2011 | vol. 54 | no. 6