Robustness in the Salus Scalable Block Store

Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike Dahlin The University of Texas at Austin Abstract: This paper describes Salus, a block store that guarantees despite a wide range of server failures (in- seeks to maximize simultaneously both scalability and cluding memory corruptions, disk corruptions, firmware robustness. Salus provides strong end-to-end correctness bugs, etc), and leverages an architecture similar to scal- guarantees for read operations, strict ordering guarantees able key-value stores like Bigtable [14] and HBase [6] to- for write operations, and strong durability and availabili- wards scaling these guarantees to thousands of machines ty guarantees despite a wide range of server failures (in- and tens of thousands of disks. cluding memory corruptions, disk corruptions, firmware Achieving this unprecedented combination of robust- bugs, etc.). Such increased protection does not come ness and scalability presents several challenges. at the cost of scalability or performance: indeed, Salus First, to build a high-performance block store from often actually outperforms HBase (the codebase from low-performance disks, Salus must be able to write d- which Salus descends). For example, Salus’ active repli- ifferent sets of updates to multiple disks in parallel. Par- cation allows it to halve network bandwidth while in- allelism, however, can threaten the basic consistency re- creasing aggregate write throughput by a factor of 1.74 quirement of a block store, as “later” writes may survive compared to HBase in a well-provisioned system. a crash, while “earlier” ones are lost. Second, aiming for efficiency and high availability at 1 Introduction low cost can have unintended consequences on robust- The primary directive of storage—not to lose data—is ness by introducing single points of failure. For exam- hard to carry out: disks and storage sub-systems can fail ple, in order to maximize throughput and availability for in unpredictable ways [7, 8, 18, 23, 34, 37], and so can the reads while minimizing latency and cost, scalable stor- CPUs and memories of the nodes that are responsible for age systems execute read requests at just one replica. If accessing the data [33, 38]. Concerns about robustness that replica experiences a commission failure that causes become even more pressing in cloud storage systems, it to generate erroneous state or output, the data returned which appear to their clients as black boxes even as their to the client could be incorrect. Similarly, to reduce cost larger size and complexity create greater opportunities and for ease of design, many systems that replicate their for error and corruption. storage layer for fault tolerance (such as HBase) leave This paper describes the design and implementation unreplicated the computation nodes that can modify the of Salus,1 a scalable block store in the spirit of Ama- state of that layer: hence, a memory error or an errant zon’s Elastic Block Store (EBS) [1]: a user can request PUT at a single HBase region server can irrevocably and storage space from the service provider, mount it like a undetectably corrupt data (see §5.1). local disk, and run applications upon it, while the service Third, additional robustness should ideally not result provider replicates data for durability and availability. in higher replication cost. For example, in a perfec- What makes Salus unique is its dual focus on scala- t world Salus’ ability to tolerate commission failures bility and robustness. Some recent systems have provid- would not require any more data replication than a scal- ed end-to-end correctness guarantees on distributed stor- able key-value store such as HBase already employs to age despite arbitrary node failures [13, 16, 31], but these ensure durability despite omission failures. systems are not scalable—they require each correct node To address these challenges Salus introduces three to process at least a majority of updates. Conversely, s- novel ideas: pipelined commit, active storage, and scal- calable distributed storage systems [3, 4, 6, 11, 14, 20, 25, able end-to-end verification. 30, 43] typically protect some subsystems like disk stor- Pipelined commit. Salus’ new pipelined commit pro- age with redundant data and checksums, but fail to pro- tocol allows writes to proceed in parallel at multiple tect the entire path from client PUT to client GET, leaving disks but, by tracking the necessary dependency infor- them vulnerable to single points of failure that can cause mation during failure-free execution, guarantees that, de- data corruption or loss. spite failures, the system will be left in a state consistent Salus provides strong end-to-end correctness guaran- with the ordering of writes specified by the client. tees for read operations, strict ordering guarantees for Active storage. To prevent a single computation node write operations, and strong durability and availability from corrupting data, Salus replicates both the storage and the computation layer. Salus applies an update to the 1Salus is the Roman goddess of safety and welfare system’s persistent state only if the update is agreed up- 1 USENIX Association 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13) 357 on by all of the replicated computation nodes. We make A volume’s interface supports GET and PUT, which on two observations about active storage. First, perhaps sur- a disk correspond to read and write. A client may have prisingly, replicating the computation nodes can actual- many such commands outstanding to maximize through- ly improve system performance by moving the computa- put. At any given time, only one client may mount a tion near the data (rather than vice versa), a good choice volume for writing, and during that time no other client when network bandwidth is a more limited resource than can mount the volume for reading. Different clients may CPU cycles. Second, by requiring the unanimous con- mount and write different volumes at the same time, and sent of all replicas before an update is applied, Salus multiple clients may simultaneously mount a read-only comes near to its perfect world with respect to over- snapshot of a volume. head: Salus remains safe (i.e. keeps its blocks consistent We explicitly designed Salus to support only a single and durable) despite two commission failures with just writer per volume for two reasons. First, as demonstrated three-way replication—the same degree of data replica- by the success of Amazon EBS, this model is sufficient tion needed by HBase to tolerate two permanent omis- to support disk-like storage. Second, we are not aware of sion failures. The flip side, of course, is that insisting on a design that would allow Salus to support multiple writ- unanimous consent can reduce the times during which ers while achieving its other goals: strong consistency,2 Salus is live (i.e. its blocks are available)—but liveness scalability, and end-to-end verification for read requests. is easily restored by replacing the faulty set of computa- Even though each volume has only a single writer at tion nodes with a new set that can use the storage layer to a time, a distributed block store has several advantages recover the state required to resume processing requests. over a local one. Spreading a volume across multiple Scalable end-to-end verification. Salus maintains a machines not only allows disk throughput and storage Merkle tree [32] for each volume so that a client can val- capacity to exceed the capabilities of a single machine, idate that each GET request returns consistent and correct but balances load and increases resource utilization. data: if not, the client can reissue the request to another To minimize cost, a typical server in existing storage replica. Reads can then safely proceed at a single replica deployments is relatively storage heavy, with a total ca- without leaving clients vulnerable to reading corrupted pacity of up to 24 TB [5, 42]. We expect a storage server data; more generally, such end-to-end assurances protec- in a Salus deployment to have ten or more SATA disks t Salus clients from the opportunities for error and cor- and two 1 Gbit/s network connections. In this configura- ruption that can arise in complex, black-box cloud stor- tion disk bandwidth is several times more plentiful than age solutions. Further, Salus’ Merkle tree, unlike those network bandwidth, so the Salus design seeks to mini- used in other systems that support end-to-end verifica- mize network bandwidth consumption. tion [19, 26, 31, 41], is scalable: each server only needs 2.1 Failure model to keep the sub-tree corresponding to its own data, and the client can rebuild and check the integrity of the w- Salus is designed to operate on an unreliable network hole tree even after failing and restarting from an empty with unreliable nodes. The network can drop, reorder, state. modify, or arbitrarily delay messages. We have prototyped Salus by modifying the HBase For storage nodes, we assume that 1) servers can key-value store. The evaluation confirms that Salus can crash and recover, temporarily making their disks’ data tolerate servers experiencing commission failures like unavailable (transient omission failure); 2) servers and memory corruption, disk corruption, etc. Although one disks can fail, permanently losing all their data (perma- might fear the performance price to be paid for Salus’ nent omission failure); 3) disks and the software that con- robustness, Salus’ overheads are low in all of our exper- trols them can cause corruption, where some blocks are iments. In fact, despite its strong guarantees, Salus often lost or modified, possibly silently [35] and servers can outperforms HBase, especially when disk bandwidth is experience memory corruption, software bugs, etc, send- plentiful compared to network bandwidth.

Load more