Forkbase: an Efficient Storage Engine for Blockchain

ForkBase: An Efficient Storage Engine for Blockchain and Forkable Applications ∗ Sheng Wang #, Tien Tuan Anh Dinh #, Qian Lin #, Zhongle Xie #, Meihui Zhang y , Qingchao Cai #, Gang Chen x, Beng Chin Ooi #, Pingcheng Ruan # #National University of Singapore, yBeijing Institute of Technology, xZhejiang University #fwangsh,dinhtta,linqian,zhongle,caiqc,ooibc,[email protected], ymeihui [email protected], [email protected] ABSTRACT complex, relational models and strong semantics, i.e. ACID, which Existing data storage systems offer a wide range of functionalities render them less scalable. In between are systems that make other to accommodate an equally diverse range of applications. How- trade-offs between data model, semantics and performance [16, 20, ever, new classes of applications have emerged, e.g., blockchain 15, 7]. Despite these many choices, we observe that there emerges and collaborative analytics, featuring data versioning, fork seman- a gap between modern applications’ requirements and what exist- tics, tamper-evidence or any combination thereof. They present ing storage systems have to offer. new opportunities for storage systems to efficiently support such Many classes of modern applications demand properties (or fea- applications by embedding the above requirements into the storage. tures) that are not a natural fit to existing storage systems. First, In this paper, we present ForkBase, a storage engine designed for blockchain systems, such as Bitcoin [47], Ethereum [2] and Hyper- blockchain and forkable applications. By integrating core appli- ledger [5], implement a distributed ledger abstraction — a globally cation properties into the storage, ForkBase not only delivers high consistent history of changes made to some global states. Because performance but also reduces development effort. The storage man- blockchain systems operate in an untrusted environment, they re- ages multiversion data and supports two variants of fork semantics quire the ledger to be tamper evident, i.e. the states and their his- which enable different fork worklflows. ForkBase is fast and space tories cannot be changed without being detected. Second, collab- efficient, due to a novel index class that supports efficient queries as orative applications, ranging from traditional platforms like Drop- well as effective detection of duplicate content across data objects, box [26], GoogleDocs [4], and Github [3] to more recent and ad- branches and versions. We demonstrate ForkBase’s performance vanced analytics platforms like Datahub [43], allow many users to using three applications: a blockchain platform, a wiki engine and work together on the shared data. Such applications need explicit a collaborative analytics application. We conduct extensive experi- data versioning to track data derivation history, and fork semantics mental evaluation against respective state-of-the-art solutions. The to let users work on independent data copies. Third, recent sys- results show that ForkBase achieves superior performance while tems that favor availability over consistency allow concurrent write significantly lowering the development effort. access to the data, which results in implicit forks that must be even- tually resolved by upper layer applications [22, 21]. PVLDB Reference Format: Without proper storage support, the applications have to imple- Sheng Wang, Tien Tuan Anh Dinh, Qian Lin, Zhongle Xie, Meihui Zhang, ment the above mentioned properties on top of a generic storage Qingchao Cai, Gang Chen, Beng Chin Ooi, Pingcheng Ruan. ForkBase: An Efficient Storage Engine for Blockchain and Forkable Applications. such as key-value stores or file systems. One problem with this PVLDB, 11(10): 1137-1150, 2018. approach is the additional development cost. Another problem is DOI: https://doi.org/10.14778/3231751.3231762 that the resulting implementations may fail to generalize to other applications. But more importantly, the bespoke implementation 1. INTRODUCTION may incur unnecessary performance overhead. For example, cur- Developing a new application today is made easier by the avail- rent blockchain platforms (e.g., Ethereum, Hyperledger) build their ability of many storage systems that offer different data models and tamper evident data structures on top of a key-value store, (e.g., operation semantics. At one extreme, key-value stores [22, 38, 8, LevelDB [6] and RocksDB [9]). However, we observe that these 37] provide a simple data model and semantics, but are highly scal- ad-hoc implementations do not always scale well, and the current able. At the other extreme, relational databases [59] support more blockchain data structures are not optimized for analytical queries. As another example, collaborative applications over large, rela- ∗corresponding author tional datasets, use file-based version control systems such as git. Permission to make digital or hard copies of all or part of this work for But they do not scale to big datasets, and only offer limited query personal or classroom use is granted without fee provided that copies are functionality. not made or distributed for profit or commercial advantage and that copies Clearly, there are benefits in unifying these properties and push- bear this notice and the full citation on the first page. To copy otherwise, to ing them down into the storage layer. One direct benefit is that it republish, to post on servers or to redistribute to lists, requires prior specific reduces development efforts for applications requiring any combi- permission and/or a fee. Articles from this volume were invited to present their results at The 44th International Conference on Very Large Data Bases, nation of these features. Another benefit is that it helps applications August 2018, Rio de Janeiro, Brazil. generalize better with additional features at no extra effort. Finally, Proceedings of the VLDB Endowment, Vol. 11, No. 10 the storage engine can exploit optimization that is otherwise hard Copyright 2018 VLDB Endowment 2150-8097/18/06. to achieve at the application layer. DOI: https://doi.org/10.14778/3231751.3231762 1137 In this paper we present ForkBase, a novel and efficient stor- In the following, we first discuss relevant background and moti- age engine that meets the high demand in modern applications for vation in Section 2. We introduce the design in Section 3, followed versioning, forking and tamper evidence1. One challenge in build- by the interfaces and implementation in Section 4 and 5 respec- ing ForkBase is to keep the storage overhead small when main- tively. We describe the modeling and evaluation of three applica- taining a large number of data versions. Another challenge is to tions in Section 6 and 7. We discuss related work in Section 8 provide elegant and flexible semantics to many classes of applica- before concluding in Section 9. tions. ForkBase overcomes these challenges in two novel ways. First, it defines a new class of index called Structurally-Invariant 2. BACKGROUND AND MOTIVATIONS Reusable Indexes (SIRI), which facilitates both fast lookups and In this section, we discuss several common properties underpin- effective identification of duplicate content. The latter helps drasti- ning many modern applications. We motivate the design of Fork- cally lower the storage overhead for multiversion data. ForkBase Base by highlighting the gap between what the application requires implements an instance of SIRI called POS-Tree, that combines of these properties and what existing solutions offer. ideas from content-based slicing [45], Merkle tree [44] and B+- tree [19]. POS-Tree directly offers tamper evidence, making Fork- 2.1 Deduplication for Multiversion Data Base a natural choice for applications in untrusted environments. Data versioning is an important concept in applications that keep Second, ForkBase provides a generic fork semantics that affords track of data changes. Each update to the data creates a new ver- the applications the flexibility of having both implicit and explicit sion, and the version history can be either linear or non-linear (i.e. forks. Fork operations are efficient, thanks to POS-Tree’s use of consisting of forks and branches). Systems that support linear ver- copy-on-write that eliminates unnecessary copies. sion histories include multiversion file systems [55, 60, 58] and ForkBase exposes simple APIs and an extended key-value data temporal databases [11, 54, 61]. Systems that have non-linear his- model. It provides built-in data types that help reduce development tories include software version control such as git, svn and mercu- effort and enable multiple trade-offs between query efficiency and rial for files, and collaborative dataset management such as Deci- storage overhead. ForkBase also scales well to many nodes, as bel [43] and OrpheusDB [31] for relational tables. Blockchains can it employs a two-layer partitioning scheme which helps distribute be also seen as versioning systems in which each block represents skewed workloads evenly across nodes. a version of the global states. To demonstrate the values of our design, we build three represen- One major challenge in supporting data versioning is to reduce tative applications on top of ForkBase, namely a blockchain plat- storage overhead. The most common approach used in dataset form, a wiki service, and a collaborative analytics application. We versioning systems, e.g. Decibel and OrpheusDB, is record-level observe that only hundreds of lines of code are required to port ma- delta encoding. In this approach, the new version stores

Load more