Meets : Design and Implementation of a Blockchain

Senthil Nathan1, Chander Govindarajan1, Adarsh Saraf1, Manish Sethi2, and Praveen Jayachandran1 1IBM Research India, 2IBM Industry Platforms USA 1(snatara7, chandg12, adasaraf, praveen.j)@in.ibm.com, [email protected]

ABSTRACT transactional processing capabilities already built into rela- In this paper, we design and implement the first-ever de- tional over decades of research and development, centralized replicated relational database with blockchain we demonstrate that we can build a blockchain relational properties that we term blockchain relational database. We database with all features provided by existing blockchain highlight several similarities between features provided by platforms and with better support for complex data types, blockchain platforms and a replicated relational database, constraints, triggers, complex queries, and provenance queries. although they are conceptually different, primarily in their Furthermore, our approach makes building blockchain appli- trust model. Motivated by this, we leverage the rich fea- cations as easy as building database applications; developers tures, decades of research and optimization, and available can leverage the rich tooling available for backup, analytics, tooling in relational databases to build a blockchain rela- and integration with existing enterprise systems. Applica- tional database. We consider a permissioned blockchain tions that have a strong compliance and audit requirements model of known, but mutually distrustful organizations each and need for rich analytical queries such as in financial ser- operating their own database instance that are replicas of vices that are already built atop relational databases, or ones one another. The replicas execute transactions indepen- that are reliant on rich provenance information such as in dently and engage in decentralized consensus to determine supply chains are likely to most benefit from the blockchain the order for transactions. We design two approaches, relational database proposed in this paper. the first where the commit order for transactions is agreed With a focus on enterprise blockchain applications, we upon prior to executing them, and the second where trans- target a permissioned setup of known but untrusted orga- actions are executed without prior knowledge of the commit nizations each operating their own independent database order while the ordering happens in parallel. We leverage nodes but connected together to form a blockchain network. serializable snapshot isolation (SSI) to guarantee that the These database nodes need to be replicas of each other but replicas across nodes remain consistent and respect the or- distrust one another. The replication of transaction logs dering determined by consensus, and devise a new variant and state among multiple database nodes is possible us- of SSI based on block height for the latter approach. We ing master-slave [16, 25, 45, 50] and master-master [25, implement our system on PostgreSQL and present detailed 48, 50] protocols. In these protocols, the master node is performance experiments analyzing both approaches. trusted as it is the only node that executes transactions and propagates the final updates to other masters and slaves ei- 1. INTRODUCTION ther synchronously or asynchronously [28]. For synchronous replication in the Master-Master architecture, the serializ- Blockchain has gained immense popularity over recent able order is decided using techniques such as 2-phase com- years, with its application being actively explored in several mit [38], 3-phase commit [20], a quorum protocol [44] to industries. At its core, it is an immutable append-only log ensure consistency. Asynchronous replication can result in arXiv:1903.01919v2 [cs.DC] 31 May 2019 of cryptographically signed transactions, that is replicated an intermediate inconsistent state between masters. Various and managed in a decentralized fashion through distributed solutions [42, 46] exist to reconcile inconsistency. However, consensus among a set of untrusted parties. Opinion on in a blockchain network, one node cannot trust the execu- the technology has varied widely from being hailed as the tion results of another node. Hence, all transactions need biggest innovation since the Internet to being considered as to be independently executed on all nodes. Further, such another database in disguise. For the first time, we under- an execution and subsequent commits across multiple un- take the challenge of explaining blockchain technology from trusted nodes must result in the same serializable order to the perspective of concepts well known in databases, and ensure consistency. As a result, existing replication proto- highlight the similarities and differences between the two. cols cannot be used for blockchain networks and hence, we Existing blockchain platforms [17, 51, 39, 14], in an at- need a novel protocol with a notion of decentralized trust. tempt to build something radically new and transforma- Although crash fault-tolerant synchronous commit proto- tive, rebuild a lot of features provided by databases, and cols [29, 49] exist, it involves multiple rounds of commu- treat it as just a store of information. We take a contrar- nications between database nodes per transaction or a cen- ian in this paper. By leveraging the rich features and tralized trusted controller. However, applying synchronous ∗Copyright held by the owner/author(s) commit per transaction on a blockchain network would be

1 very costly as nodes may be globally distributed. Further, (1) Smart contract & transaction: Many blockchain a trusted controller cannot be used. platforms such as Hyperledger Fabric [17], Ethereum [51] In this paper, we address this key challenge of ensuring and Corda [30] support smart contracts which are functions that all the untrusted database nodes execute transactions that manage state on the blockchain ledger. Transactions independently and commit them in the same serializable or- are invocations of smart contract functions. Independent der asynchronously. Towards achieving this, we make two execution of the smart contract functions by different mutu- key design choices. First, we modify the database to sep- ally distrustful peers need to be deterministic. This is sim- arate the concerns of concurrent transaction execution and ilar to stored PL/SQL procedures in relational databases. decentralized ordering of blocks of transactions. We leverage Both accept input arguments, retrieve data by executing ordering through consensus only to order blocks of transac- queries, apply the predefined logic, and write data back to tions, and not for the serializable ordering of transactions the database. However, the execution of a PL/SQL proce- within a single block. Second, independently at each node, dure may not be deterministic when executed independently we leverage serializable snapshot isolation (SSI) [23, 43] to on multiple nodes if utility functions such as random and execute transactions concurrently and then serially validate timestamp are used. Hence, we need to constrain PL/SQL & commit each transaction in a block. to obtain a serializ- procedures to ensure that the execution is deterministic. able order that will be the same across all untrusted nodes (2) User authenticity & non-repudiability: Permis- and respect the ordering of blocks obtained from consensus. sioned blockchain systems employ public key infrastructure Using these ideas, we present two novel approaches, the first for user management and ensuring authenticity. Participat- where block ordering is performed before transaction execu- ing users belong to organizations are permissioned to par- tion, and the second where transaction execution happens ticipate in the blockchain network. Transactions are digi- parallelly without prior knowledge of block ordering, and tally signed by the invoker and recorded on the blockchain, discuss their trade-offs. We leverage and modify serializable making them non-repudiable. Relational databases have so- snapshot isolation [43] to create a novel variant which guar- phisticated user management capabilities including support antees the same serializable ordering across all nodes. We for users, roles, groups, and various user authentication op- implement our ideas on PostgreSQL and discuss several sub- tions. However, submitted and recorded transactions are tle challenges we had to overcome. We make the following not signed by the invoker making them repudiable. Hence, contributions in this paper: we need invoker’s signature on transaction. (3) Confidentiality & access control: Some permis- 1. We devise two novel approaches of building a blockchain sioned support the notion of confidentiality, where platform starting from a database, ensuring that a net- smart contracts, transactions and data are only accessible work of untrusted database nodes can remain consis- by authorized users. In addition, access to invoke functions tent. and to modify data is restricted to specific users. Relational 2. We devise a new variant of SSI based on block height databases have comprehensive support for access control on to permit transaction execution to happen parallelly tables, rows, columns and PL/SQL procedures. Some re- with ordering based on consensus. lational databases even provide features like content-based access control, encryption and isolation of data. Advanced 3. We implement the system in PostgreSQL that required confidentiality features where only a subset of the nodes have an addition of only 4000 lines of C code and present access to a particular data element or smart contract can be the modified architecture. implemented, although we don’t consider that in this paper. The rest of this paper is organized as follows. We highlight (4) Immutability of transactions and ledger state: the properties required in a blockchain platform, motivate The blockchain ledger is an append-only log of blocks con- how several of these are either partially or fully available taining sets of transactions that are chained together by in relational databases, and identify shortcomings that need adding the hash of each block to it’s succeeding block. This to be addressed in §2. We discuss our design in §3 and makes it difficult for an attacker to tamper with a block and describe two approaches to leveraging avoid detection. The ledger state can only be modified by serializable snapshot isolation. We provide an account of our invoking smart contract functions recorded as transactions prototype implementation on PostgreSQL in §4. In §5, we on the blockchain and are immutable otherwise. Although a present a detailed performance study. We discuss related database logs all submitted transactions, executed queries, work in §6, future work in §7, and conclude in §8. and the user information, it isn’t possible to detect changes to the log as there are no digital signatures. (5) Consistent replicated ledger across distrustful 2. MOTIVATION nodes: All non-faulty peers in a blockchain must main- Table1 presents a comparison of properties required by tain a consistent replica of all transactions and the ledger a blockchain platform and features available in relational state. This is ensured by all peers committing transactions databases. We argue that there is much to be gained by in the same sequential order as agreed through consensus. leveraging and enhancing the features of relational databases Note that consensus is performed on blocks of transactions to build a permissioned blockchain platform, rather than rather than on individual transactions to be more perfor- building one from scratch. This approach helps us lever- mant. Databases support replication of transaction logs and age the decades of research in relational databases, stable state among nodes using master-slave [16, 25, 45, 50] and enterprise-ready code with rich features, the large commu- master-master [25, 48, 50] protocols. In these protocols, the nity, libraries, tools and integrations with enterprise appli- master node is trusted as it is the only node that executes cations. Next, we compare blockchains and databases across transactions and propagates the final updates to other mas- eight features and the enhancements needed. ters and slaves. Hence, we need to propose a replication

2 1: Similarities between blockchain properties and relational database features. Blockchain Properties Relational Database Features Enhancement Needed Smart contract PL/SQL procedure Deterministic execution Authenticity, non-repudiability User management with groups and roles Crypto-based transaction authentication Access control Role & content-based ACL policies None Immutable transaction logs Transaction logs Digitally signed transactions Consistent replicated ledger be- Master-slave & master-master replication with Decentralized trust and transaction or- tween untrusted nodes trust on transaction ordering and update logs dering determined by consensus Serializable isolation level Strict 2-phase locking, optimistic concurrency Order must respect block ordering ob- control, serializable snapshot isolation tained through consensus Async transaction, notification LISTEN & NOTIFY commands None Provenance queries Maintains all versions of a Enable query on historical records protocol with a notion of decentralized trust, while ensuring §3.1. We provide background on Serializable Snapshot Iso- that the replicas remain consistent. lation in §3.2 and show later in §3.4 that SSI, if directly (6) Serializability isolation, ACID: Blockchain trans- applied, does not guarantee serializability and consistency actions require serializable isolation, in which dirty read, across nodes for the execute-order-in-parallel approach. Fig- non-repeatable read, phantom reads, and serialization anoma- ure1 juxtaposes the two proposed approaches and we de- lies are not possible. Transactions, when executed in paral- scribe them in detail in §3.3 and §3.4, including a novel SSI lel, must follow the same serializable order across all nodes. based on block height technique. We discuss security prop- Further, transactions should be ACID [22] compliant. Seri- erties of the proposed approaches in §3.5. We describe a alizable isolation can be achieved in databases by employing: mechanism for peer node recovery in §3.6 and then discuss (i) strict 2-phase locking [38], (ii) optimistic concurrency how a permissioned blockchain network can be bootstrapped control [32], or (iii) SSI [23, 43]. This needs to be enhanced in §3.7. to follow the block order as determined through consensus, but can be leveraged to a large extent. 3.1 Key Components (7) Asynchronous transactions: As transaction pro- We consider a permissioned network of organizations that cessing and consensus may involve a non-trivial delay, clients are known to one another but may be mutually distrustful. submit transactions asynchronously and then leverage noti- The network is private to the participating organizations fication mechanisms to learn whether their transaction was and a new organization must be permissioned to join the successfully committed. Databases provide support for noti- network. Each organization may include clients, database fication channels and triggers that applications can leverage. peer nodes and ordering service nodes, which we describe (8) Provenance: The auditable append-only log of trans- below, that together form the decentralized network. actions in blockchains can be harnessed as a provenance Client: Each organization has an administrator respon- store, for several use cases including supply chain tracking sible for onboarding client users onto the network. The ad- and financial compliance. However, most blockchain plat- ministrator and each client have a digital certificate regis- forms today do not support complex queries on historic data tered with all the database peers in the system, which they or are not optimized for provenance queries. In certain rela- use to digitally sign and submit transactions on the network tional databases which support Multi-Version Concurrency (we describe the network setup process in §3.7). This helps Control [19] such as snapshot isolation [18], all versions of a support authenticity, non-repudiability and access control row are maintained and old rows are purged on demand or properties. Clients may also listen on a notification channel periodically. However, a SQL query cannot access the old to receive transaction status. rows. For provenance queries, we need to disable purging Database Peer Node: An organization may run one or and enable queries on the old rows. more database nodes in the network. All communication to send and receive transactions and blocks happens via a se- 3. DESIGN OF A BLOCKCHAIN DATABASE cure communication protocol such as TLS. Each node also has a cryptographic identity (i.e., public key) and all com- We identify two approaches to achieve our goal of building munications are signed and authenticated cryptographically. a consistent replicated ledger across untrusted nodes start- Each database node maintains its own replica of the ledger ing from a relational database. The first approach, order- as database files, independently executes smart contracts then-execute, orders all the transactions through a consensus as stored procedures, and validates and commits blocks of service and then nodes execute them concurrently, whereas transactions formed by the ordering service. in the second approach, execute-order-in-parallel, transac- Ordering Service: Consensus is required to ensure that tion execution happens on nodes without prior knowledge the untrusted database nodes agree on an ordering of blocks of ordering while block ordering happens parallelly through of transactions. In order to leverage the rich literature on a consensus service. Intuitively, while the first approach is consensus algorithms with different trust models, such as simpler and requires fewer modifications to the relational crash fault tolerant (e.g., Raft [40], Paxos [34], Zab [31]) database, the second approach has the potential to achieve and byzantine fault tolerant (e.g., PBFT [24], XFT [37]) better performance. We design and implement both ap- consensus we make the ordering service in our system plug- proaches to study their trade-offs. gable and agnostic to the database implementation. The We first describe the key components of our system in ordering service consists of consensus or orderer nodes, each

3 Client­1 form Database Node Database Node Database Node Ordering Client­1 form Database Node Database Node Database Node Ordering Org­1 of Org­1 of Org­2 of Org­N Service Org­1 of Org­1 of Org­2 of Org­N Service

2 2 T1 T1: invoke 2 T1: invoke 2 1 1 smart­contract performs consensus to smart­contract construct a block executes transactions foo(args) 3 block 3 foo(args) at 4 3 3 3 concurrently using 3 height 'n' performs consensus to Notify executes transactions proposed SSI 4 construct a block Client 4 4 in block concurrently at height n Notify 5 using proposed SSI 5 block 2 ordering Client 5 commit transactions commit transactions 3 phase 5 5 5 execution ordering 6 6 6 as per the order in as per the order in phase 2 4 phase execution block block 4 6 7 phase state change hash 7 state change hash 6 6 3 5 7 committing 8 5 7 6 phase state change hashes are state change hashes are 8 added in the next block block 9 added in the next block 6 7 8 9 8 8 block 7 8 9 10 9 9 checkpointing checkpointing checkpoints or phase 9 9 checkpoints or 9 phase 10 10 10 point in time rollback point in time rollback (a) Order then Execute (b) Execute and Order In Parallel

Figure 1: Proposed transaction flows—order-then-execute and execute-and-order-in-parallel

rw-dependency wr-dependency if T1 must have preceded T2 in the apparent serial order of execution. Three types of dependencies can create these T T T1 T2 T1 T2 T3 T1 2 3 edges:

(a) (b) (c) • rw-dependency: if T1 writes a version of some object, and T2 reads the previous version of that object, then Figure 2: SI Anomalies between 2 and 3 transactions T1 appears to have executed after T2, because T2 did not see the update. To represent this, an edge from T2 to T1 with a label rw needs to be added. As we will see, these rw-conflicts are central to SSI. Further, owned by a different organization. Each orderer node, sim- a dependency can also be caused by predicate reads. ilar to database nodes, have their own digital certificate or identity. The output of consensus yields a block of transac- • wr-dependency: if T1 writes a version of an object, tions, which is then atomically broadcast to all the database and T2 reads that version, then T1 appears to have nodes. A block consists of (a) a sequence number, (b) a set executed before T2 (an edge from T1 to T2 with a label of transactions, (c) metadata associated with the consensus wr). protocol, (d) hash of the previous block, (e) hash of the cur- rent block, i.e., hash (a, b, c, d); and (f) digital signature on • ww-dependency: if T1 writes a version of some object, the hash of the current block by the orderer node. and T2 replaces that version with the next version, then T1 appears to have executed before T2 (an edge from T1 to T2 with a label ww). 3.2 Serializable Snapshot Isolation (SSI) If a cycle is present in the graph, then the execution does not Strict 2-phase locking (S2PL), optimistic concurrency con- correspond to any serial order, i.e., a SI anomaly has caused trol (OCC), and SSI are approaches to achieve serializability. a serializability violation. Otherwise, the serial order can be As SSI offers greater concurrency than a S2PL and OCC, we determined using a topological sort. have chosen SSI in our design to ensure serializability. SSI In SI, a wr or ww-dependency from T1 to T2 denotes that achieves serializability using a modified snapshot isolation T1 must have committed before T2 began. In practice, writes (SI) technique. acquire an exclusive lock to prevent ww-dependency between Snapshot Isolation and Anomalies. In SI, each trans- two concurrent transactions. A wr-dependency also can- action reads from a consistent snapshot of the database com- not occur between two concurrent transactions. In contrast, prising of the last committed values that existed at the time only rw-dependency occurring between concurrent transac- the transaction started. Although SI prevents dirty read, tions is relevant when studying serialization anomalies. non-repeatable read, and phantom read, it cannot guarantee Adya et. al. [15] observed that every cycle in a serial- serializability due to SI anomalies [18] which violates con- ization graph contains at least two rw-dependency edges. sistency (i.e., C in ACID), specifically integrity constraints Fekete et. al. [26] subsequently found that two such edges (refer to [18, 23, 43] for examples). Hence, Cahill et. al. [23] must be adjacent. Figure2(a) shows the only possible cycle proposed SSI to detect and resolve anomalies automatically with two transactions, and Figure2(b) and (c) show the two to ensure serializability. possible cycles with three transactions. If any of the trans- Background on SI Anomalies. To detect SI anomalies, actions is aborted, a serializable order could be achieved. Adya et. al. [15] proposed Multi-Version Serialization his- SSI - Resolving Anomalies. SSI automatically detects tory Graph (MVSG). This graph contains a node per trans- and resolves anomalies. As tracking rw & wr dependen- action, and an edge from transaction T1 to transaction T2, cies, and subsequently detecting cycles is costly, SSI applies

4 heuristics that are anomaly-safe, but could result in false blocks one at a time, in the order of their block sequence positives. They are: number, from the block store and performs the following (1) abort immediately: Cahill et. al. [23] used two flags four operations: per transaction T: (a) inConflict—set when there is a rw- dependency from a concurrent transaction to T; (b) out- 1. The database node assigns a thread per transaction Conflict—set when there is a rw-dependency from T to a (provided that the identifier present in a transaction’s concurrent transaction. As soon as both of these flags are field is unique) to authenticate and execute it. In an set for T, which could lead to an SI anomaly, T is aborted. append-only ledger table, it records each transaction (2) abort during commit: Ports et. al. [43] maintained in a block. This ledger table is used for recovery as two lists per transaction T: (a) inConflictList—maintains explained in §3.6 and also for provenance queries as a list of concurrent transactions from which there is a rw- demonstrated in §4.2. dependency to T. (b) outConflictList—maintains a list of 2. Each thread retrieves the public key associated with concurrent transactions to which T has a rw-dependency. the username in the transaction, to verify the user’s The transactions present in inConflictList of T are called digital signature. On successful authentication, the nearConflicts. The transactions present in the inCon- username is set for the thread’s session which is needed flictList of each nearConflict are called farConflicts. For for access control. We leverage the database’s access e.g., in Figure2(b), for T2, T1 is a nearConflict and T3 is control capabilities without modification. a farConflict. Recall that rw-dependency occurs only be- tween concurrently executing transactions (such as in Fig- 3. Each thread executes the PL/SQL procedure with the ures2(a) and (b)). For each pair of nearConflict and far- passed arguments as per the transaction. To ensure Conflict, if both transactions are not yet committed, then that on all nodes the transactions are executed on the this heuristic aborts the nearConflict so that an immedi- same committed state, all valid transactions in a block ate retry can succeed. In contrast, a wr-dependency only are executed concurrently using the abort during com- occurs between a committed and a just commenced trans- mit SSI variant. This helps in ensuring that the set action (Figure2(c)). In this case, only if T3 has committed of transactions marked to be committed are the same first, its nearConflict T2 is aborted. Otherwise, no transac- and that they follow the same serializable order across tions are aborted. In other words, while the heuristic does all nodes. not track wr-dependency, it accounts for its possibility and aborts a transaction whose outConflict has committed. 4. Once a thread completes execution of a transaction, In the next two sections, we describe the transaction flows the transaction would be ready to either commit or for (1) Order then Execute; and (2) Execute and Order in abort (as per the procedure’s execution logic), but Parallel as shown in Figure1. In both flows, transactions waits without proceeding. This is because it could are committed asynchronously across nodes. result in a different commit order in different nodes, if the execution completion order is different (which 3.3 Order then Execute could result in different aborts by SSI in each node). A transaction submitted by a client in the order-then- Only after committing a block of transactions, the execution execute approach comprises of (a) a unique identifier, (b) phase will process the next block. In Figure1(a), step 4 the username of the client, (c) the PL/SQL procedure ex- denotes the execution phase. ecution command with the name of the procedure and ar- guments, and (d) a digital signature on the hash(a, b, c) 3.3.3 Committing Phase using the client’s private key. The transaction flow consists To ensure that the commit order is the same on all nodes, of four pipelined phases: ordering, execution, committing, the order in which the transactions get committed is the and checkpointing, wherein a submitted transaction needs order in which the transactions appear in the block. Only to first be ordered, then executed, and finally committed when all valid transactions are executed and ready to be before recording a checkpoint. either committed or aborted, the node serially notifies one 3.3.1 Ordering phase thread at a time to proceed further. Every transaction ap- Clients submit transactions directly to any one of the or- plies the abort during commit approach to determine whether dering service nodes (it is possible for a peer to act as a proxy to commit, and only then does the next transaction enter for the client to submit transactions to the ordering service, the commit stage. While it is possible to apply SSI for all as long as the client trusts the peer to not drop its trans- transactions at once to obtain a serializable order, we de- actions). On a periodic timeout (say every 1 second), the fer such optimizations for simplicity. The node records the ordering service nodes initiate a consensus protocol among transaction status in the ledger table and emits an event via themselves to construct a block of transactions. After con- a notification channel to inform the client. In Figure1(a), structing a block, the ordering service delivers it to all nodes step 5 denotes the committing phase. in the network via atomic broadcast. In Figure1(a), steps There is one additional challenge. In SSI, ww-dependency 2 and 3 denote this phase. is handled using an exclusive lock. For example, if T1 and T2 are competing to write an object, only one transaction 3.3.2 Execution phase can acquire a lock (say T1) and the other (say T2) must On receiving a block of transactions, each database node wait for the lock to be released which can happen only after verifies whether the received block is in sequence and sent the commit/abort of T1. However, in our system, to ensure by the ordering service. On successful verification, the node consistency across all nodes, the committing phase cannot appends the block to a block store which is maintained in the start unless all transactions complete execution. So, we can- local file system. In parallel, the node retrieves unprocessed not use locks for ww-dependency. However, as the ordering

5 Logical Storage database nodes and the ordering service in the background. The four steps described in §3.3.2 for the execution phase Current 3 DB state at block height 3 Block applies for the execute-order-in-parallel approach as well, 2 DB state at block height 2 except for the default SSI and the logic to handle dupli- Height 1 DB state at block height 1 cate transaction identifiers. This is shown in steps 1-3 in Figure1(b). Unlike the order-then-execute approach where Example: Execute => Can access only the DB execution of transactions happen after ordering and on the transaction-1 at snapshot-height 1 state at block height 1 committed state from the previous block, in the execute- order-in-parallel approach we endeavor to execute and or- Figure 3: Logical Database Storage to Enable Snap- der in parallel. To ensure that the transaction is executing shot Isolation Based on Block Height. on the same committed data on all nodes, the submitted transaction includes the block height on which it should be executed (the client can obtain this from the peer it is con- nected with). We propose SSI based on block height as between transactions is determined by consensus and fixed depicted in Figure3, as the default SSI cannot ensure that across all nodes, we leverage this to permit both transactions transactions are executed on the same committed state on to write to different copies of the object, but subsequently all nodes as each node can be at a different block height commit only the first as determined by the ordering. We depending on its processing speed. show how such an implementation is possible in §4. SSI Based on Block Height: In this isolation level, each It is noteworthy that certain existing blockchain plat- row of a table contains a creator block number and deleter forms such as Ethereum also follow an order-then-execute block number which denote the block that created and deleted approach. However, they execute transactions sequentially this row, respectively (irrespective of the transaction within in the order they appear in the block to ensure serializability the block that created it). Note that, creator and deleter and consistency across nodes, but this affects performance. block number are part of the implementation for the order- In contrast, leveraging SSI to execute transactions concur- then-execute approach as well, but are used only for prove- rently and then sequentially issuing committing as per or- nance queries and are not required for SSI. During a trans- dering from consensus, enables us to optimize performance. action execution, based on the block number specified, the 3.3.4 Checkpointing Phase transaction can view the database state comprising of all commits up to this block height as shown in Figure3. We Once all transactions in a block are processed, each node refer to this specified block number as the transaction’s computes the hash of the write set, which is the union of snapshot-height. In other words, a transaction sees all all changes made to the database by the block, and sub- rows with creator lesser than or equal to its snapshot-height. mits it to the ordering service as a proof of execution and For all such rows, the deleter should be either empty or commit. When a node receives subsequent blocks, it would greater than the snapshot-height. For this SSI to work, receive the hash of write set computed by other nodes. The we need to maintain all versions of a row and every update hash computed by all non-faulty nodes would be the same should be a flagging of the old row and an insertion of the and the node then proceeds to record a checkpoint (note new row. Every delete should be a marking of deleter. that this is different from the Write-Ahead Logging check- This facilitates the transaction to be executed on the same pointing). It is not necessary to record a checkpoint every committed data on all nodes. block. Instead, the hash of write sets can be computed for However, the current block (current-block) processed for a preconfigured number of blocks and then sent to the or- commit by a node might be either lower or higher than the dering service. In Figure1(a), steps 6, 7, and 8 denote the specified snapshot-height of a transaction. If the current checkpointing phase. -block is lower, the transaction would start executing once 3.4 Execute and Order in Parallel the node completes processing all blocks and transactions up A transaction submitted by a client in the execute-order- to the specified snapshot-height. If the current-block is in-parallel approach comprises of (a) the username of the higher, the transaction would be executed immediately, but client, (b) the PL/SQL procedure execution command with the serializability requirement could be violated because of a the name of the procedure and arguments, (c) a block num- phantom or stale data read, which needs to be detected and ber, (d) a unique identifier which is computed as hash(a, handled. For example, assume that a transaction is updat- b, c), and (e) a digital signature on the hash(a, b, c, d) ing all rows in a table which satisfy a given predicate men- using the client’s private key. The transaction flow consists tioned in a WHERE clause. There is a possibility that a row of four phases: execution, ordering, committing, and check- that satisfies the predicate was committed by a block with a pointing phase. A submitted transaction is executed by the number which is greater than the specified snapshot-height database nodes and in parallel ordered by the ordering nodes and lesser than or equal to current-block. In this scenario, and placed in a block. This is followed by the commit and a phantom read [18] has occurred that violated the serial- checkpoint phases. We describe each phase in detail below. izability. Similarly, a transaction can read a row from a snapshot as of snapshot-height, but if that row was either 3.4.1 Execution Phase updated or deleted by a subsequent block it would result in Clients submit transactions directly to one of the database a stale read for this transaction. nodes. When a node receives a transaction, it assigns a In order to ensure serializability, the proposed SSI ap- thread to authenticate, forward and execute the transaction. proach detects such phantom reads and stale data reads to On successful authentication (same as in the order-then- abort the corresponding transaction. To detect and abort execute approach), the transaction is forwarded to other

6 SSI variant—block-aware abort during commit. Ta- Table 2: Abort rule for our proposed SSI variants when ble2 presents the abort rules for the proposed SSI variant. Transaction T is committing. In addition to nearConflict and far- Conflict used in nearConflict farConflict to commit first Abort same block among conflicts abort during commit, our variant considers two additional " " nearConflict farConflict parameters: " " farConflict nearConflict 1. whether the nearConflict and farConflict are in the " % nearConflict farConflict same block. % " farConflict % % - nearConflict 2. if they are in the same block, which among them is % none - earlier as per the ordering. When either or both the conflicts are in the same block such a transaction, the proposed approach applies row visi- as transaction T, it is straightforward to abort the one that bility logic on the committed state: comes later in the ordering, and is deterministic on all nodes. 1. when a row with creator greater than the specified The tricky case is when neither of the conflicts are in the snapshot-height satisfies the given predicate, abort same block. In this case, we abort the nearConflict transac- the transaction provided that the deleter is empty tion. Note, the nearConflict is not in the same block but ex- (handles phantom read); ecutes concurrently with T —this means that the snapshot -height specified for both could be lesser than or equal to 2. when a row with creator lesser than snapshot-height the current block height at the node. With no synchroniza- satisfies the given predicate, abort the transaction pro- tion on transaction executions between nodes, it is possible vided that the deleter is non-empty and greater than for an SI anomaly to occur only at a subset of nodes. To snapshot-height (handles stale read). ensure consistency between nodes, we need to ensure that Further, concurrent transactions which are going to be com- the same set of transactions are aborted on all nodes. Let mitted during or after current-block can also create a phan- us consider possible scenarios in other nodes (say T2 is the tom read or a stale read problem. Such cases are tracked by nearConflict transaction): our proposed SSI as described in the committing phase. 1. If T2 is concurrent with T and an anomaly structure 3.4.2 Ordering Phase is detected, then T2 is aborted as per our heuristic. Database nodes submit transactions to the ordering ser- 2. If T commits before T2 starts execution, then T2 being vice unlike the order-then-execute approach. Note that the a nearConflict for T read a stale value and would be transactions are being executed in the database nodes while aborted as discussed earlier. they are being ordered by the ordering service. The rest of the logic is same as explained in §3.3.1 for the order-then- 3. If T2 is concurrent with T , but T is committed first, execute approach. In Figure1(b), the steps 4 and 5 denote then this is a case of a rw-dependency where the out- the ordering phase. Conflict has already committed leading to an anomaly structure (similar to Figure2(c)) and T2 will be aborted 3.4.3 Committing Phase in this case as well. Similar to the order-then-execute approach, an important pre-condition for entering commit phase is that all transac- Even if there is no farConflict, the nearConflict would get tions in a block must have completed its execution and wait- aborted (if it not in same block as T) as it could result ing to proceed with commit/abort. However, there are two in a stale read only at a subset of nodes. Whereas, in key differences compared to the commit phase of order-then- other nodes, it might result in a rw-dependency without an execute. First, after receiving a block, if all transactions are anomaly structure. Hence, we must abort the nearConflict not running (this occurs if the node has not received com- irrespective of the presence of a farConflict, whenever the munication of a transaction from another peer due to mali- nearConflict is not in the same block. ciousness or network delay), the committer starts executing The unique identifier used for a transaction must be the all missing transactions and waits for their completion be- hash of (a) the username of the client, (b) the PL/SQL pro- fore proceeding to the committing phase. Second, unlike the cedure execution command with the name of the procedure previous approach, it is possible for concurrent transactions and arguments, and (c) a block number specified for the SSI to be executing at different snapshot heights (as specified by by the client. The reason is that if two different transac- the respective clients). Further, transactions that are con- tions are submitted to two different nodes with the same current on one node, may not be concurrent on another, but unique identifier, whichever transaction executes first on a we need the set of transactions that are decided to be com- given node is the one that would succeed, whereas the other mitted to be the same on all nodes. As a result, we don’t would fail due to the duplicate identifier. As this can result support blind updates such as UPDATE table SET = in an inconsistent state across nodes, the unique identifier value; which might result in a lock for ww-dependency only is composed of the three fields in the transaction to ensure on a subset of nodes. Note, ww-dependencies are handled that no two different transactions have the same identifier. in the same way as described in §3.3.3. Further, instead of employing the abort during commit variant of SSI, we pro- 3.4.4 Checkpointing Phase pose a block-aware abort during commit variant of SSI as This phase is same as the one explained in §3.3.4. described next.

7 3.5 Discussion on Security Properties following are two ways to detect such malicious behavior. (1) Submission of invalid transactions. A malicious The user can submit the query to multiple database nodes client can potentially submit a large number of invalid trans- and verify whether the results are the same. Otherwise, actions (e.g., ones that will eventually fail due to stale reads, any of the existing query authentication [35, 36, 41, 52, 53] try to perform operations they do not have access for) in an methods can be used to verify the integrity of query results. attempt to launch a denial of service attack. This can be Further, when a stored data is tampered with, it would even- thwarted in one of two ways. First, similar to permission- tually be identified through the checkpointing phase. less blockchain networks, a transaction fee could be levied (6) Tampering of blockchain. Each database node for each transaction using a native currency (it is possible stores all blocks in a separate store called the block store. If to implement a native currency in our system if desired). any malicious database node tampers its block store, it will Second, by monitoring clients and their behavior on the net- not affect the other replicas maintained at other organiza- work, it is possible to exclude them from participation. We tions’ node. In order to tamper the block store and not be leave such enhancements for future work as it does not affect detected, the database node would need the private cryp- the core design of our system. tographic key of the orderer node as well as the client who (2) Obscuration of valid transactions. In order- submitted the transaction to forge their signatures. Even if execute approach, when a malicious orderer node receives the malicious node achieves this, if the majority of the nodes a transaction from the user, it might not include the trans- are non-malicious, honest organizations could prove that the action in a block. Similarly, in execute-order-in-parallel ap- malicious organization has tampered with the block store by proach, when a malicious database node receives a trans- comparing the replicated chain when a conflict occurs. Only action from the user, it might not forward the transaction if 51% of the nodes are malicious, a blockchain can be suc- to other database and orderer nodes. In both scenarios, at cessfully tampered. the client side, the transaction request would timeout. The client can then submit the same transaction to some other 3.6 Recovery After a Failure orderer or database node depending on the approach. Note A blockchain network is required to tolerate node fail- that even if the client side timeout was a false alarm (i.e., ures and we need to ensure that a failed node recovers to the transaction is forwarded, included in a block and exe- a consistent state when it restarts. In both order-then- cuted), the resubmission of the same transaction does not execute and execute-order-in-parallel approaches, the block affect the data consistency as all nodes check for the unique processing stage has the following two common operations identifier included in the transaction before execution. per block: (1) atomically store all transactions information In execute-order-in-parallel approach, if the database node in the ledger table along with the block number; (2) atom- forwards the transaction to an orderer node but not to other ically store all transactions’ status (i.e., commit/abort) in database nodes, eventually the request would be added into the ledger table. Note, only after all transactions get writ- a block and delivered to all database nodes. The default ten to write-ahead-logging and the default transaction logs, committing phase described in §3.4 would take care of exe- the step 2 gets executed. A node can fail during any of cuting missing transactions while processing the block. the above two stages. When the node comes back online, it (3) Withholding of a block or transaction com- would execute the following operations to recover the node mit. In both approaches, when a malicious database node to a consistent state: receives a block, it might skip committing the block or a 1. Retrieves the last processed block number from the particular transaction in that block. In such a scenario, the ledger table and checks whether all transactions in the hash computed during the checkpointing phase would differ ledger table have a status. Note, as we store all trans- from other nodes, and it would become evident during the actions’ status atomically, either all transactions must checkpointing process that the malicious node did not com- have a status or none. mit the block correctly. As other organizations can detect such maliciousness, there is no incentive for the malicious 2. If a status is found in the ledger table for transactions, organization to engage in such a scenario. Since, database it means that the block was committed successfully nodes validate and commit blocks independently of one an- and no recovery is needed. Note, if a node fails af- other, a malicious database node cannot hamper the liveness ter committing a transaction, during restart, the de- of the network. fault recovery mechanism that uses write-ahead log- (4) Byzantine ordering nodes. A malicious ordering ging (WAL) would take care of disk consistency. node could send an incorrect block to database nodes con- nected to it. If an organization’s database node trusts the 3. If a status is not found in the ledger table for trans- ordering node it operates, but the ordering node is mali- actions, a node might have (a) either failed after com- cious, then that database node would also become faulty. mitting all those transactions but before updating the This scenario would then be similar to the previous scenario ledger table with the status, or (b) failed before com- of the database node being malicious. If a database node mitting all or some those transactions. If a transaction does not trust the ordering node, it should obtain blocks was successfully committed, the relational database from 2f + 1 ordering nodes (assuming the consensus algo- would have recorded the status on the default trans- rithm tolerates f failures), to ensure it obtains the correct action logs. The node first checks the status of each block after consensus. transaction by reading the default transaction logs. If (5) Tampering of user data. In both approaches, the a status is present for all transactions, the node would user can submit a read query to a single database node to update the ledger table with the status (i.e., case (a)). fetch the stored data. A malicious database node can tam- 4. If a status is not present in the transaction log for all per the data and return an incorrect result to the user. The or some transactions (i.e., case (b)), the node needs to

8 rollback all other committed transactions in the same 1. create_deployTx() creates, replaces or drops a smart block. Then the node can start re-executing all trans- contract. The argument to this procedure can be CREATE actions as described in §3.3.2 and §3.4.1, commit them, FUNCTION statement (to create a new smart-contract), and record the status in the ledger table. The rollback CREATE OR REPLACE FUNCTION (to update an existing of committed transactions is required as we need to smart-contract), or DROP FUNCTION (to delete an ex- execute all transactions in a block parallelly using SSI isting smart-contract). On a successful execution, this at the same time to get a consistent result with other procedure creates a record in the deployment table nodes as well. with the passed argument but does not yet execute it to CREATE/REPLACE/DROP. By the time a node restarts and recovers after a failure, a few more blocks may have been added in the network. 2. submit_deployTx() executes the SQL statement present When the node receives a block from the ordering service, it in the deployment table after verifying that an admin would notice a gap in the sequence number due to missing from each organization has approved the deployment blocks. To ensure consistency, the node then retrieves any transaction. If not all organizations have approved, missing blocks, processes and commits them one by one. the invocation returns an error. If a smart contract is We observe that it is possible to do a single atomic commit updated, any uncommitted transactions that executed for all transactions in a block after validating them serially. on an older version of the contract are aborted. This can simplify the recovery process described above, but we have omitted describing this in the interest of simplicity. 3. approve_deployTx() approves a deployment transac- tion by adding a digital signature provided by the or- 3.7 Network Bootstrapping ganization’s admin. To bootstrap a permissioned network, first, we need to 4. reject_deployTx() rejects a deployment transaction bring up the database and orderer nodes each with their by adding a digital signature provided by the organi- respective admin users and secure communication via TLS zation’s admin and a reason for the rejection. between them. We need to then create users & roles for each organization, deploy PL/SQL procedures as smart con- 5. comment_deployTx(): adds a comment to a deploy- tracts, and define access control policies. ment transaction. When an admin wants to suggest Setting up database nodes and ordering service. a change to a smart contract present in a deployment At network startup, each organization shares data consist- transaction, this system contract can be used. ing of TLS certificates, domain names of database and or- derer nodes, and credentials of admin users (i.e., public keys) Further, each node creates three more system smart con- with all other organizations. Each organization then starts tracts to create, delete, and update users with cyrptographic its database and orderer nodes with the credentials of all credentials. As an invocation of all system smart contracts organizations’ admin users. are recorded as blockchain transactions, the network holds Support for blockchain and non-blockchain schema. an immutable transaction log and a history of approvals As a part of database node startup, the node creates a de- from organizations before deploying smart contracts. For fault database with two schemas named blockchain and non- access control policies, the network cannot rely on the sys- blockchain (the latter being optional for any private data tem contract as it is unware of tables to be created and held by the organization). Any transaction submitted to a privileges for each user. Hence, access control policies need blockchain schema is treated as a blockchain transaction, to be embedded within a smart contract itself. Once user’s which must follow the transaction flow described in §3.3 smart contracts are created successfully using the above sys- or §3.4. In the blockchain schema, both users and admins tem contracts, applications can start invoking them. can execute only PL/SQL procedures and individual SELECT statements. Note, individual SELECT statements would not 4. IMPLEMENTATIONUSING POSTGRESQL be recorded on the blockchain and can only be a read-only PostgreSQL [11] is the first open source database to im- operation. Other individual DML & DDL statements are plement the abort during commit SSI variant [43]. Further, not allowed to be executed, but only through PL/SQL pro- it maintains all rows even after an update or delete and cedures. On a non-blockchain schema, transactions are ex- is highly modular and extensible. For these reasons, we ecuted using the default single node transaction flow imple- chose to modify PostgreSQL to build a blockchain relational mented in a relational database. The non-blockchain schema database, rather than implementing one from scratch. The in a database node is accessible only by the organization modifications only amounted to adding around 4000 lines which owns that node. of other organizations can query of C code to implement both approaches. We present rele- or submit transactions only on the blockchain schema of the vant background on PostgreSQL in §4.1. Then, in §4.2 and database node and not on the non-blockchain schema. Users §4.3, we describe the components we added and modified of an organization can execute reports or analytical queries in PostgreSQL respectively. In §4.4, we present implemen- combining the blockchain and non-blockchain schema on the tation details of a crash fault-tolerant ordering service, and node owned by their organization. describe the transaction flows in §4.5. Setting up of PL/SQL procedures. To facilitate de- ployment of smart contracts, each node exposes the follow- 4.1 PostgreSQL Background ing four system smart contracts. These contracts are created PostgreSQL supports three isolation levels—read commit- in the blockchain schema during the database node startup. ted, repeatable read (i.e., SI), and serializable (i.e., SI with These system smart contracts can only be invoked by orga- detection and mitigation of anomalies). A snapshot com- nization admins, are considered blockchain transactions and prises of a set of transaction IDs, which were committed follow the transaction flow described earlier.

9 application Denotes modified or newly added module (2) Middleware: Forward transaction to other (1) [Transaction{uuid, queries}blk#, sig] nodes and ordering service (6) Middleware: On receiving the block, verify libpq jdbc odbc (2) Execution Phase. Verify user signature. Invoke orderer signature and store in pgBlockstore fork Middleware to forward transaction. postgres postgres backend fork Middleware fork BlockProcessor (server) exec_simple_query(query) (3) Execute transaction using proposed SSI. Wait for 1. Parse/Analyze signal from BlockProcessor before either commit/abort. 2. Plan Crypto 3. Execute Modified SSI (8) Commit Phase. Apply the proposed SSI XLOG: WAL CLOG vaccum Lock Statistics Semaphore page buffer BlockProcessor background writer TxWriteSet TxOrder TxMetadata Postgres Node 1 Shared Memory Metadata storage Middleware pgCerts pgBlockstore pgLedger (5) Block application (2) Transaction{uuid, queries} blk#, sig (5) Block (7) BlockProcessor: Signal each backend libpq jdbc odbc (4) Ordering Phase: Ordering Service serially as per transaction order in block. Wait Postgres Node 2 Construct & send the block (solo, kafka, pbft) for all to commit/abort. Update pgLedger.

Figure 4: Transaction Flow in PostgreSQL for execute-order-in-parallel

as of the start of this transaction, whose effects are visi- ing the transaction (as needed in the commit phase). ble. Each row has two additional elements in the header, The block processor uses this data structure to check namely xmin and xmax, which are the IDs of the transac- whether all transactions have completed its execution tions that created and deleted the row, respectively. Note, and to signal each backend to proceed further. Each every update to a row is a delete followed by an insert (both entry consists of the global transaction identifier (as in the table and index). Deleted rows are flagged by set- present in the transaction), transaction ID assigned lo- ting xmax instead of being actually deleted. In other words, cally by the node, process id of the backend, a semaphore PostgreSQL maintains all versions of a row unlike other im- to denote whether the transaction has completed its plementations such as Oracle that update rows in-place and execution and waiting for a signal to proceed further, keep a rollback log. This is ideal for our goal of building and a final status (i.e., commit/abort). a blockchain that maintains all versions of data. For each row, a snapshot checks xmin and xmax to see which of these 2. BlockProcessorMetadata helps in signaling block pro- transactions’ ID are included in the snapshot to determine cessor from the backend once it commits/aborts a trans- row visibility. action. Additionally, this data structure holds the Clients connect to the PostgreSQL server, postgres, us- last committed block number, current block number ing an application interface such as libpq [10]. A backend (so that the backend can use it to set the creator & process [9] is then assigned to each client connection to exe- deleter block number in rows), and a semaphore to cute queries. PostgreSQL supports background workers [8] enable signaling from middleware to block processor. (additional processes other than backend) to perform other 3. TxWriteSet holds the write-set of each transaction so activities such as logical replication and statistics collection. that after SSI validation and before writing to WAL, Note, the postgres server, backends, and background work- backend can store the creator or deleter in each row. ers coordinate and share data through shared memory re- Note, TxWriteSet just keeps a pointer to the row in gions. Both the shared memory data structures and back- buffer. For an update, it stores both the old and new ground workers are easily extensible. row pointer. For delete, it stores only the old row 4.2 New Components Introduced pointer. Further, TxWriteSet is used to compute the hash of the write-set for each block. Communication Middleware & Block Processor. We introduced two new background workers: (1) commu- 4. TxOrder helps backend to apply our proposed block- nication middleware to communicate with other nodes and aware abort during commit variant of SSI. It stores orderer, to transfer and receive transactions/blocks. The the global transaction identifier of each transaction in received blocks are stored in an append-only file named pg- the block as per the commit order to aid in finding Blockstore. Further, the middleware is also responsible for whether a nearCon- flict or farConflict is present starting a transaction using libpq in the order-then-execute in the same block. flow, and for starting any missing transactions in the execute- order-in-parallel flow; (2) block processor to process a block. Blockchain Related Catalog Tables. We introduced It executes the commit phase as described in §3.3.3 and two system catalog tables, pgLedger and pgCerts. The pgLedger §3.4.3. is the ledger table described in §3.6, and stores information Shared Memory Data Structures. We introduced the about each transaction such as the global identifier, local following four data structures in the shared memory (the last transaction ID, query passed in the transaction, client who two are used only for the execute-order-in-parallel flow). submitted the transaction, and commit/abort status. It is used for recovery and for supporting provenance queries. 1. TxMetadata enables communication and synchroniza- The pgCerts table stores the cryptographic credentials of tion between block processor and backends execut- all blockchain users.

10 Table 3: Example provenance queries to audit user transactions. Audit Scenarios Queries Get all invoice details from table invoices which were SELECT invoices.* FROM invoices, pgLedger WHERE updated by a supplier S between blocks 100 and 300 pgLedger.blockNumber BETWEEN 100 AND 300 AND pgLedger.user = S AND invoices.xmax = pgLedger.txid AND invoices.deleter IS NULL; Get all historical details of an invoice from table SELECT invoices.* FROM invoices, pgLedger WHERE invoiceID = invoices whose invoiceID (primary key) is k and was k AND pgLedger.user IN (S, M) AND pgLedger.commitTime updated by either supplier S or manufacturer M in the > now() - interval ‘24 hours’ AND invoices.xmax = last 24 hours pgLedger.txid AND invoices.deleter IS NULL;

Provenance Query. We introduced a special type of we have modified the abort rule to follow our proposed rules read only query called provenance. This query type can see in Table2. The modified SSI utilizes the TxOrder data all committed rows present in tables irrespective of whether structure in the shared memory. For ww-conflicts, we al- it is inactive (i.e., marked with xmax) or active. As it can low writes to the same object by different transactions (as access all historical content, it enables support for very com- updates are anyway not in-place in PostgreSQL) by main- plex analytical and audit queries with the help of pgLedger. taining an array of xmax values comprising of transaction Table3 presents two examples of provenance queries. IDs of all competing transactions in the row being updated. During commit, for each old row entry in the TxWriteSet, 4.3 Components Modified the backend (corresponding to the transaction that is com- Application Interface & Deterministic PL/SQL Pro- mitting now) checks xmax values and marks all other trans- cedures. We have enhanced the default application inter- actions for abort as only one transaction can write to the face, i.e., libpq, with additional APIs to submit blockchain row to avoid lost update. Finally, it retains only its own transactions & provenance queries, and fetch the latest block transaction ID in the xmax field. height at the node. To make the PL/SQL procedure deter- ministic, we have restricted the usage of date/time library, 4.4 Ordering Service random functions from the mathematics library, sequence As described in §3.1, any consensus algorithm can be manipulation functions, and system information functions. leveraged. For this work we use two different ordering ser- Further, SELECT statements must specify ORDER BY primary vices: an Apache Kafka [1] and ZooKeeper [2] based crash _key when using LIMIT or FETCH. Additionally, it cannot use fault tolerant, and a BFT-SMaRt [21] based byzantine fault row headers such as xmin, xmax in WHERE clause. tolerant ordering service. Clients/peers connect to indepen- SSI Based on Block Height. We have added two fields dent orderer nodes to submit transactions and receive cre- for each row: (i) creator block number, (ii) deleter block ated blocks. Orderer nodes either: 1. connect to a Kafka number. During commit, these two fields are filled for en- cluster and publish all received transactions to a Kafka topic, tries in the TxWriteSet depending on whether an entry is an which delivers the transactions in a FIFO order, or 2. submit update, insert, or delete. SI applies a row visibility logic us- the transactions to a BFT-SMaRt cluster which is then re- ing xmin and xmax to identify whether a row should be seen sponsible for totally ordering the received transactions among by a transaction (as described in §4.1). We enhance the row them. For creating a block of transactions, we use two pa- visibility logic to have additional conditions using the row’s rameters: block size, the maximum number of transactions creator and deleter block number and the snapshot-height in a block, and block timeout, the maximum time since the of the transaction (as described in §3.4.1). This is in addi- first transaction to appear in a block was received. Each tion to the default row visibility logic that helps avoid seeing orderer node either publishes a time-to-cut message to the rows updated/deleted by concurrent transactions. During Kafka topic or sends it to the BFT-SMaRt cluster when its predicate reads, the default visibility logic in PostgreSQL timer expires. The first time-to-cut message is considered traverses rows as per the index entries that satisfies the to cut a block and all other duplicates are ignored. Once given predicate or traverses the whole table when an in- a block is cut, orderer nodes append their signatures to the dex is missing. For our approach to work (mainly to avoid block, persist the block in filesystem and then send it to phantom or stale read described in §3.4.1), all read access connected peers. is enforced to happen via index entries only which satisfies a given predicate clause. Otherwise, there is a high possi- 4.5 Transaction Flow bility of transactions getting aborted due to false positive Figure4 depicts the new and modified components de- phantom or stale read. Hence, if an index is not available scribed in the previous sections and outlines the execute- for the given predicate clause, nodes abort the transaction. order-in-parallel transaction flow. Since the order-then-execute Further, SELECT * FROM table_name; is not allowed from flow is simpler, we omit presenting the details for it in the PL/SQL procedures as it always traverses the whole table. interest of brevity. The application leverages the blockchain It is possible to implement our approach without the need interfaces in libpq to fetch the latest block height and sub- for an index, but for simplicity we defer such optimizations. mit the transaction to the PostgreSQL backend. After veri- Note, any kind of individual SELECT statements without any fying the client signature leveraging pgCerts, the transaction writes on the blockchain schema will not be affected by SSI is forwarded to other PostgreSQL nodes and to the order- as the transaction would be marked as read-only and would ing service using the communication middleware (the client be executed on one node only. can also submit the transaction to all peers, rather than SSI Block-Aware Abort During Commit & ww- dependency. For the execute-order-in-parallel approach,

11 (a) order then execute (b) execute and order in parallel 100 100 queries in AppendixA. Note that while simple contract 2500 2500 can be implemented in most blockchain platforms today, 10 10 2000 2000 both complex-join contract and complex-group contract

1500 1 1500 1 are impossible to implement efficiently. We consider two 1000 1000 deployment models, the first where all organizations host 0.1 0.1 Throughput Latency (sec) 500 Latency (sec) 500 their database nodes on a common public cloud data center Throughput (tps) 0 0.01 5 0.01 (LAN), and second where nodes are hosted independently 1200 1500 1800 2100 1800 2100 2400 2700 Transaction Arrival Rate (tps) Transaction Arrival Rate (tps) leading to a multi/hybrid-cloud setup (WAN). In our exper- Throughput (block size: 10) Latency (block size: 10) Throughput (block size: 100) Latency (block size: 100) iments, the WAN setup involved four data centers spread Throughput (block size: 500) Latency (block size: 500) across four continents. These models help us study the Figure 5: Performance with simple contract communication overheads of the two approaches. In both deployments, database and orderer nodes were hosted on Table 4: Order then execute: micro metrics for an ar- virtual machines each with 32 vCPUs of Intel Xeon E5-2683 rival rate of 2100 tps. v3 @ 2.00GHz and 64 GB of memory. In the multi-cloud bs brr bpr bpt bet bct tet su deployment, the network bandwidth between nodes was be- 10 209.7 163.5 6 5 1 0.2 98.1% 100 20.9 17.9 55.4 47 8.3 0.2 99.1% tween 50 and 60 Mbps, whereas, it is 5 Gbps in the single 500 4.2 3.5 285.4 245 44.3 0.4 99.7% cloud deployment. In addition to throughput and latency, we also measure one peer forwarding to all other peers). The backend up- the following seven micro metrics (all as averages) to gain a dates TxMetadata, executes the transaction leveraging the deeper understanding of system behavior: SSI variant based on block-aware abort during commit, and collects the write-set into TxWriteSet. It then sets ready- 1. the block receive rate (brr)—the number of blocks re- to-proceed status in TxMetadata and waits for a signal from ceived per second at the middleware from orderer. the block processor. 2. the block processing rate (bpr)—the number of blocks Upon receiving the block from the ordering service, the processed and committed per second at the block pro- middleware verifies the orderer signature and stores the block cessor. in pgBlockstore. The block processor retrieves each unpro- cessed block from the pgBlockstore one at a time and adds 3. the block processing time (bpt in ms)—the time taken all transactions to pgLedger. It confirms from TxMetadata to process and commit a block. that all transactions have completed execution and then se- rially signals each backend as per the order in the block to 4. the block execution time (bet in ms)—the time taken proceed further. Upon receiving this signal, the backend val- to start all transactions in a block till they suspend for idates the transaction based on the block-aware abort during commit/abort. commit logic as explained in §3.4.3, sets either commit/abort 5. the transaction execution time (tet in ms)—the time status in TxMetadata, and signals the block processor. Once taken by the backend to execute a transaction until it all transactions have been processed, the block processor up- suspends for commit/abort. dates the pgLedger table with the status for each transaction. We are yet to implement the checkpoint flow described in 6. the block commit time (bct in ms)—the time taken §3.4.4. to perform the serial commit of all transactions in a block and can be measured as bpt − bet. 5. EVALUATION 7. the missing transactions (mt)—the number of transac- In this section, we study the performance and cost of both tions missing per second while processing blocks at the our design approaches. We measure performance in terms of block processor (relevant for execute-order-in-parallel). throughput and latency. Throughput is defined as the aver- age number of unique transactions committed per second in We define two additional terms: (1) the peak throughput as the blockchain network, and latency is defined as the aver- the maximum achieved throughput for a given smart con- age time taken to commit a transaction, measured from the tract type and block size across all possible arrival rates; time a client submits it. A transaction is said to be commit- (2) the system utilization (su) as the fraction of time the ted in the blockchain network when majority of the nodes block processor is busy (bpb) expressed as a percentage. bpb commit. The cost is measured in terms of resource utiliza- can be estimated as the average number of blocks processed tion such as CPU, memory, disk and network bandwidth in 1s (bpr) multiplied by the time taken to process each utilization. We study the effect on performance and cost by block (bpt). In an ideal scenario, when peak throughput is varying several system parameters, namely (a) block size, achieved, our system utilization should be close to 100%. (b) transaction arrival rate, (c) smart contract query com- Unless mentioned otherwise, our experiments use a sin- plexity, (d) deployment model (local/wide area network), gle data center with three organizations, each running one and (e) the number of database nodes (network size). database and one orderer node. At the orderer, the block We use three smart contracts (1) simple contract—inserts timeout was set to 1s and the block size was varied. We used values into a table; (2) complex-join contract—performs pgTune [7] to configure PostgreSQL and max_connection complex joins between two tables to execute aggregate oper- (i.e., number of backends) was set to 2600. The arrival rate ations and write the result to a third table; (3) complex-group was load balanced among the orderer nodes in the case of contract—performs aggregate queries over subgroups in a order-then-execute and the database nodes in the case of group and uses order by and limit to write the max aggre- execute-order-in-parallel. gate across the subgroups into a table. We illustrate these

12 (a) order-then-execute (b) execute-and-order-in-parallel 1000 250 1000 250 block commit time (bct) was observed to be higher. This could again be because of a large number of active backends. 800 200 800 200 Comparison With Ethereum’s Order then Exe- 600 150 600 150 cute. Blockchain platforms such as Ethereum also adopt an 400 100 400 100 order-then-execute approach, but execute transactions seri- Time (ms) Time (ms) 200 50 200 50 ally once the block is formed. To emulate this behavior, we Peak Throughput Peak Throughput 0 0 05 0 made our block processor also execute and commit transac- 10 50 100 10 50 100 Block Size (#tx) Block Size (#tx) tions one at a time. This resulted in a peak throughput of Peak Throughput bet bpt tet 800 tps (for a block size of 100, although the block size does not matter when we execute serially). This is only about Figure 6: Performance with complex-join contract 40% of the throughput achieved with our approach, which supports parallel execution of transactions leveraging SSI. (a) order-then-execute (b) execute-and-order-in-parallel 250 250 1400 1400 Table 5: Execute and order in parallel: micro metrics 1200 200 1200 200 for an arrival rate of 2400 tps. 1000 1000 150 150 800 800 bs brr bpr bpt bet bct tet mt su 600 100 600 100 10 232.26 232.26 3.86 2.05 1.81 0.58 479 89%

400 Time (ms) 400 Time (ms) 100 24.00 24.00 35.26 18.57 16.69 3.08 519 84% 50 50 200 200 500 4.83 4.83 149.64 50.83 98.81 6.27 230 72% Peak Throughput Peak Throughput 0 0 05 0 10 50 100 10 50 100 Block Size (#tx) Block Size (#tx) Peak Throughput bet 5.2 Smart Contract Complexity bpt tet Order then Execute. Figure6(a) plots the peak through- Figure 7: Performance with complex-group contract put and micro metrics achieved when using complex-join contract for order-then-execute. As observed earlier, with an increase in block size, the throughput increased and reached 5.1 Block Size and Arrival Rate a peak throughput of 400 tps. This was less than 25% of Order then Execute. Figure5(a) plots the throughput what was observed with simple contract, primarily be- and latency achieved in order-then-execute approach using cause of transaction execution time (tet) increasing by about the simple contract. Table4 presents the micro metrics 160× as compared to simple contract. The block pro- for an arrival rate of 2100 (tps). cessing time (bpt) and the block execution time (bet) also With an increase in transaction arrival rate, the through- increased. The CPU and memory utilization for simple put increased linearly as expected till it flattened out at contract was 10% and 15 GB respectively, compared to around 1800 tps, which was the peak throughput (system 30% and 15GB for complex-join contract. utilization close to 100% as shown in Table4). When the Execute and Order in Parallel. Figure6(b) plots the arrival rate was close to or above the peak throughput, the same for execute-order-in-parallel. Both the block process- latency increased significantly from an order of 100s of mil- ing time (bpt) and the block execution time (bet) were lower liseconds to 10s of seconds. For an arrival rate lower than as compared the one observed in order-then-execute. This the peak throughput, with an increase in the block size, the is because, by the time the block reaches the block proces- latency increased. The reason is that with an increase in sor, all transactions were either executing or already com- block size, it took longer for the orderer to fill a block with pleted execution. The peak throughput achieved was more transactions leading to an increased block creation time, and than twice that of order-then-execute. Unlike order-then- hence, a larger waiting time for each transaction at the or- execute approach, this approach permits concurrent execu- derer. For an arrival rate greater than the peak throughput, tion of a larger number of transactions unrestricted by the with an increase in block size, the latency decreased. This block size (in order-then-execute, the maximum number of is because there was no waiting for transactions to arrive concurrently executing transactions is capped by the block and form a block, and more transactions were executed in size). This manifested as a significantly larger increase to parallel. This is also observable as the block processing time the transaction execution time tet compared to order-then- (bpt) of a block of size n was always lesser than the sum of execute. For the same reason, the CPU utilization increased n bpt of m blocks each of size m . For the same reason, with to 100%, i.e., 3.3× higher than what was observed for order- an increase in the block size, the throughput also increased. then-execute. Execute and Order in Parallel. Figure5(b) plots the In the third complex-group contract smart-contract, for same for execute-and-order-in-parallel approach. Table5 a block size of 100, the maximum throughput achieved for presents the micro metrics for an arrival rate of 2400 tps. order-then-execute and execute-order-in-parallel was 1.75× The maximum throughput achieved was 2700 tps, i.e., 1.5× and 1.6× higher respectively than what was observed with higher than what was achieved with order-then-execute. Note the complex-join contract. We plot the graphs for the that the system utilization (su) was only 90%. When the ar- same in Figure7. rival rate was greater than 2700 tps, the throughput started to degrade and su never reached 100%. We believe the rea- 5.3 Deployment Model and Network Size son to be a large number of active backends that resulted in Deployment Model. Figure8(a) plots the peak through- a lot of contention on default shared memory datastructure put achieved with both approaches in a multi-cloud deploy- (we intend to study this further as part of the future work). ment. As compared to a single cloud deployment (refer Though the block processing time (bpt) and the block exe- to Figure6), only the latency increased on average by 100 cution time (bet) were observed to be lesser with execute- ms due to the WAN network but the throughput reminded and-order-in-parallel as compared to order-then-execute, the the same for most part except a 4% reduction in the peak

13 (a) a multi-cloud deployment (b) impact of network size 1000 3500 Corda [30] refers to their platform as a decentralized 1000 3000 800 database. Transactions are executed by one node at a time 100 2500 (not parallelly executed by all nodes) and the results are 600 10 2000 shared with other nodes that have a need to know. There is 1500 400 1 no notion of a blockchain in Corda. However, optionally, a

Time (ms) 1000 200 0.1 500 notary could order transactions and validate them for double Peak Throughput

0 0.01 Orderer Throughput 5 spending. Corda uses the H2 embedded SQL engine as the 10 50 100 0 4 8 12 16 20 24 28 32 36 Block Size (#tx) #Orderer Nodes state database. State objects can define a relational map- Peak Throughput (OE) increase in latency (OE) Kafka Throughput Peak Throughput (EO) increase in latency (EO) BFT Throughput ping, and an object serialization framework is used to store the states in a relational database. This permits querying Figure 8: Performance (a) withcomplex contractin amulti-cloud deployment; (b) with different network the database using SQL and also permits rich queries (such sizes. as joins) with an organization’s private non-blockchain data. However, it does not enable rich query support within the throughput when the block size was 100. In our system, smart contract itself. each transaction was only 196 bytes in size, and hence the Veritas [27] proposes shared verifiable tables using a set size of a block with 500 transactions was only about 100 of key-value stores each owned by an organization. KB. Hence, the lower bandwidth in the multi-cloud deploy- Only the verifiability property, i.e., immutable logs, is sup- ment did not have a significant impact on the performance ported. For ensuring consistency across replicas, it uses a that our system can work equally well in a geographically centralized trusted timestamp server to order transactions. distributed environment. Further, a transaction is executed only on one of the nodes, Network Size Figure8(b) plots the throughput achieved and each node periodically ships logs of multiple read-write with kafka and bft based ordering service while varying the sets to other nodes via a Kafka-based broadcast service. number of orderer nodes and fixing the transaction arrival Nodes are trusted to execute and share results correctly with rate to 3000 tps. With an increase in the number of orderer other nodes, and hence is not decentralized. Nodes vote on nodes, we observed no impact on the performance of kafka- transactions to resolve any conflicts. based ordering service but the performance achieved with BigchainDB [14] employs Tendermint consensus [33, 12] bft-based ordering service reduced from 3000 tps to 650 tps over a set of independent MongoDB [6] instances, each owned as expected due to the communication overhead. With an by a different organization. It supports immutability and increase in the number of database nodes alone, the overall decentralization. While the overall goals of BigchainDB are system throughput did not get affected but limited by the similar to ours, there are fundamental architectural differ- peak throughput of the ordering service. ences. It supports only user tokens/assets similar to Bitcoin and no support for smart contracts. A transaction submit- 6. RELATED WORK ted to BigchainDB is forwarded to Tendermint nodes, which Bitcoin [39] and Ethereum [51] adopt an order-execute ensure that all the BigchainDB nodes agree on the next model, where transactions are first ordered in a block through block in a Byzantine fault tolerant way. Transactions are consensus (such as proof-of-work) and each node validates serially executed post ordering, similar to Ethereum. and commits transactions to its copy of the blockchain by It is evident from the related work that ours is the first serially executing each transaction. In our order-execute blockchain relational database with rich features of both approach, we leverage SSI to execute transactions concur- blockchain and databases. Further, the execute-and-order- rently. Further, such platforms only support a simple key- in-parallel approach proposed in this paper is highly novel value data model. and we believe it is applicable to centralized replicated databases. Hyperledger Fabric [17] adopts an execute-then-order approach, where transactions are first executed and endorsed by multiple nodes, then ordered by consensus, and then followed by serial validation and commit. In contrast, in 7. DISCUSSION AND FUTURE WORK our execute-order-in-parallel approach transaction execution In ongoing work, we are studying the performance of both and ordering happen parallelly. Fabric only supports lev- our design approaches in the presence of rw/ww dependencies eldb and couchdb as the underlying database, with sup- with two setups: (a) a homogeneous network so that all port for composite keys and range queries. In contrast, nodes would be approximately at the same current-block; we support the full spectrum of SQL queries in a smart (b) a heterogeneous network that could lead to a significant contract (for the first time) with the exception of libraries gap in the current-block of different nodes over time. that could introduce non-determinism and blind updates. As stated in §3.3.3 and §3.6, there is a potential to op- Further, by leveraging capabilities already available within timize the current implementation by (a) executing the se- a relational database we provide support for complex data rialization check using all transaction’s inConflictList and types, schemas, constraints, and provenance queries. Perfor- outConflictList at once in a separate background process mance studies on Fabric [17, 47] have shown that a through- instead of executing at the backend serially, and (b) com- put of 3000 tps can be achieved with goLevelDB as the state mitting the whole block atomically rather than one transac- database for a simple smart-contract, while the throughput tion at a time. Optimizations are also possible in our com- drops to 700 tps with CouchDB. munication layer by leveraging gossip protocols (for larger Hyperledger Composer [4] is a set of collaboration number of nodes) and batching together transaction repli- tools for building blockchain business networks utilizing Hy- cation. We expect the performance to improve with these perledger Fabric. It allows a user to write a basic SQL query optimizations. with limited syntax [5] and internally converts the SQL Over time, the amount of historical data in the network query to a JSON selector query [3] supported by CouchDB. would grow significantly. Hence, we need to enhance the

14 existing pruning tool such as vacuum [13] to remove rows G. Laventman, Y. Manevich, S. Muralidharan, based on fields such as creator, deleter. As privacy is C. Murthy, B. Nguyen, M. Sethi, G. Singh, K. Smith, an important consideration for blockchain solutions, there A. Sorniotti, C. Stathakopoulou, M. Vukolić, S. W. is potential to enhance relational database features such as Cocco, and J. Yellick. Hyperledger fabric: A user management to provide stronger privacy guarantees. distributed operating system for permissioned Data privacy can be achieved using database schemas and blockchains. In Proceedings of the Thirteenth EuroSys the concept of channels as in Hyperledger Fabric [17]. Conference, EuroSys ’18, pages 30:1–30:15, New York, NY, USA, 2018. ACM. 8. CONCLUSION [18] H. Berenson, P. Bernstein, J. Gray, J. Melton, In this paper, we presented the design of a blockchain E. O’Neil, and P. O’Neil. A critique of ansi relational database, a decentralized database with replicas isolation levels. In Proceedings of the 1995 ACM managed by different organizations that do not trust one SIGMOD International Conference on Management of another. The key challenge we addressed is in ensuring that Data, SIGMOD ’95, pages 1–10, New York, NY, USA, all untrusted replicas commit transactions in the same serial- 1995. ACM. izable order that respects the block ordering determined by [19] P. A. Bernstein and N. Goodman. Multiversion consensus. We proposed two design approaches that lever- —theory and algorithms. aged and modified SSI to achieve this, and devised a new ACM Trans. Database Syst., 8(4):465–483, Dec. 1983. variant of SSI based on block height. Leveraging features [20] P. A. Bernstein, V. Hadzilacos, and N. Goodman. already available in databases enables us to better support Concurrency Control and Recovery in Database complex data types, schemas, complex queries and prove- Systems. Addison-Wesley Longman Publishing Co., nance queries that are not provided by blockchain platforms Inc., Boston, MA, USA, 1986. today. We implemented the system on PostgreSQL and pre- [21] A. Bessani, J. Sousa, and E. E. P. Alchieri. State sented detailed performance results. machine replication for the masses with bft-smart. In 2014 44th Annual IEEE/IFIP International 9. REFERENCES Conference on Dependable Systems and Networks, [1] Apache kafka. https://kafka.apache.org. pages 355–362, June 2014. [2] Apache zookeeper. http://zookeeper.apache.org. [22] E. Brewer. Cap twelve years later: How the "rules" [3] Couchdb selector query. have changed. Computer, 45(2):23–29, Feb 2012. https://docs.couchdb.org/en/2.2.0/api/database/find.html. [23] M. J. Cahill, U. Röhm, and A. D. Fekete. Serializable [4] Hyperledger composer. isolation for snapshot databases. In Proceedings of the https://www.hyperledger.org/projects/composer. 2008 ACM SIGMOD International Conference on [5] Hyperledger composer . Management of Data, SIGMOD ’08, pages 729–738, https://hyperledger.github.io/composer/v0.19/reference/query- New York, NY, USA, 2008. ACM. language. [24] M. Castro and B. Liskov. Practical byzantine fault [6] Mongodb. https://www.mongodb.com/. tolerance. In Proceedings of the Third Symposium on [7] pgtune. https://github.com/gregs1104/pgtune. Operating Systems Design and Implementation, OSDI [8] Postgresql background worker processes. ’99, pages 173–186, Berkeley, CA, USA, 1999. https://www.postgresql.org/docs/10/bgworker.html. USENIX Association. [9] Postgresql frontend/backend protocol. [25] E. Cecchet, G. Candea, and A. Ailamaki. https://www.postgresql.org/docs/10/protocol.html. Middleware-based database replication: The gaps [10] Postgresql libpq - c library. between theory and practice. In Proceedings of the https://www.postgresql.org/docs/10/libpq.html. 2008 ACM SIGMOD International Conference on [11] Postgresql v10. https://www.postgresql.org/. Management of Data, SIGMOD ’08, pages 739–752, New York, NY, USA, 2008. ACM. [12] Tendermint. https://tendermint.com/. [26] A. Fekete, D. Liarokapis, E. O’Neil, P. O’Neil, and [13] Vacuum. https://www.postgresql.org/docs/10/sql- D. Shasha. Making snapshot isolation serializable. vacuum.html. ACM Trans. Database Syst., 30(2):492–528, June [14] B. 2018. Bigchaindb: The blockchain database. 2005. https://www.bigchaindb.com/whitepaper/bigchaindb- [27] J. Gehrke, L. Allen, P. Antonopoulos, A. Arasu, whitepaper.pdf. J. Hammer, J. Hunter, R. Kaushik, D. Kossmann, 2018. R. Ramamurthy, S. T. V. Setty, J. Szymaszek, A. van [15] A. Adya, B. Liskov, and P. O’Neil. Generalized Renen, J. Lee, and R. Venkatesan. Veritas: Shared isolation level definitions. In Proceedings of 16th verifiable databases and tables in the cloud. In CIDR International Conference on Data Engineering (Cat. 2019, 9th Biennial Conference on Innovative Data No.00CB37073), pages 67–78, Feb 2000. Systems Research, Asilomar, CA, USA, January [16] P. A. Alsberg and J. D. Day. A principle for resilient 13-16, 2019, Online Proceedings, 2019. sharing of distributed resources. In Proceedings of the [28] J. Gray, P. Helland, P. O’Neil, and D. Shasha. The 2Nd International Conference on Software dangers of replication and a solution. In Proceedings of Engineering, ICSE ’76, pages 562–570, Los Alamitos, the 1996 ACM SIGMOD International Conference on CA, USA, 1976. IEEE Computer Society Press. Management of Data, SIGMOD ’96, pages 173–182, [17] E. Androulaki, A. Barger, V. Bortnikov, C. Cachin, New York, NY, USA, 1996. ACM. K. Christidis, A. De Caro, D. Enyeart, C. Ferris,

15 [29] J. Gray and L. Lamport. Consensus on transaction propagation for weakly consistent replication. In commit. ACM Trans. Database Syst., 31(1):133–160, Proceedings of the Sixteenth ACM Symposium on Mar. 2006. Operating Systems Principles, SOSP ’97, pages [30] M. Hearn. Corda 2016. 288–301, New York, NY, USA, 1997. ACM. https://www.corda.net/content/corda-technical- [43] D. R. K. Ports and K. Grittner. Serializable snapshot whitepaper.pdf. isolation in postgresql. Proc. VLDB Endow., [31] F. P. Junqueira, B. C. Reed, and M. Serafini. Zab: 5(12):1850–1861, Aug. 2012. High-performance broadcast for primary-backup [44] S. H. Son. Replicated data management in distributed systems. In Proceedings of the 2011 IEEE/IFIP 41st database systems. SIGMOD Rec., 17(4):62–69, Nov. International Conference on Dependable 1988. Systems&Networks, DSN ’11, pages 245–256, [45] M. Stonebraker. Concurrency control and consistency Washington, DC, USA, 2011. IEEE Computer Society. of multiple copies of data in distributed ingres. IEEE [32] H. T. Kung and J. T. Robinson. On optimistic Transactions on Software Engineering, methods for concurrency control. ACM Trans. SE-5(3):188–194, May 1979. Database Syst., 6(2):213–226, June 1981. [46] D. B. Terry, M. M. Theimer, K. Petersen, A. J. [33] J. Kwon. Tendermint: Consensus without mining. Demers, M. J. Spreitzer, and C. H. Hauser. Managing 2014. update conflicts in bayou, a weakly connected [34] L. Lamport. The part-time parliament. ACM Trans. replicated storage system. In Proceedings of the Comput. Syst., 16(2):133–169, May 1998. Fifteenth ACM Symposium on Operating Systems [35] F. Li, M. Hadjieleftheriou, G. Kollios, and L. Reyzin. Principles, SOSP ’95, pages 172–182, New York, NY, Dynamic authenticated index structures for USA, 1995. ACM. outsourced databases. In Proceedings of the 2006 [47] P. Thakkar, S. Nathan, and B. Viswanathan. ACM SIGMOD International Conference on Performance benchmarking and optimizing Management of Data, SIGMOD ’06, pages 121–132, hyperledger fabric blockchain platform. In 2018 IEEE New York, NY, USA, 2006. ACM. 26th International Symposium on Modeling, Analysis, [36] F. Li, M. Hadjieleftheriou, G. Kollios, and L. Reyzin. and Simulation of Computer and Telecommunication Authenticated index structures for aggregation Systems (MASCOTS), pages 264–276, Sep. 2018. queries. ACM Trans. Inf. Syst. Secur., [48] R. H. Thomas. A majority consensus approach to 13(4):32:1–32:35, Dec. 2010. concurrency control for multiple copy databases. ACM [37] S. Liu, P. Viotti, C. Cachin, V. Quéma, and Trans. Database Syst., 4(2):180–209, June 1979. M. Vukolic. Xft: Practical fault tolerance beyond [49] B. Vandiver, H. Balakrishnan, B. Liskov, and crashes. In Proceedings of the 12th USENIX S. Madden. Tolerating byzantine faults in transaction Conference on Operating Systems Design and processing systems using commit barrier scheduling. Implementation, OSDI’16, pages 485–500, Berkeley, In Proceedings of Twenty-first ACM SIGOPS CA, USA, 2016. USENIX Association. Symposium on Operating Systems Principles, SOSP [38] D. A. Menasce, G. J. Popek, and R. R. Muntz. A ’07, pages 59–72, New York, NY, USA, 2007. ACM. locking protocol for resource coordination in [50] M. Wiesmann, F. Pedone, A. Schiper, B. Kemme, and distributed databases. In Proceedings of the 1978 G. Alonso. Understanding replication in databases ACM SIGMOD International Conference on and distributed systems. In Proceedings 20th IEEE Management of Data, SIGMOD ’78, pages 2–2, New International Conference on Distributed Computing York, NY, USA, 1978. ACM. Systems, pages 464–474, April 2000. [39] S. Nakamoto. Bitcoin: A peer-to-peer electronic cash [51] G. Wood. Ethereum: A secure decentralised system,” http://bitcoin.org/bitcoin.pdf. generalised transaction ledger. Ethereum project [40] D. Ongaro and J. Ousterhout. In search of an yellow paper, 2014. understandable consensus algorithm. In Proceedings of [52] C. Xu, J. Xu, H. Hu, and M. H. Au. When query the 2014 USENIX Conference on USENIX Annual authentication meets fine-grained access control: A Technical Conference, USENIX ATC’14, pages zero-knowledge approach. In Proceedings of the 2018 305–320, Berkeley, CA, USA, 2014. USENIX International Conference on Management of Data, Association. SIGMOD ’18, pages 147–162, New York, NY, USA, [41] H. H. Pang and K. . Tan. Authenticating query results 2018. ACM. in edge computing. In Proceedings. 20th International [53] Y. Zhang, J. Katz, and C. Papamanthou. Integridb: Conference on Data Engineering, pages 560–571, April Verifiable sql for outsourced databases. In Proceedings 2004. of the 22Nd ACM SIGSAC Conference on Computer [42] K. Petersen, M. J. Spreitzer, D. B. Terry, M. M. and Communications Security, CCS ’15, pages Theimer, and A. J. Demers. Flexible update 1480–1491, New York, NY, USA, 2015. ACM.

16 APPENDIX A. SMART CONTRACTS USED FOR EVALUATION

Figure 9: Simple query with single inserts into table

Figure 10: Complex queries with joins and aggregates

Figure 11: Complex queries with group-by, order-by, limit and aggregates

17