Cornus: One-Phase Commit for Cloud with Storage Disaggregation

Zhihan Guo* Xinyu Zeng* University of Wisconsin-Madison University of Wisconsin-Madison Madison, Wisconsin Madison, Wisconsin [email protected] [email protected] Ziwei Ren Xiangyao Yu University of Wisconsin-Madison University of Wisconsin-Madison Madison, Wisconsin Madison, Wisconsin [email protected] [email protected]

ABSTRACT previous works have proposed one-phase commit (1PC) protocols [8, Two-phase commit (2PC) has been widely used in distributed databases 13, 22, 26] removing one phase from the commit procedure. Existing to ensure atomicity for distributed transactions. However, 2PC suf- 1PC protocols, however, make extra assumptions to 2PC [8]. Most fers from two limitations. First, 2PC incurs long latency as it requires of these assumptions are impractical in a production environment, two logging operations on the critical path. Second, when a coordina- which stymies the wide adoption of existing 1PC protocols. To solve tor fails, a participant may be blocked waiting for the coordinator’s the blocking problem, previous works have proposed three-phase decision, leading to indefinitely long latency and low throughput. commit (3PC) protocols [25] such that an uncertain transaction can We make a key observation that modern cloud databases feature learn the decision even if the coordinator crashes. However, as the a storage disaggregation architecture, which allows a transaction’s name suggests, these protocols must pay the overhead of one extra final decision to not rely on the central coordinator. We propose phase to the commit procedure, further exacerbating the latency Cornus, a one-phase commit (1PC) protocol specifically designed problem. No protocol can achieve nonblocking without introducing for this architecture. Cornus can solve the two problems mentioned extra assumptions or communications according to the fundamental above by leveraging the fact that all compute nodes are able to nonblocking theorem [25]. access and modify the log data on any storage node. We present In this project, we made a key insight that the architectural par- Cornus in detail, formally prove its correctness, develop certain adigm shift happening in cloud databases (i.e., storage disaggrega- optimization techniques, and evaluate against 2PC on YCSB and tion [2, 3, 5, 10, 15, 30, 32]) is fundamentally changing the design TPC-C workloads. The results show that Cornus can achieve 1.5× space of atomic commitment protocols. Specifically, disaggregating speedup in latency. the storage from computation allows a server to directly access all the storage nodes rather than its own storage as in a 1 INTRODUCTION conventional shared-nothing architecture. This insight allows us to design a new 1PC protocol that can achieve low latency and non- Modern database management systems (DBMS) are increasingly blocking altogether, without making further assumptions besides the distributed due to the growing data volume and the diverse demands disaggregation of storage of modern Internet services. To ensure the atomicity of distributed To this end, we propose Cornus, a new non-blocking 1PC protocol transactions, an atomic commitment protocol (ACP) is required for designed for the storage-disaggregation architecture. Cornus solves transactions that access data across distributed machines. Two-phase the latency problem by eliminating the decision logging at each commit (2PC) is so far the most widely used ACP. coordinator. Instead, a transaction relies on the collective logs of all Albeit widely implemented, 2PC has two major problems that the participating nodes for its final decision. This change is made

arXiv:2102.10185v1 [cs.DB] 19 Feb 2021 limit its performance. First, 2PC requires two round-trip network possible because all the logs are accessible to all transactions in messages and the associated logging operations. Previous works [8, a disaggregation architecture. If any failure occurs, an uncertain 9, 16, 20, 23, 24, 33] have demonstrated that 2PC can be attributed transaction can rebuild the decision by accessing all the logs. Cornus to the majority of a transaction’s execution time due to the incurred is also non-blocking — If any participant fails to flush to its log, network messages and disk logging, which directly affects the query other uncertain transactions can insert an abort record on behalf of response time that a user experiences. Second, 2PC has a well- the non-responding node. We introduce the LogIfNotExist() function known blocking problem [11, 12, 25]. If a coordinator crashes before to avoid race conditions in corner cases. In summary, this paper notifying participants of the decision, the participants may not know makes the following key contributions: the decision and will be blocked until the coordinator recovers. Meanwhile, uncertain transactions cannot release their locks, which blocks other transactions from making forward progress. • The two problems above have inspired two separate lines of We develop Cornus, a one-phase commit (1PC) protocol de- research seeking solutions. To mitigate the long latency problem, signed for a storage-disaggregation architecture to reduce the latency overhead in 2PC and alleviate the blocking problem at *Both authors contributed equally the same time. Guo, Zeng, Ren and Yu, et al.

Coordinator Participant 1 Participant 2 Coordinator Participant 1 Participant 2

Prepare Phase Begin prepare request Prepare Phase Begin prepare request [log] START-2PC [log] VOTE-YES [log] VOTE-YES [log] START-2PC [log] VOTE-YES [log] VOTE-YES

Commit Phase Begin vote yes compute node Commit Phase Begin vote yes compute node [log] COMMIT fail back to user storage node storage node commit timeout timeout contact coordinator [log] COMMIT [log] COMMIT Block until timeout timeout Coordinator ack contact coordinator recovers!

(a) 2PC with no failure. (b) 2PC with coordinator failure. Figure 1: Illustration of Two-Phase Commit (2PC) — The lifecycle of a committing transaction (a) and a scenario of coordinator failure (b). The compute node and the corresponding storage node are drawn close to each other. • We prove the correctness of Cornus by showing that it can satisfy protocol to contact other nodes to learn the decision; it will repeat all the five properties of an atomic commitment protocol that this process until at least one node replies with the decision. In 2PC can satisfy [11, 12]. case the coordinator takes a long time to recover, the nodes under • We evaluate Cornus in a great variety of settings on both YCSB uncertainty cannot learn the decision, and the associated transactions and TPC-C workloads. Cornus shows an improvement in latency will block. In 2PC, the coordinator’s decision log record serves as of 50% compared to 2PC. the ground truth of the commit/abort decision — the final outcome of the transaction relies on the success of logging this record. 2 BACKGROUND AND MOTIVATION Limitation 1: The Latency of Two Phases Section 2.1 describes two-phase commit (2PC). We discuss two In the standard 2PC protocol, the transaction caller experiences major concerns of the protocol, long latency delay and blocking, and an average latency of one network round-trip and two logging opera- briefly describe existing works addressing each problem. Section 2.2 tions, as shown in Figure 1a. Such a delay directly affects the query discusses storage disaggregation as a trending architecture in cloud response time that an end-user will experience. databases and how it motivates the design of Cornus. Previous works have proposed various one-phase commit (1PC) 2.1 Two-Phase Commit protocols to reduce this latency. Some works combine the voting phase with the execution of the transaction [7, 9, 22, 26, 27] to In a distributed database management system (DDBMS), data reduce one phase, yet they make assumptions that are too strong are partitioned across multiple sites which can be accessed by a to be practical [8]. These protocols assume that serialization and distributed transaction. After the execution, all the sites involved consistency are ensured before an acknowledgment of each oper- must reach a consensus on committing or aborting the transaction to ation is sent from the participant to the coordinator, and no abort ensure the transaction’s atomicity. An atomic commitment protocol due to consistency or serialization is allowed after the successful (ACP) is required to achieve this goal. execution of all operations. Thus, they do not support common con- Two-phase commit (2PC) [23] is a widely used ACP in current currency protocols such as Optimistic Concurrency Control [21] and DDBMSs. It contains a prepare phase and a commit phase. A demon- Timestamp Ordering [12] according to Abdallah et al. [8]. More- stration of the protocol is shown in Figure 1. When no failure hap- over, most protocols either pay extra overhead such as blocking I/O pens and all participating nodes agree to commit the transaction, during execution [26] or violating site autonomy [9, 22, 26, 27]— the protocol behaves as Figure 1a. For each transaction, one node is a property that allows each node to manage its data and recovery pre-designated to be the coordinator and the other nodes involved in independently [8]. the transaction become participants. Limitation 2: The Blocking Problem During the prepare phase, the coordinator of the transaction starts with sending prepare requests (also called vote requests) to all the In 2PC, a participant learns the decision of a transaction either participants and in parallel writing a Start-2PC log record to its directly from the coordinator or indirectly from other participants. In own log file. Upon receiving the prepare request, each participant an unfortunate corner case shown in Figure 1b where the coordinator logs a VOTE-YES record (assuming a committing transaction) to fails before sending any notifications, no participant can make or the corresponding log file and then responds to the coordinator. learn the decision since it is unclear whether the decision has been After receiving all the votes, the coordinator enters the commit made. Meanwhile, the participants must hold the locks on tuples phase by logging the final decision (i.e., commit/abort) and notifies until the coordinator is recovered. It is well known as the blocking the caller. The coordinator then forwards the decision to each partic- problem which causes certain data to be inaccessible because some ipant, which logs the decision accordingly and releases all the locks other node not holding the data is down, limiting the performance held by the concurrency control protocol. and data availability of 2PC. Existing works [11, 19, 25] resolve the If the coordinator fails before sending the decision to participants, problem of blocking by introducing extra inter-node communication as shown in Figure 1b, participants may not know the decision of the and imposing more assumptions on the failure model. transaction. Upon a timeout, a participant will initiate a termination Cornus: One-Phase Commit for Cloud Databases with Storage Disaggregation

For instance, the three-phase commit (3PC) protocol eliminates blocking by introducing an extra prepared to commit phase [25]. The coordinator will not send out a commit message until all participants have acknowledged that they are prepared to commit. Although the approach can eliminate blocking, it introduces another network Network round-trip in each transaction and exacerbates the long latency in 2PC. Moreover, 3PC assumes a synchronous system where the net- Network work delays among nodes are bounded. In practical systems with unbounded delay, 3PC cannot guarantee atomicity [18]. (a) Shared-nothing (b) Storage-disaggregation

Figure 2: Shared-Nothing vs. Storage-Disaggregation Architec- tures. 3 CORNUS 2.2 Opportunities on Improving 2PC with We first present the highlevel ideas of Cornus in Section 3.1.We Storage-Disaggregation then describe the APIs of the protocol in detail in Sections 3.2 and Modern cloud-native databases are shifting to a storage disag- 3.3. Section 3.4 describes how the protocol handles failures and gregation architecture where the storage and computation are sep- recovery. Section 3.5 proves the correctness of Cornus. Section 3.6 arately managed as different layers of services (Figure 2b). This discusses some optimization techniques in Cornus. brings significant benefits including lower cost, simpler fault tol- erance, and higher hardware utilization, compared to the common Coordinator Participant 1 Participant 2 shared-nothing architecture (Figure 2a). A number of cloud-native prepare request databases are adopting such an architecture, with examples including Prepare Phase Begin [log] VOTE-YES Aurora [32], Redshift Spectrum [3], Athena [2], Presto [5], Hive [30], [log] VOTE-YES [log] VOTE-YES back to user SparkSQL [10], and Snowflake [15]. A storage disaggregation archi- vote yes tecture has the following two key properties: Commit Phase Begin [log] COMMIT commit [log] COMMIT First, all the data in the storage layer are accessible to all the [log] COMMIT ack computation nodes. In terms of logging, this means a computation compute node node can access not only its own log file but the log files corre- storage node sponding to other computation nodes as well. This contrasts to a shared-nothing (Figure 2a) architecture, where a computation node Figure 3: Illustration of Cornus— The lifecycle of a committing can only directly access its own storage — accessing remote stor- transaction. age requires sending an explicit request to a remote computation node. Moreover, modern disaggregated storage services like Amazon 3.1 Design Overview S3 [4] are typically highly available and fault-tolerant. Second, the storage layer can perform certain computation tasks. This section describes the high-level intuition behind Cornus. In The storage layer is typically implemented as a cluster of servers particular, we explain why the two properties of a disaggregation ar- that can perform general-purpose computation. Although the storage chitecture (explained in Section 2.2) can reduce the protocol latency servers are not as powerful as computation servers and cannot com- and eliminate blocking at the same time. municate with each other arbitrarily, such computation capability Latency reduction. Storage disaggregation enables a design of 1PC can substantially improve the performance of a database, for both to reduce the latency. In the conventional 2PC protocol, the ground OLTP [32] and OLAP [34] workloads. This contrasts to a conven- truth of a transaction’s outcome (i.e., commit or abort) is the co- tional shared disk architecture, where the disks are passive devices ordinator’s decision log; in Cornus, the ground truth is instead the and perform no computation. collective votes in all participants’ logs. For example, a transac- Through scrutinizing the 2PC protocol, we learned that it is de- tion reaches the commit decision once each participant’s local log signed with a basic assumption of a shared-nothing architecture; contains VOTE-YES. Such a decision cannot rollback once reached. namely, a computation node can learn the content of a remote log If a particular node is uncertain about the outcome (e.g., the node only through explicitly contacting the remote node. If the remote times out while waiting for the coordinator’s decision), the node can node is down, the log on its local storage is no longer accessible. directly check all participating nodes’ logs to learn the final decision With the disaggregation architecture, however, a computation and thus does not have to rely on the coordinator’s decision log. This node can learn the content of a remote log by directly accessing means the coordinator’s decision logging no longer has to be on the log itself. Furthermore, it may even directly manipulate the log the critical path of the protocol. In other words, the coordinator can content if necessary. With this subtle difference, we can optimize respond to the caller of the transaction immediately after receiving 2PC to solve both limitations discussed in Section 2.1 — First, we votes from participants without logging first. Given that a logging can eliminate one phase from the protocol to reduce latency. Second, operation is quite expensive in a highly available distributed DBMS, we are able to solve the blocking problem. The details of our solution this can substantially reduce a transaction’s latency. Figure 3 shows will be discussed in the following section. the procedure of a committing transaction using Cornus which saves Guo, Zeng, Ren and Yu, et al.

Algorithm 1: Cornus API on the Storage Nodes — The returns; an asynchronous call means the program can continue exe- implementation of Log() and LogIfNotExist() functions on cuting until it is made to wait explicitly for the response. each storage node. We represent RPC using the following notation:

1 Function Storage::Log(txn, content) RPC푛 ::FuncName() 2 append content to the local log sync/async 3 Function Storage::LogIfNotExists(txn, content) , where the subscript can be sync or async for synchronous and 4 if content == VOTE-YES then asynchronous RPCs, respectively. The superscript 푛 denotes the 5 # begin atomic section destination node of the RPC. Finally, FuncName() is the function 6 return ABORT if an ABORT record exists for txn in the log that will be called through this RPC on the remote node; the function 7 otherwise log and return VOTE-YES can take arbitrary arguments if needed. 8 # end atomic section Log(txn, type) 9 else 10 # begin atomic section The Log(txn, type) function simply appends a log record of a 11 return VOTE-YES or COMMIT if such a record exists for certain type to the end of transaction txn’s log. It is the log function txn in the log that is used in conventional 2PC protocols. 12 otherwise log ABORT if not exists and return ABORT LogIfNotExist(txn, type) 13 # end atomic section In Cornus, we introduce a new log function LogIfNotExist(txn, type) to guarantee that different nodes do not write log records that conflict. Cornus uses both types of log functions. LogIfNotExist() the latency of one logging operation. This figure can be compared is called only when a node logs VOTE-YES (lines 2 & 17 of Algo- with Figure 1a to see the difference between 2PC and Cornus. rithm 2) or when a node logs ABORT on behalf of a remote node Note that the optimization cannot be applied to 2PC in a conven- when calling the termination protocol (line 30 of Algorithm 2). tional shared-nothing architecture, because a node cannot directly Algorithm 1 shows the pseudocode for the LogIfNotExist() func- access the log of another node if the remote node has failed. tion. When a storage node receives an RPC call on this function, it Non-blocking. Cornus addresses the blocking problem in 2PC with- first checks if a conflicting decision has already been logged forthe out introducing significant complexity. When a participant expe- transaction. If so, the transaction’s most recent status in the log (i.e. riences a timeout while waiting for the coordinator’s decision, it ABORT, COMMIT, or VOTE-YES) is returned. Otherwise, it will executes the termination protocol. In 2PC, the participants contact append the requested log and return the content. the coordinator for the decision and must block if the coordinator Specifically, when a compute node tries tolog VOTE-YES on its cannot be reached. (e.g., the coordinator has failed). own storage node (line 4), it checks whether other nodes have already In Cornus, in contrast, the termination protocol means to check logged ABORT on behalf of it. If so, the function returns ABORT all the votes of other participants to learn the final decision. This (line 6); otherwise, the function logs and returns VOTE-YES (line is doable in a disaggregation architecture because all the logs can 7). Note that the checking and appending to the log must be done be accessed from any computation node, and that the storage layer atomically, such that a conflicting log record cannot be appended itself is highly available. after the check is performed. In case that a particular vote is missing in the storage (e.g., the The LogIfNotExist() is also called to log ABORT on behalf of corresponding participant failed before logging), the current node another compute node during the termination protocol. In this case, running the termination protocol will write an ABORT into the log of if a VOTE-YES or COMMIT log record already exists, ABORT will the failed node. To guarantee atomicity, we implement the LogIfNo- not be logged and the existing log record is returned (line 11); tExist() function in the storage layer, which guarantees that only otherwise the ABORT decision is logged and returned (line 12). Later one of the two votes (i.e. 푉푂푇 퐸 − 푌퐸푆 or 퐴퐵푂푅푇 ) can exist for any in Section 3.6, we will describe more details on how to implement transaction on any node. The protocol guarantees that neither the the LogIfNotExist() function effectively on storage nodes. coordinator nor the participant will block due to unknown decisions. 3.3 API of the Compute Nodes Note that Cornus does not introduce extra assumptions like pre- vious 1PC protocols mentioned in Section 2.1 besides the storage This section explains in detail how Cornus works on the compute disaggregation architecture, making it applicable to more general nodes. The pseudocode is shown in Algorithm 2. We highlight the settings. In the next two sections, we describe the APIs of Cornus key changes in Cornus in contrast to standard 2PC with a gray back- on the storage and compute nodes respectively. ground color. In the following, we will go through the pseudocode for the coordinator’s procedure, the participant’s procedure, and the termination protocol. 3.2 API of the Storage Nodes Coordinator::Start1PC(txn) We first describe our RPC notation followed by the functions that After a transaction txn finishes the execution phase, it starts the are supported on the storage nodes in Cornus. atomic committment protocol by calling Start1PC(txn) at the coordi- Remote Procedure Calls (RPC) nator. The coordinator logs VOTE-YES asynchronously (line 2) and In this paper, we model the communication across nodes through sends out vote requests along with a list of all nodes involved in the RPCs. An RPC can be either synchronous or asynchronous. A syn- transaction to all participants simultaneously (lines 3–4). There are chronous call means the program logic will block until the RPC two major differences compared to a standard 2PC protocol: First, Cornus: One-Phase Commit for Cloud Databases with Storage Disaggregation the coordinator logs through LogIfNotExist() instead of append- Algorithm 2: API of Compute Nodes in Cornus — As- only logging. Second, the coordinator logs VOTE-YES instead of sumign a committing transaction. Differences between 1PC START-2PC — this second change is because all nodes can recover and 2PC are highlighted in gray. independently without the involvement of the coordinator, as we will explain in Section 3.4. 1 Function Coordinator::Start1PC(txn) local SN Then the coordinator waits for responses from all the participants 2 RPCasync ::LogIfNotExist(VOTE-YES) (line 5). If an ABORT is received, the transaction reaches an abort 3 for p in txn.participants do 4 send VOTE-REQ to asynchronously decision (line 6); if all responses are received and none of them is an 푝 ABORT (i.e., all responses are VOTE-YES), the transaction reaches 5 wait for all responses from participants and storage node a commit decision (line 7); if there is a timeout, the termination 6 on receiving ABORT decision ← ABORT protocol is executed to finalize a decision (line 8). Note that the last 7 on receiving all responses decision ← COMMIT condition is different from 2PC, which will unilaterally abort the 8 on timeout decision ← TerminationProtocol(txn) transaction without running the termination protocol. 9 reply decision to the txn caller local SN 10 RPC ::Log(decision) Once the decision is reached, it can be replied to the transaction async caller immediately before the decision is logged durably (line 9). It 11 for p in txn.participants do is a key difference between Cornus and 2PC; the latter would reply 12 send decision to p asynchronously to the caller only after the decision log is flushed. This optimization reduces the caller-observed latency by one logging time. 13 Function Participant::Start1PC(txn) Finally, the coordinator asynchronously writes the decision to its 14 wait for VOTE-REQ from coordinator local SN local storage node (line 10) and also broadcasts the decision to all 15 on timeout RPCsync ::Log(ABORT) return the participants (lines 11–12). 16 if participant votes yes for txn then local SN 17 resp ← RPCsync ::LogIfNotExist(VOTE-YES) Participant::Start1PC(txn) 18 if resp is ABORT then The logic executed by a participant is very similar between Cor- # Another node has logged ABORT for it nus and 2PC. A participant waits for a VOTE-REQ message from 19 reply ABORT to coordinator the coordinator (line 14). If a timeout occurs, the participant can 20 else unilaterally abort the transaction (line 15), which can involve rolling 21 reply VOTE-YES to coordinator back the database states, releasing locks, and logging ABORT. 22 wait for decision from coordinator Once VOTE-REQ is received, the participant votes VOTE-YES 23 on timeout decision ← TerminationProtocol(txn) local SN 24 RPC ::Log(decision) or VOTE-NO based on its local states of the transaction. For a async VOTE-NO, an ABORT record is asynchronously written to the stor- 25 else age and replied back to the coordinator (lines 26–27). This log can local SN 26 RPC ::Log(ABORT) also be asynchronous following the presumed abort optimization in sync 27 reply ABORT to coordinator conventional 2PC [23]. VOTE-YES For a , the record is logged to the corresponding stor- 28 Function TerminationProtocol(txn) age node through LogIfNotExist() (line 17). There are two possible 29 for every node p participating txn other than self do ABORT p.SN outcomes. If the function returns , this means another node 30 RPCasync::LogIfNotExist(ABORT) has already aborted the transaction on behalf of the current node 31 wait for responses through the termination protocol. In this case, the current node aborts 32 on receiving ABORT decision ← ABORT the transaction and returns ABORT to the coordinator (lines 18–19). 33 on receiving COMMIT decision ← COMMIT Otherwise the storage node returns VOTE-YES, in which case the 34 on receiving all responses decision ← COMMIT participant also returns VOTE-YES back to the coordinator (lines 35 on timeout retry from the beginning 21) and starts to wait for the decision message from the coordina- 36 return decision tor (line 22). Upon a timeout, the termination protocol is executed (lines 22–23). Upon receiving the decision, it is logged to the storage node (line 24); here we mark the log as asynchronous because other In this case, the transaction can neither commit nor abort, and must transactional logic like releasing locks does not need to wait for this block with all the locks held, until the failed node has been recovered. decision log to complete. Cornus avoids the problem described above. Specifically, the node running the termination protocol would contact all the participating TerminationProtocol(txn) storage nodes rather than peer compute nodes, by trying to log an In both 2PC and Cornus, the termination protocol is executed ABORT record to each storage node through LogIfNotExist() (lines when a compute node has a timeout while waiting for a message and 29–30). If the remote storage node has already received a decision the node cannot unilaterally abort the transaction. In 2PC, the node log record (i.e., COMMIT or ABORT) for this transaction, such a running the termination protocol will contact all the other nodes for decision will be returned and followed by the current node (line the outcome of the transaction. If any node returns the final outcome, 32–33). If the remote storage node has not received any log record the uncertainty is resolved. However, in certain corner cases, no yet, the ABORT record will be logged and returned (line 32). The active node has the outcome — for example, the coordinator has last case is that a VOTE-YES record is logged at the remote node failed right before sending out the final outcome to any participant. Guo, Zeng, Ren and Yu, et al.

Coordinator Participant 1 Participant 2 compute node Coordinator Participant 1 Participant 2 storage node

Prepare Phase Begin prepare request Prepare Phase Begin prepare request [log] VOTE-YES [log] VOTE-YES [log] VOTE-YES [log] VOTE-YES fail [log] VOTE-YES

Commit Phase Begin vote yes compute node Commit Phase Begin vote yes timeout [logIfNotExists] fail storage node ABORT timeout timeout back to user vote yes [log] ABORT [logIfNotExists] abort ABORT abort vote yes vote yes [log] ABORT [log] ABORT [log] COMMIT [log] COMMIT ack

(a) Coordinator fails before sending decision (b) Participant fails before logging vote

Figure 4: Cornus under Failures — The behavior of Cornus under two failures scenarios. and is returned; if the current node receives such responses from all and act accordingly. After the coordinator is recovered, it will also the remote storage nodes, the transaction will also reach a commit run the termination protocol to learn the outcome. Note that if this decision (line 34). Finally, if the current node experience a timeout scenario occurs in 2PC, the participants will block instead. again, it will retry the termination protocol (line 35). Figure 4a illlustrates an example of such a case. The figure can Note that as long as the storage nodes are accessible, the protocol be compared with Figure 1b showing how Cornus avoids blocking. above is non-blocking. A compute node can always reach a decision After a participant’s timeout, instead of contacting the coordinator with a small number of messages. The only case that Cornus will which has failed, it contacts all the storage nodes using the LogIfNo- run into blocking (line 35) is when the storage service cannot be tExist() function. Since all nodes have VOTE-YES in their logs, each reached. However, as discussed in Section 2.2, we assume the case participant learns the decision of COMMIT and avoids blocking. is rare with a highly available storage service maintained on its own. Case 4: The coordinator fails after sending out the decision to some but not all participants. For participants that have already 3.4 Failure and Recovery received the decision, their local 1PC protocol will terminate. For This section discusses the behavior of Cornus when failures occur. the other participants, they will timeout waiting for the decision, run For simplicity, we discuss cases where only a single node fails at a the termination protocol to learn the final decision. time. Specifically, Table 1 and Table 2 discuss the system behavior Case 5: The coordinator fails after sending out the decision to when the coordinator and a participant fails, respectively. The tables all participants. In this case, all participants have completed the also describe the behavior of the failed node after it is recovered. local 1PC protocol and thus have no effects. After the coordinator is recovered, if the final decision has been logged in its storage node, Coordinator Failure that decision is used. Otherwise, the coordinator runs the termination Table 1 lists the system behaviors if the coordinator fails at differ- protocol to learn the final decision. ent point of the Cornus protocol. Participant Failure Case 1: The coordinator fails before the protocol starts. In this case, a participant will experience a timeout waiting for VOTE-REQ Table 2 lists the effects of failure of a participant at different (line 15 in Algorithm 2). Therefore, all the participants will uni- points of the Cornus protocol. laterally abort the transaction locally. After the coordinator is later Case 1: The participant fails before receiving the vote request recovered, it can run the termination protocol to learn this outcome. from the coordinator. In this case, the coordinator will experience a Case 2: The coordinator fails after sending some but not all timeout waiting for all responses from participants (line 8 in Algo- vote requests. For participants that did not receive the request, the rithm 2) and then run the termination protocol. behavior is the same as Case 1, namely, it will unilaterally abort the The coordinator will log an ABORT record for the failed partic- transaction. For participants received the request, they will log the ipant, thereby aborting the transaction. The coordinator will then votes to the storage nodes, send responses back to the coordinator, broadcast the decision to the remaining participants. It is also pos- and experience a timeout while waiting for the final decision because sible that another participant also has a timeout and initiates the the coordinator has failed (line 23 in Algorithm 2). Then they will termination protocol; the end effect would be the same. After the run the termination protocol. The protocol checks the votes of the failed participant is recovered, it runs the termination protocol to participants that have unilaterally aborted the transaction. It either learn the final decision. learns the abort decision, or appends an ABORT to their logs, thereby Case 2: The participant fails after it receives the vote request but aborting the transaction. Once the coordinator is recovered from the before its vote is logged. In this case, the behavior of the system failure, it learns the outcome through the termination protocol. is the same as in Case 1. This is because, from the other nodes’ Case 3: The coordinator fails after sending all the vote requests perspectives, the behavior of the failed node is identical. but before sending out any decision. In this case, all participants Figure 4b shows an example of such case. The coordinator re- have logged their votes to the storage nodes. They will all timeout ceives a VOTE-YES from participant 2 and has a timeout while while waiting for the decision from the coordinator. They will all run waiting for the response from participant 1. At this point, the coordi- the termination protocol to learn the final outcome of the transaction nator runs the termination protocol by issuing LogIfNotExist() to the Cornus: One-Phase Commit for Cloud Databases with Storage Disaggregation

Time of Coordinator Failure Effect of Failure After Node is Recovered Before 1PC starts Participants (if any) will timeout and unilaterally abort Abort the transaction through the termination the transaction. protocol After sending some vote re- Participants that did not receive the request will timeout Run the termination protocol to learn the deci- quests and abort unilaterally. Participants that receive the request sion, which is abort. will timeout waiting for the decision and execute the termination protocol, which aborts the transaction. After sending all vote requests All participants will timeout while waiting for the deci- Run the termination protocol to learn the deci- but before sending any deci- sion and execute the termination protocol to learn the sion. sion outcome. After sending some decisions Participants that did not receive the decision will run Same as above. termination protocol to learn the decision. After sending all decisions No Effect If the decision is logged, follow that decision; otherwise run termination protocol to learn it. Table 1: Effects of Coordinator Failures. Time of Participant Failure Effect of Failure After Node is Recovered Before receiving the vote request The coordinator will timeout, running the termina- Abort the transaction tion protocol to abort the transaction After receiving the vote request, Same as above Same as above before logging vote After logging vote, before reply- The coordinator will run the termination protocol Abort the transaction if local vote is abort; other- ing to coordinator to see the vote and learn the final outcome. wise run termination protocol to learn the outcome. After replying vote to the coordi- No effect If decision log exists, follow the decision. Other- nator wise, same as above. Table 2: Effect of a Participant Failures. storage node of every participant and trying to log ABORT on behalf ABORT global decision if any node has logged ABORT. Otherwise of the participant. As the coordinator logs an ABORT for participant the decision is undetermined. 1 and learns that participant 2 already logs VOTE-YES, it reaches We first introduce and prove the following lemma. a decision of abort and sends it to participant 2. For simplicity, the example assumes only the coordinator has a timeout. It is possible Lemma 1 [Irreversible Global Decision]: Once a global decision is that participant 2 also experiences a timeout, which will lead to the reached for a transaction, the decision will not change. same outcome. Proof: There are two cases to consider. In case 1, an abort global de- Case 3: The participant fails after it logs the vote, but before cision has been reached meaning one log contains an ABORT record. replying to the coordinator. In this case, the coordinator will experi- According to the semantics of LogIfNotExist() , no VOTE-YES can ence a timeout waiting for votes and run the termination protocol. be appended to that log anymore, meaning that the global decision Then it can see all the participants’ votes from their storage nodes cannot switch to commit. In case 2, a commit global decision is and learn the outcome. The remaining participants will learn the reached and all nodes have VOTE-YES in their logs. The only way decision either from the coordinator or from running the termination to append an ABORT record is through the termination protocol. But protocol by themselves. After the failed participant is recovered, it according to the logics of LogIfNotExist() (i.e., Algorithm 1), ABORT will abort the transaction if the local vote is an abort; otherwise, it will not be appended since VOTE-YES has already existed. □ will run the termination protocol to learn the outcome. We now prove the five properties in order. Case 4: The participant fails after sending out the vote. This Theorem 1 [AC1]: The decision of each participant is identical to failure does not affect the rest of the nodes — the coordinator and the global decision. remaining participants will execute the rest of the protocol normally. Proof: A participant can learn the global decision in two ways: (1) by After the failed participant is recovered, it will follow its own deci- receiving the decision from the coordinator (line 22 in Algorithm 2) sion log if exists; otherwise, it will run the termination protocol to or (2) by running the termination protocol (line 23 in Algorithm 2). learn the outcome. Following the protocol, the decision at the coordinator is identical to 3.5 Proof of Correctness the global decision and thus in the first case, the participant’s deci- This section formally proves the correctness of Cornus. Our proof sion is also identical to the global decision. In the second case, the follows the structure used in [12] where an atomic commit protocol termination protocol collects the votes on each individual participant, is proven as five separate properties (AC1–5, see below). We start and reaches a local decision that is identical to the global one. □ with introducing the following definition of a global decision of a Theorem 2 [AC2]: A participant cannot reverse its decision after it distributed transaction. has reached one. Definition 1 [Global Decision]: A transaction reaches a COMMIT Proof: Due to Lemma 1, once a global decision is reached, it cannot global decision if all nodes have logged VOTE-YES; it reaches an Guo, Zeng, Ren and Yu, et al. reverse. Due to Theorem 1, each participant will reach the same Optimize for logging VOTE-YES: We propose the following de- decision as the global decision, finishing the proof. □ sign to optimize for the common case. Each storage node main- The correctness of 2PC requires the following two properties tains a hash table for ABORT records from remote nodes (i.e., initi- according to [12]. ated through the termination protocol). During normal processing, AC3: The commit decision can only be reached if all participants a VOTE-YES looks up the hash table (which should be empty in voted Yes. the common case). If no ABORT exists in the hash table for the AC4: If there are no failures and all participants voted Yes, then transaction, then VOTE-YES can be written without scanning the the decision will be to commit. entire log. Otherwise if an ABORT exists in the hash table for the For the proof of Cornus, we combine the two properties into the transaction, the storage node immediately replies it to the compute following theorem. node. With this optimization, the overhead of logging VOTE-YES through LogIfNotExist() becomes minimal. Theorem 3 [AC3&4]: The decision of a transaction is a commit if Optimize for logging ABORT in the termination protocol: The and only if all participants vote Yes and write VOTE-YES to their naive implementation scans the entire log searching for VOTE-YES corresponding logs. upon receiving LogIfNotExist() for an ABORT record. Although Proof: According to Definition 1, A transaction’s global decision the termination protocol is rarely called, we still want to reduce is a commit if and only if all participants write VOTE-YES to their this overhead. Our idea is to identify a watermark position in the logs. According to Theorem 1, the decision of each participant is log such that the VOTE-YES cannot locate before the watermark. identical to the global decision, finishing the proof. □ Therefore, only the tail of the log after the watermark needs to be Finally, 2PC requires the following property. scanned. There are many ways to identify such a safe watermark. AC5: Consider any execution containing only failures that the For example, for every network message, we can associate it with algorithm is designed to tolerate. At any point in this execution, the current log sequence number (LSN) of the corresponding storage if all existing failures are repaired and no new failures occur for node. The coordinator of a transaction can collect these LSNs and sufficiently long, then all processes will eventually reach a decision. send them to participants during the commit procedure. For every For Cornus, we prove the following theorem which achieves a node, we know for sure that no log record can possibly exist for the stronger property. transaction before the collected LSN, because the transaction does Theorem 4 [AC5]: Assuming the storage layer is fault tolerant, not touch the corresponding node before that LSN. Such LSN can with any failures that occur to the compute nodes, the remaining serve as our watermarks. participants will always reach a decision without requiring the failed In summary, with the two techniques described above, LogIfNo- nodes to be recovered. tExist() requests sent during normal execution (line 2 and line 17 in Proof: Since the storage layer is fault tolerant, once a global decision Algorithm 2) only require in-memory hash table lookup; LogIfNo- is reached, an active participant can always learn the global decision tExist() requests sent during the termination protocol (line 31 in through the termination protocol. The only case where a decision Algorithm 2) requires scanning of only the tail of the log after the cannot be reached is when one participant fails to log its vote. In watermark, which is much smaller than the entire log. this case, a coordinator or participant that experiences a timeout will Optimizing for Readonly Transactions run the termination protocol, and directly writes an ABORT into the Conventional 2PC can be optimized for readonly transactions [23]. pending participant’s log, which enforces a global decision. □ Specifically, if a node is readonly, it does not need to log anything during the prepare phase and can simply release locks and end the 3.6 Optimizations local transaction. In 1PC, this optimization has a small subtlety. If the coordinator does not know that a participant is actually readonly, This section discusses some optimization techniques for Cornus, the participant cannot avoid logging VOTE-YES, because otherwise including optimizing the LogIfNotExist() function and optimizing it might be aborted through a remote node executing the termination for readonly transactions. protocol. This may cause a performance degradation. However, we Optimizing LogIfNotExist() believe that in many practical scenarios, the coordinator does know The behavior of LogIfNotExist() depends on the current content that a participant is readonly during the execution phase. Then, when in the log (e.g., whether a particular log record exists). With a naive 1PC starts, the coordinator can send the list of non-readonly nodes implementation, the system can scan the entire log in the storage (rather than the list of all nodes) together with the prepare requests node to decide whether a vote has already been logged for a particular to participants. By this way, 1PC no longer needs to log VOTE-YES transaction. Since the log can be large in size, the naive solution can for readonly nodes. Even if some nodes run the termination protocol, lead to very long processing time. Below we describe two techniques they will not check the log of readonly nodes. to reduce such overhead. We observe that LogIfNotExist() is called only in two cases 4 EXPERIMENTAL EVALUATION (according to Algorithm 2): (1) a coordinator or participant logs We now evaluate the performance of Cornus with respect to VOTE-YES or (2) the termination protocol logs ABORT. The termi- conventional 2PC. We first introduce the experimental setup in Sec- nation protocol is called only during a timeout, hence the first case is tion 4.1. In the following sections, we evaluate Cornus under various much more common than the second case. We propose the following settings to illustrate the efficacy of Cornus in reducing latency. Ac- two techniques to optimize for both cases. cordingly, we use the transaction latency as the major metric and Cornus: One-Phase Commit for Cloud Databases with Storage Disaggregation present the latency breakdown by phases (i.e., execution, prepare, transaction logic is 20 per node and the number of server threads and commit) to verify that Cornus reduces latency by shrinking the which handle the remote requests is also 20. The default concurrency commit phase. We also provide the results in throughput. control algorithm is NO-WAIT. For each data point, we run five times with 30 seconds per trial and then took the result from a trial 4.1 Experimental Setup with the maximum throughput. Running the experiments for longer 4.1.1 Architecture. We implement the protocols on Sundial [33], time does not change the conclusions. an open-source distributed DBMS testbed. We will open-sourced For read-only transactions, we assume the coordinator of a trans- our implementation. The system has a storage-disaggregation archi- action can learn that it is read-only at the end of the execution phase tecture and contains two types of nodes: compute nodes and storage such that both Cornus and 2PC can skip both the prepare and commit nodes. In our setup, each compute node is paired with one storage phases for read-only transactions [23]. node which stores the log for transactions executed in the corre- sponding compute node. A compute node can also read and write 4.2 Percentage of Distributed Transactions logs stored in other storage nodes, but this happens only when a In this experiment, we vary the percentage of distributed transac- timeout occurs (e.g., when a node fails). Data is partitioned across tions in the YCSB workload to compare Cornus with 2PC. Note that compute nodes. One compute node may send remote requests to we set the number of worker threads to a low number (4 threads) to other compute nodes for data and send remote requests to all storage reduce the interference of high variance in network latency. nodes for logging requests. Figure 5a shows the latency comparison. In the x-axis we change The communication across nodes is implemented through gRPC [1] the amount of distributed transactions from 0% to near 100%. We and the communication can be either synchronous or asynchronous. report the average and the 99th tail latency of both Cornus and Each node in the system has a gRPC client for issuing remote re- 2PC in lines and the latency speedup of Cornus over 2PC in bars. quests and a gRPC server for receiving remote requests. Each node As the figure shows, Cornus’s latency speedup increases asthe manages a pool of server threads to handle remote requests. number of distributed transactions increases. The maximum speedup 4.1.2 Hardware. Most of the experiments are performed on a is about 1.4×, which is achieved when nearly all the transactions cluster with up to eight servers running Ubuntu 18.04 on Cloud- are distributed. The speedup of tail latency is similar to that of the Lab [17]. Half of the servers serve as compute nodes and the others average latency. serve as storage nodes. Each server contains two Intel Xeon Silver Figure 5b shows the latency breakdown for both local and dis- 4114 CPUs (10 cores × 2 HT) and 196 GB DRAM. The servers are tributed transactions for both Cornus and 2PC, when majority (i.e., connected together with a 10 Gigabit Ethernet. The logs are stored 97%) of transactions are distributed. For local transactions, there is on Intel DC S3500 480 GB SATA SSD. no need to run the commit protocol so there is no prepare phase but only the commit phase. We notice that for local transactions, Cornus 4.1.3 Workloads. We use two different OLTP workloads for per- is slight slower than 2PC. This is because Cornus allows more trans- formance evaluation. All transactions are executed as stored proce- actions to run concurrently, incurring more resource contention. For dures that contain program logic intermixed with queries. distributed transactions, Cornus can almost completely eliminate the YCSB: The Yahoo! Cloud Serving Benchmark [14] is a synthetic latency of the commit phase that a user experiences. This is mainly benchmark modeled after cloud services. It contains a single table due to two reasons: (1) As discussed in Section 3.1, the decision that is partitioned across servers in a round-robin fashion. Each logging in the commit phase can be done asynchronously in both the partition contains 10 GB data with 1 KB tuples. Each transaction coordinator and the participants, thereby eliminating the extra delay. accesses 16 tuples as a mixture of reads (50%) and writes (50%) (2) The logging overhead in cloud storage is significantly higher with on average 5% of the accesses being remote (selected uniformly than a simple network round-trip. at random). The queries access tuples following a power law distri- Figure 5c shows the throughput of 2PC and Cornus as the per- bution controlled by a parameter 휃. By default, we use 휃 = 0, which centage of distributed transactions increases. We see that Cornus means data access is uniformally distributed. constantly outperforms 2PC in throughput. High throughput im- TPC-C: This is a standard benchmark for evaluating the perfor- provement is achieved when more transactions are distributed. The mance of OLTP DBMSs [29]. TPC-C models a warehouse-centric gain in throughput speedup with repect to increased distributed order processing application that contains five transaction types. All transactions is relatively small compared to latency speedup. Note the tables except ITEM are partitioned based on the warehouse ID. that throughput is not the primary metric that Cornus is trying to By default, the ITEM table is replicated at each server. We use a sin- improve; in later experiments, we will focus more on the latency gle warehouse per server to model high contention. Each warehouse speedup instead of throughput speedup. contains around 100 MB of data. For all the five transactions, 10% of NEW-ORDER and 15% of PAYMENT transactions access the 4.3 Percentage of read-only transactions data across multiple servers; other transactions access the data on a We now evaluate the performance of Cornus under YCSB with single server. different percentage of read-only transactions. In our system, read 4.1.4 Implementation Details and Parameter Setup. Unless ratio is a per request setting and we control the percentage of read- otherwise specified, we will use the following default parameter only transactions indirectly by controlling the read ratio of each settings: we will evaluate the system on four compute nodes and data access request. In the experiment, we expect Cornus to obtain four storage nodes. The number of worker threads executing the latency speedup only for read-write transactions since both Cornus Guo, Zeng, Ren and Yu, et al.

Speedup in avg latency Speedup in throughput 1PC (left) 2PC (right) 15 1PC avg latency 2.0 1PC throughput 1PC 99% latency 6 execution 2PC throughput 1.5 6 2PC avg latency 1.5 prepare 10 2PC 99% latency commit 1.0 4 abort 4 1.0 (ms) w.r.t. 2PC 5 w.r.t. 2PC 2 0.5 0.5 2 Speedup of 1PC Speedup of 1PC Txn Latency (ms) Throughput (k txns/s) 0 0.0 Latency Breakdown 0 0 0.0 0 20 40 60 80 100 local txns distributed txns 0 20 40 60 80 100 Percentage of Distributed Txns Txn Type Percentage of Distributed Txns

(a) Latency (b) Latency breakdown (97% distributed txns) (c) Throughput Figure 5: Percentage of Distributed Transactions — YCSB with varying percentage of distributed transactions.

1PC (left) 2PC (right) Speedup in throughput Speedup in avg latency execution 1PC throughput 20 7.5 1.5 1PC avg latency 2.0 prepare 2PC throughput 1PC 99% latency 60 15 2PC avg latency commit 1.5 5.0 abort 1.0 2PC 99% latency 40 10 (ms) (ms) 1.0

2.5 w.r.t. 2PC w.r.t. 2PC 20 0.5 5 0.5 Speedup of 1PC Speedup of 1PC

Latency Breakdown 0.0 0 0.0 Throughput (k txns/s) 0 0.0 Distributed Txn Latency 0 20 40 60 80 100 all read-write txns all read-only txns 0 20 40 60 80 100 Percentage of Read-only Txns Txn Type Percentage of Read-only Txns

(a) Latency (b) Latency breakdown (c) Throughput Figure 6: YCSB varying percentage of read-only transactions

1PC (left) 2PC (right) Speedup in avg latency Speedup in throughput execution 1PC throughput 100 1PC avg latency 60 prepare 2PC throughput 1.5 1PC 99% latency 15 2PC avg latency 75 2 commit 2PC 99% latency 40 abort 10 1.0 50 (ms) (ms) 1 20 w.r.t. 2PC w.r.t. 2PC 0.5 25 5 Speedup of 1PC Speedup of 1PC

Latency Breakdown 0 0 0 Throughput (k txns/s) 0 0.0 Distributed Txn Latency 0 10 20 30 0 32 0 10 20 30 Logging Delay (ms) Logging Delay (ms) Logging Delay (ms)

(a) Latency (b) Latency breakdown (c) Throughput Figure 7: YCSB varying logging delay and 2PC omit prepare and commit phases for read-only transactions are close for both protocols. For the speedup in throughput, although as described in Section 4.1.4. there are some variations in the middle due to the close data, we The results shown in Figure 6a match the expectation that the can still see that when all the transactions are read-write, Cornus improvement of Cornus increases as the percentage of read-only achieves the best speedup in throughput. transactions decreases. When all transactions are read-only, Cornus and 2PC have the same performance. When there are more than 80% read-write transactions, the average latency speedup of Cornus over 4.4 Logging Delay 2PC is nearly 1.5×. The results in Figure 6b show that when all transactions are read- In real world applications, the time spent on logging varies due only the result meets our expectation that Cornus and 2PC are the to factors like variances in network latency and different choices of same. The commit phases in both protocols are short due to the underlying storage services. For example, a geo-replicated highly- optimizations for read-only transactions described in Section 4.1.4. available storage service can be orders of magnitude slower than The prepare phase are the same because in our implementation we a local disk. In this experiment, we evaluate the performance of do not omit the prepare phase and there is no logging in prepare Cornus as the latency of logging increases. We simulate the effect phase but only vote requests. When all transactions are read-write by introducing artificial delays in logging. we can clearly see that Cornus eliminates commit phase in the Figure 7a shows that the speedup of Cornus increases as the latency breakdown. We still see Cornus is slightly slower in prepare latency of logging increases. With 32 ms extra logging delay in- phase and we ascribe this to increased network traffic and resource troduced, Cornus can achieve 2× speedup with respect to 2PC in contention. average latency. Figure 7b further demonstrates the improvements The result in Figure 6c shows that there is still no significant in Cornus are due to the savings from logging in the commit phase. benefit for Cornus on throughput for this workload. The numbers Figure 7c again shows constant but slight benefit of Cornus over 2PC on throughput. Cornus: One-Phase Commit for Cloud Databases with Storage Disaggregation

1PC (left) 2PC (right) Speedup in throughput 60 Speedup in avg latency execution 1500 execution 1PC throughput 1PC avg latency prepare prepare 15 2PC throughput 1 1PC 99% latency 10 40 commit commit 2 2PC avg latency 2 abort 1000 abort 2PC 99% latency 10 (ms) (s) 10 1 20 500 1 1 w.r.t. 2PC

w.r.t. 2PC 5 Speedup of 1PC Speedup of 1PC Latency Breakdown 10 3 0 0 0 Throughput (k txns/s) 0 0 Distributed Txn Latency 0.00 0.25 0.50 0.75 1.00 1.25 0 0.6 0.9 0.9 1.2 1.3 0.00 0.25 0.50 0.75 1.00 1.25 Zipfian Theta Zipfian Theta Zipfian Theta Zipfian Theta

(a) Latency (log scale in y-axis) (b) Latency breakdown (c) Throughput Figure 8: YCSB varying data distribution 휃.

1PC (left) 2PC (right) Speedup in throughput Speedup in avg latency 1.5 2.5 execution 20 1PC throughput 1PC avg latency 40 prepare 2PC throughput 400 1PC 99% latency 2.0 2PC avg latency commit 15 1.0 300 2PC 99% latency 1.5 abort

(ms) 20 10 (ms) 200 1.0 0.5 w.r.t. 2PC w.r.t. 2PC 5 100 0.5 Speedup of 1PC Speedup of 1PC

Latency Breakdown 0 0 0.0 Throughput (k txns/s) 0 0.0 Distributed Txn Latency 0 10 20 30 1 16 32 0 10 20 30 Number of Warehouses Number of Warehouses Number of Warehouses

(a) Latency (b) Latency breakdown (c) Throughput Figure 9: TPCC varying the number of warehouses

Speedup in avg latency Speedup in throughput a v-curve as 휃 increases — it firstly decreases until 휃 reaches 0.9 and 1PC avg latency 1PC throughput 1.5 20 1PC 99% latency 2.0 2PC throughput then increases as the 휃 goes beyond 0.9 and further increases. 2PC avg latency 1.5 20 15 2PC 99% latency 1.0 Figure 8b demonstrates the possible interpretations of the v-curve. 10 1.0 10 w.r.t. 2PC 0.5 w.r.t. 2PC When contention is low (휃 < 0.9), aborts are rare and the main 5 0.5 Speedup of 1PC Speedup of 1PC benefit of Cornus comes from the time reduced in the commit phase. 0 0.0 Throughput (k txns/s) 0 0.0 2 4 6 8 2 4 6 8

Distributed Txn Latency (ms) Number of Nodes Number of Nodes However, when contention is high (휃 ≥ 0.9), the time spent on aborts increases significantly and becomes a dominant factor in (a) YCSB - Latency (b) YCSB - Throughput the overall latency. In this case, transactions in Cornus has shorter Speedup in avg latency Speedup in throughput 2.5 lock holding time due to smaller average latency. Cornus starts 1PC avg latency 40 1PC throughput 1.5 1PC 99% latency 2PC throughput 40 2.0 to gain improvements with fewer aborts. Figure 8c shows a similar 2PC avg latency 30 2PC 99% latency 1.5 1.0 20 pattern regarding to throughput that Cornus has benefit on throughput 20 1.0 0.5 w.r.t. 2PC w.r.t. 2PC 0.5 10 because of shorter lock holding time. Speedup of 1PC Speedup of 1PC

0 0.0 Throughput (k txns/s) 0 0.0 We also evaluate the effect of contention on TPC-C by controlling 2 4 6 8 2 4 6 8 Distributed Txn Latency (ms) Number of Nodes Number of Nodes the number of warehouses. In Figure 9a, the speedup of Cornus in- (c) TPC-C - Latency (d) TPC-C - Throughput creases as the level of contention decreases. The benefit from saving one logging diminishes as the level of contention increases and the Figure 10: Scalability time spent on aborts dominates the runtime as shown in Figure 9b. Figure 9c shows that Cornus and 2PC has similar throughput due to These results indicate that Cornus is particularly beneficial when the low percentage of distributed transactions in TPC-C. the storage layer incurs longer delay. This is the case when compute nodes and storage nodes are geo-distributed and the storage needs to maintain consistency among multiple replicas, as done in modern cloud storage services. 4.6 Scalability Analysis 4.5 Contention Finally, we evaluate the scalability of Cornus on both YCSB and In this section we evaluate the performance of Cornus with differ- TPCC as the number of compute nodes varies from 2 to 8. We set ent levels of contention under YCSB workloads and TPC-C work- the parameter to default values described in Section 4.1.3. We run loads respectively. each setup for five times with 40s run time each time. The results We vary the level of contention in YCSB by adjusting the zipfian in Figure 10 demonstrate that both Cornus and 2PC can scale well in distribution of data accesses through 휃 as mentioned in Section 4.1.3. YCSB and TPC-C. The latency of both 2PC and Cornus remains the The larger the 휃, the higher is the level of contention. Figure 8a same as number of nodes increases. Also, the throughput linearly shows the results. Overall, the improvements of Cornus can be close increases for both 2PC and Cornus as number of nodes increases. to 2× at high contention. Interestingly, the speedup of Cornus shows The speedup on latency and throughput also remains constant. Guo, Zeng, Ren and Yu, et al.

5 RELATED WORK thus it is unclear whether the protocol can be generalized as 2PC and This section describes extra related work on reducing latency of be integrated into any cloud database. The comparison with Cornus 2PC and solving the blocking problem of 2PC. can be done in the future if details of the protocol are later published. 5.1 Prior Work on One Phase Commit 5.2 Prior Works on Non-blocking Atomic Many previous works propose to remove one phase from 2PC by Commitment Protocol combining the voting phase with the execution of the transaction. It Skeen [25] showed the necessary and sufficient conditions for means the log must be forced before the commit procedure starts. a correct non-blocking commitment protocol known as the funda- There are two common approaches. The first approach was originally mental nonblocking theorem. Specifically, nodes in 2PC have four introduced as early prepare (EP) [26]. Participants are required to possible states — initial, wait, abort, and commit. The paper pointed force logging before acknowledging every remote operation. This out that with the same assumptions made in 2PC, a non-blocking introduces blocking I/O for every remote operation. To address such protocol can be achieved by introducing another state, buffer state, overhead, other works proposed to have the participants send its log between the transition from wait state to commit state. Adding such along with the acknowledgement to the coordinator and the coordina- a state also requires one more network roundtrip and the protocol tors force the log before the commit phase. This approach is applied becomes three-phase commit (3PC). Although it solves the blocking in coordinator log (CL) [26, 27], implicit yes vote (IVY) [9], and Lee issue, 3PC exaggerates the problem of latency delay in 2PC. and Yeom’s protocol [22] for in-memory databases. Although some O. Babaoglu and S. Toueg [11] proposed a non-blocking atomic protocols tried to reduce the amount of log data at the coordinator, commitment protocol based on 2PC. The proposed protocol ap- these approaches all suffer from increased size of acknowledgement plied three strategies: (1) synchronizing clocks on different nodes messages. Furthermore, these previous protocols lose site auton- so that out-of-time messages can be ignored; (2) having participants omy— a property that requires the recovery of participants to largely forwarding the decision to other participants upon receiving the mes- rely on its own logs; these protocols, instead, rely on the coordina- sage from the coordinator (called Uniform Timed Reliable Broadcast tor’s log to recovery a participant [7, 8]. To preserve site autonomy algorithm); (3) presuming abort instead of running termination proto- and avoid piggybacking redo logs increasing communication cost col upon timeout. The first two strategies enables the last strategy to in normal processing, Adballah et al. [7, 8] proposed to use logical eliminate the chance of blocking. However, the protocol introduces log and to log operations instead of values in the coordinator before more communication across nodes when no failure happens and the issuing remote operations. correctness of the algorithm relies on synchronized clocks, which is However, all these protocols that try to embed the voting into the a non-trivial requirement for real-world applications. execution phase are subject to strict restrictions on the choices of Similar to O. Babaoglu and S. Toueg’s, EasyCommit [19] solves concurrency control protocols. Such protocols assume a transaction the problem by requiring participants to forward the decisions to can only commit after acknowledgments for all operations are ex- each other on receiving but before logging the decision. When a ecuted successfully and no aborts are allowed due to serialization participant timeouts, a leader will be elected. The leader then consult or consistency afterwards [8]. This assumption is incompatible with all the active nodes for the decision and can decide to abort if no concurrency control protocols in which aborts due to serializability active nodes has a decision. However, the protocol satisfies the validation may occur after execution of an operation such as opti- atomic commitment properties only under the assumption that the mistic concurrency control protocol. These strong assumptions make forwarded message will be delivered to at least one node without such 1PC protocols impractical for real-world systems. delay or loss before the log of the decision is flushed. Moreover, Other works have been proposed to save one phase given a specific it still introduces extra communication overhead when no failure use case or system. For example, Congiu, et al. [13] designed a occurs and the complexity of leader selection when failure happens. 1PC protocol tailored for metadata services. In their design, the voting phase is cut off since only two nodes (metadata servers) 6 CONCLUSION are involved. A recent work of parallel commit protocol [6, 31] in We proposed Cornus, a one-phase commit protocol designed for CockroachDB [28] was proposed to remove one network roundtrip the storage disaggregation architecture that is widely used in cloud in 2PC. It bears similarity to Cornus regarding that it also considers databases. Cornus solves both the long latency and the blocking a transaction as committed once all the participants’ writes succeed. problem in 2PC at the same time, while not introducing extra imprac- However, the proposal is designed for a specific system with a tical assumptions. We formally proved the correctness of Cornus and specific architecture and concurrency control protocol. A formal proposed some optimizations. Our evaluations on two benchmarks specification of the protocol and proofs are not published yetand show a speedup of 1.5× in latency. Cornus: One-Phase Commit for Cloud Databases with Storage Disaggregation

REFERENCES on Reliable Distributed Systems. IEEE, 41–50. [1] 2015. gRPC: A high performance, open-source universal RPC framework. https: [19] Suyash Gupta and Mohammad Sadoghi. 2018. EasyCommit: A Non-blocking //.io/. Two-phase Commit Protocol.. In EDBT. 157–168. [2] 2018. Amazon Athena — Serverless Interactive Query Service. https://aws. [20] Rachael Harding, Dana Van Aken, Andrew Pavlo, and Michael Stonebraker. 2017. amazon.com/athena. An Evaluation of Distributed Concurrency Control. VLDB (2017), 553–564. [3] 2018. Amazon Redshift. https://aws.amazon.com/redshift. [21] Hsiang-Tsung Kung and John T Robinson. 1981. On optimistic methods for [4] 2018. Amazon S3. https://aws.amazon.com/s3/. concurrency control. ACM Transactions on Database Systems (TODS) 6, 2 (1981), [5] 2018. Presto. https://prestodb.io. 213–226. [6] 2020. Parallel Commits. https://www.cockroachlabs.com/docs/v20.2/architecture/ [22] Inseon Lee and Heon Young Yeom. 2002. A single phase distributed commit transaction-layer.html#parallel-commits protocol for main memory database systems. In Proceedings 16th International [7] Maha Abdallah. 1997. A non-blocking single-phase commit protocol for rigorous Parallel and Distributed Processing Symposium. IEEE, 8–pp. participants. In In Proceedings of the National Conference Bases de Donnes [23] C Mohan, Bruce Lindsay, and Ron Obermarck. 1986. Transaction management in Avances. Citeseer. the R* distributed database management system. ACM Transactions on Database [8] Maha Abdallah, Rachid Guerraoui, and Philippe Pucheral. 1998. One-phase Systems (TODS) 11, 4 (1986), 378–396. commit: does it make sense?. In Proceedings 1998 International Conference on [24] George Samaras, Kathryn Britton, Andrew Citron, and C Mohan. 1993. Two- Parallel and Distributed Systems (Cat. No. 98TB100250). IEEE, 182–192. phase commit optimizations and tradeoffs in the commercial environment. In [9] Y Al-Houmaily and P Chrysanthis. 1995. Two-phase commit in gigabit-networked Proceedings of IEEE 9th International Conference on Data Engineering. IEEE, distributed databases. In Int. Conf. on Parallel and Distributed Computing Systems 520–529. (PDCS). [25] Dale Skeen. 1981. Nonblocking commit protocols. In Proceedings of the 1981 [10] Michael Armbrust, Reynold S Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K ACM SIGMOD international conference on Management of data. 133–142. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J Franklin, Ali Ghodsi, et al. [26] James W Stamos and Flaviu Cristian. 1990. A low-cost atomic commit protocol. 2015. Spark SQL: Relational Data Processing in Spark. In SIGMOD. In Proceedings Ninth Symposium on Reliable Distributed Systems. IEEE, 66–75. [11] Ozalp Babaoglu and Sam Toueg. 1993. Understanding non-blocking atomic [27] James W Stamos and Flaviu Cristian. 1993. Coordinator log transaction execution commitment. Distributed systems (1993). protocol. Distributed and Parallel Databases 1, 4 (1993), 383–408. [12] Philip A Bernstein. 1987. Concurrency control and recovery in database systems. [28] Rebecca Taft, Irfan Sharif, Andrei Matei, Nathan VanBenschoten, Jordan Lewis, Vol. 370. Addison-wesley New York. Tobias Grieger, Kai Niemi, Andy Woods, Anne Birzin, Raphael Poss, et al. 2020. [13] Giuseppe Congiu, Matthias Grawinkel, Sai Narasimhamurthy, and André Cockroachdb: The resilient geo-distributed SQL database. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1493– Brinkmann. 2012. One phase commit: A low overhead atomic commitment 1509. protocol for scalable metadata services. In 2012 IEEE International Conference [29] The Transaction Processing Council. 2007. TPC-C Benchmark (Revision 5.9.0). on Cluster Computing Workshops. IEEE, 16–24. [30] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, [14] Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Ning Zhang, Suresh Antony, Hao Liu, and Raghotham Murthy. 2010. Hive — A Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of Petabyte Scale Data Warehouse Using Hadoop. In ICDE. the 1st ACM symposium on Cloud computing. 143–154. [31] Nathan VanBenschoten. 2019. Parallel Commits: An Atomic Commit Protocol For [15] Benoit Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin Globally Distributed Transactions. https://www.cockroachlabs.com/blog/parallel- Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel, commits/ Jiansheng Huang, et al. 2016. The Snowflake Elastic Data Warehouse. In SIG- [32] Alexandre Verbitski, Anurag Gupta, Debanjan Saha, Murali Brahmadesam, Ka- MOD. mal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz [16] Aleksandar Dragojevic,´ Dushyanth Narayanan, Edmund B. Nightingale, Matthew Kharatishvili, and Xiaofeng Bao. 2017. Amazon aurora: Design considerations Renzelmann, Alex Shamis, Anirudh Badam, and Miguel Castro. 2015. No Com- for high throughput cloud-native relational databases. In Proceedings of the 2017 promises: Distributed Transactions with Consistency, Availability, and Perfor- ACM International Conference on Management of Data. 1041–1052. mance. In SOSP. 54–70. [33] Xiangyao Yu, Yu Xia, Andrew Pavlo, Daniel Sanchez, Larry Rudolph, and Srini- [17] Dmitry Duplyakin, Robert Ricci, Aleksander Maricq, Gary Wong, Jonathon vas Devadas. 2018. Sundial: harmonizing concurrency control and caching in Duerig, Eric Eide, Leigh Stoller, Mike Hibler, David Johnson, Kirk Webb, Aditya a distributed OLTP database management system. Proceedings of the VLDB Akella, Kuangching Wang, Glenn Ricart, Larry Landweber, Chip Elliott, Michael Endowment 11, 10 (2018), 1289–1302. Zink, Emmanuel Cecchet, Snigdhaswin Kar, and Prabodh Mishra. 2019. The De- [34] Xiangyao Yu, Matt Youill, Matthew Woicik, Abdurrahman Ghanem, Marco Ser- sign and Operation of CloudLab. In Proceedings of the USENIX Annual Technical afini, Ashraf Aboulnaga, and Michael Stonebraker. 2020. PushdownDB: Acceler- Conference (ATC). 1–14. https://www.flux.utah.edu/paper/duplyakin-atc19 ating a DBMS using S3 Computation. In 2020 IEEE 36th International Conference [18] Rachid Guerraoui, Mikel Larrea, and André Schiper. 1995. Non blocking atomic on Data Engineering (ICDE). IEEE, 1802–1805. commitment with an unreliable failure detector. In Proceedings. 14th Symposium