An Analysis of Concurrency Control Protocols for In-Memory Databases with CCBench (Extended Version)

∗ Takayuki Tanabe1, Takashi Hoshino2, Hideyuki Kawashima4, Jun Nemoto4, Masahiro Tanaka4, Osamu Tatebe3 1NAUTILUS Technologies, Inc., 2Cybozu Labs, Inc., 3University of Tsukuba, 4Keio University [email protected], [email protected], [email protected], 4{hideyuki.kawashima, nemoto, masa16.tanaka}@keio.jp

ABSTRACT 1. INTRODUCTION This paper presents yet another concurrency control analy- 1.1 Motivation sis platform, CCBench. CCBench supports seven protocols Transacting processing systems containing thousands of (Silo, TicToc, MOCC, Cicada, SI, SI with latch-free SSN, CPU cores in a single server have been emulated [76], im- 2PL) and seven versatile optimization methods and enables plemented [45,70], and analyzed [25]. Along with these sys- the configuration of seven workload parameters. We ana- tems, a variety of concurrency control protocols for a single lyzed the protocols and optimization methods using various server [44,47,66,70,77] have been proposed in recent years workload parameters and a thread count of 224. Previous for use in many-core architectures. These modern protocols studies focused on thread scalability and did not explore the use a variety of optimization methods, and their behaviors space analyzed here. We classified the optimization methods depend on the workload characteristics (e.g., thread count, on the basis of three performance factors: CPU cache, delay skew, read ratio, cardinality, payload size, transaction size, on conflict, and version lifetime. Analyses using CCBench and read-modify-write or not). Recent new proposal stud- and 224 threads, produced six insights. (I1) The perfor- ies have compared these protocols with conventional ones mance of optimistic concurrency control protocol for a read- [31,33,34,36,38,40,44,45,47,57,64,66,70,71,74,75,77,78], and only workload rapidly degrades as cardinality increases even recent analytical studies have compared the performance of without L3 cache misses. (I2) Silo can outperform TicToc conventional protocols on a common platform [24,25,73,76]. for some write-intensive workloads by using invisible reads These studies mostly evaluated protocol scalability. This optimization. (I3) The effectiveness of two approaches to paper acknowledges that modern protocols are scalable and coping with conflict (wait and no-wait) depends on the situ- focuses on other factors that contribute to performance on ation. (I4) OCC reads the same record two or more times if a many-core environment. This type of analysis is novel to a concurrent transaction interruption occurs, which can im- the best of our knowledge. prove performance. (I5) Mixing different implementations is Fairness is important in comparison. However, some re- inappropriate for deep analysis. (I6) Even a state-of-the-art cent studies were unable to perform such analysis fairly be- garbage collection method cannot improve the performance cause they compared their new protocols with others using of multi-version protocols if there is a single long transaction different platforms. For example, evaluations of ERMIA [44], mixed into the workload. On the basis of I4, we defined the mostly-optimistic concurrency control (MOCC) [70], and Ci- read phase extension optimization in which an artificial de- cada [47] used two or three platforms. Experiments using a lay is added to the read phase. On the basis of I6, we defined mix of platforms can produce only approximate results be- the aggressive garbage collection optimization in which even cause the performance of protocols on a many-core architec- visible versions are collected. The code for CCBench and all ture depends greatly on the exploitation of the underlying arXiv:2009.11558v2 [cs.DB] 18 Aug 2021 the data in this paper are available online at GitHub. hardware. In such conditions, only the macroscopic eval- uations (i.e., scalability) can be conducted, and a detailed analysis is difficult. A single unified comparison platform is needed to conduct a detailed analysis. ∗ For a fair comparison, a common analysis platform is nec- Work done while at University of Tsukuba essary. It should provide shared core modules such as access methods (concurrent index) and thread-local data structures (read and write sets) for any protocols. It should also pro- vide a variety of optimization methods to enable obtaining a deep understanding of the protocols. Finally, the plat- form should be publicly available for reproducing the exper- iments. Although there are several open-source platforms, including DBx1000 [7], Peloton [12], and Cavalia [4], they do not provide certain modern protocols [47,69,70,77]. Ap- puswamy et al. [24] evaluated protocols in four types of ar- chitecture using Trireme, which is not publicly available. 1.2 Contributions 1.3 Organization The first contribution of this paper is CCBench: a plat- The rest of this paper is organized as follows. §2 reviews form for fairly evaluating concurrency control protocols. The existing protocols. §3 discusses the fairness condition and protocols are two-phase locking (2PL), Silo, MOCC, TicToc, presents CCBench. §4 describes the reproduction of experi- snapshot isolation (SI), latch-free serial safety net (SSN), ments. §5 investigates the effect of cache. §6 investigates the and Cicada. CCBench supports seven versatile optimization effect of delay. §7 investigates the effect of version lifetime methods and two identified in this work (Table 1, §3.2). Op- management. §8 reviews related work, and §9 concludes this timization methods can thus be applied to non-original pro- paper and shows future directions. tocols. For example, the NoWait method [29] can be applied to Silo, the rapid garbage collection (GC) method [47] can be applied to multi-version protocols, and the AdaptiveBack- 2. PRELIMINARIES off optimization method [47] can be applied to all protocols. Fairness in analyzing protocols is achieved through shared 2.1 Concurrency Control Protocols modules including a workload generator, access methods The concurrency control protocols we analyzed are classi- (Masstree [11, 48]), local datasets (read/write sets), and a fied as (1) pessimistic (2PL [35]), (2) optimistic (Silo [66] and memory manager (mimalloc [21] and AssertiveVersionReuse TicToc [77]), (3) multi-version (SI [42] and ERMIA [44]), (4) presented in this paper). Evaluation of protocols under var- integration of optimistic and pessimistic (MOCC [70]), and ious conditions is enabled by providing of seven workload (5) integration of optimistic and multi-version (Cicada [47]). configuration parameters: skew, payload size, transaction size, cardinality, read ratio, read-modify-write (RMW) or Silo [66] is an OCC protocol that has influenced sub- not, and the number of worker threads. CCBench and all of sequent concurrency control protocols. For example, FOE- the data in this paper are available online at GitHub [5]. DUS [45] and MOCC [70] extend the commit protocol of The second contribution is an analysis of cache, delay, Silo. Silo selects the design of InvisibleReads [49] so that and version lifetime using 224 threads. We clarified it does not update the metadata during a read operation. the effects of the protocols on the performance factors by The invisible read avoids cache-line conflicts, so it configuring the optimization methods and workload settings. provides scalability, as shown by Wang and Kimura [70]. As suggested elsewhere [76] and [45], the era of a thousand TicToc [77] is an OCC protocol based on timestamp or- cores is just around the corner, so conventional analytical dering with data-driven timestamp management. TicToc studies focused on evaluating thread scalability [39, 73, 76]. has a larger scheduling space than Silo, so it can commit In contrast, we performed all analyses with 224 threads on schedules where Silo cannot. TicToc provides three opti- 224 cores. mization methods. PreemptiveAbort initiates abort process- We first investigated the effects of the optimization meth- ing immediately if an abort is detected before write lock- ods related to cache. By determining that a centralized ing in the validation phase. NoWaitTT does not wait for counter increases cache-line conflicts and degrades perfor- lock release in the validation phase. It instead releases the mance, we gained two insights. I1: The performance of locks and retries the validation phase after a fixed dura- optimistic concurrency control (OCC) for a read-only work- tion of a wait without an abort. TimestampHistory expands load rapidly degrades as cardinality increases even without the scheduling space by recording the write timestamp of L3 cache misses (§5.2). I2: Silo outperforms TicToc for an older version so that some of the read operations on the write-intensive workloads by using InvisibleReads (§5.3). version can be verified after the record is overwritten. We then investigated the effects of the optimization meth- MOCC [70] exhibits high performance for a wide range ods related to delay, and gained two more insights. I3: The of workloads by adaptively switching its policy by adding effectiveness of two approaches to cope with conflict (Wait pessimistic locking and temperature management to Silo. and NoWait) depends on the situation (§6.1). I4: OCC MOCC locks high temperature (on highly conflicted) records reads the same record two or more times due to concurrent while keeping the order of records locked to avoid deadlocks transaction interruption. Surprisingly, OCC can improve in the read phase. The MOCC queuing lock (MQL) inte- performance in certain situations with it, and we have de- grates an MCS (Mellor-Crummey and Scott) lock, which can fined a new optimization method ReadPhaseExtension based time out [59], and a reader/writer fairness MCS lock [50]. on it (§6.2). Cicada [47] combines the OCC protocol, a multi-version Finally, we investigated the effects of optimization meth- protocol with timestamp ordering, and distributed times- ods related to version lifetime for multi-version concur- tamp generation. The distributed timestamp generation rency control (MVCC) protocols, and gained two final in- eliminates the need to access the centralized counter that sights. I5: Mixing different implementations is inappropri- conventional MVCC protocols require, thereby dramatically ate. Silo outperforms Cicada on the Yahoo! Cloud Serving mitigating cache-line conflicts. Cicada has a variety of op- Benchmark B (YCSB-B) workload in an unskewed environ- timization methods. EarlyAbort initiates an abort if one is ment, which is inconsistent with previously reported testing predicted during the read phase. BestEffortInlining embeds results on different systems [47]. I6: Even a state-of-the-art an inline version in the record header at the top of the ver- GC technique cannot improve the performance of MVCC if sion list to reduce indirect reference cost. PrecheckValida- there is a single long transaction mixed into the workload. tion checks whether read versions are still valid in the early To overcome this problem, we defined a new optimization validation phase. RapidGC is a quick and parallel GC opti- method, AggressiveGC. It requires an unprecedented proto- mization method. AdaptiveBackoff dynamically determines col that weaves GC into MVCC, thus going beyond the cur- how long to wait before retrying. SortWriteSetByContention rent assumption that versions can not be collected if they detects conflicts at an early stage before performing actions might be read by transactions (§7.3). unnecessarily due to aborts. Shared Objects Worker Threads details. Therefore, the basic modules of an evaluation plat- Key ǦValue Store Central Data Object Workload Generator form should be shared for a fair evaluation. Access methods (Record design depends (SI:counter, Silo:epoch, (All protocols) on each protocol) (e.g., Masstree) and local data structures (read and write ERMIA:mapping table, Read/Write Set sets) need to be shared among protocols. The effects of Cicada:backoff delta) Memory Manager  (All protocols) (Allocator (mimalloc), memory allocation should be reduced as much as possible. Version Cache Access Method pre-alloc, numactl,  The workload should be consistent among experiments. (Masstree) thread affinity) (ERMIA, SI, Cicada) Several analysis studies satisfy our fairness condition. Yu et al. [76] developed DBx1000 [7]. It was used in the eval- Figure 1: CCBench Architecture. uation of TicToc paper [77] and in another evaluation with a real 1568 cores machines [25]. DBx1000 currently does not support some modern protocols (Cicada, MOCC, ER- MIA). Wu et al. developed Cavalia [4] to compare the SI [27] is an MVCC protocol that can generate write- hardware transactional memory (HTM)-assisted OCC-style skew and read-only-transaction anomalies, so it does not protocol (HTCC) [75] with conventional protocols. Sev- produce serializable schedules. Under SI, a transaction reads eral modern protocols (Cicada, ERMIA, TicToc, MOCC) a snapshot of the latest committed version. The write op- were beyond their scope. Wu et al. [73] developed Pelo- erations are also reflected in the snapshot. SI requires a ton [12] to compare MVCC protocols. Several modern pro- monotonic-increasing timestamp assignment for each trans- tocols (Cicada, ERMIA, Silo, TicToc, MOCC) were beyond action to provide snapshots. To determine the timestamp, their scope. Appuswamy et al. [24] developed Trireme to a centralized shared counter is required. compare protocols in four types of architecture, including SI with latch-free SSN [44] integrates SI and SSN [68] shared-everything. Several modern protocols (TicToc, Ci- and exploits the latch-free mechanism [69]. SSN detects cada, MOCC, ERMIA) were beyond their scope. and aborts dangerous transactions that could lead to non- A protocol can include a variety of optimization meth- serializable schedules. We refer to the integration of SI and ods. Even if protocol P does not initially include optimiza- SSN as ERMIA in this paper. tion method O, an analysis platform should provide O to 2PL [35, 72] releases all read/write locks at the end P if users request it because the performance of a protocol of each transaction. The NoWait method, which imme- greatly depends on the optimization method. For example, diately aborts the running transaction when a conflict is RapidGC included in Cicada can also be applied to both SI detected, was originally proposed as a deadlock resolution and ERMIA. NoWait [28] can be applied to Silo to improve mechanism [28]. We use it as an optimization method. performance, as shown in Fig. 10c. Conventional platforms do not support this concept. 2.2 Environment Our fairness condition was not satisfied in several studies. The evaluation environment consisted of a single server The Cicada paper (§4.2) [47] states, “Cicada and existing with four Intel (R) Xeon (R) Platinum 8176 CPUs with schemes in DBx1000 share the benchmark code, but have 2.10GHz processors. Each CPU had 28 physical cores with separate and transaction processing engines.” hyper-threading, and the server had 224 logical cores. Each The MOCC paper [70] states that the pure OCC, MOCC, physical core had 32 KB private L1d cache and 1 MB private and PCC (2PL variants) implementations used the FOE- L2 cache. The 28 cores in each processor shared a 38.5 MB DUS system, the Dreadlock/WaitDie/BlindDie (2PL vari- L3 cache. The total cache size was about 154 MB. Forty- ants) implementations used the Orthrus system, and the eight DDR4-2666 32 GB DIMMs were connected, so the ERMIA implementation used the ERMIA system, which total size was 1.5 TB. means three benchmark systems were mixed. The ERMIA In all graphs in this paper showing the results of CCBench, paper [44] states that the Silo implementation used the Silo each plot point shows the average for five runs, each with system while the SI/SSI implementation used the ERMIA more than 3 s. We confirmed that these numbers produce system, which means two benchmark systems were mixed. stable CCBench results. The error bars indicate the range In studies using a mix of systems, experts must conduct a between the maximum and minimum values. We counted huge amount of effort to conduct a fair comparison, which the number of committed and aborted transactions to cal- would not be easy, especially in a deep understanding of culate average throughput. We used rdtscp instruction to protocols or optimizations. measure the latency of the transitions and their portions. Each worker thread was pinned to a specific CPU core in 3.2 CCBench: A Platform for Fair Analysis every experiment. Our CCBench analysis platform for in-memory CC pro- tocols satisfies our fairness condition because it shares the 3. CCBENCH basic modules among protocols. The architecture of CC- Bench is illustrated in Fig. 1. The code for CCBench is available on GitHub [5]. 3.1 Fairness Condition In CCBench, each thread executes both the client and When analyzing concurrency control protocols, their per- the server logic. The client generates a transaction with formance should be evaluated under a fairness condition. read/write operations using the workload parameters at run- The meaning of fairness depends on the situation. In this time. The server provides APIs as C++ functions, such as paper, fairness means that basic modules are shared in read, write, commit, and abort. The client calls the APIs the analysis. This is because the performance of modern to run transactions. The server runs the transactions re- protocols is closely related to the exploitation of the under- quested from the client inside a worker thread. The client lying hardware; i.e., they are sensitive to the engineering code is separated from the server code, and the both codes Table 1: Versatile optimization methods in CCBench. —: Irrelevant or Incompatible. Org: Supported by original protocol. CCB: Supported by CCBench. (α) Delay inspired by extra reads in OCC (§6.2). Proposed in this work. (β) CCBench performs read lock for hot records and invisible reads for non-hot records. (γ) NoWait locking optimization detects a write lock conflict and immediately aborts and retries the transaction. NoWaitTT releases all locks and re-locks them. (δ) Lightweight . Proposed in this work (§3.3). () PreempiveAbort and TimestampHistory cannot be applied to others. (ζ) SortWriteSetByContention, PrecheckVersionConsistency, EarlyAborts, BestEfforInlining cannot be applied to others. We applied optimization methods as follows. NoWait: Silo in Fig. 10c. RapidGC: ERMIA and SI in all cases. AssertiveVersionReuse: Cicada, ERMIA, and SI in all cases. AdaptiveBackoff: TicToc in Fig. 2c and all protocols in Figs. 4a, 4b, 4c, and 4d.

Performance Cache Delay Version Lifetime Factor Optimization Decentralized Invisible NoWait Adaptive ReadPhase AssertiveVersion Rapid Method Ordering Reads or Wait Backoff Extension(α) Reuse(δ) GC 2PL [72] Org, CCB — Org,CCB CCB — —— Silo [66] Org, CCB Org, CCB CCB CCB CCB —— MOCC [70] Org, CCB Org, CCB (β) — CCB CCB —— TicToc() [77] Org, CCB — CCB (γ) CCB CCB —— SI [42] —— — CCB — CCB CCB ERMIA [44] —— — CCB — CCB CCB Cicada(ζ) [47] Org, CCB — — Org, CCB CCB CCB Org, CCB

are compiled into a single binary. The mem- erations in the transaction executor class. The user can then ory allocator, mimalloc [21], allocates interleaved memory reuse the basic modules in the platform: workload generator, among CPU sockets by using the Linux numactl command. memory management, and so on. Users can easily attach or Memory allocation is avoided as much as possible so that detach optimization methods provided by CCBench to their the penalty [30] imposed by the Linux sys- protocols simply by rewriting the preprocessor definition in tem can be ignored and so that the execution performance the CMakeLists.txt. A developer guide for CCBench users of protocols is not degraded due to undesirable side-effects is available [1]. One of our team members implemented a of memory allocation. CCBench initializes the database for simple OCC protocol following this guide [3]; and published each experiment, pre-allocates the memory for objects, and a description of the experience for other users [17]. Users reuses the memory. Allocated memory for local data struc- can also switch the configuration of the optimization meth- tures (read/write sets) and the generated operations for a ods, key-value size, and protocol details in the same way transaction are reused for the next transaction. The meta- as done in the online example. In an experiment, the user data and record data are carefully aligned to reduce false can set the workload by specifying the runtime arguments sharing. A wrapper of Masstree [11] was implemented, and by gflags [16] using the seven workload parameters listed in all protocols used it as the access method. Table 2. This design makes it easier to conduct experiments CCBench supports seven versatile optimization methods, than DBx1000 [7], which manages these values as prepro- as shown in Table 1. (1) DecentralizedOrdering: prevents cessor definitions in a file. contended accesses to a single shared counter. (2) Invisi- bleReads: read operations that do not update memory and 3.3 Optimization Method Implementations cache-line conflicts do not occur. (3) NoWait or Wait: im- 2PL: For efficiency, we implemented from scratch the mediate abort upon detecting a conflict followed by retry, or reader/writer lock system with the compare-and-swap (CAS) waiting for lock release. (4) ReadPhaseExtension: an artifi- operation used in the protocol. cial delay added to the read phase, inspired by the extra read Silo: We padded the global epoch (64 bits) into the process that retries the read operation. (5) AdaptiveBackoff: cache-line size to reduce the false sharing that can be caused an artificial delay before restarting an aborted transaction. by a global epoch state change. Calculation of the commit (6) AssertiveVersionReuse: allocates thread-local space, de- transaction id (TID) is expensive since it requires loops for noted as version cache in Fig. 1, to retain versions so that both the read set and the write set. We reduced these two memory manager access is not needed. (7) RapidGC: fre- loops by implementing calculation at write lock acquisition quent updating of timestamp watermark for GC in MVCC and read verification. protocols. ReadPhaseExtension (4) and AssertiveVersionReuse ERMIA: We used latch-free SSN to prevent expensive (6) are presented in this paper. SSN serial validation. This implementation avoids the vali- CCBench enables the use of additional optimization meth- dation from being a critical section. We integrated pstamp ods for 2PL, Silo, TicToc, SI, ERMIA, and Cicada. It does and sstamp to reduce memory usage as described at Sec- not support the read-only optimization methods introduced tion 5.1 of original paper [69]. We optimized the transac- in Silo, FOEDUS, MOCC, and ERMIA because they im- tion mapping table. Straightforwardly, the mapping table prove performance only in a specific case (i.e., 99% read-only is designed as a two-dimensional global array with a thread transactions, 1% write transactions, skew at 0.99), and such number, TID, cstamp, and last cstamp, which requires a a workload is not considered here. lock manager for updates and read over the rows. This Users of CCBench can easily add new protocols to the degrades performance due to serialization. Therefore, we platform. After replicating a directory that has an existing designed a data structure that expresses the rows in a one- protocol, the user rewrites the functions corresponding to dimensional array so that it can be latched with a CAS the transactions with begin/read/write/commit/abort op- operation. The performance was improved by avoiding seri- alization and cache-line conflicts for row element access. We Table 2: Analyzed parameter sets. (α) Cardinality (from 3 9 3 6 used RapidGC [47]. The transaction mapping table objects 10 to 10 records); (β) Cardinality (10 or 10 records); (γ) exploited our AssertiveVersionReuse method, so the memory Read ratio (from 0% to 100%); (δ) Transaction size (from 10 manager almost did not need to work during experiments. to 100 operations); () Payload size (from 4 to 1000 ); TicToc: We implemented all three optimization meth- (ζ) Skew (from 0.6 to 0.99). ods (NoWaitTT, PreemptiveAbort, TimestampHistory). We also improved algorithm 5 in the original paper [77] by re- Cache Delay Version Figure Number F5 F7 F9 F10 F11 moving the redundant read timestamp updates. This is ef- Skew 0 0 0.8 0.9 ζ fective because if the recorded read timestamp is the same Cardinality α β 108 108 108 as the previous one, it does not need to be updated. Payload () 4 4 4  4 MOCC: MOCC periodically resets temperature infor- Xact size 10 10 δ 10 10 mation [70], which switches many cache-lines to the invalid Read ratio (%) 0,5 γ 50,95 50 50,95 state simultaneously and thereby degrades performance. In Thread count Always 224 except from reproduction contrast, our implementation stores the epoch ID of the Read modify write Always off except from reproduction latest reset timing and temperature together and thereby avoids multiple resets in the same epoch. This reduces the cost of cache-line conflicts and the number of reset opera- tions, thereby maintaining fine control of the temperature. approximation method [37]. We analyzed CC protocols by We did not implement MQL since our environment had only using 224 threads using CCBench. Most of the benchmarks four NUMA (non-uniform memory access) nodes, so the ef- were the YCSB workload, and some were variants of YCSB. fect would be limited. We instead used the reader/writer The analyzed parameter sets are summarized in Table 2. We lock that was used in our 2PL implementation. varied the skew, cardinality, payload, transaction size, and read ratio and fixed the thread count at 224. The caption in Cicada: We implemented all six optimization meth- 1 ods: SortWriteSetByContention, PrecheckVersionConsistency, the table describes the variance in parameters using α . . . ζ . AdaptiveBackoff, EarlyAborts, RapidGC, and BestEffortInlin- This paper focuses on identifying factors that significantly ing. Moreover, we fixed a logical bug, i.e., the incomplete affect performance in a highly parallel environment. We version consistency check in the validation phase. In the have chosen YCSB like benchmarks because they can gen- original paper [47], only the read version is confirmed to still erate various workloads despite the simpleness of its data be the visible version. However, whether the latest version model which offers only read and update operations for a is still the visible version needs to be confirmed as well. The single table with the primary index. It is preferred that existence of a newer version in the observable range means a benchmark also supports realistic workloads, which of- that it should have been read and the transaction view will ten contain insertion, deletion, range search, and secondary thus be broken. We turned off the one-sided synchronization indexes for multiple tables. Difallah et al. [32] summarized to improve throughput. such workloads, which included an industry-standard bench- CCBench: In the MVCC protocols (SI, ERMIA, and mark, TPC-C [22]. CCBench currently supports only a sub- Cicada), creating new versions and deleting old versions put set of TPC-C including New-Order and Payment transac- a load on the memory manager. We therefore developed tions, and we obtained the following results: (1) CCBench a new optimization method dubbed, AssertiveVersionReuse exhibited scalability in both the thread count and the ware- that avoids overloading the memory manager in MVCC pro- house count; (2) The majority of the execution time was tocols. This method enables each worker thread to maintain for the index traversal, and its acceleration was important a container for future version reuse. When GC begins, the for high performance; (3) Different analysis platforms ex- versions collected are stored in this container. A new ver- hibited different behavior even for the same workload, de- sion is taken from this container except if it is empty. If it pending on the design and the implementation. The details is empty, a request for space is sent to the memory manager are described in Appendix A and B. CCBench will support as usual. The space needed for GC is estimated before the TPC-C full-mix in future work other than Silo, which will experiment, and memory space is allocated in advance. This be available at [1]. optimization minimizes the burden on the memory manager for the MVCC protocols. Moreover, we introduce the use 4. REPRODUCTION OF PRIOR WORK of ReadPhaseExtension to delay execution. It was inspired Before presenting our analysis of reproducing experiments by the extra read process described in §6.2. We also intro- performed in prior work, we explain how we validated the duce AggressiveGC optimization, which collects even visible correctness of the CCBench implementation, so that it suc- versions, described in §7.3. cessfully reproduced the results of experiments performed in prior work. We did this by evaluating DBx1000 [7], a 3.4 Configurable Workloads standard analysis platform, as used by Bang et al. [25], to CCBench supports the seven parameters shown in Table 2. validate the behavior of CCBench. For all graphs showing Skew is an access pattern that follows a Zipf distribution. DBx1000 results, we set CPU FREQ to 2.095 after measuring Cardinality is the number of records in the database. The the real clock rate used for rdtscp instruction; each plot payload is the size of a record (key plus value). Transac- point shows the average of ten runs. The duration of each tion size is the number of operations in a transaction. Read run was 4.5 to 10 s. As the access method, we used a hash ratio is the ratio of reads in a transaction. Write model index for DBx1000 and Masstree for CCBench. DBx1000 is whether to perform RMW or not. Thread count is the does not support RMW operations in the YCSB workload, number of worker threads (fixed to 224 here except for re- production). To determine the skew, CCBench uses a fast 1More spaces were explored, and they are found online [2]. 6 2.0 40 DBx1000-Silo Torg Silo 1.5 30 4 DBx1000-TicToc T+BO TicToc 1.0 20 2 0.5 10 0 0.0 0 Throughput [MTPS] Throughput [MTPS] Throughput [MTPS] 0 20 40 60 80 0 20 40 60 80 0 20 40 60 80 # Threads # Threads # Threads (b) YCSB (Medium Contention). 16 (c) YCSB (High Contention). 16 (a) YCSB (Read-Only). 2 queries per queries per transaction, 90% reads, queries per transaction, 50% reads, transaction, skew 0. 10% writes, skew 0.8. 50% writes, skew 0.9.

Figure 2: Reproduction of TicToc experiment. 10 M records, payload 1 K bytes. Different access methods (hash for DBx1000, Masstree for CCBench). CCBench without index outperformed (39.6 Mtps) DBx1000 (30 Mtps) at 80 threads.

120 109 1.0 108 90 107 106 0.5 5 60 Silo 10 MOCC 104 30 103 0.0 56 112 168 224 0 1 2 3 4 5 6 7 8 9 10 RatioAbort 0 1 2 3 4 5 6 7 8 9 10 Throughput [TPS]

Throughput [MTPS] # Threads # writes in YCSB Transaction # writes in YCSB Transaction

(a) Read Only Workload. (b) Read-Write YCSB. 224 threads. (c) Read-Write YCSB. 224 threads.

Figure 3: Reproduction of MOCC experiment. YCSB, 50 records, 10 ops/trans, Skew 0, Payload 4 bytes, reads were pure read, writes were read-modify-write. so their plots are not presented in the reproduction results exceeded a certain value, the performance started to dete- for MOCC and Cicada. riorate due to an increase in read/write conflicts. There- fore, we presumed that some sort of backoff was applied to 4.1 TicToc TicToc in the original paper, and attached AdaptiveBackoff Our first experiment was aimed at reproducing TicToc re- to TicToc. This is shown as T+BO in Fig. 2c. The per- sults. We experimented with the settings for Figs. 6, 7, and formance curve for T+BO is similar to that in Fig. 9 in the 8 in the original paper [77]. We set the epoch interval longer original paper. The lower performance of the original Tic- than 40 ms so that cache-line conflicts for the global epoch Toc protocol shown as Torg, is attributed to excessive delays rarely occurred in Silo. The experimental results are shown and read validation failures. NoWaitTT inserts fixed delays in Figs. 2a, 2b, and 2c. TicToc uses NoWait optimization into the validation phase, which tends to result in read val- instead of NoWaitTT. Note that hereafter, TicToc means idation failure. Therefore, Torg shows low performance in to attach NoWait instead of NoWaitTT. Torg is the origi- this graph, and TicToc shows better performance due to the nal TicToc with all the optimizations, including NoWaitTT. shorter read phase. The results for DBx1000 are also plotted T+BO is TicToc using AdaptiveBackoff. in the figures, and their behaviors are similar to our results. Fig. 2a shows that Silo and TicToc had a comparable per- Fig. 2a shows that Silo and TicToc exhibited linear scal- formance for read-only workloads, consistent with the re- ability for a read-only workload, and Fig. 6 in the original sults shown in Fig. 6 in the original paper. Fig. 2b shows paper shows almost linear scalability. Fig. 2b above and that Silo and TicToc both scale in performance. This is also Fig. 7(a) in the original paper show almost the same results. consistent with the results shown in Fig. 7 in the original In summary, CCBench tended to closely reproduce the re- paper. The results for DBx1000 are also shown in the fig- sults for TicToc, and the results of CCBench are consistent ures, and their behaviors were similar to those for CCBench. with those of DBx1000. DBx1000 outperformed CCBench, which we attribute to the difference in access methods. A hash index was used for 4.2 MOCC DBx1000 while Masstree was used for CCBench, and hash The second experiment was aimed at reproducing MOCC indexes are typically faster than tree indexes. We deter- results [70]. Our MOCC implementation differs from the mined that CCBench without indexing had 30% higher per- original one in terms of temperature and MQL, as described formance (39.6 Mtps) than DBx1000 (30 Mtps). As we have in §3.3. To reproduce the results in the original paper, we already stated, the effect of the access method must be con- experimented with the settings shown in Figs. 6 and 7 in the sidered for a fair comparison. original paper. The details of the workload are described in One finding is particularly noteworthy. In Fig. 2c, the be- graph captions. haviors of Silo and TicToc differ from those in Fig. 8 in the The results in Fig. 3a appropriately reproduce those in original paper [77]. The performance of TicToc improved as Fig. 6 in the original paper. The performance scales with the number of threads was increased, but when the number the number of threads in the read-only workloads. The un- Cicada ERMIA Silo 2PL TicToc MOCC SI 240 3.0 3.0 60 180 2.0 2.0 40 120 1.0 1.0 20 0.0 0.0 60 0 0 5 10 15 20 25 [MTPS] Throughput 0 0.2 0.4 0.6 0.8 0.99 [MTPS] Throughput 0 0.2 0.4 0.6 0.8 0.99 0 5 10 15 20 25 Throughput [KTPS] # Threads Skew Skew Throughput [MTPS] # Threads (a) Write-Intensive. 16 re- (b) Write-Intensive. 28 (c) Read-Intensive. 28 (d) Read-Intensive. 1 re- quests/trans, skew 0.99, 50% threads. 16 requests/trans, threads. 16 requests/trans, quests/trans, skew 0.99, 5% RMW, 50% pure reads. 50% RMW, 50% pure reads. 5% RMW, 95% pure reads. RMW, 95% pure reads.

Figure 4: Reproduction of Cicada experiment. 10 M records, payload 100 bytes. derperformance of MOCC with more than 168 threads com- Here, we discuss two effects of the cache. Cache-line pared with Silo is attributed to the overhead of MOCC. conflict occurs when many worker threads access to the When MOCC reads a record, it accesses the corresponding same cache-line with some writes (e.g., accessing a single temperature information and checks whether the tempera- shared counter). The cache-line becomes occupied by a sin- ture is high. gle thread, and the other threads are internally blocked. The results in Figs. 3b and 3c closely reproduce those in Cache-line replacement occurs when transactions access Fig. 7 in the original paper. When the write ratio increased a large amount of data (e.g., high cardinality, low skew, or from 0% to 10%, the abort rate rapidly increased, and the large payload). The data accesses tend to replace the con- performance deteriorated. This is because the write opera- tents of the L1/L2/L3 cache. tion produces a large number of read validation failures. As shown in Fig. 3b, the performance of Silo deteriorated at 5.1 Cache-Line Conflict a write rate of 0-40%, while it improved at a write rate of Some protocols use centralized ordering (e.g., a shared 40-60%. There is thus a trade-off between reduced record- counter) [42, 44] while others use decentralized ordering access contention due to reduced write operations and fre- [47, 66, 70, 77]. We first analyzed the negative effect of cen- quent read validation failures due to increased read opera- tralized ordering. The results are shown in Fig. 5. Since tions. In contrast, MOCC showed stable performance, ap- the results for YCSB-A have the same tendency as those for parently because abort is less likely to occur by temperature YCSB-B, they are omitted. Fig. 5 shows that the through- control. puts of ERMIA and SI did not scale although their schedul- MOCC and Silo exhibited scalability, as shown in Fig. 3a ing spaces are wider than those of single-version concur- above, while MOCC and FOEDUS exhibited scalability in rency control (SVCC) serializable protocols. This is be- the original paper [70] (in Fig. 6). Fig. 3b shows the through- cause both protocols depend on a centralized data structure, puts of MOCC and Silo for various write ratios. They are which causes cache-line conflicts, which in turn degrade per- almost the same as those in Fig. 7 in the original paper. The formance. This is consistent with previous findings [66, 70]. abort ratios of Silo and MOCC shown in Fig. 3c is consis- To investigate the cause of the inefficiency, we measured tent with that in Fig. 7 in the original paper. In summary, the throughput of fetch add. The results are shown in CCBench closely reproduced the results for MOCC. Figs. 6. The L3 cache miss rate was measured using the Linux perf tool. Fig. 6a shows that one thread exhibited 4.3 Cicada the best performance. This is because frequent reads/writes to the same by multiple threads cause many Our third experiment was aimed at reproducing Cicada cache-line conflicts. Fig. 6b shows that the L3 cache miss results. The workload parameters were set almost the same rate increased with the number of threads. Cache-line con- as those in the original paper [47]. The details of the work- flicts often result in L3 cache misses and longer communica- load are described in the graph caption. The differences be- tion delays because the CPU core uses a cache-line, and the tween this paper and the original paper are as follows: (1) other cores running in different CPU sockets also require the we redesigned and reimplemented all protocols; (2) Adaptive- line. In the experiment setting, 56 threads used two sock- Backoff was implemented in all protocols because it is easily ets, and 112 or more threads used four sockets. Thus, a attachable and effective for performance improvement. higher L3 cache miss rate (≥ 30%) indicates communication Figs. 4a, 4b, 4c, and 4d present the experimental results among sockets, which greatly degrades the performance of reproducing those in Figs. 6a, 6b, 6c, and 7 in the original fetch add. This means that centralized ordering should be paper, respectively. In Figs. 4b, 4c, and 4d, the tendencies avoided. in our results for Silo, TicToc, MOCC, ERMIA, and 2PL are the same as those in the corresponding figures in the original 5.2 Cache-Line Replacement paper. In contrast, the results for Cicada differ. Fig. 4d shows that Cicada had worse performance than Silo and We discuss the results for YCSB-B. In Figs. 5a and 5c, TicToc for a read-intensive workload, and this is inconsistent as cardinality increased, (1) the throughput first improved with the results shown in Fig. 7 in the original paper. We and then deteriorated, and (2) the abort ratio monotonically discuss this inconsistency in Insight 5 (§7.1). improved. When cardinality was quite low, the footprint of accessible records in database was small so that all of the records fit inside the L3 cache. The contention is more likely with L1 or L2 cache. As cardinality increases, such con- 5. ANALYSIS OF CACHE tention is alleviated since the number of accessible records 80 120 1.0 1.0 Cicada 100 Silo ERMIA 60 SI 2PL 0.8 MOCC 0.8 80 TicToc 0.6 0.6 40 60 0.4 0.4 40

20 RatioAbort 20 0.2 0.2 Cache-Miss Ratio

Throughput [MTPS] 0 Throughput [MTPS] 0 0.0 0.0 103 104 105 106 107 108 109 103 104 105 106 107 108 109 103 104 105 106 107 108 109 103 104 105 106 107 108 109 # Records # Records # Records # Records (a) YCSB-B. (b) YCSB-C. (c) YCSB-B. (d) YCSB-C.

Figure 5: Varying cardinality: payload 4 bytes, 224 threads, skew 0, 10 ops/trans.

180 0.40 TicToc, MOCC, and 2PL converge due to frequent L3 cache- 150 line replacements. 120 0.30

90 0.20 60 [MOPS/SEC] 30 0.10 5.3 Effect of Invisible Reads 0 50 100 150 200 250 Cache-Miss Ratio 0 50 100 150 200 250 Updating shared data among threads running in different # Threads # Threads CPU sockets trigger L3 cache misses. It is mainly due to (a) Throughput. (b) Cache miss ratio. the cache-line conflicts, leading to performance degradation. The read operation in Silo does not update the correspond- Figure 6: Scalability of fetch add. ing metadata, which can prevent cache-line conflicts. This read method is referred to as InvisibleReads [49]. It effec- tively improves performance of protocols for read-intensive increases, and the performance improves. It is likely that workloads; however, its behaviors have not been explored with more cardinality, the total number of records starts to yet. The read operations of TicToc typically update the overflow L3 cache. This results in L3 cache misses, and con- corresponding metadata (i.e., read timestamp), so they are secutive accesses to remote socket cache or DRAM degrades not InvisibleReads. the performance. To investigate the details of InvisibleReads, we measured We investigated the strong effect of the L3 cache miss the performance of Silo and TicToc for various read ratios ratio. As shown in Fig. 5d, when cardinality was no more and cardinalities. The results of experiments are shown in than 105, the L3 cache miss ratios for Silo, TicToc, and Figs. 7a-7c. Fig. 7a shows that TicToc-1 K (where 1 K Cicada were almost zero. This is because the L3 cache size means 1000 records) outperformed Silo-1 K for low cardi- was 38.5 MB and the size of a record including the header nality and write-intensive workloads. The abort ratio of was 64 bytes (64 ×105 < 38.5×106). However, as cardinality TicToc-1 K was much worse than that of Silo-1 K, as shown increased, their throughputs decreased linearly, as shown in in Fig. 7b. As shown in Fig. 7a, in the read-most (90%) Fig. 5b. This is attributed to L1/L2 cache misses. When case, Silo-1 K outperformed TicToc-1 K, as evidenced by the cardinality was more than 105, we observed that L3 cache rapid increase in the abort ratio (Fig. 7b). The InvisibleReads misses affected performance even for read-only workloads. method contributed to the performance improvement in this The negative effect of read contention is shown in Figs. 1 case. For high cardinality workloads, Silo-1 M (where 1 M and 6 in the original paper [70]. The cache-line conflicts means 1 million records) always outperformed TicToc-1 M. that occurred due to read lock contention degraded perfor- The abort ratios of Silo-1 M and TicToc-1 M were almost mance, and the difference in performance between OCC and zero, as shown in Fig. 7b. This result seems mysterious. The 2PL rapidly increased as the number of threads increased. abort ratios for both Silo and TicToc were almost the same, However, there is no mention in the paper that the greater and their L3 cache miss ratios were almost the same, as the number of L3 cache misses for a read-only workload, the shown in Fig. 7c. However, Silo-1 M outperformed TicToc-1 smaller the performance difference between OCC and 2PL. M. This is because dealing with contention resolution pro- As shown in Fig. 5b, when cardinality was no less than 107, cessing in TicToc requires three functions: timestamp his- Silo, TicToc, and 2PL had almost the same throughput. tory management, early abort judgment, and read times- Moreover, their L3 cache miss ratios were almost the same, tamp update. These functions require executing additional as shown in Fig. 5d. For read-only workloads, the differ- instructions and thus degrade performance. If InvisibleReads ences in performance among protocols converged due to the is woven into the process, a protocol has difficulty exploiting L3 cache misses when the database size was large. This has them. not been reported so far. Insight 2: For a detailed analysis of protocols, contention Insight 1: Even if the entire database is within L3 cache, should be carefully considered. The occurrence of contention as cardinality increases, (1) OCC for read-intensive work- depends not only on the read ratio but also on cardinal- loads improves due to a decrease in L1/L2 cache-line con- ity and skew. We observed that Silo outperformed TicToc flicts, and (2) OCC for read-only workloads deteriorates due not only for read-only workloads but also for write-intensive to an increase in L1/L2 cache-line replacements. If the entire workloads. This fact has not been reported in previous stud- database slightly overflows the L3 cache, the performances ies [24, 47, 77]. These results are due to the use of Invisi- of the protocols diverge. If the size of the entire database bleReads method, which prevents cache misses in low con- is much larger than the L3 cache, the performances of Silo, tention cases. The efficiency of InvisibleReads is provided 50 1.0 1.0 40 0.8 0.8 30 0.6 0.6 Silo-1 K 20 TicToc-1 K 0.4 0.4 Silo-1 M 10 TicToc-1 M RatioAbort 0.2 0.2 Cache-Miss Ratio

Throughput [MTPS] 0 0.0 0.0 20 40 60 80 20 40 60 80 20 40 60 80 Read Ratio [%] Read Ratio [%] Read Ratio [%] (a) Throughput. (b) Abort ratio. (c) Cache-miss ratio.

Figure 7: Analysis of InvisibleReads. 224 threads, payload 4 bytes, skew 0, 10 ops/trans. cardinality 103 or 106 records.

MOCC 2PL-Wait 2PL-NoWait MOCC manages fewer lock objects for workloads with less 15 1.0 skew. 2PL-Wait and 2PL-NoWait showed similar perfor- 10 mances at skews of 0.60-0.85, so the cost of abort-retry and 0.5 5 the cost of waiting for lock were similar. So, we need more

0 RatioAbort 0.0 contention to answer the question. 0.6 0.65 0.7 0.75 0.8 0.85 0.6 0.65 0.7 0.75 0.8 0.85 To provide more contention, we increased the transaction Skew Skew size. The cost of aborts should be smaller for short transac- Throughput [MTPS] tions and larger for long ones. Therefore, with long trans- (a) Throughput. (b) Abort ratio. actions, the abort-retry cost is higher and may exceed the lock waiting cost. Fig. 8 shows that the lock waiting cost of Figure 8: Wait or NoWait: YCSB-B, 100 M records, 224 Wait and the abort-retry cost of NoWait were similar. Fig. 9a threads, payload 4 bytes, 10 ops/trans. shows that Wait began to outperform NoWait as the transac- tion size increased. In contrast, Silo-NoWait outperformed 101 1.0 Silo(-Wait), as shown in Fig. 10 (described in §6.2). The MOCC 2PL-Wait 0 0.8 answer then is that it depends on the situations. Therefore, 10 0.6 2PL-NoWait the control of waiting time is important for efficiency. 10-1 0.4 0.2 Insight 3: One cannot give a general statement about 10-2 RatioAbort 0.0 which is better, NoWait or Wait since it depends on the 20 40 60 80 100 20 40 60 80 100 situations. As shown in Fig. 9a, Wait begins to outperform

Throughput [MTPS] # Operations/Transaction # Operations/Transaction NoWait as the transaction size increases. On the other hand, (a) Throughput. (b) Abort ratio. as shown in Fig. 10c, Silo with NoWait comes to outperform the original Silo. This suggests the importance of adap- Figure 9: Impact of transaction size: YCSB-B, 100 M tive behavior. Although adaptive concurrency controls have records, 224 threads, payload 4 bytes, skew 0.8. been studied [38, 56, 60, 70], their importance has not been discussed in depth regarding optimizations except for back- off [47]. Studies on adaptive optimization methods remain by its simple design that does not require computations for on the frontier. conflict resolution or anti-dependency tracking. 6.2 Effect of Payload Size 6. ANALYSIS OF DELAY We analyzed the effect of payload size, which had not pre- viously been done, to the best of our knowledge. Fig. 10a 6.1 Effect of NoWait shows the relationship between throughput and payload size A method for mitigating performance degradation caused for Silo and TicToc. Throughput initially increased with by conflicts is to place an additional artificial delay before a payload size, and then started to decrease at a certain point. retry, such as with NoWaitTT [77] and AdaptiveBackoff [47]. An increase in payload size would likely degrade OCC per- In contrast to waiting for a lock release, the NoWait method formance, which is consistent with the throughput decrease immediately releases all of acquired locks when a conflict shown in the right half of the graph. However, the through- is detected and immediately retries. The question then in put increase shown in the left half of the graph is counter- dealing with contention is whether to choose NoWait or Wait. intuitive. To answer this question, we compared the performance We hypothesized that this increase was due to the de- of three methods for various degrees of skew: 2PL-Wait, lay caused by the extra reads process, where two or more 2PL-NoWait, and MOCC (mix of OCC and 2PL-Wait). All reads were performed if there was an interruption in the operations of each transaction for 2PL-Wait were sorted in concurrent update transactions. An increase in payload size advance to avoid deadlock. As shown in Fig. 8a, the perfor- lengthens the time to read the payload. This increases the mance of all three protocols started to degrade at a skew of probability of an interruption due to an update transaction, 0.7 due to a large number of conflicts. MOCC demonstrated which increases the number of extra reads. This behavior excellent performance, as shown in Fig. 8a. This is because extends the read phase and reduces the number of transac- 2 60 2 Silo Silo Silo+NoWait 1.6 Silo 1.5 TicToc TicToc 40 1.2 TicToc 1 0.8 20 0.5 0.4 0 0 0 4 200 400 600 800 1000 4 200 400 600 800 1000 0 10000 20000 Throughput [MTPS] Throughput [MTPS] Payload Size Extra Reads [M/SEC] Payload Size # Clocks of Sleep in Read Phase (a) Throughput. (b) Extra reads. (c) ReadPhaseExtension.

Figure 10: Effect of delay provided by extra read: YCSB-A, 100 M records, 224 threads, skew 0.9, 10 ops/trans. tions that run through the validation phase in parallel. As make a version by overwriting its previous version, with the the number of committed update transactions decreases, the former becoming visible and the latter becoming non-visible number of read validation failures also decreases, which leads at the same time. MVCC protocols typically make a version to throughput improvement. Delaying the retry of atomic on a different memory fragment from other visible versions reads reduces the number of concurrent worker threads in- of the same record. Therefore, their life does not end unless volved in the validation phase for the entire system, which GC conducts its work. also reduces contention. The performance of MVCC is thought to suffer from the Besides Wait, NoWait, and AdaptiveBackoff, we present a existence of many visible versions [46,55]. They lead to a sig- novel delay-related optimization method, ReadPhaseExten- nificant burden due to memory exhaustion or an increase in sion, which was inspired by the positive effect of extra reads. cache-line replacements for traversal or installation. MVCC Comparing Figs. 10a and 10b reveal that the curves are simi- protocols can provide higher performance than SVCC ones lar. This indicates that throughput and the number of extra due to their larger scheduling spaces. It should be noted that reads are correlated. If this is correct, adding an artificial architectural factors sometimes negate the benefits, as men- delay into the read phase should produce similar results. We tioned in §5.1. We analyzed this situation in depth. Note conducted such an experiment for a payload size of 4 bytes. that the cost for allocating new pages is negligible in the re- The result in Fig. 10c shows that a certain amount of addi- sults below since our AssertiveVersionReuse (§3.2) conducts tional delay (less than 9000 clocks) improved performance. the space allocation beforehand. This effect is similar to that of the extra read process. We MVCC protocols often exhibit high performance compared refer to this artificial delay as ReadPhaseExtension and de- with SVCC protocols for high contention workloads. To fine it as a new optimization method. ReadPhaseExtension is begin, we measured the throughput for various degrees of configured by setting the delay on the basis of local conflict skew (0 to 0.99), as shown in Fig. 11. Here we focus on the information. This optimization method can exploit informa- difference between Cicada (MVCC) and Silo (SVCC). In tion on the access conflicts for each record during the read Figs. 11a and 11b for YCSB-A, Cicada outperformed Silo phase whereas AdaptiveBackoff uses only global information when the skew was higher than 0.8. The larger the payload, across all worker threads. the greater the superiority of Cicada. This is attributed to Insight 4: The extra read process plays a key role in the Cicada having larger scheduling spaces than Silo; neverthe- performance of OCC protocols. It is known that the con- less, these results confirm that the performance of MVCC tention regulation caused by delay can contribute to perfor- suffers from having many versions [46, 55]. In Figs. 11c and mance improvement depending on the situation. A remark- 11d, Cicada underperformed Silo, TicToc, and MOCC on able finding here is that the ReadPhaseExtension inspired by YCSB-B. the extra read process can also improve performance. Read- To clarify the reason, we analyzed the characteristics of PhaseExtension differs from NoWaitTT since it can exploits the versions. Larger scheduling spaces are supported by information on conflicting records inside transactions to ad- many versions, and they can be an overhead. Such overhead just the delay whereas the delay in NoWaitTT is fixed. Such in modern protocols remains to be analyzed. We show the additional delay in the read phase and the use of thread-local latency breakdown with a skew of 0, where MVCC is at a conflict information combine to create a unique optimization disadvantage, in Fig. 12. In Figs. 12a and 12b, we can method. see that the major overhead of Cicada lies in the read and write operations rather than validation or GC. To clarify the reasons for the overhead, we designed an SVCC version 7. ANALYSIS OF VERSION LIFETIME of Cicada, Cicada-SV, that overwrites the inline versions of Cicada. This eliminates the need to allocate new areas, so 7.1 Determining Version Overhead Cicada-SV performs like an SVCC protocol. We prepared The life of a version begins when a corresponding write op- a special workload in which transactions were completely eration creates it. The version state is called visible during partitioned to avoid contention because Cicada-SV does not the period when other transactions can read it. Otherwise, cope well with contention. The results are shown in Fig. 13. the version state is called non-visible. Recent CC proto- The latencies of Cicada-SV and Silo were almost the same. cols typically make the version visible at the pre-commit We attribute the overhead shown by Cicada in Figs. 12a and phase [41, 66]. After a certain period, the life of the version 12b to the cost of version chain traversals. ends, and it is made non-visible. SVCC protocols typically Since having many visible versions degrades performance, Cicada ERMIA MOCC SI Silo 2PL TicToc 20 107 25 108 6 7 15 10 20 10 105 15 106 10 104 10 105 5 103 5 104 0 102 0 103

0.6 0.7 0.8 0.9 0.99 Throughput [TPS] 0.9 0.95 0.99 0.6 0.7 0.8 0.9 0.99 Throughput [TPS] 0.9 0.95 0.99 Throughput [MTPS] Throughput [MTPS] Skew Skew Skew Skew (a) YCSB-A. (b) YCSB-A (high skew). (c) YCSB-B. (d) YCSB-B (high skew).

Figure 11: Effect of workload skew: 100 M records, 224 threads, payload 4 bytes, 10 ops/trans.

Gen. Workload Validation 24 Read GC 0 Write Other 23 100 us 120 120 22 1 ms 100 100 10 ms 80 80 21 100 ms 60 60 20 40 40 19 Latency [ns] 20 Latency [ns] 20 0 0 18

Cicada Silo Cicada Silo Throughput [MTPS] 17 (a) YCSB-A. (b) YCSB-B. 16 100 101 102 103 104 105 106 Figure 12: Latency breakdown: skew 0, 100 M records, 224 GC Interval [us] threads, payload 4 bytes, 10 ops/trans. Figure 14: Analysis of RapidGC: YCSB-A, skew 0, 224 Gen. Workload Validation threads, 1 M records, payload 4 bytes, 10 ops/trans. In- Read Other serted delays for long transactions: 0 to 10 ms. Write 120 120 100 100 80 80 payload 100 byte), as shown in Fig. 7 in the original paper. 60 60 This inconsistency is due to the use of different platforms 40 40 Latency [ns]

Latency [ns] in the previous study. Using a single platform, we observed 20 20 0 0 the difference and found that the version management cost Cicada Silo Cicada Silo of Cicada is not negligible even for a low contention case, as shown in Figs. 12a and 12b. It is difficult to obtain pre- (a) YCSB-A. (b) YCSB-B. cise knowledge using different reference implementations or platforms, and the deep analysis of CC protocols must be Figure 13: Latency breakdown of Cicada-SV and Silo for done on the same platform. partitioned workload. Other settings were the same as for Fig. 12. 7.2 Limit of Current Approach We investigated the behavior of RapidGC using the same reducing the number should improve performance. RapidGC workloads as in our third experiment with a long transac- optimization effectively mitigates this problem. To evaluate tion. Even state-of-the-art GC does not sufficiently reduce its efficiency, we varied the interval setting and measured the the number of visible versions if there is only a single long throughput of Cicada. The legend 0 and 100 us in Fig. 14 transaction. This is because modern protocols assume that show that high throughputs were obtained when the interval transaction execution is one-shot, and that the transactions was no more than 100 us, as reported in prior studies [47,73]. are relatively short (e.g., 16 operations for original YCSB). This is because the number of visible versions was small, and Long transactions except read-only ones have been ignored most of them were kept in cache. So, can RapidGC with its in modern transaction studies. short intervals perform efficiently for any workload? The To generate a long transaction, we added an artificial de- answer is no. We explain the reason why in §7.2. lay at the end of the read phase. Both long and short trans- Insight 5: The overhead of multi-version management actions used the same number of operations with the same is not negligible. Silo and TicToc outperformed Cicada in read-write ratio. One worker thread executed the long trans- high-skew (0.99), read-intensive (YCSB-B), non-RMW, high action, and the remaining worker threads executed the short cardinality (100 M records) cases. A previous study [47] transactions. The skew was set to 0 so contention in record found that all three exhibited a similar performance for a accesses rarely occurred and thus did not affect performance, similar workload (skew 0.99, read 95%, RMW, 10 M records, even though there was a long transaction. We measured the performance of Cicada under the work- 8. RELATED WORK load described above, varying the RapidGC interval settings Yu et al. [76] evaluated CC protocols using DBx1000 [7], and the delay added to the long transaction. As shown in which is open source. They evaluated the scalability of CC Fig. 14, performance saturated when a delay was inserted. protocols using a real machine and an . Three re- Saturation occurred when the GC interval was the same as cent protocols [44, 47, 70] supported in CCBench, are not the added delay. For example, the light blue line includes included in DBx1000 [7]. Wu et al. [73] empirically evaluated a long transaction with a 1 ms delay, and performance sat- MVCC protocols using Peloton [12], which is open source. urated when the GC interval was 1 ms. Similar behaviors They evaluated not only scalability but also the contention were observed with longer delays. This is because current effect, read ratio, attributes, memory allocation, and index. GC methods do not collect visible versions that may be read They did not evaluate SVCC and the modern MVCC proto- by active long transactions. The current GC scheme does cols evaluated in this paper [47, 66, 70, 77]. Appuswamy et not collect the versions until the transaction finishes. We al. [24] evaluated CC protocols in four types of architecture consider that this limitation is the primary hurdle to im- using Trireme, which is not open source. They determined proving MVCC performance by reducing the number of vis- that the shared-everything architecture is still the best op- ible versions. We could not obtain results for a 1 s delay tion for contention-tolerant in-memory transaction engines. because such a delay requires a huge amount of memory, CC protocols for distributed systems have been evaluated which causes the Linux OOM (out-of-memory) killer to kill elsewhere [39, 67]; this paper focuses on a single many-core the process. architecture. Whereas previous studies mostly evaluated scalability and did not explore the behavior of protocols when thread paral- 7.3 Aggressive Garbage Collection lelism was set to a high degree [31,33,34,36,38,40,44,45,47, From the limitation of the current GC scheme described 57,64,66,70,71,74,75,77,78], we fixed the thread parallelism above, we suggest a novel GC scheme, AggressiveGC, that at 224 and analyzed protocols for various settings. We clas- aggressively collects versions beyond the current ones to sified a variety of methods on the basis of three performance deal with long transactions. For example, the multi-version factors: cache, delay, and version lifetime. This analysis let timestamp ordering (MVTO) protocol could be integrated us identify three new optimization methods. with a GC method that aggressively collects visible versions. It could make some versions non-visible even though active 9. CONCLUSION or future transactions need to read them. Such a proto- Using CCBench, we analyzed concurrency control proto- col might incur read operation failures unlike conventional cols and optimization methods for various settings of the MVTO, which could be handled by aborting the transac- workload parameters with the number of threads fixed at tion and retrying it with a new timestamp. Restricting the 224, whereas previous studies mostly focused on thread scal- number of versions in kVSR [51] and 2V2PL [26, 61] has ability, and none of them explored the space we analyzed. been discussed. However, only the latest contiguous ver- We classified versatile optimization methods on the basis of sions are kept there, so this approach is less flexible than three performance factors: cache, delay, and version lifetime. our suggested scheme. We claim that the visible versions do Through the analysis of protocols with CCBench, we gained not need to be contiguous and the number of versions can six insights. I1: The performance of optimistic concurrency be flexible depending on the context. An interesting topic control for a read-only workload rapidly degrades as car- in our proposed scheme is the risk of starvation. One way dinality increases even without L3 cache misses. I2: Silo to mitigate this problem is to manage the priorities among outperforms TicToc for write-intensive workloads, which is transactions such as wait-die [58], which has not yet been attributed to InvisibleReads for unskewed high cardinality discussed in modern MVCC studies. cases. I3: The effectiveness of two approaches to coping We suggest two optimizations for write operations in terms with conflict (Wait and NoWait) depends on the situation. of aggressive protocols. The first is version overwriting, i.e., I4: Extra reads can regulate contention. I5: Results pro- creating a new version by overwriting the memory segment duced from mixed implementations may be inconsistent with of the previous version, which becomes non-visible at the the theory. I6: Even a state-of-the-art garbage collection same time, as is done in SVCC protocols. Version overwrit- method RapidGC cannot improve the performance of multi- ing is efficient because two operations are combined into one version concurrency control if there is a single long transac- operation. The second is non-visible write, i.e., making ver- tion mixed into the workload On the basis of I4, we defined sions non-visible from the beginning of their life. The idea the ReadPhaseExtension optimization in which an artificial of non-visible write was originally proposed as the Thomas delay is added to the read phase. On the basis of I6, we write rule [65] and recently generalized as the non-visible defined the AggressiveGC optimization in which even visible write rule (NWR) [53] to deal with blind writes. Novel ver- versions are collected. sion lifetime management is a promising way to improve In future work, we will support TPC-C full-mix other than MVCC protocols. Silo, and to include logging and recovery modules based on Insight 6: Even state-of-the-art GC cannot hold down our preliminary studies [52,62]. The code for CCBench and the number of versions in MVCC protocols if a single long all the data in this paper are available online at GitHub [5]. transaction is mixed into the workload. An aggressive ap- We expect that CCBench will help to advance transaction proach can solve this problem by aggressively changing the processing research. version state to non-visible for both reads and writes, even if transactions still require the state. 10. REFERENCES [28] P. A. Bernstein and N. Goodman. Concurrency [1] CCBench Developer Guide. https://github.com/ Control in Distributed Database Systems. ACM thawk105/ccbench/tree/master/cc_format. Comput. Surv., 13(2):185–221, 1981. [2] CCBench Experimental Data. [29] P. A. Bernstein, V. Hadzilacos, and N. Goodman. https://github.com/thawk105/ccdata. Concurrency control and recovery in database [3] CCBench OCC. https: systems. 1987. //github.com/thawk105/ccbench/tree/master/occ. [30] A. T. Clements, M. F. Kaashoek, and N. Zeldovich. [4] Code of Cavalia. https://github.com/Cavalia. RadixVM: Scalable address spaces for multithreaded [5] Code of CCBench. applications. In EuroSys, pages 211–224, 2013. https://github.com/thawk105/ccbench. [31] M. Dashti, S. Basil John, A. Shaikhha, and C. Koch. [6] Code of Cicada. Transaction Repair for Multi-Version Concurrency https://github.com/efficient/cicada-engine. Control. In SIGMOD Conf., pages 235–250, 2017. [7] Code of DBx1000. [32] D. E. Difallah, A. Pavlo, C. Curino, and https://github.com/yxymit/DBx1000. P. Cudre-Mauroux. OLTP-Bench: An Extensible [8] Code of DBx1000 Extention for Cicada. Testbed for Benchmarking Relational Databases. https://github.com/efficient/ PVLDB, 7(4):277–288, 2013. cicada-exp-sigmod2017-DBx1000. [33] B. Ding, L. Kot, and J. Gehrke. Improving Optimistic [9] Code of ERMIA. Concurrency Control through Transaction Batching https://github.com/ermia-db/ermia. and Operation Reordering. PVLDB, 12(2):169–182, [10] Code of FOEDUS. 2018. https://github.com/hkimura/foedus_code. [34] D. Durner and T. Neumann. No False Negatives: [11] Code of Masstree. Accepting All Useful Schedules in a Fast Serializable https://github.com/kohler/masstree-beta. Many-Core System. In ICDE, pages 734–745, 2019. [12] Code of Peloton. https://pelotondb.io. [35] K. P. Eswaran, J. N. Gray, R. A. Lorie, and I. L. Traiger. The Notions of Consistency and Predicate [13] Code of SGT. Locks in a Database System. Comm. ACM, https://github.com/durner/No-False-Negatives. 19(11):624–633, 1976. [14] Code of Silo. https://github.com/stephentu/silo. [36] J. M. Faleiro, D. J. Abadi, and J. M. Hellerstein. High [15] Code of STO. Performance Transactions via Early Write Visibility. . https://readablesystems.github.io/sto PVLDB, 10(5):613–624, 2017. [16] gflags. . https://github.com/gflags/gflags [37] J. Gray, P. Sundaresan, S. Englert, K. Baclawski, and [17] How to Extend CCBench. P. J. Weinberger. Quickly generating billion-record https://medium.com/@jnmt. synthetic databases. In SIGMOD Record, volume 23, [18] Masstree Debug about Cast. https: pages 243–252, 1994. //github.com/thawk105/masstree-beta/commit/ [38] J. Guo, P. Cai, J. Wang, W. Qian, and A. Zhou. d4bcf7711dc027818b1719a5a4c29aee547c58f6. Adaptive Optimistic Concurrency Control for [19] Masstree Debug about string slice.hh. https: Heterogeneous Workloads. PVLDB, 12(5):584–596, //github.com/thawk105/masstree-beta/commit/ 2019. 77ef355f868a6db4eac7b44669c508d8db053502. [39] R. Harding, D. Van Aken, A. Pavlo, and [20] Masstree when Controling Key Length is 9. https: M. Stonebraker. An Evaluation of Distributed //github.com/kohler/masstree-beta/issues/42. Concurrency Control. PVLDB, 10(5):553–564, 2017. [21] mimalloc. https://github.com/microsoft/mimalloc. [40] Y. Huang, W. Qian, E. Kohler, B. Liskov, and [22] The Transaction Processing Council. TPC-C L. Shrira. Opportunities for Optimism in Contended Benchmark (Revision 5.11), February 2011. Main-Memory Multicore Transactions. PVLDB, [23] tpcc-runner. 13(5):629–642, 2020. https://github.com/wattlebirdaz/tpcc-runner. [41] R. Johnson, I. Pandis, R. Stoica, M. Athanassoulis, [24] R. Appuswamy, A. G. Anadiotis, D. Porobic, and A. Ailamaki. Aether: a Scalable Approach to M. Iman, and A. Ailamaki. Analyzing the Impact of Logging. PVLDB, 3(1-2):681–692, 2010. System Architecture on the Scalability of OLTP [42] H. Jung, H. Han, A. Fekete, U. R¨ohm,and H. Y. Engines for High-Contention Workloads. PVLDB, Yeom. Performance of Serializable Snapshot Isolation 11(2):121–134, 2017. on Multicore Servers. In DASFAA, pages 416–430, [25] T. Bang, N. May, I. Petrov, and C. Binnig. The Tale 2013. of 1000 Cores: An Evaluation of Concurrency Control [43] T. Kersten, V. Leis, A. Kemper, T. Neumann, on Real(Ly) Large Multi-Socket Hardware. In A. Pavlo, and P. Boncz. Everything You Always DaMoN, 2020. Wanted to Know about Compiled and Vectorized [26] R. Bayer, H. Heller, and A. Reiser. Parallelism and Queries but Were Afraid to Ask. PVLDB, Recovery in Database Systems. ACM TODS, 11(13):2209–2222, 2018. 5(2):139–156, 1980. [44] K. Kim, T. Wang, R. Johnson, and I. Pandis. ERMIA: [27] H. Berenson, P. Bernstein, J. Gray, J. Melton, Fast Memory-Optimized Database System for E. O’Neil, and P. O’Neil. A Critique of ANSI SQL Heterogeneous Workloads. In SIGMOD Conf., pages Isolation Levels. In SIGMOD Record, volume 24, 1675–1687, 2016. pages 1–10, 1995. [45] H. Kimura. FOEDUS: OLTP engine for a thousand Concurrency Control Method. In BITS, pages cores and NVRAM. In SIGMOD Conf., pages 291–296, 2018. 691–706, 2015. [63] D. Tang and A. J. Elmore. Toward coordination-free [46] P. Larson, S. Blanas, C. Diaconu, C. Freedman, J. M. and reconfigurable mixed concurrency control. In Patel, and M. Zwilling. High-Performance ATC, pages 809–822, 2018. Concurrency Control Mechanisms for Main-Memory [64] D. Tang, H. Jiang, and A. J. Elmore. Adaptive Databases. PVLDB, 5(4):298–309, 2011. Concurrency Control: Despite the Looking Glass, One [47] H. Lim, M. Kaminsky, and D. G. Andersen. Cicada: Concurrency Control Does Not Fit All. In CIDR, Dependably fast multi-core in-memory transactions. 2017. In SIGMOD Conf., pages 21–35, 2017. [65] R. H. Thomas. A Majority Consensus Approach to [48] Y. Mao, E. Kohler, and R. T. Morris. Cache Concurrency Control for Multiple Copy Databases. Craftiness for Fast Multicore Key-Value Storage. In ACM TODS, 4(2):180–209, 1979. EuroSys, pages 183–196, 2012. [66] S. Tu, W. Zheng, E. Kohler, B. Liskov, and [49] V. J. Marathe, W. N. Scherer, and M. L. Scott. S. Madden. Speedy Transactions in Multicore Design Tradeoffs in Modern Software Transactional in-Memory Databases. In SOSP, pages 18–32, 2013. Memory Systems. In LCR, pages 1–7, 2004. [67] C. Wang, K. Huang, and X. Qian. A Comprehensive [50] J. M. Mellor-Crummey and M. L. Scott. Scalable Evaluation of RDMA-enabled Concurrency Control Reader-Writer Synchronization for Shared-Memory Protocols. http://arxiv.org/abs/2002.12664, 2020. Multiprocessors. In SIGPLAN Notices, volume 26, [68] T. Wang, R. Johnson, A. Fekete, and I. Pandis. The pages 106–113, 1991. Serial Safety Net: Efficient Concurrency Control on [51] T. Morzy. The Correctness of Concurrency Control for Modern Hardware. In DaMoN, 2015. Multiversion Database Systems with Limited Number [69] T. Wang, R. Johnson, A. Fekete, and I. Pandis. of Versions. In ICDE, pages 595–604, 1993. Efficiently Making (Almost) Any Concurrency Control [52] Y. Nakamura, H. Kawashima, and O. Tatebe. Mechanism Serializable. VLDB Journal, Integration of TicToc Concurrency Control Protocol 26(4):537–562, 2017. with Parallel Write Ahead Logging Protocol. Journal [70] T. Wang and H. Kimura. Mostly-Optimistic of Network Computing, 9(2):339–353, 2019. Concurrency Control for Highly Contended Dynamic [53] S. Nakazono, H. Uchiyama, Y. Fujiwara, Workloads on a Thousand Cores. PVLDB, Y. Nakamura, and H. Kawashima. NWR: Rethinking 10(2):49–60, 2016. Thomas Write Rule for Omittable Write Operations. [71] Z. Wang, S. Mu, Y. Cui, H. Yi, H. Chen, and J. Li. http://arxiv.org/abs/1904.08119, 2020. Scaling Multicore Databases via Constrained Parallel [54] N. Narula, C. Cutler, E. Kohler, and R. Morris. Phase Execution. In SIGMOD Conf., pages 1643–1658, 2016. Reconciliation for Contended In-Memory [72] G. Weikum and G. Vossen. Transactional Information Transactions. In OSDI, pages 511–524, 2014. Systems. Elsevier, 2001. [55] T. Neumann, T. M¨uhlbauer, and A. Kemper. Fast [73] Y. Wu, J. Arulraj, J. Lin, R. Xian, and A. Pavlo. An Serializable Multi-Version Concurrency Control for empirical evaluation of in-memory multi-version Main-Memory Database Systems. In SIGMOD Conf., concurrency control. PVLDB, 10(7):781–792, 2017. pages 677–689, 2015. [74] Y. Wu, C.-Y. Chan, and K.-L. Tan. Transaction [56] A. Pavlo, G. Angulo, J. Arulraj, H. Lin, J. Lin, L. Ma, Healing: Scaling Optimistic Concurrency Control on P. Menon, T. C. Mowry, M. Perron, I. Quah, Multicores. In SIGMOD Conf., pages 1689–1704, 2016. S. Santurkar, A. Tomasic, S. Toor, D. V. Aken, [75] Y. Wu and K.-L. Tan. Scalable In-Memory Z. Wang, Y. Wu, R. Xian, and T. Zhang. Self-Driving Transaction Processing with HTM. In ATC, pages Database Management Systems. In CIDR, 2017. 365–377, 2016. [57] G. Prasaad, A. Cheung, and D. Suciu. Handling [76] X. Yu, G. Bezerra, A. Pavlo, S. Devadas, and Highly Contended OLTP Workloads Using Fast M. Stonebraker. Staring into the Abyss: An Dynamic Partitioning. In SIGMOD Conf., pages Evaluation of Concurrency Control with One 527–542, 2020. Thousand Cores. PVLDB, 8(3):209–220, 2014. [58] D. J. Rosenkrantz, R. E. Stearns, and P. M. Lewis. [77] X. Yu, A. Pavlo, D. Sanchez, and S. Devadas. Tictoc: System Level Concurrency Control for Distributed Time Traveling Optimistic Concurrency Control. In Database Systems. ACM TODS, 3(2):178–198, 1978. SIGMOD Conf., pages 1629–1642, 2016. [59] M. L. Scott and W. N. Scherer. Scalable Queue-based [78] Y. Yuan, K. Wang, R. Lee, X. Ding, J. Xing, Spin Locks with Timeout. In SIGPLAN Notices, S. Blanas, and X. Zhang. BCC: Reducing False Aborts volume 36, pages 44–52, 2001. in Optimistic Concurrency Control with Low Cost for [60] Y. Sheng, A. Tomasic, T. Zhang, and A. Pavlo. in-Memory Databases. PVLDB, 9(6):504–515, 2016. Scheduling OLTP Transactions via Learned Abort Prediction. In aiDM, 2019. [61] R. E. Stearns and D. J. Rosenkrantz. Distributed Database Concurrency Controls Using Before-Values. In SIGMOD Conf., pages 74–83, 1981. [62] T. Tanabe, H. Kawashima, and O. Tatebe. Integration of Parallel Write Ahead Logging and Cicada Table 3: Supported TPC-C transactions and execution plat- Bang et al. [25] evaluated only New-Order and Payment from forms in modern CC studies for a single node. Type column among the five transactions using DBx1000 [7]. indicates the type of research. Papers with Exp. describe Among the CC protocols, CCBench currently supports analytical studies, and those with Pro. propose new proto- only Silo. Here, we evaluate New-Order and Payment trans- cols or optimization methods. The TPC-C column indicates actions in CCBench and validate the correctness of its im- the type of TPC-C transactions supported in the paper. NP plementation by comparing it with other platforms [7, 8]. indicates NewOrder and Payment. Full indicates the full-mix, In this Appendix, we denote the subset of the TPC-C and α includes an original StockScan transaction in addition workload that includes only New-Order and Payment trans- to the full-mix. β includes an original Reward transaction in actions as NP and illustrate an evaluation of two variants addition to the NP. φ indicates no TPC-C evaluation. The of NP, denoted by NP-NoInsert and NP-Insert. They were System column shows the evaluation system used in the pa- supported by DBx1000, and both were evaluated by Bang per. No citation indicates that the code seems to be publicly et al. [25]. unavailable. NP-NoInsert does not include any insertion operations. NP originally includes insertions into the Order, Order-Line, Paper Type TPC-C System New-Order tables in the New-Order transaction and inser- MVCC Eval. [73] (α) Full Peloton [12] tions into the History table in the Payment transaction. CCBench CCBench [5] However, from the viewpoint of semantics, it is possible to Abyss [76] Exp. NP DBx1000 [7] omit these insertions. Records inserted into the History ta- 1000 Cores [25] DBx1000 [7] ble by the Payment transaction are not handled thereafter Trireme [24] φ Original (i.e., other transactions do not perform CRUD operations Repair [31] Cavalia [4] on the History table). Records inserted into the Orders, Healing [74] Cavalia [4] New-Order, Order-Line tables by New-Order transaction are HTCC [75] Cavalia [4] not read, updated, or deleted by the Payment transaction. Cicada [47] Cicada [6] Therefore, NP can omit all insertion operations. ACC [64] Doppel [54] NP-Insert includes insertion operations that are omitted Latch-free SSN [69] ERMIA [9] Full in NP-NoInsert. To run NP-NoInsert or NP-Insert, different FOEDUS [45] FOEDUS [10] functions should be implemented. We illustrate such dif- MOCC [70] FOEDUS [10] ferences in Table 4. As shown in the table, the required Silo [66] Silo [14] functions for NP-NoInsert are the same as those for a part of IC3 [71] Silo [14] Pro. YCSB. To implement the NP-Insert, an insertion operation STOv2 [40] STO [15] is necessary. CormCC [63] Original TicToc [77] DBx1000 [7] A.2 Implementation and Settings Strife [57] DBx1000 [7] We implemented TPC-C client codes with reference to the AOCC [38](β) NP DBx1000 [7] original DBx1000 system [7] and an extension of DBx1000 EWV [36] Original by the Cicada team [8], which is denoted by DBx1000(C BCC [78] Silo [14] ). We implemented the NP in CCBench as follows. First, Batching [33] Cicada [6] for the access method, we used a Masstree implementation SGT [34] φ Original [13] that is available online [11] with some modifications (fixing Doppel [54] Original bugs regarding the cast [18] and a long key with more than 9 bytes [19, 20]). All nine tables were searchable with the primary key. The History table did not originally need pri- mary keys, but the table was stored using Masstree. Thus, APPENDIX it used dummy primary keys using the scalable, unique iden- A. EVALUATION OF A SUBSET OF TPC-C tifier generator that we developed. The primary keys for all tables were encoded into 8 bytes in CCBench. The TPC-C benchmark emulates the workload of a whole- NP requires a secondary index on the Customer table with sale supplier and is the current industry standard for eval- c w id, c d id, and c last columns. We stored multiple uating transaction processing systems [22]. Here, we intro- customer primary keys in a std::vector container for each duce how previous studies and CCBench support TPC-C; secondary key. The size of the secondary keys was at most describe the results of its evaluation, including a compar- 20 bytes, 2 bytes for c w id, 1 byte for c d id, and up to ison with other systems [7, 8]; analyze the effect of index 17 bytes for c last. Our NP implementation is publicly structures on the performance. available on GitHub [5]. The settings of the other platforms were as follows. DBx1000 A.1 TPC-C-NP separates the table data structures from the primary in- TPC-C or its variants have been widely used to evaluate dex structures and can omit primary indexes for tables if concurrency control protocols. Some studies fully support unnecessary. We did not use the B+tree index owing to all five transactions, whereas some support only a subset of its low performance; therefore, we used a hash index for them. We summarize the studies in Table 3. Most of the pa- all indexes in our experiments. DBx1000 was configured pers with new proposals evaluated the full-mix. There were by omitting an unnecessary primary index in NP. Order, three experimental papers, among which only Wu et al. [73] New-Order, Order-Line, and History tables had no pri- evaluated the full-mix using Peloton [12]. Appuswamy et mary index. The Customer secondary index composed of the al. [24] did not evaluate TPC-C. Finally, Yu et al. [76] and (c w id, c d id, c last) key was encoded into 8 bytes. A 0.6 2.0 25.0 CCBench DBx1000 20.0 DBx1000(c) 1.5 0.4 15.0 1.0 10.0 0.2 0.5 5.0 Throughput[MTPS] Throughput[MTPS] Throughput[MTPS] 0.0 0.0 0.0 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 # Threads # Threads # Threads (a) Warehouse count = 1. (b) Warehouses count = 4. (c) Warehouse count = thread count.

Figure 15: TPC-C-NP-NoInsert (full scale).

0.6 2.0 8.0

0.5 1.5 0.4 1.0 4.0 0.3 0.5 0.2 Throughput[MTPS] Throughput[MTPS] Throughput[MTPS] 0.1 0.0 0.0 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 # Threads # Threads # Threads (a) Warehouse count = 1. (b) Warehouse count = 4. (c) Warehouse count = thread count.

Figure 16: TPC-C-NP-NoInsert (small scale).

Table 4: Workload and required functions. YCSB’ indi- contention cases. This is consistent with prior studies (Figs. cates only A, B, C, and F, which are key-wise. It does 4 and 9 for a study on TicToc [77], Fig. 5 for a study on not include insert (required by D) or range query (required Cicada [47], and Figs. 2 and 3 for a study on 1000 cores by E). Further, NP-NoInsert does not include any inser- paper [25]). Fig. 15 shows that DBx1000 outperformed CC- tion operations. NP originally includes insertions to Order, Bench in all cases. One of the reasons for this is the different Order-Line, and New-Order tables in New-Order transac- indexes used. CCBench used Masstrees, and DBx1000 used tion and insertions to History table in Payment transac- hash tables. tion. NP-Insert includes insertion operations omitted in When the warehouse count was equal to the thread count, NP-NoInsert. Phantom avd. indicates phantom avoidance. DBx1000(C) exhibited the best performance, as illustrated Each cell denotes whether the corresponding function is re- in Fig. 15c, when the thread count was less than or equal quired. to 96. When the thread count was more than 96, it was not run owing to an ASSERT failure or . YCSB’ NP-NoInsert NP-Insert Full-mix The increase in the NMaxCores parameter did not solve this Insertion No No Yes Yes issue. Fig. 16 showed that all three platforms scaled in all Deletion No No No Yes Range search No No No Yes settings when the thread count was no more than 20. Phantom avd. No No No Yes NP-Insert: For DBx1000(C), we set two parameters to true, TPCC INSERT ROWS and TPCC INSERT INDEX. We un- commented the code of the insertion operations for DBx1000. We set the corresponding command-line argument for CC- pointer to the records was retrieved using the encoded key Bench to omit insertion operations. The results are shown in through the hash index, and was accessed through the linked Figs. 17 and 18. The performance of DBx1000 for NP-Insert list. DBx1000(C) also separates the tables and indexes. It was overwhelmingly worse than that of NP-NoInsert. This properly uses the Masstree and hash index. should be due to the insertion operation. This degradation was reported by Bang et al. [25] (Fig. 9), who improved the A.3 Analysis workload by weaving modern optimizations [40, 43, 45] into We used the experimental environment described in §2.2. DBx1000. The codes were not publicly available. CCBench Each worker thread chose either a New-Order or Payment and DBx1000 exhibited similar trends in performance in all transaction at random in a 50:50 ratio each time. cases. CCBench outperformed DBx1000 under this setting. DBx1000(C) exhibited a mysterious performance when the A.3.1 Varying Thread Count warehouse count was set to 1 or 4 as shown in Figs. 17a, We evaluated the workloads by varying the thread count. 17b, 18a, and 18b. We could not determine the reason for NP-NoInsert: For DBx1000(C), we set a value of false this behavior. to two parameters, TPCC INSERT ROWS and TPCC INSERT INDEX. The performance of NP-Insert was worse than that of NP- We commented out the code of insertion operations for DBx1000. NoInsert because of additional insert operations. CCBench We set the corresponding command-line argument for CC- exhibited approximately 14 Mtps for NP-NoInsert and ap- Bench to omit insertion operations. The results are shown proximately 8 Mtps for NP-Insert, with 224 threads and in Figs. 15 and 16. Fig. 15c shows that all the systems 224 warehouses in Figs. 15(c) and 17(c), respectively. In- exhibited scalability for a low contention case, and Fig. 15a sert operations require an index traversal as well as memory and 15b show that they exhibited less efficiency for high allocation for index nodes and record data, which typically 0.4 1.0 16.0 CCBench DBx1000 0.3 DBx1000(c) 0.8 12.0

0.2 0.5 8.0

0.1 0.2 4.0 Throughput[MTPS] Throughput[MTPS] Throughput[MTPS] 0.0 0.0 0.0 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 # Threads # Threads # Threads (a) Warehouse count = 1. (b) Warehouses count = 4. (c) Warehouse count = thread count.

Figure 17: TPC-C-NP-Insert (full scale).

0.4 1.0 4.0

0.3 0.8

0.2 0.5 2.0

0.1 0.2 Throughput[MTPS] Throughput[MTPS] Throughput[MTPS] 0.0 0.0 0.0 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 # Threads # Threads # Threads (a) Warehouse count = 1. (b) Warehouse count = 4. (c) Warehouse count = thread count.

Figure 18: TPC-C-NP-Insert (small scale).

25 1 other work 20 commit insert 15 0.8 update 10 search CCBench-224th 0.6 5 DBx1000-224th Throughput[MTPS] DBx1000(c)-96th 0 0.4 20 40 60 80 100 120 140 160 180 200 220 240 # Warehouse

0.2 Figure 20: Varying warehouse count, NP-NoInsert. Relative latency breakdown 0 14 CCBench-224th NewOrder Payment 12 DBx1000-224th 10 DBx1000(c)-96th 8 Figure 19: Breakdown of TPC-C-NP-Insert, 224 threads, 6 224 warehouses, and NewOrder and Payment transactions. 4 2 Throughput[MTPS] 0 20 40 60 80 100 120 140 160 180 200 220 240 # Warehouse cause faults in the . Therefore, they are considered to be expensive compared with other opera- Figure 21: Varying warehouse count, NP-Insert. tions. Impact of Index to Performance: The difference in indexes produces a difference in performance. The access A.3.2 Varying Warehouse Count cost of the tree indexes is higher than that of the hash in- We evaluated NP-NoInsert and NP-Insert varying the ware- dexes in theory. To understand the impact of the Masstree house counts. The settings are the same as those in Ap- index on performance, we measured the latency breakdown pendix A.3.1. The results for NP-NoInsert and NP-Insert of the transactions in the NP-Insert workload of CCBench on were shown in Figs. 20 and 21. All results of DBx1000( 224 threads and 224 warehouses using the Linux perf tool. C) were measured with 96 threads due to errors. The re- As shown in Fig. 19, execution times of 68.4% and 58.9% sults showed that all three systems tended to scale. For were spent for New-Order and Payment transactions on the NP-NoInsert, DBx1000 outperformed both DBx1000(C) and search, update, and insert. These operations need to find CCBench as shown in Fig. 20. A result of a similar exper- a record, and require frequent Masstree traversals. Because iment is shown in Fig. 5 for the study on TicToc [77], and the size of the Masstree node was a few hundred bytes, its it exhibited scalability, which is consistent with this result. traversal decreased the spatial locality of the memory ac- For NP-Insert, DBx1000 underperformed both DBx1000(C) cesses; thus, the cache miss tended to increase. NP can be and CCBench, as shown in Fig. 21. executed using hash indexes, and in such a case, the ratio of search, update, and insert to the execution time would A.3.3 Difference between NP and Full-mix be significantly reduced, and the performance of NP would Transactions of TPC-C include the Order-Status, Delivery, improve. and Stock-Level in addition to New-Order and Payment. We did not obtain the results in this study. Under a full-mix The New-Order transaction inserts new records into workload, which includes all of these factors, its performance the ORDERLINE table, and the Stock-Level transaction deteriorates to a fraction of the NP. Figs. 4 and 5 for the scans the ORDERLINE table. However, the range of keys original study on Cicada [47] show the results of both full- scanned by the Stock-Level transaction is different mix and NP, respectively. The result in Fig. 4 was approx- from the keys inserted by the New-Order transaction. imately 1/3 of that in Fig. 5. For TPC-C, the New-Order The New-Order transaction increments d next id and a and Payment transactions account for 88%, and the remain- record with the key is inserted into the ORDERLINE ta- ing three account for only 12%. The reason for the perfor- ble, while the Stock-Level transaction scans keys with mance degradation was because the Delivery transaction less than the d next id from the ORDERLINE table. The contained many record accesses compared to the New-Order key of the ORDERLINE table consists of warehouse ID, and Payment transactions. Owing to the existence of long district ID, order ID, and orderline ID. transactions, many conflicts will occur for small warehouse • ORDERLINE table in the Order-Status transaction. cases, which would deteriorate the performance. For many warehouses, we expect the abort ratio would be low because The New-Order transaction inserts new records into the Kimura [45] determined that the ratio was 0.12% for the ORDERLINE table, and the Order-Status transaction warehouse count was equal to the thread count. The effect scans the table. However, it scans records inserted by of such long transactions was described in §7.2 for the ver- past New-Order transactions and not by concurrently sion lifetime management. The issues related to long trans- running ones. actions have rarely been discussed thus far, and remain a • ORDERLINE table in the Delivery transaction. research problem. The Delivery transaction processes the oldest record that has not yet delivered. Therefore, the scan for the B. EVALUATION OF A TPC-C FULL-MIX ORDERLINE table is executed over past records. The New-Order transaction inserts new records, and thus TPC-C full-mix consists of five transactions, namely New-Order, these scan and insert operations do not conflict. Payment, Delivery, Stock-Level, and Order-Status. Their percentages are 45, 43, 4, 4, and 4 %, respectively. We re- • CUSTOMER secondary table in the Payment transaction. implemented Silo for the TPC-C full-mix in CCBench with No transactions insert into the CUSTOMER table. our own implementation of Masstree. Currently, the code • CUSTOMER secondary table in the Order-Status trans- for this version of CCBench is not publicly available at our action. repository [5]. However, we will release them soon. No transactions insert into the CUSTOMER table. B.1 Design B.3 Evaluation We prepared nine tables that were accessed using pri- We implemented a TPC-C full-mix dealing with phantom mary keys. They include WAREHOUSE, CUSTOMER, NEW-ORDER, problems, following the scheme described in Section B.2. We DISTRICT, ORDER, ORDERLINE, ITEM, STOCK, and HISTORY. evaluated the throughput and abort ratio varying thread The HISTORY table uses surrogate keys, which are generated count and the results are shown in Figures 22 and 23, re- in a scalable manner. We provided two bytes for worker spectively. Each line in these figures represents the different thread ID and increased counters for the key to avoid con- settings of the warehouse count (1, 7, 28, 56, 84, 112, 140, tentions. 168, 196, 224, and the same number with the thread count). We prepared secondary tables for CUSTOMER and ORDER. Figure 22 shows that the throughput linearly increases as the As for the CUSTOMER secondary table, the key consists of thread count increases, if the warehouse count is sufficient warehouse ID, district ID, and c last, and the value consists for the thread count. The performance curve deteriorates of c first and the list of primary keys. As for the ORDER when the thread count exceeds 112. This is because our en- secondary table, the key consists of warehouse ID, district vironment has only 112 physical cores among the 224 logical ID, customer ID, and order ID, and the value is the primary cores. As illustrated in Figure 23, the abort ratio increases key for the ORDER table. if the warehouse count is insufficient for the thread count. B.2 Phantoms This is because the contention increases with the increasing number of threads that access the same warehouse. In the TPC-C full-mix workload, the phantom problem As illustrated in Figure 24, the throughput increases and can occur in the following cases. the abort ratio decreases as the warehouse count increases • ORDER secondary table. when the number of worker threads is fixed at 224. This Scan operations in the Order-Status transaction and is because of the increasing abort ratios caused by the con- insert operations in the New-Order transaction can oc- tentions. cur concurrently. B.4 Summary of TPC-C • NEW-ORDER table. TPC-C is an important and realistic benchmark that in- Scan operations in the Delivery transaction and in- cludes phantom occurrences by concurrent scan and insert sert operations in the New-Order transaction can occur operations. CCBench currently supports the TPC-C bench- concurrently. mark only for the Silo protocol. This Appendix validates the correctness of its implementation by comparing CCBench However, the phantom problem does not occur in the fol- with DBx1000 [7] and DBx1000(C) [8] for TPC-C NP, and lowing cases. scalability for TPC-C full-mix. The results of the experi- • ORDERLINE table in the Stock-Level transaction. ments demonstrated that the implementation of CCBench 1.8x106 #WH=TH Acknowledgments #WH=224 6 1.6x10 #WH=196 #WH=168 6 #WH=140 We thank Sho Nakazono for valuable advices on implement- 1.4x10 #WH=112 #WH=84 ing TPC-C full-mix benchmark. We thank Hideaki Kimura 1.2x106 #WH=56 #WH=28 #WH=7 for valuable advices on transaction processing and encour- 6 1x10 #WH=1 agements. We are grateful to the anonymous associate edi- 800000 tor, the reviewers, and the shepherd for their insightful com- 600000 ments and valuable feedback. We thank DBx1000 [7] team, Throughput (trans/sec) 400000 Cicada team [8] and tpcc-runner team [23] for publishing 200000 code. This work was supported by JST JPMJCR1414, JSPS 0 20 40 60 80 100 120 140 160 180 200 220 240 19H04117, and the New Energy and Industrial Technology #Threads Development Organization (NEDO). Figure 22: TPC-C full-mix: Throughput against thread count.

1

0.9

0.8

0.7

0.6

0.5

0.4 Abort RatioAbort

0.3

0.2

0.1

0 20 40 60 80 100 120 140 160 180 200 220 240 #Threads

Figure 23: TPC-C full-mix: Abort ratio against thread count.

is appropriate. In addition, different analysis platforms ex- hibited different behaviors, even for the same workload, de- pending on the design and implementation of the systems. We claim Insight 5 again; it is difficult to find deep knowl- edge using multiple reference implementations or platforms. We believe that CCBench will help in such an analysis and related research in the many-core architecture era. An evaluation of the TPC-C full-mix for protocols other than Silo remains as future work. For TicToc, Cicada, MOCC, and 2PL-NoWait, the technique presented in this Appendix can be used. For 2PL-Wait, next-key locking with Masstree can be used. For SI, no extension is necessary for phantom avoidance because each transaction under SI reads a fixed snapshot, and the phantom problem does not occur.

1.8x106 1

1.6x106 0.9 0.8 1.4x106 0.7 1.2x106 0.6 1x106 Throughput 0.5 Abort Ratio 800000 0.4 Abort RatioAbort 600000 0.3 Throughput (trans/sec) 400000 0.2

200000 0.1

0 0 0 50 100 150 200 250 #Warehouse

Figure 24: TPC-C full-mix: Throughput and abort ratio with 224 threads.