<<

Asymmetric-Partition Replication for Highly Scalable Distributed in Practice

Juchang Lee Hyejeong Lee Seongyun Ko SAP Labs Korea SAP Labs Korea POSTECH [email protected] [email protected] [email protected] Kyu Hwan Kim Mihnea Andrei Friedrich Keller SAP Labs Korea SAP Labs France SAP SE [email protected] [email protected] [email protected] ∗ Wook-Shin Han POSTECH [email protected]

ABSTRACT 1. INTRODUCTION replication is widely known and used for high Database replication is one of the most widely studied and availability or load balancing in many practical database used techniques in the database area. It has two major use systems. In this paper, we show how a replication engine cases: 1) achieving a higher degree of system availability and can be used for three important practical cases that have 2) achieving better scalability by distributing the incoming not previously been studied very well. The three practi- workloads to the multiple replicas. With geographically dis- cal use cases include: 1) scaling out OLTP/OLAP-mixed tributed replicas, replication can also help reduce the query workloads with partitioned replicas, 2) efficiently maintain- latency when it accesses a local replica. ing a distributed secondary index for a partitioned , and Beyond those well-known use cases, this paper shows how 3) efficiently implementing an online re-partitioning opera- an advanced replication engine can serve three other prac- tion. All three use cases are crucial for enabling a high- tical use cases that have not been studied very well. In performance shared-nothing system. addition, in order to lay a solid foundation for such ex- To support the three use cases more efficiently, we pro- tended use cases of replication, we propose the novel notion pose the concept of asymmetric-partition replication, so that of asymmetric-partition replication where replicas of a table replicas of a table can be independently partitioned regard- can be independently partitioned regardless of whether or less of whether or how its primary copy is partitioned. In ad- how the primary copy of the table is partitioned. Note that, dition, we propose the optimistic synchronous pro- in this paper, the term table partition denotes a part (or a tocol which avoids the expensive two-phase commit without subset) of such a partitioned table. sacrificing transactional consistency. The proposed asym- Furthermore, to address a challenge encountered while ap- metric-partition replication and its optimized commit pro- plying the asymmetric-partition replication to a practical tocol are incorporated in the production versions of the SAP system, we propose a novel optimization called optimistic HANA in-memory database system. Through extensive ex- synchronous commit that avoids the expensive two-phase periments, we demonstrate the significant benefits that the commit protocol while preserving the strict transactional proposed replication engine brings to the three use cases. consistency. This optimization is possible by fully exploit- ing the fact that the replica table is a derivable and recon- PVLDB Reference Format: structable data structure based on the corresponding pri- Juchang Lee, Hyejeong Lee, Seongyun Ko, Kyu Hwan Kim, Mi- mary table data. hnea Andrei, Friedrich Keller, Wook-Shin Han. Asymmetric- Partition Replication for Highly Scalable Distributed Transaction The three practical use cases are presented as follows. Processing in Practice. PVLDB, 13(12): 3112-3124, 2020. DOI: https://doi.org/10.14778/3415478.3415538 1.1 Independently Partitioned Replica ∗Corresponding author First, when handling both OLTP and OLAP workloads in the same database system, asymmetric-partition replica- tion enables the system to maintain an updateable primary This work is licensed under the Creative Commons Attribution- copy of a table in a single, larger physical node, while its NonCommercial-NoDerivatives 4.0 International License. To a copy replica copies are independently partitioned and distributed of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For across multiple smaller physical nodes, as shown in Figure 1. any use beyond those covered by this license, obtain permission by emailing With this architecture, we can avoid the cross-partition two- [email protected]. Copyright is held by the owner/author(s). Publication rights phase commit for OLTP workloads, while serving partition- licensed to the VLDB Endowment. friendly OLAP workloads in a more scalable way with the Proceedings of the VLDB Endowment, Vol. 13, No. 12 ISSN 2150-8097. multiple smaller replica nodes. DOI: https://doi.org/10.14778/3415478.3415538

3112 management problem. A secondary index for a partitioned Partition P1 of T1 table can be modeled as a sub-table replica of its primary table. The sub-table replica can be considered a vertical sub-

Partition P2 table replica, since it only needs to contain a few necessary of T1 columns from the primary table: secondary key columns and a reference (source table partition ID and ID) pointing

Partition P3 workload to the corresponding primary table record. Subsequently, OLTP OLTP workload of T1 Table T1 Partitioned OLAP the sub-table replica can be partitioned by the secondary (non-partitioned) key itself, not necessarily with the same partitioning key as Partition P4 its primary table. This architecture is illustrated in Fig- Primary DB node of T1 ure 2 with an example of the two secondary indexes (for two

Replica DB nodes secondary keys, SK1 and SK2). One may implement the described distributed secondary index from scratch without relying on a replication engine, Figure 1: Scalable Processing of OLTP/OLAP-mixed

© 2018 SAP SE or an SAP affiliate company. All rights reserved. ǀ CONFIDENTIAL but it requires significant time and developer resources. Mo- workloads (Use Case 1). 16 reover, the proposed optimistic synchronous commit proto- col enables elimination of the expensive two-phase commit PK SK1 SK2 SK1 Ref SK2 Ref cost when a change made at a primary table needs to be

Replicate synchronized with its remote sub-table replica. 1.3 Online Table Re-partitioning

Read The third use case of asymmetric-partition replication in- volves re-partitioning an existing table online. To support dynamic workloads, the system needs to be able to ad- just the partitioning scheme of an existing table dynami- cally. One solution uses a table-level exclusive lock, but this Table T1, Sub-table replicas, would block other concurrent DML transactions during the partitioned by PK partitioned by SK1 and SK2, respectively re-partitioning operation. Thus, the re-partitioning should be processed very efficiently. Figure 2: Partitioned Secondary Index with Asymmetric- Here again, asymmetric-partition replication provides a Partitioned Replication (Use Case 2). convenient and efficient solution. The online repartitioning operation can be logically mapped to a sequence of oper- ations: 1) creating an online replica by copying the table snapshot, 2) replicating delta changes that are incoming to A recent work with an academic prototype [17] suggested the original primary table during the snapshot copy opera- a similar replication architecture, called STAR. In the pro- tion, 3) transferring the primary ownership of the table to posed replication architecture, partitions of a primary ta- the new replica, and 4) dropping the original primary table ble are co-located in the same node while partitions of a in the background. In this procedure, asymmetric-partition replica table can be scattered across multiple nodes. How- replication allows the new replica to have a partition scheme ever, it assumes that the primary and replica copies of a different from that of its original table. Similar to the second table are still partitioned in the same layout. Compared to use case, this type of online DDL can be implemented from that, our proposed asymmetric-partition replication allows scratch, but we can save development time and resources by the replica tables to be differently partitioned and differently leveraging asymmetric-partition replication. distributed from the primary table. As a result, in our pro- posed architecture, the replica table can be re-partitioned 1.4 Contribution and Paper Organization and re-distributed without affecting the primary table’s par- The key contribution of this paper is as follows. titioning scheme. • We propose the novel notion of asymmetric-partition 1.2 Distributed Secondary Index replication, so that replicas of a table can be indepen- The second use case of asymmetric-partition replication dently partitioned regardless of whether or how the is to efficiently maintain a secondary index for a partitioned primary copy of the table is partitioned. and distributed table. When a table is partitioned by its • We show how asymmetric-partition replication enables primary key or its subset, it is relatively easy to identify a a scalable distributed database system - economically target partition for an incoming point query that includes and efficiently. More specifically, we describe 1) how the primary key predicate. Its partition pruning can be per- OLTP and OLAP workloads can efficiently be pro- formed with the input parameter value of the predicate at cessed with the proposed replication architecture, 2) either the client library or the server that initially received how the secondary key access query for a partitioned the query. However, when a point query needs to be evalu- table can be accelerated in distributed systems, and ated with a secondary key, then its target partition is not so 3) how the online repartitioning DDL operation can obvious, since the table is partitioned by the primary key. be implemented while minimizing its interference with That can force the query to access all the database nodes to other concurrent DML transactions. identify the corresponding table partition. Asymmetric-partition replication provides a convenient • To overcome the challenge in the write performance and efficient solution to the distributed secondary index overhead encountered while applying the replication

3113 Regardless of how tables are partitioned and distributed, Applications DB client library the multiple database nodes belong to the same transaction domain, ensuring strict ACID properties even for cross-node transactions. For this, two-phase commit and distributed snapshot isolation [14, 9] are incorporated. While queries

T1 T2 Partitioned table can be routed to any of the database nodes by the client library directly, the client library finds an optimal target T .P T3.P2 T .P 3 1 3 N database node for a given query by looking up a part of T4 T4 Replicated … In-memory its compiled , which is transparently cached and table table space refreshed at the client library. For a partitioned table, its partitioning function is also cached at the client library, so Log & Log & Log & that the target table partition can be pruned at run time checkpoint checkpoint checkpoint by evaluating a part of query predicate. This client-side DB node 1 DB node 2 DB node N statement routing is described in more detail in [14]. When a query needs to involve multiple nodes, a server-side query Figure 3: A simplified view of HANA distributed in- execution engine coordinates the distributed query process memory database architecture. by exchanging intermediate query results. 2.2 HANA Table Replication engine to those practical use cases, we propose opti- HANA Table Replication [15, 13] has been designed and mistic synchronous commit protocol that avoids the implemented primarily for the purpose of enabling real-time expensive two-phase commit protocol while preserving reporting over operational data by minimizing the propa- the strict transactional consistency. The basic idea of gation delay between the primary and its replicas. Its key optimistic synchronous commit protocol has also been features are highlighted as follows. briefly discussed in [13], but we detail implementation First, it supports replicating a table or only a sub-table issues in this paper (Section 4.1). (a subset of columns of a table, for example) without nec- • The three use cases are studied with extensive experi- essarily replicating the entire database or the entire table ments using a concrete implementation in SAP HANA. contents. The replica of a table or sub-table can be located in a database node, while the primary copy of the table is The rest of this paper is organized as follows. In Section located in another node. 2, we provide technical background for the presented work Second, it supports cross-format replication with a storag- in this paper. In Section 3, we present the three practical e-neutral logical log format. To maximize the performance use cases in more detail. Section 4 describes the optimiza- of the mixed OLTP/OLAP workloads, it enables creation of tions implemented to more efficiently support the three use a -oriented replica layout for a row-oriented primary cases. Section 5 provides the experimental evaluation re- table. For this, the replication log is defined as a logical sults. Section 6 gives an overview of related works. Section format decoupled from its underlying storage layout. 7 discusses future works and Section 8 concludes the paper. Third, it uses record-wise value logging with row ID (called RVID in [15, 13]) to maintain mapping between a primary 2. BACKGROUND record and its replica record. To avoid the non-deterministic In this section, we briefly summarize the key character- behavior of a function logging, it captures record-level DML istics of SAP HANA distributed system architecture and execution results. Together with the cross-format logical HANA Table Replication both of which lay background for logging, it enables the replica table to have a different table the work to be presented in this paper. structure (or partitioning scheme) from its primary table. Fourth, in order to minimize the propagation delay be- 2.1 SAP HANA Distributed Database System tween the source and the replica, HANA Table Replication Architecture employs a few novel optimizations such as lock-free parallel Figure 3 gives a simplified view of SAP HANA distributed log replay with record-wise ordering and an early log ship- in-memory database, which generally follows the shared- ping mechanism, which are explained in detail in [15, 13]. nothing partitioned database architecture. For the purpose Finally, HANA Table Replication supports transaction- of increasing the total computation resource and/or the total consistent online replica creation. By using the charac- in-memory database space, it exploits multiple independent teristics of multi-version , HANA Ta- database nodes which are inter-connected through a com- ble Replication allows other concurrent DML transactions modity network. to continue even while a new replica is being added online. The multiple database nodes belong to the same database Section 4.2 provides more detail about the online replica schema whose tables can be distributed across the database creation protocol. nodes. In addition, a single table can be horizontally par- titioned into multiple partitions, each of which contains a 3. ASYMMETRIC-PARTITION REPLICA- disjoint set of records of the table. The partitions of a ta- ble can be distributed across multiple database nodes. A TION AND ITS PRACTICAL USE CASES partition key needs to be designated to partition the table In this section, we describe how the presented HANA along with a partitioning function, such as hash, range, or Table Replication is evolved and extended to asymmetric- their combination. Typically, the primary key, or its subset, partition replication and then present its three practical use is chosen as the partition key of the table. cases in detail.

3114 Table 1: Summary of Key Properties of Asymmetric-Partition Replication.

Properties Usefulness in the three scenarios Replication object granularity with table or Enables applying the asymmetric-partition replication only to the nec- sub-table essary tables and columns (Use Cases 1 to 3) Cross-format logical logging and record-wise Enables replica table structure to be decoupled from its primary table value logging structure because the primary-to-replica relationship is maintained at record level (Use Cases 1 to 3) Lock-free parallel log replay Reduces the write transaction overhead by fast update propagation (Use Cases 1 and 2) or reduces the DDL elapsed time (Use Case 3) Two commit options with asynchronous com- Enables avoiding the expensive two-phase commit (Use Cases 1 to 3) mit and optimistic synchronous commit Online replica addition with MVCC Provides a basis for lock-free repartitioning operation (Use Case 3)

3.1 Asymmetric-Partition Replication modern database systems. Compared to traditional ETL- Asymmetric-partition replication enables replicas of a ta- based OLAP systems, it can reduce visibility delay between ble to be independently partitioned regardless of whether the source OLTP and target OLAP systems by eliminating or how its primary table is partitioned. It is an extended the need of an intermediate application-driven ETL process- form of HANA Table Replication which inherits and ex- ing layer. In addition, it eliminates the application-side bur- pands upon the beneficial properties of HANA Table Repli- den of maintaining consistency and compatibility between cation described in Section 2.2. Table 1 provides a summary the OLTP and OLAP database schemas. of its key properties and how they contribute to efficient and For such a database system that handles OLTP and OLAP convenient implementation for the three practical use cases. workloads, asymmetric-partition replication offers an inter- To extend HANA Table Replication to asymmetric-parti- esting optimization opportunity. While the updateable pri- tion replication, the mapping between a primary-table reco- mary copy of a table is still maintained in a single larger rd and its replica record is managed in a different way com- physical node, its replicas can be independently partitioned pared to HANA Table Replication. When a replica partition and distributed across multiple smaller physical nodes as is initially populated, its initial snapshot is generated from already shown in Figure 1. the primary table by performing the partition pruning func- A similar replication architecture for mixed OLTP/OLAP tion, instead of physically copying the entire table image. workloads has already been proposed by the authors’ earlier When a DML operation occurs at the primary table, the works [15, 13]. In the previous replication architecture, mul- corresponding DML replication log entry is generated and tiple replicas of a table can be created to scale out OLAP then its target replica partition is dynamically determined performance. In contrast, the proposed architecture in this by performing the partition pruning function for the changed paper can leverage multiple smaller replica nodes by horizon- record. When a column value corresponding to the replica’s tally partitioning a single logical replica table into multiple partitioning key is updated at the primary table, a single smaller sub-tables. While the replication architecture in [15, DML operation at the primary can generate two DML log 13] requires each of the replica tables to maintain a full copy entries depending on the before-update value and the after- of the entire table contents, the partitioned replication archi- update value: a deletion log entry for a replica partition and tecture proposed in this paper enables a reduction of mem- an insertion log entry for another replica partition. ory consumption at each replica and also a decrease in the Note that the asymmetric-partition replication implemen- required network resource consumption between the primary tation has been incorporated and officially available in pro- node and its replica nodes. Therefore, for a partitionable duction versions of SAP HANA since April 2018. The fol- OLAP workload, the proposed asymmetric-partition repli- lowing is one example of a possible SQL that creates an cation becomes a practical option. asymmetric-partition replica with hash partitioning. In ad- Moreover, when only a subset of a table (e.g., the most re- dition to hash partitioning, range, round robin, or their cas- cent set of data in the table) is needed by the OLAP queries, caded partitioning are also possible [4]. The choice of the the asymmetric-partition replication enables the creation partitioning scheme for the replica is totally independent of replicas for the needed table partitions alone. There- from how its corresponding primary table is partitioned. fore, the proposed asymmetric-partition replication provides more flexibility in choosing the right configuration for the CREATE TABLE replica table. Note that the cross-format replication pre- LIKE sented in [15, 13] (replicating from a row-store table to [ASYNCHRONOUS|SYNCHRONOUS] REPLICA PARTITION BY a column-store table) can be combined with the proposed HASH () asymmetric-partitioned replication. For example, the pri- PARTITIONS mary table can be formatted to a row-store non-partitioned AT ; layout while its replica table is formatted to a column-store partitioned layout. Note that the proposed asymmetric-partition replication 3.2 Use Case 1: Partitioned Replication for architecture does not exclude having multiple copies for a scaling out mixed OLTP/OLAPWorkloads replica partition for higher level. The capability to handle both OLTP and OLAP work- loads in the same database system is an attractive feature for

3115 Client Query withClient secondaryQuery key with(A2) secondary key (A2) Client QueryClient with secondaryQuery key with (A2) secondary key (A2) 1 1 1 1

Local Local Distributed Distributed secondary 2 2 2 secondary 2 secondary secondary index for A2 index for A2 index for A2 index for A2 3 3 3 3 Partitioned Partitioned Partitioned Partitioned base table T1 base table T1 base table T1 base table T1

T1.P1 T1.P1T1.P2 T1.P2T1.P3 T1.P3 T1.P1 T1.P1T1.P2 T1.P2T1.P3 T1.P3 Node N1 NodeNode N1 N2 NodeNode N2 N3 Node N3 Node N1 NodeNode N1 N2 NodeNode N2 N3 Node N3

Table T1 = {A1, A2,Table …} with T1 primary= {A1, A2, key …} at withA1 and primary key at A1 and Secondary index forSecondary A2, partitioned index forby A2,A2 partitioned by A2 partitioned by A1 to partitionedthree table by partitions A1 to three {P1, table P2, P3}partitions {P1, P2, P3} Secondary index forSecondary A2, partitioned index by for A1 A2, partitioned by A1 Figure 5: Independently partitioned secondary index for a partitioned table by its primary key. Figure 4: Local secondary index for a partitioned table by its primary key.

© 2018 SAP SE or an SAP affiliate company. All rights reserved. ǀ CONFIDENTIAL 4 © 2018 SAP SE or an SAP affiliate company. All rights reserved. ǀ CONFIDENTIAL Asymmetric-partition replication makes4 implementing su- ch a distributed secondary index more efficiently and with 3.3 Use Case 2: Scalable Distributed less development cost. The secondary index is created as a Secondary Index for Partitioned Tables form of replica table which is independently partitioned from For queries having a partition key predicate, the target its primary table. Then, on a DML operation at a base ta- database node that owns the matching records can be eas- ble partition, the DML results are automatically propagated ily identified by performing the partitioning function. On and applied to the corresponding secondary index partition the other hand, queries having a non-partition key predi- (i.e. a replica partition) in a transaction-consistent way. cate require a full table scan at all nodes where the table is Regarding how to generate and where to forward the DML distributed. To more efficiently handle such secondary key replication log entries, it follows the same DML replication access queries, maintaining a distributed secondary index is protocol of the asymmetric-partition replication which was necessary. explained in Section 3.1. A straightforward way of maintaining such a distributed Such asymmetric-partition replication brings the following secondary index is to co-partition a table and its secondary important performance benefits to the distributed secondary indexes together using the same partitioning key and then index implementation. to maintain a separate local index for each table partition. • Optimistic transaction commit: At the time of com- Figure 4 shows an example of the local secondary index. mitting a transaction that updated both a local table A database table T1 is partitioned into three partitions by partition and a remote secondary index partition, the taking the primary key (A1) as its partition key. Each of the optimistic synchronous commit ensures the cross-node three table partitions has its own local secondary index that atomicity without relying on the expensive two-phase spans the records belonging to the co-located table partition. commit protocol. This commit protocol will be pre- With this approach, the full table scan can be avoided sented in more detail in Section 4. but all the database nodes still need to be accessed for the • Fine-grained sub-table replication: The fine-grained su- secondary key access query because the partition key and b-table replication capability enables replication of on- the secondary key do not match. As a result, as the num- ly the necessary columns for the secondary index and ber of nodes where the table is distributed increases, the thus saves in memory footprint and network band- cost of a secondary key access query grows proportionally width. At the secondary index partitions, only the both in network resource and CPU consumption. While the secondary key columns and a reference (base table par- response time can be optimized by performing multiple lo- tition ID and row ID) to the original source record are cal index accesses in parallel, the number of accessed nodes maintained. remains unchanged, and the resource cost remains high. • Client-side statement routing: Combined with the clie- To overcome such a limitation, it might be desirable to nt-side statement routing described in Section 2.1, the partition the secondary index with its own secondary key in- input value of the secondary key parameter of a query dependently from its base table. Figure 5 shows an example can be evaluated with the corresponding partition fun- of an independently partitioned distributed secondary index. ction cached at the client library. Then, the client li- By early pruning the target secondary key index partition brary can early route the query to an optimal database (at the step 1 in the figure), the number of accessed node for node location where the corresponding secondary in- a secondary key access query can be reduced to one (when dex partition is located. the secondary index lookup result points to a base record One may implement such a distributed secondary index in the co-located table partition) or two (when the result by using SQL triggers and a separate table for maintaining points to a base record in a remote table partition), regard- the secondary key values and the references. However, this less of the number of table partitions and their distribution. is subject to the performance overhead incurred by (1) the However, this approach may need to pay the price on the general-purpose SQL trigger and (2) the two-phase commit write transaction side because an update at a table partition for cross-node write transactions. Another may use more could involve updating a secondary index partition located native and specialized implementation from scratch but it in a remote node. requires non-trivial time and developer resources.

3116 Concurrent DMLs Transaction T1 Transaction T1 Client at primary at replica

0) Initial state:

DML1 1 async-replicate

Concurrent DMLs

1) DML2 2 async-replicate 1) Populate initial replica 2) Replicate incoming delta in parallel 2) Query1 3

Wait until DML replay Commit Write a Concurrent 4 6 5 completes and DMLs commit log precommit

3) Switch primary role 7 Write a post- 4) Drop the original table 8 in the background commit log

Figure 7: Replication protocol using optimistic syn- Figure 6: Online table repartitioning with asymmetric- chronous commit (OSC-R). partition replication.

7UDQVDFWLRQ7 7UDQVDFWLRQ7 3.4 Use Case 3: Online Table Re-partitioning &OLHQW DWSULPDU\ DWUHSOLFD As discussed in Section 1, the overall performance of a shared-nothing distributed database system can be signifi- replicate cantly affected by how well the database is partitioned for '0/  a given workload. Considering the dynamic nature of many enterprise application workloads and also the ever growing replicate '0/  data volume, it is often required to continuously re-partition a given table. Under such a situation, it is desirable to sup- port an efficient online re-partitioning operation that can 4XHU\  minimize service downtime or interference to other concur- Write a prepare- rent normal transactions during the re-partitioning opera-   commit log &RPPLW Write a tion. Relying on the table-level exclusive lock during the  commit log re-partitioning operation can not meet such a demand. Write a post-   Asymmetric-partition replication offers a beneficial option commit log for implementing such an online re-partitioning operation  /$,17)',(21& ۄRUDQ6$3DIILOLDWHFRPSDQ\$OOULJKWVUHVHUYHG)6$36 ‹ exploiting the MVCC-based online replica addition proto- Figure 8: Replication protocol using synchronous DML col, instead of relying on a long-duration table lock. The trigger and two-phase commit (2PC-R). online repartitioning operation can be logically mapped to a sequence of operations, as illustrated in Figure 6. First, an online replica is initially populated by copying the table 4.1 Optimistic Synchronous Commit Protocol snapshot (step 1). Second, delta changes that are incoming Optimistic synchronous commit protocol, called OSC-R during the step 1 are asynchronously replicated by replica- in this paper, takes a hybrid approach by combining asyn- tion log entries (step 2). Third, once the step 1 is completed, chronous DML replication with synchronous transaction co- a synchronization point between the primary and its replica mmit. A DML log entry is generated at the DML time, is made and then the primary ownership of the table is trans- but its propagation to the replica is asynchronously initiated ferred to the new replica (step 3). After that, the original without affecting the DML response time. After waiting un- primary table is dropped asynchronously in the background til all its previous DML log entries are successfully applied (step 4). In this procedure, asymmetric-partition replication to the replicas, the transaction proceeds to the commit pro- allows the new replica to have a partition scheme different cessing. To highlight the performance gain from OSC-R, from that of its original table. we compare OSC-R with an alternative replication protocol Like the second use case, the described online re-partition- that is based on synchronous DML and two-phase commit, ing operation can be implemented by a more native and spe- called 2PC-R. Figure 7 and Figure 8 show OSC-R and 2PC- cialized implementation from scratch but it requires non- R protocols, respectively. trivial time and developer resource. Note that the other At OSC-R, replication does not affect DML response time types of online DDL operations that require copying the ta- as in steps 1 and 2 of Figure 7, while each DML involves ble data (for example, conversion from a row-oriented table a synchronous network round trip to the replica in 2PC-R format to a column-oriented table format) can be benefited (steps 1 an 2 in Figure 8). from such an online replica creation protocol in general. At the transaction commit phase, in contrast to 2PC- R that incurs two network round trips (Steps 4 and 7) 4. OPTIMIZATIONS and three log I/O operations (Steps 5, 6, and 8), OSC-R In this section, we present two key optimizations that can commit the transaction immediately after one network make asymmetric-partition replication more efficient for the round trip (Step 4) and one log I/O operation (Step 6), three use cases. both of which are interleaved with each other. Such an op-

3117 timistic optimization is possible by leveraging the fact that a be still achieved under OSC-R by assigning monotonically replica is basically a data structure derived from its primary increasing sequence numbers to the executed statements in table. When a failure occurs in the middle of the OSC-R a transaction and then letting the replica-routed read state- transaction’s commit phase, there is still a chance that the ment wait until the previous statement is applied to the replica can be recovered to the latest committed state by replica node. Or, more simply, such a read-own-write query re-synchronizing with the primary. By the same reason, the can be routed to the primary node directly by the client recovery redo log and the post-commit log entries generated library. at the replica can be persisted asynchronously (Step 8). The replica recovery protocol for HANA Table Replication is de- scribed in more detail in Section 4.1 of [13]. The transaction commit at OSC-R can proceed after wait- 4.2 MVCC-based Online Replica Addition ing until its previous DML replay operations complete (Step Figure 9 shows the procedure of adding a new replica on- 5 in Figure 7). However, for multi-statement transactions, line in a transaction-consistent way. It starts with activating which are quite common in many enterprise applications, replication logging for all the DMLs occurring at the target there is high chance that the asynchronous replication of an primary table (Step 1). Then, while a consistent snapshot earlier DML statement is overlapped with the next state- image of the primary table is being retrieved and applied ment execution at the primary transaction and thus the de- to the replica (Step 3), the replication logger captures and lay at Step 5 is minimized. replicates all the newly committed changes (Step 3’). Differently from the commit protocol under the lazy repli- To allow concurrent DML operations during Step 3, it cation presented in [15, 13], OSC-R must ensure visibility is important to perform this phase without relying on any atomicity of the changes made by the primary transaction exclusive lock on the primary table because this operation and its corresponding replayer transactions. Under the lazy can take a long time depending on the size of the table. replication, if a database client accesses a replica node af- For this purpose, we exploit the characteristics of MVCC. ter it receives the commit acknowledgement of its primary The replica initializer is implemented as a snapshot isolation transaction, then its written changes may not be yet visible transaction [8] that accesses a consistent set of database at the replica node. To ensure visibility atomicity across record versions as of its acquired snapshot timestamp. In the primary and its replicas at OSC-R, however, when a Figure 9, the transaction timestamp acquired at Step 2 be- concurrent query tries to access a replica record version in comes the snapshot timestamp of the replica initializer trans- precommitted state after Step 5 and before Step 8 (Figure 7), action. And then, based on the assigned snapshot times- the access to the record version is postponed until the state tamp and the MVCC protocol that is also described in [16], of the record version is finally determined to be either post- a consistent set of record versions is retrieved at Step 3. committed or aborted by Step 7 at the primary node and Note that, as a potential side effect of using the snapshot then informed to the replica node by Step 8. isolation transaction for the replica initializer, the MVCC Figure 7 omitted the transaction abort scenario, but it can version space can keep growing during Step 3 in order to be explained as follows. If the primary transaction aborts maintain a consistent set of record versions as of the acquired after sending the precommit request to the replica (for ex- snapshot timestamp. However, by using the hybrid garbage ample, by a failure while writing a commit log to the disk at collection technique [16] implemented in SAP HANA, the Step 6), then the corresponding replica record versions are space overhead at the version space is effectively minimized. marked as aborted by Step 8. And thus, those replica record The operation of copying the initial table snapshot (Step versions are ignored by the pending concurrent query. The 3) and the DML replication for the delta changes (Step 3’) overall protocol for reading the replica record versions under are performed in parallel, as illustrated in Figure 9. This OSC-R is provided by Algorithm 1. parallel processing does not cause any consistency issue be- cause both Step 3 and Step 3’ follow the record-wise order- Algorithm 1 Read a replica record version under OSC-R ing scheme based on RVID (record version ID) [13]. RVID is an unique identifier of a record version in a table. A new Require: A query, Q. RVID value is assigned when a new record version is cre- Require: A replica record version, V , found by evaluating ated. Using the RVID management, before applying a DML the search condition of Q. log entry to the replica, the log replayer checks whether the 1: if V .State = precommitted then RVID value of the target replica record equals to the Before- 2: Wait until V .State gets updated. Update RVID of the propagated DML log entry. If not, the 3: end if DML log replay operation is postponed since it means that 4: if V .State = committed AND an earlier change is not yet applied to the replica. This V .Timestamp <= Q.Timestamp then RVID-based record-wise ordering is described in more de- 5: return V .Value. tail in Section 3.3 of [13]. At Step 3’, the delta changes 6: else are asynchronously propagated to the replica and thus the 7: return NULL. other concurrent DML transactions do not need to wait for 8: end if its change propagation to the replica. Finally, at Step 4, once Step 3 is completed, the replica starts accepting new queries after waiting until the pre- One potential drawback of OSC-R compared to 2PC-R is viously generated delta changes are being applied to the that the uncommitted changes made by a transaction might replica - to make the replica state up-to-date. When creating not yet be visible if the same transaction tries to read its a lazy-mode replica, the replica can already start accepting own uncommitted changes at a replica - by the consequence new queries as soon as Step 3 is completed. of asynchronous DML replication. This read-own-write can

3118 8 OLTP tps OLAP qps 1 2 3 (copying initial snapshot) 7

4 6 3’ (for the new DMLs incoming during the snapshot copy) 5 4 1: Activate replication logging for the table 3: Retrieve a table snapshot based on the 3 snapshot timestamp acquired at the step 2; 2: Acquire transaction snapshot timestamp and apply it to the replica 2 Normalizedthroughput 3’: On any new change at the primary table, 1 generate its replication log and apply 0 to the replica 1 2 4 8 4: Mark the replica as ACTIVE and then the replica starts accepting new queries Number of replica partitions

© 2018 SAP SE or an SAP affiliate company. All rights reserved. ǀ CONFIDENTIAL 12 Figure 9: MVCC-based online replica creation. Figure 10: TPC-CH performance with asymmetric- partition replication. 5. EXPERIMENTAL EVALUATION To evaluate the performance of the proposed architec- the TPC-H queries over the four replica nodes, a query pred- tures, we conduct a number of experiments with a develop- icate (Warehouse_ID = ?) is added to each of them. The ment version of SAP HANA. The tested SAP HANA system TPC-C workload is generated by 30 concurrent clients in is configured as a distributed system which consists of up to total and the TPC-H workload is generated by 90 concur- nine database nodes. Each of them has 60 physical CPU rent clients per replica node. Each of the TPC-H clients cores (120 logical cores with hyper-threading enabled), 1TB repeats execution of the 22 TPC-H queries serially during of main memory and four 10Gbit NICs. They are intercon- the test. nected by a commodity network within the same local data Figure 10 shows the experimental result. The TPC-H center. throughput values are normalized by 1-replica TPC-H throu- ghput. The TPC-C throughput values are normalized by 5.1 Partitioned Replication for OLTP/OLAP- 1-replica TPC-C throughput. As the number of the replica mixed Workloads nodes (and thus the number of the corresponding replica To show the benefit of asymmetric-partition replication partitions) increases, the total aggregated processing throu- for mixed OLTP/OLAP workloads, we tested its perfor- ghput of TPC-H (OLAP qps) increases almost scalably. Me- mance with a TPC-CH benchmark, which was initially used anwhile, the overhead imposed to the write transaction is in [20]. The benchmark generates both TPC-C and TPC-H maintained at a minimum, as the TPC-C processing throu- workloads simultaneously over the same data set consisting ghput shows in the result (OLTP tps). Note that TPC-C of nine TPC-C tables and three other tables used exclusively throughput does not increase with the increasing number of by TPC-H (NATION, REGION, and SUPPLIER). To those tables, the replica nodes in this experiment, because the number the initial data were populated with 100 warehouses as in of the primary nodes remains unchanged and the TPC-C [20]. transactions are directed to the primary nodes only. The test system consists of one primary node and up to The plain forms of symmetric replication show nearly the eight replica nodes. At the primary node, all 12 tables are same results as Figure 10. However, they impose other over- created and placed without any partitioning. Each of those heads. For example, in a form of symmetric replication, both 12 primary tables creates its replica table. Among them, the primary and its replica tables can be created as non- eight tables which have Warehouse_ID column create their partitioned tables and each replica node can maintain a full replicas as a partitioned form by the Warehouse_ID column. copy of the primary table. Then, such a symmetric replica- For the other four tables (the three TPC-H tables and ITEM tion configuration requires additional memory consumption table), a full replica table is created for each replica node. at each replica node and additional network resource con- For example, for the ORDERLINE table, we created its replica sumption between the primary and its replica nodes. As an with range-partitioned asymmetric replicas using the below alternative under symmetric replication, the primary table DDL. The number of partitions is set as the the number of can be partitioned by the same partitioning scheme as its the used replica nodes in each experiment. replica table and then the partitions of the primary table can be co-located within the same single primary node. How- CREATE TABLE REP.ORDERLINE LIKE ever, this alternative also adds a disadvantage in that the SRC.ORDERLINE ASYNCHRONOUS REPLICA primary table must be re-partitioned whenever the parti- PARTITION BY RANGE (Warehouse_ID) tioning scheme of the replica table changes. In the proposed (PARTITION 1 <= VALUES < 26, asymmetric-partition replication, the replica table can be PARTITION 26 <= VALUES < 51, re-partitioned without necessarily affecting the partitioning PARTITION 51 <= VALUES < 76, scheme of the corresponding primary table. PARTITION 76 <= VALUES < 101) AT ’host1:port1’, ’host2:port2’, 5.2 Distributed Secondary Index ’host3:port3’, ’host4:port4’; To show the benefit of the proposed replication-based dis- tributed secondary index implementation, we extracted a TPC-H read-only queries are distributed to the replica real-application table and its involved queries from a real en- nodes simultaneously, while TPC-C read/write transactions terprise financial application workload [5]. The table called are directly routed to the primary node. To evenly distribute DFKKOP represents items in contract account documents. Th-

3119 30 7 LSI DSI LSI DSI 25 6

20 5 4 15 3 10 2

Normalizedthroughput 5 Normalizedthroughput 1 0 8 16 32 64 128 0 2 nodes 4 nodes 6 nodes 8 nodes Number of clients 32 clients 64 clients 96 clients 128 clients Number of nodes / clients Figure 11: Multi-client scalability of secondary key search Figure 12: Multi-node scalability of secondary key search throughput (4 database nodes). throughput (16 clients per node). ere are in total 170 columns and its primary key consists of Note that, since the necessary integration at the query five columns among them. The table is hash-partitioned on a processor for the distributed secondary index was not yet column called OPBEL (representing a contract document ID) available at the time of writing this paper, we revise the which is one of the five primary key columns. The initially test query into two consecutive queries for the DSI exper- populated table size is about 56GB and it was evenly dis- iment as follows. The first query finds one or more refer- tributed across the database nodes by the hash-partitioning ences to the base records by looking up the replica table function. The table has a secondary index on a column (DFKKOP_REPLICA). The secondary query retrieves the tar- called VKONT (representing a contract account ID), which get base record by giving the $reference_rowid$ value as does not belong to the primary key column set. its second parameter after finding the target table partition The experiment consists of two parts. One is to measure by giving the retrieved $reference_partid$ value as its first the improvement at the secondary key search performance parameter (PARTITION(?)). $rowid$ (64-bit numeric inte- and the other is to measure the overhead incurred at the ger) denotes a unique identifier of a record within a table. write transaction. In both measurements, we compared the performance of the proposed replication-based distributed SELECT $reference_partid$, $reference_rowid$ secondary index implementation (DSI) with that of a lo- FROM DFKKOP_REPLICA WHERE VKONT = ?; cal secondary index implementation (LSI). At LSI, the sec- SELECT * FROM DFKKOP PARTITION(?) ondary index is co-partitioned with the base table by the WHERE $rowid$ = ? AND ...; same partitioning key, OPBEL. At DSI, the secondary index is partitioned by the VKONT column. Figure 11 shows the multi-client scalability of the sec- In DSI, the following DDL will create partitioned and dis- ondary key search performance by varying the number of tributed sub-table replicas that will act as secondary indexes clients from 8 to 128. The number of database nodes is for the base table (DFKKOP). fixed to 4. The query throughput is normalized by that of CREATE TABLE DFKKOP_REPLICA LIKE the 8-client LSI case. DFKKOP SYNCHRONOUS REPLICA (VKONT) In the result, DSI shows a scalable throughput with the PARTITION BY HASH (VKONT) increasing number of clients and outperforms LSI signifi- PARTITIONS 4 AT ’host1:port1’, ’host2:port2’, cantly by factors of 1.5 to 3.1. While the secondary key ’host3:port3’, ’host4:port4’; access query at DSI involves accessing at most two database nodes regardless of the total number of the database nodes, Note that the secondary index can be also created as a the query at LSI accessed all the database nodes, including range-partitioned replica to support range queries. As the the node where no matching record exists. Considering the DDL indicates, the replica is created as a ”synchronous” fact that DSI involves two separate queries, it is expected replica to ensure the atomic transactional visibility with its that the gain would increase further by the right integration source table. Also, only the necessary secondary key column with SQL processor in the future. (VKONT) is stored in the replica together with two internal Figure 12 shows a multi-node scalability with the number columns ($reference_partid$ and $reference_rowid$) th- of clients fixed to 16 per node. The query throughput is at are used for referring to the corresponding base record. normalized by that of the 2-node LSI case. Similarly to 5.2.1 Search Performance the previous experiment, DSI scales better than LSI as the number of the database nodes increases and outperforms First, we measure the secondary key search performance LSI significantly by factors of 2.2 to 2.5. It is worth noting with the following secondary key access query chosen from that LSI exhibited 1.4 to 3.8 times higher CPU usage than the captured real-application workload. The query returns DSI in the experiment. a record for a given contract account ID with the additional constant predicate for a certain category and status. We 5.2.2 Write Performance omit the details of the constant predicate because it is spe- cific to the application semantics and did not affect the ex- In the second part of the experiment, we measure the perimental result significantly. performance overhead added to the write transaction. For this purpose, we use the following write transaction that SELECT * FROM DFKKOP WHERE VKONT = ? AND ...; involves 5 insert statements to the DFKKOP table.

3120 8 4 LSI DSI-2PC DSI-OSC LSI DSI-2PC DSI-OSC 6 3

4 2

2 1 Normalizedtime DML Normalizedthroughput

0 0 8 16 32 64 128 2 nodes 4 nodes 6 nodes 8 nodes 32 clients 64 clients 96 clients 128 clients Number of clients Number of nodes / clients Figure 13: Multi-client scalability of write transaction Figure 15: Average response time of DML operation. throughput (4 database nodes). 5 4 LSI DSI-2PC DSI-OSC LSI DSI-2PC DSI-OSC 4 3 3

2 2

1 1 Normalizedtime commit Normalizedthroughput 0 0 2 nodes 4 nodes 6 nodes 8 nodes 2 nodes 4 nodes 6 nodes 8 nodes 32 clients 64 clients 96 clients 128 clients 32 clients 64 clients 96 clients 128 clients Number of nodes / clients Figure 14: Multi-nodeNumber of scalability nodes / clients of write transaction Figure 16: Average response time of commit operation. throughput (16 clients per node).

is because the DSI-2PC is implemented based on the syn- chronous SQL trigger and thus, a DML operation can in- INSERT INTO DFKKOP VALUES ...; volve cross-node communication if the target secondary key // insert 5 different records index entry is located at a remote node. SELECT COUNT(*) FROM DFKKOP WHERE OPBEL = ?; COMMIT; Figure 16 shows that LSI shows the response time is short- ened as expected because it accesses only a single database node and performs a local commit. Compared to that, while To highlight the performance benefit of the DSI implemen- DSI-2PC shows significantly longer response time than the tation that uses the optimistic synchronous commit proto- LSI, DSI-OSC adds relatively smaller overhead to the trans- col, we compared three approaches: (1) LSI, (2) DSI with action commit operation. It is because the optimistic syn- synchronous trigger and two-phase commit (DSI-2PC), and chronous commit enables DSI-OSC to avoid the expensive (3) DSI with asymmetric-partition replication and the opti- two-phase commit. Compared to LSI, DSI-OSC still shows mistic synchronous commit protocol (DSI-OSC). slightly higher response time because the primary transac- tion can commit after confirming that all its DML opera- Figure 13 and Figure 14 respectively show multi-client tions are successfully applied to the in-memory replicas, in scalability and multi-node scalability of the above presented order to ensure atomic transaction visibility as described in write transaction. The throughput in Figure 13 is normal- Section 4. ized by that of the 8-client LSI case. The throughput in Fig- ure 14 is normalized by that of the 2-node LSI case. In Fig- ure 13, DSI-OSC scales comparably to LSI as the increasing 1.2 number of the clients and eventually DSI-OSC outperforms 1 t u p

DSI-2PC significantly. In Figure 14, DSI-OSC again scales h g 0.8 u o r

similarly to LSI as the increasing number of the database h t

d 0.6 e z

nodes and eventually DSI-OSC outperforms DSI-2PC sig- i l

a DDL-APR

m 0.4 nificantly. r o DDL-Lock N For more detailed analysis of the write transaction perfor- 0.2 DDL-Lock-Opti mance, we measured the average response time of the DML 0 0 15 30 45 60 75 90 105 120 135 and commit operations, as shown in Figure 15 and Figure 16, time (s) respectively. The number of database nodes varies from two to eight with the number of clients fixed to 16 per node. Figure 17: Impact of online repartitioning to TPC-C Figure 15 shows that the DML response time of DSI-2PC throughput. takes on average 1.9 times longer than that of DSI-OSC. It

3121 5.3 Online Table Repartitioning 6. RELATED WORK To show the benefit of the proposed online table repar- titioning implementation, we compare three alternative ap- 6.1 Asymmetric-Partitioned Replication proaches: (1) online table repartitioning based on asymme- The conventional logical replication exploits replication tric-partition replication (DDL-APR); (2) table repartition- of logical statements from the primary to its replica sys- ing with long-duration table lock (DDL-Lock); (3) table tems, and thus it enables a replica table to be indepen- repartitioning with long-duration table lock but a special- dently structured from its primary table. Such logical repli- ized optimization (DDL-Lock-Opti). At DDL-Lock, the ex- cation has been already widely used by academic or indus- clusive table lock is acquired during the entire table repar- trial systems. However, to the best of our knowledge, none titioning operation. While both DDL-APR and DDL-Lock of those previous works exactly matches with the concept of are performing MVCC-based record-wise snapshot retrieval, the asymmetric-partition replication proposed in this paper. DDL-Lock-Opti copies the physical full table image at once For example, MySQL allows that a table on the master by exploiting the fact that there is no other concurrent can have more or fewer columns than the slave’s copy of DML operation during the table copy operation by the long- the table. However, in MySQL, ”replication between tables duration table lock acquisition. having different partitioning is generally not supported” [2]. To see how those three DDL implementations affect the It is also known that the logical replication in MySQL al- other concurrent DML transactions, we perform the on- lows the historical cold data of a table is replicated but not line repartitioning DDL while continuously running TPC-C deleted at the replica side any longer. It can be seen as a workload to the same system. The initial TPC-C database special form of asymmetric-partition replication because the is populated with 100 warehouses and the TPC-C workload primary table has less data or less partitions than that of its is generated by 30 concurrent clients, consistently with the replica table which holds both of recent (hot) and historical experiment of Section 5.1. The online repartitioning DDL (cold) data. However, such a configuration is enabled simply is performed against CUSTOMER table which is updated by by turning off the replication log generation for a particular TPC-C Payment and Delivery transactions. CUSTOMER ta- database session that executes the delete operations for the ble is initially formatted as a non-partitioned table and then historical data. It does not match with the generic form of it is partitioned into four range partitions online. the asymmetric-partition replication proposed by our paper. Figure 17 shows the experiment result. The vertical axis [18] proposed another type of logical replication, called represents the throughput of the TPC-C transactions, nor- BatchDB. OLTP and OLAP replicas can have different stor- malized by that measured while DDL operation is not yet age layouts (row oriented format for OLTP, column ori- involved. In all of the three approaches, the repartitioning ented format for OLAP) to efficiently handle hybrid OLTP DDL is performed in 30 seconds after the high-load phase and OLAP workloads. In addition, it focuses only on lazy of TPC-C workload starts. (asynchronous) replication. Compared to [18], our paper ex- In DDL-Lock and DDL-Lock-Opti, during the DDL exe- plores how logical replication can evolve into the concept of cution, the TPC-C performance dropped to 0 because all the asymmetric-partition replication and how synchronous repli- TPC-C transactions eventually get blocked by the long-term cation can be further optimized without sacrificing transac- exclusive table lock acquired by the online re-partitioning tional consistency. operation. Compared to that, DDL-APR shows almost non- blocking behavior except small performance drop at two 6.2 Distributed Secondary Index time points: one when starting the DDL operation (at the The technique of implementing distributed secondary in- time point 30) and another when completing the DDL op- dex by using a separate system table is not new. Our con- eration (at the time point 120). At the two points, the tribution is not about using a separate table for the imple- current implementation at SAP HANA is relying on short- mentation of distributed secondary index, but about using term instant-duration table lock when acquiring the transac- an advanced form of replication engine for economical and tion snapshot timestamp and when transferring the primary efficient implementation of distributed secondary index for a ownership to the new replica. As already explained in Sec- practical commercial DBMS. In addition, we addressed the tion 4.2, the first short-term table lock is removable in the challenge of the node-to-node write synchronization over- future by additional implementation. head by proposing the optimistic synchronous commit pro- Remark that the overall elapsed time of DDL-APR in- tocol without sacrificing any transactional consistency, as creased compared to DDL-Lock or DDL-Lock-Opti. One shown in Section 4.1. reason is that, in DDL-APR, the TPC-C clients keep gen- For example, [12] discusses two secondary index approa- erating new DML changes and thus the total amount of ches in the context of distributed NoSQL key-value store. data that needs to be applied to the new table increases One is to store an inverted list in a separate system table compared to DDL-Lock or DDL-Lock-Opti. In DDL-Lock- and another to co-locate data and its local secondary in- Opti, the overall blocking period is shortened significantly dex. The first approach is equivalent to what we presented compared to DDL-Lock by applying the table-level phys- as distributed secondary index and the second to local sec- ical image copy operation. However, this optimization is ondary index in Section 3.3. [12] shares the same conclusion applicable only when the exclusive lock is acquired during with our study: the first table-based approach incurs higher the table copy operation. Considering that such a reparti- communication overhead between nodes on write operations tioning DDL operation in the practical systems is typically while the second co-location approach incurs the develop- performed in the background, it is more important to reduce ment cost because it has to be implemented from scratch. the interference to the normal DML transactions rather than However, while [12] does not further explore how the write reducing the elapsed time of the DDL operation itself. overhead is addressed, our work proposes a novel optimistic synchronous replication protocol as in Section 4.1.

3122 Teradata [1] supports a special form of a table-based dis- 7. DISCUSSION AND FUTURE WORK tributed secondary index implementation. It co-locates sec- We expect that our work can be further extended by con- ondary index for the attributes with non-unique values, wh- tributions from academia in the following aspects. First, it ile using a table-based approach for the attributes that have is desirable in practice to automate the decision of which unique values - via automatic hashing of the attribute val- tables need to be replicated for a given workload; how many ues. However, by the nature of the hash-based partitioning replicas are needed; and how the replica table needs to be used for the distributed secondary index, operations such as partitioned. Since replication itself needs to pay a cost for range searches need to be performed in all the nodes. Com- additional resource consumption, it is not always trivial to pared to that, our work allows the secondary index table to make a decision that maximizes the additional value over be partitioned by any general partitioning scheme including the paid cost. Second, the presented optimistic synchronous hash and range schemes. In addition, from [1], we could commit protocol can be further generalized to apply to other not find explicit description on how it addresses the node- scenarios that require transactional consistency between the to-node write synchronization overhead. base data and its derived remote data, beyond the scope of Both DynamoDB [3] and Megastore [7] support distribu- the replication. For example, tables that require transac- ted secondary index. However, both rely on ”asynchronous” tional consistency with each other by triggers or cascaded- change propagation to the remote secondary index with sac- update foreign-key constraints will present another practical rificing the transactional consistency. Under such an asyn- use case. Third, the proposed asymmetric-partition replica- chronous propagation, the query on the secondary index tion and the optimistic synchronous commit protocol can could return result that are not up to date. It imposes sig- be extended to support multi-master replication, for exam- nificant burden to application programmer since they have ple by adding a conflict detection and resolution mechanism to anticipate and handle such a sporadic stale query result. among the changes made at different master replica nodes. Google F1 [23] provides a synchronous secondary index that uses the two-phase commit for transactional consis- tency. While its write transaction performance will be bound 8. CONCLUSION by the two-phase commit overhead, our proposed optimistic In this paper, we presented a novel concept of asymmetric- synchronous commit protocol reduces the commit overhead partition replication engine as a foundation to serve three by exploiting the fact that the secondary index is a data important practical use cases in distributed database sys- structure derivable and recoverable from the base table data. tems. In the asymmetric-partition replication, replicas of a ScyllaDB [6] exploits the to support a table can be independently partitioned regardless of whether secondary index. For each secondary index, it creates a ma- or how its primary copy is partitioned. The three practical terialized view that has the indexed column as the partition- use cases include (1) scaling out OLTP/OLAP-mixed work- ing key. Our work does not exclude the possibility of using loads with partitioned replicas, (2) efficiently maintaining a materialized view instead of replication for maintaining the distributed secondary index for a partitioned table, and (3) distributed secondary index. However, our proposed opti- efficient implementation of the online re-partitioning opera- mization such as optimistic synchronous commit will still tion. In addition, to address the challenge in node-to-node hold for the implementation based on materialized view. write synchronization overhead in asymmetric-partition re- plication, we proposed optimistic synchronous commit by fully exploiting the fact that a replica is a derivable and reconstructable data structure from the corresponding pri- mary table data. The proposed asymmetric-partition repli- 6.3 Online DDL cation and its optimizations are incorporated in SAP HANA Many cloud DBMSs exploit replication for the purpose in-memory database system. Through extensive experimen- of online upgrades or online database migration [19]. Com- ts, we demonstrated the significant benefits that the pro- pared to those approaches, we propose exploiting table repli- posed replication engine brings to those three practical use cation engine for online DDL operations that occur for a cases. particular table. Overall, we believe that our work revealed that the data- For online DDL or online schema change operations, var- base replication can serve even more practical use cases, ious techniques are already known. For example, Google F1 beyond the traditional use cases, for modern distributed [21] proposes a new protocol for online and asynchronous database systems. schema evolution. In order to guarantee the database consis- tency, it transforms the schema change operations into mul- 9. ACKNOWLEDGMENTS tiple steps of consistency-preserving schema changes. PRIS- The authors would like to thank the entire SAP HANA M/PRISM++ [11, 10] presents a special language for schema development team for providing the solid foundation for the change operations, tools to evaluate the effect of changes, work presented in this paper. We would especially like to and a method. Controvol [22] develops a express our appreciation to Eunsang Kim, Hyoung Jun Na, framework for controlled schema changes for NoSQL appli- Chang Gyoo Park, Kyungyul Park, Deok Koo Kim, Jungsu cations. Lee, Joo Yeon Lee and Carsten Mueller. In addition, the In contrast to those prior works, we showed that the on- authors would like to sincerely thank anonymous VLDB line DDL can be implemented economically and efficiently reviewers who provided invaluable comments and sugges- be extending a replication engine and utilizing the charac- tions. Seongyun Ko and Wook-Shin Han are partly sup- teristics of MVCC. ported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No. NRF- 2017R1A2B3007116).

3123 10. REFERENCES C. Bensberg, J. Y. Lee, A. H. Lee, and W. Lehner. [1] Introduction to teradata. Sap hana distributed in-memory database system: https://www.teradatapoint.com/teradata/ Transaction, session, and metadata management. In secondary-index-in-teradata.htm. Data Engineering (ICDE), 2013 IEEE 29th [2] Mysql: Replication and partitioning. International Conference on, pages 1165–1173. IEEE, https://dev.mysql.com/doc/refman/8.0/en/ 2013. replication-features-partitioning.html, 2012. [15] J. Lee, S. Moon, K. H. Kim, D. H. Kim, S. K. Cha, [3] Using global secondary indexes in dynamodb. and W.-S. Han. Parallel replication across formats in https://docs.aws.amazon.com/amazondynamodb/ sap hana for scaling out mixed oltp/olap workloads. latest/developerguide/GSI.html, 2012. Proc. VLDB Endow., 10(12):15981609, Aug. 2017. [4] Sap hana and system views reference 2.0 sps 03. [16] J. Lee, H. Shin, C. G. Park, S. Ko, J. Noh, Y. Chuh, https://help.sap.com/viewer/product/SAP_HANA_ W. Stephan, and W.-S. Han. Hybrid garbage collection for multi-version concurrency control in sap PLATFORM/2.0.03, 2018. hana. In Proceedings of the 2016 International [5] Sap s/4hana. https://www.sap.com/products/ Conference on Management of Data, pages 1307–1318. s4hana-erp/features.html, 2019. ACM, 2016. [6] Scylladb. https://www.scylladb.com/, 2019. [17] Y. Lu, X. Yu, and S. Madden. Star: Scaling [7] J. Baker, C. Bond, J. C. Corbett, J. Furman, transactions through asymmetric replication. Proc. A. Khorlin, J. Larson, J.-M. Leon, Y. Li, A. Lloyd, VLDB Endow., 12(11):13161329, July 2019. and V. Yushprakh. Megastore: Providing scalable, [18] D. Makreshanski, J. Giceva, C. Barthels, and highly available storage for interactive services. In G. Alonso. Batchdb: Efficient isolated execution of CIDR, volume 11, pages 223–234, 2011. hybrid oltp+ olap workloads for interactive [8] H. Berenson, P. Bernstein, J. Gray, J. Melton, applications. In Proceedings of the 2017 ACM E. O’Neil, and P. O’Neil. A critique of ansi sql International Conference on Management of Data, isolation levels. In ACM SIGMOD Record, volume 24, pages 37–50. ACM, 2017. pages 1–10. ACM, 1995. [19] T. Mishima and Y. Fujiwara. Madeus: database live [9] C. Binnig, S. Hildenbrand, F. F¨arber, D. Kossmann, migration middleware under heavy workloads for J. Lee, and N. May. Distributed snapshot isolation: cloud environment. In Proceedings of the 2015 ACM global transactions pay globally, local transactions pay SIGMOD International Conference on Management of locally. The VLDB Journal-The International Journal Data, pages 315–329, 2015. on Very Large Data Bases, 23(6):987–1011, 2014. [20] I. Psaroudakis, F. Wolf, N. May, T. Neumann, [10] C. Curino, H. J. Moon, A. Deutsch, and C. Zaniolo. A. B¨ohm,A. Ailamaki, and K.-U. Sattler. Scaling up Automating the database schema evolution process. mixed workloads: a battle of data freshness, flexibility, The VLDB Journal-The International Journal on and scheduling. In Technology Conference on Very Large Data Bases, 22(1):73–98, 2013. Performance Evaluation and Benchmarking, pages [11] C. A. Curino, H. J. Moon, and C. Zaniolo. Graceful 97–112. Springer, 2014. database schema evolution: The prism workbench. [21] I. Rae, E. Rollins, J. Shute, S. Sodhi, and Proc. VLDB Endow., 1(1):761772, Aug. 2008. R. Vingralek. Online, asynchronous schema change in [12] J. V. Dsilva, R. Ruiz-Carrillo, C. Yu, M. Y. Ahmad, f1. Proc. VLDB Endow., 6(11):10451056, Aug. 2013. and B. Kemme. Secondary indexing techniques for [22] S. Scherzinger, T. Cerqueus, and E. C. de Almeida. key-value stores: Two rings to rule them all. In Controvol: A framework for controlled schema International Workshop On Design, Optimization, evolution in application development. In Data Languages and Analytical Processing of Big Data Engineering (ICDE), 2015 IEEE 31st International (DOLAP), 2017. Conference on, pages 1464–1467. IEEE, 2015. [13] J. Lee, W.-S. Han, H. J. Na, C. G. Park, K. H. Kim, [23] J. Shute, R. Vingralek, B. Samwel, B. Handy, D. H. Kim, J. Y. Lee, S. K. Cha, and S. Moon. C. Whipkey, E. Rollins, M. Oancea, K. Littlefield, Parallel replication across formats for scaling out D. Menestrina, S. Ellner, J. Cieslewicz, I. Rae, mixed oltp/olap workloads in main-memory . T. Stancescu, and H. Apte. F1: A distributed sql The VLDB Journal-The International Journal on database that scales. Proc. VLDB Endow., Very Large Data Bases, 27(3):421–444, 2018. 6(11):10681079, Aug. 2013. [14] J. Lee, Y. S. Kwon, F. F¨arber, M. Muehle, C. Lee,

3124