Scalable but Wasteful: Current State of Replication in the Cloud

Scalable but Wasteful: Current State of Replication in the Cloud Venkata Swaroop Matte Aleksey Charapko Abutalib Aghayev The Pennsylvania State University University of New Hampshire The Pennsylvania State University Strongly Consistent Replication - Used in Cloud datastores and Configuration management - Rely on Consensus protocols ( replication protocols ) - Achieve High-throughput 2 How to optimize for Throughput? Leader Multi-Paxos 3 One way to optimize: Shift the load Leader Multi-Paxos EPaxos - Many protocols shift work from the bottleneck to the under-utilized node - Examples: EPaxos, SDPaxos and PigPaxos 4 Resource utilization of replication protocols Core 1 Core 1 Core 1 Core 2 Core 2 Core 2 3 nodes with 2 cores each 5 Resource utilization of replication protocols Core 1 Core 1 Core 1 Utilized core Core 2 Core 2 Core 2 Idle core Leader Followers Multi-Paxos 6 Resource utilization of replication protocols Core 1 Core 1 Core 1 Core 1 Core 1 Core 1 Core 2 Core 2 Core 2 Core 2 Core 2 Core 2 Multi-Paxos EPaxos EPaxos also utilizes the idle cores to achieve high throughput 7 Confirming performance gains • Single Instance • 5 AWS EC2 m5a.large nodes • Each 2 vCPU, 8GB RAM • 50% write workload Throughput of Multi-Paxos and EPaxos EPaxos achieves 20% higher throughput compared to Multi-Paxos 8 Missing piece: Resource efficiency EPaxos Multi-Paxos 500% Utilization 200% Utilization 18 kops/s 14 kops/s Multi-Paxos shows better resource efficiency compared9 to EPaxos Metric to analyze Resource efficiency Throughput-per-unit-of-constraining-resource-utilization - Used CPU utilization to identify resource efficiency - This metric determines the added cost of removing bottleneck 10 Throughput-per-unit-of-aggregate-CPU-Utilization Metric shows the resource efficiency of replication protocols 11 Relevance of resource efficiency in Cloud - Important in a pay-as-you-go utility model like Cloud - Replication protocols are optimized for dedicated VMs - Whereas Cloud is sharded and resource packed - Spanner, CockroachDB, and YugabyteDB support many instances from different shards on the same physical machine 12 Example: Packing in a resource constrained setting Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 5 nodes with 6 cores each Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 13 Example: Packing in a resource constrained setting Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Instance 1 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 14 Example: Packing in a resource constrained setting Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Instance 1 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Instance 2 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 15 Example: Packing in a resource constrained setting Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Instance 1 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Instance 2 Instance 3 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 16 Example: Packing in a resource constrained setting Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Instance 1 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Instance 2 Instance 3 Core 1 Core 3 Core 2 Core 4 Core 5 Core 6 Instance 4 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 17 Example: Packing in a resource constrained setting Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Instance 1 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Instance 2 Instance 3 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Instance 4 Instance 5 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 18 Experiment: Packing 5 instances in Cloud • 5 Instance of Multi-Paxos/EPaxos • 5 AWS EC2 m5a.2xlarge nodes • Each 8 vCPU, 32GB RAM • 50% write workload 19 Aggregate Throughput Aggregate throughput of Multi-Paxos and EPaxos with 5 instances packed together 20 Why throughput-per-unit-of-constraining-resource-utilization? It is a good proxy for the performance of replication protocols in Cloud setting dedicated resource setting shared resource setting 21 Conclusion: Scalable but Wasteful dedicated resource setting Fixed-budget shared resource setting Resource efficiency plays a key role for replication protocols when moving from a dedicated to shared resource setting 22.

Scalable but Wasteful: Current State of Replication in the Cloud

ACID Compliant Distributed Key-Value Store

Grpc - a Solution for Rpcs by Google Distributed Systems Seminar at Charles University in Prague, Nov 2016 Jan Tattermusch - Grpc Software Engineer

Database Software Market: Billy Fitzsimmons +1 312 364 5112

Implementing and Testing Cockroachdb on Kubernetes

How Financial Service Companies Can Successfully Migrate Critical

Degree Project: Bachelor of Science in Computing Science

Cockroachdb: the Resilient Geo-Distributed SQL Database

Solution Deploying Cockroachdb Using Openebs

Automatically Detecting and Fixing Concurrency Bugs in Go Software Systems Extended Abstract

Spanner: Google's Globally Distributed Database

Strategy, Struggle & Success

SQL Joins with Interleaved Tables: a Natural Extension of Newsql Databases