PERFORMANCE ANALYSIS OF CONTAINERIZED APPLICATIONS ON LOCAL AND REMOTE STORAGE Qiumin Xu1, Manu Awasthi2, Krishna T. Malladi3, Janki Bhimani4, Jingpei Yang3, Murali Annavaram1 1USC, 2IIT Gandhinagar 3Samsung 4Northeastern

1 Becomes Very Popular

Software container platform with many desirable features

Ease of deployment, developer friendliness and light Mainstay in cloud platforms

Google Cloud Platform, Amazon EC2, Microsoft Azure Storage Hierarchy is the key component

High Performance SSDs

NVMe, NVMe over Fabrics

2 Agenda

Docker, NVMe and NVMe over Fabrics (NVMf) How to best utilize NVMe SSDs for single container?

Best configuration performs similar to raw performance

Where do the performance anomalies come from? Do Docker containers scale well on NVMe SSDs?

Exemplify using Cassandra

Best strategy to divide the resources Scaling Docker containers on NVMe-over-Fabrics

3 What is Docker Container?

Each virtualized application includes an entire OS (~10s of GB)

Docker container comprises just application and bins/libs

Shares the kernel with other container

Much more portable and efficient figure from https://docs.docker.com 4 Non-Volatile Memory Express (NVMe)

A storage protocol standard on top of PCIe NVMe SSDs connect through PCIe and support the standard

Since 2014 (, Samsung)

Enterprise and consumer variants NVMe SSDs leverage the interface to deliver superior

5X to 10X over SATA SSD[1]

[1] Qiumin Xu et al. “Performance analysis of NVMe SSDs and their implication on real world .” SYSTOR’15

5 Why NVMe over Fabrics (NVMf)? Retains NVMe performance over network fabrics Eliminate unnecessary protocol translations Enables low latency and high IOPS remote storage

J. M. Dave Minturn, “Under the Hood with NVMe over Fabrics,”, SINA Ethernet Storage Forum 6 Storage Architecture in Docker Storage Options: 11. Through Docker Filesystem (, , Overlayfs) 22. Through Virtual Devices (2.a Loop-lvm, 2.b Direct-lvm) 33. Through Docker Data Volume (-v)

Container / Operations 1 -g option 2 3 -v option Storage Driver Aufs, Btrfs, Devicemapper Data Volume Overlayfs (Loop-lvm, direct-lvm) 2.a 2.b Host Backing (, XFS, etc.)

Base Device Thin Pool Sparse Files

NVMe SSDs

7 Optimize Storage Configuration for Single Container Experimental Environment

Dual-socket, 12 HT cores Xeon E5-2670 V3 enterprise-class NVMe SSD Docker Samsung XS1715 FIO kernel v4.6.0 Docker v1.11.2 ? fio used for traffic generation Asynchronous IO engine, libaio NVMe SSD 32 concurrent jobs and iodepth is 32 (XS1715) Measure steady state performance

8 Performance Comparison — Host Backing Filesystems

800 K RAW 3000 RAW 600 EXT4 EXT4 2500 XFS XFS 400 2000 (MB/s) 200 1500 Average BW BW Average Average IOPS Average 0 1000 RR RW SR SW

EXT4 performs 25% worse for RR XFS performs closely resembles RAW for all but RW

9 Tuning the Performance Gap —Random Reads 700K IOPS 800K 600 RR 400 200 IOPS of IOPS 0 Default dioread_nolock EXT4 XFS allows multiple processes to read a file at once

Uses allocation groups which can be accessed independently EXT4 requires mutex locks even for read operations

10 Tuning the Performance Gap —Random Writes

250K RAW EXT4 XFS 200 RW 150 100

IOPS of IOPS 50 0 1 2 4 8 16 24 28 32 48 64 # of Jobs XFS performs poorly with high thread count Contention in exclusive locking kills the write performance

Used by look up and write checks

Patch available but not for 4.6 [1]

[1] https://www.percona.com/blog/2012/03/15/ext4-vs-xfs-on-ssd/

11 Storage Architecture in Docker Storage Options: 11. Through Docker Filesystem (Aufs, Btrfs, Overlayfs) 22. Through Virtual Block Devices (2.a Loop-lvm, 2.b Direct-lvm) 33. Through Docker Data Volume (-v)

Container Read / Write Operations 1 -g option 2 3 -v option Storage Driver Aufs, Btrfs, Devicemapper Data Volume Overlayfs (Loop-lvm, direct-lvm) 2.a 2.b Host Backing File system (EXT4, XFS, etc.)

Base Device Thin Pool Sparse Files

NVMe SSDs

12 Docker Storage Options Option 1: Through Docker File System

Aufs (Advanced multi-layered Unification FileSystem):

A fast reliable unification file system Btrfs (B- file system):

A modern CoW file system which implements many advanced features for fault tolerance, repair and easy administration Overlayfs:

Another modern unification file system which has simpler design and potentially faster than Aufs

13 Performance Comparison Option 1: Through Docker File System

800K Raw Raw 3000 Aufs 600 Aufs Btrfs 2500 Btrfs Overlay Overlay 400 2000 (MB/s) 200 1500 Average BW BW Average Average IOPS Average 0 1000 RR RW SR SW

Aufs and Overlayfs performs to raw block device for most cases

Btrfs has the worst performance for random workloads

14 Tuning the Performance Gap of Btrfs —Random Reads 4000

R RAW EXT4 Btrfs 3000 2000 1000

BW (MB/s) of R BW (MB/s) 0

Block Size Btrfs doesn’t work well for small block size yet

Btrfs must read the file extent before reading the file data.

Large block size reduces the frequency of reading metadata

15 Tuning the Performance Gap of Btrfs —Random Reads 200K 150 100

IOPS of RW IOPS 50 0 Default nodatacow Btrfs

Btrfs doesn’t work well for random writes due to CoW overhead

16 Storage Architecture in Docker Storage Options: 11. Through Docker Filesystem (Aufs, Btrfs, Overlayfs) 22. Through Virtual Block Devices (2.a Loop-lvm, 2.b Direct-lvm) 33. Through Docker Data Volume (-v)

Container Read / Write Operations 1 -g option 2 3 -v option Storage Driver Aufs, Btrfs, Devicemapper Data Volume Overlayfs (Loop-lvm, direct-lvm) 2.a 2.b Host Backing File system (EXT4, XFS, etc.)

Base Device Thin Pool Sparse Files

NVMe SSDs

17 Docker Storage Configurations

Option 2: Through Virtual Block Device

Devicemapper storage driver leverages the thin provisioning and snapshotting capabilities of the kernel based Framework Loop-lvm uses sparse files to build the thin-provisioned pools

Direct-lvm uses block device to directly create the thin pools (Recommended by Docker)

18 Docker Storage Configurations

Option 3: Through Docker Data Volume (-v)

Data persists beyond the lifetime of the container and can be shared and accessed from other containers

* figure from https://github.com/libopenstorage/openstorage

19 Performance Comparison Option 2 & Option 3

800K RAW RAW 3000 600 Direct-lvm Direct-lvm Loop-lvm -v 2500 Loop-lvm -v 400 Aufs -v Aufs -v 2000

Overlay -v (MB/s) Overlay -v 200 Average BW BW Average 1500 Average IOPS Average 0 1000 RR RW SR SW Direct-lvm has worse performance for RR/RW LVM, device mapper, and the dm-thinp kernel module introduced additional code paths and overhead may not suit IO intensive workloads

20 Application Performance Cassandra

NoSQL database Scale linearly to the number of nodes in the cluster (theoretically) [1] Requires data persistence

uses docker volume to store data

[1] Rabl, Tilmann et al. "Solving Big Data Challenges for Enterprise Application Performance Management”, VLDB’13

21 Scaling Docker Containers on NVMe multiple containerized Cassandra Databases Experiment Setup Dual socket, Xeon E5 server, 10Gb ethernet N = 1, 2, 3, … 8 containers Each container is driven by a YCSB client Record Count: 100M records, 100GB in each DB Client thread count: 16

Workloads

Workload A, 50% read, 50% update, Zipfian distribution Workload D, 95% read, 5% insert, normal distribution

22 Results-Throughput Workload D, directly attached SSD

C1 C2 C3 C4 C5 C6 C7 C8 50000

40000

30000

20000

Throughput (ops/sec) Throughput 10000

0 1 2 3 4 5 6 7 8 # of Cassandra Containers

Aggregated throughput peaks at 4 containers

Cgroups: 6 CPU cores, 6GB memory, 400MB/s bandwidth

23 Strategies for Dividing Resources

CPU MEM CPU+MEM BW All Uncontrolled 50000 40000 30000 20000 10000 0 Throughput (ops/sec) Throughput 0 1 2 3 4 5 6 7 8 9 # of Cassandra Containers MEM has the most significant impact on throughput Best strategy for dividing resources using cgroups Assign 6 CPU cores for each container, leave other resource uncontrolled

24 Scaling Containerized Cassandra using NVMf Experiment Setup

Cassandra + Docker

10Gbe 40Gbe Application NVMf Target YCSB Clients Server Storage Server

25 Results-Throughput

DAS_A NVMf_A DAS_D NVMf_D 4

3

2

RelativeTPS 1

0 1 2 3 4 5 6 7 8 # of Cassandra Instances

The throughput of NVMf is within 6% - 12% compared to directly attached SSDs

26 Results-Latency

DAS_A NVMf_A DAS_D NVMf_D 8

6

Latency 4

2 Relative 0 1 2 3 4 5 6 7 8 # of Cassandra Instances

NVMF incurs only 2% - 15% longer latency than direct attached SSD.

27 Results-CPU Utilization

NVMF incurs less than 1.8% CPU Utilization on Target Machine

28 SUMMARY

Best Option in Docker for NVMe Drive Performance Overlay FS + XFS + Data Volume

Best Strategy for Dividing Resources using Cgroups Control only the CPU resources

Scaling Docker Containers on NVMf Throughput: within 6% - 12% vs. DAS Latency: 2% - 15% longer than DAS THANK YOU! [email protected]

29