PERFORMANCE ANALYSIS OF CONTAINERIZED APPLICATIONS ON LOCAL AND REMOTE STORAGE Qiumin Xu1, Manu Awasthi2, Krishna T. Malladi3, Janki Bhimani4, Jingpei Yang3, Murali Annavaram1 1USC, 2IIT Gandhinagar 3Samsung 4Northeastern
1 Docker Becomes Very Popular
Software container platform with many desirable features
Ease of deployment, developer friendliness and light virtualization Mainstay in cloud platforms
Google Cloud Platform, Amazon EC2, Microsoft Azure Storage Hierarchy is the key component
High Performance SSDs
NVMe, NVMe over Fabrics
2 Agenda
Docker, NVMe and NVMe over Fabrics (NVMf) How to best utilize NVMe SSDs for single container?
Best configuration performs similar to raw performance
Where do the performance anomalies come from? Do Docker containers scale well on NVMe SSDs?
Exemplify using Cassandra
Best strategy to divide the resources Scaling Docker containers on NVMe-over-Fabrics
3 What is Docker Container?
Each virtualized application includes an entire OS (~10s of GB)
Docker container comprises just application and bins/libs
Shares the kernel with other container
Much more portable and efficient figure from https://docs.docker.com 4 Non-Volatile Memory Express (NVMe)
A storage protocol standard on top of PCIe NVMe SSDs connect through PCIe and support the standard
Since 2014 (Intel, Samsung)
Enterprise and consumer variants NVMe SSDs leverage the interface to deliver superior perf
5X to 10X over SATA SSD[1]
[1] Qiumin Xu et al. “Performance analysis of NVMe SSDs and their implication on real world databases.” SYSTOR’15
5 Why NVMe over Fabrics (NVMf)? Retains NVMe performance over network fabrics Eliminate unnecessary protocol translations Enables low latency and high IOPS remote storage
J. M. Dave Minturn, “Under the Hood with NVMe over Fabrics,”, SINA Ethernet Storage Forum 6 Storage Architecture in Docker Storage Options: 11. Through Docker Filesystem (Aufs, Btrfs, Overlayfs) 22. Through Virtual Block Devices (2.a Loop-lvm, 2.b Direct-lvm) 33. Through Docker Data Volume (-v)
Container Read / Write Operations 1 -g option 2 3 -v option Storage Driver Aufs, Btrfs, Devicemapper Data Volume Overlayfs (Loop-lvm, direct-lvm) 2.a 2.b Host Backing File system (EXT4, XFS, etc.)
Base Device Thin Pool Sparse Files
NVMe SSDs
7 Optimize Storage Configuration for Single Container Experimental Environment
Dual-socket, 12 HT cores Xeon E5-2670 V3 enterprise-class NVMe SSD Docker Samsung XS1715 FIO kernel v4.6.0 Docker v1.11.2 ? fio used for traffic generation Asynchronous IO engine, libaio NVMe SSD 32 concurrent jobs and iodepth is 32 (XS1715) Measure steady state performance
8 Performance Comparison — Host Backing Filesystems
800 K RAW 3000 RAW 600 EXT4 EXT4 2500 XFS XFS 400 2000 (MB/s) 200 1500 Average BW BW Average Average IOPS Average 0 1000 RR RW SR SW
EXT4 performs 25% worse for RR XFS performs closely resembles RAW for all but RW
9 Tuning the Performance Gap —Random Reads 700K IOPS 800K 600 RR 400 200 IOPS of IOPS 0 Default dioread_nolock EXT4 XFS allows multiple processes to read a file at once
Uses allocation groups which can be accessed independently EXT4 requires mutex locks even for read operations
10 Tuning the Performance Gap —Random Writes
250K RAW EXT4 XFS 200 RW 150 100
IOPS of IOPS 50 0 1 2 4 8 16 24 28 32 48 64 # of Jobs XFS performs poorly with high thread count Contention in exclusive locking kills the write performance
Used by extent look up and write checks
Patch available but not for Linux 4.6 [1]
[1] https://www.percona.com/blog/2012/03/15/ext4-vs-xfs-on-ssd/
11 Storage Architecture in Docker Storage Options: 11. Through Docker Filesystem (Aufs, Btrfs, Overlayfs) 22. Through Virtual Block Devices (2.a Loop-lvm, 2.b Direct-lvm) 33. Through Docker Data Volume (-v)
Container Read / Write Operations 1 -g option 2 3 -v option Storage Driver Aufs, Btrfs, Devicemapper Data Volume Overlayfs (Loop-lvm, direct-lvm) 2.a 2.b Host Backing File system (EXT4, XFS, etc.)
Base Device Thin Pool Sparse Files
NVMe SSDs
12 Docker Storage Options Option 1: Through Docker File System
Aufs (Advanced multi-layered Unification FileSystem):
A fast reliable unification file system Btrfs (B-tree file system):
A modern CoW file system which implements many advanced features for fault tolerance, repair and easy administration Overlayfs:
Another modern unification file system which has simpler design and potentially faster than Aufs
13 Performance Comparison Option 1: Through Docker File System
800K Raw Raw 3000 Aufs 600 Aufs Btrfs 2500 Btrfs Overlay Overlay 400 2000 (MB/s) 200 1500 Average BW BW Average Average IOPS Average 0 1000 RR RW SR SW
Aufs and Overlayfs performs close to raw block device for most cases
Btrfs has the worst performance for random workloads
14 Tuning the Performance Gap of Btrfs —Random Reads 4000
R RAW EXT4 Btrfs 3000 2000 1000
BW (MB/s) of R BW (MB/s) 0
Block Size Btrfs doesn’t work well for small block size yet
Btrfs must read the file extent before reading the file data.
Large block size reduces the frequency of reading metadata
15 Tuning the Performance Gap of Btrfs —Random Reads 200K 150 100
IOPS of RW IOPS 50 0 Default nodatacow Btrfs
Btrfs doesn’t work well for random writes due to CoW overhead
16 Storage Architecture in Docker Storage Options: 11. Through Docker Filesystem (Aufs, Btrfs, Overlayfs) 22. Through Virtual Block Devices (2.a Loop-lvm, 2.b Direct-lvm) 33. Through Docker Data Volume (-v)
Container Read / Write Operations 1 -g option 2 3 -v option Storage Driver Aufs, Btrfs, Devicemapper Data Volume Overlayfs (Loop-lvm, direct-lvm) 2.a 2.b Host Backing File system (EXT4, XFS, etc.)
Base Device Thin Pool Sparse Files
NVMe SSDs
17 Docker Storage Configurations
Option 2: Through Virtual Block Device
Devicemapper storage driver leverages the thin provisioning and snapshotting capabilities of the kernel based Device Mapper Framework Loop-lvm uses sparse files to build the thin-provisioned pools
Direct-lvm uses block device to directly create the thin pools (Recommended by Docker)
18 Docker Storage Configurations
Option 3: Through Docker Data Volume (-v)
Data persists beyond the lifetime of the container and can be shared and accessed from other containers
* figure from https://github.com/libopenstorage/openstorage
19 Performance Comparison Option 2 & Option 3
800K RAW RAW 3000 600 Direct-lvm Direct-lvm Loop-lvm -v 2500 Loop-lvm -v 400 Aufs -v Aufs -v 2000
Overlay -v (MB/s) Overlay -v 200 Average BW BW Average 1500 Average IOPS Average 0 1000 RR RW SR SW Direct-lvm has worse performance for RR/RW LVM, device mapper, and the dm-thinp kernel module introduced additional code paths and overhead may not suit IO intensive workloads
20 Application Performance Cassandra Database
NoSQL database Scale linearly to the number of nodes in the cluster (theoretically) [1] Requires data persistence
uses docker volume to store data
[1] Rabl, Tilmann et al. "Solving Big Data Challenges for Enterprise Application Performance Management”, VLDB’13
21 Scaling Docker Containers on NVMe multiple containerized Cassandra Databases Experiment Setup Dual socket, Xeon E5 server, 10Gb ethernet N = 1, 2, 3, … 8 containers Each container is driven by a YCSB client Record Count: 100M records, 100GB in each DB Client thread count: 16
Workloads
Workload A, 50% read, 50% update, Zipfian distribution Workload D, 95% read, 5% insert, normal distribution
22 Results-Throughput Workload D, directly attached SSD
C1 C2 C3 C4 C5 C6 C7 C8 Cgroups 50000
40000
30000
20000
Throughput (ops/sec) Throughput 10000
0 1 2 3 4 5 6 7 8 # of Cassandra Containers
Aggregated throughput peaks at 4 containers
Cgroups: 6 CPU cores, 6GB memory, 400MB/s bandwidth
23 Strategies for Dividing Resources
CPU MEM CPU+MEM BW All Uncontrolled 50000 40000 30000 20000 10000 0 Throughput (ops/sec) Throughput 0 1 2 3 4 5 6 7 8 9 # of Cassandra Containers MEM has the most significant impact on throughput Best strategy for dividing resources using cgroups Assign 6 CPU cores for each container, leave other resource uncontrolled
24 Scaling Containerized Cassandra using NVMf Experiment Setup
Cassandra + Docker
10Gbe 40Gbe Application NVMf Target YCSB Clients Server Storage Server
25 Results-Throughput
DAS_A NVMf_A DAS_D NVMf_D 4
3
2
RelativeTPS 1
0 1 2 3 4 5 6 7 8 # of Cassandra Instances
The throughput of NVMf is within 6% - 12% compared to directly attached SSDs
26 Results-Latency
DAS_A NVMf_A DAS_D NVMf_D 8
6
Latency 4
2 Relative 0 1 2 3 4 5 6 7 8 # of Cassandra Instances
NVMF incurs only 2% - 15% longer latency than direct attached SSD.
27 Results-CPU Utilization
NVMF incurs less than 1.8% CPU Utilization on Target Machine
28 SUMMARY
Best Option in Docker for NVMe Drive Performance Overlay FS + XFS + Data Volume
Best Strategy for Dividing Resources using Cgroups Control only the CPU resources
Scaling Docker Containers on NVMf Throughput: within 6% - 12% vs. DAS Latency: 2% - 15% longer than DAS THANK YOU! [email protected]
29