PERFORMANCE ANALYSIS of CONTAINERIZED APPLICATIONS on LOCAL and REMOTE STORAGE Qiumin Xu1, Manu Awasthi2, Krishna T

PERFORMANCE ANALYSIS OF CONTAINERIZED APPLICATIONS ON LOCAL AND REMOTE STORAGE Qiumin Xu1, Manu Awasthi2, Krishna T. Malladi3, Janki Bhimani4, Jingpei Yang3, Murali Annavaram1 1USC, 2IIT Gandhinagar 3Samsung 4Northeastern 1 Docker Becomes Very Popular Software container platform with many desirable features Ease of deployment, developer friendliness and light virtualization Mainstay in cloud platforms Google Cloud Platform, Amazon EC2, Microsoft Azure Storage Hierarchy is the key component High Performance SSDs NVMe, NVMe over Fabrics 2 Agenda Docker, NVMe and NVMe over Fabrics (NVMf) How to best utilize NVMe SSDs for single container? Best configuration performs similar to raw performance Where do the performance anomalies come from? Do Docker containers scale well on NVMe SSDs? Exemplify using Cassandra Best strategy to divide the resources Scaling Docker containers on NVMe-over-Fabrics 3 What is Docker Container? Each virtualized application includes an entire OS (~10s of GB) Docker container comprises just application and bins/libs Shares the kernel with other container Much more portable and efficient figure from https://docs.docker.com 4 Non-Volatile Memory Express (NVMe) A storage protocol standard on top of PCIe NVMe SSDs connect through PCIe and support the standard Since 2014 (Intel, Samsung) Enterprise and consumer variants NVMe SSDs leverage the interface to deliver superior perf 5X to 10X over SATA SSD[1] [1] Qiumin Xu et al. “Performance analysis of NVMe SSDs and their implication on real world databases.” SYSTOR’15 5 Why NVMe over Fabrics (NVMf)? Retains NVMe performance over network fabrics Eliminate unnecessary protocol translations Enables low latency and high IOPS remote storage J. M. Dave Minturn, “Under the Hood with NVMe over Fabrics,”, SINA Ethernet Storage Forum 6 Storage Architecture in Docker Storage Options: 11. Through Docker Filesystem (Aufs, Btrfs, Overlayfs) 22. Through Virtual Block Devices (2.a Loop-lvm, 2.b Direct-lvm) 33. Through Docker Data Volume (-v) Container Read / Write Operations 1 -g option 2 3 -v option Storage Driver Aufs, Btrfs, Devicemapper Data Volume Overlayfs (Loop-lvm, direct-lvm) 2.a 2.b Host Backing File system (EXT4, XFS, etc.) Base Device Thin Pool Sparse Files NVMe SSDs 7 Optimize Storage Configuration for Single Container Experimental Environment Dual-socket, 12 HT cores Xeon E5-2670 V3 enterprise-class NVMe SSD Docker Samsung XS1715 FIO kernel v4.6.0 Docker v1.11.2 ? fio used for traffic generation Asynchronous IO engine, libaio NVMe SSD 32 concurrent jobs and iodepth is 32 (XS1715) Measure steady state performance 8 Performance Comparison — Host Backing Filesystems 800 K RAW 3000 RAW 600 EXT4 EXT4 2500 XFS XFS 400 2000 (MB/s) 200 1500 Average BW BW Average Average IOPS Average 0 1000 RR RW SR SW EXT4 performs 25% worse for RR XFS performs closely resembles RAW for all but RW 9 Tuning the Performance Gap —Random Reads 700K IOPS 800K 600 RR 400 200 IOPS of IOPS 0 Default dioread_nolock EXT4 XFS allows multiple processes to read a file at once Uses allocation groups which can be accessed independently EXT4 requires mutex locks even for read operations 10 Tuning the Performance Gap —Random Writes 250K RAW EXT4 XFS 200 RW 150 100 IOPS of IOPS 50 0 1 2 4 8 16 24 28 32 48 64 # of Jobs XFS performs poorly with high thread count Contention in exclusive locking kills the write performance Used by extent look up and write checks Patch available but not for Linux 4.6 [1] [1] https://www.percona.com/blog/2012/03/15/ext4-vs-xfs-on-ssd/ 11 Storage Architecture in Docker Storage Options: 11. Through Docker Filesystem (Aufs, Btrfs, Overlayfs) 22. Through Virtual Block Devices (2.a Loop-lvm, 2.b Direct-lvm) 33. Through Docker Data Volume (-v) Container Read / Write Operations 1 -g option 2 3 -v option Storage Driver Aufs, Btrfs, Devicemapper Data Volume Overlayfs (Loop-lvm, direct-lvm) 2.a 2.b Host Backing File system (EXT4, XFS, etc.) Base Device Thin Pool Sparse Files NVMe SSDs 12 Docker Storage Options Option 1: Through Docker File System Aufs (Advanced multi-layered Unification FileSystem): A fast reliable unification file system Btrfs (B-tree file system): A modern CoW file system which implements many advanced features for fault tolerance, repair and easy administration Overlayfs: Another modern unification file system which has simpler design and potentially faster than Aufs 13 Performance Comparison Option 1: Through Docker File System 800K Raw Raw 3000 Aufs 600 Aufs Btrfs 2500 Btrfs Overlay Overlay 400 2000 (MB/s) 200 1500 Average BW BW Average Average IOPS Average 0 1000 RR RW SR SW Aufs and Overlayfs performs close to raw block device for most cases Btrfs has the worst performance for random workloads 14 Tuning the Performance Gap of Btrfs —Random Reads 4000 R RAW EXT4 Btrfs 3000 2000 1000 BW (MB/s) of R BW (MB/s) 0 Block Size Btrfs doesn’t work well for small block size yet Btrfs must read the file extent before reading the file data. Large block size reduces the frequency of reading metadata 15 Tuning the Performance Gap of Btrfs —Random Reads 200K 150 100 IOPS of RW IOPS 50 0 Default nodatacow Btrfs Btrfs doesn’t work well for random writes due to CoW overhead 16 Storage Architecture in Docker Storage Options: 11. Through Docker Filesystem (Aufs, Btrfs, Overlayfs) 22. Through Virtual Block Devices (2.a Loop-lvm, 2.b Direct-lvm) 33. Through Docker Data Volume (-v) Container Read / Write Operations 1 -g option 2 3 -v option Storage Driver Aufs, Btrfs, Devicemapper Data Volume Overlayfs (Loop-lvm, direct-lvm) 2.a 2.b Host Backing File system (EXT4, XFS, etc.) Base Device Thin Pool Sparse Files NVMe SSDs 17 Docker Storage Configurations Option 2: Through Virtual Block Device Devicemapper storage driver leverages the thin provisioning and snapshotting capabilities of the kernel based Device Mapper Framework Loop-lvm uses sparse files to build the thin-provisioned pools Direct-lvm uses block device to directly create the thin pools (Recommended by Docker) 18 Docker Storage Configurations Option 3: Through Docker Data Volume (-v) Data persists beyond the lifetime of the container and can be shared and accessed from other containers * figure from https://github.com/libopenstorage/openstorage 19 Performance Comparison Option 2 & Option 3 800K RAW RAW 3000 600 Direct-lvm Direct-lvm Loop-lvm -v 2500 Loop-lvm -v 400 Aufs -v Aufs -v 2000 Overlay -v (MB/s) Overlay -v 200 Average BW BW Average 1500 Average IOPS Average 0 1000 RR RW SR SW Direct-lvm has worse performance for RR/RW LVM, device mapper, and the dm-thinp kernel module introduced additional code paths and overhead may not suit IO intensive workloads 20 Application Performance Cassandra Database NoSQL database Scale linearly to the number of nodes in the cluster (theoretically) [1] Requires data persistence uses docker volume to store data [1] Rabl, Tilmann et al. "Solving Big Data Challenges for Enterprise Application Performance Management”, VLDB’13 21 Scaling Docker Containers on NVMe multiple containerized Cassandra Databases Experiment Setup Dual socket, Xeon E5 server, 10Gb ethernet N = 1, 2, 3, … 8 containers Each container is driven by a YCSB client Record Count: 100M records, 100GB in each DB Client thread count: 16 Workloads Workload A, 50% read, 50% update, Zipfian distribution Workload D, 95% read, 5% insert, normal distribution 22 Results-Throughput Workload D, directly attached SSD C1 C2 C3 C4 C5 C6 C7 C8 Cgroups 50000 40000 30000 20000 Throughput (ops/sec) Throughput 10000 0 1 2 3 4 5 6 7 8 # of Cassandra Containers Aggregated throughput peaks at 4 containers Cgroups: 6 CPU cores, 6GB memory, 400MB/s bandwidth 23 Strategies for Dividing Resources CPU MEM CPU+MEM BW All Uncontrolled 50000 40000 30000 20000 10000 0 Throughput (ops/sec) Throughput 0 1 2 3 4 5 6 7 8 9 # of Cassandra Containers MEM has the most significant impact on throughput Best strategy for dividing resources using cgroups Assign 6 CPU cores for each container, leave other resource uncontrolled 24 Scaling Containerized Cassandra using NVMf Experiment Setup Cassandra + Docker 10Gbe 40Gbe Application NVMf Target YCSB Clients Server Storage Server 25 Results-Throughput DAS_A NVMf_A DAS_D NVMf_D 4 3 2 RelativeTPS 1 0 1 2 3 4 5 6 7 8 # of Cassandra Instances The throughput of NVMf is within 6% - 12% compared to directly attached SSDs 26 Results-Latency DAS_A NVMf_A DAS_D NVMf_D 8 6 Latency 4 2 Relative 0 1 2 3 4 5 6 7 8 # of Cassandra Instances NVMF incurs only 2% - 15% longer latency than direct attached SSD. 27 Results-CPU Utilization NVMF incurs less than 1.8% CPU Utilization on Target Machine 28 SUMMARY Best Option in Docker for NVMe Drive Performance Overlay FS + XFS + Data Volume Best Strategy for Dividing Resources using Cgroups Control only the CPU resources Scaling Docker Containers on NVMf Throughput: within 6% - 12% vs. DAS Latency: 2% - 15% longer than DAS THANK YOU! [email protected] 29.

Load more