Improving Copy-On-Write Performance in Container Storage Drivers

Improving Copy-on-Write Performance in Container Storage Drivers Frank Zhao*, Kevin Xu, Randy Shain EMC Corp (Now Dell EMC) 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. Disclaimer THE TECHNOLOGY CONCEPTS BEING DISCUSSED AND DEMONSTRATED ARE THE RESULT OF RESEARCH CONDUCTED BY THE ADVANCED RESEARCH & DEVELOPMENT (ARD) TEAM FROM THE EMC OFFICE OF THE CTO. ANY DEMONSTRATED CAPABILITY IS ONLY FOR RESEARCH PURPOSE AND AT A PROTOTYPE PHASE, THEREFORE : THERE ARE NO IMMEDIATE PLANS NOR INDICATION OF SUCH PLANS FOR PRODUCTIZATION OF THESE CAPABILITIES AT THE TIME OF PRESENTATION. THINGS MAY OR MAY NOT CHANGE IN THE FUTURE 2 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. Outline Container (Docker) Storage Drivers Copy-on-Write Performance Drawback Our Solution: Data Relationship and Reference Design & Prototyping with DevMapper Test Results Launch storm Data access IO heatmap Summary and Future Work 3 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. Container/Docker Image and layers Running container’s logical view An image references a list of read-only layers File 1 File 2 File 3 File 4 File 5 File 6 or differences (deltas) A container begins as a R/W snapshot of the underlying image L3 File 1 File 4 File 6 I/O in the container causes data to be copied up (CoW) from File 1 File 2 File 3 L2 image CoW granularity may L1 File 2 File 5 File 6 be file (e.g. AUFS), or block (e.g. DevMapper) Image and its layers (e.g. Ubuntu 15.04) CoW impacts I/O performance 4 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. Current Docker Storage Drivers Pluggable storage driver design Configurable at start of daemon Shared for all containers/images Using snapshot and copy-on-write for space efficiency Driver Backing FS Linux distribution AUFS Any (EXT4, XFS) Ubuntu, Debian Device Mapper Any (EXT4, XFS) CentOS, RHEL, Fedora OverlayFS Any(EXT4, XFS) CoreOS ZFS ZFS (FS+vol, Block CoW) Extra install Btrfs Btrfs (block CoW) Not very stable, Docker has no commercial support 5 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. Recommendations From Docker https://docs.docker.com/engine/userguide/storagedriver/selectadriver/ NO single driver is well suited to every use-case Considering: Stability, Expertise, Future-proofing “Use the default driver for your distribution” 6 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. Deployment in Practical Environments Likely DM & AUFS are the most widely used Mature, stable DM is in kernel Expertise Widely available CentOS, RHEL, Fedora Ubuntu, Debian … 7 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. CoW Performance Penalty DM/AUFS as example DevMapper (dm-thin) AUFS Lookup Good: Single layer (Btree split) Bad: Cross-layer traverse I/O Good: Block level CoW Good: Share page cache Bad: Can’t share (page) cache Bad: File level CoW Memory Bad: Multiple data copies Good: Single/shared data copy efficiency 8 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. CoW Performance Penalty Example Initial copy-up from image to container adds I/O latency Sibling containers suffers copy-up again even accessing the same data First read by C1 penalized by CoW Re-read of the same data by C1 satisfied from C1’s page cache (20X faster) First read by C2 penalized again due to DM CoW (CentOS, HDD, DM-thin) MB/s C1: Read MB/s MB/s C2: Read MB/s 3000 3000 2500 2500 2500 2500 2000 2000 1500 1500 1000 1000 500 118 500 120 0 0 9 C1.first read C1.re-read C2 .first read C2. re-read 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. Our Goals Target dense containers(snaps) environment Multiple containers from same image, similar workload Improve common CoW performance Speedup cross-layer lookup (akin to AUFS) Reduce disk IO (akin to DM) Future, facilitate single mem copy Software atop of existing CoW infrastructure NO extra data cache: but efficient metadata Cross-container: C1C2, C2C3, … Cross-image: e.g. between Ubuntu and Tensorflow As long as derived from the same base layer Good scalability 10 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. DRR: Data Relationship & Reference *Research project (i.e. not mature enough for production) A metadata describes latest ownership of data Layer relationship: A tree for base/snap relationship, R/W containers reside at leaf Lookup layer tree: when driver indicates data is shared Update layer tree: when create/delete a layer E.g. image pull, commit changes, or delete image Data access record: tracks latest IO on shared data Record: {LBN, Len, srcLayer, targetLayer, pageBitmap} DRR common operations: Add/update: add new record when IO done on shared data Lookup: for new IO (read, small write) on shared data to reference from recent valid page cache, instead of disk Remove: write IO invalidates and removes the record since data is now owned by that container 11 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. DRR I/O Diagram File-A File-A More containers C3C3C3 Kernel C1 3 C2 6 5 4 Global Shared DRR per host 2 DRR metadata (logically per snapFamily) 1 CoW infrastructure (DevMapper, AUFS, ….) 1. C1 first time accesses file data, loads from disk 2. Adds record to DRR metadata after IO 3. Return data to C1 4. C2 to access the same data 5. Look up DRR metadata for valid reference 6. Memory copy data from C1 page cache via normal file interface at corresponding offset, update DRR to reflect most recent reference 12 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. Linux DM-thin One of DM target drivers, in kernel for 5Y, mature enough Shared block pool for multiple thin devices Internally single layer (metadata CoW) so no traverse Nice support many snaps in depth! (snap-of-snap) Metadata (Btree) and data CoW Granularity: default 64KB chunk 2-level Btree Dev4 Dev-ID Dev Tree Dev3 ThinDev2 ThinDev5 FS-LBN Base Dev(ID=1) Block Tree Thin Pool PBN (in 64KB) 13 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. DRR Key Structures: LayerTree, AccessTable Layer_Tree Access_Table (hierarchy) (LBN -> recent access entry) PoolBase (one per host) DevID1 bio.LBN SnapFamily1 Hash ID2 ID3 RO layer Chunk Chunk1 Chunk2 ID5 ID4 Versions & Ref ID6 R/W layer OwnerDevID OwnerDevID refDevID[] refDevID[] ID7 ID8 ID9 ID10 pageRefList pageRefList state state Ubuntu Tensorflow 14 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. DRR with DM-Thin Transparent to Docker/App Kernel module, between Docker/FS and DM-thin No change to core DM-thin CoW essential Leverage existing metadata, hierarchy, addressing … Determine block is shared or not. dm_thin_find_block() Break block sharing: break_sharing() File data, and shared block only Current Btree Mapping Actual-PBN (40bit) CTime(24bit) BIO Denote owner dev; page DevID Address_space 1:1 w/ DevID LBN address_space *mapping inode *host page fileOffset … 15 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. New I/O Flow with DRR App, FileSystem (XFS, Ext4) bio(dev, LBN, page ) DRR + DM-thin Lookup Btree Shared Blk? && N (Read || small-write) RD WR Y Lookup DRR Miss (In: DevID, LBN) Disk IO Break-sharing Hit Memcpy from recent Add DRR entry Remove page cache (LBN, curDevID, DRR entry (file ino, offset from ownerDevID) current bio, clean) 16 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. DRR: Preliminary Benchmark of PoC Kernel driver integrated w/ DM-thin 1800+ LOC, No Docker code change Env: CentOS7, Docker1.10, DM-thin (64KB chunk), Ext4 HDD based system E5410 @ 2,33GHz, 8 cores, 16GB DRAM 600GB SATA disk configured as LVM-direct CentOS7, kernel 3.10.0-327 PCIe SSD based system E5-2665 @ 2.40GHz, 32 cores, 48 GB DRAM OCZ PCIe SSD RM84 (1.2TB) as LVM-direct CentOS7, Kernel 3.10.229 17 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. DRR: Launch Storm Test (HDD) Launch TensorFlow containers Each TensorFlow container needs to load ~104MB trained model & parameters Results: up to 47% faster with DRR Time (Sec) Baseline DRR 200 200 150 142 98 100 Seconds 66 49 50 24 26 13 14 6 5 7 0 2 4 8 16 32 64 18 Number of containers 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. Container Launch Internals Steps to start a container 1. Create new thin dev: DM-thin locking/serialization 2. Mount Ext4 FS Out of current DRR scope FS metadata blocks, not file data 3. Binary file and libraries: Handled through DRR 4. Config file, parameters data …: 19 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. DRR: Container I/O Test (HDD) 10X faster (~120MB/s -> 1.2GB/s) Launch C1, C2 & C3 in order; then rm C1 and launch C4 Issue same IO to read image big file on each container local page cache hit gets 2.5GB/s (1.2/2.5 = 48%, CPU: 2%) MB/s CPU 2500 Read I/O BW12% & CPU2500 2500 12% 10% 10% 2000 10% 2000 8% 8% 1500 1500 1200 1100 1100 6% 6% 1000 5% 5% 1000 5% 4% PC hit 4% 500 500 2% 2% 2% 118 rm C1 0 warmup 0% 0 0% 1 20 1C1 C22 C33 C44 C4.re-read 20 2016 Storage Developer Conference. © EMC Corp All Rights Reserved. DRR: Computing + IO test (HDD) 2.7X faster in execution time: md5sum <big file> Higher CPU (+4% vs. local PC hit, due to block bio + DRR) Md5sum (Time & CPU) Time(Sec) CPU 6 18% 16% 16% 5 4.8 4.8 14% 4 12% 12% 10% 3 8% 2 1.78 1.58 6% 5% 5% 4% 1 Baseline DRR ON DRR hit PC hit DRR off Warm up 2% 0 0% 21 C1.md5 C2.md5 C3.md5 C3.md5 re-run 2016 Storage Developer Conference.

Improving Copy-On-Write Performance in Container Storage Drivers

Copy on Write Based File Systems Performance Analysis and Implementation

The Page Cache Today’S Lecture RCU File System Networking(Kernel Level Syncmem

The Kernel Report

Rootless Containers with Podman and Fuse-Overlayfs

Dm-X: Protecting Volume-Level Integrity for Cloud Volumes and Local

ECE 598 – Advanced Operating Systems Lecture 19

How to Create a Custom Live CD for Secure Remote Incident Handling in the Enterprise

Dynamically Tuning the JFS Cache for Your Job Sjoerd Visser Dynamically Tuning the JFS Cache for Your Job Sjoerd Visser

In Search of the Ideal Storage Configuration for Docker Containers

MINCS - the Container in the Shell (Script)

Kernel Boot-Time Tracing

Linux Perf Event Features and Overhead