Dynamic I/O Congestion Control in Scalable Lustre File System

Dynamic I/O congestion control in scalable Lustre file system Yingjin Qian [1], Ruihai Yi[1] Yimo Du[2], Nong Xiao[2], Shiyao Jin[2] [1]Satellite Marine Tracking & Control Department of China [2]State key Laboratory of High Performance Computing, National University of Defense Technology HPC and Lustre • Scalability is important and challenging for data storage systems. • Storage cluster is an effective method for storage scalability . • Lustre has good scalability. • Lustre is the leading clustered file system in the HPC market. Lustre File System Architecture • Three components – Metadata servers(MDS) – object storage servers (OSS) – Clients • One MDS per file system manages one metadata target (MDT) • OSTs store file data objects Lustre I/O system • Lustre lite Lustre lite – Hooks with VFS to present VFS interfaces LOV • Logical Object Volume (LOV) OSC-1 … OSC-N import import – Functions as a virtual object based device for the upper layer stacked upon OSCs … … • import/export pair export export export export … – Manage the stateful connections and OST-1 OST-N setup up the communication channel between a client and a server. • Distributed Lock Manager (DLM) – Support fine-grained locking for Network Data object efficient concurrent file access. Lustre Scalable I/O model client-0 client-1 client-N OSC OSC OSC • Large requests are … common • Each client-side import has an I/O controller network • Each client is given an I/O OST (storage server) service quota: request concurrency credits(RCC). Server-side I/O request queue Request scheduler Thread pool Backend File system / Disk System I/O page I/O service thread client-side I/O controller Process I/O request in parallel client server disk I/O request • A write RPC is showing as the Enqueue Tw right figure. Dequeue Bulk GET • Command requests are separated Tb RDMA from data requests. Data Tp • Request to the server can be Td Disk I/O processed in parallel. I/O reply I/O service thread context • All requests from different clients Parallel I/O Processing to the backend storage can be scheduled according to the elevator algorithm. Scheduler • Compared to traditional network file I/O threads Server I/O queue systems, Lustre is more scalable. Storage Congestive collapse Reply Time Out Server Server Overflow Scheduler Scheduler Storage Storage • Storage is slow. • RPC queue is long. • Timeout value is set not enough. What is solutions? • Set time out value longer – Cray Jaguar System: 600 seconds – Cons • Causes failure detection mechanism to be less responsive • Increases the recovery and failover time. • Solution – Limit the number of outstanding I/O requests on the server • Goal – Control the latency not exceeding the timeout value – Prevent the occurrence of congestive collapse Analytic I/O model server disk I/O request Trequest Enqueue Tw Dequeue Bulk GET Tb RDMA Data Tp Td Disk I/O Td N Maximal number of I/O service Treply I/O reply threads I/O service thread context Q Queue depth on the server D Queue depth on the server Simple static RCC can not achieve Total I/O good length control effect C C: The number of I/O active clients Dynamic I/O congestion control OSC I/O request OST OSC I/O request OST import export import export RCC RCC-1 1.OSC import send: 3.OSC export send Check RCC left Call RCC calculation Policy OSC I/O request OST OSC I/O request OST import export import export RCC-1 New RCC 2.OST export receive: 4.OSC import recive Queue the new I/O request Update RCC RCC Calculation Policy • Big RCC – Increase locality of disk access pattern – For one I/O, its latency can be excessive • Small RCC – Decrease I/O latency for one I/O – Impair the locality of disk access pattern • Goal – Within latency bound – Enlarge RCC as big as possible – Avoid congestive collapse based on length of server queue Implementation • Implemented our congestion control on the Tianhe-1 supercomputer system • Configuration of Tianhe-1 and corresponding Lustre file system – Scale: 80 compute cabinets including 2560 compute nodes+512 operation nodes; – Node configuration: 2 AMD GPUs+2 Intel Xeon processor; 32GB memory; – Interconnect: Mellanox Infiniband switches 40Gb/s – Storage server: Nehlem CPU; 8GB memory; linux kernel with Lustre-specific patches.; single OST bandwith 206MB/s – Storage Space Capacity: 1PB Experiment Setup • Tianhe-1 Lustre configuration – Clients scale: 1 to 1024 – Synthetic workload: IOR benchmark • Workload characterization – File size: Fixed 512GB – I/O size: 512G/c per client in average – I/O mode: File per Processor (FPP) – Transfer size: 1MB – Parameters by default: RCCmin 1 STL 60s RCCmax 32 CTL 0 Dlow 128 Base Line Evaluation static RCC scheme with different n values (c=1) static RCC scheme at scale (n=8) Base Line Evaluation Performance of the static RCC scheme with different RCC values at large scale Dynamic RCC Evaluation Comparison between static RCC and dynamic RCC with 512 clients Dynamic RCC Evaluation Comparison between static RCC and dynamic RCC with 1024 clients Any Questions? • Thank you! .

Dynamic I/O Congestion Control in Scalable Lustre File System

HP Storageworks Clustered File System Command Line Reference

Lessons Learned in Deploying the World's Largest Scale Lustre File

Titan: a New Leadership Computer for Science

Musings RIK FARROWOPINION

Shared File Systems: Determining the Best Choice for Your Distributed SAS® Foundation Applications Margaret Crevar, SAS Institute Inc., Cary, NC

Jaguar Supercomputer

Understanding Lustre Filesystem Internals

High Availability for RHEL on System Z

Comparative Analysis of Distributed and Parallel File Systems' Internal Techniques

The Chubby Lock Service for Loosely-Coupled Distributed Systems

Scaling HDFS with a Strongly Consistent Relational Model for Metadata Kamal Hakimzadeh, Hooman Peiro Sajjad, Jim Dowling

Efficient Object Storage Journaling in a Distributed Parallel File System Presented by Sarp Oral