Lustre Today

Lustre – the future Peter Braam Lustre Today Scalable to Scalable architecture provides users more than enough 10,000’s of CPUs headroom for future growth - simple incremental HW addition Support for native high-performance network IO guarantees fast Supports Multiple client & server side I/O Network Protocols Routing enables multi-cluster, multi-network namespace HW agnosticism allows users flexibility of choice - eliminates SW-only Platform vendor lock-ins Experience in multiple industries; 100s of deployed systems Proven Capability demonstrates production-quality that can exceed 99% uptime Lustre’s failover protocols eliminate single-points-of-failure and Failover Capable increase system reliability POSIX Compliant Users benefit from strict semantics and data coherency With the support of many leading OEMs and a robust user base, Broad Industry Lustre is an emerging file system standard with a very Support promising future Lustre Architecture Today = failover Commodity Storage Servers Lustre MetaData OSS 1 Servers (MDS) MDS 1 MDS 2 (active) (standby) Direct Connected Quadrics Elan Direct Connected Storage Arrays Myrinet GM/MX Infiniband OSS 2 Lustre Clients Multiple networks are OSS 3 supported Supporting simultaneously 10,000s Enterprise -Class OSS 4 Storage Arrays & SAN Fabrics OSS 5 Gigabit Ethernet 10Gb Ethernet Lustre Object OSS 6 Storage Servers (OSS, 100s) OSS 7 Lustre Today: version 1.4.6 Clients: 10,000 / 130,000 CPUs Object Storage Servers: 256 OSSs Scalability MetaData Servers: 1 (+ failover) Files: 500M File System: Single Server: 2 GB/s + Single Client: 2 GB/s + Performance Single Namespace: 100 GB/s File Operations: 9000 ops/second Redundancy No Single Point of failure with shared storage Security POSIX ACLs Management Quota RHEL 3, RHEL 4, SLES 9, Patches required Kernel/OS Support Limited NFS Support; Full CIFS Support POSIX Compliant Yes User Profile: Primarily High-End HPC IntergalacticIntergalactic StrategyStrategy • 10 TB/s • HSM • 1 PFlop Systems Lustre v3.0 • 1 Trillion files ~2008/9 • 1M file creates / sec • 30 GB/s mixed files • 1 TB/s Lustre v2.0 Late 2007 • WB caches • 5-10X Metadata • Small files Improvement • Pools • Snapshots • Proxy Servers Lustre v1.6 • Backups • Disconnected Pools HPC Scalability mid-Late 2006 • Clustered MDS •Online Server Addition •Simple Configuration •Kerberos Lustre v1.4 •Lustre RAID •Patchless Client; NFS Export Enterprise Data Management Near & mid term 12 month strategy Beginning to see much higher quality Adding desirable features and improvements Some re -writes (recovery) Focus largely on doing existing installations better Much improved Manual Extended training More community activity InIn testingtesting nownow Improved NFS performance (1.4.7) Support for NFSv4 ( pNFS when available) Broader support for data center HW; Site -wide storage enablement Networking (1.4.7, 1.4.8) OpenFabrics (fmr . OpenIB Gen 2) Support, CISCO I/B, Myricom MX Much simpler configuration (1.6.0) Clients without patches (1.4.8) Ext3 large partition support (1.4.8) Linux software RAID5 improvements (1.6.0) More 2006 Metadata Improvements (several … from 1.4.7) Client -side Metadata & OSS read -ahead More parallel operations Recovery improvements Move from timeouts to health detection More recovery situations covered Kerberos5 Authentication, OSS capabilities Remote UID Handling WAN: Lustre Awarded SC|05 for long -distance efficiency; 5GB/s over 200 miles on HP system Late 2006 / early 2007 Manageability Backups – restore files with exact or similar striping Snapshots – use LVM with distributed control Storage Pools Trouble -shooting Better error messages Error analysis tools High Speed CIFS Services – pCIFS Windows client OS X Tiger client – no patches to OS X Lustre Network RAID 1 and RAID 5 across OSS Stripes Pools – naming a group of OSS’s Commodity SAN or disks Heterogeneous servers Old and New OSS ’s Homogeneous QOS OSS2 Pool A failover Stripe inside pools Pools – name an OST group Associate qualities with pool OSS3 Policies Subdirectory must place stripes in pool ( setstripe ) OSS 4 A Lustre network must use one Pool B pool ( lctl ) OSS 5 Destination pool depends on file extension Enterprise class Keep LAID stripes in pool OSS 6 Raid storage OSS 7 Lustre Network RAID Shared storage partitions for OSS targets (OST) OSS redundancy today 1 2 OSS1 OSS2 OSS1 – active for tgt1, standby for tgt2 OSS2 – active for tgt2, standby for tgt1 Lustre network RAID (LAID) Striping Redundancy Across multiple OSS nodes (late 2006) OSS1 OSS2 OSSX Filter driver for NT CIFS pCIFS deployment: client Windows Clients Traditional CIFS protocol Windows semantics obtain file layout pCIFS redirector Parallel I/O for clients using file layout samba samba samba Lustre Name Space …… Lustre Metadata Servers Lustre Object Storage Servers (OSS) (MDS) Longer term Future Enhancements Clustered metadata – towards 1M pure MD ops / sec FSCK free file systems – immense capacity Client memory MD WB cache – smoking performance Hot data migration – HSM & management Global namespace Proxy clusters – wide area gird applications Disconnected operation Clustered Metadata: v2.0 MDS Server OSS Server Pools Pools Clients Previous scalability + Enlarge MD Pool: enhance NetBench/SpecFS, client scalability Limits: 100’s of billions of files, M-ops / sec Load Balanced Clustered Metadata Modular Architecture Lustre Client File System Logical Object Volumes OSC OSC MDC Lock svc Network & Recovery Lustre 1.X Client Lustre Client File System Logical Object Volumes Logical MD Volume OSC OSC MDC MDC Lock svc Network & Recovery Lustre 2.X Client Preliminary Results - creations open(O_CREAT) - 0 objects 16000 14000 12000 10000 8000 6000 4000 2000 0 1M 2M 3M 4M 5M 6M 7M 8M 9M 10M Dirstripe 1 3700 3620 3350 3181 3156 2997 2970 2897 2824 2796 Dirstripe 2 7550 7356 6803 6478 6235 6070 5896 5171 5734 5618 Dirstripe 3 10802 10528 9660 9311 8953 8597 8284 8140 8094 7666 Dirstripe 4 14502 13966 12927 12450 11757 11459 11117 10714 10466 10267 # Files Preliminary Results – stats stat (random, 0 objects) 25000 20000 15000 10000 5000 0 1M 2M 3M 4M 5M 6M 7M 8M 9M 10M Dirstripe 1 4898 4879 4593 2502 1344 2564 1230 1229 1119 1064 Dirstripe 2 9456 9410 9411 9384 9365 4876 1731 1507 1449 1356 Dirstripe 3 14867 14817 14311 14828 14742 14773 14725 14736 4123 1747 Dirstripe 4 19886 19942 19914 19826 19843 19833 19868 19926 19803 19831 # Files Extremely large File Systems Remove limits (#files, sizes) & NO fsck ever Start moving towards “iron ” ext3, like ZFS Work from University of Wisconsin is a first step Checksum much of the data Replicate metadata Detect and repair corruption New component is to handle relational corruption Caused by accidental re -ordering of writes Unlike “port ZFS approach ” we expect A sequence of small fixes Each with their own benefits Lustre Memory Client Writeback Cache Quantum Leap in Performance and Capability AFS is the wrong approach Local server a.k.a. cache is slow Disk writes, context switches File I/O writeback cache exists and is step 1 Critical new element is asynchronous file creation File identifier difficult to create without network traffic. Everything gets flushed to the server asynchronously. Target: 1 client: 30GB/sec on mix of small / big files Lustre Writeback Cache: Nuts & Bolts Local FS does everything in memory No context switches Sub tree locks Client capability to allocate file identifiers Also required for proxy clusters 100% asynchronous server API Except cache miss Migration - preface to HSM… OBJECT MIGRATION RESTRIPING POOL MIGRATION • Coordinating client • Intercept cache miss • Control migration • Migrate objects • Control configuration change • Transparent to clients • Migrate groups of objects • Migrate groups of inodes virtual OST Virtual migrating MDS Pool Hot Object MDS Pool A MDS Pool B Migrator Virtual migrating OSS Pool OSS2 OSS2 OSS Pool A OSS Pool B Cache miss handling & migration - HSM SMFS: Storage Management File System Logs, Attributes, Cache miss interception Proxy Servers remote client remote client proxy client Lustre . client servers . master . Lustre . servers . remote client proxy remote client Lustre . servers . 1. Disconnected operation . 2. Asynchronous replay, capable of re-striping . 3. Conflict detection and handling 4. Mandatory or cooperative lock policy Local performance 5. Namespace sub-tree locks re synchronization avoids tree walk and server log with subtree versioning Wide Area Applications Real/Fast grid storage Worldwide R&D environments in single namespace National Security Internet Content Hosting / Searching Work at home, offline, employee network Much much more … Summary of Targets Total Transactions Client System Scale Throughput per second Throughput Today ~100 GB/s 9K ops/s 800 MB/s 300 Teraflops 250 OSSs POSIX 2008 1 TB/s 1M ops/s 30 GB/s 1 Petaflop 1000 OSSs Site wide, wb cache Secure 2010 10 TB/s 10 Petaflops 5000 OSSs Proxies, disconnected Cluster File Systems, Inc. History: Founded in 2001 Shipping production code since 2003 Financial Status: Privately held, self -funded, US -based corporation no external investment Profitable since the first day of business Company Size: 40+ full -time engineers Additional staff in management, finance, sales, and administrati on. Installed Base: ~200 commercially supported worldwide deployments Supporting the largest systems in 3 major continents [NA, Europe , Asia] 20% of the Top100 Supercomputers (according to the November ‘05 Top500 list) Growing list of private -sector users Thank You!.

Lustre Today

C12) United States Patent (10) Patent No.: US 7,418,439 B2 Wong (45) Date of Patent: Aug

System Administration Guide Devices and File Systems

New/Usr/Src/Common/Zfs/Zfs Prop.C 1

Solaris Internals

Dell EMC Avamar Administration Guide

SCO® Unixware® 2.1 Technical Summary

Downloaded for Free From

HP-UX to Oracle Solaris Porting Guide Getting Started on the Move to Oracle Solaris

Dell EMC Avamar Administration Guide

Oracle Business Breakfast Oracle Solaris 11.4 Beta Jörg Möllenkamp Oracle Elite Engineering Exchange

Mac OS X Filesystems

File Systems