Lustre – the future Peter Braam
Lustre Today
Scalable to Scalable architecture provides users more than enough 10,000’s of CPUs headroom for future growth - simple incremental HW addition Support for native high-performance network IO guarantees fast Supports Multiple client & server side I/O Network Protocols Routing enables multi-cluster, multi-network namespace HW agnosticism allows users flexibility of choice - eliminates SW-only Platform vendor lock-ins
Experience in multiple industries; 100s of deployed systems Proven Capability demonstrates production-quality that can exceed 99% uptime
Lustre’s failover protocols eliminate single-points-of-failure and Failover Capable increase system reliability POSIX Compliant Users benefit from strict semantics and data coherency With the support of many leading OEMs and a robust user base, Broad Industry Lustre is an emerging file system standard with a very Support promising future Lustre Architecture Today
= failover Commodity Storage Servers Lustre MetaData OSS 1 Servers (MDS) MDS 1 MDS 2 (active) (standby)
Direct Connected Quadrics Elan Direct Connected Storage Arrays Myrinet GM/MX Infiniband OSS 2
Lustre Clients Multiple networks are OSS 3 supported Supporting simultaneously 10,000s Enterprise -Class OSS 4 Storage Arrays & SAN Fabrics
OSS 5 Gigabit Ethernet 10Gb Ethernet Lustre Object OSS 6 Storage Servers (OSS, 100s) OSS 7
Lustre Today: version 1.4.6
Clients: 10,000 / 130,000 CPUs Object Storage Servers: 256 OSSs Scalability MetaData Servers: 1 (+ failover) Files: 500M File System: Single Server: 2 GB/s + Single Client: 2 GB/s + Performance Single Namespace: 100 GB/s File Operations: 9000 ops/second Redundancy No Single Point of failure with shared storage Security POSIX ACLs Management Quota RHEL 3, RHEL 4, SLES 9, Patches required Kernel/OS Support Limited NFS Support; Full CIFS Support POSIX Compliant Yes User Profile: Primarily High-End HPC
IntergalacticIntergalactic StrategyStrategy
• 10 TB/s • HSM • 1 PFlop Systems Lustre v3.0 • 1 Trillion files ~2008/9 • 1M file creates / sec • 30 GB/s mixed files • 1 TB/s Lustre v2.0 Late 2007 • WB caches • 5-10X Metadata • Small files Improvement • Pools • Snapshots • Proxy Servers Lustre v1.6 • Backups • Disconnected Pools
HPCScalability mid-Late 2006 • Clustered MDS •Online Server Addition •Simple Configuration •Kerberos Lustre v1.4 •Lustre RAID •Patchless Client; NFS Export Enterprise Data Management Near & mid term
12 month strategy
Beginning to see much higher quality
Adding desirable features and improvements Some re -writes (recovery)
Focus largely on doing existing installations better
Much improved Manual Extended training More community activity InIn testingtesting nownow
Improved NFS performance (1.4.7) Support for NFSv4 ( pNFS when available) Broader support for data center HW; Site -wide storage enablement
Networking (1.4.7, 1.4.8) OpenFabrics (fmr . OpenIB Gen 2) Support, CISCO I/B, Myricom MX
Much simpler configuration (1.6.0)
Clients without patches (1.4.8)
Ext3 large partition support (1.4.8)
Linux software RAID5 improvements (1.6.0)
More 2006
Metadata Improvements (several … from 1.4.7) Client -side Metadata & OSS read -ahead More parallel operations
Recovery improvements Move from timeouts to health detection More recovery situations covered
Kerberos5 Authentication, OSS capabilities
Remote UID Handling WAN: Lustre Awarded SC|05 for long -distance efficiency; 5GB/s over 200 miles on HP system Late 2006 / early 2007
Manageability Backups – restore files with exact or similar striping Snapshots – use LVM with distributed control Storage Pools
Trouble -shooting Better error messages Error analysis tools
High Speed CIFS Services – pCIFS Windows client
OS X Tiger client – no patches to OS X
Lustre Network RAID 1 and RAID 5 across OSS Stripes
Pools – naming a group of OSS’s
Commodity SAN or disks Heterogeneous servers Old and New OSS ’s Homogeneous QOS OSS2 Pool A failover Stripe inside pools Pools – name an OST group Associate qualities with pool OSS3 Policies Subdirectory must place stripes in pool ( setstripe ) OSS 4 A Lustre network must use one Pool B pool ( lctl ) OSS 5 Destination pool depends on file extension Enterprise class Keep LAID stripes in pool OSS 6 Raid storage
OSS 7 Lustre Network RAID
Shared storage partitions for OSS targets (OST) OSS redundancy today 1 2
OSS1 OSS2
OSS1 – active for tgt1, standby for tgt2 OSS2 – active for tgt2, standby for tgt1
Lustre network RAID (LAID) Striping Redundancy Across multiple OSS nodes (late 2006) OSS1 OSS2 OSSX
Filter driver for NT CIFS pCIFS deployment: client Windows Clients
Traditional CIFS protocol Windows semantics obtain file layout pCIFS redirector Parallel I/O for clients using file layout
samba samba samba Lustre Name Space
……
Lustre Metadata Servers Lustre Object Storage Servers (OSS) (MDS) Longer term
Future Enhancements
Clustered metadata – towards 1M pure MD ops / sec
FSCK free file systems – immense capacity
Client memory MD WB cache – smoking performance
Hot data migration – HSM & management
Global namespace
Proxy clusters – wide area gird applications Disconnected operation Clustered Metadata: v2.0
MDS Server OSS Server Pools Pools
Clients
Previous scalability + Enlarge MD Pool: enhance NetBench/SpecFS, client scalability Limits: 100’s of billions of files, M-ops / sec Load Balanced
Clustered Metadata Modular Architecture
Lustre Client File System Logical Object Volumes OSC OSC MDC Lock svc Network & Recovery Lustre 1.X Client
Lustre Client File System Logical Object Volumes Logical MD Volume OSC OSC MDC MDC Lock svc Network & Recovery Lustre 2.X Client Preliminary Results - creations
open(O_CREAT) - 0 objects
16000 14000 12000 10000 8000 6000 4000 2000 0 1M 2M 3M 4M 5M 6M 7M 8M 9M 10M Dirstripe 1 3700 3620 3350 3181 3156 2997 2970 2897 2824 2796 Dirstripe 2 7550 7356 6803 6478 6235 6070 5896 5171 5734 5618 Dirstripe 3 10802 10528 9660 9311 8953 8597 8284 8140 8094 7666 Dirstripe 4 14502 13966 12927 12450 11757 11459 11117 10714 10466 10267 # Files
Preliminary Results – stats
stat (random, 0 objects)
25000
20000
15000
10000
5000
0 1M 2M 3M 4M 5M 6M 7M 8M 9M 10M Dirstripe 1 4898 4879 4593 2502 1344 2564 1230 1229 1119 1064 Dirstripe 2 9456 9410 9411 9384 9365 4876 1731 1507 1449 1356 Dirstripe 3 14867 14817 14311 14828 14742 14773 14725 14736 4123 1747 Dirstripe 4 19886 19942 19914 19826 19843 19833 19868 19926 19803 19831 # Files Extremely large File Systems
Remove limits (#files, sizes) & NO fsck ever
Start moving towards “iron ” ext3, like ZFS Work from University of Wisconsin is a first step Checksum much of the data Replicate metadata Detect and repair corruption New component is to handle relational corruption Caused by accidental re -ordering of writes
Unlike “port ZFS approach ” we expect A sequence of small fixes Each with their own benefits
Lustre Memory Client Writeback Cache
Quantum Leap in Performance and Capability
AFS is the wrong approach Local server a.k.a. cache is slow Disk writes, context switches
File I/O writeback cache exists and is step 1 Critical new element is asynchronous file creation File identifier difficult to create without network traffic. Everything gets flushed to the server asynchronously.
Target: 1 client: 30GB/sec on mix of small / big files Lustre Writeback Cache: Nuts & Bolts
Local FS does everything in memory No context switches
Sub tree locks
Client capability to allocate file identifiers Also required for proxy clusters
100% asynchronous server API Except cache miss
Migration - preface to HSM…
OBJECT MIGRATION RESTRIPING POOL MIGRATION • Coordinating client • Intercept cache miss • Control migration • Migrate objects • Control configuration change • Transparent to clients • Migrate groups of objects • Migrate groups of inodes
virtual OST Virtual migrating MDS Pool
Hot
Object MDS Pool A MDS Pool B Migrator Virtual migrating OSS Pool
OSS2 OSS2 OSS Pool A OSS Pool B Cache miss handling & migration - HSM
SMFS: Storage Management File System Logs, Attributes, Cache miss interception
Proxy Servers
remote client remote client proxy client Lustre . client servers . master . . Lustre . . servers . . remote client proxy remote client Lustre . servers . 1. Disconnected operation . 2. Asynchronous replay, capable of re-striping . 3. Conflict detection and handling 4. Mandatory or cooperative lock policy Local performance 5. Namespace sub-tree locks
re synchronization avoids tree walk and server log with subtree versioning Wide Area Applications
Real/Fast grid storage Worldwide R&D environments in single namespace National Security Internet Content Hosting / Searching Work at home, offline, employee network Much much more …
Summary of Targets
Total Transactions Client System Scale Throughput per second Throughput
Today ~100 GB/s 9K ops/s 800 MB/s 300 Teraflops 250 OSSs POSIX 2008 1 TB/s 1M ops/s 30 GB/s 1 Petaflop 1000 OSSs Site wide, wb cache Secure 2010 10 TB/s 10 Petaflops 5000 OSSs Proxies, disconnected Cluster File Systems, Inc.
History: Founded in 2001 Shipping production code since 2003
Financial Status: Privately held, self -funded, US -based corporation no external investment Profitable since the first day of business
Company Size: 40+ full -time engineers Additional staff in management, finance, sales, and administrati on.
Installed Base: ~200 commercially supported worldwide deployments Supporting the largest systems in 3 major continents [NA, Europe , Asia] 20% of the Top100 Supercomputers (according to the November ‘05 Top500 list) Growing list of private -sector users
Thank You!