– the future Peter Braam

Lustre Today

Scalable to Scalable architecture provides users more than enough 10,000’s of CPUs headroom for future growth - simple incremental HW addition Support for native high-performance network IO guarantees fast Supports Multiple client & server side I/O Network Protocols Routing enables multi-cluster, multi-network namespace HW agnosticism allows users flexibility of choice - eliminates SW-only Platform vendor lock-ins

Experience in multiple industries; 100s of deployed systems Proven Capability demonstrates production-quality that can exceed 99% uptime

Lustre’s failover protocols eliminate single-points-of-failure and Failover Capable increase system reliability POSIX Compliant Users benefit from strict semantics and data coherency With the support of many leading OEMs and a robust user base, Broad Industry Lustre is an emerging standard with a very Support promising future Lustre Architecture Today

= failover Commodity Storage Servers Lustre MetaData OSS 1 Servers (MDS) MDS 1 MDS 2 (active) (standby)

Direct Connected Quadrics Elan Direct Connected Storage Arrays Myrinet GM/MX Infiniband OSS 2

Lustre Clients Multiple networks are OSS 3 supported Supporting simultaneously 10,000s Enterprise -Class OSS 4 Storage Arrays & SAN Fabrics

OSS 5 Gigabit Ethernet 10Gb Ethernet Lustre Object OSS 6 Storage Servers (OSS, 100s) OSS 7

Lustre Today: version 1.4.6

Clients: 10,000 / 130,000 CPUs Servers: 256 OSSs Scalability MetaData Servers: 1 (+ failover) Files: 500M File System: Single Server: 2 GB/s + Single Client: 2 GB/s + Performance Single Namespace: 100 GB/s File Operations: 9000 ops/second Redundancy No Single Point of failure with shared storage Security POSIX ACLs Management Quota RHEL 3, RHEL 4, SLES 9, Patches required Kernel/OS Support Limited NFS Support; Full CIFS Support POSIX Compliant Yes User Profile: Primarily High-End HPC

IntergalacticIntergalactic StrategyStrategy

• 10 TB/s • HSM • 1 PFlop Systems Lustre v3.0 • 1 Trillion files ~2008/9 • 1M file creates / sec • 30 GB/s mixed files • 1 TB/s Lustre v2.0 Late 2007 • WB caches • 5-10X Metadata • Small files Improvement • Pools • Snapshots • Proxy Servers Lustre v1.6 • Backups • Disconnected Pools

HPCScalability mid-Late 2006 • Clustered MDS •Online Server Addition •Simple Configuration •Kerberos Lustre v1.4 •Lustre RAID •Patchless Client; NFS Export Enterprise Data Management Near & mid term

12 month strategy

 Beginning to see much higher quality

 Adding desirable features and improvements  Some re -writes (recovery)

 Focus largely on doing existing installations better

 Much improved Manual  Extended training  More community activity InIn testingtesting nownow

 Improved NFS performance (1.4.7)  Support for NFSv4 ( pNFS when available)  Broader support for data center HW; Site -wide storage enablement

 Networking (1.4.7, 1.4.8)  OpenFabrics (fmr . OpenIB Gen 2) Support,  CISCO I/B,  Myricom MX

 Much simpler configuration (1.6.0)

 Clients without patches (1.4.8)

 large partition support (1.4.8)

 Linux software RAID5 improvements (1.6.0)

More 2006

 Metadata Improvements (several … from 1.4.7)  Client -side Metadata & OSS read -ahead  More parallel operations

 Recovery improvements  Move from timeouts to health detection  More recovery situations covered

 Kerberos5 Authentication, OSS capabilities

 Remote UID Handling  WAN: Lustre Awarded SC|05 for long -distance efficiency; 5GB/s over 200 miles on HP system Late 2006 / early 2007

 Manageability  Backups – restore files with exact or similar striping  Snapshots – use LVM with distributed control  Storage Pools

 Trouble -shooting  Better error messages  Error analysis tools

 High Speed CIFS Services – pCIFS Windows client

 OS X Tiger client – no patches to OS X

 Lustre Network RAID 1 and RAID 5 across OSS Stripes

Pools – naming a group of OSS’s

Commodity SAN or disks  Heterogeneous servers  Old and New OSS ’s  Homogeneous QOS OSS2 Pool A failover  Stripe inside pools  Pools – name an OST group  Associate qualities with pool OSS3  Policies  Subdirectory must place stripes in pool ( setstripe ) OSS 4  A Lustre network must use one Pool B pool ( lctl ) OSS 5  Destination pool depends on file extension Enterprise class  Keep LAID stripes in pool OSS 6 Raid storage

OSS 7 Lustre Network RAID

Shared storage partitions for OSS targets (OST) OSS redundancy today 1 2

OSS1 OSS2

OSS1 – active for tgt1, standby for tgt2 OSS2 – active for tgt2, standby for tgt1

Lustre network RAID (LAID) Striping Redundancy Across multiple OSS nodes (late 2006) OSS1 OSS2 OSSX

Filter driver for NT CIFS pCIFS deployment: client Windows Clients

Traditional CIFS protocol Windows semantics obtain file layout pCIFS redirector Parallel I/O for clients using file layout

samba samba samba Lustre Name Space

……

Lustre Metadata Servers Lustre Object Storage Servers (OSS) (MDS) Longer term

Future Enhancements

 Clustered metadata – towards 1M pure MD ops / sec

 FSCK free file systems – immense capacity

 Client memory MD WB cache – smoking performance

 Hot data migration – HSM & management

 Global namespace

 Proxy clusters – wide area gird applications  Disconnected operation Clustered Metadata: v2.0

MDS Server OSS Server Pools Pools

Clients

Previous scalability + Enlarge MD Pool: enhance NetBench/SpecFS, client scalability Limits: 100’s of billions of files, M-ops / sec Load Balanced

Clustered Metadata Modular Architecture

Lustre Client File System Logical Object Volumes OSC OSC MDC Lock svc Network & Recovery Lustre 1.X Client

Lustre Client File System Logical Object Volumes Logical MD Volume OSC OSC MDC MDC Lock svc Network & Recovery Lustre 2.X Client Preliminary Results - creations

open(O_CREAT) - 0 objects

16000 14000 12000 10000 8000 6000 4000 2000 0 1M 2M 3M 4M 5M 6M 7M 8M 9M 10M Dirstripe 1 3700 3620 3350 3181 3156 2997 2970 2897 2824 2796 Dirstripe 2 7550 7356 6803 6478 6235 6070 5896 5171 5734 5618 Dirstripe 3 10802 10528 9660 9311 8953 8597 8284 8140 8094 7666 Dirstripe 4 14502 13966 12927 12450 11757 11459 11117 10714 10466 10267 # Files

Preliminary Results – stats

stat (random, 0 objects)

25000

20000

15000

10000

5000

0 1M 2M 3M 4M 5M 6M 7M 8M 9M 10M Dirstripe 1 4898 4879 4593 2502 1344 2564 1230 1229 1119 1064 Dirstripe 2 9456 9410 9411 9384 9365 4876 1731 1507 1449 1356 Dirstripe 3 14867 14817 14311 14828 14742 14773 14725 14736 4123 1747 Dirstripe 4 19886 19942 19914 19826 19843 19833 19868 19926 19803 19831 # Files Extremely large File Systems

 Remove limits (#files, sizes) & NO fsck ever

 Start moving towards “iron ” ext3, like ZFS  Work from University of Wisconsin is a first step  Checksum much of the data  Replicate metadata  Detect and repair corruption  New component is to handle relational corruption  Caused by accidental re -ordering of writes

 Unlike “port ZFS approach ” we expect  A sequence of small fixes  Each with their own benefits

Lustre Memory Client Writeback Cache

 Quantum Leap in Performance and Capability

 AFS is the wrong approach  Local server a.k.a. cache is slow  Disk writes, context switches

 File I/O writeback cache exists and is step 1  Critical new element is asynchronous file creation  File identifier difficult to create without network traffic.  Everything gets flushed to the server asynchronously.

 Target: 1 client: 30GB/sec on mix of small / big files Lustre Writeback Cache: Nuts & Bolts

 Local FS does everything in memory  No context switches

 Sub tree locks

 Client capability to allocate file identifiers  Also required for proxy clusters

 100% asynchronous server API  Except cache miss

Migration - preface to HSM…

OBJECT MIGRATION RESTRIPING POOL MIGRATION • Coordinating client • Intercept cache miss • Control migration • Migrate objects • Control configuration change • Transparent to clients • Migrate groups of objects • Migrate groups of inodes

virtual OST Virtual migrating MDS Pool

Hot

Object MDS Pool A MDS Pool B Migrator Virtual migrating OSS Pool

OSS2 OSS2 OSS Pool A OSS Pool B Cache miss handling & migration - HSM

SMFS: Storage Management File System Logs, Attributes, Cache miss interception

Proxy Servers

remote client remote client proxy client Lustre . client servers . master . . Lustre . . servers . . remote client proxy remote client Lustre . servers . 1. Disconnected operation . 2. Asynchronous replay, capable of re-striping . 3. Conflict detection and handling 4. Mandatory or cooperative lock policy Local performance 5. Namespace sub-tree locks

re synchronization avoids tree walk and server log with subtree versioning Wide Area Applications

 Real/Fast grid storage  Worldwide R&D environments in single namespace  National Security  Internet Content Hosting / Searching  Work at home, offline, employee network  Much much more …

Summary of Targets

Total Transactions Client System Scale Throughput per second Throughput

Today ~100 GB/s 9K ops/s 800 MB/s 300 Teraflops 250 OSSs POSIX 2008 1 TB/s 1M ops/s 30 GB/s 1 Petaflop 1000 OSSs Site wide, wb cache Secure 2010 10 TB/s 10 Petaflops 5000 OSSs Proxies, disconnected Cluster File Systems, Inc.

 History:  Founded in 2001  Shipping production code since 2003

 Financial Status:  Privately held, self -funded, US -based corporation  no external investment  Profitable since the first day of business

 Company Size:  40+ full -time engineers  Additional staff in management, finance, sales, and administrati on.

 Installed Base:  ~200 commercially supported worldwide deployments  Supporting the largest systems in 3 major continents [NA, Europe , Asia]  20% of the Top100 Supercomputers (according to the November ‘05 Top500 list)  Growing list of private -sector users

Thank You!