The Lustre Filesystem for Petabyte Storage at the Florida HPC Center

Home , Lustre (file system), Network File System, Object storage

Paul Avery, Dimitri Bourilkov, Yu Fu, Bockjoo Kim, Craig Prescott University of Florida

HEPiX 2014 Workshop, University of Nebraska - Lincoln, October 13-17, 2014 Outline  Storage system requirements & design: close link CMS T2 - HPC  Lustre overview  Storage at the University of Florida HPC  /cms 1.7 PB  2 PB  /lfs 2.0 PB  Performance & monitoring  Lustre on the WAN

D.Bourilkov, Florida, CMS The Lustre Filesystem for Petabyte Storage 2 Requirements I

 Scalable to Petabytes, in-house supported systems  Nodes must share data over the network ◦ 1000+ compute nodes and 20k+ processors  Throughput, IOPS, and capacity requirements  Resilience in the face of component failure ◦ RAID ◦ High-Availability (lfs)

 POSIX Semantics ◦ All nodes must maintain consistent views of the file data and metadata  Locking subsystem

D.Bourilkov, Florida, CMS The Lustre Filesystem for Petabyte Storage 3 Requirements II

 Support for High-Performance RDMA (Remote Direct Memory Access) networking  Basic choices  Traditional Network File System (e.g. NFS) or Parallel File System  Parallel file systems offer scalability  Solutions  Good proprietary parallel solutions are available  Expensive, vendor lock-in, reduced flexibility  Open Source – numerous possibilities  UF is a member of OpenSFS (Open Scalable File Systems)

D.Bourilkov, Florida, CMS The Lustre Filesystem for Petabyte Storage 4 Lustre I

 Open-source parallel file system implementation  GPL v2  POSIX compliant  High performance and open licensing  used in computer clusters from small workgroups to supercomputers  Horizontally scalable in capacity and throughput  Additional storage and server nodes can be added “online”  Scales to very large numbers of clients

D.Bourilkov, Florida, CMS The Lustre Filesystem for Petabyte Storage 5 Lustre II

 Supports High-Performance RDMA networking devices  RDMA – Remote Direct Memory Access  NIC transfers data directly to/from application memory  Implemented in InfiniBand interconnect, widely used in HPC environments  Built-in High-Availability Support  If a server goes down, another server takes over  Lustre – proven track record, a decade of in-house expertise @ UF

D.Bourilkov, Florida, CMS The Lustre Filesystem for Petabyte Storage 6 Lustre – Who Uses It?

 Largest market share in HPC systems  Primary file system of the “petascale club”  50-70% of top 10, 100 and 500 entries at top500.org in recent years  Some very large deployments – many 1000s or tens of 1000s of compute nodes  Titan (ORNL), Blue Waters (NCSA), Sequoia (LLNL), Stampede (TACC)…  Top-tier HPC vendors use Lustre  Dell, DDN, Cray, Xyratex, Fujistu, HP, Bull, ...  Minimal requirements - commodity compute and I/O hardware with Linux (any block storage device)

D.Bourilkov, Florida, CMS The Lustre Filesystem for Petabyte Storage 7 Lustre – Basic Architecture

Lustre Object Lustre Metadata Storage Servers(OSS) (100’s) Servers (MDS) Commodity Pool of Storage Servers metadata servers OSS 1 MDS 1 MDS 2 (active) (standby)

Ethernet OSS 2 Other...

Lustre Clients Simultaneous (10’s - 10,000’s) support of OSS 3 Shared storage multiple network enables failover OSS types

OSS 4

GigE Router IB OSS 5 etc...

Enterprise-Class OSS 6 Router Storage Arrays & = failover SAN Fabrics Lustre Components - Metadata Server & Target

 Metadata Server (MDS)  Server node with storage and NIC  Manages namespace only – filenames, directories, access permissions (ACLs) and file layout, no data  Determines file layout on OSTs at creation time:  default policies balance I/O and space usage over OSSs/OSTs  Only involved in pathname/permission checks, not involved in any file I/O operations, avoiding I/O scalability bottlenecks on the metadata server (only initial lookup of the file layout)  Metadata Target (MDT) – on a local filesystem  Storage device hosted by MDS  Ldiskfs backing file system  Generally a RAID1+0 device D.Bourilkov, Florida, CMS The Lustre Filesystem for Petabyte Storage 9 Lustre Components - Object Storage Servers  Object Storage Server (OSS)  Server node that hosts storage device(s) and NIC(s)  Manages object data and attributes  Manages RDMA/zero-copy from clients/routers  Typically serves 2-8 OSTs; each OST managing a single local disk filesystem  Files: byte ranges of objects on OSTs for read/write operations  Can be configured in active-active mode for failover D.Bourilkov, Florida, CMS The Lustre Filesystem for Petabyte Storage 10 Lustre Components - Object Storage Targets

 Object Storage Target (OST)  Formatted storage device hosted by OSS  Ldiskfs (“ext4+”) file systems – on-disk format is hidden from clients  Each OSS can have multiple OSTs – RAID5 or RAID6 arrays are common  OSTs are independent, thus enabling linear scaling: parallel IO operations directly to the OSTs  Optimized for large I/O (1MiB client RPCs)  Client writeback cache allows I/O aggregation  Usually hardware RAID devices, though will work with any block device

D.Bourilkov, Florida, CMS The Lustre Filesystem for Petabyte Storage 11 /cms

 Growing all the time, heterogeneous hardware  InfiniBand QDR/FDR 40 Gbps, IB/Ethernet bridge 20 Gbps  Five types of OSTs: 7.2, 21.8, 21.5, 28.6, 29.1 TB usable  19 OSS nodes (3 more coming soon; older 1-5 being reused: non-production), 135 OST nodes  Older OSTs use RAID5 (“4+1”) x 2 TB  Newer OSTs (six per OSS): RAID6 (“8+2”) x 4-6 TB  1.7 Petabytes usable, soon 2+ PB  Heavy loads, ~ 85-94 % full

D.Bourilkov, Florida, CMS The Lustre Filesystem for Petabyte Storage 12 /lfs/scratch

 24 OSS nodes, 144 OST nodes  Active-Active  128 GB RAM per OSS (read cache)  Six RAID6 (“8+2”) OSTs per OSS  1440 2TB drives total  2.0 Petabytes usable  ~50 GB/s streaming reads/writes observed  ~100k random IOPS for small I/Os  Highly-Available  FDR InfiniBand Interconnect (56Gb/s) D.Bourilkov, Florida, CMS The Lustre Filesystem for Petabyte Storage 13

Filesystem Snapshots

 /cms ~ 86 % full, heavy IO  /lfs ~ 18 % full, varying IO

D.Bourilkov, Florida, CMS The Lustre Filesystem for Petabyte Storage 14 Read/Write Performance

 A dedicated set of shell scripts monitors the IO performance  Visits each OST in turn to write a file (usually 4.5 GB) and then again to read it  Typical runs of 50 cycles (~ 2 days)  The collected data is stored & analyzed in ROOT trees  Plots for the OST R/W performance and time dependencies  /cms and /lfs

D.Bourilkov, Florida, CMS The Lustre Filesystem for Petabyte Storage 15 Read performance /cms Variations under load

D.Bourilkov, Florida, CMS The Lustre Filesystem for Petabyte Storage 16

Write performance /cms Variations under load

D.Bourilkov, Florida, CMS The Lustre Filesystem for Petabyte Storage 17

Read performance /lfs Stable

D.Bourilkov, Florida, CMS The Lustre Filesystem for Petabyte Storage 18

Write performance /lfs Stable

D.Bourilkov, Florida, CMS The Lustre Filesystem for Petabyte Storage 19

Lustre over the WAN

• Can serve a regional T2/T3 cluster • Lustre 2.X used • Lustre clients over the WAN: RTT 14-50 ms • Kerberos authentication • Tests UF-FIU

D.Bourilkov, Florida, CMS The Lustre Filesystem for Petabyte Storage 20

Outlook

 Lustre deployed and used successfully at the Florida HPC center  A decade of experience  Two large filesystems (1.7 and 2.0 PB): main storage “workhorses”  Hosts CMS data, ~ 90 % full, heavy loads  Lustre can be deployed over the WAN to serve a regional T2 and several T3 centers  Many new features in Lustre 2.X, active developments (even Hadoop MapReduce over Lustre)

D.Bourilkov, Florida, CMS The Lustre Filesystem for Petabyte Storage 21 D.Bourilkov, Florida, CMS The Lustre Filesystem for Petabyte Storage 22

/cms MDS Server

 Dual quad-core Intel E5520 2.3GHz system with 24GB memory  I/O is through a QDR InfiniBand at 40 Gbps  The MDT data is stored on a hardware mirrored RAID1 on a pair of Intel X25-M solid state drives.  We back up snapshots of the MDT data so if the MDS server is down we are able to rebuild it on new hardware within a few hours  In 5 years of continuous service, this MDS server never went down except during inevitable power, network or cooling outages  Planning a replacement on new hardware in the next round of purchases

D.Bourilkov, Florida, CMS The Lustre Filesystem for Petabyte Storage 23