Lustre: the Intergalactic File System for the National Labs?

Lustre: the Intergalactic File System for the National Labs? Peter J. Braam [email protected] http://www.clusterfilesystems.com Cluster File Systems, Inc 6/10/2001 1 Cluster File Systems, Inc… n Goals n Consulting & development n Storage and file systems n Open source n Extreme level of expertise n Leading n InterMezzo – high availability file system n Lustre – next generation cluster file system n Important role in Coda, UDF and Ext3 for Linux 2 - 6/13/2001 Partners… n CMU n National Labs n Intel 3 - 6/13/2001 Talk overview n Trends n Next generation data centers n Key issues in cluster file systems n Lustre n Discussion 4 - 6/13/2001 Trends… 5 - 6/13/2001 Hot animals… n NAS n Cheap servers deliver a lot of features n NFS v4 n Finally, NFS is getting it right… n Security, concurrency n DAFS n High level storage protocol over VI storage network n Open Source OS n Best of breed file systems (XFS, JFS, Reiser) 6 - 6/13/2001 DAFS / NFS v4 DAFS server NFS v4 server high level fast & efficient client client storage protocol concurrency control with notifications to clients 7 - 6/13/2001 Key question n How to use the parts… 8 - 6/13/2001 Next generation data centers 9 - 6/13/2001 Security and Resource Cluster Control Databases Clients Access Control Coherence Management D a Storage Area Network ta (FC, GigE, IB) T Storage ra n s Management fe rs Object Storage Targets 10 - 6/13/2001 Orders of magnitude n Clients (aka compute servers) n 10,000’s n Storage controllers n 1000’s to control PB’s of storage (PB = 10**15 Bytes) n Cluster control nodes n 10’s n Aggregate bandwidth n 100’s GB/sec 11 - 6/13/2001 Applications n Scientific computing n Bio Informatics n Rich media n Entire ISP’s 12 - 6/13/2001 Key issues 13 - 6/13/2001 Scalability n I/O throughput n How to avoid bottlenecks n Meta data scalability n How can 10,000’s of nodes work on files in same folder n Cluster recovery n If something fails, how can transparent recovery happen n Management n Adding, removing, replacing, systems; data migration & backup 14 - 6/13/2001 Features n The basics… n Recovery, management, security n The desired… n Gene computations on storage controllers n Data mining for free n Content based security n … n The obstacle… n An almost 30 year old pervasive block device protocol 15 - 6/13/2001 Look back n Andrew Project at CMU n 80’s – file servers with 10,000 clients (CMU campus) n Key question: how to reduce foot print of client on server n By 1988 entire campus on AFS n Lustre n Scalable clusters? n How to reduce cluster footprint of shared resources (scalability) n How to subdivide bottlenecked resources (parallelism) 16 - 6/13/2001 Lustre Intelligent Object Storage http://www.http://www.lustrelustrelustre.org.org 17 - 6/13/2001 What is Object Based Storage?? n Object Based Storage Device n More intelligent than block device n Speak storage at “inode level” n create, unlink, read, write, getattr, setattr n iterators, security, almost arbitrary processing n So… n Protocol allocates physical blocks, no names for files n Requires n Management & security infrastructure 18 - 6/13/2001 Project history n Started between CMU – Seagate – Stelias Computing n Another road to NASD style storage n NASD now at Panasas – originated many ideas n Los Alamos n More research n Nearly built little object storage controllers n Currently looking at genomics applications n Sandia, Tri-Labs n Can Lustre meet the SGS-FS requirements? 19 - 6/13/2001 Components of OB Storage n Storage Object Device Drivers n class drivers – attach driver to interface n Targets, clients – remote access n Direct drivers – to manage physical storage n Logical drivers – for intelligence & storage management n Object storage applications: n (cluster) file systems n Advanced storage: parallel I/O, snapshots n Specialized apps: caches, db’s, filesrv 20 - 6/13/2001 Object File System Monolithic Buffer cache File system Object File System: Page Object based Cache storage device • file/dir data: lookup • set/read attrs Object • all allocation • remainder:ask obsd Device • all persistence Methods 21 - 6/13/2001 Accessing objects n Session n connect to the object storage target, present security token n Mention object id n Objects have a unique (group,id) n Perform operation n So that’s what the object file system does! 22 - 6/13/2001 Objects may be files, or not… n Common case: n Object, like inode, represents a file n Object can also: n represent a stripe (RAID) n bind an (MPI) File_View n redirect to other objects 23 - 6/13/2001 Snapshots as logical module Present multiple /current/ Object File System views of file systems /yesterday/ objectX Versioned objects: Snapshot / versioning follow redirectors logical driver objY objZ Access to raw objects Direct object driver 7am 9am bla bla bla bla 24 - 6/13/2001 System Interface n Modules n Load the kernel modules to get drivers of a certain type n Name devices to be of a certain type n Build stacks of devices with assigned types n For example: n insmod obd_xfs ; obdcontrol dev=obd1,type=xfs n insmod obd_snap ; obdcontrol current=obd2,old=obd3,driver=obd1 n insmod obdfs ; mount –t obdfs –o dev=obd3 /mnt/old 25 - 6/13/2001 Clustered Object Clustered Object Based File System Based File System on host A on host B OBD Client Driver OBD Client Driver Type RPC Type VIA Fast storage networking Shared object storage OBD Target OBD Target Type RPC Type VIA Direct OBD 26 - 6/13/2001 Storage target Implementations n Classical storage targets n Controller – expensive, redundant, proprietary n EMC: as sophisticated & feature rich as block storage can get n A bunch of disks n Lustre target n Bunch of disks n Powerful (SMP, multiple busses) commodity PC n Programmable/Secure n Could be done on disk drives but… 27 - 6/13/2001 Inside the storage controller… FC/ SCSI dual path JBOD SMP Linux Load SMP Linux RAID Balance RAID on-controller logic & on-controller logic networking Failover networking Multipath Multipath Load balanced Load balanced Storage network Storage network 28 - 6/13/2001 interface Objects in clusters… 29 - 6/13/2001 Lustre clusters Lustre Cluster Control Nodes (10’s) Lustre Locate Object Storage Targets Lustre clients Device & object (10,000’s) Meta Data transactions SAN – low latency client / storage interactions 3rd Party File & Object I/O TCP/IP file and security protocols GSS API Compatible Key/token management InterMezzo clients remote access • Private namespace (extreme security) • Persistent cache clients (mobility) HSM backends for Lustre/DMAPI solution 30 - 6/13/2001 Cluster control nodes n Database of references to objects n E.g. Lustre File System n Directory data n Points to objects that contain stripes/extents of files n More generally n Use a database of references to objects n Write object applications that access the objects directly n LANL asked for gene processing on the controllers 31 - 6/13/2001 Examples of logical modules n Tri-Lab/NSA: SGS – File system (see next slide) n Storage management, security n Parallel I/O for scientific computation n Other requests: n Data mining while target is idle n LANL: gene sequencing in object modules n Rich media industry: prioritize video streams 32 - 6/13/2001 MPI-IO interface MPI POSIX file types Collective & interface shared I/O ADIO-OBD adaptor Lustre client FS metadata sync auditing Client node object storage client networking SGS – File System SAN Object Layering object storage target networking synchronization content based collective I/O authorization data-migration aggregation Storage target node XFS-based OSD XFS-based OSD 33 - 6/13/2001 Other applications… n Genomics n Can we reproduce the previous slide? n Data mining n Can we exploit idle cycles and disk geometry? n Rich media n What storage networking helps streaming? 34 - 6/13/2001 Why Lustre… n It’s fun, it’s new n Infrastructure for storage target based computing n Storage management: components – not monolithic n File system snapshots, raid, backup, hot migration, resizing n Much simpler n File System: n Clustering FS considerably simpler, more scalable n But: close to NFS v4 and DAFS in several critical ways 35 - 6/13/2001 And finally – the reality, what exists… n At http://www.lustre.org (everything GPL’d) n Prototypes: n Direct driver for ext2 objects, Snapshot logical driver, n Management infrastructure, Object file system n Current happenings: n Collaboration with Intel Enterprise Architecture LAB: n They are building Lustre storage networking (DAFS, RDMA, TUX) n The grand scheme of things has been planned and is moving n Also on the WWW: n OBD storage specification n Lustre SGS – File System implementation plan. 36 - 6/13/2001 Linux clusters 37 - 6/13/2001 Clusters - purpose Require: n A scalable almost single system image n Fail-over capability n Load-balanced redundant services n Smooth administration 38 - 6/13/2001 Ultimate Goal n Provide generic components n OPEN SOURCE n Inspiration: VMS VAX Clusters n New: n Scalable (100,000’s nodes) n Modular n Need distributed, cluster & parallel FS’s n InterMezzo, GFS/Lustre, POBIO-FS 39 - 6/13/2001 Technology Overview Modularized VAX cluster architecture (Tweedie) Core Support Clients Transition Cluster db Distr. Computing Integrity Quorum Cluster Admin/Apps Link Layer Barrier Svc Cluster FS & LVM Channel Layer Event system DLM 40 - 6/13/2001 Events n Cluster transition: n Whenever connectivity changes n Start by electing “cluster controller” n Only merge fully connected sub-clusters n Cluster id: counts “incarnations” n Barriers: n Distributed synchronization points n Partial implementations available: n Ensemble, KimberLite, IBM-DLM, Compaq Cluster Mgr 41 - 6/13/2001 Scalability – e.g.

Lustre: the Intergalactic File System for the National Labs?

Globalfs: a Strongly Consistent Multi-Site File System

AFS - a Secure Distributed File System

Linux Kernal II 9.1 Architecture

Using the Andrew File System on *BSD [email protected], Bsdcan, 2006 Why Another Network Filesystem

Filesystems HOWTO Filesystems HOWTO Table of Contents Filesystems HOWTO

A Persistent Storage Model for Extreme Computing Shuangyang Yang Louisiana State University and Agricultural and Mechanical College

Comparative Analysis of Distributed and Parallel File Systems' Internal Techniques

Using the Andrew File System with BSD

Zfs-Ascalabledistributedfilesystemusingobjectdisks

A Parallel File System for Linux Clusters Authors: Philip H

Performance and Extension of User Space File Introduction (1 of 2) Introduction (2 of 2) Introduction

The Evolution of File Systems