Lustre: the Intergalactic File System for the National Labs?
Total Page:16
File Type:pdf, Size:1020Kb
Lustre: the Intergalactic File System for the National Labs? Peter J. Braam [email protected] http://www.clusterfilesystems.com Cluster File Systems, Inc 6/10/2001 1 Cluster File Systems, Inc… n Goals n Consulting & development n Storage and file systems n Open source n Extreme level of expertise n Leading n InterMezzo – high availability file system n Lustre – next generation cluster file system n Important role in Coda, UDF and Ext3 for Linux 2 - 6/13/2001 Partners… n CMU n National Labs n Intel 3 - 6/13/2001 Talk overview n Trends n Next generation data centers n Key issues in cluster file systems n Lustre n Discussion 4 - 6/13/2001 Trends… 5 - 6/13/2001 Hot animals… n NAS n Cheap servers deliver a lot of features n NFS v4 n Finally, NFS is getting it right… n Security, concurrency n DAFS n High level storage protocol over VI storage network n Open Source OS n Best of breed file systems (XFS, JFS, Reiser) 6 - 6/13/2001 DAFS / NFS v4 DAFS server NFS v4 server high level fast & efficient client client storage protocol concurrency control with notifications to clients 7 - 6/13/2001 Key question n How to use the parts… 8 - 6/13/2001 Next generation data centers 9 - 6/13/2001 Security and Resource Cluster Control Databases Clients Access Control Coherence Management D a Storage Area Network ta (FC, GigE, IB) T Storage ra n s Management fe rs Object Storage Targets 10 - 6/13/2001 Orders of magnitude n Clients (aka compute servers) n 10,000’s n Storage controllers n 1000’s to control PB’s of storage (PB = 10**15 Bytes) n Cluster control nodes n 10’s n Aggregate bandwidth n 100’s GB/sec 11 - 6/13/2001 Applications n Scientific computing n Bio Informatics n Rich media n Entire ISP’s 12 - 6/13/2001 Key issues 13 - 6/13/2001 Scalability n I/O throughput n How to avoid bottlenecks n Meta data scalability n How can 10,000’s of nodes work on files in same folder n Cluster recovery n If something fails, how can transparent recovery happen n Management n Adding, removing, replacing, systems; data migration & backup 14 - 6/13/2001 Features n The basics… n Recovery, management, security n The desired… n Gene computations on storage controllers n Data mining for free n Content based security n … n The obstacle… n An almost 30 year old pervasive block device protocol 15 - 6/13/2001 Look back n Andrew Project at CMU n 80’s – file servers with 10,000 clients (CMU campus) n Key question: how to reduce foot print of client on server n By 1988 entire campus on AFS n Lustre n Scalable clusters? n How to reduce cluster footprint of shared resources (scalability) n How to subdivide bottlenecked resources (parallelism) 16 - 6/13/2001 Lustre Intelligent Object Storage http://www.http://www.lustrelustrelustre.org.org 17 - 6/13/2001 What is Object Based Storage?? n Object Based Storage Device n More intelligent than block device n Speak storage at “inode level” n create, unlink, read, write, getattr, setattr n iterators, security, almost arbitrary processing n So… n Protocol allocates physical blocks, no names for files n Requires n Management & security infrastructure 18 - 6/13/2001 Project history n Started between CMU – Seagate – Stelias Computing n Another road to NASD style storage n NASD now at Panasas – originated many ideas n Los Alamos n More research n Nearly built little object storage controllers n Currently looking at genomics applications n Sandia, Tri-Labs n Can Lustre meet the SGS-FS requirements? 19 - 6/13/2001 Components of OB Storage n Storage Object Device Drivers n class drivers – attach driver to interface n Targets, clients – remote access n Direct drivers – to manage physical storage n Logical drivers – for intelligence & storage management n Object storage applications: n (cluster) file systems n Advanced storage: parallel I/O, snapshots n Specialized apps: caches, db’s, filesrv 20 - 6/13/2001 Object File System Monolithic Buffer cache File system Object File System: Page Object based Cache storage device • file/dir data: lookup • set/read attrs Object • all allocation • remainder:ask obsd Device • all persistence Methods 21 - 6/13/2001 Accessing objects n Session n connect to the object storage target, present security token n Mention object id n Objects have a unique (group,id) n Perform operation n So that’s what the object file system does! 22 - 6/13/2001 Objects may be files, or not… n Common case: n Object, like inode, represents a file n Object can also: n represent a stripe (RAID) n bind an (MPI) File_View n redirect to other objects 23 - 6/13/2001 Snapshots as logical module Present multiple /current/ Object File System views of file systems /yesterday/ objectX Versioned objects: Snapshot / versioning follow redirectors logical driver objY objZ Access to raw objects Direct object driver 7am 9am bla bla bla bla 24 - 6/13/2001 System Interface n Modules n Load the kernel modules to get drivers of a certain type n Name devices to be of a certain type n Build stacks of devices with assigned types n For example: n insmod obd_xfs ; obdcontrol dev=obd1,type=xfs n insmod obd_snap ; obdcontrol current=obd2,old=obd3,driver=obd1 n insmod obdfs ; mount –t obdfs –o dev=obd3 /mnt/old 25 - 6/13/2001 Clustered Object Clustered Object Based File System Based File System on host A on host B OBD Client Driver OBD Client Driver Type RPC Type VIA Fast storage networking Shared object storage OBD Target OBD Target Type RPC Type VIA Direct OBD 26 - 6/13/2001 Storage target Implementations n Classical storage targets n Controller – expensive, redundant, proprietary n EMC: as sophisticated & feature rich as block storage can get n A bunch of disks n Lustre target n Bunch of disks n Powerful (SMP, multiple busses) commodity PC n Programmable/Secure n Could be done on disk drives but… 27 - 6/13/2001 Inside the storage controller… FC/ SCSI dual path JBOD SMP Linux Load SMP Linux RAID Balance RAID on-controller logic & on-controller logic networking Failover networking Multipath Multipath Load balanced Load balanced Storage network Storage network 28 - 6/13/2001 interface Objects in clusters… 29 - 6/13/2001 Lustre clusters Lustre Cluster Control Nodes (10’s) Lustre Locate Object Storage Targets Lustre clients Device & object (10,000’s) Meta Data transactions SAN – low latency client / storage interactions 3rd Party File & Object I/O TCP/IP file and security protocols GSS API Compatible Key/token management InterMezzo clients remote access • Private namespace (extreme security) • Persistent cache clients (mobility) HSM backends for Lustre/DMAPI solution 30 - 6/13/2001 Cluster control nodes n Database of references to objects n E.g. Lustre File System n Directory data n Points to objects that contain stripes/extents of files n More generally n Use a database of references to objects n Write object applications that access the objects directly n LANL asked for gene processing on the controllers 31 - 6/13/2001 Examples of logical modules n Tri-Lab/NSA: SGS – File system (see next slide) n Storage management, security n Parallel I/O for scientific computation n Other requests: n Data mining while target is idle n LANL: gene sequencing in object modules n Rich media industry: prioritize video streams 32 - 6/13/2001 MPI-IO interface MPI POSIX file types Collective & interface shared I/O ADIO-OBD adaptor Lustre client FS metadata sync auditing Client node object storage client networking SGS – File System SAN Object Layering object storage target networking synchronization content based collective I/O authorization data-migration aggregation Storage target node XFS-based OSD XFS-based OSD 33 - 6/13/2001 Other applications… n Genomics n Can we reproduce the previous slide? n Data mining n Can we exploit idle cycles and disk geometry? n Rich media n What storage networking helps streaming? 34 - 6/13/2001 Why Lustre… n It’s fun, it’s new n Infrastructure for storage target based computing n Storage management: components – not monolithic n File system snapshots, raid, backup, hot migration, resizing n Much simpler n File System: n Clustering FS considerably simpler, more scalable n But: close to NFS v4 and DAFS in several critical ways 35 - 6/13/2001 And finally – the reality, what exists… n At http://www.lustre.org (everything GPL’d) n Prototypes: n Direct driver for ext2 objects, Snapshot logical driver, n Management infrastructure, Object file system n Current happenings: n Collaboration with Intel Enterprise Architecture LAB: n They are building Lustre storage networking (DAFS, RDMA, TUX) n The grand scheme of things has been planned and is moving n Also on the WWW: n OBD storage specification n Lustre SGS – File System implementation plan. 36 - 6/13/2001 Linux clusters 37 - 6/13/2001 Clusters - purpose Require: n A scalable almost single system image n Fail-over capability n Load-balanced redundant services n Smooth administration 38 - 6/13/2001 Ultimate Goal n Provide generic components n OPEN SOURCE n Inspiration: VMS VAX Clusters n New: n Scalable (100,000’s nodes) n Modular n Need distributed, cluster & parallel FS’s n InterMezzo, GFS/Lustre, POBIO-FS 39 - 6/13/2001 Technology Overview Modularized VAX cluster architecture (Tweedie) Core Support Clients Transition Cluster db Distr. Computing Integrity Quorum Cluster Admin/Apps Link Layer Barrier Svc Cluster FS & LVM Channel Layer Event system DLM 40 - 6/13/2001 Events n Cluster transition: n Whenever connectivity changes n Start by electing “cluster controller” n Only merge fully connected sub-clusters n Cluster id: counts “incarnations” n Barriers: n Distributed synchronization points n Partial implementations available: n Ensemble, KimberLite, IBM-DLM, Compaq Cluster Mgr 41 - 6/13/2001 Scalability – e.g.