Lustre: the Intergalactic File System for the National Labs?
Peter J. Braam [email protected] http://www.clusterfilesystems.com
Cluster File Systems, Inc 6/10/2001 1 Cluster File Systems, Inc…
n Goals n Consulting & development n Storage and file systems n Open source n Extreme level of expertise n Leading n InterMezzo – high availability file system n Lustre – next generation cluster file system n Important role in Coda, UDF and Ext3 for Linux
2 - 6/13/2001 Partners…
n CMU n National Labs n Intel
3 - 6/13/2001 Talk overview
n Trends n Next generation data centers n Key issues in cluster file systems n Lustre n Discussion
4 - 6/13/2001 Trends…
5 - 6/13/2001 Hot animals…
n NAS n Cheap servers deliver a lot of features n NFS v4 n Finally, NFS is getting it right… n Security, concurrency n DAFS n High level storage protocol over VI storage network n Open Source OS n Best of breed file systems (XFS, JFS, Reiser)
6 - 6/13/2001 DAFS / NFS v4
DAFS server NFS v4 server
high level fast & efficient client client storage protocol concurrency control with notifications to clients
7 - 6/13/2001 Key question
n How to use the parts…
8 - 6/13/2001 Next generation data centers
9 - 6/13/2001 Security and Resource Cluster Control Databases Clients Access Control
Coherence Management
D a Storage Area Network ta (FC, GigE, IB) T Storage ra n s Management fe rs Object Storage Targets
10 - 6/13/2001 Orders of magnitude
n Clients (aka compute servers) n 10,000’s n Storage controllers n 1000’s to control PB’s of storage (PB = 10**15 Bytes) n Cluster control nodes n 10’s n Aggregate bandwidth n 100’s GB/sec
11 - 6/13/2001 Applications
n Scientific computing n Bio Informatics n Rich media n Entire ISP’s
12 - 6/13/2001 Key issues
13 - 6/13/2001 Scalability
n I/O throughput n How to avoid bottlenecks n Meta data scalability n How can 10,000’s of nodes work on files in same folder n Cluster recovery n If something fails, how can transparent recovery happen n Management n Adding, removing, replacing, systems; data migration & backup
14 - 6/13/2001 Features
n The basics… n Recovery, management, security n The desired… n Gene computations on storage controllers n Data mining for free n Content based security n … n The obstacle… n An almost 30 year old pervasive block device protocol
15 - 6/13/2001 Look back
n Andrew Project at CMU n 80’s – file servers with 10,000 clients (CMU campus) n Key question: how to reduce foot print of client on server n By 1988 entire campus on AFS n Lustre n Scalable clusters? n How to reduce cluster footprint of shared resources (scalability) n How to subdivide bottlenecked resources (parallelism)
16 - 6/13/2001 Lustre
Intelligent Object Storage http://www.http://www.lustrelustrelustre.org.org
17 - 6/13/2001 What is Object Based Storage??
n Object Based Storage Device n More intelligent than block device n Speak storage at “inode level” n create, unlink, read, write, getattr, setattr n iterators, security, almost arbitrary processing n So… n Protocol allocates physical blocks, no names for files n Requires n Management & security infrastructure
18 - 6/13/2001 Project history
n Started between CMU – Seagate – Stelias Computing n Another road to NASD style storage n NASD now at Panasas – originated many ideas n Los Alamos n More research n Nearly built little object storage controllers n Currently looking at genomics applications n Sandia, Tri-Labs n Can Lustre meet the SGS-FS requirements?
19 - 6/13/2001 Components of OB Storage
n Storage Object Device Drivers n class drivers – attach driver to interface n Targets, clients – remote access n Direct drivers – to manage physical storage n Logical drivers – for intelligence & storage management n Object storage applications: n (cluster) file systems n Advanced storage: parallel I/O, snapshots n Specialized apps: caches, db’s, filesrv
20 - 6/13/2001 Object File System
Monolithic Buffer cache File system
Object File System: Page Object based Cache storage device • file/dir data: lookup • set/read attrs Object • all allocation • remainder:ask obsd Device • all persistence Methods
21 - 6/13/2001 Accessing objects
n Session n connect to the object storage target, present security token n Mention object id n Objects have a unique (group,id) n Perform operation n So that’s what the object file system does!
22 - 6/13/2001 Objects may be files, or not…
n Common case: n Object, like inode, represents a file
n Object can also: n represent a stripe (RAID) n bind an (MPI) File_View n redirect to other objects
23 - 6/13/2001 Snapshots as logical module
Present multiple /current/ Object File System views of file systems /yesterday/
objectX Versioned objects: Snapshot / versioning follow redirectors logical driver objY objZ
Access to raw objects Direct object driver 7am 9am bla bla bla bla
24 - 6/13/2001 System Interface
n Modules n Load the kernel modules to get drivers of a certain type n Name devices to be of a certain type n Build stacks of devices with assigned types n For example: n insmod obd_xfs ; obdcontrol dev=obd1,type=xfs n insmod obd_snap ; obdcontrol current=obd2,old=obd3,driver=obd1 n insmod obdfs ; mount –t obdfs –o dev=obd3 /mnt/old
25 - 6/13/2001 Clustered Object Clustered Object Based File System Based File System on host A on host B
OBD Client Driver OBD Client Driver Type RPC Type VIA
Fast storage networking Shared object storage OBD Target OBD Target Type RPC Type VIA
Direct OBD
26 - 6/13/2001 Storage target Implementations
n Classical storage targets n Controller – expensive, redundant, proprietary n EMC: as sophisticated & feature rich as block storage can get n A bunch of disks n Lustre target n Bunch of disks n Powerful (SMP, multiple busses) commodity PC n Programmable/Secure n Could be done on disk drives but…
27 - 6/13/2001 Inside the storage controller…
FC/ SCSI dual path JBOD SMP Linux Load SMP Linux RAID Balance RAID on-controller logic & on-controller logic networking Failover networking Multipath Multipath Load balanced Load balanced Storage network Storage network
28 - 6/13/2001 interface Objects in clusters…
29 - 6/13/2001 Lustre clusters Lustre Cluster Control Nodes (10’s) Lustre Locate Object Storage Targets Lustre clients Device & object (10,000’s) Meta Data transactions
SAN – low latency client / storage interactions
3rd Party File & Object I/O
TCP/IP file and security protocols
GSS API Compatible Key/token management
InterMezzo clients remote access • Private namespace (extreme security) • Persistent cache clients (mobility) HSM backends for Lustre/DMAPI solution
30 - 6/13/2001 Cluster control nodes
n Database of references to objects n E.g. Lustre File System n Directory data n Points to objects that contain stripes/extents of files n More generally n Use a database of references to objects n Write object applications that access the objects directly n LANL asked for gene processing on the controllers
31 - 6/13/2001 Examples of logical modules
n Tri-Lab/NSA: SGS – File system (see next slide) n Storage management, security n Parallel I/O for scientific computation n Other requests: n Data mining while target is idle n LANL: gene sequencing in object modules n Rich media industry: prioritize video streams
32 - 6/13/2001 MPI-IO interface MPI POSIX file types Collective & interface shared I/O ADIO-OBD adaptor Lustre client FS metadata sync auditing Client node object storage client networking SGS – File System SAN Object Layering
object storage target networking synchronization content based collective I/O authorization data-migration aggregation Storage target node XFS-based OSD XFS-based OSD 33 - 6/13/2001 Other applications…
n Genomics n Can we reproduce the previous slide? n Data mining n Can we exploit idle cycles and disk geometry? n Rich media n What storage networking helps streaming?
34 - 6/13/2001 Why Lustre…
n It’s fun, it’s new n Infrastructure for storage target based computing n Storage management: components – not monolithic n File system snapshots, raid, backup, hot migration, resizing n Much simpler n File System: n Clustering FS considerably simpler, more scalable n But: close to NFS v4 and DAFS in several critical ways
35 - 6/13/2001 And finally – the reality, what exists…
n At http://www.lustre.org (everything GPL’d) n Prototypes: n Direct driver for ext2 objects, Snapshot logical driver, n Management infrastructure, Object file system n Current happenings: n Collaboration with Intel Enterprise Architecture LAB: n They are building Lustre storage networking (DAFS, RDMA, TUX) n The grand scheme of things has been planned and is moving n Also on the WWW: n OBD storage specification n Lustre SGS – File System implementation plan.
36 - 6/13/2001 Linux clusters
37 - 6/13/2001 Clusters - purpose
Require: n A scalable almost single system image n Fail-over capability n Load-balanced redundant services n Smooth administration
38 - 6/13/2001 Ultimate Goal
n Provide generic components n OPEN SOURCE n Inspiration: VMS VAX Clusters n New: n Scalable (100,000’s nodes) n Modular n Need distributed, cluster & parallel FS’s n InterMezzo, GFS/Lustre, POBIO-FS
39 - 6/13/2001 Technology Overview
Modularized VAX cluster architecture (Tweedie) Core Support Clients Transition Cluster db Distr. Computing
Integrity Quorum Cluster Admin/Apps
Link Layer Barrier Svc Cluster FS & LVM
Channel Layer Event system DLM
40 - 6/13/2001 Events
n Cluster transition: n Whenever connectivity changes n Start by electing “cluster controller” n Only merge fully connected sub-clusters n Cluster id: counts “incarnations” n Barriers: n Distributed synchronization points n Partial implementations available: n Ensemble, KimberLite, IBM-DLM, Compaq Cluster Mgr
41 - 6/13/2001 Scalability – e.g. Red Hat cluster SAN
P P P
/redhat/usa /redhat/scotland /redhat/canada
n P = peer n File Service n Proxy for remote core cluster n Cluster FS within cluster n Involved in recovery n Clustered Samba/Coda etc n Communication n Other stuff n Point to point within core clusters n Membership / recovery n Routable within cluster n DLM / barrier service n Hierarchical flood-fill n Cluster admin tools
42 - 6/13/2001 InterMezzo
http://www.inter-mezzo.org
43 - 6/13/2001 Target
n Replicate or cache directories n Automatic synchronization n Disconnected operation n Proxy servers n Scalable n Purpose n Entire System Binaries, laptop/desktop n Redundant object storage controllers n Very simple n Coda style protocols n Wrap around local file systems as cache
44 - 6/13/2001 Basic InterMezzo
Send out KML for replay Lento: Fetch KML to sync up file system/ Cache Manager/File Server object request Filter: data fresh? mkdir... no create... rmdir... unlink... VFS Local file system link…. Presto
Kernel Modification Log KML – disk file
45 - 6/13/2001 Distributed Lock Manager
IBM released HACMP DLM Open Source/ VAX style http://www.ibm.com/developerworks/open/source
46 - 6/13/2001 Locks & resources
n Purpose: generic, rich lock service n Will subsume “callbacks”, “leases” etc.
n Lock resources: resource database n Organize resources in trees n Most lock traffic is local n High performance n node that acquires resource manages tree
47 - 6/13/2001 Typical simple lock sequence
Resource database Sys A: has Sys B: need Lock on R Lock on R
Who has R? Sys B: need Sys A Lock on R
Block B’s request: I want lock on A Sys B: need Trigger owning process Lock on R
Owning process: releases lock
Grant lock to sys B
48 - 6/13/2001 A few details…
n Six lock modes n Callbacks: n Acquisition of locks n On blocking requests n Promotion of locks n On release, acquisition n Compatibility of locks n Recovery (simplified): n First lock acquisition n Dead node was: n Holder will manage n Mastering resources resource tree n Owning locks n Remotely managed n Re-master rsrc n Keep copy at owner n Drop zombie locks
49 - 6/13/2001