: the Intergalactic for the National Labs?

Peter J. Braam [email protected] http://www.clusterfilesystems.com

Cluster File Systems, Inc 6/10/2001 1 Cluster File Systems, Inc…

n Goals n Consulting & development n Storage and file systems n Open source n Extreme level of expertise n Leading n InterMezzo – high availability file system n Lustre – next generation cluster file system n Important role in , UDF and for

2 - 6/13/2001 Partners…

n CMU n National Labs n Intel

3 - 6/13/2001 Talk overview

n Trends n Next generation data centers n Key issues in cluster file systems n Lustre n Discussion

4 - 6/13/2001 Trends…

5 - 6/13/2001 Hot animals…

n NAS n Cheap servers deliver a lot of features n NFS v4 n Finally, NFS is getting it right… n Security, concurrency n DAFS n High level storage protocol over VI storage network n Open Source OS n Best of breed file systems (XFS, JFS, Reiser)

6 - 6/13/2001 DAFS / NFS v4

DAFS server NFS v4 server

high level fast & efficient client client storage protocol concurrency control with notifications to clients

7 - 6/13/2001 Key question

n How to use the parts…

8 - 6/13/2001 Next generation data centers

9 - 6/13/2001 Security and Resource Cluster Control Databases Clients Access Control

Coherence Management

D a Storage Area Network ta (FC, GigE, IB) T Storage ra n s Management fe rs Targets

10 - 6/13/2001 Orders of magnitude

n Clients (aka compute servers) n 10,000’s n Storage controllers n 1000’s to control PB’s of storage (PB = 10**15 Bytes) n Cluster control nodes n 10’s n Aggregate bandwidth n 100’s GB/sec

11 - 6/13/2001 Applications

n Scientific computing n Bio Informatics n Rich media n Entire ISP’s

12 - 6/13/2001 Key issues

13 - 6/13/2001 Scalability

n I/O throughput n How to avoid bottlenecks n Meta data scalability n How can 10,000’s of nodes work on files in same folder n Cluster recovery n If something fails, how can transparent recovery happen n Management n Adding, removing, replacing, systems; data migration & backup

14 - 6/13/2001 Features

n The basics… n Recovery, management, security n The desired… n Gene computations on storage controllers n Data mining for free n Content based security n … n The obstacle… n An almost 30 year old pervasive block device protocol

15 - 6/13/2001 Look back

n Andrew Project at CMU n 80’s – file servers with 10,000 clients (CMU campus) n Key question: how to reduce foot print of client on server n By 1988 entire campus on AFS n Lustre n Scalable clusters? n How to reduce cluster footprint of shared resources (scalability) n How to subdivide bottlenecked resources (parallelism)

16 - 6/13/2001 Lustre

Intelligent Object Storage http://www.http://www.lustrelustrelustre.org.org

17 - 6/13/2001 What is Object Based Storage??

n Object Based Storage Device n More intelligent than block device n Speak storage at “inode level” n create, unlink, read, write, getattr, setattr n iterators, security, almost arbitrary processing n So… n Protocol allocates physical blocks, no names for files n Requires n Management & security infrastructure

18 - 6/13/2001 Project history

n Started between CMU – Seagate – Stelias Computing n Another road to NASD style storage n NASD now at Panasas – originated many ideas n Los Alamos n More research n Nearly built little object storage controllers n Currently looking at genomics applications n Sandia, Tri-Labs n Can Lustre meet the SGS-FS requirements?

19 - 6/13/2001 Components of OB Storage

n Storage Object Device Drivers n class drivers – attach driver to interface n Targets, clients – remote access n Direct drivers – to manage physical storage n Logical drivers – for intelligence & storage management n Object storage applications: n (cluster) file systems n Advanced storage: parallel I/O, snapshots n Specialized apps: caches, db’s, filesrv

20 - 6/13/2001 Object File System

Monolithic Buffer cache File system

Object File System: Page Object based Cache storage device • file/dir data: lookup • set/read attrs Object • all allocation • remainder:ask obsd Device • all persistence Methods

21 - 6/13/2001 Accessing objects

n Session n connect to the object storage target, present security token n Mention object id n Objects have a unique (group,id) n Perform operation n So that’s what the object file system does!

22 - 6/13/2001 Objects may be files, or not…

n Common case: n Object, like inode, represents a file

n Object can also: n represent a stripe (RAID) n bind an (MPI) File_View n redirect to other objects

23 - 6/13/2001 Snapshots as logical module

Present multiple /current/ Object File System views of file systems /yesterday/

objectX Versioned objects: Snapshot / versioning follow redirectors logical driver objY objZ

Access to raw objects Direct object driver 7am 9am bla bla bla bla

24 - 6/13/2001 System Interface

n Modules n Load the kernel modules to get drivers of a certain type n Name devices to be of a certain type n Build stacks of devices with assigned types n For example: n insmod obd_xfs ; obdcontrol dev=obd1,type= n insmod obd_snap ; obdcontrol current=obd2,old=obd3,driver=obd1 n insmod obdfs ; mount –t obdfs –o dev=obd3 /mnt/old

25 - 6/13/2001 Clustered Object Clustered Object Based File System Based File System on host A on host B

OBD Client Driver OBD Client Driver Type RPC Type VIA

Fast storage networking Shared object storage OBD Target OBD Target Type RPC Type VIA

Direct OBD

26 - 6/13/2001 Storage target Implementations

n Classical storage targets n Controller – expensive, redundant, proprietary n EMC: as sophisticated & feature rich as block storage can get n A bunch of disks n Lustre target n Bunch of disks n Powerful (SMP, multiple busses) commodity PC n Programmable/Secure n Could be done on disk drives but…

27 - 6/13/2001 Inside the storage controller…

FC/ SCSI dual path JBOD SMP Linux Load SMP Linux RAID Balance RAID on-controller logic & on-controller logic networking Failover networking Multipath Multipath Load balanced Load balanced Storage network Storage network

28 - 6/13/2001 interface Objects in clusters…

29 - 6/13/2001 Lustre clusters Lustre Cluster Control Nodes (10’s) Lustre Locate Object Storage Targets Lustre clients Device & object (10,000’s) Meta Data transactions

SAN – low latency client / storage interactions

3rd Party File & Object I/O

TCP/IP file and security protocols

GSS API Compatible Key/token management

InterMezzo clients remote access • Private namespace (extreme security) • Persistent cache clients (mobility) HSM backends for Lustre/DMAPI solution

30 - 6/13/2001 Cluster control nodes

n Database of references to objects n E.g. Lustre File System n Directory data n Points to objects that contain stripes/extents of files n More generally n Use a database of references to objects n Write object applications that access the objects directly n LANL asked for gene processing on the controllers

31 - 6/13/2001 Examples of logical modules

n Tri-Lab/NSA: SGS – File system (see next slide) n Storage management, security n Parallel I/O for scientific computation n Other requests: n Data mining while target is idle n LANL: gene sequencing in object modules n Rich media industry: prioritize video streams

32 - 6/13/2001 MPI-IO interface MPI POSIX file types Collective & interface shared I/O ADIO-OBD adaptor Lustre client FS metadata sync auditing Client node object storage client networking SGS – File System SAN Object Layering

object storage target networking synchronization content based collective I/O authorization data-migration aggregation Storage target node XFS-based OSD XFS-based OSD 33 - 6/13/2001 Other applications…

n Genomics n Can we reproduce the previous slide? n Data mining n Can we exploit idle cycles and disk geometry? n Rich media n What storage networking helps streaming?

34 - 6/13/2001 Why Lustre…

n It’s fun, it’s new n Infrastructure for storage target based computing n Storage management: components – not monolithic n File system snapshots, raid, backup, hot migration, resizing n Much simpler n File System: n Clustering FS considerably simpler, more scalable n But: close to NFS v4 and DAFS in several critical ways

35 - 6/13/2001 And finally – the reality, what exists…

n At http://www.lustre.org (everything GPL’d) n Prototypes: n Direct driver for objects, Snapshot logical driver, n Management infrastructure, Object file system n Current happenings: n Collaboration with Intel Enterprise Architecture LAB: n They are building Lustre storage networking (DAFS, RDMA, TUX) n The grand scheme of things has been planned and is moving n Also on the WWW: n OBD storage specification n Lustre SGS – File System implementation plan.

36 - 6/13/2001 Linux clusters

37 - 6/13/2001 Clusters - purpose

Require: n A scalable almost single system image n Fail-over capability n Load-balanced redundant services n Smooth administration

38 - 6/13/2001 Ultimate Goal

n Provide generic components n OPEN SOURCE n Inspiration: VMS VAX Clusters n New: n Scalable (100,000’s nodes) n Modular n Need distributed, cluster & parallel FS’s n InterMezzo, GFS/Lustre, POBIO-FS

39 - 6/13/2001 Technology Overview

Modularized VAX cluster architecture (Tweedie) Core Support Clients Transition Cluster db Distr. Computing

Integrity Quorum Cluster Admin/Apps

Link Layer Barrier Svc Cluster FS & LVM

Channel Layer Event system DLM

40 - 6/13/2001 Events

n Cluster transition: n Whenever connectivity changes n Start by electing “cluster controller” n Only merge fully connected sub-clusters n Cluster id: counts “incarnations” n Barriers: n Distributed synchronization points n Partial implementations available: n Ensemble, KimberLite, IBM-DLM, Compaq Cluster Mgr

41 - 6/13/2001 Scalability – e.g. Red Hat cluster SAN

P P P

/redhat/usa /redhat/scotland /redhat/canada

n P = peer n File Service n Proxy for remote core cluster n Cluster FS within cluster n Involved in recovery n Clustered Samba/Coda etc n Communication n Other stuff n Point to point within core clusters n Membership / recovery n Routable within cluster n DLM / barrier service n Hierarchical flood-fill n Cluster admin tools

42 - 6/13/2001 InterMezzo

http://www.inter-mezzo.org

43 - 6/13/2001 Target

n Replicate or cache directories n Automatic synchronization n Disconnected operation n Proxy servers n Scalable n Purpose n Entire System Binaries, laptop/desktop n Redundant object storage controllers n Very simple n Coda style protocols n Wrap around local file systems as cache

44 - 6/13/2001 Basic InterMezzo

Send out KML for replay Lento: Fetch KML to sync up file system/ Cache Manager/File Server object request Filter: data fresh? mkdir... no create... rmdir... unlink... VFS Local file system link…. Presto

Kernel Modification Log KML – disk file

45 - 6/13/2001 Distributed Lock Manager

IBM released HACMP DLM Open Source/ VAX style http://www.ibm.com/developerworks/open/source

46 - 6/13/2001 Locks & resources

n Purpose: generic, rich lock service n Will subsume “callbacks”, “leases” etc.

n Lock resources: resource database n Organize resources in trees n Most lock traffic is local n High performance n node that acquires resource manages tree

47 - 6/13/2001 Typical simple lock sequence

Resource database Sys A: has Sys B: need Lock on R Lock on R

Who has R? Sys B: need Sys A Lock on R

Block B’s request: I want lock on A Sys B: need Trigger owning process Lock on R

Owning process: releases lock

Grant lock to sys B

48 - 6/13/2001 A few details…

n Six lock modes n Callbacks: n Acquisition of locks n On blocking requests n Promotion of locks n On release, acquisition n Compatibility of locks n Recovery (simplified): n First lock acquisition n Dead node was: n Holder will manage n Mastering resources resource tree n Owning locks n Remotely managed n Re-master rsrc n Keep copy at owner n Drop zombie locks

49 - 6/13/2001