Storage Aggregation for Performance & Availability: The Path from Physical RAID to Virtual Objects

Univ Minnesota, Third Intelligent Storage Workshop

Garth Gibson CTO, Panasas, and Assoc Prof, CMU [email protected] May 4, 2005 Cluster Computing: The New Wave

Traditional Computing Cluster Computing

Linux Monolithic Compute Computers Cluster

Single data Parallel Issues path data Complex Scaling paths Limited BW & I/O Islands of storage Inflexible Scale to bigger box Expensive Monolithic Scale: Storage – File & total BW – File & total capacity ?? – Load & capacity balancing But lower $ / Gbps

Slide 2 April 26, 2005 Panasas: Next Gen Cluster Storage

ActiveScale Storage Cluster Scalable performance Parallel data paths to compute nodes Compute Cluster Scale clients, network and capacity Single Step: Cluster Perform job directly As capacity grows, performance grows from high I/O Panasas Simplified and dynamic management Storage Cluster Robust, shared file access by many clients Parallel Seamless growth within single namespace Control eliminates time-consuming admin tasks path data paths Integrated HW/SW solution Optimizes performance and manageability Ease of integration and support

Metadata Managers Devices

Slide 3 April 26, 2005 Rooted Firmly in RAID

May 4, 2005 Birth of RAID (1986-1991)

Member of 4th Berkeley RISC CPU design team (SPUR: 84-89) Dave Patterson decides CPU design is a “solved” problem Sends me to figure out how storage plays in SYSTEM PERFORMANCE IBM 3380 disk is 4 arms in a 7.5 GB washing machine box SLED: Single Large Expensive Disk New PC industry demands cost effective 100 MB 3.5” disks Enabled by new SCSI embedded controller architecture Use many PC disks for parallelism SIGMOD88: A case for RAID PS. $10-20 per MB (~1000X now) 100 MB/arm (~1000X now) 20-30 IO/sec/arm (5X now)

Slide 5 April 26, 2005 But RAID is really about Availability

Arrays have more Hard Disk Assemblies (HDAs) -- more failures Apply replication and/or error/erasure detection codes

Mirroring wastes 50% space; RAID wastes 1/N Mirroring halves, RAID 5 quarters small write bandwidth

Slide 6 April 26, 2005 Off to CMU & More Availability

Parity Declustering “spreads RAID groups” to reduce MTTR Each parity disk block protects fewer than all data disk blocks (C) Virtualizing RAID group lessens recovery work Faster recovery or better user response time during recovery or mixture of both

RAID over X? X = Independent fault domains “Disk” is easiest “X” Parity declustering is first step in RAID virtualization

Slide 7 April 26, 2005 Network-Attached Secure Disks (NASD, 95-99)

May 4, 2005 Storage Interconnect Evolution

Outboard circuitry increases over time (VLSI density) Hardware (#hosts, #disks, #paths) sharing increases over time Logical (information) sharing limited by host SW 1995: Fibrechannel packetizes SCSI over a near general network

Outboard circuitry increases over time (VLSI density) Hardware (#hosts, #disks, #paths) sharing increases over time Logical (information) sharing limited by host SW 1995: Fibrechannel packetizes SCSI over a near general network Packetized SCSI will go over any network (iSCSI) What do we next do with device intelligence?

Slide 9 April 26, 2005 Storage as First Class Network Citizen

Direct transfer between client and storage Exploit scalable switched cluster area networking Split file service into: primitives (in drive) and policies (in manager)

Slide 10 April 26, 2005 NASD Architecture

Before NASD there was store&forward Server-Attached Disks (SAD) Move access control, consistency out-of-band and cache decisions Raise storage abstraction: encapsulate layout, offload data access

Slide 11 April 26, 2005 Fine Grain Access Enforcement

State of art is VPN of all out-of-band clients, all sharable data and metadata Accident prone & vulnerable to subverted client; analogy to single-address space computing Private Communication NASD Integrity/Privacy 1: Request for access Secret Key 2: CapArgs, CapKey

CapKey= MACSecretKey(CapArgs) CapArgs= ObjID, Version, Rights, Expiry,.... Object Storage Client uses a digitally signed, object- ReqMAC = MACCapKey(Req,NonceIn) specific capabilities on each request 3: CapArgs, Req, NonceIn, ReqMAC

NASD 4: Reply, NonceOut, ReplyMAC Secret Key ReplyMAC = MACCapKey(Reply,NonceOut)

Slide 12 April 26, 2005 Scalable Taxonomy

May 4, 2005 Today’s Ubiquitous NFS

ADVANTAGES DISADVANTAGES Familiar, stable & reliable Capacity doesn’t scale Widely supported by vendors Bandwidth doesn’t scale Competitive market Cluster by customer-exposed namespace partitioning

File Servers Clients

Disk arrays Storage Net

Host Net Exported Sub- File System

Slide 14 April 26, 2005 Scale Out 1: Forwarding NFS Servers

Bind many file servers into single system image with forwarding Mount point binding less relevant, allows DNS-style balancing, more manageable Control and data traverse mount point path (in band) passing through two servers Single file and single file system bandwidth limited by backend server & storage Tricord, Spinnaker

File Server Cluster

Clients

Disk arrays

Host Net Storage Net

Slide 15 April 26, 2005 Scale Out File Service w/ Out-of-Band

Client sees many storage addresses, accesses in parallel Zero file servers in data path allows high bandwidth thru scalable networking Mostly built on block-based SANs where servers trust all clients E.g.: IBM SanFS, IBM GPFS, EMC HighRoad, Sun QFS, SGI CXFS, etc Beginning to be built on Object storage (SANs too) E.g. Panasas DirectFlow, , Clients

Storage

File Servers

Slide 16 April 26, 2005 Object Storage Standards

May 4, 2005 Object Storage Architecture

An evolutionary improvement to standard SCSI storage interface (OSD) Offload most data path work from server to intelligent storage Finer granularity of security: protect & manage one file at a time Raises level of abstraction: Object is container for “related” data Storage understands how blocks of a “file” are related -> self-management Per Object Extensible Attributes key expansion of storage functionality Block Based Disk Object Based Disk Operations: Operations: Read block Create object Write block Delete object Read object Write object Addressing: Addressing: Block range [object, byte range] Allocation: Allocation: External Internal Security At Security At Volume Level ObjectSource: Level Intel

Slide 18 April 26, 2005 Why I Like Objects

Objects are flexible, autonomous SNIA Shared Storage Model, 2001 Variable length data with layout metadata encapsulated Share one access control decision Amortize object metadata Defer allocation, smarter caching With extensible attributes E.g. size, timestamps, ACLs, + Some updated inline by device Extensible programming model Metadata server decisions are signed & cached at clients, enforced at device Rights and object map small relative to block allocation map Clients can be untrusted b/c limited exposure of bugs & attacks (only authorized object data) Cache decisions (& maps) replaced transparently -- dynamic remapping -- virtualization

Slide 19 April 26, 2005 OSD is now an ANSI Standard

1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005

Panasas

OSD NSIC NASD SNIA/T10 OSD Standard

CMU NASD Lustre

Key Players:

T10/SCSI ratifies OSD command set 9/27/04 Co-chaired by IBM and Seagate, protocol is a general framework (transport independent) www.snia.org/tech_activities/workgroups/osd & www.t10.org/ftp/t10/drafts/osd/osd-r10.pdf KEY: Resulting standard will build options for customers looking to deploy objects

Slide 20 April 26, 2005 ActiveScale Storage Cluster

May 4, 2005 Object Storage Systems

Expect wide variety of Object Storage Devices

Disk array subsystem “Smart” disk for objects Prototype Seagate OSD Ie. LLNL with Lustre 2 SATA disks – 240/500 GB Highly integrated, single disk

16-Port GE Switch Blade Orchestrates system activity 4 Gbps per Balances objects across OSDs Stores up to 5 TBs per shelf shelf to cluster

Slide 22 April 26, 2005 BladeServer Storage Cluster

Integrated GE Switch Battery Module (2 Power units) Shelf Front 1 DB, 10 SB

Shelf Rear

DirectorBlade StorageBlade Midplane routes GE, power

Slide 23 April 26, 2005 How Do Panasas Objects Scale?

Scale capacity, bandwidth, reliability by striping according to small map

File Comprised of: User Data DATA Attributes Layout

ScalableScalable ObjectObject MapMap 1.1. PurplePurple OSDOSD && ObjectObject 2.2. GoldGold OSDOSD && ObjectObject 3.3. RedRed OSDOSD && ObjectObject PlusPlus stripestripe size,size, RAIDRAID levellevel

Slide 24 April 26, 2005 Object Storage Bandwidth

Scalable Bandwidth demonstrated with GE switching 12

10

8

6 GB/sec GB/sec 4

2

0 0 50 100 150 200 250 300 350 Object Storage Devices Lab results

Slide 25 April 26, 2005 ActiveScale SW Architecture

ActiveScale DirectorBlades DirectFLOW StorageBlades Realm & Performance Mgrs + Web Mgmt Server DFLOW fs Virtual Sub Mgrs DFLOW fs Buffer Cache DirectFLOW RAIDDFLOW 0 fs FileFileFile RAID 0 MgrMgrMgr RAID 0 QuotaQuotaQuotaStorStorStor DirectFLOW Client OSD / iSCSI RPC Zero NVRAM MgrMgrMgrMgrMgrMgr RAID TCP/IP Zero CopyZero NVRAM Copy CacheNVRAM Mgr DB DirectFLOW Cache ActiveScale Protocol Servers Copy Cache NTP + Mgmt OSD / iSCSI NFS NDMP CIFS DHCP Agent TCP/IP Server DirectFLOW NFS CIFS DirectFLOW DirectFLOW

OSD / iSCSI TCP/IP NFS CIFS Direct- Local FLOW UNIX Windows Buffer RAID Cache VFS POSIX App NT App POSIX App

Slide 26 April 26, 2005 DirectFLOW Client DirectFLOW Client

Standard for Linux POSIX syscalls with common performance exceptions Reduce serialization & media updates (e.g. durable read timestamps) Add create parameters for file width, RAID level, stripe unit size Add open in concurrent write mode; where app handles consistency Mount options Local (NFS) mount anywhere in client, or Global (AFS) mount at “/panfs” on all clients Cache consistency and callbacks Need a callback on a file in order to cache any file state Exclusive, read-only, read-write (non cacheable), concurrent write Callbacks are really leases so AWOL clients lose promises

Slide 27 April 26, 2005 Fault Tolerance

Overall up/down state of blades Subset of managers track overall state with heartbeats Maintain identical state with quorum/consensus Per file RAID: no parity for unused capacity RAID level per file; small files mirror RAID5 for large files First step toward policy quality of storage associated w/ data Scalable rebuilt rate Large files striped on all blades Recon XOR on all managers Rebuilt files spread on all blades

Slide 28 April 26, 2005 Manageable Storage Clusters

Snapshots: consistency for copying, backing up Copy-on-write duplication of contents of objects Named as “…/.snapshot/JulianSnapTimestamp/filename” Snaps can be scheduled, auto-deleted Soft volumes: grow management without physical constraints Volumes can be quota bounded, unbounded, or just send email on threshold Multiple volumes can share space of a set of shelves (double disk failure domain) Capacity and load balancing: seamless use of growing set of blades All blades track capacity & load; manager aggregates & ages utilization metrics Unbalanced systems influence allocation; can trigger moves Adding a blade simply makes a system unbalanced for awhile

Slide 29 April 26, 2005 Out-of-band DF & Clustered NAS

Slide 30 April 26, 2005 Scale Out 1b: Any NFS Server Will Do

Single server does all data transfer in single system image Servers share access to all storage and “hand off” role of accessing storage Control and data traverse mount point path (in band) passing through one server Hand off cheap: difference between 100% & 0% colocated front & back end was about zero SFS97 NFS op/s and < 10% latency increase for Panasas NFS Allows single file & single filesystem capacity & bandwidth to scale with server cluster

File Server Cluster

Clients

Disk arrays

Host Net Storage Net

Slide 31 April 26, 2005 Performance & Scalability for all Workloads

Objects: breakthrough data throughput AND random I/O

Source: SPEC.org & Panasas

Slide 32 April 26, 2005 ActiveScale In Practice

May 4, 2005 Los Alamos Lightning

1400 nodes and 60TB (120 TB): Ability to deliver ~ 3 GB/s* (~6 GB/s)

8 8 Panasas 8 12 shelves 8 8

4 4 4 4 4 4

4 4 4 4 4 4 Switch Lightning 1400 nodes

Slide 34 April 26, 2005 Progress in the Field

Seismic Processing Genomics Sequencing

HPC Simulations Post-Production Rendering

Slide 35 April 26, 2005 pNFS - Parallel NFS

May 4, 2005 Out-of-Band Interoperability Issues

ADVANTAGES DISADVANTAGES Capacity scaling Requires client kernel addition Faster bandwidth scaling Many non-interoperable solutions Not necessarily able to replace NFS

EXAMPLE FEATURES POSIX Plus & Minus Clients Global mount point Fault tolerant cache coherence Storage RAID 0, 1, 5 & snapshots Distributed metadata and online growth, upgrade Vendor X Kernel Patch/RPM Vendor X File Servers

Slide 37 April 26, 2005 Standards Efforts: Parallel NFS (pNFS)

December 2003 U. Michigan workshop Whitepapers at www.citi.umich.edu/NEPS U. Michigan CITI, EMC, IBM, Johns Hopkins, Network Appliance, Panasas, Spinnaker Networks, Veritas, ZForce NFSv4 extensions enabling clients to access data in parallel storage devices of multiple standard flavors Parallel NFS request routing / file virtualization Extend block or object “layout” information to clients, enabling parallel direct access

IETF pNFS Documents: draft-gibson-pnfs-problem-statement-01.txt draft-gibson-pnfs-reqs-00.txt draft-welch-pnfs-ops-01.txt

Slide 38 April 26, 2005 Multiple Data Server Protocols

Inclusiveness favors success Client Apps Three (or more) flavors of out-of- band metadata attributes: pNFS IFS BLOCKS: SBC/FCP/FC or SBC/iSCSI… Disk driver for files built on blocks OBJECTS: NFSv4 extended w/ orthogonal 1. SBC (blocks) 2. OSD (objects) OSD/iSCSI/TCP/IP/GE for files “disk” metadata pNFS 3. NFS (files) built on objects attributes FILES: pNFS server

NFS/ONCRPC/TCP/IP/GE for “disk” metadata files built on subfiles grant & revoke Local File Inode-level encapsulation in system server and client code

Slide 39 April 26, 2005 Object Storage Value Proposition

NFS NFS CIFS

DirectFLOW DirectFLOW

Linux Compute Cluster Panasas Storage Cluster Analysis & Large datasets, checkpointing Visualization Tools Analysis & Simulation Results Gateway

Scalable Shared Storage Value Proposition  Larger surveys; more active analysis  Applications leverage full bandwidth  Simplified storage management  Results gateway

Slide 40 April 26, 2005 Next Generation Network Storage

Garth Gibson [email protected] May 4, 2005