Storage Aggregation for Performance & Availability
Total Page:16
File Type:pdf, Size:1020Kb
Storage Aggregation for Performance & Availability: The Path from Physical RAID to Virtual Objects Univ Minnesota, Third Intelligent Storage Workshop Garth Gibson CTO, Panasas, and Assoc Prof, CMU [email protected] May 4, 2005 Cluster Computing: The New Wave Traditional Computing Cluster Computing Linux Monolithic Compute Computers Cluster Single data Parallel Issues path data Complex Scaling paths Limited BW & I/O Islands of storage Inflexible Scale to bigger box Expensive Monolithic Scale: Storage – File & total BW – File & total capacity ?? – Load & capacity balancing But lower $ / Gbps Slide 2 April 26, 2005 Panasas: Next Gen Cluster Storage ActiveScale Storage Cluster Scalable performance Parallel data paths to compute nodes Compute Cluster Scale clients, network and capacity Single Step: Cluster Perform job directly As capacity grows, performance grows from high I/O Panasas Simplified and dynamic management Storage Cluster Robust, shared file access by many clients Parallel Seamless growth within single namespace Control eliminates time-consuming admin tasks path data paths Integrated HW/SW solution Optimizes performance and manageability Ease of integration and support Metadata Object Storage Managers Devices Slide 3 April 26, 2005 Rooted Firmly in RAID May 4, 2005 Birth of RAID (1986-1991) Member of 4th Berkeley RISC CPU design team (SPUR: 84-89) Dave Patterson decides CPU design is a “solved” problem Sends me to figure out how storage plays in SYSTEM PERFORMANCE IBM 3380 disk is 4 arms in a 7.5 GB washing machine box SLED: Single Large Expensive Disk New PC industry demands cost effective 100 MB 3.5” disks Enabled by new SCSI embedded controller architecture Use many PC disks for parallelism SIGMOD88: A case for RAID PS. $10-20 per MB (~1000X now) 100 MB/arm (~1000X now) 20-30 IO/sec/arm (5X now) Slide 5 April 26, 2005 But RAID is really about Availability Arrays have more Hard Disk Assemblies (HDAs) -- more failures Apply replication and/or error/erasure detection codes Mirroring wastes 50% space; RAID wastes 1/N Mirroring halves, RAID 5 quarters small write bandwidth Slide 6 April 26, 2005 Off to CMU & More Availability Parity Declustering “spreads RAID groups” to reduce MTTR Each parity disk block protects fewer than all data disk blocks (C) Virtualizing RAID group lessens recovery work Faster recovery or better user response time during recovery or mixture of both RAID over X? X = Independent fault domains “Disk” is easiest “X” Parity declustering is first step in RAID virtualization Slide 7 April 26, 2005 Network-Attached Secure Disks (NASD, 95-99) May 4, 2005 Storage Interconnect Evolution Outboard circuitry increases over time (VLSI density) Hardware (#hosts, #disks, #paths) sharing increases over time Logical (information) sharing limited by host SW 1995: Fibrechannel packetizes SCSI over a near general network Outboard circuitry increases over time (VLSI density) Hardware (#hosts, #disks, #paths) sharing increases over time Logical (information) sharing limited by host SW 1995: Fibrechannel packetizes SCSI over a near general network Packetized SCSI will go over any network (iSCSI) What do we next do with device intelligence? Slide 9 April 26, 2005 Storage as First Class Network Citizen Direct transfer between client and storage Exploit scalable switched cluster area networking Split file service into: primitives (in drive) and policies (in manager) Slide 10 April 26, 2005 NASD Architecture Before NASD there was store&forward Server-Attached Disks (SAD) Move access control, consistency out-of-band and cache decisions Raise storage abstraction: encapsulate layout, offload data access Slide 11 April 26, 2005 Fine Grain Access Enforcement State of art is VPN of all out-of-band clients, all sharable data and metadata Accident prone & vulnerable to subverted client; analogy to single-address space computing Private Communication File manager NASD Integrity/Privacy 1: Request for access Secret Key 2: CapArgs, CapKey CapKey= MACSecretKey(CapArgs) CapArgs= ObjID, Version, Rights, Expiry,.... Object Storage Client uses a digitally signed, object- ReqMAC = MACCapKey(Req,NonceIn) specific capabilities on each request 3: CapArgs, Req, NonceIn, ReqMAC NASD 4: Reply, NonceOut, ReplyMAC Secret Key ReplyMAC = MACCapKey(Reply,NonceOut) Slide 12 April 26, 2005 Scalable File System Taxonomy May 4, 2005 Today’s Ubiquitous NFS ADVANTAGES DISADVANTAGES Familiar, stable & reliable Capacity doesn’t scale Widely supported by vendors Bandwidth doesn’t scale Competitive market Cluster by customer-exposed namespace partitioning File Servers Clients Disk arrays Storage Net Host Net Exported Sub- File System Slide 14 April 26, 2005 Scale Out 1: Forwarding NFS Servers Bind many file servers into single system image with forwarding Mount point binding less relevant, allows DNS-style balancing, more manageable Control and data traverse mount point path (in band) passing through two servers Single file and single file system bandwidth limited by backend server & storage Tricord, Spinnaker File Server Cluster Clients Disk arrays Host Net Storage Net Slide 15 April 26, 2005 Scale Out File Service w/ Out-of-Band Client sees many storage addresses, accesses in parallel Zero file servers in data path allows high bandwidth thru scalable networking Mostly built on block-based SANs where servers trust all clients E.g.: IBM SanFS, IBM GPFS, EMC HighRoad, Sun QFS, SGI CXFS, etc Beginning to be built on Object storage (SANs too) E.g. Panasas DirectFlow, Lustre, Clients Storage File Servers Slide 16 April 26, 2005 Object Storage Standards May 4, 2005 Object Storage Architecture An evolutionary improvement to standard SCSI storage interface (OSD) Offload most data path work from server to intelligent storage Finer granularity of security: protect & manage one file at a time Raises level of abstraction: Object is container for “related” data Storage understands how blocks of a “file” are related -> self-management Per Object Extensible Attributes key expansion of storage functionality Block Based Disk Object Based Disk Operations: Operations: Read block Create object Write block Delete object Read object Write object Addressing: Addressing: Block range [object, byte range] Allocation: Allocation: External Internal Security At Security At Volume Level ObjectSource: Level Intel Slide 18 April 26, 2005 Why I Like Objects Objects are flexible, autonomous SNIA Shared Storage Model, 2001 Variable length data with layout metadata encapsulated Share one access control decision Amortize object metadata Defer allocation, smarter caching With extensible attributes E.g. size, timestamps, ACLs, + Some updated inline by device Extensible programming model Metadata server decisions are signed & cached at clients, enforced at device Rights and object map small relative to block allocation map Clients can be untrusted b/c limited exposure of bugs & attacks (only authorized object data) Cache decisions (& maps) replaced transparently -- dynamic remapping -- virtualization Slide 19 April 26, 2005 OSD is now an ANSI Standard 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 Panasas OSD NSIC NASD SNIA/T10 OSD Standard CMU NASD Lustre Key Players: T10/SCSI ratifies OSD command set 9/27/04 Co-chaired by IBM and Seagate, protocol is a general framework (transport independent) www.snia.org/tech_activities/workgroups/osd & www.t10.org/ftp/t10/drafts/osd/osd-r10.pdf KEY: Resulting standard will build options for customers looking to deploy objects Slide 20 April 26, 2005 ActiveScale Storage Cluster May 4, 2005 Object Storage Systems Expect wide variety of Object Storage Devices Disk array subsystem “Smart” disk for objects Prototype Seagate OSD Ie. LLNL with Lustre 2 SATA disks – 240/500 GB Highly integrated, single disk 16-Port GE Switch Blade Orchestrates system activity 4 Gbps per Balances objects across OSDs Stores up to 5 TBs per shelf shelf to cluster Slide 22 April 26, 2005 BladeServer Storage Cluster Integrated GE Switch Battery Module (2 Power units) Shelf Front 1 DB, 10 SB Shelf Rear DirectorBlade StorageBlade Midplane routes GE, power Slide 23 April 26, 2005 How Do Panasas Objects Scale? Scale capacity, bandwidth, reliability by striping according to small map File Comprised of: User Data DATA Attributes Layout ScalableScalable ObjectObject MapMap 1.1. PurplePurple OSDOSD && ObjectObject 2.2. GoldGold OSDOSD && ObjectObject 3.3. RedRed OSDOSD && ObjectObject PlusPlus stripestripe size,size, RAIDRAID levellevel Slide 24 April 26, 2005 Object Storage Bandwidth Scalable Bandwidth demonstrated with GE switching 12 10 8 6 GB/sec GB/sec 4 2 0 0 50 100 150 200 250 300 350 Object Storage Devices Lab results Slide 25 April 26, 2005 ActiveScale SW Architecture ActiveScale DirectorBlades DirectFLOW StorageBlades Realm & Performance Mgrs + Web Mgmt Server DFLOW fs Virtual Sub Mgrs DFLOW fs Buffer Cache DirectFLOW RAIDDFLOW 0 fs FileFileFile RAID 0 MgrMgrMgr RAID 0 QuotaQuotaQuotaStorStorStor DirectFLOW Client OSD / iSCSI RPC Zero NVRAM MgrMgrMgrMgrMgrMgr RAID TCP/IP Zero CopyZero NVRAM Copy CacheNVRAM Mgr DB DirectFLOW Cache ActiveScale Protocol Servers Copy Cache NTP + Mgmt OSD / iSCSI NFS NDMP CIFS DHCP Agent TCP/IP Server DirectFLOW NFS CIFS DirectFLOW DirectFLOW OSD / iSCSI TCP/IP Linux NFS CIFS Direct- Local FLOW UNIX Windows Buffer RAID Cache VFS POSIX App NT App POSIX App Slide 26 April 26, 2005 DirectFLOW Client DirectFLOW Client Standard installable file system for Linux POSIX syscalls with common performance exceptions Reduce serialization & media updates (e.g.