Scalable Performance of the Panasas Parallel File System

Scalable Performance of the Panasas Parallel File System Brent Welch1, Marc Unangst1, Zainul Abbasi1, Garth Gibson12, Brian Mueller1, Jason Small1, Jim Zelenka1, Bin Zhou1 1Panasas, Inc. 2Carnegie Mellon {welch,mju,zabbasi,garth,bmueller,jsmall,jimz,bzhou}@panasas.com Abstract manufacturing, and in general purpose computing environments. In these environments, hundreds or The Panasas file system uses parallel and redundant thousands of file system clients share data and generate access to object storage devices (OSDs), per-file RAID, very high aggregate I/O load on the file system. The distributed metadata management, consistent client Panasas system is designed to support several thousand caching, file locking services, and internal cluster clients and storage capacities in excess of a petabyte. management to provide a scalable, fault tolerant, high performance distributed file system. The clustered The unique aspects of the Panasas system are its use of design of the storage system and the use of client- per-file, client-driven RAID, its parallel RAID rebuild, driven RAID provide scalable performance to many its treatment of different classes of metadata (block, concurrent file system clients through parallel access to file, system) and a commodity parts based blade file data that is striped across OSD storage nodes. hardware with integrated UPS. Of course, the system RAID recovery is performed in parallel by the cluster has many other features (such as object storage, fault of metadata managers, and declustered data placement tolerance, caching and cache consistency, and a yields scalable RAID rebuild rates as the storage system simplified management model) that are not unique, but grows larger. This paper presents performance are necessary for a scalable system implementation. measures of I/O, metadata, and recovery operations for storage clusters that range in size from 10 to 120 2 Panasas File System Background storage nodes, 1 to 12 metadata nodes, and with file system client counts ranging from 1 to 100 compute This section makes a brief tour through the system to nodes. Production installations are as large as 500 provide an overview for the following sections. The storage nodes, 50 metadata managers, and 5000 clients. two overall themes to the system are object storage, which affects how the file system manages its data, and 1 Introduction clustering of components, which allows the system to scale in performance and capacity. Storage systems for high performance computing environments must be designed to scale in performance The storage cluster is divided into storage nodes and so that they can be configured to match the required manager nodes at a ratio of about 10 storage nodes to 1 load. Clustering techniques are often used to provide manager node, although that ratio is variable. The scalability. In a storage cluster, many nodes each storage nodes implement an object store, and are control some storage, and the overall distributed file accessed directly from Panasas file system clients system assembles the cluster elements into one large, during I/O operations. The manager nodes manage the seamless storage system. The storage cluster can be overall storage cluster, implement the distributed file hosted on the same computers that perform data system semantics, handle recovery of storage node processing, or they can be a separate cluster that is failures, and provide an exported view of the Panasas devoted entirely to storage and accessible to the file system via NFS and CIFS. Figure 1 gives a basic compute cluster via a network protocol. view of the system components. The Panasas storage system is a specialized storage 2.1 Object Storage cluster, and this paper presents its design and a number of performance measurements to illustrate the An object is a container for data and attributes; it is scalability. The Panasas system is a production system analogous to the inode inside a traditional UNIX file that provides file service to some of the largest compute system implementation. Specialized storage nodes clusters in the world, in scientific labs, in seismic data called Object Storage Devices (OSD) store objects in a processing, in digital animation studios, in local OSDFS file system. The object interface computational fluid dynamics, in semiconductor addresses objects in a two-level (partition ID/object ID) namespace. The OSD wire protocol provides byte- 2.2 System Software Components oriented access to the data, attribute manipulation, creation and deletion of objects, and several other The major software subsystems are the OSDFS object specialized operations [OSD04]. We use an iSCSI storage system, the Panasas file system metadata transport to carry OSD commands that are very similar manager, the Panasas file system client, the NFS/CIFS to the OSDv2 standard currently in progress within gateway, and the overall cluster management system. SNIA and ANSI-T10 [SNIA]. The Panasas client is an installable kernel module that runs inside the Linux kernel. The kernel module implements the standard VFS interface, so that the Compute node client hosts can mount the file system and use a POSIX Client interface to the storage system. We don’t require any patches to run inside the 2.4 or 2.6 Linux kernel, and have tested with over 200 Linux variants. RPC Each storage cluster node runs a common platform that is based on FreeBSD, with additional services to NFS/CIFS SysMgr provide hardware monitoring, configuration management, and overall control. The storage nodes use a specialized local file system Client PanFS iSCSI/ OSD (OSDFS) that implements the object storage primitives. They implement an iSCSI target and the OSD command set. The OSDFS object store and iSCSI Manager node target/OSD command processor are kernel modules. OSDFS OSDFS is concerned with traditional block-level file system issues such as efficient disk arm utilization, media management (i.e., error handling), high throughput, as well as the OSD interface. Storage node The cluster manager (SysMgr) maintains the global configuration, and it controls the other services and nodes in the storage cluster. There is an associated Figure 1: Panasas System Components management application that provides both a command The Panasas file system is layered over the object line interface (CLI) and an HTML interface (GUI). storage. Each file is striped over two or more objects to These are all user level applications that run on a subset provide redundancy and high bandwidth access. The of the manager nodes. The cluster manager is file system semantics are implemented by metadata concerned with membership in the storage cluster, fault managers that mediate access to objects from clients of detection, configuration management, and overall the file system. The clients access the object storage control for operations like software upgrade and system using the iSCSI/OSD protocol for Read and Write restart [Welch07]. operations. The I/O operations proceed directly and in The Panasas metadata manager (PanFS) implements parallel to the storage nodes, bypassing the metadata the file system semantics and manages data striping managers. The clients interact with the out-of-band across the object storage devices. This is a user level metadata managers via RPC to obtain access application that runs on every manager node. The capabilities and location information for the objects that metadata manager is concerned with distributed file store files. The performance of striped file access is system issues such as secure multi-user access, presented later in the paper. maintaining consistent file- and object-level metadata, client cache coherency, and recovery from client, storage node, and metadata server crashes. Fault Object attributes are used to store file-level attributes, tolerance is based on a local transaction log that is and directories are implemented with objects that store replicated to a backup on a different manager node. name to object ID mappings. Thus the file system metadata is kept in the object store itself, rather than The NFS and CIFS services provide access to the file being kept in a separate database or some other form of system for hosts that cannot use our Linux installable storage on the metadata nodes. Metadata operations are file system client. The NFS service is a tuned version of described and measured later in this paper. the standard FreeBSD NFS server that runs inside the kernel. The CIFS service is based on Samba and runs at user level. In turn, these services use a local instance of the file system client, which runs inside the FreeBSD The logging mechanism is described and measured in kernel. These gateway services run on every manager detail later in the paper. The system monitors the node to provide a clustered NFS and CIFS service. battery charge level, and will not allow a shelf chassis to enter service without an adequately charged battery 2.3 Commodity Hardware Platform to avoid data loss during back-to-back power failures. The storage cluster nodes are implemented as blades The other important aspect of the hardware is that that are very compact computer systems made from blades are a Field Replaceable Unit (FRU). Instead of commodity parts. The blades are clustered together to trying to repair a blade, if anything goes wrong with the provide a scalable platform. Up to 11 blades fit into a hardware, the whole blade is replaced. We settled on a 4U (7 inches) high shelf chassis that provides dual two-drive storage blade as a compromise between cost, power supplies, a high capacity battery, and one or two performance, and reliability. Having the blade as a 16-port GE switches. The switches aggregate the GE failure domain simplifies our fault tolerance ports from the blades into a 4 GE trunk. The 2nd switch mechanisms, and it provides a simple maintenance provides redundancy and is connected to a 2nd GE port model for system administrators. Reliability and data on each blade.

Load more