Panache: a Parallel File System Cache for Global File Access

Panache: A Parallel File System Cache for Global File Access Marc Eshel Roger Haskin Dean Hildebrand Manoj Naik Frank Schmuck Renu Tewari IBM Almaden Research {eshel, roger, manoj, schmuck}@almaden.ibm.com, {dhildeb, tewarir}@us.ibm.com Abstract Traditionally, NFS (for Unix) and CIFS (for Win- dows) have been the protocols of choice for remote file Cloud computing promises large-scale and seamless ac- serving. Originally designed for local area access, both cess to vast quantities of data across the globe. Appli- are rather “chatty” and therefore unsuited for wide-area cations will demand the reliability, consistency, and per- access. NFSv4 has numerous optimizations for wide- formance of a traditional cluster file system regardless area use, but its scalability continues to suffer from of the physical distance between data centers. the ”single server” design. NFSv4.1, which includes Panache is a scalable, high-performance,clustered file pNFS, improves I/O performance by enabling parallel system cache for parallel data-intensive applications that data transfers between clients and servers. Unfortu- require wide area file access. Panache is the first file nately, while NFSv4 and pNFS can improve network and system cache to exploit parallelism in every aspect of I/O performance, they cannot completely mask WAN la- its design—parallel applications can access and update tencies nor operate during intermittent network outages. the cache from multiple nodes while data and metadata As “storage cloud” architectures evolve from a single is pulled into and pushed out of the cache in parallel. high bandwidth data-center towards a larger multi-tiered Data is cached and updated using pNFS, which performs storage delivery architecture, e.g., Nirvanix SDN [7], parallel I/O between clients and servers, eliminating the file data needs to be efficiently moved across loca- single-server bottleneck of vanilla client-server file ac- tions and be accessible using standard file system APIs. cess protocols. Furthermore, Panache shields applica- Moreover, for data-intensive applications to function tions from fluctuating WAN latencies and outages and seamlessly in “compute clouds”, the data needs to be is easy to deploy as it relies on open standards for high- cached closer to or at the site of the computation. Con- performancefile serving and does not require any propri- sider a typical multi-site compute cloud architecture that etary hardware or software to be installed at the remote presents a virtualized environment to customer applica- cluster. tions running at multiple sites within the cloud. Applica- In this paper, we present the overall design and imple- tions run inside a virtual machine (VM) and access data mentation of Panache and evaluate its key features with from a virtual LUN, which is typically stored as a file, multiple workloads across local and wide area networks. e.g., VMware’s .vmdk file, in one of the data centers. Today, whenever a new virtual machine is configured, 1 Introduction migrated, or restarted on failure, the OS image and its Next generation data centers, global enterprises, and virtual LUN (greater than 80 GB of data) must be trans- distributed cloud storage all require sharing of massive ferred between sites causing long delays before the ap- amounts of file data in a consistent, efficient, and re- plication is ready to be online. A better solution would liable manner across a wide-area network. The two store all files at a central core site and then dynamically emerging trends of offloading data to a distributed stor- cache the OS image and its virtual LUN at an edge site age cloud and using the MapReduce [11] framework closer to the physical machine. The machine hosting the for building highly parallel data-intensive applications, VMs (e.g., the ESX server) would connect to the edge have highlighted the need for an extremely scalable in- site to access the virtual LUNs over NFS while the data frastructure for moving, storing, and accessing mas- would move transparently between the core and edge sive amounts of data across geographically distributed sites on demand. This enormously simplifies both the sites. While large cluster file systems, e.g., GPFS [26], time and complexity of configuring new VMs and dy- Lustre [3], PanFS [29] and Internet-scale file systems, namically moving them across a WAN. e.g., GFS [14], HDFS [6] can scale in capacity and ac- Research efforts on caching file system data have cess bandwidth to support a large number of clients and mostly been limited to improving the performance of petabytes of data, they cannot mask the latency and fluc- a single client machine [18, 25, 22]. Moreover, most tuating performance of accessing data across a WAN. available solutions are NFS client based caches [15, 18] and cannot function as a standalone file system (with- and implementation of Panache and evaluate its key fea- out network connectivity) that can be used by a POSIX- tures with multiple workloads across local and wide area dependent application. What is needed is the ability networks. to pull and push data in parallel, across a wide-area The rest of the paper is organized as follows. In network, store it in a scalable underlying infrastructure the next two sections we provide a brief background while guaranteeing file system consistency semantics. of pNFS and GPFS, the two essential components of In this paperwe describe Panache, a read-write, multi- Panache. Section 4 provides an overview of the Panache node file system cache built for scalability and perfor- architecture. The details of how synchronous and asyn- mance. The distributed and parallel nature of the sys- chronous operations are handled are described in Sec- tem completely changes the design space and requires tion 5 and Section 6. Section 7 presents the evaluation re-architecting the entire stack to eliminate bottlenecks. of Panache using different workloads. Finally, Section 8 The key contribution of Panache is a fully parallelizable discusses the related work and Section 9 presents our design that allows every aspect of the file system cache conclusions. to operate in parallel. These include: 2 Background • parallel ingest wherein, on a miss, multiple files and multiple chunks of a file are pulled into the In order to better understand the design of Panache let cache in parallel from multiple nodes, us review its two basic components: GPFS, the parallel cluster file system used to store the cached data, and • parallel access wherein a cached file is accessible immediately from all the nodes of the cache, pNFS, the nascent industry-standard protocol for trans- ferring data between the cache and the remote site. • parallel update where all nodes of the cache can write and queue, for remote execution, updates to GPFS: General Parallel File System [26] is IBM’s the same file in parallelor updatethe data and meta- high-performance shared-disk cluster file system. GPFS data of multiple files in parallel, achieves its extreme scalability through a shared-disk ar- • parallel delayed data write-back wherein the writ- chitecture. Files are wide-striped across all disks in the ten file data is asynchronously flushed in parallel file system where the number of disks can range from frommultiple nodes of the cache to the remote clus- tens to several thousand disks in the largest GPFS instal- ter, and lations. In addition to balancing the load on the disks, • parallel delayed metadata write-back where all striping achieves the full throughput that the disk sub- metadata updates (file creates, removes etc.) can system is capable of by reading and writing data blocks be made from any node of the cache and asyn- in parallel. chronously flushed back in parallel from multiple The switching fabric that connects file system nodes nodes of the cache. The multi-node flush preserves to disks may consist of a storage area network (SAN), the order in which dependent operations occurred e.g., Fibre Channel, iSCSI, or, a general-purpose net- to maintain correctness. work by using I/O server nodes. GPFS uses distributed There is, by design, no single metadata server and no locking to synchronize access to shared disks where all single network end point to limit scalability as is the nodes share responsibility for data and metadata consis- case in typical NAS systems. In addition, all data and tency. GPFS distributed locking protocols ensure file metadata updates made to the cache are asynchronous. system consistency is maintained regardless of the num- This is essential to support WAN latencies and outages ber of nodes simultaneously reading from and writing as high performance applications cannot function if ev- to the file system, while at the same time allowing the ery update operation requires a WAN round-trip (with parallelism necessary to achieve maximum throughput. latencies running from 30ms to more than 200ms). pNFS: The pNFS protocol, now an integral part of While the focus in this paper is on the parallel as- NFSv4.1, enables clients for direct and parallel access pects of the design, Panache is a fully functioning to storage while preserving operating system, hardware POSIX-compliant caching file system with additional platform, and file system independence [16]. pNFS features including disconnected operations, persistence clients and servers are responsible for control and file across failures, and consistency management, that are management operations, but delegate I/O functionality all needed for a commercial deployment. Panache also to a storage-specific layout driver on the client. borrows from Coda [25] the basic premise of conflict To perform direct and parallel I/O, a pNFS client first handling and conflict resolution when supporting dis- requests layout information from a pNFS server. A lay- connected mode operations and manages them in a clus- out contains the information required to access any byte tered setting.

Load more