USOO8458239B2

(12) United States Patent (10) Patent No.: US 8.458,239 B2 Ananthanarayanan et al. (45) Date of Patent: Jun. 4, 2013

(54) DIRECTORY TRAVERSAL IN A SCALABLE s A 3. 3. Alley al MULTI-NODE CACHE FOR A 5,940,841- W - A 8, 1999 Schmucketarter et al. al. REMOTE CLUSTER FILE SYSTEM 6,012,085 A 1/2000 Yohe et al. 6,023,706 A 2/2000 Schmucket al. (75) Inventors: Rajagopol Ananthanarayanan, San 6,122,629 A * 9/2000 Walker et al...... 707,613 CA SS T y R San 6.219,6936, 192.408 B1*B1 4/20012/2001 VahaliaNapolitano et al. et al...... 709,229 ose, CA (US); Roger L. Haskin, 6,490,615 B1 12/2002 Dias et al. Morgan Hill, CA (US); Dean 6,535,970 B1* 3/2003 Bills et al...... 711/2O7 Hildebrand, San Jose, CA (US); Manoj 6,816,891 B1 * 1 1/2004 Vahalia et al...... TO9,214 P. Naik, San Jose, CA (US); Frank B. 3. R $398 ps al 1 Schmuck, Campbell, CA (US); Renu 6,965,972 B2 11/2005 NandaOSCO et al.a Tewari, San Jose, CA (US) 7,043,478 B2 5/2006 Martin et al. (73) Assignee: International Business Machines (Continued) Corporation, Armonk, NY (US) OTHER PUBLICATIONS (*) Notice: Subject to any disclaimer, the term of this Batsakis et al., “NFS-CD: Write-Enabled Cooperative Caching in patent is extended or adjusted under 35 NFS. IEEE Transactions on Parallel and Distributed Systems, Mar. U.S.C. 154(b) by 165 days. 2008, pp. 323-333, vol. 19, No. 3, IEEE Computer Society, USA. (Continued) (21) Appl. No.: 12/639,904 Primary Examiner — Joshua Bullock (22) Filed: Dec. 16, 2009 (74) Attorney, Agent, or Firm — Kenneth L. Sherman, Esq.; Michael Zarrabian, Esq.; Sherman & Zarrabian LLP (65) Prior Publication Data US 2011/0145307 A1 Jun. 16, 2011 (57) ABSTRACT Directory traversal in a scalable multi-node file system cache (51) Int. Cl. for a remote cluster file system, is provided. One implemen G06F 7/30 (2006.01) tation involves maintaining a scalable multi-node file system (52) U.S. Cl. cache in a local cluster file system, and caching local file data USPC ...... 707/828; 707/827; 707/737; 711/118 in the cache by fetching file data on demand from a remote (58) Field of Classification Search cluster file system into the cache over the network. The local USPC ...... 707/737, 827 828; 711/118 file data corresponds to file data in the remote cluster file See application file for complete search history. system. The local namespace emulates the remote namespace and is created on demand. Access to data further involves (56) References Cited performing directory traversal for the cache for the remote cluster file system by creating a directory structure on U.S. PATENT DOCUMENTS demand. 5,175,852 A 12/1992 Johnson et al. 5,781,757 A 7/1998 Deshpande 9 Claims, 8 Drawing Sheets

-10

GPFS File System. Nodes

Switching fabric (System or storage Remote File area retwork) Syster Cister

13

Shared disks (SAN-attached or network block device)

15 US 8,458.239 B2 Page 2

U.S. PATENT DOCUMENTS Howells, D., “FS-Cache: A Network Filesystem Caching Facility.” Red Hat UK Ltd., Oct. 2004, pp. 1-14, downloaded at URL: http:// 7,092,976 B2 8, 2006 Curran et al. people.redhat.com/~dhowells/fiscache/FS-Cache.pdf. 7,103,616 B1* 9/2006 Harmer et al...... 1f1 7,165,096 B2 * 1/2007 Soltis ...... 709/217 IBM, “An Overview of OpenAFS Administration.” OpenAFS 7,167,921 B1 1/2007 Ben-Zvi Administration Guide, Apr. 2000, Chapter 1, pp. 1-6 downloaded at: 7,177.270 B2 * 2/2007 Sim et al...... 370,229 URL: http://docs.openafs.org/AdminGuide/ch01s02. 7,454,653 B2 11/2008 Nicholson et al. html#HDRWQ16. 7.552,197 B2 * 6/2009 Soltis ...... 709/217 Blue Coat Systems, Proxy Caching Technology, located at URL: 7.584,222 B1* 9/2009 Georgiev ...... 1f1 http://www.bluecoat.com/resources/technology/ 7,739,379 B1* 6, 2010 Vahalia et al. 709,225 proxycachingtechnology, downloaded Dec. 14, 2009, pp. 1-6. 7,870,116 B2* 1/2011 Olsen ...... 707/705 Expand Networks, Foundation WAN Optimization, located at URL: 7,890,555 B2 2/2011 Gunda et al. http://www.expand.com/generic-wan-optimization? generic-wan 7,899,917 B2 3/2011 Chitre et al. optimization.aspx, download date Dec. 15, 2009. 8,209,307 B2 6, 2012 Erofeev Lee et al., “Metadata Management of the SANtopia File System.” 2002/0166031 A1* 11/2002 Chen et al...... 711 141 Eighth International Conference on Parallel and Distributed Systems 2004/O139235 A1 7/2004 Rashid et al. (ICPADS2001), Jun. 2001, pp. 492-499, Kyongju City, South Korea. 2004/O193620 A1 9/2004 Cheng et al. Zadok, E., et al., "On Incremental File System Development.” ACM 2005, OO15354 A1 1/2005 Grubbs et al. Transactions on Storage, May 2006, pp. 161-196, vol. 2, No. 2, ACM, 2005. O144195 A1 6, 2005 Hesselink et al. New York, NY USA. 2005/0216695 A1 9, 2005 Bono Yasuda, K., "Cache Cooperation for Clustered Disconnected Com 2006/0004765 A1 1/2006 Anderson et al...... 707/10 puters.” Proceedings of the Ninth International Conference on Par 2006, 0026154 A1 2/2006 Altinel et al. allel and Distributed Systems (ICPADS02), IEEE 2002, pp. 457 2006/01 17018 A1* 6/2006 Christiansen et al...... 707/10 464. IEEE Computer Society Washington, DC, USA. 2007/0101069 A1 5, 2007 Corbett et al. Mazer et al., “Writing the Web While Disconnected.” IEEE Personal 2007/0179981 A1* 8, 2007 Vincent ...... 707/2OO Communications, Oct. 1998, pp. 35-41, vol. 5, No. 5, IEEE, USA. 2007,0192444 A1 8/2007 Ackaouyet al. Ananthanarayanan, R., et al., “Panache: A Parallel WAN Cache for 2007,0266.108 A1 11/2007 Patterson et al. Clustered Filesystems.” ACM SIGOPS Operating Systems Review, 2008, OO40450 A1 2/2008 Appleyard et al. Jan. 2008, pp. 48-53, ACM New York, NY, USA. 2008, 0046538 A1 2/2008 Susarla et al. 2008, 0071804 A1 3/2008 Gunda et al. Boukerche, A. et al., “A Performance Analysis of a Cache-based File 2010. 0235321 A1 9, 2010 Shukla et al. Prediction Protocol for Mobile File Systems.” International Work 2011/O119228 A1* 5/2011 Menze et al...... 707/6O7 shop on Modeling Analysis and Simulation of Wireless and Mobile 2011 0145363 A1 6/2011 Ananthanarayanan et al. Systems, Proceedings of the ACM international workshop on perfor 2011 0145367 A1 6/2011 Ananthanarayanan et al. mance monitoring, measurement, and evaluation of heterogeneous 2011 0145499 A1 6/2011 Ananthanarayanan et al. wireless and wired networks (PMHW'N'06), Oct. 2, 2006, pp. 2012/0303.686 A1 1 1/2012 Ananthanarayanan et al. 100-104, ACM New York, NY, USA. 2012,0311065 A1 12/2012 Ananthanarayanan et al. Tang, H., et al., “A Self-Organizing Storage Cluster for Parallel Data-Intensive Applications.” Proceedings of the 2004 ACM/IEEE OTHER PUBLICATIONS conference on Supercomputing, Nov. 2004, pp. 52-64, IEEE Com puter Society, Washington, DC, USA. Annapureddy et al., “Shark: Scaling File Servers via Cooperative Shulman, Ariel, “WAN Application Acceleration for LAN-like Per Caching.” USENIX Association, NSDI 05: 2nd Symposium on formance.” Expand Networks and Webtorials, Aug. 2005, pp. 1-3, Networked Systems Design & Implementation, May 2005, pp. 129 downloaded at URL: http://www.expand.com/products/ 142, Boston, MA. WhitePapers/wanForLanpdf. Gibson, G., et al., “Parallel NFS Requirements and Design Consid U.S. Non-Final Office Action for Application U.S. Appl. No. erations.” Internet Engineering Task Force (IETF), Oct. 2004, pp. 12/639,832 mailed Oct. 11, 2012. 1-15, downloaded at URL: http://tools.ietforg/html/draft-gibson U.S. Non-Final Office Action for U.S. Appl. No. 12/639,920 mailed pnfs-reqs-00, USA. Dec. 19, 2011. Howard, J., “An Overview of the .” In USENIX U.S. Final Office Action for U.S. Appl. No. 12/639,920 mailed Jul. 10, 2012. Winter Technical Conference, Feb. 1988, pp. 1-6, Dallas, TX, USA. Liao, W. et al., “Cooperative Client-Side File Caching for MPI Appli Howard, J., et al., “Scale and Performance in a Distributed File cations'. International Journal of High Performance Computing System.” ACM Transactions on on Computer Systems (TOCS), Feb. Applications, May 2007, pp. 144-154, vol. 21, No. 2, SAGE Publi 1988, pp. 51-81, vol. 6, No. 1, ACM, New York, NY, USA. cations, USA. Muntz, D., et al., “Multi-level Caching in Distributed File Systems— Farazdel, A. et al., “GPFS on AIX Clusters: High Performance File or Your cache ain't nuthin but trash.” In Proceedings of the System Administration Simplified”. IBM Redbooks, IBM Interna USENIX Winter 1992 Technical Conference, 1992, pp. 305-313, San tional Technical Support Organization, Aug. 2001, pp. 1-279, Inter Francisco, CA, USA. national Business Machines Corporation, USA. Satyanarayanan, M., et al., "Coda: A Highly Available File System U.S. Non-Final Office Action for U.S. Appl. No. 13/570,014 mailed for a Distributed Workstation Environment.” IEEE Transactions on Dec. 6, 2012. Computers, Apr. 1990, pp. 447-459, vol.39, No. 4, IEEE Computer U.S. Non-Final Office Action for U.S. Appl. No. 12/639,865 mailed Society, Washington, D.C., USA. Nov. 29, 2012. Schmuck. F., et al., “GPFS: A Shared-Disk File System for Large U.S. Non-Final Office Action for U.S. Appl. No. 13/570,052 mailed Computing Clusters.” Proceedings of the Conference on File and Nov. 27, 2012. Storage Technologies (FAST02), Jan. 2002, pp. 231-244. Monterey, U.S. Notice of Allowance for U.S. Appl. No. 12/639,920 mailed Feb. CA, (USENIX, Berkeley, CA). 21, 2013. Sivathanu, G. et al., “A Versatile Persistent Caching Framework for U.S. Notice of Allowance for U.S. Appl. No. 12/639,865 mailed Mar. File Systems.” Technical Report FSL-05-05, Dec. 2005, pp. 1-4. 19, 2013. Stony Brook University, Stony Brook, NY, USA, downloaded at URL: http://www.fsl.cs. Sunysb.edu/docs/cachefs-tr/cachefs.pdf. * cited by examiner U.S. Patent Jun. 4, 2013 Sheet 1 of 8 US 8.458,239 B2

U.S. Patent Jun. 4, 2013 Sheet 2 of 8 US 8.458,239 B2

Applications pNFS over the WAN Reroie asynchronous cCaiesced Wiites is as a s r is s 8 A ... nodes f {C}cies 23-N g pNFS client --25A pNFS metadata Sey

Cai Cache Cister Remote cister file syster

A. t22 FG 2 U.S. Patent Jun. 4, 2013 Sheet 3 of 8 US 8.458,239 B2

Modify in-memory file state (open, close, lock, and unlock) locally in the local Cached file System

Fetch data On demand from the remote file system cluster into the local cached file system Over WAN

Locally opened file is read from, or written to, remote cluster by opening the file remotely prior to, and closing it after, performing the read/write using the same name used when the file was opened locally

FG. 3 U.S. Patent Jun. 4, 2013 Sheet 4 of 8 US 8.458,239 B2 -40

Create Cache file system in local file system cluster

Mount and associate cache file system with the root of the remote cluster space

For file access, application node looks - up file in the local cache

If the file is not in local cache (cache miss), then application node requests selected I/O node to obtain - from remote cluster, otherwise access file in local Cache

lf Cache miss, selection I/O node Converts and forward request to remote cluster

I/O node Creates local file in local cache, associates mapping between local Copy and remote file, and provides file information obtained from remote cluster to the requesting application node

On read request, if file in local cache file system is not complete or cache miss, then I/O node fetches needed file data from the remote Cluster and application node stored fetched information in local cache file system FG. 4 U.S. Patent Jun. 4, 2013 Sheet 5 of 8 US 8.458,239 B2

-50

On a Write request, application node Writes data to local Cache - Cluster

Application node signals Selected I/O node to Write that data to remote Cluster

I/O node dueues the signaled Write request to Y Write to remote cluster

I/O node returns acknowledgement to application node

I/O node asynchronously reads data from local Cache Cluster and Writes to the remote cluster

F.G. 5 U.S. Patent Jun. 4, 2013 Sheet 6 of 8 US 8.458,239 B2

-60

Upon directory traversal request, create directory entries based On files in local Cache Cluster

Directory entries are populated with a pair of name and inode numbers that are returned to the requesting application

Provide the directory entries to the requesting application

lf remote Cluster unavailable, application traverses local directory

F.G. 6 U.S. Patent Jun. 4, 2013 Sheet 7 of 8 US 8.458,239 B2

oca Cache Cister

NFS operations in fight

Renote Cister file system

U.S. Patent Jun. 4, 2013 Sheet 8 of 8 US 8.458,239 B2

CiSO Control input Device evice 8 4

Corrutication interface 118

Cefit

20- Device

Cisit 8.8 interet Osca Network | HOST 24 2 120-1 N server Ciest 130 Computer (PC) Soputer (PC) Seyer i? C5 130 Perrrrrrrrr US 8,458,239 B2 1. 2 DIRECTORY TRAVERSAL IN A SCALABLE Other aspects and advantages of the present invention will MULTI-NODE FILE SYSTEM CACHE FOR A become apparent from the following detailed description, REMOTE CLUSTER FILE SYSTEM which, when taken in conjunction with the drawings, illus trate by way of example the principles of the invention. BACKGROUND BRIEF DESCRIPTION OF THE SEVERAL 1. Field of the Invention VIEWS OF THE DRAWINGS The present invention relates generally to data storage. In particular, the present invention relates to directory traversal For a fuller understanding of the nature and advantages of in a Scalable cache for a cluster file systems. 10 the invention, as well as a preferred mode of use, reference 2. Background should be made to the following detailed description read in Information technology (IT) systems require sharing of conjunction with the accompanying drawings, in which: large amounts of file data in a consistent, efficient, and reli FIG. 1 illustrates a block diagram of a system for caching able manner across a wide-area network (WAN). WAN data of remote file data in a local cluster file system; storage systems, including cluster file systems, need to scale 15 FIG. 2 illustrates a block diagram of a system for caching in capacity and bandwidth to Support a large number of client of remote file data in a cluster file system; nodes. A Cluster is a group of interconnected independent FIG.3 illustrates an example open/close operation process; nodes working together as a single system. A cluster file FIG. 4 illustrates an example data read operation process; system manages data stored within a cluster and provides FIG. 5 illustrates an example data write operation process; client nodes with access to all files located on storage devices FIG. 6 illustrates an example directory retrieval operation in the file system. process; FIG. 7 illustrates build up of data access operation queues BRIEF SUMMARY at gateway nodes of a local cached file system cluster, and 25 FIG. 8 illustrates a computing system in which scalable Directory traversal in a scalable multi-node file system caching of remote file data in a cluster file system may be cache for a remote cluster file system, is provided. One implemented. embodiment comprises maintaining a scalable multi-node file system cache in a local cluster file system, and caching DETAILED DESCRIPTION local file data in the cache by fetching file data on demand 30 from a remote cluster file system into the cache over the The following description is made for the purpose of illus network. The local file data corresponds to file data in the trating the general principles of the invention and is not meant remote cluster file system. The local namespace emulates the to limit the inventive concepts claimed herein. Further, par remote namespace and is created on demand. Access to data ticular features described herein can be used in combination further involves performing directory traversal for the cache 35 for the remote cluster file system by creating a directory with other described features in each of the various possible structure on demand. combinations and permutations. Unless otherwise specifi Further, a system for directory traversal in a scalable multi cally defined herein, all terms are to be given their broadest node file system cache for a remote cluster file system, is possible interpretation including meanings implied from the provided. One embodiment comprises a cache Subsystem 40 specification as well as meanings understood by those skilled including a cache configured for maintaining data for a local in the art and/or as defined in dictionaries, treatises, etc. cluster file system including multiple computing nodes. The The description may disclose several preferred embodi system further comprises a caching layer function is config ments for caching of remote file data in an information tech ured for caching local file data in the cache by fetching file nology (IT) computing environment, including multiple com data on demand from a remote cluster file system into cache 45 puting resources, as well as operation and/or component parts over the network, wherein the local file data corresponds to thereof. While the following description will be described in file data in the remote cluster file system and the local terms of caching of remote file data for clarity and placing the namespace emulates the remote namespace. The caching invention in context, it should be kept in mind that the teach layer function performs directory traversal of the cache for ings herein may have a broad application to all types of the remote cluster file system by creating a local directory 50 systems, devices and applications. structure on demand. A cluster file system that implements caching of remote file Further, a computer program product comprising computer data in a cluster file system is provided. The system caches usable storage including a computer readable program is data on demand while guaranteeing well defined file system provided. The computer program product comprises a com consistency semantics. A preferred embodiment provides a puter readable storage medium having computer usable pro 55 Scalable cache architecture for a cache in a file system to gram code embodied therewith. The computer usable pro cache remote file system data while providing the consistency gram code comprises computer usable program code semantics of a distributed file system. The scalable caching configured to maintain a cache Subsystem configured for architecture enables the file system to cache remote file sys maintaining data in a cache for a local cluster file system, and tem data wherein the cache can scale in capacity and band provide a caching layer function for caching local file data in 60 width similar to a . Further, such a cache the cache by fetching file data on demand from the remote can support a remote server file system from different ven cluster file system into the cache over the network. The local dors. The cached data can be exported and accessed transpar file data corresponds to file data in the remote cluster file ently by a file system client for both read and write access. system and the local namespace emulates the remote The cache utilizes open, standard protocols for over-the-wire namespace. The caching layer function further performs 65 file access. Further the cache can significantly mask network directory traversal of the cache for the remote cluster file latency and continue to function with network outages. Data system by creating a directory structure on demand. is cached to provide the performance and scale of a cluster file US 8,458,239 B2 3 4 system. The cache presents essentially the same semantics as the number of nodes 11 that simultaneously read from and the remote cluster file system, emulating the namespace of the write to a file system on the disks 12 on behalf of an applica remote cluster file system. tion 16, while at the same time allowing data migration par The invention further provides directory traversal for a allelism to achieve maximum throughput. scalable multi-node file system cache for a remote cluster file For remote accesses over a wide area network (WAN). system. The scalable writable file system cache, which caches pNFS clients access storage devices in a remote cluster file remote file data, Supports directory traversals through the system in parallel. This is achieved by separating the data and cache by creating the directory structure on demand without metadata paths, and by moving the metadata server out of the impacting performance. A directory listing consists of the data path. As a result, each pNFS client can leverage the full name and inode information of the object. The inode infor 10 aggregate bandwidth of the cluster file system. Using pNFS, mation of an object is typically obtained only after the object clients can query the metadata server to determine the layout is created. Directory entries are created in the cache on of how files are distributed across data servers. Layouts are demand without actually creating the file objects in the file maintained internally by the metadata server. Based on the system. The directory entries mimic (i.e., emulate, simulate file layout, the client is able to directly access the data servers or closely represent) the entries in the remote file system 15 in parallel. A pNFS client communicates with the data servers while using the inode (e.g., data structure) information using a variety of storage access protocols, including NFSv4 obtained from the cache file system. The actual inodes of the and iSCSI/Fibre Channel. The pNFS specification allows for objects in the directory are not created when the directory the addition of new layout distributions and storage access entry listing them is created. The actual inode of the object is protocols, in addition to flexibility for implementation of the created at a later time when the object is first accessed. Direc back-end storage system. tory traversals are enabled in Such a way to reduce or elimi A more detailed description of a cached file system is nate negative impact on client performance. described below. In an example implementation described below, the Scal FIG. 2 shows a functional block diagram of an IT system able caching architecture is integrated with a General Parallel 20. The IT system 20 includes a local cache cluster 21 and a File System (GPFS) clustered file system. The remote data is 25 remote cluster file system 22. Every computing node 23 in the accessed over a (NFS) so that any remote cache cluster 21 has the same access to data cached by a local server exporting data over NFS can be the caching target. To cache file system 24 of the local cache cluster 21. However, get better performance, the cache can switch to a parallel NFS only some of the nodes 23 (i.e., I/O nodes 23A) may have the (pNFS) for data transfer if the remote system exports the data hardware and Software Support for remote network connec using pNFS. The cache is visible to any file system client as a 30 tivity. The I/O nodes 23A act as pNFS clients to fetch the data Portable Operating System Interface (POSIX) compliant file from the remote cluster 21. The remote cluster 21 may store system, thus any file system client can browse the cache and the data in any POSIX-compliant cluster file system that is access the data as if it was in a local file system. The cached exported via pNFS (NFS may also be used but with a perfor data can be further exported via NFS to a remote NFS client. mance penalty). The I/O nodes 23A are responsible for read Referring to FIG. 1, a GPFS parallel cluster file system 10 35 ing the data from the remote cluster 21 and writing it to the comprising a local cached file system cluster 15 that stores local cache file system 24 and vice-versa. The other nodes of cached data, wherein pNFS is used to move the data between the cluster (i.e., application nodes 23B) service file access the cluster 15 and a remote cluster file system 18. GPFS is requests of applications 16, by reading and writing the cached used for both file system clusters 15 and 18, to provide con data in the local cache file system (i.e., cache) 24. The remote current high-speed file access across multiple file system 40 file system 22 similarly includes I/O nodes 25A and applica computing nodes of a cluster. The cached cluster includes the tion nodes 25B. file system 15, computing nodes 11 (e.g., processors) that The application nodes 23B of the local cache file system 21 support the GPFS file system and the applications 16 that use are also responsible for forwarding access requests by it. The nodes 11 are connected to storage media 12. Such as requesting applications 16 to the I/O nodes 23A (i.e., writes to shared disks or disk subsystems, over a switching fabric 13. A 45 be synchronized with a remote server node 25A of the remote cache manager 14 maintains cached data in the storage media file system 21, and reads to be fetched from the remote server 12. All nodes 11 in the cluster have equal access to all disk 12. 25A on a local cache miss). Files are striped across all disks 12 in the file system where the The split between I/O and application nodes 23A, 23B in number of disks 12 can range from tens to several thousand the local system 21 is conceptual and any node 23 in the local disks. In addition to balancing the load on the disks, striping 50 cluster 21 can function both as an I/O node or an application achieves the full throughput that a disk Subsystem is capable node based on its configuration. The I/O nodes 23A can be of by reading and writing the blocks in parallel. viewed as the edge of the cluster cloud that can communicate The switching fabric 13 that connects file system nodes 11 with the remote cluster 22 while the application nodes 23B to the shared disks 12 may comprise a storage area network interface with the applications. (SAN) such as fibre channel or iSCSI. Alternatively, indi 55 To access the remote data consistently, the system 20 asso vidual disks 12 may be attached to some number of I/O server ciates a cache state with every object in the local cache file nodes that allow access from file system nodes 11 through a system 24, wherein the cache state includes the NFS file Software layer running over a general-purpose communica handle and inode (e.g., data structure) attributes of the corre tion network, such as IBM Virtual Shared Disk (VSD). sponding object in the remote file system 26. As multiple Regardless of how shared disks 12 are implemented, the 60 nodes 23 in the local system 21 can be accessing the cached GPFS only assumes a conventional block I/O interface with data in the local cached file system 24, the accesses may be no particular intelligence at the disks 13. Parallel read-write serialized by a standard GPFS distributed lock management disk accesses from multiple nodes 11 in the cluster 15 are with one of the nodes 23 being the token manager and issuing synchronized to prevent corruption of both user data and file read and write tokens. The data can be concurrently written at system metadata. The cluster 15 uses distributed locking to 65 the remote file system 26 of the remote cluster 22, and at the synchronize access to shared disks 12. Wherein distributed local cache file system 24 of the local cache cluster 21. locking protocols ensure file system consistency regardless of Between the remote cluster 22 and the local cache cluster 21, US 8,458,239 B2 5 6 the system 20 Supports the well known close-to-open consis the I/O node 23A creates the object in the local file system tency guarantees provided by NFS. To reduce the frequent cache 24 via the cache manager 27, associates a mapping checking of cached attributes with the remote file system 26, between the local GPFS inode and the remote cluster (or the I/O nodes 23A leverage the read and write delegation home cluster) state (the cache is local but it contains infor support of NFSv4. With delegations, the pNFS server 25A of 5 mation of the remote object (object modification times, the remote cluster 22 can transfer the ownership of a file to the unique identifier, etc.)). The I/O node 23A provides the local cache cluster 21, so that the local cache cluster 21 can obtained file handle and attributes of the object to the appli safely assume that the data is valid and service local requests. cation node 23B and returns success status back to the appli A cache manager 27 integrated into local cache file system cation node 23B (block 46). In effect, once the lookup opera 24 intercepts the application file access requests, wherein the 10 tion completes successfully, the object would have been applications simply experience the local cache file system 24 created in the local cache file system 24 but would not contain as a traditional GPFS file system. The cache manager 27 of any data. The State associated with a cached object indicates the local cluster 21 mimics the same namespace as the remote if the object is incomplete or empty. cluster 22. Thus browsing through the cache cluster 21 will On an application read request, in block 47 the application show the same listing of directories and files as the remote 15 node 23B first checks with the cache manager 27 to determine cluster 22. The caching function can be further exported via if the object exists in the local cache file system 24. If the NFS to enable access by NFS clients. Example file system object exists but is empty or incomplete (i.e., a cache miss), operations are now described, including Open/Close opera the application node 23B requests the designated I/O node tions, Data Read operations, Data Write operations and 23A to fetch the data from the remote cluster 22. The I/O node Directory Traversal operations. 23A, based on a prefetch policy, fetches/retrieves the entire Open/Close Operations file or the requested bytes from the remote cluster 22 (via FIG. 3 illustrates an example open/close operation process pNFS over WAN) and writes the fetched information in the 30. File operations that modify in-memory file state (e.g., local cache file system via the cache manager 27. If only a open, close, lock, and unlock), are performed by the cache portion of the file (object) was retrieved from the remote manager 27 locally in the local cached file system 24 without 25 cluster 22, then the rest of the file may be prefetched asyn consulting the remote server 22 (block 31). The cache man chronously after the application request is completed. ager 27 functions as a file system, with the remote cluster 22 The system 20 supports both whole file and partial file being primarily used as a source of data that is fetched on caching (segments including a set of contiguous blocks). The demand into the local cached file system 24 from the remote application node 23B, when notified of completion, reads the cluster 22 over WAN (block 32). When a locally opened file 30 requested bytes from the local cache 24 via the cache manager needs to be read from, or writtento, the remote server 25A of 27 and returns it to a requesting application 16 as if the the remote cluster 22, an I/O node 23A opens the file remotely requested data file (i.e., object) was present in the local cluster prior to, and closes it after; performing the read/write using 21 all along. It should be noted that the I/O and application the same name used when the file was opened locally (block nodes 23A, 23B only exchange request and response mes 33). 35 sages while the actual data is accessed locally by the cache The files are typically read and written in their entirety manager 27 via the shared disks 12. Thereafter, if said previ using whole file caching and write coalescing, respectively. In ously requested file (object) is read again, the application a typical usage scenario, intersite conflicts are expected to be node 23B checks via the cache manager 27 if the complete minimal and continued operation is required in the face of valid object exists in the local cache 24. On a cache hit, the WAN outages (similar to an NFSv4 client handling of its file 40 application node 23B can itself service the file read request state in the presence of delegations). from the local cache 24 via the cache manager 27. The system The system allows disconnected operations, wherein user 20 uses file and directory attribute checking, performed by an processes (applications) supported by the local cluster 21, NFS client at the I/O node 23A to guarantee close-to-open may continue to function in the absence of network connec consistency of the data in the local cache 24 of the local tivity to the remote cluster 22. 45 cluster 21, with the file system 26 of the remote cluster 22. All Data Read Operations the “read class of requests which include lookup, get FIG. 4 illustrates an example data read operation process attribute (getattr) and read, follow a similar data flow. These 40. The cache file system 24 is initially created in the local requests can be considered synchronous on a cache miss, cluster 21, and contains no data (block 41). When mounted, because the application is blocked waiting for the response the cache file system 24 is associated with the root of the 50 back from the I/O node 23A. remote clusterexported space using a mount operation (block Data Write Operations 42). When an object such as a file is first accessed by an FIG. 5 illustrates an example data write operation process application node 23B of the local cluster 21, a result of a 50. On a write request, the application node 23B first writes user-invoked opened on a file or a directory (i.e., via a GPFS the data to the local cache 24 via the cache manager 27 (block lookup request), the cache manager 27 performs a (VFS) 55 51), and then sends a message to the designated I/O node 23A lookup or read directory (i.e., readdir) operation in the local to perform the write operation at the remote cluster 22 (block cache 24 (block 43). If the object is not found in the local 52). The I/O node 23A queues the request (block 53) and cache file system (as the case would be on an initialized cache returns acknowledgement immediately, allowing the request file system), the application node 23B requests a selected I/O ing application 16 (e.g., user process) to complete (block 54). node 23A to service the request from the remote cluster 22 60 At a later time, the I/O node 23A reads the data from the local (block 44). The selection of I/O node 23A is based on a cache 24 and completes the remote write asynchronously hashing function that ensures that requests for an object are (block 55). The cache read provides better performance as it always sent to the same I/O node 23A. avoids both network latency and the limited WAN network The selected I/O node 23A converts a GPFS lookup request bandwidth. The cache read provides better throughput and to an NFS LOOKUP request and forwards it to the remote 65 low latency as the data is locally resident. A remote write cluster 22 to obtain the file information (block 45). On suc operation is sent as an NFS operation to the remote (or home) cess in obtaining file information from the remote cluster 22, cluster where the data is written to the home cluster file US 8,458,239 B2 7 system. This is performed for all “write class of requests -continued Such as; create, make directory (mkdir), write and unlink. Since data modifying operations are performed asynchro Readdir with Readdir with Readdir from nously, optimizations such a write coalescing and elimination Files per dir creates orphans cache of transient creates may be performed. 5 1OOOO 7.588 2.825 O.15 Directory Traversal 1OOOOO 451.76 25.45 1.212 FIG. 6 illustrates an example directory retrieval operation process 60. The cache manager 27 (FIG. 2), integrated into As such, an implementation of a preferred embodiment of the local GPFS, provides a standard POSIX file system inter the system comprises a remote file data caching module inte face for applications. When an application 16 requests to 10 grated with the GPFS cluster file system, providing a scalable, traverse a directory tree in the local cached cluster 21, the multi-node, consistent cache of data exported by a remote file local cached cluster 21 reflects the view of the corresponding system cluster. The example system uses the pNFS protocol directory tree at the remote cluster 22. For example, a Unix to move data in parallel from the remote file cluster. Further operating system "1S-1 operation to get a directory listing more, the system provides a POSIX compliant file system with attributes (e.g., modification time, file size) at the local 15 cached cluster 21 will present the same list of entries as one interface, making the cache completely transparent to appli performed at the remote cluster 22. Note, however, that the cations. The system can mask the fluctuating wide-area-net system 20 does not simply return the directory listing (con work (WAN) latencies and outages by Supporting asynchro taining the