High Performace AFS

RechenZentrum Garching of the Max Planck Society High Performace AFS Hartmut Reuter [email protected] · Supercomputing environment at RZG · Why AFS is slow compared to NFS and SAN-filesystems · Direct I/O from the client to the fileserver partition · Implementation in MR-AFS and OpenAFS · Performance Measurements and results · „fs import“ · MR-AFS and Castor Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society RZG RZG is the supercomputing center of the Max Planck Society in Germany It also acts as the local computing center for a number of Max Planck institutes located at Garching, specially for IPP (Institut für Plasmaphysik) The local AFS-cell therefore historically has the name ipp-garching.mpg.de Using MR-AFS this AFS cell provides also archival space for the MPG Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society MPI for Polymer Research, Mainz Multiscale model of bisphenole-A-polycarbonate (BPA-PC) on nickel (a) The coarse grained representation of a BPA-PC segment (b) Coarse grained model of a N=20 BPA-PC molecule (c) Phenole adsorbed on the bridge site of a (111) nickel surface Code: CPMD Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society MPI for Astrophysics Garching Core-collapse supernova simulation: Snapshots of the hydro- dynamic evolution of a rotating massive star, 0.25 s after the start of the explosion Code: Rady/2D Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society MPI for Metals Research, Stuttgart Large-scale atomistic study of the inertia properties of mode I cracks. A crack propagating at several kilometers per second is suddenly brought to rest. Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society MPI of Plasmaphysics Garching and Greifswald Simulation of the time development of the turbulent radial heat flux Code: TORB Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society 4 Decades of Supercomputing Tradition at RZG 1962: IBM 7090 1969: IBM 360/91 1979: Cray-1 1998: Cray T3E/816 2002/2003: IBM p690 0.1 Mflop/s, 128 kB RAM 15 Mflop/s, 2 MB RAM 80 Mflop/s, 8 MB RAM 0.47 TFlop/s, 104 GB RAM 4 TFlop/s, 2 TB RAM Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society IBM p690 4 TFlop/s, 2 TB RAM 64 GB 64 GB 64 GB 64 GB 64 GB 64 GB 64 GB Federation 256 GB Switch 256 GB 22 TB FC disks I/O I/O 5 TB SSA disks 24 compute nodes, 2 I/O nodes. Each node has 32 power4 processors. Federation switch: measured throughput: 4.4 GB/s bidirectional between 2 nodes measured latency is 12 µs. Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society AFS is too slow on the Regatta cluster For large files AFS is much slower than GPFS on the Regatta cluster + GPFS stripes data over multiple nodes. - AFS exchanges data with a single fileserver - with AFS all data go through the AFS cache. - AFS is also slower than NFS for protocol reasons Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society Why AFS is slow compared to NFS · Disk caches on local disks are slower than the network. · write() sleeps while data are transfered to the server. · Unnecessary read rpcs before a chunk is written. · Memory mapping of cache files breaks large I/O down to hundreds of requests · Rx-protocol is considered sub-optimal Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society How to make AFS faster for large files · Use fastest filesystem for /vicep-partition on server ± On Regatta cluster use GPFS · Bypass AFS caching on the client by direct I/O to the fileserver's /vicep-partitions. ± helps on all fileserver machines for files in volumes stored there ± helps in clusters, if /vicep-partitions are cluster wide mounted. ± requires modifications in the client and server code. ± Should be done only on trusted hosts Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society Writing a new file to AFS 1) create_file RPC 1 2) write chunks into cache 4 This process is interrupted and followed 5 by store_data RPCs each one doing: 3 2 3) read from cache /vicepa 4) transfer over network cache 5) write to /vicepa fileserver client Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society Writing a file directly to the AFS server partition 1) create file 1 2) check meta-data, permissions, and quota and 2 4 return the file's path in the /vicepa. 3) write the file into /vicepa. 3 4) update meta-data on the server. /vicepa fileserver client Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society Design of direct I/O to /vicep-partitions · Fileservers owning /vicep-partitions are identified by a sysid file in the partition. · afsd with option “-vicepaccess“ informs AFS kernel extension (new subcall). · Volumes with instances on fileservers with visible partitions are flagged · Open of files in these volumes tries first new RPC to get path-information from fileserver ± If that or the open of the vnode/dentry failes, open resumes in the old way. · I/O is done directly using the opened vnode/dentry. · Close for write informs the fileserver about new file-length Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society Implementation of direct I/O in MR-AFS and OpenAFS · Why MR-AFS ± because RZG runs only MR-AFS fileservers ± because existing ResidencyCmd RPC could be used without changing afsint.xg ± because MR-AFS has large file support · Which version of OpenAFS ± CVS-version from July 2003 where my last patches regarding the AIX 5.2 port had been comitted Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society MR-AFS server modifications · partition.c copies /usr/afs/local/sysid into all active /vicep-partitions · In src/viced/afsfileprocs.c: ± For direct read and write new RPC subcalls of SAFS_ResidencyCmd were implemented which return the path in the /vicep-partition as a string. ± They need the same checks as all flavours of SAFS_StoreData and SAFS_FetchData. ± Therefor the common code was put into generic routines StoreData() and FetchData(). ± In the long run new RPCs SAFS_DirectStore and SAFS_DirectFetch should be implemented, also in the OpenAFS fileserver. Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society Open() on the client After all other is done has GetServerPath volume's server vis. src/afs/VNOPS/afs_vnop_open.c yes RPC to fileserver part.? Open file in no success? yes vicep-partition no success? Save dentry yes pointer in vcache no done Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society OpenAFS client modification for open() · /vicep-partitions are scanned for sysid files. Uuids found in there are handed over to the kernel (afsd.c, afs_call.c). · Some additional flag bits in some structs allow to identify the AFS files which might be read or written directly in a visible /vicep-partition. · If these flag bits are set the open vnode operation tries After all other is done to get the path-information from the fileserver using the has GetServerPath volume's server vis. new RPC (afs_vnop_open.c). yes RPC part.? Open file in vicep- no success? ± partition If the RPC succeeds the file's vnode/dentry is no success? Save dentry yes looked up and the pointer stored in the vcache pointer in vcache no struct. done ± If the RPC or the lookup of the files path failes the old way of open is resumed. Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society write() on the client Before anything else is done has · vcache pointer to exchange dentry ptr. in srtuct file src/afs/LINUX/osi_vnodeops.c · dentry? yes call generic_file_write() Whenever a pointer to a restore dentry ptr. In struct file vnode/dentry in struct vcache no is available it is used to do the success? no I/O directly bypassing the AFS yes cache and the RPCs to the do it the old way fileserver. return · If this failes the old way is used return instead. Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society read() on the client Before anything else is done has · vcache pointer to exchange dentry ptr. in srtuct file src/afs/LINUX/osi_vnodeops.c · dentry? yes call generic_file_read() Whenever a pointer to a restore dentry ptr. In struct file vnode/dentry in struct vcache no is available it is used to do the success? no I/O directly bypassing the AFS yes cache and the RPCs to the do it the old way fileserver. return · If this failes the old way is used return instead. Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society close() on the client After everything else is done src/afs/VNOPS/afs_vnop_write.c has · Close for write triggers a vcache pointer to dput(dentry pointer) dummy StoreData64 RPC to yes dentry? update the meta-data (files no size, modification time). · Any close() releases the storemini() does StoreData RPC Was file vnode/dentry and clears the which updates file length open for write? field in vcache. in the afs vnode of the file src/afs/afs_segments.c · Does a dummy RPC SAFS_StoreData after close write Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society How we started · 1st successful implementation on my laptop for Linux 2.4.

Load more