RechenZentrum Garching of the Max Planck Society

High Performace AFS

Hartmut Reuter [email protected]

• Supercomputing environment at RZG

• Why AFS is slow compared to NFS and SAN-filesystems • Direct I/O from the client to the fileserver partition • Implementation in MR-AFS and OpenAFS • Performance Measurements and results

• „fs import“

• MR-AFS and Castor

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

RZG

RZG is the supercomputing center of the Max Planck Society in Germany

It also acts as the local computing center for a number of Max Planck institutes located at Garching, specially for IPP (Institut für Plasmaphysik)

The local AFS-cell therefore historically has the name ipp-garching.mpg.de

Using MR-AFS this AFS cell provides also archival space for the MPG

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

MPI for Polymer Research, Mainz

Multiscale model of bisphenole-A-polycarbonate (BPA-PC) on nickel

(a) The coarse grained representation of a BPA-PC segment

(b) Coarse grained model of a N=20 BPA-PC molecule

(c) Phenole adsorbed on the bridge site of a (111) nickel surface

Code: CPMD

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

MPI for Astrophysics Garching

Core-collapse supernova simulation: Snapshots of the hydro- dynamic evolution of a rotating massive star, 0.25 s after the start of the explosion

Code: Rady/2D

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

MPI for Metals Research, Stuttgart

Large-scale atomistic study of the inertia properties of mode I cracks.

A crack propagating at several kilometers per second is suddenly brought to rest.

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

MPI of Plasmaphysics Garching and Greifswald

Simulation of the time development of the turbulent radial heat flux

Code: TORB

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

4 Decades of Supercomputing Tradition at RZG

1962: IBM 7090 1969: IBM 360/91 1979: Cray-1 1998: Cray T3E/816 2002/2003: IBM p690 0.1 Mflop/s, 128 kB RAM 15 Mflop/s, 2 MB RAM 80 Mflop/s, 8 MB RAM 0.47 TFlop/s, 104 GB RAM 4 TFlop/s, 2 TB RAM

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

IBM p690 4 TFlop/s, 2 TB RAM 64 GB 64 GB 64 GB 64 GB 64 GB 64 GB 64 GB Federation 256 GB Switch 256 GB

22 TB FC disks I/O I/O

5 TB SSA disks 24 compute nodes, 2 I/O nodes. Each node has 32 power4 processors. Federation switch: measured throughput: 4.4 GB/s bidirectional between 2 nodes measured latency is 12 µs.

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

AFS is too slow on the Regatta cluster

For large files AFS is much slower than GPFS on the Regatta cluster + GPFS stripes data over multiple nodes. - AFS exchanges data with a single fileserver - with AFS all data go through the AFS cache. - AFS is also slower than NFS for protocol reasons

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

Why AFS is slow compared to NFS

• Disk caches on local disks are slower than the network.

• write() sleeps while data are transfered to the server.

• Unnecessary read rpcs before a chunk is written. • Memory mapping of cache files breaks large I/O down to hundreds of requests • Rx-protocol is considered sub-optimal

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

How to make AFS faster for large files

• Use fastest filesystem for /vicep-partition on server

– On Regatta cluster use GPFS

• Bypass AFS caching on the client by direct I/O to the fileserver's /vicep-partitions.

– helps on all fileserver machines for files in volumes stored there

– helps in clusters, if /vicep-partitions are cluster wide mounted.

– requires modifications in the client and server code.

– Should be done only on trusted hosts

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

Writing a new file to AFS

1) create_file RPC 1 2) write chunks into cache

4 This process is interrupted and followed 5 by store_data RPCs each one doing: 3 2 3) read from cache /vicepa 4) transfer over network cache 5) write to /vicepa

fileserver client

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

Writing a file directly to the AFS server partition

1) create file 1 2) check meta-data, permissions, and quota and 2 4 return the file's path in the /vicepa. 3) write the file into /vicepa. 3 4) update meta-data on the server. /vicepa

fileserver client

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

Design of direct I/O to /vicep-partitions

• Fileservers owning /vicep-partitions are identified by a sysid file in the partition. • afsd with option “-vicepaccess“ informs AFS kernel extension (new subcall). • Volumes with instances on fileservers with visible partitions are flagged • Open of files in these volumes tries first new RPC to get path-information from fileserver

– If that or the open of the vnode/dentry failes, open resumes in the old way. • I/O is done directly using the opened vnode/dentry. • Close for write informs the fileserver about new file-length

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

Implementation of direct I/O in MR-AFS and OpenAFS

• Why MR-AFS

– because RZG runs only MR-AFS fileservers

– because existing ResidencyCmd RPC could be used without changing afsint.xg

– because MR-AFS has large file support • Which version of OpenAFS

– CVS-version from July 2003 where my last patches regarding the AIX 5.2 port had been comitted

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

MR-AFS server modifications

• partition.c copies /usr/afs/local/sysid into all active /vicep-partitions • In src/viced/afsfileprocs.c:

– For direct read and write new RPC subcalls of SAFS_ResidencyCmd were implemented which return the path in the /vicep-partition as a string.

– They need the same checks as all flavours of SAFS_StoreData and SAFS_FetchData.

– Therefor the common code was put into generic routines StoreData() and FetchData().

– In the long run new RPCs SAFS_DirectStore and SAFS_DirectFetch should be implemented, also in the OpenAFS fileserver.

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

Open() on the client

After all other is done

has GetServerPath volume's server vis. src/afs/VNOPS/afs_vnop_open.c yes RPC to fileserver part.? Open file in no success? yes vicep-partition no success? Save dentry yes pointer in vcache no

done

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

OpenAFS client modification for open()

• /vicep-partitions are scanned for sysid files. Uuids found in there are handed over to the kernel (afsd.c, afs_call.c). • Some additional flag bits in some structs allow to identify the AFS files which might be read or written directly in a visible /vicep-partition.

• If these flag bits are set the open vnode operation tries After all other is done to get the path-information from the fileserver using the

has GetServerPath volume's server vis. new RPC (afs_vnop_open.c). yes RPC part.? Open file in vicep- no success? – partition If the RPC succeeds the file's vnode/dentry is no success? Save dentry yes looked up and the pointer stored in the vcache pointer in vcache no struct.

done – If the RPC or the lookup of the files path failes the old way of open is resumed.

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

write() on the client

Before anything else is done

has • vcache pointer to exchange dentry ptr. in srtuct file src/afs//osi_vnodeops.c • dentry? yes call generic_file_write() Whenever a pointer to a restore dentry ptr. In struct file vnode/dentry in struct vcache no is available it is used to do the success? no I/O directly bypassing the AFS yes cache and the RPCs to the do it the old way fileserver.

return • If this failes the old way is used return instead.

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

read() on the client

Before anything else is done

has • vcache pointer to exchange dentry ptr. in srtuct file src/afs/LINUX/osi_vnodeops.c • dentry? yes call generic_file_read() Whenever a pointer to a restore dentry ptr. In struct file vnode/dentry in struct vcache no is available it is used to do the success? no I/O directly bypassing the AFS yes cache and the RPCs to the do it the old way fileserver.

return • If this failes the old way is used return instead.

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

close() on the client

After everything else is done src/afs/VNOPS/afs_vnop_write.c has • Close for write triggers a vcache pointer to dput(dentry pointer) dummy StoreData64 RPC to yes dentry? update the meta-data (files no size, modification time). • Any close() releases the storemini() does StoreData RPC Was file vnode/dentry and clears the which updates file length open for write? field in vcache. in the afs vnode of the file src/afs/afs_segments.c • Does a dummy RPC SAFS_StoreData after close write

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

How we started

• 1st successful implementation on my laptop for Linux 2.4. – /vicep-partition was . • 2nd successful implementation on the Regatta system for AIX 5.1: – /vicepm is a GPFS with TSM-HSM support visible on all Regattas on the switch – Only possible with MR-AFS as shared residency because special precaution necessary for delayed open of migrated files. • 3rd try on GPFS in a Linux cluster was not successful (incomplete VFS implementation). • 4th try on StorNext filesystem at CASPUR in Rome was not successful (incomplete VFS implementation). • 5th try on NFS mounted filesystem was successful.

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society Problems

To open, read or write files I copied the technique used for the cache files. But this leads to some problems: At least StoreDirect, but probably also GPFS on Linux, have not properly filled operation pointers. To verify this I added some debugging code in afs_vnop_open.c: afs_Trace2(afs_iclSetp, CM_TRACE_POINTER, ICL_TYPE_STRING, "tvc->nameivp->d_inode->i_mapping->a_ops", ICL_TYPE_POINTER, tvc->nameivp->d_inode->i_mapping->a_ops); if (tvc->nameivp->d_inode->i_mapping->a_ops) { afs_Trace2(afs_iclSetp, CM_TRACE_POINTER, ICL_TYPE_STRING, "tvc->nameivp->d_inode->i_mapping->a_ops->prepare_write", ICL_TYPE_POINTER, tvc->nameivp->d_inode->i_mapping->a_ops->prepare_write); if (tvc->nameivp->d_inode->i_mapping->a_ops->prepare_write) { afs_Trace2(afs_iclSetp, CM_TRACE_POINTER, ICL_TYPE_STRING, "tvc->nameivp->d_inode->i_mapping->a_ops->commit_write", ICL_TYPE_POINTER, tvc->nameivp->d_inode->i_mapping->a_ops->commit_write); afs_Trace2(afs_iclSetp, CM_TRACE_POINTER, ICL_TYPE_STRING, "tvc->nameivp->d_inode->i_mapping->a_ops->prepare_write", ICL_TYPE_POINTER, tvc->nameivp->d_inode->i_mapping->a_ops->prepare_write); if (tvc->nameivp->d_inode->i_mapping->a_ops->prepare_write && tvc->nameivp->d_inode->i_mapping->a_ops->prepare_write) found = 1; } }

Output from “fstreace dump”: time 151.581058, pid 8590: Pointer tvc->nameivp->d_inode->i_mapping == 0xf7983634 time 151.581058, pid 8590: Pointer tvc->nameivp->d_inode->i_mapping->a_ops == 0xf8a9db80 time 151.581058, pid 8590: Pointer tvc->nameivp->d_inode->i_mapping->a_ops->prepare_write == 0x0 time 151.581058, pid 8590: Pointer tvc->nameivp->d_inode->i_mapping->a_ops->commit_write == 0x0

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

Performance measurement: write_test, read_test

• write_test writes length bytes into file filename at offset offset – Usage: write_test filename offset length

– Buffer size is 1 MB. Buffer is filled with offset information at each 4 KB.

– After each 100 MB the time needed and the current data rate are printed

– At the end the total time and data rate are printed. • read_test reads a file produced by write_test and checks for correct contents – Usage: read_test filename offset • The offset parameter was used to test the large file support in AFS without having to wait for the writing of the 1st 2 GB!

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

Example for write_test output ~/test/r: ~hwr/afs/@sys/write_test 1GB 0 1000000000 1 writing of 104857600 bytes took 0.703 sec. (145603 Kbytes/sec) 2 writing of 104857600 bytes took 0.616 sec. (166260 Kbytes/sec) 3 writing of 104857600 bytes took 0.924 sec. (110830 Kbytes/sec) 4 writing of 104857600 bytes took 1.054 sec. (97117 Kbytes/sec) 5 writing of 104857600 bytes took 0.958 sec. (106873 Kbytes/sec) 6 writing of 104857600 bytes took 0.989 sec. (103571 Kbytes/sec) 7 writing of 104857600 bytes took 0.985 sec. (104005 Kbytes/sec) 8 writing of 104857600 bytes took 0.961 sec. (106508 Kbytes/sec) 9 writing of 104857600 bytes took 0.891 sec. (114942 Kbytes/sec) write of 1000000000 bytes took 8.676 sec. close took 0.000 sec. Total data rate = 112557 Kbytes/sec. for write ~/test/r: pwd /afs/ipp-garching.mpg.de/home/h/hwr/test/r ~/test/r: df -k /vicepm Filesystem 1024-blocks Free %Used Iused %Iused Mounted on /dev/hsmgpfs 3292001280 3274890496 1% 36 1% /vicepm ~/test/r:

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

Performance measurement: raid_test

• raid_test is a combination of write_test and read_test to to get aggregate throughput numbers • Usage: raid_test filename streams [length]

– It forks times. Each child concatenates its stream number to – If this file exists it reads the file otherwise it writes it with length • raid_test is called in a script “raid.sh” which starts with 1 stream and ends with 8 parallel streams • The results and log files are kept in a directory. • Can be found in /afs/ipp-garching.mpg.de/u/hwr/public/performance-tests

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society raid_test output running on machine Linux videamus2 2.4.21-144-smp4G-kgdb #2 SMP Mon Dec 8 14:01:55 CET 2003 i686 i686 i386 GNU/Linux 5th test read 4 files in parallel at Mi Jan 28 09:35:12 CET 2004 real 0m26.271s in directory /vicepz/tmp user 0m1.910s Using files with option and size 1073741824 sys 0m20.680s

1st test: write raidfile.0 6th test: write raidfile.4 to raidfile.7 and read the others real 0m11.487s real 1m41.067s user 0m0.260s user 0m3.760s sys 0m4.940s sys 1m35.880s

2nd test: read raidfile.0 7th test read 8 files in parallel real 0m6.072s real 1m25.868s user 0m0.280s user 0m3.360s sys 0m2.720s sys 0m45.190s

3rd test: write raidfile.1 and read raidfile.0 Average values: real 0m17.169s write 1 stream 94420 KB/s user 0m0.580s read 1 stream 182830 KB/s sys 0m10.380s write 2 streams 63760 KB/s read 2 streams 65629 KB/s 4th test: write raidfile.2 and raidfile.3 and read the others write 4 streams 30896 KB/s real 0m34.344s read 4 streams 44548 KB/s user 0m1.950s write 8 streams 10598 KB/s sys 0m38.540s read 8 streams 13282 KB/s Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

The test environment on Linux Triplestor MASSCOPE 3.0 TB xSeries 335 IDE-FC RAID system 2 processors Intel Xeon 2.8 GHz 256 MB cache 1.5 GB main memory 12 Hitachi ATA 100 disks 250 GB each Linux SuSE 9.0 kernel 2.4.21-144-smp4G-kgdb RAID 5 over 11 disks + 1 hot spare FC-interface 2Gb/s /vicepz: reiserfs 3.62 partition 200 GB

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

AFS client on the fileserver without -vicepaccess running on machine Linux videamus2 2.4.21-144-smp4G-kgdb #2 SMP /vicep-Partition is on an IDE-RAID on the same machine. Mon Dec 8 14:01:55 CET 2003 i686 i686 i386 GNU/Linux at Mi Jan 28 10:43:11 CET 2004 Tests: Aggregate data rates in directory /afs/ipp/tests/fileserver/videamus2.rzg.mpg.de/perftests st Using files with option and size 1073741824 1) write 1 file 29110 KB/s 2) read 1st file 49155 KB/s

****snip***** st nd Average values: 3) read 1 file and write 2 file 35956 KB/s 4) read 1st and 2nd file, write 3rd and 4th file 39390 KB/s write 1 stream 29110 KB/s read 1 stream 49155 KB/s 5) read 4 files 43148 KB/s write 2 streams 17261 KB/s st th th th read 2 streams 18695 KB/s 6) read 1 to 4 file, write 5 to 8 file 38016 KB/s write 4 streams 8908 KB/s 7) read all files 41912 KB/s read 4 streams 10787 KB/s write 8 streams 4265 KB/s Calculation example: aggregate data rate for test 6 is read 8 streams 5239 KB/s (4265 + 5239) * 4 = 38016 The test with > 4 streams are slowed down by the limitation of 4 rx-calls per connection.

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

AFS client on the fileserver, afsd -vicepaccess running on machine Linux videamus2 2.4.21-144-smp4G-kgdb #2 SMP /vicep-Partition is on an IDE-RAID on the same machine. Mon Dec 8 14:01:55 CET 2003 i686 i686 i386 GNU/Linux at Mi Jan 28 09:56:19 CET 2004 Tests: Aggregate data rates in directory /afs/ipp/tests/fileserver/videamus2.rzg.mpg.de/perftests st Using files with option -g and size 8 1) write 1 file 97541 KB/s 2) read 1st file 183940 KB/s

****snip***** st nd Average values: 3) read 1 file and write 2 file 127590 KB/s 4) read 1st and 2nd file, write 3rd and 4th file 138868 KB/s write 1 stream 97541 KB/s read 1 stream 183940 KB/s 5) read 4 files 159596 KB/s write 2 streams 65965 KB/s 6) read 1st to 4th file, write 5th to 8th file 89288 KB/s read 2 streams 61625 KB/s write 4 streams 29735 KB/s 7) read all files 91728 KB/s read 4 streams 39899 KB/s Calculation example: aggregate data rate for test 6 is write 8 streams 10856 KB/s read 8 streams 11466 KB/s (10856 + 11466) * 4 = 89288 This run with filesize 8 GB.

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

I/O to AFS, AFS with -vicepaccess, and directly to the partition

I/O Performance

write 1st file

Read 1st file

Write 2nd file read 1st file AFS normal Write 3rd and 4th file read 1st and 2nd file AFS direct /vicepm/tmp Read 4 files

Write 5th to 8th file read 1st to 4th file

Read 8 files

0 25000 50000 75000 100000 125000 150000 175000 200000 KB/s

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

How to exploit this new feature

• On each AFS fileserver the AFS client benefits from this technique. • On Regatta systems:

– All nodes which can mount a vicep-partition in GPFS benefit

– This is expected to be possible in future also remotely and also for Linux • Implementation should be tested also for other SAN filesystems such as

– StorageTank, CXFS, QFS, StoreDirect, and others.

– Still some work to be done • Even where NFS is faster than AFS this technique can be used

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

Open Questions

• Presently the fileserver doesn't keep information about files opened on the client

– What happens when the volume is moved to another server?

– What happens when the file is going to be wiped (MR-AFS)? • We need something similar to the callback mechanism

– How to synchronize after server restart?

– How to synchronize after client reboot? • Still some work to be done before we can use this in a production environment!

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

Limitations and Chances

• Export of vicep-partitions is limited to trusted hosts because

– the root user can access all data in the vicep-partition bypassing AFS.

• AFS can be used as access control mechanism to data in globally shared filesystems because

– the local uid of a user as defined in /etc/passwd doesn't matter

– Data access is strongly protected by Kerberos authentication

– Data are accessible from any AFS client world wide.

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

Import of existing files into AFS

Expensive, long running batch jobs should better be independent from AFS. On the Regattas we allow users to write files into a special subdirectory of the /vicep-partition: /r is a tp /vicepm/r Files written there can later be imported into MR-AFS by

fs import

This creates a vnode in AFS and renames the file from /vicepm/r/... to /vicepm/AFSIDat/... where the namei-algorithm it expects. Works also with files migrated by TSM-HSM

This could be implemented for the OpenAFS fileserver as well because it doesn't depend on special MR-AFS features.

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

Data Migration / HSM-Systems at RZG

• Use of IPP developed HSM systems with Data Migration to tapes since early times.

– AMOS 1971

– HADES 1981

– AMOS2 1984 • Unix-based HSM-systems since the nineties

– DMF on Cray 1992

– DMF on SGI 1993

– TSM-HSM 2002 • HSM functionality in AFS (MR-AFS) since mid nineties

– Support of CASTOR under work

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

Supercomputers, HSM-Servers, and HSM-software at RZG 8 9 0 1 2 2 3 70 81 82 93 971 972 983 994 995 1962 1963 1964 1965 1966 1967 196 196 19 1 1 1973 1974 1975 1976 1977 1978 1979 198 19 19 1 1984 1985 1986 1987 1988 1989 1990 199 199 19 1 1 1996 1997 1998 1999 2000 2001 200 200 Supercomputers IBM 7090 IBM 360­91 Cray 1 Cray XMP 24 Cray YMP Cray T3D/128 Cray T3E IBM Regatta HSM-servers IBM 370­145 Amdahl 470 V6 Siemens 7870 IBM 4381 (B) IBM 4381 (C) IBM 3090­15E Cray EL Cray Jedi SGI Origin 2000 IBM­Regatta HSM-software AMOS HADES AMOS2 Cray DMF YMP Cray­DMF EL Cray DMF Jedi SGI­DMF TSM­HSM

MR-AFS

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

Multiple-Resident-AFS (MR-AFS)

• Developed at Pittsburgh Supercomputer Center (psc.edu) by Jonathan Goldick, Chris Kirby, Bill Zumach e.a.

• Fileserver extensions to Transarc's AFS

• Since 1995 development and maintenance at RZG.

• Since 2001 based on OpenAFS code and libraries.

• Client extensions integrated in OpenAFS (large file support, commands, etc.)

• Used in production only at RZG.

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

Main Features of MR-AFS

• Files may be stored outside the volume’s partition.

• Fileserver can do I/O remotely (remioserver).

• Fileservers can share HSM resources and disks.

• Files from any fileserver partition can be migrated into the HSM system (AFS internal data migration).

• Volumes can be moved between fileservers without moving the files stored in the HSM system or other shared disks.

• Intelligent queing for HSM recall requests.

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

AFS Cell “ipp-garching.mpg.de”

• ~20 fileservers all with MR-AFS binaries

– 3 of them using data migration

– all others behave like OpenAFS fileservers.

• ~36 TB of files, 6 TB on disk all others on tape.

• File-based backup done by TSM allows users to restore old file versions.

• RO-volumes in the RW-partition and on a separate server:

– Each night all RW-volumes which don't have actual ROs are released.

– If a partition is lost the RO-clones on the separate server can be converted to RW-volumes within few minutes. (Happened already!)

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

Future: MR-AFS and Castor

clients n clusters Castor tape movers

migration and staging

Large files shared filesystem

Castor shared residency meta data server Tape fileserver drives shared archival residency local_disk meta data, directories, small files

Geneva, February 5, 2004 Hartmut Reuter RechenZentrum Garching of the Max Planck Society

Conclusions

• AFS can make use of the speed of other shared file systems such as GPFS

– If the fileserver's partitions can be exported to trusted clients

– Results show native filesystem speed also through the AFS client.

– This technique can also be used to add secure access control to globally shared file systems. • High Performance AFS is presently available only in combination with MR-AFS

– But it schould be easy to port it to OpenAFS fileservers as well. • MR-AFS has some other interesting features which could be worth to use it.

Geneva, February 5, 2004 Hartmut Reuter