EOS Open Storage Status of extended FUSE access daemon eosxd

http://eos.cern.ch

Andreas-Joachim Peters CERN IT-ST

EOS Workshop 2019 - 04.02.2019 Why eosxd EOS extended filesystem daemon • mounted filesystem access is useful to enable storage access with common software out of the box

• (distributed) filesystem development is difficult and lengthy

AFS V1,2,3 - 36 years NFS V1,2,3,4 - 35 years cephfs - 13 years - 10 years to production

• EOS filesystem client rewrite started Q4 2016: eosd => eosxd - 2 years now

name is FuseX or eosxd from eosd to eosxd EOS extended filesystem daemon

path based communication passive client one-directional communication cache invalidation eosd by timeout / lookup meta-data service

inode based communication active client bi-directional communication eosxd cache invalidation by timeout / lookup / callback Why eosxd EOS extended filesystem daemon

Architecture - better POSIXness eosxd kernel - file locks, byte-range locks libfuse - hard links within directories low-level API - rich ACL client support - meta data queue CAP store data local caching - bulk deletion/protection hb com meta data backend XrdCl::Proxy - strong security & mount-by-key sync XrdCl::Filesystem XrdCl::File async - sync sync async user,group & project quota sync - implemented using libfuse MGM - FuseServer FST - xrootd

eosxd provides POSIXness very similar to AFS eosxd take benchmarks indicative FUSE filesystem daemon

make -j 4 CPU consumption FS Context Switches 2m 35s 300 4E+06 225 3E+06 150 2E+06 75 1E+06 0 0 0E+00 /tmp/ eosxd -fuse sshfs /tmp/ eosxdceph-fuse sshfs /tmp/ eosxdceph-fuse sshfs

EOS rpm build Streaming Write untar 1000 500 MB/s 1m 40s 10s 1s 0 0 MB/s /tmp/ eosxd ceph-fuse eosxd ceph-fuse 100ms write bs=1M read bs=4M read bs=4M cached /tmp/ eosxd ceph-fuse ceph-k4.9 afs

eosxd acceptable performance with low resource usage for a filesystem implemented in user space - many filesystems implemented in kernel space eosxd FUSE filesystem daemon • CERN storage group is working on AFSexit until Run-3 (s.c. AFS phaseout project)

• use cases can be moved to four platforms

• CERNVMFS (mainly software & static data)

• EOS (project, work, user spaces)

• high performance network filesystems NFS (ZFS FILER, CephFS)

• CERN cloud (S3) Why EOS is not exactly like AFS/NFS …

eosxd coupling between disk server nodes AFS/NFS Client is FS latency killer Client • each single FS performance fluctuation might hurt Meta Data everyone • each misbehaving client might hurt everyone Meta Data MD n DATA Data Data Server Server

i00 i01 i02 i03 i04 1 AFS/NFS one node Server EOSUSER 30 x EOSATLAS 347x in 2 CC EOSCMS 277x in 2 CC Wigner decomissioning is very welcome!

EOSHOME … because EOS was designed as a file based storage system for LAN/WAN access

• drastic architectural changes not possible within frequent releases - EOS is in production

• every file has to have a file checksum ( open-close file transaction model )

• required for GRID frameworks, Sync&Share, data replication integrity

• not a single LINUX filesystem provides a file checksum on close

• path oriented remote access protocols: XRootD, HTTP(S)/CERNBOX

• deletions can be immediately reverted ( recycle bin ) ( filesystems in general provide snapshots )

• file versioning ( as before: filesystems provide snapshots with former version )

POSIX filesystem API does not easily match the above eosxd developments 2018

• functional improvement of multi-client cache-invalidation protocol, mtime consistency, negative kernel cache management

• improvements to manage/monitor and limit thousands of users and ten-thousand of clients per instance

• essential bug fixes for locking and memory issues, logging verbosity

• evolution of client driven recovery to hide common hardware unavailability

• simplification of strong authentication and support for containerised applications ( eosfusebind not needed anymore - first time available in Kubernetes yesterday ) eosxd current status • many important improvements made it into production during past 6 months

• maturity & stability

• still more to be done → see known issues

in my opinion: difficult to judge from a user perspective - not everything works yet, easily summarised as ‘it does not work’ since file system perception is binary eosxd use cases distribution

trivial (no) use cases medium complicated cases limited/unsupported use cases

‘power users’ able to bring down an AFS volume or EOS home service

thousands of users few tens (hundreds?)

use cases tend to accumulate at extreme ends eosxd difficult/unsupported use cases

• ROOT hadd is more scalable with “root://eos…/file.root” instead of /eos/…/file.root when merging hundreds or thousand of files - triggered memory explosion of eosxd in the past

• CONDOR batch job submission broken most likely due to the way credentials are switched in condor daemon - under investigation - up to now was low priority

• /eos/ set as home currently not yet recommended for various reason - up to now low priority

• to compare: in AFS directories can be seen without kerberos credentials, in EOS not - policy changes under consideration

• using hundreds or thousands of batch jobs writing into a single directory or running python code with many python modules from a single directory with thousand parallel jobs ( creating python .pyc files )

• sharing/updating SQLITE on mount points with many batch jobs - SQLITE uses byte-range locking, which does not scale well - for eosxd locking by default is global (centralised) and not local to each client node

• GIT usage of large repositories can get slow (logical inode problem) after remount eosxd upcoming improvements

• reduce eosxd memory consumption

• using AUTOFS to remove idle mounts - problem: autofs and FUSE seem not to work properly on SLC6

• due to FUSE limitations directory meta-data once loaded can not be removed because of unknown current working directories CWD

• prototype solution: swap-in and -out directory meta-data from eosxd memory into local RocksDB

• better GIT support - drop logical inodes - currently local_ino != eos_ino

• prototype solution: eos_ino can be used client side if all creation calls are synchronous - will drop asynchronous creation - might use pre-creation to accelerate use-cases like untar

• large GIT repositories become unusable after re-mount because GIT will re-read all files of the repository if an inode change is detected eosxd current status

• known issues with currently @CERN deployed client version 4.4.17 - server 4.4.15/4.4.18

• concerning scalability

• EOSHOME meta-data server works only with moderate parallel batch access on a scale of hundred jobs when large directories are in use (1k children) • issue: meta-data server recomputed quota tables for all users per access, instead of only the requesting user - fixed, but not deployed - therefore too many batch jobs create DOS e.g. 100 jobs listing directory with 1000 sub-directories create 100k request with a single ‘’ in the code creating 10^9 quota computations - effect: meta-data performance degradation (e.g. 14th of january)

• concerning data availability for updates • issue: wrong error code mapping in data recovery - currently does not recover all unavailability cases • concerning memory footprint on lxplus/lxbatch/aiadm - memory is scarce on this nodes - need to be minimal • concerning locked-up clients (D state) on lxplus/lxbatch/aiadm - might lead to VM deactivation - should not occur • concerning local disk space for caching on lxplus/lxbatch/adiadm - in general local diskspace is scarse and IO overloaded 4.4.23 in QA deployment filesystem micro tests continuously running functional performance test filesystem micro test bad AFS case filesystem micro test good AFS case filesystem micro test most cases on par … filesystem micro tests continuously running functional performance test

• allows to track long-term performance metrics

• allows to see performance fluctuations under changing load conditions

• allows to identify performance regressions of new client/server releases

• helps to identify bugs in new client/server releases eosxd data recovery

• in NFS/AFS - single data server up or down in a given volume

• in EOS - 30-300 data server up or down, 1000-15000 disks ok, partially failing or broken

• some hardware not working is the standard case

• however as a filesystem it is better to behave binary (either 100% or 0% - 99.9% working is a disaster for applications and frustrating to users)

• MGM (meta-data server) behaves rather binary

• FSTs (storage server) create the <100% working experience eosxd data recovery

• files are stored either with replicas or as erasure encoded file stripes

• data recovery is driven by the eosxd client observing an IO error or timeout

• recovery is serialised between clients to avoid interference (still one exception to be handled …)

• eosxd clients can recover writes because writes are journaled on the local disk and journal cleanup happens only when all journal writes are confirmed from storage nodes

• recovery can be a) a simple journal-replay to a new target filesystems or b) a local staging of existing data, re-upload and journal-replay to new target filesystems

• decision when to try recovery is difficult because many errors are only transient errors, where wait & retry would be the best option Conclusion

• eosxd has been developed since 2 years with measurable improvements

• some surprises on the way - not everything we planned turned out to be the best

choice • today working on the most difficult part: integration, scalability and usability • from recent evolution confident we can meet the requirements of many users • expect to have uninterruptible operation in case of most MGM or FST failures - long-term stability is absolutely crucial for a filesystem

interface • technical details coming in the tutorial session THANK YOU QUESTIONS ?