Global Software Distribution with CernVM-FS

Jakob Blomer CERN

2016 CCL Workshop on Scalable Computing October 19th, 2016

[email protected] CernVM-FS 1 / 15 The Anatomy of a Scientific Software Stack (In High Energy Physics)

[email protected] CernVM-FS 2 / 15 The Anatomy of a Scientific Software Stack (In High Energy Physics)

My Analysis Code < 10 Python Classes

CMS Software Framework O(1000) C++ Classes

Simulation and I/O Libraries ROOT, Geant4, MC-XYZ

CentOS 6 and Utilities 풪(10) Libraries

[email protected] CernVM-FS 2 / 15 The Anatomy of a Scientific Software Stack (In High Energy Physics)

My Analysis Code < 10 Python Classes How to install on. . . ∙ my laptop: CMS Software Framework O(1000) C++ Classes compile into /opt ∼ 1 week ∙ my local cluster: Simulation and I/O Libraries ask sys-admin to install in ROOT, Geant4, MC-XYZ /nfs/software > 1 week ∙ CentOS 6 and Utilities someone else’s cluster:? 풪(10) Libraries

[email protected] CernVM-FS 2 / 15 The Anatomy of a Scientific Software Stack (In High Energy Physics)

My Analysis Code < 10 Python Classes How to install (again) on. . . changing ∙ my laptop: CMS Software Framework O(1000) C++ Classes compile into /opt ∼ 1 week ∙ my local cluster: Simulation and I/O Libraries ask sys-admin to install in ROOT, Geant4, MC-XYZ /nfs/software > 1 week

CentOS 6 and Utilities ∙ someone else’s cluster:? 풪(10) Libraries stable

[email protected] CernVM-FS 2 / 15 Beyond the Local Cluster

World Wide LHC ComputingWLCG Grid sites ∙ ∼ 200 sites: from 100 to 100 000 cores ∙ Different countries, institutions, batch schedulers, OSs, . ∙ Augmented by clouds, supercomputers, LHC@Home

[email protected] CernVM-FS 3 / 15

17 What about Docker?

Example: in Docker

$ docker pull r-base Libs ... −→ 1 GB image $ docker run -it r-base $ ... (fitting tutorial) −→ only 30 MB used Container (“App”)

It’s hard to scale Docker:

iPhone App Docker Image 20 MB 1 GB changes every month changes twice a week phones update staggered servers update synchronized

−→ Your preferred cluster or supercomputer might not run Docker

[email protected] CernVM-FS 4 / 15 A for Software Distribution

Software FS rAA Basic System Utilities Global HTTP Cache Hierarchy

OS Kernel

Worker Node’s Worker Node’s Central Web Server Memory Buffer Disk Cache Entire Software Stack Megabytes Gigabytes Terabytes

Pioneered by CCL’s GROW-FS for CDF at Tevatron Refined in CernVM-FS, in production for CERN’s LHC and other experiments

1 Single point of publishing 2 HTTP transport, access and caching on demand 3 Important for scaling: bulk meta-data download (not shown)

[email protected] CernVM-FS 5 / 15 One More Ingredient: Content-Addressable Storage

HTTP Transport Transformation Caching & Replication

Read-Only Content-Addressed Read/Write File System Objects File System

Worker Nodes Software Publisher / Master Source

Two independent issues

1 How to mount a file system (on someone else’s computer)?

2 How to distribute immutable, independent objects?

[email protected] CernVM-FS 6 / 15 Content-Addressable Storage: Data Structures

/cvmfs/icecube.opensciencegrid.org Object Store amd64-gcc6.0 ∙ Compressed files and chunks 4.2.0 ∙ De-duplicated ChangeLog . . File Catalog Compression, SHA-1 ∙ Directory structure, symlinks 806fbb67373e9... ∙ Content hashes of regular files ∙ Digitally signed Repository ⇒ integrity, authenticity ∙ Time to live ∙ Partitioned / Merkle hashes (possibility of sub catalogs) Object Store File catalogs

⇒ Immutable files, trivial to check for corruption, versioning

[email protected] CernVM-FS 7 / 15 Transactional Publish Interface

Read/Write Scratch Area CernVM-FS Read-Only Union File System AUFS or OverlayFS

Read/Write Interface File System, S3

Publishing New Content [ ~ ]# cvmfs_server transaction icecube.opensciencegrid.org [ ~ ]# make DESTDIR=/cvmfs/opensciencgrid.org/amd64-gcc6.0/4.2.0 install [ ~ ]# cvmfs_server publish icecube.opensciencegrid.org

Uses cvmfs-server tools and an Apache web server

[email protected] CernVM-FS 8 / 15 Transactional Publish Interface

Read/Write Scratch Area CernVM-FS Read-Only Union File System AUFS or OverlayFS

Read/Write Interface File System, S3

Reproducible: as in git, you can always come back to this state

Publishing New Content [ ~ ]# cvmfs_server transaction icecube.opensciencegrid.org [ ~ ]# make DESTDIR=/cvmfs/opensciencgrid.org/amd64-gcc6.0/4.2.0 install [ ~ ]# cvmfs_server publish icecube.opensciencegrid.org

Uses cvmfs-server tools and an Apache web server

[email protected] CernVM-FS 8 / 15 Content Distribution over the Web

Server side: stateless services

Data Center Caching Proxy Web Servery 풪(100) nodes / server 풪(10) DCs / server

Worker Nodes HTTP HTTP

[email protected] CernVM-FS 9 / 15 Content Distribution over the Web

Server side: stateless services

Data Center Load Balancing Web Servery 풪(100) nodes / server 풪(10) DCs / server HTTP

HTTP Worker Nodes HTTP

HTTP

[email protected] CernVM-FS 9 / 15 Content Distribution over the Web

Server side: stateless services

Data Center Caching Proxies Web Servery 풪(100) nodes / server 풪(10) DCs / server

Failover

Worker Nodes HTTP

HTTP

[email protected] CernVM-FS 9 / 15 Content Distribution over the Web

Server side: stateless services

Data Center Caching Proxies Mirror Serversy 풪(100) nodes / server 풪(10) DCs / server

Failover

Worker Nodes

Geo-IP HTTP

[email protected] CernVM-FS 9 / 15 Content Distribution over the Web

Server side: stateless services

Data Center Caching Proxies Mirror Serversy 풪(100) nodes / server 풪(10) DCs / server

Failover

Worker Nodes

HTTP

[email protected] CernVM-FS 9 / 15 Content Distribution over the Web

Server side: stateless services

Data Center Caching Proxies Mirror Serversy 풪(100) nodes / server 풪(10) DCs / server

Worker Nodes Prefetched Cache

[email protected] CernVM-FS 9 / 15 Mounting the File System Client: Fuse

Available for RHEL, , OS X; Intel, ARM, Power

Works on most grids and virtual machines (cloud) inflate+verify

HTTP GET fd file descr. open(/ChangeLog) CernVM-FS

SHA1 glibc libfuse

user space syscall /dev/fuse kernel space

Fuse . VFS . inode cache NFS dentry cache

[email protected] CernVM-FS 10 / 15 Mounting the File System Client: Parrot

Available for Linux / Intel

Works on supercomputers, opportunistic clusters, in containers inflate+verify

Parrot Sandbox HTTP GET fd file descr. open(/ChangeLog) libcvmfs

SHA1 glibc libparrot

user space syscall / Parrot kernel space

Fuse . VFS . inode cache NFS dentry cache

ext3

[email protected] CernVM-FS 11 / 15 Scale of Deployment

∙ > 350 million files under management ∙ > 50 repositories ∙ Installation service by OSG and EGI

[email protected] CernVM-FS 12 / 15 Docker Integration

Under Construction!

Docker Daemon Improved Docker Daemon

Funded Project pull & push containers file-based transfer

Docker Registry CernVM File System

[email protected] CernVM-FS 13 / 15 Client Cache Manager Plugins Under Construction!

3rd party plugins C library Transport Memory, cvmfs/fuse Cache Manager Channel , libcvmfs/parrot (Key-Value Store) (TCP, Socket, . . . ) RAMCloud, ...

Draft C Interface

cvmfs_add_refcount( s t r u c t hash object_id, i n t change_by); cvmfs_pread( s t r u c t hash object_id, i n t o f f s e t, i n t s i z e, void * b u f f e r);

// Transactional writing in fixed −s i z e d chunks cvmfs_start_txn( s t r u c t hash object_id, i n t txn_id, s t r u c t i n f o object_info); cvmfs_write_txn( i n t txn_id, void * b u f f e r, i n t s i z e); cvmfs_abort_txn( i n t txn_id); cvmfs_commit_txn( i n t txn_id);

[email protected] CernVM-FS 14 / 15 Summary

CernVM-FS Use Cases ∙ Global, HTTP-based file system ∙ Scientific software for software distribution ∙ Distribution of static data ∙ Works great with Parrot e. g. conditions, calibration ∙ Optimized for small files, ∙ VM / container distribution heavy meta-data workload cf. CernVM ∙ Open source (BSD), ∙ Building block for long-term used beyond high-energy physics data preservation

Source code: https://github.com/cvmfs/cvmfs Downloads: https://cernvm.cern.ch/portal/filesystem/downloads Documentation: https://cvmfs.readthedocs.org Mailing list: [email protected]

[email protected] CernVM-FS 15 / 15 Backup Slides

[email protected] CernVM-FS 16 / 15 CernVM-FS Client Tools

Fuse Module Mount helpers ∙ Normal namespace: ∙ Setup environment (number of file /cvmfs/ descriptors, access rights, . . . ) e. g. /cvmfs/atlas.cern.ch ∙ Used by autofs on /cvmfs ∙ Private mount as a user possible ∙ Used by /etc/fstab or mount as ∙ One process per fuse module + root watchdog process mount -t cvmfs atlas.cern.ch /cvmfs/atlas.cern.ch ∙ Cache on local disk ∙ Cache LRU managed Diagnostics ∙ NFS Export Mode ∙ Nagios check available ∙ Hotpach functionality ∙ cvmfs_config probe cvmfs_config reload ∙ cvmfs_config chksetup ∙ cvmfs_fsck Parrot ∙ cvmfs_talk, connect to running ∙ Built in by default instance

[email protected] CernVM-FS 17 / 15 Experiment Software from a File System Viewpoint

Software Directory Tree atlas.cern.ch 15 ] 6 repo 10 ×

software 10

x86_64-gcc43

5 Files 17.1.0 File System Entries [ 17.2.0 . 1 . Statistics over 2 Years

[email protected] CernVM-FS 18 / 15 Experiment Software from a File System Viewpoint

Software Directory Tree atlas.cern.ch 15 ] 6 repo 10 ×

software 10

x86_64-gcc43

Duplicates 5 17.1.0 File System Entries [ 17.2.0 . 1 File Kernel . Statistics over 2 Years

[email protected] CernVM-FS 18 / 15 Experiment Software from a File System Viewpoint

Software Directory Tree atlas.cern.ch 15 ] 6 repo 10 ×

software 10

x86_64-gcc43

Duplicates 5 17.1.0 File System Entries [ 17.2.0 . 1 File Kernel . Statistics over 2 Years

Between consecutive software versions: only ≈ 15 % new files

[email protected] CernVM-FS 18 / 15 Experiment Software from a File System Viewpoint

Software Directory Tree atlas.cern.ch 15 ]

6 Directories repo

10 Symlinks × software 10 x86_64-gcc43

Duplicates 17.1.0 5

File System Entries [ 17.2.0 . 1 File Kernel . Statistics over 2 Years

Fine-grained software structure (Conway’s law) Between consecutive software versions: only ≈ 15 % new files

[email protected] CernVM-FS 18 / 15 Directory Organization

Athena 17.0.1 CMSSW 4.2.4 50 LCG Externals R60

40

30

20 Fraction of Files [%]

10

0 0 5 10 15 20 Directory Depth

Typical (non-LHC) software: majority of files in directory level ≤ 5

[email protected] CernVM-FS 19 / 15 Cumulative File Size Distribution

ATLAS CMS Requested LHCb UNIX ALICE Web Server

218 216 214 212 210 28 Dateigröße [B] 26 24

0 10 20 30 40 50 60 70 80 90 100 Perzentil

cf. Tanenbaum et al. 2006 for “Unix” and “Webserver”

Good compression rates (factor 2–3) [email protected] CernVM-FS 20 / 15 Runtime Behavior

Working Set ∙ ≈10 % of all available files are requested at runtime ∙ Median of file sizes: < 4 kB

Flash Crowd Effect /share Software

∙ Up to 500 kHz /share meta data request dDoS rate /share

∙ Up to 1 kHz file ∙ ∙ ∙ open request rate

Shared Software Area

[email protected] CernVM-FS 21 / 15 Software vs. Data

Based on ATLAS Figures 2012

Software Data POSIX Interface put, get, seek, streaming File dependencies Independent files 107 objects 108 objects 1012 B volume 1016 B volume Whole files File chunks Absolute paths Any mountpoint Open source Confidential WORM (“write-once-read-many”) Versioned

[email protected] CernVM-FS 22 / 15