The CernVM File System and the CernVM Virtual Appliance
Jakob Blomer CERN
pre-GDB April 11th, 2017
[email protected] CernVM-FS 1 / 18 The CernVM File System
At a Glance ∙ Network file system optimized for software distribution
∙ Emerged from CERN R&D 2008 – 2011 ∙ Mission critical system for the four big LHC experiments >100 M files to >100 000 nodes
Best used for ∙ Many small files, meta-data heavy workload ∙ Public data ∙ Single point of publication, many globally distributed readers ∙ “Cachable”, e.g. subset of files needed at any given moment ∙ Examples: software, detector conditions data, static data (geometry, PDFs)
[email protected] CernVM-FS 2 / 18 The Problem with Packaging Software
Example: in Docker
$ docker pull r-base Linux Libs ... −→ 1 GB image $ docker run -it r-base $ ... (fitting tutorial) −→ only 30 MB used Container (“App”)
It’s hard to scale:
iPhone App Docker Image 20 MB 1 GB changes every month changes twice a week phones update staggered servers update synchronized
sed s/Docker/(Package Manager|VM|Tarball)/
[email protected] CernVM-FS 3 / 18 The Problem with a Shared Software Area
Working Set ∙ Not more than 풪(100MB) of software requested for any task ∙ Very meta-data heavy: look for 1 000 shared libraries in 25 search paths
/share Software Flash Crowd Effect /share ∙ 풪(Mhz) meta dDoS
data request rate /share ∙ 풪(khz) file open ∙ request rate ∙ ∙
Shared Software Area
[email protected] CernVM-FS 4 / 18 A Purpose-Built Software File System
File System Approach to Software Distribution ∙ Software producers do not package images; they copy files to CernVM-FS ∙ Clients do not download images; they read individual files from /cvmfs/. . . as they are accessed ∙ Files are cached all along the network path → In the example: machines only read 30 MB from the file system
Data Center
HTTP HTTP Worker nodes Web Web access /cvmfs/... Proxies Servers HTTP HTTP
[email protected] CernVM-FS 5 / 18 End-to-End Picture
HTTP Transport Transformation Caching & Replication
Read-Only Content-Addressed Read/Write File System Objects File System
Worker Nodes Software Publisher / Master Source
Two independent issues 1 How to mount a file system?
2 How to distribute immutable, independent objects?
[email protected] CernVM-FS 6 / 18 Transactional Publish Interface
Read/Write Scratch Area CernVM-FS Read-Only Union File System AUFS or OverlayFS
Read/Write Interface File System, S3
Publishing New Content [ ~ ]# cvmfs_server transaction icecube.opensciencegrid.org [ ~ ]# make DESTDIR=/cvmfs/opensciencgrid.org/amd64-gcc6.0/4.2.0 install [ ~ ]# cvmfs_server publish icecube.opensciencegrid.org
Uses cvmfs-server tools and an Apache web server
[email protected] CernVM-FS 7 / 18 Transactional Publish Interface
Read/Write Scratch Area CernVM-FS Read-Only Union File System AUFS or OverlayFS
Read/Write Interface File System, S3
Reproducible: as in git, you can always come back to this state
Publishing New Content [ ~ ]# cvmfs_server transaction icecube.opensciencegrid.org [ ~ ]# make DESTDIR=/cvmfs/opensciencgrid.org/amd64-gcc6.0/4.2.0 install [ ~ ]# cvmfs_server publish icecube.opensciencegrid.org
Uses cvmfs-server tools and an Apache web server
[email protected] CernVM-FS 7 / 18 Mounting the File System Client: Fuse
Available for RHEL, Ubuntu, OS X; Intel, ARM, Power
Works on most grids and virtual machines (cloud) inflate+verify
HTTP GET fd file descr. open(/ChangeLog) CernVM-FS
SHA1 glibc libfuse
user space syscall /dev/fuse kernel space
Fuse . VFS . inode cache NFS dentry cache
[email protected] CernVM-FS 8 / 18 Parrot: File System in Pure User Space Alternative to FUSE, Parrot
Available for Linux / Intel
Works on supercomputers, opportunistic clusters inflate+verify
Parrot Sandbox HTTP GET fd file descr. open(/ChangeLog) libcvmfs
SHA1 glibc libparrot
user space syscall / Parrot kernel space
Fuse . VFS . inode cache NFS dentry cache
ext3
[email protected] CernVM-FS 9 / 18 The CernVM Virtual Appliance at a Glance
CernVM (Container or VM) ∙ Curated Linux platform with all dependencies to run LHC applications ∙ RHEL 6/7 compatible ∙ “Batteries included”: ready for most IaaS clouds ∙ Strongly versioned (CernVM-FS) ∙ Graphical (development environment) and batch flavors
[email protected] CernVM-FS 10 / 18 Reminder: Building Blocks of CernVM
Twofold system: 휇CernVM boot loader+ OS delivered by CernVM-FS
User Data ··· (EC2, OpenStack, . . . ) atlas alice
OS + Extras
AUFS Writable Overlay Disk
Scratch EL 4 EL 5 EL 6 EL 7 initrd: CernVM-FS + 휇Contextualisation
AUFS Fuse Kernel 20 MB Boot Loader
[email protected] CernVM-FS 11 / 18 Use Cases
CernVM: complete and portable environment for developing and running HEP data processing tasks
[email protected] CernVM-FS 12 / 18 Use Cases
CernVM: complete and portable environment for developing and running HEP data processing tasks
Use Cases Infrastructure-as-a-Service Cloud 1 IaaS Clouds Various clouds: 2 Development ∙ ATLAS online farm Environment ∙ Cloud resources seamlessly integrated with experiment 3 Volunteer task queues (e. g. ATLAS CloudScheduler, LHCb VAC) Computing ∙ ALICE software release testing on 4 Long-Term CERN OpenStack Analysis ∙ Commercial providers (Amazon, Microsoft, . . . ) Preservation 5 Outreach & Education
[email protected] CernVM-FS 12 / 18 Use Cases
CernVM: complete and portable environment for developing and running HEP data processing tasks
Use Cases Interactive Users: 1 IaaS Clouds VirtualBox and CernVM Launcher 2 Development Environment 3 Volunteer Computing 4 Long-Term Analysis Preservation 5 Outreach & Education
[email protected] CernVM-FS 12 / 18 Use Cases
CernVM: complete and portable environment for developing and running HEP data processing tasks
Use Cases LHC@Home Projects 1 IaaS Clouds 2 Development Environment 3 Volunteer Computing 4 Long-Term Analysis Preservation 5 Outreach & Education
[email protected] CernVM-FS 12 / 18 Use Cases
CernVM: complete and portable environment for developing and running HEP data processing tasks
Use Cases ALEPH software in CernVM 1 IaaS Clouds 2 Development Environment 3 Volunteer Computing 4 Long-Term Analysis Preservation 5 Outreach & Education
Demonstrates that VMs can bridge 15+ years
[email protected] CernVM-FS 12 / 18 Use Cases
CernVM: complete and portable environment for developing and running HEP data processing tasks
Use Cases CERN OpenData Portal, CERN@School 1 IaaS Clouds 2 Development Environment 3 Volunteer Computing 4 Long-Term Analysis Preservation 5 Outreach & Education
[email protected] CernVM-FS 12 / 18 CernVM Hypervisor Support Status
The success of CernVM is largely based on the fact that it runs in practically all cloud environments.
Hypervisor / Cloud Controller Status VirtualBox X VMware X KVM X Xen X Microsoft Hyper-V X Vagrant X OpenStack X OpenNebula X CloudStack X Amazon EC2 X Google Compute Engine X Microsoft Azure X Docker X
[email protected] CernVM-FS 13 / 18 CernVM as a Container
Root file system (/) from host’s CernVM-FS / /cvmfs/cernvm-prod.cern.ch
usr symlink usr
lib64 symlink lib64
etc copy etc
var copy var
. tmp .
. .
Limitations Can be used to run tasks, does not allow derived containers
[email protected] CernVM-FS 14 / 18 Docker Graph Driver Plugin
Work by N Hardi, expected H2/2017
Host machine Internet Graphdriver plugin S3 Graphdriver CVMFS S3 client plugin Repository CVMFS AUFS HTTP CVMFS Client Docker Docker Docker client daemon registry plugin API
Regular image Thin image
read-write layer thin image layer
local read-only layer read-only layer on CVMFS [email protected] CernVM-FS 15 / 18 Summary
CernVM-FS CernVM ∙ Global, HTTP-based file system ∙ 휇CernVM + ∙ Optimized for software, small files, OS template on CernVM-FS + heavy meta-data workload Contextualization ∙ ∙ Open source (BSD) 20 MB image that adapts ∙ ∙ Successful collaborations Image for IaaS clouds, beyond high-energy physics volunteer computing long-term data preservation development environment
[email protected] CernVM-FS 16 / 18 Possibilities for Collaboration and Re-use
∙ OSG and EGI operate managed CernVM-FS “software installation services” for the grid ∙ The cvmfs, cvmfs-server packages are generic, keys, server addresses, repository names, configuration comes with the cvmfs-config-. . . packages e.g. EUCLID (astro physics) operates an independent CernVM-FS infrastructure with ∼10 sites
∙ Collaborative development on GitHub contributions from U of Nebraksa, FermiLab, U of Notre Dame e.g. features added for LIGO data distribution, improved Debian support ∙ Plugin interfaces ∙ Cache manager for exotic deployments, e.g. supercomputers (upcoming 2.4 release) ∙ Client authorization helpers for “secure CernVM-FS” setups, e.g. possibility to implement OAuth instead of X.509
∙ Re-use of CernVM: through contextualization and through custom operating system templates [email protected] CernVM-FS 17 / 18 Links
Source code: https://github.com/cvmfs/cvmfs https://github.com/cernvm Downloads: https://cernvm.cern.ch/portal/filesystem/downloads https://cernvm.cern.ch/portal/downloads Documentation: https://cvmfs.readthedocs.org Mailing list: [email protected] [email protected] JIRA bug tracker: https://sft.its.cern.ch/jira/projects/CVM
[email protected] CernVM-FS 18 / 18 Backup Slides
[email protected] CernVM-FS 19 / 18 Content-Addressable Storage: Data Structures
/cvmfs/icecube.opensciencegrid.org Object Store amd64-gcc6.0 ∙ Compressed files and chunks 4.2.0 ∙ De-duplicated ChangeLog . . File Catalog Compression, Hashing ∙ Directory structure, symlinks 806fbb67373e9... ∙ Content hashes of regular files ∙ Digitally signed Repository ⇒ integrity, authenticity ∙ Time to live ∙ Partitioned / Merkle hashes (possibility of sub catalogs) Object Store File catalogs
⇒ Immutable files, trivial to check for corruption, versioning
[email protected] CernVM-FS 20 / 18 Partitioning of Meta-Data
Automatic Approaches ∙ Meta-data partitioning critical to performance ∙ Should we add support for hard quotas (volume, number of entries)?
∙
x86_64 i586
AliRoot ROOT Geant3
v4-21-16-AN v5-27-06d v1-11-21
Partitioning up to repository owner (.cvmfscatalog marker) ∙ locality by software version ∙ locality by frequency of changes
[email protected] CernVM-FS 21 / 18 CernVM-FS In Containers
Bind Mount docker run -v /cvmfs:/cvmfs:shared ... or docker run -v /cvmfs/sft.cern.ch:/cvmfs/sft.cern.ch ... ∙ Cache shared by all containers on the same host
Docker Volume Driver https://gitlab.cern.ch/cloud-infrastructure/docker-volume-cvmfs/ docker run --volume-driver cvmfs -v cms.cern.ch:/cvmfs/cms.cern.ch ... ∙ Integrates with Kubernetes
From Inside Container docker run --privileged ... ∙ Probably not very much used in practice
[email protected] CernVM-FS 22 / 18 CernVM-FS Client Tools
Fuse Module Mount helpers ∙ Normal namespace: ∙ Setup environment (number of file /cvmfs/
[email protected] CernVM-FS 23 / 18 Distributed Publish Interface – Under Construction
Data / HTTP Leases / Tickets Pre-Shared Keys Auth Database Master Storage, e. g. Mnesia Stratum 0
User and Lease Info
Objects Replication to Stratum 1 Authentication Server Gateway Server REST Interface REST Interface KG
2 hours /ocdb/2018/run001 Object Packs K1,K2,...
Remote Application Interface Machine RemoteRemote Application Application InterfaceInterface Machine Machine e.e. g. g. Data Data Taking Taking CompressionCompression & & Hashing Hashing e. g. Data Taking Signed Compression & Hashing Tarball
[email protected] CernVM-FS 24 / 18 CernVM-FS for Data Federations Contribution from Brian Bockelman & Derek Weitzel / OSG
Use CernVM-FS as a POSIX compliant, consistent, cryptographically secured name space for data files.
Site A
Experiment Data Secure POSIX namespace HTTPS + graft namespace Site BAgent Cloud X.509 Authz
Book Keeper Namespace Gateway
WebAPI
Micro
Note the limitations: CernVM-FS is not for maximum throughput [email protected] CernVM-FS 25 / 18 Authorization Helper Interface
cvmfs2 processes helper processes “membership”, uid, gid, pid Fuse Module Authz Helper allow/deny, ttl (SSL certificate)
Authz Cache
Authz Helper ∙ Separate process, communicates via stdin, stdout ∙ Controls access to a repository based on uid, gid, pid of the accessing process ∙ The “membership” and which helper to use is stored in the root catalog ∙ Can pass X.509 proxy certificate for HTTPS authentication ∙ Controls the cache life time of the information
[email protected] CernVM-FS 26 / 18 CernVM-FS Cache Plugins
Possible 3rd party plugins
C library RAMCloud, cvmfs/fuse Transport Channel Cache Manager Cassandra, libcvmfs/parrot (UNIX/TCP Socket) (External Process) Memory, Ceph, 100 k calls/s ... 4.5 GB/s
Motivation for cache plugins ∙ More flexibility for client deployment: ∙ Diskless server farms ∙ HPC “burst buffers”: utilize fast, possibly non-POSIX storage ∙ Opens the door to external contributions!
For standard deployment on the Grid nothing changes!
[email protected] CernVM-FS 27 / 18 CernVM-FS Cache Plugin C Interface
Callbacks to be implemented by plugin developer
// Reading data i n t cvmcache_chrefcnt( s t r u c t hash object_id, i n t change_by); i n t cvmcache_object_info( s t r u c t hash object_id, s t r u c t object_info * i n f o); i n t cvmcache_pread( s t r u c t hash object_id, i n t o f f s e t, i n t s i z e, void * b u f f e r);
// Transactional writing in fixed −s i z e d parts i n t cvmcache_start_txn( s t r u c t hash object_id, i n t txn_id, s t r u c t i n f o object_info); i n t cvmcache_write_txn( i n t txn_id, void * b u f f e r, i n t s i z e); i n t cvmcache_abort_txn( i n t txn_id); i n t cvmcache_commit_txn( i n t txn_id);
// Optional: quota management i n t cvmcache_shrink( i n t shrink_to, i n t * used); i n t cvmcache_listing_begin(...); i n t cvmcache_listing_next( i n t l i s t i n g _ i d,...); i n t cvmcache_listing_end( i n t l i s t i n g _ i d);
[email protected] CernVM-FS 28 / 18 Experiment Software from a File System Viewpoint
Software Directory Tree atlas.cern.ch 15 ]
6 Directories repo
10 Symlinks × software 10 x86_64-gcc43
Duplicates 17.1.0 5
File System Entries [ 17.2.0 . 1 File Kernel . Statistics over 2 Years
Fine-grained software structure (Conway’s law) Between consecutive software versions: only ≈ 15 % new files
[email protected] CernVM-FS 29 / 18 Directory Organization
Athena 17.0.1 CMSSW 4.2.4 50 LCG Externals R60
40
30
20 Fraction of Files [%]
10
0 0 5 10 15 20 Directory Depth
Typical (non-LHC) software: majority of files in directory level ≤ 5
[email protected] CernVM-FS 30 / 18 Cumulative File Size Distribution
ATLAS CMS Requested LHCb UNIX ALICE Web Server
218 216 214 212 210 28 Dateigröße [B] 26 24
0 10 20 30 40 50 60 70 80 90 100 Perzentil
cf. Tanenbaum et al. 2006 for “Unix” and “Webserver”
Good compression rates (factor 2–3) [email protected] CernVM-FS 31 / 18 The High Energy Physics Software Stack
My Analysis Code < 10 Python Classes Key Figures
changing ∙ Hundreds of (novice) CMS Software Framework developers O(1000) C++ Classes ∙ Hundred million files ∙ 1 TB / day of nightly Simulation and I/O Libraries builds ROOT, Geant4, MC-XYZ ∙ Daily production releases, remain available CentOS 6 and Utilities “eternally” 풪(10) Libraries stable
[email protected] CernVM-FS 32 / 18 Software vs. Data
Based on ATLAS Figures 2012
Software Data POSIX Interface put, get, seek, streaming File dependencies Independent files 107 objects 108 objects 1012 B volume 1016 B volume Whole files File chunks Absolute paths Any mountpoint Open source Confidential WORM (“write-once-read-many”) Versioned
[email protected] CernVM-FS 33 / 18 CernVM Build Process: EL on CernVM-FS
Maintenance of the repository should not become a Linux distributor’s job But: should be reproducible and well-documented
Idea: automatically generate a fully versioned, closed package list from a “shopping list” of unversioned packages
Scientific Linux EPEL CernVM Extras (≈ 50)
··· yum install on CernVM-FS
Dependency Closure
!""# $%&"'# (#
Formulate dependencies as Package Integer Linear Program Archive
[email protected] CernVM-FS 34 / 18 CernVM Build Process: Package Dependency ILP
Normalized (Integer) Linear Program: ⎛x1⎞ ⎛a11 ··· a1n ⎞ ⎛x1⎞ ⎛b1 ⎞ ⎜ . ⎟ ⎜ . .. . ⎟ ⎜ . ⎟ ⎜ . ⎟ Minimize (c1 ··· cn) · ⎝ . ⎠ subject to ⎝ . . . ⎠ · ⎝ . ⎠ ≤ ⎝ . ⎠ xn am1 ··· amn xn bm
Here: every available (package, version) is mapped to a xi ∈ {0, 1}. Cost vector: newer versions are cheaper than older versions. (Obviously: less packages cheaper than more packages.) Dependencies: Package xa requires xb or xc : xb + xc − xa ≥ 0. Packages xa and xb conflict: xa + xb ≤ 1. (. . . )
Figures ≈17 000 available packages (n = 17000), 500 packages on “shopping list” ≈160 000 inequalities (m = 160000), solving time <10 s (glpk) Meta RPM: ≈1 000 fully versioned packages, dependency closure
Idea: Mancinelli, Boender, di Cosmo, Vouillon, Durak (2006)
[email protected] CernVM-FS 35 / 18 CernVM Contextualization
User-Data Sources User-Data Formats ∙ Well-known web server ∙ cloud-init (EC2, GCE, OpenStack) ∙ 휇CernVM bootloader format ∙ ISO image ∙ amiconfig (OpenNebula, HEPiX) ∙ Mixable in MIME multipart ∙ HDD image user-data (VirtualBox OVA format) ∙ CernVM Launcher user-provided snippet ∙ Baked into the image
Plugins: CernVM-FS, condor, cctools, CernVM main user, CernVM GUI (desktop icons, autostart, . . . ), inject grid certificate, grid UI version
[email protected] CernVM-FS 36 / 18 Sample Context user-data.txt [cernvm] organisations=ALICE repositories=alice,alice-ocdb,sft shell=/bin/bash config_url=http://cernvm.cern.ch/config users=alice:alice:ion edition=Desktop keyboard=us startXDM=on auto_login=on
[ucernvm-begin] cvmfs_tag=cernvm-system-3.1.1.4 [ucernvm-end]
Boot on CERN OpenStack nova boot AliceVM –image "cvm3" –flavor m1.small ∖ –key-name ssh-key –user-data user-data.txt
[email protected] CernVM-FS 37 / 18