The CernVM and the CernVM Virtual Appliance

Jakob Blomer CERN

pre-GDB April 11th, 2017

[email protected] CernVM-FS 1 / 18 The CernVM File System

At a Glance ∙ optimized for software distribution

∙ Emerged from CERN R&D 2008 – 2011 ∙ Mission critical system for the four big LHC experiments >100 M files to >100 000 nodes

Best used for ∙ Many small files, meta-data heavy workload ∙ Public data ∙ Single point of publication, many globally distributed readers ∙ “Cachable”, e.g. subset of files needed at any given moment ∙ Examples: software, detector conditions data, static data (geometry, PDFs)

[email protected] CernVM-FS 2 / 18 The Problem with Packaging Software

Example: in

$ docker pull r-base Libs ... −→ 1 GB image $ docker run -it r-base $ ... (fitting tutorial) −→ only 30 MB used Container (“App”)

It’s hard to scale:

iPhone App Docker Image 20 MB 1 GB changes every month changes twice a week phones update staggered servers update synchronized

sed s/Docker/(|VM|Tarball)/

[email protected] CernVM-FS 3 / 18 The Problem with a Shared Software Area

Working Set ∙ Not more than 풪(100MB) of software requested for any task ∙ Very meta-data heavy: look for 1 000 shared libraries in 25 search paths

/share Software Flash Crowd Effect /share ∙ 풪(Mhz) meta dDoS

data request rate /share ∙ 풪(khz) file ∙ request rate ∙ ∙

Shared Software Area

[email protected] CernVM-FS 4 / 18 A Purpose-Built Software File System

File System Approach to Software Distribution ∙ Software producers do not package images; they copy files to CernVM-FS ∙ Clients do not download images; they individual files from /cvmfs/. . . as they are accessed ∙ Files are cached all along the network path → In the example: machines only read 30 MB from the file system

Data Center

HTTP HTTP Worker nodes Web Web access /cvmfs/... Proxies Servers HTTP HTTP

[email protected] CernVM-FS 5 / 18 End-to-End Picture

HTTP Transport Transformation Caching & Replication

Read-Only Content-Addressed Read/Write File System Objects File System

Worker Nodes Software Publisher / Master Source

Two independent issues 1 How to mount a file system?

2 How to distribute immutable, independent objects?

[email protected] CernVM-FS 6 / 18 Transactional Publish Interface

Read/Write Scratch Area CernVM-FS Read-Only Union File System or OverlayFS

Read/Write Interface File System, S3

Publishing New Content [ ~ ]# cvmfs_server transaction icecube.opensciencegrid.org [ ~ ]# make DESTDIR=/cvmfs/opensciencgrid.org/amd64-gcc6.0/4.2.0 install [ ~ ]# cvmfs_server publish icecube.opensciencegrid.org

Uses cvmfs-server tools and an Apache web server

[email protected] CernVM-FS 7 / 18 Transactional Publish Interface

Read/Write Scratch Area CernVM-FS Read-Only Union File System AUFS or OverlayFS

Read/Write Interface File System, S3

Reproducible: as in git, you can always come back to this state

Publishing New Content [ ~ ]# cvmfs_server transaction icecube.opensciencegrid.org [ ~ ]# make DESTDIR=/cvmfs/opensciencgrid.org/amd64-gcc6.0/4.2.0 install [ ~ ]# cvmfs_server publish icecube.opensciencegrid.org

Uses cvmfs-server tools and an Apache web server

[email protected] CernVM-FS 7 / 18 Mounting the File System Client: Fuse

Available for RHEL, , OS X; Intel, ARM, Power

Works on most grids and virtual machines (cloud) inflate+verify

HTTP GET fd file descr. open(/ChangeLog) CernVM-FS

SHA1 glibc libfuse

syscall /dev/fuse kernel space

Fuse . VFS . inode cache NFS dentry cache

[email protected] CernVM-FS 8 / 18 Parrot: File System in Pure User Space Alternative to FUSE, Parrot

Available for Linux / Intel

Works on , opportunistic clusters inflate+verify

Parrot Sandbox HTTP GET fd file descr. open(/ChangeLog) libcvmfs

SHA1 glibc libparrot

user space syscall / Parrot kernel space

Fuse . VFS . inode cache NFS dentry cache

ext3

[email protected] CernVM-FS 9 / 18 The CernVM Virtual Appliance at a Glance

CernVM (Container or VM) ∙ Curated Linux platform with all dependencies to run LHC applications ∙ RHEL 6/7 compatible ∙ “Batteries included”: ready for most IaaS clouds ∙ Strongly versioned (CernVM-FS) ∙ Graphical (development environment) and batch flavors

[email protected] CernVM-FS 10 / 18 Reminder: Building Blocks of CernVM

Twofold system: 휇CernVM boot loader+ OS delivered by CernVM-FS

User Data ··· (EC2, OpenStack, . . . ) atlas alice

OS + Extras

AUFS Writable Overlay Disk

Scratch EL 4 EL 5 EL 6 EL 7 initrd: CernVM-FS + 휇Contextualisation

AUFS Fuse Kernel 20 MB Boot Loader

[email protected] CernVM-FS 11 / 18 Use Cases

CernVM: complete and portable environment for developing and running HEP data processing tasks

[email protected] CernVM-FS 12 / 18 Use Cases

CernVM: complete and portable environment for developing and running HEP data processing tasks

Use Cases Infrastructure-as-a-Service Cloud 1 IaaS Clouds Various clouds: 2 Development ∙ ATLAS online farm Environment ∙ Cloud resources seamlessly integrated with experiment 3 Volunteer task queues (e. g. ATLAS CloudScheduler, LHCb VAC) ∙ ALICE software release testing on 4 Long-Term CERN OpenStack Analysis ∙ Commercial providers (Amazon, Microsoft, . . . ) Preservation 5 Outreach & Education

[email protected] CernVM-FS 12 / 18 Use Cases

CernVM: complete and portable environment for developing and running HEP data processing tasks

Use Cases Interactive Users: 1 IaaS Clouds VirtualBox and CernVM Launcher 2 Development Environment 3 Volunteer Computing 4 Long-Term Analysis Preservation 5 Outreach & Education

[email protected] CernVM-FS 12 / 18 Use Cases

CernVM: complete and portable environment for developing and running HEP data processing tasks

Use Cases LHC@Home Projects 1 IaaS Clouds 2 Development Environment 3 Volunteer Computing 4 Long-Term Analysis Preservation 5 Outreach & Education

[email protected] CernVM-FS 12 / 18 Use Cases

CernVM: complete and portable environment for developing and running HEP data processing tasks

Use Cases ALEPH software in CernVM 1 IaaS Clouds 2 Development Environment 3 Volunteer Computing 4 Long-Term Analysis Preservation 5 Outreach & Education

Demonstrates that VMs can bridge 15+ years

[email protected] CernVM-FS 12 / 18 Use Cases

CernVM: complete and portable environment for developing and running HEP data processing tasks

Use Cases CERN OpenData Portal, CERN@School 1 IaaS Clouds 2 Development Environment 3 Volunteer Computing 4 Long-Term Analysis Preservation 5 Outreach & Education

[email protected] CernVM-FS 12 / 18 CernVM Support Status

The success of CernVM is largely based on the fact that it runs in practically all cloud environments.

Hypervisor / Cloud Controller Status VirtualBox X VMware X KVM X X Microsoft Hyper-V X Vagrant X OpenStack X OpenNebula X CloudStack X Amazon EC2 X Google Compute Engine X Microsoft Azure X Docker X

[email protected] CernVM-FS 13 / 18 CernVM as a Container

Root file system (/) from host’s CernVM-FS / /cvmfs/cernvm-prod.cern.ch

usr symlink usr

lib64 symlink lib64

etc copy etc

var copy var

. tmp .

. .

Limitations Can be used to run tasks, does not allow derived containers

[email protected] CernVM-FS 14 / 18 Docker Graph Driver Plugin

Work by N Hardi, expected H2/2017

Host machine Internet Graphdriver plugin S3 Graphdriver CVMFS S3 client plugin Repository CVMFS AUFS HTTP CVMFS Client Docker Docker Docker client daemon registry plugin API

Regular image Thin image

read-write layer thin image layer

local read-only layer read-only layer on CVMFS [email protected] CernVM-FS 15 / 18 Summary

CernVM-FS CernVM ∙ Global, HTTP-based file system ∙ 휇CernVM + ∙ Optimized for software, small files, OS template on CernVM-FS + heavy meta-data workload Contextualization ∙ ∙ Open source (BSD) 20 MB image that adapts ∙ ∙ Successful collaborations Image for IaaS clouds, beyond high-energy physics volunteer computing long-term data preservation development environment

[email protected] CernVM-FS 16 / 18 Possibilities for Collaboration and Re-use

∙ OSG and EGI operate managed CernVM-FS “software installation services” for the grid ∙ The cvmfs, cvmfs-server packages are generic, keys, server addresses, repository names, configuration comes with the cvmfs-config-. . . packages e.g. EUCLID (astro physics) operates an independent CernVM-FS infrastructure with ∼10 sites

∙ Collaborative development on GitHub contributions from U of Nebraksa, FermiLab, U of Notre Dame e.g. features added for LIGO data distribution, improved support ∙ Plugin interfaces ∙ Cache manager for exotic deployments, e.g. supercomputers (upcoming 2.4 release) ∙ Client authorization helpers for “secure CernVM-FS” setups, e.g. possibility to implement OAuth instead of X.509

∙ Re-use of CernVM: through contextualization and through custom templates [email protected] CernVM-FS 17 / 18 Links

Source code: https://github.com/cvmfs/cvmfs https://github.com/cernvm Downloads: https://cernvm.cern.ch/portal/filesystem/downloads https://cernvm.cern.ch/portal/downloads Documentation: https://cvmfs.readthedocs.org Mailing list: [email protected] [email protected] JIRA bug tracker: https://sft.its.cern.ch/jira/projects/CVM

[email protected] CernVM-FS 18 / 18 Backup Slides

[email protected] CernVM-FS 19 / 18 Content-Addressable Storage: Data Structures

/cvmfs/icecube.opensciencegrid.org Object Store amd64-gcc6.0 ∙ Compressed files and chunks 4.2.0 ∙ De-duplicated ChangeLog . . File Catalog Compression, Hashing ∙ Directory structure, symlinks 806fbb67373e9... ∙ Content hashes of regular files ∙ Digitally signed Repository ⇒ integrity, authenticity ∙ Time to live ∙ Partitioned / Merkle hashes (possibility of sub catalogs) Object Store File catalogs

⇒ Immutable files, trivial to check for corruption, versioning

[email protected] CernVM-FS 20 / 18 Partitioning of Meta-Data

Automatic Approaches ∙ Meta-data partitioning critical to performance ∙ Should we add support for hard quotas (volume, number of entries)?

x86_64 i586

AliRoot ROOT Geant3

v4-21-16-AN v5-27-06d v1-11-21

Partitioning up to repository owner (.cvmfscatalog marker) ∙ locality by software version ∙ locality by frequency of changes

[email protected] CernVM-FS 21 / 18 CernVM-FS In Containers

Bind Mount docker run -v /cvmfs:/cvmfs:shared ... or docker run -v /cvmfs/sft.cern.ch:/cvmfs/sft.cern.ch ... ∙ Cache shared by all containers on the same host

Docker Volume Driver https://gitlab.cern.ch/cloud-infrastructure/docker-volume-cvmfs/ docker run --volume-driver cvmfs -v cms.cern.ch:/cvmfs/cms.cern.ch ... ∙ Integrates with Kubernetes

From Inside Container docker run --privileged ... ∙ Probably not very much used in practice

[email protected] CernVM-FS 22 / 18 CernVM-FS Client Tools

Fuse Module Mount helpers ∙ Normal namespace: ∙ Setup environment (number of file /cvmfs/ descriptors, access rights, . . . ) e. g. /cvmfs/atlas.cern.ch ∙ Used by autofs on /cvmfs ∙ Private mount as a user possible ∙ Used by /etc/fstab or mount as ∙ One process per fuse module + root watchdog process mount -t cvmfs atlas.cern.ch /cvmfs/atlas.cern.ch ∙ Cache on local disk ∙ Cache LRU managed Diagnostics ∙ NFS Export Mode ∙ Nagios check available ∙ Hotpach functionality ∙ cvmfs_config probe cvmfs_config reload ∙ cvmfs_config chksetup ∙ cvmfs_fsck Parrot ∙ cvmfs_talk, connect to running ∙ Built in by default instance

[email protected] CernVM-FS 23 / 18 Distributed Publish Interface – Under Construction

Data / HTTP Leases / Tickets Pre-Shared Keys Auth Database Master Storage, e. g. Mnesia Stratum 0

User and Lease Info

Objects Replication to Stratum 1 † Authentication Server Gateway Server REST Interface REST Interface KG

ƒ 2 hours /ocdb/2018/run001 „ Object Packs K1,K2,...

Remote Application Interface Machine RemoteRemote Application Application InterfaceInterface Machine Machine e.e. g. g. Data Data Taking Taking CompressionCompression & & Hashing Hashing e. g. Data Taking ‚ Signed Compression & Hashing Tarball

[email protected] CernVM-FS 24 / 18 CernVM-FS for Data Federations Contribution from Brian Bockelman & Derek Weitzel / OSG

Use CernVM-FS as a POSIX compliant, consistent, cryptographically secured name space for data files.

Site A

Experiment Data Secure POSIX namespace HTTPS + graft namespace Site BAgent Cloud X.509 Authz

Book Keeper Namespace Gateway

WebAPI

Micro

Note the limitations: CernVM-FS is not for maximum throughput [email protected] CernVM-FS 25 / 18 Authorization Helper Interface

cvmfs2 processes helper processes “membership”, uid, gid, pid Fuse Module Authz Helper allow/deny, ttl (SSL certificate)

Authz Cache

Authz Helper ∙ Separate process, communicates via stdin, stdout ∙ Controls access to a repository based on uid, gid, pid of the accessing process ∙ The “membership” and which helper to use is stored in the root catalog ∙ Can pass X.509 proxy certificate for HTTPS authentication ∙ Controls the cache life time of the information

[email protected] CernVM-FS 26 / 18 CernVM-FS Cache Plugins

Possible 3rd party plugins

C library RAMCloud, cvmfs/fuse Transport Channel Cache Manager Cassandra, libcvmfs/parrot (UNIX/TCP Socket) (External Process) Memory, , 100 k calls/s ... 4.5 GB/s

Motivation for cache plugins ∙ More flexibility for client deployment: ∙ Diskless server farms ∙ HPC “burst buffers”: utilize fast, possibly non-POSIX storage ∙ Opens the door to external contributions!

For standard deployment on the Grid nothing changes!

[email protected] CernVM-FS 27 / 18 CernVM-FS Cache Plugin C Interface

Callbacks to be implemented by plugin developer

// Reading data i n t cvmcache_chrefcnt( s t r u c t hash object_id, i n t change_by); i n t cvmcache_object_info( s t r u c t hash object_id, s t r u c t object_info * i n f o); i n t cvmcache_pread( s t r u c t hash object_id, i n t o f f s e t, i n t s i z e, void * b u f f e r);

// Transactional writing in fixed −s i z e d parts i n t cvmcache_start_txn( s t r u c t hash object_id, i n t txn_id, s t r u c t i n f o object_info); i n t cvmcache_write_txn( i n t txn_id, void * b u f f e r, i n t s i z e); i n t cvmcache_abort_txn( i n t txn_id); i n t cvmcache_commit_txn( i n t txn_id);

// Optional: quota management i n t cvmcache_shrink( i n t shrink_to, i n t * used); i n t cvmcache_listing_begin(...); i n t cvmcache_listing_next( i n t l i s t i n g _ i d,...); i n t cvmcache_listing_end( i n t l i s t i n g _ i d);

[email protected] CernVM-FS 28 / 18 Experiment Software from a File System Viewpoint

Software Directory Tree atlas.cern.ch 15 ]

6 Directories repo

10 Symlinks × software 10 x86_64-gcc43

Duplicates 17.1.0 5

File System Entries [ 17.2.0 . 1 File Kernel . Statistics over 2 Years

Fine-grained software structure (Conway’s law) Between consecutive software versions: only ≈ 15 % new files

[email protected] CernVM-FS 29 / 18 Directory Organization

Athena 17.0.1 CMSSW 4.2.4 50 LCG Externals R60

40

30

20 Fraction of Files [%]

10

0 0 5 10 15 20 Directory Depth

Typical (non-LHC) software: majority of files in directory level ≤ 5

[email protected] CernVM-FS 30 / 18 Cumulative File Size Distribution

ATLAS CMS Requested LHCb UNIX ALICE Web Server

218 216 214 212 210 28 Dateigröße [B] 26 24

0 10 20 30 40 50 60 70 80 90 100 Perzentil

cf. Tanenbaum et al. 2006 for “Unix” and “Webserver”

Good compression rates (factor 2–3) [email protected] CernVM-FS 31 / 18 The High Energy Physics Software Stack

My Analysis Code < 10 Python Classes Key Figures

changing ∙ Hundreds of (novice) CMS Software Framework developers O(1000) C++ Classes ∙ Hundred million files ∙ 1 TB / day of nightly Simulation and I/O Libraries builds ROOT, Geant4, MC-XYZ ∙ Daily production releases, remain available CentOS 6 and Utilities “eternally” 풪(10) Libraries stable

[email protected] CernVM-FS 32 / 18 Software vs. Data

Based on ATLAS Figures 2012

Software Data POSIX Interface put, get, seek, streaming File dependencies Independent files 107 objects 108 objects 1012 B volume 1016 B volume Whole files File chunks Absolute paths Any mountpoint Open source Confidential WORM (“write-once-read-many”) Versioned

[email protected] CernVM-FS 33 / 18 CernVM Build Process: EL on CernVM-FS

Maintenance of the repository should not become a Linux distributor’s job But: should be reproducible and well-documented

Idea: automatically generate a fully versioned, closed package list from a “shopping list” of unversioned packages

Scientific Linux EPEL CernVM Extras (≈ 50)

··· yum install on CernVM-FS

Dependency Closure

!""# $%&"'# (#

Formulate dependencies as Package Integer Linear Program Archive

[email protected] CernVM-FS 34 / 18 CernVM Build Process: Package Dependency ILP

Normalized (Integer) Linear Program: ⎛x1⎞ ⎛a11 ··· a1n ⎞ ⎛x1⎞ ⎛b1 ⎞ ⎜ . ⎟ ⎜ . .. . ⎟ ⎜ . ⎟ ⎜ . ⎟ Minimize (c1 ··· cn) · ⎝ . ⎠ subject to ⎝ . . . ⎠ · ⎝ . ⎠ ≤ ⎝ . ⎠ xn am1 ··· amn xn bm

Here: every available (package, version) is mapped to a xi ∈ {0, 1}. Cost vector: newer versions are cheaper than older versions. (Obviously: less packages cheaper than more packages.) Dependencies: Package xa requires xb or xc : xb + xc − xa ≥ 0. Packages xa and xb conflict: xa + xb ≤ 1. (. . . )

Figures ≈17 000 available packages (n = 17000), 500 packages on “shopping list” ≈160 000 inequalities (m = 160000), solving time <10 s (glpk) Meta RPM: ≈1 000 fully versioned packages, dependency closure

Idea: Mancinelli, Boender, di Cosmo, Vouillon, Durak (2006)

[email protected] CernVM-FS 35 / 18 CernVM Contextualization

User-Data Sources User-Data Formats ∙ Well-known web server ∙ cloud-init (EC2, GCE, OpenStack) ∙ 휇CernVM bootloader format ∙ ISO image ∙ amiconfig (OpenNebula, HEPiX) ∙ Mixable in MIME multipart ∙ HDD image user-data (VirtualBox OVA format) ∙ CernVM Launcher user-provided snippet ∙ Baked into the image

Plugins: CernVM-FS, condor, cctools, CernVM main user, CernVM GUI (desktop icons, autostart, . . . ), inject grid certificate, grid UI version

[email protected] CernVM-FS 36 / 18 Sample Context user-data.txt [cernvm] organisations=ALICE repositories=alice,alice-ocdb,sft shell=/bin/bash config_url=http://cernvm.cern.ch/config users=alice:alice:ion edition=Desktop keyboard=us startXDM=on auto_login=on

[ucernvm-begin] cvmfs_tag=cernvm-system-3.1.1.4 [ucernvm-end]

Boot on CERN OpenStack boot AliceVM –image "cvm3" –flavor m1.small ∖ –key-name ssh-key –user-data user-data.txt

[email protected] CernVM-FS 37 / 18