Agent Cloud
Book Keeper
Gateway CERN’s Virtual File System for Global-Scale Software Delivery WebAPI
Jakob Blomer for the CernVM-FS team CERN, EP-SFT Micro MSST 2019, Santa Clara University Agent Cloud
Agenda Book Keeper
Gateway
WebAPI High Energy Physics Computing Model Micro
Software Distribution Challenge
CernVM-FS: A Purpose-Built Software File System
[email protected] CernVM-FS / MSST’19 1 / 20 High Energy Physics Computing Model Agent Cloud
Accelerate & Collide Book Keeper
Gateway
WebAPI
Micro
[email protected] CernVM-FS / MSST’19 2 / 20 Agent Cloud
Measure & Analyze Book Keeper
Gateway
WebAPI
Micro
• Billions of independent “events” • Each event subject to complex software processing → High-Throughput Computing
[email protected] CernVM-FS / MSST’19 3 / 20 Agent Cloud
Federated Computing Model Book Keeper
Gateway
WebAPI
Micro • Physics and computing: international collaborations • “The Grid”: ∼160 data centers • Approx.: global batch system • Code moves to the data rather than vice versa
[email protected] CernVM-FS / MSST’19 4 / 20 Agent Cloud
Federated Computing Model Book Keeper
Gateway
Additional opportunistic WebAPI resources, e. g. HPC backfill slots Micro • Physics and computing: international collaborations • “The Grid”: ∼160 data centers • Approx.: global batch system • Code moves to the data rather than vice versa
[email protected] CernVM-FS / MSST’19 4 / 20 Software Distribution Challenge Agent Cloud
The Anatomy of a Scientific Software Stack Book Keeper
Gateway
WebAPI
Individual 0.1 MLOC Analysis Code Micro Key Figures for LHC Experiments changing • Hundreds of (novice) developers Experiment 4 MLOC Software Framework • > 100 000 files per release • 1 TB / day of nightly builds
High Energy Physics 5 MLOC Libraries • ∼100 000 machines world-wide • Daily production releases, Compiler remain available “eternally” 20 MLOC System Libraries stable OS Kernel
[email protected] CernVM-FS / MSST’19 5 / 20 Agent Cloud
Container Image Distribution Book Keeper
Gateway Linux Libs ... WebAPI
Micro
• Containers are easier to create than to role-out at scale • Due to network congestion: long startup-times in large clusters • Impractical image cache management on worker nodes • Ideally: Containers for isolation and orchestration, but not for distribution
[email protected] CernVM-FS / MSST’19 6 / 20 Agent Cloud
Shared Software Area on General Purpose DFS Book Keeper
Gateway Working Set • ≈2 % to 10 % of all available files are requested at runtime WebAPI • Median of file sizes: < 4 kB Micro
Software
Flash Crowd Effect dDoS • 풪(MHz) meta data request rate • 풪(kHz) file open rate ∙ ∙ ∙
Shared Software Area
[email protected] CernVM-FS / MSST’19 7 / 20 Agent Cloud
Software vs. Data Book Keeper
Gateway Software Data WebAPI POSIX interface put, get, seek, streaming File dependencies Independent files O(kB) per file O(GB) per file Micro Whole files File chunks Absolute paths Relocatable
WORM (“write-once-read-many”) Billions of files Versioned
Software is massive not in volume but in number of objects and meta-data rates
[email protected] CernVM-FS / MSST’19 8 / 20 CernVM-FS: A Purpose-Built Software File System Agent Cloud
Design Objectives Book Keeper
Gateway
Transformation WebAPI HTTP Transport a1240 Caching & Replication 41ae3 ... 7e95b Micro Read-Only Content-Addressed Objects, Read/Write File System Merkle Tree File System
Worker Nodes Software Publisher / Master Source
1. World-wide scalability 3. Application-level consistency 2. Infrastructure compatibility 4. Efficient meta-data access
[email protected] CernVM-FS / MSST’19 9 / 20 Agent Cloud
Design Objectives Book Keeper
Gateway
Transformation SeveralHTTP CDN Transport options: a1240 WebAPI Caching & Replication 41ae3 • Apache + Squids ... 7e95b Micro Read-Only • Ceph/S3 Content-Addressed Objects, Read/Write File System • Commercial CDN Merkle Tree File System
Worker Nodes Software Publisher / Master Source
1. World-wide scalability 3. Application-level consistency 2. Infrastructure compatibility 4. Efficient meta-data access
[email protected] CernVM-FS / MSST’19 9 / 20 Agent Cloud
Scale of Deployment Book Keeper
Source / Stratum 0 Gateway Replica / Stratum 1 Site / Edge Cache WebAPI
Micro
LHC infrastructure: • > 1 billion files • ∼100 000 nodes • 5 replicas, 400 web caches
[email protected] CernVM-FS / MSST’19 10 / 20 Agent Cloud
High-Availability by Horizontal Scaling Book Keeper
Gateway Server side: stateless services
WebAPI Data Center Caching Proxy Web Servery 풪(100) nodes / server 풪(10) DCs / server Micro
Worker Nodes HTTP HTTP
[email protected] CernVM-FS / MSST’19 11 / 20 Agent Cloud
High-Availability by Horizontal Scaling Book Keeper
Gateway Server side: stateless services
WebAPI Data Center Load Balancing Web Servery 풪(100) nodes / server 풪(10) DCs / server Micro HTTP
HTTP Worker Nodes HTTP
HTTP
[email protected] CernVM-FS / MSST’19 11 / 20 Agent Cloud
High-Availability by Horizontal Scaling Book Keeper
Gateway Server side: stateless services
WebAPI Data Center Caching Proxies Web Servery 풪(100) nodes / server 풪(10) DCs / server Micro
Failover
Worker Nodes HTTP
HTTP
[email protected] CernVM-FS / MSST’19 11 / 20 Agent Cloud
High-Availability by Horizontal Scaling Book Keeper
Gateway Server side: stateless services
WebAPI Data Center Caching Proxies Mirror Serversy 풪(100) nodes / server 풪(10) DCs / server Micro
Failover
Worker Nodes
Geo-IP HTTP
[email protected] CernVM-FS / MSST’19 11 / 20 Agent Cloud
High-Availability by Horizontal Scaling Book Keeper
Gateway Server side: stateless services
WebAPI Data Center Caching Proxies Mirror Serversy 풪(100) nodes / server 풪(10) DCs / server Micro
Failover
Worker Nodes
HTTP
[email protected] CernVM-FS / MSST’19 11 / 20 Agent Cloud
High-Availability by Horizontal Scaling Book Keeper
Gateway Server side: stateless services
WebAPI Data Center Caching Proxies Mirror Serversy 풪(100) nodes / server 풪(10) DCs / server Micro
Worker Nodes Pre-populated Cache
[email protected] CernVM-FS / MSST’19 11 / 20 Agent Cloud
Reading Book Keeper
Gateway
CernVM-FS WebAPI rAA Basic System Utilities Global HTTP Cache Hierarchy Micro Fuse OS Kernel
File System CernVM-FS Repository (HTTP or S3) Memory Buffer Persistent Cache All Versions Available ∼1 GB ∼20 GB ∼10 TB
• Fuse based, independent mount points, e. g. /cvmfs/atlas.cern.ch • High cache effiency because entire cluster likely to use same software
[email protected] CernVM-FS / MSST’19 12 / 20 Agent Cloud
Writing Book Keeper
Gateway
Staging Area WebAPI CernVM-FS Read-Only Union File System Micro Read/Write Interface File System, S3
Publishing new content [ ~ ]# cvmfs_server transaction containers.cern.ch [ ~ ]# cd /cvmfs/containers.cern.ch && tar xvf ubuntu1610.tar.gz [ ~ ]# cvmfs_server publish containers.cern.ch
[email protected] CernVM-FS / MSST’19 13 / 20 Agent Cloud
Use of Content-Addressable Storage Book Keeper
/cvmfs/alice.cern.ch Object Store Gateway amd64-gcc6.0 • Compressed files and chunks 4.2.0 WebAPI • De-duplicated ChangeLog . . File Catalog Micro Compression, Hashing • Directory structure, symlinks 806fbb67373e9... • Content hashes of regular files
Repository • Large files: chunked with rolling checksum • Digitally signed Object Store File catalogs • Time to live
⊕ Immutable files, trivial to check for corruption, • Partitioned / Merkle hashes versioning, efficient replication (possibility of sub catalogs) ⊖ compute-intensive, garbage collection required [email protected] CernVM-FS / MSST’19 14 / 20 Agent Cloud
Partitioning of Meta-Data Book Keeper
Gateway ∙ WebAPI
certificates aarch64 Micro • Locality by software version • Locality by frequency of changes x86_64 • Partitioning up to software librarian, steering through .cvmfscatalog magical marker files
gcc Python
v8.3 v3.4
[email protected] CernVM-FS / MSST’19 15 / 20 Agent Cloud
Deduplication and Compression Book Keeper
24 months of software releases for a single LHC experiment Gateway
WebAPI 600
6 15 10 · Micro 10 400
5 200 Volume [GB]
File system entries 100 1
All entries Regular files Regular files Compressed
Without duplicates Wuthout duplicates
[email protected] CernVM-FS / MSST’19 16 / 20 Agent Cloud
Site-local network traffic: CernVM-FS compared to NFS Book Keeper
NFS server before and after the switch: Gateway
WebAPI
Micro
Site squid web cache before and after the switch:
Source: Ian Collier
[email protected] CernVM-FS / MSST’19 17 / 20 Agent Cloud
Latency sensitivity: CernVM-FS compared to AFS Book Keeper
Use case: starting “stressHepix” standard benchmark Gateway
WebAPI Startup overhead vs. latency 600 Micro 12 AFS CernVM-FS 500 10 Throughput 400 8
[min] 6 300 t Δ 4 200
2 100 Throughput [Mbit/s] 0 0 LAN 25 50 100 150 Round trip time [ms]
[email protected] CernVM-FS / MSST’19 18 / 20 Agent Cloud
Principal Application Areas Book Keeper
Gateway
Production Software Unpacked Container Images Example: /cvmfs/ligo.egi.eu Example: /cvmfs/singularity.opensciencegrid.org WebAPI
X Most mature use case X Works out of the box with Singularity Micro # Fully unprivileged deployment of fuse module X CernVM-FS driver for Docker # Integration with containerd / kubernetes
Integration Builds Auxiliary data sets Example: /cvmfs/lhcbdev.cern.ch Example: /cvmfs/alice-ocdb.cern.ch
X High churn, requires regular garbage collection X Benefits from internal versioning # Update propagation from minutes to seconds • Depending on volume requires more planning for the CDN components
# Current focus of developments
[email protected] CernVM-FS / MSST’19 19 / 20 Agent Cloud
Summary Book Keeper
Gateway
WebAPI • CernVM-FS: special-purpose virtual file system that provides a global shared software area for many scientific collaborations Micro • Content-addressed storage and asynchronous writing (publishing) key to meta-data scalability • Current areas of development: • Fully unprivileged deployment • Integration with containerd/kubernetes image management engine
https://github.com/cvmfs/cvmfs
[email protected] CernVM-FS / MSST’19 20 / 20 Backup Slides Links
Source code: https://github.com/cvmfs/cvmfs Downloads: https://cernvm.cern.ch/portal/filesystem/downloads Documentation: https://cvmfs.readthedocs.org Mailing list: [email protected] JIRA bug tracker: https://sft.its.cern.ch/jira/projects/CVM Supported Platforms
• A Platforms: • EL 5–7 (soon: 8) AMD64 • Ubuntu 16.04, 18.04 AMD64 • B Platforms • macOS, latest two versions • SLES 11 – 12 • Fedora, latest two versions • Debian 8–9 • Ubuntu 12.04 and 14.04 • EL7 AArch64 • IA32 architecture • Experimental: POWER, Raspberry Pi, RISC-V • Blue sky idea: Windows based on ProjFS
Based on the current needs. Any platform with Fuse support should be straight-forward to address. CernVM-FS à la Carte
HPC Client Deployment Data Namespace: /cvmfs/*.osgstorage.org
Site A
• Piz Daint: Upper cache layer in WN Memory Europe’s fastest Data Source Namespace
supercomputer ··· graft namespace • Runs LHC experiment jobs Lower cache layer on /gpfs/... from native Fuse M Gila (CSCS) XRootD client
HTTPS + Client Authz
Site B
Secure POSIXAgent access Cloud
··· Book Keeper
Gateway
WebAPI
Micro CernVM-FS à la Carte
HPC Client Deployment Data Namespace: /cvmfs/*.osgstorage.org
Site A Custom client cache: • Piz Daint: hierarchical cache Upper cache layer in WN Memory Europe’s fastest Data Source Namespace + RAM disk plugin supercomputer ··· graft namespace • Runs LHC experiment jobs Lower cache layer on /gpfs/... from native Fuse M Gila (CSCS) XRootD client
HTTPS + Client Authz
Site B
Secure POSIXAgent access Cloud
··· Book Keeper
Gateway
WebAPI
Micro CernVM-FS à la Carte
HPC Client Deployment Data Namespace: /cvmfs/*.osgstorage.org
accessSite toA multiple PB of data Custom client cache: • Piz Daint: hierarchical cache Upper cache layer in WN Memory Europe’s fastest Data Source Namespace + RAM disk plugin supercomputer ··· graft namespace • Runs LHC experiment jobs Lower cache layer on /gpfs/... from native Fuse M Gila (CSCS) XRootD client
HTTPS + Client Authz
Site B
Secure POSIXAgent access Cloud
··· Book Keeper
Gateway
WebAPI
Micro CernVM-FS à la Carte
HPC Client Deployment Data Namespace: /cvmfs/*.osgstorage.org
accessSite toA multiple PB of data Custom client cache: • Piz Daint: hierarchical cache Upper cache layer in WN Memory Europe’s fastest Data Source Namespace + RAM disk plugin supercomputer ··· graft namespace • Runs LHC experiment jobs Lower cache layer on /gpfs/... from native Fuse M Gila (CSCS) XRootD client Authentication plugin + high-bandwidth CDN HTTPS + Client Authz
Site B
Secure POSIXAgent access Cloud
··· Book Keeper
Gateway
WebAPI
Micro Distributed Publishing
Coordinating Multiple Publisher Nodes ssh or CI slave • Concurrent publisher nodes access storage Gateway Agent Cloud through gateway services Book Keeper • Gateway services: Gateway • API for publishing • IssuesWebAPIleases for sub paths
S3 Bucket • Receives change sets as set of Micro (Ceph, AWS, . . . ) signed object packs
Populate HTTP CDN Notification Service
Fast distribution channel for repository manifest: useful for CI pipelines, data QA
HTTP Agent Cloud WebSocket Notification Service Book Keeper AMQP WebSocket > cvmfs_swissknife notify -p ... Gateway
WebAPI
Micro CVMFS_NOTIFICATION_SERVER=...
• Optional service supporting a regular repository • Subscribe component integrated with the client, automatic reload on changes → CernVM-FS writing remains asynchronous but with fast response time in 풪(seconds) Unpacked Container Images in CernVM-FS
CernVM-FS Container Integration • Goal: avoid network congestion by starting unpacked containers from CernVM-FS • Client / worker node: requires CernVM-FS plug-ins for • Docker (available) • containerd (in contact with upstream developers) • CernVM-FS repository: efficient publishing of containers
Container Publishing Service Add-on service on the publisher node to facilitate container conversion from a Docker registry
Presentation Custom Docker Graph Driver I
Host machine Internet Graphdriver plugin “Thin image”: empty layer con- S3 taining only a descriptor that Graphdriver CVMFS S3 client plugin Repository points to the actual unpacked CVMFS AUFS layers in /cvmfs HTTP CVMFS Client Docker Docker Docker client daemon registry plugin API
Regular image Thin image
read-write layer thin image layer
local read-only layer read-only layer on CVMFS Custom Docker Graph Driver II
휆 docker plugin install cvmfs/graphdriver 휆 docker run cvmfs/thin_cernvm "echo Hello, World!"
See https://cvmfs.readthedocs.io/en/latest/cpt-graphdriver.html Container Publishing Service: Workflow
Push Docker MR to Unpack Push Thin Image Wishlist and Publish Image
User CernVM-FS
Wishlist https://gitlab.cern.ch/unpacked/sync /cvmfs/unpacked.cern.ch
version: 1 # Singularity user: cvmfsunpacker /registry.hub.docker.com/fedora:latest -> \ cvmfs_repo: 'unpacked.cern.ch' /cvmfs/unpacked.cern.ch/.flat/d0/d0932... output_format: > # Docker with thin image https://gitlab-registry.cern.ch/unpacked/sync/$(image) /.layers/f0/1af7... input: - 'https://registry.hub.docker.com/library/fedora:latest' - 'https://registry.hub.docker.com/library/debian:stable' - 'https://registry.hub.docker.com/library/centos:latest' Enabling Feature for Container Publishing: Tarball Ingestion
Direct path for the common pattern of publishing tarball contents
$ cvmfs_server transaction $ zcat ubuntu.tar.gz | \ $ tar -xf ubuntu.tar.gz cvmfs_server ingest -t - $ cvmfs_server publish
Uses libarchive: support for rpm, zip, publish::SyncUnion etc. could be easily added
Overlayfs Aufs Tarball
Read/Write Scratch Area CernVM-FS Read-Only
Performance Example Ubuntu 18.04 container – 4 GB in 250 k files: 56 s untar + 1 min publish vs. 74s ingest
1 Directory Organization
Typical non-LHC soft- ware: majority of files Athena 17.0.1 ≤ 5 in directory level CMSSW 4.2.4 50 LCG Externals R60
40
30
20 Fraction of Files [%]
10
0 0 5 10 15 20 Directory Depth Cumulative File Size Distribution
ATLAS CMS Requested LHCb UNIX ALICE Web Server
218 216 214 212 210 28 File size [B] 26 24
0 10 20 30 40 50 60 70 80 90 100 Percentile
cf. Tanenbaum et al. 2006 for “Unix” and “Webserver” LHC Data Flow
Source: Harvey et al. Selected publicationsi
Blomer, J., Aguado-Sanchez, C., Buncic, P., and Harutyunyan, A. (2011). Distributing LHC application software and conditions databases using the CernVM file system. Journal of Physics: Conference Series, 331(042003). Blomer, J., Buncic, P., Meusel, R., Ganis, G., Sfiligoi, I., and Thain, D. (2015). The evolution of global scale filesystems for scientific software distribution. Computing in Science and Engineering, 17(6):61–71. Weitzel, D., Bockelman, B., Dykstra, D., Blomer, J., and Meusel, R. (2017). Accessing data federations with cvmfs. Journal of Physics: Conference Series, 898(062044). Selected publications ii
Blomer, J., Ganis, G., Hardi, N., and Popescu, R. (2017). Delivering LHC Software to HPC Compute Elements with CernVM-FS, pages 724–730. Number 10524 in Lecture Notes in Computer Science. Springer. Hardi, N., Blomer, J., Ganis, G., and Popescu, R. (2018). Making containers lazy with docker and CernVM-FS. Journal of Physics: Conference Series, 1085.