Agent Cloud

Book Keeper

Gateway CERN’s Virtual for Global-Scale Software Delivery WebAPI

Jakob Blomer for the CernVM-FS team CERN, EP-SFT Micro MSST 2019, Santa Clara University Agent Cloud

Agenda Book Keeper

Gateway

WebAPI High Energy Physics Computing Model Micro

Software Distribution Challenge

CernVM-FS: A Purpose-Built Software File System

[email protected] CernVM-FS / MSST’19 1 / 20 High Energy Physics Computing Model Agent Cloud

Accelerate & Collide Book Keeper

Gateway

WebAPI

Micro

[email protected] CernVM-FS / MSST’19 2 / 20 Agent Cloud

Measure & Analyze Book Keeper

Gateway

WebAPI

Micro

• Billions of independent “events” • Each event subject to complex software processing → High-Throughput Computing

[email protected] CernVM-FS / MSST’19 3 / 20 Agent Cloud

Federated Computing Model Book Keeper

Gateway

WebAPI

Micro • Physics and computing: international collaborations • “The Grid”: ∼160 data centers • Approx.: global batch system • Code moves to the data rather than vice versa

[email protected] CernVM-FS / MSST’19 4 / 20 Agent Cloud

Federated Computing Model Book Keeper

Gateway

Additional opportunistic WebAPI resources, e. g. HPC backfill slots Micro • Physics and computing: international collaborations • “The Grid”: ∼160 data centers • Approx.: global batch system • Code moves to the data rather than vice versa

[email protected] CernVM-FS / MSST’19 4 / 20 Software Distribution Challenge Agent Cloud

The Anatomy of a Scientific Software Stack Book Keeper

Gateway

WebAPI

Individual 0.1 MLOC Analysis Code Micro Key Figures for LHC Experiments changing • Hundreds of (novice) developers Experiment 4 MLOC Software Framework • > 100 000 files per release • 1 TB / day of nightly builds

High Energy Physics 5 MLOC Libraries • ∼100 000 machines world-wide • Daily production releases, Compiler remain available “eternally” 20 MLOC System Libraries stable OS Kernel

[email protected] CernVM-FS / MSST’19 5 / 20 Agent Cloud

Container Image Distribution Book Keeper

Gateway Libs ... WebAPI

Micro

• Containers are easier to create than to role-out at scale • Due to network congestion: long startup-times in large clusters • Impractical image cache management on worker nodes • Ideally: Containers for isolation and orchestration, but not for distribution

[email protected] CernVM-FS / MSST’19 6 / 20 Agent Cloud

Shared Software Area on General Purpose DFS Book Keeper

Gateway Working Set • ≈2 % to 10 % of all available files are requested at runtime WebAPI • Median of file sizes: < 4 kB Micro

Software

Flash Crowd Effect dDoS • 풪(MHz) meta data request rate • 풪(kHz) file open rate ∙ ∙ ∙

Shared Software Area

[email protected] CernVM-FS / MSST’19 7 / 20 Agent Cloud

Software vs. Data Book Keeper

Gateway Software Data WebAPI POSIX interface put, get, seek, streaming File dependencies Independent files O(kB) per file O(GB) per file Micro Whole files File chunks Absolute paths Relocatable

WORM (“write-once-read-many”) Billions of files Versioned

Software is massive not in volume but in number of objects and meta-data rates

[email protected] CernVM-FS / MSST’19 8 / 20 CernVM-FS: A Purpose-Built Software File System Agent Cloud

Design Objectives Book Keeper

Gateway

Transformation WebAPI HTTP Transport a1240 Caching & Replication 41ae3 ... 7e95b Micro Read-Only Content-Addressed Objects, Read/Write File System Merkle Tree File System

Worker Nodes Software Publisher / Master Source

1. World-wide scalability 3. Application-level consistency 2. Infrastructure compatibility 4. Efficient meta-data access

[email protected] CernVM-FS / MSST’19 9 / 20 Agent Cloud

Design Objectives Book Keeper

Gateway

Transformation SeveralHTTP CDN Transport options: a1240 WebAPI Caching & Replication 41ae3 • Apache + Squids ... 7e95b Micro Read-Only • /S3 Content-Addressed Objects, Read/Write File System • Commercial CDN Merkle Tree File System

Worker Nodes Software Publisher / Master Source

1. World-wide scalability 3. Application-level consistency 2. Infrastructure compatibility 4. Efficient meta-data access

[email protected] CernVM-FS / MSST’19 9 / 20 Agent Cloud

Scale of Deployment Book Keeper

Source / Stratum 0 Gateway Replica / Stratum 1 Site / Edge Cache WebAPI

Micro

LHC infrastructure: • > 1 billion files • ∼100 000 nodes • 5 replicas, 400 web caches

[email protected] CernVM-FS / MSST’19 10 / 20 Agent Cloud

High-Availability by Horizontal Scaling Book Keeper

Gateway Server side: stateless services

WebAPI Data Center Caching Proxy Web Servery 풪(100) nodes / server 풪(10) DCs / server Micro

Worker Nodes HTTP HTTP

[email protected] CernVM-FS / MSST’19 11 / 20 Agent Cloud

High-Availability by Horizontal Scaling Book Keeper

Gateway Server side: stateless services

WebAPI Data Center Load Balancing Web Servery 풪(100) nodes / server 풪(10) DCs / server Micro HTTP

HTTP Worker Nodes HTTP

HTTP

[email protected] CernVM-FS / MSST’19 11 / 20 Agent Cloud

High-Availability by Horizontal Scaling Book Keeper

Gateway Server side: stateless services

WebAPI Data Center Caching Proxies Web Servery 풪(100) nodes / server 풪(10) DCs / server Micro

Failover

Worker Nodes HTTP

HTTP

[email protected] CernVM-FS / MSST’19 11 / 20 Agent Cloud

High-Availability by Horizontal Scaling Book Keeper

Gateway Server side: stateless services

WebAPI Data Center Caching Proxies Mirror Serversy 풪(100) nodes / server 풪(10) DCs / server Micro

Failover

Worker Nodes

Geo-IP HTTP

[email protected] CernVM-FS / MSST’19 11 / 20 Agent Cloud

High-Availability by Horizontal Scaling Book Keeper

Gateway Server side: stateless services

WebAPI Data Center Caching Proxies Mirror Serversy 풪(100) nodes / server 풪(10) DCs / server Micro

Failover

Worker Nodes

HTTP

[email protected] CernVM-FS / MSST’19 11 / 20 Agent Cloud

High-Availability by Horizontal Scaling Book Keeper

Gateway Server side: stateless services

WebAPI Data Center Caching Proxies Mirror Serversy 풪(100) nodes / server 풪(10) DCs / server Micro

Worker Nodes Pre-populated Cache

[email protected] CernVM-FS / MSST’19 11 / 20 Agent Cloud

Reading Book Keeper

Gateway

CernVM-FS WebAPI rAA Basic System Utilities Global HTTP Cache Hierarchy Micro Fuse OS Kernel

File System CernVM-FS Repository (HTTP or S3) Memory Buffer Persistent Cache All Versions Available ∼1 GB ∼20 GB ∼10 TB

• Fuse based, independent mount points, e. g. /cvmfs/atlas.cern.ch • High cache effiency because entire cluster likely to use same software

[email protected] CernVM-FS / MSST’19 12 / 20 Agent Cloud

Writing Book Keeper

Gateway

Staging Area WebAPI CernVM-FS Read-Only Union File System Micro Read/Write Interface File System, S3

Publishing new content [ ~ ]# cvmfs_server transaction containers.cern.ch [ ~ ]# cd /cvmfs/containers.cern.ch && tar xvf ubuntu1610.tar.gz [ ~ ]# cvmfs_server publish containers.cern.ch

[email protected] CernVM-FS / MSST’19 13 / 20 Agent Cloud

Use of Content-Addressable Storage Book Keeper

/cvmfs/alice.cern.ch Object Store Gateway amd64-gcc6.0 • Compressed files and chunks 4.2.0 WebAPI • De-duplicated ChangeLog . . File Catalog Micro Compression, Hashing • Directory structure, symlinks 806fbb67373e9... • Content hashes of regular files

Repository • Large files: chunked with rolling checksum • Digitally signed Object Store File catalogs • Time to live

⊕ Immutable files, trivial to check for corruption, • Partitioned / Merkle hashes versioning, efficient replication (possibility of sub catalogs) ⊖ compute-intensive, garbage collection required [email protected] CernVM-FS / MSST’19 14 / 20 Agent Cloud

Partitioning of Meta-Data Book Keeper

Gateway ∙ WebAPI

certificates aarch64 Micro • Locality by software version • Locality by frequency of changes x86_64 • Partitioning up to software librarian, steering through .cvmfscatalog magical marker files

gcc Python

v8.3 v3.4

[email protected] CernVM-FS / MSST’19 15 / 20 Agent Cloud

Deduplication and Compression Book Keeper

24 months of software releases for a single LHC experiment Gateway

WebAPI 600

6 15 10 · Micro 10 400

5 200 Volume [GB]

File system entries 100 1

All entries Regular files Regular files Compressed

Without duplicates Wuthout duplicates

[email protected] CernVM-FS / MSST’19 16 / 20 Agent Cloud

Site-local network traffic: CernVM-FS compared to NFS Book Keeper

NFS server before and after the switch: Gateway

WebAPI

Micro

Site squid web cache before and after the switch:

Source: Ian Collier

[email protected] CernVM-FS / MSST’19 17 / 20 Agent Cloud

Latency sensitivity: CernVM-FS compared to AFS Book Keeper

Use case: starting “stressHepix” standard benchmark Gateway

WebAPI Startup overhead vs. latency 600 Micro 12 AFS CernVM-FS 500 10 Throughput 400 8

[min] 6 300 t Δ 4 200

2 100 Throughput [Mbit/s] 0 0 LAN 25 50 100 150 Round trip time [ms]

[email protected] CernVM-FS / MSST’19 18 / 20 Agent Cloud

Principal Application Areas Book Keeper

Gateway

‚ Production Software „ Unpacked Container Images Example: /cvmfs/ligo.egi.eu Example: /cvmfs/singularity.opensciencegrid.org WebAPI

X Most mature use case X Works out of the box with Singularity Micro # Fully unprivileged deployment of fuse module X CernVM-FS driver for Docker # Integration with containerd / kubernetes

ƒ Integration Builds Auxiliary data sets Example: /cvmfs/lhcbdev.cern.ch Example: /cvmfs/alice-ocdb.cern.ch

X High churn, requires regular garbage collection X Benefits from internal versioning # Update propagation from minutes to seconds • Depending on volume requires more planning for the CDN components

# Current focus of developments

[email protected] CernVM-FS / MSST’19 19 / 20 Agent Cloud

Summary Book Keeper

Gateway

WebAPI • CernVM-FS: special-purpose that provides a global shared software area for many scientific collaborations Micro • Content-addressed storage and asynchronous writing (publishing) key to meta-data scalability • Current areas of development: • Fully unprivileged deployment • Integration with containerd/kubernetes image management engine

https://github.com/cvmfs/cvmfs

[email protected] CernVM-FS / MSST’19 20 / 20 Backup Slides Links

Source code: https://github.com/cvmfs/cvmfs Downloads: https://cernvm.cern.ch/portal/filesystem/downloads Documentation: https://cvmfs.readthedocs.org Mailing list: [email protected] JIRA bug tracker: https://sft.its.cern.ch/jira/projects/CVM Supported Platforms

• A Platforms: • EL 5–7 (soon: 8) AMD64 • 16.04, 18.04 AMD64 • B Platforms • macOS, latest two versions • SLES 11 – 12 • Fedora, latest two versions • 8–9 • Ubuntu 12.04 and 14.04 • EL7 AArch64 • IA32 architecture • Experimental: POWER, Raspberry Pi, RISC-V • Blue sky idea: Windows based on ProjFS

Based on the current needs. Any platform with Fuse support should be straight-forward to address. CernVM-FS à la Carte

‚ HPC Client Deployment ƒ Data Namespace: /cvmfs/*.osgstorage.org

Site A

• Piz Daint: Upper cache layer in WN Memory Europe’s fastest Data Source Namespace

supercomputer ··· graft namespace • Runs LHC experiment jobs Lower cache layer on //... from native Fuse M Gila (CSCS) XRootD client

HTTPS + Client Authz

Site B

Secure POSIXAgent access Cloud

··· Book Keeper

Gateway

WebAPI

Micro CernVM-FS à la Carte

‚ HPC Client Deployment ƒ Data Namespace: /cvmfs/*.osgstorage.org

Site A Custom client cache: • Piz Daint: hierarchical cache Upper cache layer in WN Memory Europe’s fastest Data Source Namespace + RAM disk plugin supercomputer ··· graft namespace • Runs LHC experiment jobs Lower cache layer on /gpfs/... from native Fuse M Gila (CSCS) XRootD client

HTTPS + Client Authz

Site B

Secure POSIXAgent access Cloud

··· Book Keeper

Gateway

WebAPI

Micro CernVM-FS à la Carte

‚ HPC Client Deployment ƒ Data Namespace: /cvmfs/*.osgstorage.org

accessSite toA multiple PB of data Custom client cache: • Piz Daint: hierarchical cache Upper cache layer in WN Memory Europe’s fastest Data Source Namespace + RAM disk plugin supercomputer ··· graft namespace • Runs LHC experiment jobs Lower cache layer on /gpfs/... from native Fuse M Gila (CSCS) XRootD client

HTTPS + Client Authz

Site B

Secure POSIXAgent access Cloud

··· Book Keeper

Gateway

WebAPI

Micro CernVM-FS à la Carte

‚ HPC Client Deployment ƒ Data Namespace: /cvmfs/*.osgstorage.org

accessSite toA multiple PB of data Custom client cache: • Piz Daint: hierarchical cache Upper cache layer in WN Memory Europe’s fastest Data Source Namespace + RAM disk plugin supercomputer ··· graft namespace • Runs LHC experiment jobs Lower cache layer on /gpfs/... from native Fuse M Gila (CSCS) XRootD client Authentication plugin + high-bandwidth CDN HTTPS + Client Authz

Site B

Secure POSIXAgent access Cloud

··· Book Keeper

Gateway

WebAPI

Micro Distributed Publishing

Coordinating Multiple Publisher Nodes ssh or CI slave • Concurrent publisher nodes access storage Gateway Agent Cloud through gateway services Book Keeper • Gateway services: Gateway • API for publishing • IssuesWebAPIleases for sub paths

S3 Bucket • Receives change sets as set of Micro (Ceph, AWS, . . . ) signed object packs

Populate HTTP CDN Notification Service

Fast distribution channel for repository manifest: useful for CI pipelines, data QA

HTTP Agent Cloud WebSocket Notification Service Book Keeper AMQP WebSocket > cvmfs_swissknife notify -p ... Gateway

WebAPI

Micro CVMFS_NOTIFICATION_SERVER=...

• Optional service supporting a regular repository • Subscribe component integrated with the client, automatic reload on changes → CernVM-FS writing remains asynchronous but with fast response time in 풪(seconds) Unpacked Container Images in CernVM-FS

CernVM-FS Container Integration • Goal: avoid network congestion by starting unpacked containers from CernVM-FS • Client / worker node: requires CernVM-FS plug-ins for • Docker (available) • containerd (in contact with upstream developers) • CernVM-FS repository: efficient publishing of containers

Container Publishing Service Add-on service on the publisher node to facilitate container conversion from a Docker registry

Presentation Custom Docker Graph Driver I

Host machine Internet Graphdriver plugin “Thin image”: empty layer con- S3 taining only a descriptor that Graphdriver CVMFS S3 client plugin Repository points to the actual unpacked CVMFS AUFS layers in /cvmfs HTTP CVMFS Client Docker Docker Docker client daemon registry plugin API

Regular image Thin image

read-write layer thin image layer

local read-only layer read-only layer on CVMFS Custom Docker Graph Driver II

휆 docker plugin install cvmfs/graphdriver 휆 docker run cvmfs/thin_cernvm "echo Hello, World!"

See https://cvmfs.readthedocs.io/en/latest/cpt-graphdriver.html Container Publishing Service: Workflow

Push Docker MR to Unpack Push Thin Image Wishlist and Publish Image

User CernVM-FS

Wishlist https://gitlab.cern.ch/unpacked/sync /cvmfs/unpacked.cern.ch

version: 1 # Singularity user: cvmfsunpacker /registry.hub.docker.com/fedora:latest -> \ cvmfs_repo: 'unpacked.cern.ch' /cvmfs/unpacked.cern.ch/.flat/d0/d0932... output_format: > # Docker with thin image https://gitlab-registry.cern.ch/unpacked/sync/$(image) /.layers/f0/1af7... input: - 'https://registry.hub.docker.com/library/fedora:latest' - 'https://registry.hub.docker.com/library/debian:stable' - 'https://registry.hub.docker.com/library/centos:latest' Enabling Feature for Container Publishing: Tarball Ingestion

Direct path for the common pattern of publishing tarball contents

$ cvmfs_server transaction $ zcat ubuntu.tar.gz | \ $ tar -xf ubuntu.tar.gz cvmfs_server ingest -t - $ cvmfs_server publish

Uses libarchive: support for rpm, zip, publish::SyncUnion etc. could be easily added

Overlayfs Aufs Tarball

Read/Write Scratch Area CernVM-FS Read-Only

Performance Example Ubuntu 18.04 container – 4 GB in 250 k files: 56 s untar + 1 min publish vs. 74s ingest

1 Directory Organization

Typical non-LHC soft- ware: majority of files Athena 17.0.1 ≤ 5 in directory level CMSSW 4.2.4 50 LCG Externals R60

40

30

20 Fraction of Files [%]

10

0 0 5 10 15 20 Directory Depth Cumulative File Size Distribution

ATLAS CMS Requested LHCb UNIX ALICE Web Server

218 216 214 212 210 28 File size [B] 26 24

0 10 20 30 40 50 60 70 80 90 100 Percentile

cf. Tanenbaum et al. 2006 for “Unix” and “Webserver” LHC Data Flow

Source: Harvey et al. Selected publicationsi

Blomer, J., Aguado-Sanchez, C., Buncic, P., and Harutyunyan, A. (2011). Distributing LHC application software and conditions databases using the CernVM file system. Journal of Physics: Conference Series, 331(042003). Blomer, J., Buncic, P., Meusel, R., Ganis, G., Sfiligoi, I., and Thain, D. (2015). The evolution of global scale filesystems for scientific software distribution. Computing in Science and Engineering, 17(6):61–71. Weitzel, D., Bockelman, B., Dykstra, D., Blomer, J., and Meusel, R. (2017). Accessing data federations with cvmfs. Journal of Physics: Conference Series, 898(062044). Selected publications ii

Blomer, J., Ganis, G., Hardi, N., and Popescu, R. (2017). Delivering LHC Software to HPC Compute Elements with CernVM-FS, pages 724–730. Number 10524 in Lecture Notes in Computer Science. Springer. Hardi, N., Blomer, J., Ganis, G., and Popescu, R. (2018). Making containers lazy with docker and CernVM-FS. Journal of Physics: Conference Series, 1085.