Updates on CVMFS and container integration

EP-SFT weekly meeting, 14 - Sept - 2020

Simone Mosciatti

1 Background on CVMFS

● FUSE filesystem targeted to software distribution ● Works over HTTP ● Lazy pull of files based on actual file-system calls ● Great bandwidth efficiency ● Aggressive use of caches + CAS for improving latency ● The file-system can grow indefinitely in size keeping good performances as long as the subcatalogs are managed correctly. ● Widely used and deployed inside CERN, and great interest also outside (EESI, Microsoft, HFT firms, other scientific experiments).

2 Background on containers

● Use namespace to create isolated environment where to run computation ● Distribute the container root file-system as hash-verified tarballs ● In a nutshell a. Create the root filesystem by stacking the content of hash-verified tarballs on top of each other b. Create an isolated environment using namespaces c. Run (reproducible) computation ● Different implementations (, singularity, podman, k8s-crio). ● Underneath they use all the same pieces a. containerd b. containers/storage ● Moving towards rootless implementations, simplifying deployment on GRID, data centers and supercomputer where CVMFS is already present.

3 How containers can be consumed (1)

Unpacked file-system

a. The whole file-system of the container is provided as a directory b. The container runtime pick up the directory filesystem and creates all the infrastructure c. Works out of the box with Singularity d. Works with podman with some tricks (specifying the runtime engine) singularity exec /cvmfs/unpacked.cern.ch/registry.hub.docker.com/atlas/athena:21.0.77/

All the file-system of the container is already in the directory

4 How containers can be consumed (2)

Layers

a. The image is describe in a json manifest file b. The runtime fetches the layers as tarball (or use the one from cache) c. The runtime generates the whole file-system stacking layers one on top of each other using or similar unionfs technologies docker run atlas/athena:21

The step `b` is usually a problem, since it requires a lot of time and consume a lot of bandwidth, especially with large images or when spawning several images.

5 Why merging the two technologies

● Very elegant model, taking the best from the two technologies.

Efficient distribution from CVMFS and resource isolation from containers.

● Users are accustomed to container technology

6 State of the art

● Distribution of container root file-system by pointing to a directory (such as /cvmfs/unpacked.cern.ch/registry.hub.docker.com/atlas/athena:21.0.77/) ○ Mostly used for singularity ○ Possible to use with podman ● Distribution of unpacked layers ○ Used with docker with graph-driver plugin ● Automatic creation and management of the unpacked container image content on CVMFS with DUCC ○ /cvmfs/unpacked.cern.ch ○ DUCC, golang application that interfaces directly with the CVMFS publisher / Stratum-0 ○ It takes care of translating the high-level concept of container ingestion into low level file-system manipulation

7 Recent advancements (DUCC)

1. Improve throughput and latency of DUCC

a. More parallelization b. Faster check of images already in the filesystem (from several minutes to less than 30 seconds) c. Reasonable performance for a fresh installation 250 images in ~24 hours

8 Recent advancements (containerd)

2. Creation of remote snapshotter for containerd

● Allows containerd based containers to start-up using layers directly from CVMFS ○ docker, kubernetes ● It is not necessary to download the layers from a central service ● Provides saving in bandwidth ● Promising preliminary tests, still need complete test cycles, in contact with IT

Standard docker images (no thin images) can now be used out of the box.

containerd, if correctly set, will use the unpacked layers store in CVMFS to create the union file-system.

Only files actually necessary will be downloaded from the network, not all the files. 9 Recent advancements (containers/storage)

3. Generation of container image meta-data as part of DUCC image conversion (GSoC)

● Allows containers/storage based containers to start-up using layers directly from CVMFS ○ podman, k8s cri-o (need testing) ● It is not necessary to download the layers from a central service ● Provides saving in bandwidth ● Good preliminary tests, found some minor bugs

The set of cached layers can now be hosted in CVMFS.

When the runtime checks if a layer is in local storage, it checks also in CVMFS.

All the layers found in CVMFS are not downloaded from the network. 10 Roadmap for the next 6 months

1. Wider tests on the containerd remote snapshotter 2. Iron out bugs of the `containers/storage` file-system implementation and merge it in DUCC 3. Implement a “docker registry shim”. 4. Store each step of a container file-system (the “chain”) in CVMFS using the new (2.8 unreleased) template transactions

11 Active development: docker-registry shim (1)

● The asynchronous nature of DUCC is an issue ○ User requires an image to be pushed in unpacked.cern.ch ○ The request is, eventually, satisfied ○ It usually takes 10 minutes, sometimes it may take much longer when the repository is busy ○ Only way to check is to look into /cvmfs/unpacked.cern.ch ● Users would prefer to push images to a docker registry and to know that when the push finishes the image is in unpacked.cern.ch ○ Preliminary investigation suggest that this is possible with the docker registry API ○ The real implementation will requires major adjustment in the DUCC code. ○ Requires coordination with operations ● End of a CI/CD pipeline, the image is push into a registry and it automatically appears in unpacked.cern.ch ● Fit well with the change in pricing recently announced by Docker Inc. 12 Active development: docker-registry shim (2)

● Will allow nicer integration with the containers ecosystem ○ We will “speak the same language” ● Will allow to integrate with Harbor (IT unofficially blessed cloud registry registry.cern.ch) and Gitlab docker registry

Working principles

● When the user push a layer, before to accept it, the layer can be ingested in CVMFS ● When the user push a manifest, before to accept it, we can create all the supporting structure for the image

13 Active development: storing all the stages of the container filesystem for fast ingestion of derived containers

● Several containers are based on standard images, e.g. `FROM centos:centos7` ● When creating the unpacked flat root file-system - e.g. for use with singularity - a lot of work is repeated to ingest all the files from the base image every time ○ The files are eventually deduplicated but we pay a price in ingestion/conversion time ● A new feature called “template transactions” will allow to avoid all this repeated work and ingest only the files of the so-far unseen layers. ○ cvmfs_server transaction repo.ch/foo/:bar/ ○ Creates a new transaction in which the content of the whole foo directory is already in the bar directory. ● Need to be integrated with DUCC

14 Active development: storing all the stages of the container filesystem for fast ingestion of derived containers (1)

FROM centos:centos7 sha256:00001

RUN yum install python3 sha256:00002

ADD analysis1.py sha256:00003

● At the moment we only store the last ring of the chain (sha256:00003) ● If another image based on centos7, that installs python3 need to be installed,

all the files need to be ingested again. ● This is a quite common scenario, the building pieces are similar and the thin layer of user code at the top changes frequently. 15 Active development: storing all the stages of the container filesystem for fast ingestion of derived containers (2)

FROM centos:centos7 sha256:00001

RUN yum install python3 sha256:00002

ADD better_analysis.py sha256:00042

● With the new template transaction, we want to store all the ring of the chain (sha256:00001, sha256:00002 and sha256:00003)

● When a new images comes along, we can build on top of rings already ingested in CVMFS. ● In this case we could ingest just the last layer, since sha256:00002 is already 16 in the repository. Long term ideas (1) - CERN / HEP wide registry + unpacked

With the docker shim it would be possible to operate a CERN or HEP wide container registry together with a CVMFS repository as a canonical home for all the experiment containers.

Moreover, it would expose the docker registry API to allow for deletion of old images.

Together with the rootless containers, user would then be more easily able to:

1. Create and test their own containers in the local environment 2. Push the container in unpacked 3. Run the same computation on the GRID or on lxplus 4. Store the exact same environment for software preservation

17 Long term ideas (2) - Predictive cache

While the bandwidth saving are great, start-up latency can still be problematic, especially for interactive use cases.

Predictive cache based on subcatalogs may fit very well and helps a lot.

Containers have a well defined set of “hot files”, all the files that are necessary to run the ENTRYPOINT. The “hot set” will fit very well this cache.

Proposed summer student project, unfortunately it didn’t work out due to COVID.

18 Recap

● Soon we will be able to start images on unpacked with podman (need configuration) ● containerd is able to pick up layers from unpacked (need plugin installed) ● Working on synchronous docker registry shim ● Working on fast images ingestion through chains

19 Questions?

20