Evaluation and Benchmarking of Singularity MPI Containers on EU Research E-Infrastructures

Evaluation and Benchmarking of Singularity MPI Containers on EU Research e-Infrastructures Víctor Sande Veiga, Manuel Simon, Abdulrahman Azab, Giuseppa Muscianisi, Carlos Fernandez, Giuseppe Fiameni, and Simone Marocchi CIMNE - International Centre for Numerical Methods in Engineering, Spain Nokia, France CESGA - Centro de Supercomputación de Galicia, Spain University of Oslo, Norway CINECA - Interuniversity Consortium, Italy 1 CANOPIE-HPC@SC19 www.mso4sc.eu | www.prace-ri.eu Partnership for Advanced Mathematical Modelling, Simulation and Computing in Europe Optimization for Societal Challenges with Scientific Computing 2 CANOPIE-HPC@SC19 www.prace-ri.eu (CESGA) Supercomputing Center of Galicia ▶ 328 TFLOPS ▶ 44,8 TB of RAM ▶ 1,5 PB of disk capacity 3 CANOPIE-HPC@SC19 www.mso4sc.eu | www.prace-ri.eu (CINECA) Interuniversity Consortium, Italy MARCONI is the Tier-0 CINECA system based on Lenovo NeXtScale platform. The current configuration of MARCONI consists of: • 3600 Intel Knights Landing nodes, each equipped with 1 Intel Xeon Phi 7250 @1.4 GHz, with 68 cores each and 96 GB of RAM, also named as MARCONI A2 – KNL • 3216 Intel SkyLake nodes, each equipped with 2 Intel Xeon 8160 @ 2.1 GHz, with 24 cores each and 192 GB of RAM, also named as MARCONI A3 - SKL 4 CANOPIE-HPC@SC19 www.mso4sc.eu | www.prace-ri.eu Containers 5 CANOPIE-HPC@SC19 www.mso4sc.eu | www.prace-ri.eu Native vs. Containers vs. VMs 6 CANOPIE-HPC@SC19 www.prace-ri.eu Why containers? ▶ Without containers: "I need software X, and here is the installation guide, please install it!” ▶ With containers: "I need software X, here is the name of its Docker image, please pull it" ▶ Very little performance degradation compared to native ▶ Security: SeLinux, Capability whitelist, syscall whitelist, and user namespaces 7 CANOPIE-HPC@SC19 www.prace-ri.eu What containers don’t solve ▶ Hardware architecture and kernel incompatibility. ▶ Operational maintenance mess (e.g. two different versions of MPI). ▶ Containers are not for huge software packages, e.g. Bio-Linux. To package those in one unit, VMs are more suitable 8 CANOPIE-HPC@SC19 www.prace-ri.eu MPI: Message Passing Interface 9 CANOPIE-HPC@SC19 www.mso4sc.eu | www.prace-ri.eu https://devblogs.nvidia.com/introduction-cuda-aware-mpi/ 10 CANOPIE-HPC@SC19 www.prace-ri.eu OSI Network model MPI https://medium.com/@int0x33/day-51-understanding-the-osi-model-f22d5f3df756 11 CANOPIE-HPC@SC19 www.prace-ri.eu General MPI program structure https://computing.llnl.gov/tutorials/mpi/ 12 CANOPIE-HPC@SC19 www.prace-ri.eu History ▶ May 1994: Final version of MPI-1.0 released ▶ MPI-1.1 (Jun 1995) ▶ MPI-1.2 (Jul 1997) ▶ MPI-1.3 (May 2008). ▶ 1998: MPI-2 picked up where the first MPI specification left off, and addressed topics which went far beyond the MPI-1 specification. ▶ MPI-2.1 (Sep 2008) ▶ MPI-2.2 (Sep 2009) ▶ Sep 2012: The MPI-3.0 standard was approved. ▶ MPI-3.1 (Jun 2015) ▶ Current: The MPI-4.0 standard is under development. https://computing.llnl.gov/tutorials/mpi/ 13 CANOPIE-HPC@SC19 www.prace-ri.eu Most pupular Distributions ▶ OpenMPI ▶ MPICH ▶ MVAPICH2 ▶ Intel MPI 14 CANOPIE-HPC@SC19 www.prace-ri.eu Common MPI Application Binary Interface (ABI) ▶ From Intel MPI 5.0 and MPICH 3.1 (and MVAPICH2 1.9 and higher which is based on MPICH 3.1), the libraries are interchangeable at the binary level, using common ABI. ▶ This in practice means that one can build the application with MPICH, but, run it using the Intel MPI libraries, and thus taking advantage of the Intel MPI functionality. https://software.intel.com/en-us/articles/using-intelr-mpi-library-50-with-mpich3-based-applications 15 CANOPIE-HPC@SC19 www.prace-ri.eu PMI: Process Management Interface PMI allows different process-management frameworks to interact with parallel programming libraries such as MPI Pavan Balaji, et. al. 2010. PMI: a scalable parallel process-management interface for extreme-scale systems. In Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface (EuroMPI'10), Rainer Keller, Edgar Gabriel, Michael 16 CANOPIE-HPC@SC19 www.prace-ri.eu Resch, and Jack Dongarra (Eds.). Springer-Verlag, Berlin, Heidelberg, 31-41. Container platform choice 17 CANOPIE-HPC@SC19 www.mso4sc.eu | www.prace-ri.eu Container platform choice Docker: ▶ Not designed for running HPC distributed applications. ▶ Relies on the Docker engine and daemon to be installed and running on each node. ▶ Each container has its own process which is fired up by the docker daemon. ▶ Containers run by default as root. While this can be avoided by using user namespaces, getting access to the main docker command, users can easily escape this limitation, which introduces a security risk. Each docker engine has its own image repository that, so far, cannot be shared with other docker engines. This means that running a parallel application X on e.g. 100 nodes, a X docker image need to be replicated 100 times, one per node. ▶ Docker so far has no support for MPI. 18 CANOPIE-HPC@SC19 www.prace-ri.eu Container platform choice Podman: ▶ Developed and maintained by Red Hat. ▶ A lightweight runtime for managing and running docker containers without the overhead of the docker engine and daemon. ▶ Not mainly designed for and is currently lacking support for HPC workloads. ▶ Podman can run rootless, but to to that (similar to charliecloud), it uses user namespaces ▶ Its rootless mode is not compatible with most parallel filesystems. 19 CANOPIE-HPC@SC19 www.prace-ri.eu Podman MPI example $ mpirun --hostfile hostfile \ --mca orte_tmpdir_base /tmp/podman-mpirun \ podman run --env-host \ -v /tmp/podman-mpirun:/tmp/podman-mpirun \ --userns=keep-id \ --net=host --pid=host --ipc=host \ quay.io/adrianreber/mpi-test:30 /home/ring Rank 2 has cleared MPI_Init Rank 2 has completed ring Rank 2 has completed MPI_Barrier Rank 3 has cleared MPI_Init Rank 3 has completed ring ….. https://podman.io/blogs/2019/09/26/podman-in-hpc.html 20 CANOPIE-HPC@SC19 www.prace-ri.eu Container platform choice Charliecloud: ▶ open source project which is based on a user-defined software stack (UDSS). ▶ Designed to manage and run docker based containers, and One important advantage that it is very lightweight. ▶ It enables running containers as the user, by implementing user namespaces. ▶ The main disadvantage is the requirement of a recent kernel (4.4 is recommended) that supports user namespaces. Many EU HPC clusters still have their compute nodes with old kernels (<3.10), i.e. don't support user namespaces which is a showstopper for charliecloud. ▶ Uses Slurm srun to run containers with MPI installed. 21 CANOPIE-HPC@SC19 www.prace-ri.eu Container platform choice Shifter: ▶ A platform to run Docker containers that is developed by the National Energy Research Scientific Computing Center (NERSC) and is deployed in production on a Slurm cluster of Cray supercomputers. ▶ Shifter uses its own image format to which both Docker images and VMs are converted. The Docker engine is replaced by image manager for managing the new formated images. ▶ Despite the fact that Shifter has proven to be useful managing HPC workflows, to use images in shifter, users must submit their images to a root controlled image gateway via a RESTful API. ▶ The main drawback of shifter is the lack of community ▶ Supports MPI depending on MPICH Application Binary Interface (ABI) 22 CANOPIE-HPC@SC19 www.prace-ri.eu Container platform choice Singularity: ▶ Singularity is the third and last HPC-specific container platform (after shifter and charliecloud). It is developed and maintained by Sylabs Inc. ▶ Unlike shifter, it doesn't have that complex architecture (with image manager and gateway) and can be simply installed as a environment module. ▶ Unlike Podman and charliecloud, it does not require kernel user namespaces to run rootless containers, and thus can be deployed on old Linux kernels. ▶ Singularity supports both Docker image format and its own native Single Image "flat" Format (SIF). ▶ Singularity is the most widely deployed container platform on HPC systems, currently there are 25,000+ systems running Singularity. ▶ Has build-in support for MPI (OpenMPI, MPICH, IntelMPI) 23 CANOPIE-HPC@SC19 www.prace-ri.eu Singularity Workflow 24 CANOPIE-HPC@SC19 www.mso4sc.eu | www.prace-ri.eu Singularity and MPI 25 CANOPIE-HPC@SC19 www.mso4sc.eu | www.prace-ri.eu Container platform choice Singularity: ▶ Data analysis example: Computing principal components for the first 10,000 variants from the 1000 Genomes Project chromosomes 21 and 22: wget https://…/chr21.head.vcf.gz wget https://…/chr22.head.vcf.gz LANG=C CHUNKSIZE=10000000 mpirun -x LANG -x CHUNKSIZE -np 2 singularity run -H $(pwd) variant_pca.img 26 CANOPIE-HPC@SC19 www.prace-ri.eu OpenMPI Singularity workflow ▶ From shell (or resource manager) mpirun gets called ▶ mpirun forks and exec ORTE (Open Run-Time Environment) daemon ▶ ORTED (ORTErun) process creates PMI ▶ ORTED forks == to the number of process per node requested ▶ ORTED children exec to original command passed to mpirun (Singularity) ▶ Each Singularity execs the command passed inside the given container ▶ Each MPI program links in the dynamic Open MPI libraries (ldd) ▶ OpenMPI libraries continue to open the non-ldd shared libraries (ldopen) ▶ OpenMPI libraries connect back to original ORTED via PMI ▶ All non-shared memory communication occurs through the PMI and then to local interfaces (e.g. InfiniBand) 27 CANOPIE-HPC@SC19 www.prace-ri.eu Problem Statement 28 CANOPIE-HPC@SC19 www.mso4sc.eu | www.prace-ri.eu

Evaluation and Benchmarking of Singularity MPI Containers on EU Research E-Infrastructures

Soluciones Para Entornos HPC

The Component Architecture of Open Mpi: Enabling Third-Party Collective Algorithms∗

HP-CAST 20 Final Agenda V3.0 (Sessions and Chairs Are Subject to Change Without Notice)

Application-Level Fault Tolerance and Resilience in HPC Applications

Presentación De Powerpoint

Enabling the Deployment of Virtual Clusters on the VCOC Experiment of the Bonfire Federated Cloud

Efficient Multithreaded Untransposed, Transposed Or Symmetric Sparse

Searching for Genetic Interactions in Complex Disease by Using Distance Correlation

Infiniband and 10-Gigabit Ethernet for Dummies

Enabling Rootless Linux Containers in Multi-User Environments: the Udocker Tool

User Manual Revision: C96e096

Prace6ip-D7.1