Evaluation and Benchmarking of Singularity MPI Containers on EU Research E-Infrastructures
Total Page:16
File Type:pdf, Size:1020Kb
Evaluation and Benchmarking of Singularity MPI Containers on EU Research e-Infrastructures Víctor Sande Veiga, Manuel Simon, Abdulrahman Azab, Giuseppa Muscianisi, Carlos Fernandez, Giuseppe Fiameni, and Simone Marocchi CIMNE - International Centre for Numerical Methods in Engineering, Spain Nokia, France CESGA - Centro de Supercomputación de Galicia, Spain University of Oslo, Norway CINECA - Interuniversity Consortium, Italy 1 CANOPIE-HPC@SC19 www.mso4sc.eu | www.prace-ri.eu Partnership for Advanced Mathematical Modelling, Simulation and Computing in Europe Optimization for Societal Challenges with Scientific Computing 2 CANOPIE-HPC@SC19 www.prace-ri.eu (CESGA) Supercomputing Center of Galicia ▶ 328 TFLOPS ▶ 44,8 TB of RAM ▶ 1,5 PB of disk capacity 3 CANOPIE-HPC@SC19 www.mso4sc.eu | www.prace-ri.eu (CINECA) Interuniversity Consortium, Italy MARCONI is the Tier-0 CINECA system based on Lenovo NeXtScale platform. The current configuration of MARCONI consists of: • 3600 Intel Knights Landing nodes, each equipped with 1 Intel Xeon Phi 7250 @1.4 GHz, with 68 cores each and 96 GB of RAM, also named as MARCONI A2 – KNL • 3216 Intel SkyLake nodes, each equipped with 2 Intel Xeon 8160 @ 2.1 GHz, with 24 cores each and 192 GB of RAM, also named as MARCONI A3 - SKL 4 CANOPIE-HPC@SC19 www.mso4sc.eu | www.prace-ri.eu Containers 5 CANOPIE-HPC@SC19 www.mso4sc.eu | www.prace-ri.eu Native vs. Containers vs. VMs 6 CANOPIE-HPC@SC19 www.prace-ri.eu Why containers? ▶ Without containers: "I need software X, and here is the installation guide, please install it!” ▶ With containers: "I need software X, here is the name of its Docker image, please pull it" ▶ Very little performance degradation compared to native ▶ Security: SeLinux, Capability whitelist, syscall whitelist, and user namespaces 7 CANOPIE-HPC@SC19 www.prace-ri.eu What containers don’t solve ▶ Hardware architecture and kernel incompatibility. ▶ Operational maintenance mess (e.g. two different versions of MPI). ▶ Containers are not for huge software packages, e.g. Bio-Linux. To package those in one unit, VMs are more suitable 8 CANOPIE-HPC@SC19 www.prace-ri.eu MPI: Message Passing Interface 9 CANOPIE-HPC@SC19 www.mso4sc.eu | www.prace-ri.eu https://devblogs.nvidia.com/introduction-cuda-aware-mpi/ 10 CANOPIE-HPC@SC19 www.prace-ri.eu OSI Network model MPI https://medium.com/@int0x33/day-51-understanding-the-osi-model-f22d5f3df756 11 CANOPIE-HPC@SC19 www.prace-ri.eu General MPI program structure https://computing.llnl.gov/tutorials/mpi/ 12 CANOPIE-HPC@SC19 www.prace-ri.eu History ▶ May 1994: Final version of MPI-1.0 released ▶ MPI-1.1 (Jun 1995) ▶ MPI-1.2 (Jul 1997) ▶ MPI-1.3 (May 2008). ▶ 1998: MPI-2 picked up where the first MPI specification left off, and addressed topics which went far beyond the MPI-1 specification. ▶ MPI-2.1 (Sep 2008) ▶ MPI-2.2 (Sep 2009) ▶ Sep 2012: The MPI-3.0 standard was approved. ▶ MPI-3.1 (Jun 2015) ▶ Current: The MPI-4.0 standard is under development. https://computing.llnl.gov/tutorials/mpi/ 13 CANOPIE-HPC@SC19 www.prace-ri.eu Most pupular Distributions ▶ OpenMPI ▶ MPICH ▶ MVAPICH2 ▶ Intel MPI 14 CANOPIE-HPC@SC19 www.prace-ri.eu Common MPI Application Binary Interface (ABI) ▶ From Intel MPI 5.0 and MPICH 3.1 (and MVAPICH2 1.9 and higher which is based on MPICH 3.1), the libraries are interchangeable at the binary level, using common ABI. ▶ This in practice means that one can build the application with MPICH, but, run it using the Intel MPI libraries, and thus taking advantage of the Intel MPI functionality. https://software.intel.com/en-us/articles/using-intelr-mpi-library-50-with-mpich3-based-applications 15 CANOPIE-HPC@SC19 www.prace-ri.eu PMI: Process Management Interface PMI allows different process-management frameworks to interact with parallel programming libraries such as MPI Pavan Balaji, et. al. 2010. PMI: a scalable parallel process-management interface for extreme-scale systems. In Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface (EuroMPI'10), Rainer Keller, Edgar Gabriel, Michael 16 CANOPIE-HPC@SC19 www.prace-ri.eu Resch, and Jack Dongarra (Eds.). Springer-Verlag, Berlin, Heidelberg, 31-41. Container platform choice 17 CANOPIE-HPC@SC19 www.mso4sc.eu | www.prace-ri.eu Container platform choice Docker: ▶ Not designed for running HPC distributed applications. ▶ Relies on the Docker engine and daemon to be installed and running on each node. ▶ Each container has its own process which is fired up by the docker daemon. ▶ Containers run by default as root. While this can be avoided by using user namespaces, getting access to the main docker command, users can easily escape this limitation, which introduces a security risk. Each docker engine has its own image repository that, so far, cannot be shared with other docker engines. This means that running a parallel application X on e.g. 100 nodes, a X docker image need to be replicated 100 times, one per node. ▶ Docker so far has no support for MPI. 18 CANOPIE-HPC@SC19 www.prace-ri.eu Container platform choice Podman: ▶ Developed and maintained by Red Hat. ▶ A lightweight runtime for managing and running docker containers without the overhead of the docker engine and daemon. ▶ Not mainly designed for and is currently lacking support for HPC workloads. ▶ Podman can run rootless, but to to that (similar to charliecloud), it uses user namespaces ▶ Its rootless mode is not compatible with most parallel filesystems. 19 CANOPIE-HPC@SC19 www.prace-ri.eu Podman MPI example $ mpirun --hostfile hostfile \ --mca orte_tmpdir_base /tmp/podman-mpirun \ podman run --env-host \ -v /tmp/podman-mpirun:/tmp/podman-mpirun \ --userns=keep-id \ --net=host --pid=host --ipc=host \ quay.io/adrianreber/mpi-test:30 /home/ring Rank 2 has cleared MPI_Init Rank 2 has completed ring Rank 2 has completed MPI_Barrier Rank 3 has cleared MPI_Init Rank 3 has completed ring ….. https://podman.io/blogs/2019/09/26/podman-in-hpc.html 20 CANOPIE-HPC@SC19 www.prace-ri.eu Container platform choice Charliecloud: ▶ open source project which is based on a user-defined software stack (UDSS). ▶ Designed to manage and run docker based containers, and One important advantage that it is very lightweight. ▶ It enables running containers as the user, by implementing user namespaces. ▶ The main disadvantage is the requirement of a recent kernel (4.4 is recommended) that supports user namespaces. Many EU HPC clusters still have their compute nodes with old kernels (<3.10), i.e. don't support user namespaces which is a showstopper for charliecloud. ▶ Uses Slurm srun to run containers with MPI installed. 21 CANOPIE-HPC@SC19 www.prace-ri.eu Container platform choice Shifter: ▶ A platform to run Docker containers that is developed by the National Energy Research Scientific Computing Center (NERSC) and is deployed in production on a Slurm cluster of Cray supercomputers. ▶ Shifter uses its own image format to which both Docker images and VMs are converted. The Docker engine is replaced by image manager for managing the new formated images. ▶ Despite the fact that Shifter has proven to be useful managing HPC workflows, to use images in shifter, users must submit their images to a root controlled image gateway via a RESTful API. ▶ The main drawback of shifter is the lack of community ▶ Supports MPI depending on MPICH Application Binary Interface (ABI) 22 CANOPIE-HPC@SC19 www.prace-ri.eu Container platform choice Singularity: ▶ Singularity is the third and last HPC-specific container platform (after shifter and charliecloud). It is developed and maintained by Sylabs Inc. ▶ Unlike shifter, it doesn't have that complex architecture (with image manager and gateway) and can be simply installed as a environment module. ▶ Unlike Podman and charliecloud, it does not require kernel user namespaces to run rootless containers, and thus can be deployed on old Linux kernels. ▶ Singularity supports both Docker image format and its own native Single Image "flat" Format (SIF). ▶ Singularity is the most widely deployed container platform on HPC systems, currently there are 25,000+ systems running Singularity. ▶ Has build-in support for MPI (OpenMPI, MPICH, IntelMPI) 23 CANOPIE-HPC@SC19 www.prace-ri.eu Singularity Workflow 24 CANOPIE-HPC@SC19 www.mso4sc.eu | www.prace-ri.eu Singularity and MPI 25 CANOPIE-HPC@SC19 www.mso4sc.eu | www.prace-ri.eu Container platform choice Singularity: ▶ Data analysis example: Computing principal components for the first 10,000 variants from the 1000 Genomes Project chromosomes 21 and 22: wget https://…/chr21.head.vcf.gz wget https://…/chr22.head.vcf.gz LANG=C CHUNKSIZE=10000000 mpirun -x LANG -x CHUNKSIZE -np 2 singularity run -H $(pwd) variant_pca.img 26 CANOPIE-HPC@SC19 www.prace-ri.eu OpenMPI Singularity workflow ▶ From shell (or resource manager) mpirun gets called ▶ mpirun forks and exec ORTE (Open Run-Time Environment) daemon ▶ ORTED (ORTErun) process creates PMI ▶ ORTED forks == to the number of process per node requested ▶ ORTED children exec to original command passed to mpirun (Singularity) ▶ Each Singularity execs the command passed inside the given container ▶ Each MPI program links in the dynamic Open MPI libraries (ldd) ▶ OpenMPI libraries continue to open the non-ldd shared libraries (ldopen) ▶ OpenMPI libraries connect back to original ORTED via PMI ▶ All non-shared memory communication occurs through the PMI and then to local interfaces (e.g. InfiniBand) 27 CANOPIE-HPC@SC19 www.prace-ri.eu Problem Statement 28 CANOPIE-HPC@SC19 www.mso4sc.eu | www.prace-ri.eu