Universitat Politecnica` de Catalunya (UPC) - BarcelonaTech

Facultat d’Informatica` de Barcelona (FIB)

Master in Innovation and Research in Informatics

High-Performance Computing

Containers in HPC: Is it worth it?

Author: Advisor: Kilian Peiro Filippo Mantovani

Tutor: Co-advisor: Eduard Ayguade Marta Garcia-Gasulla

Department: Company: Computer Architecture Barcelona Supercomputing Center

June 2020 Abstract

Container technologies usage has been growing in data centers and supercomputer facilities during the last years. More and more people are benefiting from the portability, flexibility and other traits of containers. Despite being some evaluation per- formed in the field of HPC, there is still a lot of research to be done on the matter. Also, the system-wide deployment and usage of a container infrastructure in an HPC system requires a non-negligible effort. Therefore a part of the study is also to analyze if the deployment effort is worth it. In this thesis I study the performance and energy consumption of containers with HPC workloads of real scientific applications (OpenFOAM, GROMACS) on emerging HPC ar- chitectures (Armv8). In addition, I make some suggestions on the Singularity deployment from a user and a System Admin- istrator perspective. Results show almost no performance nor energy differences be- tween the executions of applications ran inside and outside of a container when tested on a small number of compute nodes. Containerization with Singularity seems overall a good tool to have in an HPC system. In appreciation for:

My family, for the mutual affection and support through the years.

My friends, for each and every one good moment.

My coworkers, for unknowingly offering the best environment one could imagine.

Music, for being and not being there as pleased. Contents

1 Introduction2 1.1 Competences...... 3

2 Background5 2.1 HPC computing systems...... 5 2.2 HPC cluster deployment...... 6 2.3 HPC cluster usage...... 7 2.4 Environment modules...... 7 2.5 ...... 7 2.6 Containerization...... 8 2.7 Benchmarks...... 9 2.8 Applications...... 9 2.9 State of the art...... 10

3 Research questions 11

4 Implementation 13 4.1 Bare-metal Software Stack...... 13 4.2 Singularity Image Setup...... 13 4.3 Methodology...... 14 4.4 Testing the OSU benchmarks...... 15 4.5 Testing OpenFOAM...... 16 4.6 Testing GROMACS...... 16

5 Evaluation 18 5.1 Startup time...... 18 5.2 OSU Benchmarks...... 19 5.3 OpenFOAM...... 20 5.4 GROMACS...... 21

6 Conclusions 24 6.1 Further work...... 25

Acronyms 26

Appendix A Reproducibility 27

1 1 Introduction

This Master Thesis is the result of part of the work done in the Barcelona Supercomputing Center (BSC) 1 from February 2020 to June 2020. I work at BSC as a High-Performance Computing (HPC) System Administrator, and I am in charge of maintaining several Arm- based HPC clusters from the Mont-Blanc 2, Mont-Blanc 3 and Mont-Blanc 2020 European projects 2, as well as the hardware and software infrastructure for the European Processor Initiative (EPI) 3 project. Due to my experience with the Arm architecture, I have also collaborated in Mont-Blanc 3 Work Package 6, analyzing the performance of applications, mini-applications and benchmarking on the project test platforms. I have also formed part of the UPC-BSC ISC Student Cluster Competition 4 teams of 2018, 2019 and 2020. The first one as a student and the latter two as an advisor. These two environments have made me grow as a System Administrator but also as a researcher, searching for state-of-the-art methods in order to configure HPC clusters so that users can maximize the performance of their applications with minimal effort.

Virtual Machines and Containers were developed many years ago for different purposes which include portability, Bring Your Own Environment (BYOE), composability and ver- sion control integration. Some of these traits are beneficial for the world of HPC, but both containers and Virtual Machines had a performance downside incompatible with the HPC vision. New HPC-oriented containers, such as Singularity, have been emerging recently, and it is our duty as researchers to test and measure if it is worth to use containers in the HPC world. It is also important to measure the energy consumption of these containers.

In the thesis I try to tackle these challenges, by performing tests with two real scientific applications – OpenFOAM and GROMACS –, comparing bare-metal executions with executions inside a Singularity container. I selected OpenFOAM and GROMACS because they are complex applications that use a relevant fraction of computing time on several data centers around the globe, and also on the MareNostrum supercomputer at BSC. I also explain parts of the procedure of creating and Singularity images, and how to run applications inside a container. All tests are performed in Arm-based clusters, since the Arm architecture grown its im- portance in HPC during the last months up to being top in the Top500 list in the recent edition of June 2020 with the Fugaku 5 supercomputer by Fujitsu, powered by Arm.

My results show that the average setup time with Singularity is negligible (less than one second), and that in general the bare-metal version performs between 1% and 12% faster than the Singularity version. Also, the bare-metal version consumes a proportional frac- tion of energy less than the Singularity version.

To conclude, the experiments show that Singularity is capable of running applications within the HPC standards, and obtaining results similar to the bare-metal version. From a point of view of a System Administrator, it can be a valuable software to install in an HPC machine, since users may benefit from the traits that Singularity offers. Nev- ertheless, throughout the thesis I have found several handicaps when using Singularity containers, so as a personal opinion, the software still needs more development before

1https://www.bsc.es/ 2https://www.montblanc-project.eu/project/presentation 3https://www.european-processor-initiative.eu/ 4https://www.isc-hpc.com/student-cluster-competition.html 5https://www.r-ccs.riken.jp/en/fugaku/project

2 becoming a production tool.

The document is structured as follows:

• Section1 introduces the thesis ideas and contents.

• Section2 explains the background research done and the state of the art.

• Section3 explains the ideas derived from the background and the research questions that guided my exploration.

• Section4 explains all the setup and methodology done for the thesis experiments.

• Section5 explains and analyzes the results obtained.

• Section6 explains the final thoughts of the thesis, based on the results obtained.

• Before the appendix, there is an acronyms section.

• AppendixA is a guide on how to reproduce the work performed for this thesis.

1.1 Competences This thesis is part of the Master in Innovation and Research in Informatics (MIRI) 6 of the Facultat d’Inform`aticade Barcelona (FIB) 7 and Universitat Polit`ecnicade Barcelona (UPC) 8, and it covers part of the list of competences of the Master. The list of compe- tences covered is written below.

Transversal Competences:

• CTR3: Capacity of being able to work as a team member, either as a regular member or performing directive activities, in order to help the development of projects in a pragmatic manner and with sense of responsibility; capability to take into account the available resources.

• CTR4: Capability to manage the acquisition, structuring, analysis and visualization of data and information in the area of informatics engineering, and critically assess the results of this effort.

• CTR5: Capability to be motivated by professional achievement and to face new chal- lenges, to have a broad vision of the possibilities of a career in the field of informatics engineering. Capability to be motivated by quality and continuous improvement, and to act strictly on professional development. Capability to adapt to technological or organizational changes. Capacity for working in absence of information and/or with time and/or resources constraints.

• CTR6: Capacity for critical, logical and mathematical reasoning. Capability to solve problems in their area of study. Capacity for abstraction: the capability to create and use models that reflect real situations. Capability to design and implement simple experiments, and analyze and interpret their results. Capacity for analysis, synthesis and evaluation.

6https://www.fib.upc.edu/en/studies/masters 7https://www.fib.upc.edu/en/ 8https://www.upc.edu/en

3 • CB6: Ability to apply the acquired knowledge and capacity for solving problems in new or unknown environments within broader (or multidisciplinary) contexts related to their area of study.

• CB8: Capability to communicate their conclusions, and the knowledge and rationale underpinning these, to both skilled and unskilled public in a clear and unambiguous way.

• CB9: Possession of the learning skills that enable the students to continue studying in a way that will be mainly self-directed or autonomous.

Generic Competences:

• CG1: Capability to apply the scientific method to study and analyse of phenomena and systems in any area of Computer Science, and in the conception, design and implementation of innovative and original solutions.

• CG4: Capacity for general and technical management of research, development and innovation projects, in companies and technology centers in the field of Informatics Engineering.

• CG5: Capability to apply innovative solutions and make progress in the knowledge to exploit the new paradigms of computing, particularly in distributed environments.

• High Performance Computing Competences:

• CEE4.2: Capability to analyze, evaluate, design and optimize software considering the architecture and to propose new optimization techniques.

• CEE4.3: Capability to analyze, evaluate, design and manage system software in supercomputing environments.

4 2 Background

2.1 HPC computing systems An HPC system consists of several machines sharing a network together using network switches. Usually, the machines also share some kind of data storage. Each of the machines is called a node, and inside a node we can find Central Processing Unit (CPU) sockets, with one CPU per socket. Figure1 shows a cluster of two racks, each of them with 8 nodes, and sharing network storage.

Figure 1: Scheme of an HPC cluster

An HPC CPU can have a big number of cores, and different memory hierarchies and architectures. Figure2 shows a 32-core CPU with a shared L3 cache, and 8 DDR4 memory channels.

Figure 2: Scheme of an HPC CPU

The HPC computing system for this thesis is the Dibona platform, a cluster integrated by Bull/ATOS within the framework of the European project Mont-Blanc 3 [1]. It integrates 24 Arm-based compute nodes powered by two Marvell ThunderX2 (TX2) CN9980 proces- sors 9, each processor with 32 Armv8 cores at 2.0 GHz, 32 MB L3 cache and 8 DDR4-2666 memory channels. Each compute node has 256 GB of Random-access memory (RAM).

9https://en.wikichip.org/wiki/cavium/thunderx2/cn9980

5 The interconnect consists of a fat-tree network, implemented with Mellanox IB EDR-100 switches. A secondary 1 GbE network is employed for the management of the cluster and the Network FileSystem (NFS). Dibona runs Enterprise Server release 7.5 with kernel v4.14.0 and it uses SLURM 17.02.11 as job scheduler.

2.2 HPC cluster deployment HPC clusters and supercomputers are used all over the world for different purposes. The composition of the cluster makes a big number of resources available for several users at the same time or, in the opposite case, use all the resources for just one application. Across the years, different levels of parallelism have been developed, and also different specialized compilers have been developed for HPC use.

2.2.1 Parallelism An important trait of an HPC cluster is the ability to exploit parallelism, since some em- barrassingly parallel applications can be run in thousands of cores simultaneously. There are different levels of parallelism, and different tools and languages to have in a cluster that use parallelization in some way. In this study I focus on OpenMP and Open MPI

OpenMP 10 is a specification for a set of compiler directives, library routines and environ- ment variables that can be used to specify parallelism in Fortran and /C++ programs. OpenMP implements multithreading, which is a method of parallelizing where a master thread forks a number of slave threads and assigns work to them. The threads share the memory of the machine, and this type of parallelization is used inside of a compute node, not between nodes. The first releases of OpenMP go as far as 1998, and have been improving until now, with the 5.0 version released on November 2018.

Open MPI 11 is an open-source implementation of the Message Passing Interface (MPI) library developed and maintained by a consortium of academia, research and industry. Between the goals of OpenMPI we can find: create a good quality MPI implementation, with competitive performance aiming to HPC, and support for different HPC platforms and environments. The Message Passing Interface that Open MPI implements lets pro- cesses communicate between them, so this type of parallelization can be used between nodes, and it is ideal in a distributed memory system. The first releases of Open MPI go as far as 2005 with the v1.0 series, and have been improving until now, with the v4.0.4 series released in June 2020.

2.2.2 Compilers Arm Allinea Studio 12 is a set of tools developed by Arm with the mindset of squeezing out performance of Arm machines. The toolset is composed by a C/C++ Compiler, a Fortran Compiler, math libraries known as the Arm Performance Libraries and a debugger, profiler and reporting tool known as Arm Forge.

10https://www.openmp.org/ 11https://www.open-mpi.org/ 12https://developer.arm.com/tools-and-software/server-and-hpc/ arm-allinea-studio

6 I have made good use of the Arm HPC Compiler and the Arm Performance Libraries, since these tools have been updated and supported for more than five years, and have shown better performance in both benchmarks and applications.

2.3 HPC cluster usage The usual usage of a cluster from the user perspective can be seen in Figure3. The user accesses the login node (e.g. via ssh), compile source code in the login node if needed, and perform executions with the binary in the compute nodes, preferably in parallel. The user needs or expects some compilers or libraries in order to do his or her work properly. The administrators of the cluster are in charge of deploying a set of tools in order to satisfy the needs of the users.

Figure 3: Usual usage of a cluster

There are multiple ways to organize the set of tools, some examples are environment modules or containers.

2.4 Environment modules Environment modules is a tool that lets users modify their Linux user environment using modulefiles. A modulefile contains instructions to change shell environment paths such as PATH and LD LIBRARY PATH. Using this tool, one user can be compiling his or her software with one version of the GNU Compiler Collection (GCC) that he or she likes or needs, and another different user can be compiling with Arm HPC Compiler. Both in the same machine and at the same time. Some examples of Environment Modules are modules 13 and lmod 14.

2.5 Virtualization Since compute nodes scaled up in some resources such as number of cores per node and RAM memory, data centers started to use virtual machines, because several Operative Systems could be run in the same machine.

In Figure4 we can see how a , a Docker daemon and a Singularity work, respectively. The green boxes are the processes running in the bare-metal machine. The orange boxes are processes running inside the virtual machine or inside the container. The blue arrows represent the communications between processes or system layers.

13http://modules.sourceforge.net/ 14https://lmod.readthedocs.io/en/latest/

7 Figure 4: Scheme of the Virtual Machine, Docker and Singularity layers, respectively

2.5.1 Virtual Machines A virtual machine is an emulator of a computer system. This type of virtualization lets a user to run any application with any architecture and any compiler they want. Since the emulator is running in the actual machine, there is a non-negligible use of machine resources. This means that applications that require high bandwidth or I/O may underperform inside a Virtual Machine. As can be seen in Figure4, the Virtual Machine virtualizes and emulates every layer of the machine, from the file system to the hardware. The Virtual Machine acts as a driver between the virtual and the physical layers, letting virtual applications use physical resources via the . Some examples of Virtual Machines are Linux KVM 15 (Kernel-based Virtual Machine) and QEMU 16.

2.6 Containerization As we have seen, Virtual Machines emulate a whole machine. In the case of containers, only the is virtualized. This reduces management overhead and improves performance. The main reasons behind using containers for HPC are portability, composability and BYOE. Portability is the ability to rebuild, layer or share a container between machines or users. Composability is the ability to define explicitly the composition of the software environment inside the image. BYOE is the ability to run a container in any machine out-of-the-box.

2.6.1 Docker Docker 17 is the industry-leading container engine. A container works on top of an oper- ating system, and it is run by the system kernel. Since it does not virtualize hardware, it does not use as many resources as a virtual machine. Docker has several problems that affect its usage for HPC. The first one is that it introduces several security concerns that

15https://www.linux-kvm.org/page/Main_Page 16https://www.qemu.org/ 17https://www.qemu.org/

8 are critical in an HPC environment, such as needing root permissions or being root inside the container. The second one is that it uses TCP/IP for networking, and it is not trivial to make it use the custom interconnect of the HPC machine (for example, InfiniBand 18). Some developments have been made in order to solve these issues, but in the meantime other container software such as Singularity, have also been developed alongside. As can be seen in Figure4, the Docker daemon creates a virtual OS that is the Docker container root, on top of the kernel space. Applications can be then executed inside the container.

2.6.2 Singularity Singularity 19 is a different approach of the container techniques, it was designed to work for HPC environments and it provides advantages over other container tools, such as the ability to block user privileges and simplification to access custom networks and Graphics Pocessing Units (GPUs). As can be seen in Figure4, Singularity runs as a user command, creating a container within the user namespace. The container has the same user id as the user, and executes commands as the user. It also shares part of the host filesystem, such as the home of the user.

2.7 Benchmarks In this section, I explain the background of the benchmarks that have been put to test. These are not real applications and the only purpose of using them is to stress or test the machine.

2.7.1 OSU Micro-benchmarks The Ohio State University (OSU) Micro-benchmarks 20 are used for testing the network of a machine or cluster. There are latency and bandwidth tests, for several MPI calls such as gather, scatter, reduce, barrier, etc. Also message size can be changed, so it has a good general coverage. It also offers Compute Unified Device Architecture (CUDA) support, and it is written mostly in C.

2.8 Applications In this section, I explain the background of the applications that have been put to test. These are real applications that scientists use on a daily basis for science and investigation around the globe.

2.8.1 OpenFOAM OpenFOAM 21 is a free, open-source Computational Fluid Dynamics (CFD) software. It is widely used in the field of HPC due to its numerous range of solvers, from complex fluid flows to acoustics, solid mechanics and electromagnetics. The are three main variants of

18https://www.mellanox.com/pdf/whitepapers/IB_Intro_WP_190.pdf 19https://singularity.lbl.gov/ 20https://mvapich.cse.ohio-state.edu/benchmarks/ 21https://www.openfoam.com/

9 the application and I am going to use the one by OpenCFD Ltd, since it is the more recent and the one with more Docker support. OpenFOAM is written mostly in C++ and C. It is so huge that it uses its own bash environment, and multiple third-party libraries are needed or highly recommended in order to perform better.

2.8.2 GROMACS GROMACS 22 stands for GROningen MAchine for Chemical Simulation, and it is a soft- ware package used to perform molecular dynamics. The project began in 1991 and it was primarily designed for biochemical molecules, but due to its good performance, it is also being used nowadays for non-biological studies, and the developers are trying to extend it to Quantum Chemistry and Bioinformatics.

The software makes good use of all the machine components such as Single instruction, multiple data (SIMD), different topologies, etc. It also has CUDA-based GPU acceleration. It can run on multiple nodes using MPI, and also in single node using OpenMP or thread- MPI. GROMACS is written mostly in C++ and C.

2.9 State of the art Several papers have been dealing with the question of container performance in the HPC environment, starting in 2013 with Xavier et al. [2], that experimented with LXC, OpenVZ, VServer containers and Virtual Machine, showing little performance impact on the containers for the STREAM and IOzone benchmarks. In 2017, some researchers started to evaluate Singularity containers adding CUDA, such as Souza et al. [3] and Arango et al. [4], showing good results in performance for benchmarks and applications running with CUDA in GPUs. Other container software designed for HPC environments, such as Shifter and Charliecloud, has been tested in Torrez et al. [5]. Different architectures have been also tested, like Arm in Acharya et al. [6], and Arm and Power9 in Rudyy et al. [7]. Also evaluation for containers in HPC clouds has been done, in Younge et al. [8], a comparison of performance between supercomputers running Singularity and Amazon EC2 using Docker is made. In Zhang et al. [9], different network types and memory modes are tested. Wang et al. [10] raises a question about performance when container images are too big.

To sum up, the results of the experiments show almost no performance drop when com- paring Singularity containers to its bare-metal counterpart. Nevertheless, in order to get similar performance, some actions need to be done in the basic image, such as using the native MPI and CUDA libraries to take advantage of the custom network and the GPUs.

22http://www.gromacs.org/

10 3 Research questions

After studying the state of the art, and having a deeper knowledge of the research per- formed in the field of containers, I can assess the research questions to extend the global knowledge of the use of containers in HPC.

Observation 1 – Containers have been in our life for several years now, and the general- purpose containers, such as Docker, have shown bad performance in the world of HPC. So the logical path to take is to test state-of-the-art containers specialized in HPC, such as Singularity. Also, while in literature we can find several evaluations of benchmarks leveraging container technologies, not a lot of effort has been put into testing real scientific applications. Question 1 – How Singularity behaves when handling complex HPC applications? To address this question I have tried and tested Singularity for two known applications such as OpenFOAM and GROMACS.

Observation 2 – After the slowing down of Moore’s law, and after more than a decade of the predominance of architecture, data centers are starting to populate with diverse architectures. The Top500 list 23 is known as the list of the 500 most powerful commercially available computer systems, based on a benchmark called LINPACK [11], ordered by performance (TFlop/s). The list is updated twice a year, and the list of June 2020 shows that the top system is an Arm A64FX, while the number 2 is an IBM Power9 22C, with NVIDIA Volta GV100 accelerators. The High-Performance Conjugate Gradient (HPCG) list 24 is known as the list of the 500 most powerful commercially available computer systems, based on a benchmark called HPCG [12], ordered by performance (TFlop/s). The list is updated twice a year, and the list of June 2020 shows that the top system is an Arm A64FX, while the number 2 is an IBM Power9 22C, with NVIDIA Volta GV100 accelerators. Both machines are not Intel, this means that knowing how to deal with different architec- tures is very important in the scene of HPC. Question 2 – How Singularity behaves on emerging HPC architectures? Since I have access to an HPC Arm cluster and the Arm architecture has not yet been deeply analyzed, I have performed my tests in the Dibona Arm-based cluster. Testing different architectures does not mean only switching hardware but also the software in- frastructure ensuring its correct behavior. Arm provides a compiler tailored for HPC and scientific codes, known as the Arm HPC Compiler, and there is no research evidence in using it inside a Singularity container, so all applications have been compiled with the Arm HPC Compiler.

Observation 3 – Since Obama’s executive order of 2015 25 several efforts have been made to advance the computational power and the efficiency of computing systems. The energy consumption of large HPC clusters is of primary importance, since achieving an exaflop is limited to 20 MW of power. It is important to measure if adding software layers to a cluster affects energy consumption. Question 3 – Which is the impact on the energy budget of container technologies when running HPC workloads?

23https://www.top500.org/lists/top500 24https://www.top500.org/lists/hpcg/ 25https://obamawhitehouse.archives.gov/the-press-office/2015/07/29/ executive-order-creating-national-strategic-computing-initiative

11 The literature offers neither measurements nor analysis of energy consumption use cases of HPC applications running within container technologies, so I have obtained energy results among performance and execution time results.

12 4 Implementation

The purpose of my experiments is to compare bare-metal execution, i.e. the execution without any container technology, against executions ran in a Singularity container. To have a fair comparison between bare-metal and Singularity executions I took care of configuring both environments in the most similar way possible. As introduced in Section 2.1, the cluster used for my tests is the Dibona cluster. For the bare-metal runs I had to study the software layers already deployed on Dibona and try to replicate them in a working Singularity image. The Singularity image comes from a Docker image, since Docker images are more man- ageable and migrating from Docker to Singularity is very well supported. The image runs 18.04.4 LTS, and there is an image for each application, since putting all appli- cations in the same image results in a very big sized image. Other studies have claimed that big sized images could give performance issues [10].

Sections 4.1 and 4.2 are dedicated to providing the details of the software stack of the bare-metal and the Singularity container that I used for my evaluation.

4.1 Bare-metal Software Stack The Dibona cluster offers multiple compilers and libraries in the environment modules. The most updated and installed version of the Arm Compiler and the MPI libraries were chosen for the experiments. In Table1 you can find a summary of the bare-metal and the Singularity configurations for each application.

OSU Benchmarks OpenFOAM GROMACS Bare-metal Singularity Bare-metal Singularity Bare-metal Singularity Open MPI 4.0.0 4.0.0 4.0.0 4.0.0 4.0.0 4.0.0 Arm HPC Compiler 19.1 19.3 19.1 19.3 19.1 19.3 Arm . Libraries No No 19.1 19.3 19.1 19.3 cmake No No 3.14.5 3.10.2 3.14.5 3.10.2

Table 1: Summary of Bare-metal and Singularity configurations for each application

4.2 Singularity Image Setup For the Singularity part, each library needed in the bare-metal cluster was also compiled in the container. There are some libraries that are not exactly the same, because the source files could not be obtained. For applications that could benefit from the Arm Performance Libraries, the machine-specific flag thunderx2t99 was added. In Table1 you can find a summary of the bare-metal and the Singularity configurations for each application. The workflow for creating the image was as follows:

• A Docker image is created via Dockerfile, with every library necessary and the ap- plication compiled.

– sudo docker build -t image:tag . is used for creating the image. – sudo docker run -ti image:tag /bin/bash is used for testing the image.

• The Docker image is transformed to a Singularity image.

13 – Singularity build test.sif docker-daemon://test:v1 is used for creating the image. – Singularity shell test.sif is used for testing the image.

Kurtzer et al. [13] was very helpful in order to understand deeply how Singularity works and the commands needed to manage an image.

Figure 5: Image creation and usage workflow.

Figure5 shows the workflow of image creation and usage, from the Dockerfile to the Singularity command needed for running an application. The rectangle represents the Dockerfile. The rhomboids represent Docker or Singularity commands. The rounded- border rectangles represent Docker or Singularity images. The cloud represents Compilers and libraries in the image, used to build and run applications inside the image. The cylin- der represents files or paths in the host that are needed for running an application correctly.

Guidelines for replicating each application can be found in AppendixA.

4.3 Methodology For this study, one benchmark and two applications have been tested. The same test has been run five times in different days, hours and nodes, a test was run again if the result was an outlier, or if the application ended abruptly. After getting five tests, I did the mean of the results.

4.3.1 Power monitoring Dibona power drain is monitored by High Definition Energy Efficiency Monitoring (HDEEM) [14]. The HDEEM library is a software interface used to measure power consumption of HPC clusters with bullx blades. Measurements are made via the Baseboard Management Con- troller (BMC) and a Field-Programmable Gate Array (FPGA) located on each compute node motherboard. The power monitoring devices installed on a Dibona node allows us to monitor the power drain of:

• the global board.

• the two TX2 CPUs.

• each DDR memory domain (four DIMM domains).

• the mezzanine board used for the InfiniBand interconnection.

14 Figure6 shows the procedure used for the energy accounting measurements gathered for this thesis. I use the GPIO signals to restart and stop data collection in-band easily. Also, there is an ssh script that allows users to retrieve their measurements saved in the FPGA through the BMC.

Figure 6: Power monitor procedure for gathering energy measurements in Dibona

As background, the FPGA is constantly monitoring the energy drain listed above with a sampling rate of 1 ms for the global board and 10 ms for the other sensors. The job scheduler (SLURM) running in the cluster offers the possibility of running a task pro- log/epilog script before starting a job and right after the job ends on each allocated node. Also, Dibona offers another method to process actual power values at CPU level. This is provided by the TX2’s power management unit called M3. The requests are performed using an Intelligent Platform Management Interface (IPMI) implementation to the correct address space.

For this thesis I always use the method using the energy counters because I want to take into account the energy consumption of the whole board, including not only the CPU but also the memory and the network adapter.

4.4 Testing the OSU benchmarks The OSU benchmarks have been introduced in Section 2.7.1. The rationale behind testing them is just to get familiar with Docker and Singularity environments, and to be sure that the InfiniBand is working. It turns out that the InfiniBand network is not working out-of-the-box on the Singularity container, and actually several files have to be passed from the host to the container for the InfiniBand to work. The list of files is shown below:

• MPI library.

• libibverbs libraries.

• Unified Communication X (UCX) libraries.

Also some environment variables have to be changed:

• LD LIBRARY PATH, in order to add the host’s Arm HPC Compiler and Arm Per- formance Libraries.

• MPI ROOT, in order to use the host’s MPI library.

• PATH, in order to use the host’s MPI library.

On top of that, the mpirun command issued errors, so I used srun. The command time srun singularity exec -B $sbind $image $sosub $param

15 executes the OSU benchmark ($sosub) inside the Singularity image ($image) with the necessary bindings ($sbind) and with the OSU parameters ($param). For the bare-metal version, the command time srun $bosub $param executes the OSU benchmark ($bosub) in bare-metal with the OSU parameters ($param). Guidelines for executing the OSU benchmark can be found in AppendixA. It is easy to know when the container is using InfiniBand instead of Ethernet just by watching the latency and bandwidth results.

4.5 Testing OpenFOAM OpenFOAM has been introduced in Section 2.8.1. For testing it, I compiled the latest release (v1912) in both bare-metal and the container. Several configuration files needed to be changed in order to compile the application and the Third Party Software with the Arm Compiler and the Arm Performance Libraries.

For the input set I used the DrivAer 26 model, that is provided by the Technical University of Munich, used for automotive aerodynamics [15][16]. It is a mesh of 64M cells, where several OpenFOAM commands are applied:

• decomposePar, in order to distribute the mesh between the processors.

• renumberMesh, for mesh tuning after it has been partitioned.

• checkMesh, in order to check if the partitioning has been done correctly.

• restore0Dir, for copying the initial values to all processors.

• potentialFoam, a basic solver for the velocity potential in incompressible flux fields, it is used for initialization.

• simpleFoam, an incompressible solver for the continuity and momentum equations.

I have focused the study in the simpleFoam solver, since it is the one that takes the most execution time and the most important. And since the number of cells is fixed, the study is done for strong scaling, i.e., changing the number of processing elements leaving the input constant across different executions.

Since OpenFOAM edits the mesh during the execution and the Singularity image is in read-only format, the input set cannot be inside the container image. All the OpenFOAM commands are issued inside the image to an input set stored on a host partition that is accessible from the container image. Guidelines for executing OpenFOAM in bare-metal and in the Singularity image can be found in AppendixA.

4.6 Testing GROMACS GROMACS has been introduced in Section 2.8.2. For testing it, I compiled the latest release (2020) in both bare-metal and the container. The application is easy to compile with the Arm Compiler and the Arm Performance Libraries.

26https://www.mw.tum.de/en/aer/research-groups/automotive/drivaer/

16 For the input set I used the lignocellulose [17], that is provided by the UEABS PRACE 27. It is an inhomogeneous system of 3.3M atoms. It uses the Parrinello-Rahman pressure coupling, meaning that the box size of the simulation is not fixed, and some scaling is done as the number of processors increases. Nevertheless, the number of atoms stays fixed, so the study is still done for strong scaling. Since GROMACS does not edit the input set, both the application and the input set can be stored inside the image. Guidelines for executing GROMACS in bare-metal and in the Singularity image can be found in AppendixA.

27https://repository.prace-ri.eu/git/UEABS/ueabs/

17 5 Evaluation

5.1 Startup time One of the first things to evaluate is the time spent setting up the Singularity image, which is also called startup time. A simple test with the sleep command has been executed, in both bare-metal and the Singularity image.

Bare-metal Singularity Command Time [s] Std Dev Time [s] Std Dev sleep 5 5.00 0.00 5.60 0.03

Table 2: Time spent setting up the Singularity image

As can be seen in Table2, the bare-metal execution of the sleep command lasts 5.00 seconds, while in the Singularity image it lasts 5.60 seconds. Meaning that the average startup time of a Singularity image in the Dibona cluster is 0.60 seconds. This test is a very basic one, since I also want to know the time spent setting up the image when multiple nodes and MPI calls are involved. For that purpose, I wrote a simple MPI code that initializes an MPI process, sleeps for five seconds and finalizes. The code for this test is shown below.

#include "mpi.h" #include #include int main (int argc, char *argv[]) { MPI_Init(&argc, &argv); sleep(5); MPI_Finalize(); exit(EXIT_SUCCESS); }

Figure 7: Time setting up the Singularity image when increasing the MPI ranks

This test lets me see how the startup time scales with the number of MPI ranks. Figure7

18 shows that as the number of MPI ranks increases, the startup time of the Singularity container also increases.

5.2 OSU Benchmarks As explained in Section4, I used the OSU benchmarks in order to test if the InfiniBand was being used inside the Singularity image. Figure8 shows the test called osu bw that measures the bandwidth of the network, for message sizes between 1 and 1024 bytes, since it is the average message size in applications such as OpenFOAM [18]. As can be seen, the Singularity image has overall a higher bandwidth than the bare-metal version.

Figure 8: Bandwidth of the OSU Benchmarks with message size between 1 and 1024 bytes

In Figure9 we observe the same test, but with bigger message sizes. One can see that we are using InfiniBand because we reach the InfiniBand EDR limit of 12.5 GB/s. The gap visible in message sizes between 210 and 215 bytes can be due to the Bare-metal and Singularity containers not having the exact same libraries and Linux distribution. This is however a hypothesis left to be verified in future work.

Figure 9: Bandwidth of the OSU Benchmarks with large message size

19 On top of studying the bandwidth, I also made a latency evaluation with the osu allreduce collective, which is an MPI call widely used in HPC applications. Figure 10 shows that both versions have similar latency times.

Figure 10: Latency of the OSU Benchmarks with the osu allreduce test

5.3 OpenFOAM As explained in Section3, one of the motivations of this thesis is to explore the effect of using vendor-specific compilers versus open-source ones within a container image. In my case, I am evaluating the Arm HPC Compiler and GCC, both inside a Singularity image. In order to see the differences between both compilers, I made a test comparing a GCC version of OpenFOAM and an Arm HPC Compiler version of OpenFOAM, both in bare-metal.

Figure 11: Duration in seconds of OpenFOAM in bare-metal Arm HPC Compiler and GCC

As can be seen in Figure 11, in single node, the Arm HPC Compiler version shows a better execution time of 84.62 seconds versus the GCC version of 95.22 seconds. This behavior does not occur when scaling out, since both versions of OpenFOAM spend the same time executing

20 Seeing that both deliver similar performance, I continue using the Arm HPC Compiler version for the scalability study of OpenFOAM.

Figure 12: Scalability of OpenFOAM in bare-metal and Singularity

In Figure 12, I study the scalability of OpenFOAM in the bare-metal and Singularity versions for one and two nodes. The solid lines represent the execution time in seconds of the application for one and two nodes. The dashed lines show the ideal execution time for two nodes. The first thing that can be seen is that the bare-metal version shows a better execution time for two nodes than the Singularity version. The second thing that can be seen is that it also shows better scalability, since the bare-metal solid line is closer from its respective dashed line than the Singularity version.

5.4 GROMACS As explained in Section2, applications benefit from different levels of parallelism. The documentation of GROMACS recommends the use of hybrid runs, meaning using Open MPI and OpenMP together. In order to discover the best combination of MPI ranks and OpenMP threads, I have tested all combinations for a single node, for both bare-metal and Singularity versions. As can be seen in Table3, the best combination for the Singularity version is 32 MPI ranks per 2 OpenMP threads, since it gives the same performance than full-MPI, but with less execution time (13.14 seconds). For the bare-metal version, the best combination is 16 MPI ranks per 4 OpenMP threads with a performance of 0.68 ns/day. Since I have to select only one combination for all tests, I have chosen the 32 MPI ranks per 2 OpenMP threads version, because it favors Singularity a little.

In Figure 13 can be seen that both Singularity version and the bare-metal version perform similarly in single-node, and this behavior is also found when scaling out to two and four nodes. As we can see in Figure 14, Singularity runs slightly slower than bare-metal. This execution time is obtained from the time command and not from the output of the application, so it also measures the container startup time and the bindings from the host to the container. Another observation is that both versions perform similarly when scaling out to two and four nodes.

21 Bare-metal Singularity # MPI ranks # OMP threads Time [s] Perf [ns/day] Time [s] Perf [ns/day] 1 64 5.50 0.47 8.15 0.50 2 32 6.45 0.46 8.79 0.46 4 16 6.13 0.50 8.80 0.50 8 8 6.49 0.67 9.47 0.52 16 4 7.02 0.68 10.36 0.52 32 2 8.05 0.65 13.34 0.66 64 1 10.37 0.66 16.76 0.66

Table 3: Time and performance results using different combinations of MPI and OMP

Figure 13: Performance of GROMACS in bare-metal and Singularity

Figure 14: Duration in seconds of GROMACS in bare-metal and Singularity

22 Figure 15 shows the energy consumption of the bare-metal and Singularity executions. There is an interesting trade-off between executing faster and consuming less energy, going from one to two nodes in the bare-metal version gets you a better execution time while using nearly the same energy, but this does not happen for four nodes. The explanation behind this could be that the application is not scaling properly from two to four nodes in terms of execution time, so resources are not being used as well as they should for both versions.

Figure 15: Energy consumption of GROMACS in bare-metal and Singularity

23 6 Conclusions

Based on the results, I can answer the research questions of Section3. Singularity can run complex HPC applications with scientific input sets, and achieves execution times as good as its bare-metal counterpart for HPC workloads requiring a small number of compute nodes. I was not able to scale out my experiments to more than four nodes, because the experimental cluster used for the evaluation just offered a limited number of nodes. However, we know from the literature [7] that Singularity performs well also at a bigger scale. In regards to energy consumption, there is also almost no difference between using Singularity or bare-metal. Singularity also shows good performance when using an Arm-based machine, and can benefit from the Arm HPC Compiler and the Arm Performance Libraries.

Containers are used mainly for portability, composability and BYOE. The fact of using HPC-oriented containers hinders some of these attributes. Using different architectures hinders portability, since images must be rebuilt. Using different clusters hinders portabil- ity, composability and BYOE, since similar versions of libraries are needed for InfiniBand and CUDA support. In this thesis I have presented the use of containers from the points of view of the System Administrator and the common HPC user, and after having built several images and ran different applications, I have some suggestions about a viable use of the containers for HPC.

Figure 16: Workflow in an HPC cluster using containers

Figure 16 shows one possible way of setting up a viable workflow in an HPC cluster using Docker and Singularity. First, the System Administrators need to make a Basic Docker image with all the libraries and software needed for the optimal use of the container in the HPC cluster. Secondly, experienced users should be able to build any application or tool they need on top of the Basic Docker image, and build them into Singularity images. Lastly, some type of information should be given from the System Administrators to all users, such as what files to bind, what environment variables to pass from host to container and inform that Singularity images are read-only. One of the flaws in this workflow is that creating Docker and Singularity images can be

24 a little time consuming, and also making a mistake may need a complete rebuild of the Docker and the Singularity image. This is partially solved by using the Docker cache. Another flaw is that images can be big in file size, and this can lead to worse startup time and worse performance overall, as seen in the literature [10]. The only solution for this is to keep track of and limit the image size.

As a user, I have run applications in a DGX-1 in the National Supercomputing Centre Singapore (NSCC) cluster, inside a Docker image using CUDA. Having complete control of the environment is a nice trait to have, but it is also important to notice that some libraries or applications need to be compiled from sources, something that not all HPC users may be able to or want to perform. On top of that, it has been mentioned in the literature [7][8] that Docker does not perform as good as Singularity or bare-metal once you scale out to more than one node.

To sum up, containers are a very interesting tool to have in an HPC cluster, and are totally worth it. System Administrators and users that know how to manage containers can make a huge profit of it. There are almost no differences between running complex HPC applications in a bare-metal cluster or a Singularity cluster performance-wise and energy-wise. Nevertheless, there is still a lot of room for improvement in usage, since HPC-oriented containers such as Singularity require users with good expertise on the matter. This may scare basic users, and I think that HPC must be a resource available and easy to use for as many researchers as possible.

6.1 Further work After having answered the research questions, new ideas have arisen. Since June 2019, NVIDIA announced its support for Arm CPUs, providing the HPC community a new path to build energy efficient supercomputers. Knowing that Singularity has CUDA support, and interesting study to make could be comparing an application with CUDA support, such as GROMACS, with and without Singularity, but this time with CUDA enabled, and obtain performance and energy measurements. A study of the scalability with more nodes could be done if I had access to an Arm-based cluster with a higher number of nodes. When the deployment of the MareNostrum 4 is finished, it is supposed to have a cluster formed of Armv8 processors like the Japanese Post-K supercomputer, and then further study can be done. Scientific HPC applications of other fields could be also studied, such as astronomical, IA, etc. Traces could be obtained using Extrae 28, in order to see if the behavior of the application is the same inside and outside the container.

28https://tools.bsc.es/extrae

25 Acronyms

BMC Baseboard Management Controller. 14

BSC Barcelona Supercomputing Center. 2

BYOE Bring Your Own Environment. 2, 8, 24

CFD Computational Fluid Dynamics. 9

CPU Central Processing Unit. 5, 14, 15, 25

CUDA Compute Unified Device Architecture. 9, 10, 24, 25

EPI European Processor Initiative. 2

FIB Facultat d’Inform`aticade Barcelona. 3

FPGA Field-Programmable Gate Array. 14, 15

GCC GNU Compiler Collection. 7, 20

GPU Graphics Pocessing Unit. 9, 10

HDEEM High Definition Energy Efficiency Monitoring. 14

HPC High-Performance Computing. 2, 5, 6, 8, 9, 10, 11, 14, 24, 25

HPCG High-Performance Conjugate Gradient. 11

IPMI Intelligent Platform Management Interface. 15

MIRI Master in Innovation and Research in Informatics. 3

MPI Message Passing Interface. 6, 9, 10, 13, 15, 18, 21

NFS Network FileSystem. 5

NSCC National Supercomputing Centre Singapore. 25

OSU Ohio State University. 9, 15, 16, 19

RAM Random-access memory. 5, 7

SIMD Single instruction, multiple data. 10

TX2 ThunderX2. 5, 14, 15

UCX Unified Communication X. 15

UPC Universitat Polit`ecnicade Barcelona. 3

26 A Reproducibility

For this section, I have created a public GitLab repository 29, with all the necessary files for executing the OSU Benchmarks and GROMACS inside a Singularity cluster out-of- the-box. There is also an OpenFOAM container that can only be executed having a valid input set outside of the container. The libraries used for compiling the applications in bare-metal and Singularity can be found in Section 4.1.

Bare-metal builds: The configuration and build for the applications in bare-metal have been done following these guides:

• OSU Benchmarks 30

• OpenFOAM 31

• GROMACS 32

Singularity builds: In order to build the Docker and Singularity images, you need to be in an Armv8 cluster with Docker and Singularity installed. For the GROMACS and OpenFOAM images, the Arm HPC Compiler installer is needed, it can we downloaded here 33. For the OpenFOAM image, the input set must be downloaded here 34.

Enter in the folder of the image you want to install and make the following commands: docker build -t image:tag . for creating the image sudo docker run -ti image:tag /bin/bash for opening a bash shell into the im- age

Once having a Docker image ready: singularity build image.sif docker-daemon://image:tag for creating the image singularity shell image.sif for opening a bash shell into the image

Examples of job scripts for the SLURM queue system can be found in the respective application folders, with the energy measurement scripts added for GROMACS. Please note that OpenFOAM is the only image that cannot be run out-of-the-box, since it needs an input set outside of the container.

29https://repo.hca.bsc.es/gitlab/kpeiro/masterthesisreproducibility 30https://mvapich.cse.ohio-state.edu/static/media/mvapich/README-OMB.txt 31https://gitlab.com/arm-hpc/packages/-/wikis/packages/openfoam 32https://gitlab.com/arm-hpc/packages/-/wikis/packages/gromacs 33https://developer.arm.com/-/media/Files/downloads/hpc/arm-allinea-studio/ 19-3/Ubuntu16.04/Arm-Compiler-for-HPC_19.3_Ubuntu_16.04_aarch64.tar 34https://www.cfdsupport.com/download-cases-tcfd-drivaer-car-model.html

27 References

[1] Fabio Banchelli, Marta Garcia, Marc Josep, Filippo Mantovani, Julian Morillo, Kil- ian Peiro, Guillem Ramirez, Xavier Teruel, Giacomo Valenzano, Joel Wanza Weloli, et al. Mb3 d6. 9–performance analysis of applications and mini-applications and benchmarking on the project test platforms. Technical report, Version 1.0. Technical report. Available at: http://bit. ly/mb3-dibona-apps , 2019. 2.1 [2] Miguel G. Xavier, Marcelo V. Neves, Fabio D. Rossi, Tiago C. Ferreto, Timoteo Lange, and Cesar A. F. De Rose. Performance Evaluation of Container-Based Vir- tualization for High Performance Computing Environments. In 2013 21st Euromi- cro International Conference on Parallel, Distributed, and Network-Based Processing, pages 233–240, February 2013. ISSN: 2377-5750. 2.9 [3] P. Souza, G. M. Kurtzer, C. Gomez-Martin, and P. M. Cruz e Silva. HPC Containers with Singularity. volume 2017, pages 1–5. European Association of Geoscientists & Engineers, October 2017. 2.9 [4] Carlos Arango, Rmy Dernat, and John Sanabria. Performance Evaluation of Container-based Virtualization for High Performance Computing Environments. arXiv:1709.10140 [cs], September 2017. arXiv: 1709.10140. 2.9 [5] Alfred Torrez, Timothy Randles, and Reid Priedhorsky. HPC Container Runtimes have Minimal or No Performance Impact. In 2019 IEEE/ACM International Work- shop on Containers and New Orchestration Paradigms for Isolated Environments in HPC (CANOPIE-HPC), pages 37–42, November 2019. ISSN: null. 2.9 [6] Ashijeet Acharya, Jrmy Fangude, Michele Paolino, and Daniel Raho. A Performance Benchmarking Analysis of Containers and Unikernels on ARMv8 and x86 CPUs. In 2018 European Conference on Networks and Communications (EuCNC), pages 282–9, June 2018. ISSN: 2575-4912. 2.9 [7] Oleksandr Rudyy, Marta Garcia-Gasulla, Filippo Mantovani, Alfonso Santiago, Ral Sirvent, and Mariano Vzquez. Containers in HPC: A Scalability and Portability Study in Production Biological Simulations. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 567–577, May 2019. ISSN: 1530-2075. 2.9,6,6 [8] Andrew Younge, Kevin Pedretti, Ryan Grant, and Ron Brightwell. A Tale of Two Systems: Using Containers to Deploy HPC Applications on Supercomputers and Clouds. December 2017. 2.9,6 [9] Jie Zhang, Xiaoyi Lu, and Dhabaleswar K. Panda. Is Singularity-based Container Technology Ready for Running MPI Applications on HPC Clouds? In Proceedings of the10th International Conference on Utility and Cloud Computing, UCC ’17, pages 151–160, Austin, Texas, USA, December 2017. Association for Computing Machinery. 2.9 [10] Yinzhi Wang, Richard Evans, and Lei Huang. Performant Container Support for HPC Applications. pages 1–6, July 2019. 2.9,4,6 [11] Jack Dongarra, editor. Linpack users’ guide. SIAM, Philadelphia, 10. printing edition, 1993. OCLC: 233386311.3 [12] Michael Allen Heroux and Jack. Dongarra. Toward a new metric for ranking high performance computing systems. Technical Report SAND2013-4744, 1089988, June 2013.3

28 [13] Gregory M. Kurtzer, Vanessa Sochat, and Michael W. Bauer. Singularity: Scientific containers for mobility of compute. PLOS ONE, 12(5):e0177459, May 2017. 4.2

[14] Daniel Hackenberg, Thomas Ilsche, Joseph Schuchart, Robert Sch¨one,Wolfgang E Nagel, Marc Simon, and Yiannis Georgiou. Hdeem: high definition energy efficiency monitoring. In Proceedings of the 2nd International Workshop on Energy Efficient Supercomputing, pages 1–10. IEEE Press, 2014. 4.3.1

[15] Christopher Collin, Steffen Mack, Thomas Indinger, and Joerg Mueller. A numerical and experimental evaluation of open jet wind tunnel interferences using the drivaer reference model. SAE International Journal of Passenger Cars - Mechanical Systems, 9, 04 2016. 4.5

[16] Martin Peichl, Steffen Mack, Thomas Indinger, and Friedhelm Decker. Numerical investigation of the flow around a generic car using dynamic mode decomposition. American Society of Mechanical Engineers, Fluids Engineering Division (Publication) FEDSM, 1, 08 2014. 4.5

[17] Benjamin Lindner, Loukas Petridis, Roland Schulz, and Jeremy C. Smith. Solvent- Driven Preferential Association of Lignin with Regions of Crystalline Cellulose in Molecular Dynamics Simulation. Biomacromolecules, 14(10):3390–3398, October 2013. 4.6

[18] Fabio Banchelli, Kilian Peiro, Andrea Querol, Guillem Ramirez-Gargallo, Guillem Ramirez-Miranda, Joan Vinyals, Pablo Vizcaino, Marta Garcia-Gasulla, and Filippo Mantovani. Performance study of HPC applications on an Arm-based cluster using a generic efficiency model. In 2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pages 167–174, March 2020. ISSN: 2377-5750. 5.2

29