<<

applied sciences

Article HPC Cloud Architecture to Reduce HPC Workflow Complexity in Containerized Environments

Guohua Li 1 , Joon Woo 1 and Sang Boem Lim 2,*

1 National Supercomputing Center, Korea Institute of Science and Technology Information, 245 Daehak-ro, Yuseong-gu, Daejeon 34141, Korea; [email protected] (G.L.); [email protected] (J.W.) 2 Department of Smart ICT Convergence, Konkuk University, 120 Neungdong-ro, Gwangjin-gu, Seoul 05029, Korea * Correspondence: [email protected]; Tel.: +82-2-450-3840

Abstract: The complexity of high-performance computing (HPC) workflows is an important issue in the provision of HPC cloud services in most national supercomputing centers. This complexity problem is especially critical because it affects HPC resource scalability, management efficiency, and convenience of use. To solve this problem, while exploiting the advantage of bare-metal-level high performance, container-based cloud solutions have been developed. However, various problems still exist, such as an isolated environment between HPC and the cloud, security issues, and workload management issues. We propose an architecture that reduces this complexity by using and , which are the container platforms most often used in the HPC cloud field. This HPC cloud architecture integrates both image management and job management, which are the two main elements of HPC cloud workflows. To evaluate the serviceability and performance of the proposed architecture, we developed and implemented a platform in an HPC cluster experiment.  Experimental results indicated that the proposed HPC cloud architecture can reduce complexity  to provide supercomputing resource scalability, high performance, user convenience, various HPC Citation: Li, G.; Woo, J.; Lim, S.B. applications, and management efficiency. HPC Cloud Architecture to Reduce HPC Workflow Complexity in Keywords: HPC workflow; cloud platform; supercomputing; HPC cloud architecture; container- Containerized Environments. Appl. ized environment Sci. 2021, 11, 923. https://doi.org/ 10.3390/app11030923

Academic Editor: 1. Introduction Bernabe Dorronsoro Supercomputing services are currently among the most important services provided Received: 19 November 2020 by national supercomputing centers worldwide [1]. This designation typically refers to the Accepted: 15 January 2021 use of aggregated computing power to solve advanced computational problems related to Published: 20 January 2021 scientific research [2]. Most of these services are composed of cluster-based supercomputers and are important for providing the computational resources and applications necessary Publisher’s Note: MDPI stays neutral in various scientific fields [3]. These supercomputing services are often divided into high- with regard to jurisdictional claims in performance computing (HPC) services and high-throughput computing (HTC) services [4] published maps and institutional affil- according to the workflow of the job. An HPC service has a strong impact on jobs that are iations. tightly coupled in parallel processing. These jobs (such as message passing interface (MPI) parallel jobs) must process large numbers of computations within a short time. In contrast, an HTC service involves independence between jobs that are loosely coupled in parallel processing. These jobs must process large and distributed numbers of computations over a Copyright: © 2021 by the authors. specific period (such as a month or a year) [5]. Licensee MDPI, Basel, Switzerland. Most national supercomputing centers are plagued with inconvenience and ineffi- This article is an open access article ciency in handling these types of jobs. Therefore, in this study, we focused on the current distributed under the terms and problems facing HPC users and supercomputing center administrators seeking to provide conditions of the Creative Commons Attribution (CC BY) license (https:// efficient systems. HPC users face four main problems when performing traditional HPC creativecommons.org/licenses/by/ jobs; these are complexity, compatibility, application expansion, and a pay-as-used arrange- 4.0/).

Appl. Sci. 2021, 11, 923. https://doi.org/10.3390/app11030923 https://www.mdpi.com/journal/applsci Appl. Sci. 2021, 11, 923 2 of 28

ment. Conversely, center administrators face five main problems with traditional HPC systems; these are cost, flexibility, scalability, integration, and portability. The solution to these five problems with traditional HPC is to adopt technology to provide HPC services in a cloud environment [6]. With the development of cloud computing technology and the stabilization of its service, the provision of HPC services in the cloud environment has become one of the main interests of many HPC administrators. Many researchers have begun to enter this field, using descriptors such as HPC over Cloud, HPC Cloud, HPC in Cloud, or HPC as a Service (HPCaaS) [7]. Al- though these approaches solve typical scalability problems in the cloud environment and reduce deployment and operational costs, they fail to meet high-performance require- ments, because of the performance degradation of itself. In particular, the performance degradation in overlay networking solutions involving Virtual Machines (VMs) and networking virtualization has become a most serious issue. To address this, studies are currently underway to program the fast packet processing between VMs on the data plane using a data plane development kit [8]. Another fatal problem results from the inflexible mobility of VM images, which include various combinations or versions of, e.g., the operating system, compiler, library, and HPC applications. Thus, the biggest challenge to HPC administrators is to achieve scalability of the cloud, performance, and image service in virtual situations. One solution to overcome this challenge is to implement an HPC cloud in a con- tainerized environment. Integrating the existing HPC with a containerized environment leads to complexity in terms of designing the HPC workflow, which is defined as the flow of tasks that need to be executed to compute on HPC resources. The diversity of these tasks increases the complexity of the infrastructure to be implemented. We defined some requirements of an HPC workflow in a containerized environment and compared them with those of other projects such as Shifter [9], Sarus [10], EASEY [11], and JEDI [12], which are suggested by several national supercomputing centers. It is difficult to design an architecture that includes all functionalities that satisfy the requirements of users and administrators on the basis of these related research analyses. In this study, we proposed an HPC cloud architecture that can reduce the complexity of HPC workflows in containerized environments to provide supercomputing resources scalability, high performance, user convenience, various HPC applications, and manage- ment efficiency. To evaluate the serviceability of our proposed architecture, we developed a platform that was part of the Partnership and Leadership of Supercomputing Infrastructure (PLSI) project [13] led by the Korean National Supercomputing Center, Korea Institute of Science and Technology Information (KISTI), and built a test bed based on the PLSI infrastructure. In addition, we developed a user-friendly platform that is easy to use and uses a minimal knowledge base interface to ensure user convenience. The reminder of this paper is organized as follows: In Section2, we describe re- lated work on container-based HPC cloud solutions involving national supercomputing resources. In Section3, the platform implemented is explained and information about system architecture and workflows is included, and the detailed method is described in Section4. In Section5, we present the results of evaluation applied to various aspects of the platform we developed. Finally, we provide concluding remarks in Section6.

2. Related Work To propose an HPC cloud architecture in a containerized environment, we first an- alyzed container solutions such as Docker [14], Singularity [15], and Charliecloud [16] that are suitable for an HPC cluster environment. Projects such as Shifter [9], Sarus [10], EASEY [11], and JEDI [12] provide HPC cloud services based on these container-based solutions. In these projects, we analyzed the HPC workflow from an architectural per- spective. Next, we defined several requirements of the HPC workflow in a containerized environment and compared each of the projects with the proposed architecture. Appl. Sci. 2021, 11, x FOR PEER REVIEW 3 of 27

Appl. Sci. 2021, 11, 923 3 of 28 2.1. Container Solutions 2.1.1. Docker 2.1.Docker Container is a Solutions lightweight container technology that easily bundles and runs libraries, packages,2.1.1. Docker and applications in an isolated environment. Compared with VM-based virtu- alizationDocker technology, is a lightweight Docker containeris an isolation technology technology that easily that bundles shares and the runs resources libraries, of the hostpackages, to provide and applicationsbetter performance in an isolated than environment. that of a VM. Compared Nevertheless, with VM-based the following virtual- prob- lemsization exist technology, in the HPC Docker cluster is enviro an isolationnment technology with Docker that sharestechnology the resources [14]: of the host to provide better performance than that of a VM. Nevertheless, the following problems • existDocker in the is HPC not clusterwell integrated environment with with other Docker workload technology managers. [14]: • It does not isolate users on shared file systems. • Docker is not well integrated with other workload managers. • • DockerIt does is not designed isolate users without on shared consideration file systems. of clients who do not have a local disk. • • TheDocker Docker is designed network without is a bridge consideration type by ofdefault. clients Thus, who do to not enable have acommunication local disk. on • multipleThe Docker hosts, network another is overlay a bridge network type by default. solution Thus, is required. to enable communication on multiple hosts, another overlay network solution is required. 2.1.2. Singularity 2.1.2. Singularity Singularity, which is a new container technology with a format different from that of Singularity, which is a new container technology with a format different from that the Docker image, was introduced in 2016 for the development of HPC services [15] for of the Docker image, was introduced in 2016 for the development of HPC services [15] solvingfor solving these theseproblems. problems. Singularity Singularity can can be beintegrated integrated with with any any resource resource managermanager on on the host.the For host. example, For example, it is feasible it is feasible to integrate to integrate it with it with resource resource managers managers such such as as HPC HPC inter- connects,interconnects, scheduler, scheduler, file system, file system, and andGraphics Graphics Processing Processing Units Units (GPUs). (GPUs). Singularity Singularity also focusesalso focuses on container on container mobility mobility and can and be can run be in run any in anyworkload workload with with no nomodification modification to any host.to anyTo provide host. To a provide deeper a understanding deeper understanding of the technology of the technology that implements that implements this function, this thefunction, architecture the architecture of the VM, of Docker the VM, container, Docker container, and Singularity and Singularity are compared are compared in Figure in 1 [17].Figure 1[17].

FigureFigure 1. Architecture 1. Architecture comparison comparison of of the the Virtual (a(a),), Docker Docker container container (b), (b and), and Singularity Singularity (c). (c). 2.1.3. Charliecloud 2.1.3. Charliecloud Charliecloud (developed at the Los Alamos National Laboratory) provides unprivi- legedCharliecloud containers (developed for user-defined at the software Los Alamos stacks in National HPC [16 ].Laboratory) The motivation provides for devel- unprivi- legedoping containers Charliecloud for user-defined was to meet thesoftware increasing stacks demand in HPC at supercomputing[16]. The motivation centers for for devel- opinguser-defined Charliecloud software was stacks. to meet These the demands increasing included demand complex at supercomputing dependencies or centers build for user-definedrequirements, software externally stacks. required These configurations, demands included portability, complex consistency, dependencies and security. or build requirements,Charliecloud externally is designed requir as a lightweight,ed configurations, open source portability, software stack consistency, based on theand security. Charlieclouduser namespace. is designed It uses as Docker a lightweight, to build images, open source uses a shellsoftware script stack for automating based on the con- Linux figurations, and runs user code with the C program [16]. It also uses a host network like user namespace. It uses Docker to build images, uses a shell script for automating config- Singularity (described in the next section) to meet HPC performance standards. urations, and runs user code with the C program [16]. It also uses a host network like Singularity (described in the next section) to meet HPC performance standards.

Appl. Sci. 2021, 11, x FOR PEER REVIEW 4 of 27

Appl. Sci. 2021, 11, 923 4 of 28

2.2. Container-Based HPC Cloud Solutions 2.2.1.2.2. Shifter Container-Based HPC Cloud Solutions 2.2.1.Shifter Shifter is a user-defined container image solution developed by the National Energy ResearchShifter Scientific is a user-defined Computer container Center (NERSC) image solution for HPC developed services. by It the has National been implemented Energy as Researcha prototype Scientific on the Computer Cray supercomputers Center (NERSC) at for NERSC. HPC services. It allows It has an beenHPC implemented system to permit usersas a to prototype run a Docker on the Cray image supercomputers efficiently and at NERSC. safely. ItShifter’s allows an workflow HPC system [18] to is permit shown in users to run a Docker image efficiently and safely. Shifter’s workflow [18] is shown in Figure 2. It provides a user-defined image manager, workload management for resources, Figure2. It provides a user-defined image manager, workload management for resources, andand a user a user interface interface to to distribute containerscontainers in in an an HPC HPC environment environment and providesand provides HPC HPC servicesservices to to users. users. This This solution covers covers various various image image types, types, as well as aswell providing as providing great detail great de- tailfor for a container-baseda container-based image image management management workflow. workflow. However, because However, the batch because scheduler the batch schedulerused in theused traditional in the traditional HPC field isHPC used field without is used integrating without it with integrating the container-based it with the con- tainer-basedscheduler, the scheduler, amount of the automation amount of for automa the jobtion workflow for the is stilljob limited.workflow is still limited.

FigureFigure 2. 2.NERSCNERSC Shifter Shifter workflow. workflow.

TheThe Swiss Swiss National National SupercomputingSupercomputing Center Center (CSCS) (CSCS) has developed has developed a CSCS Shiftera CSCS [19 Shifter] [19]based based on on a NERSC a NERSC Shifter Shifter to work to work with GPUwith resources GPU resources at the Piz at Daint the Piz supercomputer. Daint supercom- A CSCS Shifter can be used by loading the module shifter-ng. To use Shifter commands, puter. A CSCS Shifter can be used by loading the module shifter-ng. To use Shifter com- users must include the GPU-enabled software stack by loading daint-gpu [19]. As shown mands,in Figure users3, themust workload include manager the GPU-enabled can distribute software job requests stack by from loading users, daint-gpu but this can [19]. As shownalso bein aFigure connected 3, the runtime workload service. manager This architecture can distribute not only job managesrequests imagesfrom users, but also but this canruns also containers be a connected directly withruntime these service. converted This images. architecture Although not this only solution manages implements images but alsoautomation runs containers of the job directly workflow with by these Shifter co Runtime,nverted it images. still has Although a dependency this issue solution in that imple- mentsit has automation to run in its of own the supercomputerjob workflow by environment Shifter Runtime, and a network it still has performance a dependency issue issue in betweenthat it has parallel to run processing in its own running supercom on multipleputer nodes.environment and a network performance issue between parallel processing running on multiple nodes. 2.2.2. Sarus Sarus is a software package designed to run Linux containers in HPC environments and was developed by the CSCS in 2019. This software was developed specifically for the requirements of HPC systems and was released as an open source application [10]. The main reason for the development of Sarus is the increased demand for container-based HPC platforms that solve the following issues [10]: • Suitable for an HPC environment: It includes diskless compute nodes and workload manager compatibility, is parallel file system friendly, and has no privilege escalation (multitenant).

Figure 3. CSCS Shifter workflow. Appl. Sci. 2021, 11, x FOR PEER REVIEW 4 of 27

2.2. Container-Based HPC Cloud Solutions 2.2.1. Shifter Shifter is a user-defined container image solution developed by the National Energy Research Scientific Computer Center (NERSC) for HPC services. It has been implemented as a prototype on the Cray supercomputers at NERSC. It allows an HPC system to permit users to run a Docker image efficiently and safely. Shifter’s workflow [18] is shown in Figure 2. It provides a user-defined image manager, workload management for resources, and a user interface to distribute containers in an HPC environment and provides HPC services to users. This solution covers various image types, as well as providing great de- tail for a container-based image management workflow. However, because the batch scheduler used in the traditional HPC field is used without integrating it with the con- tainer-based scheduler, the amount of automation for the job workflow is still limited.

Figure 2. NERSC Shifter workflow.

The Swiss National Supercomputing Center (CSCS) has developed a CSCS Shifter

Appl. Sci. 2021, 11, 923 [19] based on a NERSC Shifter to work with GPU resources at the Piz5 Daint of 28 supercom- puter. A CSCS Shifter can be used by loading the module shifter-ng. To use Shifter com- mands, users must include the GPU-enabled software stack by loading daint-gpu [19]. As shown• Vendor in Figure support: 3, NVIDIAthe workload GPU, Cray manager MPI, and can Remote distribute Direct Memory job requests Access (RDMA) from users, but this Appl. Sci. 2021, 11, x FOR PEER REVIEW 5 of 27 can also(it should be a connected meet the standard runtime open service. container This initiative architecture (OCI) hooks) not only manages images but also• runsUser experience:containers The directly Docker-like with Command these co Linenverted Interface images. (CLI), DockerAlthough Hub integra-this solution imple- mentstion, automation and writable of container the job workflow file system should by Shifter be satisfied. Runtime, it still has a dependency issue 2.2.2.• AdminSarus experience: It includes easy installation (e.g., single binary) and container in thatcustomization it has to run (e.g., in plugins). its own supercomputer environment and a network performance Sarus is a software package designed to run Linux containers in HPC environments issue• Maintenancebetween parallel effort: processing It can apply third-partyrunning on technology multiple and nodes. leverage community and was developed by the CSCS in 2019. This software was developed specifically for the efforts. requirements of HPC systems and was released as an open source application [10]. The main reason for the development of Sarus is the increased demand for container-based HPC platforms that solve the following issues [10]: • Suitable for an HPC environment: It includes diskless compute nodes and workload manager compatibility, is parallel file system friendly, and has no privilege escalation (multitenant). • Vendor support: NVIDIA GPU, Cray MPI, and Remote Direct Memory Access (RDMA) (it should meet the standard open container initiative (OCI) hooks) • User experience: The Docker-like Command Line Interface (CLI), Docker Hub inte- gration, and writable container file system should be satisfied. • Admin experience: It includes easy installation (e.g., single binary) and container cus- tomization (e.g., plugins). • Maintenance effort: It can apply third-party technology and leverage community ef- Figure 3. CSCS Shifter workflow. Figureforts. 3. CSCS Shifter workflow. TheThe architecture architecture of of Sarus Sarus solves solves all all the the abov abovee issues; issues; the theSarus Sarus workflow workflow is shown is shown in Fig- in ureFigure 4. It4 includes. It includes an image an image manager manager and a andcontainer a container runtime runtime manager. manager. Users working Users working through thethrough CLI can the send CLI container can send image container requests image from requests Docker from Image Docker Registry, Image and Registry,a parallel andfile systema parallel can filehandle system Docker can image handle formats Docker as image well as formats TAR-compressed as well as fi TAR-compressedle formats. Users op- file eratingformats. through Users operatingthe JavaScript through Object the Notation JavaScript (JSON) Object format Notation can send (JSON) OCI formatbundle canrequests send thatOCI can bundle be handled requests with that the can root be handledfile system with. It thedelivers root filerequests system. from It deliversusers to requeststhe OCI runtime,from users such to theas runc OCI to runtime, execute such the containe as runcr to process, execute which the container creates hooks process, such which as the creates MPI job,hooks NVIDIA such asGPU the job, MPI and job, other NVIDIA jobs. In GPU this architecture, job, and other it is jobs. impressive In this to architecture, use the concept it is ofimpressive hooks that to can use support the concept various of HPC hooks workload that cans for support job workflows. various HPC However, workloads in the case for job of theworkflows. Image Manager However, in Figure in the case3, it ofdoes the not Image solve Manager the workload in Figure of the3, it image does notmanagement solve the serverworkload when of numerous the image users management send their serverrequests when at the numerous same time. users Furthermore, send their it is requests taxing for at thisthe sameImage time. Manager Furthermore, to download it is images taxing forfrom this Docker Image Image Manager Registry to download or move images com- from pressedDocker with Image TAR Registry with only or move one manager. images compressed with TAR with only one manager.

FigureFigure 4. 4. CSCSCSCS Sarus Sarus workflow. workflow.

2.2.3. EASEY One application of Charliecloud is in developing a software program called EASEY, which is a containerized application for HPC systems [11]. As shown in Figure 5, the EA- SEY workflow for job submissions on HPC systems uses Charliecloud to connect user Appl. Sci. 2021, 11, 923 6 of 28

Appl. Sci. 2021, 11, x FOR PEER REVIEW 6 of 27

2.2.3. EASEY One application of Charliecloud is in developing a software program called EASEY, Appl. Sci. 2021, 11, x FOR PEER REVIEW 6 of 27 which is a containerized application for HPC systems [11]. As shown in Figure5, the EASEY systemsworkflow with for jobHPC submissions software on stacks, HPC systems such as uses ba Charliecloudtch schedulers to connect and storage user systems servers (file sys- tem),with using HPC software EASEY stacks, middleware. such as batch We schedulers selected and this storage study servers case (filebecause system), it is using a good example ofEASEY systemsdeveloping middleware. with HPC a user-defined software We selected stacks, this software such study as case ba stacktch because schedulers for it the is a goodandHPC storage example environment, servers of developing (file evensys- though no automationatem), user-defined using isEASEY implemented software middleware. stack for for We the the selected HPC job environment, workflow. this study case even because though it nois a automation good example is implementedof developing for a theuser-defined job workflow. software stack for the HPC environment, even though no automation is implemented for the job workflow.

Figure 5. EASEY Charliecloud workflow. Figure 5. EASEY Charliecloud workflow.

Figure2.2.4.2.2.4. 5. JEDIJEDI EASEY Charliecloud workflow. AnAn example example of of using using Singularity Singularity (also (alsoCharliecloud) Charliecloud)is isprovided providedby byjedi-stack, jedi-stack,devel- devel- 2.2.4.opedoped JEDI by by the the JEDI JEDI team team in in the the Joint Joint Center Center for for Satellite Satellite Data Data Assimilation Assimilation (JCSDA) (JCSDA) [ 12[12].]. As As shownshownAn in examplein Figure Figure6, 6, withof with using jedi-stack, jedi-stack, Singular jedi jedi usersity users (also can can use Charliecloud)use Docker, Docker, Charliecloud, Charliecloud, is provided and and Singularity Singular-by jedi-stack, devel- toity use to clouduse cloud resources resources or HPC or HPC resources. resources. This provides This provides instant instant images images for cloud for computingcloud com- oped by the JEDI team in the Joint Center for Satellite Data Assimilation (JCSDA) [12]. As andputing environment and environment modules modules with selected with selected HPC systems. HPC systems. For continuous For continuous integration integration testing, shownTravis-CItesting, in Travis-CI isFigure used for 6, is leveragingwithused for jedi-stack, leveraging Docker containers.jedi Dock userser containers. This can team use This usesDocker, team packaged uses Charliecloud, packaged user environ- user and Singular- itymentenvironment to use concepts cloud concepts for resources defining for thedefining or software HPC the resources. container.software container. The This advantage provides The advantage of this instant architecture of this images architec- is that for cloud com- putingresourceture is and that pools environmentresource are divided pools into aremodules divided cloud and with into HPC cloud selected environments, and HPCHPC environments, systems. and each poolFor and iscontinuous providedeach pool integration testing,byis theprovided differentTravis-CI by solutionsthe is different used Charliecloud for solutions leveraging (DockerCharli Dockecloud based)er (Docker andcontainers. Singularity. based) This and As Singularity.team shown uses in this Aspackaged user architecture,shown in this job architecture, workflow management job workflow is management not mentioned is ornot implemented. mentioned or implemented. environment concepts for defining the software container. The advantage of this architec- ture is that resource pools are divided into cloud and HPC environments, and each pool is provided by the different solutions Charliecloud (Docker based) and Singularity. As shown in this architecture, job workflow management is not mentioned or implemented.

Figure 6. Jedi-stack workflow. Figure 6. Jedi-stack workflow.

2.3. Requirements of the HPC Workflow in a Containerized Environment From our review of related projects, we listed 10 requirements and compared them with those of our proposed platform in Table 1. These requirements were derived from our experiences of HPC users and system administrators. We define an HPC workflow as

Figure 6. Jedi-stack workflow.

2.3. Requirements of the HPC Workflow in a Containerized Environment From our review of related projects, we listed 10 requirements and compared them with those of our proposed platform in Table 1. These requirements were derived from our experiences of HPC users and system administrators. We define an HPC workflow as Appl. Sci. 2021, 11, 923 7 of 28

2.3. Requirements of the HPC Workflow in a Containerized Environment From our review of related projects, we listed 10 requirements and compared them with those of our proposed platform in Table1. These requirements were derived from our experiences of HPC users and system administrators. We define an HPC workflow as the flow of tasks that need to be executed to compute on HPC resources. Tasks in a containerized environment can be divided into image, template, and job (container and application) management.

Table 1. Comparison of requirement of projects vs. the proposed platform.

Requirements Shifter Sarus EASEY JEDI Proposed Platform Image, template, A self-service interface Image Image, template Image Image, template, job job Docker, Container solutions Docker Docker Charliecloud Charliecloud, Docker, Singularity Singularity A rapid expansion of new configuration environments: (1) Template-based image management (1), (3) (1), (2), (3) (1) None (1), (2) (2) Image-based job management (3) Image format conversion A pay-as-used model (on-demand billing) None None None None Yes Performance similar to that of bare-metal None Yes Yes None Yes servers Auto-provisioning that includes virtual Yes Yes None None Yes instances and applications Workload Image, job (batch Image, job (batch Image None None management scheduler) scheduler) Yes (image Resource auto-scaling None None None None management) Multitenancy Yes Yes None None Yes capability Portability of Yes Yes Yes None Yes application

In Table1, for a self-service interface requirement, the Sarus project designed image, template, and job management, which was implemented as command line interfaces. Our proposed platform was also designed to support image, template, and job management with integrated command line interfaces. Shifter and Sarus use Docker as their basic container solution. The EASEY project was designed based on a Charliecloud solution, which can be used in an HPC environment. The JEDI project supports various container solutions such as Docker, Charliecloud, and Singularity. We selected only Docker and Singularity for our base container solutions because we considered architectural ways to provide unprivileged containers with Docker instead of Charliecloud. For a rapid expansion of new configuration environments, we listed three functionalities: template- based image management, image-based job management, and image format conversion. Sarus supports all three features. We first attempted to implement two main features and included image format conversion between Docker and Singularity as our future work. Appl. Sci. 2021, 11, 923 8 of 28

For a pay-as-used model, we implemented an on-demand billing system by collecting metering data with an existing authentication and authorization policy not supported in other projects. Sarus and EASEY use the host network with containers through multinode computing. We also evaluated container networking performance with tuning in our previous research. Shifter and Sarus support auto-provisioning that includes containers and inside running applications. For workload management, Sarus supports both image and job management. We referred to the architecture of this project. For resource auto- scaling, no projects support it. We implemented it for image management to distribute a request load. For multitenancy capability, Shifter and Sarus were designed to use a shared resource pool but provide various separated services to users. All projects support portability of application with container solutions.

3. Platform Design To meet the requirements that were mentioned in the previous section, we designed a container-based HPC cloud platform based on system analysis. The system architecture and workflow designs were proposed with a consideration of the requirements of current users and administrators. The workflow of the image management system, job management system, and metering data management were explained in detail.

3.1. Architecture This platform was designed in accordance with the three service roles of the cloud architecture (Figure7); these are service creator, service provider, and service consumer roles that must be distinguished to enable self-service. In this figure, items in blue boxes were implemented by exiting from the open source software, those in green boxes were developed by integrating the necessary parts, and those in yellow boxes were newly developed. The service creator can manage a template consisting of Template Create and Template Evaluate processes. The previously verified templates can be searched using Template List and Template Search and can also be deleted using Template Delete. Template List, Template Search, and Template Delete were developed as CLI tools and provided as a service. All verified templates are automatic installation and configuration scripts for versions such as Operating System (OS), library, compiler, and application. All container images can be built based on verified templates. All jobs (including container and application) can be executed based on these built images. HPCaaS is provided by our container-based HPC cloud service provider to service consumers through the container-based HPC cloud management platform, which consists primarily of job integration management and image management processes. Our platform provides services based on two container platforms, the hardware-based management of which is accomplished with Container Platform Management. Image management is based on a distributed system and is a base for the implementation of workload-distributing, task-parallelizing, auto-scaling, and image-provisioning functions of Image Manager in detail. We designed container and application provisioning by developing integration packages according to the different types of jobs because different container solutions need different container workload managers. The on-demand billing function was implemented using measured metering data. We designed real-time resource monitoring for CPU use and memory usage, and a function for providing various types of Job History Data records based on collected metering data. In addition, interfaces for connecting with other service roles were also implemented. Service Development Interface sends the template created by the service creator to the service provider. The image service and job service created on this template are delivered in the form of HPC Image Service and HPC Job Service through Service Delivery Interface. Appl. Sci. 2021, 11, x FOR PEER REVIEW 8 of 27

3.1. Architecture This platform was designed in accordance with the three service roles of the cloud archi- tecture (Figure 7); these are service creator, service provider, and service consumer roles that must be distinguished to enable self-service. In this figure, items in blue boxes were imple- mented by exiting from the open source software, those in green boxes were developed by integrating the necessary parts, and those in yellow boxes were newly developed. The service creator can manage a template consisting of Template Create and Template Evaluate pro- cesses. The previously verified templates can be searched using Template List and Template Search and can also be deleted using Template Delete. Template List, Template Search, and Template Delete were developed as CLI tools and provided as a service. All verified templates are automatic installation and configuration scripts for versions such as Operating System Appl. Sci. 2021, 11, 923 (OS), library, compiler, and application. All container images can be built based on verified9 of 28 templates. All jobs (including container and application) can be executed based on these built images.

Figure 7. System architecture.

HPCaaSThe workflow is provided diagram by presentsour container-based a detailed design HPC ofcloud the container-basedservice provider HPC to service cloud consumersplatform, which through includes the container-based image management HPC cl (yellowoud management boxes) and platform, job integration which manage-consists primarilyment (blue of boxes),job integration as demonstrated management in Figureand image8. We management proposed a processes. distributed Our system platform for providesimage management services based to reduceon two thecontainer workload platforms, resulting the from hardware-based requests. When management HPC users of whichrequest is the accomplished desired container with image,Container Image Plat Managerform Management. automatically Image generates management and provides is basedthe image on a based distributed on the system existing and templates is a base created for the by implementation administrators. Forof workload-distrib- example, when a uting,user tries task-parallelizing, to execute an MPI auto-scaling, job, a container and imageimage-provisioning including MPI shouldfunctions be checkedof Image first. Man- If agerthe requested in detail. imageWe designed exists, submit container the MPIand jobapplication with this provisioning image. If not, by the developing user can request inte- grationan image packages build with according an existing to the template. different types If there of isjobs no because template, different the user container can request solu- a tionstemplate need from different the administrator. container workload Each usermanagers. can request The on-demand the image butbilling can function also share was it implementedwith other users. using In measured addition, wemetering designed data. an We auto-scaling designed real-time scheduler resource for Image monitoring Manager fornodes CPU regarded use and asmemory workers. usage, To ensure and a jobfunction integration for providing management, various we types integrated of Job History a batch Datascheduler records and based container on collected orchestration metering mechanism data. In addition, to deploy interfaces the container for connecting and application with othersimultaneously. service roles After were creating also implemented. a container Service and executing Development applications, Interface all sends processes the tem- and platecontainers created were by the automatically service creator deleted to the to service release provider. the resource The allocation.image service Additionally, and job ser- a vicedata created collector on for this data template metering are delivered was designed. in the Finally,form of weHPC described Image Service My Resource and HPC View, Job Servicewhich wasthrough designed Service to showDelivery the Interface. resource generated by each user and is used for imple- menting multitenancy. To share the resource pool with different or isolated services by each user, we designed My Resource View to check through usage statistics for each user’s resources.

3.2. Image Management The workflow of image management is shown in Figure9. The user can submit image requests on the login node; when user requests are received, Image Manager delivers them to Docker Daemon or Singularity Daemon to deploy the images and then reports the history to Image Metadata Storage on the image manager node. Docker Daemon uses templates written in Docker file format to create Docker images automatically and in accordance with the user’s request. All the created Docker images are stored in Docker Image Temporary Storage for the steps that follow. Singularity Daemon also uses templates written in definition file format to create Singularity images automatically, based on the request. Similarly, all created Singularity images are stored in Singularity Image Temporary Storage for upcoming steps. When the user requests Docker Image Share, Image Manager uploads the requested image to the Private Docker Registry that has already been built Appl. Sci. 2021, 11, x FOR PEER REVIEW 9 of 27

The workflow diagram presents a detailed design of the container-based HPC cloud platform, which includes image management (yellow boxes) and job integration manage- ment (blue boxes), as demonstrated in Figure 8. We proposed a distributed system for image management to reduce the workload resulting from requests. When HPC users request the desired container image, Image Manager automatically generates and provides the image based on the existing templates created by administrators. For example, when a user tries to execute an MPI job, a container image including MPI should be checked first. If the re- quested image exists, submit the MPI job with this image. If not, the user can request an image build with an existing template. If there is no template, the user can request a template from the administrator. Each user can request the image but can also share it with other users. In addition, we designed an auto-scaling scheduler for Image Manager nodes re-

Appl. Sci. 2021, 11, 923 garded as workers. To ensure job integration management, we integrated a batch scheduler10 of 28 and container orchestration mechanism to deploy the container and application simultane- ously. After creating a container and executing applications, all processes and containers were automatically deleted to release the resource allocation. Additionally, a data collector foras adata local metering hub. This was localdesigned. hub isFinally, used aswe Docker described Image My Resource Storage to View, ensure which high-speed was de- signedtransmission to show and the security. resource When generated a user requestsby each Singularityuser and is Imageused for Share, implementing Image Manager mul- titenancy.uploads the To requested share the image resource to the pool parallel with filedifferent system or that isolated has been services mounted by each on alluser, nodes. we designedOnce the My image Resource is uploaded, View to a jobcheck request through can usage be submitted statistics using for each this user’s image. resources.

Appl. Sci. 2021, 11, x FOR PEER REVIEW 10 of 27

curity. When a user requests Singularity Image Share, Image Manager uploads the re- quested image to the parallel file system that has been mounted on all nodes. Once the image Figure 8. System workflow. workflow. is uploaded, a job request can be submitted using this image. 3.2. Image Management The workflow of image management is shown in Figure 9. The user can submit image requests on the login node; when user requests are received, Image Manager delivers them to Docker Daemon or Singularity Daemon to deploy the images and then reports the history to Image Metadata Storage on the image manager node. Docker Daemon uses templates written in Docker file format to create Docker images automatically and in accordance with the user’s request. All the created Docker images are stored in Docker Image Temporary Storage for the steps that follow. Singularity Daemon also uses templates written in defini- tion file format to create Singularity images automatically, based on the request. Similarly, all created Singularity images are stored in Singularity Image Temporary Storage for up- coming steps. When the user requests Docker Image Share, Image Manager uploads the requested image to the Private Docker Registry that has already been built as a local hub. This local hub is used as Docker Image Storage to ensure high-speed transmission and se-

FigureFigure 9. 9.ImageImage management management workflow. workflow. Image Manager constitutes the key point of our platform. We presented a distributed Imageand parallel Manager architecture constitutes for our platform the key to point reduce of the our workload platform. of Image We presented Manager result- a distributed and paralleling from multiplearchitecture users withfor accessour platform to Image Managerto reduce Server. the Asworkload depicted in ofFigure Image 10, Manager six re- sultingfeatures from were multiple designed users as parallel with workloads,access to therebyImage allowingManager users Server. to send As user depicted requests in Figure simultaneously. On the login node, we deployed client parts from Client A to Client F; 10, sixevery features client haswere its matchingdesigned worker as parallel port designated workloads, Worker thereby A to Worker allowing F. We users designed to send user requestsDocker simultaneously. Image Create, Docker On the Image login Share, node, and Docker we deployed Image Delete client Features parts for from Docker Client A to Client F; every client has its matching worker port designated Worker A to Worker F. We designed Docker Image Create, Docker Image Share, and Docker Image Delete Features for Docker workers; for Singularity workers, Singularity Image Create, Singularity Image Share, and Singularity Image Delete Features were designed. Between client and worker, we presented Task Queue for listing user requests by task unit. When User A creates task ① and User B creates task ②, according to the queue order, the Image Manager first receives task ① and then receives task ②. Likewise, tasks ③ and ④ requested by incom- ing users are queued in that order and are executed immediately after the preceding tasks are completed. Unlike other workload distributions for Image List and Image Search Fea- tures, these are designed separately to connect only to Image Metadata Storage. Appl. Sci. 2021, 11, 923 11 of 28

workers; for Singularity workers, Singularity Image Create, Singularity Image Share, and Singularity Image Delete Features were designed. Between client and worker, we presented Task Queue for listing user requests by task unit. When User A creates task 1 and User B creates task 2 , according to the queue order, the Image Manager first receives task 1 and then receives task 2 . Likewise, tasks 3 and 4 requested by incoming users are queued in that order and are executed immediately after the preceding tasks are completed. Unlike Appl. Sci. 2021, 11, x FOR PEER REVIEW 11 of 27 other workload distributions for Image List and Image Search Features, these are designed separately to connect only to Image Metadata Storage.

FigureFigure 10. 10. ImageImage Manager Manager workflow.workflow.

WeWe designed designed an an auto-scaling auto-scaling schedulerscheduler composed composed of scale-upof scale-up and and scale-down scale-down rules rules forfor the the scheduler. scheduler. As As shown shown in in Figure 11 11,, users users can can submit submit image image requests requests to Auto-scaling to Auto-scaling GroupGroup using using Login Login Node,Node, which which consists consists of a masterof a master worker nodeworker and node several and slave several worker slave nodes. Redis-server runs on the Master Worker node and synchronizes the slave workers. worker nodes. Redis-server runs on the Master Worker node and synchronizes the slave Each worker will send the created image automatically to the public mounted file system workers.or Docker Each Private worker Registry, will send depending the created on the typeimage of platform.automatically to the public mounted file systemTo calculate or Docker the Private waiting Registry time of the, depending latency, the onexecution the type of time platform. of the image was added. After comparison of the execution time and the waiting time, it is determined whether to scale up or scale down the slave workers. Figure 12 presents the flowchart of the auto-scaling scheduler. We designed auto-scaling algorithms for each defined task queue that cannot be executed in parallel. When a user requests a task queue, our system performs the following determination steps starting with Create Image Start. Next, the system checks whether there are current active tasks in the auto-scaling worker group using Check Task Active. If tasks in the worker group are not currently active, any worker is selected to activate the task using Add Consumer, and the task is sent using Send Task to the selected worker using the routing key. When the task sent from the worker node is finished, Cancel Consumer is used to deactivate the task, and if it is not the master worker node, the worker node registered using Remove Worker is released. The task queue process for image generation is a scale-down and waits for the next user’s request.

Figure 11. Auto-scaling scheduler workflow.

To calculate the waiting time of the latency, the execution time of the image was added. After comparison of the execution time and the waiting time, it is determined whether to scale up or scale down the slave workers. Figure 12 presents the flowchart of the auto-scaling scheduler. We designed auto-scaling algorithms for each defined task queue that cannot be executed in parallel. When a user requests a task queue, our system performs the following determination steps starting with Create Image Start. Next, the system checks whether there are current active tasks in the auto-scaling worker group Appl. Sci. 2021, 11, x FOR PEER REVIEW 11 of 27

Figure 10. Image Manager workflow.

We designed an auto-scaling scheduler composed of scale-up and scale-down rules for the scheduler. As shown in Figure 11, users can submit image requests to Auto-scaling Group using Login Node, which consists of a master worker node and several slave worker nodes. Redis-server runs on the Master Worker node and synchronizes the slave Appl. Sci. 2021, 11, 923 12 of 28 workers. Each worker will send the created image automatically to the public mounted file system or Docker Private Registry, depending on the type of platform.

Appl. Sci. 2021, 11, x FOR PEER REVIEW 13 of 27 FigureFigure 11. 11. Auto-scalingAuto-scaling scheduler workflow. workflow.

To calculate the waiting time of the latency, the execution time of the image was added. After comparison of the execution time and the waiting time, it is determined whether to scale up or scale down the slave workers. Figure 12 presents the flowchart of the auto-scaling scheduler. We designed auto-scaling algorithms for each defined task queue that cannot be executed in parallel. When a user requests a task queue, our system performs the following determination steps starting with Create Image Start. Next, the system checks whether there are current active tasks in the auto-scaling worker group

FigureFigure 12.12. FlowchartFlowchart of of the the auto-scaling auto-scaling scheduler. scheduler.

Table However,2. Variables if and there descriptions are currently of equations. active tasks in the auto-scaling worker group, an arbitrary worker among the active workers is first selected using the following steps: add a worker node,Variables activate the task queue using Add Consumer, andDescription send the task by specifying the routing key.N When the work is The finished, number and of the current worker wo noderkers is released, with active it returns tasks to the scale-downn state and waits for The another number request. of current If there active are currently tasks running tasks in all workers, the worker node with the smallest sum of the waiting time is selected and sent. Φ CT Current time The variables and descriptions to get Equations (1) and (2) for the Get Min (total waiting time) workerΦ are ST summarized in Table Start2 time below. of the current active task Φ ET Execution time for the image to run Φ WT Total waiting time for the active tasks Worker that has the minimum value of the total waiting Φ W time

3.3. Job Management We presented a design of the job integration management system. After creating and uploading the image, the user can use Submit Job Request to schedule jobs, allocate com- puter resources, create containers, and execute HPC applications. The platform then auto- matically deletes containers to release allocated resources. Depending on container platform types, such as Docker and Singularity, our system designs different processes for submitting jobs. The key aspect of job integration management is the integration work done with a tra- ditional batch scheduler so that traditional HPC users also can use our system. In addition, our system is designed to automate container, application, and resource releases through Submit Job Request. Our system consists of three main features, i.e., Job Submit, Job List, and Job Delete, for which respective flowcharts are presented in the following sections. Different containers require different schedulers; the Singularity platform can di- rectly use traditional HPC batch schedulers for submitting jobs, while the Docker platform requires integration with the default container scheduler. Job integration management in our system was designed to support both platforms. As shown in Figure 13, the user can submit jobs through Job Submit, but since the job submission process varies based on the platform, we designed it as both Docker Job Summit and Singularity Job Submit. Job List, Job Delete, and History Delete were also designed. Appl. Sci. 2021, 11, 923 13 of 28

Table 2. Variables and descriptions of equations.

Variables Description N The number of current workers with active tasks n The number of current active tasks Φ CT Current time Φ ST Start time of the current active task Φ ET Execution time for the image to run Φ WT Total waiting time for the active tasks Φ W Worker that has the minimum value of the total waiting time

Equation (1) explains Φ WT that calculates the total waiting time for active tasks. After getting a list of currently running tasks, how long each task has been executed up until the present time is calculated using the formula Φ CT−Φ ST. Here, Φ ST represents the start time of the current active task, Φ CT the current time of the system, and n the number of current active tasks. Φ WT denotes the sum of latencies of active tasks for each worker node that has active tasks. Finally, we can get the worker Φ W that has a minimum value of the total waiting time of the workers using Equation (2).

∞ Φ WT = ∑ (Φ ETn − (Φ CT − STn)) (1) n=1

∞ ΦW = min( ∑ (Φ WTN) ) (2) N=1 However, when these equations are applied to an actual cluster environment, the following limitations exist. We implemented one worker per node; therefore N also stands for the number of nodes of Image Manager. The higher the value of N, the better, but it is limited by the number of network switch ports connecting nodes in a cluster configuration. Considering the availability of our HPC resources, we tested N up to 3. We also implemented one task per process; therefore, n also stands for the number of processes of the host. The maximum value of n is the number of cores per one node. However, considering each task is not executed completely independently but shares the resource of the host, we actually created 4 tasks and tested them. The optimization of the resource use about n is one more area of further research.

3.3. Job Management We presented a design of the job integration management system. After creating and uploading the image, the user can use Submit Job Request to schedule jobs, allocate computer resources, create containers, and execute HPC applications. The platform then automatically deletes containers to release allocated resources. Depending on container platform types, such as Docker and Singularity, our system designs different processes for submitting jobs. The key aspect of job integration management is the integration work done with a traditional batch scheduler so that traditional HPC users also can use our system. In addition, our system is designed to automate container, application, and resource releases through Submit Job Request. Our system consists of three main features, i.e., Job Submit, Job List, and Job Delete, for which respective flowcharts are presented in the following sections. Different containers require different schedulers; the Singularity platform can directly use traditional HPC batch schedulers for submitting jobs, while the Docker platform requires integration with the default container scheduler. Job integration management in our system was designed to support both platforms. As shown in Figure 13, the user can submit jobs through Job Submit, but since the job submission process varies based on the platform, we designed it as both Docker Job Summit and Singularity Job Submit. Job List, Job Delete, and History Delete were also designed. Appl. Sci. 2021, 11, 923 14 of 28 Appl. Sci. 2021, 11, x FOR PEER REVIEW 14 of 27 Appl. Sci. 2021, 11, x FOR PEER REVIEW 14 of 27

Figure 13. Job management workflow. FigureFigure 13.13. Job management workflow. workflow. Metering data management was implemented by Data Collector, which consists of MeteringMetering data management management was was implemented implemented by by Data Data Collector, Collector, which which consists consists of Resource Measured Data Collector, Real-time Data Collector, and Job History Data Col- ofResource Resource Measured Measured Data Data Collector, Collector, Real-time Real-time Data Data Collector, Collector, and Job and History Job History Data Col- Data lector, as specified in Figure 14. Resource Measured Data Collector collects resource meas- Collector,lector, as specified as specified in Figure in Figure 14. Resource 14. Resource Measured Measured Data Collector Data Collector collects collects resource resource meas- ured data provided by the batch scheduler. Real-time Data Collector collects current CPU measuredured data dataprovided provided by the by batch the batchscheduler. scheduler. Real-time Real-time Data Collector Data Collector collects collects current current CPU use and memory use for containers running in the container platform group by sending CPUuse and use memory and memory use for use containers for containers running running in the container in the container platform platformgroup by groupsending by requests using the sshpass command every 10 s while the container is running. Job History sendingrequests requestsusing the using sshpass the command sshpass commandevery 10 s while every the 10 container s while the is running. container Job is History running. Data Collector organizes measured data and real-time data designated by JobID as a his- JobData History Collector Data organizes Collector measured organizes data measured and real-time data data and designated real-time data by JobID designated as a his- by tory dataset that can be reconstructed into history data for a certain period. The sshpass JobIDtory dataset as a history that can dataset be reco thatnstructed can be reconstructedinto history data into fo historyr a certain data period. for a certain The sshpass period. command for sending requests every 10 s will create an overhead on both the metering Thecommand sshpass for command sending requests for sending every requests 10 s will every create 10 an s willoverhead create on an both overhead the metering on both node and the compute nodes. As a way to improve performance, there is a solution of thenode metering and the nodecompute and nodes. the compute As a way nodes. to improve As a way performance, to improve there performance, is a solution there of is collecting logs using another communication protocol or installing a log collection agent acollecting solution logs of collecting using another logs usingcommunication another communication protocol or installing protocol a log or collection installing agent a log on the compute node side that can transmit to the metering node. It has not yet been ap- collectionon the compute agent onnode the side compute that can node transmit side that to the can metering transmit node. to the It meteringhas not yet node. been It ap- has plied as a future plan. notplied yet as been a future applied plan. as a future plan.

Figure 14. Metering management workflow. FigureFigure 14.14. Metering management workflow. workflow. Appl. Sci. 2021, 11, 923 15 of 28

4. Platform Implementation Our research goal was to develop a system that provides container-based HPCaaS in the cloud. To evaluate our system, we created a cluster environment and verified the serviceability of our container-based HPC cloud platform using supercomputing resources. We configured the cluster to check the availability of our platform. As depicted in Figure 15, our container-based HPC cloud platform was constructed based on the network configura- tion of the KISTI cluster. There are three types of network configurations: a public network connected by a 1G Ethernet switch (black line), a management network connected by a 10G Ethernet switch (black line), and a data network connected by an InfiniBand switch (red line). We constructed three Image Manager Server nodes as an auto-scaling group, and these are connected to Docker Private Registry and the parallel file system GPFS storage with the management network. As the figure shows, we configured two container types, Docker Container Platform (green), which is configured with the overlay network for container network communication, and Singularity Container Platform (blue), which is configured with the host network for container network communication. Both data networks for container communication are connected to InfiniBand Switch, and the Job Submit Server is deployed with PBSPro and Kubernetes. For Docker Container Platform, the integration scheduler of PBSPro and Kubernetes will work to create jobs. In contrast, for Singularity Container Platform, only PBSPro will work to create jobs. We also built Appl. Sci. 2021, 11, x FOR PEER REVIEW 16 of 27 Image Metadata Server and Job Metadata Server for storing and managing image metadata and job metadata, respectively.

Figure 15. Cluster environment. Figure 15. Cluster environment.

Table 3. InstalledTable3 shows software the information. installed software information. We have software licenses for PBSPro and GPFS. Therefore, we used other open source software compatible with the Softwaresoftware. TheVersion choice of open sourceInsta softwarelled Node was (in based Figure on 14) easy and common use. We Image Metadata Storage MariaDBinstalled MariaDB 5.5.52 v5.5.52Image Metadata for Image Server Metadata Server as the relational database MySQL Job Metadata Storage MongoDBand MongoDB 2.6.12 v2.6.12 Job Metadata for Job Metadata Server Server as the Not Only SQL (NoSQL) database. Docker Container Platform, Docker Private Registry, Image Manager Container Platform DockerJob Metadata 17.06-ce Server is more suitable for building NoSQL databases rather than relational databases for storingServer metering data on resource use. NoSQL is designed to improve Singularitylatency and throughput 2.2 Singularity performance Container byPlatform, providing Image highly Manager optimized Server key value storage Job Submit Server, Docker Container Platform, Singularity Container Batch Scheduler PBSProfor simple retrieval 14.1.0 and additional operations. We selected MongoDB to implement the Platform NoSQL database because it is designed based on the document data model and provides a Docker Container Scheduler Kubernetes 1.7.4 Job Submit Server, Docker Container Platform Overlay Network Solution Calicomanual for various 2.3 J programmingob Submit Server, languages, Docker Container as well Platform as a simple query syntax in Python. Distributed Task Queue CeleryJob Submit Server 4.1 Imageand Image Submit Submit Server, Image Server Manager that implemented Server, Job Submit the distributedServer system Programming Language Python 2.7.5 Image Submit Server, Image Manager Server, Job Submit Server Message Broker Redis-server 3.2.3 Image Manger Server User Command Line Tool (developed in Python) PLSICLOUD 1.0 Image Submit Server, Job Submit Server

5. Evaluation In this study, we evaluated a platform with essential attributes such as on-demand self-service, rapid elasticity and scalability, auto-provisioning, workload management, multitenancy, portability of applications, and performance, which can meet the listed re- quirements experienced by HPC users and administrators in Table 1.

5.1. On-Demand Self-Service On-demand self-service means that a consumer can unilaterally provision computing capabilities, such as computing time and storage, as needed automatically without requiring human interaction with each service provider [20]. HPC users can self-service on both cur- rent image resources and computational resources, which was automatically provided by the service provider, as shown in Figure 7. Users can also request their own images and submit jobs by allocating computational resources through the plsicloud_create_image and plsicloud_submit_job commands. We also presented on-demand billing management of this platform, which enables a pay-as-used model. By integrating the tracejob command of PBSPro, we implemented a resource usage calculation result for each user in Figure 16. As the figure shows, we provided average CPU use, total number of used processes, total used wall-time, CPU time, memory size, and virtual memory size for on-demand billing man- agement evaluation. Based on this used resource information, our supercomputing center can apply a pricing policy for the pay-as-used model. Appl. Sci. 2021, 11, 923 16 of 28

were installed using a combination of Python v2.7.5, Celery v4.1, and Redis-server v3.2.3. PBSPro v14.1.0 was installed as a batch scheduler, and Kubernetes v1.7.4 was installed as a container scheduler. The latest Docker v17.05-ce was installed with Calico v2.3 to configure the overlay network with Kubernetes for Docker containers. For Singularity Container Platform, we installed Singularity 2.2, which is the most stable version. PLSICLOUD is a user command line tool with which users submit image and job requests, and it was installed in Image Submit Server and Job Submit Server.

Table 3. Installed software information.

Software Version Installed Node (in Figure 14) Image Metadata Storage MariaDB 5.5.52 Image Metadata Server Job Metadata Storage MongoDB 2.6.12 Job Metadata Server Docker Container Platform, Docker Private Registry, Container Platform Docker 17.06-ce Image Manager Server Singularity Container Platform, Image Manager Singularity 2.2 Server Job Submit Server, Docker Container Platform, Batch Scheduler PBSPro 14.1.0 Singularity Container Platform Docker Container Scheduler Kubernetes 1.7.4 Job Submit Server, Docker Container Platform Overlay Network Solution Calico 2.3 Job Submit Server, Docker Container Platform Image Submit Server, Image Manager Server, Job Distributed Task Queue Celery 4.1 Submit Server Image Submit Server, Image Manager Server, Job Programming Language Python 2.7.5 Submit Server Redis- Message Broker 3.2.3 Image Manger Server server User Command Line Tool (developed in Python) PLSICLOUD 1.0 Image Submit Server, Job Submit Server

5. Evaluation In this study, we evaluated a platform with essential attributes such as on-demand self-service, rapid elasticity and scalability, auto-provisioning, workload management, multitenancy, portability of applications, and performance, which can meet the listed requirements experienced by HPC users and administrators in Table1.

5.1. On-Demand Self-Service On-demand self-service means that a consumer can unilaterally provision computing capabilities, such as computing time and storage, as needed automatically without requir- ing human interaction with each service provider [20]. HPC users can self-service on both current image resources and computational resources, which was automatically provided by the service provider, as shown in Figure7. Users can also request their own images and submit jobs by allocating computational resources through the plsicloud_create_image and plsicloud_submit_job commands. We also presented on-demand billing management of this platform, which enables a pay-as-used model. By integrating the tracejob command of PBSPro, we implemented a resource usage calculation result for each user in Figure 16. As the figure shows, we provided average CPU use, total number of used processes, total used wall-time, CPU time, memory size, and virtual memory size for on-demand billing management evaluation. Based on this used resource information, our supercomputing center can apply a pricing policy for the pay-as-used model.

5.2. Rapid Elasticity and Scalability Rapid elasticity and scalability mean that capabilities can be elastically provisioned and released, in some cases automatically, to scale rapidly up and inward commensurate with demand [20]. One important factor affecting rapid elasticity on our platform is the rapid movement of image files. When a user makes a job request for a resource with the required image, the system must quickly move the image created in Image Appl. Sci. 2021, 11, 923 17 of 28

Temporary Storage to the compute nodes. For Singularity, there are no considerations when transferring an image, which can be also run as a container. The problem is to deploy the Docker image. Since Docker has a layered image structure and is managed by the Docker engine, it must be compressed into a file to move to the compute nodes. Considering the file compression time, file transfer time, and file decompression time, this is very inefficient. To solve this problem, we built Docker Private Registry that can Appl. Sci. 2021, 11, x FOR PEER REVIEW 17 of 27 store images connected with the management network. We calculated the degree of time deduction with and without Docker Private Registry for handling a 1.39 GB image. With Docker Private Registry, it took 122 s, 20 s faster than without Docker Private Registry.

FigureFigure 16. Resource 16. Resource usage. usage. For reducing the workload of creating images, we developed an auto-scaling scheduler with5.2. three Rapid workers Elasticity implemented and in Scalability a group for our platform. We applied this custom algorithm to compare the waiting time of each task with and without cases. With one worker,Rapid the second elasticity task must wait and for thescalability previous task mean until it isthat completed. capabilities The third can be elastically provisioned task must wait for 2059 s, which is almost twice the first waiting time. For the fourth task inand a queue, released, it will wait in for some 3083 s, cases requiring automatically, 6166 s in total as a waitingto scale time. rapidly With this up and inward commensurate auto-scalingwith demand algorithm, [20]. three workersOne important will work in parallel, factor so onlyaffecting 1080 s are rapid needed forelasticity on our platform is the the fourth task; thus the total waiting time required is only 1200 s. rapid movement of image . When a user makes a job request for a resource with the 5.3.required Auto-Provisioning image, the system must quickly move the image created in Image Temporary Automatic provisioning of resources is an integral part of our platform. In a container- izedStorage environment to the for thecompute HPC cloud, nodes. the provisioning For Singularity, process of resources there requires are no not considerations when transfer- only an image and a container but also their applications. More specifically, provisioning ofring a job inan application image, units which is needed can rather be thanalso in run container as a units. container. To solve this The problem, problem is to deploy the Docker weimage. evaluated Since the auto-provisioning Docker has of a image layered and job image resources bystru makingcture their and life cycles.is managed by the Docker engine, Figure 17 shows the auto-provisioning life cycle of the image process. Once users requestit must to build be compressed an image, configurations into a for file image to move creation to are the verified compute and the state nodes. Considering the file com- willpression be Creating. time, After file image transfer creation is complete,time, and the state file changes deco tompression Created and the time, this is very inefficient. To image is automatically uploaded to Docker Private Registry or the shared file system. In thissolve process, this the problem, image status we is displayed built Docker as Uploading. Private Once it Re is uploaded,gistry that the state can store images connected with changesthe management to Uploaded. If an errornetwork. occurs during We Creating,calculated Uploading, the and degree Deleting of states, time it deduction with and without is automatically displayed as the Down state. Images in the states of Created, Uploaded, andDocker Down can Private be deleted, Registry and these deletedfor handling images are automaticallya 1.39 GB expunged image. in With the Docker Private Registry, it repository.took 122 s, 20 s faster than without Docker Private Registry. For reducing the workload of creating images, we developed an auto-scaling sched- uler with three workers implemented in a group for our platform. We applied this custom algorithm to compare the waiting time of each task with and without cases. With one worker, the second task must wait for the previous task until it is completed. The third task must wait for 2059 s, which is almost twice the first waiting time. For the fourth task in a queue, it will wait for 3083 s, requiring 6166 s in total as a waiting time. With this auto- scaling algorithm, three workers will work in parallel, so only 1080 s are needed for the fourth task; thus the total waiting time required is only 1200 s.

5.3. Auto-Provisioning Automatic provisioning of resources is an integral part of our platform. In a container- ized environment for the HPC cloud, the provisioning process of resources requires not only an image and a container but also their applications. More specifically, provisioning of a job in application units is needed rather than in container units. To solve this problem, we eval- uated the auto-provisioning of image and job resources by making their life cycles. Figure 17 shows the auto-provisioning life cycle of the image process. Once users re- quest to build an image, configurations for image creation are verified and the state will be Creating. After image creation is complete, the state changes to Created and the image is automatically uploaded to Docker Private Registry or the shared file system. In this process, the image status is displayed as Uploading. Once it is uploaded, the state changes to Up- loaded. If an error occurs during Creating, Uploading, and Deleting states, it is automati- cally displayed as the Down state. Images in the states of Created, Uploaded, and Down can be deleted, and these deleted images are automatically expunged in the repository. Appl.Appl. Sci. Sci. 20212021,, ,1111,, , xx 923 FORFOR PEERPEER REVIEWREVIEW 18 of 28 18 of 27

FigureFigure 17. 17.Auto-provisioning Auto-provisioning life cyclelife cycl of thee of image the image process. process.

FigureFigure 18 18 shows shows the the auto-provisioning auto-provisioning life cycle life cycl of thee of job the process. job process. Once users Once submit users submit a ajob,job, job, containerscontainers containers areare are executedexecuted executed usingusing using aa privateprivate a private oror oraa sharedshared a shared image.image. image. InIn thisthis In thisprocess,process, process, thethe statestate the changeschanges state changes from Building to Creating. After the creation is complete, the application is from Building to Creating. After the creation is complete, the application isis executedexecuted byby chang-chang- executed by changing to the Created state and then automatically going to the Running ing to the Created state and then automatically going to the Running state. If the application state.ing to If the applicationCreated state execution and then is finished, automatically the daemon going of theto the job isRunning automatically state. updatedIf the application toexecution the Finished is finished, state to complete the daemon the operation. of the job Then, is automatically allocated resources updated are to automatically the Finished state to releasedcomplete with the expunging operation. the Then, container, allocated including resources applications are automatically in the Expunged released state. with Jobs expunging canthethe be container,container, deleted in includingincluding the states applicationsapplications of Created, Running, inin thethe ExpungedExpunged Finished, state. andstate. Down. JobsJobs cancan If a bebe forced deleteddeleted request inin thethe statesstates forof resourceCreated, release Running, is received, Finished, the and state Down. shows If Expunged, a forced request and then for the resource allocated release resource is received, isthethe released. statestate showsshows Expunged,Expunged, andand thenthen thethe allocatedallocated resourceresource isis released.released.

FigureFigure 18. 18.Auto-provisioning Auto-provisioning life cyclelife cycle of the of job the process. job process.

5.4. Workload Management There are three workloads considered in our platform: image task, container execu- tion,tion, andand applicationapplication execution.execution. DistributingDistributing andand parallelizingparallelizing workloadsworkloads forfor imageimage taskstasks are implemented by defining each client, task queue, and worker with Celery as a distrib- uted framework and Redis as a message broker. Workloads for container and application execution are handled using an existing batchtch scheduler.scheduler. WeWe evaluatedevaluated forfor managingmanaging Appl. Sci. 2021, 11, 923 19 of 28

Appl. Sci. 2021, 11, x FOR PEER REVIEW 5.4. Workload Management 19 of 27

There are three workloads considered in our platform: image task, container execution, and application execution. Distributing and parallelizing workloads for image tasks are implemented by defining each client, task queue, and worker with Celery as a distributed these workloads byframework integrating and Redis some as command a message broker.s of the Workloads container for containerscheduler and (Kubernetes) application and the batch schedulerexecution are(PBSPro). handled usingFigure an existing 19 shows batch scheduler.the resource We evaluated monitoring, for managing which these in- workloads by integrating some commands of the container scheduler (Kubernetes) and cludes informationthe about batch schedulercontainers (PBSPro). and batch Figure 19jobs shows using the resourcethe plsicloud monitoring, my_resouce which includes com- mand. information about containers and batch jobs using the plsicloud my_resouce command.

Figure 19. ResourceFigure monitoring 19. Resource monitoring information information about about containers containers and and batch jobs. jobs.

5.5. Multitenancy 5.5. Multitenancy Multitenancy means that the provider’s computing resources are pooled to serve Multitenancymultiple means consumers that the assigned provider’s and reassigned computing to consumer resources demand are [20pooled]. In our to plat-serve multiple consumersform, assigned we provided and the reassigned isolated environment to consumer to each demand user with [20]. shared In resource our platform, pools. Figure 20 shows the concept of the multitenancy implemented in our platform. Image we provided the isolatedmetadata environment and job metadata to are each stored us iner the with two differentshared typesresource of databases—MySQL pools. Figure 20 shows the conceptand of theNoSQL—according multitenancyto implem the dataented characteristics in our of platform. the resources. Image We canmetadata use PLSI- and job metadata are CLOUDstored toin obtain the two the information different sent types to the of service databases—MySQL by each user. The development and NoSQL— of the according to the dataPLSICLOUD characteristics CLI tool allows of the evaluation resources. of the cloudWe can multitenancy use PLSICLOUD model. to obtain the information sent5.6. to Portability the service of Applications by each user. The development of the PLSICLOUD CLI tool allows evaluationWe of evaluated the cloud the portabilitymultitenancy of applications model. by providing containerized applications. We verified containerization of frequently used container versions, OS versions, and compilers dependent on parallel libraries to meet the requirements of users, as shown in Table4. We tested serviceability through the containerizing HPL task, a benchmark tool that tests HPC system performance and that is installed automatically from the OS to the application based on our templates [21]. Currently, our system supports Docker v17.06.1-ce and stable Singularity v2.2, along with CentOS 6.8 and 7.2; OpenMPI provides v1.10.6 and v2.0.2. Each of them contributes to a mathematical library GotoBLAS2 based on a parallel library.

Figure 20. Multitenancy concept implemented in our platform.

5.6. Portability of Applications We evaluated the portability of applications by providing containerized applications. We verified containerization of frequently used container versions, OS versions, and com- pilers dependent on parallel libraries to meet the requirements of users, as shown in Table 4. We tested serviceability through the containerizing HPL task, a benchmark tool that Appl. Sci. 2021, 11, x FOR PEER REVIEW 19 of 27

these workloads by integrating some commands of the container scheduler (Kubernetes) and the batch scheduler (PBSPro). Figure 19 shows the resource monitoring, which in- cludes information about containers and batch jobs using the plsicloud my_resouce com- mand.

Figure 19. Resource monitoring information about containers and batch jobs.

5.5. Multitenancy Multitenancy means that the provider’s computing resources are pooled to serve multiple consumers assigned and reassigned to consumer demand [20]. In our platform, we provided the isolated environment to each user with shared resource pools. Figure 20 shows the concept of the multitenancy implemented in our platform. Image metadata and job metadata are stored in the two different types of databases—MySQL and NoSQL—

Appl. Sci. 2021, 11, 923 according to the data characteristics of the resources. We can use PLSICLOUD20 of 28 to obtain the information sent to the service by each user. The development of the PLSICLOUD CLI tool allows evaluation of the cloud multitenancy model.

FigureFigure 20. 20.Multitenancy Multitenancy concept concept implemented implemented in our in platform. our platform.

5.6. Portability of Applications Table 4. Template of containerized applications. We evaluated the portability of applications by providing containerized applications. Application Container OSWe verified containerization Library of frequently Parallel Library used container versions, OSApplication versions, and com- Library pilers dependent on parallel libraries to meet the requirements of users, as shown in Table gcc 4. We gcc,tested Development serviceability Tools through the contai1.10.6nerizingGotoBLAS HPL task,2-1.13 a benchmark HPL 2.2 tool that CentOS 6.8 OpenMPI gcc Docker 17.06.1c-e / gcc, Development Tools 2.0.2 GotoBLAS 2-1.13 HPL 2.2 Singularity 2.2 gcc gcc, Development Tools 1.10.6 GotoBLAS 2-1.13 HPL 2.2 CentOS 7.2 OpenMPI gcc gcc, Development 2.0.2 GotoBLAS 2-1.13 HPL 2.2

5.7. Performance Evaluation with MPI Tests We constructed two nodes for running point-to-point MPI parallel tasks to check different types of bandwidth and latency with the benchmark tool osu-micro-benchmarks v5.4 [22]. The latency test mainly measures the latency caused by sending and receiving by data size between two nodes using the ping-pong test. The test results of latency between two nodes with bare-metal, singularity, and docker-calico are almost the same and will be not mentioned in this paper [21]. The test results of bandwidth (Figure 21) include bandwidth, bi-directional bandwidth, and multiple-bandwidth tests. As shown in the figure, the peak bandwidth exists in a certain message size interval. This interval is almost the same in three cases of bare-metal, singularity, and docker-calico, except for the instability of docker-calico. Results between the two nodes are insufficient to evaluate the performance of the container-based HPC platform in an HPC environment. Therefore, we constructed 8 nodes with 16, 32, 64, and 72 cores for measuring the latency test with various MPI blocking col- lective operations (barrier in Figure 22, gather in Figure A1, all-gather in Figure A2, reduce in Figure A3, all-reduce in Figure A4, reduce-scatter in Figure A5, scatter in Figure A6, all-to-all in Figure A7, and broadcast in Figure 23 using the same benchmark tool osu- micro-benchmarks. Appl. Sci. 2021, 11, x FOR PEER REVIEW 20 of 27

tests HPC system performance and that is installed automatically from the OS to the ap- plication based on our templates [21]. Currently, our system supports Docker v17.06.1-ce and stable Singularity v2.2, along with CentOS 6.8 and 7.2; OpenMPI provides v1.10.6 and v2.0.2. Each of them contributes to a mathematical library GotoBLAS2 based on a parallel library.

Table 4. Template of containerized applications.

Container OS Library Parallel Library Application Library Application gcc gcc, Development Tools 1.10.6 GotoBLAS 2-1.13 HPL 2.2 CentOS 6.8 OpenMPI gcc Docker 17.06.1-ce / gcc, Development Tools 2.0.2 GotoBLAS 2-1.13 HPL 2.2 Singularity 2.2 gcc gcc, Development Tools 1.10.6 GotoBLAS 2-1.13 HPL 2.2 CentOS 7.2 OpenMPI gcc gcc, Development 2.0.2 GotoBLAS 2-1.13 HPL 2.2

5.7. Performance Evaluation with MPI Tests We constructed two nodes for running point-to-point MPI parallel tasks to check dif- ferent types of bandwidth and latency with the benchmark tool osu-micro-benchmarks v5.4 [22]. The latency test mainly measures the latency caused by sending and receiving messages by data size between two nodes using the ping-pong test. The test results of latency between two nodes with bare-metal, singularity, and docker-calico are almost the same and will be not mentioned in this paper [21]. The test results of bandwidth (Figure 21) include bandwidth, bi-directional bandwidth, and multiple-bandwidth tests. As shown in the figure, the peak bandwidth exists in a certain message size interval. This Appl. Sci. 2021, 11, 923 interval is almost the same in three cases of bare-metal, singularity, and 21docker-calico, of 28 except for the instability of docker-calico.

Appl. Sci. 2021, 11, x FOR PEER REVIEW 21 of 27

Results between the two nodes are insufficient to evaluate the performance of the container-based HPC platform in an HPC environment. Therefore, we constructed 8 nodes with 16, 32, 64, and 72 cores for measuring the latency test with various MPI block- ing collective operations (barrier in Figure 22, gather in Figure A1, all-gather in Figure A2, reduce in Figure A3, all-reduce in Figure A4, reduce-scatter in Figure A5, scatter in Figure A6, all-to-all in Figure A7, and broadcast in Figure 23 using the same benchmark tool osu- Figure 21. 21.MPI MPI bandwidth bandwidth includes includes (a) MPI(a) MPI Bandwidth, Bandwidth, (b) MPI (b) BIMPI Bandwidth, BI Bandwidth, and (c) and MPI ( Multic) MPI Multi Bandwidth.micro-benchmarks.

FigureFigure 22. MPI 22. barrier MPI latency. barrier latency.

In Figure 22, the barrier latency of singularity shows almost the same performance with the bare-metal case except with 64 cores. The value of docker-calico shows a performance gap between singularity and bare-metal. For the remaining operations, as the number of cores increases in the same number of nodes, the performance results are almost the same with bare-metal, singularity, and docker-calico when looking at the overall graph change. However, in Figure 23, the results of docker-calico in a particular interval are worse than the other two cases in a specific message size range.

Figure 23. MPI broadcast latency with (a) 16 cores, (b) 32 cores, (c) 64 cores, and (d) 72 cores. Appl. Sci. 2021, 11, x FOR PEER REVIEW 21 of 27

Results between the two nodes are insufficient to evaluate the performance of the container-based HPC platform in an HPC environment. Therefore, we constructed 8 nodes with 16, 32, 64, and 72 cores for measuring the latency test with various MPI block- ing collective operations (barrier in Figure 22, gather in Figure A1, all-gather in Figure A2, reduce in Figure A3, all-reduce in Figure A4, reduce-scatter in Figure A5, scatter in Figure A6, all-to-all in Figure A7, and broadcast in Figure 23 using the same benchmark tool osu- micro-benchmarks.

Figure 22. MPI barrier latency.

In Figure 22, the barrier latency of singularity shows almost the same performance with the bare-metal case except with 64 cores. The value of docker-calico shows a performance gap between singularity and bare-metal. For the remaining operations, as the number of cores increases in the same number of nodes, the performance results are almost the same with bare-metal, singularity, and docker-calico when looking at the overall graph change. Appl. Sci. 2021, 11, 923 However, in Figure 23, the results of docker-calico in a particular interval are22 of worse 28 than the other two cases in a specific message size range.

FigureFigure 23. 23.MPI MPI broadcast broadcast latency latency with with (a) 16 (a cores,) 16 cores, (b) 32 ( cores,b) 32 (cores,c) 64 cores, (c) 64 and cores, (d) 72and cores. (d) 72 cores.

In Figure 22, the barrier latency of singularity shows almost the same performance with the bare-metal case except with 64 cores. The value of docker-calico shows a performance gap between singularity and bare-metal. For the remaining operations, as the number of cores increases in the same number of nodes, the performance results are almost the same with bare-metal, singularity, and docker-calico when looking at the overall graph change. However, in Figure 23, the results of docker-calico in a particular interval are worse than the other two cases in a specific message size range.

6. Conclusions and Future Work The container-based approach to the HPC cloud is expected to ensure efficient man- agement and use of supercomputing resources, which are areas that present challenges in the HPC field. This study demonstrates the value of technology convergence by attempting to provide users with a single cloud environment through integration with container-based technology and traditional HPC resource management technology. It provides container- based solutions to the problems of HPC users and administrators, and these can be of practical assistance in resolving issues such as complexity, compatibility, application expan- sion, pay-as-used billing management, cost, flexibility, scalability, workload management, and portability. The deployment of a container-based HPC cloud platform is still a challenging task. Thus far, our proposed architecture has set and used measurement values for various resources by mainly considering KISTI’s computer-intensive HPC jobs. In future work, we must consider HTC jobs, network-intensive jobs, and GPU-intensive jobs, especially for machine learning or deep learning applications, and add measurement values for new resources that satisfy the characteristics of these jobs. Another potential research task is to automate the process of creating and evaluating templates for HPC service providers as service creators. In our platform, there is not enough generalization in the degree Appl. Sci. 2021, 11, 923 23 of 28

of automation to conform with the application characteristics realized whenever a new template is added. In the current job integration management part, additional development is required for the batch scheduler for general jobs and other interworking for Kubernetes with container jobs. In our platform, the user cannot access the container directly. Container creation, execution, application execution, and container deletion are all automated. However, it is possible to solve this issue by developing a linked package for the Application Program- ming Interface (API) of the existing Kubernetes and batch job schedulers; these will be connected to the machine on which the job is running, while submitting the job according to the user’s requirements. The evaluation of our platform was conducted in a small cluster environment. If we apply our platform to a large cluster environment, future evaluations of availability, deployment efficiency, and execution efficiency would be needed. We hope that our proposed architecture will contribute to the widespread deployment and use of future container-based HPC cloud services.

Author Contributions: Conceptualization, G.L. and J.W.; methodology, G.L. and S.B.L.; software, G.L.; validation, J.W.; formal analysis, G.L.; resources, J.W.; writing-original draft preparation, G.L.; writing—review and editing, G.L. and S.B.L.; supervision, S.B.L.; project administration, J.W. All authors have read and agreed to the published version of the manuscript. Funding: This research has been performed as a subproject of Project No. K-21-L02-C01-S01 (The National Supercomputing Infrastructure Construction and Service) supported by the KOREA INSTI- TUTE of SCIENCE and TECHNOLOGY INFORMATION (KISTI). Institutional Review Board Statement: Not applicable. Informed Consent Statement: Not applicable. Data Availability Statement: No new data were created or analyzed in this study. Data sharing is not applicable to this article. Acknowledgments: This study was performed as a subproject of the KISTI project entitled the National Supercomputing Infrastructure Construction and Service. Appl. Sci. 2021, 11, x FOR PEER REVIEW Conflicts of Interest: This manuscript has not been published or presented elsewhere in part or in23 of 27 entirety and is not under consideration by another journal. We have read and understood your journal’s policies, and we believe that neither the manuscript nor the study violates any of these. There are no conflict of interest to declare.

AppendixAppendix A A

Figure A1. Cont.

Figure A1. MPI gather latency with (a) 16 cores, (b) 32 cores, (c) 64 cores, and (d) 72 cores.

Figure A2. MPI all-gather latency with (a) 16 cores, (b) 32 cores, (c) 64 cores, and (d) 72 cores. Appl.Appl. Sci. Sci. 2021 2021, ,11 11, ,x x FOR FOR PEER PEER REVIEW REVIEW 2323 of of 27 27

AppendixAppendix A A

Appl. Sci. 2021, 11, 923 24 of 28

FigureFigure FigureA1. A1. MPI MPI A1. MPIgather gather gather latency latency latency with with ( a( )a( 16a) )16 cores,16 cores, cores, (b) 32 ( b cores,(b) )32 32 (ccores, )cores, 64 cores, ( c(c) and )64 64 (cores, dcores,) 72 cores. and and ( d(d) )72 72 cores. cores.

FigureFigureFigure A2. A2. MPI A2.MPIMPI all-gather all-gather all-gather latency latency with with with (a) 16( a( cores,a) )16 16 cores, (cores,b) 32 cores, ( b(b) )32 (c32) 64cores, cores, cores, ( andc(c) )64 (64d )cores, 72cores, cores. and and ( d(d) )72 72 cores. cores. Appl. Sci. 2021Appl., 11 Sci., x FOR2021 PEER, 11, 923 REVIEW 24 of 27 25 of 28 Appl. Sci. 2021, 11, x FOR PEER REVIEW 24 of 27

Figure A3. MPI reduce latency with (a) 16 cores, (b) 32 cores, (c) 64 cores, and (d) 72 cores. FigureFigure A3. A3.MPI MPI reduce reduce latency latency with with (a )(a 16) 16 cores, cores, (b ()b 32) 32 cores, cores, (c )(c 64) 64 cores, cores, and and (d )(d 72) 72 cores. cores.

Figure A4. MPI all-reduce latency with (a) 16 cores, (b) 32 cores, (c) 64 cores, and (d) 72 cores. FigureFigure A4. A4.MPI MPI all-reduce all-reduce latency latency with with (a) 16(a) cores, 16 cores, (b) 32(b) cores, 32 cores, (c) 64(c) cores, 64 cores, and and (d) 72(d) cores. 72 cores. Appl.Appl. Sci.Sci. 20212021Appl.,, 1111 Sci.,, xx FORFOR2021 PEER,PEER11, 923 REVIEWREVIEW 2525 ofof 2727 26 of 28

FigureFigureFigure A5. MPI A5.A5. reduce-scatterMPIMPI reduce-scatterreduce-scatter latency latencylatency with (withwitha) 16 ((a cores,a)) 1616 cores,cores, (b) 32 ((b cores,b)) 3232 cores,cores, (c) 64 ( cores,(cc)) 6464 cores,cores, and ( d andand) 72 (( cores.dd)) 7272 cores.cores.

FigureFigureFigure A6. A6.A6.MPI MPIMPI scatter scatterscatter latency latencylatency with withwith (a ()(aa 16)) 1616 cores, cores,cores, (b (()bb 32)) 3232 cores, cores,cores, (c ()(cc 64)) 6464 cores, cores,cores, and andand (d ()(dd 72)) 7272 cores. cores.cores. Appl. Sci. 2021, 11, x FOR PEER REVIEW 26 of 27 Appl. Sci. 2021, 11, 923 27 of 28

Figure A7. MPI all-to-all latency with (a) 16 cores, (b) 32 cores, (c) 64 cores, and (d) 72 cores. Figure A7. MPI all-to-all latency with (a) 16 cores, (b) 32 cores, (c) 64 cores, and (d) 72 cores. References References 1. Joseph, E.; Conway, S.; Sorensen, B. High Performance Computing in the EU: Progress on the Implementation of the European HPC Strategy; European Commission: Brussels, Belgium, 2014. 1. Joseph, E.; Conway,2. S.;Fusi, Sorensen, M.; Mazzocchetti, B. High F.; FarrPerformanceés, A.; Kosmidis, Computing L.; Canal, in R.; the Cazorla, EU: Progress F.J.; Abella, on J. Onthe theImplementation Use of Probabilistic of the Worst-Case European HPC Strategy; European Commission:Execution Time Br Estimationussels, forBelgium, Parallel Applications 2014. in High Performance Systems. Mathematics 2020, 8, 314. [CrossRef] 3. Smirnov, S.; Sukhoroslov, O.; Voloshinov, V. Using Resources of Supercomputing Centers with Everest Platform. In Proceedings 2. Fusi, M.; Mazzocchetti,of the F.; 4th Farrés, Russian A.; Supercomputing Kosmidis, Days,L.; Canal, Moscow, R.; Russia, Cazorla, 24–25 September F.J.; Abella, 2018; pp.J. On 687–698. the Use of Probabilistic Worst-Case Execution Time4. EstimationFienen, M.N.; for Hunt, Parallel R.J. High-Throughput Applications Computing in VersusHigh High-Performance Performance Computing Systems. for GroundwaterMathematics Applications. 2020, 8, 314, doi:10.3390/math8030314.Ground Water 2015, 53, 180–184. [CrossRef][PubMed] 3. Smirnov, S.; Sukhoroslov,5. Raicu, I.;O.; Foster, Voloshinov, I.T.; Zhao, Y.V. Many-task Using Resources computing for of grids Supercomputing and supercomputers. Centers In Proceedings with Everest of the 2008 Platform. Workshop In on Proceed- Many-Task Computing on Grids and Supercomputers, Austin, TX, USA, 17 November 2008; pp. 1–11. ings of the 4th Russian6. Balakrishnan, Supercomputing S.R.; Veeramani, Days, S.; Leong,Moscow, J.A.; Murray, Russia, I.; Sidhu,24–25 A.S. September High Performance 2018; Computingpp. 687–698. on the Cloud via HPC+Cloud 4. Fienen, M.N.; Hunt, R.J.software High-Throughput framework. In Proceedings Computing of the 2016 Versus Fifth International High-Performance Conference on Computing Eco-Friendly Computing for Groundwater and Communication Applications. Ground Water 2015, 53Systems, 180–184, (ICECCS), doi:10.1111/gwat.12320. Bhopal, India, 8–9 December 2016; pp. 48–52. 7. Jamalian, S.; Rajaei, H. ASETS: A SDN Empowered Task Scheduling System for HPCaaS on the Cloud. In Proceedings of the 2015 5. Raicu, I.; Foster, I.T.; IEEEZhao, International Y. Many-t Conferenceask computing on Cloud Engineering, for grids Tempe,and supercomputers AZ, USA, 9–13 March. In 2015; Proceedings pp. 329–334. of the 2008 Workshop on Many-Task Computing8. Shanmugalingam, on Grids an S.;d Ksentini,Supercomputers, A.; Bertin, P. DPDK Austin, Open TX, vSwitch USA, performance 17 November validation 2008; with mirroringpp. 1–11. feature. In Proceedings 6. Balakrishnan, S.R.; Veeramani,of the 2016 23rd S.; International Leong, ConferenceJ.A.; Murray, on Telecommunications I.; Sidhu, A.S. (ICT), High Thessaloniki, Performance Greece, 16–18 Computing May 2016; pp. on 1–6. the Cloud via 9. Gerhardt, L.; Bhimji, W.; Canon, S.; Fasel, M.; Jacobsen, D.; Mustafa, M.; Porter, J.; Tsulaia, V. Shifter: Containers for HPC. J. Phys. HPC+Cloud softwareConf. framework. Ser. 2017, 898 In, 082021. Procee [CrossRefdings ]of the 2016 Fifth International Conference on Eco-Friendly Computing and Communication Systems10. Benedicic, (ICECCS), L.; Cruz, F.A.;Bhopal Madonna,, India, A.; 8–9 Mariotti, December K. Sarus: Highly2016; pp. Scalable 48–52. Docker Containers for HPC Systems. In Mining Data 7. Jamalian, S.; Rajaei, H.for FinancialASETS: Applications A SDN ;Empowered Springer Nature: Task Cham, Schedulin Switzerland,g 2019;System pp. 46–60. for HPCaaS on the Cloud. In Proceedings of the 2015 IEEE International Conference on Cloud Engineering, Tempe, AZ, USA, 9–13 March 2015; pp. 329–334. 8. Shanmugalingam, S.; Ksentini, A.; Bertin, P. DPDK Open vSwitch performance validation with mirroring feature. In Proceed- ings of the 2016 23rd International Conference on Telecommunications (ICT), Thessaloniki, Greece, 16–18 May 2016; pp. 1–6. 9. Gerhardt, L.; Bhimji, W.; Canon, S.; Fasel, M.; Jacobsen, D.; Mustafa, M.; Porter, J.; Tsulaia, V. Shifter: Containers for HPC. J. Phys. Conf. Ser. 2017, 898, 082021, doi:10.1088/1742-6596/898/8/082021. 10. Benedicic, L.; Cruz, F.A.; Madonna, A.; Mariotti, K. Sarus: Highly Scalable Docker Containers for HPC Systems. In Mining Data for Financial Applications; Springer Nature: Cham, Switzerland, 2019; pp. 46–60. 11. Höb, M.; Kranzlmüller, D. Enabling EASEY Deployment of Containerized Applications for Future HPC Systems. In Mining Data for Financial Applications; Springer Nature: Cham, Switzerland, 2020; Volume 12137, pp. 206–219. 12. Miesch, M. JEDI to Go: High-Performance Containers for Earth System Prediction. In Proceedings of the JEDI Academy, Boul- der, CO, USA, 10–13 June 2019. 13. Woo, J.; Jang, J.H.; Hong, T. Performance Analysis of Data Network at the PLSI Global File System. In Proceedings of the Korea Processing Society Conference; Korea Institute of Science and Technology Information: Daejeon, Korea, 2017, pp. 71–72, doi.org/10.3745/PKIPS.Y2017M04A.71. Appl. Sci. 2021, 11, 923 28 of 28

11. Höb, M.; Kranzlmüller, D. Enabling EASEY Deployment of Containerized Applications for Future HPC Systems. In Mining Data for Financial Applications; Springer Nature: Cham, Switzerland, 2020; Volume 12137, pp. 206–219. 12. Miesch, M. JEDI to Go: High-Performance Containers for Earth System Prediction. In Proceedings of the JEDI Academy, Boulder, CO, USA, 10–13 June 2019. 13. Woo, J.; Jang, J.H.; Hong, T. Performance Analysis of Data Network at the PLSI Global File System. In Proceedings of the Korea Processing Society Conference; Korea Institute of Science and Technology Information: Daejeon, Korea, 2017; pp. 71–72. 14. Piras, M.E.; Pireddu, L.; Moro, M.; Zanetti, G. Container Orchestration on HPC Clusters. Min. Data Financ. Appl. 2019, 11887, 25–35. 15. Hu, G.; Zhang, Y.; Chen, W. Exploring the Performance of Singularity for High Performance Computing Scenarios. In Proceedings of the 2019 IEEE 21st International Conference on High Performance Computing and Communications, IEEE 17th International Conference on Smart City, IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), Zhangjiajie, China, 10–12 August 2019; pp. 2587–2593. 16. Priedhorsky, R.; Randles, T. Charliecloud. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, Denver, CO, USA, 12–17 November 2017; p. 36. 17. Aragon, C.; Dernat, R.; Sanabria, J. Performance Evaluation of Container-based Virtualization for High Performance Computing Environments. arXiv 2017, arXiv:1709.10140. 18. Benedicic, L.; Cruz, F.A.; Schulthess, T.C.; Jacobsen, D. Shifter: Fast and Consistent HPC Workflows Using Containers; Cray User Group (CUG): Redmond, WA, USA, 11 May 2017. 19. Madonna, A.; Benedicic, L.; Cruz, F.A.; Mariotti, K. Shifter at CSCS-Docker Containers for HPC. In Proceedings of the HPC Advisory Council Swiss Conference, Lugano, Switzerland, 9–12 April 2018. 20. Mell, P.; Grance, T. The NIST Definition of Cloud Computing, in Recommendations of the National Institute of Standards and Technology. Technical Report, Special Publication 800-145. September 2011. Available online: https://nvlpubs.nist.gov/nistpubs/ Legacy/SP/nistspecialpublication800-145.pdf (accessed on 20 January 2021). 21. Lim, S.B.; Woo, J.; Li, G. Performance analysis of container-based networking solutions for high-performance computing cloud. Int. J. Electr. Comput. Eng. (IJECE) 2020, 10, 1507–1514. [CrossRef] 22. Bureddy, D.; Wang, H.; Venkatesh, A.; Potluri, S.; Panda, D.K. OMB-GPU: A Micro-Benchmark Suite for Evaluating MPI Li-braries on GPU Clusters. In Proceedings of the EuroMPI 2012 Confierence, Vienna, Austria, 23–26 September 2012; pp. 110–120.