PoS(ISGC2017)020 http://pos.sissa.it/ GPUs are already TM 1 [email protected] [email protected] [email protected] [email protected]

CERM Magnetic Resonance Center Italy and University of Florence, CIRMMP E-mail: Marco Verlato INFN, Padova Sezione di Italy 8, 35131 Padova, Marzolo Via E-mail: Jan Astalos Jan of Academy Sciences Institute of Slovak Informatics Bratislava, Slovakia E-mail: Andrea Giachetti CERM Magnetic Resonance Center Italy and University of Florence, CIRMMP E-mail: AntonioRosato [email protected] [email protected] [email protected] [email protected] [email protected]

information system structure, it was possible to expose the correct information about the available at resource centre level, users must directlyinformation about interactthe type of withresources and softwarethe libraries available, andlocal which submission provider to get queues must be used to submit Engage acceleratedsince March computing2015 has workloads.worked toEU-funded implement theproject support toEGI- both its acceleratedHTC and computingCloud platformson addressing two levels: the information system, based on the OGF GLUE standard, and the middleware. By developing a common extension of the While accelerated computing instances providingavailable since access a tocouple of NVIDIA years Federated inCloud has putcommercial in production publicits first cloudsOpenStack-based site providinglike GPU-equipped Amazon instances EC2,at thethe endEGI of 2015. However, many coprocessors to enable high performance processing are not directly EGIsupported yet in a federated sites which are providing GPUs or mannerMIC by the EGI HTC and Cloud platforms. In fact, to use the accelerator cards capabilities Copyright owned by the author(s) under the terms of the Creative Commons the terms the under the owned Creative of by author(s) Copyright

Speaker 1 © 4.0). 4.0 International (CC BY-NC-ND License Attribution-NonCommercial-NoDerivatives INFN, Sezione diINFN, Padova Sezione 8, 35131 Padova, Marzolo Italy Via E-mail: Viet Tran Viet Sciences Academy of of Slovak Informatics Institute Slovakia Bratislava, E-mail: Zangrando Lisa E-mail: E-mail: David Rebatto diINFN, Milano Sezione Celoria 16, 20133 Milano, Italy Via E-mail: INFN, Sezione diINFN, Padova Sezione 8, 35131 Padova, Marzolo Italy Via E-mail: Miroslav Dobrucky Sciences Academy of of Slovak Informatics Institute Slovakia Bratislava, computing Andreetto Paolo EGI federated platforms supporting accelerated supportingaccelerated platforms EGIfederated PoS(ISGC2017)020 P. Andreetto et al. et Andreetto P. 2 applications and best practices implemented by the structural biology and biodiversity scientific user communities that already started to use theavailable EGI the through HTC and platforms. Cloud first accelerated computing resources made information system, the new objects and attributes proposed for implementation in the version 2.1 of the GLUE schema. For what concernsimplemented to enable GPU virtualisation on KVM hypervisor via PCI pass-through technology the Cloud platform, we describe the solutions on both OpenStack and OpenNebula based IaaS cloud sites,Federated which Cloudare nowoffer, partand of thethe latestEGI we a showcase of number Moreover, developmentscontainer technology replacement a KVM as of hypervisor. about GPU direct access through LXD Accelerator capabilities can now be informationpublished directly from uniformly,the information systemso without interacting thatwith the sites, usersand easily canuse resourcesextract providedall bythe multiple sites.support for accelerator cards has been extended,On where needed, in order to provide a transparent the other hand, HTC and uniform wayand to allocate these resources together withCloud CPU cores efficiently to the users. In middleware this paper we describe the solution developed for enabling accelerated computing support in the CREAM Computing Element for the most popular batch systems and, for what concerns the accelerated computing technologies available, both software and hardware, at site level. EGI federated platforms supporting acceleratedEGI supporting computing federated platforms International Symposium 2017 -ISGC Grids and Clouds 2017- on International 2017 March 5-10 Taiwan Academia Sinica, Taipei, PoS(ISGC2017)020 P. Andreetto et al. et Andreetto P. Many Integrated Core TM 3 coprocessors, based on Intel TM Phi TM through a computing infrastructure through a 30.2 % of the resource centres were already 24 theincomingin months. planned to do providing GPU resources, and 56.3 % 92.9 % of the users would be interested in accessing remote GPU based resources The paper is organized as follows. Section 2 describes the development work carried out thedevelopmentwork to out carried follows. as 2 Sectiondescribes isorganized paper The Unfortunately, in the current implementation of the EGI Unfortunately, infrastructure, there is neither way In the following Acceleratedyears Computing has become increasingly The popular. term The EGI infrastructure is a federation of 20 cloud providers and over 300 data centres, • • Introduction EGI federated platforms supporting acceleratedEGI supporting computing federated platforms enhance the EGI HTC platform in order to fully exploit accelerator cards. Section 3 illustrates Observatories. It provided an Image Classification Deep Learning Tool as use case. MoBrain CC instead captured the requirements of the Structural Biology deployingscientific portalsdomain, aimingfor at bio-molecular simulations leveraging GPUinvolved popularresources. molecular Theirdynamics packagesuse like AMBERcases and GROMACS, andand like DisVis. fitting structure model PowerFit software molecular 3D bio- introducing middleware support for accelerator cards. User communities, grouped within EGI- Engage in entities called requirements Competence and providing a Centres number of (CCs),use cases. In had particular, LifeWatch CCrequirements of the Biodiversity and Ecosystems research acapturedcommunity for deploying GPU based the driving role in e-infrastructure services supporting settingdata management, processing and modelling for Ecological the efficiently execute user jobs. A dedicated task of the EGI-Engage wasproject therefore designed A execute user jobs. efficiently to address these issues, Computing with platform the for ultimate EGI. goal The objectives: to implement accelerated tocomputing support in task the information system, based on the provide activities started a in complete Cloud and platforms evolution; EGI HTCextend tothe current schema [2] GLUE standard OGF March new 2015,Accelerated with two main allow to reduce execution times by offloading parallelizable computationally-intensive portions of an application to the accelerators while the remainder of the CPU. code continues to run on the to describe/discover this kind of resources, nor to allocate them together with CPU cores to calculations are carried out on specialized processors traditional CPUs to achieve(known faster real-world execution times. Nowadays accelerators are highly as accelerators) coupled with specialized microprocessors designed with data parallelism in mind, and more in general other than GPUs they include Xeon (MIC) architecture, and specialized Field Programmable GateArray (FPGA) PCIe cards. They refers to a computing model used in scientific and engineering applications whereby European projects: EGI-InSPIRE (2010-2014) and EGI-Engage (2015-2017). In 2012 Working Group awas setEGI up to assess the interest of EGI community (bothproviders) usersabout General-Purposeand Graphicsresource Processing Units (GPGPUs, shortened to GPUs in that:2012 Septembercompleted in[1] showed survey A thearticle). of the rest 1. spread across Europe and worldwide. As of November 2016, EGICPU-cores for offersHigh Throughput moreCompute, 6,600 thancores for 730,000 cloud compute, and more than 500 PB of data storage. The infrastructure operations have been supported since 2010 by two PoS(ISGC2017)020 TM P. Andreetto et al. et Andreetto P. K20m GPUs, and a server hosting TORQUE 4 TM Tesla command line. Being the JDL an extensible language [6] TM NVML libraries) with MAUI 3.3.1 as Thescheduler. testbed TM -ce-job-* information. Development of information providers according specifications. to the extended GLUE 2.1 CREAM ofcertification CE prototype. and the enhanced Test draft accelerators. CEupdatedversion ofsupport for CREAM with full Release a Identification of the relevant GPU/MIC/FPGA related parameters attributes. useful to abstract JDL them LRMS and supported by any components. CREAM therelevant inchanges code theneeded Implementation of Extension of the current version 2.1 draft GLUE schema for describing the accelerator E5-2620 v2 and 2 NVIDIA User job submission towards EGI HTC platform typically happens through VO level Considering that TORQUE is the most popular choice of LRMS in EGI for the CREAM A snapshot A of the EGI Grid Information System taken in October 2015 [1] has shown that TM 4. 5. 6. 1. 2. 3. Job submission Job The HTC Accelerated Platform HTC The EGI federated platforms supporting acceleratedEGI supporting computing federated platforms errors from the JDL parser. However, only a “supported certainattributes” from now on, isset taken into account byof the CREAM CE service. attributes,Figure 1 that we will refer as The Job Description Language (JDL) is a high-level, user-oriented language based on Condor classified advertisements (classads) for describing jobs to be submitted to a resource broker or consider here only jobdirect submission to a CREAM directly to the We’ll CREAM CE service. CE, through the the user is allowed to use whatever attribute for the description of a request without incurring in 2.1 services like DIRAC [4] or gLite-WMS [5], which actsuited CE where to execute the user job, according to asjob requirements specified file. in the JDL resource broker to select the best CE instances, we started to work with a testbed composed by 3 nodes, each one with 2 Intel Xeon 4.2.10 (compiled with NVIDIA Italy. in Florence, atthe University of team CIRMMP made available by was Accelerated Platform based on an enhanced CREAM CE in the followingtheCEin on CREAM steps: enhanced basedan Accelerated Platform in EGI for a number of Local Resource Management Systems (LRMS): TORQUE, SGE, LSF, Slurm, HTCondor. Nowadays, most recent versions of these LRMS provide native support to GPUs, MIC coprocessors and other accelerators, meaning that computing nodes hosting these cards and managed by one of the above LRMS can be selected by specifying the proper LRMS directives and parameters. This suggested the definition of a work plan to implement an HTC the 75% of the 567 Compute Elements (CEs) composing the EGI CREAMHTC (Computingplatform Resource isExecutionAnd basedManagement): aon web service based CE for the UMD/gLite grid middleware [3]. CREAM has been since years the most popular grid interface introduced in the EGI Federated Cloud by properly configuring and tuning the OpenStack and OpenNebula cloud frameworks. Section 4 shows a number of applications that firstly used the accelerated platform to enhance their performance, and finally Section 5 provides a summary perspectives. thefuture and 2. how the support to GPUs, based on PCI pass-through virtualisation technology, has been PoS(ISGC2017)020 P. Andreetto et al. et Andreetto P. 5 $ qsub -l nodes=1 -W x='GRES:gpu@1' A deeper analysis of latest versions of the other CREAM supported LRMS, namely The testbed set up at CIRMMP was used to implement the first prototype of a GPU- The CREAM core is a Java component where JDL supported attributes are defined and EGI federated platforms supporting acceleratedEGI supporting computing federated platforms HTCondor, LSF, Slurm HTCondor, LSF, and SGE, allowed instead to identify two additional JDL attributes that GPUs or other accelerator cards of TORQUE LRMS coupled with Maui source[8], joba popularscheduler open whose support was discontinued in favourMoab, was found very limited and ofcould not allow to abstract other interesting attributes itsat JDL commercial successor level. On the other hand, the integration of CREAM this ofactivity. was the scope innot acceleratorsto support CE with Moab, which implements full the new JDL attributecomponents were patched in order to allow the CREAM CE to understand the “GPUNumber”new attribute and was defined properly and map the it CREAM into core thecontaining the line suitable andGPUNumber=1 would imply TORQUE/Mauithe BLAH job to directive. be enqueued in This theworker node LRMShosting at leastuntil wayone GPU a card becomes available. native Unfortunately, support to the user JDL file enabled enabled CREAM CE in December 2015. jobStarting from submission TORQUE/Maui the local node GPU: one requesting onejob with one syntax, e.g. for a likethe specific parsed from the user abstraction request. layer for submission and control The of job requests to a BLAHlocal batch system [7]. It (Batch is a Localdaemon written in CAscii language coupled to BASH scripts Helper) that allow to parse componentthe JDL supported isattributes and an translate them in the proper parameters given LRMS. directives to a corresponding specific to be used as input arguments to the Figure 1: High level architecture of CREAM1: CEhighlight High Figure service. levelthe architecture Circles components to be computing. to support enhanced accelerated gateway with a ServiceWeb interface to allow remote authorized (via VOMS service) users to thepopular LRMS ofby listed above. onecompute managed of clusternodes submit jobs a to below shows a high level architecture of the CREAM CE service. It is essentially a grid PoS(ISGC2017)020 K80 TM Tesla K80GPUs. TM TM P. Andreetto et al. et Andreetto P. Tesla TM K80 GPU. It was submitted to the QMUL TM 6 Tesla TM K40c, and 32 CPU cores plus four NVIDIA TM Tesla TM The attribute “GPUModel” was defined to target worker nodes with a given model of card. GPU The attribute “MICNumber” was defined workerto nodes target with a given number of coprocessors. MIC For what concerns the applications run in the worker node, the usual grid way to distribute The implementation of GPU/MIC related JDL attributes assumed that the local batch Figure 2 below shows an example of fileJDL describing a structural biology job targeting The implementation and test of a new CREAM CE prototype supporting the two • • EGI federated platforms supporting acceleratedEGI supporting computing federated platforms fact, GPU specific application software components generally depend on the drivers of a given documentation for HTCondor [9], LSF [10] and Slurm [11]. Of course, to let the user know if QMUL grid site has GPUs on board, and beto extended. has properly system which vendor/model is available, the information software via CVMFS tool could not be used because of the complex software dependencies. In Figure Figure 3: Excerpt of Slurm configuration files gres.conf and slurm.conf where GPU support is implemented. systems already configured the GPU/MIC support according with the official LRMS specific Figure 2: Example of JDL file targeting a worker node with one NVIDIA withone a node worker filetargeting JDL of Example 2: Figure cores plus one NVIDIA GPUs. London (QMUL) data (QMUL)data centre. London a node worker hosting at least one NVIDIA grid site where the CREAM CE prototype has been deployed as interface Slurm LRMS. Figure 3 displays anto extract of its and gres.conf filesslurm.conf showing that two their production worker nodes among the ones forming the entire cluster are equipped respectively with 8 CPU additional attributes was carried out since March 2016 thanks to the collaboration of four EGI resource centres that deployed the prototype besides their production CREAM their CEcluster onmanaged toprespectively byof HTCondor at GRIF/LLR data centre, Slurm atdata ARNES centre, LSF at INFN-CNAF data centre, SGE and Slurm at Queen Mary University of these batchsystems: these could be defined and implemented to exploit the more advanced native accelerator support of PoS(ISGC2017)020 GPUs at driver. Six TM TM P. Andreetto et al. et Andreetto P. drivers were made available in the Docker TM 7 Computing Share andEnvironment. Share Execution Computing Many accelerator types, each one with its own specific features, can be installed in or linked to an environment. The outcome of extending the environment with many The definition of a new GLUE object, theAccelerator Environment, representing a set homogeneous devices. accelerator of New attributes and associations for existing objects such as Computing Manager, Two Two reasons lead to considering a new GLUE object instead of extending the Execution In order to introduce the concept of accelerator device in a Service an • • • Information System Information EGI federated platforms supporting acceleratedEGI supporting computing federated platforms Environment described in GLUE Environment [13]: described 2.0 enhancement of the GLUE 2.0 schema is required. The proposed changes, as encompassed in on: draft,consist2.1 the GLUE Figure 4: Example of script for Docker containers. dependent running DisVis application on the grid using GPU driver 2.2 information is collected and used through the Udocker tool to pull the right Docker image from The isthenexecuted the container. applicationinside thecontainer. createto which CIRMMP, QMUL andARNES. avoid CIRMMP, To security issues, and then releasing the requirement to have the Docker engine installed and running on each worker node, the Udocker tool was used. Udocker is a basic user toolcontainers indeveloped user space bywithout requiring rootINDIGO-DataCloud privileges. Figureto 4 shows executethe script executableacting simple as in the Docker JDL file example of Figure 2. After landing on the worker node, the driver nodes at the various resource centres. To avoid such conflicts the applications were built inside a theapplicationsinside builtconflicts avoid suchwere To centres. resource at the various nodes number of Docker containers, each one in turn containers withbuilt the latest six more recent NVIDIA with a different NVIDIA Hub for the structural biology applications in collaboration with [12],INDIGO-DataCloud project in order to allow the exploitation of the three grid sites hosting NVIDIA GPU card, and different driver versions for the same card can be installed on different worker PoS(ISGC2017)020 P. Andreetto et al. et Andreetto P. 8 of the GLUE specification. of A many-to-many relationship between accelerator object taken andinto consideration in environment order to mustsupport external GPU be appliances. In that case Acceleratorthe Environment is completely decoupled, not only from the logical point Execution one,from thephysicalEnvironment. an also butfrom view of complex attributes would have been cumbersome and very far from the inner meaning The information about the accelerator driver or generally any related software tool can be The main concept behind the new defined attributes in Computing Share and Computing An Accelerator Environment reports the number of physical accelerators, the amount of • EGI federated platforms supporting acceleratedEGI supporting computing federated platforms be used as a bridge for expressing the relationship between the driver and thedriver the therelationshipbetweendevice. expressing for a as bridge used be published through Applicationthe Environment object, as declared in the GLUE 2.0 schema. It is not necessary to modify the associationdefinition of with Application oneEnvironment in or order more Application Environmentassociation andtobetween thethe Execution Environment, the latter can Acceleratorhave an Environments. Since there is a many-to-many “exclusive execution mode”. Each process information published with the new attributes of Computing Share and Computing Manager can has the control of vary an according to accelerator the resource device. usage, for The example the numberComputing Share,of or freethe runtime acceleratorconfiguration ofslots a batchfor system, a such as the total number of Computing afor Manager. slots accelerator Manager is the accelerator slot. It represents the minimum amount of a GPU resource that can be allocated to a job. Since a GPU candefinition of accelerator slot bemay be quite complex. In accordance efficientlyto the meaning of theshared related among multiple processes, the device work toin theacceleratoris considered describedpreviously GPUNumber, attribute, JDL Figure 5: UML diagram of GLUE 2.1 draft Computing Entities. In red the classes involved in involved the classes Computing GLUE red of draft Entities. In 2.1 diagram UML 5: Figure theaccelerators of the description memory available and any detail about the hardware such as theclock vendor, speed.the model Theand the information published occurs. restructuring ifonlya statically hardware immutable It ordefined. changes considered with an Accelerator Environment object must be PoS(ISGC2017)020 P. Andreetto et al. et Andreetto P. devices installed in the worker nodes, and TM 9 GPU devices. The TORQUE management system, for TM System Management Interface (nvidia-smi) on each active worker TM } } creamce::hardware_table : environment_001 : { ce_cpu_model : XEON, ce_cpu_speed : 2500, the worker node # other definitions for accelerators : { { acc_device_001 : type : GPU, log_acc : 2, phys_acc : 2, vendor : NVIDIA, K20m", model : "Tesla memory : 5119 } devices the common solution consists on an information provider which is able isable to: thecommonan which provider solution on information devices consists TM Run the NVIDIA node. toGLUEschema. 2.1 according the thedataharvested Aggregate Reading statically defined descriptions of resources. of descriptions statically Reading defined and executables,the output. catching ofa information providers, called Running set Since the executable does not depend on any batch system specific feature, the solution is Every attribute related to the accelerator slots and published within Computing Manager The AcceleratorThe Environment is published by the resource BDII using a statically defined The framework adopted by the EGI HTC platform for managing the information system is The BDII is structured as a hierarchy of nodes, with the “top BDII” one at the root level, The conceptual model of the Grid Computing Service with the accelerator information is • • • • EGI federated platforms supporting acceleratedEGI supporting computing federated platforms portable. The portable. main drawback is that it requires a further configuration of both the resource BDII NVIDIA accelerator devices like NVIDIA example, can identify the structure of the NVIDIA many parameters reporting the quality of service, GPU and memory usage. However in many cases the support is not able to give the completeaccelerator device. controlBesides it isover more advisable allto avoid thethe development ofrequired many information details of an providers, each one batch system specific, if a common solution can be identified. For Figure 6: Excerpt of a Puppet configuration file for a GPU installed in a worker a ofa configuration in node Puppet GPU worker file installed Excerpt for a 6: Figure and Computing Share objects is calculated by anexecuted informationby provider, the resource BDII. Many batch systems integrate the support for information retrieval from popular description. For a CREAM CE configuration nodesystem, based on such the Puppet asuite. Figure description6 shows an isworker into installeda node. an describes device accelerator which(hiera) file configuration excerpt from built the Puppet by the deployment and bottom, collecting data from the grid resources. Gathering ways: different informationBDII in two resource is carried out by the being defined in GLUE2.1 draft and draft shown in Figure GLUE2.15. inand being defined [14]. (BDII) IndexDatabase Information the Berkeley referencing all the information provided by the other nodes, and the “resource BDII” node at the PoS(ISGC2017)020 TM System TM P. Andreetto et al. et Andreetto P. GVT-G recently added to TM 10 vGPU for hypervisors, XenServer SR-IOV andVMWare TM GRID TM Management Library (NVML) allows through the NVIDIA TM MxGPU for VMWare hypervisors, and Intel TM KVM with PCI pass-through virtualisation technology for GPUs is rather mature, but Regrettably, this functionality is not yet integrated in any of the considered LRMS. In The APEL (Accounting Processor for Event Logs) [15] is the fundamental tool for the Accounting The Cloud Accelerated Platform Cloud The EGI federated platforms supporting acceleratedEGI supporting computing federated platforms 4.10 kernel. However, these are not yet supported by KVM, the most popular open source maximum oneVM can be attached to one physical card, as shown on the left of Figure 8. Other virtualisation technologies that allow to share the physical GPUs among many virtual machines are available, e.g. NVIDIA basedAMD planned funds. The accounting oftemporarily abandoned, pending its possible native GPUsupport in future versions of the considered usage in theLRMS. EGI HTC platform was therefore 3. principle, developing suitable prologue/epilogue scripts for any given LRMS would implementallow to per-job accounting starting from NVML. theHowever, estimated the effort for developing and sustaining in the long term per-process such kind GPU accounting provided of by solution was considered not affordable by the APEL team with their current and future Figure 7: Per-process GPU utilization accounting example using nvidia-smi tool usingnvidia-smiutilization example accounting GPU Per-process 7: Figure SGE showed that they don’t contain GPU usage information. For the most recentGPUs, NVIDIA the NVIDIA Management Interface (nvidia-smi) tool to enable per-process accounting of GPU usage using Figure output7: theexampleofshownin as PID, Linux Parser, APEL Core, and APEL Publisher component respectively. The APEL developers were therefore involved in the Their concluded analysis that GPU aaccounting little discussion. effort to modify APEL Parser and Core functions had been required if the LRMS detaileda analysis report the GPU attributableusage to each job in logtheir files. Unfortunately, would be able to of the job accounting records reported in the log filesSlurm, ofHTCondor TORQUE, LSF, and CPU usage accounting infrastructure deployed within EGI.As a log processing application, it logsinterprets CREAMof CE and its underlying LRMS to CPUproduce job accounting records identified with grid user identities.APEL then publishes accounting records into a centralised repository at a Grid Operations Centre (GOC) for access from the EGIAccounting Portal. The functions of log files parsing, record generation and publication are implemented by theAPEL established with openssh tools, and with normal user privileges, therefore accounts and public be the cluster. over distributed must keys 2.3 node and each worker node. Any command of nvidia-smi is run through a secure channel, PoS(ISGC2017)020 TM P. Andreetto et al. et Andreetto P. 11 E5-2650v2 (16 CPU cores), 64GB RAM, 1TB storage and 2x NVIDIA Tesla cards, therefore PCI pass-through has been used. OpenStack has built-in TM TM support of scheduler, VMs with PCI pass-through can be mixed with normal VMs on deployment. OpenStack the same pass-through are defined in the configuration“pci_pass-through_whitelist = { file"vendor_id": "8086", "product_id": "10fb" of}”, its alias OpenStack Nova as properties. as VMflavour tobe will assigned whitelist Implementation of scheduler supporting PCI pass-through to correctlywith allocatePCI VMs pass-through enabled to compute nodes with available resources. With the Implementation of support for PCI pass-through configuringmechanism andon creatingcompute VMs nodeswith for PCI pass-through enabled. The PCI devices for K20m, running Ubuntu K20m,LTS. 14.04 Xeon The initial setup demonstrated that KVM PCI pass-through was not stable, compute nodes The first testbed supporting GPU cards was integrated into EGI at IISAS As mentioned above, the technology for sharing virtual GPUs is not supported by KVM TM TM • • Accelerated computing OpenStackcomputing with Accelerated EGI federated platforms supporting acceleratedEGI supporting computing federated platforms setting for isolating NMI of the pass-through device. However, a more consistent setting solution where for isolatingof NMI the device.pass-through However, Tesla crashed randomly. The VMs had direct access to PCI devices VMsand could send NMI (non-maskable interrupt) signal to the hardwareevery that can be propagated to misconfiguration in the host machine and caused the system crash. The workaround solution is to change BIOS as the “IISAS-GPUCloud” site, which comprises four IBM dx360Intel M4 servers each with 2x for NVIDIA support for PCI pass-through from Grizzly release. That includes the following changes in the Nova: OpenStack Figure 8: PCI (right). technology pass-through virtualisation of GPUs (left) 3.1 versus sharing virtual GPU Federated Cloud. Federated hypervisor deployed in the large majority of the cloud resource centres part of the EGI PoS(ISGC2017)020 P. Andreetto et al. et Andreetto P. 12 There are several implementation of Nova driver for OpenStack. However, most of them In the recent years, container technologies have been emerging as a viable alternative of Since version 4.14 the OpenNebula cloud computing framework supports PCI pass- Accelerated computing with LXD LXD containers computing with Accelerated Accelerated computing OpenNebula computing with Accelerated EGI federated platforms supporting acceleratedEGI supporting computing federated platforms the VMs in KVM and practically reached native performance, that is very promising. However, some critical features, mainly the supports for block devices, were still missing, that prevented production. intheexperimental siteof far use so are still immature at the time of testing (2016).An experimental cloud site has been setIISAS upto at enable GPU support with LXD hypervisor with OpenStack through nova-lxd driver. theUnfortunately, support for configuring GPU device mapping in Nova configuration files of the tested version of nova-lxd was not yet implemented.Thus, the devices had to be hard-coded outperforming stabilitycontainers were and the The LXD of performance testing.for in the code containers must share the same OS kernelchange askernel theconfigurations hosting(e.g. machineloading andkernel usersmodules). computing, cannot For where easily use VMscases arewith mostlyaccelerated usedrelevant. for computational tasks, such limitations are not full virtualisation for cloud computing. From the point of view of applications runningcloud, the full containers likein LXD have very similar behaviours VMs:like usersthe can create and log into containers, install/configure software, running applications and terminate the containers when they are no longer needed. In comparison with full VMsKVM, managed containersby hypervisors havelike much faster start,resources lower via overheads, device and mapping. easier The access main to hardware limitation of container technologies is that the Metacloud site in May 2016, and a January in2017. production new IISAS-Nebula cloud site was set up and put into 3.3 GPU accelerators in the OpenStack cloud computing framework follows GPU accelerators anare requested using resourceapproach template, therefore for compatibilitywhere reasons the rOCCI server for OpenNebula had to be modified by its developers GPUto allowdevice specificationPCI of codes inside resource templates. To test the framework inintegration the EGI ofFederated Cloud, anthe experimental configuration wasOpenNebula set up at CESNET- through virtualisation including GPU accelerators. By default, PCI devices can be specified in OS templates using vendor, device and class PCI administrator to codes.define special templates for Thisevery combination of virtual appliance approachimages and requires the system GPUs provided by worker nodes. It gets complicated if the site provides more than one type of accelerator and worker nodes have more than one accelerator installed.Also, the integration of system. A viable alternative is to use LXD [16] container hypervisor as replacement for KVM 3.3. subsectionthenextin discussed as hypervisor, 3.2 the host can retain control over hardware is critical for overall stability of the PoS(ISGC2017)020 P. Andreetto et al. et Andreetto P. 13 option, as for example: for as option, --device --device=/dev/nvidiactl:/dev/nvidiactl \ --device=/dev/nvidiactl:/dev/nvidiactl The conceptual model of the Cloud Compute Service is being defined in GLUE 2.1 draft The conceptual model for a generic Cloud Provider Site does not fit correctly into the --device=/dev/nvidia-uvm:/dev/nvidia-uvm IMAGE /bin/bash --device=/dev/nvidia-uvm:/dev/nvidia-uvm Like HTC Accelerated platform, the integration with EGI Federated Cloud will require Docker containers with GPU support can be easily executed in cloud sites supporting \ -it --device=/dev/nvidia0:/dev/nvidia0 $ docker run Docker is another widely adopted container technology. Unlike LXD which is full OS- Integration with EGIcloud with federated Integration

Accelerated computing with Docker containers Docker computing with Accelerated EGI federated platforms supporting acceleratedEGI supporting computing federated platforms accelerators. specification, as depicted in Computing Figure Service and 9. a Cloud There’s Compute aservice. The strongdescribesCloud the Compute hardware similarity Instanceenvironment, Type orclass betweenflavour, of the classes VM, like inthe Execution a Environment Figure Figure 9: UML diagram of GLUE 2.1 draft Cloud Computing Entities. In light red the classes introduced with GLUE 2.1, in dark red the ones also involved in the description of the additional support for GPU in Information and Accounting systems. Accounting and Informationin support GPU for additional GLUE 2.0 schema. The concepts of virtual machines, images and flavours are brand new from the logical point Extendingof view. the objects of a classical Computing Service is not enough, Compute Service. andCloud a isnecessary the schema create entire to redesign it 3.5 hosting environment. hosting VM with GPUs. Users flavour createcan GPU-enabled and a image, and run withdocker proper viatheaccess toGPUs mapping level container technology and behave more like hypervisor, Docker containers are application containers with the aim to single run application process inside. It wraps up application software in a complete file system that contains everything it needs to run: code, runtime, system tools, system libraries, therefore the application in Docker is highly portable and independent from the 3.4 PoS(ISGC2017)020 P. Andreetto et al. et Andreetto P. 14 The Structural Biology user community participating to the MoBrain Competence Centre As pointed out in section 1, two EGI scientific communities organized respectively in the To To facilitate the testing and development of accelerated applications in EGI, a dedicated GPU accounting is also easier in cloud environment. Considering that the cloud In order to keep the similarity even for modelling “virtual accelerator devices” a new Molecular dynamics Molecular Applications EGI federated platforms supporting acceleratedEGI supporting computing federated platforms who have a long-standing collaboration long-standing with EGI.collaboration a who have implementing automatic rigid bodyMicroscopy densities, fittinghave been already of described in bio-moleculara recentreported here. publication structuresMolecular Dynamics [19](MD) simulations andwith inGROMACS willand NAMD not packages Cryo-Electron exploiting GPU resources were carried out Laboratory MolDynGridVirtual [20] by team, from theAcademy Nationalof Sciences of Ukraine, the CESNET team of MoBrain and by the provided several applications designed in highly parallel manner to profit of GPU architecture. Some of them, i.e. DisVis, implementing interactionvisualisation and spacequantification of of the accessible distance restrained binary bio-molecular complexes, and PowerFit, MoBrain (Structural Biology) and Centres LifeWatch were the early (Biodiversity adopters of the EGIAccelerated and Platform, setting the requirements Ecosystems)and Competence providing a number of use cases exploiting the GPU hardware to enhance the performance of These in the next subsections. applications are described calculations. their 4.1 VOs like fedcloud.egi.eu. More information can be found in [18]. incaninformation found be fedcloud.egi.eu. likeMore VOs 4. Virtual Organization “acc-comp.egi.eu” has been established. A VOdedicated allows Organization “acc-comp.egi.eu”toA hascreate beena Virtual established. separate set of virtual appliance images in theApplication EGIDatabase [17] with pre-installed drivers and libraries (e.g CUDA, OpenCL) and allows to limit the list of sites to the ones that provide relevant hardware. Furthermore, the separated VO allows users to share and exchange their knowledge and experiences related to accelerated computing easier than general-purpose frameworks currently return wall clock time only to theAPEL based accounting system, if the wall clock for longhow a GPU VM is available was toattached then a the GPU reporting would be in line with cloud CPU time, i.e. wall clock time only. TheAPEL team has therefore involvedbeen to define an extended usage record and new views to display GPU usage in the Accounting Portal. EGI Instance Type to the virtualisation Cloud only Computeone VM Virtualat Accelerator.a timedevices, Since can it withbedoes attachednot PCI tomake onesense pass-through ortodevice. entiretheto slot accelerator corresponds an Compute Service morehave physicalthe accelerator concept of accelerator slot. Within the Cloud the Computing Manager, or batch system manager, in the classical GLUE 2.0 schema. 2.0classicalthein GLUE manager, batch system or Manager, the Computing entity, the Cloud Compute homogeneous Virtual virtual accelerator devices in Accelerator, terms of number of GPUs, has amount of memory and been defined.hardware details, like vendor and Itmodel. A many-to-many association ties the describesCloud Compute a set of does within a Computing Service. The Cloud Compute Manager, or hypervisor, corresponds to PoS(ISGC2017)020 P. Andreetto et al. et Andreetto P. K20). Note the logarithmic TM Tesla TM 15 6366-HE) vs one GPU card (NVIDIA TM Opteron Unrestrained, also called free, MD simulations. also free, called Unrestrained, Restrained MD (rMD), i.e. including experimental data that define the conformational themolecule. to available space TM The acceleration factor under this real case scenario, as shown in Figure 10, is about 100. The main application of rMD is the optimization of 3D structural models generated by MD simulations with the PMEMD tool of the AMBER package (ambermd.org) were 2. 1. EGI federated platforms supporting acceleratedEGI supporting computing federated platforms back-calculated data with respect to the input, was essentially identical.This proves that the use of GPUs does not introduce numerical errors due to single precision. The GPU-enabled service Note that rMD simulations do not scale well with the number of cores because of the way the restraint potential is coded in the software. For the AMBERaffect code,the the GPUCPU performance.server doesThe not agreement between experimentalthe restraint data in the two simulations, asrefined measured by the residual deviations of the structural models and the Figure 10: Calculation times for rMD simulations for PDB entry 2KT0 on a single-core CPU (AMD theyof axis. scale for a predefined number of steps and then cooled again to 0 K was applied to DOI: 10.2210/pdb2kt0/pdb). entry data(PDB 2KT0, with publicly available experimental a small protein simulations differ in the input data required. For free MDs only the atomic coordinates of initial the 3D structural model of the biological system of interest are needed. For rMD simulations, the experimental data are required as an additional input, in a suitable format. rMD simulations are enabled by the AMPS-NMR portal which was developed projectin [22,the 23].context Forof thethe rMDWeNMR calculation a standardized constant-time simulationsimulation, whereprotocol the consistingsolvated ofprotein isa initially heated, then allowed to move techniques such as distance geometry deposited to the Protein Data Bank or (PDB) [21]. Instead, free MD simulations have an extensive molecular modelling. Optimized modelsrange of can biological be applications. Besides their specific research purpose, the two types of carried out by the CIRMMP benchmarked teamtwo applications of and MD simulations willthat are often be performed with describedthe AMBER package: in more detail here. CIRMMP PoS(ISGC2017)020 TM TM TM Opteron TM P. Andreetto et al. et Andreetto P. 6366-HE) vs one GPU card (NVIDIA TM 16 Opteron TM CPU, the simulation performance (ns/day of simulation) as a TM CPU E5-2620 and AMD TM Xeon K20). The graph reports the nanoseconds simulationof that can K20). The be graph incomputed reports one day It appears that the correlation with the number of cores is actually less than optimal on To benchmark unrestrained (free) MD simulations we used a more challenging TM TM EGI federated platforms supporting acceleratedEGI supporting computing federated platforms 6366-HE CPUs (1.8GHz, 16C, cache L2 16MB/L3 16MB,memory) on32x4GB a RDIMMsingle blade LV providesdual about 50%rank of the expected increase in simulation length either CPU type. For the Intel function of the number of cores used is 1:5.1:9.6, against a ratio of 1:6:12 AMD cores; for the CPU the corresponding data againstare a1:11.2:32 ratio of1, compare1:16:64 the cores (Table first and third columns). Thus, using a 64-core system constituted by four AMD Figure 11: Extended comparison of the performance achieved (Intel using single/multi-core CPUs Tesla type. on CPU usedeach cores ofa asfunction the of number structural studies by NMR and therefore their dynamic properties can be explored only by MD simulations. In practice, the current setup constituted the basis for an extensive investigation of the molecular mechanism of iron release, which will be published elsewhere. The acceleration 11. inisdepicted theGPU benchmark Figure in ourby afforded 10.2210/pdb4das/pdb). This system is composed by 10.2210/pdb4das/pdb). 24 identical protein forming chains, single a 10,051to(withrespect of an thuswith Da mass 499,008 adduct, overall protein macromolecular of 2KT0). After solvation, this corresponds to ca. 180,000 atoms about a The factor physiological 125 role ferritinin size.of is to store iron ions release and them in the system vs 1,440, so when needed by the cell. Molecular systems of the size of ferritin are not amenable to routine About 40 minutes for 20 structures ! run ”.timesTypical in the traditional CPU-based ITC grid hours. 10 theorder of of were macromolecular system than the system used for rMD, owing to its The totallarge system mass. of choice was the M homopolymer ferritin from bullfrog (PDB entry 4DAS, DOI: is already available to users via theAMPS-NMR portal. A user commented “It was very fast. PoS(ISGC2017)020 TM Xeon TM - P. Andreetto et al. et Andreetto P. 9.74x 4.69x 41.70x 25.00x 13.40x 93.70x 18.30x 150.00x GPU acceleration - 9.6 1.0 3.6 6.0 1.0 5.1 11.2 32.0 system accelerated with the GPU. TM Ratio vs. Ratio single-core Xeon TM 17 0.77 7.50 0.05 0.18 0.30 0.56 1.60 0.08 0.41 6366-HE) vs the same cores plus one GPU card TM Simulation TM CPU CPU E5-2620 Opteron system and the Intel system. system. TM TM performance in performance ns/day in TM TM Opteron 1 6 1 1 4 8 Xeon 64 12 16 TM K20m), measured byK20m), themeasured nanoseconds of simulation that can be computed in TM TM Opteron Opteron TM TM of cores of Number Intel card GPU AMD Tesla TM With the above configuration we could thus compute multiple 100-ns simulations EGI federated platforms supporting acceleratedEGI supporting computing federated platforms The difference between the active power and the idlepower”) of powerboth systems(generally was calculated.known The dynamicas power was“dynamic considered because the two systems had different base configurations, like power supplies and CPUs. It turned out that the electric cost of one ns of simulation with the GPU is only the 8% of the cost when using the full AMD 64-core (reproducing different experimental conditions) of human ferritin, in less than two weeks each. With a traditional system, each calculation would have taken about measured monitoring the for simulationwas with the iDRAC utilityonefor ofalso ns consumed two months. The energy 64-core AMD CPU E5-2620 and AMD (NVIDIA one day for the various the from useresulting of multiplevs. cores a single as core well as hardwareaccelerationby theprovided configurations. The table alsocard. GPU reports the scale factor Table 1: Benchmark results for AMBER, on unrestrained MD simulations for the ferritin system. AMBER, on forunrestrained 1: Benchmark results Table The table reports the performance achieved using only single/multi-core CPUs (Intel single GPU card with respect to the full 64-core system is 4.7 (Table 1, last1, column). systemwith (Table 64-core is 4.7 to the GPU full respect card single that can be computed per day with respect to a single core. The acceleration provided by a PoS(ISGC2017)020 P. Andreetto et al. et Andreetto P. Kepler or Maxwell GPU architectures TM 18 The CREAM GPU-enabled prototype was tested at four sites hosting the following In March 2017 the European Commission granted the legal status of European Research Biodiversity Conclusions and future work future and Conclusions EGI federated platforms supporting acceleratedEGI supporting computing federated platforms system, in order to be included in the bethe EGI distribution. to ininUMD-4 included system, order proposed and included in the GLUE2.1 draft Working Group.latest A major version,release ofafter CREAM discussionis almostwith ready withthe GPU/MIC LRMSOGF support listedfor the above, with the GLUE2.1 draft prototype Futureimplemented official asapproval ofinformation GLUE system. 2.1 would occur after the specification is prototyperevised basedlessons on learned. The major release of CREAM will support CentOS 7 operating LRMS: TORQUE, LSF, HTCondor, Slurm, and SGE. Three new JDL attributes were defined: GPUNumber, GPUModel, MICNumber. At three sites the prototype QMULis run in “production”: and LRMS/scheduler). ARNES A number (with of new classes Slurm and attributes LRMS) describing accelerators and were CIRMMP (with TORQUE/Maui accuracy. Theaccuracy. service has been tested and validated on the EGI Federated Cloud site of IISAS hosting the GPU servers, by using several datasets, including images from Portuguese flora and detail[25]. inindescribed asBotanico Madrid, ofReal Jardin 5. implements a two steps process: at first, the Neural Network is trained with the image datasets and tags provided by the user; then, the trainedAPI to service beaoffering used with the provided web interfaceNeural and android app.The user can Network is connected with the web e.g. upload a flower image from his smart-phone through the demonstrator android app, and the returned result will be a ranked list of the potential Latin names of the flower with their AlexNet model, while with the most recent NVIDIA the use of CuDNN v3 library allows classify touser provided images achieveaccording to the atrained model. It factoris a I/O 32.bounded task that Thecan second beservice runallows to efficiently without GPUs. Starting developedfrom a the demonstrator Caffe framework, service the to LifeWatch helpteam users to train a web based image classifier. It framework was selected to build a prototype of classification service on flower species through image recognition. Caffe is a deep learning framework using Convolutional Neural Networks and can be deployed on the cloud. It provides two services: “train” and “classify”.The first one is a compute intensive service allowing to train a new model. Its performance can be enhanced using GPUs supporting CUDA. The speed-up using GPU versus CPU is a factor 9 using e.g. infrastructure to large sets of laboratories data,and decision-support applications. services Within the andEGI-Engage project, the tools LifeWatch thatcommunity, enable organized as the Competence creation Centre, of explored virtual assisted pattern recognition tools on EGI specific resources, including the servers with GPUs on the integration and deployment of cloud. After a state-of-the-art research on computer vision and deep learning tools, the Caffe Infrastructure Consortium (ERIC) to LifeWatch:Infrastructure for the Biodiversity and e-ScienceEcosystem Research and[24]. It Technologyaims toand European advance biodiversity ecosystem research and to challenges providesuch as climate major change by providing contributions access through toa pan-European distributed address e- big environmental 4.2 PoS(ISGC2017)020

TM , Trinity Trinity , P. Andreetto et al. et Andreetto P.

[thesis] ,

[cs.SE] 19 (2008) 062009 (2011) 062039 (2011) , arXiv:1603.09536v3 , arXiv:1603.09536v3 119 119 331

(2010) 062049 (2010) https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki? , (2011) 062024 (2011) 219

https://slurm.schedmd.com/gres.html 331 ,

,

https://wiki.italiangrid.it/twiki/bin/view/CREAM/JdlGuide NDIGO-Datacloud: foundations and architectural description of a Platform as description of a Platform as NDIGO-Datacloud: and architectural foundations Towards GLUE model, evolution of element computing information the 2: Towards , The gLite Workload Management System, Journal of Physics: Conference Series Management System, of Journal Physics: Conference The gLite Workload Job submission and control on a generic system: the batch BLAH experience, and control Job submission Status and Developments Element ofStatus of Journal CREAM Computing the Service, DIRAC Pilot Framework and the DIRAC Workload Management System, Journal DIRAC Pilot the and DIRAC Framework Workload http://www.adaptivecomputing.com/products/open-source/maui/ , Accelerated computing gridAccelerated infrastructures on computational

(2008) 062007 (2008) This work is supported by European Union’s Horizon 2020 Framework Programme e- For what concerns the Cloud Accelerated platform, three cloud sites hosting NVIDIA a Service to oriented scientific computing GPU/MIC support in LSFGPU/MIC support in https://www.ibm.com/support/knowledgecenter/it/SSETD4_9.1.2/lsf_admin/define_gpu_mic_resour ces.html SlurmGPU/MIC support in D. I al., et Salomoni M.Mezzadri al., et Series ofJournal Physics: Conference schedulerMaui HTCondorGPU/MIC support in p=HowToManageGpus of Physics: Conference Series of Physics: Conference al., et F.Giacomini 119 guide CREAM JDL S. Andreozzi al., et S. Series ofJournal Physics: Conference Andreetto al., et P. Series Physics: Conference A.Casajus al., et John Walsh, Walsh, John College Ireland). (Dublin, Computer Science School & 2016, pp 187 of Statistics, [7] [8] [9] [5] [6] [2] [3] [4] [1] [11] EGI federated platforms supporting acceleratedEGI supporting computing federated platforms [10] [12] References West-Life Grant No. 675858). Andrea Sartirana(QMUL), Barbara Grant(GRIF/LLR), No.Traynor Daniel675858). West-Life Krasovec (ARNES) and Stefano Dal Pra (INFN-CNAF) are acknowledged for having supported Boris their CREAM centres. on resource CE prototype the GPU/MIC enabled the of deployment Parak (CESNET) is acknowledged for having implemented PCI pass-through support for GPUs Cloud. theEGI Federated OpenNebula of site CESNET-Metacloud on stable PCI pass-through). than KVMstable Acknowledgments Infrastructure grants (EGI-Engage, Grant No. 654142; INDIGO-DataCloud, Grant No. 653549; GPUs were put in production using PCI pass-through virtualisation with KVM hypervisor, and An experimental cloud site was set up at IISAS integrated to enablethein EGI Federated Cloud. GPU support with LXD hypervisor with OpenStack. LXD is a full container solution supported by Linux that is expected to provide better performance and stability than KVM (faster start-up time, better integration with OS), especially in terms of GPU support (simpler site setup, more PoS(ISGC2017)020

, GLUE GLUE (2004)

:743-767

P. Andreetto et al. et Andreetto P. 10 (Database (3) pp. 399-407 (3) 35 429 2012, , 2017, 2007 Jan, (17):2384-90

A Grid-enabled web portal for NMR structure for NMR structure Grid-enabled web portal A 27 The worldwide Protein Data Bank (wwPDB):Bank Data Protein The worldwide , GFD.147, 2009 , GFD.147, Volume 7, Issue 9, September-2016 Volume Integration of assisted pattern recognition tools assisted pattern recognition Integration of 20 https://wiki.egi.eu/wiki/Accelerated_computing_VO , , Springer New York, 2011, pp. 175–186 2011, York, , Springer New 2011 Sep 1, Sep 2011

http://moldyngrid.org ‐ , A Comparison of LXD, Docker and Virtual Machine, International of and Virtual LXD, Docker Comparison A Int. Conf. on Computing in High Energy and Nuclear Physics and Nuclear Conf. inInt. High Energy on Computing The DisVis and PowerFit Web Servers: Explorative Servers: and Integrative Web and PowerFit The DisVis Grid deployment experiences: The path to a production quality ldap based Grid deployment to experiences: a production The path :4, 780-782 Plant Biosystems - An International Journal Dealing with all Aspects of Aspectsof allAn International with Dealing Journal Biosystems Plant - , Proc. , Proc. , 146 Biodiversity eBiodiversity on biodiversity infrastructure European the Science: LifeWatch, An APEL Tool Based CPU Usage Accounting Infrastructure for Large Scale Scale Large for Infrastructure Accounting CPU Usage Based Tool APEL An WeNMR: Structural Biology Grid, J. on the Grid. Comp., Structural WeNMR: , 2012, https://appdb.egi.eu/ , apan Gupta, Gera, Deepanshu and ecosystem research and ecosystem Biology Plant https://documents.egi.eu/document/2647 A. Basset, W. Los, W. A. Basset, Marco,Eduardo Lostal, Jesus Francisco Sanz, issue):D301-3 al., et Wassenaar A, RosatoA, Bertini Case Ferella I, DA, L, Giachetti Bioinformatics AMBER, with refinement G.C.P. van Zundert van al., et G.C.P. BiomolecularModeling Journal Molecular Biology of Complexes, of Laboratory MolDynGrid Virtual Berman Henrick K, Markley H, H, JL, Nakamura Acids Res. of data, PDB Nucleic a single,ensuring uniform archive S ofJournal Scientific & Research, Engineering AppDB EGI Organization Accelerated Virtual computing L Field and M W Schulz, Schulz, W and Field M L system grid information Ming al., et Jiang e-ScienceComputing Grids, Data Driven S. Andreozzi, S. Burke, L. Field, G. Galang, B. Konya, M. Litmaath, P. Millar, J. P. Navarro, Navarro, P. J. Millar, Andreozzi, Konya, L. Field, S. Galang, M. Litmaath, Burke, G. P. B. S. 2.0, OGF Document Series Specification Version EGI federated platforms supporting acceleratedEGI supporting computing federated platforms [24] [25] [22] [23] [19] [20] [21] [16] [17] [18] [14] [15] [13]