D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

Transparent heterogeneous hardware Architecture deployment for eNergy Gain in Operation D3.2 TANGO Toolbox – Scientific Report Alpha version (year-1)

Lead Editor Jean-Christophe DEPREZ (CETIC) Authors Jean-Christophe DEPREZ (CETIC), Lotfi Guedria (CETIC), Renaud De Landtsheer (CETIC), David Garcia Perez (ATOS), Roi Sucasas Font (ATOS), Richard Kavanagh (ULE), Jorge Ejarque (BSC), Yiannis Georgiou (BULL) Version 1.2 Reviewers Bruno Wery (DELTATEC), Yiannis Georgiou (BULL) Work package WP 3 Due date 31/12/2016 Submission date 23/12/2016 Distribution level (CO, PU): PU – Report

TANGO Consortium 2016 Page 1 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

Abstract This deliverable D3.2 present the progress of the TANGO project at the end of its first Year. The scientific contributions of several TANGO tools and components are provided, notably, for the energy modeller, the design-time optimiser, the /OMPSs programming model integration and the code optimiser. Furthermore, two testbeds are made available, a first one with CPU, GPU and Xeon Phi nodes and another one with CPU and FPGA to conduct experimental runs and collect time and energy performance of different small computation: matrix multiplication, Hydro and NBody simulation. Keywords Heterogeneous hardware, time and energy profiling, benchmarking Licensing information: This work is licensed under Creative Commons Attribution- ShareAlike 3.0 Unported (CC BY-SA 3.0) http://creativecommons.org/licenses/by-sa/3.0/

Document Description

Document Revision History

Modifications Introduced Version Date Description of change Modified by

2016/10/28 First draft with initial table of content Jean-Christophe v0.1 DEPREZ (CETIC)

2016/12/11 Initial integration with Section 3 ULE, CETIC, BSC v0.2 content

V0.3 2016/12/13 Section 1 and 2 CETIC

2016/12/14 Initial input for Section 4, Conclusion, CETIC V0.4 Exec summary

Integrate input Hydro measurements V0.5 2016/12/16 CETIC, BULL on Nova 2 and acronym table

Integration initial set of Bruno’s Jean-Christophe V0.6 2016/12/18 comments and integrate input on DEPREZ (CETIC) FPGA Matrix multiplication and NBody

Integration of Bruno’s comments by BSC and CETIC and integration of Nova V0.7 2016/12/20 CETIC, BSC 2 Benchmarking for Matrix multiplication

TANGO Consortium 2016 Page 2 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

ULE and Jean- Integration of modifications to handle V1.0 2016/12/21 Christophe DEPREZ comments on Energy Modeller (CETIC)

Integration of modification to handle Jean-Christophe V1.1 2016/12/21 Bruno’s comment on Section 2 and 4. DEPREZ (CETIC)

BSC, ULE and Jean- Updates to address comments from V1.2 2016/12/22 Christophe DEPREZ Yiannis, Bruno and Karim (CETIC)

TANGO Consortium 2016 Page 3 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

Table of Contents

Table of Contents ...... 4 Table of Figures ...... 7 Table of ...... 8 Terms and abbreviations ...... 9 Executive Summary ...... 10 1 Introduction ...... 11 1.1 About this deliverable ...... 11 1.2 Document structure ...... 12 2 TANGO Architecture – Year 1 Implementation and Usage Scenarios ...... 13 2.1 WP3 Benchmarking Vision – Profiling Time and Energy for Design-time Implementation Decisions ...... 14 2.2 TANGO Component Integration for Benchmarking/Profiling on Bull’s Nova 2 ...... 16 2.3 TANGO Component Integration for Benchmarking/Profiling on CETIC’s FPGA- augmented server ...... 17 3 Specific Scientific Contributions ...... 20 3.1 Energy Modelling ...... 20 3.1.1 Motivation and purpose ...... 20 3.1.2 Related Work ...... 20 3.1.3 Scientific Contribution ...... 21 3.1.4 Evaluation ...... 24 3.1.5 Conclusion and Future Work ...... 30 3.2 Design-Time and Deployment-time Optimiser – Placer ...... 30 3.2.1 Input problem ...... 31 3.2.1.1 Hardware model ...... 31 3.2.1.2 Software model ...... 34 3.2.1.3 Objective function: energy model ...... 36 3.2.1.4 Constraints ...... 36 3.2.2 Output: a placement ...... 37 3.2.3 Implementation of Placer ...... 38 3.2.3.1 About the underlying technology...... 38 3.2.3.2 Declaring the placement problem to the solver ...... 38 3.2.3.3 Declaring a processor ...... 39 3.2.3.4 Declaring an atomic process ...... 39 3.2.3.5 Declaring a bus ...... 39 3.2.3.6 Declaring a flow between two atomic processes...... 40 3.2.4 Related work ...... 40

TANGO Consortium 2016 Page 4 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

3.2.5 Conclusion ...... 41 3.3 Programming Model – COMPSs/OMPSs Integration ...... 42 3.3.1 Motivation and purpose ...... 42 3.3.2 Related Work ...... 42 3.3.3 Scientific Contribution ...... 43 3.3.4 Evaluation ...... 44 3.3.5 Conclusion and Future Work ...... 47 3.4 Code Optimiser Plugin for Energy Profiling of Java ...... 47 3.4.1 Motivation and purpose ...... 47 3.4.2 Related Works ...... 47 3.4.3 Scientific Contribution ...... 48 3.4.4 Evaluation ...... 48 3.4.5 Conclusion and Future Work ...... 48 4 Benchmarking Mini Apps on Time and Energy performance ...... 49 4.1 Matrix Multiplication...... 49 4.1.1 Benchmarking TANGO C/OMPSs Programming Model of Matrix Multiplication on Nova 2 ...... 49 4.1.1.1 Benchmarking Context ...... 49 4.1.1.2 Benchmark results ...... 49 4.1.1.3 Conclusion ...... 52 4.1.2 Benchmarking C/Poroto Matrix Multiplication on small FPGA-based System ... 52 4.1.2.1 Benchmarking Context ...... 53 4.1.2.2 Benchmark results ...... 56 4.1.2.3 Discussion ...... 57 4.1.2.4 Conclusion ...... 58 4.2 Benchmarking Hydro and NBody mini Applications ...... 58 4.2.1 Benchmarking on Hydro MPI Implementation on Nova 2 ...... 58 4.2.1.1 Benchmarking Context ...... 58 4.2.1.2 Compilation details ...... 59 4.2.2 Execution Settings ...... 59 4.2.2.1 Benchmark results ...... 60 4.2.2.2 Discussion ...... 63 4.2.2.3 Conclusion ...... 63 4.2.3 Benchmarking NBody C/Poroto Implementation on FPGA-augmented dual-Core server 63 4.2.3.1 Benchmarking Context ...... 63 4.2.3.2 Benchmark results ...... 63 4.2.3.3 Discussion ...... 65

TANGO Consortium 2016 Page 5 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

4.2.3.4 Conclusion ...... 66 5 Conclusions ...... 68 6 References ...... 69 APPENDIX: Runtime Overhead of GPU Energy Monitoring Probe ...... 71 APPENDIX: Evaluation of Energy Modeller’s Accuracy by Inducing Load ...... 73

TANGO Consortium 2016 Page 6 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

Table of Figures

FIGURE 1: GENERAL TANGO ARCHITECTURE WITH GENERIC COMPONENT NAMES...... 13 FIGURE 2: OVERVIEW OF THE TANGO ARCHITECTURE INSTANCE ON NOVA 2 AT BULL ...... 16 FIGURE 3: OVERVIEW OF THE TANGO ARCHITECTURE AVAILABLE ON THE FPGA-BASED TESTBED AT CETIC. .. 18 FIGURE 4 - OUTLINE OF DATA FLOWS IN THE ENERGY MODELLER ...... 21 FIGURE 5 - CALIBRATION DATA GATHERED ON NOVA2 - NODE ND32 ...... 24 FIGURE 6: TRACE OF IPMI AND WATT METER MEASUREMENTS WITH INCREMENTING CPU LOAD ...... 25 FIGURE 7: TRACE OF WATT METER AND TEMPERATURE MEASUREMENTS WITH INCREMENTING CPU LOAD ..... 26 FIGURE 8: CPU LOAD VS POWER AND ENERGY CONSUMPTION ...... 27 FIGURE 9: CPU LOAD VS ENERGY CONSUMPTION ADJUSTING TO COMPENSATE FOR IDLE HOST POWER CONSUMPTION ...... 28 FIGURE 10: CPU LOAD VS POWER AND ENERGY CONSUMPTION – COMPENSATING FOR INACCURACIES IN IPMI MEASURED VALUES ...... 29 FIGURE 11. MATRIX MULTIPLICATION MAIN CODE ...... 45 FIGURE 12. MATRIX MULTIPLICATION COARSE-GRAIN INTERFACE FILE ...... 46 FIGURE 13: IMPLEMENTATION OF THE COARSE-GRAIN TASK FOR THE DIFFERENT DEVICES...... 46 FIGURE 14: COP POWER GRAPH SCREENSHOT ...... 48 FIGURE 15: COMPARISON OF THE EXECUTION TIMES AND ENERGY CONSUMPTIONS FOR DIFFERENT CONFIGURATIONS AND GAINS...... 52 FIGURE 16 ON THE LEFT: ADM-XRC-6T1: XMC BOARD FORMAT PLUGGED ON THE ADC-XMC-II CARRIER BOARD; ON THE RIGHT: ADC-XMC-II: CARRIER CARD FOR PCI EXPRESS BASED SYSTEMS. ALLOWS DIRECT COMMUNICATION BETWEEN TWO XMC BOARDS...... 53 FIGURE 17 ML605 FPGA BOARD ...... 54 FIGURE 18 RIFFA BASED SYSTEM ARCHITECTURE ON THE ML605 BOARD ...... 54 FIGURE 19 TESTING MACHINE AND ITS POWER MEASUREMENT EQUIPMENT ...... 55 FIGURE 20: HYDRO EXECUTION TIME AND ENERGY MEASUREMENTS ON NOVA 2...... 60 FIGURE 21 HYDRO CPU USAGE ON 2 NODES ON NOVA 2...... 61 FIGURE 22: HYDRO MEMORY USAGE ON 2 NODES OF NOVA 2...... 61 FIGURE 23: HYDRO POWER CONSUMPTION USAGE PER NODE ON NOVA 2...... 61 FIGURE 24: HYDRO DATA STORAGE USAGE PER TASK ON NOVA 2 NODES...... 62 FIGURE 25: HYDRO GPU WRITE OPERATIONS ...... 63 FIGURE 26: TRACE OF A WORKLOAD INDUCED BY THE PHORONIX TESTSUITE (CPU) ...... 74 FIGURE 27: A DISTRIBUTION OF ERRORS IN THE MODEL’S ACCURACY AS COMPARED TO THE POWER METER READING ...... 74

TANGO Consortium 2016 Page 7 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

Table of Tables

TABLE 1: THE FIT DATA FOR BOTH LINEAR REGRESSION AND SEGMENTED ...... 29 TABLE 5: EXECUTION TIME (SECONDS) AND ENERGY CONSUMPTION (KJOULES) FOR A SINGLE COARSE-GRAIN TASK (BLOCK MATRIX MULTIPLICATION) USING DIFFERENT CONFIGURATION OF SINGLE TYPE OF DEVICE. BEST VALUES FOR EACH DEVICE TYPE ARE IN BOLD ...... 50 TABLE 6: EXECUTION TIME (SECONDS) AND ENERGY CONSUMPTION (KJOULES) FOR A WHOLE MATRIX MULTIPLICATION EXECUTION SPLIT IN DIFFERENT BLOCKS AND USING DIFFERENT CONFIGURATION OF SINGLE TYPE OF DEVICE...... 50 TABLE 5: EXECUTION TIME (SECONDS) AND ENERGY CONSUMPTION (KJOULES) FOR A WHOLE MATRIX MULTIPLICATION EXECUTION SPLIT IN DIFFERENT BLOCKS AND USING ALL THE RESOURCES OF THE COMPUTING NODE ...... 51 TABLE 6 EXECUTION TIME AND POWER EVALUATION FOR N-BODY (1 ITERATION) USING POROTO ...... 65 TABLE 6: SPECIFICATIONS OF THE SYSTEM USE TO TEST GPU ENERGY PROBES OVERHEAD...... 71 TABLE 7: RESULTS OF THE TOP COMMAND ON NVML_APP (THE PROCESS RELATED TO THE GPU ENERGY MONITORING PROBE.) ...... 71 TABLE 8: ERROR BETWEEN WATT METER READING AND THE MODEL GENERATED ESTIMATE OF POWER CONSUMPTION ...... 73

TANGO Consortium 2016 Page 8 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

Terms and abbreviations

ADSL Asymmetric Digital Subscriber Line ALDE Application Lifecycle Deployment Engine API Application Programming Interface BMC Board-based Management Controller BSD Berkley Software Distribution CAN Controller Area Network COP Code Optimiser (C)OMPSs (Cloud) Open MP Superscalar (from BSC) CP Constraint Programming CPU Central Processing Unit CUDA Compute Unified Device Architecture DDR RAM Double Data-Rate Random-Access Memory DVI Digital Visual Interface EM Energy Modeller FPGA Field Programmable Gate Array GPU Graphical Processing Unit DTC Design-Time Characteriser DTO Design-Time Optimiser EC European Commission IDE Integrated Development Environment JVM Java Virtual Machine IPMI Intelligent Platform Management Interface MPI Message Passing Interface PCIe Peripheral Component Interconnect express RAPL Running Average Power Limit RIFFA Reusable Integration Framework for FPGA Accelerators RMS Root Mean Square (value) ROCCC Riverside Optimizing Compiler for Configurable Circuits SLURM Simple Linux Utility for Resource Management SSD Solid State Disk UID User Identification (*nix OS) USB Universal Serial Bus VHDL Very High-level Design Language

TANGO Consortium 2016 Page 9 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

Executive Summary This document D3.2 – TANGO Alpha version – Scientific Report presents the progress made by the work package 3 (WP3) of TANGO project which ran through the project first year. This first objective of TANGO is to provide means to profile time and energy consumption of an application or part of an application when run on different heterogeneous hardware target architectures. To achieve this goal, two testbeds are provided. Bull’s Nova 2 testbed provides computing nodes with CPU, GPU and Xeon Phi processors; CETIC’s testbed is composed of a small server with an added FPGA board. It is possible to measure the power and energy consumed during a selected period of time on both testbeds. Consequently, it was possible to execute different computational operation or small applications, namely, matrix multiplication, fluid dynamic simulation and an NBody simulation and to collect time and energy performance of each run.

In addition to being able to obtain benchmarking data from the time and energy profiling effort, the implementation of different tools and components of the TANGO framework has progressed significantly and provided initial scientific contributions. Notably, the energy modeller shows how to calibrate software Watt meter on specific hardware. The design time optimiser provides the means to find the most optimal design-time placement of dataflow- oriented computational tasks on the different heterogeneous hardware nodes available, the C/OMPSs programming models make is possible to execute coarse and fine grain tasks efficiently, and the code optimiser enables the profiling of source code to identify potential energy leaks.

At the end of this first year, WP3 will be completed and WP4 will take over. Initially, on-going work on the integration of the various TANGO components will continue to automate many of the tedious tasks such as compiling code for the correct set of hardware architectures on which it may run, monitoring, aligning and analysing the collected on time and energy performance of a job or workload.

Alongside this deliverable, work package 3 has produced a software deliverable. Information about the installation of the software provided by TANGO is described in the accompanying deliverable D3.1.

TANGO Consortium 2016 Page 10 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

1 Introduction Recent years saw the emergence of Cyber-Physical Systems (CPS), the Internet of Things (IoT), and the Smart Anything Everywhere Initiative which have the potential to transform the way we live and work. They will also heavily influence industries by uplifting Europe’s innovation capacity across the economy from traditional industrial and professional service sectors to emerging consumer sectors [1]. In the long term, the IoT transformational impact is expected to increase significantly with mass adoption, tens of billions of things connected, and a multi- trillion-dollar economic value. These are key drivers to new business models taking advantage of the data collected by the IoT, sophisticated application development platforms, analytics applied to things, and distributed/parallel architectures [2].

TANGO’s goal is to understand the factors which affect power consumption in software development and operation for heterogeneous parallel environments. Our main novel contribution is the combination of the principles of requirements engineering and design modelling for self-adaptive software systems and power consumption awareness related to these environments.

TANGO will facilitate the use of heterogeneous parallel architectures by providing a complete methodology that enables software designers to easily implement and verify applications running on such platforms including general-purpose processors and acceleration modules implemented in the latest reconfigurable technology. The methodology will consider low- power consumption as a key factor for applications as well as other requirements such as for performance, dependability, security, or other qualities of service. 1.1 About this deliverable This document D3.2 presents the scientific achievements in Year 1 of the TANGO project entirely performed under work package 3. D3.1 is another deliverable related to work package 3 that provides the installation and configuration manual for the various TANGO software tools and components developed during Year 1 [3]. Thus, while this document D3.2 explains how the software developed during year-1 helps to achieve work package 3 objectives, the technical details on how to install these software packages are found in [3].

The main objective of work package 3 relates to static benchmarking on time and energy performance. This means that during Year 1, initial testbeds where time and energy measurement are possible are put in place. Secondly, one of TANGO’s objectives is to provide the necessary tooling to facilitate executing sample applications or portion of applications on the provided testbeds to obtain time and energy profiles, which could be used to establish benchmarks on different heterogeneous hardware platforms or between alternative algorithms implementation of a particular operation used in a sample application. An overview of the testbeds provided in Year 1 are described in Section 2 and measurements of time and energy performance on these testbeds are presented in Section 4.

To help with the development of tools to facilitate performing static benchmarking, the following sub-objectives are targeted:

 Identify generic categories of application scenarios and benchmarking goals on performance (e.g. time) and energy efficiency when developing systems achieving computing-continuum across heterogeneous parallel hardware architectures.  Refine the initial vision of the reference architecture and complete the first toolbox implementation for the purpose of easy benchmarking of system prototypes composed of emerging heterogeneous parallel hardware architectures.

TANGO Consortium 2016 Page 11 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

 Provide a programming model with built-in support for different generic data movement approaches for the various hardware architectures considered including heterogeneous cluster, heterogeneous chip, (multi-processing system-in-chip, Application Specific Instruction-set on Chip) and programmable logic devices (ASIP, FPGA). 1.2 Document structure An overview of the testbeds provided at the end of year-1 is described in Section 2 as well as general considerations when using the TANGO software tools and packages. Section 3 presents the scientific achievement and associated technical contributions for TANGO components that achieved significant progress in Year 1. Notably, results are provided for the energy modeller (part of the Self-Adaptation Manager to be designed in Year 2), the design-time optimiser (part of the Requirements and Design Modelling tools), the programming model and its runtime abstraction layer, and the code optimiser. Section 4 then illustrates how matrix multiplication and two mini-applications (Hydro and NBody) can be profiled on time and energy on the two testbeds available in Year 1.

TANGO Consortium 2016 Page 12 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

2 TANGO Architecture – Year 1 Implementation and Usage Scenarios In this first year of the TANGO project, every components of the target TANGO architecture presented in D2.1 [4] followed its own implementation path and progress. Several of these component either alone or already integrated have achieved significant progress described in Section 3 on scientific contribution.

Beside these partial results, initial testbeds have taken shape during year-1 in order to profile different implementations of simple applications on time and energy performance and obtain benchmarks of these implementations on different types of heterogeneous hardware. Although these testbeds do not yet provide a complete implementation of the TANGO operational framework, several of the software packages installed on these testbed can already be viewed as initial partial instances of the TANGO architecture recalled in Figure 1.

Figure 1: General TANGO Architecture with generic component names.

Therefore the next sections (2.2 and 2.3) present the various specific software packages installed and used at runtime on each testbed and identify the correspondence from each of these specific software packages with the generic component names of the general TANGO

TANGO Consortium 2016 Page 13 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016 architecture. It is worth noting that in these two sections only a high level overview of the software packages is given. A more detailed version is presented in Section 4 and the complete details on how to install and configure these software packages are provided in D3.1 of TANGO [3]. 2.1 WP3 Benchmarking Vision – Profiling Time and Energy for Design-time Implementation Decisions Below we review the reasons why benchmarking different parallel implementations on different heterogeneous hardware remains necessary. In fact, design-time decisions on how to parallelise the implementation of different parts of a software solution has increased drastically in the recent year due to the diversity of new heterogeneous hardware appearing on the market. Thus, the problem of determining what parallelisation techniques to implement to achieve targeted time and energy performance is extremely complex. In particular, if complete freedom on software and hardware decision is given at design-time.

Even if design-time decisions are constrained by having to use existing portions of a software implementation or by fixing the range of heterogeneous hardware on which software will run, it remains very difficult for a development team to estimate the time and energy performance without actually executing software on particular hardware to obtain objective measurements.

From the WP6 validation use cases and from the set of interviews with industry members and researchers performed in the first three month of the project, it is clear that most projects focused on the use of heterogeneous hardware platform rarely start, from a blank page. Instead, existing implementation of an entire application or part of an application are exploited.

In many cases, the existing code cannot be executed as such on any heterogeneous hardware. For instance, existing C code will only run on traditional or multi-core CPU but it will need to be either completely or partially re-written as well as significantly annotated in order to be executed on GPU, Xeon Phi or programmable devices such as FPGA. The level of intrinsic parallelism in the algorithm implemented, the existing design and code will significantly influence the effort of the re-writing exercise needed to exploit a new targeted type of heterogeneous hardware. The type of interconnection (max bandwidth) between the various hardware devices will also heavily influence how to re-write code. In particular, the bandwidth and maximum throughput for transmitting data across the distributed hardware processing nodes will influence the granularity of tasks to distribute and perform in parallel. Thus, the different memory sizes internal to the different type of heterogeneous hardware makes it very complex for a human developer to determine in one shot the optimal task granularity. Furthermore different input/output data sizes may also affect what task granularity will perform best on different hardware targets. Also, the ability to exploit easily or not different types of internal memory to an FPGA or a GPU to buffer incoming and outgoing data and handle input/output data transfer in a batch oriented way will also influence the code implementation.

In other cases, the existing code can nearly be exploited directly on a new kind of heterogeneous hardware. For instance, existing OpenCL implementation can be compiled and run on new type of heterogeneous hardware as soon as the needed code generation module for the new hardware type is implemented into the OpenCL compilation chain. Existing CUDA code can be run on a new NVIDIA GPU system, existing VHDL can be directly recompiled on a newer FPGA, and existing MPI code can be executed on new Xeon Phi-based nodes. However, in many cases, it is not because an existing implementation performed well on a given heterogeneous hardware platforms that it will automatically perform well on another one. For

TANGO Consortium 2016 Page 14 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016 instance, even for the example of the CUDA code switching to a newer NVIDIA is no guarantee to benefit from the added power of the newer GPU generation. Indeed, past design decisions may have been hard coded in the current CUDA implementation in order to optimally perform data movement using the memory size and data transfer capabilities of the older GPU. It is often difficult to avoid fine tuning performance without freeze certain decisions in the code. In the end, a given code is extremely well suited to exploit the full capacity of a given GPU but it may not be capable to exploit the increased capacity of a new GPU version. This shows that even when existing code can be directly executed on a new hardware platform, it will often require a rewriting pass to make sure that the newer hardware is exploited to its largest extend.

At a given moment in time, it may be theoretically possible to determine automatically how to break down (hence annotated) application code for processing on different devices to achieve optimal time and energy performance trade-off. However, newer hardware with augmented capacities and increase in data size to process may make past design decision not so optimal.

Furthermore, hardware-software co-design is mostly reserved to certain parts (sub-system) of a larger solution. For instance, the development team will be requested to focus on the energy performance of a certain type of IoT objects, which is only one type of sub-system of the overall IoT solution. In other case, time performance of a particular sub-systems that must operate in real-time environment will have priority. Only in very few flagship cases will an entire system have the freedom to have its hardware and software co-designed. In general, hardware devices must be fit to run various kind of (software) workloads and conversely, a workload should be capable to run and exploit various kind of heterogeneous hardware devices.

Therefore, a profile and benchmarking approach shows to be an adequate way to determine how a solution perform in terms of time and energy performance. In particular, profiling time and energy consumption of different algorithms to the same problem, with computation tasks split and distributed slightly differently, executed on different types of heterogeneous hardware devices with different networking setups and with input dataset of different sizes will provide useful measurements to establish benchmark to guide developer in making decision regarding what to fix at design time, at deployment time and what to leave flexible for runtime decision.

As highlighted above, software exploiting heterogeneous hardware is influenced by

 The raw capabilities of heterogeneous hardware devices used for processing, store and exchange data in terms of time and energy performance;  The variety of workloads that a hardware setup will have to handle;  The development team’s willingness to spend time re-writing code and experimenting various ways to break down code or implement alternative algorithms to benchmark on the various hardware targets available.

To obtain the time and energy profile, the two testbeds (Bull’s Nova 2 and CETIC FPGA-based system) used during the first year of TANGO offer an adequate variety of heterogeneous hardware to show that TANGO delivers an initial version of the necessary software components to perform time and energy profiling for benchmarking selected software- hardware alternatives and guide developers in their design-time decisions. Prior to presenting our initial effort on performing time and energy profiling in Section 4, a review of the current scientific results achieved by several TANGO components are presented in Section 3.

TANGO Consortium 2016 Page 15 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

2.2 TANGO Component Integration for Benchmarking/Profiling on Bull’s Nova 2 During this first year of the TANGO project, to measure time and energy of running applications on Bull’s infrastructure Nova 2, the software components shown by the segment Nova 2 Operational / Runtime Environment in Figure 2 were installed and configured. Before executing an application on Nova 2, it is first developed and tested using the C/OMPSs development environment (IDE segment), which is installed locally on TANGO researchers’ laptops. Alternatively, developers who prefer using their current IDE to implement applications with MPI can also continue to do so. Furthermore, TANGO aims to support various programming language framework that work at low level such as OpenCL, CUDA, or C/Poroto as mentioned in the next section.

Application Integrated Programming Model Development Environment

OMPSs Dev. Env. COMPSs Dev. Env. IDE

Application-aware MPI OMPSs COMPSs

middleware runtime runtime runtime

Hardware-aware middleware SLURM

Nova 2 software measurement probes IPMI-based (or RAPL) Energy GPU energy XeonPhi measurement probes for measurement measurement whole node probes (NVIDIA ) probes ( api)

Nova 2 Computing Hardware

Multi-Core CPU CPU+GPU nodes Heterogeneous CPU nodes with Linux with Linux (XeonPhi) with Linux Environment /Runtime Operational 2 Nova

Figure 2: Overview of the TANGO architecture instance on Nova 2 at Bull

The correspondence with the general TANGO architecture presented in Figure 1 is listed in Table 1. As of Year 1, the functionality of energy modelling simply relate to energy measurement being obtained from software probes installed on the hardware. Neither application deployment nor self-adaptation have been made available on the Nova 2 integrated testbed. These features are currently implemented and tested independently and will be added in the remaining time of the project. So during Year-1, the existing functionality of SLURM is used to achieve the device supervision, infrastructure monitoring and also energy modelling as the series of energy measurements collected by probes are aggregated by SLURM in order to determine the energy consumed by whole nodes. The work of matching these energy profile to the execution of an application remains a manual effort in year-1. The application lifecycle deployment engine will automate this process in the future.

Although not installed on the operational Nova 2 testbed, the application integrated development environment, which consists of IDE-related tools and plugins, are also illustrated

TANGO Consortium 2016 Page 16 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016 in Figure 2, for the sake presenting the complete picture of how measurement and benchmarking was conducted during this first year on Nova 2. In particular, these IDE tools available on developers’ local environment were used to develop the source code of an example application using the COMPSs and OMPSs programming model and then the developers could manually upload either the source code or the pre-compiled object code to be executed on Nova 2 to define benchmark values based on the time and energy measurement obtained. Alternate to using the COMPSs and OMPSs programming model, developers can also continue implementing using their existing technologies and then follow with the same manual procedure to upload and run their applications.

Table 1: correspondence between components of general TANGO Architecture and of Nova 2 in Year-1

General TANGO Architecture Nova 2 in Year-1 Heter. Paral. Device Cluster - CPU-APU Multi-Core CPU nodes with Linux Heter. Paral. Device Cluster – GPGPU CPU+CPU nodes with Linux Heter. Paral. Device Cluster - CPU-APU Heterogeneous CPU (XeonPhi) with Linux Energy Modeller IPMI-based (or RAPL) Energy measurement probes for whole node Energy Modeller GPU energy measurement probes (NVIDIA api) Energy Modeller XeonPhi measurement probes (Intel api) Energy Modeller SLURM Device Supervisor SLURM Infrastructure Monitor SLURM Runtime Abstraction Layer OMPSs runtime Runtime Abstraction Layer COMPSs runtime

2.3 TANGO Component Integration for Benchmarking/Profiling on CETIC’s FPGA-augmented server While programming CPU and GPU can rely on the fact that processing is done using a fixed instruction set made available by the CPU or GPU, the situation is different for FPGA. It is possible to configure the logic gates of an FPGA to achieve the desired instruction or operation.

TANGO Consortium 2016 Page 17 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

Application Integrated Programming Development Environment Model C/Poroto Annotation

Hardware-aware Poroto Data Transfer API VHDL Compiler middelware compiler Alphadata, RIFFA ROCCC, Bamboon, Cλash & runtime

CETIC Hardware/Software Power Measurement System CETIC Power Monitor CETIC Amp-based Power probe

CETIC FPGA-Augmented System CPU board FPGA board ePCI

Figure 3: Overview of the TANGO architecture available on the FPGA-based testbed at CETIC.

To determine if it is worth executing a given set of operations on an FPGA, it is useful to follow a rapid prototyping approach where a selected portion of an algorithm is ported to an FPGA where execution with datasets of different sizes can be measured to obtain time and energy performance on FPGA.

There exists different types of systems augmented with FPGA. Some are closer to a blade system that can be assemble to reach performance of HPC-centres while others FPGA system are much smaller to be embedded in IoT devices or in IoT edge network devices. While waiting for such a smaller type of system to be made available as part of testbed to explore, what the general TANGO architecture refers to as Smart device, CETIC built a FPGA-based system where a standard standalone CPU-based server has been augmented with an Xilinx FPGA board connected through ePCI. Energy measurements for the whole system is taken using a cheap Ampere meter based side system where two Arduinos are used to perform the needed computations to determine various power measurements (Watt, Volt, Amper) as well as signal when the FPGA is used for computation or not. The hardware probe is currently connected to a side computer that collects and makes these measurements available at near real-time as well as persists them after execution runs for later consultation.

In the case of the CETIC testbed, it is assumed that a developer has existing C code of an application and wants to explore offloading a part of its execution on an FPGA. In this programming model, the developer will annotate the C code with C/Poroto pragmas to indicate what portion of the C code to transform in VHDL understandable by the Xilinx tool chain to generate the needed executable FPGA configuration. The Poroto runtime handles the transformation of the desired portion of C code into VHDL as well as generates the needed code to handle the data transfer from the main server to the FPGA respecting execution clock- synchronisation consideration from the FPGA.

Table 2: correspondence between general TANGO Architecture and CETIC FPGA-Augmented server in Year-1

General TANGO Architecture CETIC FPGA-Augmented server in Year-1 Heter. Paral. Device Cluster – CPU CETIC FPGA-Augmented System with Linux Heter. Paral. Device Cluster – FPGA CETIC FPGA-Augmented System with Linux Energy Modeller CETIC Amp-based Power probe Infrastructure Monitor CETIC Power Monitor Runtime Abstraction Layer Poroto

TANGO Consortium 2016 Page 18 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

Runtime Abstraction Layer Data Transfer API – Alphadata – RIFFA Runtime Abstraction Layer VHDL Compiler – ROCCC, Bamboo, Cλash

This CETIC testbeds provide a solution in between high-end FPGA-augmented systems and small IoT system. Initially it was thought to be a temporary solution to initiate reflecting on how to perform measurements and to obtain useful benchmark information to help analysts and developers to make decisions on what part of computation to perform on FPGA and what can be expected in term of acceleration when exploiting FPGA on top of CPU power. However, it may provide a useful testbed to further develop in the remaining time of the project as it will help to address other type of applications than the current IoT industry used case. While the IoT use case focuses on small embedded system, the CETIC testbed provides a more powerful infrastructure that usually is not functioning on battery and instead powered continuously (such as a set-top box).

TANGO Consortium 2016 Page 19 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

3 Specific Scientific Contributions This section reviews the scientific and technical results of the various components of the TANGO architecture that achieved significant progress in this first year of the project. In particular, the energy modeller, the design-time optimiser (part of the requirements and design modelling tools), the code optimiser, the programming model and its runtime abstraction layer presents their current results. 3.1 Energy Modelling The motivations and purpose of developing energy modelling in TANGO as well as the related work done by similar subjects are described. Subsequently, the contribution of the initial effort on energy modelling and its evaluation are presented. Finally, conclusions and future work are highlighted.

3.1.1 Motivation and purpose The Energy Modeller primary focus is for predicting energy usage and generating historic logs of usage in the Tango framework.

The TANGO project is built around the concept of low power computing and saving energy; in order to achieve this some components such as the device supervisor need detailed power and energy usage information especially forecasting to understand the impact of any decisions that are to be made.

3.1.2 Related Work An important part of prediction and monitoring of power utilisation is the characterisation of the physical resources and the characterisation of the workload. Physical resource characterisation has given rise to energy profiling and testing frameworks such as JouleUnit [5]. In addition to frameworks aimed specifically at energy profiling, there are more generalised monitoring frameworks such as Zabbix [6] and Ganglia [7]. These more generalised frameworks fit the requirements of TANGO well as they can monitor compute environments at scale.

Data for resources' power consumption is principally obtained either by direct measurement [8] or inferred via software and physical performance counters [9] [10]. Direct measurement can be difficult as it needs specialised equipment such as Watt meters, or devices such as PowerInsight [11] and PowerPack [12] for power consumption of node’s individual components. These measuring process however utilise additional hardware which is costly and can cause difficulties in scaling to the size of a data centre.

Performance counters [11] [12] are a non-invasive means of determining energy usage, thus have the ability to scale better than direct measurement. They utilise performance counters located within the CPU and Operating System. These performance counters then utilise models to provide power consumption information.

The Running Average Power Limit (RAPL) interface is a hardware implementation of such performance counters, which was first introduced with the Intel Sandy Bridge processors [13]. It provides operating system level access to energy consumption information based on a software model driven by hardware counters. RAPL reports energy on the level of socket and not on the level of core, which makes allocating individual processes power consumption more involved. Power estimates [16] have been shown to be fairly precise, which provides the prospect of core based power consumption.

TANGO Consortium 2016 Page 20 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

The Intelligent Platform Management Interface (IPMI) [13] is a message-based, hardware-level interface specification that operates independently of the operating system. It is used by system administrators for system recovery and monitoring e.g. temperature, voltage, fan speed and power consumption etc. IPMI is the main API for querying and utilising a baseboard management controller (BMC) which is a specialized microcontroller embedded on the motherboard which collects the data from various sensors such as the power sensors. IPMI in this regard is integrated into the systems hardware and can also be found in nearly all current Intel architectures. If the power sensors are supported it can provide a very cheap built-in way for power collection. Various open source software such as libopenipmi [18].

Many prediction models have been developed with the aim of allocating power consumption to processes, especially in the context of virtualised environments. The majority of these cases have used linear models [8, 9, 10, 11], while others have used lookup table structures [12] and other techniques [13]. In most cases, these linear models have shown a high degree of accuracy in providing the power profile of resources, usually within 5%, or less than 3W of the actual value.

Many models can also be described as additive models such as [8, 10]. These models are characterised by summing each of the major physical component’s power consumption separately. The idle power consumption in these cases is treated as an additional parameter to the model that is simply added to the other load characteristics. This practice often ignores the fact that to utilise one physical component of the host such as I/O often requires another such as CPU, thus making it hard to isolate the true cause of an increase in power consumption.

In other cases more complicated bias mechanisms [9] are utilised or other mechanisms such as principal component analysis [13] to create the host’s profiles underlying model.

In addition to the characterisation of physical resources, the workload is required to be characterised as well, as a means of making projection regarding the future power consumption. Forward Linear Regression based CPU Utilization Prediction (LiRCUP) [11] is one such example of CPU load prediction, which is aimed at maintaining service level agreements.

3.1.3 Scientific Contribution The energy modeller in the TANGO architecture is shown in Figure 4.

Figure 4 - Outline of data flows in the Energy Modeller

TANGO Consortium 2016 Page 21 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

The energy modeller takes raw power values from the physical infrastructure, in the case of TANGO it is via the SLURM’s inbuilt monitoring capabilities. SLURM provides various sources for this information, such as modules that integrate into IPMI and into RAPL. There also exists the possibility for the energy modeller to use Watt meters directly or for the energy modeller to emulate the presence of a Watt meter in cases where calibration data has previously been gathered.

The energy modeller’s principle mechanism for working is through a series of calibration runs which gain power consumption information and compare it to current system load. The principle cause of variation in the power consumption value is the CPU utilisation thus this in the 1st year of the TANGO project is the metric that is chosen. We expect to expand this to further take account of other factors such as disk, I/O, network traffic, load per core and hyper-threading. The focus of the work in the first year has been to improve calibration through better understanding the errors that can be induced rather than creating a complicated model initially, that may lead to over fitting.

The energy modeller during its calibration phase applies a sequence of predefined levels of load which allows the mapping between power consumption and utilisation factors to be determined.

The calibration process on a physical host works by the following process:

1. Launching threads on each core, using the utility stress to provide load. 2. Limiting the cpu load on the physical host using cpulimit and taskset. cpulimit restricts the cpu usage to a set limit while taskset ensures the thread is fixed to a particular core. 3. Iteratively inducing load for a period of time and then providing a backoff period of no load. During each iteration the cpu usage is kept constant, thus keeping a plateau, in which power consumption and utilisation remain stable.

Once this run is performed at multiple levels of utilisation, the utilisation traces are first extracted from the SLURM job accounting and then cleaned and loaded into the energy modeller, thus forming the calibration dataset.

4. The first step extracts the power and utilisation traces from SLURM, using sh5util. These traces are then merged so the mapping between power and utilisation can be performed. 5. The cleaning process eliminates all datapoints where the recorded power consumption varies between in comparison to the previously recorded value. The aim of this is to eliminate values where either the power or utilisation values do not represent the same level of workload on the CPU. This might occur due to the differences in when each measurement was taken. This is likely to occur when the workload changes after one of the periods of constant load induced in step 3. 6. This data is then loaded into the energy modeller’s dataset for a given physical host.

Once the calibration data has been gathered a fit is applied to the data to determine the best way of determining how the utilisation relates to the power consumption. This can be either a linear model, a polynomial fit of order 2 or a polynomial fit that uses spline points in cases where the trend clearly changes between low and high workload. The root mean square error can then be utilised to select the model that is most suited to the data.

Once this phase is done, it then allows, given an estimated workload of an application, to forecast its future power consumption.

TANGO Consortium 2016 Page 22 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

The energy modeller has undergone an initial level of integration into the job management system SLURM. This initial integration enables it to get raw measurement data from the physical hosts in the testbed. The result of this integration is therefore shown in the next section. It also remains an integral part of the code optimizer plugin.

The scientific contributions to the energy modeller focus on calibration and in particular with IPMI. The assessment of IPMI as a means to perform calibration of the Energy modeller has resulted in both an understanding of the sources of error during calibration and a set of rules that allow for the energy modeller to be calibrated utilising IPMI.

IPMI is readily available in modern hardware yet the sensors in the related Baseboard Management Controllers (BMC) on the motherboard are limited in regards to the measurement of power consumption. This work therefore provides the means to create an emulated Watt meter that can then be used to provide power measurements at a fine granularity than is currently available with power sensors that may be read via IPMI.

Errors during the calibration phase results in an inaccurate model that does not correctly represent the relationship between load and power consumption. The following reasons for this have been identified:

Unsynchronized metric update intervals for different metric types: This occurs when measuring CPU utilisation and power together. For calibration to be accurate it requires the measurements to be perfectly synced and values to be pushed at the same rate. An alternative to this is for the utilisation to remain stable during a measurement phase, so that both measurements represent the physical state of the host machine.

Measurement arrival latency (Monitoring infrastructure overhead): This is a different kind of synchronisation issue, differing from above, this is caused by the inherent delays in taking a measurement, transferring the value across a network and recording it in the monitoring infrastructure. This effects the detection of the start and end of periods of induced load. This can be mitigated by performing the calibration run locally without the use of a full monitoring infrastructure, such as Zabbix, Ganglia etc. This however will only work during the calibration and will not work during normal operations.

Locally monitoring load will however have the side effect of measuring a small amount of load induced by itself.

Averaging and time windows of measurement’s values: Measurements arrive with a given polling interval, however measurements such as CPU load also have a time window in which the measurement was taken e.g. over the last minute.

This averaging causes errors in the model and requires the CPU utilisation measurement window to be made as small as possible. One alternative is for measurements used in the calibration dataset to only start to be taken after load has been induced for a time that is longer than the length of the averaging period. The former option is simpler but requires custom scripts in the case of the Zabbix monitoring environment.

The Sensor’s Update Interval: Sensors such as power measurements taken over IPMI update slower than the interval at which the baseboard can be queried. Thus rapid polling of the interface can result in the previously reported value been reported again, without prospect of change. Hence the poll interval should not exceed this

TANGO Consortium 2016 Page 23 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

update interval. In the case of IPMI power values polling in our testbed should be restricted to every 5 seconds.

This therefore provides the basis of several recommendations for the calibration process:

 To use metrics that represent the physical host in its most recent state, which we call spot metrics and tend to avoid averaging and representing long periods of time.  That load should be induced followed by waiting a set period of time for the values to stabilise and then taking measurements. A further addition to this is to detect plateaus in the measured values and only using congruent data points, which can be used as a mechanism to determine how long to wait before accepting measurements as being valid.  To take measurements locally thus avoiding monitoring system overheads including network delays.

3.1.4 Evaluation

The first element of the evaluation demonstrates the calibration of a node on the Nova2 testbed, namely nd32. The calibration data shows power vs CPU utilization on the testbed’s node nd32. It was gathered by scripts that utilize SLURM to gather calibration data for the energy modeller as described in the previous section. These scripts induce a series of separate loads on the selected host, in increments of 10%, with all cores been equally loaded. This data was then added into the energy modeller’s training set, allowing it to provide future predictions.

Figure 5 - Calibration data gathered on Nova2 - Node nd32

This rest of this evaluation focuses on the improvements that can be made to calibration in the energy modeller and in particular upon the accuracy of Baseboard management controllers (BMC) devices and attached sensors that are accessed via IPMI.

TANGO Consortium 2016 Page 24 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

The aim of the experimentation is to explore the suitability of the physical hosts built in power measuring functionality.

The experimentation follows the energy modeller’s calibration process, which involves inducing load at selected present values onto a physical host and measuring the power consumption that the load causes.

The experimentation was performed on a Cloud testbed that uses Open Nebula 4.10.2 [14] and Zabbix 2.4.4 [6] for monitoring. The physical host that was measured is a Dell PowerEdge R430 Server commodity server that is monitored through IPMI. The physical host tested has two 2.4GHz Intel Xeon E5-2630 v3 CPUs with 128GB of RAM, a 120GB SSD hard disk and an iDRAC Port Card that is IPMI 2.0 compliant.

For the purpose of creating a baseline to compare IPMI based power meter values a WattsUp Meter Pro [15] Watt Meter is attached, with an accuracy of +/- 1.5%. The readings from the Watts Up meter were taken every second and reported to Zabbix and from the IPMI sensor it was every 5 seconds. In post processing the values reported by IPMI were interpolations, in order to compare data to the Watt meter. The IPMI sensor uses an inbuilt time window of 60 seconds. Zabbix was installed on a separate server as to the host undergoing measurement as to avoid unnecessary additional load.

Figure 6: Trace of IPMI and Watt meter measurements with incrementing CPU load

The load induced on the physical hosts ranges from 0% CPU usage up to 100% in increments of 10%. In order to generate this load a tool called Stress [16] is used, along with cpulimit and taskset. In order to generate full load 32 threads were launched and then mapped using taskset to the CPU cores on the physical host. cpulimit was used to set the intended load and at each interval of induced load, it was induced for 120 seconds. In order to represent a realistic setup for the physical host the CPU scheduling governor was set to the default option of on demand and hyper-threading was enabled with all sleep states been available. We however expect in the future to further investigate such settings and disable aspects such as

TANGO Consortium 2016 Page 25 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016 hyper-threading, to better consider its impact upon power consumption especially at application level.

In Figure 6 the overall trace of the calibration run is shown. It shows multiple measurements for each set CPU utilisation level been gathered via IPMI and the Watt meter along with the CPU load induced on the physical host. The Watt meter at the start of some periods of induced load especially at 10% and 20% CPU load shows spikes, before the load settles. This is in contrast to the IPMI sensor that is unable to detect any change in power consumption at 10% CPU load. This is due to the granularity of the sensor. It exhibits only 9 distinct values bands within the measurement range used (112W - 224W in 14W increments). The initial measured idle is 117W while at 10% load it is 124W and with only 7W difference this is undetectable using IPMI.

IPMI undergoes averaging, which results in the peak associated with IPMI been offset to the right of the Watt meter’s reported values. This suggests that if accurate calibration is desired that these values should only be used after the averaging window has passed while sustained consistent load is in effect. The IPMI power values also under report the power consumption by seemingly only rounding down towards the last permissible increment.

At 60% CPU utilisation and above we notice that the system’s power consumption becomes capped at around 228W. It can therefore be seen that a purely linear model as seen in much of the literature does not apply in the context of our machine.

Figure 7: Trace of Watt meter and temperature measurements with incrementing CPU load

In Figure 7, we examine the effect of temperature measured by IPMI on the power consumption to investigate the higher than expected variance in power during the sustained 120 seconds workload. The correlation between CPU load and CPU temperature can clearly be seen. The temperature at the start of our experiment before any load is induced starts at 63°C, yet lowers to 53°C at the lowest point during our experiment, which occurs soon after a load

TANGO Consortium 2016 Page 26 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016 period has completed and is a result of the fans cooling the CPU past its normal idle temperature. At 50% CPU utilisation and above in our test setup, the power consumption as reported by the Watt meter shows an initial slope and then a tail in which the power consumption doesn’t immediately drop down to idle once the load has finished. Thus as the CPU further heats at the start of a load period an initial slop is created due to heating and the increase in fan speed. The power consumption stabilises and then at the end of the load period drops, yet the remaining additional heat takes time to dissipate, thus causing the tail.

Figure 8: CPU load vs power and energy consumption

In Figure 8 we show the CPU load and power consumption calibration data processed from the raw data (shown in Figure 6) where all the data points over the 120 seconds of each workload are averaged. Standard deviation is illustrated via vertical error bars. This data is used in estimating power consumption from CPU load. We see that IPMI consistently under reports the power consumption and the overall energy consumed. The error is also larger particularly when the CPU load is higher. This error is due to the averaging window that the IPMI device is using when taking measurements. It can also be seen as in Figure 6 how at 10% CPU utilisation that IPMI doesn’t register the change in power consumption. We can however see that the lines although offset follow the same trend.

TANGO Consortium 2016 Page 27 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

Figure 9: CPU load vs energy consumption adjusting to compensate for idle host power consumption

Figure 9 shows the effect of making two adjustments that means calibration data obtained by IPMI more closely matches the data obtained from the Watt meter. Firstly we remove the idle power consumption of the server thus we only consider the additional energy consumption of the application and secondly we increase the window size for the IPMI measurements from 120 seconds to 180 seconds. This takes account of the entire averaging window used by IPMI which is fixed at 60 seconds.

After these changes we can see that the two lines nearly directly correlate, with the Watt meter and the IPMI sensor closely agreeing in the range 20-80 % CPU utilisation but with slightly more error at the high and low ends. The application of these two simple rules thus illustrates how IPMI can be used to produce a similar result to an actual Watt meter, albeit for the energy consumption of the physical host.

To derive the current power consumption of an application from the model is more useful than its energy consumption alone. Figure 10 demonstrates how this can be achieved. We show a graph of calibration data for power consumption vs CPU utilisation along with confidence intervals of 95% for the linear regression Watt meter and IPMI results. In this instance, however we ignore the initial data points that artificially lower the IPMI sensor data’s power consumption.

TANGO Consortium 2016 Page 28 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

Figure 10: CPU load vs power and energy consumption – Compensating for inaccuracies in IPMI measured values

The adjusted IPMI confidence intervals are very similar and thus excluded to avoid overly filling the graph. The fit was generated in R using segmented linear regression. The adjusted IPMI gathered data this time ignores the first 60 seconds of data points. We can see how IPMI without this processing under reports the power consumption and that the correct answer is reported by the Watt meter. The removal of the first 60 seconds of data points in IPMI works because the averaging window used by IPMI will no longer reflect a period of time before the load was induced and measurements will only reflect the CPU at the load specified. Once this is done the IPMI calibration line fits much more closely to the Watt meter’s line. This means in the context of calibration that the load should be induced for at least the length of the averaging window, in order to get a decent calibration. The R2 values for the fitted lines are shown in Table 1.

Table 1: The fit data for both linear regression and segmented Multiple R2 Adjusted R2

Watt Meter Segmented 0.9989 0.978 Watt Meter Linear 0.9358 0.9287 IPMI Segmented 0.9946 0.9891 IPMI Linear 0.9417 0.9352 IPMI Adjusted Segmented 0.9928 0.9857 IPMI Adjusted Linear 0.9285 0.9206

TANGO Consortium 2016 Page 29 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

Once this model has been constructed using the IPMI data, CPU counters can then be used in conjunction with the model generated in order to get rapid and accurate values for the power consumption. In the APPENDIX: Evaluation of Energy Modeller’s Accuracy by Inducing Load, the experimentation is continued in a virtualised environment in which load is applied and the outputs of the energy modeller is compared based upon the source of the initial calibration data. This may be relevant to TANGO should the use of containers be considered as part of the research agenda.

3.1.5 Conclusion and Future Work In conclusion, the work on the energy modeller has focused on calibration as well as scalability in regards to the usage of IPMI. This has led to the reduction of noise in the calibration data, when using common but inaccurate sensors.

The error found in estimates of power over time was found to be mildly underestimating the measured value, but largely centred around zero thus energy estimates over a period of time are likely to reflect the real value.

The work in the future will extend to handling heterogeneous environments where the usage of different accelerators such as GPUs and FPGAs which will give rise to more complicated energy models. Notably, we will explore the use of energy measurement probes for GPUs described in D3.1, which have just been made available on Nova 2. They will be used to calibrate power and energy consumption models for the energy modeller. As presented in APPENDIX: Runtime Overhead of GPU Energy Monitoring Probe, the GPU monitoring probe developed by TANGO generates a negligible overhead. We will also consider RAPL based sensors provided by SLURM as a means of providing higher accuracy especially in regards to CPU bound applications. The models will then be tested for their accuracy via the use of HPC benchmarks and mini-apps that are part of the Bull use-case. 3.2 Design-Time and Deployment-time Optimiser – Placer The requirement and design modelling tools are composed of several components to guide software analysts, architects and developers on making decision on how to better exploit a heterogeneous hardware infrastructure. One of the component is in charge of exploring design time and deployment-time optimisation. Such information can then be used by developers to determine what parts of an application should propose implementation deployable on different types of heterogeneous hardware as well as compilation guidance for specific type of heterogeneous hardware.

In line with the above goal, this section presents placer, a software that optimizes the placement of software component onto heterogeneous multi-processing hardware platforms. It is exclusively focusing on software structured as data flow, that are made of software processes and data flows between these processes. The problem amounts to placing every such process onto processors of various classes (cpu, FPGA, co-processor, etc.) and routing every data flow onto available communication resources that connect these processors (PCI bus, USB, Ethernet, ADSL, etc.).

Software processes, as supported by placer, can have several implementations, for different hardware classes (cpu, fpga, etc.). Placer also finds the proper implementation for each such process.

Placer furthermore minimizes energy consumption. It relies on a model of energy consumption to express its objective function. The energy consumption is defined as the sum of energy

TANGO Consortium 2016 Page 30 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016 consumed by every processor. Each hardware class (cpu, FPGA, co-processor) has its own energy model, defined by a polynomial formula provided by the analyst/developers. In general the energy model will be derived from benchmarking different software components on the different types of hardware available.

The document is structured as follows:

 Presents the problem statement, detailing every aspect of the software and hardware models supported by placer, as well as the objective functions. The problem is illustrated using both mathematical notations, and using the Scala scripting notation as supported by placer.  Presents some small examples, and discusses on the answers provided by placer  Concludes with the recap, the set of limitations, and the future work

3.2.1 Input problem This section presents the software placement problem, as supported by placer. It presents the software, and hardware model as well as the model of energy to minimize.

3.2.1.1 Hardware model This section presents the hardware model as supported by Placer. We also refer to it as the hardware metadata.

3.2.1.1.1 Processor classes First of all, hardware model introduces processor classes. This defines what are CPU, FPGA, etc. in the viewpoint of the placement problem, so with respect to what kind of size constraints they need to be enforced, and their energy model. These are both defined through metrics. Each class supports its own metrics, for instance FPGA’s have a number of gates and multipliers, while CPUs have a number of flops. These metrics are cumulative resources, that is: each piece of software that is run on a given hardware will add its value to these metrics.

Formally, each class c defines a set of metrics: Mc = {mc}

These metrics are meant to express the capacity of each processor (number of gates, memory, flops, etc.) Each processor defines a maximal value for the metrics defined by its class.

For instance, an example of attributes for the CPU and FPGA classes can be defined as follows

MFPGA = {gates, multipliers}

MCPU = {flops}

A processor class can be tagged as multi-task or as single task. A multi-task processor class can only run a single process, while a multi-task can run several processes. Of course processes must fit within the available metrics for the processor bot for single and multi-task processors. There is an exception to this rule because Placer supports the notion of group of identical process. A single task processor can host any number of identical process that belong to the same group of process.

So there is an additional Boolean property for processor classes: multiTaskc

Here is the Scala code that declares cpu and fpga classes introduced here above: val cpu = ComputingHardware("cpu",SortedSet("flops"),

TANGO Consortium 2016 Page 31 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

multiTask = true) val fpga = ComputingHardware("fpga",SortedSet("kgate","multiplier"), multiTask = true) val gpgpu = ComputingHardware("gpgpu",SortedSet("core"), multiTask = false)

We can see that a computing hardware has a name, for debugging purposes, a set of metrics, and the multitask attribute. It also defines the gpgpu class with number of cores used as a metric. It is set to not being a multi-task processor class.

3.2.1.1.2 Processors A processor belongs to a defined processor class. Each processor defines a maximal value for the metrics defined by its class.

Formally, each processor p, has a class cp, and a maximal value for each of the defined metrics of cp These are noted MAXmp = {maxmcp}

For instance, one can define a processor class representing a brand of FPGA, which defines two metrics: number of gates and number of multipliers. There can be several instances of this FPGA class, say a small FPGA with small number of gates and multipliers, and a larger FPGA with larger maximal values for these metrics.

Each processor also supports a model of power consumption in Watt. The dimension is somewhat arbitrary here since these values are manipulated without applying any operation related to the logics of power. For instance we could also consider that these represent an energy consumption in Joules.

The model is actually declared through a formula built using standard operators (+ - *), constants, and metrics defined by the processor class of the processor. The considered value for such metrics mentioned in the energy model is the actual value when all software element are placed, that is obtained by summing the constant value attributed to this metric by each software process placed on this processor.

For the FPGA class examples that defines gates and multiplier metrics. We can imagine defining a small FPGA with the following values:

MAXsmallFPGA,FPGA = {gatessmallFPGA,FPGA: 10000, multiplierssmallFPGA,FPGA : 500}

Energy: gates * 1 + multiplier * 200

This formula defines that the energy consumption is 1 mJ/gate and 200 mJ/multiplier. Notice that the unit in use is only a convention. It suffices that the same convention is used throughout the model to ensure that the energy model is coherent.

Here is a set of processors declared in the scripting language of placer. ProcessorD is the small FPGA presented here above. val processorA = Processor(cpu, SortedMap("flops" -> 100), "procA", Usage("flops")*30) val processorB = Processor(gpgpu,

TANGO Consortium 2016 Page 32 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

SortedMap("core" -> 110), "GPGPU", Usage("core")*5) val processorC = Processor(cpu, SortedMap("flops" -> 60), "procC", Usage("flops")*20) val processorD = Processor(fpga, SortedMap("kgate" -> 10000, "multiplier" -> 500), "FPGA", Usage("kgate") + Usage("multiplier")*200)

3.2.1.1.3 Communication buses A communication bus is the hardware through which processor can communicate. It can be an actual hardware bus (PCI, I2C, etc.) or represent a network bus (Ethernet, CAN, etc.). Only the mathematical properties are relevant to Placer.

Communications are quantified through bit per second that we call bandwidth. The data flows are also quantified by their bandwidth. This metric is a cumulative resource, so that all communications occurring on a given communication bus sum up their bandwidth and this sum must be smaller or equal to the available bandwidth supported by the communication bus.

Each bus b defines a value MaxBandwidthb

We distinguish two kind of such busses: symmetric shared support busses, and single way busses.

Symmetric shared support busses relate a set of processing element together. All related processor can exchange information between them on such busses, in any direction. The support is however shared, so all communications occurring on this bus have to share the available communication bandwidth.

Single way busses only support flows from a set of from-processors to a set of to-processors. Again, the hardware support is shared, so that they sum up their communication bandwidth.

Here are a set of busses declared in the scripting language of placer, and that relate the processors declared here above: val globalBus = SymmetricSharedSupportBus( List(processorA, processorB, processorC,processorD), 50, "globalBus") val busAB = SymmetricSharedSupportBus( List(processorA, processorB), 100, "busAtoGPGPU") val busBC = SymmetricSharedSupportBus( List(processorB, processorC), 100,

TANGO Consortium 2016 Page 33 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

"busCToCoGPGPU") val busAD = SingleWayBus( List(processorA), List(processorD), 100, "busAToFPGA")

3.2.1.2 Software model The model of software supported by placer is a graph where nodes are processes and edges are directed and represent data flows.

We illustrate the software process using a dataflow software that mimics an image processing workflow. The values given for the metrics of this software are arbitrary and are only meant to illustrate Placer.

3.2.1.2.1 Processes A process can be either an atomic process or a group of atomic processes.

Atomic process

An atomic process is to be placed on a single processor. An atomic process is defined by the set of available implementation for the defined processor classes. It can have at most one implementation per defined processor class. For each supported processor class, it defines the value for the metrics of the processor class.

Let be a process i, it defines a mapping X between processor classes c and implementation irc.

For instance, we can represent a process inputting that has two implementation; one for FPGA and one for CPU as follows:

Xinputting = {FPGA → {gates = 1000, multipliers = 300}, CPU → {flops = 30}}

Here are a set of atomic processes declared in the scripting language of placer, and that relate the processors declared here above. The inputting process is the first declared one: val inputting = AtomicProcess( SortedMap(cpu -> SortedMap("flops"->30), fpga -> SortedMap("kgate" -> 1000, "multiplier" -> 300)), "inputting") val decoding = AtomicProcess( SortedMap(cpu -> SortedMap("flops"->50), fpga -> SortedMap("kgate" -> 1000, "multiplier" -> 200), gpgpu -> SortedMap("core"->100)), "decoding") val watermarking = AtomicProcess( SortedMap(cpu -> SortedMap("flops"->30), Fpga -> SortedMap("kgate" -> 20,"multiplier" -> 30), gpgpu -> SortedMap("core"->100)), "watermarking")

TANGO Consortium 2016 Page 34 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

val encoding = AtomicProcess( SortedMap(cpu -> SortedMap("flops"->31)), "encoding")

Group of processes

A group of process is a set of identical atomic processes. The number of such process in the group is defined by the group. Processors that are declared as single process can run several atomic processes of the same process group.

A group of process g defines an atomic process ag with the properties explained above, and specifies how many instances of these process actually exist through a constant ng

As explained above, a single task processor can run any number of atomic process that belong to the same group.

This notion of group is needed because it allows the engine to reason at a more symbolic level then if the individual processes where explicitly created. Such explicit approach would introduce a lot of symmetries: sapping two identical processes would lead to a solution that is indistinguishable from the previous one. Such symmetries make Placer very inefficient because it would try to explore all such permutations, and there is a large number of them. Using group of processes when needed is a way of avoiding this source of inefficiency.

For instance, we can define a group of process called transforming that is made of 12 instances of the same atomic process aatomicTransforming that is defined by its implementations as follows:

XatomicTransforming = {GPGPU → {cores = 1}, CPU → {flops = 4}}

And specifies the number of such atomic process through nbar = 5

The corresponding Scala script supported by placer is as follows: val atomicTransform = AtomicProcess(SortedMap(cpu -> SortedMap("flops"->4), gpgpu -> SortedMap("core"->1)), "atomicTransforming") val transforming = ProcessGroup(atomicTransform,12,"transforming")

3.2.1.2.2 Data flows A data flow is defined by the source and destination processes, and the bandwidth. Each flow is mapped on busses that relate the processors where the considered processes are executed. A flow cannot be spread on several busses. Also, flows are not transitive: they must be mapped on a bus that directly relates the considered processors. It cannot perform any intermediary hops. In case the processes related by a flow are located on the same processor, they will not be mapped to a bus since such communication will happen internally to the processor.

Besides, Placer supports specific model of flows to represent that some data can be shared by members of a group if they are on the same processor. For this reasons, there are three classes of flows supported by Placer:

 between two atomic processes  between an atomic process and a group of process

TANGO Consortium 2016 Page 35 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

 between an atomic process and the atomic process that constitutes a group of processes

All these classes of flow relate two processes and are directed. Placer supports both directions for each of these classes of flows.

A flow between two atomic processes must be mapped on a bus that relates the two processors where the atomic processes are placed. The bus must support a flow in the proper direction, and the bandwidth of the flow adds to the used bandwidth of the bus.

A flow between an atomic process and the atomic process defining a group of process represents actually a set of flows. Each of these flows relates the atomic process to one instance of the atomic process of the group. There are thus as many such flows as members of the group.

A flow between an atomic process and a group of process is a special case because the data coming through this flow is actually shared by every member of the group. It makes more sense if the flow originates from the atomic process and goes to the group of process. Such flow is therefore mapped onto communication busses in such a way that the required bandwidth is reserved between the atomic process and each processor where at least one member of the group is placed.

Here are a set of flows related to the considered software application. There is no distinction between the three supported classes of flows in the scripting language. val inputToDecode = Flow(inputting, decoding, 50, "inputToDecode") val decodeToTransform = Flow(decoding, atomicTransform, 2, "decodeToTransform") val transformToWatermark = Flow(atomicTransform, watermarking, 2, "transformToWatermark") val watermarkToEncode = Flow(watermarking, encoding, 20, "watermarkToEncode") val sideComm = Flow(inputting, encoding, 5, "side_comm") val comToGroup = Flow(decoding,transforming,1,"group_comm")

3.2.1.3 Objective function: energy model The objective function is the power consumption. It is provided to placer through power model of processors. Placer selects the placement that minimized the total power consumption, which is the sum of the power consumed by each processor in the considered hardware.

3.2.1.4 Constraints The constraints have already been introduced here above; this section recaps them.

Capacity constraints

 capacity constraints on the processor; for each of the metrics defined on the processor class, the sum of the value for this metric along all software element placed on a given processor must be lower or equal to the maximal value for this metric, for the given processor  capacity constraints on the bandwidth of busses: the sum of the bandwidth of each data flow placed on a given bus must be lower or equal to the bandwidth of the bus

TANGO Consortium 2016 Page 36 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

Communication adjacency constraints

 each signal routed on a bus must come from a process located on a processor that can emit on this bus and must go to a process that is placed on a processor that can receive from this bus

Implementation constraint

 if a process is placed on a processor, it must have an implementation for the class of processor that the processor belongs to

3.2.2 Output: a placement Once given the problem statement defined here above, Placer searches for a mapping of each software process onto the available processors, and of the data flows onto the available communication busses. The image processing-inspired problem leads to the following placement that is optimal with respect to energy consumption: Some(Mapping( inputting -> procA decoding -> FPGA watermarking -> GPGPU encoding -> procA transforming -> {procC:12} Flow(group_comm decoding->transforming) to procC -> globalBus Flow(watermarkToEncode watermarking->encoding) -> globalBus Flow(inputToDecode inputting->decoding) -> busAToFPGA Flow(side_comm inputting->encoding) -> local loop Flow(transformToWatermark atomicTransforming->watermarking) from procC(12 flows) -> {globalBus:2, busCToCoGPGPU:10} Flow(decodeToTransform decoding->atomicTransforming) to procC(12 flows) -> {globalBus:12} energyConsumption:44290))

This placement defines, for each process, where the process is to be executed.

inputting -> procA decoding -> FPGA watermarking -> GPGPU encoding -> procA

The 12 processes of the group of process “transforming” have been paced on processorC:

transforming -> {procC:12}

The simple flow are touted onto busses:

Flow(group_comm decoding->transforming) to procC -> globalBus Flow(watermarkToEncode watermarking->encoding) -> globalBus Flow(inputToDecode inputting->decoding) -> busAToFPGA

TANGO Consortium 2016 Page 37 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

One of them is assigned to local loop. It means that the source and target of the flow are actually located on the same processor, so that the information flows locally on the same processor:

Flow(side_comm inputting->encoding) -> local loop

The flows between atomic processes and members of the group of process have been dispatched on busses as well. Notice that each of these flows is treated as an atomic flow. In this case, the flows from “transformToWatermark” have been spread between globalBus and busCToCoGPGPU.

Flow(transformToWatermark atomicTransforming->watermarking) from procC(12 flows) -> {globalBus:2, busCToCoGPGPU:10} Flow(decodeToTransform decoding->atomicTransforming) to procC(12 flows) -> {globalBus:12}

The energy consumption, as computed by the model is also given in the mapping:

energyConsumption:44290))

3.2.3 Implementation of Placer

3.2.3.1 About the underlying technology Placer has been implemented by using a CP solver. The selected solver is OscaR.CP [18]. Such solver perform symbolic exhaustive search. Exhaustive because it tries all possible mapping, hence finds the best possible one. Symbolic because it does not enumerates them explicitly. Rather, it is able to evaluate batches of such mapping without even enumerating them.

In such engine, and in our implementation, the objective function is handled through the branch and bound approach. The engine searches for a mapping, once it is found, the objective function is evaluated, and the search is restarted with an additional constraint that the next solution must strictly improve on the objective function, compared to the evaluated value.

The implementation is a straight translation of the constraints into OscaR.CP, with a few additional redundant constraints that are meant to speed up the engine. These redundant constraints are related to the connectivity of the communication busses.

So far, only the core engine has been developed, with a few hard-coded examples.

3.2.3.2 Declaring the placement problem to the solver We briefly explain here how the placement problem is declared to the CP engine, and only focus on atomic process and flow that express them.

In particular, declaring processes, flows and all the other elements from the problem space in a form understood by the CP engine solver of OscaR is required because the solver must remain totally agnostic on the different types of optimisation problems it needs to solve.The cp engine solver is only able to reason on Boolean and integer variables. To make this possible, a global numbering is performed: Processors (resp. processes) are numbered starting at 0 until number of processors (res. processes) minus one, for instance. Detailed information on programming for the OscaR CP engine is available on OscaR wiki1.

1 https://www.info.ucl.ac.be/~pschaus/cp4impatient/

TANGO Consortium 2016 Page 38 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

3.2.3.3 Declaring a processor A processor proc is declared by specifying a set of integer variables representing the metrics of the class of the processor. Let be the metric m, the variable representing the value for this metric is Cm. These variables are bounded to the maximal value specified in the processor specification.

Cm <= MAXproc,m

For each metric, an accumulator list accm is also created, where the translation of each process can accumulate some variable representing the amount of metrics it needs on this processor. When all processes are posted, a constraint is posted that forces the variable representing the metrics to be equal to the sum of the accumulated variables.

Cm == Sum(accm)

3.2.3.4 Declaring an atomic process

Let be an atomic process process. An integer variable Vprocess is associated with this process and represents the id of the processor where the process is to be run.

A processor of a given glass can host a process if the process has an implementation for this class of processor, so that a set of constraints are posted on Vprocess to specify the forbidden processors for this process.

Vprocess != forbidden_processor

For each allowed processor proc,

A Boolean variable IsAssignedprocess,processor is created. It represents whether the process runs on the processor appropriate constraints are posted between this variable and Vprocess

IsAssignedprocess,processor == (Vprocess == processor)

Let be the Xprocess (processor) the sizes of the process for the metrics of the processor

For each metric m of processor proc

The value IsAssignedprocess,processor * Xprocess (class_of_processor)(m) is added to the accumulator accm of processor proc

End for

End for

3.2.3.5 Declaring a bus We only consider asymmetric busses with a set of origins and a set of destination. Symmetric busses are asymmetric busses whose origin and destination are the same set of processors.

A bus b is therefore represented by a set of origin processors: originb, a set of destination processor: destb, and a maximal throughput: used_bandwidthb.

An accumulator variable busAccb is created, that is a list of integer variable, where the translation of each flow can accumulate some variable representing the amount of bandwidth it needs on this bus. When all flows are translated, a constraint is posted that forces the

TANGO Consortium 2016 Page 39 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016 variable representing the used bandwidth of the bus to be equal to the sum of the accumulated variables. used_bandwidthb == Sum(busAccb)

3.2.3.6 Declaring a flow between two atomic processes

Let flow be a flow between atomic process originflow and atomic process desinationflow, with bandwidth bancwidthflow.

This flow is assigned an integer variable Vflow representing the bus where the flow is routed.

The flow must obey two adjacency constraints:

From the processor of originflow and the selected bus, and from the selected bus to the processor of desinationflow,

To represent this adjacency, we use the table constraint of OscaR. A table constraints inputs two integer variable, and a set of couples representing the allowed values for these variables.

So the first adjacency is implemented through

Table(Vorigin,flow, Vflow, ProcToBusConnectivity)

Where ProcToBusConnectivity represents the set of couples (processor,bus) such that processor belongs to originbus

So the second adjacency is implemented through

Table(Vflow, , Vdest,flow ,BusToProcConnectivity)

Where BusToProcConnectivity represents the set of couples (bus,processor) such that processor belongs to destbus

To make the engine faster, a redundant constraint is added between Vorigin,flow and Vdest,flow. This constraint represents that there is at least one bus that can relate the processor of origin to the processor of dest. This redundant constraint is implemented through ha table constraint

Table(Vorigin,flow, Vdest,flow ,ProcToProcConnectivity)

Where ProcToProcConnectivity is a set of couples (proc1,proc2) such that there is a bus bus where proc1 belongs to originbus and proc2 belongs to destbus

The bandwidth of the bus is reserved by adding a variable to the accumulator of each potential bus where the flow can be routed. This variable is bandwidthflow,bus. It is equal to the bandwidth of the flow if it is routed on the bus, equal to zero otherwise. bandwidthflow,bus == ((Vflow, == bus) * bandwidthflow)

3.2.4 Related work The company Silexica provides a tool that maps software applications onto heterogeneous MpSoC and preforms data-flow reasoning [18]. It first extract a data-flow model from software, hence identifies opportunities for parallelization, computes a mapping of the data- flow onto the hardware, and performs the mapping by generating appropriate source code. Our approach is exclusively focusing on computing the mapping between a data-flow model and the target hardware platform. The supported modes are different in a few aspects: first Silexica supports the notion of timing, in all the model, thus includes delays, task duration, etc.

TANGO Consortium 2016 Page 40 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

However, it does only support a single implementation per task, although it can support heterogeneity. The optimization engine performs a greedy search, and is claimed to perform a “best effort”. Our data-flow mapper does not include timing aspects so far, yet the search heuristics is complete in its current version, meaning that it finds the optimal mapping.

SDFpy is a library for the management and analysis of synchronous dataflow graphs, written in Python. Synchronous dataflow (SDF) graphs model stream processing systems [19]. An SDF graph represents a function, or computation, which is applied to an infinite stream of data. Nodes in an SDF graph act as functions: they map input (data) streams, taken from incoming edges, to output streams, which are produced onto outgoing edges. The model allows for temporal analysis: by assigning (worst case) processing times (execution times) to the graph’s nodes, one may compute how much data can be processed by an SDF graph, per time unit. SDFpy is capable of performing several analyses on SDF graphs, including throughput analysis, which computes the performance bottleneck of the graph. Performance bottlenecks indicate which sequences of computations limit the maximum throughput of the graph, and thus provide valuable insight into which parts of the computation should be improved, or assigned more resources. This tool does not actually find a mapping of the data-flow graph onto the hardware.

Complementary to research on optimal mapping such as Placer, effort to characterise hardware and software realistically is important. Without appropriate characterisation, the result obtain by any mapping tool cannot be guaranteed optimal. In order to obtain appropriate characterisation, TANGO proposes either to rely on profiling particular algorithms to use in an application through a rapid prototyping approach such as presented in Section 4. Alternatively, during year 2 TANGO will also explore an approach relaying on device emulators or simulators.

3.2.5 Conclusion The Placer tool presented aims to help during the development phase where the development team provides different implementation of the same algorithms used in an application. Each implementation is targeted to run on a different type of hardware found in a heterogeneous infrastructure. When several algorithms implemented in an application are each proposing several implementations targeted to run on different hardware then it can quickly become complex to determine the optimal allocation of these different implementations on the different hardware available. In some cases, this placement will be fixed at development time or at deployment/installation time. This is during this phase that Place is useful. It will assist developers in deciding what placement to fix prior to runtime and what to keep flexible for runtime adaptation. At the moment, the use of Placer in the context of TANGO is mostly target to situation where custom implementation on FPGA for portion of an application are envisaged. However, scenarios more typical to HPC will also be explored if such use cases find Placer useful in any way.

The main limitation of the model used by Placer is that it does not encompass timing aspects. Software tasks are not continuous, they are actually repeating batches that take some time, hence introduce a delay; communication on busses also introduce a delay, communications on a given bus might actually not interact with each other if they do not overlap in time, etc. Introducing a model of time is a very important open issue for future work. Within such timing aspects, one also can encompass some hardware knobs that allow to dynamically adapt the frequency of the processors. Such knobs might also be included into the optimization process.

TANGO Consortium 2016 Page 41 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

A related open issue is the memory sizing. It is directly related to timing because some memory buffer might be needed to store data in-between a transmission and the start of the computation that will consume it.

A third open issue is that transmission take place on a bus, and can only relate two adjacent processors; it cannot transit on intermediary processors. Similarly, network topologies such as a network-on-chip or a network in a data-centre are not handled so far by placer.

Placer is still a prototype without any declared API, file format or associated IDE. Providing such supporting technology is also to be considered, although it might be postponed until the model supported by placer is rich enough to provide a useful support in the design of multi- core applications. 3.3 Programming Model – COMPSs/OMPSs Integration This section presents the scientific report of the research performed in Y1 for the Programming Model and Runtime Abstraction Layer components. This research is focussed on providing a Programming Model which facilitates the implementation of applications for distributed heterogeneous parallel architectures and a runtime system which abstracts the architecture singularities and manages the spawn of the computation in the different heterogeneous devices and the required data movements in the different storage devices and memories.

3.3.1 Motivation and purpose During the last years, the computing ecosystem is becoming more and more heterogeneous. On the one hand, trends in computer architectures focuses on providing different computing devices (CPUs, GPUs and FPGAs) and memories in a single chip or computing node, with the aim of providing better computing devices for the different types of algorithms and applications. On the other hand, supercomputers which have been traditionally composed by a large amount of homogeneous nodes, are starting to be composed by heterogeneous nodes with different cores, accelerators and memory capacities in order to achieve better performance with lower energy consumption. As consequence of this heterogeneity is that we have machines with powerful computing devices but users must know how to better use them, because executing the same algorithm in one or another device can have different results in terms of performance and energy consumption. Therefore, selecting the proper device for each part of your application is a key factor to achieve an efficient application execution.

Moreover, programming these heterogeneous platforms is not an easy task. For each accelerator, the developer has to add some code to manage transfers between device memories, and spawning processes on these devices, etc. For that reason, in the TANGO project we propose a programming model to facilitate the development and execution of applications for next distributed heterogeneous parallel architectures.

3.3.2 Related Work As introduced in previous paragraphs, computing nodes are incorporating different types of devices in order to be more efficient when computing different types of applications, either by accelerating the computation or by reducing the energy consumed. However, it has bring more complexity in the application development, each of this device has its own programming language or API to spawn the computation in the different devices. For instance, for FPGA, they are traditionally programmed with the VHDL language; and for deploying and running the computation, developers have to use the tool chain provided by the FPGA vendor. A similar problem happens with the General Purpose GPUs. nVIDIA offers the CUDA framework for programming and running applications in its devices and other vendors offer similar frameworks to do the same.

TANGO Consortium 2016 Page 42 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

Current research is focusing their efforts in reducing the complexity of programming these heterogeneous nodes, as well as, providing portability between architectures allowing the reuse the code for similar devices. One of the examples is OpenCL [20]. It was born with the ambition of providing a common programming interface for heterogeneous devices (including not only GPUs, but also DSPs and FPGAs). With syntax very based on C, it has had a significant impact because the same code could be used in several accelerators. However, similar to CUDA, it requires the programmer to write specific code for the device handling, which reduces programmability. OpenACC [21] is another example of programming standard for parallel computing designed to simplify parallel programming of heterogeneous CPU/GPU systems. Based on directives, the programmer can annotate the code to indicate those parts that should be run in the heterogeneous device. The OpenMP standard tackles the programmability issues in a similar way as OpenACC with regard the heterogeneous devices and also considers many other aspects of parallelism which makes it a stronger option.

Finally, OmpSs [22] is a task based programming model proposed by BSC which promotes both programmability and portability of codes by hiding the details of the software application architecture to the programmer (i.e., aspects such as allocation of memory in the device or data transfers are performed automatically by the runtime) while exploiting the inherent parallelism of the software application architecture by analysing the data dependencies between its software tasks.

However, these solutions are just managing the heterogeneity inside a node. If the application requires to run in several nodes (e.g. big amount of data or large parallelism), solutions mentioned before must be combined with other frameworks which manages the spawning of processes and data movements between the different computing nodes. To do it, developers can try to do it by hand by using the TCP/IP and threading libraries, which require a lot of programming effort and skills or use one of the parallel distributed computing frameworks. One of this framework is MPI, which provides an API for interchanging data between the different processes for SPMD applications. Another option are PGAS programming models such as UPC, which allow to create a global address space and use shared memory programs in different nodes. Both options are working quite well running Single Process Multiple Data (SPMD) application in homogeneous clusters interconnected with a very fast network and SPMD application. However, in heterogeneous environments distributed across different locations they are to reaching good performance. Finally, COMPSs is the sibling of OmpSs for distributing computing, it applies the same concepts but instead of distributing tasks on the different devices in a node, it distributes tasks in the different computing nodes taking into account node heterogeneity.

3.3.3 Scientific Contribution The TANGO programming model consists of the combination of programming models and runtimes of StarSs developed at Barcelona Supercomputing Center (BSC). StarSs is a family of task-based programming models where developers define some parts of the application as tasks indicating the direction of the data required by those tasks. Based on these annotations the programming model runtime analyzes data dependencies between the defined tasks, detecting the inherent parallelism and scheduling the tasks on the available computing resources, managing the required data transfers and performing the task execution. The StarSs family is currently composed by two frameworks: COMP superscalar (COMPSs), which provides the programming model and runtime implementation for distributed platforms such as Clusters, Grids and Clouds, and Omp Superscalar (OmpSs), which provides the programming model and runtime implementation for shared memory environments such as multicore architectures and accelerators (such as GPUs and FPGAs).

TANGO Consortium 2016 Page 43 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

In the case of TANGO, we propose to combine COMPSs and OmpSs in a hierarchical way, where an application is composed by a workflow of coarse-grain tasks developed with COMPSs. Each of these coarse-grain tasks can be implemented as a workflow of fine-grain tasks developed with OmpSs. At runtime, coarse-grain tasks will be managed by COMPSs runtime optimizing the execution in a platform level by distributing tasks in the different compute nodes according to the task requirements and the cluster heterogeneity. On the other hand, fine-grain tasks will be managed by OmpSs which will optimize the execution of tasks in a node level by scheduling them in the different devices available on the assigned node.

This combination presents different advantages with respect to other approaches:

First, it allows developers to implement parallel applications in a distributed heterogeneous resources environment without changing the programming model paradigm. The programmer do not need to learn and invoke any API inside the code. It just requires deciding which segments of code correspond to different tasks, and annotate them by indicating the direction (in/out) of the task data parameters.

Second, developers do not have to deal with programming data movements like in MPI. The programming model will analyze data dependencies and keep track of the data locations during the execution. So, it will try to schedule tasks as close as data or transparently doing the required data transfer to exploit the maximum parallelism.

Third, we have extended the versioning and constraints capabilities of these programming models. With these extensions, developers will be able to define different versions of tasks for different computing devices (CPU, GPUs, FPGA) or combinations of them. So, the same application will be able to adapt to the different capabilities of the heterogeneous platform without having to modify the application. During the execution, the programming model runtime will be in charge of optimizing the execution to the available resources in a coordinated way. In platform scheduling the runtime will schedule the task in the different compute node resources, deciding which task can run in parallel in each node and managing that the different tasks are not colliding in the use of resources by the affinity of task to devices. At the node level, the runtime is in charge of scheduling the fine-grain tasks in the resources assigned in the platform level scheduling.

3.3.4 Evaluation

TANGO Consortium 2016 Page 44 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

To evaluate the functionalities developed during Y1 in the Programming Model and Runtime Abstraction Layer components, we have implemented a benchmark application with TANGO programming model which implements a matrix multiplication by blocks in two levels. The first level splits the matrices in blocks and computes the matrix multiplication by block. Each block multiplication is defined as coarse-grain task. Each matrix block can be decomposed in smaller blocks, and each block multiplication can be decomposed as a workflow of small block multiplications. Figure 11 shows the main code of the benchmark application where a loop of int main(int argc, char **argv) { Matrix A; Matrix B; Matrix C; int N = atoi(argv[1]); int M = atoi(argv[2]); compss_on();

cout << "Loading Matrices...\n"; Matrix A = Matrix::init(N,Ml); Matrix B = Matrix::init(N,M); Matrix C = Matrix::init(N,M);

cout << "Executing Multiplication...\n"; for (int i=0; imultiplyBlocks(*A.data[i][k], *B.data[k][j]);

} } } compss_off(); }

Figure 11. Matrix Multiplication main code the multiplyBlock coarse-grain tasks is implemented.

TANGO Consortium 2016 Page 45 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

interface Matmul { @Constraints( processors={ @Processor(ProcessorType=CPU, ComputingUnits=4)}); void Block::multiply ( in Block block1, in Block block2 );

@Constraints( processors={ @Processor( ProcessorType=GPU, ComputingUnits=1 )}); @Implements(Block::multiply); void Block::multiplyGPU ( in Block block1, in Block block2 ); }; Figure 12. Matrix Multiplication coarse-grain interface file

Figure 12 shows the interface file where the developer can define the methods which are defined as tasks. In this case we have defined a task which has two implementations one which runs in 4 CPU cores and another which runs in a GPU. Finally, Figure 13 depicts the implementation of the big block multiplication. In the first case, the fine grain tasks are the computation of the different elements of the resultant matrix block. In the second case, the big matrix block is decomposed in smaller block in order to fit in the GPU device memory and finer-grain tasks are defined as the multiplication of these small blocks. The fine-grain task in this case is the CUDA kernel defined by the Muld function.

We have run this application with different configurations in the TANGO Nova 2 testbed. First, we run the application using only CPU and then only GPU as we could do with state of the art programming models such as CUDA, OpenCL or OpenMP. Then, we have run the application implemented with the TANGO C/OMPSs Programming Model where we can exploit different

void Block::multiply(Block block1, Block block2) { for (int i=0; i

As summary, we have seen that by using the TANGO C/OMPSs Programming Model and its Runtime, we can achieve an important gain when the application requires an intensive usage of all of computing resources. In contrast, we have seen that in some situations the usage of one type of resource is penalizing the execution, such as the case with when we have small blocks. In this case, using just CPUs is more efficient than the combined CPU/GPU execution. So, to get the best efficiency in all the cases, we should incorporate a mechanism to indicate the task version is not efficient either in the Design Phase, or the Implementation Phase when using the Programming Model or at Runtime.

3.3.5 Conclusion and Future Work In this first year, we have implemented the prototype of the TANGO C/OMPSs programming model and the Runtime Abstraction Layer. The programming model facilitates the implementation of parallel application in a distributed heterogeneous resources without changing the programming model paradigm and without having to use any to spawn the computation in the different heterogeneous devices distributed across different compute nodes.

Moreover, the runtime abstraction layer analyzes data dependencies between tasks and keep track of the data locations during the execution. So, it tries to schedule tasks as close as data or transparently doing the required data transfer to exploit the maximum parallelism. Moreover due to the versioning capabilities, it selects the implementation which fits better for the available resources selecting the one which provides better performance. So, the same application is transparently adapted to the different capabilities of the heterogeneous platform without having to modify the application.

The results obtained in the first year are the base for the second and third year research. Next years, we are going to extend the programming model and runtime with complex applications which defines different tasks. We will evaluate how the runtime currently perform with different tasks which can have differ efficiency in the different results. We will look for heuristics and algorithm to tackle the multi-objective scheduling of coarse-grain tasks in order to find the best scheduling of the different tasks taking into account different factors. Then, we will also study how the programming model can interact with other TANGO components, like the Device Supervisor in order to have more efficient execution, by releasing or partially powering off the devices which are not currently used by the application. 3.4 Code Optimiser Plugin for Energy Profiling of Java

3.4.1 Motivation and purpose The Code Optimiser Plugin (COP), is a standalone plug-in that helps in the role of reducing power and energy consumption of an application. This is achieved through facilitating measurement of power consumption as part of the software development processes, by providing Java software developers the ability to directly understand the energy footprint of the code they write.

3.4.2 Related Works There are two works that relate strongly with the functionality and potential scientific outcome of COP component. These go some way to providing application assessment and profiling at the software development level but have deficiencies. The first JVM Monitor [23] is a general purpose Java profiler integrated with Eclipse to monitor CPU, threads and memory

TANGO Consortium 2016 Page 47 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016 usage of Java applications but does not consider energy consumption. The second tool is JouleUnit [5], an Eclipse-based workbench that provides tools for the visualization of energy profiling results on a per unit test basis that has been designed for the android mobile smart phone operating system.

3.4.3 Scientific Contribution The novelty of this component lies with is generic Java profiling capabilities (above that available in the discipline of mobile computing) that enables the energy assessment of code out-of-band of an application’s normal operation within a developer’s IDE. It facilitates the measurement of energy consumption of an application at the method level, enabling developers to understand where most power is been consumed within the application. This gives rise to the prospect of focusing efforts to optimise code on the areas that will gain the most benefit.

3.4.4 Evaluation The COP provides a visual demonstration of the energy consumption of an application or part of an application. This can be seen by the screenshot provided in the following figure.

Figure 14: COP Power Graph Screenshot

This feature of graphing power usage is generated by taking calibration data from the physical host that maps resource usage to power consumption then closely monitoring the Java runtime environments (JRE) utilisation and mapping this to power consumption.

3.4.5 Conclusion and Future Work To conclude, the COP component is a component that has emerged this year, providing many exciting avenues to explore both power and total energy consumption within applications that are under development. This give rise to the prospect of enhancing the software development of Java applications, by identifying code energy hotspots and addressing them. In the future this will be extended to include other programming languages that may be developed within the Eclipse IDE.

TANGO Consortium 2016 Page 48 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

4 Benchmarking Mini Apps on Time and Energy performance This section shows how to benchmark jobs implemented using various programming model (COMPS+OmpSs, C/Poroto, MPI) on various heterogeneous hardware. Initially, a very simple example based on matrix multiplication illustrates how it can be implemented in the different programming models and how the resulting executables can be executed on different heterogeneous hardware. Subsequently, slightly more complex applications, namely, Hydro and NBody Simulation are each used to show a more realistic case where conducting benchmarking at development time will provide necessary information to help analysts and developers to make decision on the granularity of jobs to prepare for executing on different types of hardware such as multi-core CPU, GPU or FPGA.

Importantly, the benchmarking exercise is not meant to compare programming models but rather to show how the implementation in a given programming model can obtain measurements on how it performs when executing jobs on different types of hardware. Benchmarking results should help developers and analysts to determine if the current implementation should favour executing on a particular type of heterogeneous hardware or another taking into account application knowledge such as the dataset size. A benchmarking exercise will also help to determine if it is worth rethinking the current algorithmic implementation to better exploit other types of heterogeneous hardware. 4.1 Matrix Multiplication Matrix multiplication provides a simple case where parallelization can easily be implemented in different programming models and second, the size of input datasets can be varied easily. Thus, the next subsections explore how different implementations of matrix multiplication can be profiled for time and energy performance to establish benchmark measurements on different types of heterogeneous hardware.

4.1.1 Benchmarking TANGO C/OMPSs Programming Model of Matrix Multiplication on Nova 2 In this section, we provide the results of benchmarking the Matrix Multiplication implemented with the TANGO Programming Model, which is based on a combination of the COMPSs and OmpSs programming models and runtimes. The application has implemented as detailed in Section ¡Error! No se encuentra el origen de la referencia., where a multiplication of a big matrix is computed by blocks and each of this block-to-block multiplications is defined as a coarse-grain tasks, which internally parallelized with fine-grain tasks. Different coarse-grain tasks version has been implemented to enable the execution on the different heterogeneous devices available in the Nova 2 testbed.

4.1.1.1 Benchmarking Context For the benchmarking addresses in this section, we have selected a heterogeneous node of the Nova2 clusters which contains 2 Intel Xeon CPUs with 8 cores each, 128 GB of RAM memory and 2 NVIDIA Tesla K20 GPUs. To measure the energy consumption and to extract the power profile, we will use the monitoring infrastructure provided by the TANGO middleware tools installed in the Nova 2 cluster.

4.1.1.2 Benchmark results We have run this application with different configurations and execution environments (matrix sizes, different number of block decompositions and CPU /GPU configurations) in order to see how the application and runtime perform in each situation. Results are summarized in the following tables.

TANGO Consortium 2016 Page 49 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

Resource Type CPU GPU Task Configuration 16 14 7 2 1 Block Size Time Energy Time Energy Time Energy Time Energy Time Energy 256 11,30 2,68 11,40 4,64 10,70 2,67 22,10 5,01 19,10 4,36 512 13,40 3,17 13,30 3,09 13,35 3,06 24,10 5,45 21,70 4,90 1024 23,40 5,59 23,70 5,62 24,40 5,31 32,80 7,47 32,20 7,16 2048 81,30 19,32 87,10 19,42 117,60 23,14 79,80 17,22 95,70 19,06 4096 436,30 103,02 448,00 110,13 806,60 180,25 282,00 58,20 435,20 83.8 Table 2: Execution Time (seconds) and Energy Consumption (KJoules) for a Single Coarse-Grain Task (Block matrix multiplication) Using Different Configuration of Single Type of Device. Best Values for each device type are in Bold

Table 2 shows the execution time and the energy consumption when running the execution of the Matrix Multiplication with a single coarse-grain task configured to use different resources. For instance the 16 CPU column show the execution of the Matrix Multiplication with a single coarse-grain tasks which allow to use 16 cores to run the fine-grain tasks defined inside the coarse-grain task implementation. In the case of GPUs, we run the coarse grain task configured to use one or two GPUS. The result obtained shown that, there is a trade-off between the level parallelization and the block size. For small blocks we can see that we are not improving either in energy or time by using a larger number of cores. This mainly due to thread data management. The serialization of the matrix blocks has a big impact when using small blocks. Moreover, the runtimes have a larger management overheads when using more threads. However, when we increase the block size, the serialization overhead grows linearly but the computation grows faster, so when increasing the block size, the overhead is compensated and we are getting performance and energy improvements when using parallelism. The same happens with GPUs, but in this case there is an extra overhead because of an extra data movements is required to load the data from the node main memory and the GPU memory and the overhead of spawning the process in the GPU. Regarding using CPU or GPU, we can also see that using GPUs is more efficient for large blocks, where we can get profit of the large number of processors. For small blocks running just in CPUs is more efficient.

Resource Type CPU GPU Task Configuration 16 14 7 2 1 #Blocks Block Size Time Energy Time Energy Time Energy Time Energy Time Energy 256 18,70 3,99 19,40 4,64 14,80 3,41 107,10 23,85 50,10 11,43 512 34,60 8,15 35,90 8,56 23,30 5,55 125,00 27,85 60,20 13,72 4 (2x2) 1024 117,40 28,16 124,90 29,66 81,90 18,76 197,80 43,90 96,30 22,22 2048 512,80 128,90 545,70 147,50 441,00 107,20 483,50 101,31 278,20 79,72 256 81,30 17,68 82,20 17,64 48,10 10,49 856,80 190,80 400,80 91,44 512 220,10 52,17 222,40 51,27 127,61 29,08 1000,00 222,80 481,60 109,76 16(4x4) 1024 917,40 226,49 990,90 232,53 589,10 139,00 1582,40 351,20 770,40 177,76 2048 4007,18 1036,74 4329,34 1156,38 3172,08 794,29 3868,00 810,48 2225,60 637,76 Table 3: Execution Time (seconds) and Energy Consumption (KJoules) for a whole Matrix Multiplication execution Split in Different Blocks and Using Different Configuration of Single Type of Device.

Table 3 shows the execution time and energy consumption of the Matrix Multiplication when we only use a single type of resource (GPU or CPU) but splitting the Matrices in different number of blocks. In this scenario, the number of coarse-grain tasks executed in parallel depends on the used configuration. In the cases of using 16 and 14 CPU cores per coarse-grain task, we just can run one coarse-grain task, but in the case of 7 CPUs, the node capabilities allow the runtime to execute two coarse-grain tasks in parallel. Due to this fact, we can observe in the table that running more tasks with less CPUs is better than running a large task,

TANGO Consortium 2016 Page 50 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016 this is due to the overhead is also parallelized. So, in this cases, we are also serializing blocks in parallel. A similar issue also happens for the 2 GPU and 1 GPU task configuration.

Resource Type Whole Node Task Configuration 2 GPUs / 14 CPUs 1 GPU / 14 CPUs 2 GPUs / 7 CPUs 1 GPU / 7 CPUs #Blocks Block Size Time Energy Time Energy Time Energy Time Energy 256 34,80 8,07 33,90 7,96 31,90 7,12 29,70 6,78 512 38,70 10,08 38,30 9,66 37,50 8,81 36,80 8,16 4 (2x2) 1024 100.4 26,82 91,40 23,21 88,30 21,69 70,50 18,03 2048 312.3 85.5 291,30 78,53 295,50 73,69 224,70 70,54 256 108,60 27,29 60,50 15,19 87,60 22,28 47,60 11,32 512 231,30 62,31 125,20 34,99 163,00 44,97 123,40 30,39 16(4x4) 1024 624,40 185,35 480,10 138,19 460,10 132,98 363,30 108,41 2048 1705,35 498,59 1530,12 467,56 1539,75 451,79 1157,92 424,14 Table 4: Execution Time (seconds) and Energy Consumption (KJoules) for a whole Matrix Multiplication execution Split in Different Blocks and Using all the Resources of the Computing Node

Table 4 shows the execution time with the application when use the TANGO Programming Model and Runtime capabilities which includes task versioning, multi-level task-based decomposition and the transparent execution of tasks in different devices at the same time (CPUS and GPUS in this case). The execution has been performed for the same matrix sizes as in previous executions. In this case, we combine the two versions (CPU and GPU) of the block multiplication tasks with different possible configurations. For instance, the configuration 2 GPUs / 14 CPUs configuration means that we have defined a version which uses 2 GPUs and another which can run 14 CPU. So, according to the node capabilities, the runtime will be able to run 2 parallel coarse-grain tasks. For the 1 GPU / 14 CPUs and 2 GPUs / 7 CPUs configurations the runtime will execute 3 tasks in parallel. Finally, for the 1 GPU / 7 CPUs configuration the runtime is able to run 4 tasks in parallel. So, in all sizes we can see we have better results for the 1 GPU / 7 CPUs.

We have compared the obtained results with full TANGO Programming Model and Runtime capabilities with the results obtained by using just CPU or GPUs. Results of this comparison are summarized in Figure 15. In general, we can see that by using the TANGO Programming Model and Runtime, we can achieve an important gain when the application requires an intensive usage of all of resources. In contrast, we can see that in some situations the usage of one type of resource is penalizing the execution, such as the case when we have small blocks. In this case, using just CPUs is more efficient in terms of time and energy than the combined CPU/GPU execution. So, to get the best efficiency in all the cases, we should increase the intelligence of the TANGO tools in order to detect these situations and execute the tasks with the most efficient configuration.

TANGO Consortium 2016 Page 51 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

Figure 15: Comparison of the execution times and energy consumptions for different configurations and gains.

4.1.1.3 Conclusion In this section, we have evaluated how the Matrix Multiplication benchmark performs in terms of time and energy for different resource configurations enabled by the TANGO programming model and Runtime. This benchmark splits the application in different blocks in order to introduce a certain degree of parallelism. Each block multiplication is defined as a coarse-grain tasks of the TANGO programming model and at the same time, each coarse grain task has been parallelized in fine-gain tasks.

We have executed this application for different number of blocks and block sizes and we have measure the energy and execution time by using different resources. We have observed that depending on the application configuration, execution times can vary considerably. For small blocks, using just CPUs is more efficient in terms of time and energy than the combined CPU/GPU execution. So, to get the best efficiency in all the cases, we should increase the intelligence of the TANGO tools in order to detect these situations and execute the tasks always with the most efficient configuration. Moreover, we are going the same benchmarking to other applications and resource types which combine other type of processors and accelerators such as FPGAs.

4.1.2 Benchmarking C/Poroto Matrix Multiplication on small FPGA-based System This section presents a similar type of exercise where matrix multiplication is executed on CPU- only and then on FPGA-only. It is then possible for the development team to extrapolate on what performance would be achieve if exploiting a parallel usage of a CPU+FPGA system. If this is the case then migrating to an implementation following to the TANGO C/OMPS programming model and providing the programming model with the appropriate FPGA kernel to use would be beneficial.

TANGO Consortium 2016 Page 52 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

4.1.2.1 Benchmarking Context For the benchmarking addressed in the context of TANGO, CETIC is using a platform consisting of server machine embedding FPGA accelerator boards. The server machine has the following hardware characteristics:

 Intel(R) Xeon(R) CPU X5550 @ 2.67Ghz  8 processor cores  8GB RAM

The machine is running a Debian GNU/Linux distribution (Release 8.5).

Below an overview on the embedded FPGA boards:

4.1.2.1.1 Alpha Data FPGA Board The Alpha Data FPGA system used in the benchmarking platform is a modular one.Its architecture is based on FPGA XMC boards plugged into a carrier board:

Figure 16 On the left: ADM-XRC-6T1: XMC Board format plugged on the ADC-XMC-II carrier board; on the right: ADC- XMC-II: Carrier card for PCI Express based systems. Allows direct communication between two XMC boards.

The ADM acceleration board is based on a Xilinx Virtex 6 FPGA (XC6VLX240T) connected to 4 separate banks of 256 MB DDR3 memory. A second Virtex6 (LX130) FPGA translates communication from PCIe to Alpha Data interface. This saves resources of the main FPGA and enable the FPGA programming through PCIe bus. Two others connectors are available to plug a mezzanine board linked to the fast transceivers of the Virtex6. This architecture is a typical one for implementation of FPGA accelerator boards.

As for the development environment provided with the board, it includes:

 The software driver (Windows and Linux),  The C language API for data communication between CPU application and the FPGA process,  The functional blocs (RTL or vhdl source) to implement data transfers between CPU memory and FPGA board memory.

4.1.2.1.2 Xilinx ML605 FPGA Board The ML605 board provides many resources to enable developers to create any design targeting the embedded Virtex 6 FPGA (XC6VLX240T). The main features are:

 DDR3 SODIMM 512 MB memory,  an 8-lane PCI Express interface,  SFP connector  UART, USB ports

TANGO Consortium 2016 Page 53 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

 DVI port  general purpose I/O  LCD display  high-speed VITA-57 FPGA Mezzanine Connector (FMC), high pin count (HPC) expansion connector, or the on-board VITA-57 FMC low pin count (LPC) connector. What is interesting with this board is that the RIFFA environment (Reusable Integration Framework for FPGA Accelerators: a simple framework for communicating data from a host CPU to a FPGA via a PCI Express bus) is already ported, implemented and tested on it. As described in deliverable D3.1, the Poroto tool, used for evaluation of designs on our system, leverages both, the proprietary Alpha Data API on one hand and the open source RIFFA framework on the other hand.

Figure 17 ML605 FPGA board

Consequently, with this board, the hardware description code is already optimized for communication blocs such as PCIe interface and RIFFA endpoint (which switches data to the adequate RIFFA channel).

Figure 18 RIFFA based System architecture on the ML605 Board

4.1.2.1.3 Benchmarking setup description

TANGO Consortium 2016 Page 54 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

The considered test system was instrumented with an external equipment allowing to sample and compute electrical measurements associated with instantaneous overall power consumption. This equipment is based on an Arduino board design that interfaces with current clamps and transformer to sample instantaneous values of current and tension then computes several other electrical parameters ( active power, RMS values, Phase, etc.). Measurement cycles are controlled by software directly in the test program through a special library. This library controls the power measurement over a USB connection. Instantaneous values could be displayed in real time through a user interface. Average sampling frequency for power measurements is ∼ 10Hz.

Figure 19 Testing machine and its power measurement equipment

For each design example to be evaluated through the test platform, a test-bench file (c/c++ code) is defined. It implements the following steps:

 Make initial settings of the measurement equipment  Send the bit stream to the FPGA: currently automated for the AlphaData board using the provided driver with the PCI interface. For the ML605 board an external tool is used to send the bit stream over a USB link to the Board.  Sample power consumption for a given period when the machine is in idle state (i.e. doing no computations)  Execute the code on CPU only (the number of iteration is defined as a parameter) while continuing to sample power values.  Execute the implementation using the offloaded design on the FPGA (the number of iteration is defined as a parameter) while continuing to sample power values.  compute and display execution time measurements

The data collected on power measurements is logged in a spreadsheet file then the average power values at each step is computed. We consider the overhead of power induced by the CPU only and the (CPU+FPGA) implementations compared to the idle state power values.

TANGO Consortium 2016 Page 55 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

4.1.2.2 Benchmark results The following paragraph illustrates the results obtained by offloading the following matrix multiplication function on the Alpha-data FPGA Board using the Poroto tool.

Below the code of the annotated function as given in input to the tool.

#pragma poroto memory test_A int 100000 #pragma poroto memory test_B int 100000 #pragma poroto memory test_C int 100000

#pragma poroto stream::roccc_bram_in MatrixMultiplication::A(test_A, height_A*width_A_height_B) #pragma poroto stream::roccc_bram_in MatrixMultiplication::B(test_B, width_A_height_B*width_B) #pragma poroto stream::roccc_bram_out MatrixMultiplication::C(test_C, height_A*width_B) void MatrixMultiplication(int** A, int** B, int height_A, int width_A_height_B, int width_B, int** C) { int i; int j; int k;

for(i = 0; i < height_A; ++i) { for (j = 0; j < width_B; ++j) { int currentSum = 0; for (k = 0; k < width_A_height_B; ++k) { currentSum += A[i][k] * B[k][j]; } C[i][j] = currentSum; } } }

The code above manipulates input and result matrices of up to 100 000 elements each but accepts any dimensions that fit with this limit.

The following results are obtained for:

A x B = C with sizes of 10x10000, 10000x10 and 10x10 respectively

Exec time (ms) mW (avg) CPU only 5.777 11144 FPGA 41.183 4242

Transfer time through PCI to the FPGA is negligible compared to computation time (~ 3%).

With the same kernel, the change in matrix dimensions to test with the biggest square matrices that match the constraint of 100.000 element maximum, gives the following results:

A x B = C with sizes of 316 x316 each (99856 elements)

TANGO Consortium 2016 Page 56 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

Exec time (ms) mW (avg) CPU only 184.468 10483 FPGA 1262.155 4516

Transfer time through PCI to the FPGA is negligible compared to computation time (<1%).

With a kernel that is compiled for a maximum number of 1000 elements per matrix, the results are the following:

A x B = C with sizes of 20x50, 50x20 and 20x20 respectively

Exec time (ms) mW (avg) CPU only 0.113 11307 FPGA (no optim) 0.460 4437 FPGA (Loop 0.261 4473 unrolling)

The “loop unrolling” version is simply using an additional directive to the underlying ROCCC compiler to guide more the generation of VHDL (see code below). The directive forces explicitly a loop unrolling which in hardware results in more instantiation of some resources inside the FPGA (almost 3.25x more DSP blocks, 28% more Slices, 15% more LUTs and 33% more registers).

#pragma poroto roccc LoopUnrolling L1 10 void MatrixMultiplication(int** A, int** B, int height_A, int width_A_height_B, int width_B, int** C) { int i; int j; int k;

for(i = 0; i < height_A; ++i) { for (j = 0; j < width_B; ++j) { int currentSum = 0; L1: for (k = 0; k < width_A_height_B; ++k) { currentSum += A[i][k] * B[k][j]; } C[i][j] = currentSum; } } }

4.1.2.3 Discussion With the quickly generated design for FPGA matrix multiplication, with no particular optimizations for the offloaded kernel, we can roughly see that the obtained execution time is almost 7 times slower while being ~2.5 times more efficient in power consumption.

TANGO Consortium 2016 Page 57 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

With the same design, it is possible to address different sizes of the input and output matrices giving a maximum number of elements for each since the FPGA design was generated against that constraint.

Although the performance of the automatically generated design are not optimal, they allow to identify what would be an achievable performance without investing lot of development effort or test time.

The benchmarking process shows, through the “loop unrolling” version of the offloaded function, that the performance can be enhanced with a little bit more advanced usage of the Poroto prototyping tool and little overhead in engineering of the source code.

4.1.2.4 Conclusion The prototyping process based on Poroto enables rapid evaluation of various implementation scenarios for the offloaded function during the application development. The user can quickly instantiate and characterise the benefits that could be achieved from offloading some computation on the FPGA and at the same time get easily relevant figures on memory transfer overheads, resource utilization, potential design speed, etc.

Bearing in mind that the tool is essentially intended for quick prototyping, the benchmarking results should be seen as a first doable implementation that could be achieved quickly and easily without needing a deep expertise in FPGA technology and tools. In this perspective, the tool brings an added value during the development process and gives the user the ability to evaluate several scenarios to explore the design space for his application.

Further developments of the tool will focus on introducing enhancements to the underlying engine for the hardware implementation generation, better benchmarking process automation and support of other communication interfaces that PCIe. 4.2 Benchmarking Hydro and NBody mini Applications

4.2.1 Benchmarking on Hydro MPI Implementation on Nova 2 HYDRO is a mini-application which implements a simplified version of RAMSES, a code developed to study large scale structure and galaxy formation. HYDRO uses a fixed rectangular two-dimensional space domain and solves the compressible Euler equations of hydrodynamics using a finite volume discretization of the space domain and a second-order Godunov scheme with splitting direction technique.

In this section we study different versions of the same mini application exploring different programming models (MPI/OpenMP, OpenCL and CUDA) when executed upon heterogeneous hardware (CPUs and GPUs).

The goal of these experiments is to highlight some compilation details along with launching executions, monitoring some resources consumption and providing posttreatment analysis on accounting and profiling results.

4.2.1.1 Benchmarking Context The benchmark is a full hydro code implementing an 2D Eulerian scheme using a Godunov method. It has been implemented in various versions yet it should result in similar results.

Currently only the following versions have been compiled and experimented.

 HydroC99_2DMpi

TANGO Consortium 2016 Page 58 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

o A fine grain OpenMP + MPI version using C99  cuHydroC_2DMpi o An implementation using CUDA + MPI  oclHydroC_2D o An implementation using OpenCL + MPI

4.2.1.2 Compilation details All versions have been compiled using the Intel compiler and Intel MPI implementation of MPI.

For this the user needs to use the following commands prior to start the compilation: module load intel_compiler module load intel_mpi

For the case of Cuda and OpenCL versions the compilation needs to take place upon a node that has the right versions of cuda. The user needs to use the following command: module load cuda

Once the right libraries have been charged then a simple make, make install on each directory will perform the compilation and generate the executables.

4.2.2 Execution Settings For the execution of Hydro we create different sbatch scripts for each version of the mini-app. The case of the MPI-OpenMP sbatch script is shown here for example. The other sbatch scripts are quite similar. #!/bin/sh #SBATCH -N2 #SBATCH -n32 module load intel_compiler/2016.0.047 module load gcc/4.8.2 module load intel_mpi/mpi_5.0.1.035 export I_MPI_PMI_LIBRARY=/usr/local/slurm1702_161117/lib/libpmi.so RUNDIR=${PWD}/slurm-$SLURM_JOBID EXEDIR=${PWD}/../Src/ INPDIR=${PWD}/../../../Input #RUNCMD="env OMP_NUM_THREADS=16 KMP_AFFINITY=compact srun -n 4 - N 2 --profile=Energy --acctg-freq=Energy=1" RUNCMD="srun -n 32 -N 2 --profile=Energy,Task --acctg- freq=Energy=1,Task=1" mkdir -p ${RUNDIR} cd ${RUNDIR}

${RUNCMD} ${EXEDIR}/hydro -i ${INPDIR}/input_15000x15000_corner.nml

The above script shows us that within the sbatch script we charge the right environments, we declare the number of requested resources to be allocated and we describe the data-set to be used which gives the details upon the specific parameters for the Hydro simulation. The MPI execution takes place using the srun command which is tightly integrated with mpirun and can execute MPI processes through the PMI library. The “profile” parameter activates the detailed profiling for both energy and task related monitoring (CPU, Memory, Local Disk).

The parameters of Hydro execution, which is the same for all version executions are given within the input_15000x15000_corner.nml which is provided here:

TANGO Consortium 2016 Page 59 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

&RUN tend=1600 noutput=10 nstepmax=10 # dtoutput=4. /

&MESH nx=15000 ny=15000 nxystep=32 prt=0 dx=0.05 boundary_left=1 boundary_right=1 boundary_down=1 boundary_up=1 testcase=1 /

&HYDRO courant_factor=0.8 niter_riemann=10 /

The MPI-OpenMP version is executed upon simple x86 CPUs whereas the CUDA and OpenCL versions are executed upon x86 CPUs plus Tesla K20 GPUs. During the executions we collect aggregated energy data (the total energy consumption) along with execution time. Furthermore we collect profiling data on power consumption, CPU, memory and local disk consumptions.

4.2.2.1 Benchmark results We performed 5 repetitions for each execution and the following histogram shows the average collected results for each version. The following graph on the left shows the total execution time of each version whereas the one on the right shows the energy consumption respectively

Figure 20: Hydro execution time and energy measurements on Nova 2.

We can observe that while OpenCL has the lowest execution time its MPI-OpenMP version that has the lowest energy consumption. However let’s try to go a little bit deeper in the analysis by considering the profiling data. The following graph shows the percentage of CPU usage during the execution of the MPI-OpenMP version.

TANGO Consortium 2016 Page 60 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

Figure 21 Hydro CPU Usage on 2 nodes on Nova 2.

The following graph shows the memory consumption during the same execution:

Figure 22: Hydro Memory Usage on 2 nodes of Nova 2.

Further, the power consumption throughout the execution is shown in the following graph

Figure 23: Hydro power consumption usage per node on Nova 2.

TANGO Consortium 2016 Page 61 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

And finally the local disk writes per task are shown here:

Figure 24: Hydro Data storage usage per task on Nova 2 nodes.

Based on this results and the fact that we write a lot to disk, naturally we can propose to test with Lustre filesystem to improve the time of writes. And this is what we did in the following graphs.

It is interesting to note that MPI-OpenMP version has become the most time-consuming and energy hogging whereas OpenCL appears to be equally fast than CUDA while the best energy consumption is achieved with the OpenCL version.

The following graph shows the Lustre writes during the Hydro execution and OpenCL version. It’s interesting to observe how increased are the iterations (up to 500MB per second/per task) in comparison with the previous case where data have been written on local disks (no more than 35MB per second/per task).

TANGO Consortium 2016 Page 62 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

Figure 25: Hydro GPU write operations

4.2.2.2 Discussion The above experiment of the same benchmark collecting the same results through different implementations and using different types of architectures shows us how we can exploit heterogeneity in an infrastructure; and how by combining fine monitoring capabilities we managed to select the best combination of programming model and hardware usage for a particular choice of allocated resources, data-set and execution parameters.

4.2.2.3 Conclusion We expect that depending the number of type of hardware resources, datasets and the execution parameters the results in performance and energy consumption will vary a lot and it will need multiple repetitions to elect to optimize a particular metric. That is why a framework such as Tango will be very interesting in the context of HPC.

4.2.3 Benchmarking NBody C/Poroto Implementation on FPGA-augmented dual-Core server This section present time and energy measurements collected when running NBody simulation on the CETIC FPGA-Based system.

4.2.3.1 Benchmarking Context The benchmarking context for the N-Body implementation is identical to the one described for Matrix-Multiplication in paragraph 4.1.2.1

4.2.3.2 Benchmark results The N-Body problem is a simulation of a dynamical system of particles, usually under the influence of physical forces, such as gravity.

Basically the code considered in this evaluation iterates over a sequence of three steps: first the forces applying to particles are computed, then the velocities of individual particles are updated using the forces values and finally the positions of particles are calculated based on their updated velocities. The new updated positions are then used for the next iteration and so on.

TANGO Consortium 2016 Page 63 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

To evaluate execution of N-Body on FPGA accelerator using Poroto tool, the offloaded function consisted of the force and speed updates. The integration of velocities is done on the CPU side at each iteration. The forces and speed updates are the most compute intensive functions. Our FPGA showed to be not big enough to host the complete set of N-Body computations.

The offloaded function is slightly adapted to allow its compilation through the underlying C to VHDL compiler (i.e. ROCCC). Some external FPGA kernels are required to be generated and integrated to the design generated by ROCCC in order to support mathematical float operations such as multiplication, division and square root.

The following code illustrates the offloaded function, with the associated annotations, as given in input to Poroto tool:

#pragma poroto file src sqrt.vhdl #pragma poroto template ipcore_dir sqrt_impl.xco #pragma poroto latency 28 void sqrt(float a, float &result);

#pragma poroto memory x float 1024 #pragma poroto memory y float 1024 #pragma poroto memory z float 1024 #pragma poroto memory xp float 1024 #pragma poroto memory yp float 1024 #pragma poroto memory zp float 1024 #pragma poroto memory vx_in float 1024 #pragma poroto memory vy_in float 1024 #pragma poroto memory vz_in float 1024 #pragma poroto memory vx_out float 1024 #pragma poroto memory vy_out float 1024 #pragma poroto memory vz_out float 1024

#pragma poroto stream::roccc_bram_in nbody::x(x, n) #pragma poroto stream::roccc_bram_in nbody::y(y, n) #pragma poroto stream::roccc_bram_in nbody::z(z, n) #pragma poroto stream::roccc_bram_in nbody::xp(xp, n) #pragma poroto stream::roccc_bram_in nbody::yp(yp, n) #pragma poroto stream::roccc_bram_in nbody::zp(zp, n) #pragma poroto stream::roccc_bram_in nbody::vx_i(vx_in, n) #pragma poroto stream::roccc_bram_in nbody::vy_i(vy_in, n) #pragma poroto stream::roccc_bram_in nbody::vz_i(vz_in, n) #pragma poroto stream::roccc_bram_out nbody::vx_o(vx_out, n) #pragma poroto stream::roccc_bram_out nbody::vy_o(vy_out, n) #pragma poroto stream::roccc_bram_out nbody::vz_o(vz_out, n) void nbody(float *x,float *y,float *z, float *xp,float *yp,float *zp, float *vx_i,float *vy_i,float *vz_i, float *vx_o,float *vy_o,float *vz_o, float dt, int n) { int i; int j; float Fx; float Fy; float Fz;

for (i = 0; i < n; i++) { Fx = 0.0f; Fy = 0.0f; Fz = 0.0f; for (j = 0; j < n; j++) { float dx = xp[j] - x[i];

TANGO Consortium 2016 Page 64 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

float dy = yp[j] - y[i]; float dz = zp[j] - z[i]; float distSqr = dx*dx + dy*dy + dz*dz + SOFTENING; float sqrVal; sqrt(distSqr, sqrVal); float invDist = 1.0f / sqrVal; float invDist3 = invDist * invDist * invDist; Fx = Fx + dx * invDist3; Fy = Fy + dy * invDist3; Fz = Fz + dz * invDist3; } vx_o[i] = vx_i[i] + dt*Fx; vy_o[i] = vy_i[i] + dt*Fy; vz_o[i] = vz_i[i] + dt*Fz; } }

Poroto automatically interfaces these IP blocks to the design generated by ROCCC. When considering N=1024, the full design occupied almost 80% of the FPGA logic resources and the total compilation and mapping time started to be very long (several hours) with timing violations that required us to decrease the overall frequency of the design. The overall design includes mainly the offloaded function and attached kernels, the PCI interface, and other glue logic for integration. The FPGA final design is running at 100 MHz. We measured the power and time required by execution on CPU only and on (CPU + FPGA) with N = 1024 and we obtained the results in Table 3.

Table 5 Execution time and power evaluation for N-Body (1 iteration) using Poroto

Exec time (ms) mW (avg) CPU only 64.571 22051 CPU&FPGA 3521 5807

4.2.3.3 Discussion The values are average measurements over several iterations for the execution time. The average power is measured during several iterations of the N-Body implementation.

While the power consumption is almost 4 times lower when offloading on the FPGA, the execution time is far longer. Thus, if comparing the overall energy consumption of a CPU-only solution vs the CPU&FPGA solution (where the CPU is mostly idled as it is only used to perform a simple aggregation of data resulting from the FPGA computation), we observe that CPU-only solution performs an order of magnitude better (10x).

However, the point is not necessarily to choose one solution over the other but rather to determine if it is worth the development investment to develop a solution that would not only use the CPU-only approach or the mostly-FPGA approach as currently benchmarked by CPU&FPGA but rather a solution that combines both; in particular, a solution that continues to compute most of the job on the CPU but nonetheless offloads a portion of the computation on the FPGA instead of keeping it idled. Based on the current extremely poor execution time obtained on the FPGA, this remains questionable. But, if time performance on FPGA can be improved then it may be worth it.

TANGO Consortium 2016 Page 65 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

The poor time performance on the FPGA could be explained by the fact that the kernel generated by ROCCC in this particular example is not very efficient due to low level of parallelism that can be achieved by the compiler in the final design. Also, there are many memory transfer overheads at each iteration due to the transfer of the arguments and results from CPU to FPGA forth and back at each step which is not the case for the CPU only execution that have immediate access to data. Furthermore the FPGA is running at 1/25 the frequency of the CPU.

This rapid benchmarking process allows to identify quickly, with a minimal implementation effort, whether it is worth or not to offload the code execution of a giving computation to the FPGA. Also, the user can get quick feedback on resource usage on the target hardware and could then either fine-tune his implementation to take advantage of non-used resources on the FPGA, identify a better suited FPGA for his application or better load-balance the computations over current hardware components to take better benefit from the platform as a whole. For instance, even if in this quick prototyping of the N-Body execution, the hardware offloading is less performant than the CPU execution, this gives the user a realistic idea of the achievable performance that can be obtained without further developments, simply by exploiting adequately the available resources. In the current example, the CPU can be loaded with almost 20 times more N-Body computations that the FPGA. Therefore running NBody simulation on both the CPU and the FPGA (current implementation) would roughly give a potential performance enhancement of 5% compared to a CPU-only execution. The energy enhancement could be estimated in the same way.

4.2.3.4 Conclusion With the proposed rapid prototyping and benchmarking approach based on Poroto, the user can quickly evaluate the performance enhancement potential associated with the offloading of a selected computation on the FPGA. Of course this approach is not intended to provide user with a highly optimised design, but instead it gives a quick and valuable feedback at early development phases without investing much effort. It also helps to identify potential distribution of computation between the different hardware components available, in this case, CPU and FPGA.

The approach can be generalised in this way: If the Poroto based quick benchmarking gives a performance ratio rPerf of the FPGA based implementation over the CPU only one (i.e. execution time of the offloaded computation on FPGA is equal to rPerf times its execution time on the CPU) and power ratio rPow (i.e. power values in Watts while execution the offloaded function on FPGA is equal rPow times its execution on CPU), then the user can realistically target a performance enhancement of 1/rPerf in execution time that results in rPerf. (1+ rPow)/(1+ rPerf) times the energy consumption of execution on CPU only.

To illustrate such an estimate, let us consider an example (other than NBody) with a situation more favourable to relying on an FPGA offloading where:

FPGA execution is 4 times faster (rPerf = 0.25) and 2 times lesser in power (rPow = 0.5).

That means we can execute 5 instances of a given computation function in a time slot t_cpu equal to the execution time of 1 execution on CPU i.e., 1 on the CPU and 4 on the FPGA.

The total energy for execution 5 times the computation on CPU only is

5 x t_cpu x P(CPU)

TANGO Consortium 2016 Page 66 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016 while the energy for executing these same computations on CPU and FPGA (1 on CPU and 4 on FPGA) is

1 x t_cpu x P(CPU) + 1 x t_cpu x P(FPGA) where P(CPU) and P(FPGA) are the powers associated with execution on CPU and FPGA respectively.

That results in energy consumption ratio an execution on CPU+ FPGA over CPU only:

t_cpu x (P(CPU) + P(FPGA) ) / ( 5 x t_cpu x P(CPU) ) = 1.5/5 = 30%

That validates the ratio of the formula (i.e. rPerf. (1+ rPow)/(1+ rPerf) )

This reasoning could even apply to the NBody example where a CPU+FPGA solution would approximatively achieve only about a 2% time improvement and nearly no energy improvement. Thus, unless significant effort is spend on improving the current FPGA implementation, it is currently not worth offloading NBody simulation on a CPU+FPGA platform.

TANGO Consortium 2016 Page 67 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

5 Conclusions At the end of Year 1, most components from the TANGO framework have initiated their implementation and several have already contributed initial scientific results. Notably, significant progress is achieved by the design-time and deployment-time optimiser for finding optimal placement of computational tasks on various hardware; by the energy modeller for estimating the energy consumption; by the integration of C/OMPSs programming models to handle the scheduling of coarse and fine grain tasks respectively at the data centre level and computing node levels.

Furthermore, in this first year, the TANGO consortium has extensively made use of SLURM, which already supports part of the device supervision to let an application request a set of nodes to run on. Furthermore, SLURM is also capable to capture time measurement for running a job as well as capture the power and energy consumed by the various nodes on which a job is computed.

The scientific results in the Year 1 have been obtained on Nova2 platform provided by partner Bull. A second testbed at CETIC provides a small server to which an FPGA board is connected on the PCIe bus. This sever has the basic capabilities to capture time and power/energy consumption of a running software computation. To facilitate a rapid prototyping approach on this testbed, C/Poroto annotation can be used to identify portion of code for which to offload computation on the FPGA part of the testbed.

Some basic computations, namely, matrix multiplication, Hydro (fluid dynamics simulation) and Nbody simulations were performed on these two testbeds and corresponding time and energy measurement were taken.

At the end of this first year, results are not final. Firstly, many of the tasks for collecting, aggregating, aligning and analysing measurement data between time and energy still require significant manual effort. In the next few months, the various tools and components of the TANGO framework will implement the necessary software to automate most of these tedious data collection and analysis tasks.

Secondly, the application lifecycle deployment engine will finalise its first release in the next few months as well. Throughout the second year, the implementation of a device emulator will also be explored.

Finally, the component with existing scientific contributions will continue their progress. The energy modeller will perform its calibration on GPU and Xeon Phi based processing nodes, the design-time optimiser will help scenarios where task-scheduling optimisation is needed at design time, and the code optimiser plans on profiling source code for time and energy performance on GPU and Xeon Phi. Finally, SLURM will be augmented to handle the allocation of a job on a set of heterogeneous nodes while at the moment, it is only capable to handle heterogeneous allocation but on different executions.

TANGO Consortium 2016 Page 68 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

6 References

[1] European Commission, “Future Vision: Smart Everywhere.,” March 2015. [Online]. Available: http://ec.europa.eu/digital-agenda/en/smart-anything-everywhere. [Accessed 14 12 2016].

[2] J. Tully, “Mass Adoption of the Internet of Things Will Create New Opportunities and Challenges for Enterprises,” Gartner Report, 27 February 2015.

[3] TANGO Consortium, “D3.1 - TANGO Toolbox – Alpha version (Sofware),” December 2016.

[4] TANGO Consortium, “D2.1 - TANGO Requirements and Architecture Specification Alpha Version,” April 2016.

[5] “JouleUnit - A generic framework for profiling ICT applications,” [Online]. Available: https://code.google.com/p/jouleunit/. [Accessed 12 04 2014].

[6] Zabbix LLC, “Zabbix The Enterprise-class Monitoring Solution for Everyone,” [Online]. Available: http://www.zabbix.com/. [Accessed 14 12 2016].

[7] “Ganglia,” [Online]. Available: http://ganglia.info/. [Accessed 14 12 2016].

[8] F. Z. J. L. N. K. A. A. B. A. Kansal, “Virtual Machine Power Metering and Provisioning,” 1st ACM Symposium on Cloud Computing, p. 39–50, 2010.

[9] V. C. A. E. H. Bohra, “VMeter: Power modelling for virtualized clouds,” IEEE International Symposium on Parallel Distributed Processing, Workshops and PhD Forum (IPDPSW), pp. 1-8, 2010.

[10] A. K.-H. J. S. W. I. S. J. W. Smith, “CloudMonitor: Profiling Power Usage,” IEEE 5th International Conference on Cloud Computing (CLOUD), pp. 947-948, 2012.

[11] P. L. J. P. F. Farahnakian, “LiRCUP: Linear Regression Based CPU Usage Prediction Algorithm for Live Migration of Virtual Machines in Data Centers,” 39th EUROMICRO Conference on Software Engineering and Advanced Applications (SEAA), p. 357–36, 2013.

[12] C. L. Y. C. Z. Jiang, “VPower: Metering power consumption of VM,” IEEE 4th international Conference on Software Engineering and Service Science, p. 483–486, May 2013.

[13] Q. Z. Z. L. D. Q. H. Yang, “iMeter: An integrated VM power model based on performance profiling,” Future Generation Computer Systems, vol. 36, pp. 267-286, 2014.

[14] O. N. Project, “Open Nebula – Flexible Enterprise Cloud Made Simple,” 2014. [Online]. Available: http://opennebula.org/. [Accessed 14 12 2016].

[15] ThinkTank Energy Products Inc., “WattsUp?,” 2014. [Online]. Available: https://www.wattsupmeters.com/secure/index.php. [Accessed 14 12 2016].

[16] A. Waterland, “Stress Project Homepage,” 2014. [Online]. Available: http://people.seas.harvard.edu/~apw/stress/. [Accessed 14 12 2016].

TANGO Consortium 2016 Page 69 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

[17] Phoronix Media, “Phoronix Testsuite Homepage,” 2016. [Online]. Available: http://www.phoronix.com/scan.php?page=home. [Accessed 14 12 2016].

[18] OscaR, “Operational Research in Scala,” [Online]. Available: https://bitbucket.org/oscarlib/oscar/wiki/Home. [Accessed 15 12 2016].

[19] R. L. a. G. A. Jeronimo Castrillon, “MAPS: Mapping Concurrent Dataflow Applications to Heterogeneous MPSoCs, , vol 9, issue 1, 527-545, 2013,” IEEE Transactions on Industrial Informatics, vol. 9, no. 1, pp. 527-545, 2013.

[20] R. d. Groote, On the Analysis of Synchronous Dataflow Graphs -- A System-Theoretic Perspective. PhD Thesis, ISBN 978-90-365-4041-4, Twente, The Netherlands: University of Twente, 2016.

[21] Khronos OpenCL Working Group, “The OpenCL Specification, version 2.0,” 2014.

[22] “OpenACC website,” [Online]. Available: http://openacc-standard.org. [Accessed 23 2 2016].

[23] OMPs, “The OmpSs Programming Model,” [Online]. Available: https://pm.bsc.es/ompss. [Accessed 25 4 2016].

[24] “JVM Monitor - Java profiler intergrated with Eclipse,” [Online]. Available: http://www.jvmmonitor.org/. [Accessed -4 12 2014].

TANGO Consortium 2016 Page 70 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

APPENDIX: Runtime Overhead of GPU Energy Monitoring Probe It is important to understand the overhead of measurement probes added to a testbed and then verify that it does not adversely generate an important overhead. Below the system on which the effect of the GPU energy measurement probe developed by TANGO are tested is first reviewed followed by actual measurements of the process launched by the probe.

The application used to monitor the NVDIA GPUs was tested in a Linux environment with the following characteristics: - 2 NVIDIA GPU devices - 24 GB RAM - CPU: Table 6: Specifications of the system use to test GPU energy probes overhead.

Architecture x86_64 CPU op-mode(s) 32-bit, 64-bit Byte Order Little Endian CPU(s) 8 On-line CPU(s) list 0-7 Thread(s) per core 1 Core(s) per socket 4 Socket(s) 2 NUMA node(s) 2 Vendor ID GenuineIntel CPU family 6 Model 44 Stepping 2 CPU MHz 2660.000 BogoMIPS 5333.20 Virtualization VT-x L1d cache 32K L1i cache 32K L2 cache 256K L3 cache 12288K NUMA node0 CPU(s) 0-3 NUMA node1 CPU(s) 4-7 A script application based on the Linux top command is used to get the memory and cpu used by the monitoring application (energy probe) responsible of getting the power consumption of the NVIDIA GPUs. top shows many metrics measurements, including the CPU and Memory usage of a running process. The following table shows some of the results obtained by this script application: Table 7: results of the top command on nvml_app (the process related to the GPU energy monitoring probe.)

VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11212 500 388 S 0.0 0.0 0:00.00 nvml_app 11344 824 628 R 68.2 0.0 0:02.05 nvml_app 11344 916 720 S 29.3 0.0 0:02.93 nvml_app 11344 916 720 S 0.3 0.0 0:02.94 nvml_app 11344 916 720 S 0.0 0.0 0:02.94 nvml_app 11344 916 720 S 0.3 0.0 0:02.95 nvml_app 11344 916 720 S 0.0 0.0 0:02.95 nvml_app 11344 916 720 S 0.3 0.0 0:02.96 nvml_app

TANGO Consortium 2016 Page 71 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

11344 916 720 S 0.0 0.0 0:02.96 nvml_app 11344 916 720 S 0.3 0.0 0:02.97 nvml_app 11344 916 720 S 0.0 0.0 0:02.97 nvml_app 11344 916 720 S 0.3 0.0 0:02.98 nvml_app 11344 916 720 S … … … nvml_app …

VIRT Total amount of virtual memory used by the process RES Resident size (kb) - Non-swapped physical memory which the process has used SHR Shared memory size (kb) - Amount of shared memory which the process has used (shared memory is memory which could be allocated to other processes) S Process status: 'R' = running 'S' = sleeping %CPU Percentage of CPU time the process was using at the time top last updated %MEM Percentage of memory (RAM) the process was using at the time top last updated The lines marked in yellow correspond to the NVML library initialization (the call to function nvmlInit()). This “init” operation can take a few seconds to complete, and it makes use of a high amount of CPU. After initializing the library, the monitoring application gets the power consumption values of the NVIDIA GPUs every one second. It sleeps one second before getting the metrics again. In relation to the RAM memory used by the monitoring application, the table shows that this application uses an insignificant amount of it (the RAM used by the application corresponds to approx. 0,01% of the total).

TANGO Consortium 2016 Page 72 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

APPENDIX: Evaluation of Energy Modeller’s Accuracy by Inducing Load This appendix item continues on from section 3.1 and focus on assessing the validity of the changes made to the calibration data in the context of IPMI through analysing the accuracy of the power and energy predictions made from a less synthetic workload. This load is induced via a virtualised infrastructure that is not necessarily the same as the Nova2 testbed but still maintains a degree of relevance so is presented here.

Table 8: Error between Watt meter reading and the model generated estimate of power consumption WM IPMI IPMI-adj Average error (W) -0.20 -18.35 -6.50 Average absolute error (W) 15.68 21.88 18.92 Average error/idle power -0.17% -15.75% -5.58% Absolute average error/idle power 13.46% 18.78% 16.24%

We use the Phoronix testsuite [17] as a means of inducing a workload. The benchmarking suite then runs for an hour inducing load on the system, with the resultant trace shown in ¡Error! No se encuentra el origen de la referencia.. ¡Error! No se encuentra el origen de la referencia. shows the use of the Watt meter emulator with the results from three different calibration datasets. These datasets having been gathered via a Watt meter, by IPMI and via IPMI with the same adjustments as used in Figure 10. It clearly shows how the estimated power consumption for the adjusted IPMI more closely matches the Watt meter generated calibration data’s trace. The average error and absolute average error for this trace is shown in ¡Error! No se encuentra el origen de la referencia., for Watt Meter calibrated (WM), IPMI, and adjusted IPMI (IPMI-adj).

We can see in terms of estimating energy consumption of an application the adjustments made to IPMI have made a substantial improvement to the average error (11.85W or 10.17%). Thus over time the estimation of energy consumption will be far more accurate. Considering the absolute error it can be seen while the model used to estimate the actual power consumption has errors a reduction in the error from IPMI alone is also realised (2.96W or 2.54%). This demonstrates how a single power value may have inaccuracies but for the overall energy consumption it will eventually converge to the real value in the context of this workload. The difference in error between IPMI and the Watt meter remains, principally as a result of the lack of resolution of the IPMI based power sensors, having eliminated averaging issues during the calibration run. This can only be resolved by hardware vendor based improvements of these power sensors. Until this improvement is realized, this leaves our models and careful calibration as the only solution for gaining reliable estimates of current power consumption. Models such as ours will retain their usefulness for the prediction of future power consumption of a given workload.

Finally we illustrate the errors associated with the trace as shown in Figure 26. The deviation from the actual power consumption for each estimated power value is calculated and shown on the x axis, while on the y axis the count of how many estimates with that error are shown. Therefore the more estimates that are close to zero Watts of error the better the prediction of power consumption and the more symmetrical a distribution is for error the more accurate the prediction for energy consumption will be.

TANGO Consortium 2016 Page 73 of 74 D3.2 TANGO Toolbox – Alpha Version Scientific Report Version: v1.2 – Final, Date: 23/12/2016

Figure 26: Trace of a workload induced by the Phoronix testsuite (CPU)

Figure 27: A distribution of errors in the model’s accuracy as compared to the power meter reading

Figure 27 shows how the IPMI based calibration data biggest peak has a slight offset from 0 underestimating the power consumed. The adjusted IPMI makes an improvement on this with a peak centred closer to 0W of error. The Watt meter based calibration performed best in that its average error was -0.20W. Aside from observing the proximity to the ideal of centring around zero Watts of error, we can see other errors shown. These tend to result from transition periods between distinctly different levels of CPU load and timing issues between the different types of metric values been gathered given that measurements were stored in a real distributed system.

TANGO Consortium 2016 Page 74 of 74