ENHANCING ENERGY-PERFORMANCE FOR POWER CONSTRAINED SOC SYSTEMS

Rami Jioussy

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015

ENHANCING ENERGY-PERFORMANCE FOR POWER CONSTRAINED SOC SYSTEMS

Research Thesis

Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science

Rami Jioussy

Submitted to the Senate of the Technion Israel Institute of Technology Sh’vat 5775 Haifa February 2015

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 This research was carried out under the supervision of Prof. Avi Mendelson and Dr. Yariv Aridor (Intel), in the Faculty of Computer Science.

Acknowledgments

I would like to thank my supervisors, Prof. Avi Mendelson for enriching me with a lot of information and valuable help, and Dr. Yariv Aridor (Intel) for keeping pushing me forward, providing me with guidance to keep me focused on this thesis objectives.

All this research wouldn’t have even started without the encouragement, help and endless patience of my dear ex-wife, Lana Kattawi Jioussy (Rest In Peace).

I would like to thank my team at Intel (specially Shai Satt and Shiri Manor) for demonstrating patience with my intermittent absence and providing me the hours during a regular work-day to complete my master studies.

Last and not least, I would like to thank my loving wife, Heba and my family for keeping pushing me forward and being there for me in the tough moments.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 Contents Abstract ...... 1 1 Introduction ...... 2 2 Related works ...... 4 3 Glossary and Assumptions...... 5 3.1 Terms ...... 5 3.2 Assumptions ...... 6 4 Technical Background ...... 7 4.1 SoC Architecture ...... 7 4.1.1 Shared power envelop...... 8 4.2 OS power-policies ...... 8 4.3 OpenCL ...... 11 4.3.1 OpenCL kernel splitting for Hybrid execution ...... 12 5 This research focus ...... 14 5.1 The hybrid execution model ...... 14 5.1.1 Mapping to the OpenCL model ...... 14 5.2 EDP: energy-performance metric ...... 15 6 WHP: an energy-performance optimizing method ...... 16 6.1 Observations ...... 16 6.1.1 Constant power consumption ...... 16 6.1.2 Performance is linear...... 16 6.1.3 Package power offset ...... 17 6.1.4 TDP budget interferes hybrid execution on SoC systems ...... 17 6.2 Device completion assumption ...... 19 6.3 Algorithm ...... 19 6.3.1 Stage 1: Constructing FT and FP ...... 20 6.3.2 Stage 2: Determine optimal configuration ...... 20

6.3.3 Stage 3: Apply (cpufreq, gpufreq, α) settings ...... 21 6.4 Walk-through example ...... 21 7 Experimental testbed ...... 23 7.1 The testbed hardware ...... 23 7.2 Measurement Methodology ...... 23 7.2.1 Measuring power ...... 24 7.2.2 Measuring performance (execution time) ...... 24 7.3 Workloads...... 24

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 8 Experiments ...... 26 8.1 Demonstrating the observations ...... 26 8.1.1 Constant power consumption ...... 26 8.1.2 Performance is linear...... 26 8.1.3 Package power offset ...... 27 8.2 Applying WHP ...... 28 8.3 WHP vs. Balanced ...... 29 8.3.1 Results discussion ...... 30 8.4 WHP vs. other OS policies ...... 34 9 Summary and future work ...... 36 9.1 Ideas for future work ...... 36 10 References ...... 37

List of Tables

Table 1: Mapping the hybrid execution model onto OpenCL ...... 14 Table 2 FT and FP for AES256 kernel ...... 21 Table 3 WHP result for AES26 kernel ...... 21 Table 4: The Testbed platform specification ...... 23 Table 5 Characteristics of the testbed kernels ...... 25 Table 6 Testbed kernels power consumption: CPUwatt + GPUwatt / Packagewatt. Confirms Equation 3...... 28 Table 7: WHP scores (execution time, energy, EDP). The runtime EDP is calculated by Time x Energy. The EDP matching ratio is calculated by MIN (runtime EDP, computed EDP)/MAX (runtime EDP, computed EDP) ...... 29 Table 8: balanced-mode policy scores (execution time, energy, EDP). The runtime EDP is calculated by Time x Energy...... 29 Table 9 High-performance policy scores ...... 35 Table 10 Power-save policy scores ...... 35

List of Figures

Figure 1 IvyBridge SoC Architecture ...... 7 Figure 2 High-performance power policy. CPU is set to 2800MHz (CPU-HFM) and GPU to 1150MHz (GPU-HFM) ...... 9 Figure 3 Power-save power policy. CPU is set to 2800MHz (CPU-LFM) and GPU to 350MHz(GPU- LFM) ...... 9

Figure 4 Balanced power policy. CPU is set to ~2000MHz and GPU to HFMgpu...... 10 Figure 5 OpenCL Kernel NDRange ...... 12 Figure 6 Partitioning OpenCL kernel NDRange execution between CPU and GPU...... 13 Figure 7 Depicting kernel partitioning by a partition factor...... 15 Figure 8 SoC Power behavior on Hybrid execution (TDP 17W) ...... 18

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 Figure 9 Discrete power behavior on Hybrid execution (TDP 77W) ...... 19 Figure 10: The WHP algorithm ...... 20 Figure 11 WHP solution for AES256 workload, CPUfreq=2000MHZ, GPUfreq=1150Mhz, a=6. energy-performance score is 55.1 ...... 22 Figure 12 WHP instrumentation for kernel execution ...... 23 Figure 13 Power MSRs ...... 24 Figure 14 Frequency MSRs ...... 24 Figure 15 Convolution kernel: Power behavior for CPU and GPU at both LFM and HFM frequencies...... 26 Figure 16 Performance linearity of AES256 kernel: Power behavior for AES256 kernel for 3 execution modes: CPU only at 2000MHz, GPU only at 1150MHz, and Hybrid at 2000/1150MHz respectively...... 27 Figure 17 AES256: WHP vs. Balanced-mode ...... 31 Figure 18: improvements of WHP over OS balanced-mode. Overall, WHP achieves an average 23% improvement of EDP over balanced-mode...... 31 Figure 19: Convolution: WHP vs. Balanced ...... 32 Figure 20 Convolution performance scaling...... 33 Figure 21 Convolution energy scaling ...... 33 Figure 22 EDP summary: WHP vs. all power polices...... 35

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 Abstract

System on Chip (SoC) based heterogeneous architectures are becoming the de facto standard of low-end systems. Hence, optimal power management for these platforms is a highly important goal. Prior research works have suggested approaches and techniques to schedule computation for parallel execution on the different compute devices (hence hybrid execution), with the goal of improving performance, while only a few focused on energy-performance. All those research works have been targeted at hybrid platforms with CPU and discrete GPUs, for which performance or energy can be optimized independently per device. Those approaches suggest that optimizing energy-performance for hybrid execution is achieved by different schemes for work partitioning (for example, using OpenCL kernels), followed by runtime power management (for example, OS balanced-mode power policy).

In the SoC environment, optimizing both energy and performance cannot be addressed directly with these prior techniques. They either become inapplicable or their proposed solution will be suboptimal.

This research presents a novel approach; it suggests an offline method which learns the program behavior (performance, energy) on each of the platform devices, to determine both optimal work partitioning as well as device frequencies, in order to maximize energy-performance during hybrid execution.

Our proposed approach demonstrates 23% average improvement in energy-performance for a set of OpenCL workloads running on an off-the-shelf SoC platform, compared to an OS balanced- mode power policy.

1

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 1 Introduction Processing machines, ranging from desktop PCs and servers (e.g. Xeons), through lower power devices such as laptops, netbooks and ultrabooks to smartphones, are frequently equipped with multiple types of processing devices: CPUs (), integrated or discrete GPUs (Graphics Processing Unit) and coprocessors (such as DSPs). Such heterogeneous composition of powerful sub-devices, assists applications capable of utilizing such a system to achieve an immense performance boost compared to running on a single device.

In recent years, a new class of hybrid systems has emerged in consumer electronics such as netbooks, ultrabooks and smartphones, mainly as low-end platforms like NVidia Jetson TK1 [1], Intel microarchitecture code name Bay Trail [2] and AMD Kaveri [3]. These systems, known as SoCs (system on a chip), are integrated and hybrid; they have different types of processing elements such as a set of CPUs, a set of GPU units and sometimes other audio/video/imaging HW accelerators, all sharing the same main memory and other resources in order to take advantage of the integration, and so they have a shared power and thermal envelop. Programming these heterogeneous platforms is a non-trivial task and so, new programming languages, such as OpenCL, have been developed to enable generating code that can simultaneously run on different devices and to efficiently share data among these devices.

Nevertheless, programmers are facing two types of challenges, executing faster and saving more power. The need for faster execution might be ascribed to many different reasons, for example, to achieve better response time or to meet real-time constraint requirements. While the need for power efficiency (total power consumed per amount of completed work) can range from a user’s need for longer device battery life-time (low power devices) up to a corporate need for cost savings, dominated by the cost of power (desktop and servers).

Finding the right tradeoff between energy and performance (hence energy-performance) for these programs is challenging, yet an essential optimization for many systems, such as low-end SoC platforms since: first, both energy and performance cannot always be optimized independently; e.g., increasing the device frequency may linearly improve its performance but deteriorate the overall energy-efficiency polynomially. Hence finding the optimal working point (power, frequency) in terms of both energy and performance is not trivial. Second, since SoC platform usually use a single thermal plan for the entire system, increasing the power of one device, may force lowering the power (via frequency) of other devices to meet the overall power and Thermal Design Power (TDP) of the platform. Consequently, concurrent utilization of the devices might incur a performance penalty on each of them. For example, a device’s frequency might be lowered in order to compensate for the extra power consumed by the other device running in parallel.

It’s very common to name the combined vector of power and performance as energy- performance. Equation 1 shows the EDP metric, which is very often used for evaluating energy- performance for a workload execution (more details at subsection 5.2‎ ).

푒푥푒푐푡푖푚푒 퐸퐷푃 = ∫ 푃표푤푒푟(푡) × 푒푥푒푐푡푖푚푒 = 푇푒푥푒푐푡푖푚푒 × 퐸푛푒푟푔푦 푡

Equation 1 EDP formula

2

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 This metric has lately been proven to be very essential by itself [4] [5] [6], since it guarantees we are not trading any of the single targets for the sake of the other; like boosting a workload execution twice while exhausting half of the battery life, compared to boosting 30% in exchange for only 15% battery time.

In this work we study the problem of how to achieve a minimum energy-performance of an application running on a modern SoCs that integrates CPU and GPU under power and thermal constraints. As such, we propose a novel offline method, named Workload-based Heterogeneous Policy (WHP), which uses three steps; (1) characterize the program behavior (performance, energy) when being executed solely on each of the system’s devices, (2) use a simple heuristic for partitioning the workload among the devices of the SoC platform and(3) finding near optimal device frequencies in order to improve the overall hybrid execution of the program on this platform.

Overall, the research makes three main contributions:

 Suggests a new offline approach for optimizing energy-performance of hybrid execution on SoC platforms.  Shows that this optimizing method is applicable for existing heterogeneous programming models e.g., OpenCL.  Demonstrates an average 23% energy-performance improvement for a set of public OpenCL kernels running on the top of off-the-shelf SoC platforms, as compared with OS balanced-mode policy.

The rest of the research is organized as follows. Section 2‎ surveys related works. Section 3‎ defines basic terms we use throughout this document. Section 4‎ provides a short summary of a related technical background. Section 5‎ defines the research focus, such as the hybrid execution model assumed by our optimizing method and its mapping to OpenCL, along with the EDP metric we use for evaluating our method. Section 6‎ describes the energy-performance optimizing method itself. Section 7‎ provides the details of our experimental testbed and workloads. Section 8‎ presents and discusses the experimental results. Section 9‎ concludes the research with a summary and a short discussion on future work.

3

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 2 Related works Prior research works in the field of optimize heterogeneous systems can be categorized into two major domains: under the first domain we can find works that focus on harnessing heterogeneous system resources for improving performance, while ignoring any power implications, while in the other domain, we can find works that focus on achieving the best energy efficiency and so aims to balance between the energy a program consumes and its performance (delay).

In the first domain, there are works which have suggested adaptive methods in which the ratio of work distribution per device is calculated based on offline and historical profiling of the heterogeneous application with different data sets [7] [8] [9]. Other works suggested static methods in which partitioning of work is determined by compile-time analysis of code features and performance models to match between code segments and the most efficient devices [10] [11] [12] [13]. Finally, a few works introduced online methods, in which the ratio of work distribution is calculated and refined on-the-fly during runtime [14] [15] [16]. All those works, except for [16], demonstrated high-performance on CPU and discrete GPUs. Those works, left designing methods for maximizing energy-performance for future research.

On the other hand, in the energy efficiency domain, prior studies already suggested models and techniques for optimizing energy on either CPU [17] [18] [19] [20] or GPU [21] [22]. These studies focus on a single device only, either CPU or GPU, without considering the other device as a supporting compute unit; so their methods could not be directly applied to CPU-GPU heterogeneous architectures. Very few works suggested methods for energy-performance improvements on heterogeneous platforms [5] [23]. These methods map programs onto CPUs and discrete GPUs first, and only then apply DVFS on each of these devices. Such methods assume that devices are independent in terms of power optimization and so cannot be applied, as is, on SoC platforms with a single power control for all devices.

We believe that our research here is the first to address the problem of optimizing energy- performance for a heterogeneous power constrained SoC systems, where energy efficiency is very crucial and optimizing it becomes a very challenging task.

4

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 3 Glossary and Assumptions This section aims to define the terms we will use during this thesis: 3.1 Terms CPU: Central Processing Unit

GPU: Graphics Processing Unit

GPGPU: General purpose GPU; which refers to the GPU computing capabilities which traditionally were not an option and graphics was the only task performed by GPUs.

TDP: Thermal design point - represents the maximum amount of power the cooling system in a computer is required to dissipate.

EDP: Energy-delay product, a metric used to measure energy-performance efficiency. It’s the product E x T of the energy E consumed during the time span T.

Turbo Boost: Technology which enables the processor to run above its base operating frequency for a limited timeframe (usually short) via dynamic control of the processor’s .

Power: Rate at which electric energy is transferred by an electric circuit; measured by the watt unit (W).

Hybrid Execution: A mode where the executed workload makes use of more than one processing device, such as CPU and GPU, multiple GPUs or multiple CPUs.

LLC: Last level cache. In Intel architectures, starting from Sandy Bridge, the last cache level also serves both the CPU and GPU in order to allow fast data sharing between them.

Performance: A measure of how long a workload took to complete execution.

HFM: High frequency mode of the device. Refers to a state where the device is being set to its highest frequency.

LFM: Low frequency mode of the device. Refers to a state where the device is being set to its lowest frequency.

PCH: platform controller hub. A microchip by Intel which serves as a bridge between the system processor and the rest of the platform, such as I/O or DMI

Energy-Performance: A measure of energy efficiency of a particular workload execution on a specific system. Usually refers to amount of power (watt) spent per execution time unit.

OpenCL: Open Compute Language is a standard maintained by Khronos Group for writing programs that execute across heterogeneous platforms.

SoC: System on a chip; refers to a chip which integrates multiple components, such as both CPU and GPU, on a single chip.

OpenCL Kernel: A kernel is a function declared in a program and executed on an OpenCL device

5

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 Package: Refers to the silicon die which includes the CPU, GPU and other platform components, such as caches, system agent, buses. The package power, used later through this document, refers to the combined power of all components within the package. 3.2 Assumptions For reasons of simplicity, throughput the document, we consider SoC platforms, which have one CPU and one GPU.

We assume that the code running on each device is independent on each other.

The work will focuses on devices with a Thermal envelop fitted to 17W TDP. We believe that the same techniques can suit other Thermal envelopes, but it is out of the scope of this thesis.

6

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 4 Technical Background In this section, we provide a basic technical knowledge about different concepts used throughout the thesis. We start by providing an introduction to the SoC architecture, while demonstrating the presented concepts using Intel Ivy Bridge. Then we explain the difference between the power policies supported by nowadays operating systems. We conclude by going over basic concepts of the OpenCL framework, which we use as the hybrid execution model for our experiments later at section 8‎ . 4.1 SoC Architecture System-on-a-chip (SoC) technology is the packaging of all the necessary electronic circuits and parts for a "system" (such as a smartphone, netbook or ultra-book) on a single (IC), generally known as a microchip. Recently, firms such as Intel, Apple, AMD and others, have been using this technology for building different classes of systems, ranging from special server chips to small sized systems, such as handhelds (iPads/smartphones), which can meet user requirements such as cost, device dimensions, performance and battery-life.

Most of current SoCs are using a heterogeneous architecture that combines different types of devices, such as CPUs and GPUs, and sometimes it can also includes DSPs or other types of special purpose hardware. For example, Intel introduced such technology, starting from Sandy- bridge (32nm) [SNB [24]][IVB [25]][HSW [26]], AMD APUs uses it as part of its Fusion technology [Kev [27]] and Samsung are proposing the snapdragon [SNP [28]]. All these SoCs are facilitating resource sharing, such as caches and memory, in order to reduce the cost and to improve the communication between the different integrated devices.

Figure 1 depicts the architecture for IvyBridge SoC, one of Intel’s recent chips. This SoC also happens to be the one we use at our testbed (see 7.1‎ ). Please note that the concepts described here are general and so we believe that they can be applied to other SoC platforms as well.

The diagram illustrates an SoC, is composed from a set of four processors and a set of graphics cores. The SoC also includes caches (some of them private and some of them are shared), and I/O interfaces. All those components are packaged into a single chip that can be connected through a specialized inter-connect to the other system components, such as the RAM / PCH or other peripheral ports.

Figure 1 IvyBridge SoC Architecture

At those systems, CPU and GPU share the same channels to the system RAM (the GPU isn’t equipped with an isolated memory chip as in discrete graphics). As a result, those channels may

7

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 become a contention spot at hybrid execution when all components are active. For example, it may limit the bandwidth allocated to each device, when all devices perform memory-intensive operations. In order to mitigate such contention, some SoCs have the last level cache (LLC) shared among all devices (for example, Intel IVB has a full-coherent cache between CPU and GPU). This cache level serves as communication medium for both CPU and GPU, and can also be used for sharing data among them.

4.1.1 Shared power envelop Device Integration presents many advantages, such as resource sharing and consequently optimizing execution, however it also presents new challenges. Components composing the SoC share the same power source and the same heat-sink and/or vent. Hence, they share the same power and thermal budget available for the system.

As a result, their combined power consumption and thermal state is subject to a certain ceiling specified by the SoC manufacturer, referred to as TDP (see definition at subsection 3.1‎ ). For example, in Figure 1, the CPU, GPU and other platform components (such as the System Agent) are sharing the same power and thermal envelop, while the remaining platform components are powered separately, such as the PCH and Memory (DDR).

SoC TDP budget is very often constrained due to the nature of these devices, being small and battery powered. It’s also important to note that many times the SoC devices, such as the CPU or GPU, are prioritized differently as for their thermal/power stake within the budget, which makes prediction of their behavior more challenging. 4.2 OS power-policies Managing power in operating systems can be very complicated, since it needs to be both generic, answering the needs of various application types, and also be optimized for the targeted usage. In order to simplify it, leading operating systems, such as Windows/Linux/Android traditionally support a set of power policies (True for all consumer sectors such as desktop/laptop/mobile) so that the user/operator can choose what suits their needs the best. Those power policies are designed to be optimal for different consumer usages, where the power and performance demands might vary dramatically. There might be differences between the operating systems with respect to the power options support they provide, however, the main concepts remain identical and map into three main categories. For the sake of the discussion, we focus on the Windows OS since it’s the one being adopted in this research (see subsection 7.1‎ ); the same general concepts apply to other operating systems.

Windows OS supports three power-policies: High-Performance, Power-save and Balanced [29]. The first two modes are the trivial ones.

 High-Performance sets the devices to the highest allowed frequency (HFM), independent of device utilization. For the Convolution workload (see 7.3‎ ), Figure 2 demonstrates how the CPU and GPU frequencies are being set to ~2800MHz and ~1150MHz respectively, by the system policy. At this power-policy, once a device leaves the idle state, determined at some minimal utilization threshold, the system automatically grants it HFM; however, all this is subject to the system TDP limitations.

8

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 CPU freq (MHz) GPU freq (MHz)

Figure 2 High-performance power policy. CPU is set to 2800MHz (CPU-HFM) and GPU to 1150MHz (GPU-HFM)

We can observe from the graph, how the CPU is granted HFM even at the kernel execution setup, times [29-33] msec, which is characterized by low utilization, since it includes no computation, rather it includes buffer preparation and NDRange parameters setup (before the kernel execution itself, which occurs on time 34msec).

 Power-save (sometimes called “maximum battery-life”) sets the devices to the lowest frequency, or very close to that; this usually depends on the frequency regulation range available for the OS to set the devices frequency, which in most cases is bounded by 50% of HFM.

CPU freq (MHz) GPU freq (MHz)

Figure 3 Power-save power policy. CPU is set to 2800MHz (CPU-LFM) and GPU to 350MHz(GPU-LFM)

For the Convolution kernel (see 7.3‎ ) execution, Figure 3 demonstrates how the CPU and GPU frequencies are being set to LFM (800Mhz and 350MHz respectively) for the

9

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 complete run, even during the kernel execution phase itself, which has significantly higher utilization than the other phases, which come either before or after.

 Balanced-mode: aims at balancing energy consumption and system performance by adapting the computer's processor speed to the application activity. This policy works by periodically sampling the system state, device utilization and thermal status, and then configuring the system devices for the next time window, by adjusting frequencies, according to the recent sampled window demands. The frequency adjustment is done by a HW mapping table (provided by the HW vendor), where for each device separately, it matches a frequency level to a utilization (load) level. Basically, the combination of the new chosen frequencies must not violate the system TDP. In case the system decides to enter turbo mode, it is required to follow an architectural priority model, defined by the hardware manufacturer, which dictates how the energy should be divided between the devices. CPU freq (MHz) GPU freq (MHz)

Figure 4 Balanced power policy. CPU is set to ~2000MHz and GPU to HFMgpu.

Figure 4 demonstrates the frequency behavior over time for the convolution workload execution (see 7.3‎ ). It clearly shows how the kernel execution phase is distinguishable from the other phases, where the CPU and GPU are set to their corresponding LFM frequencies (800Mhz and 350Mhz respectively). During the kernel execution phase, times [34-43] msec, the CPU frequency is being set to ~2000Mhz which falls within the

range of [800LFM-2800HFM] while the GPU is being set to ~HFM.

The above description aims to explain the general concepts of power policies we can find nowadays at the different operating systems, moreover, Windows isn’t special with respect to that; for example, the counterpart for High-performance, Balanced and Power-save on Android/Linux systems are: performance, ondemand[others] and powersave [30].

10

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 4.3 OpenCL Open Computing Language (OpenCL) is a standard for general purpose, heterogeneous parallel programming [31]. OpenCL consists of an API for coordinating parallel computation across heterogeneous processors, with a well-specified computation environment. Using OpenCL, workload developers may write their code once and have it run on any device supporting OpenCL in the system, without the need to re-code it into a specific device native language, such as DX or CUDA for Nvidia.

Systems supporting OpenCL expose a set of OpenCL devices, such as CPU, GPU and DSP, which are the target for carrying out the user command execution and executing the application kernels; those are OpenCL programs written in OpenCL-C language, which define the functions to run on the OpenCL devices. A typical OpenCL app is composed of two parts, a host part, which takes care of the input/output setup, as well as the setup code for initializing the OpenCL kernel; and the OpenCL kernel program itself.

The OpenCL kernel execution can be carried out by any of the OpenCL devices (except for rare cases where the kernel is using device-specific extensions) or by a cooperation of several devices (see 4.3.1‎ ).

Different devices may have a different ISA (Instruction set architecture) and the OpenCL compiler needs to know how to generate code for each device in the system. Each device has one or more command-queues, which hold commands received from the application and to be executed by the device.

In OpenCL, developers are also required to define the execution space, which could be 1D, 2D or 3D space, where the kernel function is performed. Figure 5 illustrates the execution space used by the OpenCL specification, aka NDRange; the user kernel is applied to each point in that space. Whenever a kernel function being applied to a given point in the NDRange space, it receives as an input the location coordinates of that point within the NDRange space, facilitating writing kernel functions where the output to be produced depends on the processed element’s location. Invocations of a kernel are done in parallel and are denoted “work-items”, or “work- groups”, whenever we refer to a collection of them.

In addition, OpenCL allows multiple OpenCL devices to reside within the same OpenCL context, facilitating memory objects setup, synchronization and execution synchronization between the devices; for example, in a shared OpenCL context with both CPU and GPU, any instantiated memory object is automatically shared between the devices within the context, without the need for any explicit synchronization between them at the host app level.

11

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015

Figure 5 OpenCL Kernel NDRange1

4.3.1 OpenCL kernel splitting for Hybrid execution Figure 6 illustrates an example of how an NDRange can be split between the OpenCL devices, CPU and GPU. Splitting is initially specified at the application level (host), and then submitted to the OpenCL framework in order to be executed by the OpenCL devices. The example shows how the NDRange global range is being divided into 16 equal portions (16 is a random division factor for demonstration), assigning the CPU partitionFactor portions out of it, while the remaining portions are being assigned to the GPU. The split results in two distinct sequential NDRanges, offsetted from each other.

1 Figure borrowed from OpenCL 1.2 specification document : https://www.khronos.org/registry/cl/specs/opencl-1.2.pdf

12

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015

Figure 6 Partitioning OpenCL kernel NDRange execution between CPU and GPU.

13

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 5 This research focus In this section we describe both the hybrid execution model assumed by our proposed method WHP, introduced at section 6‎ . We also describe the EDP metric used for evaluating WHP energy- performance enhancements. 5.1 The hybrid execution model As previously described at section 1‎ , this research focuses on SoC heterogonous systems equipped with CPU and GPU. In this section we describe the Hybrid execution model we assume for applications we optimize for such systems. The characteristics of such a model are:

 Parallelism: heterogeneous computation is represented by a large pool of tasks which can be executed in parallel on different platform devices. Each task works on its own dataset with zero or little data sharing between tasks.  Homogeneity: All tasks are homogeneous in size; they do about the same amount of computations.  Device-agnostic: Tasks are not bound to a specific device and can be executed either on the CPU or the GPU.  Scalability: Tasks can fully utilize the computing resources of each device.

Such an execution model is naturally realized in heterogeneous programing models such as OpenCL.

5.1.1 Mapping to the OpenCL model Table 1 maps each characteristic of the execution model (described above) to the OpenCL framework, which was previously introduced at subsection 4.3‎ .

hybrid execution model OpenCL model The global range of an OpenCL kernel is split and processed in parallel by multiple instances of the kernel in different devices. Each kernel Parallelism work-item or a workgroup (depending on the specific OpenCL implementation) operates on its own data sub-range. OpenCL kernel which has no code divergence based on its location Homogeneity within the global range (NDRange). Device-agnostic OpenCL kernels are device-agnostic by definition A typical kernel work-group is composed of hundreds of work-items. Scalability Often, a few workgroups suffice to occupy all processing elements of a device. Table 1: Mapping the hybrid execution model onto OpenCL

Additionally, in order to allow hybrid execution of a single OpenCL kernel over the CPU and GPU while experimenting with different allocation ratios (see Figure 7) at the same time, we use the NDRange split capability, previously described at subsection 4.3.1‎ .

14

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015

Figure 7 Depicting kernel partitioning by a partition factor

5.2 EDP: energy-performance metric Equation 1 introduces the energy-delay-product metric, EDP, initially proposed by [4]. This metric is used for evaluating energy-performance and works on finding the right tradeoff between energy and performance. In EDP, the execution time is weighted by the amount of energy consumed. In other words, EDP penalizes every delay in execution time by multiplying it by the amount of total consumed energy.

Reaching an optimal execution point that minimizes this equation can be achieved by either saving more energy or boosting the performance (lower execution time). This would be the basis for our proposed approach at section 6‎ . The main reason why monitoring energy only isn’t sufficient, though it internally takes into account the execution latency (Equation 1), can be attributed to the fact that energy is proportional to T while EDP is proportional to T2, hence, many times energy improvements might have a significant adverse affect on performance, and that is something we would like to avoid (same approach adopted by [32])

푒푥푒푐푡푖푚푒 퐸 = ∫ 푃표푤푒푟(푡) ≈ 푐푣2 푡

Equation 2 Energy formula

EDP is different from other energy-oriented metrics, such as PDP (Power-delay-product). PDP means the energy for executing an application, obtained by the sum of the product with time and actual power consumption. PDP is proportional to 푐푣2 and can be decreased by lowering the supplied voltage (assume constant capacity), but the latter would have an adverse affect on performance, and this is something we try to avoid. Hence, this metric is appropriate for the evaluation whenever the battery life is the main concern, while performance is a lower priority.

Given that our target is energy-performance, EDP is the metric we chose.

15

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015

6 WHP: an energy-performance optimizing method This section describes WHP, an offline method to improve energy-performance efficiency of hybrid execution on heterogeneous SoC platforms.

Optimal execution of a heterogeneous task, in terms of energy-performance, should involve workload-specific considerations. On the thermal/power side, static system rules, such as always prioritizing the GPU device over the CPU, or setting a device frequency as a function of the device utilization level, ignoring all together the workload relative efficiency in terms of power and performance among the devices, might cause missing significant energy-performance enhancements. Moreover, as previously described at subsection 4.1‎ , SoC platforms are very often tightly constrained in terms of power, making the co-utilization of the system devices for hybrid execution a very challenging task.

The lack of such a workload aware policy which considers the system limitations at the same time, produces the need for a central system policy, which can both control the devices operation in terms of power, and load-balance a given task, all this while taking into consideration the specific given task needs.

In this research, we target our optimizations at tasks featuring more compute than I/O, following the assumption that such tasks would suffer less from the contention spots, such as shared system resources: caches/RAM/buses. 6.1 Observations Initially, we begin by enumerating the ground base observations WHP relies on. We will be reasoning each observation later on at subsection 8.1‎ . Those observations will be used for designing WHP, our proposed method, for enhancing energy-performance.

6.1.1 Constant power consumption Given the fact that the execution model is scalable (see section 5.1‎ ), WHP assumes a constant power consumption of the SoC package throughout the program execution. This is true because the devices are able to remain saturated during execution, thus keeping constant utilization, which consequently results in a constant power. We also observe that for a given frequency, the device power level remains equal between single-device execution and hybrid execution.

6.1.2 Performance is linear Performance (execution time), for the type of kernels we are focusing on, is linear even when both devices are executing in parallel, cooperating on completing the overall task. This means that if it requires any of the devices a T time to complete a task, then it would require the same device α×T time to complete a ratio α of the same task. The justifications for this assumptions are:

 We are using a scalable framework that fully saturates a device when executing a large group of homogenous tasks (see 5.1.1‎ ). Hence, executing a ratio α of such a large group, would still result in a large number of homogenous tasks to be executed.  The type of kernels we are focusing on is compute kernels. For such type of kernels, where high contention is not expected on the devices’ shared resources, such as

16

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 memory and caches, scalability gets more viable whenever having both devices operating in parallel.  The fact that WHP aims at keeping the system away from violating the TDP limit (Turbo mode), by selecting an appropriate system configuration, results in a situation where no power tradeoff is taking place (see 6.3‎ ).

6.1.3 Package power offset WHP computes the platform power during hybrid execution statically by Equation 3. The CPU power and GPU power represent the power during program execution only on CPU and only on GPU, respectively (see ahead). This equation also assumes an additional constant power for the other platform components, such as memory caches, I/O controller, DMA controller (denoted by platform-power-offsetconstant).

Platform power (hybrid) = CPU-only power + GPU-only power + platform-power-offsetconstant

Equation 3: Platform power during hybrid execution calculated by the WHP algorithm

Equation 3 demonstrates the relation between the package power consumption and its sub- components. This fixed-offset behavior can be ascribed to the compute-bound nature of the kernels, which don't impose any significant contention over the uncore components (caches / DMI, see Figure 1), eliminating any noticeable variance in the power measurements of those components.

6.1.4 TDP budget interferes hybrid execution on SoC systems On SoC platforms, hybrid execution using multiple devices, very often hits the TDP limit of the system and results in a situation where a device isn’t granted same power stake as in single device execution mode. This behavior is unlikely to happen with discrete systems, where the TDP limit is comparatively high.

Figure 8(a) shows how the CPU is granted ~10.5W for CPU-only execution (GPU idle). Similarly, Figure 8(b) shows how the GPU is granted ~15W on a GPU-only execution (CPU idle). On the other hand, Figure 8(c) shows the power behavior over time for the hybrid execution of the same workload; where it doesn’t resemble a composition of the power behavior over time for each device. We can observe that during a hybrid execution [35-120]msec, the GPU is granted only 11W (not 15W), while the CPU is granted only 6W (not 10.5W). Once the GPU is finished execution, at time 120msec, the workload execution goes back to CPU-only execution mode and the CPU is granted again 10.5W we observed before.

17

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015

(a) CPU executing FloatVirus OpenCL kernel while GPU is (b) GPU executing FloatVirus OpenCL kernel while CPU is Idle Idle

(c) CPU and GPU are executing FloatVirus OpenCL kernel in parallel

Figure 8 SoC Power behavior on Hybrid execution (TDP 17W)

On the other hand, Figure 9(a)-(c) shows the same scenario from Figure 8 but for a discrete system, with a 77W TDP. We can observe how the power behavior over time for the hybrid execution is a composition of the single device executions of the CPU and GPU. At hybrid execution (Figure 9(c)) CPU and GPU are both granted 30W and 7W respectively, they were granted at single device execution, Figure 9(a) and Figure 9(b).

Briefly, we see that unlike discrete platforms, for SoC platforms, characterized by low TDP budget, it’s not a trivial task to conclude about a hybrid execution power behavior over time given the corresponding single devices’ power charts.

18

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015

(a) CPU executing FloatVirus OpenCL kernel while GPU is (b) GPU executing FloatVirus OpenCL kernel while CPU Idle is Idle

(c) CPU and GPU are executing FloatVirus OpenCL kernel in parallel Figure 9 Discrete power behavior on Hybrid execution (TDP 77W) 6.2 Device completion assumption WHP assumes that optimal execution is achieved when both the CPU as well as the GPU finish execution at the same time. The justification for this can be explained intuitively; when a device is idling, it still consumes static and leakage power and so we prefer all devices to finish their work at the same time. Moreover, the platform-power-offset (see Equation 3) is constant as long as any of the devices is running [33], hence, we would like both to finish as soon as possible. The same heuristics and rationale were used in other works [5]. 6.3 Algorithm The proposed algorithm is based on three stages: (1) Characterization of the application in terms of power and performance; (2) Given the application collected characteristics, determine the system configuration, in terms of work partitioning and devices’ frequencies among all the possible combinations and (3) configure system and application partitioning according to the chosen settings.

Let cpufreq denote CPU frequency and gpufreq denote GPU frequency. Also, let α denote a factor of work partitioning between the devices in the range [0…1]. WHP aims at finding the best

configuration tuple of (cpufreq, gpufreq, α) which provides the minimal EDP value for a given program.

19

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 6.3.1 Stage 1: Constructing FT and FP In the first phase, we actually measure the power (FP) and execution time (FT) of the entire

program on a CPU and a GPU, running in constant device frequencies of cpufreq or gpufreq. We repeat the measurements for all the different frequency values ranging from the lowest frequency (LFM) to highest frequency (HFM) per device1. We also average the differences between the platform power and the sum of the CPU power and GPU power in all measurements, to compute the constant offset of platform-power-offsetconstant. (Equation 3).

6.3.2 Stage 2: Determine optimal configuration This is done by brute-force search over all possible permutations of cpufreq, gpufreq and α values. WHP derives the hybrid execution time from the execution times of each device (FT). For example, if a computation takes execution time T on the CPU at frequency f, then it would

require Tα = α×FT(f) time to complete a fraction α of that computation on the CPU at the same frequency. This assumption is derived from the homogeneity and scalability of our execution model (section 5.1‎ ).

Similarly, WHP derives the power consumption throughout the hybrid execution from the FP tables for each device using Equation 3. For example, the package power consumption for a hybrid execution at frequencies f1 and f2 of the CPU and GPU respectively, would be estimated

as: FPcpu(f1)+FPgpu(f2)+platform-power-offsetconstant.

We further relate to this and the above observations in our experiments (section 8‎ ).

[1] for α in [1/NumPartitions, 2/ NumPartitions … 1]: // try NumPartitions values

[2] for cpufreq in [LFMCPU…HMFCPU, j=200Mhz]: // 200 is the default frequency step

[3] for gpufreq in [LFMGPU…HFMGPU, j=200Mhz]: [4] cputime = FT(cpufreq) * α [5] gputime = FT(gpufreq) * ( 1- α)

[6] if ( abs (cputime – gputime) > TimeDiffThreshold ): // Ignore when times are different [7] continue

[8] if ( FP(cpufreq)+FP(gpufreq) + platform-power-offsetconstant > TDP ) [9] continue

[10] Time = (cputime + gputime) / 2 [11] Power = FP(cpufreq) + FP(gpufreq) + platform-power-offsetconstant [12] Energy = Power * Time [13] EDP = Energy * Time

[14] Table[{cpufreq, gpufreq, α}] = EDP [15] end [16] end [17] end

[18] ret{cpufreq, gpufreq,α} = min(Table) [19] return ret{cpufreq, gpufreq, α} Figure 10: The WHP algorithm

1 In practice, the amount of measurements can be reduced by using other methods like performance and power models per device.

20

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 The second phase of WHP is described in Figure 10. We apply a brute-force search over the

range of cpufreq, gpufreq, and α values (lines 1-3) and apply a few simple heuristics based on the above assumptions. For each configuration triple (cpufreq, gpufreq, α), we compute the relative execution times on CPU and GPU (lines 4-5) based on device execution times in the first phase (PT). For the reasons explained above, we search only cases of “equal” device time (lines 6-7) and platform power that do not exceed the platform TDP (lines 8-9). For those cases, we compute the overall hybrid execution time (line 10) and energy (lines 11-12) based on the

power measurements in the first phase (FP). Finally, the result is the triple (cpufreq, gpufreq, α) which yields the minimal EDP value, hence the best energy-performance efficiency for the given program (lines 18-19).

6.3.3 Stage 3: Apply (cpufreq, gpufreq, α) settings Given the chosen settings by WHP (Figure 10), we apply the kernel partitioning factor α to execute the kernel (see subsection 4.3.1‎ ) in a hybrid fashion while setting the CPU and GPU

frequencies to cpufreq and gpufreq respectively.

6.4 Walk-through example The following is a walk through example for applying the WHP method presented earlier at Figure 10 for the AES256 workload, which we later demonstrate at the experiments section.

Stage 1: First we start by constructing FT and FP as required by the WHP. We manually measure the execution time (FT) and power consumption (FP) for the application execution on the CPU and GPU separately. Recall that FT and FP construction phase doesn’t involve load-balancing or work partitioning. For the AES256 kernel, Table 2 shows FT and FP values for each of the CPU and GPU devices.

Table 2 FT and FP for AES256 kernel

Stage 2: Using Table 2 data as an input, WHP method (presented at Figure 10) returns the following configuration as the best configuration for energy-performance:

{cpufreq,gpufreq,α} = (2000,1150,6/16) energy-performance = ~54 Table 3 WHP result for AES26 kernel

21

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 This implies we should expect an energy-performance score of ~ 54 for the following configuration:  α= 6/16 (6/16 for CPU and 10/16 for GPU)  CPUfreq of 2000MHz  GPUfreq of 1150MHz

And by WHP this is argued to be the best we can get in terms of energy-performance among the examined configurations by WHP.

Stage 3: Finally, at Figure 11, we confirm WHP results by configuring the system devices according to the recommended frequencies (Table 3), and partitioning the kernel according to the suggested partitioning factor α. And indeed, the resultant energy- performance score is 55.1, very close to the predicted value of 54 by Table 3.

Figure 11 WHP solution for AES256 workload, CPUfreq=2000MHZ, GPUfreq=1150Mhz, a=6. energy-performance score is 55.1

The executed configuration has been able to finish executing the workload in 1837msec while consuming 30.1J of energy, which sums up to ~55 energy-performance score, and this is about 2.5% far from the WHP projection, which is every close to our expectations. Later on, at Figure 18, we demonstrate how this result by WHP represents about 19% improved energy- performance when compared to our chosen baseline, OS balanced policy.

22

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 7 Experimental testbed In this section we introduce the hardware testbed along with the tools and system methodology we have used for conducting our experiments. 7.1 The testbed hardware Table 4 shows the details of the SoC platform we used in our experiments. This SoC has a CPU and a GPU with a shared cache (LLC), all located on the same silicon die and sharing a single power management scheme (see Figure 1).

Component Name Specific Version System Intel Ultrabook 17W TDP Platform Intel microarchitecture codename Ivy Bridge CPU I7-3667U, 2xCores/4xThreads, 2.0GHz - 3.2GHz GPU HD Graphics 4000, 350MHz - 1.15GHz RAM 4.0GB DDR3L OS Windows 8 64-bit Power AC Max Memory Bandwidth 25.6 GB/sec Memory frequency 1333Mhz

Table 4: The Testbed platform specification 7.2 Measurement Methodology In this research we focus the energy-performance optimizations to the kernel execution phase, which is the only phase where hybrid execution applies. In our experiments, we isolate the kernel execution phase from the whole application and merely measure the kernel execution from the point a kernel is submitted to execution on the target device (or devices) up to the point the command has completed, and control goes back to the host.

[1] while I < N: [2] Pre-Processing(…)

[3] clEnqueueWriteBuffer(queue, inputinputbuffer) [4] clFinish(queue) [5] < WHP_START_MEASUREMENTS > [6] clEnqueueNDRange(queue, kernel) [7] clFinish(queue) [4] < WHP_STOP_MEASUREMENTS >

[3] clEnqueueReadBuffer(queue, outputinputbuffer) [7] clFinish(queue) [8] Post-Processing(…) [9] endwhile Figure 12 WHP instrumentation for kernel execution

23

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 Figure 12 shows the instrumentation we add to OpenCL workloads in order to measure the execution time, power, frequency and energy for the kernel execution parts. These measurements are collected via special platform model status registers1.

Ideally, we were interested in combining LLC and RAM measurement tools (see description at 4.1‎ ), but due to lack of such tools on real hardware we were not able to do so. Adding LLC contention measurements, as a metric for estimating collision of co-execution of a workload on CPU and GPU in parallel, while adding RAM frequency control at the same time, in order to save RAM power in case memory bandwidth isn’t a bottleneck, might provide tremendous improvement for our method (section 6‎ ). However, as previously noted, such enhancements are left as future work. In retrospect, this is tolerated nicely at our experiments mainly because we focus on compute intensive kernels which harnesses the processors’ (CPU or GPU) power more than it does for shared components, such as Memory.

7.2.1 Measuring power Here are the MSRs we used in order to measure power of the CPU/GPU and Package:

MSR_PP0_ENERGY_STATUS Reports the actual energy usage for the processor cores

MSR_PP1_ENERGY_STATUS Reports the actual energy usage for a specific device on the uncore (GT GPU). MSR_PKG_ENERGY_STATUS Reports the actual energy usage for the complete package, including the processor cores and uncore. Figure 13 Power MSRs

7.2.2 Measuring performance (execution time) Here are the MSRs we used in order to sample frequencies of the CPU and GPU:

IA32_PERF_STATUS , Reads/Writes processor cores (CPU) frequency IA32_PERF_CTL Two intel internal mchbar Reads/Writes the GPU core frequency bypassing OS or registers (confidential) driver interference. Figure 14 Frequency MSRs

In order to acquire such a control over the GPU frequency, we apply modifications to the SpeedStep technology (known as Gyservill) which are not described here (due to intellectual property reasons). 7.3 Workloads Table 5 lists the kernels we use in the testbed, while summarizing the performance (execution time) and energy consumption for each kernel. The data represents kernel executions when executing separately on each device, at the device highest frequency. Moreover, the last two columns in the table show the execution time and consumed energy differences between the CPU and GPU, which is up to x4 gap in the worst case, meaning it’s still within the acceptable range for applying load-balancing. Metrics presented here and later at section 8‎ refer to the portion of the OpenCL kernel execution only.

1 http://www.intel.com/content/www/us/en/processors/architectures-software-developermanuals. html

24

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 Workload CPU GPU Max time Max energy Executio Consumed Executio Consumed ratio diff consumption n Time Energy n Time Energy between the diff between (msec) (Jouls) (msec) (Jouls) devices the devices AES2561 3491 59.9 2915 26.2 120% 229% Gaussian Noise2 923 16.2 245 5.8 377% 279% Vectorhypot3 200 5 861 15.2 431% 304% ColorConversion4 512 10.5 1318 23.6 257% 225% BillateralFilter5 457 8.1 156 4.8 293% 169% MersenneTwister6 322 3.7 221 2.2 146% 168% Convolution7 619 10.1 1419 16 229% 158% Table 5 Characteristics of the testbed kernels

1 https://github.com/softboysxp/OpenCL-AES 2 AMD APP SDK 3 Nvidia Sample SDK 4 Intel IPP OpenCL Library 5 https://github.com/GNOME/gegl/blob/master/opencl/bilateral-filter.cl 6 Nvidia Sample SDK 7 Self authored convolution

25

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 8 Experiments In this section we present all the experimental results and discuss them. We also demonstrate the observations presented earlier at subsection 6.1‎ . 8.1 Demonstrating the observations At this subsection, we demonstrate each observation we have made earlier at subsection 6.1‎ . We do this by a proof example for one or more of the workloads from the testbed kernels (7.3‎ ).

8.1.1 Constant power consumption Starting with observation 6.1.1‎ , Figure 15 demonstrates how the CPU power consumption is stable during the execution of the convolution kernel, not fluctuating, both at the low and high frequency levels, (a) and (b) charts. The same behavior holds for the GPU too as (c) and (d) demonstrates. This verifies the argued constant power behavior.

(a) CPU LFM (b) CPU HFM

(c) GPU LFM (d) GPU HFM Figure 15 Convolution kernel: Power behavior for CPU and GPU at both LFM and HFM frequencies.

8.1.2 Performance is linear As for assumption 6.1.2‎ , Figure 16 demonstrates how the CPU and GPU performance is linear to

the amount of work assigned to them. At Figure 16(a), it takes the CPU about CPUexecution_time =4.891 secs to complete execution of the AES256 kernel at frequency 2000Mhz; while it takes

the GPU about GPUexecution_time = 2.915 secs at its HFM frequency, see Figure 16(b). Both of these

26

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 time measurements refer to the execution on each device separately, where the other device is kept idle. Finally, at Figure 16(c), we allow both devices to cooperate executing the same task, by having α =6/16 of the task assigned to the CPU, while the rest is being assigned to the GPU. The corresponding execution times of the devices are: CPUtime=1.841secs and

GPUtime=1.820secs, which exactly correlate to the estimated relative times α × CPUexecution_time and (1-α)× GPUexecution_time given by the linearity assumption.

(a) CPU only execution (b) GPU only execution

(c) hybrid execution

Figure 16 Performance linearity of AES256 kernel: Power behavior for AES256 kernel for 3 execution modes: CPU only at 2000MHz, GPU only at 1150MHz, and Hybrid at 2000/1150MHz respectively.

8.1.3 Package power offset Finally, with respect to 6.1.3‎ , Table 6 lists the power consumption in the form of ([CPUwatt+ GPUwatt ] / Packagewatt) for each workload single device execution, as a function of the corresponding device frequency. The presented table data confirms a package offset of [2- 3]watt which follows the expression from Equation 3, about fixed package power offset.

27

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015

CPU frequency Gaussia Vector Color Billateral Mersenne (Mhz) AES256 n Noise hypot Conversion Filter Twister Convolution 800.00 3/5 2.8/4.9 2.7/4.9 3/5.1 2.4/4.3 2.7/5.4 2.3/4.7 1,000.00 3.8/5.7 3.5/5.5 3.4/5.7 3.7/5.8 3/4.9 3.3/6.2 2.8/5.2 1,200.00 4.5/6.4 4.1/6.2 4/6.3 4.4/6.5 3.5/5.4 3.9/7 3.3/5.8 1,400.00 5.3/7.1 4.9/6.8 4.6/7 5.1/7.2 4.1/6 4.5/7.9 3.8/6.5 1,600.00 6/7.8 5.5/7.5 5.3/7.7 5.8/7.9 4.7/6.6 5/8.4 4.2/7 1,800.00 6.7/8.6 6.3/8.2 5.9/8.5 6.5/8.7 5.3/7.2 5.5/9 4.8/7.6 2,000.00 8/9.9 7.3/9.3 7.0/9.6 7.7/9.9 6.2/8.1 6.4/9.8 5.6/8.4 2,200.00 9.2/11.2 8.5/10.5 8.2/10.9 9.0/11.4 7.3/9.2 7.4/10.9 6.4/9.4 2,400.00 11/12.8 10.1/12 9.6/12.3 10.6/12.9 8.6/10.5 8.2/11.7 7.6/10.4 12.6/14. 11.8/13. 2,600.00 6 7 11.1/13.9 12.2/14.5 10.1/11.9 9.7/13.2 8.6/11.7 14.9/16. 13.7/15. 2,800.00 7 6 13/16 14.3/16.7 11.7/13.6 11.1/14.5 10/13 GPU frequency (Mhz) 350 1.8/4.3 3.4/6 3.2/5.5 3.2/5.1 3.6/5.4 2.9/5.1 2.4/5.3 550 2.5/4.8 5.3/7.9 4.8/7.2 5.1/6.9 5.6/7.5 4.4/6.6 3/6.2 750 3.3/5.7 7.8/11 7/9.6 7.3/9.7 8.3/10.2 6.6/8.9 3.9/7.4 11.7/14. 950 4.8/7.1 6 10.2/12.7 11/12.8 12.3/14.2 9.6/12 5.2/8.7 15.6/18. 1150 6.2/8.7 2 13.8/16.2 14.5/16.6 16.6/18.5 12.7/15.3 6.7/9.6

Table 6 Testbed kernels power consumption: CPUwatt + GPUwatt / Packagewatt. Confirms Equation 3.

For example, we can see that for Vectorhypot kernel execution on the CPU at frequency=1800Mhz (single device execution), the combined power consumption of the CPU and GPU is 5.9watt, while the total package power consumption 8.5watt. Subtracting those two quantities, we get 2.6watt power consumption for the platform-power-offsetconstant.

8.2 Applying WHP Table 7 demonstrates the resultant EDP when applying WHP chosen settings for each experimented kernel. As demonstrated by Figure 10, in addition to the workload-partitioning factor as part of WHP output, WHP also specifies the frequency settings for the CPU and GPU devices.

Table 7 provides the execution time, the consumed energy and the work-partitioning factor for each experimented kernel. Moreover, Table 7 includes the EDP values computed by WHP. All runtime EDP values are greater than the corresponding computed EDP values.

28

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 EDP Time Energy Work Runtime Computed Kernel Matching (msec) (joules) partitioning EDP EDP ratio AES256 1844 29.9 0.37 55135.6 54024.6 0.98 Gaussian Noise 156 4.2 0.06 655.2 326.6 0.50 Vectorhypot 166 4.4 0.93 730.4 608.6 0.83 ColorConversion 384 9.8 0.75 3763.2 3668.7 0.97 BillateralFilter 115 4.3 0.18 494.5 402.1 0.81 MersenneTwister 162 4.6 0.37 745.2 383.8 0.52 Convolution 610 7.9 0.681 4819 2261.3 0.47 Table 7: WHP scores (execution time, energy, EDP). The runtime EDP is calculated by Time x Energy. The EDP matching ratio is calculated by MIN (runtime EDP, computed EDP)/MAX (runtime EDP, computed EDP)

Some of these differences are fairly large. One reason is that both execution time and power during hybrid execution had to be measured from a central point on the host. These measurements incur the overhead of launching the kernels and synchronization to wait for their completion on each of the devices. Another reason is the heuristics of WHP. Figure 17 and Figure 19 show the power graphs for two kernel execution. While the device power and platform power look constant, and devices have “equal” execution time, there are still small variations, mostly at beginning and end of a kernel execution. These affect the computed EDP. Anyway, this comparison is included only for reason of completeness. The gaps between the two EDP values are not the success criteria of WHP. The WHP success is indicated by the energy- performance improvements compared with the balanced-mode policy, reported ahead. 8.3 WHP vs. Balanced We evaluate WHP by comparing it with the OS policy of balanced mode (see subsection 4.2‎ ) as both share the goal of optimizing energy-performance of the platform. Balanced policy is very successful policy for workloads with multiple kernels, as it adjusts the power in between kernel executions and for each kernel separately. WHP is currently designed for a single kernel. Extending WHP to support broader class of workloads is left for future work.

We made two major sets of experiments. One set runs each kernel at the balanced policy, at 16 different values for α (work split factor) [1/16..15/16], finally summarizing at Table 8, the α value which produced the best EDP value, together with the execution time and energy consumed throughout the execution. The other set runs the kernel with the device frequencies and the work partitioning factor α calculated by WHP as demonstrated at Table 7.

Time Energy Work EDP Kernel (msec) (joules) partitioning (Time x Energy) AES256 2131 32 0.44 68192 Gaussian Noise 2180 5.3 0.18 1155.4 Vectorhypot 186 4.5 0.94 837 ColorConversion 445 10.2 0.75 4539 BillateralFilter 139 4.4 0.25 611.6 MersenneTwister 183 4.9 0.375 901.6 Convolution 688 10.9 0.62 7449.2 Table 8: balanced-mode policy scores (execution time, energy, EDP). The runtime EDP is calculated by Time x Energy.

29

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 Figure 18 summaries the comparison between balanced-mode and WHP. Overall, WHP achieves 23% (average) energy-performance improvement compared with the best EDP for balanced- mode. It is also interesting to note that WHP also improves both performance and the consumed energy of all kernels.

8.3.1 Results discussion In this section, we closely examine the behavior of WHP versus balanced-mode for two representative kernels.

8.3.1.1 AES256 kernel With this kernel, the GPU is the most dominant device for this kernel, as indicated by Table 5. The balanced-mode policy decides to start the kernel execution in turbo mode, as indicated by the power of 20W > TDP in Figure 17(b). At the starting point, since most of the power budget is devoted to the CPU, the system devotes a lower frequency and power to the GPU device, compared with WHP. At time point 51, the CPU finishes execution (as indicated by declining to idle power1 in Figure 17(b)). Apparently, even with the best work-partitioning factor, the CPU is faster than the GPU. At this point, the balanced-mode policy can increase the GPU frequency of 1150MHz. Overall, the CPU, being the less efficient device for this kernel, consumes too much power at the beginning without significant improvement in execution time.

In contrast, WHP, makes better decisions based on the program behavior (performance, power) on the CPU and GPU. It determined a CPU frequency of 2000MHz and higher GPU frequency of 1150MHz, compared with balanced-mode. It also keeps them constant during all hybrid execution (see Figure 17(c)). Consequently, it outperforms the balanced-mode policy in both energy consumption and execution time, as indicated by Figure 18.

1 The CPU frequency still remains in 2000 MHZ, due to Internal system architecture constraints.

30

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015

(a) balanced-mode frequency (b) balanced-mode power

(c) WHP frequency (d) WHP power

Figure 17 AES256: WHP vs. Balanced-mode

0.5 0.43 0.4 0.36 0.28 0.28 0.3 0.21 0.19 0.17 0.170.19 0.17 0.2 0.13 0.14 0.110.13 0.12 0.11 0.07 0.06 0.1 0.02 0.04 0.02 0

Energy Performance EDP

Figure 18: improvements of WHP over OS balanced-mode. Overall, WHP achieves an average 23% improvement of EDP over balanced-mode.

31

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 8.3.1.2 Convolution kernel The Convolution kernel represents a different scenario than the AES256 kernel. In this case, the GPU is the less efficient device than the CPU, as indicated by the kernel characteristics at Table 5. Moreover, with balanced-mode, both devices finish at the same time, as indicated by Figure 19.

With this kernel, WHP outperforms balanced-mode because it matches better between the CPU and GPU frequencies than balanced-mode. In practice, WHP keeps a very low frequency of the GPU through the entire execution. The balanced-mode policy chooses higher frequency for both the CPU and the GPU (Figure 19 (a)-(b)) compared with WHP.

Overall, the power consumption of the CPU for both policies is approximately the same (Figure 19(b)-(d)). However, the power consumption of the GPU is much higher with balanced-mode than with WHP during most of the execution time. All in all, lower GPU power and 11% lower execution time with WHP, yields 36% better energy-performance with WHP than with balanced- mode (see Figure 18).

(a) Balanced frequency (b) Balanced power

(c) WHP frequency (d) WHP power

Figure 19: Convolution: WHP vs. Balanced

32

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 A deeper look into the WHP solution here: As it has been described at Figure 10, the first phase of applying WHP would be generating the scaling functions FT and FP. In order to visualize it better for the sake of the explanation here, we will focus the discussion on the energy chart FE (function energy) instead of the power chart, which is immediately driven from FT and FP according to Equation 2. Figure 20 presents the performance scaling charts for the CPU and GPU:

CPU GPU Figure 20 Convolution performance scaling

GPU CPU

Figure 21 Convolution energy scaling

33

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 Figure 21 shows how the CPU energy is optimal within the range of [1600-1800]MHz, with a corresponding performance of [80-85]; however, the next step towards better performance would be accompanied with a sharper increase in energy which is unfavored by WHP in its search for the minimal EDP score. As a result, WHP chooses one of the frequencies in the range [1600-1800]Mhz as the target frequency for the CPU part. A similar reasoning holds for the GPU part, where it settles upon the lowest frequency of 350Mhz. Now we will show how the proposed system configuration by WHP doesn’t violate the optimization problem constraint of TDP limit. If we check the corresponding power for those frequencies at both tables Table 3 and Table 4, we get the following:

FPcpu(1800Mhz) = 4.8W

FPgpu(350Mhz) = 1.4W

Summing those up, we get a total of 6.2W; adding up an estimated offset of 2.5W, we get to 8.7W which is very close to the package power we observe at Figure 19.

This has been an effort to explain the considerations taken by WHP, which has a pre-knowledge of the workload performance and energy, contrary to the Balanced policy, which is a kind of on- demand profile, which works by a pre-defined set of static rules.

One intuitive summary for balanced policy behavior here is that balanced policy allows higher frequency for higher utilization, ignoring the actual performance achieved for such a frequency. However, this isn’t necessarily an always correct attitude; the reason is that we might sometimes hit a case where a workload isn’t programmed efficiently or rather compiles not optimally for one device compared to the other, resulting in a lot of generated code, increasing device utilization without seriously boosting the performance; same as we have observed in the latest example, where the GPU has been set to its HFM while showing much slower performance with respect to the CPU which did well at lower frequencies too. 8.4 WHP vs. other OS policies In a similar way to the balanced policy (Table 8), we report at Table 9 and Table 10 the scores for the Power-save and High-performance power policies.

By examining Table 8-Table 10, we observe that the partition factor α; which is chosen to be the value that minimizes the EDP score of the executed workload, changes when moving from one power policy to the other. For instance, for each of High-Performance, Power-save and Balanced power policies, the AES256 workload above generates the following α values: 0.44, 0.56 and 0.44 correspondingly.

This behavior can be ascribed to the relative performance of the devices (whenever executing the kernel) at the different frequencies, but not only that, also to the way the system prioritizes the different devices whenever dividing the TDP budget among them. For instance, at power- save policy, the devices are configured to perform at their LFM frequency, hence, there is no TDP limit issue, and the optimal partitioning is highly dependent on the devices performance and energy consumption as if they were to execute the workload separately as a standalone devices. On the other hand, at High-Performance mode, the devices execute at HFM frequency, hence, there is much higher chance to hit TDP, which in such case the system (OS, driver) will prioritize the devices according to a dictated policy by the HW manufacturer.

34

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 This observation strengthens the ground assumptions for this research, which couples the way devices trade off the power budget to the workload characteristics.

Workload Execution Consumed α Energy Time Energy Performance AES256 1854 36.5 .44 67671 Gaussian Noise 147 4.7 .0625 690.9 Vectorhypot 165 4.7 .94 775.5 ColorConversion 425 9.8 .81 4165 BillateralFilter 131 4.3 .25 563.3 MersenneTwister 171 5.4 .375 923.4 Convolution 617 12 .81 7404 Table 9 High-performance policy scores

Workload Execution Consumed α Energy Time Energy Performance AES256 4400 35.5 .56 156200 Gaussian Noise 396 4.5 .125 1782 Vectorhypot 521 4.7 .94 2448.7 ColorConversion 1180 10.1 .69 11918 BillateralFilter 310 3.9 .1875 1209 MersenneTwister 344 4.2 .56 1444.8 Convolution 586 9.1 .75 5332.6 Table 10 Power-save policy scores

Finally, for each kernel in the testbed, Figure 22 summarizes the EDP scores of WHP vs. High- performance, Power-save and Balanced policies.

Figure 22 EDP summary: WHP vs. all power polices.

35

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 9 Summary and future work As SoC based heterogeneous architectures are becoming the de facto standard of low-power platforms, achieving an optimal power management is a very important task. This work shows that better energy-performance on SoC platforms is available if we consider both work partitioning and device frequency management together. We suggest an offline method that maximizes energy-performance during the program hybrid execution based on program behavior on single-device. The method demonstrates 23% average improvement of energy- performance for several OpenCL kernels, compared to the OS policy of balanced-mode.

This research process, along with the experimental results, provide the following main conclusions:

 SoC hybrid execution is a new area in terms of energy-performance.  Conventional methods, which optimize performance and energy separately, are not applicable for SoC power/thermal constrained platforms.  Specific care per workload might be essential for improving energy-performance score. (Nowadays, many major apps for SoC handheld devices get specific touches per manufactured device)  Demonstrates an average 23% energy-performance improvement for a set of public OpenCL kernels running on the top of off-the-shelf SoC platforms, as compared with OS balanced-mode policy.

We believe that WHP, despite being manual, would gain a great adoption and that’s because in the SoC sector, especially table/smartphone devices, already go through manual tuning, both hardware and software, at the vendor warehouse before being released to the public consumer. 9.1 Ideas for future work During this research, we have identified few potential extensions, which can enhance WHP to apply to a broader range of applications. For example, incorporating RAM usage/energy as part of WHP can result in more accurate WHP scores for I/O bound applications. Similarly, adding support for Turbo mode would allow WHP consider cases where the combined single devices’ executions exceed the TDP limit, which might lead to better EDP scores due to better performance.

Moreover, as a continuous research, we suggest evolving WHP to an automatic online technique, potentially as part of a hybrid framework such as OpenCL/Renderscript.

36

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 10 References [1] Jetson-tK1. [Online]. https://developer.nvidia.com/jetson-tk1

[2] Intel BayTrail. [Online]. http://ark.intel.com/products/codename/55844/Bay-Trail#@All

[3] AMD Kaveri. [Online]. http://www.amd.com/en-us/products/processors/desktop/a-series- apu

[4] Ricardo Gonzalez and Mark Horowitz, "Energy dissipation in general purpose microprocessors," IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 31, NO. 9, 1996.

[5] X. Li, W. Chen, C. Zhang and X. Wang K. Ma, GreenGPU: A Holistic Approach to Energy Efficiency in GPU-CPU Heterogeneous Architectures.: International Conference on Parallel Processing (ICPP), 2012.

[6] Rance Rodrigues, Israel Koren, and Sandip Kundu Arunachalam Annamalai, "Dynamic thread scheduling in asymmetric multicores to maximize performance-pe-rwatt," IEEE 26th International Parallel and Distributed Processing Symposium and Workshops PhD Forum (IPDPSW), 2012.

[7] Won Woo Ro, and Jean-Luc Gaudiot Changmin Lee, "Cooperative heterogeneous computing for parallel processing on cpu/gpu hybrids," The 16th Workshop on Interaction between Compilers and Computer Architectures (INTERACT), 2012.

[8] Z. Wang and M. O'Boyle D. Grewe, Portable mapping of data parallel programs to OpenCL for heterogeneous systems.: IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2013.

[9] S. Hong and H. Kim C. Luk, "Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping," in the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2009.

[10] Mehrzad Samadi, Yongjun Park, and Scott Mahlke Janghaeng Lee, "Transparent cpu-gpu collaboration for data-parallel kernels on heterogeneous systems," Proceedings of the 22nd international conference on Parallel architectures and compilation techniques (PACT), 2013.

[11] D. Grewe and M. O'Boyle, A static task partitioning approach for heterogeneous systems using OpenCL.: Proceedings of the 20th international conference on Compiler construction, 2011.

[12] Pablo Toharia, Jose Luis Bosque, and Oscar D. Robles Carlos S. de La Lama, "Static multi- device load balancing for opencl ," IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA), 2012.

[13] S. Thibault, R. Namyst and P. Wacre- nier C. Augonnet, Starpu: a unified platform for task scheduling on heterogeneous multicore architectures.: Concurrency and Computation: Practice and Experience 23, 2, 2011.

37

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 [14] Oreste Villa, Sriram Krishnamoorthy, and Guang R. Gao Long Chen, "Dynamic load balancing on single- and multi-gpu systems," IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2010.

[15] R. Barik, T. Shpeisman, B. T. Lewis, C. Hu and K. Pingali R. Kaleem, Adaptive heterogeneous scheduling for integrated GPUs.: International Conference on Parallel Architectures and Compilation Techniques (PACT), 2014.

[16] L. N. Bhuyan and R. Gupta, M. E. Belviranli, A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures.: ACM Transactions on Architecture and Code Optimization (TACO), 2013.

[17] Chung-Hsing Hsu and Ulrich Kremer, "The design, implementation, and evaluation of a compiler algorithm for cpu energy reduction," Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation (PLDI), 2003.

[18] P. Juang, M. Martonosi and D. W. Clark, Q. Wu, "Formal online methods for voltage/frequency control in multiple clock domain microprocessors," The 11th international conference on Architectural support for programming languages and operating systems (ASPLOS XI), 2004.

[19] S. Herbert and D. Marculescu, Variation-Aware Dynamic Voltage/Frequency Scaling.: IEEE 15th International Symposium on High Performance Computer Architecture (HPCA), 2009.

[20] Jian Li and Jose F. Martinez and Michael C. Huang, "The Thrifty Barrier: Energy-Aware Synchronization in Shared-Memory Multiprocessors," Proceedings of the 10th International Symposium on High Performance Computer Architecture (HPCA), 2004.

[21] S. Hong and H. Kim, An integrated GPU power and performance model.: Proceedings of the 37th annual international symposium on Computer architecture, 2010.

[22] D. Defour and A. Tisserand, S. Collange, Power Consumption of GPUs from a Software Perspective.: Proceedings of the 9th International Conference on Computational Science: Part I, 2009.

[23] J. Li, W. Huang, J. Rubio, E and Speight, F. X. Lin C. Liu, "Power-Efficient Time-Sensitive Mapping in Heterogeneous Systems," in 21st international conference on Parallel architectures and compilation techniques (PACT'12), pp. 23-32.

[24] SNB. [Online]. http://ark.intel.com/products/codename/29900/Sandy-Bridge#@All

[25] Intel Ivy Bridge. [Online]. http://ark.intel.com/products/codename/29902/Ivy-Bridge#@All

[26] Intel Haswell. [Online]. http://ark.intel.com/products/codename/42174/Haswell#@All

[27] AMD Fusion. [Online]. http://en.wikipedia.org/wiki/AMD_Accelerated_Processing_Unit

38

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 [28] Qualcomm Snapdragon. [Online]. http://en.wikipedia.org/wiki/Qualcomm_Snapdragon

[29] http://msdn.microsoft.com/en- us/library/windows/desktop/aa373177%28v=vs.85%29.aspx.

[30] Domingo Don, Landmann Rüdiger, and Reed Jack,., ch. 3.2.1.

[31] OpenCL khronos homepage. [Online]. https://www.khronos.org/opencl/

[32] Stephen M Blackburn, Tiejun Gao, and Kathryn S McKinley Ting Cao, "The yin and yang of power and performance for asymmetric hardware and managed software," Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA), 2012.

[33] Rotem Efraim, Ginosar Ran, Weiser Uri, and Mendelson Avi, "Energy Aware Race to Halt: A Down to EARtH Approach for Platform Energy Management," Computer Architecture Letters, 2012.

[34] Hotta Yoshihiko et al., "Profile-based Optimization of Power Performance by using Dynamic Voltage Scaling on a PC cluster," Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International, 2006.

39

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 המסקנות העיקריות של עבודה זו הינם:

1. ביצוע היברידי )מקבילי( במערכות SoC הינו תחום מחקר חדש במונחים של אנרגיה-ביצועים. 2. שיטות קונבנציונליות לייעול אנרגיה וביצועים לחוד אינן מתאימות עבור מערכות SoC מוגבלות מבחינת הספק. 3. ישנו צורך בטיפול מיוחד עבור האפליקציות השונות לצורך ייעול מדד אנרגיה-ביצועים עבור ביצוע אותה האפליקציה. )דרישה סבירה שממושת כיום עבור חלק מהאפליקציות הנפוצות, לפני שחרור האפליקציה עבור מכשיר נייד חדיש( 4. השיטה WHP חושפת שיפור טמון של- 23% בממוצע במדד האנרגיה-ביצועים ביחס למדיניות המאוזנת.

ג Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 לאחר מכן, אחרי שכבר אספנו את כל הנתונים עבור כל התקן בנפרד ,אנו מקבלים 4 פונקציות:

(FTcpu(freq), FTgpu(freq- זמן הביצוע של המשימה )קוד הליבה( עבור תדר מסוים של אותו התקן.

(FTcpu(freq), FPgpu(freq- צריכת האנרגיה של המשימה )קוד הליבה( עבור תדר מסוים של אותו התקן.

את השליטה בתדרים אנו עושים ע״י שימוש בממשק חומרתי )דרך – registers( אשר קיים במערכות שהשתמשנו בהם. לאחר מכן, בעזרת פונקציות אלו שאספנו ,אנחנו מנסים לפתור בעיית אופטימיזציה אשר מנסה לשפר את מדד הביצועים של המערכת ע״י מציאת גורם החלוקה α וערכי התדר של כל התקן cpufreq ו- gpufreq אשר שילובם מביא לערך מדד אנרגיה-ביצועים הטוב ביותר, וזאת מבלי לחרוג ממעטפת ההספק של ההתקנים. לצורך הגדרת בעיית האופטימיזציה אנחנו מצדיקים מספר הבחנות לאורך הדרך אשר מאפשרות את הגדרת הבעיה באופן מתמטי:

 במשימות עתירות חישובים ,ובמערכות SoC, יש לחבילה של ההתקנים (die package) צריכת הספק בהפרש קבוע מהצריכה הכוללת של ההתקנים. את ההפרש הזה אנו מעריכים כ- 2.5W בחישובים שלנו. הבחנה זו מאפשרת לנו להעריך את ההספק הכולל של חבילת ההתקנים ולדחות קונפיגורציות אשר חורגות ממגבלות המערכת האנרגטיות.  בביצוע של קוד ליבה )kernel( עתיר חישובים במודלים ההברידים, כגון OpenCL ואחרים, ישנה צריכת הספק קבועה לאורך הביצוע, וכמן כן הביצוע הינו מדרגי מבחינת זמן ריצה. הבחנה זו מאפשרת לנו להעריך באופן לינארי את האנרגיה ו-זמן הביצוע כאשר אנו מקדישים רק אחוז α מסה"כ המשימה.

תוצאות את הגישה שלנו אנחנו משווים לשלושת מדיניות האנרגיה אשר קיימות במערכות המחשוב הסטנדרטיות, מדיניות הביצועים (high-performance), מדיניות חיסכון-באנרגיה )powersave( והמדיניות המאוזנת )balanced(. אף על פי כן, לאורך התזה אנו ממקדים את הדיון בהשוואה למדיניות המאוזנת וזו בגלל העובדה שהיא מתיימרת לגשר על הפער בין שני הקצוות המיוצגות ע״י המדיניות האחרות: הכי חסכונית באנרגיה לעומת בעלת הביצועים הטובים ביותר. לצורך שיערוך הצלחת השיטה שאנו מציעים, אנחנו מדגימים את תוצאותיה עבור ביצוע קוד הליבה של 7 מבחני ביצוע ידועים, ומראים שיפור של כ- 23% בממוצע לעומת המדיניות המאוזנת הסטנדרטית. לצורך המדידות אנחנו משתמשים במערכת סטנדרטית מתוצרת אינטל ,בעלת צריכת אנרגיה נמוכה של ultrabook( 17W(. יתר על כן ,אנו מראים שבממוצע ,השיטה שלנו- WHP אינה נחותה מהמדיניות האחרות באשר למדד שכל אחת מהן מיועדת אליו. למשל, WHP עולה ב- 3% בממוצע על מדיניות הביצועים ,ב- 1% על מדיניות חיסכון-באנרגיה ,זאת אומרת שהשימוש ב- WHP לצורך שיפור מדד האנרגיה-ביצועים אינו מחייב פשרה באשר למדדים האחרים, גם כאשר בוחנים כל אחד מהם לחוד. מסקנות

מחקרים עכשוויים בתחום הביצוע ההיברידי )CPU-GPU Hybrid( מתמקדים רובם באספקטים של ביצועים )זמן ריצה( ולא בשיפור האנרגיה. אחת הסיבות לכך היא שחלק ממחקרים אלו לוותה המחשבה שהתמקדות בייעול הביצוע תביא כמו כן לייעול צריכת האנרגיה, מתוך אמונה שאם זמן הביצוע מתקצר אז כנראה שפעלנו פחות זמן בבזבוז אנרגיה. בתזה זו אנו גם מפריכים טענה זו עבור מערכות SoC עם חסמי הספק נמוכים, ומראים שאין זה מספיק להתמקד בהאצת הביצוע בלבד לייעול מדד האנרגיה-ביצועים וכמו כן, באשר למיפוי משימה להרצה מקבילית במערכת היברידית, אז מחקרים התמקדו בפלטפורמות עם כרטיס גרפי שאינו מובנה כאשר למערכת אין חסמי הספק נוקשים אשר מצריכים תשומת לב מיוחדת. בתזה זו ,הצגנו גישה ידנית, לאיתור מיפוי אופטימלי של משימה באופן מקבילי על ה- CPU וה- GPU במערכות SoC עם חסמי הספק נמוכים.

ב Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 תקציר

מפתחי תוכנה בתחום ה – GPGPU אשר מיעדים את האפליקציות שהם מפתחים למערכות היברידיות )בנות-כלאים( נתקלים בשני סוגי אתגרים :לבצע את המשימה מהר יותר ולחסוך יותר באנרגיה. סוג אחד של מערכות היברידיות הינם מערכות אינטגרלית, מבוססי SoC אשר נפוצות לאחרונה. במערכות כאלו ישנו צורך לייעל את שני הווקטורים ,ביצועים ואינרגיה. מחקרים קודמים כבר הציעו גישות וטכניקות לביצוע מקבילי של חישובים בהתקנים השונים )ומכאן ביצוע היברידי( במטרה לשפר את הביצועים. ואילו רק מעטים התמקדו באופטימזצית אנרגיה-ביצועים. כל עבודות מחקר אלו היו ממוקדות בפלטפורמות היברידיות עם מעבד CPU ו- GPU דיסקרטי, שעבורם ביצועים או אנרגיה יכולים להיות מותאמים באופן עצמאי לכל התקן. במערכות SoC כל ניסיון בהם לשיפור מדד האנרגיה-ביצועים אינו טריוויאלי ונתקל מיד באתגרים חדשים, ולכן שיטות קודמות אינם ישימים עבורם.

בעבודה זו אנו מציגים שיטה חדשנית אשר מסייעת למערכות SoC היברידיות, לחשוף את הפוטנציאל הנסתר לשיפור מדד האנרגיה ביצועים. השיטה מתמקדת באיתור תצורת מערכת אופטימלית, שמשלבת הרצה מבוזרת של החישובים על המשאבים השונים באופן מבוקר מבחינת תצרוכת אנרגיה, יחד עם איזון המשאבים, לדוגמא שליטה בתדר ההתקן, על מנת לשפר את מדד האנרגיה-ביצועים עבור עומס עבודה נתון. השיטה שאנו מציגים משפרת את מדד האנרגיה-ביצועים בכ- 23% בממוצע עבור קבוצה מוכרת של מבחני ביצועים, בהשוואה לשיטות ניהול צריכת האנרגיה הנוכחיות בחומרה כיום.

התרומות העיקריות של עבודה זו הינם:

1. עבודה זו מציגה ניתוח מעמיק יחד עם שיטה לייעול מדד האנרגיה-ביצועים עבור ביצוע היברידי במערכות SoC עם מעבד וכרטיס גרפי מובנה אשר חולקים את אותה מעטפת הספק. 2. הצגת שיטה חדשה להרצה מקבילית במערכות SoC, אשר מתאימה למודלים תכנותיים חדישים כגון OpenCL ו- Renderscript, ומספקת שיפורים אופטימלים למדד האנרגיה- ביצועים. WHP .3 חושף פוטנציאל טמון בהשגת ביצועים טובים יותר במדד האנרגיה-ביצועים עבור אפלקציות בנות-כלאיים ,ומפגין 23% שיפור בממוצע במבחני ביצועים.

במחקר זה אנו מתמקדים בייעול של שלב הליבה שהינו נחשב השלב המסיבי ביותר (kernel execution) בריצה של אפליקציות בנות-כלאיים, אשר חישובם עושה שימוש במספר התקנים בו זמנית, עם כמות חישובים רבה, ותופס חלק ניכר מזמן הריצה וצריכת האנרגיה עקב העומס הרב שהוא מכביד על המערכת בזמן הביצוע. יתר על כן, שלב זה הינו השלב היחיד בריצה עם פוטנציאל להרצה היברידית על שני ההתקנים ולכן יש לו את הפוטנציאל הבולט לייעול האנרגיה-ביצועים. ולכן ,שילוב של שיקולים אלו גורם לנו להתמקד בשלב זה ,בדומה למחקרים קודמים, תוך ידיעה שלייעול בו ישנו פוטנציאל לשיפור משמעותי עבור כלל האפליקציה. מדד האנרגיה-ביצועים

מדד האנרגיה-ביצועים מוכוון לייעול בו זמנית של שני המדדים ביצועים ו-אנרגיה ,בכך שהוא מתייחס אל שניהם ביחד ולא לחוד, וזה בגלל שהרבה פעמים במערכות מוגבלות הספק נוצר ניגוד ביניהם אשר מקשה על ייעולם בצורה נפרדת. השימוש במדד זה אינו חדש, והוא נעשה גם כן ע"י מספר של עבודות קודמות. הגישה המוצעת

ההצעה שלנו מתמקדת במציאת קונפיגורצית המערכת )חלוקת עבודה בין ההתקנים וויסות התדרים שלהם( אשר מביאה לתוצאות אופטימליות במדד האנרגיה-ביצועים. השיטה עובדת בהתחלה ע״י אפיון הביצוע וצריכת האנרגיה של המשימה עבור כל התקן בנפרד, כאשר האפיון נעשה תוך כדי שינוי התדרים של ההתקן ובחינת הביצוע של המשימה עבור אותו שילוב של התקן ותדר.

א Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015 עבודה זאת בוצעה בהנחיית פרופ .אבי מנדלסון ,טכניון וד"ר יריב ארידור ,אינטל.

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015

שיפור מדד האנרגיה-ביצועים עבור מערכות SoC עם חסמי הספק נמוכים

חיבור על מחקר

לשם מילוי חלקי של הדרישות לקבלת תואר מגיסטר למדעים במדעי המחשב

רמי ג'יוסי

הוגש לסנט הטכניון – מכון טכנולוגי לישראל

שבט תשע"ה חיפה פברואר 2015

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015

שיפור מדד האנרגיה-ביצועים עבור מערכות SoC עם חסמי הספק נמוכים

רמי ג׳יוסי

Technion - Computer Science Department - M.Sc. Thesis MSC-2015-02 - 2015