Regression Modelling of Power Consumption for Heterogeneous Processors

by

Tahir Diop

A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate Department of Departement of Electrical and Computer Engineering University of Toronto

c Copyright 2013 by Tahir Diop Abstract

Regression Modelling of Power Consumption for Heterogeneous Processors

Tahir Diop Master of Applied Science Graduate Department of Departement of Electrical and Computer Engineering University of Toronto 2013

This thesis is composed of two parts, that relate to both parallel and heterogeneous processing. The first describes DistCL, a distributed OpenCL framework that allows a cluster of GPUs to be programmed like a single device. It uses programmer-supplied meta-functions that associate work-items to memory. DistCL achieves speedups of up to 29× using 32 peers. By comparing DistCL to SnuCL, we determine that the compute-to-transfer ratio of a benchmark is the best predictor of its performance scaling when distributed.

The second is a statistical power model for the AMD Fusion heterogeneous processor. We present a systematic methodology to create a representative set of compute micro-benchmarks using data collected from real hardware. The power model is created with data from both micro-benchmarks and application benchmarks. The model showed an average predictive error of 6.9% on heterogeneous workloads. The

Multi2Sim heterogeneous simulator was modified to support configurable power modelling.

ii Dedication

To my wife and best friend Petra.

iii Contents

1 Introduction 1 1.1 Contributions...... 2 1.2 Organization...... 3

2 Background 4 2.1 GPU Architecture...... 4 2.1.1 AMD Evergreen...... 5 2.1.2 Nvidia Fermi...... 6 2.2 Fusion APU...... 8 2.2.1 CPU...... 8 2.2.2 GPU...... 9 2.3 Programing Models...... 9 2.3.1 OpenCL...... 9 2.3.2 CUDA...... 12 2.4 Simulators...... 12 2.4.1 Multi2Sim...... 13 2.4.2 GPGPUSim...... 14 2.5 SciNet...... 15

3 Distributing OpenCL kernels 16 3.1 Background...... 17 3.2 DistCL...... 18 3.2.1 Partitioning...... 19 3.2.2 Dependencies...... 20 3.2.3 Scheduling Work...... 22 3.2.4 Transferring Buffers...... 22 3.3 Experimental Setup...... 23 3.3.1 Linear Compute and Memory...... 24 3.3.2 Compute-Intensive...... 25 3.3.3 Inter-Node Communication...... 25 3.3.4 Cluster...... 27 3.3.5 SnuCL...... 28 3.4 Results and Discussion...... 29

iv 3.5 Performance Comparison with SnuCL...... 33 3.6 Conclusion...... 36

4 Selecting Representative Benchmarks for Power Evaluation 38 4.1 Power Measurements...... 38 4.2 Micro-benchmark Selection...... 43 4.2.1 Memory Benchmarks...... 44 4.2.2 Compute Benchmarks...... 46 4.3 Conclusion...... 53

5 Power Modelling 54 5.1 Background...... 56 5.2 Selecting Benchmarks...... 58 5.2.1 Micro-Benchmarks...... 58 5.2.2 Application Benchmarks...... 58 5.3 Measuring...... 60 5.3.1 Hardware Performance Counters...... 60 5.3.2 Multi2Sim Simulation...... 61 5.4 Modelling...... 66 5.5 Conclusion...... 76

6 Power Multi2Sim 77 6.1 Epochs...... 77 6.2 Using Power Modelling...... 78 6.2.1 Configuration...... 78 6.2.2 Runtime Usage...... 79 6.2.3 Reports...... 79 6.3 Validation...... 80 6.4 Conclusion...... 80

7 Conclusion and Future Work 82

Bibliography 84

Appendices 92

A Clustering Details 93

B Multi2Sim CPU Configuration Details 95

v List of Tables

2.1 AMD A6-3650 Specification...... 8

3.1 Benchmark Description...... 24 3.2 Cluster Specifications...... 27 3.3 Measured Cluster Performance...... 27 3.4 Execution Time Spent Managing Dependencies...... 32 3.5 Execution Time Spent Managing Dependencies...... 34 3.6 Benchmark Performance Characteristics...... 34

4.1 Data Acquisition Unit Specifications...... 39 4.2 ACS711 Current Sensor Specifications...... 39 4.3 AMD Fusion Cache Specifation...... 46 4.4 Possible Factor Values for Benchmarks...... 47 4.5 Operation Groupings...... 51 4.6 Sensitivity Scores for the CPU...... 53 4.7 Sensitivity Scores for the GPU...... 53

5.1 Applications Benchmarks Used...... 59 5.2 Instruction Categories...... 62 5.3 CPU Configuration Summary...... 63 5.4 Memory Latency Comparison...... 66 5.5 Memory Configuration...... 67 5.6 GPU Model Coefficients...... 71 5.7 CPU Model Coefficients...... 71 5.8 GPU Model Coefficients...... 73 5.9 CPU Model Coefficients...... 75 5.10 APU Model Coefficients...... 75

A.1 Most Common Property per Cluster for the CPU...... 93

B.1 CPU Configuration Details...... 95

vi List of Figures

2.1 AMD Evergreen based streaming processor...... 5 2.2 AMD Evergreen based SIMD core...... 6 2.3 Nvidia Fermi base CUDA core...... 7

3.1 Vector’s 1-dimensional NDRange is partitioned into 4 subranges...... 19 3.2 The read meta-function is called for buffer a in subrange 1 of vector...... 22 3.3 Speedup of distributed benchmarks using DistCL...... 30 3.4 Breakdown of runtime...... 31 3.5 HotSpot with various pyramid heights...... 33 3.6 DistCL and SnuCL speedups...... 35 3.7 DistCL and SnuCL compared relative to compute-to-transfer ratio...... 36

4.1 Idle power measurements done using the DI-145...... 40 4.2 Idle power measurements done using the DI-149...... 40 4.3 MSI A75MA-G55 motherboard schematic [63]...... 41 4.4 Schematic of the measuring setup...... 42 4.5 A Picture of the measuring setup in action...... 43 4.6 An example of a stack used to store the order of recent memory accesses...... 45 4.7 Energy consumption of ALU benchmarks on the CPU...... 49 4.8 Energy consumption of ALU benchmarks on the GPU...... 50 4.9 Frequency of cluster sizes from the CPU results...... 52 4.10 Frequency of property being the most common in a cluster...... 52 4.11 Percentage of benchmarks in a cluster that share the most common property...... 52

5.1 Steps involved in the modelling process...... 55 5.2 Comparison of the literal and best memory configurations...... 65 5.3 The Regression process...... 69 5.4 Fitting error of the training benchmarks for the CPU models...... 70 5.5 Fitting error of the training benchmarks for the GPU models...... 70 5.6 Fitting error of the validation benchmarks for the CPU models...... 72 5.7 Fitting error of the validation benchmarks for the GPU models...... 72 5.8 Linear regression of workloads at various frequencies...... 73 5.9 Predicted and true values for the total energy of the Rodinia benchmarks...... 75

6.1 Measured power consumption of back propagation on real hardware...... 80

vii 6.2 Simulate power consumption of back propagation using Multi2Sim...... 80

viii List of Abbreviations

ABI — Application Binary Interface

AGU — Address Generating Unit

ALU — Arithmetic Logic Unit

APU — Accelerated Processing Unit

AMD —

API — Application Programming Interface

ATX — Advanced Technology eXtended

CMP — Chip Multi-Processor

CPU — Central Processing Unit

DAQ — Data AcQuisition

DSE — Design Space Exploration

DSM — Distributed Shared Memory

DSP — Digital Signal Processor

DVFS — Dynamic Voltage and Frequency Scaling

EPI — Energy Per Instruction

FPU — Floating Point Unit

FU — Functional Unit

GND — GrouND

GPU —

GPGPU — Genral Purpose Graphics Processing Unit

HPC — High Performance Computing

IPC — Instructions Per Clock

ILP — Instruction-Level Parallelism

ISA — Instruction Set Architecture

ISP — Image Signal Processor

MLP — Memory-Level Parallelism

NOC — Network On-Chip

OoO — Out-of-Order

ix PCIe — Peripheral Component Interconnect Express

PMU — Performance Monitoring Counters

PSU — Power Supply Unit

OS — Operating System

SATA — Serial Advance Technology Attachment

SC — SIMD Core

SFU — Special Function Unit

SIMD – Single Instruction Multiple Data

SIMT — Single Instruction Multiple Thread

SMT — Simultaneous Multi-Threading

SoC — System on Chip

SP — Streaming Processor

SPU — Stream Processing Unit

SSE — Streaming SIMD Extensions

TLP — Thread Level Parallellism

VLIW — Very Long Instruction Word

VRM — Voltage Regulator Module

x Chapter 1

Introduction

Since the introduction of microprocessors in the 1970s, their processing power has exponentially increased. Performance increases are due to increasing transistor counts and increasing clock frequency. However, in modern processors, power density constraints have led to a dramatic slowing of frequency increases. To keep power density in check, multicore designs have been introduced. This allowed continued increases in performance, by increasing the number of cores instead of frequency [1].

Graphics processing units (GPUs) are highly parallel processors, and they have recently seen an explosion of use for parallel workloads. This has lead to the creation of GPU programming frameworks, such as CUDA [2] and OpenCL [3], which allow GPUs to be programmed with an emphasis on general purpose computation, rather than graphics work. As general purpose GPU (GPGPU) computation has gained broader acceptance, GPUs have started to be included in compute clusters [4]. However, current GPGPU programming frameworks do not assist in programming clusters of GPUs, so multiple programming models must be combined. One programing model manages the cluster environment by transferring data between nodes and assigning work to different GPUs, and another is responsible for programing the GPUs themselves.

To take advantage of a GPGPU an application must be highly parallelizable, which is not always the case. This has led to an increased use of accelerators and the emergence of heterogeneous architectures. These architectures combine multiple specialized processors that are particuarly fast for a limited set of applications. Most of the processors, shipped by Intel and AMD in the consumer space today, are heterogeneous [5][6]. They include central processing unit (CPU) and GPU cores. In the ultra-mobile space, we see even more convergence with entire systems on a chip (SoC) [7]. These systems are highly heterogeneous and contain a CPU, GPU, digital signal processor (DSP), image signal processor (ISP), video decoder/encoder, and/or wireless controllers. As transistor sizes keep shrinking and power limits do not, we are entering the age of dark silicon [8], where a chip contains more transistors than it can reasonably power. This will make heterogeneous architectures even more attractive, since we can spare the area and desperately need the power savings associated with less general purpose hardware.

This thesis is composed of two main parts: DistCL [9], a distributed OpenCL framework that allows a cluster of GPUs to be programmed as if it were a single device, and a power model for a heterogeneous

1 Chapter 1. Introduction 2 processor. While at first glance these two parts may appear unrelated, they are part of a larger project focusing on heterogeneous computing. The project seeks to investigate the best way to schedule work on a heterogeneous processor. One of the difficulties of using heterogeneous processors is that there are very few programming models that are common to the different types of processors. OpenCL is a heterogeneous programming model that aims to make heterogeneous processors programmable using a single framework. Dividing up an OpenCL application so that it may run on a heterogeneous system is in many ways similar to dividing it up to run on a cluster. In both cases, we must ensure that program correctness is preserved even though the work itself is being divided up amongst multiple processors. This involves determining the memory dependencies required for each part of the work and ensuring that this memory is made available to the correct device. In order to do this efficiently, it is important to understand the overheads involved and the trade-offs that can be made. In this thesis, two frameworks that allow OpenCL kernels to be distributed across a cluster, with varying degrees of programmer involvement, are compared. This allows us to gain better insights into how performance scales for distributed OpenCL applications.

Before one can determine how to best distribute work on a heterogeneous processor, it is necessary to define a metric for best. Possible metrics include shortest runtime, minimum energy consumption to complete a given task, or maximum performance within a given power envelope. Simulators can easily be used to determine which approach produces the lowest runtime, but since there are no publicly available power models of a heterogeneous processor, it is impossible to assess the other two metrics. A heterogeneous power model must take into account not only the power consumption of individual processors, such as CPUs and GPUs, but also that of shared resources, such as memory controllers. Developing a power model allows architects to investigate what hardware configurations are best using similar metrics. To this end, power modelling capabilities were added to the Multi2Sim [10] heterogeneous architecture simulator. A power model of the AMD Fusion accelerator processing unit (APU) was created and tested within Multi2Sim. A statistical approach was used to create the power model and to determine which micro-benchmarks are necessary to create a valid model. Multi2Sim was then configured using this model and its power modelling capabilities were validated. This approach is not unique to the Fusion and could be used to create similar power models for any desired hardware.

1.1 Contributions

The major contributions of this work are:

1. An analysis of performance scaling for distributed OpenCL kernels using two approaches.

2. A systematic methodology to create a representative set of power micro-benchmarks using data collected from real hardware.

3. Creating the first power model for a CPU/GPU heterogeneous processor.

4. Adding configurable power modelling capabilities to the Multi2Sim heterogeneous architecture simulator. Chapter 1. Introduction 3

1.2 Organization

The thesis is organized as follows: Chapter2 provides background that is common to the thesis as whole. This includes background on GPU architecture, the Fusion heterogeneous APU, OpenCL and competing programing models, current GPGPU simulators, and the computing infrastructure used to conduct experiments. Further chapter-specific background will also be provided at the beginning of each chapter. Chapter3 describes the DistCL framework and compares it to SnuCL [ 11], another framework that allows GPU clusters to be programmed. Chapter4 describes the power measuring setup that was used when measuring power consumption on the Fusion. This chapter also describes how measured power information of over 1600 benchmarks was used to create a representative set of power benchmarks that contained less than 350 benchmarks. Chapter5 describes the process used to create a power model for the Fusion APU. This includes describing how Multi2Sim was configured to simulate the APU and explaining the regression analysis used to create the actual model. Chapter6 explains how Multi2Sim was modified to support power modelling. It also describes the approach used and how users of the simulator can take advantage of this new feature. Finally, Chapter7 summarizes the conclusions and insights made throughout this work and provides further insights into the future work that is now possible. Chapter 2

Background

This chapter provides background to this work as a whole. Where necessary, subsequent chapters will provide chapter-specific background sections. Section 2.1 describes current GPU architectures and contrasts AMD and Nvidia designs. Section 2.2 describes the architecture of the Fusion APU. Section 2.3 introduces OpenCL and CUDA, which are frameworks for writing and executing heterogeneous and GPGPU programs. Section 2.4 discusses the simulators used in GPU architecture research and introduces Multi2Sim, a heterogeneous simulator used in this work. Finally, Section 2.5 describes the SciNet [12] computing clusters that were used for this work.

2.1 GPU Architecture

Modern CPUs are highly versatile processors that are best optimized for low latency single threaded computation [13]. Features such as out-of-order (OoO) execution, branch prediction, superscalar designs, and large caches help achieve this goal. While these features increase performance, they come at a price: increased complexity, area, and power consumption. These factors limit the number of cores in a single chip multi-processor (CMP). On the other hand, GPUs focus on maximizing throughput and reducing core area, to fit as many cores as possible onto a single chip. This comes at the expense of architectural efficiency, latency, and memory bandwidth per-core.

A GPU core is a single instruction multiple thread (SIMT) pipeline [14]. AMD calls these SIMD cores, while Nvidia calls these CUDA cores. SIMT, a variation on single instruction multiple data (SIMD), groups multiple threads together into wavefronts (AMD terminology) or warps (Nvidia terminology), which execute the same instructions in lock step. SIMT allows threads to take divergent branches. To ensure the threads keep executing in lock step they may need to take both branches. To maintain correctness each thread will only write back the result of the appropriate branch. Such computation of unnecessary results is one way architectural efficiency is reduced. Due to limitations on the number of I/O pins per chip, GPUs have limited per-core memory bandwidth [15]. To make the most of this limited bandwidth they sacrifice latency to increase throughput. The memory controller will try to group multiple requests into a single larger contiguous request to reduce the number of memory accesses. Due to the

4 Chapter 2. Background 5 in-order nature of GPUs and the high memory latencies, cores are typically running multiple wavefronts simultaneously. A wavefront waiting on a memory request to be filled will be swapped out for one that is ready to run. To get high architectural efficiency, each core generally requires hundreds of simultaneously executing threads. For traditional graphics workloads where a GPU needs to produce frames for a display at a resolution of say 1920x1080, this is not an issue. Each pixel can be an independent thread, meaning that there are over 2 million threads per frame, giving the GPU plenty of wavefronts to choose from. However, extracting this level of parallelism out of general purpose applications is not always trivial.

The rest of this section takes a closer look at GPU architectures. Section 2.1.1 describes the AMD Evergreen architecture, which is found in the AMD Fusion APU, that was used for this work and can be simulated using Multi2Sim. Section 2.1.2 describes a competing Fermi architecture from Nvidia. Fermi GPUs are available on SciNet and can be simulated using GPGPUSim [16]. Some of the naming conventions of AMD and Nvidia that are introduced in the following sections can be confusing when presented concurrently. The reader should focus on the AMD naming convention as the remainder of this work will use the AMD naming convention.

2.1.1 AMD Evergreen

The AMD Evergreen micro-architecture, or 5000 series, specifies the design of the SIMD core, the associated memory controllers, and rastering hardware. The Evergreen architecture uses a very long instruction word (VLIW) instruction set architecture (ISA) [17], which influences the design.

Starting at the lowest level, individual threads are mapped onto streaming processors (SPs). Each SP is composed of five stream processing units (SPU), named x, y, z, w, and t, as well as a register file as shown in Figure 2.1. The first four SPUs are able to perform simple integer and floating point operations, including multiplication, and load/store operations. The t, or transcendental, SPU is more complex and has additional functionality. It can perform all the remaining complex operations such as division, trigonometric operations, and square roots. This design maps very well to pixel shading as the x, y, z, and w SPUs can calculate the three colour components and transparency of a pixel, while the t SPU handles the more complex lighting operations. Instruction level parallelism (ILP) is required to keep all the SPUs busy. In the Evergreen ISA, each instruction clause can contain up to five separate calculations, one for each SPU. If the ILP is less then five, some SPUs will remain idle. Since this ILP must be expressed in the machine code, it must be discoverable at compile time to be taken advantage of.

Streaming Processor

Register File

X y Z W t

Figure 2.1: AMD Evergreen based streaming processor.

A SIMD core (SC) is composed of sixteen SPs, as shown in Figure 2.2. The SIMD core is the smallest Chapter 2. Background 6 unit to which work can be assigned, as all the SPs in the SC will be executing in lock step. The SC can be assigned multiple wavefronts. Each wavefront contains 64 threads which are split into four groups of sixteen and always run concurrently. The scheduler switches between wavefronts to hide high latency events. The SIMD core also contains 32 kB of shared memory, which can be used to store data that will be shared among SPs. It is much faster than main memory and this is used for OpenCL’s local memory. The texture unit is normally responsible for applying graphic textures, but in GPGPU computations it is used to make global memory reads. The texture cache is not used as a data cache when performing GPGPU computations, because textures are read-only.

SIMD Core Scheduler

SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP

Shared Memory

Texture Units

Texture Cache

Figure 2.2: AMD Evergreen based SIMD core.

An entire Evergreen GPU consists of one to twenty SCs and one to four memory controllers. There is global read-only data share which can be used in GPGPU programing as constant memory. Other shared resources include the thread dispatch scheduler which assigns wavefronts to SCs.

2.1.2 Nvidia Fermi

A direct competitor to AMD’s Evergreen based GPUs are Nvidia’s Fermi based, or 400 series, GPUs. Nvidia does not use a VLIW architecture, so there are significant differences in its design.

Again starting from the bottom, individual threads are mapped onto a shader processors (SP).1 Each SP can handle simple integer and floating point operations, including multiplication, similar to AMD’s SPUs, but unlike them it cannot perform load or store operations. There are separate load/store units, as well

1For the remainder of this section SP will refer to a shader processor. Thereafter, this abbreviation will return to meaning an AMD streaming processor Chapter 2. Background 7 as special function units to handle more complex operations. These are all combined into a CUDA core (CC).

CUDA Core Scheduler Register File

LD/ST SP SP SP SP LD/ST SFU LD/ST SP SP SP SP LD/ST

LD/ST SP SP SP SP LD/ST SFU LD/ST SP SP SP SP LD/ST

LD/ST SP SP SP SP LD/ST SFU LD/ST SP SP SP SP LD/ST

LD/ST SP SP SP SP LD/ST SFU LD/ST SP SP SP SP LD/ST Shared Memory L1 cache

Texture Units

Texture Cache

Figure 2.3: Nvidia Fermi base CUDA core.

Figure 2.3 shows the components of a CC. It is composed of 32 SPs, 16 load/store units, and 4 special function units (SFUs). The heterogeneity in execution resources is seen at the CC level rather than at the streaming processor level, as in AMD’s design. Just like a SC, a CC will execute a single instruction at a time. This means 32 threads for most instructions, 16 threads for load/stores, and 4 threads for complex operations at a time. The Fermi architecture makes no effort to take advantage of ILP and instead focuses on data-parallelism by including more SPs per CC. Nvidia’s CC scheduler can also be assigned multiple warps; it interleaves them to hide high latency operations.

A major difference of the CC, compared to the SC, is that it contains both a shared memory and an L1 data cache. The shared memory and cache are 64 kB in total, and can be configured as either a 16 kB shared memory and a 48 kB cache or as a 48 kB shared memory and a 16 kB cache. The texture cache cannot be used for GPGPU computing. Due to the differences in the hardware, AMD and Nvidia GPUs should be programmed differently. When programming the AMD GPU it is important to ensure that the program contains ILP to take advantage of all the SPUs. If this is not done, it is possible to obtain as little as 20% of the available performance. With the AMD GPU it is also more important to make use of the shared memory, as there is no data cache. However, due to the simpler architecture of AMD’s SPs, it is possible to fit more of them into a given area. This means one can get more performance per dollar, however, the challenge is unlocking it all. Chapter 2. Background 8

2.2 Fusion APU

This work presents a power model developed for the AMD Fusion A6-3650, a Llano APU [5][18][19]. The Fusion APU is a heterogeneous processor that contains four CPU cores, four GPU cores, and a shared memory controller. The CPU is based on the family 12h [20] architecture and the GPU is based on the Evergreen architecture. The APU’s specifications are summarized in Table 2.1. The rest of this section describes the CPU architecture and the specific the GPU configuration that the Fusion employs.

Table 2.1: AMD A6-3650 Specification Component Value CPU cores 4 CPU architecture family 12h CPU Operating Frequency 2.6 GHz L1 I-cache 64 KB per-core L1 D-cache 64 KB per-core L2 Cache 1 MB per-core GPU cores 4 GPU architecture Evergreen GPU Operating Frequency 443 MHz Streaming Processors 16 per-core Stream Processing Unit 5 per-streaming processor Memory controller Dual channel DDR3 TDP 100 W

2.2.1 CPU

The CPU in the A6-3650 is an OoO, x86 four core family 12h processor, which is based on the K8 64-bit architecture. It is a three-wide architecture, with three integer and three floating point pipelines and support for SSE instructions. Each integer pipeline contains a scheduler, an integer arithmetic-logic unit (ALU), and an address generation unit (AGU). Each pipe also handles one of the following types of instructions: multiplication, division, or branch instructions.

The floating point pipelines are all different, but share a single scheduler. While some instructions can be handled by multiple pipes, in general one is responsible for simple arithmetic, another is responsible for complex arithmetic, and the last one is responsible for load and store operations. The floating point unit (FPU) executes x87 and SSE instructions.

SSE instructions allow the CPU to take advantage of data parallelism by executing vector instructions. This is done by operating on 128-bit registers. In the SSE terminology, these registers are packed with smaller data-types. The number of operands that can be packed depends on the data-type; it is possible to pack two 64-bit values, four 32-bit values, and up to sixteen 8-bit values into an SSE register. A register does not need to be fully packed in order to perform operations; for example, it is possible to use an SSE instruction to operate on three 32-bit floats. The OpenCL vector data-types, such as int4, map directly to packed SSE instructions when executed on the CPU. The OpenCL compiler also uses the SSE instructions to perform any floating point operations, instead of x87 instructions. Chapter 2. Background 9

The processor includes two levels of cache and an integrated memory controller shared with the GPU. Each core has private 64 kB L1 instruction and same size data caches, as well as a private 1 MB unified L2 cache. The caches use an exclusive design so the L2 cache is a victim cache and does not contain any data found in any L1 caches.

2.2.2 GPU

The GPU in the APU is based on the Evergreen micro-architecture. It contains four cores, each made up of sixteen streaming processors. See Section 2.1.1 for more details on the Evergreen architecture. When programming Evergreen GPUs with OpenCL, it is possible to obtain much better performance when using vector data-types. This is because the compiler interprets a vector operation as a group of independent operations. Consider the example val = a + b. If the values are of type int, this is a single operation that will be executed by a single SPU in an SP, for 20% of the maximum throughput of the SP. However, if instead they are of type int4, then this can be treated as four independent additions, assigned to four SPUs in a single SP, which means 80% of the SP’s throughput is utilized. Therefore, by using vector data-types, it possible to increase the architectural efficiency of the GPU considerably.

The Fusion’s architectural details are important for two reasons. First, in Chapter4 we need to understand the code the OpenCL compiler produces, so we know how to write kernels to target specific hardware components. Second, we need to know this information in Chpater5 so we can configure Multi2Sim to approximate the Fusion as closely as possible.

2.3 Programing Models

There are two common frameworks available GPGPUs computing: OpenCL [3] and CUDA [2]. OpenCL is an open heterogeneous programing standard maintained by the Khronos group that can be used not only to program GPUs but many other types of devices as well. This section will explain OpenCL in detail, as it is needed in Chapter3 to understand how OpenCL kernels can be transparently distributed. This explanation also helps in understanding how OpenCL is used in Chapter4 to create micro-benchmarks. OpenCL is also briefly contrasted with CUDA, which is used to program Nvidia GPUs. It is also used under the hood by SnuCL [11] and by GPGPUSim [16].

2.3.1 OpenCL

OpenCL is a framework that allows the programming of heterogeneous processors. OpenCL programs have two major components: the host program and the kernels. The host program is normal C or C++ code and runs on the CPU. It is the code that makes calls to the OpenCL application programming interface (API), manages the devices on which kernels will be executed, and launches kernels. The kernels consist of functions written in OpenCL C, a C-99 derivative, and can run on any device that supports OpenCL. Kernels usually contain algorithms to be run by a GPU or other accelerator, but they can also be run on a CPU. Chapter 2. Background 10

OpenCL represents hardware components in a hierarchy. At the top of the hierarchy we find the OpenCL platform. The platform is made up of host processors and one or more compute devices. Compute devices are hardware accelerators that execute OpenCL kernels. A compute device is composed of one or more compute units. Work is scheduled at the compute unit level, but a compute unit can be further subdivided into one or more processing elements. Each processing element will run a single thread of execution. On x86 CPUs, compute units map to cores, and each compute unit contains a single processing element, the core itself. On Evergreen GPUs compute units map to SCs and processing elements map to SPs.

This hardware hierarchy informs both OpenCL’s memory and programing models. OpenCL supports three distinct types of memory: global, local, and private. Global memory is shared at the device level. All the compute units in a single device share the same global memory, however, it is not guaranteed to be consistent. Reads and writes to and from global memory can be executed in any order as long as calls from a single compute unit remain ordered. This means that compute units cannot communicate through global memory. OpenCL also supports constant memory, which is essentially read-only global memory. Local memory is per compute unit memory. This allows processing elements within a compute unit to communicate using local memory. Local memory is smaller and faster than global memory. Private memory is per-processing-element memory. This memory can be used by individual threads to store private data. On CPUs there is no distinction between the different types of memory at the hardware level. However, the device driver limits the size of local and private memory such that they fit into the L1 cache. On GPUs the different types of memory usually map to different physical memories. For Evergreen GPUs the register file is used for private memory, the per SC shared memory is used for local memory, the global data share is used for constant memory, and main memory is used for global memory.

The host and compute devices do not have a shared address space, so OpenCL provides buffer objects. Buffers are host allocated in either device or host memory. OpenCL provides API calls to manipulate these buffers which handle pointer marshalling and copying when necessary. Copies between buffers and host memories are explicit. Local memory requirements must be static. This is because the local memory requirements may limit how many work-groups can be simultaneously assigned to a compute unit.

The OpenCL programming model follows a hierarchy similar to that of the memory. When work is assigned to a compute device, a kernel and NDRange must be specified. The kernel is an OpenCL function. The NDRange specifies the number of threads, or work-items in OpenCL, that will run the kernel. The range can be one, two, or three dimensional and can be thought of as a Cartesian space. Work-items are identified by their position in the range using a unique global-ID, which is its coordinates in the space. For example, if we create a two-dimensional NDRange of size sixteen in both the x and y dimensions, it will contain 256 work-items. Each work item will have an x-ID ranging from zero to fifteen and a y-ID from zero to fifteen, though each combination will be unique. This NDRange is subdivided into work-groups. Work-groups can be thought of as NDRanges which are mapped to compute units. Each work-item has a local-ID which identifies its position within the work-group. Work-groups have the same number of dimensions as the NDRange and have their own unique multi-dimensional ID. In OpenCL, work is scheduled at the work-group granularity because work-items within a work-group must be able to communicate using shared memory. The work-items themselves are assigned to processing elements. OpenCL uses a SIMT execution model, just like GPUs. The local- and global-IDs are used to allow the programmer to express data parallelism, and are usually used to index data in a buffer, but Chapter 2. Background 11

1 k e r n e l void i n c ( g l o b a l int ∗a , ) 2 { 3 a [ g e t g l o b a l i d ( 0 ) ] ++; 4 5 } Listing 2.1: OpenCL kernel to increment array elements.

1 void i n c r e m e n t a r r a y ( int ∗ input , int buffersize , cl context context, cl command queue queue, c l kernel kernel) 2 { 3 4 cl mem buffer = clCreateBuffer(context , CL MEM READ WRITE, s i z e o f ( int ) ∗ buffersize , NULL, &errcode); 5 e r r c o d e |= clEnqueueWriteBuffer(queue, buffer , CL TRUE, 0 , s i z e o f ( int ) ∗ buffersize , input , 0 , NULL, NULL) ; 6 assert(errcode ==CL SUCCESS) ; 7 8 assert(clSetKernelArg(kernel, 0, s i z e o f ( cl mem), &buffer) == CL SUCCESS) ; 9 10 s i z e t g l o b a l [ ] = {1 , 1 , 1 } ; 11 s i z e t l o c a l [ ] = {1 , 1 , 1 } ; 12 13 global[0] = buffersize; 14 local[0] = buffersize / 4; 15 16 assert(clEnqueueNDRangeKernel(queue, kernel , dims, NULL, global , local , 0, NULL, NULL) == CL SUCCESS) ; 17 c l F i n i s h ( queue ) ; 18 19 (clEnqueueReadBuffer(queue, buffer , CL TRUE, 0 , s i z e o f ( int ) ∗ buffersize , input, 0, NULL, NULL) == CL SUCCESS) ; 20 } Listing 2.2: OpenCL host code to increment array elements.

occasionally as a direct operand.

An example OpenCL kernel can be seen in Listing 2.1 and the associated host code in Listing 2.2. The code in the example increments each element in an array by one. This stripped down example omits the OpenCL boilerplate required to discover the devices, create a context, compile the kernel code, and create the devices command queue. We can assume the calling function has taken care of these steps and is passing a valid context, command queue, and kernel object as arguments. In this example, the array we want to increment is pointed to by input, and is of size buffersize, and we will assume buffersize is 100.

The kernel is very simple. It accepts an array as an argument and increments the value at index global-ID of the array, as seen in Listing 2.1. In order to execute this kernel, the host code must first create a buffer and copy the array values to it, as is done in lines 4 and 5 of Listing 2.2. This buffer is then be passed as the argument to the kernel in line 8. This kernel needs one-dimensional NDRange of size 100, and for this particular example, the local size is not important so we can set it to any divisor of 100, say, 25. This step is done in lines 10 through 14. This will create four work-groups, each with twenty-five work-items. Each work-item will execute identical instructions, but since they each have a different global-ID, each work item will increment a different array element. The kernel is then submitted to the command queue of the device we are using in line 16. To get the results from the OpenCL device, the buffer needs to be copied back to host memory, as is done in line 19.

We can see from the example that the kernel code will work for any size of buffer, as the kernel code itself only increments a single element. The host program will run as many instances of the kernel as Chapter 2. Background 12 necessary for a given array. In this example, the number of work-groups will always be four, which limits the problem size to the maximum size work-group the device can run. It would also be possible to create a variable number of work-groups by specifying the local size. If in this example the local size were always four, then we would have created 25 work-groups instead of four. The ideal local size depends on the device being used. For the CPU, it make no difference, since each work-item must be executed separately. On the other hand, on an Evergreen GPU, any work-group size that is not a multiple of 16 will leave some SPs idle when executing, thereby decreasing performance. Also, the larger the work-groups, the more wavefronts the SC will have to interleave during execution.

Since OpenCL can be used to program both CPUs and GPUs, it is an ideal framework to program heterogeneous processors such as the AMD Fusion. Both the Fusion’s CPU and GPU can be used as OpenCL devices. This allows for the same kernel to be executed on either processor and the work can be more easily shared since there is no need to copy data over an external interface such as PCIe [21]. This is useful for comparing the relative strengths of both processors and to allow them to collaborate on the same workload.

2.3.2 CUDA

The CUDA programming model is very similar to OpenCL. In fact, Nvidia GPUs can be programmed using either OpenCL or CUDA. When it comes to writing kernels, the only differences between CUDA and OpenCL are the names of the IDs a work-item or work-group has. To avoid confusion, I am omitting the CUDA nomenclature as it is not used anywhere in this work. However, there are more differences in the host code. Since CUDA can only be used to program GPUs, it does not need to be as general, which simplifies the host code considerably. The same things are still taking place behind the scenes, but the framework can make more assumptions because it knows an Nvidia GPU is being programmed.

The biggest difference is that CUDA gives access to certain Nvidia specific features. For example, the shared memory/L1 cache can only be configured using CUDA. In OpenCL you always get 48 kB of shared memory. There are also CUDA plug-ins for programs like MatLab and many libraries available. This is partially due to the fact that CUDA pre-dates OpenCL, but also because it is easier to write libraries that only need to work on one type of hardware.

2.4 Simulators

GPU simulators come in two flavours: those that simulate graphics programs and those that simulate general purpose computation. Simulators such as Qsilver [22] or ATTILA [23] can be used to simulate graphics programs written in OpenGL [24]. GPGPU computations written in CUDA or OpenCL can be run in simulators such as GPGPUSim [16] or Multi2Sim [25][10]. As this work focuses on GPGPU, this section will introduce the GPGPU simulators starting with Multi2Sim, since it was used for this research. Chapter 2. Background 13

2.4.1 Multi2Sim

Multi2Sim is an open source heterogeneous architecture simulator. It supports the simulation of much more than just GPGPU programs. When this research started in 2012, it supported the simulation of 32-bit x86 CPUs (i386) and GPUs based on AMD’s Evergreen architecture. Since that time, Muli2Sim has added support for ARM, MIPS, AMD’s Southern-Island, and Nvidia’s Fermi GPUs. Multi2Sim is capable of performing cycle-accurate simulations for both CPUs and GPUs. Multi2Sim is not a full system simulator, meaning that is does not run an operating system (OS), but runs the target application directly. Multi2Sim emulates the desired ISA, program loading, and system calls. Multi2Sim must emulate program loading and system calls directly because these are services an OS would normally provide. Multi2Sim emulates the system calls specified in the application binary interface (ABI) allowing it to run most Linux applications.

Multi2Sim is the only GPGPU simulator that simulates AMD GPUs and specifically the Evergreen micro-architecture, which is found in Llano generation APUs. While the x86 CPU model in Multi2Sim is not based on any existing hardware, it is highly configurable. This makes Multi2Sim an ideal platform for work with the Fusion.

Originally, Multi2Sim only supported the execution of OpenCL kernels on the GPU. This limited its ability to run a single workload on both the CPU and GPU or to run workloads where both the CPU and GPU were collaborating to run the same OpenCL kernel. Steven Gurfinkel and myself addressed this limitation by writing a CPU OpenCL runtime that was incorporated into Multi2Sim. Since AMD does not publish the specification for its OpenCL runtime, it had to be reverse engineered. Our OpenCL runtime is compatible with binaries produced by the OpenCL compiler in the AMDAPP SDK versions 2.5 through 2.7. The runtime works both in and outside of Multi2Sim. More details on its operation is available in the Multi2Sim documentation [26].

Limitations

Multi2sim exhibits a number of limitations that were discovered over the course of this work. The most important limitation is the inaccuracy of the memory model. Originally Multi2Sim simulated the cache hierarchy as being part of a complex network on-chip (NOC), where each cache and core had its own router. This caused very high latency cache accesses, as routing took multiple cycles at each stage. The lowest latency cache access we could originally achieve for the L1 cache was twelve cycles, which was much higher than the three it takes on the Fusion. After bringing this issue to the attention of the Multi2Sim team, a bus model was added. This significantly sped up the communication between the caches. However, there were still some issues with modelling all the cache latencies correctly, as the L2 cache’s latency is closely tied to that of main memory and the prefetcher in Multi2Sim does not perform nearly as well the one found in real hardware.

Multi2Sim simulates main memory the same way it simulates caches. This means that there is a constant latency associated with main memory accesses. This is not accurate for DRAM where the latency can vary greatly depending on multiple factors. DRAM is organized into multiple banks and only one bank can be active (precharged) at a time. Memory access latency is lowest when we are accessing a bank that Chapter 2. Background 14 is currently active and highest when we have to deactivate the current bank and activate another one. This is one reason contiguous memory accesses are faster than sparse memory accesses. The memory latency is also affected by the need to periodically refresh the DRAM’s values. If we make a request to an address that is being refreshed, the latency will increase.

Another issue is the fact that Multi2Sim simulates an inclusive cache hierarchy while the family 12h CPU we are modelling has an exclusive cache. This means that the size of the L2 cache for each processor is 1 th simulated as being 128 kB or 8 smaller than in reality, since it must also contain all the data found in both L1 caches. Unfortunately, it was not possible to correct for this by increasing the number of sets or the associativity of the L2 cache since Multi2Sim only handles powers of two for both of these values. 1 7 Given the choice between a cache that was 8 too small or 8 too large, the former option was chosen.

The other issue is that Multi2Sim does not support the simultaneous execution of the CPU and the Evergreen GPU. Currently, when a kernel is enqueued, the CPU is suspended while the kernel executes. The new Southern Island based GPU model does not have this limitation, but it uses a different OpenCL runtime. This was not an issue for most of the benchmarks since they usually have the CPU block until the kernel completes. It was however, an issue when simulating power in Multi2Sim since it was impossible to model power consumption that was truly the sum of both processors.

It is unlikely that the Evergreen issues will be solved as development focus has shifted to the more recent Southern Island (SI) micro-architecture. Unfortunately, the SI micro-architecture is a radical departure from previous AMD GPU architecture. The VLIW ISA has been replaced by a non-VILW one, which focuses primarily on exploiting data-parallelism rather than instruction level parallelism (ILP). This architecture has more in common with the Fermi architecture, described in Section 2.1.2, than the Evergreen architecture. Therefore, in spite of its improved features and continued development, it was a poor candidate for emulating Evergreen-based hardware.

2.4.2 GPGPUSim

GPGPUSim [16] is a GPU architectural simulator that can simulate the execution of CUDA or OpenCL kernels. It simulates Nvidia hardware from the 8800GTX [27] up to and including Fermi. GPGPUSim only simulates the execution of the kernel, while the host program executes in real-time on a real CPU. Calls to libcudart are intercepted, to allow the host program to communicate with GPGPUSim.

There have been some efforts to combine GPGPUSim with a CPU simulator to create heterogeneous simulators. Work by Zakharenko et al. [28] combines GPGPUSim with PTLSim [29], but does not include a power model. Work by Wang et al. [30] combines GPGPUSim with gem5 [31]. They also include a power model to study power budgeting, but the power model is very coarse assuming constant per-core power consumption.

Since GPGPUSim was primarily developed to simulate Nvidia hardware, it expects the kernel to be compiled to Nvidia’s PTX assembly. It has no support for AMD’s Evergreen assembly, nor does it support the simulation of VLIW architectures, so it could not be used to simulate the Fusion’s GPU. Chapter 2. Background 15

2.5 SciNet

Most of the computations in this work were performed on systems belonging to the SciNet HPC Consortium [12]. Two of their clusters were used: GPC and Gravity. GPC is the general purpose cluster and consists of octo-core Xeon processor nodes and it was used to run Multi2Sim simulations. Without it, it would have taken nearly two months to run all the simulations required for the power modelling. The Gravity cluster is a GPU cluster where each node contain a dodeca-core Xeon processor and two Tesla GPUs. This cluster was used to run experiments using the SnuCL [11] and DistCL [9] distributed OpenCL runtimes. Chapter 3

Distributing OpenCL kernels

GPUs were first used to offload graphics tasks from the CPU. Thanks to the demand of computer gamers for ever higher quality graphics, GPUs evolved from simple fixed function accelerators to fully programmable massively parallel processors [32]. As the level of programmability of GPUs increased, it became possible to run non-graphics workloads on them. Recently, there has been significant interest in using GPUs for general purpose and high performance computing (HPC). Significant speedups have been demonstrated when porting applications to a GPU [33], even in HPC workloads such as linear algebra [34], computational finance [35], and molecular modelling [36]. Therefore, it is no surprise that modern computing clusters such as the TianHe-1A [37] and Titan [4] are incoporating GPUs. However, additional speedups are still possible beyond the computational capabilities afforded by a single GPU. As GPU programing has grown in popularity in the HPC space, there has been much interest in expanding the OpenCL and CUDA [2] programing models to support cluster programming.

This chapter introduces DistCL [9], a framework for the distribution of OpenCL kernels across a cluster. To simplify this task, DistCL takes advantage of three insights:

1) OpenCL tasks (called kernels) contain threads (called work-items) that are organized into small groups (called work-groups). Work-items from different work-groups cannot communicate during a kernel invocation. Therefore, work-groups only require that the memory they read be up-to-date as of the beginning of the kernel invocation. Thus, DistCL must know what memory a work-group reads, to ensure that the memory is up-to-date on the device that runs the work-group. DistCL must also know what memory each work-group writes, so that future reads can be satisfied. However, no intra-kernel synchronization is required.

2) Most OpenCL kernels make only data-independent memory accesses; the addresses they access can be predicted using only the immediate values they are passed and the geometry they are invoked with. Their accesses can be efficiently determined before they run. DistCL requires that kernel writes be data-independent.

3) Kernel memory accesses are often contiguous. Contiguous accesses fully harness the wide memory buses of GPUs [33]. DistCL does not require contiguous accesses for correctness, but they improve distributed performance because contiguous accesses made in the same work-group can be treated like a

16 Chapter 3. Distributing OpenCL kernels 17 single, large access when tracking writes and transferring data between devices.

In OpenCL (and DistCL) memory is divided into large device-resident arrays called buffers. DistCL introduces the concept of meta-functions: simple functions that describe the memory access patterns of an OpenCL kernel. Meta-functions are programmer-written kernel-specific functions that relate a range of work-groups to the parts of a buffer that those work-groups will access. When a meta-function is passed a range of work-groups and a buffer to consider, it divides the buffer into intervals, marking each interval either as accessed or not accessed by the work-groups. DistCL takes advantage of kernels with sequential access patterns, which have fewer (larger) intervals, because it can satisfy their memory accesses with fewer I/O operations. By dividing up buffers, meta-functions allow DistCL to distribute an unmodified kernel across a cluster. To our knowledge, DistCL is the first framework to do so.

In addition to describing DistCL, this chapter evaluates the effectiveness of kernel distribution across a cluster based on the kernels’ memory access patterns and their compute-to-transfer ratio. It also examines how the performance of various OpenCL and network operations affect the distribution of kernels.

This work was done in partnership with Steven Gurfinkel. While I did participate in the design process of DistCL and the writing of the first version, development on the two subsequent versions was done primarily by Gurfinkel. My main contributions are:

• The evaluation of how the properties of different kernels affect their performance when distributed.

• A performance comparison between DistCL and another framework that distributes OpenCL kernels, SnuCL [11].

The rest of this chapter is organized as follows: It first describes related work in Section 3.1. Then in Section 3.2, it describes DistCL using vector addition as an example, in particular looking at how DistCL handles each step involved with distribution. Focus then shifts to analysis; Section 3.3 introduces the benchmarks, which are grouped into three categories: linear runtime benchmarks, compute intensive benchmarks, and benchmarks that involve inter-node communication. Results for these benchmarks are presented in Section 3.4. A comparison with SnuCL is provided in Section 3.5.

3.1 Background

Programming GPUs is not a simple task, especially in a cluster environment. Often, a distributed programming model such as MPI [38] is combined with a GPU programming model such as OpenCL or CUDA. This makes memory management difficult because the programmer must manually transfer it not only between nodes in the cluster but also to and from the GPUs in each node. There have been multiple frameworks proposed that allow all the GPUs to be accessed as if they are part of a single platform. rCUDA [39] is one such framework for CUDA, while Mosix VCL [40] provides similar functionality for OpenCL. These frameworks are limited by the fact that they work with a single CUDA or OpenCL implementation; that is to say, one cannot mix devices from different vendors. More recent frameworks such as Hybrid OpenCL [41], dOpenCL [42], Distributed OpenCL [43], and SnuCL [11] address this limitation. These frameworks allow devices from different vendors using different OpenCL implementation to be combined into a single context. This makes memory management much easier, as OpenCL buffer Chapter 3. Distributing OpenCL kernels 18 copy operations can be used to transparently transfer data between nodes. clOpenCL [44] takes a similar approach but allows the user to specify which nodes to include in contexts using host names. With any of these frameworks, work can still only be dispatched to a single device at a time. Therefore, in order to take advantage of multiple devices, work must be manually broken down and dispatched in smaller pieces.

With CUDASA [45] it is possible to launch a single kernel and have it run on multiple devices in a network. However, this is not transparent to the programmer. The CUDA programming model is extended with network and bus layers to represent a cluster and a node respectively. This is an addition to existing kernel(NDRange), block (work-group), and thread (work-item) layers which map to devices, compute units, and processing elements respectively. Unfortunately, this means that kernel code must be modified accordingly if one wants to take advantage of more than a single device. Another drawback of CUDASA is that it does not handle any memory transfers transparently. CUDASA includes a distributed shared memory (DSM) to allow all nodes to share a single address space. When allocating memory, the programmer can specify the desired address range, to ensure the memory is on the right node. Functions are provided to easily copy memory across the DSM using MPI.

Single OpenCL kernels are transparently run on multiple devices in work by Kim et al. [46]. This framework is targeted at multiple GPUs in a single computer and has no support for distributing a kernel across a cluster. To determine which device requires which memory, sample runs are used. Before enqueuing work onto any GPU, the work-items at the corners of the NDRange are run on the CPU to determine the memory access pattern. In the event that this analysis is inconclusive, the work is still distributed and the output buffers are diff’ed. The fact that in certain cases this framework relies on diff’ing entire buffers makes it ill suited for distribution. If distributed, the process involves not only transferring the entire buffer from the GPU, but also from each worker node to the master, and this operation consumes a significant amount of time.

SnuCL [11] is another framework that distributes OpenCL kernels across a cluster. SnuCL can create the illusion that all the OpenCL devices in a cluster belong to a local context, and can automatically copy buffers between nodes based on the programmer’s placement of kernels. As opposed to SnuCL, DistCL not only abstracts inter-node communication, but also the fact that there are multiple devices in the cluster. A more detailed description of SnuCL is provided in Section 3.3.5. In this chapter, the performance of DistCL will be compared to that of SnuCL.

3.2 DistCL

DistCL executes on a cluster of networked computers. OpenCL host programs use DistCL by creating one context with one command queue for one device. This device represents the aggregate of all the devices in the cluster. When a program is run with DistCL, identical processes are launched on every node. When the OpenCL context is created, one of those nodes becomes the master. The master is the only node that is allowed to continue executing the host program. All other nodes, called peers, enter an event loop that services requests from the master. Nodes communicate in two ways: messages to and from the master, and raw data transfers that can happen between any pair of nodes.

To run a kernel, DistCL divides its NDRange into smaller grids called subranges. Kernel execution Chapter 3. Distributing OpenCL kernels 19

1 k e r n e l void v e c t o r ( g l o b a l int ∗a , g l o b a l int ∗b , g l o b a l int ∗ out ) 2 { 3 int i = g e t g l o b a l i d ( 0 ) ; 4 out[i]=a[i]+b[i]; 5 } Listing 3.1: OpenCL kernel for vector addition.

0 1M

0 256k 512k 768k 1M 0 1 2 3

Figure 3.1: Vector’s 1-dimensional NDRange is partitioned into 4 subranges.

gets distributed because these subranges run on different peers. DistCL must know what memory a subrange will access in order to distribute the kernel correctly. This knowledge is provided to DistCL with meta-functions. Meta-functions are programmer-written, kernel-specific callbacks that DistCL uses to determine what memory a subrange will access. DistCL uses meta-functions to divide buffers into arbitrarily-sized intervals. Each interval of a buffer is either accessed or not. DistCL stores the intervals calculated by meta-functions in objects called access-sets. Once all the access-sets have been calculated, DistCL can initiate the necessary transfers needed to allow the peers to run the subranges they have been assigned. Recall the important distinction between subranges which contain threads and intervals which contain data. The remainder of this section describes the execution process in more detail, illustrating each step with a vector addition example, whose kernel source code is given in Listing 3.1.

3.2.1 Partitioning

Partitioning divides the NDRange of a kernel execution into smaller grids called subranges. DistCL never fragments work-groups, as that would violate OpenCL’s execution model and could lead to incorrect kernel execution. For linear (1D) NDRanges, if the number of work-groups is a multiple of the number of peers, each subrange will be equal in size. Otherwise, some subranges will be one work-group larger than others. DistCL partitions a multidimensional NDRange along its highest dimension first, in the same way it would partition a linear NDRange. If the subrange count is less than the peer count, DistCL will continue to partition lower dimensions.

Multidimensional arrays are often organized in row-major order, so highest-dimension-first partitioning frequently results in subranges accessing contiguous regions of memory. Transferring fragmented regions of memory requires multiple I/O operations to avoid transferring unnecessary regions, whereas large contiguous regions can be sent all at once.

Our vector addition example has a one-dimensional NDRange. Assume it runs with 1M = 220 work-items on a cluster with 4 peers. Assuming 1 subrange per peer, the NDRange will be partitioned into 4 subranges, each with a size of 256 k work-items, as shown in Figure 3.1. Chapter 3. Distributing OpenCL kernels 20

3.2.2 Dependencies

The host program allocates OpenCL buffers and can read from or write to them through OpenCL function calls. Kernels are passsed these buffers when they are invoked. For example, the three parameters, a, b, and out in Listing 3.1 are buffers.

DistCL must know what parts of each buffer a subrange will access in order to create the illusion of many compute devices with separate memories sharing a single memory. The set of addresses in a buffer that a subrange reads and writes are called its read-set and write-set, respectively. DistCL represents these access-sets with concrete data-structures and calculates them using meta-functions. Access-sets are calculated every kernel invocation, for every subrange-buffer combination, because the access patterns of a subrange depend on the invocation’s parameters, partitioning, and NDRange. In our vector addition example with 4 subranges and 3 buffers, 24 access-sets will be calculated: 12 read-sets and 12 write sets.

An access-set is a list of intervals within a buffer. DistCL represents addresses in buffers as offsets from the beginning of the buffer; thus an interval is represented with a low and high offset into the buffer. These intervals are half open; low offsets are part of the intervals, but high offsets are not.

For instance, subrange 1 in Figure 3.1 contains global IDs from the interval [256k, 512k). As seen in Listing 3.1, each work-item produces a 4-byte (sizeof (int)) integer, so subrange 1 produces the data for interval [1 MB, 2 MB) of out. Subrange 1 will also read the same 1 MB region from buffers a and b to produce this data. The intervals [0 MB, 1 MB) and [2 MB, 4 MB) of a, b and out are not accessed by subrange 1.

Calculating Dependencies

To determine the access-sets of a subrange, DistCL uses programmer-written, kernel-specific meta- functions. Each kernel has a read meta-function to calculate read-sets and a write meta-function to calculate write-sets.

DistCL passes meta-functions information regarding the kernel invocation’s geometry. This includes the invocation’s global size (global in Listing 3.2), the current subrange’s size (subrange), and the local size (local). DistCL also passes the immediate parameters of the kernel (params) to the meta-function. The subrange being considered is indicated by its starting offset in the NDRange (subrange offset) and the buffer being considered is indicated by its zero-indexed position in the kernel’s parameter list (param num).

DistCL builds access-sets one interval at a time, progressing through the intervals in order, from the beginning of the buffer to the end. Each call to the meta-function generates a new interval. If and only if the meta-function indicates that this interval is accessed, DistCL includes it in the access-set.

To call a meta-function, DistCL passes the low offset of the current interval through start and the meta-function sets next start to its end. The meta-function’s return value specifies whether the interval is accessed. Initially setting start to zero, DistCL advances through the buffer by setting the start of subsequent calls to the previous value of next start. When the meta-function sets next start to the size of the buffer, the buffer has been fully explored and the access-set is complete. Chapter 3. Distributing OpenCL kernels 21

1 int i s b u f f e r r a n g e r e a d v e c t o r ( 2 const void ∗∗params , const s i z e t ∗ g l o b a l , 3 const s i z e t ∗ subrange , const s i z e t ∗ l o c a l , 4 const s i z e t ∗ s u b r a n g e o f f s e t , unsigned int param num , 5 s i z e t start, size t ∗ n e x t s t a r t ) 6 { 7 int r e t = 0 ; 8 ∗ n e x t s t a r t = s i z e o f ( int ) ∗ g l o b a l [ 0 ] ; 9 i f ( param num != 2) { 10 s t a r t /= s i z e o f ( int ); 11 r e t = r e q u i r e region(1, global , subrange offset , subrange, start , next s t a r t ) ; 12 ∗ n e x t s t a r t ∗= s i z e o f ( int ); 13 } 14 return r e t ; 15 } Listing 3.2: Read meta-function.

Meta-Function Verification

DistCL provides a tool that allows meta-functions to be verified. It can be configured to verify the meta- function of a kernel for any desired NDRange. A configuration file instructs the tool with which kernel to run, what buffers to create, what subrange to consider, and whether the read or write meta-function should be considered. The kernel code must be modified to write the value 1 to the memory location being accessed. Both the kernel and the meta-function are then run and their outputs are compared. Both a graphical and a text representation are then provided to show any regions the meta-function missed and any regions the meta-function included that were not modified by the kernel. This tool allows meta-functions to be tested independently, which is particularly helpful for multi-kernel benchmarks where errors in the early meta-function will propagate through and make it difficult to tell which meta-function caused the benchmark to execute incorrectly.

Rectangular Regions

Many OpenCL kernels structure multidimensional arrays into linear buffers using row-major order. When these kernels run, their subranges typically access one or more linear, rectangular, or prism-shaped areas of the array. Though these areas are contiguous in multidimensional space, they are typically made up of many disjoint intervals in the linear buffer. Recognizing this, DistCL has a helper function, called require region, that meta-functions can use to identify which linear intervals of a buffer constitute any such area.

require region, whose prototype is shown in Listing 3.3, operates over a hypothetical dim-dimensional grid. Typically, each element of this grid represents an element in a DistCL buffer. The size of this grid in each dimension is specified by the dim-element array total size. require region considers a rectangular region of that grid whose size and offset into the grid are specified by the dim-element arrays required size and required start, respectively. Given this, require region calculates the linear intervals that correspond to that region if the elements of this dim-dimensional grid were arranged linearly, in row-major-order. Because there may be many such intervals, the return value, start parameter and next start parameter of require region work the same way as in a meta-function, allowing the caller to move linearly through the intervals, one at a time. If a kernel does not access memory in rectangular regions, it does not have to use the helper function.

Even though vector has a one-dimensional NDRange, require region is still used for its read meta- Chapter 3. Distributing OpenCL kernels 22

1 int r e q u i r e r e g i o n ( int dim , const s i z e t ∗ t o t a l s i z e , const s i z e t ∗ r e q u i r e d s t a r t , const s i z e t ∗ r e q u i r e d s i z e , s i z e t start, size t ∗ n e x t s t a r t ) ;

Listing 3.3:0 require region helper function. m( )

m( )

m( )

Figure 3.2: The read meta-function is called for buffer a in subrange 1 of vector.

function in Listing 3.2. This is because require region not only identifies the interval that will be used, but also identifies the intervals on either side that will not be used. require region is passed global as the hypothetical grid’s size, making each grid element correspond to an integer, the datatype changed by a single work-item. Therefore, lines 10 and 12 of Listing 3.2 translate between elements and bytes, which differ by a factor of sizeof (int).

Figure 3.2 shows what actually happens when the meta-function is called on buffer a for subrange 1. In Figure 3.2a, the first time the meta-function is called, DistCL passes in 0 as the start of the interval and the meta-function calculates that the current interval is not in the read set, and that the next interval starts at an offset of 1MB. Next, in Figure 3.2b, DistCL passes in 1 MB as the start of the interval. The meta-function calculates that this interval is in the read-set and that the next interval starts at 2 MB. Finally, in Figure 3.2c, DistCL passes in 2 MB as the start of the interval. The meta-function calculates that this interval is not in the read-set and that it extends to the end of the buffer which has a size of 4 MB.

3.2.3 Scheduling Work

The scheduler is responsible for deciding when to run subranges and on which peer to run them. The scheduler runs on the master and broadcasts messages to the peers when it assigns work. DistCL uses a simple scheme for determining where to run subranges. If the number of subranges equals the number of peers, each peer gets one subrange; however, if the number of subranges is fewer, some peers are never assigned work.

3.2.4 Transferring Buffers

When DistCL executes a kernel, the data produced by the kernel is distributed across the peers in the cluster. The way this data is distributed depends on how the kernel was partitioned into subranges, how these subranges were scheduled, and the write-sets of these subranges. DistCL must keep track of how the data in a buffer is distributed, so that it knows when it needs to transfer data between nodes to satisfy subsequent reads - which may not occur on the same peer that originally produced the data.

DistCL represents the distribution of a buffer in a similar way to how it represents dependency information. The buffer is again divided into a set of intervals, but this time each interval is associated with the ID of the node that has last written to it. This node is referred to as the owner of that interval. Chapter 3. Distributing OpenCL kernels 23

Buffers

Every time the host program creates a buffer, the master allocates a region of host (not device) memory, equal in size to the buffer, which DistCL uses to cache writes that the host program makes to the buffer. Whether the host program initializes the buffer or not, the buffer’s dependency information specifies the master as the sole owner of the buffer. Additionally, each peer allocates, but does not initialize, an OpenCL buffer of the specified size. Generally, most peers will never initialize the entire contents of their buffers because each subrange only accesses a limited portion of each buffer. However, this means that using DistCL does not give access to any more memory than can be found in a single GPU.

Satisfying Dependencies

When a subrange is assigned to a peer, before the subrange can execute, DistCL must ensure that the peer has an up-to-date copy of all the memory in the subrange’s read-set. For every buffer, DistCL compares the ownership information to the subrange’s read-set. If data in the read-set is owned by another node, DistCL initiates a transfer between that node and the assigned node. Once all the transfers have completed, the assigned peer can execute the subrange. When the kernel completes, DistCL also updates the ownership information to reflect the fact that the assigned peer now has the up-to-date copy of the data in the subrange’s write-set. DistCL also implements host-enqueued buffer reads and writes using this mechanism.

Transfer Mechanisms

Peer-to-peer data transfers involve both intra-peer and inter-peer operations. For memory reads, data must first be transferred from the GPU into a host buffer. Then, a network operation can transfer that host buffer. For writes, the host buffer is copied back to the GPU. DistCL uses an OpenCL mechanism called mapping to transfer between the host and GPU.

3.3 Experimental Setup

Eleven applications, from the Rodinia benchmark suite v2.3 [47][48], AMD APPSDK [49], and GNU libgcrypt [50], were used to evaluate our framework. Each benchmark was run three times, and the median time was taken. This time starts when the host initializes the first buffer and ends when it reads back the last buffer containing the results, thereby including all buffer transfers and computations required to make it seem as if the cluster were one GPU with a single memory. The time for each benchmark is normalized against a standard OpenCL implementation using the same kernel running on a single GPU, including all transfers between the host and device. We group the benchmarks into three categories:

1. Linear compute and memory characteristics: nearest neighbor, hash, Mandelbrot;

2. Compute-intensive: binomial option, Monte Carlo; Chapter 3. Distributing OpenCL kernels 24

3. Inter-node communication: n-body, bitonic sort, back propagation, HotSpot, k-means, LU decom- position.

These benchmarks were chosen to represent a wide range of data-parallel applications. They will provide insight into what type of workloads benefit from distributed execution. The three categories of problems give a spread of asymptotic complexities. This allows the effect of distribution, which primarily affects memory transfers, to be studied with tasks of varying compute-to-transfer ratios. The important characteristics of the benchmarks are summarized in Table 3.1, and each is described below. For the Rodinia benchmarks, many of the problem sizes are quite small, but they were all run with the largest possible problem size, given the input data distributed with the suite. The worst relative standard deviation in runtime for any benchmark is 11%, with the Rodinia benchmarks’ runtime varying the most due to their smaller problem sizes. For the non-Rodinia benchmarks, it was usually under 1%.

Table 3.1: Benchmark Description Benchmark Source Inputs Complexity Work-Items Kernels Problem Size per Kernel (bytes) Nearest Rodinia 42764 locations(n) O(n) n 1 12n neighbor 24M points(n) Mandelbrot AMD x: 0 to 0.5, y: -0.25 to 0.25 O(kn) n 1 4n max iterations(k) 1000 Hash Libgcrypt 24M hashes(n) O(n) n 1 32 + n Binomial AMD 786432 samp.(k) O(kn2) k(n + 1) 1 32k 767 iterations(n) 2 m2 2 Monte AMD 4k sums(n) O(knm ) 8n k 2m (2n + 1) Carlo 1536 samp.(m), 10 steps(k) n-body AMD 768k bodies(n), 1,8 iter.(k) O(kn2) n k 16n 2 n 2 Bitonic AMD 32M elem.(n) O(n lg n) 2 lg n 4n k-means Rodinia 819200 points(n) kern. 1 O(nk) n 1 8nk 34 features(k), 5 clusters(c) kern. 2 O(nck) var. 4(2nk + c) Back Rodinia 4M input nodes(n) kern. 1 O(nk) nk 2 4(kn + 3n + 2k + 4) propagation 16 hidden nodes(k) kern. 2 O(nk) 4(kn + 2n + k + 3) 2 2 n 2 k  n 32  HotSpot Rodinia chip dim.(n) 1k, time-steps: O(n ) 256d 16−2x e d x e 4 d 16−2x e( 8−x ) ) per-kernel(x) 5, total(k) 60 +4n2 matrix dim.(n) 2k kern. 1 O(n) 16 k + 1 4n2 n 2 2 LUD Rodinia 16 − 1 iterations (k) kern. 2 O(n ) 2n − 32(i + 1) k 4n current iter. denoted(i) kern. 3 O(n3) (n − 16(i + 1))2 k 4n2

3.3.1 Linear Compute and Memory

All linear benchmarks consist of n work-items and a single kernel invocation. For these benchmark,s the amount of data transfered scales linearly with the problem size. The compute-to-transfer ratio remains constant regardless of problem size.

Nearest neighbor. This benchmark determines the nearest locations to a specified point from a list of available locations. Each work-item calculates the Euclidean distance between a single location and the specified point. The input buffer consists of n coordinates and the output is a n-element buffer of distances. Since 12 bytes are transferred per distance calculation, this benchmark has a very low compute-to-transfer ratio and is therefore poorly suited to distribution.

Hash. The hash benchmark attempts to find the hash collision of a sha-256 hash, similar to Bitcoin [51] miners. Each kernel hashes its global ID and compares it to the provided hash. Hash is well-suited to Chapter 3. Distributing OpenCL kernels 25 distribution because the only data transmitted is the input hash and a single byte from each work-item that indicates whether a collision was found.

Mandelbrot. This benchmark uses an iterative function to determine whether or not a point is a member of the Mandelbrot set. Each work-item iterates over a single point which it determines using its global ID. This benchmark is well suited to distribution because it has similar characteristics to the hash benchmark. There are no input buffers and only an n-element buffer that is written back after the kernel execution, giving it a high compute-to-transfer ratio.

3.3.2 Compute-Intensive

Binomial option. Binomial Option is used to value American options and is common in the financial industry. It involves creating a recombinant binomial tree that is n levels deep, where n is the number of iterations. This creates a tree with n + 1 leaf nodes and one work-item calculates each leaf. The n + 1 work-items take the same input, and only produce one result. Therefore, as the number of iterations is increased, the amount of computation grows quadratically, as the tree gets both taller and wider, while the amount of data that needs to be transferred remains constant. This benchmark is very well suited to distribution. Since all samples can be valued independently, only a single kernel invocation is required.

Monte Carlo. This benchmark uses the Monte Carlo method to value Asian options. Asian options are far more challenging to value than American options, so a stochastic approach is employed. This mn2 benchmark requires 8 work-items and m kernel invocations, where n is the number of options, and m the number of steps used.

3.3.3 Inter-Node Communication

These benchmarks all have inter-node communication between kernel invocations, as opposed to the other benchmarks where nodes only need to communicate with the master. The inter-node communication allows the full path diversity of the network to be used when data is being updated between kernels. These benchmarks, like the others, require a high compute-to-transfer ratios to see a benefit from distribution. n-body. This benchmark models the movement of bodies as they are influenced by each other’s gravity. For n bodies and k iterations, this benchmark runs k kernels with n work-items each. Each work-item is responsible for updating the position and velocity of a single body, using the position and mass of all other bodies in the problem. Data transfers occur initially (when each peer receives the initial position, mass, and velocity for the bodies it is responsible for), between kernel invocations (when position information must be updated globally), and at the end (when the final positions are sent back to the host). As the number of bodies increases, the amount of computation required increases quadratically, while the amount of data to transfer only increases linearly, meaning that larger problems are better suited to distribution.

Bitonic sort. Bitonic sort is a type of merge sort well suited to parallelization. For an n-element n 2 2 array, it requires 2 lg n comparisons, each of which are performed by a work-item, through lg n kernel invocations. Each kernel invocation is a global synchronization point and could potentially involve data Chapter 3. Distributing OpenCL kernels 26 transfers. Bitonic sort divides its input into blocks which it operates on independently. While there are more blocks than peers, no inter-node communication takes place; only when the blocks are split between peers does communication begin. k-means. This benchmark clusters n points into c clusters using k features. This benchmark contains two kernels: The first, which is only executed once, simply transposes the features matrix. The second kernel is responsible for the clustering. This kernel is executed until the result converges, which varies depending on the input data. For the largest input set available it took 20 kernel invocations before convergence. Both kernels consist of n work-items. For the first kernel, each work-item reads a row of the input array and writes it to a column of the output array. This results in a non-ideal memory access pattern for the writes. The second kernel reads columns of the features matrix, and the entirety of the cluster matrix which contains the centroid coordinates of each of the existing clusters. The writes of this kernel are contiguous because each work-item uses its one-dimensional global ID as an index into the array where it writes its answer.

Back propagation. This benchmark consists of the training of a two-layer neural network and contains two kernels. For a network with n input nodes and k hidden nodes, each kernel requires nk work-items. The work is divided such that each work-item is responsible for the connection between an input node and one of the hidden nodes. As the number of input nodes grows, the amount of computation required increases linearly, since the number of hidden nodes is fixed for this benchmark.

HotSpot. This benchmark models processor temperature based on a simulated power dissipation profile. The chip is divided into a grid and there is a work-item responsible for calculating the temperature in each cell of the grid. The temperature depends on power produced by the chip at that cell, as well as the temperature of the four neighboring cells. To avoid having to transfer data between work-groups at each time-step, this benchmark uses a “pyramid” approach. Since we need the temperature of all neighboring cells when updating the temperature, we will always read the temperature for a larger region than we will write. If we read an extra x cells in each direction we can find the temperature after x time-steps without any memory transfers. For each time-step, we calculate the updated temperature for a region that is smaller by one in each direction and that region then becomes the input for the next time-step. This creates a “pyramid” of concentric input regions of height x. While this results in less memory transfers it does mean that some work will be duplicated as there will be multiple work-groups calculating the temperature of overlapping regions during intermediate time-steps. The total amount of computation performed increases with x, while the amount of memory transferred decreases.

LU decomposition. This benchmarks factors a square matrix into unit lower triangular, unit upper triangular and diagonal matrices. LU decomposition consists of three kernels that calculate the diagonal, perimeter, and remaining values, respectively. These kernels operate over a square region of the matrix, called the area of interest. The problem is solved in 16 × 16 element blocks so for a matrix of size n × n, n LU decomposition requires 16 − 1 iterations. At each iteration the region of interest shrinks, losing 16 rows from the top and 16 columns from the left. Each iteration, the diagonal kernel updates a single block; the perimeter kernel updates the top 16 rows and left-most 16 columns of the area of interest; and the internal kernel updates the entire area of interest. After all the iterations, the diagonal kernel is run again to cover the bottom right block. While the perimeter and internal kernels can scale well, performance is limited by the diagonal kernel which consists of a single work-group and cannot be parallelized. This Chapter 3. Distributing OpenCL kernels 27

Table 3.2: Cluster Specifications Number of Nodes 49 GPUs Per Node 2 (1 used) GPU NVIDIA Tesla M2090 GPU memory 6 GB Shader / Memory clock 1301 / 1848 MHz Compute units 16 Processing elements 512 Network 4× QDR Infiniband (4 × 10 Gbps) CPU Intel E5-2620 CPU clock 2.0 GHz System memory 32 GB

Table 3.3: Measured Cluster Performance Transfer type Test 64MB Latency 8B Latency ms (Gbps) ms (Mbps) In-memory Single thread memcpy() 26.5 (20.3) 0.0030 (21) Inter-device OpenCL map for reading 36.1 (14.9) 0.62 (0.10) Inter-node Infiniband round trip time 102 (10.5) 0.086 (3.0) benchmark is not well suited to DistCL because of its inter-node communication, complex access pattern, and lack of parallelism.

3.3.4 Cluster

Our framework is evaluated using a cluster with an Infiniband interconnect [12]. The configurations and theoretical performance are summarized in Table 3.2. The cluster consists of 49 nodes. Though there are two GPUs per node, we use only one to focus on distribution between machines. We present results for 1, 2, 4, 8, 16 and 32 nodes.

We use three microbenchmarks to test the cluster and to aid in understanding the overall performance of our framework. The results of the microbenchmarks are reported in Table 3.3. We first test the performance of memcpy(), by copying a 64 MB array between two points in host memory. We initialize both arrays to ensure that all the memory was paged-in before running the timed portion of the code. The measured memory bandwidth was 20.3 Gbps.

To test OpenCL map performance, a program was written that allocates a buffer, executes a GPU kernel that increments each element of that buffer, and then reads that buffer back with a map. The program executes a kernel to ensure that the GPU is the only device with an up-to-date version of the buffer. Every time the host program maps a portion of the buffer back, it reads that portion, to force it to be paged into host memory. The program reports the total time it took to map and read the updated buffer. To test the throughput of the map operation, the mapping program reads a 64 MB buffer with a single map operation. Only the portion of the program after the kernel execution completes gets timed. We measured 14.9 Gbps of bandwidth between the host and the GPU. The performance of an 8-byte map Chapter 3. Distributing OpenCL kernels 28 was measured to determine its overhead. An 8-byte map takes 620 µs, equivalent to 100 kbps. This shows that small fragmented maps lower DistCL’s performance.

The third program tests network performance. It sends a 64MB message from one node to another and back. The round trip time for Infiniband took 102 ms and each one-way trip took only 51 ms on average, yielding a transfer rate of 10.5 Gbps. Since Infiniband uses 8b/10b encoding, this corresponds to a signalling rate of 13.1 Gbps. This still fall short of the maximum signalling rate of 40 Gbps. Even using a high-performance Infiniband, network transfers are slower than maps and memory copies. For this reason, it is important to keep network communication to a minimum to achieve good performance when distributing work. Infiniband is designed to be low-latency, and as such its invocation overhead is lower than that of maps.

3.3.5 SnuCL

We compare the performance of DistCL with SnuCL1 [11], another framework that allows OpenCL code to be distributed across a cluster. SnuCL can create the illusion that all OpenCL devices on the cluster belong to a single local context. It is designed with heterogeneous hardware in mind and translates OpenCL code to CUDA when running on Nvidia GPUs and to C when running on a CPU. To distribute a task with SnuCL, the programmer must partition the work into many kernels, ensuring that no two kernels write to the same buffer. If the kernel is expected to run on a variable number of devices, the programmer is responsible for ensuring that their division technique handles all the necessary cases. SnuCL transfers memory between nodes automatically, but requires the programmer to divide their dataset into many buffers to ensure that each buffer is written to by at most one node. For efficiency, the programmer should also divide up the buffers that are being read, to avoid unnecessary data transfers. The more regular a kernel’s access pattern, the larger each buffer can be, and the fewer buffers there will be in total. With SnuCL, the programmer uses OpenCL buffer copies to transfer data. SnuCL will determine if a buffer copy is internal to a node, in which case it uses a normal OpenCL copy, or if it is between nodes, in which case it uses MPI. If subsequent kernel invocations require a different memory division, this task again falls to the programmer, who will have to create new buffers and copy the appropriate regions from existing buffers.

The buffers in SnuCL are analogous to the intervals generated by meta-functions in DistCL. However, in SnuCL, buffers must be explicitly created, resized, and redistributed when access patterns change, whereas DistCL manages changes to intervals automatically. SnuCL does not abstract the fact that there are multiple devices, it only automates transfers and keeps track of memory placement. When using SnuCL, the programmer is presented with a single OpenCL platform that contains as many compute devices as are available on the entire cluster. The programmer is then responsible for dividing up work between the compute devices. If SnuCL is linked to existing OpenCL code, all computation would simply happen on a single compute-device, as the code must be modified before SnuCL can distribute it. Existing code can be linked to DistCL without any algorithmic modification, reducing the likelihood of introducing new bugs. The meta-functions are written in a separate file that is pointed to by the DistCL configuration file. This allows the meta-functions file to be managed separately from the both the kernel and host files. 1Version 1.2 beta, downloaded November 15th 2012. Chapter 3. Distributing OpenCL kernels 29

We ported four of our benchmarks to SnuCL. We did not compare DistCL and SnuCL using the Rodinia benchmarks. While porting the other benchmarks to DistCL involved only the inclusion of meta-functions, porting the Rodinia benchmarks to SnuCL is a much more involved process involving modifications that would alter the characteristics of the Rodinia benchmarks. We were also unable to compare against inter-node communication benchmarks due to a (presumably unintentional) limitation in SnuCL’s buffer transfer mechanism. The one exception was n-body, which ran correctly for a single iteration, removing the need for inter-node communication and turning it into a compute-intensive benchmark.

Since kernels and buffers are subdivided to run using SnuCL, sometimes kernel arguments or the kernel code itself must be modified to preserve correctness. For example, Mandelbrot requires two arguments that specify the initial x and y values used by the kernel. To ensure work is not duplicated, no two kernels can be passed the same initial coordinates, and the programmer must determine the appropriate value for each kernel. Kernels such as n-body require an additional offset parameter because per-peer buffers can be accessed using the global ID, but globally shared buffers must be accessed with what would be the global ID if the problem was solved using a single NDRange. Hash also required an offset parameter since it used a work-item’s global ID as the preimage. Similar changes must be made for any kernel that uses the value of its global ID or an input parameter to determine what part of a problem it is working on, rather than data from an input buffer.

Having ported benchmarks to both DistCL and SnuCL, we find the effort is less for DistCL. Using DistCL, we can focus solely on the memory access patterns of a kernel when writing its meta-functions, and the host code does not need to be understood. However, with SnuCL, we must understand both the host and kernel code. Debugging was also simpler with DistCL thanks to its meta-function verification tool. This allowed us to write meta-functions one kernel at a time. We did not port any multi-kernel benchmarks to SnuCL, but we did attempt to debug some, before we found the buffer transfer issue, and it was quite difficult to find the exact source of error for incorrect benchmarks. Another advantage of DistCL is the fact that once a benchmark has been successfully distributed, we can be confident it will distribute correctly for any number of peers.

While it is simpler to port applications to DistCL than SnuCL, the latter should experience lower runtime overhead. This is because the work done by meta-functions at runtime has already been done by the programmer. With SnuCL, there is less need to synchronize globally as buffer ownership does not need to be coherent across nodes. There is also less communication taking place as nodes must not be informed of buffer ownership unless a transfer is required, in which case the host program will explicitly specify source and destination devices.

3.4 Results and Discussion

Figure 3.3 shows the speedups obtained by distributing the benchmarks using DistCL, compared to using normal OpenCL on a single node. Compute-intensive benchmarks see significant benefit from being distributed, with binomial achieving a speedup of over 29x when run on 32 peers. The more compute-intensive linear benchmarks, hash and Mandelbrot, also see speedup when distributed. Of the inter-node communication benchmarks, only n-body benefits from distribution, but it does see almost perfect scaling from 1-8 peers and speedup of just under 15x on 32 peers. For the above benchmarks, Chapter 3. Distributing OpenCL kernels 30

30 1 peer 2 peers 4 peers 8 peers 25 16 peers 32 peers

20

15 Speedup

10

5

0 MandelbrotHash Nearest Binomial Monte CarloN-body N-body Bitonic K-meansBack HotSpot LUD Mean Neighbor 1 iteration 8 iterations propagation

Figure 3.3: Speedup of distributed benchmarks using DistCL. we see better scaling when the number of peers is low. While the amount of data transferred remains constant, the amount of work per peer decreases, so communication begins to dominate the runtime.

The remaining inter-node communication and linear benchmarks actually run slower when distributed versus using a single machine. These benchmarks all have very low compute-to-transfer ratios, so they are not good candidates for distribution. For the Rodinia benchmarks in particular, the problem sizes are very small. Aside from LU decomposition, they took less than three seconds to run. Thus, there is not enough work to amortize the overheads.

Figure 3.4 shows a run-time breakdown of the benchmarks for the 8 peer case. Each run is broken down into five parts:

Buffer — the time taken by host program buffer reads and writes.

Execution — the time during which there was at least one subrange execution but no inter-node transfers.

Transfer — the time during which there was at least one inter-node transfer but no subrange executions.

Overlapped transfer/execution — the time during which both subrange execution and memory transfers took place.

Other/sync — the average time the master waited for other nodes to update their dependency Chapter 3. Distributing OpenCL kernels 31

1.2 Buffer Execution Transfer Overlapped Transfer/Execution Other/Sync 1

0.8

0.6 Portion of Runtime 0.4

0.2

0 Mandelbrot Hash Nearest Binomial Monte CarloNbody Bitonic K-means Back HotSpot LUD Neighbor propagation

Figure 3.4: Breakdown of runtime.

information.

The benchmarks which saw the most speedup in Figure 3.3 also have the highest proportion of time spent in execution. The breakdowns for binomial, Monte Carlo, and n-body are dominated by execution time; whereas, the breakdowns for nearest neighbor, back propagation and LU decomposition are dominated by transfers and buffer operations, which is why they did not see a speedup. One might wonder why Mandelbrot sees a speedup, but bitonic and k-means do not, despite the proportion of time they spent in execution being similar. This is because Mandelbrot and hash are dominated by host buffer operations, which also account for a significant portion of execution with a single GPU. In contrast, Bitonic and k-means have higher proportions of inter-node communication, which map to much faster intra-device communication on a single GPU.

Table 3.4 show the amount of time spent managing dependencies. This includes running meta-functions, building access-sets, and updating buffer information. Table 3.4 also shows the time spent per kernel invocation, and the time as a proportion of the total runtime. Benchmarks that have fewer buffers like Mandelbrot and Bitonic Sort spend less time applying dependency information per kernel invocation than benchmarks with more buffers. LU decomposition has the most complex access pattern of any benchmark. Its kernels operate over non-coalescable regions that constantly change shape. Further, the fact that none of LU decomposition’s kernels update the whole array means that ownership information from previous kernels is passed forward, forcing the ownership information to become more fragmented, and Chapter 3. Distributing OpenCL kernels 32

Table 3.4: Execution Time Spent Managing Dependencies Total Per Kernel Invocation Percent of Benchmark Time (µs) Time (µs) Runtime Mandelbrot 109 109 0.097 Hash 112 112 0.15 Nearest neighbor 120 120 0.62 Binomial 126 126 0.0043 Monte Carlo 1500 150 0.028 n-body 166 166 0.00045 Bitonic Sort 29900 91.9 2.9 k-means 30400 1450 4.6 Back propagation 434 217 0.017 HotSpot 23500 981 5.9 LUD 2.31 × 107 60400 30 take longer to process. With the exception of LU decomposition, the time spent managing dependencies is low, demonstrating that the meta-function based approach is intrinsically efficient.

An interesting characteristic of HotSpot is that the compute-to-transfer ratio can be altered by changing the pyramid height. The taller the pyramid, the higher the compute-to-transfer ratio. However, this comes at the price of doing more computation than necessary. Figure 3.5 shows the speedup of HotSpot run with a pyramid height of 1, 2, 3, 4, 5, and 6. The distributed results are for 8 peers. Single GPU results were acquired using conventional OpenCL. In both cases, the speedups are relative to that framework’s performance using a pyramid height of 1. The number of time-steps used was 60 to ensure that each height was a divisor of the number of time-steps. We can see that for a single GPU, the preferred pyramid height is 2. However, when distributed the preferred size is 5. This is because with 8 peers we have more compute available but the cost of memory transfers is much greater, which shifts the sweet spot toward a configuration that does less transfer per computation.

Benchmarks like Hotspot and LU decomposition that write to rectangular areas of two-dimensional arrays need special attention when being distributed. While the rectangular regions appear contiguous in two-dimensional space, in a linear buffer, a square region is, in general, not a single interval. This means that multiple OpenCL map and network operations need to be performed every time one of these areas is transferred.

We modified the DistCL scheduler to divide work along the y-axis to fragment the buffer regions transfered between peers. This results in performance that is 204× slower on average across all pyramid heights, for 8 peers. This demonstrates that the overhead of invoking I/O operations on a cluster is a significant performance consideration.

In summary, DistCL exposes important characteristics regarding distributed OpenCL execution. Distribu- tion amplifies the performance characteristics of GPUs. Global memory reads become even more expensive compared to computation, and the aggregate compute power is increased. Further, the performance gain seen by coalesced accesses is not only realized in the GPU’s bus, but across the network as well. Synchronization - now a whole-cluster operation - becomes even higher latency. There are also aspects of distributed programming not seen with a single GPU. Sometimes, it is better to transfer more data with few transfers than it is to transfer little data with many transfers. Chapter 3. Distributing OpenCL kernels 33

4.5 OpenCL DistCL 4

3.5

3

2.5

2 Normalized Speedup 1.5

1

0.5

0 1 2 3 4 5 6 Pyramid Height

Figure 3.5: HotSpot with various pyramid heights.

3.5 Performance Comparison with SnuCL

SnuCL and DistCL’s performance was compared using the four benchmarks from the AMD APP SDK and one (hash) from GNU libgcrypt, that were ported to SnuCL. Each benchmark was run three times, and the median time was taken. The time for each benchmark is normalized against a standard OpenCL implementation using the same kernel running on a single GPU, including all transfers between the host and device.

Figure 3.6 shows the speedups obtained using SnuCL or DistCL relative to that of normal OpenCL run on a single GPU. For the linear benchmarks SnuCL outperforms DistCL by up to 3.5x. When using one or two peers, performance is within 10%, but DistCL does not scale as well. While SnuCL keeps benefiting from additional peers, DistCL sees peak performance when using 16 peers in both cases. The story is different for the compute intensive benchmarks, as both frameworks see improved performance from adding additional devices all the way up to 32 peers. Performance is also more similar with near identical performance from one to eight peers and a maximum difference in performance of 25% with 32 peers. However, even with the compute intensive benchmarks, it is clear that DistCL does not scale as well as SnuCL.

One might assume that the additional runtime overhead of meta-functions must be responsible. This would also explain why the difference in performance increases as the number of peers increases, because Chapter 3. Distributing OpenCL kernels 34

Table 3.5: Execution Time Spent Managing Dependencies Total Runtime Per Kernel Invocation Percent of Benchmark Time (s) Time (µs) Time (µs) Runtime Mandelbrot 0.11 109 109 0.097 Hash 0.08 112 112 0.15 Binomial 2.95 126 126 0.0043 Monte Carlo 5.80 1500 150 0.028 n-body 3.70 166 166 0.00045

Table 3.6: Benchmark Performance Characteristics DistCL Percent of Compute-to- Compute-to- Percent of Benchmark Speedup SnuCL’s performance transfer ratio sync. ratio runtime in sync. Mandelbrot 1.9 73.3 0.22 32.5 0.56 Hash 3.6 67.0 0.83 55.3 0.86 Binomial 7.9 97.6 140 1430 0.07 Monte Carlo 6.8 97.6 21 664 0.14 n-body 7.8 99.8 68 4660 0.02 as the number of peers increases, so do the number of subranges and therefore the number of access-sets that must be calculated. To verify this hypothesis, the amount of time spent running meta-functions was measured for each benchmark distributed across eight peers. The results are presented in Table 3.5. Running meta-functions and managing dependencies accounts for less than 0.1% of runtime in the worst case, so this is clearly not to blame for the reduced performance.

To understand the performance difference, the characteristics of the benchmarks must be better understood. If we refer again to Figure 3.4, we can see a runtime performance breakdown for the benchmarks, again for eight peers. It is no surprise to see that the benchmarks with the most time spent actually running the kernel see the most speedup. Table 3.6 shows some performance characteristics of interest for the benchmarks. For the purpose of calculating the ratios, compute is the sum of execution and overlapped transfer/execution and transfer is the sum of buffer, transfer, and overlapped transfer/execution. In this manner, transfer accounts for all the memory transfer necessary to run the kernel, not just transfers between peers. This is a better representation of how much work is truly performed versus how much time is spent shuffling memory around. We can clearly see that benchmarks with high compute-to-transfer ratios see the best speedups. Figure 3.7 shows the speeds achieved with both frameworks for the compute-to-transfer ratio of each benchmark. We see a similar trend in both cases with low ratios leading to poor speedups and high ratios leading to near ideal scaling. However, the compute-to-transfer ratio is not a good predictor of performance relative to SnuCL. This is not surprising considering the same amount of memory transfers and computation take place regardless of which framework is used.

A much better predictor of performance relative to SnuCL is the compute-to-synchronization ratio. Synchronization is not just composed of the time it takes to update the dependency information, which was already shown to be insignificant, but also the time it takes for all the peers to notify the master that they have done so, and any time spent waiting for the last peer. Synchronization time is an average of 7× the meta-function execution time and as high as 16× in the case of binomial. The round trip latency on the cluster for an 8 byte transfer was measured to be 86 µs which is always less than meta-function execution, so it alone does not account for the difference. This means that most of this time is spent Chapter 3. Distributing OpenCL kernels 35

35 1 peer SnuCL 1 peer DistCL 2 peers SnuCL 30 2 peers DistCL 4 peers SnuCL 4 peers DistCL 8 peers SnuCL 25 8 peers DistCL 16 peers SnuCL 16 peers DistCL 32 peers SnuCL 32 peers DistCL 20

Speedup 15

10

5

0 Mandelbrot Hash Binomial Monte Carlo N-body Mean 1 iteration

Figure 3.6: DistCL and SnuCL speedups. waiting for all the peers to reach the synchronization points. This is a traditional synchronization overhead [52]. This also explains why relative performance degrades as the number of peers is increased. However, the synchronization time itself does not account for the entire performance difference. For example, consider hash, where synchronization accounts for only 1% of the runtime, yet it only manages 67% of SnuCL’s performance.

The remaining factor to consider is the fact that SnuCL actually translates the OpenCL kernel into a CUDA kernel. Even when running on the same hardware, it has been shown that large performance differences can remain. Work by Fang et al. [53] has shown that synthetic performance is not affected when using either CUDA or OpenCL. However, when running real applications, CUDA consistently outperforms OpenCL. Several reasons are listed, including faster kernel launches with CUDA and a better compiler leading to fewer instruction in the intermediate representation. This agrees with our results; the fastest kernels, where launching the kernel is a more significant portion of runtime, have the largest performance deficit. Work by Karimi et al. [54] also shows that transferring data between the host and GPU is also faster with CUDA, on average about 40%. From our tests, we can see that benchmarks that spend more time transferring buffers are further from SnuCL’s performance. This difference between CUDA and OpenCL performance also explains why binomial was seeing slight super-linear scaling from 1 to 16 peers under SnuCL.

Using two different distributed OpenCL frameworks has shown that the compute-to-transfer ratio of a Chapter 3. Distributing OpenCL kernels 36

9 SnuCL DistCL 8

7

6

5

4 Normalized Speedup

3

2

1 0 20 40 60 80 100 120 140 160 Compute-to-transfer Ratio

Figure 3.7: DistCL and SnuCL compared relative to compute-to-transfer ratio. benchmark is the best predictor of performance scaling. SnuCL slightly outperforms DistCL, since it is statically scheduled and uses the CUDA runtime. In contrast, it is easier to port OpenCL applications to DistCL than to SnuCL, since there is no need to modify either the host or kernel code.

3.6 Conclusion

This chapter presented DistCL, a framework for distributing the execution of an OpenCL kernel across a cluster, causing that cluster to appear as if it were a single OpenCL device. DistCL shows that it is possible to efficiently run kernels across a cluster while preserving the OpenCL execution model. To do this, DistCL uses meta-functions that abstract away the details of the cluster and allow the programmer to focus on the algorithm being distributed. We believe the meta-function approach imposes less of a burden than any other OpenCL distribution system to date. Speedups of up to 29x on 32 peers are demonstrated.

With a cluster, transfers take longer than they do with a single GPU, so more compute-intense approaches perform better. Also, certain access patterns generate fragmented memory accesses. The overhead of doing many fragmented I/O operations is profound and can be impacted by partitioning.

We also compared DistCL to another open-source framework, SnuCL, using five benchmarks. From a usability standpoint, DistCL has the advantage of being able to distribute unmodified OpenCL applications. For compute-intensive benchmarks, performance between DistCL and SnuCL is comparable, Chapter 3. Distributing OpenCL kernels 37 but otherwise SnuCL has better performance. This difference cannot be fully attributed to the overhead of meta-functions, which account for a very small portion of the runtime. This is mostly due to DistCL requiring tighter synchronization between nodes and the fact that SnuCL uses CUDA under the hood. The increased synchronization of DistCL also means that is does not scale as well as SnuCL as the number of peers in the cluster increases.

Nevertheless, by introducing meta-functions, DistCL opens the door to distributing unmodified OpenCL kernels. DistCL allows a cluster with 214 processing elements to be accessed as if it were a single GPU. Using this novel framework, we gain insight into both the challenges and potential of unmodified kernel distribution. In the future, DistCL can be extended with new partitioning and scheduling algorithms to further exploit locality and more-aggressively schedule subranges. DistCL is available at http://www.eecg.toronto.edu/~enright/downloads.html. Chapter 4

Selecting Representative Benchmarks for Power Evaluation

Benchmarks play an important role in computer architecture. They are the tools used to measure the performance of various designs. However, the relative performance of architectures can vary depending on the type of benchmark being used, as well as the input sets. This is why benchmark suites such as SPEC [55] contain multiple benchmarks and input data sets. Simulating the entire benchmark suite can be impractically time consuming. In an effort to reduce the number of benchmarks necessary to cover a similar breadth of workloads, statistical methods have been used to compare the similarity of benchmarks. The work by Phansalkar et al. [56] showed that it was possible to obtain similar information running just fourteen out of 29 SPEC benchmarks, when benchmarks were clustered by instruction mix and locality. While application benchmarks, such as SPEC, give an idea of the overall performance of an architecture, micro-benchmarks can be used to obtain information about individual components in a design [57].

Existing benchmarking focuses on performance, but there is also a need to consider power. The energy consumption of various operations and data must be understood in order to create a set of benchmarks that cover a wide range of possible power scenarios. This chapter presents the methodology used to create a representative set of micro-benchmarks that will be used to create a power model for the Fusion APU. Section 4.1 first describes the setup used to measure the power consumption of the Fusion APU. Section 4.2 describes the methodology used to determine what benchmark characteristics are important from a power perspective.

4.1 Power Measurements

In order to build a power model, we need to measure the power consumption of the components we want to model. There have been various methods proposed for measuring power consumption in computer hardware. The methods range in complexity and accuracy, from measuring full system power at the wall [58] to measuring per-component power in a temperature controlled environment to account for thermal leakage [59]. In this work, power was measured for the APU at the package level at normal

38 Chapter 4. Selecting Representative Benchmarks for Power Evaluation 39

Table 4.1: Data Acquisition Unit Specifications DATAQ DATAQ DI-145 DI-149 Channels 4 8 Measurement Range ±10 V ±10 V Maximum Sample Rate 240 Hz 10 000 Hz Interface USB USB

Table 4.2: ACS711 Current Sensor Specifications Input range ±12.5 A ¯ V V Sensitivity 0.03Vcc A (0.167 A at Vcc = 5V ) Minimum logic voltage 3 V Maximum logic voltage 5.5 V Supply current 4 mA Internal resistance 1.2 mΩ Bandwidth 100 kHz Error ±5 % operating temperatures. Benchmarks were kept short and the APU was allowed to return to idle between benchmarks to prevent heat buildup in the chip. The measured value includes the power consumption of the CPU, GPU, and memory controller simultaneously.

To measure the power consumption of the APU, both the current and voltage delivered to the package were measured. Measurements were made using the DataQ DI-145 [60] and DI-149 [61] data acquisition units (DAQs). The DI-145 was used for all the benchmark clustering experiments and the GPU power measurements. The CPU measurements were made using the DI-149 because there was more variation in the CPU’s power consumption and the CPU models were less accurate. Both units can measure differential voltage on up to four and eight separate channels, respectively. More detailed specifications are available in Table 4.1.

The DI-149 has an additional power overhead due to the limited on-board storage of the DAQs; data must be periodically read. Since the DI-149 has a higher sampling rate this happens more often. Figures 4.1 and 4.2 show measured idle power consumption, using the DI-145 and DI-149 respectively. The spikes in power consumption occur when the CPU has to wake up and read data from the DAQ. For this benchmark, it only happened once for the DI-145 but multiple times for the DI-149. Due to the higher sampling frequency, the DI-149 also allows us to see the overshoot and undershoot caused by sudden increases and decreases in power consumption respectively [62]. The power overhead of using the higher sampling rate on the DI-149 was found to be 3.2%.

Four channels were used to measure the APU’s power consumption: two were used to measure current, one to measure voltage on the 12 V line, and another to measure voltage on the 5 V line. Current consumption was measured by inserting current sensors in the 12 V line of the 2 × 2 connector between the PSU and motherboard. In accordance with the ATX power specification [64], only this connector delivers power to the APU’s voltage regulators. Figure 4.3 depicts a schematic of the MSI A75MA-G55 motherboard [63] used in our Fusion system. The 2 × 2 power connector is circled in red, while the APU socket is circled in blue. Since this connector has two wires for both the 12 V and ground (GND) lines, Chapter 4. Selecting Representative Benchmarks for Power Evaluation 40

DI-145 Idle power 15 14 13 12 11

Power [W] 10 9 8 Time

Figure 4.1: Idle power measurements done using the DI-145.

Idle power DI-149 35 30 25 20 15 10 Power [W] 5 0 -5 Time

Figure 4.2: Idle power measurements done using the DI-149. it required 2 sensors. We used Allegro’s ACS711 current sensors [65] mounted on carrier boards from Pololu Robotics & Electronics. The specifications of the sensor are available in Table 4.2. To measure current on the 12 V line, a voltage divider was required, as the DAQ units can only make readings up to 10 V. A 100 kΩ potentiometer was used to divide the voltage in two. When using the DI-145 the maximum sampling rate of 240 Hz, 60 Hz per channel, was used. For the DI-149, a sampling rate was set to 938Hz per-channel, as this allows for precise power tracking but keeps the monitoring overhead low.

Power consumption is being measured before the voltage regulator modules (VRMs). The efficiency of the VRMs used on the motherboard varies between 80 and 92% [66]. With our system the 80% efficiency is only reached at idle. Under GPU-only loads efficiency ranges between 84 and 88%, while under CPU-only loads it ranges between 90 and 92%. Due to the small variation in efficiency for each device type the modelling assumes a constant VRM efficiency.

Figure 4.4 shows a schematic of the measuring setup. The same setup was used for both the DI-145 and DI-149, since they are physically almost identical. The sensors and the potentiometer were soldered to a prototype board along with a Molex 8981 connector. The Molex connector was used to supply 12V, 5V, and GND to the prototype board reliably, while allowing the measuring setup to be removable. The 12V lines to the APU were cut and each had both ends soldered to one of the current sensors. Terminal blocks were added to the prototype board for each of the signals that we wanted to measure and for a ground for each channel. A picture of the measuring setup installed in the system can be seen in Figure 4.5. The 2 × 2 power connector and APU socket are again circled in red and blue respectively, while the prototype board and DI-145 are circled in yellow. Chapter 4. Selecting Representative Benchmarks for Power Evaluation 41

Power connector

Socket

Figure 4.3: MSI A75MA-G55 motherboard schematic [63].

The DataQ is attached to the host computer using a USB port. Under Linux it is detected as a terminal device and can be written to or read from as such. A driver program was written that allows easy access to both DI-145 and DI-149. The important public functions include a constructor and start and stop methods. The constructor detects if the device is the DI-145 or DI-149 and initializes it, then it configures the channels as requested. The start method begins data recording and also creates a new thread that will read data from the DAQ periodically to prevent its buffer from overflowing. The stop method stops recording and returns all the data that was captured in an array, as well as the sample count. To reduce the impact of applications running on the system contributing to the power consumption, a system with a clean installation of Ubuntu 12.04 LTS was used. The system was run without a display, so the GPU would have no additional tasks, and an SSH connection without X forwarding was used to access the system. Cool’n’Quiet [67] AMD’s P-state [68] dynamic voltage and frequency scaling (DVFS) implementation was disabled, meaning the processor was always operating at the maximum frequency. This prevents the frequency from varying throughout the course of a benchmark. However, it was not possible to disable C-states [68], so an idle CPU or GPU could still be clock gated.

To further simplify the collection of data, we wrote two benchmarking programs, one to run OpenCL kernels in isolation and another to run entire benchmarks directly. Both these programs reported runtime, performance counter values, as well as the power and energy consumed to run the benchmarks. More details on the performance counters used are available in Section 5.3.1. The program used to run Chapter 4. Selecting Representative Benchmarks for Power Evaluation 42

USB TO PC DATAQ DI-14X

channel 1 channel 2 channel 3 channel 4

5V ATX

Vcc OUT GND Current Vcc OUT GND 12V ATX Sensor Current IN OUT Sensor IN OUT CPU Power Connector

100 KΩ Ground

Figure 4.4: Schematic of the measuring setup.

benchmarks directly started a timer and power measurements, before immediately forking a process to run the benchmark.

To run OpenCL kernels, an XML schema was developed to represent all the information needed to run benchmarks and perform power measurements. A single XML file can contain multiple kernel runs. Information that is common across all kernels only has to be specified once: the device-type being used and the configuration of the power sensors. There is also kernel-specific information: the kernel names, kernel source file, and the kernel arguments. For any buffer arguments, the size is also required, as well as the kernel used to initialize the values in the buffer. This makes it possible to run a kernel with different inputs. The program also starts and stops power measurements, the performance counters, and a timer. The output is an XML file with the same information as the input, but with power and timing information added to each kernel. Using an XML file to specify the benchmark makes it possible to run various kernels without needing to create a corresponding host program for each kernel. This reduced the likelihood of error, which is high with hundreds of similar host programs. Finally, parser was created to parse the output XML file and convert the current readings into energy and power values, and prepare the data for graphing. Chapter 4. Selecting Representative Benchmarks for Power Evaluation 43

Power connector

Socket

Prototype board and DI-145

Figure 4.5: A Picture of the measuring setup in action.

4.2 Micro-benchmark Selection

A micro-benchmark is designed to exercise one specific component of a processor. By using a representative set of micro-benchmarks, one that exercises every component of the processor, it becomes possible to characterize the entire processor. This can be done not just for performance but also for power.

There are many factors that affect the power consumption of each component. Certain factors are static and therefore will be captured by any benchmark that targets the component, while others are dynamic and depend on the benchmark itself. The most obvious static factor is the die area of each component. The larger the component, the more power it can possibly draw. Dynamic factors include the activity ratio of the component and the type of data being operated on. To create a truly representative set of micro-benchmarks, all the factors that affect power must be considered.

Micro-benchmarks are evaluated based on the total energy required to run them. Energy is the ultimate concern, whether it be because of limited battery life in mobile devices or high power bills in data-centres. Accounting for energy also allows the analysis of both power consumption and execution time. This means that a benchmark that consumes 10 W and takes 2 s to perform a certain task is correctly identified as more efficient than a benchmark that only consumes 8 W, but requires 3 s to complete the same task.

Memory and compute micro-benchmarks were considered separately. Where possible, OpenCL benchmarks Chapter 4. Selecting Representative Benchmarks for Power Evaluation 44 were used, so they could be run on both the CPU and GPU. Section 4.2.1 focuses on the selection process of memory benchmarks, while section 4.2.2 describes the selection process for compute benchmarks.

4.2.1 Memory Benchmarks

Memory activity is a very important factor, with respect to total energy consumption. On both CPUs and GPUs, accessing main memory takes more than 100 cycles and requires off-chip communication which makes the operation both long-running and high-power. In general, the memory hierarchies of the CPU and GPU are quite different. The GPU’s memory hierarchy is directly exposed to the programmer and maps to the OpenCL memory model. The CPU memory hierarchy is more hidden and includes multiple levels of caches. More micro-benchmarks are required to fully understand the CPU’s memory.

GPU Memory Benchmarks

The GPU has two types of memory: main memory, which is non-coherent, and local data shares which are coherent but private to individual compute units. These correspond to OpenCL’s global and local memory respectively.

Main memory supports contiguous 256-bit accesses; to get the maximum memory bandwidth, kernels should make contiguous memory accesses. An important factor that affects energy consumption is the latency of memory accesses, so we must have benchmarks with different memory latencies. Three types of main memory accesses were considered: contiguous, sparse, and conflict. Each benchmark consisted of an OpenCL kernel. For the contiguous case, each work-group makes contiguous reads or writes to separate regions of memory allowing for maximum throughput. For the sparse case, work-items in a work-group access memory with a stride of 256 bits. This prevents coalescing of accesses from a single work-group. For the conflict case, every work-item in a work-group attempts to read or write the same memory location. This forces the writes to be serialized and reduces the amount of data transferred for reads. On the GPU no data is cached.

These benchmarks allow best-case, and two types of worst-case memory accesses. This provides a range of energy consumption for global memory operations. They will also allow the instruction trace characteristics of each type of memory access to be isolated.

Two types of local memory access were considered: contiguous and conflict. There was no need for a sparse access benchmark as there is no penalty for non-contiguous local memory accesses, only for conflicts. Similar to the global memory benchmarks, the contiguous benchmark accesses adjacent memory locations and the conflict benchmark has the work-items access the same location.

CPU Memory Benchmarks

Due to the CPU’s use of caches, not only do the types of memory accesses matter, but the temporal and spatial locality of subsequent accesses also matters. While the CPU requires some additional benchmarks to expose this factor, all the benchmarks used for the GPU can be used as well. In the CPU, different types of OpenCL memory do not map to different physical memories. There is a single memory space Chapter 4. Selecting Representative Benchmarks for Power Evaluation 45

1 3 2 14 L1 3 8 4 4

10 19 L2 11 11

18 1 19 7 Main n-1 13 memory n 16

Figure 4.6: An example of a stack used to store the order of recent memory accesses. that is shared by both local and global memory and the CPU cache operates normally. However, local memory is small enough that it can fit into a processor’s L1 d-cache and hopefully stay there. Since local memory is not shared between cores, it will only be moved if its entry is being replaced. The memory benchmarks that we created for the GPU focus most of the activity on the L1 cache and main memory. Therefore, additional benchmarks were required to evaluate L2 cache activity.

A new benchmark was created, that allows cache accesses to each level of cache to be controlled. This benchmark makes accesses to a large array. To ensure the cache behaviour can be predicted, only locations in the array that map to the same cache set are accessed. Each core in the AMD fusion has a private L1 and L2 cache. The cache specifications are given in Table 4.3. Memory accesses with a 64 kB stride were used, since these accesses map to the same set in both the L1 and L2 cache.

A stack is used to keep track of the data accesses. An example stack is shown in Figure 4.6. Each cell in the stack contains the index of a memory access that was made. In this example, the most recent access was made at index three, and the oldest access was made at index sixteen.

We know all these accesses map to the same set, therefore only the values of the two most recent accesses will still be in the L1 d-cache. Since the L2 cache is exclusive, it cannot contain the same values as the L1 cache, it will contain the values of the third through eighteenth most recent accesses. Any locations that are more than eighteen accesses old will be in main memory. With this knowledge, we can generate an access to any level of memory.

Using this benchmark, an arbitrary number of accesses can be made to the L2 cache and main memory. For each access, one of the admissible indices was randomly selected, and whether the access was a read or a write was also randomly selected with a 50% chance for each. The benchmark makes 1 000 000 Chapter 4. Selecting Representative Benchmarks for Power Evaluation 46

Table 4.3: AMD Fusion Cache Specifation Attribute L1 D-cache Unified L2 cache Line size 64 B 64 B Size 64 kB 1 MB Ways 2 16 Sets 512 1024 Way size (Bytes) 32 kB 64 kB accesses to the array and allows for the percentage of accesses targeting L1, L2, and main memory to be specified. The benchmark was run for all combinations of access rates in 10% increments. For example, one run of the benchmark will make 10% of the accesses (100 000) to L1, 40% of them (400 000) to L2 and 50% to main memory (500 000). This resulted in 66 benchmarks being run.

To verify the benchmarks were causing the desired behaviour, the benchmarks were tested with per- formance counters enabled. More details on performance counters are available in Section 5.3.1. The measured L2 and main memory accesses were always within 3% of the specified value. L1 accesses did not track as closely, because the rest of the program made requests to the L1 d-cache as well. This is acceptable because the previous local-memory benchmarks already isolate L1 d-cache activity.

4.2.2 Compute Benchmarks

For the compute benchmarks, there are more factors to consider than for the memory benchmarks. If we want to generate every combination of factors, the number of benchmarks quickly becomes overwhelming, especially if we want to simulate them. Therefore, it becomes necessary to determine which factors are the most important, which means those that have the greatest effect on energy consumption. Section 4.2.2 details which factors were considered for the compute benchmarks. Section 4.2.2 explains the process used to determine the importance of each factor to create a smaller set of benchmarks.

Generating Benchmarks

For the compute benchmarks the following factors were considered: the operation, the data-type, the size of the operand vector, the data values of the operands, the ILP, and the number of cores used. The possible values for each of these parameters are given in Table 4.4. The operations used include: arithmetic, relational, and bitwise operators. The data-types used were of either type int or float in vectors of size one or four. On the CPU, instructions will be executed on the integer ALUs and the FPU/SSE units. On the GPU, the SPs will see a variable number of SPUs used, depending on the ILP and vector size used. The input values for the data are chosen so that the bits are: all 0, all 1, alternating between 0 and 1, or have a variety of bit values. For floating-point values, it was not possible to create an input where all bits are 1 as that would represent minus infinity; instead, the value of -1.9999999 was used, which is 0xBFFFFFFF, so only a single bit will be 0. Since most operators use adjacent values, these input values are intended to test situations where most bits are 0, 1, frequently changing between 0 and 1, and more or less random, respectively. If the ILP is one, operations will use the result of the previous operation as an operand. On the CPU, only one pipeline will be active at a time and there is Chapter 4. Selecting Representative Benchmarks for Power Evaluation 47 no opportunity for OoO execution. On the GPU, multiple operations cannot be combined into a single VLIW instruction. If the ILP is five, then operations will use the result of the fifth most recent operation as an input. This allows the CPU to use it’s three pipelines concurrently and since all the operands are in local memory (L1 cache) there will no need for OoO execution. On the GPU, VLIW bundles of up to five operations can be created. The value of five was chosen because Evergreen is a VLIW-5 architecture, while the CPU is only 3-wide. This allows us to measure power consumption at minimal and peak architectural efficiency for both the CPU and GPU. The number of cores was varied between one and four because both the CPU and GPU in the Fusion have four compute-units.

Table 4.4: Possible Factor Values for Benchmarks Parameter Possible Values Operation +, –, ∗, /, &, |, ∧, <<, %, ∼, ++, <, != Data-types int, float Vector size 1, 4 Input data Values all 0, all 1, alternating between 0 and 1, ascending values from 1 ILP 1, 5 Cores 1, 2, 3, 4

All the benchmarks were written in OpenCL. Each benchmark begins by reading a value from global memory and storing it as a local variable. Next, the main loop executes the desired operation on the local variables before writing the result back to main memory. To ensure all benchmarks were directly comparable, they all completed the same number of operations, in this case 122 880 000 000. The actual number of loop iterations per work-item depends on the benchmark’s configuration and can be calculated using equation (4.1). The work-group size is 256 for every benchmark, so at most we will complete 480 000 000 iterations per work-item, equation (4.2a). The global size of the kernel is used to specify the number of cores used; when the global size is equal to the local size only one core is used, because only a single work group is created. The maximum global size used was four times the local size. This means each core is assigned a single work-group, and to complete the same amount of work with quadruple the cores, each core only needs to complete a quarter as many iterations. When the ILP value is five, we execute five operations per loop iteration, so only a fifth of the iterations are required. The last parameter that can affect the number of iterations is the width of the vector data-type. While it is true that with the use of SSE we can add four floats in a single operation, it still results in the addition of 128 bits, so for the purpose of modelling it is considered as four operations. On the GPU using a size four-wide data-type still results in four ALU operations. This means that benchmarks that have ILP of five, use a four-wide data-type, and run on four cores, only require 6 000 000 iterations, equation (4.2b).

numberofoperations = numberofiterations (4.1) localsize × corecount × ILP × vectorsize

122880000000 Maximum number of iterations = 480000000 (4.2a) 256 × 1 × 1 × 1 122880000000 Minimum number of iterations = 6000000 (4.2b) 256 × 4 × 5 × 4

The Cartesian product of all the factor values would result in 1664 benchmarks. However, some operations Chapter 4. Selecting Representative Benchmarks for Power Evaluation 48 such as bit-wise AND and modulo have no equivalent in float-point, and operations such as increment do not work with vector data-types. Therefore, the final number of benchmarks created is 832.

Selecting Benchmarks

The number of compute benchmarks is too large to simulate them all, since each CPU simulation takes between 24 and 48 hours to complete. To make simulating a representative set of compute micro- benchmarks a tractable problem, the number of benchmarks needs to be reduced, but the remaining benchmarks should still cover all the important factors. This section will detail the statistical analysis used to achieve this goal.

There has been a bit of work looking at instruction clustering based on energy consumption. The work by Bona et al. [69] aims to create an energy per-instruction (EPI) power model. They compute k-means with five to sixteen clusters for a 60 instruction ISA. They show that the energy consumption of only 11 representative instruction can represent the ISA, while keeping the standard deviation per cluster below 13%.

The work in this section will consider more than just the instruction type when clustering instructions. It also considers a scalar (CPU), as well as an VLIW (GPU) architecture. When selecting clusters, normality is used as a criterion instead of standard deviation. This is to ensure that clusters contain only similar instructions, and not just a lot of very similar instructions and a few dissimilar ones.

The statistical analysis in this section was done using the GNU R statistical computing environment [70]. While R itself provides a lot of the functionality necessary to do statistical programming, it also has a package system. This allows more complex and less common algorithms to be made available for download by users.

Analyzing Benchmarks All 832 benchmarks, described in the previous section, were run on the AMD A6-3665, described in Chapter 2.2, on both the CPU and GPU. During the benchmark execution, runtime and power consumption were recorded. This allowed total energy consumption per benchmark to be calculated. The average power consumption of compute benchmarks ranged from 25.5 to 44.1 W on the CPU and 13.7 to 17.6 W on the GPU. The total energy consumption of the benchmarks is plotted in Figures 4.7 and 4.8 for the CPU and GPU respectively. By inspection, we can already see that the CPU benchmarks form multiple clusters, especially for the higher energy benchmarks. For the GPU benchmarks, there is less variety as most of the benchmarks consume less that 500 J, but there are still a few clusters of higher energy benchmarks. While this data allows us to see that there is indeed clustering taking place, a quantitative approach to clustering and analyzing the commonalities in a cluster is required.

To this end, k-means [71] was the first clustering approach that was considered. k-means is a clustering algorithm that groups n observations into k clusters. The goal is to create clusters such that each observation is as close to the its cluster’s mean as possible. This problem is NP-hard, so k-means implementations employ heuristic algorithms. A common approach is to start the problem with k randomly selected cluster means. The distance between each cluster’s mean and observation is calculated Chapter 4. Selecting Representative Benchmarks for Power Evaluation 49

140

120

100

80

Frequency 60

40

20

0 0 1000 2000 3000 4000 5000 6000 7000 Energy (J)

Figure 4.7: Energy consumption of ALU benchmarks on the CPU. and observations are assigned to the cluster with the nearest mean. Observations and means are d- dimensional vectors, and the distance between them can be calculated as the Euclidian distance between the two. Once all the observations have been assigned to a cluster, a new mean is calculated for each cluster, using the values of the cluster members. The process then repeats, using the new cluster means, until the clusters are stable.

The limitation with k-means is that the number of clusters must be specified at the outset. If the k is smaller than the natural number of clusters, the resulting clusters will group benchmarks that are dissimilar enough that both their characteristics cannot be captured by a single representative benchmark. This will result in an inaccurate model, as it will not capture all the significant benchmark characteristics. If the k is larger than the natural number of clusters, benchmarks that are similar enough to be covered by a single representative benchmark will be separated. This results in more benchmarks than necessary being run, which increases the cost of simulating the benchmark set.

To avoid this limitation of k-means, Gaussian-means(G-means) [72], a variant on the algorithm, was used. The G-means algorithm is capable of determining what the appropriate number of clusters for a data-set is. It starts by selecting a k and running the normal k-means algorithm. Then, each cluster is tested to determine if it follows a Gaussian, or normal, distribution. If any of the clusters do not appear to be Gaussian, k is incremented by one and the process is started again. G-means clusters the data into the minimum number of representative clusters.

There is no R package that provides an implementation of G-means, so one had to be written. This meant Chapter 4. Selecting Representative Benchmarks for Power Evaluation 50

450

400

350

300

250

200 Frequency

150

100

50

0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Energy (J)

Figure 4.8: Energy consumption of ALU benchmarks on the GPU. that the test for determining whether a data-set followed a Gaussian distribution had to be selected. There are multiple statistical tests available to determine if a data-set fits a Gaussian distribution [73]. The most powerful of these tests are: the Shapiro and Wilk’s W statistic [74], the Cram´er-von Mises statistic W 2 [75][76], and Anderson and Darling’s A2 statistic [77]. All three of these tests can be used in the case where both the sample mean (µ) and variance (θ2) must be estimated, which is the case with our data set. Since all three statistics meet the necessary requirements to be used in our G-means implementation, a further criterion was introduced to select the appropriate candidate: computational complexity. This easily eliminated the W 2 statistic, as the A2 statistic is a simplified form of the more general and computationally expensive W 2 statistic. The A2 statistic was shown to be less computationally complex than the W [78] statistic, so it was chosen.

G-means was implemented with the following parameters: The Hartigang Wong k-means algorithm [79] was used as it is fast and efficient and an R implementation is available. 50 random starts were used for each k, as we have no prior knowledge of the probable centres. By using multiple starts, we increased the confidence both in the clustering generated, and also in the correctness of the decision regarding when a clustering is rejected. The normality of each cluster was accessed with the following parameters: The null hypothesis being used is that the members of cluster c have a normal distribution. The significance level of each test was chosen to be α = 0.0001 [72], to correspond to α = 0.01 with a Bonferroni correction [80], for each k-means instance. The Bonferroni correction is made to reduce the chance of Type I error (rejecting a true null hypothesis) over multiple tests. Since for each value of k, k null hypotheses will be tested before making a decision, we have k chances to make a Type I error. Rejection of the null Chapter 4. Selecting Representative Benchmarks for Power Evaluation 51 hypothesis for any one of the k clusters will result in the value of k being deemed too small, and all clusters being rejected. There is not the same worry for Type II errors (accepting a false null hypothesis), because if we make an error for one cluster we still have k-1 chances to reject the hypothesis. This means that we have to make up to k Type II errors for a set of clusters to be erroneously accepted, but only one Type I error for a set of clusters to be falsely rejected. For this reason, the Bonferroni correction increases the level of confidence required before a cluster can be rejected, because false rejecting is a more serious mistake than falsely accepting a cluster.

The G-means algorithm was applied to the results generated by running the 832 benchmarks on real hardware. Since k-means has an element of randomness the number of clusters generated varied from about 40 to 60 for both the CPU and GPU, with an average standard deviation of 6. The randomness in k-means stems from the fact that the initial cluster centres are randomly generated, and their value affects the final clustering. For each benchmark, a benchmark object was created. Each benchmark object was described by the values of its factors as shown in Table 4.4, as well as average power consumption, runtime, and total energy consumption. To allow for similar ALU operations to be clustered, five operations groups were created. The groupings are described in Table 4.5. This allows for operations such as addition and subtraction to be grouped together. This was done because from a hardware perspective, there is very little difference between an addition and a subtraction, or between any bitwise operations. Each benchmark object also contains information describing which operations groups it belongs to. The benchmarks were clustered based on the total energy required to run them. Energy was used because the goal is to develop an energy per-instruction (EPI) model and it combines both power consumption and runtime. Clustering based on total energy consumption is possible, because the number of operations completed is kept constant for each benchmark. The results are presented for the CPU benchmarks and the number of clusters is 49.

Table 4.5: Operation Groupings Group Operations Included Logic and, or, xor Unary not, increment Comparison less than, not equal Arithmetic add, subtract Complex division, modulus

Figure 4.9 shows the various cluster sizes in histogram form. We can see the clusters range in size from 2 to 32. Figure 4.10 shows how often a given property was the most common property in a cluster. A property is either a factor, from Table 4.4, or an operations group. The data-type property includes both data-type and vector size. This gives us four data-types: int, float, int4, and float4. The sum of the frequencies in this figure is more than 49. In many clusters more than one property tied as the most common. This shows that multiple properties are important to the energy consumption of a benchmark. From this figure, it appears that the most important properties are related to the operation and data-type, while the input data is of least importance. The fact that the operations group and the operations itself are of almost equal importance confirms that grouping operations can safely be done, as there is little loss of useful information. Figure 4.11 shows what percentage of the benchmarks in a cluster share the dominant property. We can see that in many cases 100% of the benchmarks share the most common property, and most of the time more than 50% do, meaning that they are clustering Chapter 4. Selecting Representative Benchmarks for Power Evaluation 52

10 8 6

4 Frequency 2 0 0 5 10 15 20 25 30 35 Size of the Clusters

Figure 4.9: Frequency of cluster sizes from the CPU results. because of the property. AppendixA has a more detailed look at the make-up of each cluster. While this cursory analysis provides us some helpful information, it provides us with little guidance as to which benchmarks to run to represent each cluster.

22 20 18 16 14 12

Frequency 10 8 6 Data-type ILP Core Group Input Op Most Common Property

Figure 4.10: Frequency of property being the most common in a cluster.

18 16 14 12 10 8 6 Frequency 4 2 0 0 20 40 60 80 100 120 Percent of Benchmarks that Share the Most Common Property

Figure 4.11: Percentage of benchmarks in a cluster that share the most common property.

To determine what benchmarks to run we turn to hierarchical clustering. In hierarchical clustering, the most common property is determined in a cluster, then elements of the cluster that do not have that property are removed and the most common property in the subset is determined. This way, we can determine the most common property configuration in a cluster.

To determine the overall order of importance of the properties, we use a scoring system. Whenever a property is found at the top of cluster’s hierarchy it receives a score of six, the next one down a score of five, and so on, down to one for property at the bottom of the hierarchy. Each cluster is scored Chapter 4. Selecting Representative Benchmarks for Power Evaluation 53

Table 4.6: Sensitivity Scores for the CPU Property Inputs Data-types ILP Cores Operation Operation group Score 60 210 269 237 138 191

Table 4.7: Sensitivity Scores for the GPU Property Inputs Data-types ILP Cores Operation Operation group Score 67 225 297 252 149 202 individually, then the scores are summed per property. The higher the score, the more sensitive energy consumption is to that property. Tables 4.6 and 4.7 show these sums for the CPU and GPU respectively. The highest possible score for CPU properties is 294 and for the GPU properties 318, since they have 49 and 53 clusters respectively. In both cases, energy consumption is most sensitive to ILP, followed by core count, input data-type, operations group, operations, and is least sensitive to input data. The score of input data is particularly low in both cases, with less than a quarter of the score of ILP. This is clearly the least important property by any measure. Sensitivity to the operation being performed is not as high as we first observed. The operation being performed was usually either at the top or near the bottom of the hierarchy, meaning that for many clusters it was of little importance. Sensitivity is higher for operations groups, which rarely finds itself at the bottom of the hierarchy. This shows that grouping operations is a useful approach and increases the similarity in a cluster.

These results show that there is no need to vary the input data or test every instruction. The ascending input data was used for the representative benchmarks, as it represents a mix of values. The following instructions were used to represent each operations group: or (∧), not (∼), less than (<) add (+), multiply (∗), divide (/). The size of the final set of representative compute benchmarks is 160 benchmarks, less than 20% of the 832 considered.

4.3 Conclusion

This chapter described the power measuring hardware that was used throughout this work and also introduced methodology used to measure benchmark energy consumption. The chapter also explained how a representative set of micro-benchmarks was created for the Fusion APU. These include both memory and compute focused benchmarks. For the compute benchmarks, statistical analysis was used to reduce the number of required benchmarks from 832 to 160. These benchmarks will be used in the power modelling described in the next chapter. Chapter 5

Power Modelling

Today, power is an important constraint in computer architecture [81]. Power density is the reason we have seen very little clock frequency increases in the past few years. Energy consumption is also more important than ever, as mobile devices from cell phones to laptops now rely on batteries. In data centres, peak power consumption and total energy consumption are also very important [82]. In shared hosting environments, power consumption is as important as CPU time, disk space, or memory requirements, and there are often steep penalties for exceeding both maximum power consumption and monthly energy quotas [83].

For architects to be able to give power the attention it deserves, they need to be able to evaluate power throughout the design process. Architectural simulators need to help architects evaluate not only the performance but also the power characteristics of their designs. This means that power models have to be created and incorporated into simulators. CPU power models have been available for over a decade [84][85][86][87][88][89], and as GPGPU applications have started becoming more common, GPU power models have also been developed in the last few years [58][62][90][91][92][93][94][95]. This work presents a power model for the AMD Fusion; the first power model for an APU. It takes into account the power consumption of the CPU, GPU, and shared resources. The model was built using performance information from the Multi2Sim simulator and hardware performance monitoring counters (PMCs). The model was then incorporated into Multi2Sim.

While the model incorporated into Multi2Sim was ultimately based only on performance data from the simulator, the modelling process also involved the use of PMCs. The performance counters were used for the CPU’s memory system because it is the least accurate part of the simulator. Over the course of this research, the accuracy of Multi2Sim’s memory was improved based on our feedback to the Multi2Sim team, and the simulator’s memory model could be used. The PMC’s values were used to guide the development of power models and assess the quality of micro-benchmarks.

54 Chapter 5. Power Modelling 55

Refine Micro-benchmarks

Micro-benchmarks Start Memory Benchmarks NO Compute Benchmarks Regression Good YES Model Coverage Modelling Validation ? Measurments Power Measurments

Simulator Statistics New NO Desired Accuracy Performance Counters Models ? YES

Stop

Figure 5.1: Steps involved in the modelling process.

The modelling process involved four main steps:

1. Benchmarks selection

2. Measuring power and performance for each benchmark

3. Statistical modelling

4. Model validation

Figure 5.1 shows the entire modelling process with feedback. For the benchmark selection process, the micro-benchmark set developed in Chapter4 was first used. Additional application benchmarks were also used to provide a set of mixed instruction benchmarks. Measuring involved making power measurements with the setup described in Chapter4, as well as PMC measurements on real hardware, and simulating the workloads in Multi2Sim for performance information. For the simulation statistics to be useful, Multi2Sim had to be configured to match the Fusion APU as closely as possible. The statistical modelling used all the measured values to generate a power model for the hardware. Finally, the model was validated using application benchmarks that were not part of the training set.

This chapter is organized as follows. Section 5.1 gives a background on power modelling and discuss prior work in the area. Section 5.2 describes the benchmarks used in the modelling process, and focuses on the benchmarks not described in Chapter4. Section 5.3 describes how PMCs were used and how Multi2Sim was configured to approximate the AMD Fusion. Section 5.4 then describes the modelling process itself. Chapter 5. Power Modelling 56

5.1 Background

The aim of power modelling is to create a mathematical model that takes performance information as input and gives energy information as output. The performance information can come from measurements made on real hardware, statistics from a performance simulator, or from analyzing the source code. There are two common approaches to generating power models for processors: circuit-level modelling and statistical modelling using performance data directly.

Circuit-level modelling is a more accurate but much more computationally expensive approach and it is used in the popular Wattch CPU power modelling framework [85]. It requires many details about the architecture and process technology to produce accurate results. To generate the power model, the energy dissipation of all possible switching events is calculated. To do this, the lumped capacitance and resistance of the circuit, as seen from the power supply, must be calculated. Approximations are used to simplify this process. For example, instead of calculating the energy dissipated by writing value X to a register containing value Y for all permutations of X and Y , it is assumed that half the bits in the register will toggle. This has very little effect on the model accuracy except in edge cases.

Circuit-level models can be quite burdensome in the early stage of design, since the implementation details of the process or certain components may still be unknown. There have been attempts to reduce the burden on the architect, by guessing any missing values or by allowing them to use existing component models in works such as McPAT [86]. While this approach allows for the quick evaluation of certain design decisions, accuracy is reduced as the novelty of the design increases, since the framework’s guesses are based on existing design practices. It is also not always clear to the designer what assumptions McPAT is making, so it is difficult to evaluate how reasonable the guesses are.

One advantage of the circuit-level modelling is that it is possible to separate switching power from leakage power and incorporate a thermal model of the chip. To do this, more information, particularly about the process technology, is required. Then, two simulations can be carried out in parallel, one that models the energy dissipated by switching events and another that models the thermals of the chip and energy dissipated by leakage current. The total energy dissipated is the sum of the energy calculated by these two simulations and can be used as an input to the thermal model. The output of the thermal model can then be used to update the leakage power. While these simulations can be very accurate, they are now incredibly computationally expensive, as the thermal model involves solving complex equations [96]. This is why most power models do not take thermal feedback into account.

In general, circuit-level models, while accurate, are too slow to produce information in real-time. This makes them ill suited to any framework that requires real time power estimates. It is possible to create per component power models and then assume they dissipate a certain amount of power per given measurable event. This approach can greatly increase speed, as circuit-level simulations take place before hand, but accuracy is sacrificed since generalizations and assumptions are made.

On the other hand, statistical modelling has a high up-front cost while the model is being trained, but is much faster to use. There are two common approaches to producing statistical power models that use benchmarks and performance counters. They are the top-down and bottom-up approaches [97]. The top-down approach treats the hardware like a black box, so no details about the modelled system need to be known. While the top-down approach is simple and yields comparable results to the bottom-up Chapter 5. Power Modelling 57 approach, it has the limitation of not being composable. This is to say that with the bottom-up approach the total power consumption can be decomposed into the power consumption of various functional units, plus any fixed consumption. This makes the bottom-up approach a good fit to power models that will be used with an architectural simulator. Since each functional unit can be modelled, it is possible to simulate architectures with varying numbers of functional units. The bottom-up approach can also be used to model both CMP and SMT effects [89], meaning it can be used to estimate power consumption even as the number of cores simulated is varied.

Most of the previous power modelling work has focused on CPUs. Only recently have these approaches been applied to GPU power modelling. Some works took advantage of existing CPU circuit-level modelling infrastructure and added it to the GPGPUSim simulator. Work by Wang et al. [92] added power modelling to GPGPUSim using modified versions of Wattch and Orion [98]. Both GPUWattch [62] and GPUSimPow [91] combine GPGPUSim with modified versions of McPAT to create GPU power models. All these works model Nvidia hardware. Another circuit modelling approach was PowerRed [90] which uses traces as inputs to the model, but was not validated against any real hardware.

There has also been work to create statistical power models for GPUs. Hotpower [99] is an Nvidia GPU power model based on GPU PMCs, as is the model developed by Nagasaka et al. [95]. Wang et al. [94] developed a power model based on the PTX assembly code generated by the Nvidia kernel compiler. The model developed by Hong and Kim [58] is based on access rates obtained from their PTX assembly code analysis tool [100]. These top-down approaches generate models that can be useful for online power estimation in real hardware and can also be used to evaluate the effect of program code on power consumption, but they have limited use for architecture evaluation.

There was comparatively little prior work looking at AMD GPUs or integrated GPUs. Only one model of an integrated GPU was in work by Bircher and John [59] and it was part of a top-down full system power model. The only model for a discrete AMD GPU was in work by Ma et al. [93] which used GPU performance counters, but measured power at the system level. Measuring power at the system level can be very inaccurate as it assumes that all components other than the GPU have a constant power consumption, which is not the case. Both these approaches are also top-down and thus have the associated limitations.

There has also been some work by Wang et al. [30] to study power budgeting for heterogeneous processors. However, their power model is very coarse assuming constant per-core power consumption. This means that the only way to alter the CPU’s or GPU’s power consumption it to is to increase or decrease the core count. Neither the effect of the workload on power consumption, nor frequency scaling are not considered.

This work presents a bottom-up power model for the Fusion APU. This is the first such model for a heterogeneous processor, as well as for a GPU. The model is decomposable, so power can be estimated separately for the CPU and GPU, as well as for each core. However, creating a power model for a heterogeneous processor is not as simple as combining the power models for each type processor found in the chip. Shared resources need to be considered, as otherwise their power consumption will be counted multiple times. The statistical methods used in this chapter were largely based on the work of Bertran et al. [101][89]. The micro-benchmark approach was based on the work of Lee and Brooks [87][102]. While these works were used to create CPU power models, their techniques were extended to GPU and APU Chapter 5. Power Modelling 58 modelling in this research.

5.2 Selecting Benchmarks

Since the power model relies on data generated using benchmarks on both real and simulated hardware, it is necessary to keep the number of benchmarks low enough to allow all the simulations to complete in a reasonable amount of time. However, it is necessary to ensure that the benchmarks used represent a wide enough range of workloads for the resulting model to be valid. To ensure all components of the architecture are exercised, micro-benchmarks are used to target specific components. To ensure the importance of individual components is not exaggerated, application benchmarks that make use of a wide range of components are also used.

5.2.1 Micro-Benchmarks

The micro-benchmark set described in Chapter4 was used for modelling. However, this set of benchmarks did not fully cover the functional units that Multi2Sim simulates. Multi2Sim simulates a larger number of functional units than are found in the Fusion’s CPU. For example, it assumes there are separate SSE and floating-point units, while on the Fusion the same hardware is used.

This limitation was addressed in two ways. First, two further instruction micro-benchmarks had to be added. One was cast between integer and floating point, to exercise the conversion hardware. The other was swizzling for vector data-types, which in the CPU is treated like an SSE packing instruction. Swizzling is the reordering of components in a vector. For example, if we have an int4 of value wxyz, we can swizzle it to create an int4 of value xwzy. During the operation, the value of the four ints that make up vectors did not change, but their order did. Second, all the floating point benchmarks where written in regular C, so they could compile to x87 instructions and be run on FPUs in Multi2Sim.

During the modelling process, we noticed that the training set was weighted towards memory micro- benchmarks for the CPU. The CPU had an additional 264 cache specific benchmarks. To reduce this number, cache benchmarks were only run for one and four cores, and only a sixth of the cache benchmarks were used. Values of one half and one quarter were also tried, but one sixth produced the best results. This reduced the total number of cache micro-benchmarks to 22, on top of the existing global and local memory benchmarks.

The total number of micro-benchmarks used to create the GPU power models was 224; 184 compute and 40 memory. The CPU required an additional 48 x87 micro-benchmarks and 22 cache micro-benchmarks, for a total of 294.

5.2.2 Application Benchmarks

Application benchmarks are used to simulate a mixed workload that includes a range of memory and ALU operations. It is important to include some general benchmarks that use more than one type of operation to ensure that the energy consumption of individual operations is not overestimated. When Chapter 5. Power Modelling 59 a single operation is used, its modelled cost includes not only the cost to execute the operation, but also any overheads that are not captured by another metric. Application benchmarks are also ideal for model validation, as they cover a broad set of modelled parameters. The application benchmarks had three primary sources; the AMD APPSDK v2.7 [49], the Parsec benchmark suite [103], and the Rodinia heterogeneous benchmark suite [47]. In total, twenty-five applications listed in Table 5.1 were used.

Table 5.1: Applications Benchmarks Used CPU GPU APU Application name Source CPU GPU valid. valid. valid. Back propagation Rodinia x x x Binomial Option APPSDK x Black Sholes APPSDK x x Black Sholes Parsec x Bodytrack Parsec x Canneal Parsec x Dedup Parsec x DCT APPSDK x x Fluidanimate Parsec x K-means Rodinia x x x LU-decomposition Rodinia x x x Mandelbrot APPSDK x x x Matrix multiplication APPSDK x x x Matrix multiplication (local memory) APPSDK x x x Matrix transpose APPSDK x x x Needleman-Wunsch Rodinia x x x N-body APPSDK x Pathfinder Rodinia x x Prefix sum APPSDK x Reduce APPSDK x x Scan large array APPSDK x Sobel Filter APPSDK x x x Swaptions Parsec x URNG APPSDK x x Vector DistCL x x x

The APPSDK benchmarks and the vector benchmark are written in OpenCL and were run on both the CPU and GPU. Of the fourteen benchmarks, thirteen ran correctly in Multi2Sim using the CPU and nine using the GPU. These benchmarks all run using the four cores of the CPU and GPU.

Validation benchmarks are used to assess the predictive power of the models. They are not part of the set of benchmarks used to train models. This allows them to be used to reject over-fitted models. The validation sets for the CPU and GPU consisted of four APPSDK benchmarks each.

The Parsec benchmarks are x86 applications and could therefore only be run on the CPU. They were added because the initial CPU models had very poor predictive ability. All the Parsec benchmarks that would simulate in the 48 hour window allotted to jobs on SciNet [12] were used. The number of threads used for the benchmarks can be specified, so the benchmarks were run using one or four cores.

The Rodinia benchmarks consist of heterogeneous workloads that run OpenCL kernels on the GPU and perform the remaining computations on the CPU. These benchmarks were used to validate the Chapter 5. Power Modelling 60

APU power model as they target both devices. This allowed us to test the predictive power of the final CPU+GPU model with benchmarks that make use of both devices. The five Rodinia benchmarks used are the only Rodinia benchmarks that would simulate correctly on both the CPU and the GPU.

5.3 Measuring

The characteristics of all the benchmarks used were measured in three ways: physical power measurements, hardware performance counters, and Multi2Sim simulations. The power measuring setup used is identical to the one described in Chapter4. This section will introduce the hardware performance counters and detail how they were used. Then, it will describe how Multi2Sim was configured to emulate the A6-3650. This section also describes which simulation statistics, reported by Multi2Sim, were used in the modelling process.

5.3.1 Hardware Performance Counters

Modern processors contain performance monitoring counters (PMCs) that can be programmed to monitor certain hardware events. Using hardware performance counters has very low overhead [104]. Since Linux 2.6.31, it has been possible to access these counters through the perf event library [105]. The counters of interest for this work are those related to memory operations because the Fusion’s memory hierarchy is difficult to model accurately in Multi2Sim. The Fusion APU has four performance counters that can be programmed to monitor any performance events listed in the BIOS and Kernel Developer’s Guide For AMD Family 12h processors [106]. However, in practice, at most three memory related counters could be used simultaneously. While it is still possible to monitor more than three memory events at a time, the counters would be time multiplexed.

The three events that are monitored are:

1. Requests to L2 Cache – which counts the number of L2 requests from L1 instruction and data caches.

2. L2 Cache Misses – which counts the number of L2 cache misses that originated from the L1 instruction and data caches.

3. DRAM Access – which counts the number of read and write requests to the DRAM.

By measuring L2 cache misses and DRAM accesses separately, it is possible to distinguish between cache misses that could be satisfied by other cores from those that were satisfied by DRAM. The energy consumption of DRAM accesses is higher because they require off-chip communication.

The perf event library is designed to provide easy access to common performance events. Unfortunately, the performance events for AMD processors are not mapped correctly in current versions of the Linux kernel. This meant that it was not possible to use the architecture agnostic types defined by perf event, instead the Performance Event Select Registers had to be written directly with raw hex values. The developer’s guide was used to determine the appropriate hex values [106]. Chapter 5. Power Modelling 61

From the Multi2Sim simulations we can get a count of the number of memory requests made by the CPU. This value will be accurate regardless of the memory hierarchy simulated. This can be used along with the L2 requests to determine the L1 hit and miss rates, at which point we have a complete picture of what is taking place in the CPU’s memory.

5.3.2 Multi2Sim Simulation

Simulating the benchmarks in Multi2Sim gives detailed information about he instructions and certain runtime characteristics. For the GPU this was the only way to get this information. For the CPU it would have been possible to get some of the same information using hardware performance counters but for other values it would not. Furthermore, it would have required multiple runs of each benchmark since at most four counters can be used concurrently.

To ensure that counts correspond to the region of code that is being measured on the real hardware, every OpenCL benchmark was run twice. First, the entire benchmark was run unmodified. Second, everything except the actual OpenCL kernel itself was run. By subtracting the counts of the second run from the first, the counts for running just the kernel were obtained. To reduce the likelihood of error during this step, a python script was written to perform these calculations automatically.

The remainder of this section will first describe how the simulator was configured for GPU simulation and what performance statistics were collected from GPU simulations. Then, the same process will be described for the CPU simulations.

GPU Simulation

The GPU simulations were fairly straight forward since Mult2Sim is capable of simulating the Evergreen architecture directly. The Evergreen simulator was developed with the assistance of AMD and as such is reliable. The only things that needed to be specified were the number of compute units in the GPU and the operating frequency.

Multi2Sim can be used to obtain the following information from each benchmark: a) Total instruction and cycle counts for each core. b) Instruction and cycle counts for the CFEngine, ALUEngine, and TEXEngine. c) Reads and writes to main and local memory.

The CFEngine or control flow engine, is responsible for control flow instructions, such as branches, and global memory writes. The ALUEngine consists of all the SPs in an SC and is responsible for performing calculation on data. The TEXEngine consists of the texture units and under GPGPU workloads is responsible for global memory reads. These statistics provide a high level view of what is going on in each engine and memory when benchmarks are run.

To get a more detailed breakdown of what sort of instructions are being run on the different engines, Multi2Sim was modified to report individual instruction counts. This resulted in counts for 291 instructions, Chapter 5. Power Modelling 62

Table 5.2: Instruction Categories Name Description ALU the instruction uses the ALUEngine TEX the instruction uses the TEXEngine (main memory reads) CF the instruction uses the CFEngine (control and main memory write) Local-mem the instruction access local memory Global-mem the instruction access main memory FP a floating point instruction Int an integer instruction Predecate a predicate instruction Conditional a conditional instruction Bitwise a logic instruction Arithmetic addition and subtraction instructions Mult multiplication instructions Div division instructions Transcendental complex math instructions (logarithms, square roots, etc.) Conversion type conversion instructions (int < − > float) more than should be part of any model. Instead of using individual instruction counts, instructions were grouped by type, using information from the Evergreen architecture reference guide [17]. With this information, instructions were put into the categories listed in Table 5.2.

CPU Simulation

Configuring Multi2Sim for CPU simulations was a little more involved, since the Multi2Sim CPU architecture is not based on any existing x86 CPU. To approximate the architecture of Fusion’s CPU, the simulator needed to be configured as closely as possible to the processor in the APU. The CPU is part of the AMD’s family 12h line, which is a derivative of their K8 architecture. The majority of the configuration parameters were found in AMD’s optimization guide for this family of processors [20]. The remaining parameters were obtained from an analysis piece on the architecture [107]. The parameters that need to be configured include: the core count, the width of the various pipeline stages, the size of queues, the branch predictor, and the available functional units. Table 5.3 summarizes the important configuration parameters and lists the number of functional units of each type that were simulated. A detailed list of all the parameters and their values is available in AppendixB.

Each functional unit has three parameters: count, operation latency, and issue latency. Count is simply the number of such functional units. Operation latency is the latency in cycles required to complete an operation using that functional unit (FU). Issue latency is the number of cycles after which a new operation can be issued to the FU. This reflects the pipelining that takes place in certain functional units. For example, we can issue one integer multiply per cycle, but each operation takes three cycles to complete.

One issue that was encountered when configuring the functional units was the fact that the functional units in the Fusion’s CPU can execute many types of micro-instructions. For example, integer addition, comparison, and logic operation all take place in the same three ALUs. In Multi2Sim, each of these operations would be executed by a different FU. To maintain the appropriate maximum throughput for Chapter 5. Power Modelling 63

Table 5.3: CPU Configuration Summary Parameter Description Value General Frequency in MHz 2600 Cores core count 4 Threads number of simultaneous threads a core supports 1 Pipeline DecodeWidth decode width 3 DispatchWidth dispatch width 3 IssueWidth issue width 9 CommitWidth commit width 3 Functional Units IntAdd.Count integer add/subtract 3 IntMult.Count integer multiplication 1 IntDiv.Count integer division 1 EffAddr.Count address calculation 3 Logic.Count logic 3 FloatSimple.Count floating point simple (eg. round) 2 FloatAdd.Count floating point add/subtract 1 FloatComp.Count floating point compare 1 FloatMult.Count floating point multiply 1 FloatDiv.Count floating point divide 1 FloatComplex.Count floating point complex (eg. trig) 1 XMMIntAdd.Count SSE integer add/subtract 2 XMMIntMult.Count SSE integer multiply 1 XMMLogic.Count SSE logic 2 XMMFloatAdd.Count SSE floating point add/subtract 1 XMMFloatComp.Count SSE floating point compare 1 XMMFloatMult.Count SSE floating point multiply 1 XMMFloatDiv.Count SSE floating point divide 1 XMMFloatConv.Count SSE floating point convert 2 XMMFloatComplex.Count SSE floating point complex (eg. sqrt) 1 Chapter 5. Power Modelling 64 each type of operation, Multi2Sim is configured to have three of each of these functional units.

The situation is comparable on the floating point side. Here, the real hardware has three FPUs with different capabilities. These three units are used to execute x87 and SSE instructions. There is a store unit(FSTORE), and addition unit(FADD), and a multiplication unit(FMUL). x87 and SSE instructions mostly map to a single FPU. On real hardware, floating point complex instructions do not all have the same latency. In fact, they range from 19 to 114 cycles. The most common complex operations in benchmarks are square roots, used to calculate Euclidean distances, and trigonometric operations for Fourier transforms. The latency of complex operations in Multi2Sim was set to 50 cycles; between the latency of square root and trigonometric operations.

Originally, Multi2Sim assumed that all SSE instructions were handled by a single type of FU. This meant that only a single operation latency could be set for all SSE instructions. However, since the OpenCL compiler uses SSE instructions to implement floating point operations, it made sense to allow more fine grained control. Therefore, the SSE functional unit was split into integer: add, multiply, and logic; and floating point: add, multiply, compare, divide, convert, and complex. Instructions were mapped to the appropriate FU using AMD [20] and Intel [108] documentation. This allowed different latencies to be assigned to different types of SSE operations.

Multi2Sim reports statistics such as total cycle counts, total bytes of memory used, and counts for each type of micro-instruction. Multi2Sim also reports detailed information from each core. This includes pipeline, functional unit, branch, and queue information. Each micro-instruction is tracked through three pipeline stages: dispatch, issue, and commit. Each stage also includes higher level counts, such as the number of fp, int, or memory instructions, as well as statistics such as the number of system calls or context switches performed. For each type of functional unit, there is a count of the number of micro-instructions that used the functional unit. Most of the modelling work uses statistics from either the pipeline stages or the functional units. All these statistics are written to a single file in the INI format. A python script was used to parse them for the relevant information.

Memory Configuration

For the GPU, the default memory configuration could be used. Things are a little more difficult for the CPU for the reasons described in Section 2.4.1. The initial approach used was to specify the memory configuration to match that of the real hardware where possible and approximate where necessary. This resulted in a memory configuration that was much slower than expected. To overcome this issue, an experimental methodology was adopted to create the CPUs memory model.

Five micro-benchmarks were written in assembly and then run on both real hardware and in Multi2Sim. Three were used to evaluate the CPU configuration: add dependent — which performs additions where each subsequent addition is dependent on the result of the last, add independent — which is similar except dependencies are four additions apart, and register — which moves a value between a pair of registers. This allows us to verify that the latency of add instructions is configured correctly, that three concurrent adds can take place when they are independent, and that register moves have the correct latency. The memory was evaluated both in terms of random and sequential access. The random access benchmark (mem latency) consists of random pointer chasing within a bounded address space. This Chapter 5. Power Modelling 65

Best

add dependent Literal add independent

register

mem latency 32kB

mem latency 64kB

mem latency 512kB

mem latency 1MB

mem latency 8MB

mem throughput 8kB

mem throughput 16kB

mem throughput 128kB

mem throughput 256kB

mem throughput 2MB

0 50 100 150 200 250 300 350 400 450 500

Percent of Performance on Real Hardware

Figure 5.2: Comparison of the literal and best memory configurations. allows us to measure the latency of different types of memory accesses. The sequential access benchmark (mem throughput) simply reads a contiguous region of memory. This allows us to measure the memory throughput of the CPU and evaluate the performance of the prefetcher. This was used to tune the memory model until it performed as close to real hardware as possible. Figure 5.2 compares the memory configuration where latencies were set to the values found on real hardware, which we call the literal configuration, to the most accurate configuration we generated, which we call the best configuration. The three benchmarks used to evaluate the CPU configuration are run once each, and the memory benchmarks are both run five times for different sizes of memory. This allows the performance of different levels of cache to be isolated. All values are normalized to the performance measured on real hardware. Table 5.4 has the unnormalized data and also includes the values documented in the optimization guide.

What is clear from looking at the figure is that the micro-architecture is modelled accurately. The add and register micro-benchmarks are all with 1.5% of the measured values. The same cannot be said for the memory hierarchy. L1 cache performance using the best model is pretty close, with 20% faster accesses at 32 kb, but 40% slower accesses at 64 kb. L2 cache performance is a little further off, with access latencies that are 60% slower, for both 512 kB and 1 MB, using the best model. Main memory access latencies are about five times faster than what was measured on real hardware. The reason for this is that every Chapter 5. Power Modelling 66

Table 5.4: Memory Latency Comparison Micro-bennchmark Level Documented Measured Literal Best Simulated Simulated dependent add CPU 1 1 1 1 independent add 0.33 0.38 0.37 0.37 Register moves 1 1 1 1 32kb random reads L1 3 3 11.7 2.5 64kb random reads 3 3 21.2 5.4 8kb contiguous reads - 1 7.2 1.5 16kb contiguous reads - 1 15.4 1.9 512kb random reads L2 12 15 274.2 40.4 1MB random reads 12 28 291.4 65.8 128kb contiguous reads - 1.3 26.3 4 256kb contiguous reads - 1.8 26.3 4 8MB random reads Mem 107 219 306.2 44.4 2MB contiguous reads - 3.1 26.4 4 time the access latency of main memory is increased, the latency of L2 accesses is also increases. With this configuration we reach a sort of balance as L2 cache accesses are slower than real hardware, while main memory accesses are faster. Memory throughput is varies between 33 and 78% of the measured value, with a clear dip for arrays that fit in L2 but are too large to fit into L1. These results indicate that the prefetcher used in Multi2Sim does not perform as well as the one found in the Fusion, since even with faster main memory the simulation could not match the throughput of the real hardware. The literal model’s performance is much worse. With the exception of the main memory access latency, where it achieves 71% of the measured performance, it never achieves more than 26% of the measured performance.

The actual memory configuration for the best model is given in Table 5.5. This is the fastest memory configuration Multi2Sim will support. The memory frequency is 10 GHz which is the maximum supported value and all the latencies are set to 1. There was no easy way to account for the fact that the Fusion has an exclusive cache hierarchy, since the number of sets and the associativity both needed to be powers of two.

5.4 Modelling

Once all the power measurements were made and simulations were run, it became possible to start regression modelling to create our final power model. We used multiple linear regression, which consists of solving a system of equations where the dependent variable is the total energy consumption and the knowns are the runtime of a benchmark and the counts of hardware activity, and we are trying to solve for the energy consumption of each predictor. Predictors are either statistics obtained from the simulator or the PMCs. Examples include, cycle counts, number of integer instruction, number of local memory accesses, and number of L1 cache accesses. Since the problem is overdetermined by design there is no solution. Instead, a regression aims to find an approximate solution by reducing the sum of squares of the residuals. The residual for each benchmark is the difference between the predicted energy consumption of a benchmark and the true energy consumption of a benchmark. This is known as the least squares fit. Chapter 5. Power Modelling 67

Table 5.5: Memory Configuration Parameter Description Value Frequency in MHz 10 000 PageSize in bytes 4096 L1 cache Sets 512 Assoc Associativity 2 BlockSize in kB 64 Latency latency for this level alone 1 Policy cache eviction policy LRU Ports 2 L2 cache Sets 1024 Assoc Associativity 16 BlockSize in kB 64 Latency 1 Policy cache eviction policy LRU Ports 2 Main memory BlockSize in kB 64 Latency 1

The final result is an equation of the form: y = c1x1 + c2x2 + ... + cnxn + b where, y is the dependent variable (total energy), ci is the coefficient for the predictor xi, and b is the intercept (static power × runtime).

The basic problem formulation for device power is expressed in equation 5.1. Total power (P ) is the sum of static power (Pstat) and dynamic power (Pdyn). Pdyn can be further decomposed into the dynamic power of each core (Pdync ), as shown in equation 5.2. As shown in equation 5.3, Pdync is the sum of dynamic power for each predictor. The dynamic power of each predictor is the product of the activity count of the predictor (nk) and the energy consumption of the predictor (Ek) divided by time. Pstat can also be decomposed into the static power of each core and the static power of shared resources (PstatS ), as shown in equation 5.4. The complete breakdown of power consumption is shown in equation 5.5. To generate a power model, we need to solve for Pstat and Ek. P is measured using real hardware and the E value of c and nk are obtained from simulation. Since power is energy over time (P = t ), it is possible to formulate the problem in terms of energy, as shown in equation 5.6.

The advantage of the energy formulation is that it makes it easier to incorporate the model into a simulator. Models that are formulated in terms of power, as shown in equation 5.7, use the activity factor of each component (AFk). The activity factor is a value between zero and one that represents the ratio between total cycles and cycles where the component was active. The easiest way to simulate this would be to count the accesses to a component and then multiply by the number of cycles that component requires to complete an operation. To calculate power consumption, the latency of the component (lk), runtime, and the operating frequency are required. Using the energy formulation, only

Ek and runtime are required. This reduces the interaction between components of the simulator and simplifies calculations. Chapter 5. Power Modelling 68

P = Pstat + Pdyn (5.1)

#cores X Pdyn = Pdync (5.2) c=1

#preds. X nk × Ek P = (5.3) dync t k=1

#cores X Pstat = PstatS + Pstatc (5.4) c=1

#cores #preds. ! X X nck × Ek P = P + P + (5.5) statS statc t c=1 k=1

#cores #preds. ! X X E = tPstatS + tPstatc + nck × Ek (5.6) c=1 k=1

#cores #preds. ! X X P = PstatS + Pstatc + AFck × lk (5.7) c=1 k=1

A bottom-up approach was used to obtain a decomposable model [97]. This required the CPU and GPU power models to be developed separately. As discussed in section 5.3.2, there are many statistics provided by the simulator, more than can be used to create a reliable power model. As shown by Lee and Brooks [87], a regression model is not reliable if the ratio between benchmarks and predictors is less than twenty. Based on this, we can safely use models with up to eleven predictors for the GPU and fifteen for the CPU. For both the CPU and GPU models, multiple combinations of statistics were considered as model inputs. In total, 61 CPU models and 72 GPU models were considered. The complexity in models varied significantly.

The modelling process consists of solving equation 5.6. What distinguishes the various models are ks being considered. For each model, the process shown in Figure 5.3 was followed. We do not have enough information to solve for all the predictors and for static power at once. Instead, we have to perform multiple regressions assuming different static power levels. The model is most accurate when the correct static power value is used. Since the number of cores used is variable, we can solve for per-core static power. A predictor is added called core. Since we are solving for energy, the value of core is the product of number of cores used and the duration of the benchmark. This allows us to solve directly for Pstatc .

Each model is assessed by its error rate. The error rate is the mean of the relative error obtained when using the model to predict the energy consumption of the validation benchmarks. The model is used to predict the energy consumption of each validation benchmark. Then, the residual is calculated and the relative error is found, and the absolute value of the residual is divided by energy consumption of the Chapter 5. Power Modelling 69

Static Power = 0 Min Error = 1e20

Linear Regression

NO Mean Error < Min Error ?

YES

Save Static Power and Coefficients

Increment Static Power 0.1W

NO Static Power > Max Static Power

YES

Return Saved Coefficents and Static Power

Figure 5.3: The Regression process. benchmarks. Finally, the arithmetic mean of all the relative errors is calculated. We use an arithmetic mean instead of a geometric mean because we want to increase the cost of high error outliers and reduce the benefit from low error outliers. A consistent but less accurate model is more useful than a model that is more accurate on average but occasionally has a large error.

At each iteration of the process in Figure 5.3, PstatS is fixed and the values of Ek are derived. The process is repeated for values of PstatS between 0 and 35W for the CPU and 0 and 25W for the GPU in increments of 0.1W. The maximum PstatS values are based on the range of power consumption seen in CPU and GPU benchmarks. The lm function in R is used to perform the linear regression. It requires a formula that describes the regression we want to solve, the static power value, and the measured data for each benchmark as inputs. The output is a model object which contains the coefficients for all the predictors and the residual and predicted value for each benchmark. The values of the most accurate model is then returned.

Figures 5.4 and 5.5 are histograms of the percentage error of the CPU and GPU models respectively. For Chapter 5. Power Modelling 70

10

8

6

4 Number of models 2

0 0 50 100 150 200 250 300 Percentage error

Figure 5.4: Fitting error of the training benchmarks for the CPU models.

40

35

30

25

20

15

Number of models 10

5

0 0 5 10 15 20 25 30 35 Percentage error

Figure 5.5: Fitting error of the training benchmarks for the GPU models. these figures, the error is the mean relative error between the predicted and true energy consumption of the benchmarks in the training set. Comparing these two histograms, it is clear that the GPU models are more accurate with more than half of the models having less than a 10% error. The most accurate GPU model has an error rate of 4.0% and is a model of medium complexity. It considered the number of cycles spent in each of the three engines, and all four types of memory access, interaction terms were also modelled for the memory. If we count the interactions, the model consisted of twelve terms. The resulting model is summarized in Table 5.6.

On the CPU side, only one model has less than 50% error. This model has an error of 38% and considers only memory accesses and their interactions. Including the interaction terms, this model uses fifteen predictors. The model is summarized in Table 5.7.

Since many of these models are quite complex, we run the risk of over-fitting our data. This happens when the complexity of the model is used to“remember” the values for the training data rather than Chapter 5. Power Modelling 71

Table 5.6: GPU Model Coefficients Predictor coefficient CFEngine cycles −2.658e−8 ALUEngine cycles 2.605e−8 TEXEngine cycles 9.877e−8 global memory writes 1.402e−6 global memory reads 1.665e−7 local memory effective writes −1.933e−7 local memory effective reads 6.758e−6 Interaction Terms TEXEngine cycles × global memory reads −4.213e−15 ALUEngine cycles × local memory effective reads −3.454e−16 ALUEngine cycles × local memory effective writes 1.942e−18 ALUEngine cycles × local memory effective writes × local memory effective reads 2.444e−26 Static Power Global 14.1 W Per-core 0.31 W

Table 5.7: CPU Model Coefficients Predictor coefficient Dispatched memory µops 4.715e−10 RAM accesses 4.610e−7 L2 misses −7.618e−6 L2 hits −1.610e−7 Interaction Terms Dispatched memory × RAM −9.361e−16 Dispatched memory × L2 misses 3.011e−14 Dispatched memory × L2 hits −1.379e−14 RAM × L2 misses −2.339e−14 RAM × L2 hits −1.379e−14 L2 misses × L2 hits 2.048e−13 Dispatched memory × RAM × L2 misses 4.334e−22 Dispatched memory × RAM × L2 hits −2.656e−23 Dispatched memory × L2 misses × L2 hits −1.026e−21 RAM × L2 misses × L2 hits −2.656e−23 Dispatched memory × RAM × L2 misses × L2 hits 1.267e−32 Static Power Global 34.5 W Per-core 2.68 W Chapter 5. Power Modelling 72

10

8

6

4 Number of models 2

0 0 500 1000 1500 2000 2500 3000 Percentage error

Figure 5.6: Fitting error of the validation benchmarks for the CPU models.

9 8 7 6 5 4 3 Number of models 2 1 0 0 2000 4000 6000 8000 10000 Percentage error

Figure 5.7: Fitting error of the validation benchmarks for the GPU models.

find the trend. While an over fitted model will have very low fitting error, it will have poor predictive abilities. To test the predictive abilities of the models they were tested against the validation benchmarks. Since these benchmarks were not used in the linear regression, the model cannot “remember” anything about them. Histograms of the percentage error of the CPU and GPU models when applied to the validation benchmarks can be seen in Figures 5.6 and 5.7 respectively. It is immediately evident from these histograms that there are certain models that have quite poor predictive performance.

While the CPU models performed worse in training, it seems they were less prone to over-fitting then GPU models. The best model after training has a predictive error rate of 44%, and is the same model found when looking only at training data. It is summarized in Table 5.7. Note that this model does not include any statistics obtained from Multi2Sim, only those obtained using performance counters.

When examining the GPU models, we see that while it seemed that they were more accurate when looking at the training error, many of the models are useless when it comes to predictions. Interestingly, Chapter 5. Power Modelling 73

Table 5.8: GPU Model Coefficients Predictor coefficient Cycles 8.931e−10 Static Power Global 14.1 W Per-core 0.05 W what were previously considered our best models now perform among the worst, overestimating power consumption by more than 95×. The best model when it comes to predictive error is now one of the simplest ones. It only considers the number of cores and the number of execution cycles. This is similar to the integrated GPU power model found by Bircher and John [59]. This model is still very accurate with a predictive error rate of 7.8% and a training error rate of 10.3%. This model again has the same static power as the best training model at 14.1 W, but per-core static power is down to 0.05 W, and is summarized in Table 5.8.

45 Mandelbrot Matrix Multiply Matrix Multiply (local) Binomial option 40 n-body Array Scan Sobel Filter URNG Reduce 35 Power 30

25

20 0 500 1000 1500 2000 2500 CPU frequency

Figure 5.8: Linear regression of workloads at various frequencies.

Since the GPU model contains only positive coefficients, we can be fairly confident that static power value only captures the true static power. For the CPU model, we have some negative coefficients, so it is possible that the static power contains a portion of the average dynamic power, as we can have total power consumption lower than the claimed static power. Another way to capture static power is to make power measurements of the real hardware at various frequencies, since the static power is leakage power. Using the Butts and Sohi model [109], power can be represented using equation 5.8 and static and dynamic power by equations 5.9 and 5.10 respectively. Vcc is the supply voltage, N, is the number of transistors, kdesign is a circuit specific value that depends on transistor sizing and the sizing/count of NMOS transistors relative to PMOS transistors. α is the activity factor and representative the percentage of cycles switching occurred, C is the load capacitance and f is the operating frequency. One can see that only dynamic power is effected by the clock frequency f and that Pstat can only vary with Vcc for a Chapter 5. Power Modelling 74

given design. If one could get the clock frequency down to zero without modifying Vcc one should be able to measure Pstat directly. Unfortunately, that is not exactly practical, but one can run the same workloads at varying frequencies and then perform a linear regression; the intercept will be Pstat. The same workload is used to ensure the switching factor α remains constant and the capacitance C is a function of the chip, so it will remain constant as well. However, Iˆleak varies with temperature, so the workload should be short to avoid thermal buildup in the chip, thus increasing static power.

1 P = V NIˆ k + αCV 2 f (5.8) cc leak design 2 cc

Pstat = VccNIˆleakkdesign (5.9)

1 P = αCV 2 f (5.10) dyn 2 cc

There are two ways to modify the operating frequency of the Fusion. The first consists of modifying the base clock (bclk) from which all the other clocks on the chip are derived. This is an ideal way of modifying the frequency, since all the clock domains on the chip are changed simultaniously. However, there was very little play in the bclk for two reasons: the motherboard did not allow any value below the default of 100 MHz, and at bclk above 105 MHz caused the system to become unstable. It would not boot at frequencies above 110 MHz. This issue appears to be due to the SATA controller, which communicates over the PCIe-bus, not handling the change in bus speed [110].

Instead of using the bclk, is it possible to modify the frequencies of the CPU and GPU directly. It was possible to down clock the CPU to frequencies ranging from 1.6 GHz to 2.4 GHZ and the GPU to frequencies from 400 MHz to 600 MHz. Three points were selected in this range, each 400 MHz apart for the CPU and 100 MHz apart for the GPU. The frequencies were confirmed using the /proc/cpuinfo for the CPU and using the aticonfig --od-getclocks command for the GPU. The ten APPSDK benchmarks that belonged to the training set were run at each of the frequencies and the average power consumption was measured. The median of three measurements was taken and a linear regression was fitted to each benchmark, as shown in Figure 5.8. The resulting static power is shown to lie somewhere between 21.3 and 26.4 W, with a mean value of 24.8 W. This is much lower than 34.5 W found during the regression process.

Based on this results, the CPU models were reformulated with a static power of 24.8 W. Only three of the new models have a validation error of less than 100%. They were based on either the total number dispatched, issued, or committed instructions. The best model both in terms of training and validation error is the one based on committed instructions. It had a training error of 61% and a validation error of 74%. This model is summarized in Table 5.9.

Once the two processor models were generated, it was possible to combine them to generate an APU wide model. When doing this, the dynamic power consumption of each processor can be summed, but this cannot be done for the static power as there would be double counting for share components. The chip wide static power used was based on the idle power consumption of 9.2 W, as shown in Figure 4.1 in Chapter 5. Power Modelling 75

Table 5.9: CPU Model Coefficients Committed CPU instructions 1.08e−9 Static Power Global 24.8 W Per-core 4.14 W

Table 5.10: APU Model Coefficients Predictor coefficient CFEngine cycles 8.931e−10 Committed CPU instructions 1.08e−9 Static Power Chip wide 9.2W CPU global 15.6W CPU per-core 4.1W GPU global 4.2W GPU per-core 0.05W

Chapter4. This value is subtracted from the static power of both devices. If the model used any main memory predictors they would need to agree, since the CPU and GPU on the Fusion share a memory controller. The final model is summarized in Table 5.10.

This model was validated using the Rodinia heterogeneous benchmarks. Figure 5.9 shows the predicted and true values of total energy consumption for the four Rodinia benchmarks. We can see that the predicted value trends well, but tends to over-estimate the total power consumption. Mean relative error of the model it only 6.9%, so it is very accurate over this diverse set of benchmarks. The error is lowest for the LU decomposition benchmark, at 2.5%, and highest for the Needleman-Wunsch benchmark, at 11.7%, which is still considered rather low. The biggest difference between these two benchmarks is the relative length of the GPU portion of the code. It is much longer for LU decomposition. Since the GPU portion of the model is more accurate than the CPU part, this behaviour is expected.

45 Predicted True 40

35

30

25 Total Energy 20

15

10 k means LUD NW Pathfinder

Figure 5.9: Predicted and true values for the total energy of the Rodinia benchmarks. Chapter 5. Power Modelling 76

5.5 Conclusion

This chapter described the power modelling process used to create a power model for the Fusion APU. The main steps were: Selecting a set of representative micro-benchmarks and diverse application benchmarks. Configuring the Multi2Sim simulator to approximate the Fusion APU as closely as possible. And finally, using multiple linear regression techniques to develop a statistical model. The final power model for the APU has a low mean relative error of 6.9%.

This approach is not unique to the Fusion, or heterogeneous processors for that matter. It is possible to use this methodology to create statistical power models for any type of processor. However, a cycle- accurate simulator and reference hardware are required. Also, what constitutes a set of representative micro-benchmarks may also vary between processors. The methodology presented in Chapter4 can be used to develop a set of micro-benchmarks for different hardware.

The final model created is decomposable, so it is possible to conduct a limited architectural exploration with it. The model considered only the core count and execution cycles or committed instructions. This is due to the fact that when operating the power consumption of a core does not vary much, especially for the GPU. The increased energy consumption of certain benchmarks is primarily due to the fact that they are longer running; do not make efficient use of the architecture. One of the best ways to increase the energy efficiency of a program is to increase instruction level parallelism and take advantage vector operations. In general the faster a program runs the less energy it will consume, because we found that static power was greater than dynamic power. This means that simply powering on a device consumes more power than is consumed doing useful work, no matter how much work is being done. We also found that the GPU consumes between half and one third of the power of the CPU. This means that work can take up to three times as long to complete on the GPU but still be more power efficient. Due to the simplicity of the model for the APU, architectural exploration would be limited to varying the operating frequency or core count of the CPU and GPU. However, this does open the door to interesting research on DVFS or power budgeting techniques for the Fusion. Questions such as how much of the total power should be allocated to each processor, or which processor should be throttled down when maximum power is exceeded can now be considered. Chapter 6

Power Multi2Sim

Once a power model for the Fusion APU was created, Multi2Sim was modified in order to support power modelling. Power modelling support was implemented in such a way that it is not only possible to get power numbers from the simulator after simulation is complete, but so that the simulated program itself can request power information during execution. The power model used by the simulator is also configurable, so it is not tied directly to the power model developed in the previous chapter.

When a program is simulated with power modelling enabled, the simulator tracks the average power consumption and reports it in the simulation summary. It is also possible to log instant power consumption, throughout the execution, to a file. If a simulating program wants power information, it can use Multi2Sim specific system calls that return power information. To offer a wide range of power statistics in a general way, an epoch-based approach to power modelling was chosen. This approach is inspired by the way PMCs work on real CPUs, where users can specify which performance counters they want to monitor and are free to start and stop them whenever they like. The epoch approach allows power and energy consumption to be calculated over an arbitrary window of the program. Multi2Sim uses this feature internally to calculate power over short periods of time to estimate instant power, and over the entire length of the program to report average power for the entire program. Users are also able to define their own epochs to measure power in whatever part of the program they like.

This chapter explains how power modelling support is implemented in Multi2Sim. Section 6.1 explains how epochs are implemented in Multi2Sim. Section 6.2 describes how simulated programs can take advantage of the power modelling. Finally, Section 6.3 analyzes simulation results obtained using Multi2Sim with power modelling.

6.1 Epochs

The power modelling capabilities added to Multi2Sim were inspired by the PMCs found in modern processors. However, there is one big difference between PMCs and a simulator. On real hardware, the number of PMCs is limited, but in a simulator there is no such limitation. Therefore, instead of having a single “power counter,” we can have an unlimited number of power epochs. An epoch represents a period

77 Chapter 6. Power Multi2Sim 78 in the execution of a program for which we want power information. Epochs can be overlapping and concurrently active.

Epochs can be used both in the simulated program, using special syscalls, and by Multi2Sim internally. Each epoch is identified by an ID, and the type of device it is for. Currently, epochs are supported for devices of type: all, x86, Evergreen, and memory, since Multi2Sim models the memory hierarchy separately from the processor. Each epoch also contains a flag, to indicate whether or not it is still running. It also keeps track of the cycle it started and ended. The most important component of an epoch are the activity snapshots.

Snapshots contains all the activity counts associated with the device. There is one snapshot taken at the beginning of the epoch and another at the end. The snapshots are device specific, which is why each epoch is associated with an device type. A XXX stat take snapshot function was added for each device. As an example, when we take a snapshot of the x86 device, we record the number of accesses to each functional unit, as well as counts of the different instruction types dispatched, issued, and committed.

Once the epoch has ended, both start and end fields will have been populated. By subtracting the values of the first snapshot from the second, we get counts for what took place during the epoch. This then allows us to calculate energy and power consumption during the epoch. Total energy consumed is calculated in two parts. First, static power is calculated by multiplying static power consumption by the elapsed time, which is calculated from the elapsed cycles. Then, dynamic power is calculated by multiplying the instruction and hardware counts by their energy consumption. Power is then calculated by dividing total energy by time.

To return total average power consumption for each simulation, there is an epoch that is started for each device at the beginning of the simulation and ended when the simulation completes. Book keeping is minimized be only considering the beginning and end of the entire simulation. In order to calculate instant power, there is also always a short period epoch that is running. Its exact length can be configured, but the default is 128 k cycles. When this epoch completes, an instant power value is calculated and a new epoch of the same length is calculated. These instant power values are logged to a file when detailed power reporting is enabled.

6.2 Using Power Modelling

There are a few steps involved in using the power modelling feature added to Multi2Sim. First, the model must be configured. Then, there are two mains way to use it, either during execution or through reports. This section will describe how these steps are performed.

6.2.1 Configuration

The configuration files for the power model are based on Multi2Sim’s existing device configuration files. Static power consumption, and a per-cycle energy consumption, can be specified. Each architectural component can also be assigned a per-access energy consumption, as can each instruction type. Chapter 6. Power Multi2Sim 79

Multi2Sim’s existing configuration reader is used and all values default to zero if not present in the configuration. This allows any desired combination of factors to be easily considered, without requiring detailed configuration files. Since we are specifying per-operation energy, dynamic power will scale with frequency. Static power remains constant with frequency, so it is specified as power not as energy.

In order to simulate C-states, it is also possible to set a threshold power. If the calculated power during an epoch falls below the threshold power, it is assumed the processor has been put to sleep due to inactivity. When this is the case, we assume the core is dissipating zero power. This also includes static power. The threshold can be set to any value, but by default it is assumed that processors cannot be put to sleep.

Configuring power for the memory is a little bit different. In Multi2Sim, there are three main parts to memory configuration: the cache geometry, the topology, and the links. The geometry specifies features such as cache line size, associativity, and access latency. The topology specifies the number of caches of each geometry and at which level in the hierarchy they find themselves. The links specify the bandwidth between the caches. Power consumption is specified per-geometry and it is assumed that caches with the same geometry will exhibit the same type of power consumption.

When a memory snapshot is taken, the statistics for each module, either a cache or main memory, must be recorded. This means that rather than a single snapshot, the epoch for the memory contains a list of snapshots. Energy is then calculated for each module based on its geometry. The sum of all the modules’ energy is the total energy consumption for memory device.

6.2.2 Runtime Usage

In order to allow a running program to take advantage of the power modelling capabilities, three new system calls were added to Multi2Sim. These allow programs to either get instant power consumption numbers, or manage user specified epochs. The instant power syscall takes an device type as an argument and return saved current power values for that device. It is also possible to manage an epoch directly with a syscall to start an epoch, and one to end it. This is also device specific. Once the epoch is ended, energy consumption is calculated and returned. It is possible to have multiple outstanding user specified epochs. These syscalls were all assigned to unused codes.

6.2.3 Reports

If we do not want to modify the code being simulated, it is possible to use Multi2Sim’s power reporting capabilities. A simulation-wide summary is given at the end of any simulation. If power modelling is enabled, this summary also includes average power. Average power is provided for the whole chip, and also for each device that used a detailed simulation.

For more information, it is also possible to generate detailed reports. In this case, instant power values will be written to the file as they are updated. This allows power consumption throughout the simulation to be tracked at a fine granularity. For example, if we model the Fusion CPU, which runs at 2.6 GHz, and update the current power every 128 k cycles, we will simulate power measurements with a sampling period of about 20 kHz. This is fast enough to capture small changes in power consumption. Chapter 6. Power Multi2Sim 80

6.3 Validation

The power modelling in Multi2Sim was validated using the back propagation benchmark from the Rodinia benchmark suite. Multi2Sim was configured to simulate the Fusion APU. Power consumption was configured using the model created in Chapter5. Since the Fusion uses C-states, threshold values were used to simulate cores being put to sleep.

Back Propagation 40 35 30 [W] 25 20

Power 15 10 GPU kernels 5 0 Time

Figure 6.1: Measured power consumption of back propagation on real hardware.

Back propagation 35 30

[W] 25 20

Power 15 10 GPU kernels 5 Time

Figure 6.2: Simulate power consumption of back propagation using Multi2Sim.

Figures 6.1 and 6.2 show the power consumption over time of the back propagation benchmark, on real hardware and in Multi2Sim respectively. Power consumption in the simulation is much more stable and the times do not line up due to inaccuracies in Multi2Sim, nevertheless, both figures show the same trend. We can clearly see the two large dips in power consumption, circled in blue, that happen when kernels are being executed on the GPU. We can also see that in both cases, power consumption dips slightly before each kernel is run, and then spikes very briefly just before the kernel is launched. The simulated power is also slightly higher then the measured power, which is expected since this power model tends to overestimate power consumption. This validates not only implementation of power modelling in Multi2Sim, but also further validates the power model used.

6.4 Conclusion

Power modelling in Multi2Sim was implemented in a very flexible way. It is possible to configure energy consumption for almost every performance statistic tracked by the simulator. This allows a wide range of power models to be implemented within Multi2Sim. Chapter 6. Power Multi2Sim 81

There are multiple options when it comes to getting power numbers out of Multi2Sim. Total runtime summaries are provided at the end of each simulation but it is also possible to generate detailed reports that sample instant power at any specified rate. It is possible for the code being simulated to access power numbers during execution. This makes it possible to simulate power-aware algorithms using Multi2Sim.

Power modelling in Multi2sim was validated using the back propagation benchmark. While timing did not line up perfectly due to inaccuracies in the simulator, the power trends clearly line up well. With power modelling support, it will be possible to use Multi2Sim to conduct new research into power consumption, both at an architectural and software level. Chapter 7

Conclusion and Future Work

The work presented in this thesis can be broken down into two parts: DistCL, a distributed OpenCL runtime that was used to analyze the overheads involved in distributing an OpenCL kernel across a cluster, and a power model for the Fusion APU and the addition of power modelling support to Multi2Sim. The main contributions of this thesis are:

1. An analysis of performance scaling for distributed OpenCL kernels using two approaches.

2. A systematic methodology to create a representative set of power micro-benchmarks using data collected from real hardware.

3. Creating the first power model for a CPU/GPU heterogeneous processor.

4. Adding configurable power modelling capabilities to the Multi2Sim heterogeneous architecture simulator.

DistCL allows a cluster containing multiple OpenCL devices to be programmed as if it were a single OpenCL device. To do this, DistCL uses meta-functions that abstract away the details of the cluster and allow the programmer to focus on the algorithm being distributed. Speedups of up to 29× on 32 peers were demonstrated. DistCL was also compared to SnuCL, another open-source framework that allows OpenCL kernels to be distributed. For compute intensive benchmarks, performance between DistCL and SnuCL is comparable, but otherwise, SnuCL shows better performance. This difference cannot be fully attributed to the overhead of meta-functions, which account for a very small portion of the runtime. It is mostly due to DistCL requiring tighter synchronization between nodes and the fact that SnuCL uses CUDA under the hood. By analyzing how kernels are distributed using two different frameworks, we show that the compute-to-transfer ratio of a benchmark is the best predictor of performance scaling.

DistCL could be extended in a few interesting ways. It has been demonstrated that compiler analysis can be used to determine the memory dependencies of OpenCL kernels [46]. By adding this type of compiler analysis to DistCL, it would be possible to distribute unmodified OpenCL kernels without having to supply any additional information. This would make the entire process fully transparent. In this case, it would be useful if there was a way to automatically determine if a kernel would even see a speedup when distributed. This may be as simple as calculating the compute-to-transfer ratio of a kernel, but

82 Chapter 7. Conclusion and Future Work 83 this question would benefit from more thorough investigation. Finally, what partitioning strategy to use for what kernel still remains a open question. In DistCL, we assume that most memory accesses will be performed in row major order, but that will not always be the case. A way of automatically determining the best partitioning strategy would allow even more kernels to see speedups when being distributed.

The power model found for the Fusion APU was created using a systematic methodology. Power measurements for over 1600 benchmarks run on real hardware were used to generate a representative set of compute micro-benchmarks that consisted of just over 300 benchmarks. This was done using a implementation of the G-means clustering algorithm. These micro-benchmarks, along with memory and applications benchmarks, were then used to create a power model for the APU. Multiple linear regression was used to create a decomposable statistical model. Validated against four heterogeneous benchmarks, the mean error rate was only 6.9%. This approach is not unique to Fusion and can be used to create power models for any type of processor.

The Multi2Sim simulator was modified to add support for power modelling. It provides a flexible way of configuring the power model, which allows a wide range of power models to be simulated. Power statistics can also be obtained by workloads being simulated using the syscall interface. This allows Multi2Sim to simulate power aware algorithms. We show that the power modelling in Multi2Sim is capable of capturing the power trends present in real programs.

With the combination of a decomposable power model and the support for power aware algorithms, the door is opened for some interesting new research, both at an architectural and software level. Further research looking at different power budgeting schemes between the CPU and GPU could be performed, using the new capabilities of Multi2Sim. These capabilities also allow future work examining scheduling policies with runtime and/or power constraints to be undertaken. Such as research determining the best way of distributing computation on a heterogeneous processor, using a variety of metrics that consider both power and performance. These future directions could also be applicable in shared hosting environments, where there can be multiple constraints, such as power, memory, or cores, but where high performance is still desired. The ubiquity of processors in our technology guarantees that research in this direction is paramount, as we seek to improve performance while reducing energy consumption in all our devices. Bibliography

[1] J. L. Hennessy and D. A. Patterson, Computer architecture: a quantitative approach, 4th ed. Morgan Kaufmann, 2007.

[2] Nvidia Corporation, “CUDA programming guide,” 2008.

[3] Khronos OpenCL Working Group and others, “The OpenCL specification,” A. Munshi, Ed, pp. 1–302, 2008.

[4] A. S. Bland, J. Wells, O. E. Messer, O. Hernandez, and J. Rogers, “Titan: Early experience with the Cray XK6 at oak ridge national laboratory,” Cray User Group, 2012.

[5] A. Branover, D. Foley, and M. Steinman, “AMD Fusion APU: Llano,” Micro, IEEE, vol. 32, no. 2, pp. 28–37, 2012.

[6] L. Gwennap, “Sandy bridge spans generations,” Microprocessor Report, vol. 9, no. 27, pp. 10–01, 2010.

[7] W. Wolf, A. A. Jerraya, and G. Martin, “Multiprocessor system-on-chip (MPSoC) technology,” Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, vol. 27, no. 10, pp. 1701–1713, 2008.

[8] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger, “Dark silicon and the end of multicore scaling,” in Proc. of Int. Symp. on Computer Architecture (ISCA). IEEE, 2011, pp. 365–376.

[9] T. Diop, S. Gurfinkel, J. Anderson, and N. Enright Jerger, “DistCL: A framework for the distributed execution of OpenCL kernels,” in Proc. of the Int. Symp. on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS). IEEE, 2013.

[10] R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli, “Multi2Sim: a simulation framework for CPU- GPU computing,” in Proc. of the Int. Symp. on Parallel architectures and compilation techniques (PACT). ACM, 2012, pp. 335–344.

[11] J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee, “SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters,” in Proc. of the Int. Conf. on Supercomputing (ICS), 2012, pp. 341–352.

84 Bibliography 85

[12] C. Loken, D. Gruner, L. Groer, R. Peltier, N. Bunn, M. Craig, T. Henriques, J. Dempsey, C.-H. Yu, J. Chen et al., “SciNet: Lessons learned from building a power-efficient top-20 system and data centre,” vol. 256, no. 1, p. 012026, 2010.

[13] V. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund et al., “Debunking the 100x GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU,” in ACM SIGARCH Computer Architecture News, vol. 38, no. 3. ACM, 2010, pp. 451–460.

[14] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, “Nvidia tesla: A unified graphics and computing architecture,” vol. 28, no. 2. IEEE, 2008, pp. 39–55.

[15] T. M. Aamodt, “Architecting graphics processors for non-graphics compute acceleration,” in Pacific Rim Conf. on Communications, Computers and Signal Processing (PacRim). IEEE, 2009, pp. 963–968.

[16] A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, “Analyzing CUDA workloads using a detailed GPU simulator,” in Int. Symp. on Performance Analysis of Systems and Software (ISPASS). IEEE, 2009, pp. 163–174.

[17] AMD, Evergreen Family Instruction Set Architecture Instructions and Mi- crocode, 2011. [Online]. Available: http://developer.amd.com/wordpress/media/2012/10/ AMD Evergreen-Family Instruction Set Architecture.pdf

[18] N. Brookwood, “AMD white paper: AMD Fusion family of APUs,” Technical report, AMD Corporation, Tech. Rep., 2010.

[19] ——, “AMD Fusion family of APUs: enabling a superior, immersive pc experience,” Insight, vol. 64, no. 1, pp. 1–8, 2010.

[20] AMD, Software Optimization Guide for AMD Family 10h and 12h Processors, 2010. [Online]. Available: http://developer.amd.com/resources/documentation-articles/developer-guides-manuals/

[21] PCI SIG, PCI express base 2.0 specification: includes PCI express base 2.0 and PCI express card electromechanical 2.0 specifications. Wakefield, MA.: PICMG, 2008.

[22] J. Sheaffer, D. Luebke, and K. Skadron, “A flexible simulation framework for graphics architectures,” in Proc. of the Conf. on Graphics hardware. ACM, 2004, pp. 85–94.

[23] V. M. Del Barrio, C. Gonz´alez,J. Roca, A. Fern´andez,and E. Espasa, “ATTILA: a cycle- level execution-driven simulator for modern GPU architectures,” in Proc. of the Int. Symp. on Performance Analysis of Systems and Software (ISPASS). IEEE, 2006, pp. 231–241.

[24] F. Hill and S. Kelley, Computer Graphics Using OpenGL, 3/E. Pearson, 2007.

[25] R. Ubal, J. Sahuquillo, S. Petit, and P. L´opez, “Multi2Sim: A simulation framework to evaluate multicore-multithread processors,” in Proc. of the Int. Symp. on Computer Architecture and High Performance computing, 2007, pp. 62–68. Bibliography 86

[26] C. Barton, S. Chen, Z. Chen, T. Diop, X. Gong, S. Gurfinkel, B. Jang, D. Kaeli, P. L´opez, N. Materise, R. Miftakhutdinov, P. Mistry, S. Petit, J. Sahuquillo, D. Schaa, S. Shukla, R. Ubal, Y. Ukidave, M. Wilkening, N. Rubin, A. Shen, T. Swamy, and A. K. Ziabari, The Multi2Sim Simulation Framework, 2013. [Online]. Available: http://www.multi2sim.org/

[27] W. W. Fung, I. Sham, G. Yuan, and T. M. Aamodt, “Dynamic warp formation and scheduling for efficient GPU control flow,” in Proc. of the Int. Symp. on Microarchitecture (MICRO). IEEE Computer Society, 2007, pp. 407–420.

[28] V. Zakharenko, T. Aamodt, and A. Moshovos, “Characterizing the performance benefits of fused CPU/GPU systems using FusionSim,” in Proc. of the Conf. on Design, Automation and Test in Europe (DATE). EDA Consortium, 2013, pp. 685–688.

[29] M. T. Yourst, “PTLsim: A cycle accurate full system x86-64 microarchitectural simulator,” in Proc. of Int. Symp. on Performance Analysis of Systems & Software (ISPASS). IEEE, 2007, pp. 23–34.

[30] H. Wang, V. Sathish, R. Singh, M. J. Schulte, and N. S. Kim, “Workload and power budget partitioning for single-chip heterogeneous processors,” in Proc. of Int. conf. on Parallel architectures and compilation techniques (PACT). ACM, 2012, pp. 401–410.

[31] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti et al., “The gem5 simulator,” ACM SIGARCH Computer Architecture News, vol. 39, no. 2, pp. 1–7, 2011.

[32] J. Nickolls and W. J. Dally, “The GPU computing era,” Micro, IEEE, vol. 30, no. 2, pp. 56–69, 2010.

[33] S. Ryoo, C. Rodrigues, S. Baghsorkhi, S. Stone, D. Kirk, and W. Hwu, “Optimization principles and application performance evaluation of a multithreaded GPU using CUDA,” in Proc. of the Symp. on Principles and practice of parallel programming (PPoPP), 2008, pp. 73–82.

[34] V. Volkov and J. W. Demmel, “Benchmarking GPUs to tune dense linear algebra,” in Proc. of the Int. Conf. on Supercomputing (ICS). ACM/IEEE, 2008, p. 31.

[35] A. Gaikwad and I. M. Toke, “Parallel iterative linear solvers on GPU: a financial engineering case,” in Proc. of the Int. Conf. on Parallel, Distributed and Network-Based Processing (PDP). IEEE, 2010, pp. 607–614.

[36] J. E. Stone, D. J. Hardy, I. S. Ufimtsev, and K. Schulten, “GPU-accelerated molecular modeling coming of age,” Journal of Molecular Graphics and Modelling, vol. 29, no. 2, pp. 116–125, 2010.

[37] X.-J. Yang, X.-K. Liao, K. Lu, Q.-F. Hu, J.-Q. Song, and J.-S. Su, “The TianHe-1A : its hardware and software,” Journal of Computer Science and Technology, vol. 26, no. 3, pp. 344–351, 2011.

[38] W. Gropp, E. Lusk, and A. Skjellum, Using MPI: Portable Parallel Programming with the Message- Passing Interface, 2nd edition. MIT Press, 1999. Bibliography 87

[39] J. Duato, A. Pena, F. Silla, R. Mayo, and E. Quintana-Orti, “rCUDA: Reducing the number of GPU-based accelerators in high performance clusters,” in Proc. of the Int. Conf. on High Performance Computing and Simulation (HPCS). IEEE, 2010, pp. 224–231.

[40] A. Barak, T. Ben-Nun, E. Levy, and A. Shiloh, “A package for OpenCL based heterogeneous computing on clusters with many GPU devices,” in Proc. of the Int. Conf. on Cluster Computing. IEEE, 2010, pp. 1–7.

[41] R. Aoki, S. Oikawa, T. Nakamura, and S. Miki, “Hybrid OpenCL: Enhancing OpenCL for distributed processing,” in Proc. of the Int. Symp. on Parallel and Distributed Processing with Applications (ISPA), 2011, pp. 149–154.

[42] P. Kegel, M. Steuwer, and S. Gorlatch, “dOpenCL: Towards a uniform programming approach for distributed heterogeneous multi-/many-core systems,” in Proc. of the Int. Symp. on Parallel and Distributed Processing (PDP) Workshops & PhD Forum. IEEE, 2012, pp. 174–186.

[43] B. Eskikaya and D. Altılar, “Distributed OpenCL distributing OpenCL platform on network scale,” Int’l Journal of Computer Applications, vol. ACCTHPCA, no. 2, pp. 25–30, July 2012.

[44] A. Alves, J. Rufino, A. Pina, and L. P. Santos, “clOpenCL-supporting distributed heterogeneous computing in HPC clusters,” in Euro-Par: Parallel Processing Workshops. Springer, 2013, pp. 112–122.

[45] M. Strengert, C. M¨uller,C. Dachsbacher, and T. Ertl, “CUDASA: Compute unified device and systems architecture,” in Proc. of the Symp. on Parallel Graphics and Visualization (EGPGV), 2008, pp. 49–56.

[46] J. Kim, H. Kim, J. Lee, and J. Lee, “Achieving a single compute device image in OpenCL for multiple GPUs,” in Proc. of the Symp. on Principles and practice of parallel programming (PPoPP), 2011, pp. 277–288.

[47] S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S. Lee, and K. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,” in Proc. of the Int. Symp. on Workload Characterization (IISWC). IEEE, 2009, pp. 44–54.

[48] S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, L. Wang, and K. Skadron, “A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads,” in Proc. of the Int. Symp. on Workload Characterization (IISWC). IEEE, 2010, pp. 1–11.

[49] AMD, “AMD accelerated parallel processing (APP) SDK,” 2011. [Online]. Available: http://developer.amd.com/sdks/amdappsdk/pages/default.aspx

[50] Free Software Foundation, “GNU libgcrypt,” 2011. [Online]. Available: http://www.gnu.org/ software/libgcrypt

[51] S. Nakamoto, “Bitcoin: A peer-to-peer electronic cash system,” 2009. [Online]. Available: www.bitcoin.org

[52] J. M. Mellor-Crummey and M. L. Scott, “Algorithms for scalable synchronization on shared-memory multiprocessors,” ACM Transactions on Computer Systems (TOCS), vol. 9, no. 1, pp. 21–65, 1991. Bibliography 88

[53] J. Fang, A. L. Varbanescu, and H. Sips, “A comprehensive performance comparison of CUDA and OpenCL,” in Proc. of Int. Conf. on Parallel Processing (ICPP). IEEE, 2011, pp. 216–225.

[54] K. Karimi, N. G. Dickson, and F. Hamze, “A performance comparison of CUDA and ,” CoRR, vol. abs/1005.2581, 2010.

[55] J. L. Henning, “SPEC CPU2006 benchmark descriptions,” ACM SIGARCH Computer Architecture News, vol. 34, no. 4, pp. 1–17, 2006.

[56] A. Phansalkar, A. Joshi, and L. K. John, “Analysis of redundancy and application balance in the SPEC CPU2006 benchmark suite,” in ACM SIGARCH Computer Architecture News, vol. 35, no. 2. ACM, 2007, pp. 412–423.

[57] H. Wong, M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos, “Demystifying GPU microar- chitecture through microbenchmarking,” in Proc. of the Int. Symp. on Performance Analysis of Systems & Software (ISPASS). IEEE, 2010, pp. 235–246.

[58] S. Hong and H. Kim, “An integrated GPU power and performance model,” in ACM SIGARCH Computer Architecture News, vol. 38, no. 3. ACM, 2010, pp. 280–289.

[59] W. Bircher and L. John, “Complete system power estimation using processor performance events,” Computers, IEEE Transactions on, vol. 61, no. 4, pp. 563–577, April 2012.

[60] DATAQ Instruments, “DI-145 USB data acquisition starter kit,” 2012.

[61] ——, “DI-149 8-channel USB data acquisition starter kit,” 2012.

[62] J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, “GPUWattch: Enabling energy optimizations in GPGPUs,” in proc. of ISCA, vol. 40, 2013.

[63] MSI, A75MA-G55 Mainboard User Guide, 2011.

[64] Intel, ATX Specification, 2003.

[65] Allegro MicroSystems, “Hall effect linear current sensor with overcurrent fault output for <100 v isolation applications,” ACS711 datasheet, 2008.

[66] Fairchild Semiconductor, “Generation II XS DrMos family,” 2011.

[67] AMD, “Why is AMD CPU/APU not always running at full speed?” Aug. 2012. [Online]. Available: http://support.amd.com/us/kbarticles/Pages/CPUAPUnotrunningatfullspeed.aspx

[68] Hewlett-Packard, Intel, Microsoft, Phoenix, and Toshiba, Advanced Configuration and Power Interface Specification, Dec. 2011. [Online]. Available: http://www.acpi.info/spec50.htm

[69] A. Bona, M. Sami, D. Sciuto, C. Silvano, V. Zaccaria, and R. Zafalon, “Energy estimation and optimization of embedded VLIW processors based on instruction clustering,” in Proc. of the Design Automation Conf. (DAC). IEEE, 2002, pp. 886–891.

[70] R Development Core Team, R: A Language and Environment for Statistical Computing,R Foundation for Statistical Computing, Vienna, Austria, 2011, ISBN 3-900051-07-0. [Online]. Available: http://www.R-project.org/ Bibliography 89

[71] J. A. Hartigan, Clustering algorithms. John Wiley & Sons, Inc., 1975.

[72] G. Hamerly and C. Elkan, “Learning the k in k-means,” in Advances in Neural Information Processing Systems 16, S. Thrun, L. Saul, and B. Sch¨olkopf, Eds. Cambridge, MA: MIT Press, 2004.

[73] M. A. Stephens, “EDF statistics for goodness of fit and some comparisons,” Journal of the American Statistical Association, vol. 69, no. 347, pp. 730–737, 1974.

[74] S. S. Shapiro and M. B. Wilk, “An analysis of variance test for normality (complete samples),” Biometrika, vol. 52, no. 3/4, pp. 591–611, 1965.

[75] H. Cram´er, Sannolikhetskalkylen och n˚agra av dess anv¨andningar. Gjallarhornet, 1927.

[76] R. Von Mises, “Mathematical theory of probability and statistics,” Mathematical Theory of Probability and Statistics, New York: Academic Press, 1964, vol. 1, 1964.

[77] T. W. Anderson and D. A. Darling, “Asymptotic theory of certain goodness of fit criteria based on stochastic processes,” The Annals of Mathematical Statistics, vol. 23, no. 2, pp. 193–212, 1952.

[78] A. N. Pettitt, “Testing the normality of several independent samples using the anderson-darling statistic,” Journal of the Royal Statistical Society, vol. 26, no. 2, pp. pp. 156–161, 1977.

[79] J. A. Hartigan and M. A. Wong, “Algorithm as 136: A k-means clustering algorithm,” Applied statistics, pp. 100–108, 1979.

[80] J. Neyman and E. S. Pearson, “On the use and interpretation of certain test criteria for purposes of statistical inference: Part i,” Biometrika, pp. 175–240, 1928.

[81] T. Mudge, “Power: A first-class architectural design constraint,” Computer, vol. 34, no. 4, pp. 52–58, 2001.

[82] J. S. Chase, D. C. Anderson, P. N. Thakar, A. M. Vahdat, and R. P. Doyle, “Managing energy and server resources in hosting centers,” in ACM SIGOPS Operating Systems Review, vol. 35, no. 5. ACM, 2001, pp. 103–116.

[83] R. Buyya, A. Beloglazov, and J. Abawajy, “Energy-efficient management of data center resources for cloud computing: a vision, architectural elements, and open challenges,” in Proc. of the Int. Conf. on Parallel and Distributed Processing Techniques and Applications (PDPTA). CSREA Press, pp. 6–17.

[84] S. Gurumurthi, A. Sivasubramaniam, M. Irwin, N. Vijaykrishnan, and M. Kandemir, “Using complete machine simulation for software power estimation: The softwatt approach,” in Proc. of the Int. Symp. on High-Performance Computer Architecture (HPCA). IEEE, 2002, pp. 141–150.

[85] D. Brooks, V. Tiwari, and M. Martonosi, “Wattch: a framework for architectural-level power analysis and optimizations,” in ACM SIGARCH Computer Architecture News, vol. 28, no. 2. ACM, 2000, pp. 83–94. Bibliography 90

[86] S. Li, J. Ahn, R. Strong, J. Brockman, D. Tullsen, and N. Jouppi, “McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures,” in Proc. of the Int. Symp. on Microarchitecture (MICRO). IEEE/ACM, 2009, pp. 469–480.

[87] B. Lee and D. Brooks, “Accurate and efficient regression modeling for microarchitectural performance and power prediction,” in ACM SIGOPS Operating Systems Review, vol. 40, no. 5. ACM, 2006, pp. 185–194.

[88] B. Lee, J. Collins, H. Wang, and D. Brooks, “CPR: Composable performance regression for scalable multiprocessor models,” in Proc. of the Int. Symp. on Microarchitecture (MICRO). IEEE, 2008, pp. 270–281.

[89] R. Bertran, A. Buyuktosunoglu, M. Gupta, M. Gonz`alez,and P. Bose, “Systematic energy charac- terization of CMP/SMT processor systems via automated micro-benchmarks,” in Proc. of the Int. Symp. on Microarchitecture (MICRO). IEEE, 2012, pp. 199–211.

[90] K. Ramani, A. Ibrahim, and D. Shimizu, “PowerRed: A flexible modeling framework for power efficiency exploration in GPUs,” in Proc. of the Workshop on General Purpose Processing on GPUs, GPGPU, 2007.

[91] J. Lucas, S. Lal, M. Andersch, M. Alvarez-Mesa, and B. Juurlink, “How a single chip causes massive power bills GPUSimPow: A GPGPU power simulator,” in Proc. of the Int. Symp. on Performance Analysis of Systems and Software (ISPASS). IEEE, 2013, pp. 97–106.

[92] G. Wang, “Power analysis and optimizations for GPU architecture using a power simulator,” in Proc. of the Int. Conf. on Advanced Computer Theory and Engineering (ICACTE), vol. 1. IEEE, 2010, pp. V1–619.

[93] Y. Zhang, Y. Hu, B. Li, and L. Peng, “Performance and power analysis of ATI GPU: A statistical approach,” in Int. Conf. on Networking, Architecture and Storage (NAS). IEEE, 2011, pp. 149–158.

[94] H. Wang and Q. Chen, “Power estimating model and analysis of general programming on GPU,” Journal of Software, vol. 7, no. 5, pp. 1164–1170, 2012.

[95] H. Nagasaka, N. Maruyama, A. Nukada, T. Endo, and S. Matsuoka, “Statistical power modeling of GPU kernels using performance counters,” in Proc. of the Int. Green Computing Conf. IEEE, 2010, pp. 115–122.

[96] Y. Zhang, D. Parikh, K. Sankaranarayanan, K. Skadron, and M. Stan, “Hotleakage: A temperature- aware model of subthreshold and gate leakage for architects,” University of Virginia Dept of Computer Science Tech Report CS-2003, vol. 5, 2003.

[97] R. Bertran, M. Gonz`alez,X. Martorell, N. Navarro, and E. Ayguad´e,“Counter-based power modeling methods: Top-down vs. bottom-up,” The Computer Journal, vol. 56, no. 2, pp. 198–213, 2013.

[98] H.-S. Wang, X. Zhu, L.-S. Peh, and S. Malik, “Orion: a power-performance simulator for intercon- nection networks,” in Proc. of the Int. Symp. on Microarchitecture (MICRO). IEEE, 2002, pp. 294–305. Bibliography 91

[99] X. Ma, M. Dong, L. Zhong, and Z. Deng, “Statistical power consumption analysis and modeling for GPU-based computing,” in Proc. of ACM SOSP Workshop on Power Aware Computing and Systems (HotPower), 2009.

[100] S. Hong and H. Kim, “An analytical model for a GPU architecture with memory-level and thread- level parallelism awareness,” ACM SIGARCH Computer Architecture News, vol. 37, no. 3, pp. 152–163, 2009.

[101] R. Bertran, M. G. Tallada, X. Martorell, N. Navarro, and E. Ayguade, “A systematic methodology to generate decomposable and responsive power models for CMPs,” IEEE Transactions on Computers, 2012.

[102] B. Lee and D. Brooks, “Spatial sampling and regression strategies,” Micro, IEEE, vol. 27, no. 3, pp. 74–93, 2007.

[103] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The PARSEC benchmark suite: characterization and architectural implications,” in Proc. of the Int. Conf. on Parallel architectures and compilation techniques (PACT). ACM, 2008, pp. 72–81.

[104] V. M. Weaver, “Linux perf event features and overhead,” in FastPath Workshop. University of Maine, 2013.

[105] S. Eranian, “Linux new monitoring interface: Performance counter for linux,” in CSCADS Workshop, 2009.

[106] A. BIOS, Kernel Developers Guide For AMD Family 12h Processors, 2011. [Online]. Available: http://developer.amd.com/resources/documentation-articles/developer-guides-manuals/

[107] H. de Vries, “Understanding the detailed architecture of AMD’s 64 bit core,” Sept. 2003, [Feb. 2013]. [Online]. Available: http://www.chip-architect.com/news/2003 09 21 detailed architecture of amds 64bit core.html

[108] Intel 64 and IA-32 Architectures Software Developer’s Manual: Combined Volumes: 1, 2A, 2B, 2C, 3A, 3B, and 3C, 2013. [Online]. Available: http://www.intel.com/content/www/us/en/processors/ architectures-software-developer-manuals.html

[109] J. A. Butts and G. S. Sohi, “A static power model for architects,” in Proc. of the Int. Symp. on Microarchitecture (MICRO). ACM, 2000, pp. 191–201.

[110] C. Angelini, “AMD A8-3850 review: Llano rocks entry-level desktops,” Tom’s Hardware, June 2011. [Online]. Available: http://www.tomshardware.com/reviews/amd-a8-3850-llano,2975-7.html Appendices

92 Appendix A

Clustering Details

Table A.1 shows which properties were most significant for each cluster based on the CPU results. Where multiple properties are listed they were common to the same number of benchmarks, if more than one value of a property was common to the same number of benchmarks they are listed in parentheses. One can see that many groups have multiple properties in common, so it is clear that multiple properties are indeed important to the energy consumption of a benchmark.

Table A.1: Most Common Property per Cluster for the CPU

Most Common factor Percent of benchmarks size of with property Cluster int, low ILP, shift left, 1 core 100 4 subtraction, comparison 33 24 simple 53 15 simple 70 10 increment 70 10 multiplication 38 21 low ILP, simple 100 9 int 42 19 int, uniary, simple 100 5 int, 1 core 78 9 addition 40 15 multiplication, simple 40 20 int4, high ILP, addition, 4 core, simple 100 4 comparison 44 18 int, high ILP, shift left, 2 core 100 4 int data-type, low ILP, shift left, 2 core 100 4 (zero, alternating, ascending, negative) inputs, (addition, and, or, xor) 25 16 int, 3 core, comparison 67 12 alternating inputs, subtraction 33 15 logic 9 27 addition 33 24 int4, low ILP, multiplication, 1 core 100 8

93 Appendix A. Clustering Details 94

float, low ILP, 4 core 100 8 int, uniary, simple 100 7 int4, high ILP, addition, 3 core, simple 100 4 simple 67 12 int, low ILP 100 8 subtraction 67 12 comparison 30 23 ascending input, int, low, increment, three core, uniary, simple 100 1 int, low ILP, core 1, comparison, simple ILP 100 6 addition 32 25 low ILP 100 5 (zero, alternating, ascending, negative) inputs, (addition, and, or, xor) 25 16 4 core, comparison, simple 67 24 subtraction 50 16 simple 100 8 (zero, alternating, ascending, negative) input 25 20 ascending 40 15 int, low ILP, multiplication, core 2 100 4 int, core 2, comparison, simple 67 12 (zero, alternating, ascending, negative) input, addition 25 32 subtraction 35 23 zero input, float, low ILP, core 1 100 2 float4, 4 core 60 15 multiplication 80 10 subtraction, comparison 32 25 (addition, subtraction) 40 20 increment 80 10 Appendix B

Multi2Sim CPU Configuration Details

Table B.1 provides the configuration used for the CPU in Multi2Sim the emulate the CPU in the A6-3650.

Table B.1: CPU Configuration Details

Parameter Description Value General Frequency in MHz 2600 Cores core count 4 Threads number of simultaneous threads a core supports 1 Pipeline DecodeWidth decode width 3 DispatchWidth dispatch width 3 IssueWidth issue width 9 CommitWidth commit width 3 Queues FetchQueueSize entries in fetch queue 32 UopQueueSize entries in micro-op queue 42 RobSize entries in reorder buffer 84 IqSize entries in instruction queue 24 LsqSize entries in load/store queue 32 RfIntSize entries in integer register file 40 RfFpSize entries in floating point register file 16 RfXmmSize entries in SSE register file 32 Branch Predictor Kind Uses a two-level predictors TwoLevel RAS.Size number of entries in the return address stack 24 TwoLevel.L1Size number of entries in the first level 2048

95 Appendix B. Multi2Sim CPU Configuration Details 96

TwoLevel.L2Size number of entries in the second level 16384 TwoLevel.HistorySize number of previous branch tracked per entry 8 Functional Units IntAdd.Count integer add/subtract 3 IntAdd.OpLat 1 IntAdd.IssueLat 1 IntMult.Count integer multiplication 1 IntMult.OpLat 3 IntMult.IssueLat 1 IntDiv.Count integer division 1 IntDiv.OpLat 17 IntDiv.IssueLat 17 EffAddr.Count address calculation 3 EffAddr.OpLat 1 EffAddr.IssueLat 1 Logic.Count logic 3 Logic.OpLat 1 Logic.IssueLat 1 FloatSimple.Count floating point simple (eg. round) 2 FloatSimple.OpLat 2 FloatSimple.IssueLat 2 FloatAdd.Count floating point add/subtract 1 FloatAdd.OpLat 4 FloatAdd.IssueLat 4 FloatComp.Count floating point compare 1 FloatComp.OpLat 2 FloatComp.IssueLat 2 FloatMult.Count floating point multiply 1 FloatMult.OpLat 4 FloatMult.IssueLat 4 FloatDiv.Count floating point divide 1 FloatDiv.OpLat 16 FloatDiv.IssueLat 16 FloatComplex.Count floating point complex (eg. trig) 1 FloatComplex.OpLat 50 FloatComplex.IssueLat 40 XMMIntAdd.Count SSE integer add/subtract 2 XMMIntAdd.OpLat 1 XMMIntAdd.IssueLat 1 XMMIntMult.Count SSE integer multiply 1 XMMIntMult.OpLat 3 XMMIntMult.IssueLat 3 XMMLogic.Count SSE logic 2 Appendix B. Multi2Sim CPU Configuration Details 97

XMMLogic.OpLat 2 XMMLogic.IssueLat 1 XMMFloatAdd.Count SSE floating point add/subtract 1 XMMFloatAdd.OpLat 4 XMMFloatAdd.IssueLat 4 XMMFloatComp.Count SSE floating point compare 1 XMMFloatComp.OpLat 2 XMMFloatComp.IssueLat 2 XMMFloatMult.Count SSE floating point multiply 1 XMMFloatMult.OpLat 4 XMMFloatMult.IssueLat 4 XMMFloatDiv.Count SSE floating point divide 1 XMMFloatDiv.OpLat 18 XMMFloatDiv.IssueLat 18 XMMFloatConv.Count SSE floating point convert 2 XMMFloatConv.OpLat 7 XMMFloatConv.IssueLat 7 XMMFloatComplex.Count SSE floating point complex (eg. sqrt) 1 XMMFloatComplex.OpLat 24 XMMFloatComplex.IssueLat 24