IT 14 015 Examensarbete 30 hp June 2014

GPU: Power vs Performance

Siddhartha Sankar Mondal

Institutionen för informationsteknologi Department of Information Technology

Abstract GPU: Power vs Performance

Siddhartha Sankar Mondal

Teknisk- naturvetenskaplig fakultet UTH-enheten GPUs are widely being used to meet the ever increasing demands of High performance computing. High-end GPUs are one of the highest consumers of power Besöksadress: in a computer. Power dissipation has always been a major concern area for computer Ångströmlaboratoriet Lägerhyddsvägen 1 architects. Due to power efficiency demands modern CPUs have moved towards Hus 4, Plan 0 multicore architectures. GPUs are already architectures. There has been some encouraging results for power efficiency in CPUs by applying DVFS . The Postadress: vision is that a similar approach would also provide encouraging results for power Box 536 751 21 Uppsala efficiency in GPUs.

Telefon: In this thesis we have analyzed the power and performance characteristics of GPU at 018 – 471 30 03 different frequencies. To help us achieve that, we have made a set of

Telefax: microbenchmarks with different levels of memory boundedness and threads. We have 018 – 471 30 00 also used benchmarks from CUDA SDK and Parboil. We have used a GTX580 Fermi based GPU. We have also made a hardware infrastructure that accurately measures Hemsida: the power being consumed by the GPU. http://www.teknat.uu.se/student

Handledare: Stefanos Kaxiras Ämnesgranskare: Stefanos Kaxiras Examinator: Ivan Christoff IT 14 015 Sponsor: UPMARC Tryckt av: Reprocentralen ITC

AhŽ•™ëuou“u•±

I would like to thank my supervisor Prof. Stefanos Kaxiras for giving me the wonderful opportunity to work on this interesting topic. I would also like to thank his PhD students Vasileios Spiliopoulos and Konstantinos Koukos for helping me get started with the benchmarks and hardware setup. e LATEX community has been of great help while making this document. Also, I would like to thank the spirit that lives in the computer. is thesis was funded by UPMARC.

v

C™•±u•±«

AhŽ•™ëuou“u•± v

C™•±u•±« vii

ÕI•±§™o¶h±†™• Õ

ó BZhŽ§™¶•o ì ó.Õ CUDA Programming modelì ó.ó GPU Architecture¦ ó.ì Power issues¢ ó.¦ Latencyä ó.¢ Previous workß

ì Mu±„™o™™í É ì.Õ Experimental SetupÉ ì.ó Power measurementÉ ì.ì DVFSÕÕ ì.¦ MicrobenchmarksÕÕ ì.¢ Benchmarks Õó ì.¢.Õ Matrix Multiplication Õì ì.¢.ó Matrix Transpose Õì ì.¢.ì Histogram Õì ì.¢.¦ Radix Sort Õì ì.¢.¢ Merge Sort Õì ì.¢.ä Conjugate Gradient Õì ì.¢.ß BFS(Breadth First Search) Õì ì.¢.˜ Eigenvalues Õì ì.¢.É Black-Scholes option pricing Õì ì.¢.Õþ ìD FDTD(ìD Finite Dišerence Time Domain method) Õ¦ ì.¢.ÕÕ Scalar Product Õ¦

¦ EêZ¶Z±†™• Õ¢

vii viii h™•±u•±«

¦.Õ Microbenchmarks Õä ¦.Õ.Õ Microbenchmark Õ Õä ¦.Õ.ó Microbenchmark ó Õß ¦.Õ.ì Microbenchmark ì ՘ ¦.Õ.¦ Microbenchmark ¦ ÕÉ ¦.ó Benchmarks óþ ¦.ì Memory Bandwidth ó¦ ¦.ì.Õ Black-Scholes ó¦ ¦.ì.ó Eigenvalue ó¢ ¦.ì.ì ìD FDTD óä ¦.ì.¦ Scalar Product óß

¢ C™•h¶«†™• óÉ

B†f†™§Z£„í ìÕ

A««u“fí h™ou ìì

P™ëu§ “uZ«¶§u“u•±« ìß 1 h„Z£±u§ I•±§™o¶h±†™•

Over the last few years the CPU and GPU architectures have been evolving very rapidly. With the introduction of programmable in GPU since the turn of century, it has been used by the scientic computing community as a powerful computational accelerator to the CPU. For many compute intensive workloads GPUs gives few orders of magnitude better performance than a CPU. e introduction of languages like CUDA and OpenCL has made it more easier to program on a GPU. In our evaluation we have used the GTX¢˜þ GPU. It is based on the Fermi architecture[Õ¢]. With ¢Õó compute units called CUDA cores it can handle ó¦,¢ßä active threads [É, Õ¢]. Its oš-chip memory bandwidth is around ÕÉó GB/s, which is considerably higher than CPUs main memory bandwidth(around ìó GB/s for core iß). eoretically it can do over Õ.¢ TFLOPS(with single precision FMA). Even with such great numbers to boast about, GPUs have a few bottlenecks that one has to consider before sending all workload down to the GPU. One of them is the power and the other is the cost to transfer data from host to device(i.e. CPU to GPU) and again back from device to host(i.e. GPU to CPU). e bandwidth of Õäx PCI express is around ˜GB/s. Power is a very important constraint for every [ß]. Modern high-end GPUs have thermal design power(TDP) of over óþþ watts. e GTX¢˜þ has a TDP of ó¦¦ watts. Whereas a high-end multicore CPU like Intel core iß has a TDP of around Õþþ watts. In terms of GFLOPS/watt GPUs can be considered to be more power e›cient than CPUs. But for GPUs to be considered as co- it consumes way too much power out of the total power budget. ough GPUs have such high oš-chip memory bandwidth, accessing the oš-chip global memory is very expensive(latency of around ¦þþ to ˜þþ clock cycles). GPUs hide this latency by scheduling very large number of threads at a time. For memory bound application it becomes hard to hide this latency and they have slack. On CPUs we can take advantage of this by applying DVFS to reduce the dynamic power.

Õ ó h„Z£±u§ Õ. †•±§™o¶h±†™•

In this thesis work we investigate the power and performance characteristics of a GPU with DVFS(Dynamic voltage and frequency scaling). To help us achieve that we make use of microbenchmarks with dišerent levels of memory boundedness and threads. We follow it up by applying DVFS to some more benchmarks. Apart from the introduction in this chapter the thesis is structured as follows. In Chapter ó , we discuss about GPU programming, GPU architecture, the concept behind DVFS based power savings and also some previous work. In Chapter ì, we describe the method used to measure power and implement DVFS on GPU. We also throw some light on the benchmarks that were used for the evaluation. In Chapter ¦, we evaluate the performance and power consumption of the benchmarks under dišerent number of threads, frequencies and memory bandwidth. In Chapter ¢, we režect on the conclusions drawn from our evaluation. In Appendix: Assembly code, we provide the assembly code of the micro benchmarks we used. In Appendix: Power measurements, we provide all the execution times and power consumption readings from the evaluation. 2 h„Z£±u§ BZhŽ§™¶•o

ó.Õ h¶oZ £§™§Z““†• “™ou

CUDA is a GPGPU programming language that can be used to write programs for NVIDIA GPUs. It is an extension of the C programming language. A code written in CUDA consists of a mix of host(CPU) code and device(GPU) code[É]. e NVCC compiler compiles the device code and separates it from the host code. At a later stage a C compiler can be used to compile the host code and the device code parts are replaced by calls to the compiled device code. A data parallel function that is to be run on a device is called a kernel. A kernel creates many threads to compute this data parallel workload. A block consists of many threads and a group of thread block form a grid.

Õ __global__ void SimpleKernel(float A[N],float B[N],float C[N], int N) ó { ì //calculate thread id ¦ int i= blockIDx.x * blockDim.x + threadIdx.x; ¢ if(i>>(A,B,C,N); Õ¦ ...... Õ¢ }

Listing ó.Õ: A simple CUDA kernel

ì ¦ h„Z£±u§ ó. fZhŽ§™¶•o

In the Listing ó.Õ on page ì we have shown a very simple CUDA program model. Inside the kernel SimpleKernel we have CUDA specic variables that help us identify the data element for the particular thread. threadIdx identies a thread inside a block. Each thread inside a block has a unique thread id. blockIdx identies the block in a grid. e blockDim gives the size of a block in a particular dimension. When we invoke the kernel from the device code, along with the kernel arguments we also pass the thread details. e variable threadsPerBlock gives the number of threads in a block and numberOfBlocks indicate the number of blocks that are to be passed. All the blocks are of the same size(i.e. they have the same number of threads). Both numberOfBlocks and threadsPerBlock can be ÕD or óD or ìD data types. A more detailed explanation about CUDA programming can be found in [É].

ó.ó £¶ Z§h„†±uh±¶§u

SM SM SM SM SM SM SM SM

SM SM SM SM SM SM SM SM

Fermi GPU

Figure ó.Õ: On the leŸ we have the CUDA core. SM is shown in the centre. It is made up of CUDA cores. e Fermi GPU architecture shown on the right has many SMs. Source: NVIDIA Fermi architecture whitepaper

In our setup we use the Nvidia GTX¢˜þ GPU with CUDA compute capability ó.þ. It is based on the Fermi architecture[Õ¢]. Figure ó.Õ gives an overview of the Fermi architecture. e basic computing unit is the CUDA core. ere are ¢Õó CUDA cores on the Fermi. Each CUDA core can execute integer or žoating point instructions for a single thread. ìó CUDA cores make up a single Streaming Multiprocessor(SM). ere in total Õä SMs on the GTX¢˜þ. Each SM has its own congurable and LÕ , which can be congured as Õä/¦˜ KB. It also has an Ló cache of size ßä˜ KB. In the Fermi all the global memory calls pass through the Ló cache[É]. It has six ä¦ bit memory interface. With a óþþ¦ MHz GDDR¢ it gives us a peak memory bandwidth of ÕÉó GB/s. ó.ì. £™ëu§ †««¶u« ¢

Technical Specication Values Maximum threads per block Õþó¦ Warp size ìó Maximum resident blocks per SM ˜ Maximum resident warps per SM ¦˜ Maximum resident threads per SM Õ¢ìä Registers(ìó bit) per SM ìó K

Table ó.Õ: Fermi features for Compute Capability ó.þ

A SM schedules ìó threads at a time called a warp. Nvidia calls it as the (Single instruction Multiple read)SIMT architecture. Each CUDA thread corresponds to a SIMD Lane[ó]. All the treads in a warp are internally synchronized. From programming perspective keeping in mind the warp structure, the size of the block should be a multiple of warp size. A block can be allocated to only one SM. e table ó.Õ has some GTX¢˜þ specic features listed. Based on the register requirements of a kernel and block size, the scheduler decides on the number of resident blocks for a SM so that all the specications in the table ó.Õ are met.

SM-0 REGISTERS

L1 SMEM

Figure ó.ó:

e gure ó.ó gives an overview of the memory hierarchy. Each SM has its own LÕ cache and the Ló cache is shared by all the SMs. Each thread has its own local memory and all threads in a block have per-block shared memory. All the threads have access to the same global memory. Both the local and global data reside in the device(GPU) memory. In the Fermi all the local and global memory data are cached in both the LÕ and Ló. e on-chip memory on the Fermi is used for both LÕ and shared memory. e user can specify the size of LÕ and shared memory.

ó.ì £™ëu§ †««¶u«

Modern high-end GPUs have TDP of over óþþ watts, which is very high compared to the total power budget of the whole system. e GTX ¢˜þ has a TDP of around ó¦¦ watts. Power dissipation has always been a major area of concern for computer architects[ä, ß]. In the gure ó.ì on page ä we see the GPU power density of dišerent ä h„Z£±u§ ó. fZhŽ§™¶•o

Figure ó.ì: GPU power trends. Source: Nvidia [Õ] technology nodes. Nvidia set a maximum power density of þ.ìW~mmó for the worst case , voltage and temperature[Õ]. e AC power density represents the dynamic power and the DC power density represents the static power. e GTX ¢˜þ is made using the ¦þnm technology. In this case both static and dynamic power density is almost similar. When we go to the newer ó˜nm technology the dynamic power density exceeds the static power by a huge margin. So dynamic power is a major area of concern. Dynamic power is given by the equation[ä], P = CV óAf where, C = Capacitance V = Supply voltage A = Activity factor f = Clock frequency

Together V ó f makes a cubic impact on the power dissipation. Also V and f are linearly related. Memory bound applications sušer from long latency stalls and do not benet from high clock frequency. So, for memory bound applications we can make use of dynamic voltage frequency scaling(DVFS) to reduce power dissipation.

ó.¦ Z±u•hí

Even though GPUs have such high memory bandwidth, the major bottleneck is the latency due to oš-chip memory access. Fetching operands from from oš-chip memory costs around äþþ clock cycles. GPUs have a large number of resident threads in the form of warps to hide this latency. e gure ó.¦ on page ß shows how the warp scheduler swaps between dišerent resident warps. e number of warps needed to hide this latency depends on the ratio of memory call instructions to other instructions. For our device if we have a kernel with Õ memory call out of Õ¢ instructions then we need about ¦þ resident warps in that SM to hide the latency[É]. Having more number of ó.¢. £§uꆙ¶« 뙧Ž ß

Figure ó.¦: Warp scheduler. Source Nvidia [Õ¢] warps might help in hiding the latency but then all these extra resident warp’s memory calls have to be serviced too. ere is also limitation in number of warps that can be handled at a time(¦˜ per SM for Fermi). Higher clock cycles do not give much performance benets to memory bound kernels. In this work we try to investigate the power and performance behavior of the GPU by applying DVFS to kernels with dišerent number of memory calls and dišerent number of resident warps.

ó.¢ £§uꆙ¶« 뙧Ž

A lot of work has been done to investigate energy savings in GPUs, but in terms of CPUs it is a little sparse. ere are few well documented research work related to energy savings in GPUs that has helped me understand and frame my thesis work. Suda and Ren [Õó] present a method to get more accurate measurements. Using their method we measure the power of only the GPU. ey built a power model but it only limited to simple žoating point addition kernel with variable number of threads and block sizes. It does not investigate the ešect of memory boundedness on power and performance. Jiao et al. [¢] have investigated the ešect of DVFS on memory bound, compute bound and mixed kernels for GPU. eir work is in line with my thesis work but they did not use consistent measurement units and have an error prone measurement setup that measures the power of the whole compute. Also, they ignore the ešect of threads and block sizes on power and performance. Nagasaka et al. [˜] and Hong and Kim [¦][ì] present models to predict power ˜ h„Z£±u§ ó. fZhŽ§™¶•o and performance in GPU. Nagasaka et al. proposed a statistical model and Hong and Kim proposed an analytical model. Both the research work has excellent amount of information for extending my thesis data to derive a Power and Performance model for DVFS in GPUs. 3 h„Z£±u§ Mu±„™o™™í

ì.Õ uì£u§†“u•±Z «u±¶£

CPU AMD Phenom II Memory Õä GB GPU Nvidia GTX ¢˜þ PSU Cossair ߢþW Data Acquisition Unit Advantech USB-¦ßÕä OS Windows ß (ä¦ bit) Compiler CUDA ¦.ó/Visual Studio óþÕþ CUDA driver ÕÉ¢.Éó

Table ì.Õ: Experimental setup

We performed all the evaluations on the GTX ¢˜þ graphics card. We used the Windows environment because DVFS is not supported on for Fermi based graphic cards.

ì.ó £™ëu§ “uZ«¶§u“u•±

e GTX ¢˜þ gets its power supply from ì dišerent sources. As shown in table ì.ó on page Õþ, the PCI express slot, ä-pin and ˜-pin connectors provide ߢ W, ߢ W and Õ¢þ W respectively. is amounts to total of ìþþ W of available power supply. In the gure ì.Õ on page Õþ we have a description of the ˜-pin power supply. When Senseþ

É Õþ h„Z£±u§ ì. “u±„™o™™í

Power Source Maximum Power PCI Express slot ߢ W PCI Express ä-pin connector ߢ W PCI Express ˜-pin connector Õ¢þ W

Table ì.ó: Distribution of GPU power supply

Figure ì.Õ: ˜pin PCI Express auxiliary power supply and SenseÕ is detected it acts like the ˜ pin power supply and provides Õ¢þ W. When SenseÕ is not detected it behaves like the ä-pin power supply and provides ߢ W[Õþ].

Figure ì.ó: GPU power measurement

In the gure ì.ó we have described the power measurement infrastructure we have in place. It helps us accurately measure the power going on to the GPU and also from each power source of the GPU. We x probes to the power supply wires that go to the GPU to measure the power. We have used a PCI-Express riser card to measure the ì.ì. oꀫ ÕÕ

power being supplied through the PCI-Express. e Data Acquisition Unit measures the current passing through all the power supplies. We assume that the voltage through the ÕóV and ì.ìV are constant. It gives rise to a total power measurement error of ì watts[Õó].

ì.ì oꀫ

Sr. Shader Frequency (MHz) Core Voltage (V) Õ ˜Õþ þ.Éäì ó Õþþþ þ.Éߢ ì Õóþþ þ.ɘ˜ ¦ Õ¦þþ Õ.þþþ ¢ Õäþþ Õ.þÕì

Table ì.ì: DVFS settings used

We have used a shader frequency range of ˜Õþ-Õäþþ MHz with a stepping size of óþþ MHz. e safe voltage scaling option provided by the GPU is þ.ÉäìV to Õ.þÕìV. As shown in table ì.ì we assigned a voltage value to the shader frequency to make the DVFS settings.

ì.¦ “†h§™fu•h„“Z§Ž«

We have made ¦ microbenchmarks to help us understand the power vs performance behavior. All the microbenchmarks are simple vector addition with dišerent level of global memory calls. All of them are made with the help CUDA Occupancy calculator to ensure the intended behavior[Õì] In the Appendix A on page ìì we have provided the assembly code for each of the microbenchmarks. e microbenchmarkÕ kernel code is given in listing ì.Õ. It has a single thread register requirement of Õþ registers. ó out of ՘ instructions are memory calls.

Õ __global__ void micro1(const Mytype* A, const Mytype* B, Mytype* C, int N) ó { ì int x = blockDim.x * blockIdx.x + threadIdx.x; ¦ int y = blockDim.x * gridDim.x * blockIdx.y; ¢ int i = x + y; ä if (i < N) ß C[i] = A[i] + B[i]; ˜ }

Listing ì.Õ: Microbenchmark Õ

e microbenchmarkó kernel code is given in listing ì.ó. It has a single thread register requirement of Õì registers. ¦ out of óä instructions are memory calls. e microbenchmarkì kernel code is given in listing ì.ì. It has a single thread register requirement of Õ¢ registers. ä out of ì¦ instructions are memory calls. Õó h„Z£±u§ ì. “u±„™o™™í

Õ __global__ void micro2(const Mytype* A, const Mytype* B, const Mytype* D, const Mytype* E,Mytype* C, int N) ó { ì int x = blockDim.x * blockIdx.x + threadIdx.x; ¦ int y = blockDim.x * gridDim.x * blockIdx.y; ¢ int i = x + y; ä if (i < N) ß C[i] = A[i] + B[i] + D[i] +E[i]; ˜ }

Listing ì.ó: Microbenchmark ó

Õ __global__ void micro3(const Mytype* A, const Mytype* B, const Mytype* D, const Mytype* E,const Mytype* F,const Mytype* G ,Mytype* C, int N) ó { ì int x = blockDim.x * blockIdx.x + threadIdx.x; ¦ int y = blockDim.x * gridDim.x * blockIdx.y; ¢ int i = x + y; ä if (i < N) ß C[i] = A[i] + B[i] + D[i] +E[i] + F[i] + G[i]; ˜ }

Listing ì.ì: Microbenchmark ì

e microbenchmark¦ kernel code is given in listing ì.¦. It has a single thread register requirement of ÕÉ registers. Õþ out of ¢þ instructions are memory calls.

Õ __global__ void micro4(const Mytype* A, const Mytype* B, const Mytype* D, const Mytype* E,const Mytype* F,const Mytype* G ,const Mytype* H, const Mytype* I,const Mytype* J,const Mytype* K ,Mytype* C, int N) ó { ì int x = blockDim.x * blockIdx.x + threadIdx.x; ¦ int y = blockDim.x * gridDim.x * blockIdx.y; ¢ int i = x + y; ä if (i < N) ß C[i] = A[i] + B[i] + D[i] +E[i] + F[i] + G[i] + H[i] + I[i] + J[i] + K[ i]; ˜ }

Listing ì.¦: Microbenchmark ¦

ì.¢ fu•h„“Z§Ž«

We have used a few benchmarks from the CUDA SDK[Õì] and Parboil benchmark suite[ÕÕ]. e original benchmarks had a very short execution time. To accurately measure the power we need to run the applications for a longer duration of time i.e more than Õ ms. e limitation was because of the Data Acquisition Unit we used. It can measure power at intervals of þ.¢ ms. So we increased the number of iterations and size of data. ey are as follows: ì.¢. fu•h„“Z§Ž« Õì

ì.¢.Õ Matrix Multiplication e size of the Matrix was increased to ó¢äþ × ó¢äþ . e number of iterations was increased to óþ.

ì.¢.ó Matrix Transpose e optimized version of Matrix Transpose was used i.e. Coalesced transpose with no bank conžicts. e Matrix size was increased from Õþó¦ × Õþó¦ to óþ¦˜ × óþ¦˜ . e number of repetitions was increased from Õþþ to ¦þþþ.

ì.¢.ì Histogram e ó¢ä-bin histogram was used. e number of runs was increased from Õä to Õþþþ.

ì.¢.¦ Radix Sort e default version uses ìó-bit unsigned int keys and values. We used ìó-bit žoat keys and values. e number of elements to sort was increased from Õ,þ¦˜,¢ßä to ¦þ,Éäþ,þþþ. We used Õþ iterations.

ì.¢.¢ Merge Sort e default version uses ä¢,¢ìä values. We increased it to Õ,þ¦˜,¢ßä values and changed the iterations from Õ to Õþþ.

ì.¢.ä Conjugate Gradient e size of the matrix was increased from Õþó¦ × Õþó¦ to ¦þÉä × ¦þÉä.

ì.¢.ß BFS(Breadth First Search) BFS was used from the Parboil benchmark and was used without any increase in default data size because the input was received from a le (unlike the CUDA SDK benchmarks that generated its own input on the CPU).

ì.¢.˜ Eigenvalues e default values were used for the Eigenvalues. e matrix used was of size óþ¦˜×óþ¦˜ and the benchmark performed Õþþ iterations.

ì.¢.É Black-Scholes option pricing e default value of ¦,þþþ,þþþ options were used. But we increased the the number of iteration from ¢Õó to ˜ÕÉó. Õ¦ h„Z£±u§ ì. “u±„™o™™í

ì.¢.Õþ ìD FDTD(ìD Finite Dišerence Time Domain method) e default version applies FDTD on a ìßä × ìßä × ìßä volume with symmetric lter radius of ¦ for Õþ timesteps. We increased the timesteps to ä¦.

ì.¢.ÕÕ Scalar Product e scalar product is performed on ó¢ä vector pairs. e default version has ¦,þÉä elements per vector and we increased that to Õä,옦 elements per vector. Also we made it iterate a Õ,þþþ times. 4 h„Z£±u§ EêZ¶Z±†™•

e parameters that were considered during the evaluation are execution time, power, energy and EDP. e measurements have been provided in the Appendix B: Power measurements on page ìß. Apart from that we evaluated the ešect of DVFS on the memory bandwidth. From gure ¦.Õ we do notice that the memory bandwidth is less than theoretical peak of ÕÉó.¦ GB/s but it stays consistent for the particular shader frequency through dišerent sizes of data.

Figure ¦.Õ: Ešect of DVFS on memory bandwidth

Õ¢ Õä h„Z£±u§ ¦. uêZ¶Z±†™•

¦.Õ “†h§™fu•h„“Z§Ž«

As described in the section ì.¦ on page ÕÕ , the memory calls to compute ratio increases from the Microbenchmark Õ to Microbenchmark ¦. Each of the Microbenchmarks have been evaluated for dišerent number of warps per SM i.e. Õä, ìó and ¦˜ warps. e scheduler between multiple warps in a SM to hide the latency for memory calls. But, while looking through the plots for each of the microbenchmarks it can be seen that beyond a certain point even increasing the number of warps does not give us increase in performance. As the memory boundedness increases there is very little dišerence between the execution time for ìó warp and ¦˜ warp versions. Another thing worth noticing is that Õä warp version consumes considerably lesser power than the ìó warp and ¦˜ warp version.

¦.Õ.Õ Microbenchmark Õ ÕÕ.ÕÕ of the instructions for this microbenchmark are memory calls. e ¦˜ warp version has the best execution times. Beyond Õóþþ MHz both ìó warp and ¦˜ warp versions execution times starts to žatten. We get the best EDP for the ¦˜ warp version at Õþþþ MHz.

2 220 16 warps 16 warps 32 warps 32 warps 48 warps 210 48 warps 1.8 200

1.6 190

180 1.4 170 Power(Watts)

Execution time(s) 1.2 160

150 1 140

0.8 130 800 1000 1200 1400 1600 800 1000 1200 1400 1600 Frequency(MHz) Frequency(MHz)

Figure ¦.ó: MicroÕ: Execution time Figure ¦.ì: MicroÕ: Power

500

280 16 warps 32 warps 16 warps 450 48 warps 32 warps 260 48 warps 400

240 350

220 300 EDP(Js)

250

Energy(J) 200

200 180

150 160

100 800 1000 1200 1400 1600 140 800 1000 1200 1400 1600 Frequency(MHz) Frequency(MHz) Figure ¦.¦: MicroÕ: Energy Figure ¦.¢: MicroÕ: EDP ¦.Õ. “†h§™fu•h„“Z§Ž« Õß

¦.Õ.ó Microbenchmark ó Õ¢.ì˜ of the instructions for this microbenchmark are memory calls. e ìó warp version shows marginally better execution times than the ¦˜ warp version. Also, for the ìó warp version the best execution time is seen at Õóþþ MHz. We observe the best EDP at Õóþþ MHz for the ìó warp version.

2.3 230 16 warps 16 warps 32 warps 32 warps 2.2 48 warps 220 48 warps

2.1 210

2 200

1.9 190 1.8

Power(Watts) 180 Execution time(s) 1.7

170 1.6

1.5 160

1.4 150 800 1000 1200 1400 1600 800 1000 1200 1400 1600 Frequency(MHz) Frequency(MHz)

Figure ¦.ä: Microó: Execution time Figure ¦.ß: Microó: Power

350 800 16 warps 16 warps 32 warps 32 warps 340 48 warps 750 48 warps

700 330

650 320

600 310 550 EDP(Js) Energy(J) 300 500

290 450

280 400

270 350 800 1000 1200 1400 1600 800 1000 1200 1400 1600 Frequency(MHz) Frequency(MHz)

Figure ¦.˜: Microó: Energy Figure ¦.É: Microó: EDP ՘ h„Z£±u§ ¦. uêZ¶Z±†™•

¦.Õ.ì Microbenchmark ì Õß.ä¢ of the instructions for this microbenchmark are memory calls. e ìó warp and ¦˜ warp version show similar execution times. e Õä warp version starts showing better execution times than the other two version beyond Õ¦þþ MHz. At ÕóþþMHz the Õä warp version has the best EDP.

2.7 230 16 warps 16 warps 32 warps 32 warps 2.6 48 warps 220 48 warps

2.5 210

2.4 200

2.3 190

2.2 Power(Watts) 180 Execution time(s)

2.1 170

2 160

1.9 150 800 1000 1200 1400 1600 800 1000 1200 1400 1600 Frequency(MHz) Frequency(MHz)

Figure ¦.Õþ: Microì: Execution time Figure ¦.ÕÕ: Microì: Power

450 1150 16 warps 16 warps 32 warps 32 warps 48 warps 1100 48 warps 440

1050

430 1000

420 950 EDP(Js) Energy(J) 900 410

850

400 800

390 750 800 1000 1200 1400 1600 800 1000 1200 1400 1600 Frequency(MHz) Frequency(MHz)

Figure ¦.Õó: Microì: Energy Figure ¦.Õì: Microì: EDP ¦.Õ. “†h§™fu•h„“Z§Ž« ÕÉ

¦.Õ.¦ Microbenchmark ¦ óþ of the instructions for this microbenchmark are memory calls. e Õä warp version starts getting better execution times than the other two versions beyond Õ¦þþ MHz. We get the best EDP at Õóþþ MHz.

4.2 230 16 warps 16 warps 32 warps 32 warps 48 warps 220 48 warps 4

210 3.8

200

3.6

190 Power(Watts)

Execution time(s) 3.4 180

3.2 170

3 160 800 1000 1200 1400 1600 800 1000 1200 1400 1600 Frequency(MHz) Frequency(MHz)

Figure ¦.Õ¦: Micro¦: Execution time Figure ¦.Õ¢: Micro¦: Power

720 2900 16 warps 16 warps 710 32 warps 2800 32 warps 48 warps 48 warps

700 2700

690 2600

680 2500

670 2400 EDP(Js) Energy(J) 660 2300

650 2200

640 2100

630 2000

620 1900 800 1000 1200 1400 1600 800 1000 1200 1400 1600 Frequency(MHz) Frequency(MHz)

Figure ¦.Õä: Micro¦: Energy Figure ¦.Õß: Micro¦: EDP óþ h„Z£±u§ ¦. uêZ¶Z±†™•

¦.ó fu•h„“Z§Ž«

All the benchmarks in this section have been executed at normal memory clock of óþþ¦ MHz i.e. bandwidth of ÕÉó.¦ GB/s. Also, note that all the plotted data are normalized values. Please refer to table ¢ on page ¦þ for the measurements. In terms of applying DVFS, the results are a little disappointing. e best EDP for all the benchmarks we evaluated were at the highest shader frequency.

Normalized Power Profiling 3 Normalized Time Normalized Power 2.75 Normalized Energy Normalized EDP

2.5

2.25

2

1.75 Normalized

1.5

1.25

1 800 1000 1200 1400 1600 Frequency(MHz) Figure ¦.՘: BFS

Normalized Power Profiling 3 Normalized Time Normalized Power 2.75 Normalized Energy Normalized EDP

2.5

2.25

2

1.75 Normalized Speedup

1.5

1.25

1 800 1000 1200 1400 1600 Frequency(MHz) Figure ¦.ÕÉ: Black-Scholes ¦.ó. fu•h„“Z§Ž« óÕ

Normalized Power Profiling 3 Normalized Time Normalized Power 2.75 Normalized Energy Normalized EDP

2.5

2.25

2

1.75 Normalized Speedup

1.5

1.25

1 800 1000 1200 1400 1600 Frequency(MHz) Figure ¦.óþ: Conjugate Gradient

Normalized Power Profiling 3.25 Normalized Time Normalized Power 3 Normalized Energy Normalized EDP 2.75

2.5

2.25

2

1.75 Normalized Speedup

1.5

1.25

1 800 1000 1200 1400 1600 Frequency(MHz) Figure ¦.óÕ: Eigenvalues

Normalized Power Profiling 3 Normalized Time Normalized Power 2.75 Normalized Energy Normalized EDP

2.5

2.25

2

1.75 Normalized Speedup

1.5

1.25

1 800 1000 1200 1400 1600 Frequency(MHz) Figure ¦.óó: ìD FDTD óó h„Z£±u§ ¦. uêZ¶Z±†™•

Normalized Power Profiling 3 Normalized Time Normalized Power 2.75 Normalized Energy Normalized EDP

2.5

2.25

2

1.75 Normalized Speedup

1.5

1.25

1 800 1000 1200 1400 1600 Frequency(MHz) Figure ¦.óì: Histogram

Normalized Power Profiling 3 Normalized Time Normalized Power 2.75 Normalized Energy Normalized EDP

2.5

2.25

2

1.75 Normalized Speedup

1.5

1.25

1 800 1000 1200 1400 1600 Frequency(MHz) Figure ¦.ó¦: Matrix Multiplication

Normalized Power Profiling 3 Normalized Time Normalized Power 2.75 Normalized Energy Normalized EDP

2.5

2.25

2

1.75 Normalized Speedup

1.5

1.25

1 800 1000 1200 1400 1600 Frequency(MHz) Figure ¦.ó¢: Matrix Transpose ¦.ó. fu•h„“Z§Ž« óì

Normalized Power Profiling 3 Normalized Time Normalized Power 2.75 Normalized Energy Normalized EDP

2.5

2.25

2

1.75 Normalized Speedup

1.5

1.25

1 800 1000 1200 1400 1600 Frequency(MHz) Figure ¦.óä: Merge sort

Normalized Power Profiling 3 Normalized Time Normalized Power 2.75 Normalized Energy Normalized EDP

2.5

2.25

2

1.75 Normalized Speedup

1.5

1.25

1 800 1000 1200 1400 1600 Frequency(MHz) Figure ¦.óß: Radix Sort

Normalized Power Profiling 2 Normalized Time Normalized Power Normalized Energy Normalized EDP 1.8

1.6

1.4 Normalized Speedup

1.2

1 800 1000 1200 1400 1600 Frequency(MHz) Figure ¦.ó˜: Scalar Product ó¦ h„Z£±u§ ¦. uêZ¶Z±†™•

¦.ì “u“™§í fZ•oë†o±„

We further evaluated ¦ of the benchmarks to see the ešect of memory bandwidth on power and performance. e plots have been made using normalized values. Please refer to table ˜ on page ¦¦ to compare the values for dišerent benchmarks.

¦.ì.Õ Black-Scholes Black-Scholes produced similar execution times at Õ¢þþ MHz and óþþ¦ MHz memory frequency. Black-Scholes is a compute bound benchmark and benets from increased shader frequency. But beyond Õ¢þþ MHz memory frequency it does not benet signi- cantly from increased memory bandwidth. e best EDP value was obtained at Õ¢þþ MHz memory frequency.

Normalized Power Profiling Normalized Power Profiling 1.5 3 Normalized Time Normalized Time Normalized Power Normalized Power Normalized Energy 2.75 Normalized Energy Normalized EDP Normalized EDP 1.4 2.5

2.25 1.3

2

1.2 1.75 Normalized Speedup Normalized Speedup

1.5 1.1

1.25

1 1 800 1000 1200 1400 1600 800 1000 1200 1400 1600 Frequency(MHz) Frequency(MHz)

Figure ¦.óÉ: Black-Scholes Õþþþ MHz Memory Figure ¦.ìþ: Black-Scholes Õ¢þþ MHz Memory

frequency Normalized Power Profiling frequency 3 Normalized Time Normalized Power 2.75 Normalized Energy Normalized EDP

2.5

2.25

2

1.75 Normalized Speedup

1.5

1.25

1 800 1000 1200 1400 1600 Frequency(MHz) Figure ¦.ìÕ: Black-Scholes óþþ¦ MHz Memory frequency ¦.ì. “u“™§í fZ•oë†o±„ ó¢

¦.ì.ó Eigenvalue e result for the Eigenvalue benchmark is interesting. It has similar execution times at Õþþþ MHz, Õ¢þþ MHz and óþþ¦ MHz memory frequency. e best EDP value was obtained at Õþþþ MHz memory frequency. Eigenvalue is a compute bound benchmark and benets a lot by increasing shader frequency but memory bandwidth has very little ešect on it. So, at lower memory bandwidth it shows signicant reduction in EDP.

Normalized Power Profiling Normalized Power Profiling 3.25 3.5 Normalized Time Normalized Time Normalized Power 3.25 Normalized Power 3 Normalized Energy Normalized Energy Normalized EDP Normalized EDP 3 2.75

2.75 2.5

2.5 2.25 2.25 2 2

1.75 Normalized Speedup Normalized Speedup 1.75

1.5 1.5

1.25 1.25

1 1 800 1000 1200 1400 1600 800 1000 1200 1400 1600 Frequency(MHz) Frequency(MHz)

Figure ¦.ìó: Eigenvalues Õþþþ MHz Memory Figure ¦.ìì: Eigenvalues Õ¢þþ MHz Memory

frequency Normalized Power Profiling frequency 3.25 Normalized Time Normalized Power 3 Normalized Energy Normalized EDP 2.75

2.5

2.25

2

1.75 Normalized Speedup

1.5

1.25

1 800 1000 1200 1400 1600 Frequency(MHz) Figure ¦.ì¦: Eigenvalues óþþ¦ MHz Memory frequency óä h„Z£±u§ ¦. uêZ¶Z±†™•

¦.ì.ì ìD FDTD ìD FDTD performed marginally better at óþþ¦ MHz memory frequency compared to Õ¢þþ MHz memory frequency. Also, we obtained the best EDP at óþþ¦ MHz memory frequency. It benets from both increased shader frequency and increased memory bandwidth.

Normalized Power Profiling Normalized Power Profiling 1.5 3 Normalized Time Normalized Time Normalized Power Normalized Power Normalized Energy 2.75 Normalized Energy Normalized EDP Normalized EDP 1.4 2.5

2.25 1.3

2

1.2 1.75 Normalized Speedup Normalized Speedup

1.5 1.1

1.25

1 1 800 1000 1200 1400 1600 800 1000 1200 1400 1600 Frequency(MHz) Frequency(MHz)

Figure ¦.ì¢: ìD FDTD Õþþþ MHz Memory Figure ¦.ìä: ìD FDTD Õ¢þþ MHz Memory

frequency Normalized Power Profiling frequency 3 Normalized Time Normalized Power 2.75 Normalized Energy Normalized EDP

2.5

2.25

2

1.75 Normalized Speedup

1.5

1.25

1 800 1000 1200 1400 1600 Frequency(MHz) Figure ¦.ìß: ìD FDTD óþþ¦ MHz Memory frequency ¦.ì. “u“™§í fZ•oë†o±„ óß

¦.ì.¦ Scalar Product Scalar product benchmark performs better at óþþ¦ MHz memory frequency. It is a memory bound benchmark and benets from higher memory bandwidth. Also, the best EDP value was obtained at óþþ¦ MHz memory frequency.

Normalized Power Profiling Normalized Power Profiling 1.5 1.5 Normalized Time Normalized Time Normalized Power Normalized Power Normalized Energy Normalized Energy Normalized EDP Normalized EDP 1.4 1.4

1.3 1.3

1.2 1.2 Normalized Speedup Normalized Speedup

1.1 1.1

1 1 800 1000 1200 1400 1600 800 1000 1200 1400 1600 Frequency(MHz) Frequency(MHz)

Figure ¦.ì˜: Scalar product Õþþþ MHz Memory Figure ¦.ìÉ: Scalar product Õ¢þþ MHz Memory

frequency Normalized Power Profiling frequency 2 Normalized Time Normalized Power Normalized Energy Normalized EDP 1.8

1.6

1.4 Normalized Speedup

1.2

1 800 1000 1200 1400 1600 Frequency(MHz) Figure ¦.¦þ: Scalar product óþþ¦ MHz Memory frequency

5 h„Z£±u§ C™•h¶«†™•

In this thesis work we have evaluated the application of DVFS on GPU for general purpose workloads. From the results we have from the microbenchmark evaluation, we can conclude that DVFS can be ešectively applied to memory bound kernels. We also show how the number of warps(threads) also ašect the performance and power of a kernel. For higher memory bound applications it is more suitable to have fewer warps to get better performance and lower power. e benchmarks we evaluated showed the best EDP at the highest shader frequency. But it would be interesting to investigate the impact of warp sizes on their performance and power. Also, we evaluated the power and performance of some of the benchmarks at dišerent memory bandwidths. Certain compute intensive benchmarks are not signicantly ašected by reducing the memory bandwidth and it saves some power. We have also provided the infrastructure to accurately measure the GPU power consumption from each power source. At the very end of the thesis work the Graphic card broke due to excessive DVFS. So, it might not be advisable to do DVFS on current desktop grade hardware. nVidia has been making it easier to prole the GPU kernels. As a part of future work, my thesis data can be used along with the proler data to make power models[˜][¦][ì] for DVFS in GPUs.

óÉ

B†f†™§Z£„í

[Õ] J.Y. Chen. “GPU technology trends and future requirements”. In: Electron Devices Meeting (IEDM), óþþÉ IEEE International. IEEE. óþþÉ, pp. Õ–ä. [ó] J.L. Hennessy and D.A. Patterson. Computer architecture: a quantitative ap- proach. ¢th ed. Morgan Kaufmann Pub, óþÕó. [ì] S. Hong and H. Kim. “An integrated GPU power and performance model”. In: ACM SIGARCH Computer Architecture News. Vol. ì˜. ì. ACM. óþÕþ, pp. ó˜þ– ó˜É. [¦] Sunpyo Hong and Hyesoon Kim. “An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness”. In: Proceedings of the ìäth annual international symposium on Computer architecture. ISCA ’þÉ. Austin, TX, USA: ACM, óþþÉ, pp. Õ¢ó–Õäì. †«f•: Éߘ-Õ-äþ¢¢˜-¢óä-þ. o™†: 10.1145/1555754.1555775. ¶§: http://doi.acm.org/10.1145/1555754. 1555775. [¢] Y. Jiao et al. “Power and performance characterization of computational ker- nels on the GPU”. In: Green Computing and Communications (GreenCom), óþÕþ IEEE/ACM Int’l Conference on & Int’l Conference on Cyber, Physical and Social Computing (CPSCom). IEEE. óþÕþ, pp. óóÕ–óó˜. [ä] Stefanos Kaxiras and Margaret Martonosi. Computer Architecture Techniques for Power-E›ciency. Õst. Morgan and Claypool Publishers, óþþ˜. †«f•: բɘóÉóþ˜þ, ÉߘբɘóÉóþ˜¦. [ß] T. Mudge. “Power: A rst class design constraint for future architectures”. In: High Performance Computing HiPC óþþþ (óþþþ), pp. óÕ¢–óó¦. [˜] H. Nagasaka et al. “Statistical power modeling of GPU kernels using perfor- mance counters”. In: Green Computing Conference, óþÕþ International. IEEE. óþÕþ, pp. ÕÕ¢–Õóó. [É] “Nvidia CUDA C programming guide version ¦.ó”. In: (óþÕó). [Õþ] M. Scott. Upgrading and repairing PCs. Que Publishing, óþÕÕ. [ÕÕ] J. A. Stratton et al. Parboil Benchmarks. http://impact.crhc.illinois.edu/ Parboil/parboil.aspx. [Online; accessed ä-May-óþÕó]. óþÕó.

ìÕ ìó f†f†™§Z£„í

[Õó] . Suda and D.Q. Ren. “Accurate Measurements and Precise Modeling of Power Dissipation of CUDA Kernels toward Power Optimized High Perfor- mance CPU-GPU Computing”. In: óþþÉ International Conference on Paral- lel and , Applications and Technologies. IEEE. óþþÉ, pp. ¦ìó–¦ì˜. [Õì] CUDA SDK V¦.ó. http://developer.nvidia.com/gpu-computing-sdk. [Online; accessed ó¦-April-óþÕó]. óþÕó. [Õ¦] CUDA TOOLKIT V¦.ó. http://developer.nvidia.com/cuda-toolkit. [Online; accessed ó¦-April-óþÕó]. óþÕó. [Õ¢] Whitepaper , NVIDIA’s next generation CUDA compute architecture: Fermi. http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_ Fermi_Compute_Architecture_Whitepaper.pdf. vÕ.Õ, [Online; accessed ä-May-óþÕó]. óþþÉ. A««u“fí h™ou

e CUDA Toolkit comes with many useful tools. We made use of the tool called cuobjdump to generate the assembly code used by the GPU[Õ¦]. Just to remind you, the PTX le generated while compiling is just an intermediate code and not the assembly code. It comes with a documentation to look into the assembly code. In our case we are interested in the LD command. It indicates the memory calls.

Assembly code for MicroÕ

Õ Function : _Z6micro1PKfS0_Pfi ó /*0000*//*0x00005de428004404*/ MOV R1, c [0x1] [0x100]; ì /*0008*//*0x94001c042c000000*/ S2R R0, SR_CTAid_X; ¦ /*0010*//*0x8400dc042c000000*/ S2R R3, SR_Tid_X; ¢ /*0018*//*0x50011de428004000*/ MOV R4, c [0x0] [0x14]; ä /*0020*//*0x98009c042c000000*/ S2R R2, SR_CTAid_Y; ß /*0028*//*0x2000dca320064000*/ IMAD R3, R0, c [0x0] [0x8], R3; ˜ /*0030*//*0x20401ca350004000*/ IMUL R0, R4, c [0x0] [0x8]; É /*0038*//*0x08001ca320060000*/ IMAD R0, R0, R2, R3; Õþ /*0040*//*0xe001dc23188e4000*/ ISETP.LT.AND P0, pt, R0, c [0x0] [0x38], pt; ÕÕ /*0048*//*0x8000a1e740000001*/ @!P0 BRA.U 0xb0; Õó /*0050*//*0x100141e218000000*/ @P0 MOV32I R5, 0x4; Õì /*0058*//*0x100100e35000c000*/ @P0 IMUL.HI R4, R0, 0x4; Õ¦ /*0060*//*0x800080a3200b8000*/ @P0 IMAD R2.CC, R0, R5, c [0x0] [0x20]; Õ¢ /*0068*//*0x9040c04348004000*/ @P0 IADD.X R3, R4, c [0x0] [0x24]; Õä /*0070*//*0xa00200a3200b8000*/ @P0 IMAD R8.CC, R0, R5, c [0x0] [0x28]; Õß /*0078*//*0x0020c18584000000*/ @P0 LD.E.CG R3, [R2]; ՘ /*0080*//*0xb042404348004000*/ @P0 IADD.X R9, R4, c [0x0] [0x2c]; ÕÉ /*0088*//*0xc00180a3200b8000*/ @P0 IMAD R6.CC, R0, R5, c [0x0] [0x30]; óþ /*0090*//*0x0080818584000000*/ @P0 LD.E.CG R2, [R8]; óÕ /*0098*//*0xd041c04348004000*/ @P0 IADD.X R7, R4, c [0x0] [0x34]; óó /*00a0*//*0x0830000050000000*/ @P0 FADD R0, R3, R2; óì /*00a8*//*0x0060008594000000*/ @P0 ST.E [R6], R0; ó¦ /*00b0*//*0x00001de780000000*/EXIT;

ìì ì¦ Z««u“fí h™ou

Assembly code for Microó

Õ Function : _Z6micro2PKfS0_S0_S0_Pfi ó /*0000*//*0x00005de428004404*/ MOV R1, c [0x1] [0x100]; ì /*0008*//*0x94001c042c000000*/ S2R R0, SR_CTAid_X; ¦ /*0010*//*0x8400dc042c000000*/ S2R R3, SR_Tid_X; ¢ /*0018*//*0x50011de428004000*/ MOV R4, c [0x0] [0x14]; ä /*0020*//*0x98009c042c000000*/ S2R R2, SR_CTAid_Y; ß /*0028*//*0x2000dca320064000*/ IMAD R3, R0, c [0x0] [0x8], R3; ˜ /*0030*//*0x20401ca350004000*/ IMUL R0, R4, c [0x0] [0x8]; É /*0038*//*0x08001ca320060000*/ IMAD R0, R0, R2, R3; Õþ /*0040*//*0x2001dc23188e4001*/ ISETP.LT.AND P0, pt, R0, c [0x0] [0x48], pt; ÕÕ /*0048*//*0x8000a1e740000002*/ @!P0 BRA.U 0xf0; Õó /*0050*//*0x100301e218000000*/ @P0 MOV32I R12, 0x4; Õì /*0058*//*0x100240e35000c000*/ @P0 IMUL.HI R9, R0, 0x4; Õ¦ /*0060*//*0x800280a320198000*/ @P0 IMAD R10.CC, R0, R12, c [0x0] [0x20]; Õ¢ /*0068*//*0x9092c04348004000*/ @P0 IADD.X R11, R9, c [0x0] [0x24]; Õä /*0070*//*0xa00180a320198000*/ @P0 IMAD R6.CC, R0, R12, c [0x0] [0x28]; Õß /*0078*//*0x00a2018584000000*/ @P0 LD.E.CG R8, [R10]; ՘ /*0080*//*0xb091c04348004000*/ @P0 IADD.X R7, R9, c [0x0] [0x2c]; ÕÉ /*0088*//*0xc00080a320198000*/ @P0 IMAD R2.CC, R0, R12, c [0x0] [0x30]; óþ /*0090*//*0x0061c18584000000*/ @P0 LD.E.CG R7, [R6]; óÕ /*0098*//*0xd090c04348004000*/ @P0 IADD.X R3, R9, c [0x0] [0x34]; óó /*00a0*//*0xe00100a320198000*/ @P0 IMAD R4.CC, R0, R12, c [0x0] [0x38]; óì /*00a8*//*0x0020818584000000*/ @P0 LD.E.CG R2, [R2]; ó¦ /*00b0*//*0xf091404348004000*/ @P0 IADD.X R5, R9, c [0x0] [0x3c]; ó¢ /*00b8*//*0x000180a320198001*/ @P0 IMAD R6.CC, R0, R12, c [0x0] [0x40]; óä /*00c0*//*0x0041018584000000*/ @P0 LD.E.CG R4, [R4]; óß /*00c8*//*0x1c80c00050000000*/ @P0 FADD R3, R8, R7; ó˜ /*00d0*//*0x1091c04348004001*/ @P0 IADD.X R7, R9, c [0x0] [0x44]; óÉ /*00d8*//*0x0830000050000000*/ @P0 FADD R0, R3, R2; ìþ /*00e0*//*0x1000000050000000*/ @P0 FADD R0, R0, R4; ìÕ /*00e8*//*0x0060008594000000*/ @P0 ST.E [R6], R0; ìó /*00f0*//*0x00001de780000000*/EXIT; ìì ......

Assembly code for Microì

Õ Function : _Z6micro3PKfS0_S0_S0_S0_S0_Pfi ó /*0000*//*0x00005de428004404*/ MOV R1, c [0x1] [0x100]; ì /*0008*//*0x94001c042c000000*/ S2R R0, SR_CTAid_X; ¦ /*0010*//*0x8400dc042c000000*/ S2R R3, SR_Tid_X; ¢ /*0018*//*0x50011de428004000*/ MOV R4, c [0x0] [0x14]; ä /*0020*//*0x98009c042c000000*/ S2R R2, SR_CTAid_Y; ß /*0028*//*0x2000dca320064000*/ IMAD R3, R0, c [0x0] [0x8], R3; ˜ /*0030*//*0x20401ca350004000*/ IMUL R0, R4, c [0x0] [0x8]; É /*0038*//*0x08001ca320060000*/ IMAD R0, R0, R2, R3; Õþ /*0040*//*0x6001dc23188e4001*/ ISETP.LT.AND P0, pt, R0, c [0x0] [0x58], pt; ÕÕ /*0048*//*0x8000a1e740000003*/ @!P0 BRA.U 0x130; Õó /*0050*//*0x100341e218000000*/ @P0 MOV32I R13, 0x4; Õì /*0058*//*0x100300e35000c000*/ @P0 IMUL.HI R12, R0, 0x4; Õ¦ /*0060*//*0x800200a3201b8000*/ @P0 IMAD R8.CC, R0, R13, c [0x0] [0x20]; Õ¢ /*0068*//*0x90c2404348004000*/ @P0 IADD.X R9, R12, c [0x0] [0x24]; Õä /*0070*//*0xa00180a3201b8000*/ @P0 IMAD R6.CC, R0, R13, c [0x0] [0x28]; Õß /*0078*//*0x0082418584000000*/ @P0 LD.E.CG R9, [R8]; ՘ /*0080*//*0xb0c1c04348004000*/ @P0 IADD.X R7, R12, c [0x0] [0x2c]; ÕÉ /*0088*//*0xc00100a3201b8000*/ @P0 IMAD R4.CC, R0, R13, c [0x0] [0x30]; óþ /*0090*//*0x0063818584000000*/ @P0 LD.E.CG R14, [R6]; óÕ /*0098*//*0xd0c1404348004000*/ @P0 IADD.X R5, R12, c [0x0] [0x34]; óó /*00a0*//*0xe00080a3201b8000*/ @P0 IMAD R2.CC, R0, R13, c [0x0] [0x38]; óì /*00a8*//*0x0041418584000000*/ @P0 LD.E.CG R5, [R4]; ó¦ /*00b0*//*0xf0c0c04348004000*/ @P0 IADD.X R3, R12, c [0x0] [0x3c]; ó¢ /*00b8*//*0x000280a3201b8001*/ @P0 IMAD R10.CC, R0, R13, c [0x0] [0x40]; óä /*00c0*//*0x0020c18584000000*/ @P0 LD.E.CG R3, [R2]; óß /*00c8*//*0x10c2c04348004001*/ @P0 IADD.X R11, R12, c [0x0] [0x44]; ó˜ /*00d0*//*0x200180a3201b8001*/ @P0 IMAD R6.CC, R0, R13, c [0x0] [0x48]; óÉ /*00d8*//*0x00a2c18584000000*/ @P0 LD.E.CG R11, [R10]; ìþ /*00e0*//*0x30c1c04348004001*/ @P0 IADD.X R7, R12, c [0x0] [0x4c]; ìÕ /*00e8*//*0x0060818584000000*/ @P0 LD.E.CG R2, [R6]; ìó /*00f0*//*0x3891000050000000*/ @P0 FADD R4, R9, R14; ìì /*00f8*//*0x1441000050000000*/ @P0 FADD R4, R4, R5; ì¦ /*0100*//*0x0c40c00050000000*/ @P0 FADD R3, R4, R3; ì¢

ì¢ /*0108*//*0x400100a3201b8001*/ @P0 IMAD R4.CC, R0, R13, c [0x0] [0x50]; ìä /*0110*//*0x2c30000050000000*/ @P0 FADD R0, R3, R11; ìß /*0118*//*0x50c1404348004001*/ @P0 IADD.X R5, R12, c [0x0] [0x54]; ì˜ /*0120*//*0x0800000050000000*/ @P0 FADD R0, R0, R2; ìÉ /*0128*//*0x0040008594000000*/ @P0 ST.E [R4], R0; ¦þ /*0130*//*0x00001de780000000*/EXIT;

Assembly code for Micro¦

Õ Function : _Z6micro4PKfS0_S0_S0_S0_S0_S0_S0_S0_S0_Pfi ó /*0000*//*0x00005de428004404*/ MOV R1, c [0x1] [0x100]; ì /*0008*//*0x94001c042c000000*/ S2R R0, SR_CTAid_X; ¦ /*0010*//*0x8400dc042c000000*/ S2R R3, SR_Tid_X; ¢ /*0018*//*0x50011de428004000*/ MOV R4, c [0x0] [0x14]; ä /*0020*//*0x98009c042c000000*/ S2R R2, SR_CTAid_Y; ß /*0028*//*0x2000dca320064000*/ IMAD R3, R0, c [0x0] [0x8], R3; ˜ /*0030*//*0x20401ca350004000*/ IMUL R0, R4, c [0x0] [0x8]; É /*0038*//*0x08001ca320060000*/ IMAD R0, R0, R2, R3; Õþ /*0040*//*0xe001dc23188e4001*/ ISETP.LT.AND P0, pt, R0, c [0x0] [0x78], pt; ÕÕ /*0048*//*0x8000a1e740000005*/ @!P0 BRA.U 0x1b0; Õó /*0050*//*0x100441e218000000*/ @P0 MOV32I R17, 0x4; Õì /*0058*//*0x100400e35000c000*/ @P0 IMUL.HI R16, R0, 0x4; Õ¦ /*0060*//*0x800280a320238000*/ @P0 IMAD R10.CC, R0, R17, c [0x0] [0x20]; Õ¢ /*0068*//*0x9102c04348004000*/ @P0 IADD.X R11, R16, c [0x0] [0x24]; Õä /*0070*//*0xa00200a320238000*/ @P0 IMAD R8.CC, R0, R17, c [0x0] [0x28]; Õß /*0078*//*0x00a2c18584000000*/ @P0 LD.E.CG R11, [R10]; ՘ /*0080*//*0xb102404348004000*/ @P0 IADD.X R9, R16, c [0x0] [0x2c]; ÕÉ /*0088*//*0xc00180a320238000*/ @P0 IMAD R6.CC, R0, R17, c [0x0] [0x30]; óþ /*0090*//*0x0082018584000000*/ @P0 LD.E.CG R8, [R8]; óÕ /*0098*//*0xd101c04348004000*/ @P0 IADD.X R7, R16, c [0x0] [0x34]; óó /*00a0*//*0xe00100a320238000*/ @P0 IMAD R4.CC, R0, R17, c [0x0] [0x38]; óì /*00a8*//*0x0064818584000000*/ @P0 LD.E.CG R18, [R6]; ó¦ /*00b0*//*0xf101404348004000*/ @P0 IADD.X R5, R16, c [0x0] [0x3c]; ó¢ /*00b8*//*0x000080a320238001*/ @P0 IMAD R2.CC, R0, R17, c [0x0] [0x40]; óä /*00c0*//*0x0042818584000000*/ @P0 LD.E.CG R10, [R4]; óß /*00c8*//*0x1100c04348004001*/ @P0 IADD.X R3, R16, c [0x0] [0x44]; ó˜ /*00d0*//*0x200300a320238001*/ @P0 IMAD R12.CC, R0, R17, c [0x0] [0x48]; óÉ /*00d8*//*0x0020c18584000000*/ @P0 LD.E.CG R3, [R2]; ìþ /*00e0*//*0x3103404348004001*/ @P0 IADD.X R13, R16, c [0x0] [0x4c]; ìÕ /*00e8*//*0x400380a320238001*/ @P0 IMAD R14.CC, R0, R17, c [0x0] [0x50]; ìó /*00f0*//*0x00c2418584000000*/ @P0 LD.E.CG R9, [R12]; ìì /*00f8*//*0x5103c04348004001*/ @P0 IADD.X R15, R16, c [0x0] [0x54]; ì¦ /*0100*//*0x600180a320238001*/ @P0 IMAD R6.CC, R0, R17, c [0x0] [0x58]; ì¢ /*0108*//*0x00e3c18584000000*/ @P0 LD.E.CG R15, [R14]; ìä /*0110*//*0x7101c04348004001*/ @P0 IADD.X R7, R16, c [0x0] [0x5c]; ìß /*0118*//*0x800100a320238001*/ @P0 IMAD R4.CC, R0, R17, c [0x0] [0x60]; ì˜ /*0120*//*0x0061c18584000000*/ @P0 LD.E.CG R7, [R6]; ìÉ /*0128*//*0x9101404348004001*/ @P0 IADD.X R5, R16, c [0x0] [0x64]; ¦þ /*0130*//*0xa00300a320238001*/ @P0 IMAD R12.CC, R0, R17, c [0x0] [0x68]; ¦Õ /*0138*//*0x0041418584000000*/ @P0 LD.E.CG R5, [R4]; ¦ó /*0140*//*0xb103404348004001*/ @P0 IADD.X R13, R16, c [0x0] [0x6c]; ¦ì /*0148*//*0x00c0818584000000*/ @P0 LD.E.CG R2, [R12]; ¦¦ /*0150*//*0x20b1000050000000*/ @P0 FADD R4, R11, R8; ¦¢ /*0158*//*0x4841000050000000*/ @P0 FADD R4, R4, R18; ¦ä /*0160*//*0x2841000050000000*/ @P0 FADD R4, R4, R10; ¦ß /*0168*//*0x0c40c00050000000*/ @P0 FADD R3, R4, R3; ¦˜ /*0170*//*0xc00100a320238001*/ @P0 IMAD R4.CC, R0, R17, c [0x0] [0x70]; ¦É /*0178*//*0x2430c00050000000*/ @P0 FADD R3, R3, R9; ¢þ /*0180*//*0x3c30c00050000000*/ @P0 FADD R3, R3, R15; ¢Õ /*0188*//*0x1c30c00050000000*/ @P0 FADD R3, R3, R7; ¢ó /*0190*//*0x1430000050000000*/ @P0 FADD R0, R3, R5; ¢ì /*0198*//*0xd101404348004001*/ @P0 IADD.X R5, R16, c [0x0] [0x74]; ¢¦ /*01a0*//*0x0800000050000000*/ @P0 FADD R0, R0, R2; ¢¢ /*01a8*//*0x0040008594000000*/ @P0 ST.E [R4], R0; ¢ä /*01b0*//*0x00001de780000000*/EXIT; ¢ß ......

P™ëu§ “uZ«¶§u“u•±«

Table Õ: Microbenchmark Õ Warps Shader Execution Power Energy EDP frequency(MHz) time (sec) (Watts) (Joules) Õä ˜Õþ Õ.˜˜ó Õì˜.ó¢ óäþ.Õ˜ß ¦˜É.äßÕ Õþþþ Õ.¢äþ Õ¢Õ.þ¦ óì¢.äóó ìäß.¢ßÕ Õóþþ Õ.ì˜ó Õäó.˜¢ óó¢.þ¢É ìÕÕ.þìÕ Õ¦þþ Õ.ó¢¦ Õßä.Õß óóþ.ÉÕß óßß.þìþ Õäþþ Õ.Õßó ՘ß.¢ß óÕÉ.˜ìó ó¢ß.ä¦ì

ìó ˜Õþ Õ.Õßß Õ¢˜.ßÉ Õ˜ä.˜Éä óÕÉ.Éßä Õþþþ þ.Éߘ Õߦ.ÕÉ Õßþ.좘 Õää.äÕþ Õóþþ þ.˜ÉÕ Õ˜˜.ÕÉ Õäß.äßß Õ¦É.¦þþ Õ¦þþ þ.˜äì ÕÉß.ä˜ Õßþ.¢É˜ Õ¦ß.óóä Õäþþ þ.˜¢ó óþß.¢ä Õßä.˜¦Õ Õ¢þ.ääÉ

¦˜ ˜Õþ Õ.þìÕ Õää.ä¢ ÕßÕ.˜Õä Õßß.Õ¦ó Õþþþ þ.˜äÉ Õ˜ì.˜ì Õ¢É.ߦ˜ Õì˜.˜óÕ Õóþþ þ.˜ì˜ ÕÉ¢.¦Õ Õäì.ߢ¦ Õìß.óóä Õ¦þþ þ.˜¦Õ óþ¦.ää Õßó.ÕÕÉ Õ¦¦.ߢó Õäþþ þ.˜ìÉ óÕþ.¢˜ Õßä.äßß Õ¦˜.óìó

ìß ì˜ £™ëu§ “uZ«¶§u“u•±«

Table ó: Microbenchmark ó Warps Shader Execution Power Energy EDP frequency(MHz) time (sec) (Watts) (Joules) Õä ˜Õþ ó.ó¢þ Õ¢ó.þÉ ì¦ó.óþì ßäÉ.É¢ä Õþþþ Õ.˜äÕ Õäß.äÉ ìÕó.þßÕ ¢˜þ.ßä¦ Õóþþ Õ.äßó ՘Õ.ÉÉ ìþ¦.ó˜ß ¢þ˜.ßä˜ Õ¦þþ Õ.¢ä¢ ÕÉ¢.óä ìþ¢.¢˜ó ¦ß˜.óìä Õäþþ Õ.¢þþ óþä.ìÉ ìþÉ.¢˜¢ ¦ä¦.ìߘ

ìó ˜Õþ Õ.ߦ¦ ÕßÕ.óß óɘ.äÉ¢ ¢óþ.Éó¦ Õþþþ Õ.¦äþ ՘É.þÕ óߢ.É¢¢ ¦þó.˜É¦ Õóþþ Õ.¦þ¢ óþÕ.ßÉ ó˜ì.¢Õ¢ ìɘ.ììÉ Õ¦þþ Õ.¦Õ¢ óÕÕ.ìÉ óÉÉ.ÕÕß ¦óì.ó¢þ Õäþþ Õ.¦Õì óóÕ.¢Õ ìÕó.Éɦ ¦¦ó.óäþ

¦˜ ˜Õþ Õ.ß˜Õ ÕßÕ.ÉÕ ìþä.Õßó ¢¦¢.óÉó Õþþþ Õ.¦ÉÉ Õ˜É.þ¢ ó˜ì.ì˜ä ¦ó¦.ßÉä Õóþþ Õ.¦óó óþì.Õì ó˜˜.˜¢Õ ¦Õþ.ߦä Õ¦þþ Õ.¦óì óÕì.¢˜ ìþì.Éó¦ ¦ìó.¦˜¦ Õäþþ Õ.¦ÕÉ óóþ.ì¦ ìÕó.ääó ¦¦ì.ää˜

Table ì: Microbenchmark ì Warps Shader Execution Power Energy EDP frequency(MHz) time (sec) (Watts) (Joules) Õä ˜Õþ ó.ä˜ó Õ¢É.¢Õ ¦óß.˜þä Õ,Õ¦ß.ìߢ Õþþþ ó.óóä Õßä.ì˜ ìÉó.äóó ˜ßì.Éßä Õóþþ ó.þìÉ ÕÉÕ.˜¦ ìÉÕ.Õäó ßÉß.¢ßÉ Õ¦þþ Õ.ÉÉì óþì.óß ¦þ¢.ÕÕß ˜þß.ìɘ Õäþþ Õ.Éßó óÕ¦.þì ¦óó.þäß ˜ìó.ìÕä

ìó ˜Õþ ó.¢¦ó Õßþ.þÕ ¦ìó.Õä¢ Õ,þɘ.¢ä¦ Õþþþ ó.ÕóÉ Õ˜ß.ɘ ¦þþ.óþÉ ˜¢ó.þ¦ä Õóþþ ó.þþ¢ óþÕ.ä˜ ¦þ¦.ìä˜ ˜Õþ.ß¢É Õ¦þþ Õ.Éɦ óÕì.óÉ ¦ó¢.ìþþ ˜¦˜.þ¦É Õäþþ Õ.É˜ß óóì.þì ¦¦ì.ÕäÕ ˜˜þ.¢äþ

¦˜ ˜Õþ ó.¢óÉ ÕßÕ.þä ¦ìó.äÕÕ Õ,þɦ.þßì Õþþþ ó.ÕÕþ ՘É.óó ìÉÉ.ó¢¦ ˜¦ó.¦óä Continued on next page ìÉ

Continued from previous page Warps Shader Execution Power Energy EDP frequency(MHz) time (sec) (Watts) (Joules) Õóþþ ó.þþó óþì.ó˜ ¦þä.Éäß ˜Õ¦.ß¦ß Õ¦þþ Õ.ÉÉß óÕ¢.óÕ ¦óÉ.ßߦ ˜¢˜.ó¢É Õäþþ Õ.ÉÉó óó¢.¦É ¦¦É.Õßä ˜É¦.ߢÉ

Table ¦: Microbenchmark ¦ Warps Shader Execution Power Energy EDP frequency(MHz) time (sec) (Watts) (Joules) Õä ˜Õþ ¦.Õßì Õäì.Õ¦ ä˜þ.ߘì ó,˜¦þ.Éþ˜ Õþþþ ì.¦¢¢ ՘þ.Õß äóó.¦˜ß ó,Õ¢þ.äɦ Õóþþ ì.Õɦ Õɦ.Éß äóó.ßì¦ Õ,ɘÉ.þÕì Õ¦þþ ì.Õìä óþß.Õ¢ ä¦É.äóó ó,þìß.óÕä Õäþþ ì.þÉÉ ó՘.¦É äßß.ÕþÕ ó,þɘ.ìì¦

ìó ˜Õþ ì.É¢ó Õߦ.Õ¦ 䘘.óþÕ ó,ßÕÉ.ßßÕ Õþþþ ì.ó˜¦ ÕÉó.Éß äìì.ßÕì ó,þ˜Õ.ÕÕ¢ Õóþþ ì.Õóó óþä.þ¢ ä¦ì.ó˜˜ ó,þþ˜.즢 Õ¦þþ ì.Õó¢ ó՘.þó ä˜Õ.ìÕì ó,ÕóÉ.Õþó Õäþþ ì.Õ՘ óó¢.˜¦ ßþ¦.ÕäÉ ó,ÕÉ¢.¢ÉÉ

¦˜ ˜Õþ ì.ɦ¢ Õßó.ì¢ äßÉ.ÉóÕ ó,ä˜ó.ó˜ß Õþþþ ì.óߘ ÕÉó.¦Õ äìþ.ßóþ ó,þäß.¢þþ Õóþþ ì.ÕÕÉ óþ¦.˜¦ äì˜.˜Éä Õ,ÉÉó.ßÕä Õ¦þþ ì.Õìì ó՘.¦ó 䘦.ìÕþ ó,Õ¦ì.ɦì Õäþþ ì.Õó˜ óó˜.¢¢ ßÕ¦.Éþ¦ ó,óìä.óóÕ ¦þ £™ëu§ “uZ«¶§u“u•±«

Table ¢: Benchmarks at Memory clock óþþ¦ MHz (Data rate ¦þþ˜ MHz and maximum Bandwidth ÕÉó.¦ GB/s) Benchmark Shader Execution Power Energy EDP frequency time (sec) (Watts) (Joules) (MHz) Matrix Multiplication ˜Õþ ¦.¢¦˜ ÕìÕ.þÕ ¢É¢.˜ìì ó,ßþÉ.˜¢Õ Õþþþ ì.äÉä Õ¦¢.óó ¢ìä.ßìì Õ,ɘì.ßää Õóþþ ì.ÕÕÉ ÕäÕ.¢Õ ¢þì.ߢþ Õ,¢ßÕ.ÕÉ¢ Õ¦þþ ó.ääß Õßß.ßß ¦ß¦.ÕÕì Õ,óä¦.¦¢˜ Õäþþ ó.ì¢Õ Õɦ.ìä ¦¢ä.ɦþ Õ,þߦ.óäß Matrix Transpose ˜Õþ Õ.Éóì Õ¦É.Éä ó˜˜.ìßì ¢¢¦.¢¦Õ Õþþþ Õ.¢äó Õä¢.¦¢ ó¢˜.¦ìì ¦þì.äßó Õóþþ Õ.ìóó ՘ì.óó ó¦ó.óÕß ìóþ.óÕÕ Õ¦þþ Õ.Õ¦É óþþ.¢¢ óìþ.¦ìó óä¦.ßää Õäþþ Õ.þì¦ óÕß.ä¢ óó¢.þ¢þ óìó.ßþó ìD FDTD ˜Õþ Õ.ä¦É Õìì.ߦ óóþ.¢ìß ìäì.äää Õþþþ Õ.ì¢þ Õ¦ä.ÉÕ Õɘ.ìóÉ óäß.ߦì Õóþþ Õ.Õ¢ß ÕäÕ.ìÕ Õ˜ä.äìä óÕ¢.Éìß Õ¦þþ Õ.þÕþ Õßä.äó Õߘ.ì˜ä ՘þ.Õßþ Õäþþ þ.ÉÕó ÕÉþ.ÕÉ Õßì.¦¢ì Õ¢˜.Õ˜É Conjugate Gradient ˜Õþ þ.óþä Õ¦ß.þó ìþ.ó˜ä ä.óìÉ Õþþþ þ.ÕßÕ ÕäÕ.ɦ óß.äÉó ¦.ßì¢ Õóþþ þ.Õ¦˜ Õßß.ìÕ óä.ó¦ó ì.˜˜¦ Õ¦þþ þ.Õìó ÕÉó.ßó ó¢.¦ìÉ ì.좘 Õäþþ þ.Õóì óþä.˜ó ó¢.¦ìÉ ì.ÕóÉ BFS(Breadth First Search) ˜Õþ þ.Õþß Õþþ.Õß Õþ.ß՘ Õ.Õ¦ß Õþþþ þ.þ˜ß Õþ¢.ä˜ É.Õɦ þ.˜þþ Õóþþ þ.þߦ ÕþÉ.ì¢ ˜.þÉó þ.¢ÉÉ Õ¦þþ þ.þä¦ ÕÕì.ßì ß.óßÉ þ.¦ää Õäþþ þ.þ¢ß ÕÕÉ.óß ä.ßɘ þ.옘 Continued on next page ¦Õ

Continued from previous page Benchmark Shader Execution Power Energy EDP frequency time (sec) (Watts) (Joules) (MHz) Radix Sort ˜Õþ þ.Éì¦ Õ¢ó.ɘ Õ¦ó.˜˜ì Õìì.¦¢ì Õþþþ þ.ßäó Õää.¢ì Õóä.˜Éä Éä.äÉ¢ Õóþþ þ.ä¦ó ՘¦.þß Õ՘.Õßì ߢ.˜äß Õ¦þþ þ.¢¢Õ óþÕ.äì ÕÕÕ.þɘ äÕ.óÕ¢ Õäþþ þ.¢þ¢ óó¢.¢ì ÕÕì.˜Éì ¢ß.¢Õä Histogram ˜Õþ ¢.ìäó Õì¦.Éß ßóì.ßþÉ ì,˜˜þ.¢ó˜ Õþþþ ¦.ì¦Õ Õ¦É.þó ä¦ä.˜Éä ó,˜þ˜.Õߢ Õóþþ ì.ä¢Õ Õä¢.՘ äþì.þßó ó,óþÕ.˜Õß Õ¦þþ ì.Õþ¢ ՘ó.˜¢ ¢äß.ß¦É Õ,ßäó.˜äÕ Õäþþ ó.ßóä óþþ.¦¦ ¢¦ä.ìÉÉ Õ,¦˜É.¦˜¢ Merge Sort ˜Õþ ¦.¢¢Õ Õìì.˜¦ äþÉ.Õþä ó,ßßó.þ¦Õ Õþþþ ì.äÉ¢ Õ¦ä.¦É ¢¦Õ.ó˜Õ ó,þþþ.þìó Õóþþ ì.Õìä ÕäÕ.óÕ ¢þ¢.¢¢¢ Õ,¢˜¢.¦ÕÉ Õ¦þþ ó.ß՘ Õßä.ìÉ ¦ßÉ.¦ó˜ Õ,ìþì.þ˜¢ Õäþþ ó.¦ó¦ ÕÉÕ.՘ ¦äì.¦óþ Õ,Õóì.ììÕ Eigenvalues ˜Õþ ¢.þ¦˜ ɘ.þß ¦É¢.þ¢ß ó,¦ÉÉ.þ¢þ Õþþþ ¦.þɦ Õþó.þ˜ ¦Õß.ÉÕä Õ,ßÕþ.ɦä Õóþþ ì.¦¦ì Õþä.˜ä ìäß.ÉÕÉ Õ,óää.ߦ¢ Õ¦þþ ó.Éìì ÕÕÕ.äß ìóß.¢ó˜ Éäþ.ä¦þ Õäþþ ó.¢ß¦ ÕÕä.¦¦ óÉÉ.ßÕß ßßÕ.¦ßþ Black-Scholes ˜Õþ Õä.þþì Õ¢¦.ìó ó,¦äÉ.¢˜ì ìÉ,¢óþ.ßìä Õþþþ Õó.ÉßÕ ÕßÕ.äÕ ó,óó¢.É¢ì ó˜,˜ßó.˜¦þ Õóþþ Õþ.Éþ˜ ÕÉÕ.¦˜ ó,þ˜˜.ää¦ óó,ߘì.Õ¦¢ Õ¦þþ É.ó˜ß óÕ¦.ìÉ Õ,ÉÉÕ.þ¦þ ՘,¦Éþ.ߘ˜ Õäþþ ˜.Õäì óì˜.Õ¢ Õ,ɦ¦.þ՘ Õ¢,˜äÉ.þóì Scalar Product ˜Õþ ì.äߦ Õ¢ä.ìä ¢ß¦.¦äß ó,ÕÕþ.¢Éþ Õþþþ ì.þÕÉ Õßó.þì ¢ÕÉ.ì¢É Õ,¢äß.ɦ¦ Õóþþ ó.ßìÕ Õ˜˜.þ˜ ¢Õì.ä¦ä Õ,¦þó.ßäÉ Õ¦þþ ó.¢äó óþó.˜¢ ¢ÕÉ.ßþó Õ,ììÕ.¦ßä Õäþþ ó.¦¦¢ óÕß.˜¢ ¢ìó.ä¦ì Õ,ìþó.ìÕì ¦ó £™ëu§ “uZ«¶§u“u•±«

Table ä: Benchmarks at Memory clock Õ¢þþ MHz (Data rate ìþþþ MHz and maximum Bandwidth Õ¦¦.¦ GB/s) Benchmark Shader Execution Power Energy EDP frequency time (sec) (Watts) (Joules) (MHz) Black-Scholes ˜Õþ Õä.þÕ¦ Õ¦˜.þ˜ ó,ìßÕ.ì¢ì ìß,Éߦ.˜¦É Õþþþ Õó.É˜É Õää.ìÕ ó,Õäþ.óþÕ ó˜,þ¢˜.˜¦¢ Õóþþ Õþ.Éìì ՘¢.ߦ ó,þìþ.äÉ¢ óó,óþÕ.¢Éì Õ¦þþ É.ìóì óþß.ßä Õ,Éìä.ɦä ՘,þ¢˜.Õ¢ó Õäþþ ˜.óÕä óìÕ.¦Õ Õ,ÉþÕ.óä¢ Õ¢,äóþ.ßÉþ Eigenvalues ˜Õþ ¢.þ¢ß Éó.¢Õ ¦äß.˜óì ó,ìä¢.ß˜Õ Õþþþ ¦.Õþì Éä.¦ó ìÉ¢.äÕÕ Õ,äóì.ÕÉì Õóþþ ì.¦¢Õ Õþþ.ìß ì¦ä.ìßß Õ,ÕÉ¢.ì¦ß Õ¦þþ ó.Éìß Õþä.þ¦ ìÕÕ.¦ìÉ ÉÕ¦.äɘ Õäþþ ó.¢˜Õ Õþ˜.¢˜ ó˜þ.ó¦¢ ßóì.ìÕó ìD FDTD ˜Õþ Õ.ä˜ì ÕóÉ.Õä óÕß.ìßä ìä¢.˜¦¦ Õþþþ Õ.ìÉß Õ¦Õ.Éó Õɘ.óäó óßä.Éßó Õóþþ Õ.óÕÕ Õ¢ì.˜É ՘ä.ìäÕ óó¢.ä˜ì Õ¦þþ Õ.þßÕ Õäß.ߢ ÕßÉ.ääþ ÕÉó.¦Õä Õäþþ þ.Éߘ ՘Õ.óÉ Õßß.ìþó Õßì.¦þÕ Scalar Product ˜Õþ ì.ßßì Õ¦ß.ßÉ ¢¢ß.äÕó ó,Õþì.˜äÉ Õþþþ ì.ìßÉ Õäþ.˜ß ¢¦ì.¢˜þ Õ,˜ìä.ߢä Õóþþ ì.óþß Õߦ.þÉ ¢¢˜.ìþß Õ,ßÉþ.¦˜É Õ¦þþ ì.þäÉ Õ˜É.þÉ ¢˜þ.ìÕß Õ,ߘþ.Éɦ Õäþþ ì.þþÕ óþó.þÉ äþä.¦ßó Õ,˜óþ.þóì ¦ì

Table ß: Benchmarks at Memory clock Õþþþ MHz (Data rate óþþþ MHz and maximum Bandwidth Éä GB/s) Benchmark Shader Execution Power Energy EDP frequency time (sec) (Watts) (Joules) (MHz) Black-Scholes ˜Õþ óß.¦ßì Éó.Éä ó,¢¢ì.˜Éþ ßþ,Õäì.þóó Õþþþ óß.äÕÕ ÉÉ.ÉÕ ó,ߢ˜.äÕ¢ ßä,Õä˜.ÕÕÉ Õóþþ óß.˜˜Õ Õþä.ßì ó,Éߢ.ßìÉ ˜ó,Éää.¢˜ì Õ¦þþ óß.ÉÉß ÕÕ¦.ߢ ì,óÕó.ä¢ä ˜É,ɦ¦.ßóì Õäþþ ó˜.Õ¢˜ Õóó.É¢ ì,¦äó.þóä Éß,¦˜ì.ßìÕ Eigenvalues ˜Õþ ¢.þ¢˜ ¢¢.ìó óßÉ.˜þÉ Õ,¦Õ¢.óßó Õþþþ ¦.ÕþÕ ¢ß.˜É óìß.¦þß Éßì.äþä Õóþþ ì.¦¢ó äÕ.äÉ óÕó.É¢¦ ßì¢.ÕÕß Õ¦þþ ó.Éì˜ ää.Õä Õɦ.ìߘ ¢ßÕ.þ˜ì Õäþþ ó.¢˜ì ßþ.¦ì ՘Õ.ÉóÕ ¦äÉ.ÉþÕ ìD FDTD ˜Õþ ì.þþä ߘ.äþ óìä.óßó ßÕþ.óìó Õþþþ ó.É¢¦ ˜ó.¢˜ ó¦ì.É¦Õ ßóþ.äþì Õóþþ ó.É¦É ˜ä.ߢ ó¢¢.˜óä ߢ¦.¦ìþ Õ¦þþ ó.É¢ß ÉÕ.þÕ óäÉ.ÕÕß ßÉ¢.ßߘ Õäþþ ó.Éä˜ É¢.ÕÕ ó˜ó.ó˜ä ˜ìß.˜óä Scalar Product ˜Õþ Õì.ÕßÕ Éþ.þì Õ,՘¢.ߘ¢ Õ¢,äÕß.Éßä Õþþþ Õì.þóó ɘ.ÉÉ Õ,ó˜É.þ¦˜ Õä,ߘ¢.ɘþ Õóþþ Õó.ÉäÕ Õþ˜.ìÕ Õ,¦þì.˜þä ՘,Õɦ.ßó˜ Õ¦þþ Õó.ÉÕß Õ՘.äß Õ,¢ìó.˜äþ ÕÉ,ßÉÉ.É¢˜ Õäþþ Õó.˜˜ß Õó˜.ó¢ Õ,ä¢ó.ߢ˜ óÕ,óÉÉ.þ˜É ¦¦ £™ëu§ “uZ«¶§u“u•±« EDP ìÉ,¢óþ.ߦ ó˜,˜ßó.˜¦ óó,ߘì.Õ¢ ՘,¦Éþ.ßÉ Õ¢,˜äÉ.þó ó,¦ÉÉ.þ¢ Õ,ßÕþ.É¢ Õ,óää.ߢ Éäþ.ä¦ ßßÕ.¦ß ìäì.äß óäß.ߦ óÕ¢.ɦ ՘þ.Õß Õ¢˜.ÕÉ ó,ÕÕþ.¢É Õ,¢äß.ɦ Õ,¦þó.ßß Õ,ììÕ.¦˜ Õ,ìþó.ìÕ Energy (joules) ó,¦äÉ.¢˜ ó,óó¢.É¢ ó,þ˜˜.ää Õ,ÉÉÕ.þ¦ Õ,ɦ¦.þó ¦É¢.þä ¦Õß.Éó ìäß.Éó ìóß.¢ì óÉÉ.ßó óóþ.¢¦ Õɘ.ìì ՘ä.ä¦ Õߘ.ìÉ Õßì.¦¢ ¢ß¦.¦ß ¢ÕÉ.ìä ¢Õì.ä¢ ¢ÕÉ.ßþ ¢ìó.ä¦ Power (watts) Õ¢¦.ìó ÕßÕ.äÕ ÕÉÕ.¦˜ óÕ¦.ìÉ óì˜.Õ¢ ɘ.þß Õþó.þ˜ Õþä.˜ä ÕÕÕ.äß ÕÕä.¦¦ Õìì.ߦ Õ¦ä.ÉÕ ÕäÕ.ìÕ Õßä.äó ÕÉþ.ÕÉ Õ¢ä.ìä Õßó.þì ՘˜.þ˜ óþó.˜¢ óÕß.˜¢ óþþ¦ Mhz memory Execution time(sec) Õä.þþ Õó.Éß Õþ.ÉÕ É.óÉ ˜.Õä ¢.þ¢ ¦.þÉ ì.¦¦ ó.Éì ó.¢ß Õ.ä¢ Õ.ì¢ Õ.Õä Õ.þÕ þ.ÉÕ ì.äß ì.þó ó.ßì ó.¢ä ó.¦¢ EDP ìß,Éߦ.˜¢ ó˜,þ¢˜.˜¢ óó,óþÕ.¢É ՘,þ¢˜.Õ¢ Õ¢,äóþ.ßÉ ó,ìä¢.ߘ Õ,äóì.ÕÉ Õ,ÕÉ¢.ì¢ ÉÕ¦.ßþ ßóì.ìÕ ìä¢.˜¦ óßä.Éß óó¢.ä˜ ÕÉó.¦ó Õßì.¦þ ó,Õþì.˜ß Õ,˜ìä.ßä Õ,ßÉþ.¦É Õ,ߘþ.ÉÉ Õ,˜óþ.þó Energy (joules) ó,ìßÕ.ì¢ ó,Õäþ.óþ ó,þìþ.ßþ Õ,Éìä.É¢ Õ,ÉþÕ.óä ¦äß.˜ó ìÉ¢.äÕ ì¦ä.ì˜ ìÕÕ.¦¦ ó˜þ.ó¦ óÕß.ì˜ Õɘ.óä ՘ä.ìä ÕßÉ.ää Õßß.ìþ ¢¢ß.äÕ ¢¦ì.¢˜ ¢¢˜.ìÕ ¢˜þ.ìó äþä.¦ß Power (watts) Õ¦˜.þ˜ Õää.ìÕ Õ˜¢.ߦ óþß.ßä óìÕ.¦Õ Éó.¢Õ Éä.¦ó Õþþ.ìß Õþä.þ¦ Õþ˜.¢˜ ÕóÉ.Õä Õ¦Õ.Éó Õ¢ì.˜É Õäß.ߢ ՘Õ.óÉ Õ¦ß.ßÉ Õäþ.˜ß Õߦ.þÉ Õ˜É.þÉ óþó.þÉ Õ¢þþ Mhz memory Execution time(sec) Õä.þÕ Õó.ÉÉ Õþ.Éì É.ìó ˜.óó ¢.þä ¦.Õþ ì.¦¢ ó.ɦ ó.¢˜ Õ.ä˜ Õ.¦þ Õ.óÕ Õ.þß þ.ɘ ì.ßß ì.ì˜ ì.óÕ ì.þß ì.þþ EDP ßþ,Õäì.þó ßä,Õä˜.Õó ˜ó,Éää.¢˜ ˜É,ɦ¦.ßó Éß,¦˜ì.ßì Õ,¦Õ¢.óß Éßì.äÕ ßì¢.Õó ¢ßÕ.þ˜ ¦äÉ.Éþ ßÕþ.óì ßóþ.äþ ߢ¦.¦ì ßÉ¢.ߘ ˜ìß.˜ì Õ¢,äÕß.ɘ Õä,ߘ¢.ɘ ՘,Õɦ.ßì ÕÉ,ßÉÉ.Éä óÕ,óÉÉ.þÉ Energy (joules) ó,¢¢ì.˜É ó,ߢ˜.äó ó,Éߢ.ߦ ì,óÕó.ää ì,¦äó.þì óßÉ.˜Õ óìß.¦Õ óÕó.É¢ Õɦ.ì˜ Õ˜Õ.Éó óìä.óß ó¦ì.ɦ ó¢¢.˜ì óäÉ.Õó ó˜ó.óÉ Õ,՘¢.ßÉ Õ,ó˜É.þ¢ Õ,¦þì.˜Õ Õ,¢ìó.˜ä Õ,ä¢ó.ßä Power (watts) Éó.Éä ÉÉ.ÉÕ Õþä.ßì ÕÕ¦.ߢ Õóó.É¢ ¢¢.ìó ¢ß.˜É äÕ.äÉ ää.Õä ßþ.¦ì ߘ.äþ ˜ó.¢˜ ˜ä.ߢ ÉÕ.þÕ É¢.ÕÕ Éþ.þì ɘ.ÉÉ Õþ˜.ìÕ Õ՘.äß Õó˜.ó¢ Õþþþ MHz memory Execution time(sec) óß.¦ß óß.äÕ óß.˜˜ ó˜.þþ ó˜.Õä ¢.þä ¦.Õþ ì.¦¢ ó.ɦ ó.¢˜ ì.þÕ ó.É¢ ó.É¢ ó.Éä ó.Éß Õì.Õß Õì.þó Õó.Éä Õó.Éó Õó.˜É Shader frequency ˜Õþ Õþþþ Õóþþ Õ¦þþ Õäþþ ˜Õþ Õþþþ Õóþþ Õ¦þþ Õäþþ ˜Õþ Õþþþ Õóþþ Õ¦þþ Õäþþ ˜Õþ Õþþþ Õóþþ Õ¦þþ Õäþþ Bandwidth Õ¦¦.¦ GB/s) and óþþ¦ MHz (Data rate ¦þþ˜ MHz and maximum Bandwidth ÕÉó.¦ GB/s) Benchmarks at Memory clock Õþþþ MHz (Data rate óþþþ MHz and maximum Bandwidth Éä GB/s) , Õ¢þþ MHz (Data rate ìþþþ MHz and maximum Benchmark Black-Scholes Eigenvalue ìD FDTD Scalar product Table ˜: