Comparing Performance and Energy Efficiency of Fpgas and Gpus For

Comparing Performance and Energy Efficiency of FPGAs and GPUs for High Productivity Computing Brahim Betkaoui, David B. Thomas, Wayne Luk Department of Computing, Imperial College London London, United Kingdom {bb105,dt10,wl}@imperial.ac.uk Abstract—This paper provides the first comparison of per- figurable Computing (HPRC). We define a HPRC system as formance and energy efficiency of high productivity computing a high performance computing system that relies on recon- systems based on FPGA (Field-Programmable Gate Array) and figurable hardware to boost the performance of commodity GPU (Graphics Processing Unit) technologies. The search for higher performance compute solutions has recently led to great general-purpose processors, while providing a programming interest in heterogeneous systems containing FPGA and GPU model similar to that of traditional software. In a HPRC accelerators. While these accelerators can provide significant system, problems are described using a more familiar high- performance improvements, they can also require much more level language, which allows software developers to quickly design effort than a pure software solution, reducing programmer improve the performance of their applications using reconfig- productivity. The CUDA system has provided a high productivity approach for programming GPUs. This paper evaluates the urable hardware. Our main contributions are: High-Productivity Reconfigurable Computer (HPRC) approach • Performance evaluation of benchmarks with different to FPGA programming, where a commodity CPU instruction memory characteristics on two high-productivity plat- set architecture is augmented with instructions which execute forms: an FPGA-based Hybrid-Core system, and a GPU- on a specialised FPGA co-processor, allowing the CPU and FPGA to co-operate closely while providing a programming based system. model similar to that of traditional software. To compare the • Energy efficiency comparison of these two platforms GPU and FPGA approaches, we select a set of established based on actual power measurements. benchmarks with different memory access characteristics, and • A discussion of the advantages and disadvantages of compare their performance and energy efficiency on an FPGA- using FPGAs and GPUs for high-productivity computing. based Hybrid-Core system with a GPU-based system. Our results show that while GPUs excel at streaming applications, high- II. RELATED WORK productivity reconfigurable computing systems outperform GPUs in applications with poor locality characteristics and low memory Several researchers have explored ways to make FPGA bandwidth requirements. programming easier by giving developers a more familiar C- style language instead of hardware description languages [1], I. INTRODUCTION [2]. However, these languages require a significant proportion In recent years clusters built from commodity processors of an existing application to be rewritten using a programming have been a common choice for High Performance Computing model specific to the accelerator, which is problematic in an (HPC), as they are cheap to purchase and operate. However, environment where large amounts of legacy code must be HPC systems constructed from conventional microprocessors maintained. now face two key problems: the reduction in year-on-year There is previous work on comparing the performance of performance gains for CPUs, and the increasing cost of power FPGAs and GPUs. For example, it is reported that while supply and cooling as clusters grow larger. FPGA and GPU implementations for real-time optical flow One way of addressing this problem is to use a heteroge- have similar performance, the FPGA implementation takes 12 neous computer system, where commodity processors are aug- times longer to develop [3]. Another study shows that FPGA mented with specialized hardware that can accelerate specific can be 15 times faster and 61 times more energy efficient than kernels. Examples of specialized hardware include Graphics GPU for uniform random number generation [4]. A third study Processing Units (GPUs) and Field Programmable Gate Ar- shows that for many-body simulation, a GPU implementation rays (FPGAs). Such hardware accelerators can offer higher is 11 times faster than an FPGA implementation, but the FPGA performance and energy efficiency compared to commodity is 15 times better than the GPU in terms of performance per processors. However, programming these co-processing archi- Watt [5]. tectures, in particular FPGAs, requires software developers to Prior work on comparing FPGAs and GPUs for high pro- learn a whole new set of skills and hardware design concepts, ductivity computing used a set of non-standard benchmarks and accelerated application development takes more time than that target different process architectures [6] such as asyn- producing a pure software version. chronous pipeline and partially synchronous tree. Using the One potential solution to the problem of programming benchmarks in [6] results in analysis that considers processes FPGA-based accelerated systems is High Productivity Recon- at architectural level. However, these benchmarks do not cover ___________________________________ 978-1-4244-8983-1/10/$26.00 ©2010 IEEE TABLE I applications with different memory access characteristics, and LIMITS TO PERFORMANCE OF BENCHMARKS they do not address the performance of GPUs for non- streaming applications in comparison to a HPRC system. In Benchmark Program Main performance limiting factor our work, we take a different approach by selecting specific STREAM Memory Bandwidth limited Dense matrix multiplication Computationally limited benchmarks with different memory locality characteristics. Fast Fourier Transform Memory Latency limited Our benchmark results lead us to a different conclusion Monte-Carlo Methods Parallelism limited from the one reported in [6] about HPRC systems being marginalised by GPUs. Moreover, we also provide comparison of energy efficiency for FPGA and GPU technologies. III. CHARACTERISING PRODUCTIVITY AND LOCALITY Creating a meaningful productivity measure which can be applied to programming models and architectures would require the aggregation of results from many individual projects, and is outside the scope of this paper. Instead we restrict our productivity study to the idea that if the same programmer effort is applied to two systems, then the system providing the highest performance solution is the most productive. This stems from the following equation that was developed in [7]: Fig. 1. Benchmark applications as a function of memory access character- Relative Speedup istics Productivity = (1) Relative Effort So in our work, we will restrict our productivity comparison main memory using the following four long vector operations: to using a common programming model based on familiar COPY : c ← a high-level programming languages and development tools. The SCALE : b ← αc benchmarks are developed using the C language plus platform ADD c ← a b specific extensions, using comparable development efforts in : + terms of design time and programmer training. In other words, TRIAD : a ← b + αc productivity will be dictated by the relative speedup achieved m where a, b, c ∈ R ; α ∈ R by a platform: The STREAM benchmark is designed in such a way that data Productivity(GPU) Relative Speedup(GPU) re-use cannot be achieved. A general rule of this benchmark is = (2) Productivity(FPGA) Relative Speedup(FPGA) that each vector must have about 4 times the size of all the last- level caches used in the run. In our work, we used arrays of In contrast to previous work [6], we select a number of 32 Million floating-point elements (4 bytes for each element), established benchmarks based on the HPC Challenge (HPCC) which require over 300MB of memory. Each vector kernel is Benchmark Suite [8] and the Berkeley technical report on Par- timed separately and the memory bandwidth is estimated by allel Computing Research [9] to measure the relative speedup dividing the total number of bytes read and written, by the for each platform. Four benchmark programs are used: time it takes to complete the corresponding operation. • The STREAM benchmark. B. Dense Matrix Multiplication • Dense matrix multiplication. • Fast Fourier Transform (FFT). Dense floating-point matrix-matrix multiplication is a vital • Monte-Carlo methods for pricing Asian options. kernel in many scientific applications. It is one of the most important kernels in the LINPACK benchmark, as it is the The authors in [9] identify the main performance limiting building block for many higher-level linear algebra kernels. factor for these benchmarks, shown in Table I. Some of The importance of this benchmark has led HPC system ven- these benchmarks are part of the HPPC benchmarks, which dors to both optimize their hardware and provide optimised aim to explore spatial and temporal locality by looking at libraries for this benchmark. In our work, we used vendor- streaming and random memory access type benchmarks, as provided libraries for the matrix multiplication benchmark, in illustrated in Fig.1. Where possible these benchmark programs order to achieve optimum or near optimal performance results are implemented using vendor libraries, which are optimised for each platform. for each platform. The SGEMM routine in the BLAS library performs single precision matrix-matrix multiplication, defined as follows: A. STREAM c ← βC + αAB The STREAM benchmark [8] is a simple synthetic bench- n×n n mark that is used to measure sustained memory bandwidth

Comparing Performance and Energy Efficiency of Fpgas and Gpus For

X86 Platform Coprocessor/Prpmc (PC on a PMC)

Convey Overview

Powervr SGX Series5xt IP Core Family

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

Driver Riva Tnt2 64

(GPU) Computing

UNIT 8B a Full Adder

Exploiting Free Silicon for Energy-Efficient Computing Directly

CUDA What Is GPGPU

A Configurable General Purpose Graphics Processing Unit for Power, Performance, and Area Analysis

On the Energy Efficiency of Graphics Processing Units for Scientific

Comparing the Power and Performance of Intel's SCC to State