General Purpose Computing on Low-Power Embedded Gpus: Has It Come of Age?
Total Page:16
File Type:pdf, Size:1020Kb
General Purpose Computing on Low-Power Embedded GPUs: Has It Come of Age? Arian Maghazeh Unmesh D. Bordoloi Petru Eles Zebo Peng Department of Computer and Information Science, Linköpings Universitet, Sweden E-mail: {arian.maghazeh, unmesh.bordoloi, petru.eles, zebo.peng}@liu.se Abstract—In this paper we evaluate the promise held by low- Our contributions: We believe that the arrival of OpenCL- power GPUs for non-graphic workloads that arise in embedded enabled GPUs gives us an opportunity to usher in the era systems. Towards this, we map and implement 5 benchmarks, of heterogeneous computation for embedded devices as well. that find utility in very different application domains, to an However, unless GPGPU on low-power embedded platforms embedded GPU. Our results show that apart from accelerated can be unleashed, heterogeneous computation will remain performance, embedded GPUs are promising also because of their energy efficiency which is an important design goal for debilitated. The question has remained open whether (and what battery-driven mobile devices. We show that adopting the same kind of) non-graphics workloads may benefit from embedded optimization strategies as those used for programming high-end GPUs given that powerful multi-core CPUs on the same chip GPUs might lead to worse performance on embedded GPUs. This are already a promising choice. Towards this, we mapped and is due to restricted features of embedded GPUs, such as, limited implemented 5 benchmarks — Rijndael algorithm, bitcount, or no user-defined memory, small instruction-set, limited number of registers, among others. We propose techniques to overcome genetic programming, convolution and pattern matching — such challenges, e.g., by distributing the workload between GPUs to an embedded GPU from Vivante. These algorithms are and multi-core CPUs, similar to the spirit of heterogeneous deployed in numerous potential applications including automo- computation. tive (bitcount, convolution, security), radar or flight tracking I. INTRODUCTION Over the span of the last decade, researchers have re- (pattern matching) and image processing/augmented reality engineered sequential algorithms from a wide spectrum of ap- (convolution). Our choice was also driven by a desire to inves- plications to harness the parallelism offered by GPUs (Graphics tigate whether computationally heavy optimization algorithms Processing Units) and demonstrated tremendous performance (genetic programming), typically not suitable on embedded benefits [14]. In fact, brisk successes from early endeavors platforms, may become feasible with the advent of low power meant that GPGPU (General Purpose computing on Graphics GPUs. Our results show that embedded GPUs are indeed, Processing Units) was established as a specialization on its sometimes, attractive alternatives for non-graphics computation own. It is not uncommon, nowadays, to find commercial but, at the same time, intelligent tradeoffs/choices must be GPGPU applications in electronic design, scientific computing considered for each application between its GPU, multi-core and defense, among others. As GPGPU research has matured, and hybrid (heterogeneous) implementations. To the best of the emerging trend for future is “heterogeneous computation”, our knowledge, ours is the first paper to implement such non- where multi-cores, GPUs and other units are utilized syner- graphic workloads on embedded GPUs and compare against gistically to accelerate computationally expensive problems. sequential and multi-core implementations. Heterogeneous computation is well-poised to become the de- In the last 10 years, several hundred papers have been facto computational paradigm in the realm of servers and published on GPGPU but they were almost exclusively con- desktops [3]. cerned about high-end GPUs where maximizing the speedup Yet, in embedded devices, the role of GPUs has been has been the sole overarching aim. Our experiments reveal limited, until recently. Over the last 18 months, programmable that adopting the same optimization strategies as those used GPUs have penetrated embedded platforms providing graphics for high-performance GPU computing might lead to worse software designers with a powerful programmable engine. performance on embedded GPUs. This is due to restricted Typically, such embedded GPUs are programmed with features of embedded GPUs like limited or no user-defined OpenGL, Renderscript and other similar tools that require memory, limited size of cache, among several others. Moreover, prior experience in graphics programming and, in some cases, embedded GPUs share the same physical memory with the even expertise on the underlying graphics pipeline hardware CPUs, which implies higher contention for memory, imposing [14]. However, today, the stage seems to be set for a change new demands on programmers. We discuss techniques, such as several industrial players, either, already have, or are on as distribution of the workload between GPUs and multi-core the verge of releasing embedded platforms with low-power CPUs, similar to the spirit of heterogeneous computation to GPUs that are programmable by OpenCL. OpenCL is a overcome some of the challenges. framework for coding that enables seamless programming II. RELATED WORK across heterogeneous units including multi-cores, GPUs and other processors. Vivante’s embedded graphics cores [17], Major strides have been made in GPGPU [14] programming ARM’s Mali [12] graphics processors, the StemCell Processor over the years. Almost all threads of work, however, singularly [18] from ZiiLabs (Creative) are some examples of low-power focused on improving performance without discussing other embedded GPUs targeted for mobile devices. concerns that arise specifically in embedded GPUs. Hence, the 978-1-4799-0103-6/13/$31.00 ©2013 IEEE 1 existing body of work does not provide any insight into oppor- some benchmarks have a high ratio of compute-to-memory tunities and challenges of embedded GPGPU programming. Of access while others have a low ratio of compute-to-memory late, few attempts have been made to bridge this gap, however, access. Second, we believe that the benchmarks should cover almost all of them evaluated their work on high-end GPUs. a wide application range. As mentioned in Section I, the 5 In a recent paper, Gupta and Owens [4], discussed strategies benchmarks that we implemented may be utilized in automotive for memory optimizations for a speech recognition application. software, radar/flight tracking, image processing, augmented However, they did not discuss the impact of their algorithm reality and optimization tools. We hope that our work will spur on power or energy consumption. In fact, they evaluated their the embedded systems community to explore other potential performance results on a 9400M Nvidia GPU which is not applications on embedded GPUs. With these methodological targeted towards low-power devices such as hand-held smart choices in mind, we chose 5 benchmarks. The characteristics phones. Yet, we refer the interested reader to their paper for for each benchmark is individually discussed below. an excellent discussion on limitations that arise out of (i) on- We chose the Rijndael algorithm [2] that is specified in the chip memory sharing between GPU and CPU and (ii) limited Advanced Encryption Standard. Data security is increasingly or no L2 cache. These two characteristics are among several becoming more important as mobile and hand-held devices restrictions in embedded GPUs that we target in this paper. are beginning to host e-commerce activities. In fact, it is also We also note that Mu et al. [13] implemented the bench- gaining significance in vehicular systems as the internet con- marks from High Performance Embedded Computing Chal- tinues to penetrate the automotive applications. Moreover, the lenge Benchmark from MIT [15] on a GPU. However, their characteristics of the version we chose to implement inherently paper suffers from two significant drawbacks. First and fore- makes it an excellent candidate to stress and test the memory most, they did not study power or energy consumption which is bandwidth of the GPU because the algorithm relies heavily crucial for embedded devices. In fact, all the experiments were on look-up tables. It essentially has a low compute-to-memory performed on the power-hungry Fermi machine from Nvidia. access ratio. Second, their reported speedups do not include the overheads Second, we selected the bitcount application from the Au- of data transfer between the CPU and GPU. tomotive and Industrial Control suite of MiBench [5]. The bit- There has been some prior work [1], [6], [11], [16] related count is an important benchmark because low-level embedded to modeling power and energy consumption on GPUs. Again, control applications often require bit manipulation and basic this thread of work has focused on high performance GPU math functionalities. It counts the total number of bits set computing and desktop/server workloads, shedding no light on to 1 in an array of integers. Thus, it needs to sum several their suitability for low-power applications. integers, thereby stressing the integer datapath of the GPU. The Researchers from Nokia published results [10] that they bit count algorithm is also interesting because it involves two obtained from OpenCL programming of an embedded GPU. typical GPGPU algorithmic paradigms (i) “synchronization” Unfortunately, this work was limited to an image processing across the threads in the GPU and (ii) “reduction” in