A Case Study of Opencl on an Android Mobile GPU

A Case Study of OpenCL on an Android Mobile GPU James A. Ross∗, David A. Richiey, Song J. Parkz, Dale R. Shiresz, and Lori L. Pollockx ∗Engility Corporation, Chantilly, VA [email protected] yBrown Deer Technology, Forest Hill, MD [email protected] zU.S. Army Research Laboratory, APG, MD fsong.j.park.civ,[email protected] xUniversity of Delaware, Newark, DE [email protected] Abstract—An observation in supercomputing in the past GPUs through Android platforms may allow computationally decade illustrates the transition of pervasive commodity products demanding applications to achieve higher performance. How- being integrated with the world’s fastest system. Given today’s ever, the programmability of these heterogeneous systems for exploding popularity of mobile devices, we investigate the possi- bilities for high performance mobile computing. Because parallel effective HPC remains an open question. processing on mobile devices will be the key element in developing The main contributions of this paper are the following: a mobile and computationally powerful system, this study was • A case study in porting an OpenCL parallel benchmark designed to assess the computational capability of a GPU on a low-power, ARM-based mobile device. The methodology for to the mobile GPU on an Android device using a higher- executing computationally intensive benchmarks on a handheld level programming API that leverages OpenCL. mobile GPU is presented, including the practical aspects of • Empirical results from tuning an OpenCL N-body bench- working with the existing Android-based software stack and mark on a Qualcomm Adreno 320 mobile GPU in com- leveraging the OpenCL-based parallel programming model. The parison to other common architectures. empirical results provide the performance of an OpenCL N- body benchmark and an auto-tuning kernel parameterization • An evaluation of mobile Android platforms directed to- strategy. The achieved computational performance of the low- wards gathering an initial picture of the GPU capability power mobile Adreno GPU is compared with a quad-core ARM, in terms of architecture, hardware features, usability, and an x86 Intel processor, and a discrete AMD GPU. performance. Keywords-handheld GPU; OpenCL; Android; N-body; II. MOTIVATION I. INTRODUCTION Processing technologies continue to decrease in size, thus Mobile graphics processing units (GPUs) paired with ARM increasing the mobility of compute resources. The history of processors on low-power system-on-a-chip (SoC) devices have computing suggests that computational power trickles down been available for several years in smartphones and tablets. from stationary systems to mobile devices. As a case in Despite widespread adoption and deployment of discrete point, consider modern smartphones that now rival past super- GPUs as accelerators, there remains no clear path for pro- computer performance. Many individuals are walking around gramming heterogeneous systems with co-processors. The with compute-capable devices, which provides mobility of development of application programming interfaces (APIs) sophisticated computing. and programming methodologies for mixed architectures are a With the end of the frequency scaling single-core era [1], topic of active research and development. A major challenge [2], parallel computing technologies are ubiquitous and can be is the lack of availability and maturity in software technolo- found across a wide range of devices such as smartphones, gies that should, in principle, allow for programmability and laptops, and supercomputers. The economics of computing efficiency across the coupled disparate cores in these devices. and market forces favor low-cost, high-volume, and general- Handheld mobile GPU architectures now possess general- purpose processors rather than slowly evolving, performance- purpose compute capabilities and a software stack sufficient to centric, specialized products [3], [4]. The rise of clusters using begin exploring these devices as computational accelerators. commodity personal computers is an example of how the This development in the technology mirrors that of desk- popularity of the everyday PC transformed the supercomputing top, workstation, and large-scale high performance comput- community. Modern GPUs appear to satisfy this economic ing (HPC) platforms where GPUs are increasingly used to observation where video cards are reasonably priced, widely offload parallel algorithms for improved application perfor- available, and support general-purpose computing. An execu- mance. Leveraging the compute capability of modern mobile tion of floating-point intensive simulation on a low-power em- U.S. Government work not protected by U.S. copyright bedded GPU technology provides a comparative performance This recent development provides a significant opportunity to case study for potential future-generation HPC systems. finally benchmark these devices for computationally expen- Dedicated or distributed computing over mobile devices, sive applications using OpenCL. Notwithstanding this positive which are becoming increasingly heterogeneous, is an area development, the overall software stack and support remains that pushes this idea of high performance computing to the relatively complicated and immature. edge. Indeed, the U.S. military has seized on this trend Programming with OpenCL on mobile GPUs poses multiple and has recently approved Android phones for Soldiers [5]. challenges due to several limitations. The Android operating Limited data sharing over ad hoc networks, or specialized system and software stack are an obstacle for accessing mobile mobile devices with larger data storage, may alleviate memory GPU performance with vendor-supplied OpenCL implemen- storage problems for high resolution databases, but comput- tations. Since OpenCL is a C-based library, the developer ing efficiency and required time-to-solution remains an issue must utilize the Android Native Development Kit (NDK) to when high performance servers are not available to handle cross-compile the native executable. For general-purpose GPU processing requests. In these situations when mobile devices (GPGPU) computation, Google recommends RenderScript, have to do the heavy lifting in processing, understanding which is a Java-based API primarily used for supplemental and optimizing efficiency of mobile GPU technology can graphics effects within an Android App. However, using be critical. This is true for the military and first responders OpenCL allows the application developer to leverage an ex- where timely situational awareness is key in avoiding risk isting GPU code rather than porting code to RenderScript. To in dynamic environments where decisions have to be made the best of our knowledge, only a few people have executed rapidly. For example, this might involve optimized way-point OpenCL code on a mobile GPU under Android, as evident in planning for entry to or exit from a hostile environment. Line- notably few related publications [9], [10]. of-sight (LOS) types of applications will also be paramount As in the desktop computing environment, there are multi- in determining threats from explosives or hostile forces. Haz- ple vendors with GPU architectures available for integration ardous chemical leaks into the air will also require extensive into low power SoCs. The most notable examples are the processing to determine dispersion based on active weather Qualcomm Adreno, the ARM Mali, the NVIDIA GeForce patterns in the area. In all of these applications, mobile ULP, Vivante’s ScalarMorphic architecture, and the Imagina- processing can assist as a vital computational resource. tion Technologies PowerVR architectures. Each of these GPU architectures has been used in a number of mobile Android III. MOBILE GPUS FOR GENERAL-PURPOSE products, but presently only Adreno and Mali appear to have COMPUTATION conformant OpenCL implementations available for current Due to very tight restrictions on energy consumption, hardware in consumer products. mobile devices are not generally used for computationally Qualcomm has shipped an OpenCL implementation with intensive calculations. The theoretical throughput of mobile new phones that contain the Adreno 320 GPU. The Qualcomm ARM processors remains a staggering two orders of magnitude Snapdragon S4 Pro APQ8064i SoC is in the LG Nexus 4 lower than workstation CPUs, although this gap is generally mobile smartphone. The Snapdragon SoC contains a 1.5 GHz decreasing. Peak FLOPS for ARM Cortex processors are quad-core Krait ARM CPU and Adreno 320 GPU. While reported in [6] and GFLOPS for Xeon E5-4600 processors hardware details about the Adreno 320 GPU are sparse, are provided in [7]. Mobile GPUs can be used to improve existence of the OpenCL library is sufficient to access and interactivity where previous calculations could not be com- benchmark the capabilities of the device. puted within interactive time frames. The performance data presented here will illuminate that the computational through- IV. AUTO-TUNING OPENCL N-BODY BENCHMARK put of a mobile GPU goes further in closing the gap in mobile N-body is an algorithm used to solve Newton’s laws of device performance as compared to modern x86 desktops or motion for N particles subject to an inter-particle force. The workstations. basic algorithm requires the update of all particle positions OpenCL provides an industry standard for parallel pro- and velocities based on the distance of a given particle to gramming of heterogeneous

Load more