Unleashing the Benefits of GPU Computing with ARM ® Mali
Total Page:16
File Type:pdf, Size:1020Kb
Unleashing the Benefits of GPU Computing with ARM® Mali™-T600 series Practical Use-Cases and Applications Roberto Mijat Visual Computing Marketing Manager 30 October 2013 Abstract GPU computing on the ARM Mali-T600 series of GPUs offers a host of benefits: it accelerates data-parallel computation while reducing system work load; reduces platform energy consumption while increasing system throughput; and enhances your system's value by consolidating functionality while reducing programmer effort. In this paper, we show how ARM Mali-T600 GPUs deliver such benefits on shipping devices by analyzing ecosystem Partners' experiences in key use cases areas: advanced image processing, computational photography, multimedia codecs and computer vision. Addressing today’s computational challenges The relentless progress in mobile and embedded technologies poses a significant challenge, as computational requirements are fast outpacing capabilities. Screen resolutions continue to increase [1, 2], meaning many more pixels need to be computed. More sensors are integrated in devices, all capturing information that needs to be processed: understood, correlated, and acted upon. Content being consumed, and produced, is becoming richer and more complex. Increasing complexity of applications is driven by end user expectation for improved experience: When users upgrade their devices, they anticipate considerable improvements in features, usage and visual experience, and without compromising good battery life. The emergence of the smartphone as the primary camera device [3] brings additional requirements and is at the basis of emerging fields such as computational photography [4]. All of these factors contribute in fuelling the need for more computational power. Unfortunately, trends clearly show that processing power requirements are greatly outgrowing battery capacity. Since 2010, battery capacity of smart-phone devices has doubled. In the same timeframe, processing power has increased by a factor of 12 [5]. Traditional solutions are no longer sustainable. Increasing operational frequency of processors is bound by thermal limitations. Simply adding duplicate cores quickly runs into diminishing returns, as most problems don’t scale linearly or fall victim of Amdhal’s argument on program speed-up. It is therefore extremely important to focus on processing efficiency. Parallelism is at the core of modern processor architecture design: it enables increased processing performance and efficiency. Superscalar CPUs implement instruction level parallelism (ILP). Single Instruction Multiple Data (SIMD) architectures enable faster computation of vector data. Simultaneous multithreading (SMT) is used to mitigate memory latency overheads. Multi-core SMP can provide significant performance uplift and energy savings by executing multiple threads/programs in parallel. SoC designers combine diverse accelerators together on the same die sharing a unified bus matrix. All these approaches enable increased performance and more efficient computation: by doing things in parallel. GPUs are extraordinary parallel compute engines. Their heritage is associated with processing large streams of pixels and triangles. Modern designs such as the ARM Mali-T600 series extend this by enabling the processing and acceleration of non-graphical tasks. Mali-T600 GPUs provide a very powerful tool to improve platform performance and energy efficiency, through a scalable multithreaded architecture that complements well the characteristics of the CPU and in particular NEON™. Emerging compute APIs such as OpenCL [6] and RenderScript [7, 8, 9] enable programmers to extract the phenomenal computational capabilities of the Mali-T600 GPUs, and facilitate the most efficient use of the hosting heterogeneous platform processing resources. Copyright © 2013 ARM Limited. All rights reserved. The ARM logo is a registered trademark of ARM Ltd. All other trademarks are the property of their respective owners and are acknowledged Page 2 of 11 Benefits of GPU Compute GPU computing on the ARM Mali-T600 series of GPUs brings a host of benefits. The architectural characteristics of the Mali-T600 series of GPUs enable them to process many parallel workloads much more efficiently than alternative processor solutions. GPU Compute accelerated applications can therefore benefit by consuming less energy, which translates to longer battery life. Where raw performance is the target, the computation of embarrassingly parallel workloads can also be significantly accelerated through GPU Compute acceleration. This may translate in increased frame rate, or the ability to carry out more work in the same temporal/power budget, and can result in benefits such as improved UI responsiveness, more accurate physics simulation, the ability to apply complex pre-/post- processing effects to multimedia on-device and in real-time. In essence: a significantly improved end-user experience. Heterogeneous compute APIs such as OpenCL and RenderScript, are designed for concurrency. They allow the developer to migrate some of the load from the CPU to the GPU or other accelerator, or to distribute it between processors in order to enable better load-balancing across system resources. For example a video codec may offload motion vector calculations to the GPU, enabling the CPU to operate with fewer cores and at lower frequencies, or to be available to compute additional tasks. System designers may be influenced by various cost, flexibility and portability concerns to consider migrating functionality from dedicated hardware accelerators to software solutions which leverage the CPU/GPU subsystem. This approach is made viable and compelling due to the additional computational power provided by the GPU, now exposed through industry standard heterogeneous compute APIs. Use-Cases and Applications The applications area than can benefit from GPU Computing are countless, examples include: Copyright © 2013 ARM Limited. All rights reserved. The ARM logo is a registered trademark of ARM Ltd. All other trademarks are the property of their respective owners and are acknowledged Page 3 of 11 In this paper we will focus on emerging embedded and mobile applications like computer vision, computational photography, and media processing. Real life experiences by our ecosystem Partners will be used to showcase the benefits of GPU Compute on Mali. Advanced Image Processing Mali-T600 series has been proven to accelerate a variety of complex image processing algorithms. MulticoreWare® Inc. is a software vendor and provider of professional services to optimize algorithms to leverage heterogeneous computing. They have implemented a library of complex image filters for RenderScript, optimized both for ARM CPU/NEON and Mali-T600 GPUs [10]. Measurements carried out on an off-the-shelf Google® Nexus™ 10 tablet device show significant and consistent performance uplift when utilizing Mali-T604™ GPU Compute acceleration, as shown in Table 1. Mali GPU Compute has also been demonstrated to improve performance and save energy for video pre-/post-processing. MulticoreWare’s Table 1 - Image filters performace up- transcoding Android™ application [11] utilises video processing plug-ins lift using RenderScript on Mali-T600 written in RenderScript. When the scripts were compiled and executed on the Mali-T604 GPU (hence the application was able to leverage both CPU and GPU processing power) the performance uplift was dramatic. For example, when applying de-shake correction to the video stream, the application executed 3.5x faster, whilst when up-scaling the video stream the performance uplift exceeded 6.7x. GPU Compute with Mali-T604 was also demonstrated to improve battery life. A battery drain test was carried out using the de-shake algorithm, and it was recorded that when Figure 1 - Battery drain test measured on Google Nexus 10. 30 iterations of de-shake transcoding. Lower is better. (Source: MulticoreWare) Copyright © 2013 ARM Limited. All rights reserved. The ARM logo is a registered trademark of ARM Ltd. All other trademarks are the property of their respective owners and are acknowledged Page 4 of 11 using CPU and GPU together the energy consumed by the application was reduced by over 80% when compared to the standard execution mode (where both application and script reside on the CPU). Synthesis Corporation®, a pioneering Japanese industry-academic cooperative venture, provider of semiconductor and software IP, with recent focus around image and video processing, has been collaborating with ARM on the optimization of selected proprietary technologies using GPU Compute on Mali [12]. Synthesis’ Super- resolution Scaler (an high performance, high quality image up-scaling solution) has been implemented using OpenGL™ ES and RenderScript compute for the Mali-T604 GPU, and is capable of achieving up to five Figure 2 - OpenCL accelerated Adaptive luminance/dynamic range times performance increase in comparison to enhancement algorithm using Mali-T604. (SOURCE: Synthesis Corporation) a conventional CPU-only implementation. Synthesis Corporation’s proprietary adaptive luminance/dynamic-range enhancement software library has been optimized for Mali-T600 series of GPU using OpenCL™ and delivers a 16x speed up when using the Mali-T604 GPU for acceleration. Computational Photography Smartphone photography is becoming the most popular form of photo taking. It is disrupting traditional photography. According to National Geographic [3], 37 percent of the images taken in the U.S. in 2011