Unleashing the Benefits of GPU Computing with ARM® Mali™-T600 series

Practical Use-Cases and Applications

Roberto Mijat

Visual Computing Marketing Manager 30 October 2013

Abstract

GPU computing on the ARM Mali-T600 series of GPUs offers a host of benefits: it accelerates data-parallel computation while reducing system work load; reduces platform energy consumption while increasing system throughput; and enhances your system's value by consolidating functionality while reducing programmer effort.

In this paper, we show how ARM Mali-T600 GPUs deliver such benefits on shipping devices by analyzing ecosystem Partners' experiences in key use cases areas: advanced image processing, computational photography, multimedia codecs and computer vision.

Addressing today’s computational challenges

The relentless progress in mobile and embedded technologies poses a significant challenge, as computational requirements are fast outpacing capabilities.

Screen resolutions continue to increase [1, 2], meaning many more pixels need to be computed. More sensors are integrated in devices, all capturing information that needs to be processed: understood, correlated, and acted upon. Content being consumed, and produced, is becoming richer and more complex. Increasing complexity of applications is driven by end user expectation for improved experience: When users upgrade their devices, they anticipate considerable improvements in features, usage and visual experience, and without compromising good battery life. The emergence of the as the primary camera device [3] brings additional requirements and is at the basis of emerging fields such as computational photography [4]. All of these factors contribute in fuelling the need for more computational power.

Unfortunately, trends clearly show that processing power requirements are greatly outgrowing battery capacity. Since 2010, battery capacity of smart-phone devices has doubled. In the same timeframe, processing power has increased by a factor of 12 [5]. Traditional solutions are no longer sustainable. Increasing operational frequency of processors is bound by thermal limitations. Simply adding duplicate cores quickly runs into diminishing returns, as most problems don’t scale linearly or fall victim of Amdhal’s argument on program speed-up.

It is therefore extremely important to focus on processing efficiency. Parallelism is at the core of modern processor architecture design: it enables increased processing performance and efficiency. Superscalar CPUs implement instruction level parallelism (ILP). Single Instruction Multiple Data (SIMD) architectures enable faster computation of vector data. Simultaneous multithreading (SMT) is used to mitigate memory latency overheads. Multi-core SMP can provide significant performance uplift and energy savings by executing multiple threads/programs in parallel. SoC designers combine diverse accelerators together on the same die sharing a unified bus matrix. All these approaches enable increased performance and more efficient computation: by doing things in parallel.

GPUs are extraordinary parallel compute engines. Their heritage is associated with processing large streams of pixels and triangles. Modern designs such as the ARM Mali-T600 series extend this by enabling the processing and acceleration of non-graphical tasks. Mali-T600 GPUs provide a very powerful tool to improve platform performance and energy efficiency, through a scalable multithreaded architecture that complements well the characteristics of the CPU and in particular NEON™. Emerging compute APIs such as OpenCL [6] and RenderScript [7, 8, 9] enable programmers to extract the phenomenal computational capabilities of the Mali-T600 GPUs, and facilitate the most efficient use of the hosting heterogeneous platform processing resources.

Copyright © 2013 ARM Limited. All rights reserved. The ARM logo is a registered trademark of ARM Ltd. All other trademarks are the property of their respective owners and are acknowledged

Page 2 of 11

Benefits of GPU Compute

GPU computing on the ARM Mali-T600 series of GPUs brings a host of benefits.

The architectural characteristics of the Mali-T600 series of GPUs enable them to process many parallel workloads much more efficiently than alternative processor solutions. GPU Compute accelerated applications can therefore benefit by consuming less energy, which translates to longer battery life.

Where raw performance is the target, the computation of embarrassingly parallel workloads can also be significantly accelerated through GPU Compute acceleration. This may translate in increased frame rate, or the ability to carry out more work in the same temporal/power budget, and can result in benefits such as improved UI responsiveness, more accurate physics simulation, the ability to apply complex pre-/post- processing effects to multimedia on-device and in real-time. In essence: a significantly improved end-user experience.

Heterogeneous compute APIs such as OpenCL and RenderScript, are designed for concurrency. They allow the developer to migrate some of the load from the CPU to the GPU or other accelerator, or to distribute it between processors in order to enable better load-balancing across system resources. For example a video codec may offload motion vector calculations to the GPU, enabling the CPU to operate with fewer cores and at lower frequencies, or to be available to compute additional tasks.

System designers may be influenced by various cost, flexibility and portability concerns to consider migrating functionality from dedicated hardware accelerators to solutions which leverage the CPU/GPU subsystem. This approach is made viable and compelling due to the additional computational power provided by the GPU, now exposed through industry standard heterogeneous compute APIs.

Use-Cases and Applications

The applications area than can benefit from GPU Computing are countless, examples include:

Copyright © 2013 ARM Limited. All rights reserved. The ARM logo is a registered trademark of ARM Ltd. All other trademarks are the property of their respective owners and are acknowledged

Page 3 of 11

In this paper we will focus on emerging embedded and mobile applications like computer vision, computational photography, and media processing. Real life experiences by our ecosystem Partners will be used to showcase the benefits of GPU Compute on Mali.

Advanced Image Processing

Mali-T600 series has been proven to accelerate a variety of complex image processing algorithms.

MulticoreWare® Inc. is a software vendor and provider of professional services to optimize algorithms to leverage heterogeneous computing. They have implemented a library of complex image filters for RenderScript, optimized both for ARM CPU/NEON and Mali-T600 GPUs [10]. Measurements carried out on an off-the-shelf ® Nexus™ 10 tablet device show significant and consistent performance uplift when utilizing Mali-T604™ GPU Compute acceleration, as shown in Table 1.

Mali GPU Compute has also been demonstrated to improve performance and save energy for video pre-/post-processing. MulticoreWare’s Table 1 - Image filters performace up- transcoding Android™ application [11] utilises video processing plug-ins lift using RenderScript on Mali-T600 written in RenderScript. When the scripts were compiled and executed on the Mali-T604 GPU (hence the application was able to leverage both CPU and GPU processing power) the performance uplift was dramatic. For example, when applying de-shake correction to the video stream, the application executed 3.5x faster, whilst when up-scaling the video stream the performance uplift exceeded 6.7x. GPU Compute with Mali-T604 was also demonstrated to improve battery life. A battery drain test was carried out using the de-shake algorithm, and it was recorded that when

Figure 1 - Battery drain test measured on 10. 30 iterations of de-shake transcoding. Lower is better. (Source: MulticoreWare)

Copyright © 2013 ARM Limited. All rights reserved. The ARM logo is a registered trademark of ARM Ltd. All other trademarks are the property of their respective owners and are acknowledged

Page 4 of 11

using CPU and GPU together the energy consumed by the application was reduced by over 80% when compared to the standard execution mode (where both application and script reside on the CPU).

Synthesis Corporation®, a pioneering Japanese industry-academic cooperative venture, provider of semiconductor and software IP, with recent focus around image and video processing, has been collaborating with ARM on the optimization of selected proprietary technologies using GPU Compute on Mali [12]. Synthesis’ Super- resolution Scaler (an high performance, high quality image up-scaling solution) has been implemented using OpenGL™ ES and RenderScript compute for the Mali-T604 GPU, and is capable of achieving up to five Figure 2 - OpenCL accelerated Adaptive luminance/dynamic range times performance increase in comparison to enhancement algorithm using Mali-T604. (SOURCE: Synthesis Corporation) a conventional CPU-only implementation. Synthesis Corporation’s proprietary adaptive luminance/dynamic-range enhancement software library has been optimized for Mali-T600 series of GPU using OpenCL™ and delivers a 16x speed up when using the Mali-T604 GPU for acceleration.

Computational Photography

Smartphone photography is becoming the most popular form of photo taking. It is disrupting traditional photography. According to National Geographic [3], 37 percent of the images taken in the U.S. in 2011 were captured with camera phones, and this number is expected to rise to 50 percent by 2015. The top 3 devices in the Most Popular Cameras in the Flickr Community chart are . Millions of people are getting their hands on phones with cameras, and developers are taking advantage of it [13].

Users expect good quality photos, in all conditions, and expect editing capabilities traditionally akin to professional workstations to be carried out seamlessly and in real-time on the device. Ideally, intelligent photo fixing/improvement can be done transparently. This is the motive behind the subject of computational photography: addressing the limitations of hardware components, ambient conditions and user skills (or lack thereof) algorithmically, through computation. Coupled with the underlying growth in sensors’ pixel resolutions, computational photography trends are another strong contributor to further computational requirements.

Computational photography includes applications such as: resolution enhancement, multi-frame image processing (like free-motion panorama stitching, HDR reconstruction, scene detection), tone mapping, de- noising, matting, light-field photography (capture scenes or objects under multiple illuminations and use them to perform relighting after the capture), re-focusing after capture, plenoptic/array-sensor photography and

Copyright © 2013 ARM Limited. All rights reserved. The ARM logo is a registered trademark of ARM Ltd. All other trademarks are the property of their respective owners and are acknowledged

Page 5 of 11

more. GPU Compute is particularly suited to efficiently process and accelerate these types of algorithms, as evidenced by the great work done by the Mali GPU Compute ecosystem.

Apical®, a leader in advanced image and video processing technology for embedded applications, are porting and optimizing many of their advanced camera features for Mali-T600 GPUs: video HDR, computational flash, multi-focus fusion, pixel-level motion compensation, spatio-temporal noise reduction. At the Mobile World Congress 2013 in Barcelona, Apical were able to demonstrate their Assertive Display™ technology optimized for Mali-T600 GPU Compute [14, 15]. Assertive Display is a display management technology that dynamically adjusts colour and brightness of each pixel to adapt to ambient conditions, and GPU Compute enables this in a much more energy efficient way that traditional approaches that use backlight.

The OpenCL API enables offload/acceleration on Mali-T600 series of GPUs of many ISP (Image Signal Processing) functions. A software solution brings many advantages as it gives more flexibility than hardware and enables algorithm modifications right up to consumer device release. Sensor and camera module vendors have the option to invest in optimized portable software libraries in order to complement hardware designs. SoC implementers have the opportunity to reduce BoM and area costs by offloading selected ISP blocks to the GPU. ARM is collaborating with multiple sensor vendors in order to enable them to leverage GPU Compute on Mali.

Aptina Imaging®, an industry leader in CMOS sensor technologies, has developed and demonstrated OpenCL accelerated computational photography on Mali-T604. Aptina used their AR0833™ iHDR™ sensor (a high performance 8MP BSI sensor that can run full resolution at 30fps), to feed raw data to the Samsung® Exynos™ 5250 on an InSignal® Arndale™ platform, with the whole image processing pipeline offloaded to the GPU using OpenCL [15].

Figure 3 - ISP pipeline offloaded to Mali-T600 GPU using OpenCL (source: Aptina Imaging)

There are many other Mali Partners working on the optimization of computational photography solutions; the list includes ArcSoft® [16] and Morpho® Inc. [17].

The Khronos Group, an industry consortium creating open standards, has recently established the Camera Working Group, with the mission to create an open, royalty-free standard for advanced, low-level control of mobile and embedded cameras and sensors [18]. ARM is a supporter of this activity.

Copyright © 2013 ARM Limited. All rights reserved. The ARM logo is a registered trademark of ARM Ltd. All other trademarks are the property of their respective owners and are acknowledged

Page 6 of 11

New and emerging video codecs

HEVC and VP9 are some of the new, emerging codecs that promise compelling bandwidth reductions, new market opportunities and significant cost savings for content distributors. However, it is essential to ensure high quality and wide deployment in consumer devices to realize these opportunities [19].

HEVC (High Efficiency Video Codec, aka H.265) is the latest video compression standard, successor to H.264/AVC. HEVC is designed for higher resolutions, can support up to 8K UHDTV (8192×4320), and it promises improved video quality and data compression from H.264: 50% compression for video resolutions of 1080p and above, 35% bitrate reduction for same PSNR output and 50% bitrate reduction for similar (subjective) perceptual video quality. These improvements come at a cost in terms of increased computation, in particular in the encode stage, due to more complex/advanced tools and a lot more data/pixels to be processed.

Deployment to mass-market consumer devices of new codec technologies brings several challenges. The HEVC specification has only recently been ratified, and as with many new Standards, initially some maturity issues are to be expected. As an example, Argon Design already found over 40 problems with the reference code and specification [19]. Mass adoption/maturation of HEVC will take several years, and with multiple encoders, varying quality and spec interpretation and limited conformance and content to test against, there is a clear risk of broken/inconsistent hardware implementations. With a cycle time of ~2 years between design and deployment in consumer devices, early adoption of hardware codecs is a particularly high risk strategy. In addition, silicon footprint overhead in deploying HEVC is high, and this will contribute to limit market adoption in early years.

As today there is no hardware codecs in shipping devices, high performance software implementations become vital to the success of HEVC. A software approach also enables more flexibility as changes can be made closer to device deployment or even over air at a later stage. The challenges are performance and power consumption.

Due to the parallel nature of several of its stages (such as motion compensation, inverse quantization and transform and de-blocking) HEVC decode is particularly suitable for GPU acceleration/offload.

ARM is collaborating with several codec vendors in order to ensure the widest availability of HEVC across multiple ARM platforms, through ARM NEON™ and GPU Compute on Mali-T600.

Ittiam Systems® [20] have been Figure 4 - HEVC decoder processing blocks suitable for GPU Compute acceleration. (Source: Ittiam Systems) demonstrating HEVC decode on ARM Mali-T600 since the Mobile World Congress in 2013 [14]. When using NEON acceleration on the CPU in conjunction with off-loading compute

Copyright © 2013 ARM Limited. All rights reserved. The ARM logo is a registered trademark of ARM Ltd. All other trademarks are the property of their respective owners and are acknowledged

Page 7 of 11

intensive steps like Motion Compensation to the Mali-T600 GPU using OpenCL, the CPU load can be dramatically reduced. Thanks to the additional computational power of the GPU, the decoder can either significantly increase the frame-rate or resolution achievable or reduce the energy consumed.

Figure 5 - SOURCE: Analysis and estimations based on measurements on silicon hardware by ARM and Ittiam Systems on a Mali- T604 system.

Ittiam Systems have accelerated their HEVC decoder using the Mali-T604 GPU, enabling power efficient, sustainable and long duration HEVC playback of a Full-HD (1080p) at 30 FPS on smart phone devices, Full- HD (1080p) at 60 FPS on Tablet devices and up to Ultra-HD (4K resolution) playback on Set-top Boxes and Digital TVs, based on ARM Cortex™-A series CPUs and Mali-T600 GPUs. Ittiam observed that “Mali GPUs are well suited for video acceleration with significant power/performance benefits” [20].

Many other partners are currently developing software HEVC decoders and encoders using ARM NEON and OpenCL on Mali-T600 GPUs. The list include: Aricent Group®, Squid Systems®, ArcSoft®, Octanz Labs®, PIXTREE® Inc, VisualOn® and many more. For more information you can consult the malidevelopers.arm.com website or contact the partner directly.

Computer Vision Applications

Computer Vision groups all those techniques aimed at the acquisition, processing, analysis and understanding of sensor data (images) in order to derive information that can enable decision making.

GPU Compute is particularly suited for the acceleration of Computer Vision workloads and Mali-T600 GPUs have been proved to provide major performance and energy improvements for this class of algorithms.

Copyright © 2013 ARM Limited. All rights reserved. The ARM logo is a registered trademark of ARM Ltd. All other trademarks are the property of their respective owners and are acknowledged

Page 8 of 11

As most Computer Vision applications share the same core characteristics, we have focused on a specific example: face detection. A study carried out in our laboratories demonstrated significant performance and energy benefits using Mali-T600 GPU Compute to improve this algorithm. The test was carried out on an instrumented InSignal Arndale Community Board, featuring a Mali- T604 MP4 GPU. The original algorithm was sourced from the OpenCV library. It was then rewritten to better suit GPU offload, and the OpenCL kernels were re-written and optimized for Mali-T604. The system was configured to operate at various CPU and GPU frequencies, and on average we recorded a 6-9x Figure 6 - Face detection algorithm optimized using OpenCL on Mali- performance improvement. For each individual T604 based silicon detection, the variant of the algorithm that also used the GPU consumed 83% less energy.

Gesture User Interfaces

Computer Vision lies at the heart of many emerging embedded and mobile applications, including gesture- based user interfaces, and many ARM Mali ecosystem partners are active in this field. eyeSight Technologies [21] is a leading provider of gesture recognition technologies. They are particularly excited by GPU Computing and have been at the forefront of innovation in this area by enabling and optimizing their solution for Mali-T600 GPUs [22].

Mali-T600 has proven particularly suitable to accelerate and enhance eyeSight’s machine vision capabilities. eyeSight were able to take advantage of many benefits of using Mali GPU Compute in their UI engine:

- Reduced CPU load, which in turn contributed to reduced power consumption, - Enhancement of image pre-processing, allowing high-quality gesture detection even when depending on relatively lower-quality CMOS sensors, - More accurate and robust finger detection in challenging lighting Figure 7 - Detection rate performance conditions, comparison when using GPU Compute in low light environments

Copyright © 2013 ARM Limited. All rights reserved. The ARM logo is a registered trademark of ARM Ltd. All other trademarks are the property of their respective owners and are acknowledged

Page 9 of 11

- Reduced latency, improving User Experience, through higher performance (operating at 60-120 fps), - Wide field of view coverage made possible through higher input resolutions.

Figure 8 - eyeSight gesture UI technology using GPU Computing to improve detection in challenging environments

A low light noise study carried out by eyeSight concluded that when using the Mali-T600 GPU for acceleration via OpenCL, the detection rate performance doubled in low light environments. Image processing on GPU is ideal for machine vision, and it has been proved to enhance video quality while conserving system resources. Thanks to Mali-T600 GPU Compute, gesture detection is now enabled in previously unsupported environments.

Conclusion

GPU computing on ARM Mali-T600 GPUs offers a host of benefits: it accelerates data-parallel computation while reducing system work load; reduces platform energy consumption while increasing system throughput; and enhances your system's value by consolidating functionality while reducing programmer effort.

GPU Computing with Mali-T600 GPUs has been proven to deliver major tangible benefits for real world applications such as: advanced imaging, computer vision, computational photography and multimedia codecs. Improved performance and energy efficiency were measured on real devices.

The Mali Ecosystem is making GPU Compute a reality today. Industry leaders take advantage of Mali-T600 GPUs capabilities to innovate and deliver – be one of them!

FOR MORE INFORMATION: [email protected]

Copyright © 2013 ARM Limited. All rights reserved. The ARM logo is a registered trademark of ARM Ltd. All other trademarks are the property of their respective owners and are acknowledged

Page 10 of 11

References

[1] http://www.androidguys.com/2013/01/25/android-mobile-display-resolutions/ [2] http://arstechnica.com/gadgets/2013/08/smartphone-display-wars-go-to-ludicrous-speed-2560x1440-in-5- 5-inches/ [3] http://www.ce.org/i3/Features/2013/Digital-America/Camera-Industry-Pivots-as-Smartphones-Target- Point.aspx#sthash.Q6QOtsZV.dpuf [4] http://www.bbc.co.uk/news/technology-23235771 [5] Source: McKinsey & Company Inc. [Making smartphones brilliant: Ten trends - McKinsey & Company], http://goo.gl/rkSP4 [6] OpenCL with ARM Mali: GPU Computing...with no compromises [http://goo.gl/64O36] [7] http://developer.android.com/guide/topics/renderscript/compute.html [8] http://android-developers.blogspot.co.uk/2013/01/evolution-of-renderscript-performance.html [9] GPU Computing in Android? With ARM Mali-T604 & RenderScript Compute You Can! [http://goo.gl/5A5UB] [10] http://malideveloper.arm.com/learn-about-mali/mali-partners/computer-vision/multicoreware/ [11] http://www.youtube.com/watch?v=0awU2YCncHk [12] http://malideveloper.arm.com/learn-about-mali/mali-partners/computational-photography/synthesis- corporation/ [13] https://play.google.com/store/apps/category/PHOTOGRAPHY/collection/topselling_paid?hl=en_GB [14] http://www.apical.co.uk/2013/03/apical-and-arm-collaborate-on-computational-photography/ [15] http://blogs.arm.com/multimedia/907-the-mali-ecosystem-makes-gpu-computing-a-reality/ [16] http://malideveloper.arm.com/learn-about-mali/mali-partners/computational-photography/arcsoft/ [17] http://malideveloper.arm.com/learn-about-mali/mali-partners/computational-photography/morpho- inc/ [18] http://www.khronos.org/camera [19] http://blogs.arm.com/multimedia/1047-addressing-the-challenges-in-bringing-dedicated-hevc- hardware-to-market/ [20] http://malideveloper.arm.com/learn-about-mali/mali-partners/computer-vision/ittiam-systems/ [21] http://malideveloper.arm.com/tag/eyesight/ [22] http://blogs.arm.com/multimedia/1005-eyesights-gesture-recognition-mali-touch-free-capabilities- optimized/

Copyright © 2013 ARM Limited. All rights reserved. The ARM logo is a registered trademark of ARM Ltd. All other trademarks are the property of their respective owners and are acknowledged

Page 11 of 11