Utilization of Graphic Processing Unit on Smartphones

Abstract While the convergence of mobile phone and computing technology has been building for years, we are now starting to experience exciting and disruptive new applications on mobile devices. Starting off by introducing desktop GPUs as a reference, we discuss how mobile GPUs are designed to deliver the performance required for current and future mobile use cases. A mobile device is powered with batteries and is also small to be portable. In this paper, we discuss the architecture of one such GPU X1, explore the key factors of it and we have compared Tegra X1 with its competitor Snapdragon Adreno 530. We show the rising demand for GPUs for Smartphones and tablets; The main functionality of Graphic processing unit on the mobile devices is the Visual Computing, and we also discuss the other use cases, Including the way mobile’s GPU is exploited for parallel processing for malware detection. The limitations and constraints imposed by current GPU technology is addressed.

Introduction and background Mobile devices are becoming our most valuable in the present day and more valuable that personal computers. The smartphones and other portable tablets can execute almost every possible task that a PC could do, rather everyone prefer to use a mobile device because of its portability. Since the usage of these devices have become vital and the growth has been exponential, the need for faster more powerful mobile devices are required. With GPU and CPU being the two most vital processing units of the device, we concentrate on the jobs performed by the GPU in this paper. With the emergence of extreme scale computing, modern graphics processing units (GPUs) have been widely used to build powerful supercomputers and data centers. With large number of processing cores and high-performance memory subsystem, modern GPU is a perfect candidate to facilitate high performance computing (HPC). As the leading manufacturers in the GPU industry [1], Nvidia and ATI have been consistently competing by introducing a new generation series of graphic cards for computers. In recent times, similar necessity has become essential for mobile and handheld devices. Not only will high performance GPU’s will suffice, but also various other factors including the power efficiency and size of the GPU is taken into consideration while developing powerful Graphic processing units for mobile devices. Later, in the paper, we discuss about the basic GPU architecture for a mobile device; and compare the importance of performance vs power utilization.

1

The new mobile devices are more interactive, just like the newly developed immersive applications. The users have high expectations from these new mobile devices, for a smooth and uncompromised performance of these applications. To cater the needs of the user on the mobile devices, the two main components, the CPU and GPU of the device comes into action. The whole performance of a mobile device depends mostly upon the memory, CPU and the GPU used.

How does a GPU work? Unlike central processors, with a few cores running at high speed, GPUs have many processing cores running at low speeds. These cores are basically aimed at two different functions: the processing of vertices and . Vertex processing essentially revolves around the idea of coordinate systems. The GPU handles geometric calculations to reproduce dimensional space on your screen. This results in things like depth and spatial data in games and the possibility of rotation in three-dimensional space. The processing of GPUs, or to put it more simply, the graphics we see, is very complex and requires even more processing power than is required by vertices. Pixel processing renders the various layers and applies the effects needed to create complex textures to get the most realistic graphics possible.

2

Need for more powerful GPUs. The segment of mobile gaming has been one of the fastest growing section, with the visually rich graphic mobile games being almost compared to those of PC and console games. Simple 2D games are now being developed into affluent gameplay and rich graphic 3D games demanding the fastest GPU processing built into the mobile device. The need for powerful GPU processing is to facilitate various features of the device, like delivering quick webpage and application rendering, smooth user interface interactions, high resolution displays and indeed high performance gaming. In the heat of including such high- performance GPU, challenges of power consumption and size is also considered while designing a suitable mobile device. We will consider the leading GPU brands and analyze the performance of NVIDIA’s Tegra X1 with Qualcomm Adreno 530 GPU based on a few benchmarks later in the Analysis section of this paper.

GPU VS CPU Graphics processing units (GPUs) are becoming increasingly important in today’s platforms as their increased generality allows for them to be used as powerful coprocessors. A powerful GPU contributes to a rich 2D and 3D user interface[5], faster rendering of web pages and display output, and in providing a more immersive 3D gaming. These powerful GPUs’ also are very necessary while providing a smooth and responsive interface while executing applications. When we first consider the CPU and GPU and their performance independently during the computations in a mobile device, the efficiency and performance of the device falls as compared to when the GPU and CPU is integrated a used in parallel processing. The CPU is suited for most of the logic operations and ’s scheduling code. Most of the website’s HTML and JavaScript is parsed better when handled by the CPU. It deals with data processing jobs like reading and processing database to present emails and text and parsing documents into PDFs. The GPU, on the other hand, is best for rendering purposes. Effects like the window transitioning, zooming and scrolling, and animations, are handled smoothly by the GPU. Similarly, running high end 3D games on mobile devices requires the rendering capabilities of the graphic processing unit. We have seen the performance and potential when we use CPU and GPU independently, but then came the integration and processing of most the instructions parallelly by using both CPU and GPU simultaneously. To compensate for the loses when execution of applications and rendering occurs on independent efforts of the CPU and GPU, there was a recent development of integrating the CPU and GPU. Introducing the Augmented reality(AR) technology, also presented parallel processing of the CPU and GPU. AR applications help users to understand their environment better by

3

enhancing the real world with the virtual information on their mobile devices [6]. AR applications must perform intensive image processing algorithms and modules of overlaying specific virtual objects on real-time capturing. To execute these high intensive applications, parallel processing of CPU and GPU was used, where this scheme assigned the feature extraction and the feature description to CPU and GPU respectively, and these processes are conducted in parallel. The integration of CPU and GPU, and parallel processing was introduced by Nvidia Tegra 2[7], which we later explain with the image below. Modern system-on-chips (SoC) integrate CPU and GPU for better rendering and 3D immersive gaming experience on mobile devices. Multiprocessor system-on-chips in high-performance mobile platforms have witnessed unparalleled advances over the past few years. The current mobile system-on-chips combine diverse processing elements such as CPU, GPU, DSP blocks on a single chip. Digital Signal Processor (DSP) are designed high speed execution of large numeric values representing the analog signals in real time. Below figure shows the block diagram of the recent SoC by NVDIA Tegra 2.

Nvidia’s Tegra series processor introduced a whole new technology of integrating powerful 3D rendering graphic GPU with CPU cores on the same chip, providing a sophisticated real- time experience for the mobile users. [7]

4

Analysis Mobile GPU architecture In the early mobile GPU stages, the CPU was involved in setting required parameters on the GPU and these GPU’s were Immediate-Mode-Rendering (IMR) based. The restructuring of pixels to the display was done span-wise (top to bottom and left to right) and the CPU had to wait for each conversion to be completed by the GPU, before it could issue a new pixel set; This causing a heavy dependency and synchronization overhead on the both the components. In the modern phase of GPU, the conversion of pixels (rasterization) was executed using the tile base approach (rendering all triangulated pixels tile by tile into the ). The usage of framebuffer reduced the memory bandwidth consumption. Although still following the IMR approach, the modern GPU’s followed Tile based deferred Rendering leading to significant reduction of dependency of the GPU and CPU. And with drastic development of System on Chips, the powerful GPU and CPU are now integrated into a single SoC [8]. An application running on a mobile device (a Game for example) utilizes the GPU which renders the triangles waits for the GPU to state on how and where it should be rendered. Processing these vertices into a noted position is done in a unit called Vertex processing. The setup unit assembles the triangles and computes data that are constant over the triangle. Computing the color of the pixel found inside the triangle is performed in the Pixel processing unit. Below is a conceptual overview of the working of a Graphic processing unit[8].

The modern mobile GPU’s are introduced as a System on Chip (SoC) which integrates a graphic processing unit, CPU, and a . A good example of the such a SoC is the NVIDIA Tegra family of processors. Using the reference of NVIDIA Tegra 4 as a reference of modern processor, we will give an overview of the GPU architecture. The GPU in Tegra 4 was an evolution of the GPU from the ancestors of Tegra 3 and 2 family[9], with inclusion of efficient additions like increasing the number of processing resources and . It employed 72 custom cores with a frequency of 672MHz. The video processor implemented a full support for hardware encoding and decoding and facilitated 4K (Ultra HD) display. This pure example of a modern processor for mobile devices was deployed on fancy handheld devices like HP Slate and . The lower variant Tegra 4i was designed for phones, with almost 60 custom cores.

5

Figure 2 : GPU architecture of Tegra 4[9]

Evolution of NVIDIA Tegra Tegra is a (SoC) series developed by Nvidia for mobile devices such as smartphones, personal digital assistants, and mobile Internet devices. The Tegra integrates an ARM architecture (CPU), (GPU), Northbridge, Southbridge, and memory controller onto one package. Early Tegra SoCs are designed as efficient multimedia processors, while more recent models emphasize gaming performance without sacrificing power efficiency. On January 7, 2010, Nvidia announced its next generation Tegra system-on-a-chip with dual core, the Nvidia Tegra 2. It primarily supported Android smartphones. Later Nvidia announced the first quad-core SoC at the February 2011. The chip was also codenamed Kal- El, it is now branded as Tegra 3. The benchmark results show impressive performance over Tegra 2,[2] and the chip was used in many of the tablets released in the second half of 2011. In January 2012, Nvidia announced the Tegra 3 processor for its in-vehicle sub systems and digital instruments display.[3] In March 2015, Nvidia announced the Tegra X1, the first SoC to have a graphics performance of 1 teraflop. At the announcement event, Nvidia showed off Epic Games' Unreal Engine 4 "Elemental" demo, running on a Tegra X1.

6

Table1: Evolution of NVIDIA Tegra[4]

Other use cases of a GPU Graphic processors are known mainly to facilitate the graphic user interface of mobiles, improve the gaming experience in portable devices. GPU also provide a helping hand during computations of large floating point expressions along with CPU for faster performance. Below, we discuss of the other use cases of Graphic Processing Unit in a mobile device and how else is it useful other than being used for improving the computation speed and graphic interface. Malware detection in a mobile device was improved by using the computation capabilities of a GPU [10]. The set of malware signatures grow proportional to the increase of threats. With this increase, the load of computations and comparisons increases relatively thus posing a threat to mobile capabilities because of its power consumption and memory limitations. To reduce the burden only on the CPU, a method to include GPU’s capabilities for analyzing and comparing malware signatures was used. With parallel processing at its peak, a parallel host-based anti-malware application was developed for Android mobile devices in which the GPU plays a vital role in accelerating malware detection. A total throughput of 333 Mbit/second was recorded when a GPU was used, which is three times faster as compared to when only the mobile CPU was used for this malware detection. GPU-based Gabor face feature extraction is based on the Fast Fourier Transform (FFT) method which transforms face image into Fourier space, multiplies and then inverse- transforms them back to their space domain[11]. Using the Nvidia Tegra GPU, this task to compute the Gabor wavelet could be completed in 1.2 seconds; that accelerating by a power of 4.25x in comparison to the CPU implementation. Thus, reducing total time to 4.6 seconds from 8.5 seconds for complete face recognition on a smartphone. Not only is the time that is reduced, but the power consumed is also lowered. While using the CPU and

7

GPU implementation, the energy consumed is 16.3J while 29.8J of energy is consumed when only CPU is used for computation. Augmented reality is a way of portraying real-world environment and their elements in a semantic context. With this technology, information about the user’s live surroundings can be made interactive and immersive for educative and safety purposes. Virtual reality, on the other hand, replaces the real-world elements with simulated world and an immersive walk through is created for the user. These technologies are mostly used on powerful mobile devices, and the GPU plays a major role in a smooth execution of AR and VR applications. GPU’s efficient capabilities in image processing and object detection is used to improve the execution of AR applications [12]. A rectangle detection algorithm is used is one of the common tools while executing AR applications; and using GPU and CPU parallelly was used to reduce the execution time of rectangle detection algorithm.

Challenges of building a Graphic Processing Unit Designing a powerful mobile GPU is not as easy as inclusion of multiple cores, latest technology and increasing the clock rate to faster and efficient execution. A whole lot of constraints fall in while designing a GPU and these must be considered carefully. Power consumption and efficiency, Silicon area of the chips and cost are the most important challenges faced by mobile GPU designers [13]. Power Efficiency A mobile device cannot be constantly plugged in, so the power efficiency plays a vital role in specifications of a device. With mobile GPU and the CPU consuming most of the power, designers find it harder to develop SoCs facilitating the power constraint. The major challenge faces are, getting rid of the heat dissipated by the mobile GPU, without any large fans or heatsinks [15]. On an average, a mobile device can withhold almost 3W to 6W of heat, depending on the size and built of the device. To simply increase the performance, we can

8

add more logic chips or rise the clock frequency; but this comes with a price of increase in silicon area and power consumption. Power consumption increase almost quadratically with increase in clock frequency. A paper on GPU thermal management [16] came up with a solution to control thermal power dissipation by a dynamic thermal management technique. Due to lack of coordination between the CPU and GPU, the on-chip temperature varies, degrading the performance of the device and application. This technique coordinates the frequencies of CPU and GPU and satisfies the thermal constraint during high surges in performance of the device. This approach would drastically help in the performance of high graphic gaming applications because it would modulate the frequencies of both the CPU and GPU. Area Mobile GPU’s performance can be compared to their desktop counterparts with today’s technology; but are limited to perform under hard operating conditions considering the limitation of mobile phone’s size and power. GPUs designed for these mobile devices must hence use as little space as needed, thus affecting the performance factor. The area of silicon chips play a vital challenge for SoC designers, as it often means that, more of Silicon area directly increases the power consumption and cost but reduces the performance drastically. To increase the performance, we could either add more logic gates further increasing the chip size or increase the frequency. But there’s always a size constraint that must be satisfied when it comes to portable mobile devices. Hence, most of the ideas for improving the performance will come at a cost of some hardware block becoming complex and again increase in size and consumption of power [14].

Cost We draw up a cost model for a mobile GPU and observe the percentage of GPU cost on the whole SoC. Vivante GC1000 Graphic processing unit was used an example to sketch out the observations. The Silicon-Area is around 12.25푚푚2 and cost of it is almost $0.07 per mm. This GPU also equips 2nd level caches of about 6KB-256KB with a cost range of $0.18-$0.30 cost range per chip just for the GPU-Subsystem. Apart from the chips, costs include addition of other technologies like OGLES2, OpenCL, DirectX and royalties like licensing fee and maintenance fee.

GPU NVIDIA® TEGRA X1 and Qualcomm Adreno 530 Tegra X1 is Nvidia’s newest mobile processor, integrating a GM20B Maxwell GPU and two ARM quad-core CPUs. The power efficient Maxwell GPU architecture utilizes a 256 CUDA core Maxwell GPU with a GPU processing power of 1000 GFLOPS for 16-bit workloads and over 500 GFLOPS for 32-bit workloads. The Nvidia Tegra X1 CPU architecture holds an ARM Cortex A57 quad-core along with ARM Cortex A53 quad-core processor. These high-

9

performance CPU processors with Maxwell GPU contributes in providing a top-notch and incredibly energy efficient performance with supporting all the modern graphics and computing latest APIs. In addition to the performance features of Tegra X1, it supports advanced graphics and features such as OpenGL 4.5, DirectX 12 API, OpenGL ES 3.1 and CUDA 6.0. The Maxwell GPU includes its inherent hardware support for 16-bit floating point computations and two high performance Image signal processors(ISP) that is most important for computer vision based applications. The Cortex A57 CPU share a common 2MB L2 cache and each core has its own 48KB L1 instruction cache and a 32KB L1 data cache. More power efficient Cortex A53 CPU share a common 512KB L2 cache and each core has a 32KB L1 instruction cache and 32KB L1 data cache. The Qualcomm Adreno 530 is the integrated powerful GPU in Snapdragon 820 processor. The Adreno 530 comprises of 256 arithmetic logic units running at 624 MHz. With Adreno 530 known to be the perfect competitor for GPU leaders Nvidia, it also delivers a 40% improved performance and power consumption over the Adreno 430. It supports OpenGL ES 3.1 + AEP, OpenCL 2.0, RenderScript and DirectX 11.2. The 820 processor which contains the Adreno 530 GPU, runs with Kyro CPU, Spectra ISP, and the Hexagon 680 DSP (digital signal processor). The 64-bit Kyro quad-core CPU clocks up to 2.2GHz. Qualcomm’s Spectra ISP in the Adreno 530 allows smartphones to support up to three different 25MP cameras running at 30 FPS. Qualcomm’s Snapdragon 820 processor is being used in some top mobile devices, namely, S7, LG V20, HTC 10 and OnePlus 3. While Nvidia’s Tegra X1 is used in Nvidia Shield and ’s tablet. Below is a chart analyzing the benchmark tests of Nvidia Tegra X1’s Maxwell GPU and Qualcomm Snapdragon 820’s Adreno 530 GPU [17][18]. Benchmark Tests Tegra X1 Maxwell Adreno 530 GPU GPU GFXBench 3.0 - Manhattan Offscreen OGL 53.5 43.3 GFXBench 3.0 - Manhattan Onscreen OGL 48.4 34.6 Basemark X 1.1 High quality 41415.5 31469.6 Basemark X 1.1 Medium quality 43721 34695 GFXBench (DX / GLBenchmark) 2.7 52.9 52.8 3DMark - Ice Storm Unlimited Graphics Score 57697 31299.7 PassMark PerformanceTest Mobile V1 3D GFX 3320 2182.8 Tests PassMark PerformanceTest Mobile V1 - 2D GFX 5215 5512 Tests Smartbench 2012 - Gaming Index 4497 4410

10

Adreno 530 vs Tegra X1 Maxwell GPU (Benchmark results)

Smartbench 2012 - Gaming Index

PassMark PerformanceTest Mobile V1 - 2D Graphics Tests

PassMark PerformanceTest Mobile V1 - 3D Graphics Tests

3DMark - Ice Storm Unlimited Graphics Score

GFXBench (DX / GLBenchmark) 2.7

Basemark X 1.1 Medium quality

Basemark X 1.1 High quality

GFXBench 3.0 - Manhattan Onscreen OGL

GFXBench 3.0 - Manhattan Offscreen OGL

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Adreno 530 Tegra X1 Maxwell GPU

Future work and Conclusion With the empowerment of mobile devices and overshadowing the use of personal computers, the importance and growth of powerful portable tablets and mobile devices has been exorbitant. The performance and efficiency has also been on a rise, with use of latest technology and most powerful and architecture. Although the current use of processing units is sufficient to run the current day high end applications, this will not suffice the growing applications and if mobile devices can handle the applications. The two leading mobile GPU manufacturers, NVIDIA and Qualcomm, have been providing top notch processors and these SoCs’ are used in the top mobile devices and tablets. We are at a stage where the small sized GPU architecture built for mobile devices are comparable to the graphic cards used for personal computers. In late September 2016, Nvidia announced a new SoC named Xavier featuring a Volta GPU with 512 CUDA, 7 billion count transistors and 8 custom ARMv8 core processors. When comparing this to the most powerful computer GPU, that is Nvidia GeForce GTX 1080, it almost matches the specs with Xavier and this is astounding, considering the size, power consumption and other limitations of a mobile GPU. With such a distinct and sudden

11

performance increase in mobile SoCs, we can surely predict where mobile graphics will soon overpower the performance statistics of a computer graphic performance. Virtual and Augmented reality is on a rise, and soon predicted to be way most infotainment being displayed. In such a world, applications featuring virtual and augmented reality can only be executed on high end mobile devices featuring the most powerful processing units and GPUs. Considering the growth of virtual reality applications, HTC Vive, Samsung VR and other competitors are designing the most powerful and feature filled mobile phones to handle the highly intensive and immersive 3D games and applications. Through this, there is predicted that the rise of powerful GPUs will not stop here, but would go on through the roof considering the amount of research and development performed in this field. In this paper, we have discussed the design of mobile GPU and the way they deliver required performance for the current and future mobile requirements. To have a clear understanding on the current technology being used in today’s GPUs’, we have focused on the architecture of Nvidia Tegra 4 processors. We have looked upon the limitations posed during the designing of processors in mobile devices like power, cost, and size. Analyzing the two most powerful mobile GPU’s, Nvidia Tegra X1 and Qualcomm Adreno 530, and almost being comparable to the GPUs’ for PC, stage is now set for bigger, more powerful mobile GPUs.

12

References 1. Best gpu manufacturers http://www.ranker.com/list/the-best-gpu- manufacturers/computer-hardware 2. Tegra 3 is faster than a core 2 duo T https://vrworld.com/2011/02/21/why- nvidiae28099s-tegra-3-is-faster-than-a-core-2-duo-t7200/ 3. Audi selects Tegra processor for infotainment and dashboard 4. https://en.wikipedia.org/wiki/Tegra 5. https://www.quora.com/How-important-is-the-GPU-compared-to-the-CPU-in-a-mobile- device-like-the-iPad-or-iPhone 6. CPU and GPU Parallel Processing for Mobile Augmented Reality 7. The Benefits of Multiple CPU Cores in Mobile Devices http://www.nvidia.com/content/pdf/tegra_white_papers/benefits-of-multi-core-cpus-in- mobile-devices_ver1.2.pdf 8. Graphic Processing Units for Handhelds http://ieeexplore.ieee.org/document/4483498/ 9. Nvidia Tegra 4 family GPU architecture http://www.nvidia.com/content/pdf/tegra_white_papers/tegra-k1-whitepaper.pdf 10. On the Use of Mobile GPU for Accelerating Malware Detection Using Trace Analysis http://ieeexplore.ieee.org/document/7371440/?section=abstract 11. A paper on using Mobile GPU for general computing suggested a way to of using GPU for accelerated Face recognition system. http://lbmedia.ece.ucsb.edu/resources/ref/vlsidat11.pdf 12. CPU and GPU parallel processing for mobile Augmented Reality http://ieeexplore.ieee.org/document/6743972/ 13. https://bastianzuehlke.wordpress.com/2011/10/18/mobile-gpus-introduction- challenges/ 14. http://www.slideshare.net/AlessioVillardita/ca-1st-presentation-final-published 15. https://kristerw.blogspot.com/2016/09/mobile-gpus-power-performance-area.html 16. Improving mobile gaming performance through cooperative CPU-GPU thermal management http://ieeexplore.ieee.org/document/7544290/

17. http://www.notebookcheck.net/NVIDIA-Tegra-X1-Maxwell-GPU.137006.0.html 18. http://www.notebookcheck.net/Qualcomm-Adreno-530.156189.0.html

13

Appendix

14

15

16

17

18