2015-Samos-01

Experiences in Speeding Up Computer Vision Applications on Mobile Computing Platforms Luna Backes Alejandro Rico Bjorn¨ Franke Barcelona Supercomputing Center, Barcelona Supercomputing Center, The University of Edinburgh, Barcelona, Spain Barcelona, Spain Edinburgh, United Kingdom Email: [email protected] Email: [email protected] Email: [email protected] Abstract—Computer vision (CV) is widely expected to be the state-of-the-art CV algorithms require powerful desktop GPUs next big thing in mobile computing. The availability of a camera able to provide high performance. and a large number of sensors in mobile devices will enable CV applications that understand the environment and enhance Existing CV applications for mobile devices are simpler people’s lives through augmented reality. One of the problems versions of more complex algorithms in an attempt to provide yet to solve is how to transfer demanding state-of-the-art CV high frame rates at the expense of functionality and accu- algorithms —designed to run on powerful desktop computers racy [4]. We tried a set of existing CV apps on a powerful with several GPUs— onto energy-efficient, but slow, processors smartphone, the Samsung Galaxy Note 3, and found that and GPUs found in mobile devices. To accommodate to the lack user experience is poor due to erroneous outputs, overheating of performance, current CV applications for mobile devices are simpler versions of more complex algorithms, which generally that led some apps to abort, and slow frame rates. This run slowly and unreliably and provide a poor user experience. experience shows that existing attempts of CV on mobile In this paper, we investigate ways to speed up demanding devices are not completely satisfactory, and motivates the work CV applications to run faster on mobile devices. We selected of several projects, such as PAMELA [5] and Google’s project KinectFusion (KF) as a representative CV application. The KF Tango [6], towards providing more optimized CV software and application constructs a 3D model from the images captured specialised hardware. by a Kinect. After porting it to an ARM platform, we applied several optimisation and parallelisation techniques using OpenCL In this paper, we investigate ways to speed up demanding to exploit all the available computing resources. We evaluated the CV applications on mobile devices. We selected the Kinect- impact on performance and power and demonstrate a 4× speed- Fusion (KF) application [7] as a representative CV application up with just a 1.38× power increase. We also evaluated the that requires high performance. KF processes images of objects performance portability of our optimisations by running on a taken from multiple angles by a Kinect and constructs a 3D different platform, and assessed similar improvements despite model of the observed objects. It usually runs on desktop the different multi-core configuration and memory system. By measuring processor temperature, we found overheating to be the machines with powerful GPUs because it requires high per- main limiting factor for running such high-performance codes on formance to process the images and construct the model at a mobile device not designed for full continuous utilisation. interactive rates (approximately 30 FPS). We ported KF to run on a mobile system-on-chip (SoC), the Samsung Exynos 5250. We profiled the application and ported it to OpenCL to make I. INTRODUCTION use of the embedded GPU available in the chip. We performed Mobile devices represent a new era in computing with a set of optimisations by porting several parts to the application a market rapidly growing [1]. At the same time, mobile to run on the GPU, parallelised other parts to exploit multiple processor performance has increased by a factor of 25× in CPUs, and minimised synchronizations and data transfers. 3 years [2]. The challenge now is power: the power envelope is going to remain fixed, at about 2W for smartphones and With this work we attempt to quantify how far current 5-7W for tablets. This is due to the need for cooling, which is mobile SoCs are capable of running demanding CV applica- limited in compact mobile devices. This thermal limit equally tions, such as KF, at interactive rates; anticipate what problems determines the power limit that stays constant. This also affects developers may face when porting such applications to mobile performance: a mobile device working at maximum speed for devices; and provide a sense of the potential improvements of a long time heats up quickly, which forces the system to reduce our optimisations on such applications. its operating frequency. In this context, the contributions of this paper are: Smartphones and other mobile devices integrate more sensors in each generation. Some examples are GPS, accelerome- • A profiling of the KF application running on the Sam- ter, gyroscope, proximity, light and camera. These hardware sung Exynos 5250 and description of the optimisations components make mobile devices an appealing option for applied based on that profiling. We ported several computer vision (CV). The possibilities of CV in mobile parts of the application to OpenCL to exploit more devices are endless [3], for example: games, improve disabled parallelism using the embedded GPU, removed data people’s life or new learning techniques. However, demanding transfers between CPU and GPU using shared memory and removed unnecessary CPU-GPU synchronisations This work was funded by the EPSRC grant PAMELA EP/K008730/1 and by allowing consecutive computations on the GPU to the Mont-Blanc project (EU FP7 grant agreements 288777 and 610402). reuse the data in GPU memory. 978-1-4673-7311-1/15/$31.00 ©2015 IEEE 1 TABLE I. PERCENTAGE OF TIME SPENT IN THE DIFFERENT STAGES OF • A temperature evaluation of KF running on Samsung THE KF ORIGINAL VERSION. Exynos 5250. We found that overheating is a major issue in terms of stability and performance because Stages Time (%) Acquisition 2.70% the processor applies thermal throttling to avoid tem- Preprocessing 23.08% perature exceeding 85 degrees to avoid damaging Tracking 5.50% the CPU. We also evaluated the impact of several Integration 49.87% Rendering 18.64% cooling solutions. Adding a heat sink and a fan Drawing 0.14% reduced execution time by 20% of the unoptimised KF version. We also found that our optimisations made KF less sensitive to temperature. Moving computation fabric that interconnects all the processing resources and pe- to the GPU significantly reduces overheating and, ripherals. The architecture is implemented for high throughput thus, thermal throttling happens much less frequently rather than latency so it runs at hundreds of megahertz targeting than in the unoptimised version. Therefore, the GPU power efficiency. version not only improves performance by exploiting more parallelism, but also due to a reduced thermal This chip is integrated in the specialised mobile device throttling activity. developed in Google’s project Tango [6]. Their objective is to provide a 5” Android phone with highly customized hardware, • A performance and power evaluation of KF running on including Movidius’ Myriad chip, and software designed to Samsung Exynos 5250 on each optimisation step. We track the full 3-dimensional motion of the device as you hold measure the execution time and power consumption it while simultaneously creating a map of the environment. of each one of our optimisations. We also break down The project also includes the development of a 7” tablet with the execution time on the different stages of the KF the NVIDIA Tegra K1 processor, 4GB of RAM, 128GB of algorithm to analyse which of them our optimisations storage, motion tracking camera, integrated depth sensing and affect. We achieved an overall 4× speed-up with only wireless interfaces. a 1.38× power increase. This results in a 3× reduction on energy-to-solution which is crucial for battery- Another approach to improve the capabilities of CV ap- restricted environments such as mobile devices. We plications on mobile devices is to optimise the software for also repeated the analysis on a different mobile SoC, existing mobile processors. Previous works [13], [14] ported the Samsung Exynos 5422, and experienced similar CV applications to mobile SoCs and provide significant speed- improvements that show a good performance porta- ups mainly by using the GPU. Cheng and Wang [13] ported bility of our optimisations. a face recognition algorithm to OpenGL. The application processes a single image to identify faces using a pattern recognition algorithm. They tested their OpenGL version in II. RELATED WORK Android-powered devices and compared their efficiency in Computer vision (CV) for mobile devices is becoming a GPUs from Qualcomm, NVIDIA and Imagination Technolo- trending topic since the performance of mobile processors gies. Wang et al. [14] ported an image inpainting-based object is becoming high enough to run simple CV applications. removal algorithm to OpenCL. The application processes a Existing examples of such applications are video stabilisation single image to fill holes left by the removal of objects. It with rolling shutter correction, face detection and recognition, uses patterns nearby the hole to build an approximation of the low-light image/video enhancement, car plate detection and pixels the object was covering. They tested their application recognition [8]; augmented reality (e.g., placing of furniture on in an Android-powered device with a Qualcomm GPU [14]. the recorded space [9]); and real-time translation of text [10]. Our work differs from these previous works in that we port An example is Word Lens [10], which translates text in and optimise a highly-demanding application, KinectFusion, real time from the video captured by the mobile camera. that processes a video, instead of a single image, and we The application has to recognise the text from the image and provide a comprehensive evaluation of performance, power translate it, ideally, at an interactive rate.

2015-Samos-01

Enabling the Use of Low Power Mobile and Embedded Technologies For

DISI - University of Trento

Introducing Slambench, a Performance and Accuracy Benchmarking Methodology for SLAM

Report on Tuned Linux-ARM Kernel and Delivery of Kernel Patches to the Linux Kernel Version 1.0

Proyecto Fin De Grado

Que E Un Arduino

Energy Neutral Wireless Sensing for Server Farms Monitoring

D5.5– Intermediate Report on Porting and Tuning of System Software to ARM Architecture Version 1.0

On First Page

Design, Construction, and Use of a Single Board Computer Beowulf Cluster: Application of the Small- Footprint, Low-Cost, Insignal 5420 Octa Board

Samsung Exynos 5 Board - 10-30-2012 by Vincent - Streamcomputing

Fast Software Polynomial Multiplication on ARM Processors Using the NEON Engine Danilo Câmara, Conrado Gouvêa, Julio López, Ricardo Dahab