Edinburgh Research Explorer

Edinburgh Research Explorer Experiences in Speeding Up Computer Vision Applications on Mobile Computing Platforms Citation for published version: Backes, L, Rico, A & Franke, B 2015, Experiences in Speeding Up Computer Vision Applications on Mobile Computing Platforms. in Proceedings of the International Symposium on Systems, Architectures, Modeling, and Simulation (SAMOS XV), Samos, Greece, July 20-23, 2015. Institute of Electrical and Electronics Engineers (IEEE), pp. 1-8. https://doi.org/10.1109/SAMOS.2015.7363653 Digital Object Identifier (DOI): 10.1109/SAMOS.2015.7363653 Link: Link to publication record in Edinburgh Research Explorer Document Version: Peer reviewed version Published In: Proceedings of the International Symposium on Systems, Architectures, Modeling, and Simulation (SAMOS XV), Samos, Greece, July 20-23, 2015 General rights Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact [email protected] providing details, and we will remove access to the work immediately and investigate your claim. Download date: 25. Sep. 2021 Experiences in Speeding Up Computer Vision Applications on Mobile Computing Platforms Luna Backes Alejandro Rico Bjorn¨ Franke Barcelona Supercomputing Center, Barcelona Supercomputing Center, The University of Edinburgh, Barcelona, Spain Barcelona, Spain Edinburgh, United Kingdom Email: [email protected] Email: [email protected] Email: [email protected] Abstract—Computer vision (CV) is widely expected to be the state-of-the-art CV algorithms require powerful desktop GPUs next big thing in mobile computing. The availability of a camera able to provide high performance. and a large number of sensors in mobile devices will enable CV applications that understand the environment and enhance Existing CV applications for mobile devices are simpler people’s lives through augmented reality. One of the problems versions of more complex algorithms in an attempt to provide yet to solve is how to transfer demanding state-of-the-art CV high frame rates at the expense of functionality and accu- algorithms —designed to run on powerful desktop computers racy [4]. We tried a set of existing CV apps on a powerful with several GPUs— onto energy-efficient, but slow, processors smartphone, the Samsung Galaxy Note 3, and found that and GPUs found in mobile devices. To accommodate to the lack user experience is poor due to erroneous outputs, overheating of performance, current CV applications for mobile devices are simpler versions of more complex algorithms, which generally that led some apps to abort, and slow frame rates. This run slowly and unreliably and provide a poor user experience. experience shows that existing attempts of CV on mobile In this paper, we investigate ways to speed up demanding devices are not completely satisfactory, and motivates the work CV applications to run faster on mobile devices. We selected of several projects, such as PAMELA [5] and Google’s project KinectFusion (KF) as a representative CV application. The KF Tango [6], towards providing more optimized CV software and application constructs a 3D model from the images captured specialised hardware. by a Kinect. After porting it to an ARM platform, we applied several optimisation and parallelisation techniques using OpenCL In this paper, we investigate ways to speed up demanding to exploit all the available computing resources. We evaluated the CV applications on mobile devices. We selected the Kinect- impact on performance and power and demonstrate a 4 speed- Fusion (KF) application [7] as a representative CV application up with just a 1.38 power increase. We also evaluated⇥ the ⇥ that requires high performance. KF processes images of objects performance portability of our optimisations by running on a taken from multiple angles by a Kinect and constructs a 3D different platform, and assessed similar improvements despite model of the observed objects. It usually runs on desktop the different multi-core configuration and memory system. By measuring processor temperature, we found overheating to be the machines with powerful GPUs because it requires high per- main limiting factor for running such high-performance codes on formance to process the images and construct the model at a mobile device not designed for full continuous utilisation. interactive rates (approximately 30 FPS). We ported KF to run on a mobile system-on-chip (SoC), the Samsung Exynos 5250. We profiled the application and ported it to OpenCL to make I. INTRODUCTION use of the embedded GPU available in the chip. We performed Mobile devices represent a new era in computing with a set of optimisations by porting several parts to the application a market rapidly growing [1]. At the same time, mobile to run on the GPU, parallelised other parts to exploit multiple processor performance has increased by a factor of 25 in CPUs, and minimised synchronizations and data transfers. 3 years [2]. The challenge now is power: the power envelope⇥ is going to remain fixed, at about 2W for smartphones and With this work we attempt to quantify how far current 5-7W for tablets. This is due to the need for cooling, which is mobile SoCs are capable of running demanding CV applica- limited in compact mobile devices. This thermal limit equally tions, such as KF, at interactive rates; anticipate what problems determines the power limit that stays constant. This also affects developers may face when porting such applications to mobile performance: a mobile device working at maximum speed for devices; and provide a sense of the potential improvements of a long time heats up quickly, which forces the system to reduce our optimisations on such applications. its operating frequency. In this context, the contributions of this paper are: Smartphones and other mobile devices integrate more sen- A profiling of the KF application running on the Sam- sors in each generation. Some examples are GPS, accelerome- • ter, gyroscope, proximity, light and camera. These hardware sung Exynos 5250 and description of the optimisations components make mobile devices an appealing option for applied based on that profiling. We ported several computer vision (CV). The possibilities of CV in mobile parts of the application to OpenCL to exploit more devices are endless [3], for example: games, improve disabled parallelism using the embedded GPU, removed data people’s life or new learning techniques. However, demanding transfers between CPU and GPU using shared memory and removed unnecessary CPU-GPU synchronisations This work was funded by the EPSRC grant PAMELA EP/K008730/1 and by allowing consecutive computations on the GPU to the Mont-Blanc project (EU FP7 grant agreements 288777 and 610402). reuse the data in GPU memory. TABLE I. PERCENTAGE OF TIME SPENT IN THE DIFFERENT STAGES OF A temperature evaluation of KF running on Samsung THE KF ORIGINAL VERSION. • Exynos 5250. We found that overheating is a major issue in terms of stability and performance because Stages Time (%) Acquisition 2.70% the processor applies thermal throttling to avoid tem- Preprocessing 23.08% perature exceeding 85 degrees to avoid damaging Tracking 5.50% the CPU. We also evaluated the impact of several Integration 49.87% Rendering 18.64% cooling solutions. Adding a heat sink and a fan Drawing 0.14% reduced execution time by 20% of the unoptimised KF version. We also found that our optimisations made KF less sensitive to temperature. Moving computation fabric that interconnects all the processing resources and pe- to the GPU significantly reduces overheating and, ripherals. The architecture is implemented for high throughput thus, thermal throttling happens much less frequently rather than latency so it runs at hundreds of megahertz targeting than in the unoptimised version. Therefore, the GPU power efficiency. version not only improves performance by exploiting more parallelism, but also due to a reduced thermal This chip is integrated in the specialised mobile device throttling activity. developed in Google’s project Tango [6]. Their objective is to provide a 5” Android phone with highly customized hardware, A performance and power evaluation of KF running on • including Movidius’ Myriad chip, and software designed to Samsung Exynos 5250 on each optimisation step. We track the full 3-dimensional motion of the device as you hold measure the execution time and power consumption it while simultaneously creating a map of the environment. of each one of our optimisations. We also break down The project also includes the development of a 7” tablet with the execution time on the different stages of the KF the NVIDIA Tegra K1 processor, 4GB of RAM, 128GB of algorithm to analyse which of them our optimisations storage, motion tracking camera, integrated depth sensing and affect. We achieved an overall 4 speed-up with only wireless interfaces. a 1.38 power increase. This results⇥ in a 3 reduction on energy-to-solution⇥ which is crucial for⇥ battery- Another approach to improve the capabilities of CV ap- restricted environments such as mobile devices. We plications on mobile devices is to optimise the software for also repeated the analysis on a different mobile SoC, existing mobile processors. Previous works [13], [14] ported the Samsung Exynos 5422, and experienced similar CV applications to mobile SoCs and provide significant speed- improvements that show a good performance porta- ups mainly by using the GPU. Cheng and Wang [13] ported bility of our optimisations. a face recognition algorithm to OpenGL. The application processes a single image to identify faces using a pattern recognition algorithm. They tested their OpenGL version in II.

Edinburgh Research Explorer

Enabling the Use of Low Power Mobile and Embedded Technologies For

DISI - University of Trento

Introducing Slambench, a Performance and Accuracy Benchmarking Methodology for SLAM

Report on Tuned Linux-ARM Kernel and Delivery of Kernel Patches to the Linux Kernel Version 1.0

Proyecto Fin De Grado

Que E Un Arduino

Energy Neutral Wireless Sensing for Server Farms Monitoring

D5.5– Intermediate Report on Porting and Tuning of System Software to ARM Architecture Version 1.0

On First Page

Design, Construction, and Use of a Single Board Computer Beowulf Cluster: Application of the Small- Footprint, Low-Cost, Insignal 5420 Octa Board

Samsung Exynos 5 Board - 10-30-2012 by Vincent - Streamcomputing

Fast Software Polynomial Multiplication on ARM Processors Using the NEON Engine Danilo Câmara, Conrado Gouvêa, Julio López, Ricardo Dahab