An Independent Evaluation of Implementing Computer Vision Functions with OpenCL on the Adreno 420

By the staff of

Berkeley Design Technology, Inc.

July 2015

OVERVIEW Computer vision algorithms are becoming increasingly important in mobile, embedded, and wearable devices and applications. These compute-intensive workloads are challenging to implement with good performance and power-efficiency. In many applications, implementing critical portions of computer vision workloads on a general-purpose (GPU) is an attractive solution. Qualcomm enables programming of the Adreno GPU in its Snapdragon application processors via the open standard OpenCL language and API. OpenCL support enables programmers to offload computer vision algorithm kernels to the GPU, which in turn provides speed and power-consumption advantages over a CPU implementation. BDTI developed an Android application demonstrating computer vision functionality utilizing the Adreno 420 GPU in Qualcomm’s Snapdragon 805. The BDTI application can run compute-intensive computer vision functions on the GPU or the CPU, enabling comparisons of GPU and CPU performance and power-efficiency. This paper discusses the BDTI application, implementation and optimization techniques used in its development, and the substantial benefits observed when offloading compute-intensive kernels to the GPU.

© 2015 Berkeley Design Technology, Inc. Page 1

Contents the improvements in performance and power consumption obtained using the Qualcomm 1. Introduction ...... 2 Adreno GPU in this way. 2. Algorithm Overview ...... 2 The BDTI Background Blur OpenCL Android 3. Implementation Overview ...... 4 application was designed to run on the 4. Adreno GPU OpenCL-Accelerated Snapdragon 805 MDP tablet reference design Implementation ...... 5 from Intrinsyc. The application detects the largest 5. ARM CPU NEON-Accelerated foreground object in the camera’s view—usually Implementation ...... 8 the user—and displays the video with the 6. Benefits of OpenCL Acceleration on the foreground object shown unaltered, and the Adreno GPU ...... 8 background shown with severe blurring applied. 7. Conclusions...... 9 This functionality could be useful, for example, in 8. References ...... 9 a video conference call where we may want to

obscure confidential information on a whiteboard 1. Introduction behind the user, while the image of the user is seen clearly. Computer vision promises to bring exciting The BDTI Background Blur OpenCL new features and user experiences to mobile application includes two modes of operation: in devices such as smart phones, tablets and the default “GPU” mode, the computationally- wearables. Many vision-enabled applications are intensive portions of the algorithm are offloaded already commonplace: photo-stitching techniques to the GPU using OpenCL. For comparison, a enable automatic generation of panoramic views “CPU” mode is provided in which the from a series of images, feature detection and computationally-intensive portions of the tracking techniques enable augmented reality algorithm are executed on a single CPU core. This experiences, and so on. And computer vision demo therefore clearly illustrates the capabilities of algorithms and techniques are advancing rapidly, the Qualcomm Adreno GPU for computer vision promising to deliver many new features and applications. applications. But computer vision algorithms tend to be very compute-intensive, threatening to bog 2. Algorithm Overview down mobile devices’ CPU cores and memory The BDTI Background Blur OpenCL bandwidth, and to drain their batteries. application employs cutting-edge background Mobile application processors increasingly subtraction techniques to separate foreground include general-purpose graphics processing units: objects from background. The algorithm is graphics processors (GPUs) capable of massively- illustrated in Figure 1. parallel general-purpose computing. These The algorithm begins with a background specialized processing engines are often a great fit subtraction kernel. This kernel employs a for computer vision algorithms, which are modified version of the background subtraction typically characterized by very high data algorithm described in [1]. This kernel parallelism (”Data parallelism” refers to the ability continuously updates a model of the background, to distribute data among many parallel compute and compares each in the current frame to resources such as those available in a GPU). To the background model to estimate the implement computer vision applications, GPUs background/foreground classification of the pixel. can be programmed using the open standard The background model contains multiple samples OpenCL language and API from the Khronos of pixel intensity and local binary pattern Group. For example, the Qualcomm’s descriptors for each color component at each Snapdragon 805 application processor includes pixel position. At each pixel position, the kernel the Adreno 420 GPU. The Adreno 420 can be searches the samples in the background model for programmed with OpenCL to offload vision samples that match the color intensities and local algorithms from the CPU cores, increasing binary pattern descriptors for the pixel position in performance and reducing power consumption. In the current frame. If sufficient matches are found, this paper we explore how BDTI used OpenCL to the pixel is estimated to be in the background. offload vision algorithms from the CPU to the Otherwise it is estimated as foreground. In GPU in a demonstration application, and discuss addition to the background/foreground

© 2015 Berkeley Design Technology, Inc. Page 2

Input Camera Frame Background Subtraction Est. Mask Background Frame Refinement Blur Preprocess Refinement Blurred Pass #1 Frame Refinement Pass #2

Refined Mask Contour Display Processing

Render to Blend Display Mask

Figure 1 BDTI Background Blur Algorithm segmentation mask, the background subtraction subtraction kernel. The refinement method is kernel outputs an estimated background image. performed in three stages: When areas of the foreground closely match Preprocessing: in this stage the estimated the background in both color and intensity, the foreground mask is eroded with a 3×3 structuring background subtraction kernel can sometimes element to reduce false positives, and a dynamic misclassify an entire region of foreground as threshold is computed for each pixel position. The background. To greatly reduce this artifact, the dynamic threshold is used in the following stages. application refines the estimated foreground mask First-pass refinement search: in this stage, using a proprietary method developed by BDTI for each pixel marked as background in the and inspired by k-means clustering. The estimated mask, the algorithm searches for nearby refinement compares pixels that are marked as foreground pixel matches, and marks the background in the estimated mask to nearby pixels background pixel as foreground if enough marked as foreground in the estimated mask; it matches are found. marks a background pixel as foreground if it is Second-pass refinement search: this stage is more similar to nearby foreground pixels than it is identical to the previous stage, further refining the similar to the estimated background image. This estimated foreground mask output by the previous refinement procedure fills in portions of the refinement pass. foreground mask where the foreground pixels are Finally, the algorithm applies morphological too close to the background model to be correctly operations, finds the contours of foreground marked as foreground by the background objects, selects the largest contour, and removes all other contours from the foreground mask. The

© 2015 Berkeley Design Technology, Inc. Page 3

resulting mask is used to blend the original video The input video frame is needed by the GPU frame with a blurred version of the same video in both the CPU and GPU modes of operation, frame. The final refined and contour-processed since the GPU renders the output in both modes. mask is also passed to the background subtraction The input video is also needed by the CPU in kernel along with the next video frame, where the CPU mode, but isn’t needed by the CPU when processed mask is used to control thresholds and operating in GPU mode. Video frames are fetched background model updates. from the camera directly into GPU memory space, To reduce computational requirements, the eliminating the need to copy input frames in GPU background subtraction and refinement kernels mode. downsample the input frame by a factor of two Figure 2 depicts the partitioning of the horizontally, and by a factor of four vertically. algorithm among processors and APIs on the This downsampling is performed by simply Snapdragon application processor. The accessing a subset of the input pixels—a background subtraction and refinement kernels downsampled image is never physically generated are implemented in OpenCL in the GPU mode, in memory. Due to this downsampling of the and in C using ARM NEON compiler intrinsics in input, the blend mask is generated at the the CPU mode. downsampled resolution. The blend mask is The contour processing portion of the passed to OpenGL-ES as a texture, and is algorithm is implemented on the CPU using automatically upsampled by the GPU to the Qualcomm’s FastCV library. Contour tracing original frame size during . kernels are difficult to parallelize efficiently, and Because the original input image is never therefore are not an attractive target for GPU physically downsampled, the sharpness of the optimization. Qualcomm’s FastCV library includes rendered foreground pixels is not impacted. a highly optimized CPU implementation of Therefore, downsampling dramatically reduces contour tracing. In GPU mode, the refined computational demand with negligible impact on foreground mask must be copied from GPU output quality. memory to CPU memory for contour processing. In both modes, after contour processing the 3. Implementation Overview resulting blend mask must be copied to GPU The BDTI Background Blur OpenCL demo memory for rendering. Because the subsampled application is architected to realistically portray the masks are only a fraction of the size of a video advantages of the Adreno GPU in vision-enabled frame, these copies have a relatively small impact applications. The background subtraction and on speed and power consumption. refinement kernels are very computationally In the GPU mode of the application, the CPU intensive and comprise the bulk of the processing can perform contour processing in parallel with in the application. Therefore, optimizations focus the background subtraction and refinement on these kernels. To ensure representative kernels running on the GPU. To enable this performance in both the default GPU mode and parallel operation of the CPU and GPU, the in the CPU mode, thorough and reasonable application is pipelined as illustrated in Figure 3. optimizations are employed in both modes. For each frame, background subtraction and Optimizations of the application’s software refinement consume the frame directly from the architecture are discussed in this section, and camera, while the contour processing and optimizations of the GPU and CPU rendering to the display consume a mask, an input implementations of the kernels are discussed in frame, and a blurred frame from the previous Section 4 and Section 5, respectively. frame period. When offloading computation from a CPU to The application uses OpenGL-ES to render a GPU, a key consideration on most hardware output to the display. In addition, the application platforms is the need to copy data between CPU uses OpenGL-ES to render the input video frame and GPU memory spaces. Memory copies to two textures. One is a low-resolution texture, introduce latency and consume power, especially which results in blurring of the background when for large buffers such as video frames. The BDTI this texture is interpolated back to full resolution Background Blur OpenCL demo application is during rendering to the display. The other texture therefore designed to minimize the need to move is a full-resolution texture, which implements a data between CPU and GPU memory spaces.

© 2015 Berkeley Design Technology, Inc. Page 4

Camera

Input Frame CPU/Qualcomm FastCV

Est.BG Background Contour Frame Subtraction Processing

Refinement Blend Preprocess Mask

Refinement Render to Pass #1 Display Display Refinement Delayed Blurred Pass #2 Frame Frame

Refined Render to Mask Texture GPU/OpenCL Or CPU/NEON GPU/OpenGL-ES

Figure 2 BDTI Background Blur implementation partitioning

one-frame delay needed due to the software OpenCL also provides Single Instruction pipelining of the application. Multiple Data (SIMD) capabilities, with explicit support for two-, four-, eight-, and sixteen- 4. Adreno GPU OpenCL- element vectors. This enables programmers to Accelerated Implementation take advantage of additional data parallelism In OpenCL, data-parallel algorithm kernels are within a work item. broken down into a large number of very small In the GPU mode of the BDTI Background “work items.” A work item typically represents the Blur OpenCL application, background subtraction set of operations for processing a single pixel or and refinement are offloaded to the Adreno GPU small group of pixels. The implementation is via OpenCL. The OpenCL code comprises three designed to minimize—and hopefully eliminate— kernels: background subtraction with local binary any dependencies between work items so that patterns, refinement pre-process, and refinement work items can execute in parallel. Programmers search. The refinement search kernel is executed generally think of work items as independent twice per frame. All of the OpenCL kernels are parallel threads, and GPGPUs typically execute carefully refactored1 and SIMD-optimized to many work items in parallel. Spinning off expose the inherent parallelism of the algorithms hundreds or in some cases even thousands of and efficiently utilize the resources of the Adreno work-items enables the GPU to hide memory 420 GPU architecture. All three kernels operate latency. 1 Code refactoring is the process of restructuring existing code without changing its external behavior.

© 2015 Berkeley Design Technology, Inc. Page 5

Frame N Delayed Render to Frame Render to Texture Blurred Display Frame Camera Display Input BG Subtract Refined Contour Frame & Refine Mask Processing

Frame N-1 Delayed Render to Frame Render to Texture Blurred Display Frame Camera Display Input BG Subtract Refined Contour Frame & Refine Mask Processing

Figure 3 Pipelining of Background Blur Implementation

on one pixel position per work-item. Additional from neighboring pixel positions. Although the OpenCL implementation and optimization background model includes many background considerations are discussed below. samples per pixel position, there is little overlap in state and input data among the kernel’s work- Memory Footprint and Local Memory items. Therefore, each work-item can copy its Data-intensive algorithms usually require state and input data into local variables and arrays efficient use of fast local memories for optimal without duplicating the copies performed for performance. On a CPU, for example, algorithms neighboring pixels –thus avoiding overflowing the can be refactored to optimize utilization of L1 local memory and/or causing redundant accesses caches. On a GPU, fast local memory must be to slow DDR memory. explicitly managed, or else performance suffers The OpenCL code for the background dramatically. A collection of work-items is subtraction kernel copies state and data from referred to in OpenCL parlance as a work-group, DDR into local variables, but it does not explicitly and each work-group has access to a pool of fast declare its local copies of input and state data as local memory. residing in local memory. For this kernel it was To minimize the impact of long DDR access not necessary to manage local memory more latencies, each work-item copies the state and explicitly, probably because most local variables input data it needs from DDR into variables and and arrays fit in GPU registers for this kernel, and arrays residing in local memory. The work-item the minimal overlap with neighboring pixels then operates on this data locally. This idiom is means that even without more explicit techniques utilized for all three OpenCL kernels in the BDTI few DDR memory access conflicts occur. This is Background Blur OpenCL application, with some in contrast to the refinement pre-processing and important differences among the kernels search kernels, where more explicit memory described below. management is required. In the background subtraction kernel (unlike The refinement pre-process and refinement most computer vision kernel functions), each search kernels both process a neighborhood of work-item accesses only a minimal amount of data pixels centered on each pixel position. Therefore,

© 2015 Berkeley Design Technology, Inc. Page 6

most of the input data for each pixel position performance, oftentimes drastically. OpenCL overlaps with the input data of neighboring pixel kernels (and OpenGL ) are often positions in these kernels. Explicit management of refactored to eliminate branches. GPU local memory is needed to avoid redundant The background subtraction kernel includes copies as each work-item copies its input into many branches per pixel. However, this kernel’s local variables and arrays. performance on the Adreno GPU was about twice Work-items in these kernels are grouped into the performance of the CPU version without an eight-by-eight tile of pixel positions per work- requiring refactoring to eliminate the branches. group. Local arrays are used to store the input This may be due to very good correlation between data required for an entire eight-by-eight tile, and the execution paths for the work items at are shared by all of the work-items in the neighboring pixel positions—when a certain respective work-group. Each work-item copies a branch is taken for one pixel position, it is likely small portion of the input data for the entire work for the same branch to be taken for neighboring group into the local array, and the portions pixels. Therefore it is likely that many work items fetched by work items do not overlap. After naturally execute the same instruction stream copying the data, work-items within a work-group despite the presence of branches. However, this is synchronize using OpenCL’s “barrier” mechanism not always the case, and it may be possible to to ensure that all input data has been loaded further improve the performance of this kernel before work-items proceed to perform their with refactoring to eliminate some of the computations. The work-items thus cooperate to branches. However, in order to eliminate a fetch overlapping inputs from slow DDR branch, the refactored kernel must sometimes memory. This implementation technique perform the work of both branch-taken and eliminates redundant copies of data in local branch-not-taken execution paths, thus increasing memory, reduces redundant accesses to DDR, and the computational workload. Optimizing the helps the GPU hide the latency of slow DDR background subtraction kernel further would memory accesses. therefore require laborious statistical analysis to balance the increase in parallelism gained from Per-Pixel Pseudo-Random Number eliminating each branch against the resulting Generation increase in computation. To achieve desirable statistical properties, The refinement search kernel is explicitly updates of the background model are randomized refactored to eliminate branches. This kernel in the background subtraction kernel. Because a attempts to match each background pixel against pseudo-random number generator updates its foreground pixels in a neighborhood centered on state with each invocation, calling a single random the background pixel position. The CPU number generator from each work item would implementation of this kernel includes a branch in create a dependency as all work items attempt to the kernel’s inner loop: for each pixel position the access and update the same state. This search is stopped once enough matches are found. dependency would block the work items from On the GPU, however, eliminating branches is executing in parallel. Therefore, each work item more efficient than reducing the workload by includes an independent random number terminating the search. Therefore, the OpenCL generator with its own state. To ensure that the implementation does not include the stopping random number generators for all of the work condition, and always iterates through the entire items are uncorrelated, each random number inner loop. generator must be randomly seeded at initialization. The OpenCL random number Byte-Wide Fixed-Point and Logical generator code is based on [2]. Operations The Adreno 420 GPU includes native support Conditional Operations and Branches for 16-bit fixed-point data. To support 8-bit data, GPU architectures typically require that many 16-bit operations are performed by the hardware, OpenCL work items share a single instruction with additional operations such as sign extension stream. Multiple execution paths due to data- sometimes added by the compiler in order to dependent branches or conditional operations guarantee correct functionality. To minimize within a work item can therefore reduce unnecessary operations, the OpenCL code for all

© 2015 Berkeley Design Technology, Inc. Page 7

three kernels promotes some 8-bit data to 16-bit BDTI has not attempted extensive, rigorous fixed-point or 32-bit floating-point data types. performance and power measurements on the Additionally, the background subtraction application under carefully controlled conditions. OpenCL kernel uses a 256-entry lookup table to Therefore, results measured by BDTI and perform a population-count operation (the presented below do not represent a population-count operation counts the number of comprehensive range of operating conditions and bits in the input word that have a value of one). should be considered as a coarse estimate. However, BDTI has observed the performance of 5. ARM CPU NEON-Accelerated both the CPU and GPU modes under conditions Implementation that can be considered typical. The typical In the CPU mode of the BDTI Background difference in performance between the GPU and Blur OpenCL application, background subtraction CPU is striking, as discussed below. and refinement are implemented on the CPU and GPU vs. CPU Speed Comparison refactored to efficiently utilize the CPU caches. A comparison of video display frame rates The refinement pre-process step is split into achieved under typical operating conditions in the independent operations: the erosion operation is GPU and CPU modes, respectively, is shown in implemented with a call to Qualcomm’s FastCV Table 1. Overall, the GPU mode typically achieves library, and the threshold computation is a nearly two times higher than that of interleaved with the refinement search as the CPU mode. described below.

The threshold computation and two Typical refinement search passes are pipelined on a scan- Observed Mode average frame line basis, with a five scan-line delay between the range rate first and second refinement search passes. Pipelining these functions greatly improves cache GPU 30 fps 25-33 fps utilization and is paramount to achieving good performance on the CPU. CPU 16 fps 14-18 fps Background subtraction and all refinement steps are carefully optimized using ARM NEON Table 1 Frame rate comparison of GPU and instructions to perform SIMD-parallelized CPU modes operations. NEON optimizations make use of NEON’s native support for eight-bit data and Table 2 shows the approximate time in native population-count instruction. milliseconds per invocation of the compute- intensive background subtraction and refinement 6. Benefits of OpenCL Acceleration kernels on the GPU and CPU. As described in on the Adreno GPU Section 5 above, the two refinement search passes The BDTI Background Blur OpenCL Android are tightly interleaved on the CPU, along with part application illustrates the advantages of the of the refinement pre-processing. Therefore, it is Adreno 420 GPU over the ARM CPU, for not practical to individually profile these massively parallel algorithms programmed in processing steps on the CPU. The background OpenCL. Comparing the application’s behavior in subtraction kernel appears to be slightly more than the GPU and CPU modes of operation reveals the two times faster on the GPU compared to the performance benefit of the GPU. CPU, although significant data-dependent timing The computational workloads of the variations occur on both the CPU and GPU. The background subtraction and refinement kernels three refinement steps combined are likewise are data dependent. Furthermore, the roughly twice as fast on the GPU compared to the computational workload’s dependencies on input CPU. data vary somewhat between the ARM NEON- Contour processing requires several additional optimized code and the OpenCL code. Therefore, milliseconds of computation on the CPU. In GPU precise comparisons of performance of the two mode, contour processing still executes on the modes can be made only for precisely defined CPU but occurs in parallel with the OpenCL operating conditions. kernels running on the GPU. However, the

© 2015 Berkeley Design Technology, Inc. Page 8

application incurs some additional overhead in efficient than CPUs. Although BDTI did not both modes for rendering, synchronization, and independently measure the power consumption of housekeeping, reducing the overall speedup of the the BDTI Background Blur OpenCL demo application to slightly less than a factor of two. application, Qualcomm has reported that the GPU mode of the demo consumes half as much power as the CPU mode when throttling the Backgr Refine Refine Refine frame rate of the GPU mode to match the highest ound ment Conto ment ment frame rate achieved in the CPU mode. Subtrac Pre- urs However, effective GPU programming and Pass #1 Pass #2 tion process code optimization can be tricky. Algorithm implementations must be refactored to maximize 0.9 ms 5 ms 5 ms parallelism, and conform to the memory system GPU 14 ms n/a and core architectures of the GPU, as exemplified Refinement total: 11 ms by the considerations discussed in this paper:  The application must be architected to CPU 30 ms 22 ms 4-7 ms minimize memory copies between GPU and CPU memory spaces. Table 2 Kernel duration comparison of GPU  GPU code must carefully manage limited and CPU implementations, under typical fast local memory. conditions  Programmers must be aware of GPU core architectural characteristics, even when Note that the CPU mode of the application programming in a high-level language such uses only one of the Snapdragon processor’s ARM as OpenCL. For example, code must cores. Using two CPU cores instead of one, it is minimize the use of branches and take care possible to achieve a frame rate roughly equivalent to utilize the most appropriate SIMD data to that of the GPU, at the cost of slightly higher types. code complexity, and greatly increased power When implemented with best practices, consumption. computer-vision functions run efficiently on the GPU. Qualcomm’s Adreno GPU with support for 7. Conclusions OpenCL will enable vision functions in a wide As new computer-vision-enabled user range of mobile devices and applications. experiences emerge in mobile, embedded, and wearable devices, computational demands will 8. References continue to rise, while size, cost, and power [1] P.-L. St-Charles and G.-A. Bilodeau. constraints will become more stringent. In many Improving background subtraction using local products, massively-parallel GPGPU binary similarity patterns. In Applications of implementations of key algorithm kernels will be Computer Vision (WACV), 2014 IEEE Computer critical to meeting application requirements. Society Winter Conference on, 2014. 1 Qualcomm’s support for OpenCL on the Adreno [2] David B. Thomas. The MWC64X Random GPU makes this possible on Snapdragon Number Generator. Retrieved from application processors. http://cas.ee.ic.ac.uk/people/dt10/research/rngs- As illustrated by the BDTI Background Blur gpu-mwc64x.htm OpenCL demo application, offloading compute- intensive kernels to the Adreno 420 GPU can dramatically reduce CPU utilization in a computer-vision-enabled application, freeing CPU resources to tackle additional applications and features. Additionally, offloading compute- intensive tasks from the CPU can dramatically improve power consumption. Because of their specialized massively-parallel architectures and lower clock rates, GPUs tend to be more power-

© 2015 Berkeley Design Technology, Inc. Page 9