DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2017

ART vs. NDK vs. GPU acceleration: A study of performance of image processing algorithms on Android

ANDREAS PÅLSSON

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION ART vs. NDK vs. GPU acceleration: A study of performance of image processing algorithms on Android

ANDREAS PÅLSSON

Master in Computer Science Date: June 26, 2017 Supervisor: Cyrille Artho Examiner: Johan Håstad Swedish title: ART, NDK eller GPU acceleration: En prestandastudie av bildbehandlingsalgoritmer på Android School of Computer Science and Communication

iii

Abstract

The Android ecosystem contains three major platforms for execution suit- able for different purposes. Android applications are normally written in the Java , but computationally intensive parts of An- droid applications can be sped up by choosing to use a native language or by utilising the parallel architecture found in graphics processing units (GPUs). The experiments conducted in this thesis measure the performance benefits by switching from Java to ++ or RenderScript, ’s GPU acceleration framework.

The experiments consist of often-done tasks in image processing. For some of these tasks, optimized libraries and implementations already exist. The performance of the implementations provided by third parties are compared to our own.

Our results show that for advanced image processing on large images, the benefits are large enough to warrant C++ or RenderScript usage instead of Java in modern . However, if the image processing is conducted on very small images (e.g. thumbnails) or the image processing task contains few calculations, moving to a native language or RenderScript is not worth the added development time and static complexity.

RenderScript is the best choice if the GPU vendors provide an optimized implementation of the processing task. If there is no such implementation provided, both C++ and RenderScript are viable choices. If full precision is required in the floating point arithmetic, a C++ implementation is the rec- ommended. If it is possible to achieve the desired effect without compliance with IEEE Floating Point Arithmetic standard, RenderScript provides better run time performance. iv

Sammanfattning

Android-ekosystemet innehåller tre exekveringsplattformer passande för oli- ka syften. Android-applikationer är vanligtvis skrivna i programmerings- språket Java, men beräkningsintensiva delar av en Android-applikation kan snabbas upp genom att använda en statiskt kompilerat språk eller genom att utnyttja den parallella arkitekturen som hittas i grafikprocessorer. Experi- menten utförda i det här projektet ämnar mäta prestandasförbättringar som kan uppnås genom att byta från Java till C++ eller RenderScript, grafikaccelerationsramverk.

Experimenten består av ofta använda algoritmer inom bildhantering. För någ- ra av dessa finns det optimerade bibliotek och övriga färdiga implementatio- ner. Prestandan av tredjepartsbiblioteken jämförs med våra implementatio- ner.

Våra resultat visar att för avancerad bildhantering är prestandaförbättringar- na tillräckligt bra för att använda C++ eller RenderScript istället för Java på moderna smartphones. I de fall bildhanteringen görs på väldigt små bilder eller innehåller få beräkningar (exempelvis miniatyrbilder) är bytet från Ja- va till RenderScript eller C++ inte värt den extra utvecklingstiden samt den statiska kodkomplexiteten.

RenderScript är det bästa valet då grafikprocessortillverkarna tillhandahål- ler implementationer av algoritmen som ska köras. Om det inte finns någon sådan implementation är både C++ och RenderScript tillämpbara val. Om noggrann precision krävs rekommenderas en C++-implementation. Däremot om full precision inte behövs vid flyttalsberäkningar rekommenderas istället RenderScript. Contents

1 Introduction 1 1.1 Problem ...... 2 1.2 Research Question ...... 2 1.3 Scope ...... 3 1.4 Ethics and sustainability ...... 3 1.5 Structure ...... 3

2 Background 4 2.1 Native and interpreted languages ...... 4 2.1.1 Java ...... 4 2.1.2 C++ ...... 5 2.1.3 Performance ...... 6 2.2 Android ...... 6 2.3 Android application compilation ...... 7 2.4 ...... 8 2.5 ...... 8 2.5.1 Ahead-of-time (AOT) compilation ...... 9 2.5.2 Improved garbage collection ...... 10 2.5.3 Just-in-time (JIT) compilation ...... 10 2.6 Android Native Development Kit ...... 11 2.7 RenderScript ...... 11 2.7.1 Compilation and deployment ...... 12 2.7.2 Floating point precision ...... 13 2.8 Image Processing ...... 14 2.8.1 Image smoothing ...... 14 2.8.2 Grayscaling ...... 16 2.8.3 Thresholding ...... 17 2.9 Color spaces ...... 18

3 Related Work 19 3.1 Java and C++ benchmarks ...... 19

v vi CONTENTS

3.2 Dalvik vs ART ...... 20 3.3 Using GPU for calculations ...... 21 3.3.1 RenderScript ...... 21 3.3.2 OpenCL ...... 21

4 Method 23 4.1 Choice of method and algorithms ...... 23 4.2 Development environment and devices ...... 23 4.3 Implementation ...... 24 4.3.1 Color space conversion ...... 25 4.3.2 Blurring ...... 27 4.3.3 Grayscaling and thresholding ...... 30 4.4 Measuring Runtime Performance ...... 30 4.4.1 Image processing ...... 30 4.4.2 Setup ...... 31 4.5 Verifying results ...... 32

5 Results 34 5.1 Color space conversion ...... 34 5.2 Blurring ...... 35 5.2.1 Box filter ...... 35 5.2.2 Median filter ...... 36 5.2.3 Gaussian filter ...... 37 5.3 Grayscaling ...... 38 5.4 Thresholding ...... 39

6 Discussion 40 6.1 Color space conversion ...... 40 6.2 Blurring ...... 41 6.3 Grayscaling and thresholding ...... 42 6.4 Overall Performance ...... 43 6.5 Threats to validity ...... 45 6.5.1 Choice of algorithms ...... 45 6.5.2 High variance ...... 46 6.5.3 Devices ...... 46 6.5.4 Image sizes ...... 46 6.5.5 Optimization ...... 46 6.6 Future Research ...... 47

7 Conclusion 49

Bibliography 50 CONTENTS vii

A Tables 53 A.1 Blurring ...... 53 A.1.1 Box filter ...... 55 A.1.2 Median filter ...... 57 A.1.3 Gaussian filter ...... 60 A.2 Grayscaling ...... 63 A.3 Thresholding ...... 64

Chapter 1

Introduction

The first version of the mobile Android was released in fall 2008. It is, as of January 2017, the most widely used operating system [14]. It is used all over the world, with varying device and network quality. Because of these reasons, it is important to mobile application de- velopers to be able to develop high quality applications that work well on low-end devices in third world countries.

Android application developers can choose to write the business logic of an application in a native language (a source language that is directly compiled to machine code) or Java, where Google recommends the use of Java [8]. However, when conducting computationally intensive tasks it can be advan- tageous to use native languages, as it is generally faster than Java [8], to not impede the user experience.

Moreover, a developer can utilize GPU () accelerated computing to utilize the full capabilities of the device. This means using the device’s graphics processor to offload compute-intensive portions of code to the GPU, while the remainder of the code remains on the CPU (central pro- cessing unit). This allows the device to take advantage of the massively par- allel architecture of the GPU.

Code written to run on a GPU does not have to be custom-written for each different type of GPU, but can be compiled from a higher-level language. This means that GPU acceleration is more readily available for developers today than what it traditionally has been.

1 2 CHAPTER 1. INTRODUCTION

As today’s users of technological products see more and more of virtual and augmented reality products, it is of utmost importance to keep the experience as smooth as possible. Many new technologies offer a more visual experience than before, which further increases the need for performance, since graphics processing require large amounts of heavy calculations.

1.1 Problem

Java is the recommended programming language for building Android ap- plications. However, the Java programming language contains features de- signed to improve safety and convenience at the expense of performance, e.g., the automatic memory management.

Therefore, Google suggests that it might be useful for a developer to use a native language over Java in two cases [8]:

• Squeeze extra performance out of a device to achieve low latency or run computationally intensive applications, such as games or physics simulations.

• Reuse your own or other developers’ C or C++ libraries.

This thesis intends to examine the first bullet point and investigate how large the performance benefits can be when conducting real time image processing. Furthermore, the usage of GPU acceleration can provide greater performance improvements due to increased levels of parallelization in the hardware. The problem is that with increased performance from the Android system, it is hard to know whether the performance benefit of using a different language than Java is worth the extra complexity needed to add another programming language to a software project.

Image processing contains many computationally intensive processes and is therefore a candidate where it might be useful to switch to a native language or a framework that allows use of GPU acceleration.

1.2 Research Question

The question this thesis intends to answer is the following: CHAPTER 1. INTRODUCTION 3

Can performance increases in run time warrant the usage of C++ or GPU acceleration frameworks over Java when writing image processing algorithms on Android?

1.3 Scope

The reason a developer might not want to choose a native language or a GPU acceleration framework over Java despite performance benefits is likely that the added performance improvements do not outweigh the complex- ity added to the software project. This project does not intend to extensively measure the code development complexity added by using these components in an Android project.

1.4 Ethics and sustainability

The work presented in this thesis aims to be as ethical as possible, in the sense that all results in presented in this thesis are reproducible from the descrip- tion in chapter 4.

Regarding sustainability, there are three pillars of sustainability: social, eco- nomical and environmental. The work presented in this thesis only touches the last two pillars by performing performance analysis, seeing as the work presented here lacks the dimension of affecting social sustainability. Achiev- ing higher performance in a mobile device might lead to lower battery us- age and usage of fewer clock cycles, saving energy and therefore leading to greater environmental and economical sustainability.

1.5 Structure

This report consists of eight chapters. Chapter 2 contains technical informa- tion needed to understand the project. Chapter 3 contains previous research conducted in the area. Chapter 4 outlines the experiments conducted in this project. Chapter 5 contains the results from the aforementioned experiments. Chapter 6 contains discussions regarding the results and their possible appli- cations, as well as possible extensions to the research. Chapter 7 contains our conclusion and final answer to the research question. Chapter 2

Background

This section will contain background information that is needed to under- stand the rest of this paper.

2.1 Native and interpreted languages

A fundamental difference between interpreted and native languages is that the native language gets compiled to instructions that can be interpreted by the processor. An overview of their compilation processes and differences are presented in this section.

2.1.1 Java

Java is an interpreted, object-oriented programming language developed by Oracle. Java gets compiled to bytecode in order to run on the Java Virtual Machine (JVM). It is commonly seen together with the slogan Write once, run everywhere, since the compiled Java code can run on any platform without the need to compile it for each architecture. The compiled bytecode can run on any JVM. The JVM takes care of translating the bytecode to instructions that the host CPU can understand.

4 CHAPTER 2. BACKGROUND 5

*.java files

Java compiler

.class files containing java bytecode

Figure 2.1: Steps in compilation to Java bytecode

Figure 2.1 shows the process of compiling Java source code to its correspond- ing bytecode that will be processed by the JVM.

2.1.2 C++

C++ is a native language, and can skip the translation step required in the JVM, as it is compiled directly to native processor instructions. This also means that it is not architecture independent, and the code has to be compiled for each architecture it is supposed to run on.

C++-files

compiler

assembler

linker

Figure 2.2: Steps in compiling native language source files to an executable

The native-language source files are compiled to assembly code by the com- piler. The code generated by the compiler is then assembled into object-code for the platform. This object-code is then linked together with library depen- 6 CHAPTER 2. BACKGROUND

dencies and other code needed to produce the actual executable, shown in Figure 2.2.

2.1.3 Performance

The runtime performance achieved using C++ can be higher than Java run- ning on a JVM. C++ lacks automatic garbage collection, a feature that can impede the performance of Java programs at the cost of developer conve- nience. Another reason that Java performance is penalized is that it does not allow memory allocation on the stack. Accessing and allocating memory on the heap is a more costly operation, creating overhead for Java implementa- tions. In C++ the stack is freely available for developers, making the memory access and allocation faster.

2.2 Android

Android is an operating system developed by Google, designed primarily for use on mobile smartphones and tablets. It is based on the Linux kernel. The Android operating system has also been customized to run on smart watches, TVs and in cars. It is the most widely used mobile operating system with a market share of 88% [4].

On top of the Linux kernel at the root of the Android architecture there are native libraries and middleware, for example Webkit and OpenGL. On top of the native libraries lies the application framework. The application frame- work provides (Application Programming Interface) for developers to use when building Android applications. Applications that can be found on are written on top of all this, in the application layer, as shown in Figure 2.3.

The runtime, responsible for running applications on the smartphone, lies between the native libraries and the application framework. CHAPTER 2. BACKGROUND 7

Applications Phone, Contacts...

Application Framework Managers for activities, packages..

Libraries Runtime SQLite, OpenGL Dalvik/ART

Linux Kernel Display, WiFi, Camera..

Figure 2.3: The Android software stack

2.3 Android application compilation

Compiling for the Android platform requires adding extra steps to the Java compilation process described above. The first compilation step produces standard JVM bytecode (.class-files) from the source code. This is not com- patible with Android devices, since Google developed the Dalvik Virtual Machine (DVM) that utilizes a different bytecode format. The next step in compilation is taking any .jar-libraries and the .class-files and converting it to DVM bytecode.

.class-files .jar-libraries

Dalvik converter

.dex-file

Figure 2.4: Converting the bytecode to DVM format

The resulting DVM bytecode is contained in a single .dex-file, as shown in Figure 2.4. This file is packaged together with any application resources (e.g., 8 CHAPTER 2. BACKGROUND

layouts, images) into an Android Package file (.apk-file). The package can then be deployed and installed to devices running the Android operating system.

When the application is running the bytecode is interpreted by the host de- vice and then passed to the CPU for execution. In the 2.2 release of Android a Just In Time (JIT)-compiler was added to the DVM. This meant that code that was run often could be compiled to native code and the DVM could achieve higher performance because it could effectively skip the interpretation step.

2.4 Dalvik

Dalvik was the standard virtual machine (VM) on Android devices running Android versions 4.4 and earlier. It is different from Oracle’s standard JVM in certain aspects. The Dalvik VM was constructed for mobile devices limited in memory and storage space. It uses a register-based architecture, as opposed to the regular JVM’s stack-based architecture, and therefore requires fewer virtual machine instructions. The uncompressed .dex-files used by the Dalvik VM are often a few percent smaller than a compressed Java archive, making it more suitable for the limited storage on Android devices.

Dalvik has had trace-based just-in-time (JIT) compilation since the release of Android 2.2. The JIT compilation allows Dalvik to compile frequently executed code ("traces") to native code. Even though Dalvik interprets the remaining bytecode, this dynamic compilation provided significant perfor- mance improvements [3].

There are also drawbacks of using a VM with a JIT-compiler such as Dalvik, as opposed to using a native language. The time constraints of running the optimization alongside the program process incurs time constraints, lower- ing the degree of possible optimization as compared to static compilers.

2.5 Android Runtime

Android Runtime (ART) is Dalvik’s successor and is the standard runtime for Android applications and certain system services. ART was, like its prede- cessor Dalvik, created specifically for the Android operating system and was optimized for devices with a limited amount of memory and storage space. CHAPTER 2. BACKGROUND 9

ART implemented a number of features to improve performance in the An- droid system.

2.5.1 Ahead-of-time (AOT) compilation

As opposed to only using JIT to compile certain parts of bytecode to native code, ART compiles applications to native code at install time. By eliminating the interpretation and JIT-compilation of Dalvik, run time performance and battery consumption was improved [12].

It is important to note that there are certain optimizations that are possible in JIT-compilation that AOT-compilation cannot offer. Static analysis is very difficult in the general case, and therefore optimization done at install time is difficult. The JIT compiler does not have this problem, as it does not have to statically analyze the code; it is observable at runtime.

JIT compilers, however, have a bigger problem with resource consumption. An AOT compiler can take longer time without worrying about stealing re- sources from the program at hand, whereas a JIT compiler must not slow down the application it is optimizing.

dex-file resources build process

zip

.apk-file

installation on resources smartphone

.dex-file

.elf-file

Figure 2.5: Steps in conducting ahead of time compilation 10 CHAPTER 2. BACKGROUND

In Figure 2.5 the process of AOT-compilation is outlined. An apk-file is cre- ated from the .dex-file and resources. When the package is installed, it is unpackaged and processed through a tool called dex2oat in order to create an .elf-file. dex2oat is a tool that compiles a .dex-file to native code. An .elf-file is an executable file, which can be executed natively by the proces- sor instead of relying on the JVM interpreting Java bytecode.

2.5.2 Improved garbage collection

Garbage collection (GC) is the process of reclaiming system memory occu- pied by objects that are no longer in use by the program. Poor use of objects in an application, making the GC do a lot of work, impairs the application’s performance, resulting in choppy display and poor responsiveness.

The garbage collector in Dalvik is invoked if any of these conditions are true:

• An OutOfMemoryError is about to be triggered,

• When the heap size hits a limit,

• When GC was explicitly requested

The typical garbage collection is triggered by the allocation limit being reached. The actual collection is done using a Mark-Sweep algorithm. The algorithm consists of two phases: mark and sweep. In the first phase it finds and marks all accessible objects. In the second phase it scans through the heap and re- claims all objects that have not been marked. Both of these phases halt the execution of the program.

As opposed to the two pauses in Dalvik, the ART GC only pauses once. The mark phase in ART’s GC is done concurrently by letting threads mark their own objects [2].

2.5.3 Just-in-time (JIT) compilation

In Android 7.0 Google added a JIT compiler to complement ART’s AOT com- pilation process. The JIT compiler can do runtime optimizations in order to improve run time performance. ART utilized profile-guided optimization which allows it to use a profiler to precompile and cache select methods. This feature further reduces applications’ memory usage. CHAPTER 2. BACKGROUND 11

2.6 Android Native Development Kit

The Android Native Development Kit (NDK) is a set of tools allowing devel- opers to write parts of their Android application in native languages such as C or C++. It can be of use in order to achieve low latencies or run compu- tationally intense code. Furthermore, it enables reusing previously written C/C++ code.

The NDK is used to compile C/C++ code into a native library and package it into the application package. Java code can then use the Java Native Interface (JNI) to call functions in the native library. It is worth noting that crossing this Java-Native boundary might incur performance degradation, as compared to calling a Java method [11].

Native code is platform specific, so for the native code to work on all de- vices the code must be compiled for every supported device architecture (e.g., ARM, x86).

2.7 RenderScript

RenderScript (RS) is a framework in the Android ecosystem for running com- putationally intensive tasks at high performance using heterogenous com- puting and is primarily oriented for use with data-parallel computation. The RenderScript runtime on Android parallelizes work across the multi-core GPUs and CPUs available on a device.

The reason that using the GPU for compute-intensive functions can be bene- ficial is that the architecture differs from that of the CPU. A CPU consists of cores optimized for sequential processing, whereas a GPU has a parallel ar- chitecture consisting of many smaller cores, optimized for handling multiple tasks simultaneously.

RenderScript itself is a C99-derived language and code written in it is com- piled on devices at runtime to allow platform-independence. The perfor- mance gain, compared to Java, is gained from executing native code on the device. As opposed to the NDK, RenderScript is cross-platform. The Render- Script code is compiled to a device-agnostic intermediate state before being packaged in the application package. The scripts are compiled to machine code and optimized on the device when the application is run. The device 12 CHAPTER 2. BACKGROUND

decides at runtime whether the computation should be run on the CPU or GPU.

2.7.1 Compilation and deployment

Compilation and deployment of RenderScript code contains 3 steps:

• Offline compiler

• Online JIT compiler

• RenderScript Runtime

The offline compiler converts RenderScript .rs-files to portable bitcode and reflected Java-files. The JIT compiler translates the portable bitcode output by the offline compiler to machine code appropriate for the processor the code is running on (e.g., CPU, GPU). The RenderScript runtime manages memory allocation, provides implementation of libraries (e.g., math, time, drawing) and manages RenderScript objects created from Android Runtime or Dalvik.

Below is a RenderScript implementation that changes saturation of a bitmap:

const static float3 gMonoMult = {0.299f, 0.587f, 0.114f}; float saturationValue = 0.f; uchar4 __attribute__((kernel)) saturation(uchar4 in) { float4 f4 = rsUnpackColor8888(in); float3 result = dot(f4.rgb, gMonoMult); result = mix( result, f4.rgb, saturationValue );

return rsPackColorTo8888(result); }

The corresponding usage of the above RenderScript code from Java can look like the following: CHAPTER 2. BACKGROUND 13

Bitmap outputBitmap = .. Bitmap inputBitmap = ..

// Initialize the RenderScript context Renderscript rs = RenderScript.create(mContext); // Create the specific script from the bitcode ScriptC_process script = new ScriptC_process(rs);

// Create an allocation (which is memory abstraction in the Renderscript) that corresponds to the outputBitmap Allocation allocationOut = Allocation. createFromBitmap(rs,outputBitmap); Allocation allocationIn = Allocation.createTyped(rs, allocationOut.getType(), Allocation.USAGE_SCRIPT); process.set_saturationValue(1); process.forEach_saturation(inAllocation, outAllocation); rs.finish();

The code above creates objects of type Allocation-objects, which is the pri- mary means of passing data to RenderScript code. The code then passes the elements in the inAllocation as the parameter in to the saturation- method in the RenderScript code. The returned result is put into the allocationOut- object.

2.7.2 Floating point precision

Developers can control the required level of precision in RenderScript, if the full IEEE 754-2008 [15] standard is not required. A developer can lower the required precision in order to improve performance and allow for additional optimizations on certain architectures. RenderScript implementations that does not require IEE 754-2008 compliance will be referenced to as Relaxed RenderScript-implementations. 14 CHAPTER 2. BACKGROUND

2.8 Image Processing

2.8.1 Image smoothing

Image processing is the use of algorithms to process images. This can be done in order to improve clarity, remove noise or compress images to optimize them for network communication. A step in conducting image process often involves image smoothing. Image smoothing is the process of blurring an image in order to remove noise. The intent is to capture important patterns in the data while removing rarely occurring phenomena. It is often used before conducting further processing, e.g., face or edge detection.

Image smoothing functions are often linear, where each pixels output value is a function of some input pixels:

X g(i, j) = f(i + k, j + l)h(k, l) k,l where h(k, l) is the kernel of the algorithm. The kernel contains the relative weights of each .

Box Filter

Box filter [29] is an image smoothing algorithm that achieves a blurring effect by replacing a pixel with the average of itself and its surrounding pixels. This means that the new value can be one that was not in the image before. Calculating the average of the pixels is also known as applying a box filter.A 3 × 3 kernel for a box filter looks like the following:

1 1 1 1 1 1 1 1 1 meaning that the relative weight of all pixels are the same. Using this filter on a 1800 × 1018 image yields the result shown in Figure 2.6. CHAPTER 2. BACKGROUND 15

Figure 2.6: Averaging all pixels to achieve image smoothing, with a 5x5- kernel

Median Filter

Median image smoothing [29] uses the same process as the averaging, but calculates the median of the pixels instead of average. This means that the newly calculated value of the pixel is always present in the non-processed image. A median filter applied to an image of size 1800×1018 pixels is shown in Figure 2.7.

Figure 2.7: Calculating the median of the pixels to achieve image smoothing

Gaussian filter

Gaussian filter [29] is an often used technique of image smoothing, using the Gaussian function. It uses a Gaussian kernel, where the relative weights of each pixel decrease as the distance to the center increases. In a 2D picture the value of the kernel is calculated as 16 CHAPTER 2. BACKGROUND

2 2 1 − x +y G(x, y) = e 2σ2 2πσ2 where x is the distance from the center in the horizontal axis, y is the distance from the center in the vertical axis, and σ is the standard deviation of the Gaussian distribution.

0.077847 0.123317 0.077847 0.123317 0.195346 0.123317 0.077847 0.123317 0.077847

Table 2.1: Sample 3x3 Gaussian kernel with σ = 1.0.

As can be seen in matrix 2.1 the weighted values are largest in the center of the matrix, and decreasing as the distance to the center increases. An example of a Gaussian filter applied to an image of size 1800 × 1018 pixels can be seen in Figure 2.8.

Figure 2.8: Example of Gaussian blur of an image with σ = 1.0 and a 5x5- kernel

2.8.2 Grayscaling

Grayscaling an image is the process of converting a colored image to an image composed by different shades of gray. Each pixel in the image carries only intensity information. To convert a colored image to a grayscale image, the intensity of the pixels have to be calculated.

Y = 0.299R + 0.587G + 0.114B (2.1) CHAPTER 2. BACKGROUND 17

Equation 2.1 shows how to calculate the intensity (Y ) from the colors of a pixel [31]. R is the amount of red in the pixel, G is the amount of green and B is the amount of blue.

(a) Original (b) Grayscale

Figure 2.9: An image converted to grayscale

Figure 2.9 shows an example of a colored image converted to grayscale. The intensities of the gray pixels are calculated from the colors according to Equa- tion 2.1.

2.8.3 Thresholding

Thresholding is a method of image segmentation, converting a grayscale im- age to a binary image (i.e., an image with only two colors). The simplest thresholding methods replace each pixel in an image with a black or white pixel depending on the intensity of the pixel. Given a grayscale image and a fixed T , every pixel in the image with an intensity I < T is replaced with a black pixel, and the others are replaced with white pixels.

(a) Original (b) Thresholding effect

Figure 2.10: Grayscaling and thresholding applied to an image 18 CHAPTER 2. BACKGROUND

In Figure 2.10 the original image is converted to grayscale with the help of Equation 2.1. Every pixel with an intensity < 0.5 is then replaced with a black pixel, and otherwise a white pixel.

2.9 Color spaces

A color space is a model describing how to represent colors as tuples of num- bers. An example of a commonly seen color space is RGB [21]. RGB is an additive color space where a color in the space is defined by its amount of red, blue and green. The color red is defined as #FF0000 in hexadecimal. The two first digits (i.e., the first byte) represent the amount of red, the mid- dle two digits (the second byte) represent the amount of green and the last two digits (the last byte) represent the amount of blue.

RGB is not a supported format on all Android devices. All devices, however, support capturing video in YUV420 format and it is the standard format for the Android camera preview. The YUV model defines a color space in terms of a brightness component (Y) and two color components (U, V) [21].

An image in the RGB color space is represented as interspersed values, i.e., the red-green-blue values lie next to each other. The YUV model, however, group together the U and V values, while the Y values are left at the begin- ning.

In order to do color processing in the RGB color space on Android, the cam- era frames fetched must therefore first be converted from the YUV420. The following formula can be used [22]:

B = 1.164(Y − 16) + 2.018(U − 128) G = 1.164(Y − 16) − 0.813(V − 128) − 0.391(U − 128) (2.2) R = 1.164(Y − 16) + 1.596(V − 128) Chapter 3

Related Work

Earlier related research conducted is presented in this section.

3.1 Java and C++ benchmarks

Reinholtz [23] claimed that the run-time performance of the Java program- ming language will likely surpass that of C++. The author based this claim on the fact that the dynamic compilation gives the Java compiler access to runtime information that is not available to the C++ compiler. The author claims that this is bound to occur since the market for embedded systems will be driven to extend battery life, and that a more performant language would be desirable.

Hundt [13] conducted a comparison of the programming languages Java, C++, Go and Scala. The intent was to compare loop recognition in the lan- guages mentioned. The implementations all used idiomatic container classes, but did not attempt to exploit specific language or run time features. The re- sults showed that a Java implementation contained 25.6% more lines of code than the C++ version, and that it used 6 times more virtual memory than the C++ version. The Java version was 5.8 times slower than the C++ version. The author claims that even though the benchmark itself was simple and compact, it utilized many language features such as higher-level data struc- tures (lists, maps etc), some well known algorithms (e.g., DFS, union/find), which means that the comparison shown could be applicable in other situ- ations as well. Following the reasoning of Reinholtz [23], this gap between

19 20 CHAPTER 3. RELATED WORK

the Java and C++ performance will decrease as the JIT compilers used in the JVMs are improved.

Gherardi, Brugali, and Comotti [9] showed a smaller difference in run time performance than Hundt [13]. Gherardi, Brugali, and Comotti [9] imple- mented algorithms processing sensor data, often used in robotics. In certain test runs the performance difference was measured as Java being 9% slower than the C++ implementation. However, The same program with different data also showed a performance decrease of 280% when using the Java ver- sion. Moreover, Gherardi, Brugali, and Comotti [9] presented similar bench- marks conducted with earlier versions of the JVM, showing that the perfor- mance gain of using C++ was getting smaller every year, giving reason to believe that Reinholtz [23] is correct.

Lin et al. [18] have earlier shown that Android applications written in C++ in- creased run time performance by up to 34.2% when moving from Java. These tests were however run on the Dalvik VM, and not utilizing the ART runtime released with Android 5.0. Son and Lee [30] also showed that they could in- crease the run time performance of their augmented reality engine by 86.9% when rewriting it using C++ instead of Java.

3.2 Dalvik vs ART

Konradsson [17] compared the performance of the Dalvik VM and the then newly introduced Android Runtime. The author compared run time, mem- ory usage and application size, using well established benchmarking frame- works. Solving a dense 1000×1000 linear equation system yielded an average improvement of 12.35% when using ART over Dalvik when measuring the number of floating point operations per second. It is worth noting that Dalvik outperformed ART in certain test cases. In two of five test cases, Dalvik per- formed 0.6% and 0.8% more floating point operations per second. The tests were run on Android versions 4.4.2 and 5.1.1, meaning the ART version did not have access to a JIT-compiler. The reason that Dalvik outperformed ART in certain test cases is possibly that the JIT-compiler in Dalvik was able to op- timize the code for the architecture on that device. The author furthermore measured the RAM usage for 6 popular applications, namely Drive, , WhatsApp, Netflix, Dropbox and Skype. It is not clear what actions the user performed, but the average RAM usage was 45% higher on Dalvik. CHAPTER 3. RELATED WORK 21

3.3 Using GPU for calculations

3.3.1 RenderScript

In 2012 an Android engineer showed that when varying saturation in a bitmap run time could be 7 times as fast when using RenderScript over Java [27]. In 2013 Google further optimized the RenderScript engine, significantly im- proving its performance [26]. When code was executed only on the CPU, the RenderScript engine showed improvements in the range of 90 %–220 % when updating from Android 4.1 to Android 4.2.

Figure 3.1: Comparison of CPU and GPU code doing image processing on Android 4.2 [6]

The tests run in the comparisons used Android versions 4.0, 4.1 and 4.2, meaning that ART was not used in the tests. The tests did therefore not com- pare the RenderScript engine with AOT-compiled code. As can be seen in Figure 3.1, using the GPU provides performance benefits compared to using only the CPU. The performance is shown relative to the performance mea- sured on Android 4.0.

3.3.2 OpenCL

OpenCL [35] is a framework for writing programs that execute across het- erogeneous platforms, The programs can be run on CPUs and GPUs and OpenCL is widely used for parallelization. Its support for Android is lim- ited [1], since Google opted for RenderScript instead. 22 CHAPTER 3. RELATED WORK

Wang et al. [36], using OpenCL, implemented an algorithm that removes ob- jects from images and fills the hole left by removing the object, creating a plausible image. Using OpenCL that only ran on the CPU, processing an im- age took 393.8 seconds. Utilizing the GPU and varying certain parameters in their algorithm it took 4.266 seconds. The authors conclude that frameworks such as OpenCL are suitable for use on modern mobile GPUs.

Ross et al. [25] measure the performance of mobile CPUs and GPUs using the N-body algorithm. N-body is an algorithm used to solve problems regarding particles subject to an inter-particle force. The authors considered the algo- rithm representative of many real-world computational kernels. The results presented show that code running on the GPU is considerably faster than the code running on the CPU, and that the performance of handheld GPUs is closing in on desktop CPUs. The authors furthermore note that OpenCL is immature for mobile and embedded devices, but that it will likely get better.

Kim and Kim [16] compared the performance of OpenCL and RenderScript when computing matrix multiplications. When performing the multiplica- tions on a PC the OpenCL implementation far outperformed the Render- Script version. The OpenCL implementation was 2 times faster when multi- plying a 10 × 10 matrix, and by approximately 13 times faster when multi- plying a 100 × 100 matrix. The average case is that OpenCL was 9.11 times faster.

However, when conducting the same experiments on a mobile device the RenderScript implementation was 5.8 times better in average. The PC ver- sions, however, used an emulator in order to run the RenderScript versions, which might penalize the performance. The authors conclude that Render- Script is more optimized for the architectures found on Android devices. Chapter 4

Method

This chapter explains how the experiment that is intended to answer the re- search question was conducted.

4.1 Choice of method and algorithms

Selecting a language other than Java when developing Android applications is often done when developing for computationally intensive purposes. Im- age processing is a computationally intensive area, making it suitable for use when measuring language performance. The algorithms implemented in this project are popular algorithms that are available in open source libraries.

The different image processing algorithms implemented have different prop- erties that make them interesting. The grayscaling algorithm, for example, only requires accessing one pixel to determine the color of the new pixel. The Gaussian blurring function, however, requires accessing neighboring pixels to calculate a weighted average for the pixel. This means that the algorithms have differing cache localities, making them suitable candidates for this test.

4.2 Development environment and devices

The Android applications built for this thesis was built using 2.2.3 [10] and CMake 3.4.1 [5]. The compiler used for compiling the native parts of the application was 3.8.256229.

23 24 CHAPTER 4. METHOD

The image processing algorithms outlined in this chapter were tested on mul- tiple Android devices running different versions of the operating system. The following smartphones were tested:

• Samsung Galaxy S5, Android 6.0.1 • Sony Xperia Z1, Android 4.4

The Samsung Galaxy S5 device was using ART whereas the Sony Xperia Z1 device was using Dalvik as its runtime. The technical specifications can be seen in Tables 4.1 and 4.2. OS Android 6.0.1 (Marshmallow) Chipset Qualcomm MSM8974AC Snapdragon 801 CPU Quad-core 2.5 GHz Krait 400 GPU Adreno 330 RAM 2GB

Table 4.1: Technical specifications for the Samsung Galaxy S5

OS Android 4.4 (KitKat) Chipset Qualcomm MSM8974 Snapdragon 800 CPU Quad-core 2.2 GHz Krait 400 GPU Adreno 330 RAM 2GB

Table 4.2: Technical specifications for the Sony Xperia Z1

4.3 Implementation

The benchmark implementations were done by implementing a color space conversion algorithm, different versions of blurring filters, grayscaling and thresholding. The color space conversion is used to convert YUV-data to RGB- format. The implementations are described in more detail below.

The number of bugs encountered during implementation of these algorithm was larger in RenderScript and C++ than Java. However, the number of bugs encountered is also due to our previous experience with the development languages. Moreover, the tooling for NDK and RenderScript are not as ex- tensive as Java, making debugging harder. The build times of the project increased as the NDK or RenderScript was added to the project. CHAPTER 4. METHOD 25

4.3.1 Color space conversion

The color space conversion application captures frames from the smartphone’s camera. When a frame is fetched it is passed to an instance of the interface Camera.PreviewCallback. The frame is passed as a byte array contain- ing data in YUV-format. A common operation is converting the frame data from YUV-format to a RGB-format before conducting further processing of the image. Seeing as this is a common operation, there exists many imple- mentations provided by different vendors. It also means that this is a suitable test that can be generalized. The following implementations were tested, and are more thoroughly explained below:

• Java Threaded

• C++ Threaded

• C++ implemented in OpenCV

• RenderScript intrinsics

• Relaxed RenderScript

The implementations were built into an Android application that captured frames from the camera and then let each algorithm process the frame.

The methods all set the pixels of a bitmap displayed on the screen. After pro- cessing, the bitmap is invalidated and is redrawn by the operating system. In the cases where it is possible to change, the processing is done by the max- imum number of threads possible on the device. In the case of color space conversion, each thread processes a part of the image. The main thread then waits for each thread to complete before rendering the final image on the screen, using the join-method present in the language. The garbage collector is manually requested to run before each algorithm in order to avoid garbage collection during the processing of the image.

A reference C++ implementation developed by Google was adapted for our Java and C++ implementations1. The reference implementation provided by Google uses the formula in Equation 2.2 to calculate the RGB values from YUV. 1https://android.googlesource.com/platform/frameworks/rs/+/master/ cpu_ref/rsCpuIntrinsicYuvToRGB.cpp 26 CHAPTER 4. METHOD

Floating point operations are often poorly performing and are as such re- placed with integer and bitwise operations, both in our implementation and the reference implementation provided by Google.

Java, C++

The Java and C++ implementations create the maximum number of threads usable by the hardware. The maximum number of threads usable by the CPU is detectable at runtime. Creating 4 new threads on a Samsung Galaxy S5 takes on average 3 ms, over 100 test runs. Creation of new threads was as such considered to not be a large overhead, can be recreated when needed.

C++, using OpenCV

OpenCV is an open source computer vision library [20]. It contains opti- mized code for many tasks often done in image processing, and contains architecture-specific optimizations. The OpenCV Android SDK v3.2.0 was used to call the OpenCV C++ API.

RenderScript intrinsics

Google provides implementations of often-used algorithms with RenderScript intrinsics. Intrinsics are built-in functions that perform operations often used when conducting image processing. They provide high performance with a very small amount of code [24].

Relaxed RenderScript

The Relaxed RenderScript implementation uses lower precision in floating point operations in favor of increased performance. The implementation therefore uses 32-bit precision instead of 64-bit precision which is common in CPUs. CHAPTER 4. METHOD 27

4.3.2 Blurring

A pre-defined bitmap is shown and blurred, using different blurring algo- rithms. The algorithms are run sequentially using AsyncTasks, a class in Android framework used for processing on a background thread. The algo- rithms are run sequentially. Each algorithm takes two parameters: an input bitmap and and output bitmap.

The methods all set the pixels of a bitmap displayed on the screen. After processing, the bitmap is invalidated and is redrawn by the operating system. The processing is done by the maximum number of threads possible on the device in cases where it is possible to change. The main thread then waits for each thread to complete before rendering the final image on the screen, using the join-method present in the language.

Before running an algorithm the system garbage collector is manually re- quested to run in order to not pollute the run times of the algorithms. Note that this does not guarantee that the garbage collector is invoked, but it is visible in the logs when it is run. The logs were manually checked to make sure that the garbage collector did not run during the processing.

Different implementations can in some instances return different results. For instance, it is not possible for a developer to specify the kernel used when using the intrinsic RenderScript implementation of Gaussian blurring, and we must therefore consider the possibility that the images slightly differ. The difference between two images is calculated pixel per pixel. The red, green and blue values of each pixel are summed, and the sum is compared between the images. This is called the Manhattan norm.

n X ||x1|| = xi i=1

It is considered acceptable if the resulting images differ up to 10% in each channel (R, G, B). Further distortion when conducting blurring will be no- ticeable in the resulting image.

Gaussian filter

As described in chapter 2, a 2D Gaussian Kernel is calculated as follows: 28 CHAPTER 4. METHOD

2 2 1 − x +y G(x, y) = e 2σ2 2πσ2

In order to speed up image processing, one can use a one dimensional filter and apply it twice, both horizontally and vertically. This means that a 1D vector is computed and applied horizontally to each row in the image. The resulting image is then used for the vertical pass for all columns in the image.

The Gaussian kernel used was pre-calculated using an online service [7]. The calculation of the Gaussian kernel is therefore not taken into account.

The Gaussian filter had 8 different implementations tested:

• Single-threaded Java

• Multi-threaded Java

• Single-threaded C++

• Multi-threaded C++

• C++, using OpenCV

• RenderScript

• Relaxed RenderScript

• RenderScript Intrinsics

The Java and C++ implementations were based on a reference C++ imple- mentation found in the Android system source code2.

Box filter

Applying a box filter can be done using the same strategy as applying the Gaussian filter, using one vertical and one horizontal pass. Recall that a box filter is identical to a Gaussian filter, where the relative weights of the pixels are the same. The box filter had 7 different implementations:

• Single-threaded Java

2https://android.googlesource.com/platform/frameworks/rs/+/master/ cpu_ref/rsCpuIntrinsicBlur.cpp CHAPTER 4. METHOD 29

• Multi-threaded Java

• Single-threaded C++

• Multi-threaded C++

• C++, using OpenCV

• Relaxed RenderScript

• RenderScript

Google does not provide an intrinsic RenderScript box filter function, and it could therefore not be included. The Java and C++ versions were based on the Gaussian blurring implementation provided by Google, with changes to adapt it to a box filter.

Median filter

The median filter is applied by looking at every pixel surrounding a center pixel within the radius supplied. The color of the center pixel was then set to the median color of the pixels.

The median filter has 5 different implementations:

• Single-threaded Java

• Multi-threaded Java

• Single-threaded C++

• Multi-threaded C++

• C++, using OpenCV

No RenderScript intrinsics were provided by Google. RenderScript does not allow using vectors as function parameters, which made calculating medians unpractical in the C99-derived language, and it was therefore left out. The Java and C++ versions were not based on a reference implementation. 30 CHAPTER 4. METHOD

4.3.3 Grayscaling and thresholding

The grayscaling and thresholding algorithms were implemented in four vari- ants:

• Java Threaded

• C++ Threaded

• C++ OpenCV

• RenderScript

• Relaxed RenderScript

As opposed to the blurring implementations described above, these algo- rithm only accesses one pixel at a time to calculate the color of the new value. This could lead to better cache locality.

The thresholding implementation uses the grayscaling implementation to convert a colored to a grayscale image before deciding whether a certain pixel should be black or white.

4.4 Measuring Runtime Performance

There are multiple ways of measuring run time in Java. Using wall-clock time, with System.currentTimeMillis(), is not reliable seeing as it can be altered at seemingly random times by the operating system. Instead, the elapsed CPU time is measured in this thesis. The elapsed CPU time is mea- sured using the Android OS system call SystemClock.elapsedRealtimeNanos() as recommended by Google [33].

4.4.1 Image processing

The run time of the algorithms can change depending on a number of fac- tors. Among others, the JIT compiler present in the OS will optimize the code as it is running. In order to minimize its effect on the collected run times, a number of warmup rounds are run before the run times are measured. Further- more, to prevent other processes influencing the run times of the algorithms, CHAPTER 4. METHOD 31

no other applications were running during the testing of the algorithms. The smartphone was also running in flight mode.

In the color space conversion test, 50 warmup rounds are run before starting the test. The blurring, grayscaling and thresholding tests use 10 warmup rounds. The run times were successively smaller in the first warmup rounds, due to the JIT compilation. After the 10 warmup rounds, the optimization did not further improve the performance.

4.4.2 Setup

For every time a blurring, grayscaling or thresholding algorithm is run, some setup is required (for, e.g., allocating memory). The time required to setup the necessary environments for each algorithm differs, and is also collected. For instance, when running the Java versions, the following is always used: int[] srcpixels = new int[width * height]; int[] dstpixels = new int[width * height]; src.getPixels(srcpixels, 0, width, 0, 0, width, height);

The C++ setup conducted is the same as when doing image processing in Java.

The RenderScript versions require more sophisticated setup. This is not part of the actual calculations done by the algorithm, but is necessary for the algo- rithms to work, and is therefore taken into account. Nothing is saved between two runs of the same algorithm, which means that the setup is conducted each time the algorithm is run. To take this into account, the time taken for each algorithm to setup the necessary allocations is measured. An example setup used for RenderScript in this thesis is shown below:

Allocation input = Allocation.createFromBitmap(rs, src); Allocation output = Allocation.createFromBitmap(rs, dst); ScriptC_gaussian_blur script = new ScriptC_gaussian_blur(rs); script.set_width(w); script.set_height(h);

//set input for blurring script.set_ScratchPixel1(input); script.set_ScratchPixel2(input); 32 CHAPTER 4. METHOD

The Allocation objects are handled by the RenderScript runtime and pro- vide a buffer for the GPU to read from.

In the color space conversion test, the setup is reused between tests, meaning that the initialization and memory allocation does not have to be done before every run of the algorithms. The setup times are therefore not taken into account in that experiment.

4.5 Verifying results

In order to confirm that conclusions can be drawn from the results, a sta- tistical test must be performed. A Wilcoxon Signed Rank Test is used in this project [38]. The test is performed by conducting pairwise comparisons be- tween the average run times of different implementations of the same algo- rithm.

Table 4.3 shows the algorithms and what languages they were implemented in. All algorithms, except the YUV to RGB conversion, were tested on im- ages with 3 different resolutions. The YUV to RGB conversion was tested on a camera feed, with 2 different resolutions. The results of these tests were treated as independent data points, so the Java Threaded implementation had 17 independent data points, for example. Java Java Threaded C++ C++ Threaded C++ OpenCV RS Relaxed RS RS Intrinsic YUV to RGB Gaussian blur Box blur Median blur Thresholding Grayscaling

Table 4.3: Algorithms and what languages they have been implemented in

The data points of an implementation are pairwise compared with the other implementations, i.e., the run times of two implementations of a certain algo- rithm on a certain resolution frame are compared.

The run times are normalized to the interval [0, 1] in order to reduce the rel- ative importance the run times of the tasks done on larger image. The nor- CHAPTER 4. METHOD 33

Algorithm Java C++ Algorithm Java C++ Gaussian blur 12 ms 16 ms Gaussian blur 0.75 1 Median blur 20 ms 18 ms Median blur 1 0.9 Thresholding 17 ms 19 ms Thresholding 0.8947 1 Grayscaling 13 ms 14 ms Grayscaling 0.9286 1

(a) Absolute runtimes (b) Normalized runtimes

Table 4.4: An example of converting absolute run times to normalized times malization is done by dividing the runtime of the implementation with the largest runtime in the pair. Table 4.4 shows an example of converting abso- lute runtimes to normalized values, used for further calculation of statistical significance. Note that the results are examples only, and that the tables do not contain any real data.

The results shown in Table 4.4b are used in the pairwise calculations to cal- culate whether the results are significant or not. An online tool is used for convenience to determine statistical significance [37]. The performance of an implementation of an algorithm on a single resolution can also be compared with other implementations with the Wilcoxon Signed Rank Test by comparing their absolute runtimes.

Some of the algorithms were not implemented in all languages. The single- threaded versions of Java and C++ performed worse than their correspond- ing multi-threaded versions, and were therefore left out. RenderScript Intrin- sics are only available for a select few operations, and could therefore not be used for all algorithms.

In these cases where the number of data points is too small, the test statistic does not converge to a normal distribution, like it normally does. When the number of data points is lower than 10, the calculated test statistic has to be compared with predefined values to determine whether the data is significant or not, as is standard when using a Wilcoxon Signed Rank Test. Chapter 5

Results

In this chapter the results from the experiments conducted are presented.

5.1 Color space conversion

The color space conversion was done on 100 sequential frames captured by the smartphone’s camera, tested with different resolutions. The results are presented in this section.

Table 5.1 shows the average run time of the color space conversion. The An- droid 4.4 run times were collected from a Sony Xperia Z1 and the Android 6.0.1 runtimes were collected from a Samsung Galaxy S5. The resolutions of the camera feed are also displayed in the table.

Android 4.4 Android 6.0.1 Resolution 640 × 480 1280 × 720 640 × 480 1920 × 1080 Java Threaded 26 ± 8 ms 63 ± 13 ms 32 ± 9 ms 70 ± 14 ms C++ Threaded 14 ± 5 ms 40 ± 8 ms 23 ± 8 ms 51 ± 11 ms C++ OpenCV 11 ± 5 ms 34 ± 7 ms 16 ± 6 ms 42 ± 10 ms Relaxed RenderScript 29 ± 3 ms 79 ± 13 ms 19 ± 8 ms 65 ± 18 ms RenderScript Intrinsic 11 ± 3 ms 32 ± 8 ms 12 ± 5 ms 40 ± 7 ms

Table 5.1: Run times for converting from YUV to RGB on different resolutions and different smartphones

34 CHAPTER 5. RESULTS 35

The RenderScript Intrinsic provided the best runtime performance out of the tried implementations. The average runtimes of the C++ implementa- tions was not far behind. The Java Threaded and Relaxed RenderScript- implementations provided the worst runtime performance in the color space conversion test.

5.2 Blurring

The blurring algorithms were applied to three images of different sizes, rang- ing from 100 × 67 pixels up to 1920 × 1080 pixels. The run times of the al- gorithms are presented in this section. The graphs presented below display run time as a function of image size. Tables with more detailed results can be found in Appendix A.

5.2.1 Box filter

Table 5.2 shows the run times of the box filter implementations on a 1920 × 1080 image, on both Android 4.4 and 6.0.1. Note that the different operating system versions were used on different smartphones. Tables with run times for other resolutions can be found in Appendix A.

Algorithm Android 4.4 Android 6.0.1 Java 3092 ms 4907 ms Java Threaded 869 ms 1084 ms C++ 945 ms 939 ms C++ Threaded 353 ms 311 ms C++ OpenCV 201 ms 236 ms RenderScript 402 ms 324 ms Relaxed RenderScript 168 ms 151 ms

Table 5.2: Run times for applying box filter to a 1920 × 1080 image

The Relaxed RenderScript runtime performance is the best out of the imple- mentations shown above, and the single-threaded Java implementation is the slowest implementation. Increasing the number of threads in the Java and C++ implementations shows a linear increase in run time performance, as expected. The C++ Threaded and OpenCV implementations outperformed 36 CHAPTER 5. RESULTS

Java Java Threaded C++ C++ Threaded RS Relaxed RS OpenCV

Android 4.4 Android 6.0.1

3,000 4,000 2,000

2,000 1,000 Time (ms)

0 0 0 1 2 0 1 2 Pixels in image ·106 Pixels in image ·106

Figure 5.1: Run times of applying a box filter on two versions of Android the RenderScript implementation that used full floating point precision. Fig- ure 5.1 displays the run times of applying a box filter to images of different resolutions.

5.2.2 Median filter

Table 5.3 shows the run times of the median filter implementations on a 1920 × 1080 image, on both Android 4.4 and 6.0.1. Note that the different operating system versions were used on different smartphones. Tables with run times for other resolutions can be found in Appendix A.

The trends in the runtime performance of different implementations of the median filter are similar to the trends in the box filter results. An outlier in Table 5.3 is the OpenCV runtime performance on Android 4.4.

Algorithm Android 4.4 Android 6.0.1 Java 4283 ms 2983 ms Java Threaded 1903 ms 1230 ms C++ 1560 ms 1835 ms C++ Threaded 646 ms 717 ms C++ OpenCV 3176 ms 201 ms

Table 5.3: Run times for applying median filter to a 1920 × 1080 image CHAPTER 5. RESULTS 37

Java Java Threaded C++ C++ Threaded OpenCV

Android 4.4 Android 6.0.1

3,000 4,000

2,000

2,000 1,000 Time (ms)

0 0 0 1 2 0 1 2 Pixels in image ·106 Pixels in image ·106

Figure 5.2: Run times of applying a median filter on two versions of Android

Figure 5.2 shows the average run times of the different implementations on both Android 4.4 and Android 6.0.1. The time is displayed as a function of image size.

5.2.3 Gaussian filter

Table 5.4 shows the run times of the Gaussian filter implementations on a 1920 × 1080 image, on both Android 4.4 and 6.0.1. Note that the different operating system versions were used on different smartphones. Tables with run times for other resolutions can be found in Appendix A.

Algorithm Android 4.4 Android 6.0.1 Java 3067 ms 4959 ms Java Threaded 877 ms 1115 ms C++ 936 ms 939 ms C++ Threaded 392 ms 325 ms C++ OpenCV 240 ms 285 ms RenderScript 420 ms 356 ms RenderScript Intrinsic 124 ms 49 ms Relaxed RenderScript 166 ms 168 ms

Table 5.4: Run times for applying Gaussian filter to a 1920 × 1080 image 38 CHAPTER 5. RESULTS

Java Java Threaded C++ C++ Threaded RS RS Intrinsic Relaxed RS OpenCV

Android 4.4 Android 6.0.1

3,000 4,000 2,000

2,000 1,000 Time (ms)

0 0 0 1 2 0 1 2 Pixels in image ·106 Pixels in image ·106

Figure 5.3: Run times of applying a Gaussian filter on two versions of An- droid

The Instrinsic RenderScript implementation was the fastest implementation by a significant margin. The Relaxed RenderScript implementation was the second fastest implementation, with an average run time 119 ms higher than the Intrinsic implementation on Android 6.0.1, and 42 ms on Android 4.4.

Figure 5.3 shows the average run times of the implementations when ap- plying a Gaussian filter to images of varying resolution. The run times are plotted as a function of image size.

5.3 Grayscaling

Table 5.5 shows the run times of the different implementation when convert- ing a colored image to grayscale. The resolution of the image was 1920×1080. The average run times of applying grayscaling to images of other resolutions can be found in Appendix A. CHAPTER 5. RESULTS 39

Android 4.4 Android 6.0.1 Implementation Setup time Runtime Setup time Runtime Java Threaded 55 ± 2 ms 133 ± 12 ms 29 ± 4 ms 107 ± 26 ms C++ Threaded 52 ± 6 ms 98 ± 11 ms 30 ± 3 ms 93 ± 18 ms C++ OpenCV 14 ± 3 ms 20 ± 3 ms 20 ± 2 ms 23 ± 11 ms Relaxed RenderScript 57 ± 12 ms 72 ± 15 ms 46 ± 12 ms 59 ± 15 ms RenderScript 51 ± 4 ms 81 ± 7 ms 41 ± 6 ms 65 ± 5 ms

Table 5.5: Run times for converting a 1920 × 1080-image to grayscale

In this relatively simple algorithm, the C++ OpenCV implementation achieved the lowest average runtime. The multi-threaded Java implementation was the slowest contender.

5.4 Thresholding

Table 5.6 shows the run times of the different implementation when applying thresholding to an image. The resolution of the image was 1920 × 1080. The average run times of applying thresholding to images of other resolutions can be found in Appendix A.

The results and trends shown in Table 5.6 are similar to that of the grayscaling performance.

Android 4.4 Android 6.0.1 Implementation Setup time Runtime Setup time Runtime Java Threaded 57 ± 7 ms 142 ± 13 ms 25 ± 4 ms 110 ± 14 ms C++ Threaded 55 ± 8 ms 106 ± 13 ms 29 ± 5 ms 93 ± 21 ms C++ OpenCV 24 ± 4 ms 24 ± 3 ms 26 ± 4 ms 26 ± 4 ms Relaxed RenderScript 52 ± 12 ms 94 ± 11 ms 33 ± 4 ms 63 ± 7 ms RenderScript 48 ± 6 ms 95 ± 8 ms 32 ± 3 ms 63 ± 7 ms

Table 5.6: Run times for applying thresholding to a 1920 × 1080-image Chapter 6

Discussion

In this chapter we discuss the experiments conducted and their results.

6.1 Color space conversion

Real time image processing, such as real time color space conversion, requires high performance. Each frame captured by the camera must be processed in at most roughly 33 ms in order to reach 30 FPS (frames per second). If any real time image processing task takes longer than 33 ms, the user will start to notice stuttering. Table 5.1 shows the difference in run time when applying a formula for converting YUV-frames to RGB.

The tables are both showing a significant difference between some of the im- plementations. In Table 5.1, using Android 4.4, the average run time of the Java implementation is 146% higher than that of the C++ implementation. Furthermore, the OpenCV implementation shows an average run time per- formance increase of 16.2% compared to our C++ implementation. Our Ren- derScript implementation performed worse than the C++ implementations and the RenderScript intrinsics.

The RenderScript intrinsic on Android 4.4 performed 60.5% better than our RenderScript implementation. However, on Android 6.6 the difference was only 35.5%, but done on a larger frame size. The main reason for this is likely the RenderScript engine improvements done from upgrading Android 4.4 to Android 6.0.1 together with an increase in processor speed.

40 CHAPTER 6. DISCUSSION 41

The RenderScript intrinsics have likely been fine-tuned by hand by Render- Script engine developers, explaining their performance. However, the intrin- sics implementation can be found widely optimized in assembly code in the Android source code1 for certain architectures. The GPU version of this code is proprietary and developed by the vendors.

6.2 Blurring

Applying a filter to an image is done in real time less often than the color space conversion. Many applications exist today that allow applying filters to an image from the smartphone’s camera roll. It is therefore not as crucial that these filters can be applied as fast to achieve 30 FPS, but the performance is important as to not impede the user experience.

Notice that the many of the results presented earlier in this report were run times of algorithms that were ran on an image smaller than what is today normally captured on a modern smartphone camera. Table 5.4 shows the run time comparisons of the algorithms run on a larger image. The differences between the implementations grow as the image size grows, which means that a Java implementation might not be suitable when conducting image processing on images capture by a modern smartphone camera.

However, resorting to RenderScript implementations might not always be necessary. As can be seen in the tables in the previous section the C++ OpenCV can be considered a strong competitor of the RenderScript implementation. This means that an optimized implementation in a native language can be as fast, or faster, than the RenderScript code. If the setup time is taken into account, the RenderScript and C++ OpenCV run times are often very close.

However, the results differ largely depending on the precision required in the RenderScript computation scripts. In Table 5.4, the time taken for the Relaxed RenderScript implementation was 47.2% of the average runtime of the RenderScript implementation on Android 6.0.1.

The other blurring filters show similar trends. The outlier is the C++ OpenCV implementation on Android 4.4 when applying median filtering on an image. The performance is significantly increased on Android 6.0.1.

1https://android.googlesource.com/platform/frameworks/rs/+/master/ cpu_ref/rsCpuIntrinsics_neon_YuvToRGB.S 42 CHAPTER 6. DISCUSSION

The RenderScript intrinsics provided by Google are by far the best option regarding run time. However, Google only provides intrinsics for 11 common tasks [28]. Given that many applications today conduct more sophisticated image processing the intrinsics might not give developers what they want. If a cross platform solution must be developed it is easy to argue for a native language such as C or C++, because RenderScript is only available for the Android platform.

Considering that the RenderScript intrinsics outperform our RenderScript implementations by a large margin, it is worth considering that our imple- mentation might lack optimization used in the intrinsic implementation. How- ever, as presented in Figure 3.1, the GPU-utilizing implementations do not always outperform the CPU counterparts, meaning the algorithm itself must be considered before deciding on whether RenderScript is worth using.

The Gaussian filtering reference implementation provided by Google con- tains highly optimized assembly code for different architectures2. However, the GPU version of this code is proprietary and developed by the vendors, and the RenderScript intrinsics always outperformed their counterparts, mean- ing that the GPU vendors’ implementations are favorable.

6.3 Grayscaling and thresholding

The results of the conversion from color to grayscale can be seen in Table 5.5. The best implementation was the C++ OpenCV implementation, followed by the Relaxed RenderScript implementation. However, the average setup time for the RenderScript implementation is 26 ms higher than the average setup time for the OpenCV version. Without counting the setup, the RenderScript implementation performs nearly as well as the C++ OpenCV implementa- tion. The runtime of this algorithm is small compared to the blurring, where we see runtimes of > 100 ms. The reason that the RenderScript implemen- tation is lacking in run time performance here can be that it is too costly to pass data to the buffers needed in the RenderScript runtime. Note that the grayscaling was only conducted on images of resolution 1920 × 1080, and the performance difference would likely be smaller on smaller images, as we have seen in the blurring and color space conversion results. The threshold- ing results look very much like the grayscaling results and the same trends can be found in Table 5.6. 2https://android.googlesource.com/platform/frameworks/rs/+/master/ cpu_ref/rsCpuIntrinsics_advsimd_Blur.S CHAPTER 6. DISCUSSION 43

Our Java and C++ implementations do not differ much in run time perfor- mance when doing grayscaling and thresholding. The calculations for these algorithms are small compared to the blurring, meaning that Java perhaps can be considered a viable candidate for very simple image processing tasks.

6.4 Overall Performance

The single-threaded implementations of C++ and Java were the worst can- didates, and their multi-threaded counterparts achieved significantly higher runtime performance. However, our C++ implementation performed consis- tently worse than the OpenCV implementation, with a p-value of as low as 0.0003.

The average runtime of the OpenCV C++ implementation, over all the al- gorithms, proved to be better than both the RenderScript and the Relaxed RenderScript implementations as well, with a p-value of 0.0036 and 0.0114, respectively.

300

200 OpenCV 100

0 0 100 200 300 Relaxed RenderScript

Figure 6.1: Plot showing run times of Relaxed RenderScript and OpenCV. Blue circles indicate a 640 × 480 resolution on the image processed, red indi- cates a resolution of 500×333, and green indicates a resolution of 1920×1080.

Figure 6.1 shows the average runtimes of OpenCV and Relaxed RenderScript on Android 6.0.1. OpenCV performed better than the Relaxed RenderScript version in the majority of cases. However, this was the performance of the 44 CHAPTER 6. DISCUSSION

implementations over all the algorithms and all resolutions. If we compare the runtimes of the Box filter and Gaussian filter on a 1920 × 1080 image, the Relaxed RenderScript version performs better, with p-values < 0.05. OpenCV performs better than the Relaxed RenderScript implementations on the smaller images, likely due to reduced setup time. In addition to this, OpenCV per- forms better in the thresholding and grayscaling tasks as well on large im- ages.

Despite the relative poor performance of our C++ implementation, it cannot be inferred from the data that it is worse than the RenderScript or Relaxed RenderScript in the average case with 95% confidence. However, in the case of blurring, thresholding and grayscaling 1920 × 1080-images, the Relaxed RenderScript outperforms our Threaded C++ implementation, yielding p- values < 0.05.

The RenderScript intrinsics outperformed all other implementations in every algorithm. However, with intrinsics only being available for the Gaussian blurring and color space conversion, there are not enough data points to say that it is better with a 95% certainty.

The RenderScript and Relaxed RenderScript implementations did not show any significant difference when pairwise comparing their runtimes. How- ever, like the case with Relaxed RenderScript and OpenCV, the performance is significantly different in certain cases. In the Gaussian and Box blurring, the Relaxed RenderScript implementation was significantly faster than Ren- derScript. In the other cases, the difference is insignificant. This is likely due to the fact that the Gaussian and Box blurring contains more floating point operations than the other algorithms implemented.

For clarity, the outcomes of the statistical significance tests are presented in Table 6.1. The results are calculated as described in Chapter 4, and are thus calculated from the run times of all algorithms on all image resolutions. CHAPTER 6. DISCUSSION 45 Java Java Threaded C++ C++ Threaded C++ OpenCV RS Relaxed RS

Java - JT C++ C++T C++O RS RSR Java Threaded JT - - C++T C++O - - C++ C++ - - - C++O - - C++ Threaded C++T C++T - - C++O - - C++ OpenCV C++O C++O C++O C++O - C++O C++O RS RS - - - C++O - - Relaxed RS RSR - - - C++O - -

Table 6.1: Statistical significance. The names have been abbreviated as fol- lows: Java Threaded: JT, C++ Threaded: C++T, C++ OpenCV: C++O, Ren- derScript Relaxed: RSR, RenderScript: RS

The Intrinsic RenderScript implementations did not show any statistical sig- nificance, due to the few test cases available, and has therefore not been in- cluded in Table 6.1. Note again that certain implementations proved to be better than other in certain cases, but not in the table. The Relaxed Render- Script outperformed our Threaded C++ implementation in many tasks on higher resolution images, for example. The table shows that the third party C++ implementation found in OpenCV performed best on average.

6.5 Threats to validity

6.5.1 Choice of algorithms

The algorithms implemented as part of this project were selected to be rep- resentative of common image processing tasks. The algorithms chosen have been implemented in open-source projects and were therefore deemed suit- able for testing the performance of the available execution platforms. Even though the algorithms implemented have different properties that could af- fect their run time performance, similar trends can be found in many of the results. However, there might exist other image processing algorithms with different properties where another implementation language might be favor- able. 46 CHAPTER 6. DISCUSSION

6.5.2 High variance

The variance of the collected run times were, in some measurements, as high as 20% of the average run time. The high variances could most often be seen where the average run time of the algorithm was low, i.e., < 100 ms. This could be the result of the Android system performing background tasks, leaving less processing power for the application. However, despite the high variance in some cases, trends are still very visible in the results.

6.5.3 Devices

Two devices were used to test the runtime performance of the three execution platforms, while the number of different Android devices exceeds 20000. The devices used in this project use similar chipsets and GPUs, meaning that the same results might not be replicable on processing units from other vendors. However, the Samsung Galaxy models are the most popular series of An- droid devices, and Qualcomm GPUs are the second most commonly seen GPU [19]. The devices used in this project can therefore represent a wide variety of commonly used devices.

6.5.4 Image sizes

The image processing algorithms were applied to with sizes ranging from 100 × 67 to 1920 × 1080. The smallest images represent thumbnails, i.e., reduced-size versions of images used to help recognition, whereas the largest image represents pictures taken with a smartphone camera. High-end de- vices in the current generation of smartphones can take pictures with a higher resolution than 1920 × 1080, and these were not taken into account. The res- olutions used in this project were chosen to represent an average Android device. The trends visible in the results of this project can possibly not be ex- trapolated to determine the performance on the algorithms on larger images.

6.5.5 Optimization

The OpenCV C++ implementation performed better than our own multi- threaded C++ implementation. The OpenCV implementation has, however, been optimized by many contributors over a long period of time. Optimizing the implementations tested in this project can therefore likely be done. How- CHAPTER 6. DISCUSSION 47

ever, our Java, C++ and RenderScript implementations contained identical algorithms, and did not use any language specific feature or optimization and can therefore be used as a benchmark of language performance.

6.6 Future Research

Any future research conducted will likely use a better JIT compiler. Dy- namic compilation allows for optimizations that are platform specific and Java might therefore be able to surpass the performance of native languages since they are statically compiled, with no access to runtime information. Reinholtz [23] claims that Java performance eventually will surpass that of C++, and therefore comparing Java with native languages will be interesting in the future as well.

Increasing the performance of applications in the Android system is impor- tant for battery life. Developers have to consider the user’s battery when conducting computationally intensive tasks in an application. It could there- fore be interesting to examine the RenderScript runtime and its effect on bat- tery life. The amount of extra memory required to utilize the RenderScript runtime could be interesting to measure as well.

The native language chosen in this thesis was C++ because of the availability of the Android NDK. However, developers are free to write other native lan- guages for the Android platform as well. The programming language Go [34] has support for mobile tools in versions above 1.5, allowing developers to generate bindings to use existing Go code in an Android project or write en- tire applications in Go. Go is a statically compiled language with automatic memory management meaning that it, much like Java, trades performance for safety. However, it does not run on the JVM and can therefore be a vi- able competitor to other native languages on the Android system. Apple also states that it is possible to use the Swift [32] programming language on An- droid devices, which could be a contender. Swift is a programming language used for building iOS applications, meaning that using Swift on Android en- ables sharing code across platforms. Both Swift and Golang only support ARM architectures, however, meaning that not all Android devices are sup- ported.

A more practical approach to GPU acceleration instead of using RenderScript can be to use a cross-platform framework such as OpenCL. OpenCL currently has limited support for the Android platform, but the code can be reused for, e.g., iOS devices. Many large applications are developed for multiple plat- 48 CHAPTER 6. DISCUSSION

forms, meaning that other GPU acceleration frameworks should be evaluated as well. Chapter 7

Conclusion

Recall the research question posed in the introduction:

Can performance increases in run time warrant the usage of C++ or GPU acceleration frameworks over Java when writing image processing algorithms on Android?

All tests showed that our C++ implementation was significantly better per- forming than the corresponding Java implementations. The RenderScript im- plementations were significantly faster than Java on large images, but did not perform better in the average case. As such, Java cannot be considered a vi- able option when conducting advanced image processing on large images on the current generation of smartphones. However, the difference in run time performance between Java and C++ is minor when the calculations are very simple (e.g., grayscaling) or the images are very small.

Our RenderScript implementations with full floating point precision did not turn out to be better performing than their C++ counterparts. If compliance with the IEEE Standard for Floating-Point Arithmetic is required, C++ is the recommended implementation language. If there is no strict requirement on floating point precision, RenderScript can outperform the C++ implementa- tion, although our results did not show a statistically significant difference between the two in the average case. However, when the algorithms were applied on larger images, the RenderScript implementation with low floating point arithmetic precision proved to be better than the C++ implementation.

49 Bibliography

[1] https://streamcomputing.eu/blog/2013-08-01/google- blocked--on-android-4-3/. [2] ART GC overview. https : / / source . android . com / devices / tech/dalvik/gc-debug.html. Accessed on 2017-04-01. [3] Bill Buzbee Ben Cheng. A JIT Compiler for Android’s Dalvik VM. http: / / www . android - app - developer . co . uk / android - app - development - docs / android - jit - compiler - androids - dalvik-vm.pdf. Accessed on 2017-02-07. [4] Ananya Bhattacharya. Android just hit a record 88% market share of all smartphones. https://source.android.com/devices/tech/ dalvik/. Accessed on 2017-02-07. [5] CMake. https://cmake.org/. Accessed on 2017-03-21. [6] Evolution of Renderscript Performance. https://android-developers. googleblog.com/2013/01/evolution- of- renderscript- performance.html. Accessed on 2017-03-03. [7] Gaussian Kernel Calculator. http://dev.theomader.com/gaussian- kernel-calculator/. Accessed on 2017-03-23. [8] Getting Started with the NDK. https://developer.android.com/ ndk/guides/index.html. Accessed on 2017-03-20. [9] Luca Gherardi, Davide Brugali, and Daniele Comotti. “A java vs. c++ performance evaluation: a 3d modeling benchmark”. In: International Conference on Simulation, Modeling, and Programming for Autonomous Robots. Springer. 2012, pp. 161–172. [10] Gradle Build Tool. https://gradle.org/. Accessed on 2017-03-20. [11] Nassim A Halli, Henri-Pierre Charles, and Jean-François Mehaut. “Per- formance comparison between Java and JNI for optimal implementa- tion of computational micro-kernels”. In: arXiv preprint arXiv:1412.6765 (2014).

50 BIBLIOGRAPHY 51

[12] How ART works. https://source.android.com/devices/tech/ dalvik/configure.html\#how_art_works. Accessed on 2017- 03-02. [13] Robert Hundt. “Loop recognition in C++/Java/Go/Scala”. In: Proceed- ings of Scala Days 2011 (2011), p. 38. [14] IDC. Smartphone OS Market Share, 2016 Q3. http://www.idc.com/ promo/smartphone-market-share/os. Accessed on 2017-02-03. [15] IEEE SA - 754-2008 - IEEE Standard for Floating-Point Arithmetic. https: / / standards . ieee . org / findstds / standard / 754 - 2008 . html. Accessed on 2017-05-07. [16] SeongKi Kim and Seok-Kyoo Kim. “Comparison of OpenCL and Ren- derScript for mobile devices”. In: Multimedia Tools and Applications 75.22 (2016), pp. 14161–14179. [17] Tobias Konradsson. ART and Dalvik performance compared. 2015. [18] Cheng-Min Lin et al. “Benchmark Dalvik and native code for An- droid system”. In: Innovations in Bio-inspired Computing and Applications (IBICA), 2011 Second International Conference on. IEEE. 2011, pp. 320– 323. [19] Mobile Hardware Statistics. http : / / hwstats . unity3d . com / mobile/index.html. [20] OpenCV Library. http://opencv.org. Accessed on 2017-03-23. [21] Charles Poynton. Digital video and HD: Algorithms and Interfaces. Else- vier, 2012. [22] Recommendation ITU-R BT.601-5: Studio Encoding Parameters of Digital Television for Standard 4:3 and wide-screen 16:9 Aspect Ratios. https:// www.itu.int/dms_pubrec/itu-r/rec/bt/R-REC-BT.601-5- 199510-S!!PDF-E.pdf. [23] Kirk Reinholtz. “Java will be faster than C++”. In: ACM Sigplan Notices 35.2 (2000), pp. 25–28. [24] RenderScript Intrinsics. https://android-developers.googleblog. com/2013/08/renderscript- intrinsics.html. Accessed on 2017-03-18. [25] James A Ross et al. “A case study of OpenCL on an Android mobile GPU”. In: High Performance Extreme Computing Conference (HPEC), 2014 IEEE. IEEE. 2014, pp. 1–6. [26] R Jason Sams. Evolution of Renderscript Performance. https://android- developers . googleblog . com / 2013 / 01 / evolution - of - renderscript-performance.html. Accessed on 2017-02-07. 52 BIBLIOGRAPHY

[27] R Jason Sams. Levels in Renderscript. https://android-developers. googleblog.com/2011/03/renderscript.html. Accessed on 2017-02-07. [28] ScriptIntrinsic. https://developer.android.com/reference/ android/renderscript/ScriptIntrinsic.html. Accessed on 2017-04-20. [29] Linda Shapiro and George C Stockman. “Computer Vision”. In: ed: Prentice Hall (2001). [30] Ki-Cheol Son and Jong-Yeol Lee. “The method of Android application speed up by using NDK”. In: Awareness Science and Technology (iCAST), 2011 3rd International Conference on. IEEE. 2011, pp. 382–385. [31] Studio encoding parameters of digital television for standard 4:3 and wide- screen 16:9 aspect ratios. http://www.itu.int/dms_pubrec/itu- r/rec/bt/R-REC-BT.601-7-201103-I!!PDF-E.pdf. [32] Swift. https://swift.org/. Accessed on 2017-04-20. [33] SystemClock. https://developer.android.com/reference/ android/os/SystemClock.html. [34] The Go Programming Language. https://golang.org/. Accessed on 2017-04-20. [35] The open standard for parallel programming of heterogeneous systems. https://www.khronos.org/opencl/. Accessed on 2017-02-25. [36] Guohui Wang et al. “Accelerating computer vision algorithms using OpenCL framework on the mobile GPU-a case study”. In: Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE. 2013, pp. 2629–2633. [37] Wilcoxon Signed-Rank Test. http://vassarstats.net/wilcoxon. html. [38] Frank Wilcoxon. “Individual comparisons by ranking methods”. In: Biometrics bulletin 1.6 (1945), pp. 80–83. Appendix A

Tables

A.1 Blurring

53 54 APPENDIX A. TABLES al .:Aeaerntmsfriaesotigoeain.Rslto o niae a indicates low Resolution operations. smoothing a image for times run Average A.1: Table eae RenderScript Relaxed Intrinsic RenderScript RenderScript OpenCV C++ Threaded C++ C++ Threaded Java Java Resolution 500 × 333 iaeadhg eouinidctsa indicates resolution high and -image low 14 13 8 7 5 - - - Median med 212 157 178 356 99 - - - high 3176 1560 3092 646 869 - - - low 10 6 6 0 5 3 9 - nri 4.4 Android 59 - 75 17 75 106 128 269 med Box 1920 high 3092 168 402 201 353 945 869 × - 1080 low 10 10 6 1 9 0 4 2 -image. Gauss med 111 130 277 58 43 76 30 76 high 3067 166 124 420 240 392 936 877 - - - 0 10 7 9 12 low Median med 154 235 12 49 90 - - - high 1835 1230 2983 100 201 717 - - - × nri 6.0.1 Android 67 low 19 15 18 16 iae eimindicates medium -image, 1 8 3 - med Box 388 29 33 23 43 77 88 - high 1084 4907 151 324 236 311 939 - low 21 16 25 17 2 1 4 2 Gauss med 395 29 36 31 48 75 84 8 high 1115 4959 168 356 285 325 939 49 APPENDIX A. TABLES 55

A.1.1 Box filter

Android 4.4

Table A.2 shows the run times of the box filter implementations on a 100 × 67 image, running Android 4.4. The run times were captured on a Sony Xperia Z1.

Algorithm Setup time (ms) Avg (ms) Min (ms) Max (ms) Java 0 ± 0 9 ± 0.38 9 10 Java Threaded 0 ± 0 10 ± 1.59 7 13 C++ 0 ± 1 3 ± 1.02 2 6 C++ Threaded 0 ± 0 5 ± 1.86 2 11 C++ OpenCV 0 ± 0 0 ± 0 0 0 RenderScript 3 ± 2 6 ± 0.82 5 8 Relaxed RenderScript 4 ± 2 6 ± 1.60 4 10

Table A.2: Run times for applying box filter to a 100 × 67 image, on Android 4.4

Table A.3 shows the run times of the box filter implementations on a 500×333 image, running Android 4.4. The run times were captured on a Sony Xperia Z1.

Algorithm Setup time (ms) Avg (ms) Min (ms) Max (ms) Java 23 ± 3 269 ± 17 261 320 Java Threaded 25 ± 3 128 ± 11 109 155 C++ 23 ± 4 106 ± 23 90 156 C++ Threaded 28 ± 6 75 ± 9 63 106 C++ OpenCV 0 ± 0 17 ± 4 15 34 RenderScript 29 ± 5 75 ± 9 61 95 Relaxed RenderScript 43 ± 9 59 ± 14 37 79

Table A.3: Run times for applying box filter to a 500 × 333 image, on Android 4.4

Table A.4 shows the run times of the box filter implementations on a 1920 × 1080 image, running Android 4.4. The run times were captured on a Sony Xperia Z1. 56 APPENDIX A. TABLES

Algorithm Setup time (ms) Avg (ms) Min (ms) Max (ms) Java 54 ± 6 3092 ± 46 3022 3176 Java Threaded 54 ± 4 869 ± 21 823 902 C++ 54 ± 4 945 ± 52 901 1134 C++ Threaded 50 ± 6 353 ± 32 291 416 C++ OpenCV 16 ± 1 201 ± 15 195 251 RenderScript 52 ± 12 402 ± 9 382 422 Relaxed RenderScript 54 ± 12 168 ± 25 135 223

Table A.4: Run times for applying box filter to a 1920 × 1080 image, on An- droid 4.4

Android 6.0.1

Table A.5 shows the run times of the box filter implementations on a 100 × 67 image, running Android 6.0.1. The run times were captured on a Samsung Galaxy S5.

Algorithm Setup time (ms) Avg (ms) Min (ms) Max (ms) Java 0 ± 1 16 ± 6.02 12 37 Java Threaded 0 ± 0 18 ± 5.17 12 37 C++ 0 ± 0 3 ± 1.14 2 4 C++ Threaded 0 ± 4 8 ± 1.87 6 14 C++ OpenCV 0 ± 0 1 ± 0.80 0 4 RenderScript 12 ± 2 15 ± 8.39 10 50 Relaxed RenderScript 13 ± 1 19 ± 4.12 12 30

Table A.5: Run times for applying box filter to a 100 × 67 image, on Android 6.0.1

Table A.6 shows the run times of the box filter implementations on a 500×333 image, running Android 6.0.1. The run times were captured on a Samsung Galaxy S5. APPENDIX A. TABLES 57

Algorithm Setup time (ms) Avg (ms) Min (ms) Max (ms) Java 1 ± 2 388 ± 32 327 493 Java Threaded 1 ± 5 88 ± 11 69 113 C++ 2 ± 1 77 ± 25 62 177 C++ Threaded 1 ± 2 43 ± 10 30 79 C++ OpenCV 1 ± 0 23 ± 13 14 66 RenderScript 12 ± 6 33 ± 5 27 42 Relaxed RenderScript 18 ± 9 29 ± 8 22 57

Table A.6: Run times for applying box filter to a 500 × 333 image, on Android 6.0.1

Table A.7 shows the run times of the box filter implementations on a 1920 × 1080 image, running Android 6.0.1. The run times were captured on a Sam- sung Galaxy S5.

Algorithm Setup time (ms) Avg (ms) Min (ms) Max (ms) Java 35 ± 6 4907 ± 124 4693 5164 Java Threaded 34± 8 1084 ± 60 994 1259 C++ 33± 2 939 ± 25 896 979 C++ Threaded 32± 7 311 ± 35 261 402 C++ OpenCV 13 ± 4 236 ± 38 171 320 RenderScript 55 ± 12 324 ± 20 289 372 Relaxed RenderScript 68 ± 18 151 ± 30 113 212

Table A.7: Run times for applying box filter to a 1920 × 1080 image, on An- droid 6.0.1

A.1.2 Median filter

Android 4.4

Table A.8 shows the run times of the median filter implementations on a 100 × 67 image, running Android 4.4. The run times were captured on a Sony Xperia Z1. 58 APPENDIX A. TABLES

Algorithm Setup time (ms) Avg (ms) Min (ms) Max (ms) Java 0 ± 1 13 ± 0.18 13 14 Java Threaded 0 ± 0 14 ± 2.44 10 21 C++ 0 ± 0 5 ± 0 5 5 C++ Threaded 0 ± 1 7 ± 1.91 4 12 C++ OpenCV 0 ± 0 8 ± 0 8 8

Table A.8: Run times for applying median filter to a 100 × 67 image, on An- droid 4.4

Table A.9 shows the run times of the median filter implementations on a 500× 333 image, running Android 4.4. The run times were captured on a Sony Xperia Z1.

Algorithm Setup time (ms) Avg (ms) Min (ms) Max (ms) Java 23 ± 4 356 ± 23 341 414 Java Threaded 27 ± 5 178 ± 11 157 203 C++ 23 ± 2 157 ± 20 147 209 C++ Threaded 24 ± 3 99 ± 10 85 116 C++ OpenCV 0 ± 1 212 ± 1 211 212

Table A.9: Run times for applying median filter to a 500 × 333 image, on Android 4.4

Table A.10 shows the run times of the median filter implementations on a 1920 × 1080 image, running Android 4.4. The run times were captured on a Sony Xperia Z1.

Algorithm Setup time (ms) Avg (ms) Min (ms) Max (ms) Java 64 ± 12 4382 ± 381 3923 4995 Java Threaded 87 ± 20 1903 ± 253 1555 2328 C++ 59 ± 6 1560 ± 33 1537 1684 C++ Threaded 73 ± 12 646 ± 36 582 729 C++ OpenCV 17 ± 1 3176 ± 199 2646 3369

Table A.10: Run times for applying median filter to a 1920 × 1080 image, on Android 4.4 APPENDIX A. TABLES 59

Android 6.0.1

Table A.11 shows the run times of the median filter implementations on a 100 × 67 image, running Android 6.0.1. The run times were captured on a Samsung Galaxy S5.

Algorithm Setup time (ms) Avg (ms) Min (ms) Max (ms) Java 0 ± 0 12 ± 4.63 8 24 Java Threaded 0 ± 0 9 ± 3.19 5 18 C++ 0 ± 0 7 ± 3.95 5 25 C++ Threaded 0 ± 0 10 ± 5.49 4 31 C++ OpenCV 0 ± 1 0 ± 0.87 0 4

Table A.11: Run times for applying median filter to a 100 × 67 image, on Android 6.0.1

Table A.12 shows the run times of the median filter implementations on a 500 × 333 image, running Android 6.0.1. The run times were captured on a Samsung Galaxy S5.

Algorithm Setup time (ms) Avg (ms) Min (ms) Max (ms) Java 1 ± 1 235 ± 17 195 259 Java Threaded 1 ± 0 90 ± 14 69 120 C++ 1 ± 1 154 ± 18 129 189 C++ Threaded 1 ± 0 49 ± 11 37 86 C++ OpenCV 1 ± 0 12 ± 5 10 18

Table A.12: Run times for applying median filter to a 500 × 333 image, on Android 6.0.1

Table A.13 shows the run times of the median filter implementations on a 1920 × 1080 image, running Android 6.0.1. The run times were captured on a Samsung Galaxy S5. 60 APPENDIX A. TABLES

Algorithm Setup time (ms) Avg (ms) Min (ms) Max (ms) Java 34 ± 4 2983 ± 88 2851 3180 Java Threaded 48 ± 9 1230 ± 37 1145 1308 C++ 31 ± 3 1835 ± 33 1779 1909 C++ Threaded 44 ± 4 717 ± 37 650 842 C++ OpenCV 23 ± 1 201 ± 62 122 350

Table A.13: Run times for applying median filter to a 1920 × 1080 image, on Android 6.0.1

A.1.3 Gaussian filter

Android 4.4

Table A.14 shows the run times of the Gaussian filter implementations on a 100 × 67 image, running Android 4.4. The run times were captured on a Sony Xperia Z1.

Algorithm Setup time (ms) Avg (ms) Min (ms) Max (ms) Java 0 ± 0 10 ± 0.31 9 10 Java Threaded 0 ± 1 8 ± 1.38 5 11 C++ 0 ± 1 2 ± 0 2 2 C++ Threaded 0 ± 0 4 ± 0.72 3 5 C++ OpenCV 0 ± 0 0 ± 0 0 0 RenderScript 5 ± 1 9 ± 2.36 6 17 RenderScript Intrinsic 0 ± 1 1 ± 0.30 1 2 Relaxed RenderScript 4 ± 2 6 ± 1.45 5 10

Table A.14: Run times for applying Gaussian filter to a 100 × 67 image, on Android 4.4

Table A.15 shows the run times of the Gaussian filter implementations on a 500 × 333 image, running Android 4.4. The run times were captured on a Sony Xperia Z1. APPENDIX A. TABLES 61

Algorithm Setup time (ms) Avg (ms) Min (ms) Max (ms) Java 23 ± 4 277 ± 25 262 337 Java Threaded 26 ± 3 130 ± 10 113 151 C++ 24 ± 1 111 ± 25 90 168 C++ Threaded 25 ± 2 76 ± 9 60 98 C++ OpenCV 1 ± 1 30 ± 6 28 58 RenderScript 30 ± 3 76 ± 6 64 89 RenderScript Intrinsic 32 ± 4 43 ± 6 25 54 Relaxed RenderScript 44 ± 9 58 ± 16 34 102

Table A.15: Run times for applying Gaussian filter to a 500 × 333 image, on Android 4.4

Table A.16 shows the run times of the Gaussian filter implementations on a 1920 × 1080 image, running Android 4.4. The run times were captured on a Sony Xperia Z1.

Algorithm Setup time (ms) Avg (ms) Min (ms) Max (ms) Java 52 ± 8 3067 ± 41 3044 3224 Java Threaded 63 ± 12 877 ± 24 847 946 C++ 57 ± 14 936 ± 56 903 1119 C++ Threaded 66 ± 17 392 ± 21 371 446 C++ OpenCV 16 ± 1 240 ± 19 229 289 RenderScript 52 ± 8 420 ± 8 406 436 RenderScript Intrinsic 51 ± 12 124 ± 7 111 138 Relaxed RenderScript 57 ± 3 166 ± 25 136 257

Table A.16: Run times for applying Gaussian filter to a 1920 × 1080 image, on Android 4.4

Android 6.0.1

Table A.17 shows the run times of the Gaussian filter implementations on a 100 × 67 image, running Android 6.0.1. The run times were captured on a Samsung Galaxy S5. 62 APPENDIX A. TABLES

Algorithm Setup time (ms) Avg (ms) Min (ms) Max (ms) Java 0 ± 1 17 ± 7.18 12 43 Java Threaded 0 ± 2 25 ± 8.62 13 47 C++ 0 ± 1 2 ± 0.55 2 4 C++ Threaded 0 ± 0 4 ± 2.45 2 13 C++ OpenCV 0 ± 0 1 ± 1.88 0 8 RenderScript 12 ± 3 16 ± 6.48 11 38 RenderScript Intrinsic 1 ± 1 2 ± 1.67 1 9 Relaxed RenderScript 15 ± 1 21 ± 6.39 16 51

Table A.17: Run times for applying Gaussian filter to a 100 × 67 image, on Android 6.0.1

Table A.18 shows the run times of the Gaussian filter implementations on a 500 × 333 image, running Android 6.0.1. The run times were captured on a Samsung Galaxy S5.

Algorithm Setup time (ms) Avg (ms) Min (ms) Max (ms) Java 2 ± 1 395 ± 40 339 487 Java Threaded 1 ± 2 84 ± 11 69 117 C++ 1 ± 1 75 ± 15 61 109 C++ Threaded 2 ± 1 48 ± 10 34 77 C++ OpenCV 1 ± 0 31 ± 8 26 55 RenderScript 14 ± 6 36 ± 6 29 57 RenderScript Intrinsic 4 ± 1 8 ± 2 5 11 Relaxed RenderScript 17 ± 4 29 ± 4 23 40

Table A.18: Run times for applying Gaussian filter to a 500 × 333 image, on Android 6.0.1

Table A.19 shows the run times of the Gaussian filter implementations on a 1920 × 1080 image, running Android 6.0.1. The run times were captured on a Samsung Galaxy S5. APPENDIX A. TABLES 63

Algorithm Setup time (ms) Avg (ms) Min (ms) Max (ms) Java 32 ± 4 4959 ± 116 4639 5198 Java Threaded 34 ± 5 1115 ± 39 1049 1187 C++ 33 ± 2 939 ± 35 891 1064 C++ Threaded 32 ± 6 325 ± 31 278 384 C++ OpenCV 23 ± 1 285 ± 44 210 372 RenderScript 46 ± 12 356 ± 23 304 387 RenderScript Intrinsic 23 ± 9 49 ± 8 33 70 Relaxed RenderScript 80 ± 24 168 ± 36 122 228

Table A.19: Run times for applying Gaussian filter to a 1920 × 1080 image, on Android 6.0.1

A.2 Grayscaling

Table A.20 shows the average run and setup times for applying the grayscal- ing algorithm to a 500 × 333 image.

Android 4.4 Android 6.0.1 Implementation Setup time Runtime Setup time Runtime Java Threaded 0 ± 1 9 ± 2 4 ± 1 16 ± 4 C++ Threaded 0 ± 0 5 ± 1 3 ± 1 15 ± 3 C++ OpenCV 0 ± 1 1 ± 0 1 ± 0 2 ± 1 Relaxed RenderScript 25 ± 6 41 ± 9 19 ± 4 19 ± 3 RenderScript 29 ± 3 35 ± 7 20 ± 5 20 ± 7

Table A.20: Run times for applying grayscaling to a 500 × 333-image

Table A.21 shows the average run and setup times for applying the grayscal- ing algorithm to a 500 × 333 image. 64 APPENDIX A. TABLES

Android 4.4 Android 6.0.1 Implementation Setup time Runtime Setup time Runtime Java Threaded 0 ± 0 2 ± 1 0 ± 0 3 ± 2 C++ Threaded 0 ± 0 0 ± 0 0 ± 0 1 ± 1 C++ OpenCV 0 ± 0 0 ± 0 0 ± 0 0 ± 0 Relaxed RenderScript 3 ± 2 3 ± 1 8 ± 3 9 ± 5 RenderScript 2 ± 1 2 ± 0 8 ± 2 9 ± 4

Table A.21: Run times for applying grayscaling to a 100 × 67-image

A.3 Thresholding

Android 4.4 Android 6.0.1 Implementation Setup time Runtime Setup time Runtime Java Threaded 0 ± 0 9 ± 3 2 ± 1 18 ± 6 C++ Threaded 0 ± 0 5 ± 1 2 ± 1 16 ± 6 C++ OpenCV 0 ± 0 1 ± 0 1 ± 0 1 ± 0 Relaxed RenderScript 39 ± 4 53 ± 9 21 ± 4 23 ± 5 RenderScript 42 ± 4 52 ± 4 22 ± 3 30 ± 5

Table A.22: Run times for applying thresholding to a 500 × 333-image

Android 4.4 Android 6.0.1 Implementation Setup time Runtime Setup time Runtime Java Threaded 0 ± 0 1 ± 1 0 ± 0 3 ± 1 C++ Threaded 0 ± 0 0 ± 0 0 ± 0 1 ± 1 C++ OpenCV 0 ± 0 0 ± 0 0 ± 0 0 ± 0 Relaxed RenderScript 2 ± 3 4 ± 1 9 ± 2 9 ± 1 RenderScript 2 ± 2 4 ± 1 8 ± 2 9 ± 4

Table A.23: Run times for applying thresholding to a 100 × 67-image www.kth.se