Automatic GPU Code Generation for Android HIPAcc – http://hipacc-lang.org

Oliver Reiche, Richard Membarth, Frank Hannig, and Jürgen Teich

A DSL for Image Processing

HIPAcc: The Heterogeneous Image Processing Acceleration Framework

DSL code embedded into ++ Optimizations tailored to GPU architectures C++ Domain knowledge captured by classes embedded DSL Compiler Features cc Domain HIPA Classes Knowledge Source-to-Source I memory padding / alignment Compiler Image input / output buffers I utilization of textures Architecture /LLVM Accessor ROI of input image Knowledge I exploit full GPU memory hierarchy

I boundary handling I MPMD code generation CUDA OpenCL C/C++ Renderscript I interpolation / filtering I loop unrolling (GPU) (x86/GPU) (x86) (x86/ARM/GPU) IterationSpace ROI of output image I constant propagation CUDA/OpenCL/Renderscript Runtime Library Kernel compute kernel description I thread-coarsening Mask convolution mask (multiple pixels per thread)

Domain iteration domain I vectorization support (point operators)

Pyramid image pyramid description I implicit use of unified CPU/GPU memory Figure: Overview of the HIPAcc Framework

DSL Code Example: Gaussian Blur

DSL Host Code DSL Kernel Code 1 // ... Runtime 1 class GaussianBlur : public Kernel { 2 const float filter_mask[3][3] = {...}; Calls 2 // ... 3 Mask mask(filter_mask); rewrite ++ (C ) 3 void kernel() { 4  DSL 4 float4 sum = convolve(mask, HipaccSUM, [&]() -> float4{ 5 BoundaryCondition bound(in, mask, BOUNDARY_CLAMP);  Code cc 5 return mask() * convert_float4(input(mask)); 6 Accessor acc(bound);  HIPA 6 }); 7 IterationSpace iter(out);  (C++)    7 output() = convert_uchar4(sum + 0.5f); 8  Target   generate  8 } 9 GaussianBlur filter(iter, acc, mask);   Device 9 }; 10 filter.execute(); Code       Seamless Integration into the Android Development Tools

Android Development Tools (ADT) GPU Computing on Android Android Software Development Kit (SDK) Renderscript Compute Filterscript

I based on I code mapping to native I stricter limitations I relaxed precision I supports C++ via Java Native Interface (JNI) worker threads I no scatter writes I automatic compilation and packaging into app I targets DSPs, CPUs, I pointers are illegal I compilation done by the Android Native and GPUs (since Android 4.2) Development Kit (NDK) I ensures wider compatibility

HIPAcc Integration Modular Makefile host device Used to seamlessly integrate HIPAcc into ADT .cpp source g++ cc I called during preprocessing step HIPA binary executable

I set appropriate target compiler flags DSL source reflection libRScpp I append generated files to NDK sources

.rs source .bc file native lib system lib Eclipse llvm-rs-cc libbcc libRS

Demonstration Setup and Results Exynos 5250 MPSoC Live Demo Application

I ARM Cortex-A15 I five image filters in DSL code I dual core @1.7 GHz I single description of filters I 64/128 bit SIMD NEON I target-independent code I ARM Mali T-604 GPU

I 4 cores @533 MHz I automatic code generation for I 16 SIMD lanes per core Renderscript and Filterscript

I 2 GB of 800 MHz DDR3 DRAM I up to 4× faster

Figure: Samsung Exynos 5250 Arndale Board