Tatra T87, House, Worm) Appendix
Total Page:16
File Type:pdf, Size:1020Kb
Presentation parts Part I. Acceleration of Blender Cycles Render Engine using OpenMP and MPI Part II. Acceleration of Blender Cycles Render Engine using Intel® Xeon Phi™ . GPU rendering in original code . Rendering using Intel® Xeon Phi™ . Rendering using Intel® Xeon Phi™ and MPI . Benchmark (Tatra T87, House, Worm) Appendix . Building and running CyclesPhi 2 Part II. Acceleration of Blender Cycles Render Engine using Intel® Xeon Phi™ VŠB – Technical University of Ostrava, IT4Innovations, Czech Republic Milan Jaroš, Lubomír Říha, Tomáš Karásek, Renáta Plouharová Presentation parts Part I. Acceleration of Blender Cycles Render Engine using OpenMP and MPI Part II. Acceleration of Blender Cycles Render Engine using Intel® Xeon Phi™ . GPU rendering in original code 4 POSIX Threads in Blender Blender Scene Draw Blender Thread Sync Tag Update Scene Session Thread Device Update Device Scene Display Device Threads Buffer https://wiki.blender.org/index.php/Dev:Source/Render/Cycles/Threads 5 CUDADevice CUDADevice GPU rendering KernelData KernelTextures buffer, rng_state cam, background, integrator bvh, objects, triangles, lights, (emission, bounces, sampler), particles, sobol_directions, in original code ... texture_images, ... mem_alloc tex_alloc mem_free const_copy_to tex_free mem_copy_from mem_copy_to ONE NODE GPU (CUDA support) CUDADevice decompose task TILES to subtasks STACK thread_run 6 CUDADevice CUDADevice GPU rendering KernelData KernelTextures buffer, rng_state cam, background, integrator bvh, objects, triangles, lights, (emission, bounces, sampler), particles, sobol_directions, in original code ... texture_images, ... mem_alloc tex_alloc mem_free const_copy_to tex_free mem_copy_from mem_copy_to ONE NODE GPU (CUDA support) CUDADevice decompose task TILES to subtasks STACK thread_run 7 GPU rendering in original code //blender/intern/cycles/device/device_cuda.cpp void mem_copy_to(const char *name, device_memory& mem) { void mem_alloc(const char *name, device_memory& mem, cuda_push_context(); MemoryType /*type*/) { cuda_assert(cuMemcpyHtoD(cuda_device_ptr(mem.device_pointer), cuda_push_context(); (void*)mem.data_pointer, mem.memory_size())); CUdeviceptr device_pointer; cuda_pop_context(); //... } cuda_assert(cuMemAlloc(&device_pointer, size)); //... void mem_copy_from(const char *name, device_memory& mem, cuda_pop_context(); int y, int w, int h, int elem) { } //... cuda_push_context(); void mem_free(const char *name, device_memory& mem) { cuda_assert(cuMemcpyDtoH((uchar*)mem.data_pointer + offset, cuda_push_context(); (CUdeviceptr)(mem.device_pointer + offset), size)); cuda_assert(cuMemFree(cuda_device_ptr(mem.device_pointer))); //... //... cuda_pop_context(); } } void tex_alloc(const char *name, void const_copy_to(const char *name, void *host, size_t size) { device_memory& mem, //... InterpolationType interpolation, cuda_push_context(); ExtensionType extension); cuda_assert(cuModuleGetGlobal(&mem, &bytes, cuModule, name)); cuda_assert(cuMemcpyHtoD(mem, host, size)); void tex_free(const char *name, device_memory& mem); cuda_pop_context(); } 8 CUDADevice CUDADevice GPU rendering KernelData KernelTextures buffer, rng_state cam, background, integrator bvh, objects, triangles, lights, (emission, bounces, sampler), particles, sobol_directions, in original code ... texture_images, ... mem_alloc tex_alloc mem_free const_copy_to tex_free mem_copy_from mem_copy_to ONE NODE GPU (CUDA support) CUDADevice decompose task TILES to subtasks STACK thread_run 9 GPU rendering in original code The decomposition of synthesized image with resolution x(r) x y(r) to tiles with size x(t) x y(t) by original implementation. The one tile is computed by one GPU device for x(t) x y(t) pixels and one GPU core computes one pixel. decompose to tiles ONE NODE GPU CUDA 1x core r t y y TILES STACK t y 1x tile 1x pixel xt*yt pixels x r xt xt 10 CUDADevice CUDADevice GPU rendering KernelData KernelTextures buffer, rng_state cam, background, integrator bvh, objects, triangles, lights, (emission, bounces, sampler), particles, sobol_directions, in original code ... texture_images, ... mem_alloc tex_alloc mem_free const_copy_to tex_free mem_copy_from mem_copy_to ONE NODE GPU (CUDA support) CUDADevice decompose task TILES to subtasks STACK thread_run 11 GPU rendering in original code //blender/intern/cycles/device/device_cuda.cpp //blender/intern/cycles/kernel/kernels/cuda/kernel.cu void path_trace(RenderTile& rtile, int sample, bool branched) { extern "C" __global__ void //... CUDA_LAUNCH_BOUNDS(CUDA_THREADS_BLOCK_WIDTH, cuda_assert(cuModuleGetFunction(&cuPathTrace, cuModule, CUDA_KERNEL_MAX_REGISTERS) "kernel_cuda_path_trace")); kernel_cuda_path_trace(float *buffer, uint *rng_state, int sample, //... int sx, int sy, int sw, int sh, int offset, int stride) cuda_assert(cuFuncGetAttribute(&threads_per_block, { CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK, int x = sx + blockDim.x*blockIdx.x + threadIdx.x; cuPathTrace)); int y = sy + blockDim.y*blockIdx.y + threadIdx.y; //... int xthreads = (int)sqrt((float)threads_per_block); if (x < sx + sw && y < sy + sh) int ythreads = (int)sqrt((float)threads_per_block); kernel_path_trace(NULL, buffer, rng_state, sample, x, y, offset, stride); int xblocks = (rtile.w + xthreads - 1)/xthreads; } int yblocks = (rtile.h + ythreads - 1)/ythreads; cuda_assert(cuFuncSetCacheConfig(cuPathTrace, CU_FUNC_CACHE_PREFER_L1)); cuda_assert(cuLaunchKernel(cuPathTrace, xblocks , yblocks, 1, // blocks xthreads, ythreads, 1, // threads 0, 0, args, 0)); cuda_assert(cuCtxSynchronize()); cuda_pop_context(); } 12 Presentation parts Part I. Acceleration of Blender Cycles Render Engine using OpenMP and MPI Part II. Acceleration of Blender Cycles Render Engine using Intel® Xeon Phi™ . Rendering using Intel® Xeon Phi™ 13 Source Code Native mode, Offload mode and Symmetric mode Compilers, Libraries, and Parallel Models main() main() main() main() Xeon Xeon Xeon Xeon Phi Xeon main() Xeon Xeon Xeon Xeon Xeon Phi Xeon Phi Results Results Results Results Many-Core Only Multicore Hosted Symmetric Multicore Only Native with Many-Core Offload http://www.cism.ucl.ac.be 14 Native mode, Offload mode and Symmetric mode Symmetric mode Offload mode Blender Blender CPU CPU MIC0 MIC1 OpenMP+Offload MPI Native mode OpenMP OpenMP Blender MIC0 MIC1 Blender Blender client client MIC0 15 Offload pragma/directives in C++ C/C++ Syntax Semantics Offload pragma #pragma offload <clauses> Execute next statement on <statement> MIC (which could be an OpenMP parallel construct) __declspec ( target (mic)) <func/var> __attribute__ (( target (mic)) <func/var> Function and variable #pragma offload_attribute (target (mic)) Compile function and <func/var> variable for CPU and MIC https://software.intel.com/en-us/articles/ixptc-2013-presentations 16 Offload pragma/directives in C++ Clauses Syntax Semantics Target specification target ( mic [: <expr> ] ) Where to run construct If specifier if ( condition ) Offload statement if condition is TRUE Inputs in (var-list modifiers) Copy CPU to target Outputs out (var-list modifiers) Copy target to CPU Inputs & outputs inout (var-list modifiers) Copy both ways Non-copied data nocopy (var-list modifiers) Data is local to target Modifiers Specify pointer length length (element-count-expr) Copy that many pointer elements Control pointer memory alloc_if ( condition ) Allocate/free new block of allocation free_if ( condition ) memory for pointer if condition is TRUE https://software.intel.com/en-us/articles/ixptc-2013-presentations 17 OMPDevice OMPDevice Parallelization KernelData KernelTextures buffer, rng_state cam, background, integrator bvh, objects, triangles, lights, (emission, bounces, sampler), particles, sobol_directions, for MIC using ... texture_images, ... mem_alloc tex_alloc mem_free OpenMP and const_copy_to tex_free mem_copy_from Offload mem_copy_to ONE NODE OpenMP + Offload OpenMP OMPDevice decompose task TILES to subtasks STACK thread_run 18 OMPDevice OMPDevice Parallelization KernelData KernelTextures buffer, rng_state cam, background, integrator bvh, objects, triangles, lights, (emission, bounces, sampler), particles, sobol_directions, for MIC using ... texture_images, ... mem_alloc tex_alloc mem_free OpenMP and const_copy_to tex_free mem_copy_from Offload mem_copy_to ONE NODE OpenMP + Offload OpenMP OMPDevice decompose task TILES to subtasks STACK thread_run 19 Parallelization for MIC using OpenMP and Offload //blender/intern/cycles/kernel/kernels/mic/kernel_mic.cpp #define ALLOC alloc_if(1) free_if(0) void mic_mem_alloc(int numDevice, char *mem, size_t memSize) { #define FREE alloc_if(0) free_if(1) #pragma offload target(mic:numDevice) in(mem:length(memSize) ALLOC) #define REUSE alloc_if(0) free_if(0) } #define ONE_USE void mic_mem_copy_to(int numDevice, char *mem, size_t memSize, char* device_ptr mic_alloc_kg(int numDevice) { signal_value) { device_ptr kg_bin; if (signal_value == NULL) { #pragma offload target(mic:numDevice) out(kg_bin) #pragma offload target(mic:numDevice) in(mem:length(memSize) REUSE) { } else { KernelGlobals *kg = new KernelGlobals(); #pragma offload_transfer target(mic:numDevice) in(mem:length(memSize) REUSE) kg_bin = (device_ptr) kg; signal(signal_value) } } return (device_ptr) kg_bin; } } void mic_mem_copy_from(int numDevice, char *mem, size_t offset, size_t memSize, void mic_free_kg(int numDevice, device_ptr kg_bin) { char* signal_value) { #pragma offload target(mic:numDevice) in(kg_bin) if (signal_value == NULL) { { KernelGlobals *kg = (KernelGlobals *) kg_bin; #pragma offload target(mic:numDevice) out(mem[offset:memSize]: