Presentation parts

Part I. Acceleration of Blender Cycles Render Engine using OpenMP and MPI Part II. Acceleration of Blender Cycles Render Engine using Intel® Xeon Phi™ . GPU rendering in original code . Rendering using Intel® Xeon Phi™ . Rendering using Intel® Xeon Phi™ and MPI . Benchmark (Tatra T87, House, Worm) Appendix . Building and running CyclesPhi

2 Part II. Acceleration of Blender Cycles Render Engine using Intel® Xeon Phi™ VŠB – Technical University of Ostrava, IT4Innovations, Czech Republic Milan Jaroš, Lubomír Říha, Tomáš Karásek, Renáta Plouharová Presentation parts

Part I. Acceleration of Blender Cycles Render Engine using OpenMP and MPI Part II. Acceleration of Blender Cycles Render Engine using Intel® Xeon Phi™ . GPU rendering in original code

4 POSIX Threads in Blender

Blender Scene Draw Blender Thread Sync Tag Update

Scene Session Thread Device Update

Device Scene Display Device Threads Buffer

https://wiki.blender.org/index.php/Dev:Source/Render/Cycles/Threads

5 CUDADevice CUDADevice

GPU rendering KernelData KernelTextures buffer, rng_state cam, background, integrator bvh, objects, triangles, lights, (emission, bounces, sampler), particles, sobol_directions, in original code ... texture_images, ... mem_alloc tex_alloc mem_free const_copy_to tex_free mem_copy_from mem_copy_to

ONE NODE

GPU (CUDA support)

CUDADevice

decompose task TILES to subtasks STACK thread_run

6 CUDADevice CUDADevice

GPU rendering KernelData KernelTextures buffer, rng_state cam, background, integrator bvh, objects, triangles, lights, (emission, bounces, sampler), particles, sobol_directions, in original code ... texture_images, ... mem_alloc tex_alloc mem_free const_copy_to tex_free mem_copy_from mem_copy_to

ONE NODE

GPU (CUDA support)

CUDADevice

decompose task TILES to subtasks STACK thread_run

7 GPU rendering in original code

//blender/intern/cycles/device/device_cuda.cpp void mem_copy_to(const char *name, device_memory& mem) { void mem_alloc(const char *name, device_memory& mem, cuda_push_context(); MemoryType /*type*/) { cuda_assert(cuMemcpyHtoD(cuda_device_ptr(mem.device_pointer), cuda_push_context(); (void*)mem.data_pointer, mem.memory_size())); CUdeviceptr device_pointer; cuda_pop_context(); //... } cuda_assert(cuMemAlloc(&device_pointer, size)); //... void mem_copy_from(const char *name, device_memory& mem, cuda_pop_context(); int y, int w, int h, int elem) { } //... cuda_push_context(); void mem_free(const char *name, device_memory& mem) { cuda_assert(cuMemcpyDtoH((uchar*)mem.data_pointer + offset, cuda_push_context(); (CUdeviceptr)(mem.device_pointer + offset), size)); cuda_assert(cuMemFree(cuda_device_ptr(mem.device_pointer))); //... //... cuda_pop_context(); } } void tex_alloc(const char *name, void const_copy_to(const char *name, void *host, size_t size) { device_memory& mem, //... InterpolationType interpolation, cuda_push_context(); ExtensionType extension); cuda_assert(cuModuleGetGlobal(&mem, &bytes, cuModule, name)); cuda_assert(cuMemcpyHtoD(mem, host, size)); void tex_free(const char *name, device_memory& mem); cuda_pop_context(); }

8 CUDADevice CUDADevice

GPU rendering KernelData KernelTextures buffer, rng_state cam, background, integrator bvh, objects, triangles, lights, (emission, bounces, sampler), particles, sobol_directions, in original code ... texture_images, ... mem_alloc tex_alloc mem_free const_copy_to tex_free mem_copy_from mem_copy_to

ONE NODE

GPU (CUDA support)

CUDADevice

decompose task TILES to subtasks STACK thread_run

9 GPU rendering in original code

The decomposition of synthesized image with resolution x(r) x y(r) to tiles with size x(t) x y(t) by original implementation. The one tile is computed by one GPU device for x(t) x y(t) pixels and one GPU core computes one pixel.

decompose to tiles ONE NODE

GPU CUDA 1x core

r

t

y y TILES

STACK t

y 1x tile 1x pixel xt*yt pixels x r xt xt

10 CUDADevice CUDADevice

GPU rendering KernelData KernelTextures buffer, rng_state cam, background, integrator bvh, objects, triangles, lights, (emission, bounces, sampler), particles, sobol_directions, in original code ... texture_images, ... mem_alloc tex_alloc mem_free const_copy_to tex_free mem_copy_from mem_copy_to

ONE NODE

GPU (CUDA support)

CUDADevice

decompose task TILES to subtasks STACK thread_run

11 GPU rendering in original code

//blender/intern/cycles/device/device_cuda.cpp //blender/intern/cycles/kernel/kernels/cuda/kernel.cu void path_trace(RenderTile& rtile, int sample, bool branched) { extern "C" __global__ void //... CUDA_LAUNCH_BOUNDS(CUDA_THREADS_BLOCK_WIDTH, cuda_assert(cuModuleGetFunction(&cuPathTrace, cuModule, CUDA_KERNEL_MAX_REGISTERS) "kernel_cuda_path_trace")); kernel_cuda_path_trace(float *buffer, uint *rng_state, int sample, //... int sx, int sy, int sw, int sh, int offset, int stride) cuda_assert(cuFuncGetAttribute(&threads_per_block, { CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK, int x = sx + blockDim.x*blockIdx.x + threadIdx.x; cuPathTrace)); int y = sy + blockDim.y*blockIdx.y + threadIdx.y; //... int xthreads = (int)sqrt((float)threads_per_block); if (x < sx + sw && y < sy + sh) int ythreads = (int)sqrt((float)threads_per_block); kernel_path_trace(NULL, buffer, rng_state, sample, x, y, offset, stride); int xblocks = (rtile.w + xthreads - 1)/xthreads; } int yblocks = (rtile.h + ythreads - 1)/ythreads;

cuda_assert(cuFuncSetCacheConfig(cuPathTrace, CU_FUNC_CACHE_PREFER_L1));

cuda_assert(cuLaunchKernel(cuPathTrace, xblocks , yblocks, 1, // blocks xthreads, ythreads, 1, // threads 0, 0, args, 0));

cuda_assert(cuCtxSynchronize()); cuda_pop_context(); }

12 Presentation parts

Part I. Acceleration of Blender Cycles Render Engine using OpenMP and MPI Part II. Acceleration of Blender Cycles Render Engine using Intel® Xeon Phi™

. Rendering using Intel® Xeon Phi™

13 Source Code Native mode, Offload mode and Symmetric mode Compilers, Libraries, and Parallel Models

main() main() main() main() Xeon Xeon Xeon Xeon Phi Xeon main() Xeon Xeon Xeon Xeon Xeon Phi Xeon Phi

Results Results Results Results

Many-Core Only Multicore Hosted Symmetric Multicore Only Native with Many-Core Offload http://www.cism.ucl.ac.be

14 Native mode, Offload mode and Symmetric mode

Symmetric mode Offload mode

Blender Blender

CPU CPU MIC0 MIC1

OpenMP+Offload MPI

Native mode OpenMP OpenMP Blender MIC0 MIC1

Blender Blender client client MIC0

15 Offload pragma/directives in C++

C/C++ Syntax Semantics

Offload pragma #pragma offload Execute next statement on MIC (which could be an OpenMP parallel construct) __declspec ( target (mic))

__attribute__ (( target (mic)) Function and variable #pragma offload_attribute (target (mic)) Compile function and variable for CPU and MIC

https://software.intel.com/en-us/articles/ixptc-2013-presentations

16 Offload pragma/directives in C++ Clauses Syntax Semantics Target specification target ( mic [: ] ) Where to run construct If specifier if ( condition ) Offload statement if condition is TRUE Inputs in (var-list modifiers) Copy CPU to target Outputs out (var-list modifiers) Copy target to CPU Inputs & outputs inout (var-list modifiers) Copy both ways Non-copied data nocopy (var-list modifiers) Data is local to target Modifiers Specify pointer length length (element-count-expr) Copy that many pointer elements Control pointer memory alloc_if ( condition ) Allocate/free new block of allocation free_if ( condition ) memory for pointer if condition is TRUE https://software.intel.com/en-us/articles/ixptc-2013-presentations

17 OMPDevice OMPDevice

Parallelization KernelData KernelTextures buffer, rng_state cam, background, integrator bvh, objects, triangles, lights, (emission, bounces, sampler), particles, sobol_directions, for MIC using ... texture_images, ... mem_alloc tex_alloc mem_free OpenMP and const_copy_to tex_free mem_copy_from Offload mem_copy_to ONE NODE

OpenMP + Offload

OpenMP OMPDevice

decompose task TILES to subtasks STACK thread_run

18 OMPDevice OMPDevice

Parallelization KernelData KernelTextures buffer, rng_state cam, background, integrator bvh, objects, triangles, lights, (emission, bounces, sampler), particles, sobol_directions, for MIC using ... texture_images, ... mem_alloc tex_alloc mem_free OpenMP and const_copy_to tex_free mem_copy_from Offload mem_copy_to ONE NODE

OpenMP + Offload

OpenMP OMPDevice

decompose task TILES to subtasks STACK thread_run

19 Parallelization for MIC using OpenMP and Offload //blender/intern/cycles/kernel/kernels/mic/kernel_mic.cpp #define ALLOC alloc_if(1) free_if(0) void mic_mem_alloc(int numDevice, char *mem, size_t memSize) { #define FREE alloc_if(0) free_if(1) #pragma offload target(mic:numDevice) in(mem:length(memSize) ALLOC) #define REUSE alloc_if(0) free_if(0) } #define ONE_USE void mic_mem_copy_to(int numDevice, char *mem, size_t memSize, char* device_ptr mic_alloc_kg(int numDevice) { signal_value) { device_ptr kg_bin; if (signal_value == NULL) { #pragma offload target(mic:numDevice) out(kg_bin) #pragma offload target(mic:numDevice) in(mem:length(memSize) REUSE) { } else { KernelGlobals *kg = new KernelGlobals(); #pragma offload_transfer target(mic:numDevice) in(mem:length(memSize) REUSE) kg_bin = (device_ptr) kg; signal(signal_value) } } return (device_ptr) kg_bin; } } void mic_mem_copy_from(int numDevice, char *mem, size_t offset, size_t memSize, void mic_free_kg(int numDevice, device_ptr kg_bin) { char* signal_value) { #pragma offload target(mic:numDevice) in(kg_bin) if (signal_value == NULL) { { KernelGlobals *kg = (KernelGlobals *) kg_bin; #pragma offload target(mic:numDevice) out(mem[offset:memSize]: REUSE) delete kg; } } } else { void mic_const_copy(int numDevice, /*…*/) { #pragma offload_transfer target(mic:numDevice) out(mem[offset:memSize]: REUSE) #pragma offload target(mic:numDevice) \\ signal(signal_value) in(host_bin:length(size) ONE_USE) in(kg_bin) in(size) } { } KernelGlobals *kg = (KernelGlobals *)kg_bin; void mic_mem_free(int numDevice, char *mem, size_t memSize) { memcpy(&kg->__data, host_bin, size); #pragma offload target(mic:numDevice) in(mem:length(0) FREE) kg->__data_size = size; } } }

20 OMPDevice OMPDevice

Parallelization KernelData KernelTextures buffer, rng_state cam, background, integrator bvh, objects, triangles, lights, (emission, bounces, sampler), particles, sobol_directions, for MIC using ... texture_images, ... mem_alloc tex_alloc mem_free OpenMP and const_copy_to tex_free mem_copy_from Offload mem_copy_to ONE NODE

OpenMP + Offload

OpenMP OMPDevice

decompose task TILES to subtasks STACK thread_run

21 Parallelization for MIC using OpenMP and Offload

The synthesized image with resolution x(r) x y(r) is decomposed to rows (y(t) = 1). In our cases, there are three devices: CPU (24 cores), Intel Xeon Phi / MIC (61+61 cores). One device reads the stack and gets one row. The load balancing is provided by the stack. camera view decompose to rows One node

one row to one device CPU

r y MIC0 TILES

STACK MIC1

t y xr xr OpenMP+Offload

22 OMPDevice OMPDevice

Parallelization KernelData KernelTextures buffer, rng_state cam, background, integrator bvh, objects, triangles, lights, (emission, bounces, sampler), particles, sobol_directions, for MIC using ... texture_images, ... mem_alloc tex_alloc mem_free OpenMP and const_copy_to tex_free mem_copy_from Offload mem_copy_to ONE NODE

OpenMP + Offload

OpenMP OMPDevice

decompose task TILES to subtasks STACK thread_run

23 Parallelization for MIC using OpenMP and Offload

//blender/intern/cycles/kernel/kernels/mic/kernel_mic.cpp //blender/intern/cycles/device/device_omp.cpp void mic_path_trace(int numDevice, /*...*/) omp_set_nested(1); { #pragma omp parallel num_threads(2) { #pragma offload target(mic:numDevice) \ #pragma omp single nowait { in(buffer_bin : length(0) REUSE) \ #pragma omp task { in(rng_state_bin : length(0) REUSE) \ while (reqFinished == 0) { in(sample_finished_mic : length(0) REUSE) \ #pragma omp flush in(reqFinished_mic : length(0) REUSE) \ if (omp_path_trace_req != 0) { in(rgba_byte_bin : length(0) REUSE) \ cpu_path_trace((KernelGlobals *)kg_bin, /*...*/); in(kg_bin) in(start_sample) in(end_sample) \ omp_path_trace_req = 0; in(tile_x) in(tile_y) in(offset) in(stride) \ } in(tile_h) in(tile_w) in(nprocs_cpu) \ usleep(100); signal(signal_value) } } { #pragma omp task { #pragma omp parallel for num_threads(nprocs_cpu) schedule(dynamic, 1) while (true) { for (int i = 0; i < size; i++) for (int dev = 0; dev < num_devices_cpu_mics; dev++) { { if (dev > 0) int y = i / tile_w; mic_mem_copy_from(dev - 1, (char*) buffer, /*...*/); int x = i - y * tile_w; if (sample_finished_devices[dev] == end_sample) { if (dev == 0) omp_path_trace_req = 1; for (int sample = start_sample; sample < end_sample; sample++) else mic_path_trace(dev - 1, /*...*/); { } } kernel_path_trace((KernelGlobals *)kg_bin, /*…*/); task.update_progress(&tile); } } } //... } } } #pragma omp taskwait }

24 Presentation parts

Part I. Acceleration of Blender Cycles Render Engine using OpenMP and MPI Part II. Acceleration of Blender Cycles Render Engine using Intel® Xeon Phi™

. Rendering using Intel® Xeon Phi™ and MPI

25 Rendering using OpenMP, Symmetric mode and MPI Node 0 Node 1 Blender Blender Blender Blender client client client build flags: . blender CPU CPU MIC0 MIC1 . WITH_IT4I_MPI=ON OpenMP OpenMP OpenMP . client-cpu . WITH_IT4I_MIC_NATIVE=OFF MPI . WITH_IT4I_MIC_OFFLOAD=OFF a

. client-mic OpenMP OpenMP OpenMP OpenMP OpenMP . WITH_IT4I_MIC_NATIVE=ON . WITH_IT4I_MIC_OFFLOAD=OFF MIC0 MIC1 CPU MIC0 MIC1

Blender Blender Blender Blender Blender client client client client client

Node n

26 Rendering using OpenMP, Offload and MPI

Node 1

Blender client

CPU MIC0 MIC1 Node 0 Blender OpenMP+Offload 7D Enhanced MPI CPU hypercube Infiniband network

OpenMP+Offload build flags: . blender CPU MIC0 MIC1 . WITH_IT4I_MPI=ON . client Blender client . WITH_IT4I_MIC_NATIVE=OFF Node n . WITH_IT4I_MIC_OFFLOAD=ON

27 Presentation parts

Part I. Acceleration of Blender Cycles Render Engine using OpenMP and MPI Part II. Acceleration of Blender Cycles Render Engine using Intel® Xeon Phi™

. Benchmark (Tatra T87, House, Worm)

28 Benchmark (Tatra T87, House, Worm)

. The benchmark was run on one computing node of the Salomon supercomputer equipped with two Intel Xeon E5-2680v3 CPUs and two Intel Xeon Phi 7120P. . GPU test was run on two NVIDIA GeForce GTX 970.

Tatra T87 House Worm

Tatra T87 by David Cloete Pabellon Barcelona by Claudio Andres Cosmos Laundromat - First Cycle

29 Performance comparison MIC with other devices

1600 1506 1435 1400

1200 1143 1035 975 1000 CPU24 800 800 OMP24 558 GPU2 600 494 Time in secondsin Time 451 MIC2 399 400 266 200 0 0 Tatra T87 House Worm

30 Benchmark Worm: Strong Scalability MPI Test (offline)

. The benchmark was run on 64 computing nodes of the Salomon supercomputer equipped with two Intel Xeon E5-2680v3 CPUs and two Intel Xeon Phi 7120P. . Worm scene has 13.2 million triangles. . Resolution: 4096x2048, Cosmos Laundromat - First Cycle Samples: 1024

31 Benchmark Worm: Strong Scalability MPI Test (offline) 100000 OMP24 Offload Symmetric linear 16225 8436 10000 4339 8594 2210 4297 1115 1000 2149 562 1074 285 537 100 269

134 Time in seconds in Time

10

1 1 2 4 8 16 32 64 Number of nodes

32 Benchmark Tatra T87: Strong Scalability MPI Test (interactive)

. The benchmark was run on 64 computing nodes of the Salomon supercomputer equipped with two Intel Xeon E5-2680v3 CPUs . Tatra T87 has 1.2 million triangles and uses the HDRI lighting. . Resolution: 1920x1080, Samples: 1

Tatra T87 by David Cloete

33 Benchmark Tatra T87: Strong Scalability MPI Test (interactive - 1 sample) 1000 950 510 OMP24 Offload Symmetric linear 310 520 150 ~ 280 1.9 fps 80 100 ~ 160 3.6 fps 40 ~ 20 6.3 fps 88 ~ ~ 50 fps 50 11.3 fps 10 ~ 30 17 20 fps ~ Time in milliseconds in Time ~ real-time rendering achieved 33 fps 59 fps we can increase samples per transfer 1 1 2 4 8 16 32 64 Number of nodes

34 Benchmark Tatra T87: Strong Scalability MPI Test (interactive - 1 sample) 1000 950 Offload OMP24 510 Offload - weak scaling OMP24 - weak scaling 310 520 linear optimal weak 150 ~ 280 80 ms 80 ms 1.9 fps 80 2 samples 4 samples 100 ~ 3.6 fps 160 12,5 fps ~ 6.3 fps 88 51 ms 52 ms ~ 50 2 samples 4 samples 11.3 fps 10 ~ 19.2 fps

20 fps Time in milliseconds in Time real-time rendering achieved we can increase samples per transfer 1 1 2 4 8 16 32 64 Number of nodes

35 References

. Jaros Milan, et al.: Acceleration of Blender Cycles Path-Tracing Engine Using Intel Many Integrated Core Architecture, CISIM 2015, Warsaw, Poland, p. 86- 97, September 2015 . Frederik Steinmetz, Gottfriend Hofmann: The Cycles Encyclopedia . http://mpitutorial.com/tutorials . https://software.intel.com/en-us/articles/ixptc-2013-presentations . https://wiki.blender.org

36 Appendix Building and running CyclesPhi Building Blender 2.77a with Intel Compiler 2016

. https://wiki.blender.org/index.php/Dev:Doc/Building_Blender . Intel® Parallel Studio XE Cluster Edition (free for students) . Intel® Manycore Platform Software Stack (Intel® MPSS) . Microsoft Visual Studio 2013 (Windows) or Netbeans 8.1, CMake 3.3, GCC 5.3 (Linux) . Libraries (GCC/ICC): boost (./bjam install toolset=intel), ilmbase, , , openimageio, zlib, Python, …

38 Building CyclesPhi with Intel Compiler 2016 git clone [email protected]:blender/cyclesphi.git new build flags: . blender – WITH_IT4I_MIC_OFFLOAD=ON/OFF – WITH_IT4I_MPI=ON/OFF – WITH_OPENMP=ON/OFF (old flag) . client – WITH_IT4I_MIC_NATIVE=ON/OFF – WITH_IT4I_MIC_OFFLOAD=ON/OFF

39 Building CyclesPhi with Intel Compiler 2016 new folders: . it4i/client – /client_api.h – main header with predefined communication tags and structures – cycles_mic – the shared libraries to rendering on Intel Xeon Phi – cycles_mpi – the shared libraries to communication with Blender (root) – cycles_omp - the shared libraries to rendering on CPU using OpenMP – main – blender_client application

. it4i/scripts (created for Salomon supercomputer with PBS job management system) – build_lib.sh – build some basic libraries (boost, …) – build_blender.sh – build Blender CyclesPhi, which supports Intel Xeon Phi – run_blender.sh – run Blender CyclesPhi without MPI – run_mpi.sh – run Blender CyclesPhi with MPI support

40 Run CyclesPhi

New scripts in it4i/scripts (created for Salomon supercomputer): . run_blender.sh – run Blender CyclesPhi without MPI ./${ROOT_DIR}/install/blender/Blender . run_mpi.sh – run Blender CyclesPhi with MPI (only CPU / MIC‘s offload mode) mpirun -n 1 ${ROOT_DIR}/install/blender/blender : -n 1 ${ROOT_DIR}/install/blender_client/bin/blender_client . run_mpi_mic.sh – run Blender CyclesPhi with MPI support with Symmetric mode mpirun -genv LD_LIBRARY_PATH $MIC_LD_LIBRARY_PATH -machine $NODEFILECN -n 1 ${ROOT_DIR}/install/blender/blender : -n $NUMOFCN ${ROOT_DIR}/install/blender_client/bin/blender_client

41