PROGRAMMING Intel® Processor Graphics

PROGRAMMING Intel® Processor Graphics Chi-Keung (CK) Luk - Intel Principal Engineer Intel Software & Services Group Agenda • Compute programming on Intel Graphics with: • OpenCL • CilkPlus • Tools • Workload performance 2 Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Acknowledgments for Slide Sources • OpenCL: . Khronos . Uri Levy, Yuval Eshkol, Doron Singer . Robert Ioffe, Aaron Kunze, Ben Ashbaugh, Stephen Junkins, Michal Mrozek • CilkPlus: . Knud J Kirkegaard, Anoop Madhusoodhanan Prabha, Konstantin Bobrovsky, Sergey Dmitriev • VTune: . Alexandr Kurylev, Julia Fedorova • Workload Performance: . Sharad Tripathi, Chinang Ma, Akhila Vidiyala . Edward Ching, Norbert Egi, Masood Mortazavi, Vivent Cheng, Guangyu Shi 3 Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice OpenCL 1. Introduction 2. Optimizing OpenCL for Intel GPUs 3. Using Shared Virtual Memory (SVM) 4 Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice OpenCL* (Open Computing Language) • An open standard managed by the Khronos group • A set of C-based APIs on the host that defines a run-time environment • Programs written in a C-based language (C++ support since OpenCL 2.1) that run on the device(s) Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Based on C99 Mostly C-like Kernels (functions which execute a work-item) have a kernel prefix, and a void return type kernel void foo (global int* ptr) { … for (int i=0; i<10; i++) { if (ptr == NULL) continue; *ptr[i] = i; } … No support for library functions . No stdio.h / stdlib.h / math.h / etc.. But printf is supported Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Supported data types Scalar data types . char / uchar / short / ushort / int / uint / long / ulong . float / double . size_t . Pointers Derived data types . Arrays . Structures Vector data types Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Working with vectors Vectors exist for all scalar types Vector widths are 2, 3, 4, 8, 16 All arithmetic operations work on vector types uint3 vec0, vec1; … uint3 result = vec0 + vec1; Component access (XYZW) double res1 = dvec0.x + dvec1.z; double2 res2 = dvec0.wy + dvec1.xx; // Swizzle Vectors > 4 use numeric (hexadecimal) indices float res1 = vec16.s5 + vec16.sf; float2 res2 = vec8.s37 + vec16.sca; // Swizzle Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Additional details Functions used for querying the index . get_global_id(int dimension); // Index of work-item in entire execution . get_local_id(int dimension); // index of work-item in work-group . get_group_id(int dimension); // index of work-group . A few others.. Memory . No support for dynamic memory allocation . When passing buffers as arguments, specify global / local / constant kernel void foo (global const int* ptr, local float* scratch) { … Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Built-in functions Many functions supported Overloaded to all relevant types and vector widths A tiny bit for example: . Math: sin / cos / min / max / log / pow / sqrt / …. Geometric: dot / cross / distance / length / … . Relational: isEqual / isGreater / all / any / select / … kernel void dot_product foo(global const int4* a, global const int4* b, global int* out) { size_t tid = get_global_id(0); out[tid] = dot(a[tid], b[tid]); } Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Casts and type conversions • Scalar values • Use “normal” (C-style) casts • Vector values • Vector conversions must be done explicitly dstValue = convert_destType(srcValue) • Source and destination types must have the same vector width • Example: int8 intVec; double8 dVec; float8 fVec = convert_float8(intVec); float8 fVec2 = convert_float8(dVec); • Vector construction (from scalars) ushort3 vecUshort = (ushort3)(0, 12, 7); Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Language summary Based on C99 With some extra features: . Vector data types . Extensive Built-in functions library . Image handling, Work-group synchronization Minus others: . Recursion . Function pointers . Pointers to pointers . Dynamic memory allocation Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice OpenCL 1. Introduction 2. Optimizing OpenCL for Intel GPUs . Use zero copying to transfer data between CPU and GPU . Maximize EU occupancy . Maximize compute performance . Avoid Divergent Control Flow . Take advantage of large register space . Optimize data accesses 3. Using Shared Virtual Memory (SVM) 18 Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Optimizing Host to Device Transfers: Zero Copying • Host (CPU) and Device (GPU) share the same physical memory • For buffers allocated through the OpenCL™ runtime: - Let the OpenCL runtime allocate system memory . Create buffer with system memory pointer and CL_MEM_ALLOC_HOST_PTR - OR, Use pre-allocated system memory . Create buffer with system memory pointer and CL_MEM_USE_HOST_PTR . Allocate system memory aligned to a page (4096 bytes) (e.g., use _aligned_malloc or memalign to allocate) . Allocate a multiple of cache line size (64 bytes) . No transfer needed (zero copy)! - Use clEnqueueMapBuffer() to access data . No transfer needed (zero copy)! 19 Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Maximizing EU Occupancy • Occupancy is a measure of EU thread utilization • Two primary things to consider: - Launch enough work items to keep EU threads busy - In short kernels: use short vector data types and compute multiple pixels to better amortize thread launch cost . For example, color conversion: __global uchar* src_ptr, dst_ptr; uchar16 src = vload16(0, src_ptr); __global uchar* src, dst; uchar4 c0 = src.s048c; p = src[src_idx] * B2Y + uchar4 c1 = src.s159d; src[src_idx + 1] * G2Y + uchar4 c2 = src.s26ae; src[src_idx + 2] * R2Y; uchar4 Y = c0 * B2Y + dst[dst_idx] = p; c1 * G2Y + c2 * R2Y; Before: vstore4(Y, 0, dst_ptr); One pixel per work item After: Four pixels per work item 20 Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Maximize Compute Performance • Use floats instead of integer data types - Because an EU can issue two float operations per cycle • Floating-point throughput depends on the data width - float16 throughput = 2 x float32 throughput - float32 throughput = 4 x float64 throughput • Trade accuracy for speed, where appropriate - Use “native” built-ins (or use -cl-fast-relaxed-math) - Use mad() / fma()(or use -cl-mad-enable) x = cos(i); x = native_cos(i); 21 Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Avoid Divergent Control Flow “SIMT” ISA with Predication and Branching “Divergent” code executes both branches . Reduced SIMD Efficiency, Increased Power and Exec Time this(); Example: “x” sometimes true Example: “x” never true SIMD lane SIMD lane time if ( x ) time that(); else another(); finish(); 22 Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Optimizing Data Accesses 23 Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Take Advantage of Large Register Space • Each work item in an OpenCL™ kernel has access to up to 256-512 bytes of register space • Bandwidth to registers faster than any memory • Loading and processing blocks of pixels in registers is very efficient! float sum[PX_PER_WI_X] = { 0.0f }; float k[KERNEL_SIZE_X]; allocated in registers float d[PX_PER_WI_X + KERNEL_SIZE_X]; // Load filter kernel in k, input data in d ... Use available registers (up // Compute convolution to 512 bytes) instead of for (px = 0; px < PX_PER_WI_X; ++px) for (sx = 0; sx < KERNEL_SIZE_X; ++sx) memory, where possible! sum[px]= mad(k[sx], d[px + sx], sum[px]); 24 Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Global and Constant Memory Global Memory Accesses go through the L3 Cache L3 cache line is 64 bytes EU thread accesses to the same cache line are collapsed • Order of data within cache line does not matter • Bandwidth determined by number of cache lines accessed • Maximum Bandwidth (L3 EU): 64 bytes / clock / sub slice Good: Load at least 32-bits of data at a time, starting from

Load more