Quick viewing(Text Mode)

PROGRAMMING Intel® Processor Graphics

PROGRAMMING Intel® Processor Graphics

PROGRAMMING ® Processor Graphics

Chi-Keung (CK) Luk - Intel Principal Engineer

Intel & Services Group Agenda • Compute programming on Intel Graphics with: • OpenCL • CilkPlus • Tools • Workload performance

2 Copyright © 2015, Intel . All rights reserved. *Other names and may be claimed as the property of others. Optimization Notice Acknowledgments for Slide Sources

• OpenCL: . Khronos . Uri Levy, Yuval Eshkol, Doron Singer . Robert Ioffe, Aaron Kunze, Ben Ashbaugh, Stephen Junkins, Michal Mrozek • CilkPlus: . Knud J Kirkegaard, Anoop Madhusoodhanan Prabha, Konstantin Bobrovsky, Sergey Dmitriev • VTune: . Alexandr Kurylev, Julia Fedorova • Workload Performance: . Sharad Tripathi, Chinang Ma, Akhila Vidiyala . Edward Ching, Norbert Egi, Masood Mortazavi, Vivent Cheng, Guangyu Shi

3 Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice OpenCL

1. Introduction 2. Optimizing OpenCL for Intel GPUs 3. Using Shared Virtual Memory (SVM)

4 Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice OpenCL* (Open Computing Language) • An open standard managed by the • A set of -based on the host that defines a run-time environment • Programs written in a C-based language (C++ support since OpenCL 2.1) that run on the device(s)

Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

Based on

Mostly C-like Kernels (functions which execute a work-item) have a kernel prefix, and a void return

kernel void foo (global int* ptr) { … for (int i=0; i<10; i++) { if (ptr == NULL) continue; *ptr[i] = i; } …

No support for functions

. No stdio.h / stdlib.h / math.h / etc.. But printf is supported

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Supported data types

Scalar data types

. char / uchar / short / ushort / int / uint / long / ulong

. float / double

. size_t

. Pointers

Derived data types

. Arrays

. Structures

Vector data types

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Working with vectors Vectors exist for all scalar types

Vector widths are 2, 3, 4, 8, 16

All arithmetic operations work on vector types uint3 vec0, vec1; … uint3 result = vec0 + vec1;

Component (XYZW)

double res1 = dvec0.x + dvec1.z; double2 res2 = dvec0.wy + dvec1.xx; // Swizzle

Vectors > 4 use numeric (hexadecimal) indices

float res1 = vec16.s5 + vec16.sf; float2 res2 = vec8.s37 + vec16.sca; // Swizzle

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Additional details Functions used for querying the index

. get_global_id(int dimension); // Index of work-item in entire execution

. get_local_id(int dimension); // index of work-item in work-group

. get_group_id(int dimension); // index of work-group

. A few others.. Memory

. No support for dynamic memory allocation

. When passing buffers as arguments, specify global / local / constant

kernel void foo (global const int* ptr, local float* scratch) { …

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Built-in functions Many functions supported Overloaded to all relevant types and vector widths A tiny bit for example: . Math: sin / cos / min / max / log / pow / sqrt / …. . Geometric: dot / cross / distance / length / … . Relational: isEqual / isGreater / all / any / select / …

kernel void dot_product foo(global const int4* a, global const int4* b, global int* out) { size_t tid = get_global_id(0); out[tid] = dot(a[tid], b[tid]); }

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Casts and type conversions • Scalar values

• Use “normal” (C-style) casts • Vector values

• Vector conversions must be done explicitly

dstValue = convert_destType(srcValue)

• Source and destination types must have the same vector width • Example: int8 intVec; double8 dVec; float8 fVec = convert_float8(intVec); float8 fVec2 = convert_float8(dVec); • Vector construction (from scalars)

ushort3 vecUshort = (ushort3)(0, 12, 7);

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Language summary

Based on C99 With some extra features: . Vector data types . Extensive Built-in functions library . Image handling, Work-group synchronization Minus others: . Recursion . Function pointers . Pointers to pointers . Dynamic memory allocation

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice OpenCL 1. Introduction 2. Optimizing OpenCL for Intel GPUs . Use zero copying to transfer data between CPU and GPU . Maximize EU occupancy . Maximize compute performance . Avoid Divergent Control Flow . Take advantage of large register space . Optimize data accesses

3. Using Shared Virtual Memory (SVM)

18 Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Optimizing Host to Device Transfers: Zero Copying

• Host (CPU) and Device (GPU) share the same physical memory • For buffers allocated through the OpenCL™ runtime: - Let the OpenCL runtime allocate system memory . Create buffer with system memory pointer and CL_MEM_ALLOC_HOST_PTR - OR, Use pre-allocated system memory . Create buffer with system memory pointer and CL_MEM_USE_HOST_PTR . Allocate system memory aligned to a (4096 bytes) (e.g., use _aligned_malloc or memalign to allocate) . Allocate a multiple of cache line size (64 bytes) . No transfer needed (zero )! - Use clEnqueueMapBuffer() to access data . No transfer needed (zero copy)! 19 Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Maximizing EU Occupancy • Occupancy is a measure of EU utilization • Two primary things to consider: - Launch enough work items to keep EU threads busy - In short kernels: use short vector data types and compute multiple pixels to better amortize thread launch cost . For example, color conversion: __global uchar* src_ptr, dst_ptr; uchar16 src = vload16(0, src_ptr); __global uchar* src, dst; uchar4 c0 = src.s048c; p = src[src_idx] * B2Y + uchar4 c1 = src.s159d; src[src_idx + 1] * G2Y + uchar4 c2 = src.s26ae; src[src_idx + 2] * R2Y; uchar4 Y = c0 * B2Y + dst[dst_idx] = p; c1 * G2Y + c2 * R2Y; Before: vstore4(Y, 0, dst_ptr); One per work item After: Four pixels per work item 20 Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Maximize Compute Performance

• Use floats instead of integer data types - Because an EU can issue two float operations per cycle • Floating-point throughput depends on the data width - float16 throughput = 2 x float32 throughput - float32 throughput = 4 x float64 throughput • Trade accuracy for speed, where appropriate - Use “native” built-ins (or use -cl-fast-relaxed-math) - Use mad() / fma()(or use -cl-mad-enable)

x = cos(i); x = native_cos(i);

21 Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Avoid Divergent Control Flow

“SIMT” ISA with Predication and Branching “Divergent” code executes both branches . Reduced SIMD Efficiency, Increased Power and Exec Time

this(); Example: “x” sometimes true Example: “x” never true

SIMD lane SIMD lane time if ( x ) time that(); else another(); finish();

22 Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Optimizing Data Accesses

23 Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Take Advantage of Large Register Space

• Each work item in an OpenCL™ kernel has access to up to 256-512 bytes of register space • Bandwidth to registers faster than any memory • Loading and processing blocks of pixels in registers is very efficient!

float sum[PX_PER_WI_X] = { 0.0f }; float k[KERNEL_SIZE_X]; allocated in registers float [PX_PER_WI_X + KERNEL_SIZE_X]; // Load filter kernel in k, input data in d ... Use available registers (up // Compute convolution to 512 bytes) instead of for (px = 0; px < PX_PER_WI_X; ++px) for (sx = 0; sx < KERNEL_SIZE_X; ++sx) memory, where possible! sum[px]= mad(k[sx], d[px + sx], sum[px]);

24 Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Global and Constant Memory

Global Memory Accesses go through the L3 Cache L3 cache line is 64 bytes EU thread accesses to the same cache line are collapsed • Order of data within cache line does not matter • Bandwidth determined by number of cache lines accessed • Maximum Bandwidth (L3  EU): 64 bytes / clock / sub slice Good: Load least 32-bits of data at a time, starting from a 32-bit aligned address Best: Load 4 x 32-bits of data at a time, starting from a cache line aligned address • Loading more than 4 x 32-bits of data is not beneficial

25 Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Example: Global and Constant Memory Accesses

26 Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Local Memory Accesses

• Local memory accesses go thru. SLM • Sits next to L3 cache in the architecture • Key difference: SLM is banked • Banked at 4-byte granularity, with 16 banks in total • Maximum bandwidth: still 64 bytes / clock / sub slice • Supports more access patterns with full bandwidth than Global memory: • Reading the same address from a bank => not a bank conflict • Reading different addresses from a bank => bank conflict • Maximum bandwidth achieved when there is no bank conflict

27 Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Example: Local Memory Accesses

28 Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice OpenCL

1. Introduction 2. Optimizing OpenCL for Intel GPUs 3. Using Shared Virtual Memory (SVM)

29 Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice Shared Virtual Memory (Pre-history)

Builds upon “shared physical memory” feature  SPM established with OpenCL 1.0 => CL_MEM_USE_HOST_PTR flag  Supported on Intel 3rd Gen processors with HD Graphics  Eliminated buffer copy costs, aka “zero-copy” buffers*  Buffer must have 4k byte alignment and size divisible by 64

SPM available since 2011, but still not used by many OpenCL apps…

* See “Getting the Most from OpenCL™ 1.2: How to Increase Performance by Minimizing Buffer Copies on Intel® Processor Graphics” Shared Virtual Memory (SVM) - Basics

31 3 types of SVM Coarse-grain buffers (Intel 5th Gen Processors w/ HD Graphics 5300)  SVM buffers are mapped to either CPU or GPU at any given time  Access is controlled by clEnqueueSVMMap/Unmap commands

Fine-grain buffers (Intel 5th Gen Processors w/ HD Graphics 5500+)  SVM buffers can be accessed from either CPU or GPU at any time  Use atomics to control access (if CPU & GPU may try to modify the same memory location)  Check CL_DEVICE_SVM_FINE_GRAIN_BUFFER for fine-grained buffer SVM support, CL_DEVICE_SVM_ATOMICS is for atomics support

Fine-grain system memory (Future Intel Processors)  CPU & GPU can share anything allocated from the C-runtime ‘heap’ (i.e. malloc/new) 3 types of SVM

Coarse-grain buffers (Intel 5th Gen Processors w/ HD Graphics 5300)  SVM buffers are mapped to either CPU or GPU at any given time  Access is controlled by clEnqueueMap/Unmap commands

Un-mapped state: Only GPU can access buffer 3 types of SVM

Coarse-grain buffers (Intel 5th Gen Processors w/ HD Graphics 5300)  SVM buffers are mapped to either CPU or GPU at any given time  Access is controlled by clEnqueueMap/Unmap commands

Mapped state: Only CPU can access buffer 3 types of SVM

Fine-grain buffers (Intel 5th Gen Processors w/ HD Graphics 5500+)  SVM buffers can be accessed from both CPU and GPU at any time  Can use atomics to avoid ‘race’ conditions Check if device supports (CL_DEVICE_SVM_FINE_GRAIN_BUFFER & CL_DEVICE_SVM_ATOMICS flags) Fine grain SVM buffer allows simultaneous access from CPU & GPU 3 types of SVM

Fine-grain system memory (Future Intel Processors)  CPU & GPU can share anything allocated from the C-runtime ‘heap’ (i.e. malloc/new)  Ideal end-state – requires convergence of OS, H/W, and API support

Full CPU/GPU memory coherency for all heap allocations SVM --- API Basic

37 SVM --- Kernel Setup

38 Agenda

• Compute programming on Intel Graphics with: • OpenCL • CilkPlus • Tools • Workload performance

39 Intel CilkPlus for Parallel Programming

• Add language extensions to C++ programs to exploit parallelism. Come with two flavors: • MIT • OpenMP • Kinds of parallelism exploited: • Task-level • Loop-level • SIMD-level • Originally designed for CPU, now also support Intel Graphics • Unlike OpenCL, no separation of host and device programs

40 Example: Serial C++ version

void vecadd(int n, float *a, float *b, float *c) { for (int i=0; i < n; i++) { a[i] = b[i] + c[i]; } }

41 Example: Parallel CPU version (Cilk flavor)

void vecadd(int n, float *a, float *b, float *c) { _Cilk_for (int i=0; i < n; i++) { a[i] = b[i] + c[i]; } }

42 Example: Intel Graphics version (Cilk flavor) void vecadd(int n, float *a, float *b, float *c) { #pragma offload target(gfx) pin(a, b, c : length(n)) _Cilk_for (int i=0; i < n; i++) { a[i] = b[i] + c[i]; } }

43 Example: Parallel CPU version (OpenMP flavor) void vecadd(int n, float *a, float *b, float *c) { #pragma omp parallel for for (int i=0; i < n; i++) { a[i] = b[i] + c[i]; } }

44 Example: Intel Graphics version (OpenMP flavor) void vecadd(int n, float *a, float *b, float *c) { #pragma omp target(gfx) \ (tofrom: a[0:n], b[0:n], c[0:n]) map(to: n) #pragma omp parallel for for (int i=0; i < n; i++) { a[i] = b[i] + c[i]; } }

45 CilkPlus keywords

• Cilk Plus adds three keywords to C and C++: _Cilk_spawn _Cilk_sync _Cilk_for • If you #include , you can write the keywords as cilk_spawn, cilk_sync, and cilk_for. • Cilk Plus runtime controls thread creation and scheduling. • For GFX offload _Cilk_for is supported – No Cilk Plus runtime on the target – Scheduling happens on the host side

Software and Services Group Optimization Notice

46 Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. cilk_for loop

• Looks like a normal for loop. cilk_for (int x = 0; x < 1000000; ++x) { … } • Any or all iterations may execute in parallel with one another. • All iterations complete before program continues. • Constraints: – Limited to a single control variable. – Must be able to jump to the start of any iteration at random. – Iterations should be independent of one another.

Software and Services Group Optimization Notice

47 Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Array Notation (1)

• Use a “:” in array subscripts to operate on multiple elements A[:] // all of array A A[lower_bound : length] A[lower_bound : length : stride]

Explicit Based on C/C++ Arrays

Software and Services Group Optimization Notice

48 Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Array Notation (2)

• Example: A[:] = B[:]+C[:] • An extension to C/C++ • Perform operations on sections of arrays in parallel • Extension has parallel semantics • Well suited for code that: – performs per-element operations on arrays, – without an implied order between them – with an intent to execute in vector instructions

Software and Services Group Optimization Notice

49 Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Operations on Array Sections (1)

• C/C++ operators d[:] = a[:] + (b[:] * c[:])

• Function calls b[:] = foo(a[:]); // Call foo() on each element of a[]

• Reductions combine array elements to get a single result // Add all elements of a[] sum = __sec_reduce_add(a[:]); // More reductions exist...

• If-then-else and conditional operators allow masked operations if (mask[:]) { a[:] = b[:]; // If mask[i] is true, a[i]=b[i] }

Software and Services Group Optimization Notice

50 Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Operations on Array Sections (2)

• Implicit index fills array with index values – sec_implicit_index(0) // 1st rank index section index – sec_implicit_index(1) // 2nd rank section index

// fill A with values 0,1,2,3,4.... A[:] = __sec_implicit_index(0);

// fill B[i][j] with i+j B[:][:] = __sec_implicit_index(0) + __sec_implicit_index(1);

// fill the lower-left triangle of C with 1 C[0:n][0:__sec_implicit_index(0)] = 1;

Software and Services Group Optimization Notice

51 Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. SIMD Functions (1)

• A general construct to express data parallelism: – Write a function to describe the operation on a single element • Annotate it with one of: __declspec(vector)__ __attribute__((vector)) – Invoke the function across a parallel data structure (arrays) or from within a vectorizable loop.

Software and Services Group Optimization Notice

52 Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. SIMD Functions (2)

• Polymorphic: a vectorizing may create both array and scalar versions of the function.

• Writing the function is independent of its invocation – The function can invoked on scalars, within serial for or cilk_for loops, using array notation, etc..

Software and Services Group Optimization Notice

53 Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. SIMD Functions - Example

• Defining an elemental function: __declspec (vector) double option_price_call_black_scholes( double S, double K, double r, double sigma, double time) { double time_sqrt = sqrt(time); double d1 = (log(S/K)+r*time)/(sigma*time_sqrt) + 0.5*sigma*time_sqrt; double d2 = d1-(sigma*time_sqrt); return S*N(d1) - K*exp(-r*time)*N(d2); Compiler breaks data } into SIMD vectors and calls function on each vector • Invoking the elemental function: // The following loop can also use cilk_for call[0:N] = option_price_call_black_scholes(S[0:N], K[0:N], r, sigma, time[0:N]);

Software and Services Group Optimization Notice

54 Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. SIMD Annotations

• Loop annotation informs the compiler that vectorized loop will have same semantics as serial loop: void f(float *a, const float *b, const int *e, int n) { #pragma for (int i = 0; i < n; ++i) Potential aliasing and loop- carried dependencies would a[i] = 2 * b[e[i]]; thwart auto-vectorization } • Currently implemented as a pragma, but other methods of annotating the loop can be considered. • Additional clauses for reductions and other vectorization guidance (matching OpenMP*)

Software and Services Group Optimization Notice

55 Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. CilkPlus Example: N-Body simulation

Optimization Notice

Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. N-Body simulation

- N particles - Each has its own mass, position and velocity in 3D space - Particles are moving by influence of mutual gravitational forces - Classic simulation has O(N2) computational complexity

Optimization Notice

Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. N-Body simulation: main loop

_Cilk_for (int i = start; i < end; ++i) { Vector3 acc = 0.0f; for (int j = 0; j < body_count; ++j) { Vector3 dist = old_pos[j] - old_pos[i]; float len = sqrtf(dot(dist, dist) + epsilon); acc += dist * masses[j] / (len * len * len); }

new_vel[i] = old_vel[i] + acc * time; new_pos[i] = old_pos[i] + old_vel[i] * time + acc * time * time / 2; }

Straightforward approach is to add #pragma offload to the main loop

Optimization Notice

Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. N-Body simulation: #pragma offload

#pragma offload target(gfx) pin(old_pos, old_vel, new_pos, new_vel: length(body_count)) _Cilk_for (int i = start; i < end; ++i) { Vector3 acc = 0.0f; for (int j = 0; j < body_count; ++j) { Vector3 dist = old_pos[j] - old_pos[i]; float len = sqrtf(dot(dist, dist) + epsilon); acc += dist * masses[j] / (len * len * len); }

new_vel[i] = old_vel[i] + acc * time; new_pos[i] = old_pos[i] + old_vel[i] * time + acc * time * time / 2.0f; }

Optimization Notice

Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. N-Body simulation: Blocking #pragma offload target(gfx) pin(old_pos, old_vel, new_pos, new_vel: length(body_count))

_Cilk_for (int i = start; i < end; i += TILE) { GFXVector3 pos[TILE]; GFXVector3 acc[TILE]; pos[0:TILE] = old_pos[i:TILE]; acc[0:TILE] = 0.0f;

for (int j = 0; j < body_count; j += TILE) { GFXVector3 tpos[TILE]; GFXVector3 tmass[TILE]; tpos[0:TILE] = old_pos[j:TILE]; tmass[0:TILE] = masses[j:TILE];

for (int t = 0; t < TILE; t++) { for (int k = 0; k < TILE; k++) { GFXVector3 dist = tpos[t] - pos[k]; float inv_len = 1.0f / sqrtf(dot(dist, dist) + epsilon); acc[k] += dist * tmass[t] * inv_len * inv_len * inv_len; } } } new_vel[i:TILE] = old_vel[i:TILE] + acc[0:TILE] * time; new_pos[i:TILE] = pos[0:TILE] + old_vel[i:TIME] * time + acc[0:TILE] * time * time / 2.0f; } Optimization Notice

Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. CilkPlus advanced features

• Static data declaration • Separate file compilation (linking) • Recursive functions • Shared Local Memory (SLM) • Shared Virtual Memory (SVM)

Software and Services Group Optimization Notice

61 Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Where to get Intel OpenCL and Intel CilkPlus?

• Intel OpenCL: – https://software.intel.com/en-us/intel-opencl

• Intel CilkPlus: – https://software.intel.com/en-us/intel-cilk-plus

Software and Services Group Optimization Notice

62 Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Agenda

• Compute programming on Intel Graphics with: • Tools • VTune • GT-Pin • Workload performance

63 VTune

• A platform-wide performance profiler: • Memory accesses, storage and IO analysis, , CPU/GPU concurrency, … • Intel GPU analysis in VTune: . What compute APIs are used (OpenCL, CilkPlus, Media SDK) . When and what GPU units were utilized by those APIs . Rich set of hardware metrics allowing to see how the actual machine was utilized . Hints for performance issues . OpenCL source/Gen assembly view and some ability to map performance data back to original code https://software.intel.com/en-us/intel-vtune-amplifier-xe

64 CPU + GPU Utilization (on a Media + OpenCL app.)

65 Compute API Profiling (on an OpenCL application) … and performance data attributed to them

OpenCL kernels on timeline OpenCL host-side API calls OpenCL GEN GPU kernels… engines utilization

GPU HW metrics over time

SW Queue and GPU OpenCL queue engines utilization

Intel Confidential 6 Architecture Diagram

67 EU Dynamic Instruction Count

68 EU Instruction Latency

69 GT-Pin

• A Pin-like binary instrumentation tool for the EU in Intel GPUs • Command-line interface • Used to build a wide range of tools for: • Performance analysis, workload tracing, debugging • Providing instruction count and latency data to VTune • Support: • OpenCL, CilkPlus, DirectX, OpenGL • Windows, , Android, OS-X • Will provide API to allow users to write their own tools • Availability: first public release in 2016

70 Sample GT-Pin Tool: Opcodeprof

71 Workload Analysis with Opcodeprof

72 Sample GT-Pin Tool: Cacheprof (a cache simulator) • Trace accesses to data ports from EU • Simulate Intel GPU cache hierarchy:

Instruction Constant Sampler Render Cache Cache Cache Cache

L3 Cache

LLC Cache • Can also dump memory traces to files for offline analysis

73 Working-set Size Analysis with Cacheprof Benchmark = gaussian_blur filter

working set

74 Advanced Tool: GT-PinPoints • A tool to find representative regions in OpenCL traces for GPU simulations and workload analysis • Based on the proven Simpoint methodology • Overview:

• Results: • 223x simulation for 3.0% error or • 35x simulation speedup for 0.3% error http://www.cs.columbia.edu/~melanie/iiswc2015.pdf 75 Agenda

• Compute programming on Intel Graphics with: • Tools • Workload performance • Photoshop • Database

76 Adobe Photoshop

• The de facto industry standard raster graphics editing software developed by Adobe • Our workloads have 22 feature tests: • Each of which does image processing like applying filters, blurring images, adding effects etc.

77 Photoshop Experimental Setup Compare the products with the highest theoretical FLOPs from Intel and : • Intel Graphics (integrated) • Iris Pro 6200 (Broadwell GT3e) • Theoretical FLOPs = 883 GFlops • TDP = 47 W • Actual power measured via a public tool called GPU-Z • Nvidia GPU (discrete) • GTX Titan Z • Theoretical FLOPs = 8122 GFlops • TDP = 375 W • Actual power measured via a Nvidia tool called Nvidia-smi

78 Performance Comparison Nvidia is 1-2.5x faster

Source: Intel. See slide 78 for experiment configuration, slide 88 for FTC disclaimer. 79 GPU Power Measured Intel consumes 8-14x less power

Source: Intel. See slide 78 for experiment configuration, slide 88 for FTC disclaimer. 80 GPU Energy Consumed Intel consumes 5-16x less energy

Source: Intel. See slide 78 for experiment configuration, slide 88 for FTC disclaimer. 81 Comparing Integrated and Discrete GPUs on Database Processing

Paper: Unleashing the Hidden Power of Integrated-GPUs for Database Co- Processing, by E. Ching et al. from FutureWei http://subs.emis.de/LNI/Proceedings/Proceedings232/1755.pdf

82 Experimental setup

83 Data transfer time

Source: FutureWei Technologies. See slide 83 for experiment configuration, slide 88 for FTC disclaimer. 84 TPC-H decision support benchmark (TPC-H Q1)

Execution time (lower is better) Throughput per Watt (higher is better)

Source: FutureWei Technologies. See slide 83 for experiment configuration, slide 88 for FTC disclaimer. 85 TPC-H decision support benchmark (TPC-H Q9) Execution time (lower is better) Throughput per Watt (higher is better)

Source: FutureWei Technologies. See slide 83 for experiment configuration, slide 88 for FTC disclaimer. 86 Summary

• Intel processor graphics is integrated on with the CPU: • No extra cost for GPU • High performance • Energy efficient • Intel processor graphics comes with a rich software ecosystem: • Support most standard graphics/compute programming APIs • Modern highly- • Helpful tools • Call for actions: • Try to program the integrated graphics on your Intel-based ! • Visit this tutorial’s webpage for the slides and more information

87 FTC disclaimer

Software and workloads used in performance tests may have been optimized for performance only on Intel . Performance tests, such as SYSmark and MobileMark, are measured using specific systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

88 Legal Disclaimer and Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

© 2015, Intel Corporation. All rights reserved. Intel, the Intel logo, , Core, Iris, VTune, , and are trademarks of Intel Corporation in the U.S. and/or other countries.

Optimization Notice Intel’s may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. -dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

89 Legal Notices and Disclaimers

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.

No computer system can be absolutely secure.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.

Intel, the Intel logo and others are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

© 2015 Intel Corporation.