<<

Kirill Rogozhin, November 2019 Advisor

Roofline: Characterize Offload Advisor: identify Roofline: Characterize and optimize on CPU best offload candidates and optimize on GPU

omp target Vectorization Memory Threading omp teams omp map...

omp omp parallel_for omp declare simd omp task ......

2 Advisor Roofline Characterize and optimize CPU code

3 What is the ? Characterization of your application performance in the context of the hardware The maximum FLOPS as a product of It uses two simple metrics ops/ (AI) and maximum supplied FLOPS ▪ Flop count per second is a diagonal line. ▪ Bytes transferred

The CPU’s maximum FLOPS can be plotted as a horizontal line. A loop or function

Arithmetic Intensity FLOPS/Byte

4 The Roofline Chart in Intel® Advisor 1 2 3

A single button or CLI command runs the Survey and FLOPS analyses to generate the Roofline chart.

Focus on big dots with headroom. ▪ Intel® Advisor sizes and color- codes dots by relative time taken. ▪ Space between dots and their uppermost limits is room to improve. Roofs above a dot indicate potential sources of bottlenecks.

5 Getting Roofline data in Intel® Advisor

Roofline : Axis X: AI = #FLOP / #Bytes Overhead Axis Y: FLOP/S = #FLOP (mask aware) / #Seconds

Step 1: Survey (-collect survey) - Provide #Seconds 1x - Root access not needed - User mode sampling, non-intrusive. Step 2: FLOPS (-collect tripcounts –flops) - Provide #FLOP, #Bytes, AVX-512 Mask - Root access not needed 5-10x - Precise, instrumentation based, count number of instructions

6 Questions to answer with Roofline Final Bottleneck? 1 Am I doing well? How far am I from 2 the pick? (where will be my limit after I done all optimizations?) (do I utilize hardware well or not?) Long-term ROI, optimization strategy

Big optimization gap. Platform underutilization. Memory-bound, invest into Compute bound: invest into blocking etc SIMD,..

7 Advisor Roofline Characterize and optimize CPU code: Threading

8 Roofline: iso3dfd example Adjust roofs to #cores, ranks, NUMA in your OpenMP application: OMP_NUM_THREADS KMP_AFFINITY KMP_HW_SUBSET ….

Hotspot OpenMP loop

9 Roofline: iso3dfd example

Top max GFLOPS limit increases Compare with previous version with number of threads

Visualized performance difference after adding #pragma omp parallel for

10 Advisor Roofline Characterize and optimize CPU code: Memory

11 Integrated Roofline: DRAM is the bottleneck

CARM Roofline Guidance: either DRAM or LLC is the bottleneck

12 Memory access pattern analysis How should I access data ?

Unit stride access are faster

for (i=0; i

Constant stride are usually worse for (i=0; i

Non predictable access are usually bad for (i=0; i

13 Memory Access Patterns

Strided access in Memory object OpenMP region

14 Advisor Roofline Characterize and optimize CPU code: Vectorization

15 Advisor Vectorization analysis

Filter by which loops What prevents Trip Counts are vectorized! vectorization?

Focus on What vectorization Which Vector instructions How efficient is hot loops issues do I have? are being used? the code?

16 Enabling Vectorization 2 1 Check dependencies

3 Use #pragma omp simd

18 Offload Advisor Define regions to be offloaded

18 Offload Advisor Summary

19 Explore offload candidates

Offload candidates

Performance bounding factors

20 Offloaded regions details and source view

The region can be 4.27x faster on Gen11

Add “#pragma omp target” here

21 Data transfer and memory mapping

map(to:a) map(from:a) map(tofrom:a)

Total estimated data traffic

22 GPU Roofline Characterize and optimize GPU code

23 Roofline for GPU ▪ Preview feature, supports Gen9 GPU

▪ Command line data collection ▪ HTML export as UI

Memory traffic across all memory levels on GPU

24 GPU Roofline

Tiled version: higher FLOPS, higher FLOP/byte

Basic version

25 Intel Advisor

Roofline: Characterize Offload Advisor: identify Roofline: Characterize and optimize on CPU best offload candidates and optimize on GPU

#pragma omp target Vectorization Memory Threading omp teams ...

#pragma omp simd #pragma omp simd #pragma omp declare simd #pragma omp declare simd ......

26