Pragma Omp Parallel For

Kirill Rogozhin, November 2019 Intel Advisor Roofline: Characterize Offload Advisor: identify Roofline: Characterize and optimize on CPU best offload candidates and optimize on GPU omp target Vectorization Memory Threading omp teams omp map... omp simd omp parallel_for omp declare simd omp task ... ... 2 Advisor Roofline Characterize and optimize CPU code 3 What is the Roofline Model? Characterization of your application performance in the context of the hardware The maximum FLOPS as a product of It uses two simple metrics ops/byte (AI) and maximum bytes supplied FLOPS ▪ Flop count per second is a diagonal line. ▪ Bytes transferred The CPU’s maximum FLOPS can be plotted as a horizontal line. A loop or function Arithmetic Intensity FLOPS/Byte 4 The Roofline Chart in Intel® Advisor 1 2 3 A single button or CLI command runs the Survey and FLOPS analyses to generate the Roofline chart. Focus on big dots with headroom. ▪ Intel® Advisor sizes and color- codes dots by relative time taken. ▪ Space between dots and their uppermost limits is room to improve. Roofs above a dot indicate potential sources of bottlenecks. 5 Getting Roofline data in Intel® Advisor Roofline : Axis X: AI = #FLOP / #Bytes Overhead Axis Y: FLOP/S = #FLOP (mask aware) / #Seconds Step 1: Survey (-collect survey) - Provide #Seconds 1x - Root access not needed - User mode sampling, non-intrusive. Step 2: FLOPS (-collect tripcounts –flops) - Provide #FLOP, #Bytes, AVX-512 Mask - Root access not needed 5-10x - Precise, instrumentation based, count number of instructions 6 Questions to answer with Roofline Final Bottleneck? 1 Am I doing well? How far am I from 2 the pick? (where will be my limit after I done all optimizations?) (do I utilize hardware well or not?) Long-term ROI, optimization strategy Big optimization gap. Platform underutilization. Memory-bound, invest into Compute bound: invest into cache blocking etc SIMD,.. 7 Advisor Roofline Characterize and optimize CPU code: Threading 8 Roofline: iso3dfd example Adjust roofs to #cores, ranks, NUMA in your OpenMP application: OMP_NUM_THREADS KMP_AFFINITY KMP_HW_SUBSET …. Hotspot OpenMP loop 9 Roofline: iso3dfd example Top max GFLOPS limit increases Compare with previous version with number of threads Visualized performance difference after adding #pragma omp parallel for 10 Advisor Roofline Characterize and optimize CPU code: Memory 11 Integrated Roofline: DRAM is the bottleneck CARM Roofline Guidance: either DRAM or LLC is the bottleneck 12 Memory access pattern analysis How should I access data ? Unit stride access are faster for (i=0; i<N; i++) A[i] = B[i]*d Constant stride are usually worse for (i=0; i<N; i+=2) A[i] = B[i]*d Non predictable access are usually bad for (i=0; i<N; i++) A[i] = B[C[i]]*d 13 Memory Access Patterns Strided access in Memory object OpenMP region 14 Advisor Roofline Characterize and optimize CPU code: Vectorization 15 Advisor Vectorization analysis Filter by which loops What prevents Trip Counts are vectorized! vectorization? Focus on What vectorization Which Vector instructions How efficient is hot loops issues do I have? are being used? the code? 16 Enabling Vectorization 2 1 Check dependencies 3 Use #pragma omp simd 18 Offload Advisor Define regions to be offloaded 18 Offload Advisor Summary 19 Explore offload candidates Offload candidates Performance bounding factors 20 Offloaded regions details and source view The region can be 4.27x faster on Gen11 Add “#pragma omp target” here 21 Data transfer and memory mapping map(to:a) map(from:a) map(tofrom:a) Total estimated data traffic 22 GPU Roofline Characterize and optimize GPU code 23 Roofline for GPU ▪ Preview feature, supports Gen9 GPU ▪ Command line data collection ▪ HTML export as UI Memory traffic across all memory levels on GPU 24 GPU Roofline Tiled version: higher FLOPS, higher FLOP/byte Basic version 25 Intel Advisor Roofline: Characterize Offload Advisor: identify Roofline: Characterize and optimize on CPU best offload candidates and optimize on GPU #pragma omp target Vectorization Memory Threading omp teams ... #pragma omp simd #pragma omp simd #pragma omp declare simd #pragma omp declare simd ... ... 26 .

Pragma Omp Parallel For

UNIT 8B a Full Adder

With Extreme Scale Computing the Rules Have Changed

Theoretical Peak FLOPS Per Instruction Set on Modern Intel Cpus

Misleading Performance Reporting in the Supercomputing Field David H

Intel Xeon & Dgpu Update

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures

Summarizing CPU and GPU Design Trends with Product Data

Theoretical Peak FLOPS Per Instruction Set on Less Conventional Hardware

Parallel Programming – Multicore Systems

PAKCK: Performance and Power Analysis of Key Computational Kernels on Cpus and Gpus

Comparing Performance and Energy Efficiency of Fpgas and Gpus For

Measurement, Control, and Performance Analysis for Intel Xeon Phi