Intel® Offload Advisor
Total Page:16
File Type:pdf, Size:1020Kb
NHR@ZIB - Intel oneAPI Workshop, 2-3 March 2021 Intel® Advisor Offload Modelling and Analysis Klaus-Dieter Oertel Intel® Advisor for High Performance Code Design Rich Set of Capabilities Offload Modelling Design offload strategy and model performance on GPU. One Intel Software & Architecture (OISA) 2 Agenda ▪ Offload Modelling ▪ Roofline Analysis – Recap ▪ Roofline Analysis for GPU code ▪ Flow Graph Analyzer One Intel Software & Architecture (OISA) 3 Offload Modelling 4 Intel® Advisor - Offload Advisor Find code that can be profitably offloaded Starting from an optimized binary (running on CPU): ▪ Helps define which sections of the code should run on a given accelerator ▪ Provides performance projection on accelerators One Intel Software & Architecture (OISA) 5 Intel® Advisor - Offload Advisor What can be expected? Speedup of accelerated code 8.9x One Intel Software & Architecture (OISA) 6 Modeling Performance Using Intel® Advisor – Offload Advisor Baseline HW (Programming model) Target HW 1. CPU (C,C++,Fortran, Py) CPU + GPU measured measured estimated 1.a CPU (DPC++, OCL, OMP, CPU + GPU measured “target=host”) measured estimated 2 CPU+iGPU (DPC++, OCL, OMP, CPU + GPU measured “target=offload”) measured Estimated Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Modeling Performance Using Intel® Advisor – Offload Advisor Region X Region Y Execution time on baseline platform (CPU) • Execution time on accelerator. Estimate assuming bounded exclusively by Compute • Execution time on accelerator. Estimate assuming bounded exclusively by caches/memory • Offload Tax estimate (data transfer + invoke) Final estimated time on target device (GPU) X’ Y’ t X – profitable to Y - too much overhead, accelerate, t(X) > t(X’) not accelerable, t(Y)<t(Y’) t region = max(tcompute, tmemory subsystem) + tdata transfer tax + tinvocation tax 8 One Intel Software & Architecture (OISA) Intel Confidential 8 Will Offload Increase Performance? Good Candidates to offload What is workload bounded by Bad Candidates One Intel Software & Architecture (OISA) 9 In-Depth Analysis of Top Offload Regions ▪ Provides a detailed description of each loop interesting for offload ▪ Timings (total time, time on the accelerator, speedup) This is where you will use DPC++ or OMP offload . ▪ Offload metrics (offload tax data transfers) ▪ Memory traffic (DRAM, L3, L2, L1), trip count ▪ Highlight which part of the code should run on the accelerator One Intel Software & Architecture (OISA) 10 What Is My Workload Bounded By? Predict performance on future GPU hardware. 95% of workload bounded by L3 bandwidth but you may have several bottlenecks. One Intel Software & Architecture (OISA) 11 Will the Data Transfer Make GPU Offload Worthwhile? Memory Memory objects histogram Total data transferr ed One Intel Software & Architecture (OISA) 12 What Kernels Should Not Be Offloaded? ▪ Explains why Intel® Advisor doesn’t recommend a given loop for offload ▪ Dependency issues ▪ Not profitable ▪ Total time is too small One Intel Software & Architecture (OISA) 13 Compare Acceleration on Different GPUs Gen9 – Not profitable to offload kernel Gen11 – 1.6x speedup One Intel Software & Architecture (OISA) 14 Program Tree ▪ The program tree offers another view of the proportion of code that can be offloaded to the accelerator. • Generated if the DOT(GraphViz*) utility is installed Target = GPU, Target = CPU Accelerated One Intel Software & Architecture (OISA) 15 Before you start to use Offload Advisor ▪ The only strict requirement for compilation and linking is full debug information: -g: Requests full debug information (compiler and linker) ▪ Offload Advisor supports any optimization level, but the following settings are considered the optimal requirements: -O2: Requests moderate optimization -no-ipo: Disables inter-procedural optimizations that may inhibit Offload Advisor to collect performance data (Intel® C++ & Fortran Compiler specific) One Intel Software & Architecture (OISA) 16 Performance Estimation Flow Performance estimation steps: Output: A. Profiling 1. report.html B. Performance modelling 3 different approaches to get estimation: • run_oa.py (both A and B), most convenient • collect.py (A) + analyze.py (B) • advixe-cl (multiple times, A) + analyze.py (B), most control Performance estimation result: • List of loops to be offloaded • Estimated speed-up (relative to baseline) 2. report.csv (whole grid in CSV table) For batch processing Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Using Python scripts to run Offload Advisor ▪ Set up the Intel® Advisor environment (implicitly done by oneAPI setvars.sh) source <advisor_install_dir>/advixe-vars.sh Analyze for a specific GPU config Environment variable APM points to <ADV_INSTALL_DIR>/perfmodels ▪ Run the data collection advixe-python $APM/collect.py advisor_project --config gen9 -- <app> [app_options] Also works with other installed python, advixe-python only provided for convenience. ▪ Run the performance modelling advixe-python $APM/analyze.py advisor_project --config gen9 --out-dir proj_results View the report.html generated (or generate a command-line report) ▪ Alternatives: run_oa.py or advixe-cl + analyze-py One Intel Software & Architecture (OISA) 18 Roofline Analysis - Recap 19 What is a Roofline Chart? ▪ A Roofline Chart plots application performance against hardware limitations • Where are the bottlenecks? • How much performance is being left on the table? • What are the next steps? ▪ Values of Rooflines in Intel® Advisor are measured CPU Roofline chart • Small benchmarks are run when starting a Roofline Analysis Roofline first proposed by University of California at Berkeley: Roofline: An Insightful Visual Performance Model for Multicore Architectures, 2009 Cache-aware variant proposed by University of Lisbon: Cache-Aware Roofline Model: Upgrading the Loft, 2013 One Intel Software & Architecture (OISA) 20 21 What is the Roofline Model? Do you know how fast you should run? ▪ Comes from Berkeley ▪ Performance is limited by equations/implementation & code generation/hardware ▪ 2 hardware limitations ▪ PEAK Flops ▪ PEAK Bandwidth ▪ The application performance is bounded by hardware specifications Arithmetic Intensity 푷풍풂풕풇풐풓풎 푷푬푨푲 Gflop/s= 풎풊풏 ቊ (Flops/Bytes) 푷풍풂풕풇풐풓풎 푩푾 ∗ 푨푰 One Intel Software & Architecture (OISA) 21 Drawing the Roofline Defining the speed of light 푷풍풂풕풇풐풓풎 푷푬푨푲 2 sockets Intel® Xeon® Processor E5-2697 v2 Gflop/s= 풎풊풏 ቊ Peak Flop = 1036 Gflop/s 푷풍풂풕풇풐풓풎 푩푾 ∗ 푨푰 Peak BW = 119 GB/s 1036 Gflops/s AI [Flop/B] One Intel Software & Architecture (OISA) 22 22 Drawing the Roofline Defining the speed of light 푷풍풂풕풇풐풓풎 푷푬푨푲 2 sockets Intel® Xeon® Processor E5-2697 v2 Gflop/s= 풎풊풏 ቊ Peak Flop = 1036 Gflop/s 푷풍풂풕풇풐풓풎 푩푾 ∗ 푨푰 Peak BW = 119 GB/s 1036 Gflops/s NB: Origin not crossed due to both axis in logarithmic scale AI [Flop/B] One Intel Software & Architecture (OISA) 23 23 Drawing the Roofline Defining the speed of light 푷풍풂풕풇풐풓풎 푷푬푨푲 2 sockets Intel® Xeon® Processor E5-2697 v2 Gflop/s= 풎풊풏 ቊ Peak Flop = 1036 Gflop/s 푷풍풂풕풇풐풓풎 푩푾 ∗ 푨푰 Peak BW = 119 GB/s 1036 Gflops/s 8.7 AI [Flop/B] One Intel Software & Architecture (OISA) 24 24 25 Ultimate Performance Limits Performance cannot exceed the machine’s capabilities, so each loop is ultimately limited by either compute FLOPS or memory capacity. Ultimately Ultimately Memory-Bound Compute-Bound Arithmetic Intensity FLOP/Byte One Intel Software & Architecture (OISA) 25 Roofline Analysis for GPU code 26 Find Effective Optimization Strategies Intel® Advisor - GPU Roofline GPU Roofline Performance Insights ▪ Highlights poor performing loops ▪ Shows performance ‘headroom’ for each loop – Which can be improved – Which are worth improving ▪ Shows likely causes of bottlenecks – Memory bound vs. compute bound ▪ Suggests next optimization steps One Intel Software & Architecture (OISA) 27 Intel® Advisor GPU Roofline See how close you are to the system maximums (rooflines) Roofline indicates room for improvement One Intel Software & Architecture (OISA) 28 Find Effective Optimization Strategies Intel® Advisor - GPU Roofline Configure levels to display Shows performance headroom for each loop Likely bottlenecks Suggests optimization next steps One Intel Software & Architecture (OISA) 29 Customize to Display Only Desired Roofs Click on the top-right corner and remove unused roofs One Intel Software & Architecture (OISA) 30 How to Run Intel® Advisor – GPU Roofline Run 2 collections with --enable-gpu-profiling option. First Survey run will do time measurements with minimized overhead: advixe-cl –collect=survey --enable-gpu-profiling --project-dir=<my_project_directory> --search-dir src:r=<my_source_directory> -- ./myapp [app_parameters] Run the Trip Counts and FLOP data collection : advixe-cl –collect=tripcounts --stacks --flop --enable-gpu-profiling --project-dir=<my_project_directory> --search-dir src:r=<my_source_directory> -- ./myapp [app_parameters] Generate a GPU Roofline report: advixe-cl --report=roofline --gpu --project-dir=<my_project_directory> --report-output=roofline.html Open the generated roofline.html in a web browser to visualize GPU performance. One Intel Software & Architecture (OISA) 31 Flow Graph Analyzer 32 Visualize Asynchronous Execution CompilerCompiler resolves resolves data data and and controlcontrol dependency dependency Data Dependence A Control flow Kernel 1 B A Kernel 3 Kernel