Intel® Offload Advisor

Intel® Offload Advisor

NHR@ZIB - Intel oneAPI Workshop, 2-3 March 2021 Intel® Advisor Offload Modelling and Analysis Klaus-Dieter Oertel Intel® Advisor for High Performance Code Design Rich Set of Capabilities Offload Modelling Design offload strategy and model performance on GPU. One Intel Software & Architecture (OISA) 2 Agenda ▪ Offload Modelling ▪ Roofline Analysis – Recap ▪ Roofline Analysis for GPU code ▪ Flow Graph Analyzer One Intel Software & Architecture (OISA) 3 Offload Modelling 4 Intel® Advisor - Offload Advisor Find code that can be profitably offloaded Starting from an optimized binary (running on CPU): ▪ Helps define which sections of the code should run on a given accelerator ▪ Provides performance projection on accelerators One Intel Software & Architecture (OISA) 5 Intel® Advisor - Offload Advisor What can be expected? Speedup of accelerated code 8.9x One Intel Software & Architecture (OISA) 6 Modeling Performance Using Intel® Advisor – Offload Advisor Baseline HW (Programming model) Target HW 1. CPU (C,C++,Fortran, Py) CPU + GPU measured measured estimated 1.a CPU (DPC++, OCL, OMP, CPU + GPU measured “target=host”) measured estimated 2 CPU+iGPU (DPC++, OCL, OMP, CPU + GPU measured “target=offload”) measured Estimated Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Modeling Performance Using Intel® Advisor – Offload Advisor Region X Region Y Execution time on baseline platform (CPU) • Execution time on accelerator. Estimate assuming bounded exclusively by Compute • Execution time on accelerator. Estimate assuming bounded exclusively by caches/memory • Offload Tax estimate (data transfer + invoke) Final estimated time on target device (GPU) X’ Y’ t X – profitable to Y - too much overhead, accelerate, t(X) > t(X’) not accelerable, t(Y)<t(Y’) t region = max(tcompute, tmemory subsystem) + tdata transfer tax + tinvocation tax 8 One Intel Software & Architecture (OISA) Intel Confidential 8 Will Offload Increase Performance? Good Candidates to offload What is workload bounded by Bad Candidates One Intel Software & Architecture (OISA) 9 In-Depth Analysis of Top Offload Regions ▪ Provides a detailed description of each loop interesting for offload ▪ Timings (total time, time on the accelerator, speedup) This is where you will use DPC++ or OMP offload . ▪ Offload metrics (offload tax data transfers) ▪ Memory traffic (DRAM, L3, L2, L1), trip count ▪ Highlight which part of the code should run on the accelerator One Intel Software & Architecture (OISA) 10 What Is My Workload Bounded By? Predict performance on future GPU hardware. 95% of workload bounded by L3 bandwidth but you may have several bottlenecks. One Intel Software & Architecture (OISA) 11 Will the Data Transfer Make GPU Offload Worthwhile? Memory Memory objects histogram Total data transferr ed One Intel Software & Architecture (OISA) 12 What Kernels Should Not Be Offloaded? ▪ Explains why Intel® Advisor doesn’t recommend a given loop for offload ▪ Dependency issues ▪ Not profitable ▪ Total time is too small One Intel Software & Architecture (OISA) 13 Compare Acceleration on Different GPUs Gen9 – Not profitable to offload kernel Gen11 – 1.6x speedup One Intel Software & Architecture (OISA) 14 Program Tree ▪ The program tree offers another view of the proportion of code that can be offloaded to the accelerator. • Generated if the DOT(GraphViz*) utility is installed Target = GPU, Target = CPU Accelerated One Intel Software & Architecture (OISA) 15 Before you start to use Offload Advisor ▪ The only strict requirement for compilation and linking is full debug information: -g: Requests full debug information (compiler and linker) ▪ Offload Advisor supports any optimization level, but the following settings are considered the optimal requirements: -O2: Requests moderate optimization -no-ipo: Disables inter-procedural optimizations that may inhibit Offload Advisor to collect performance data (Intel® C++ & Fortran Compiler specific) One Intel Software & Architecture (OISA) 16 Performance Estimation Flow Performance estimation steps: Output: A. Profiling 1. report.html B. Performance modelling 3 different approaches to get estimation: • run_oa.py (both A and B), most convenient • collect.py (A) + analyze.py (B) • advixe-cl (multiple times, A) + analyze.py (B), most control Performance estimation result: • List of loops to be offloaded • Estimated speed-up (relative to baseline) 2. report.csv (whole grid in CSV table) For batch processing Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Using Python scripts to run Offload Advisor ▪ Set up the Intel® Advisor environment (implicitly done by oneAPI setvars.sh) source <advisor_install_dir>/advixe-vars.sh Analyze for a specific GPU config Environment variable APM points to <ADV_INSTALL_DIR>/perfmodels ▪ Run the data collection advixe-python $APM/collect.py advisor_project --config gen9 -- <app> [app_options] Also works with other installed python, advixe-python only provided for convenience. ▪ Run the performance modelling advixe-python $APM/analyze.py advisor_project --config gen9 --out-dir proj_results View the report.html generated (or generate a command-line report) ▪ Alternatives: run_oa.py or advixe-cl + analyze-py One Intel Software & Architecture (OISA) 18 Roofline Analysis - Recap 19 What is a Roofline Chart? ▪ A Roofline Chart plots application performance against hardware limitations • Where are the bottlenecks? • How much performance is being left on the table? • What are the next steps? ▪ Values of Rooflines in Intel® Advisor are measured CPU Roofline chart • Small benchmarks are run when starting a Roofline Analysis Roofline first proposed by University of California at Berkeley: Roofline: An Insightful Visual Performance Model for Multicore Architectures, 2009 Cache-aware variant proposed by University of Lisbon: Cache-Aware Roofline Model: Upgrading the Loft, 2013 One Intel Software & Architecture (OISA) 20 21 What is the Roofline Model? Do you know how fast you should run? ▪ Comes from Berkeley ▪ Performance is limited by equations/implementation & code generation/hardware ▪ 2 hardware limitations ▪ PEAK Flops ▪ PEAK Bandwidth ▪ The application performance is bounded by hardware specifications Arithmetic Intensity 푷풍풂풕풇풐풓풎 푷푬푨푲 Gflop/s= 풎풊풏 ቊ (Flops/Bytes) 푷풍풂풕풇풐풓풎 푩푾 ∗ 푨푰 One Intel Software & Architecture (OISA) 21 Drawing the Roofline Defining the speed of light 푷풍풂풕풇풐풓풎 푷푬푨푲 2 sockets Intel® Xeon® Processor E5-2697 v2 Gflop/s= 풎풊풏 ቊ Peak Flop = 1036 Gflop/s 푷풍풂풕풇풐풓풎 푩푾 ∗ 푨푰 Peak BW = 119 GB/s 1036 Gflops/s AI [Flop/B] One Intel Software & Architecture (OISA) 22 22 Drawing the Roofline Defining the speed of light 푷풍풂풕풇풐풓풎 푷푬푨푲 2 sockets Intel® Xeon® Processor E5-2697 v2 Gflop/s= 풎풊풏 ቊ Peak Flop = 1036 Gflop/s 푷풍풂풕풇풐풓풎 푩푾 ∗ 푨푰 Peak BW = 119 GB/s 1036 Gflops/s NB: Origin not crossed due to both axis in logarithmic scale AI [Flop/B] One Intel Software & Architecture (OISA) 23 23 Drawing the Roofline Defining the speed of light 푷풍풂풕풇풐풓풎 푷푬푨푲 2 sockets Intel® Xeon® Processor E5-2697 v2 Gflop/s= 풎풊풏 ቊ Peak Flop = 1036 Gflop/s 푷풍풂풕풇풐풓풎 푩푾 ∗ 푨푰 Peak BW = 119 GB/s 1036 Gflops/s 8.7 AI [Flop/B] One Intel Software & Architecture (OISA) 24 24 25 Ultimate Performance Limits Performance cannot exceed the machine’s capabilities, so each loop is ultimately limited by either compute FLOPS or memory capacity. Ultimately Ultimately Memory-Bound Compute-Bound Arithmetic Intensity FLOP/Byte One Intel Software & Architecture (OISA) 25 Roofline Analysis for GPU code 26 Find Effective Optimization Strategies Intel® Advisor - GPU Roofline GPU Roofline Performance Insights ▪ Highlights poor performing loops ▪ Shows performance ‘headroom’ for each loop – Which can be improved – Which are worth improving ▪ Shows likely causes of bottlenecks – Memory bound vs. compute bound ▪ Suggests next optimization steps One Intel Software & Architecture (OISA) 27 Intel® Advisor GPU Roofline See how close you are to the system maximums (rooflines) Roofline indicates room for improvement One Intel Software & Architecture (OISA) 28 Find Effective Optimization Strategies Intel® Advisor - GPU Roofline Configure levels to display Shows performance headroom for each loop Likely bottlenecks Suggests optimization next steps One Intel Software & Architecture (OISA) 29 Customize to Display Only Desired Roofs Click on the top-right corner and remove unused roofs One Intel Software & Architecture (OISA) 30 How to Run Intel® Advisor – GPU Roofline Run 2 collections with --enable-gpu-profiling option. First Survey run will do time measurements with minimized overhead: advixe-cl –collect=survey --enable-gpu-profiling --project-dir=<my_project_directory> --search-dir src:r=<my_source_directory> -- ./myapp [app_parameters] Run the Trip Counts and FLOP data collection : advixe-cl –collect=tripcounts --stacks --flop --enable-gpu-profiling --project-dir=<my_project_directory> --search-dir src:r=<my_source_directory> -- ./myapp [app_parameters] Generate a GPU Roofline report: advixe-cl --report=roofline --gpu --project-dir=<my_project_directory> --report-output=roofline.html Open the generated roofline.html in a web browser to visualize GPU performance. One Intel Software & Architecture (OISA) 31 Flow Graph Analyzer 32 Visualize Asynchronous Execution CompilerCompiler resolves resolves data data and and controlcontrol dependency dependency Data Dependence A Control flow Kernel 1 B A Kernel 3 Kernel

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    49 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us