Intel® Offload Advisor

Total Page:16

File Type:pdf, Size:1020Kb

Intel® Offload Advisor NHR@ZIB - Intel oneAPI Workshop, 2-3 March 2021 Intel® Advisor Offload Modelling and Analysis Klaus-Dieter Oertel Intel® Advisor for High Performance Code Design Rich Set of Capabilities Offload Modelling Design offload strategy and model performance on GPU. One Intel Software & Architecture (OISA) 2 Agenda ▪ Offload Modelling ▪ Roofline Analysis – Recap ▪ Roofline Analysis for GPU code ▪ Flow Graph Analyzer One Intel Software & Architecture (OISA) 3 Offload Modelling 4 Intel® Advisor - Offload Advisor Find code that can be profitably offloaded Starting from an optimized binary (running on CPU): ▪ Helps define which sections of the code should run on a given accelerator ▪ Provides performance projection on accelerators One Intel Software & Architecture (OISA) 5 Intel® Advisor - Offload Advisor What can be expected? Speedup of accelerated code 8.9x One Intel Software & Architecture (OISA) 6 Modeling Performance Using Intel® Advisor – Offload Advisor Baseline HW (Programming model) Target HW 1. CPU (C,C++,Fortran, Py) CPU + GPU measured measured estimated 1.a CPU (DPC++, OCL, OMP, CPU + GPU measured “target=host”) measured estimated 2 CPU+iGPU (DPC++, OCL, OMP, CPU + GPU measured “target=offload”) measured Estimated Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Modeling Performance Using Intel® Advisor – Offload Advisor Region X Region Y Execution time on baseline platform (CPU) • Execution time on accelerator. Estimate assuming bounded exclusively by Compute • Execution time on accelerator. Estimate assuming bounded exclusively by caches/memory • Offload Tax estimate (data transfer + invoke) Final estimated time on target device (GPU) X’ Y’ t X – profitable to Y - too much overhead, accelerate, t(X) > t(X’) not accelerable, t(Y)<t(Y’) t region = max(tcompute, tmemory subsystem) + tdata transfer tax + tinvocation tax 8 One Intel Software & Architecture (OISA) Intel Confidential 8 Will Offload Increase Performance? Good Candidates to offload What is workload bounded by Bad Candidates One Intel Software & Architecture (OISA) 9 In-Depth Analysis of Top Offload Regions ▪ Provides a detailed description of each loop interesting for offload ▪ Timings (total time, time on the accelerator, speedup) This is where you will use DPC++ or OMP offload . ▪ Offload metrics (offload tax data transfers) ▪ Memory traffic (DRAM, L3, L2, L1), trip count ▪ Highlight which part of the code should run on the accelerator One Intel Software & Architecture (OISA) 10 What Is My Workload Bounded By? Predict performance on future GPU hardware. 95% of workload bounded by L3 bandwidth but you may have several bottlenecks. One Intel Software & Architecture (OISA) 11 Will the Data Transfer Make GPU Offload Worthwhile? Memory Memory objects histogram Total data transferr ed One Intel Software & Architecture (OISA) 12 What Kernels Should Not Be Offloaded? ▪ Explains why Intel® Advisor doesn’t recommend a given loop for offload ▪ Dependency issues ▪ Not profitable ▪ Total time is too small One Intel Software & Architecture (OISA) 13 Compare Acceleration on Different GPUs Gen9 – Not profitable to offload kernel Gen11 – 1.6x speedup One Intel Software & Architecture (OISA) 14 Program Tree ▪ The program tree offers another view of the proportion of code that can be offloaded to the accelerator. • Generated if the DOT(GraphViz*) utility is installed Target = GPU, Target = CPU Accelerated One Intel Software & Architecture (OISA) 15 Before you start to use Offload Advisor ▪ The only strict requirement for compilation and linking is full debug information: -g: Requests full debug information (compiler and linker) ▪ Offload Advisor supports any optimization level, but the following settings are considered the optimal requirements: -O2: Requests moderate optimization -no-ipo: Disables inter-procedural optimizations that may inhibit Offload Advisor to collect performance data (Intel® C++ & Fortran Compiler specific) One Intel Software & Architecture (OISA) 16 Performance Estimation Flow Performance estimation steps: Output: A. Profiling 1. report.html B. Performance modelling 3 different approaches to get estimation: • run_oa.py (both A and B), most convenient • collect.py (A) + analyze.py (B) • advixe-cl (multiple times, A) + analyze.py (B), most control Performance estimation result: • List of loops to be offloaded • Estimated speed-up (relative to baseline) 2. report.csv (whole grid in CSV table) For batch processing Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Using Python scripts to run Offload Advisor ▪ Set up the Intel® Advisor environment (implicitly done by oneAPI setvars.sh) source <advisor_install_dir>/advixe-vars.sh Analyze for a specific GPU config Environment variable APM points to <ADV_INSTALL_DIR>/perfmodels ▪ Run the data collection advixe-python $APM/collect.py advisor_project --config gen9 -- <app> [app_options] Also works with other installed python, advixe-python only provided for convenience. ▪ Run the performance modelling advixe-python $APM/analyze.py advisor_project --config gen9 --out-dir proj_results View the report.html generated (or generate a command-line report) ▪ Alternatives: run_oa.py or advixe-cl + analyze-py One Intel Software & Architecture (OISA) 18 Roofline Analysis - Recap 19 What is a Roofline Chart? ▪ A Roofline Chart plots application performance against hardware limitations • Where are the bottlenecks? • How much performance is being left on the table? • What are the next steps? ▪ Values of Rooflines in Intel® Advisor are measured CPU Roofline chart • Small benchmarks are run when starting a Roofline Analysis Roofline first proposed by University of California at Berkeley: Roofline: An Insightful Visual Performance Model for Multicore Architectures, 2009 Cache-aware variant proposed by University of Lisbon: Cache-Aware Roofline Model: Upgrading the Loft, 2013 One Intel Software & Architecture (OISA) 20 21 What is the Roofline Model? Do you know how fast you should run? ▪ Comes from Berkeley ▪ Performance is limited by equations/implementation & code generation/hardware ▪ 2 hardware limitations ▪ PEAK Flops ▪ PEAK Bandwidth ▪ The application performance is bounded by hardware specifications Arithmetic Intensity 푷풍풂풕풇풐풓풎 푷푬푨푲 Gflop/s= 풎풊풏 ቊ (Flops/Bytes) 푷풍풂풕풇풐풓풎 푩푾 ∗ 푨푰 One Intel Software & Architecture (OISA) 21 Drawing the Roofline Defining the speed of light 푷풍풂풕풇풐풓풎 푷푬푨푲 2 sockets Intel® Xeon® Processor E5-2697 v2 Gflop/s= 풎풊풏 ቊ Peak Flop = 1036 Gflop/s 푷풍풂풕풇풐풓풎 푩푾 ∗ 푨푰 Peak BW = 119 GB/s 1036 Gflops/s AI [Flop/B] One Intel Software & Architecture (OISA) 22 22 Drawing the Roofline Defining the speed of light 푷풍풂풕풇풐풓풎 푷푬푨푲 2 sockets Intel® Xeon® Processor E5-2697 v2 Gflop/s= 풎풊풏 ቊ Peak Flop = 1036 Gflop/s 푷풍풂풕풇풐풓풎 푩푾 ∗ 푨푰 Peak BW = 119 GB/s 1036 Gflops/s NB: Origin not crossed due to both axis in logarithmic scale AI [Flop/B] One Intel Software & Architecture (OISA) 23 23 Drawing the Roofline Defining the speed of light 푷풍풂풕풇풐풓풎 푷푬푨푲 2 sockets Intel® Xeon® Processor E5-2697 v2 Gflop/s= 풎풊풏 ቊ Peak Flop = 1036 Gflop/s 푷풍풂풕풇풐풓풎 푩푾 ∗ 푨푰 Peak BW = 119 GB/s 1036 Gflops/s 8.7 AI [Flop/B] One Intel Software & Architecture (OISA) 24 24 25 Ultimate Performance Limits Performance cannot exceed the machine’s capabilities, so each loop is ultimately limited by either compute FLOPS or memory capacity. Ultimately Ultimately Memory-Bound Compute-Bound Arithmetic Intensity FLOP/Byte One Intel Software & Architecture (OISA) 25 Roofline Analysis for GPU code 26 Find Effective Optimization Strategies Intel® Advisor - GPU Roofline GPU Roofline Performance Insights ▪ Highlights poor performing loops ▪ Shows performance ‘headroom’ for each loop – Which can be improved – Which are worth improving ▪ Shows likely causes of bottlenecks – Memory bound vs. compute bound ▪ Suggests next optimization steps One Intel Software & Architecture (OISA) 27 Intel® Advisor GPU Roofline See how close you are to the system maximums (rooflines) Roofline indicates room for improvement One Intel Software & Architecture (OISA) 28 Find Effective Optimization Strategies Intel® Advisor - GPU Roofline Configure levels to display Shows performance headroom for each loop Likely bottlenecks Suggests optimization next steps One Intel Software & Architecture (OISA) 29 Customize to Display Only Desired Roofs Click on the top-right corner and remove unused roofs One Intel Software & Architecture (OISA) 30 How to Run Intel® Advisor – GPU Roofline Run 2 collections with --enable-gpu-profiling option. First Survey run will do time measurements with minimized overhead: advixe-cl –collect=survey --enable-gpu-profiling --project-dir=<my_project_directory> --search-dir src:r=<my_source_directory> -- ./myapp [app_parameters] Run the Trip Counts and FLOP data collection : advixe-cl –collect=tripcounts --stacks --flop --enable-gpu-profiling --project-dir=<my_project_directory> --search-dir src:r=<my_source_directory> -- ./myapp [app_parameters] Generate a GPU Roofline report: advixe-cl --report=roofline --gpu --project-dir=<my_project_directory> --report-output=roofline.html Open the generated roofline.html in a web browser to visualize GPU performance. One Intel Software & Architecture (OISA) 31 Flow Graph Analyzer 32 Visualize Asynchronous Execution CompilerCompiler resolves resolves data data and and controlcontrol dependency dependency Data Dependence A Control flow Kernel 1 B A Kernel 3 Kernel
Recommended publications
  • Beyond BIOS Developing with the Unified Extensible Firmware Interface
    Digital Edition Digital Editions of selected Intel Press books are in addition to and complement the printed books. Click the icon to access information on other essential books for Developers and IT Professionals Visit our website at www.intel.com/intelpress Beyond BIOS Developing with the Unified Extensible Firmware Interface Second Edition Vincent Zimmer Michael Rothman Suresh Marisetty Copyright © 2010 Intel Corporation. All rights reserved. ISBN 13 978-1-934053-29-4 This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold with the understanding that the publisher is not engaged in professional services. If professional advice or other expert assistance is required, the services of a competent professional person should be sought. Intel Corporation may have patents or pending patent applications, trademarks, copyrights, or other intellectual property rights that relate to the presented subject matter. The furnishing of documents and other materials and information does not provide any license, express or implied, by estoppel or otherwise, to any such patents, trademarks, copyrights, or other intellectual property rights. Intel may make changes to specifications, product descriptions, and plans at any time, without notice. Fictitious names of companies, products, people, characters, and/or data mentioned herein are not intended to represent any real individual, company, product, or event. Intel products are not intended for use in medical, life saving, life sustaining, critical control or safety systems, or in nuclear facility applications. Intel, the Intel logo, Celeron, Intel Centrino, Intel NetBurst, Intel Xeon, Itanium, Pentium, MMX, and VTune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
    [Show full text]
  • Intel Advisor for Dgpu Intel® Advisor Workflows
    Profile DPC++ and GPU workload performance Intel® VTune™ Profiler, Advisor Vladimir Tsymbal, Technical Consulting Engineer, Intel, IAGS Agenda • Introduction to GPU programming model • Overview of GPU Analysis in Intel® VTune Profiler • Offload Performance Tuning • GPU Compute/Media Hotspots • A DPC++ Code Sample Analysis Demo • Using Intel® Advisor to increase performance • Offload Advisor discrete GPUs • GPU Roofline for discrete GPUs Copyright © 2020, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. 2 Intel GPUs and Programming Model Gen9 Application Workloads • Most common Optimized Middleware & Frameworks in mobile, desktop and Intel oneAPI Product workstations Intel® Media SDK Direct Direct API-Based Gen11 Programming Programming Programming • Data Parallel Mobile OpenCL platforms with C API C++ Libraries Ice Lake CPU Gen12 Low-Level Hardware Interface • Intel Xe-LP GPU • Tiger Lake CPU Copyright © 2020, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. 3 GPU Application Analysis GPU Compute/Media Hotspots • Visibility into both host and GPU sides • HW-events based performance tuning methodology • Provides overtime and aggregated views GPU In-kernel Profiling • GPU source/instruction level profiling • SW instrumentation • Two modes: Basic Block latency and memory access latency Identify GPU occupancy and which kernel to profile. Tune a kernel on a fine grain level Copyright © 2020, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. 4 GPU Analysis: Aggregated and Overtime Views Copyright © 2020, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
    [Show full text]
  • Hands-On Intel® Software Development & Oneapi WORKSHOP
    Hands-on Intel® Software Development & oneAPI WORKSHOP May 26-27, 2020 Scandic Solli, Parkveien 68 Box 2458 Solli, 0202 Oslo AGENDA DAY 1 - Technical Computing & Developer Tools - May 26 Timing Sessions 08:30 – 09:00 Registration & Light breakfast Part 1: Coding for maximum performance using the new Intel® Parallel Studio XE 2020 A refresher on the Intel® Hardware Architecture for Software Developers and Architects This session will offer in-depth insights into the current and future Intel® hardware platforms tailored to the 09:00 -09:45 needs of software developers, software architects, HPC and AI experts. We will cover the latest Intel® processors and the future Intel® GPU architecture. Developing code for Intel® architecture: how to achieve maximum performance using the new Intel® Parallel Studio XE 2020 09:45 – 10:30 Learn how Intel® Software Development Tools will help you to achieve optimal performance in your High Performance Computing, Artificial Intelligence ,and IoT projects. Includes a look at the new Intel® Parallel Studio XE 2020 tools which are designed to take advantage of the latest generation of Intel processors. 10:30 – 11:00 Coffee Break How to optimize and maximize code performance Learn how to use some of the advanced features of Intel® VTune™ Amplifier profile your applications. See how you can use event-based and architectural analysis to fine-tune your code so that it is taking full 11:00 – 12:00 advantage of the latest processor features of the target CPU. Learn how to use Intel Advisor, a powerful tool for tracking down and solving vectorization problems. In this session we will demonstrate how the Intel Advisor vector analysis and associated Roofline Model can be used to identify and help fixing vectorization problems.
    [Show full text]
  • Evaluating Techniques for Parallelization Tuning in MPI, Ompss and MPI/Ompss
    Evaluating techniques for parallelization tuning in MPI, OmpSs and MPI/OmpSs Advisors: Author: Prof. Jesús Labarta Vladimir Subotic´ Prof. Mateo Valero Prof. Eduard Ayguade´ A THESIS SUBMITTED IN FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor per la Universitat Politècnica de Catalunya Departament d’Arquitectura de Computadors Barcelona, 2013 i Abstract Parallel programming is used to partition a computational problem among multiple processing units and to define how they interact (communicate and synchronize) in order to guarantee the correct result. The performance that is achieved when executing the parallel program on a parallel architec- ture is usually far from the optimal: computation unbalance and excessive interaction among processing units often cause lost cycles, reducing the efficiency of parallel computation. In this thesis we propose techniques oriented to better exploit parallelism in parallel applications, with especial emphasis in techniques that increase asynchronism. Theoretically, this type of parallelization tuning promises multiple benefits. First, it should mitigate communication and synchro- nization delays, thus increasing the overall performance. Furthermore, parallelization tuning should expose additional parallelism and therefore increase the scalability of execution. Finally, increased asynchronism would allow more flexible communication mechanisms, providing higher toler- ance to slower networks and external noise. In the first part of this thesis, we study the potential for tuning MPI par- allelism. More specifically, we explore automatic techniques to overlap communication and computation. We propose a speculative messaging technique that increases the overlap and requires no changes of the orig- inal MPI application. Our technique automatically identifies the applica- tion’s MPI activity and reinterprets that activity using optimally placed non-blocking MPI requests.
    [Show full text]
  • Introduction to Intel Performance Tools Part
    Introduction to Intel Performance Tools Part 1/2 Doug Roberts SHARCNET / COMPUTE CANADA Intel® Performance Tools o Intel Advisor - Optimize Vectorization and Thread Prototyping for C, C++, Fortran o Intel Inspector - Easy-to-use Memory and Threading Error Debugger for C, C++, Fortran o Intel Vtune Amplifier - Serial/Threaded Performance Profiler for C, C++, Fortran, Mixed Python o Intel Trace Analyzer and Collector - Understand MPI application behavior for C, C++, Fortran, OpenSHMEM o Intel Distribution for Python - High-performance Python powered by native Intel Performance Libraries Intel® Parallel Studio XE – Cluster Edition https://software.intel.com/en-us/parallel-studio-xe o Intel Advisor* https://software.intel.com/en-us/intel-advisor-xe o Intel Inspector* https://software.intel.com/en-us/intel-inspector-xe o Intel Vtune Amplifier* https://software.intel.com/en-us/intel-vtune-amplifier-xe o Intel Trace Analyzer and Collector* https://software.intel.com/en-us/intel-trace-analyzer o Intel Distribution for Python https://software.intel.com/en-us/distribution-for-python * Product Support → Training, Docs, Faq, Code Samples Initializating the Components – The Intel Way ssh graham.sharcnet.ca cd /opt/software/intel/18.0.1/parallel_studio_xe_2018.1.038 source psxevars.sh → linux/bin/compilervars.sh → clck_2018/bin/clckvars.sh → itac_2018/bin/itacvars.sh → inspector_2018/inspxe-vars.sh → vtune_amplifier_2018/amplxe-vars.sh → advisor_2018/advixe-vars.sh Examples ls /opt/software/intel/18.0.1/parallel_studio_xe_2018.1.038/samples_2018/en
    [Show full text]
  • Intel® Software Products Highlights and Best Practices
    Intel® Software Products Highlights and Best Practices Edmund Preiss Business Development Manager Entdecken Sie weitere interessante Artikel und News zum Thema auf all-electronics.de! Hier klicken & informieren! Agenda • Key enhancements and highlights since ISTEP’11 • Industry segments using Intel® Software Development Products • Customer Demo and Best Practices Copyright© 2012, Intel Corporation. All rights reserved. 2 *Other brands and names are the property of their respective owners. Key enhancements & highlights since ISTEP’11 3 All in One -- Intel® Cluster Studio XE 2012 Analysis & Correctness Tools Shared & Distributed Memory Application Development Intel Cluster Studio XE supports: -Shared Memory Processing MPI Libraries & Tools -Distributed Memory Processing Compilers & Libraries Programming Models -Hybrid Processing Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE New VTune Amplifier XE features very well received by Software Developers Key reasons : • More intuitive – Improved GUI points to application inefficiencies • Preconfigured & customizable analysis profiles • Timeline View highlights concurrency issues • New Event/PC counter ratio analysis concept easy to grasp Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® VTune™ Amplifier XE The Old Way versus The New Way The Old Way: To see if there is an issue with branch misprediction, multiply event value (86,400,000) by 14 cycles, then divide by CPU_CLK_UNHALTED.THREAD (5,214,000,000). Then compare the resulting value to a threshold. If it is too high, investigate. The New Way: Look at the Branch Mispredict metric, and see if any cells are pink.
    [Show full text]
  • Intel® Composer XE 2013 Product Brief
    HIGH PERFORMANCE INTEL C++ AND FORTRAN COMPILERS WITH INTEL PERFORMANCE LIBRARIES Intel® Composer XE 2013 Product Brief Key Features Fortran, C++, and Libraries in One Convenient Package . Leadership Application Intel Composer XE 2013 SP1 is for Fortran developers who want a matching C++ compiler. It Performance combines all the compilers and tools from Intel® C++ Composer XE and Intel® Fortran Composer . Powerful Parallelism Models XE to help deliver outstanding performance for your applications as they run on systems using Simplify Multicore Support Intel® Core™ or Xeon® processors, compatible processors, and Intel® Xeon Phi™ coprocessors. Take advantage of a significant savings compared to purchasing individual components. Optimized Libraries For Threading, Math, Media And Data Processing It’s available on Windows* and Linux*. For Windows users, use of this suite requires that Compatible with Leading Microsoft Visual Studio* 2008, 2010 or 2012 is installed on your system. Development Environments Intel Composer XE features compiler-based innovations in vectorization and parallel programming to simplify development of performance-oriented application software. Among these are Intel® Cilk™ Plus, OpenMP* 4.0 features, and guided auto-parallelization. It’s compatible with leading C++ compilers and developer environments on Windows and Linux. “Intel makes great C++ and Fortran compilers as well as math libraries” Intel Performance Libraries are also included – Intel® Math Kernel Library (Intel® MKL) for Dr. Ronald C. Young, President, Multipath advanced math processing and Intel® Integrated Performance Primitives (Intel® IPP) for Corporation multimedia, signal and data processing. These libraries offer highly optimized, threaded, and specialized functions that speed development and add application performance. A bonus for C++ developers is Intel® Threading Building Blocks (Intel® TBB), which helps simplify adding Also available with one language: parallelism so that applications can take advantage of Intel multicore and many-core processor .
    [Show full text]
  • Openmp Instrumentation In
    Vishakha Agrawal Mark Lubin Pablo Reble Vasanth Tovinkere Michael Voss Overview • Motivation: Task based Parallel Programing • Threading Building Blocks (TBB) and OpenMP* Tasking with dependencies • Introduction to Intel® Advisor -- Flow Graph Analyzer (FGA) • FGA extension to support OpenMP tasks with dependencies • Future work • Summary Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. IXPUG Sept 24, 2018 2 *Other names and brands may be claimed as the property of others. Task based parallel programming with dependencies Advantages of task-based parallelism: • Makes parallelization efficient for irregular and runtime dependent execution • Higher level thinking • Improved load balancing Additional benefits for tasks with dependencies: • Extends the expressiveness of task-based parallel programming • Reduces need for global synchronization mechanisms like task barriers Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. IXPUG Sept 24, 2018 3 *Other names and brands may be claimed as the property of others. OpenMP* Task dependencies Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. IXPUG Sept 24, 2018 4 *Other names and brands may be claimed as the property of others. Explicit vs implicit graphs: Implicit graph, Explicit graph, nodes, and edge nodes, and Hello World edges graph g; #pragma omp parallel { continue_node< continue_msg > h( g, #pragma omp single []( const continue_msg & ) { { std::string s =“”; cout << “Hello “; { } ); #pragma omp task depend(out: s) { continue_node< continue_msg > w( g, s = “Hello ”; []( const continue_msg & ) { printf(“%s”, s); cout << “World\n“; } #pragma omp task depend(out : s) } ); { make_edge( h, w ); s = “World”; printf(“%s, s); h.try_put(continue_msg()); } g.wait_for_all(); } } Threading Building Blocks (TBB) Flow Graph OpenMP* Tasks with dependencies example Optimization Notice Copyright © 2018, Intel Corporation.
    [Show full text]
  • Optimization Workshop
    Optimization workshop Intel® VTune™ Amplifier and Intel® Advisor Kevin O’Leary, Technical Consulting Engineer Changing Hardware Affects Software Development More cores and wider vector registers mean more threads and more maximum performance! … but you need to need to write software that takes advantage of those cores and registers. More threads means more potential speedup. Intel® Xeon® 5100 5500 5600 E5-2600 E5-2600 E5-2600 Platinum 64-bit E5-2600 Processor series series series V2 V3 V4 8180 Cores 1 2 4 6 8 12 18 22 28 Threads 2 2 8 12 16 24 36 44 56 SIMD Width 128 128 128 128 256 256 256 256 512 Copyright © Intel Corporation 2019 *Other names and brands may be claimed as the property of others. 2 The Agenda Optimization 101 Threading The uOp Pipeline Tuning to the Architecture Vectorization ? ? Q & A ? Copyright © Intel Corporation 2019 *Other names and brands may be claimed as the property of others. 3 Optimization 101 Take advantage of compiler optimizations with the right flags. Linux* Windows* Description -xCORE-AVX512 /QxCORE-AVX512 Optimize for Intel® Xeon® Scalable processors, including AVX-512. -xCOMMON-AVX512 /QxCOMMON-AVX512 Alternative, if the above does not produce expected speedup. -fma /Qfma Enables fused multiply-add instructions. (Warning: affects rounding!) -O2 /O2 Optimize for speed (enabled by default). -g /Zi Generate debug information for use in performance profiling tools. Use optimized libraries, like Intel® Math Kernel Library (MKL). Deep Neural Linear Algebra Fast Fourier Transforms Vector Math Summary Statistics
    [Show full text]
  • Performance and Scalability Analysis of CNN-Based Deep Learning Inference in the Intel Distribution of Openvino Toolkit Tuesday, 24 September 2019 11:15 (15 Minutes)
    IXPUG 2019 Annual Conference at CERN Contribution ID: 11 Type: not specified Performance and Scalability Analysis of CNN-based Deep Learning Inference in the Intel Distribution of OpenVINO Toolkit Tuesday, 24 September 2019 11:15 (15 minutes) Deep learning is widely used in many problem areas, namely computer vision, natural language processing, bioinformatics, biomedicine, and others. Training neural networks involves searching the optimal weights of the model. It is a computationally intensive procedure, usually performed a limited number of times offline on servers equipped with powerful graphics cards. Inference of deep models implies forward propagation of a neural network. This repeated procedure should be executed as fast as possible on available computational devices (CPUs, embedded devices). A large number of deep models are convolutional, so increasing the per- formance of convolutional neural networks (CNNs) on Intel CPUs is a practically important task. The Intel Distribution of OpenVINO toolkit includes components that support the development of real-time visual ap- plications. For the efficient CNN inference execution on Intel platforms (Intel CPUs, Intel Processor Graphics, Intel FPGAs, Intel VPUs), the OpenVINO developers provide the Deep Learning Deployment Toolkit (DLDT). It contains tools for platform independent optimizations of network topologies as well as low-level inference optimizations. In this talk we analyze performance and scalability of several toolkits that provide high-performance CNN- based deep learning inference on Intel platforms. In this regard, we consider two typical data science problems: Image classification (Model: ResNet-50, Dataset: ImageNET) and Object detection (Model: SSD300, Dataset: PASCAL VOC 2012). First, we prepare a set of trained models for the following toolkits: Intel Distribution of OpenVINO toolkit, Intel Caffe, Caffe, and TensorFlow.
    [Show full text]
  • Learn MORE Intel® Advisor
    Effectively Train and Execute 00001101 00001010 Machine Learning and Deep 00001101 00001010 Learning Projects on CPUs 01001100 01101111 01110010 Parallelism in Python* Using Numba* 01100101Issue 01101101 Boosting the Performance of Graph Analytics Workloads 0010000036 011010002019 01110001 01110011 01110101 CONTENTSThe Parallel Universe 2 Letter from the Editor 3 Onward to Exascale by Henry A. Gabb, Senior Principal Engineer, Intel Corporation Effectively Train and Execute Machine Learning and Deep Learning Projects on CPUs 5 Meet the Intel-Optimized Frameworks that Make It Easier FEATURE Parallelism in Python* Using Numba* 17 It Just Takes a Bit of Practice and the Right Fundamentals Boosting the Performance of Graph Analytics Workloads 23 Analyzing the Graph Benchmarks on Intel® Xeon® Processors How Effective is Your Vectorization? 29 Gain Insights into How Well Your Application is Vectorized Using Intel® Advisor Improving Performance using Vectorization for Particle-in-Cell Codes 37 A Practical Guide Boost Performance for Hybrid Applications with Multiple Endpoints in Intel® MPI Library 53 Minimal Code Changes Can Help You on the March Toward the Exascale Era Innovate System and IoT Apps 63 How to Debug, Analyze, and Build Applications More Efficiently Using Intel® System Studio For more complete information about compiler optimizations, see our Optimization Notice. Sign up for future issues The Parallel Universe 3 LETTER FROM THE EDITOR Henry A. Gabb, Senior Principal Engineer at Intel Corporation, is a longtime high-performance and parallel computing practitioner who has published numerous articles on parallel programming. He was editor/coauthor of “Developing Multithreaded Applications: A Platform Consistent Approach” and program manager of the Intel/Microsoft Universal Parallel Computing Research Centers.
    [Show full text]
  • Intel Compiler
    FZ-Jülich, 26 November 2020 Intel Tuning for Juwels and Jureca Dr. Heinrich Bockhorst - Intel Notices & Disclaimers Intel technologies may require enabled hardware, software or service activation. Learn more at intel.com or from the OEM or retailer. Your costs and results may vary. Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy. Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804. https://software.intel.com/en-us/articles/optimization-notice Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. See backup for configuration details. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
    [Show full text]