Intel Tuning for JURECA and JUWELS
Total Page:16
File Type:pdf, Size:1020Kb
FZJ, May 20th, 2020 Dr. Heinrich Bockhorst, Intel Agenda Introduction Processor Architecture Overview Composer XE – Compiler Intel Python APS – Application Performance Snapshot MPI and ITAC analysis VTune Amplifier XE - analysis Advisor XE - Vectorization Selected Intel® Tools References Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. The “Free Lunch” is over, really Processor clock rate growth halted around 2005 Source: © 2014, James Reinders, Intel, used with permission Software must be parallelized to realize all the potential performance Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. What platform should I use for code modernization? Juwels The world is going Intel® Xeon® Processor Intel® Xeon Phi™ x100 Intel® Xeon® Processor Intel® Xeon Phi™ x200 parallel – stick E5-2600 v3 Product Product Family E5-2600 v4 Product Product Family with sequential Family formerly formerly codenamed Family codenamed codenamed … code and you will codenamed Knights Broadwell Knights Skylake fall behind. Haswell Corner Landing Cores 18 61 22 72 28 Threads/Core 2 4 2 4 2 Vector Width 256-bit 512-bit 256-bit 512-bit (x2) 512-bit (x2) Peak Memory Bandwidth 68 GB/s 352 GB/s 77 GB/s >500 GB/s 228 GB/s Both Xeon and KNL are suitable platforms; KNL provides higher scale & memory bandwidth. Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Intel® Parallel Studio XE Intel® Inspector Intel® Advisor Vectorization Optimization and Thread Prototyping Memory and Threading Checking Intel® Cluster Checker Cluster Diagnostic Expert System Intel® VTune™ Amplifier Architecture Performance Profiler Intel® Trace Analyzer and Collector MPI Profiler Profiling, Analysis, Profiling,and Intel® Data Analytics Acceleration Library Intel® MPI Library Tools Cluster Optimized for Data Analytics & Machine Learning Intel® Integrated Performance Primitives Image, Signal, and Compression Routines Libraries Intel® Math Kernel Library Performance Intel® Threading Building Blocks Optimized Routines for Science, Engineering, and Financial Task-Based Parallel C++ Template Library Intel® C/C++ and Fortran Compilers Intel® Distribution for Python Performance Scripting Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Intel® ONEAPI toolkits(beta) Toolkits Tailored to Your Needs Domain-specific sets of tools to get your job done quickly. Intel® oneAPI Base Toolkit A core set of high-performance tools for building Data Parallel C++ applications and oneAPI library based applications Learn More Intel® oneAPI HPC Toolkit Intel® oneAPI IoT Toolkit Intel® oneAPI DL Framework Intel® oneAPI Rendering Developer Toolkit Toolkit Everything HPC developers need to Tools for building high-performing, Tools for developers & researchers Powerful rendering libraries to create deliver fast C++, Fortran, & OpenMP* efficient, reliable solutions that run at who build deep learning frameworks high-performance, high-fidelity applications that scale the network’s edge or customize existing ones so visualization applications applications run faster Learn More Learn More Learn More Learn More Toolkits Powered by oneAPI Intel® System Bring-Up Toolkit Intel® Distribution of OpenVINO™ Intel® AI Analytics Toolkit Toolkit Tools to debug & tune power & Tools to build high performance deep Tools to build applications that performance in pre- & post-silicon learning inference & computer vision leverage machine learning & deep development applications (production-level tool) learning models Learn More Learn More Learn More Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. 9 *Other names and brands may be claimed as the property of others. Some useful links ▪ Intro: https://software.intel.com/en-us/oneapi ▪ oneAPI initiative: https://www.oneapi.com/ ▪ Code migration: https://software.intel.com/en-us/oneapi-programming- guide-migrating-code-to-dpc ▪ Windows and Linux toolkits available on https://software.intel.com/en- us/oneapi ▪ Access to hardware in DevCloud: https://software.intel.com/content/www/us/en/develop/tools/devcloud.html Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. 10 *Other names and brands may be claimed as the property of others. Common Optimization Options Windows* Linux*, OS* X Disable optimization /Od -O0 Optimize for speed (no code size increase) /O1 -O1 Optimize for speed (default) /O2 -O2 High-level loop optimization /O3 -O3 Create symbols for debugging /Zi -g Multi-file inter-procedural optimization /Qipo -ipo Profile guided optimization (multi-step build) /Qprof-gen -prof-gen /Qprof-use -prof-use Optimize for speed across the entire program /fast -fast (“prototype switch”) same as: /O3 /Qipo same as: /Qprec-div-, Linux: -ipo –O3 -no-prec-div –static – fast options definitions changes over time! /fp:fast=2 /QxHost) fp-model fast=2 -xHost) OS X: -ipo -mdynamic-no-pic -O3 - no-prec-div -fp-model fast=2 -xHost OpenMP support /Qopenmp -qopenmp Automatic parallelization /Qparallel -parallel Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Vectorization Single Instruction Multiple Data (SIMD): – Processing vector with a single operation – Provides data level parallelism (DLP) – Because of DLP more efficient than scalar processing Vector: A B – Consists of more than one element AAi i BBi i A B Ai i Bi i – Elements are of same scalar data types (e.g. floats, integers, …) + + Ci Vector length (VL): Elements of the vector CCi C Ci i Scalar Vector VL Processing Processing Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Many Ways to Vectorize Compiler: Ease of use Auto-vectorization (no change of code) Compiler: Auto-vectorization hints (#pragma vector, …) Compiler: Intel® Cilk™ Plus Array Notation Extensions SIMD intrinsic class (e.g.: F32vec, F64vec, …) Vector intrinsic (e.g.: _mm_fmadd_pd(…), _mm_add_ps(…), …) Assembler code (e.g.: [v]addps, [v]addss, …) Programmer control Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Basic Vectorization Switches I Linux*, OS X*: -x<feature> ▪ Might enable Intel processor specific optimizations ▪ Processor-check added to “main” routine: Application errors in case SIMD feature missing or non-Intel processor with appropriate/informative message ▪ Example: -xCORE-AVX2 (Jureca Xeon HSW) ▪ Example: -xMIC-AVX512 (Jureca Booster KNL) ▪ Example: -xCORE-AVX512 (Juwels Xeon SKL) Linux*, OS X*: -ax<features> ▪ Multiple code paths: baseline and optimized/processor-specific ▪ Multiple SIMD features/paths possible, e.g.: -axSSE2,CORE-AVX512 ▪ Baseline code path defaults to –xSSE2 Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Basic Vectorization Switches II Special switch for Linux*, OS X*: -xHost ▪ Compiler checks SIMD features of current host processor (where built on) and makes use of latest SIMD feature available ▪ Code only executes on processors with same SIMD feature or later as on build host Refer to software.intel.com/articles/optimization-notice for more information regarding performance & optimization choices in Intel software products. Copyright ©, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Basic Vectorization Switches III Special switch in addition to CORE-AVX512: -qopt-zmm-usage=[keyword] ▪ [keyword] = [high | low] ; Note: “low” is the default ▪ Why choosing a defensive vectorization level? Frequency drops in vectorized parts. Frequency does not immediately increases after the vectorized loop. Too many small vectorized loops will decrease the performance for the serial part. Refer to software.intel.com/articles/optimization-notice for more information regarding