Intel Tuning for JURECA and JUWELS

Intel Tuning for Jurecaand JuWELS Dr. Heinrich Bockhorst – Intel Date: 29.11.2019 Agenda • Introduction • Processor Architecture Overview • Composer XE – Compiler • Intel Python • APS – Application Performance Snapshot • MPI and ITAC analysis • VTune Amplifier XE - analysis • Advisor XE - Vectorization • Selected Intel® Tools • References © 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Processor Architecture Overview The “Free Lunch” is over, really Processor clock rate growth halted around 2005 Source: © 2014, James Reinders, Intel, used with permission Software must be parallelized to realize all the potential performance © 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. What platform should I use for code modernization? Today (Juwels) The world is going Intel® Xeon® Processor Intel® Xeon Phi™ x100 Intel® Xeon® Processor Intel® Xeon Phi™ x200 parallel – stick E5-2600 v3 Product Product Family E5-2600 v4 Product Product Family with sequential Family formerly formerly codenamed Family codenamed codenamed … code and you will codenamed Knights Broadwell Knights Skylake fall behind. Haswell Corner Landing Cores 18 61 22 72 28 Threads/Core 2 4 2 4 2 Vector Width 256-bit 512-bit 256-bit 512-bit (x2) 512-bit (x2) Peak Memory Bandwidth 68 GB/s 352 GB/s 77 GB/s >500 GB/s 228 GB/s Both Xeon and KNL are suitable platforms; KNL provides higher scale & memory bandwidth. © 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Intel® Parallel Studio XE Intel® Parallel Studio XE Intel® Inspector Intel® Advisor Vectorization Optimization and Thread Prototyping Memory and Threading Checking Intel® Cluster Checker Cluster Diagnostic Expert System Architecture Intel® VTune™ Amplifier Performance Profiler Intel® Trace Analyzer and Collector MPI Profiler Profiling, Analysis, and Cluster Cluster Tools Intel® Data Analytics Acceleration Library Intel® MPI Library Optimized for Data Analytics & Machine Learning Intel® Integrated Performance Primitives Image, Signal, and Compression Routines Libraries Performance Performance Intel® Math Kernel Library Intel® Threading Building Blocks Optimized Routines for Science, Engineering, and Financial Task-Based Parallel C++ Template Library Intel® C/C++ and Fortran Compilers Intel® Distribution for Python Performance Scripting © 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Intel® CompoSERXE Common OptimizationOptions Windows* Linux*, OS* X Disable optimization /Od -O0 Optimize for speed (no code size increase) /O1 -O1 Optimize for speed (default) /O2 -O2 High-level loop optimization /O3 -O3 Create symbols for debugging /Zi -g Multi-file inter-procedural optimization /Qipo -ipo Profile guided optimization (multi-step build) /Qprof-gen -prof-gen /Qprof-use -prof-use Optimize for speed across the entire program /fast -fast same as: /O3 /Qipo same as: (“prototype switch”) /Qprec-div-, /fp:fast=2 Linux: -ipo –O3 -no-prec-div –static –fp- fast options definitions changes over time! /QxHost) model fast=2 -xHost) OS X: -ipo -mdynamic-no-pic -O3 -no-prec- div -fp-model fast=2 -xHost OpenMP support /Qopenmp -qopenmp Automatic parallelization /Qparallel -parallel © 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Vectorization • Single Instruction Multiple Data (SIMD): • Processing vector with a single operation • Provides data level parallelism (DLP) • Because of DLP more efficient than scalar processing • Vector: • Consists of more than one element • Elements are of same scalar data types (e.g. floats, integers, …) • Vector length (VL): Elements of the vector A B A B AAi i BBi i Ai i Bi i + Scalar Vector + Processing Processing Ci C CCi Ci i VL © 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Many Ways to Vectorize • Compiler: Ease of use Auto-vectorization (no change of code) Compiler: Auto-vectorization hints (#pragma vector, …) Compiler: Intel® Cilk™ Plus Array Notation Extensions SIMD intrinsic class (e.g.: F32vec, F64vec, …) Vector intrinsic (e.g.: _mm_fmadd_pd(…), _mm_add_ps(…), …) Assembler code (e.g.: [v]addps, [v]addss, …) Programmer control © 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Basic Vectorization Switches I • Linux*, OS X*: -x<feature> ▪ Might enable Intel processor specific optimizations ▪ Processor-check added to “main” routine: Application errors in case SIMD feature missing or non-Intel processor with appropriate/informative message ▪ Example: -xCORE-AVX2 (Jureca Xeon HSW) ▪ Example: -xMIC-AVX512 (Jureca Booster KNL) ▪ Example: -xCORE-AVX512 (Juwels Xeon SKL) • Linux*, OS X*: -ax<features> • Multiple code paths: baseline and optimized/processor-specific • Multiple SIMD features/paths possible, e.g.: -axSSE2,CORE-AVX512 • Baseline code path defaults to –xSSE2 © 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Basic Vectorization Switches II • Special switch for Linux*, OS X*: -xHost ▪ Compiler checks SIMD features of current host processor (where built on) and makes use of latest SIMD feature available ▪ Code only executes on processors with same SIMD feature or later as on build host ▪ As for -x<feature> if “main” routine is built with –xHost the final executable only runs on Intel processors © 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Basic Vectorization Switches III • Special switch in addition to CORE-AVX512: -qopt-zmm-usage=[keyword] ▪ [keyword] = [high | low] ; Note: “low” is the default ▪ Why choosing a defensive vectorization level? Frequency drops in vectorized parts. Frequency does not immediately increases after the vectorized loop. Too many small vectorized loops will decrease the performance for the serial part. © 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. Control Vectorization • Verify vectorization: ▪ Globally: Linux*, OS X*: -qopt-report[n] check for additional options (man icc)! ▪ Annotated source listing: -qopt-report-annotate=[text | html] generates source listing with compiler comments • Advanced: • Ignore vector dependencies (IVDEP): C/C++: #pragma ivdep Fortran: !DIR$ IVDEP • “Enforce” vectorization: C/C++: #pragma omp simd Fortran: !$omp simd © 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. NextGENCompiler • Compiler will be based on LLVM in the future. Can be enabled in latest intel compiler by: -qnextgen • More information: https://software.intel.com/en-us/articles/early-documentation-for-intel-c-compiler-based-on-the-modern- llvm-framework © 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.

Load more