Intel® Compilers Overview Michael Steyer Technical Consulting Engineer Intel Architecture, Graphics and Software Legal Disclaimer & Optimization Notice

Intel® compilers overview Michael Steyer Technical Consulting Engineer Intel Architecture, Graphics and Software Legal Disclaimer & Optimization Notice Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks. INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Copyright © 2019, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 2 Agenda • Intro • New Features • High-Level Optimizations • Optimization reports • Interprocedural Optimizations • Auto-Vectorization • Profile-Guided Optimizations • Auto-Parallelization 3 General Compiler Features Supports standards Full OpenMP* 4.5 support ▪ Full C11, Full C++14, Some C++ 17 Optimized math libraries ▪ use -std option to control, e.g. -std=c++14 ▪ libimf (scalar) and libsvml (vector): faster than ▪ Full Fortran 2008, Some F2018 GNU libm Intel® C/C++ source and binary compatible ▪ Driver links libimf automatically, ahead of libm ▪ Linux*: gcc ▪ Replace math.h by mathimf.h ▪ Windows*: Visual C++ 2013 and above Many advanced optimizations Supports all instruction sets via vectorization ▪ With detailed, structured optimization reports ▪ Intel® Streaming SIMD Extensions, (Intel® SSE) ▪ Intel® Advanced Vector Extensions (Intel® AVX) ▪ Intel® Advanced Vector Extensions 512 (Intel® AVX-512) * Other names and brands may be claimed as the property of others. 4 New options and features The Intel® Xeon® Processor Scalable Family is based on the server microarchitecture codenamed Skylake ▪ Compile with processor-specific option [-/Q]xCORE-AVX512 ▪ Many HPC apps benefit from aggressive ZMM usage ▪ Most non-HPC apps degrade from aggressive ZMM usage A new compiler option [-q/Q]opt-zmm-usage=low|high is added to enable a smooth transition from AVX2 to AVX-512 icpc -c -xCORE-AVX512 –qopenmp -qopt-report:5 foo.cpp void foo(double *a, double *b, int size) { #pragma omp simd remark #15305: vectorization support: vector length 4 for(int i=0; i<size; i++) { … b[i]=exp(a[i]); remark #15321: Compiler has chosen to target XMM/YMM } vector. Try using -qopt-zmm-usage=high to override } … remark #15478: estimated potential speedup: 5.260 5 New options and features The Intel® Xeon® Processor Scalable Family is based on the server microarchitecture codenamed Skylake ▪ Compile with processor-specific option [-/Q]xCORE-AVX512 ▪ Many HPC apps benefit from aggressive ZMM usage ▪ Most non-HPC apps degrade from aggressive ZMM usage A new compiler option [-q/Q]opt-zmm-usage=low|high is added to enable a smooth transition from AVX2 to AVX-512 icpc -c -xCORE-AVX512 -qopt-zmm-usage=high –qopenmp void foo(double *a, double *b, int size) { -qopt-report:5 foo.cpp #pragma omp simd for(int i=0; i<size; i++) { b[i]=exp(a[i]); remark #15305: vectorization support: vector length 8 } … } remark #15478: estimated potential speedup: 10.110 6 High-Level Optimizations Basic Optimizations with icc -O… -O0 no optimization; sets -g for debugging -O1 scalar optimizations excludes optimizations tending to increase code size -O2 default for icc/icpc (except with -g) includes auto-vectorization; some loop transformations, e.g. unrolling, loop interchange; inlining within source file; start with this (after initial debugging at -O0) -O3 more aggressive loop optimizations including cache blocking, loop fusion, prefetching, … suited to applications with loops that do many floating-point calculations or process large data sets 8 Intel® Compilers: Loop Optimizations ifort (or icc or icpc or icl) -O3 Loop optimizations: ▪ Automatic vectorization‡ (use of packed SIMD instructions) ▪ Loop interchange ‡ (for more efficient memory access) ▪ Loop unrolling‡ (more instruction level parallelism) ▪ Prefetching (for patterns not recognized by h/w prefetcher) ▪ Cache blocking (for more reuse of data in cache) ▪ Loop versioning ‡ (for loop count; data alignment; runtime dependency tests) ▪ Memcpy recognition ‡ (call Intel’s fast memcpy, memset) ▪ Loop splitting ‡ (facilitate vectorization) ▪ Loop fusion (more efficient vectorization) ▪ Scalar replacement‡ (reduce array accesses by scalar temps) ▪ Loop rerolling (enable vectorization) ▪ Loop peeling ‡ (allow for misalignment) ▪ Loop reversal (handle dependencies) ▪ etc. ▪ ‡ all or partly enabled at -O2 9 Common optimization options Linux* Disable optimization -O0 Optimize for speed (no code size increase) -O1 Optimize for speed (default) -O2 High-level loop optimization -O3 Create symbols for debugging -g Multi-file inter-procedural optimization -ipo Profile guided optimization (multi-step build) -prof-gen -prof-use Optimize for speed across the entire program (“prototype switch”) -fast same as: fast options definitions changes over time! -ipo –O3 -no-prec-div –static –fp-model fast=2 -xHost) OpenMP support -qopenmp Automatic parallelization -parallel 10 Compiler Reports – Optimization Report Enables the optimization report and controls the level of details ▪ -qopt-report[=n] ▪ When used without parameters, full optimization report is issued on stdout with details level 2 Control destination of optimization report ▪ -qopt-report=<filename> ▪ By default, without this option, a <filename>.optrpt file is generated. Subset of the optimization report for specific phases only ▪ -qopt-report-phase[=list] Phases can be: – all – All possible optimization reports for all phases (default) – loop – Loop nest and memory optimizations – vec – Auto-vectorization and explicit vector programming – par – Auto-parallelization – openmp – Threading using OpenMP – ipo – Interprocedural Optimization, including inlining – pgo – Profile Guided Optimization – cg – Code generation 11 InterProcedural Optimizations (IPO) Multi-pass Optimization icc -ipo Analysis and optimization across function and/or source file boundaries, e.g. ▪ Function inlining; constant propagation; dependency analysis; data & code layout; etc. 2-step process: ▪ Compile phase – objects contain intermediate representation ▪ “Link” phase – compile and optimize over all such objects ▪ Seamless: linker automatically detects objects built with -ipo and their compile options ▪ May increase build-time and binary size ▪ But build can be parallelized with -ipo=n ▪ Entire program need not be built with IPO, just hot modules Particularly effective for applications with many smaller functions Get report on inlined functions with -qopt-report-phase=ipo 12 InterProcedural Optimizations Extends optimizations across file boundaries -ip Only between modules of one source file -ipo Modules of multiple files/whole application Without IPO With IPO Compile & Optimize file1.c Compile & Optimize Compile & Optimize file2.c Compile & Optimize file3.c file1.c file3.c Compile & Optimize file4.c file4.c file2.c 13 Math Libraries icc comes with Intel’s optimized math libraries ▪ libimf (scalar) and libsvml (scalar & vector) ▪ Faster than GNU* libm ▪ Driver links libimf automatically, ahead of libm ▪ Additional functions (replace math.h by mathimf.h) Don’t link to libm explicitly! -lm ▪ May give you the slower libm functions instead ▪ Though the Intel driver may try to prevent this ▪ gcc needs -lm, so it is often found in old makefiles 14 SIMD: Single Instruction, Multiple Data for (i=0; i<n; i++) z[i] = x[i] + y[i]; ❑ Scalar mode ❑ Vector (SIMD) mode – one instruction produces – one instruction can produce one result multiple results – E.g. vaddss, (vaddsd) – E.g. vaddps, (vaddpd) 8 doubles for AVX-512 X X x7 x6 x5 x4 x3 x2 x1 x0 + + Y Y y7 y6 y5 y4 y3 y2 y1 y0 = = X + Y X + Y x7+y7 x6+y6 x5+y5 x4+y4 x3+y3 x2+y2 x1+y1 x0+y0 Vectorization Vectorization is the generation of loops using packed SIMD instructions as above Different approaches: ▪ Auto-vectorization: The compiler does it all, at -O2 or above • Compiler is responsible for correctness ▪ Assisted

Intel® Compilers Overview Michael Steyer Technical Consulting Engineer Intel Architecture, Graphics and Software Legal Disclaimer & Optimization Notice

Generalizing Loop-Invariant Code Motion in a Real-World Compiler

A Tiling Perspective for Register Optimization Fabrice Rastello, Sadayappan Ponnuswany, Duco Van Amstel

Efficient Symbolic Analysis for Optimizing Compilers*

Using Static Single Assignment SSA Form

Loop Transformations and Parallelization

Compiler-Based Code-Improvement Techniques

Autotuning for Automatic Parallelization on Heterogeneous Systems

Programming Parallel Machines

An Unfolding-Based Loop Optimization Technique

SSE Optimizations These Optimizations Leverage the Streaming SIMD Extension (SSE) Instruction Set of the CPU

Compiler Construction

Using Openmp Effectively on Theta