Introduction to Intel® Compiler Vectorization Basics Optimization Report Floating Point Model Explicit Vectorization
Total Page:16
File Type:pdf, Size:1020Kb
Kenneth Craft Compiler Technical Consulting Engineer Intel® Corporation 03-06-2018 Agenda Introduction to Intel® Compiler Vectorization Basics Optimization Report Floating Point Model Explicit Vectorization Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Basic Optimizations with icc/ifort -O… -O0 no optimization; sets -g for debugging -O1 scalar optimizations • Excludes optimizations tending to increase code size -O2 default (except with -g) • includes auto-vectorization; some loop transformations such as unrolling; inlining within source file; • Start with this (after initial debugging at -O0) -O3 more aggressive loop optimizations • Including cache blocking, loop fusion, loop interchange, … • May not help all applications; need to test -qopt-report [=0-5] • Generates compiler optimization reports in files *.optrpt Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Common Optimization Options Windows* Linux*, OS X* Disable optimization /Od -O0 Optimize for speed (no code size increase) /O1 -O1 Optimize for speed (default) /O2 -O2 High-level loop optimization /O3 -O3 Create symbols for debugging /Zi -g Multi-file inter-procedural optimization /Qipo -ipo Profile guided optimization (multi-step build) /Qprof-gen -prof-gen /Qprof-use -prof-use Optimize for speed across the entire program /fast -fast (“prototype switch”) same as: /O3 /Qipo same as: /Qprec-div-, Linux: -ipo –O3 -no-prec-div –static –fp- fast options definitions changes over time! /fp:fast=2 /QxHost) model fast=2 -xHost) OS X: -ipo -mdynamic-no-pic -O3 -no- prec-div -fp-model fast=2 -xHost OpenMP support /Qopenmp -qopenmp Automatic parallelization /Qparallel -parallel Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Interprocedural Optimizations (IPO) Multi-pass Optimization • Interprocedural optimizations performs a static, topological analysis of your application! • ip: Enables inter-procedural Windows* Linux* optimizations for current source file compilation /Qip -ip • ipo: Enables inter-procedural optimizations across files /Qipo -ipo - Can inline functions in separate files - Especially many small utility functions benefit from IPO Enabled optimizations: • Procedure inlining (reduced function call overhead) • Interprocedural dead code elimination, constant propagation and procedure reordering • Enhances optimization when used in combination with other compiler features • Much of ip (including inlining) is enabled by default at option O2 Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Profile-Guided Optimizations (PGO) Static analysis leaves many questions open for the optimizer liKe: § How often is x > y § What is the size of count § Which code is touched how often if (x > y) for(i=0; i<count; ++I do_this(); else do_worK(); Use execution-dotime that(); feedbacK to guide (final) optimization Enhancements with PGO: – More accurate branch prediction – Basic blocK movement to improve instruction cache behavior – Better decision of functions to inline (help IPO) – Can optimize function ordering – Switch-statement optimization – Better vectorization decisions Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Don’t use a single Vector lane! Un-vectorized and un-threaded software will under perform Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Permission to Design for All Lanes Threading and Vectorization needed to fully utilize modern hardware Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Vectorize and Thread for Performance Boost Vectorized & Threaded 200 150 The Difference Is Growing with Each New Generation of Hardware 100 130x 50 Threaded 0 Vectorized Serial 2010 2012 2013 2014 2016 2017 Intel® Xeon® Intel Xeon Intel Xeon Intel Xeon Intel Xeon Intel® Xeon® Platinum Processor X5680 Processor E5-2600 Processor E5-2600 Processor E5-2600 Processor E5-2600 Processor 81xx codenamed codenamed Sandy v2 codenamed Ivy v3 codenamed v4 codenamed codenamed Skylake Westmere Bridge Bridge Haswell Broadwell Server Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, & SSSE3 instruction sets & other optimizations. Intel does not guarantee the availability, functionality, or operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor- and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product dependent optimizations in this product are intended for use with Intel microprocessors. Certain when combined with other products. For more information go to http://www.intel.com/performance. Configurations for 2007- optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific 2016 Benchmarks at the end of this presentation instruction sets covered by this notice. Notice Revision #20110804 Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. SIMD Types for Intel® Architecture 127 0 SSE 255 0 X4 X3 X2 X1 X8 X7 X6 X5 X4 X3 X2 X1 AVX Vector size: 128 bit Vector size: 256 bit Y4 Y3 Y2 Y1 Data types: Y8 Y7 Y6 Y5 Y4 Y3 Y2 Y1 Data types: 8, 16, 32, 64 bit integer 8, 16, 32, 64 bit integer X4◦Y4 X3◦Y3 X2◦Y2 X1◦Y1 32 and 64 bit float X8◦Y8 X7◦Y7 X6◦Y6 X5◦Y5 X4◦Y4 X3◦Y3 X2◦Y2 X1◦Y1 32 and 64 bit float VL: 2, 4, 8, 16 VL: 4, 8, 16, 32 511 0 X16 … X8 X7 X6 X5 X4 X3 X2 X1 Intel® AVX-512 Vector size: 512 bit Y16 … Y8 Y7 Y6 Y5 Y4 Y3 Y2 Y1 Data types: 8, 16, 32, 64 bit integer ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ X16 Y16 … X8 Y8 X7 Y7 X6 Y6 X5 Y5 X4 Y4 X3 Y3 X2 Y2 X1 Y1 32 and 64 bit float VL: 8, 16, 32, 64 Illustrations: Xi, Yi & results 32 bit integer Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Evolution of SIMD for Intel Processors 512b SIMD 256b SIMD AVX-512 AVX-512 ER/PR VL/BW/DQ 128b AVX-512 F/CD AVX-512 F/CD SIMD AVX2 AVX2 AVX2 AVX AVX AVX AVX SSE4.2 SSE4.2 SSE4.2 SSE4.2 SSE4.2 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSE4.1 SSSE3 SSSE3 SSSE3 SSSE3 SSSE3 SSSE3 SSSE3 SSE3 SSE3 SSE3 SSE3 SSE3 SSE3 SSE3 SSE3 SSE2 SSE2 SSE2 SSE2 SSE2 SSE2 SSE2 SSE2 SSE2 SSE SSE SSE SSE SSE SSE SSE SSE SSE MMX MMX MMX MMX MMX MMX MMX MMX MMX Willamette Prescott Merom Penryn Nehalem Sandy Bridge HasWell Knights Skylake Landing server Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Math Libraries icc (ifort) comes with optimized math libraries • libimf (scalar; faster than GNU libm) and libsvml (vector) • Driver links libimf automatically, ahead of libm • More functionality (replace math.h by mathimf.h for C) • Optimized paths for Intel® AVX2 and Intel® AVX-512 (detected at run-time) Don’t link to libm explicitly! -lm • May give you the slower libm functions instead • Though the Intel driver may try to prevent this • GCC needs -lm, so it is often found in old makefiles Options to control precision and “short cuts” for vectorized math library: • -fimf-precision = < high | medium | low > • -fimf-domain-exclusion = < mask > • Library need not check for special cases (¥, nan, singularities ) Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Agenda Introduction to Intel® Compiler Vectorization Basics Optimization Report Explicit Vectorization Optimization Notice Copyright © 2018, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Auto-vectorization of Intel Compilers void add(double *A, double *B, double *C) { for (int i = 0; i < 1000; i++) subroutine add(A, B, C) C[i] = A[i] + B[i]; real*8 A(1000), B(1000), C(1000) } do i = 1, 1000 C(i) = A(i) + B(i) end do end Intel® SSE4.2 Intel® AVX .B2.14: .B2.15 movups xmm1, XMMWORD PTR [edx+ebx*8] vmovupd ymm0, YMMWORD PTR [ebx+eax*8] movups xmm3, XMMWORD PTR [16+edx+ebx*8] vmovupd ymm2, YMMWORD PTR [32+ebx+eax*8] movups xmm5, XMMWORD PTR [32+edx+ebx*8] vmovupd ymm4, YMMWORD PTR [64+ebx+eax*8] movups xmm7, XMMWORD PTR [48+edx+ebx*8] vmovupd ymm6, YMMWORD PTR [96+ebx+eax*8] movups xmm0, XMMWORD PTR [ecx+ebx*8] vaddpd ymm1, ymm0, YMMWORD PTR [edx+eax*8] movups xmm2, XMMWORD PTR [16+ecx+ebx*8] vaddpd ymm3, ymm2, YMMWORD PTR [32+edx+eax*8] movups xmm4, XMMWORD PTR [32+ecx+ebx*8] vaddpd ymm5, ymm4, YMMWORD PTR [64+edx+eax*8] movups xmm6, XMMWORD PTR [48+ecx+ebx*8] vaddpd ymm7, ymm6, YMMWORD PTR [96+edx+eax*8] addpd xmm1, xmm0 vmovupd YMMWORD PTR [esi+eax*8], ymm1 addpd xmm3, xmm2 vmovupd YMMWORD PTR [32+esi+eax*8], ymm3 addpd xmm5, xmm4 vmovupd YMMWORD PTR [64+esi+eax*8], ymm5 addpd xmm7, xmm6 vmovupd YMMWORD PTR [96+esi+eax*8], ymm7 movups XMMWORD PTR [eax+ebx*8], xmm1 add eax, 16 movups XMMWORD PTR [16+eax+ebx*8], xmm3 cmp eax, ecx movups XMMWORD PTR [32+eax+ebx*8], xmm5 jb .B2.15 movups XMMWORD PTR [48+eax+ebx*8], xmm7 add ebx, 8 cmp ebx, esi jb .B2.14 ..