Compiler Vectorization and Threading of Loops (SSE & AVX)

Loop Independence, Compiler Vectorization and Threading of Loops (SSE & AVX) Michael Klemm Software & Services Group Developer Relations Division Software & Services Group 1 Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm Software & Services Group Optimization Notice Optimization Notice Intel® compilers, associated libraries and associated development tools may include or utilize options that optimize for instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel compilers, including some that are not specific to Intel micro-architecture, are reserved for Intel microprocessors. For a detailed description of Intel compiler options, including the instruction sets and specific microprocessors they implicate, please refer to the “Intel® Compiler User and Reference Guides” under “Compiler Options." Many library routines that are part of Intel® compiler products are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel® compiler products offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors. Intel® compilers, associated libraries and associated development tools may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best performance on Intel® and non-Intel microprocessors, Intel recommends that you evaluate other compilers and libraries to determine which best meet your requirements. We hope to win your business by striving to offer the best performance of any compiler or library; please let us know if you find we do not. Notice revision #20101101 Software & Services Group Agenda • Vector Instructions (SIMD) • Compiler Switches for Optimization • Controlling Auto-vectorization • Controlling Auto-parallelization • Manual Vectorization (SSE & AVX) • Hands-on Software & Services Group 4 Agenda • Vector Instructions (SIMD) • Compiler Switches for Optimization • Controlling Auto-vectorization • Controlling Auto-parallelization • Manual Vectorization (SSE & AVX) • Hands-on Software & Services Group 5 What is SSE (and related instruction sets) • SSE: Streaming SIMD extension • SIMD: Single instruction, Multiple Data (Flynn’s Taxonomy) • SSE allows the identical treatment of 2 double, 4 floats and 4 integers at the same time Source vector a a3 a2 a1 a0 op Source vector b b3 b2 b1 b0 = Destination vector a3 op b3 a2 op b2 a1 op b1 a0 op b0 Software & Services Group SSE Data Types 128 bit 2x double 4x float 16x byte 8x short 4x integer32 2x integer64 Software & Services Group AVX Data Types 128 bit 128 bit Lane 1 Lane 0 Software & Services Group Agenda • Vector Instructions (SIMD) • Compiler Switches for Optimization • Controlling Auto-vectorization • Controlling Auto-parallelization • Manual Vectorization (SSE & AVX) • Hands-on Software & Services Group 9 Intel® Compiler Architecture C++ FORTRAN Front End Front End Profiler Disambiguation:t ypes, array, pointer, Interprocedural analysis and optimizations: inlining, structure, constant prop, whole program detect, mod/ref, points-to directives Loop optimizations: data deps, prefetch, vectorizer, unroll/interchange/fusion/dist, auto-parallel/OpenMP Global scalar optimizations: partial redundancy elim, dead store elim, strength reduction, dead code elim Code generation: vectorization, software pipelining, global scheduling, register allocation, code generation Software & Services Group 10 A few General Switches Functionality Linux Disable optimization -O0 Optimize for speed (no code size increase), no SWP -O1 Optimize for speed (default) -O2 High-level optimizer (e.g. loop unroll), -ftz (for Itanium) -O3 Vectorization for x86, -xSSE2 is default <many options> Aggressive optimizations (e.g. -ipo, -O3, -no-prec-div, -static -xHost for x86 -fast Linux*) Create symbols for debugging -g Generate assembly files -S Optimization report generation -opt-report OpenMP support -openmp Automatic parallelization for OpenMP* threading -parallel Software & Services Group 11 Architecture Specific Switches Functionality Linux Optimize for current machine -xHOST Generate SSE v1 code -xSSE1 Generate SSE v2 code (may also emit SSE v1 code) -xSSE2 Generate SSE v3 code (may also emit SSE v1 and v2 code) -xSSE3 Generate SSE v3 code for Atom-based processors -xSSE_ATOM Generate SSSE v3 code (may also emit SSE v1, v2, and v3 code) -xSSSE3 Generate SSE4.1 code (may also emit (S)SSE v1, v2, and v3 code) -xSSE4.1 Generate SSE4.2 code (may also emit (S)SSE v1, v2, v3, and v4 code) -xSSE4.2 Generate AVX code -xAVX Software & Services Group 12 Interprocedural Optimization Extends optimizations across file boundaries Without IPO With IPO Compile & Optimize file1.c Compile & Optimize Compile & Optimize file2.c Compile & Optimize file3.c file1.c file3.c Compile & Optimize file4.c file4.c file2.c -ip Only between modules of one source file -ipo Modules of multiple files/whole application Software & Services Group 13 IPO – A Multi-pass Optimization A Two-Step Process Compiling Pass 1 Mac*/Linux* icc -c -ipo main.c func1.c func2.c Windows* icl -c /Qipo main.c func1.c func2.c virtual .o Pass 2 Linking Mac*/Linux* icc -ipo main.o func1.o func2.o Windows* icl /Qipo main.o func1.o func2.o executable Software & Services Group 14 What you should know about IPO • O2 and O3 activate “almost” file-local IPO (-ip) – Only a very few, time-consuming IP-optimizations are not done but for most codes, -ip is not adding anything – Switch –ip-no-inlining disables in-lining • IPO extends compilation time and memory usage – See compiler manual when running into limitations • Inlining of functions is the most important feature of IPO but there is much more – Inter-procedural constant propagation – MOD/REF analysis (for dependence analysis) – Routine attribute propagation – Dead code elimination – Induction variable recognition – …many, many more Software & Services Group 15 Profile-Guided Optimizations (PGO) • Use execution-time feedback to guide (final) optimization • Helps I-cache, paging, branch-prediction • Enabled optimizations: – Basic block ordering – Better register allocation – Better decision on which functions to inline – Function ordering – Switch-statement optimization Software & Services Group 16 PGO Usage: Three Step Process Step 1 Instrumented Instrumented Compilation executable: icc -prof_gen prog.c prog.exe Step 2 Instrumented Execution DYN file containing prog.exe (on a typical dataset) dynamic info: .dyn Step 3 Merged DYN summary file: .dpi Feedback Compilation Delete old dyn files icc -prof_use prog.c unless you want their info included Software & Services Group 17 Simple PGO Example: Code Re-Order for (i=0; i < NUM_BLOCKS; i++) { switch (check3(i)) { case 3: /* 25% */ x[i] = 3; break; case 10: /* 75% */ x[i] = i+10; break; default: /* 0% */ x[i] = 99; break } } “Case 10” is moved to the beginning – PGO can eliminate

Load more