Using Intel® Advisor XE 2016 to Vectorize and Thread Your Code

Using Intel® Advisor XE 2016 to vectorize and thread your code Performance is a Proven Game Changer It is driving disruptive change in multiple industries Protecting buildings from extreme events Sophisticated mechanics simulations are performed to identify innovative ways to protect infrastructure from extreme events, such as natural disasters. Solving Austin, Texas’s traffic problem Running advanced traffic simulations to improve the models used to plan infrastructure and traffic control changes New possible treatments for Parkinson’s Extensive calculations performed at supercomputer helped researchers to learn more about the protein structure’s evolution Optimization Notice Click on a picture for details Copyright © 2015, Intel Corporation. All rights reserved. 2 *Other names and brands may be claimed as the property of others. The “Free Lunch” is over, really Processor clock rate growth halted around 2005 Source: © 2014, James Reinders, Intel, used with permission Software must be parallelized to realize all the potential performance Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 3 *Other names and brands may be claimed as the property of others. Changing Hardware Impacts Software More cores More Threads Wider vectors Intel® Xeon® Intel® Xeon® Intel® Xeon® Intel® Xeon® Intel® Xeon® Intel® Xeon® Intel® Xeon® Intel® Xeon Phi™ Intel® Xeon Phi™ processor processor processor processor processor processor processor coprocessor processor & 64-bit 5100 series 5500 series 5600 series code-named code-named code-named Knights coprocessor Sandy Bridge Ivy Bridge Haswell Corner Knights EP EP EP Landing1 Core(s) 1 2 4 6 8 12 18 61 60+ Threads 2 2 8 12 16 24 36 244 SIMD Width 128 128 128 128 256 256 256 512 *Product specification for launched and shipped products available on ark.intel.com. 1. Not launched or in planning. High performance software must be both: . Parallel (multi-thread, multi-process) . Vectorized Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 4 *Other names and brands may be claimed as the property of others. 5 Data Driven Vectorization Design Intel® Advisor XE – Vectorization Advisor Have you: . Recompiled with AVX2, but seen little benefit? . Wondered where to start adding vectorization? . Recoded intrinsics for each new architecture? . Struggled with cryptic compiler vectorization messages? Breakthrough for vectorization design . What vectorization will pay off the most? . What is blocking vectorization and why? . Are my loops vector friendly? More Performance . Will reorganizing data increase performance? Fewer Machine Dependencies . Is it safe to just use pragma simd? Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 7 *Other names and brands may be claimed as the property of others. The Right Data At Your Fingertips Get all the data you need for high impact vectorization Filter by which loops are What prevents vectorized! vectorization? Focus on What Which Vector How efficient hot vectorization instructions are is the code? loops issues do I being use? have? Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 8 *Other names and brands may be claimed as the property of others. 4 Steps to Efficient Vectorization Intel® Advisor XE – Vectorization Advisor 1. Compiler diagnostics + Performance 2. Guidance: detect problem and Data + SIMD efficiency information recommend how to fix it 3. Loop-Carried Dependency Analysis 4. Memory Access Patterns Analysis Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 9 *Other names and brands may be claimed as the property of others. Efficiently Vectorize your code Intel Advisor XE – Vectorization Advisor Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 10 *Other names and brands may be claimed as the property of others. Background on loop vectorization A typical vectorized loop consists of This is where we want our loops to be executing! Main vector body • Fastest among the three! Optional peel part • Used for the unaligned references in your loop. Uses Scalar or slower vector Remainder part • Due to the number of iterations (trip count) not being divisible by vector length. Uses Scalar or slower vector. Larger vector register means more iterations in peel/remainder • Make sure you Align your data! (and you tell the compiler it is aligned!) • Make the number of iterations divisible by the vector length! Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 11 *Other names and brands may be claimed as the property of others. Vectorizing problems solved! Get your code to vectorize • For unvectorized loops, discover what prevents code from being vectorized and get tips on how to vectorize it. Increase Performance • For vectorized loops, measure their performance efficiency and get tips on how to increase it. • For both vectorized and unvectorized loops, explore how the memory layout and data structures can be made more vector-friendly. Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 12 *Other names and brands may be claimed as the property of others. Quickly categorize your loops Vectorizable, but not vectorized loops • Require some minimal program changes such as OpenMP4.x to enable Compiler-driven SIMD parallelism Vectorized loops whose performance could be improved • Low-hanging optimization techniques Vectorized loops whose performance was limited by data layout • Requiring code refactoring to further speed-up execution Vectorized loops that performed well. All other cases (including non-vectorizable kernels) Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 13 *Other names and brands may be claimed as the property of others. Vectorizable, but not vectorized loops Our hottest loop falls into this category. • In the “Why no vectorization” column, Vector Advisor reports that “loop iteration count could not be computed”. • Vector Advisor gives details on how to fix the issue under “Compiler Diagnostic Detail” • The Advice is to use #pragma omp simd to vectorize the loop. Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 14 *Other names and brands may be claimed as the property of others. Vectorizable, but not vectorized loops After adding the #pragma omp simd to our loop • Loop is now vectorized at 100% efficiency • Moves to category #4, it is now a well performing loop Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 15 *Other names and brands may be claimed as the property of others. Vectorizable, but not vectorized loops Assumed Dependencies present • More data is needed to confirm if the loop can be vectorized • Select the given loop and run dependency analysis to see if it can be vectorized using an OpenMP* 4.0 simd pragma. Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 16 *Other names and brands may be claimed as the property of others. Data Dependencies – Tough Problem #1 Is it safe to force the compiler to vectorize? Data dependencies for (i=0;i<N;i++) // Loop carried dependencies! A[i] = A[i-1]*С[i];// Need the ability to check if it // it is safe to force the compiler // the compiler to vectorize! Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 17 *Other names and brands may be claimed as the property of others. Check if it is It Safe to Vectorize Loop-carried dependencies analysis verifies correctness Select loop for Vector Dependence Correct prevents Analysis and Vectorization! press play! Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 18 *Other names and brands may be claimed as the property of others. Vectorized loops whose performance could be improved by low-hanging optimization techniques Our top 2 hot loops now fall into this category. They are vectorized but with an issue being reported • Ineffective peel remainder loop • Top loop just has a peel loop and no vector body. • Next loop has vector body, peel and remainder loop. Only 27% efficient • Recommendation #1 – Align data to remove peel loop • Recommendation #2 – Add data padding so that the loop trip count is divisible by vector length. Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 19 *Other names and brands may be claimed as the property of others. Vectorized loops whose performance could be improved by low-hanging optimization techniques Code transformations • Align data • Tell the compiler the data is aligned • Pad data structure • Add pragma to tell the compiler what the trip count is Peel and remainder disappear. 99% efficiency! Performance improved • Now 3.98x speedup as opposed to 1.09x (relative to scalar) Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 20 *Other names and brands may be claimed as the property of others. Vectorized loops whose performance was limited by data layout Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 21 *Other names and brands may be claimed as the property of others. Improve Vectorization Memory Access pattern analysis Select loops of interest Run Memory Access Patterns analysis, just to check how memory is used in the loop and the called function Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Vectorized loops whose performance was limited by data layout Constant stride access can often

Using Intel® Advisor XE 2016 to Vectorize and Thread Your Code

Intel® Parallel Studio

Michael Steyer Technical Consulting Engineer Intel Architecture, Graphics & Software Analysis Tools

Accelerate AI, HPC, Enterprise & Cloud Applications

Intel® Software Products Highlights and Best Practices

5. Oneapi in Der Praxis

Intel® Parallel Studio Xe 2017 Runtime

Simple-Path2parallelism-Intel-Cilk-Plus Studioxe-Evalguide/Rev-082014

Intel® Threading Building Blocks

Intel Parallel Studio XE 2015 Professional

Driving Performance with Intel® Advisor's Flow Graph Analyzer

Intel® Parallel Studio XE 2015 Cluster Edition Release Notes

Code That Performs