Using ® Advisor XE 2016 to vectorize and thread your code Performance is a Proven Game Changer It is driving disruptive change in multiple industries Protecting buildings from extreme events

Sophisticated mechanics simulations are performed to identify innovative ways to protect infrastructure from extreme events, such as natural disasters.

Solving Austin, Texas’s traffic problem

Running advanced traffic simulations to improve the models used to plan infrastructure and traffic control changes

New possible treatments for Parkinson’s

Extensive calculations performed at supercomputer helped researchers to learn more about the protein structure’s evolution Optimization Notice Click on a picture for details Copyright © 2015, Intel Corporation. All rights reserved. 2 *Other names and brands may be claimed as the property of others. The “Free Lunch” is over, really Processor clock rate growth halted around 2005

Source: © 2014, James Reinders, Intel, used with permission must be parallelized to realize all the potential performance Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 3 *Other names and brands may be claimed as the property of others. Changing Hardware Impacts Software More cores  More Threads  Wider vectors

Intel® Xeon® Intel® Xeon® Intel® Xeon® Intel® Xeon® Intel® Xeon® Intel® Xeon® Intel® Xeon® Intel® Xeon Phi™ Intel® Xeon Phi™ processor processor processor processor processor processor processor coprocessor processor & 64-bit 5100 series 5500 series 5600 series code-named code-named code-named Knights coprocessor Sandy Bridge Ivy Bridge Haswell Corner Knights EP EP EP Landing1

Core(s) 1 2 4 6 8 12 18 61 60+

Threads 2 2 8 12 16 24 36 244

SIMD Width 128 128 128 128 256 256 256 512

*Product specification for launched and shipped products available on ark.intel.com. 1. Not launched or in planning. High performance software must be both: . Parallel (multi-thread, multi-process) . Vectorized Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 4 *Other names and brands may be claimed as the property of others.

5

Data Driven Vectorization Design Intel® Advisor XE – Vectorization Advisor

Have you: . Recompiled with AVX2, but seen little benefit? . Wondered where to start adding vectorization?

. Recoded intrinsics for each new architecture?

. Struggled with cryptic compiler vectorization messages? Breakthrough for vectorization design

. What vectorization will pay off the most?

. What is blocking vectorization and why?

. Are my loops vector friendly? More Performance . Will reorganizing data increase performance? Fewer Machine Dependencies

. Is it safe to just use pragma simd? Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 7 *Other names and brands may be claimed as the property of others. The Right Data At Your Fingertips Get all the data you need for high impact vectorization

Filter by which loops are What prevents vectorized! vectorization?

Focus on What Which Vector How efficient hot vectorization instructions are is the code? loops issues do I being use? have? Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 8 *Other names and brands may be claimed as the property of others. 4 Steps to Efficient Vectorization Intel® Advisor XE – Vectorization Advisor 1. Compiler diagnostics + Performance 2. Guidance: detect problem and Data + SIMD efficiency information recommend how to fix it

3. Loop-Carried Dependency Analysis 4. Memory Access Patterns Analysis

Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 9 *Other names and brands may be claimed as the property of others. Efficiently Vectorize your code XE – Vectorization Advisor

Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 10 *Other names and brands may be claimed as the property of others. Background on loop vectorization

A typical vectorized loop consists of This is where we want our loops to be executing! Main vector body

• Fastest among the three!

Optional peel part

• Used for the unaligned references in your loop. Uses Scalar or slower vector

Remainder part

• Due to the number of iterations (trip count) not being divisible by vector length. Uses Scalar or slower vector.

Larger vector register means more iterations in peel/remainder

• Make sure you Align your data! (and you tell the compiler it is aligned!)

• Make the number of iterations divisible by the vector length! Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 11 *Other names and brands may be claimed as the property of others. Vectorizing problems solved!

Get your code to vectorize • For unvectorized loops, discover what prevents code from being vectorized and get tips on how to vectorize it. Increase Performance • For vectorized loops, measure their performance efficiency and get tips on how to increase it. • For both vectorized and unvectorized loops, explore how the memory layout and data structures can be made more vector-friendly.

Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 12 *Other names and brands may be claimed as the property of others. Quickly categorize your loops

Vectorizable, but not vectorized loops • Require some minimal program changes such as OpenMP4.x to enable Compiler-driven SIMD parallelism Vectorized loops whose performance could be improved • Low-hanging optimization techniques Vectorized loops whose performance was limited by data layout • Requiring code refactoring to further speed-up execution Vectorized loops that performed well. All other cases (including non-vectorizable kernels)

Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 13 *Other names and brands may be claimed as the property of others. Vectorizable, but not vectorized loops

Our hottest loop falls into this category. • In the “Why no vectorization” column, Vector Advisor reports that “loop iteration count could not be computed”. • Vector Advisor gives details on how to fix the issue under “Compiler Diagnostic Detail” • The Advice is to use #pragma omp simd to vectorize the loop.

Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 14 *Other names and brands may be claimed as the property of others. Vectorizable, but not vectorized loops

After adding the #pragma omp simd to our loop • Loop is now vectorized at 100% efficiency • Moves to category #4, it is now a well performing loop

Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 15 *Other names and brands may be claimed as the property of others. Vectorizable, but not vectorized loops

Assumed Dependencies present • More data is needed to confirm if the loop can be vectorized • Select the given loop and run dependency analysis to see if it can be vectorized using an OpenMP* 4.0 simd pragma.

Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 16 *Other names and brands may be claimed as the property of others. Data Dependencies – Tough Problem #1 Is it safe to force the compiler to vectorize? Data dependencies

for (i=0;i

A[i] = A[i-1]*С[i];// Need the ability to check if it

// it is safe to force the compiler

// the compiler to vectorize!

Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 17 *Other names and brands may be claimed as the property of others. Check if it is It Safe to Vectorize Loop-carried dependencies analysis verifies correctness

Select loop for Vector Dependence Correct prevents Analysis and Vectorization! press play!

Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 18 *Other names and brands may be claimed as the property of others. Vectorized loops whose performance could be improved by low-hanging optimization techniques Our top 2 hot loops now fall into this category. They are vectorized but with an issue being reported • Ineffective peel remainder loop • Top loop just has a peel loop and no vector body. • Next loop has vector body, peel and remainder loop. Only 27% efficient • Recommendation #1 – Align data to remove peel loop • Recommendation #2 – Add data padding so that the loop trip count is divisible by vector length.

Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 19 *Other names and brands may be claimed as the property of others. Vectorized loops whose performance could be improved by low-hanging optimization techniques Code transformations • Align data • Tell the compiler the data is aligned • Pad data structure • Add pragma to tell the compiler what the trip count is Peel and remainder disappear. 99% efficiency! Performance improved • Now 3.98x speedup as opposed to 1.09x (relative to scalar)

Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 20 *Other names and brands may be claimed as the property of others. Vectorized loops whose performance was limited by data layout

Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 21 *Other names and brands may be claimed as the property of others. Improve Vectorization Memory Access pattern analysis

Select loops of interest

Run Memory Access Patterns analysis, just to check how memory is used in the loop and the called function

Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Vectorized loops whose performance was limited by data layout Constant stride access can often be transformed to unit strides by AoS

Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 23 *Other names and brands may be claimed as the property of others.

Data Driven Threading Design Intel® Advisor XE – Thread Prototyping Have you: . Tried threading an app, but seen little performance benefit? . Hit a “scalability barrier”? Performance gains level off as you add cores? . Delayed a release that adds threading because of synchronization errors? Breakthrough for threading design: Part of Intel® Parallel Studio For Windows* and * From $1,599 . Quickly prototype multiple options . Project scaling on larger systems . Find synchronization errors before Add Parallelism with Less Effort, implementing threading Less Risk and More Impact . Separate design and implementation -Design without disrupting development Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. http://intel.ly/advisor-xe 25 *Other names and brands may be claimed as the property of others. Design Then Implement

Intel® Advisor XE Thread Prototyping 1) Analyze it. Design Parallelism 2) Design it. . No disruption to regular development (Compiler ignores these annotations.) . All test cases continue to work

. Tune and debug the design before you 3) Tune it. implement it

4) Check it.

Implement Parallelism 5) Do it!

Optimization NoticeLess Effort, Less Risk, More Impact Copyright © 2015, Intel Corporation. All rights reserved. 26 *Other names and brands may be claimed as the property of others. Survey Where should I add parallelism?

Locate where parallelism will have high impact for your application

Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 27 *Other names and brands may be claimed as the property of others. Design Then Implement Intel® Advisor XE Thread Prototyping 1) Analyze it. 

2) Design it. (Compiler ignores these annotations.)

3) Tune it.

4) Check it.

Implement Parallelism 5) Do it!

Optimization NoticeLess Effort, Less Risk, More Impact Copyright © 2015, Intel Corporation. All rights reserved. 28 *Other names and brands may be claimed as the property of others. Design Then Implement Intel® Advisor XE Thread Prototyping 1) Analyze it. 

2) Design it.  (Compiler ignores these annotations.)

3) Tune it.

4) Check it.

Implement Parallelism 5) Do it!

Optimization NoticeLess Effort, Less Risk, More Impact Copyright © 2015, Intel Corporation. All rights reserved. 29 *Other names and brands may be claimed as the property of others. Check Suitability Is it fast enough? Experiment with modeling by changing: . Number of tasks . Task duration . Runtime modeling . Threading model . Target system Instantly see impact on scalability Quickly Evaluate Design Alternatives Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 30 *Other names and brands may be claimed as the property of others. Design Then Implement Intel® Advisor XE Thread Prototyping 1) Analyze it. 

2) Design it.  (Compiler ignores these annotations.)

3) Tune it. 

4) Check it.

Implement Parallelism 5) Do it!

Optimization NoticeLess Effort, Less Risk, More Impact Copyright © 2015, Intel Corporation. All rights reserved. 31 *Other names and brands may be claimed as the property of others. Check Correctness

Any data sharing issues?

Are they easy to fix?

Will adding locks wipe out the performance gain?

Quickly Evaluate Design Alternatives Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 32 *Other names and brands may be claimed as the property of others. Conclusion/Summary

Modernizing your Code To get the most out of your hardware, you need to modernize your code with vectorization and threading. Taking a methodical approach such as the one outlined in this presentation, and taking advantage of the powerful tools in Intel® Parallel Studio XE, can make the modernization task dramatically easier.

Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 33 *Other names and brands may be claimed as the property of others. Download Today! Intel® Parallel Studio XE 2016

Vectorization is a tough problem Download Today

Google: It is decomposable into tractable steps “Intel Parallel Studio 2016” Get help at each step: Or go directly to: . Find the best opportunities https: //software.intel.com /en-us/articles/ . Improve vectorization effectiveness intel-parallel-studio- xe-2016-beta . Assure correctness

Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 34 *Other names and brands may be claimed as the property of others. Intel® Parallel Studio XE Faster code faster!

Vectorizing Compiler Squeeze all the performance out of the latest instruction set Download Today Threaded Performance Libraries Google: Pre-vectorized, pre-threaded, pre-optimized “Intel Parallel High Level Parallel Models Studio 2016” Productive solutions for thread, process & vector parallelism Or go directly to: Parallel Performance Profilers https: Quickly discover bottlenecks and tune for high performance //software.intel.com /en-us/articles/ Thread Debugger intel-parallel-studio- Find and debug non-deterministic threading errors xe-2016-beta Vectorization Optimization and Thread Prototyping Data driven design tools help you vectorize & thread effectively

Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 35 *Other names and brands may be claimed as the property of others. Additional Resources All links start with: https://software.intel.com/

Vectorization Guide: https://software.intel.com/articles/a-guide-to-auto-vectorization-with-intel-c-compilers/ Explicit Vector Programming in : https://software.intel.com/articles/explicit-vector-programming-in-fortran

Optimization Reports: https://software.intel.com/videos/getting-the-most-out-of-the-intel-compiler-with-new- optimization-reports

Beta Registration & Download: https://software.intel.com/en-us/articles/ intel-parallel-studio-xe-2016-beta For Intel® Xeon Phi™ coprocessors, but also applicable: https://software.intel.com/en-us/articles/vectorization-essential https://software.intel.com/en-us/articles/fortran-array-data-and-arguments-and-vectorization Intel® Composer XE User and Reference Guides: https://software.intel.com/compiler_15.0_ug_c https://software.intel.com/compiler_15.0_ug_f

Compiler User Forums: http://software.intel.com/forums

Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 36 *Other names and brands may be claimed as the property of others.

Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © 2015, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, , and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

Optimization Notice Copyright © 2015, Intel Corporation. All rights reserved. 37 *Other names and brands may be claimed as the property of others.