Effective Vectorization with Openmp 4.5

ORNL/TM-2016/391 Effective Vectorization with OpenMP 4.5 Joseph Huber Oscar Hernandez Graham Lopez Approved for public release. Distribution is unlimited. March 2017 DOCUMENT AVAILABILITY Reports produced after January 1, 1996, are generally available free via US Department of Energy (DOE) SciTech Connect. Website: http://www.osti.gov/scitech/ Reports produced before January 1, 1996, may be purchased by members of the public from the following source: National Technical Information Service 5285 Port Royal Road Springfield, VA 22161 Telephone: 703-605-6000 (1-800-553-6847) TDD: 703-487-4639 Fax: 703-605-6900 E-mail: [email protected] Website: http://classic.ntis.gov/ Reports are available to DOE employees, DOE contractors, Energy Technology Data Ex- change representatives, and International Nuclear Information System representatives from the following source: Office of Scientific and Technical Information PO Box 62 Oak Ridge, TN 37831 Telephone: 865-576-8401 Fax: 865-576-5728 E-mail: [email protected] Website: http://www.osti.gov/contact.html This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of au- thors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof. Study was performed as a part of the SULI program sponsored by the Department of Energy at Oak Ridge National Laboratory, managed by UT-Battelle for DOE. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. ORNL/TM-2016/391 Comuter Science and Mathematics Devision Effective Vectorization with OpenMP 4.5 Usage of OpenMP 4.5 SIMD Directives Joseph Huber Oscar Hernandez Graham Lopez March 2017 Prepared by OAK RIDGE NATIONAL LABORATORY Oak Ridge, TN 37831-6283 managed by UT-Battelle, LLC for the US DEPARTMENT OF ENERGY under contract DE-AC05-00OR22725 CONTENTS TABLE OF CONTENTS . iv LIST OF FIGURES . v LIST OF TABLES . vi ACKNOWLEDGEMENTS . vii ABSTRACT . 1 1. Introduction . 1 2. Vectorization . 2 2.1 SIMD Instructions . 5 2.1.1 Data Movement . 6 2.1.2 Vector Conditional Masking . 8 2.1.3 Vector Constants . 9 2.2 Memory Alignment . 10 2.3 AVX512 Additions . 11 2.3.1 Mask Registers . 12 2.3.2 Embedded Broadcasting . 13 2.3.3 Compress and Expand . 13 2.3.4 Conflict Detection . 13 2.3.5 Reciprocals and Exponentials . 14 2.4 x86 Vector Instructions . 14 2.4.1 SAXPY . 14 2.4.2 Gather . 15 2.4.3 Reduction . 16 2.4.4 Linear . 17 2.4.5 Conditional Statement . 18 3. OpenMP SIMD . 19 3.1 SIMD loop directives . 19 3.1.1 SIMD aligned . 19 3.1.2 SIMD reduction . 20 3.1.3 SIMD safelen . 20 3.1.4 SIMD collapse . 20 3.1.5 SIMD linear . 21 3.1.6 SIMD private / lastprivate . 21 3.2 SIMD declare directives . 21 3.2.1 SIMD declare aligned . 22 3.2.2 SIMD declare simdlen . 22 3.2.3 SIMD declare uniform . 22 3.2.4 SIMD declare linear . 22 3.2.5 SIMD declare inbranch / notinbranch . 23 3.3 SIMD Block-Level directives . 23 3.3.1 ordered SIMD . 23 3.4 Vendor Specific OpenMP SIMD directives . 23 3.4.1 SIMD declare processor . 23 3.5 SAXPY example . 24 4. OpenMP SIMD Programming Guidelines . 27 iii 4.1 Ensure SIMD execution legality . 27 4.2 Vector Length and Alignment . 28 4.3 OpenMP SIMD Functions . 28 4.4 Memory Access Collapsing . 31 4.5 Scalar Execution Inside SIMD Regions . 32 5. General SIMD Programming Guidelines . 33 5.1 Data Size and Conversion . 33 5.2 Memory Access . 34 5.3 Integer Multiplication and Division by Constants . 36 5.4 Conditional Statements . 36 6. HACCmk . 37 7. Conclusion . 38 iv LIST OF FIGURES 1 Layout of x86 vector registers . 2 2 Vertical SIMD addition between four packed elements . 5 3 Moving a single value into a vector register . 6 4 Shuffle operation on four packed elements . 7 5 Blend operation on four packed elements . 7 6 Unpacking the low elements on four packed elements . 7 7 Movement from the low elements to the high elements . 8 8 A conditional statement that returns X if true and 0 if false . 9 9 Aligning a memory location by 16 . 11 10 Masked vector addition . 12 v LIST OF TABLES 1 Available x86 vector instruction sets . 3 2 Vector lengths for different vector registers and data types . 3 3 memory alignment for each data type . 10 4 Available AVX512 vector instruction sets . 12 5 Supported target architectures for the processor clause . 24 vi ACKNOWLEDGEMENTS • Special thanks to Michael Klemm and Xinmin Tian from Intel for reviewing and providing us valuable feedback on the report. Their comments and corrections helped us understand the rationale and possible implementations of several of the SIMD directives and clauses. • HACCmk Compiler benchmark https://asc.llnl.gov/CORAL-benchmarks/#haccmk vii ABSTRACT This paper describes how the Single Instruction Multiple Data (SIMD) model and its extensions in OpenMP work, and how these are implemented in different compilers. Modern processors are highly parallel computational machines which often include multiple processors capable of executing several instructions in parallel. Understanding SIMD and executing instructions in parallel allows the processor to achieve higher performance without increasing the power required to run it. SIMD instructions can significantly reduce the runtime of code by executing a single operation on large groups of data. The SIMD model is so integral to the processor’s potential performance that, if SIMD is not utilized, less than half of the processor is ever actually used. Unfortunately, using SIMD instructions is a challenge in higher level languages because most programming languages do not have a way to describe them. Most compilers are capable of vectorizing code by using the SIMD instructions, but there are many code features important for SIMD vectorization that the compiler cannot determine at compile time. OpenMP attempts to solve this by extending the C++/C and Fortran programming languages with compiler directives that express SIMD parallelism. OpenMP is used to pass hints to the compiler about the code to be executed in SIMD. This is a key resource for making optimized code, but it does not change whether or not the code can use SIMD operations. However, in many cases critical functions are limited by a poor understanding of how SIMD instructions are actually implemented, as SIMD can be implemented through vector instructions or simultaneous multi-threading (SMT).∗ We have found that it is often the case that code cannot be vectorized, or is vectorized poorly, because the programmer does not have sufficient knowledge of how SIMD instructions work. 1. Introduction SIMD hardware units in the processor provides the functionality to process multiple data elements simultaneously. The SIMD model [13, 12, 6] is becoming much more essential to unleash the power of modern processors and achieve higher performance. When a section of code makes efficient use of SIMD instructions, it is said to be vectorized. The amount of data that.

Effective Vectorization with Openmp 4.5

Elementary Functions: Towards Automatically Generated, Efficient

RISC-V Vector Extension Webinar II

Heterogeneous Task Scheduling for Accelerated Openmp

Andre Heidekrueger

Optimizing Packed String Matching on AVX2 Platform

Hierarchical Roofline Analysis for Gpus: Accelerating Performance

Parallel Programming

Openmp API 5.1 Specification

Openmp Made Easy with INTEL® ADVISOR

Introduction to Openmp Paul Edmon ITC Research Computing Associate

Parallel Programming with Openmp

Theoretical Peak FLOPS Per Instruction Set on Modern Intel Cpus