Using Avx2 Instruction Set to Increase Performance of High Performance Computing Code

Computing and Informatics, Vol. 36, 2017, 1001{1018, doi: 10.4149/cai 2017 5 1001 USING AVX2 INSTRUCTION SET TO INCREASE PERFORMANCE OF HIGH PERFORMANCE COMPUTING CODE Pawel Gepner Intel Corporation Pipers Way, Swindon, Wiltshire SN3 1RJ, United Kingdom e-mail: [email protected] Abstract. In this paper we discuss new Intel instruction extensions { Intel Advance Vector Extensions 2 (AVX2) and what these bring to high performance computing (HPC). To illustrate this new systems utilizing AVX2 are evaluated to demonstrate how to effectively exploit AVX2 for HPC types of the code and expose the situation when AVX2 might not be the most effective way to increase performance. Keywords: FMA operations, performance of AVX2 instruction set, benchmarking, Haswell processor, 4th generation Intel Core processor, Intel Xeon E5-2600v3 series processor Mathematics Subject Classification 2010: 68M07, 68M20 1 INTRODUCTION In the field of computational science, a synergistic approach based on separated but connected domains, taking into account knowledge about the problem, tools and environments for obtaining a simulation model of the given problem and a properly chosen computer architecture for research is foreseen. Selection of each element of such an approach is equally important, hence the development of simulation systems is usually a complicated task, requiring a collaborative effort from many domain specialists, built with demanding advances in computing/computer sciences. 1002 P. Gepner Each domain, as the element of the approach is to be evolved separately, but with limitations following from the requirements for collaboration. Development of High Performance Computing (HPC) architectures is performed at different levels, following from their functional construction, with processors as a primary interest. Since the new processor with a novel implementation of Ad- vanced Vector Extensions (AVX) instruction set { AVX2 has brought new level of performance for floating point operations and changed the landscape of HPC, the new era of calculation began. Therefore, the focus of this paper is on efficient han- dling of floating point operations, as often used in scientific computing and in linear algebra operations in particular. As linear algebra operations, often used in HPC, operate mainly on matrices and scalar functions with focus on basic-type operations like multiplications and additions, a deep study will be done concentrated on operations throughput and latency in different scenario context. The Haswell processor is the first Intel CPU supporting the AVX2 instruction set, correspondingly the graphics unit has been completely redefined which doubled the graphics performance and made this product even more attractive for the end- users. In addition to AVX2 a new DDR4 memory interface is available within the Intel Xeon E5 family portfolio. In particular AVX2 appears to be very useful for high-performance computing, however, the new graphics unit is dedicated more towards desktop and laptop users. The scope of this paper will not focus on graphics quality as it is concentrated on HPC-type applications. The rest of the paper is organized as follows. In the next section the overview of existing approaches on FMA implementations are recapitulated and roughly com- pared. Section 3 outlines the problem under study together with the experimental setup used for experiments, while the following sections report the problem solution, results and conclusions. Outlook for the future research recapitulates the paper. 2 STATE OF THE ART In March 2008 Intel introduced the first AVX extensions, new products from the Sandy Bridge and Ivy Bridge family established and brought significant advantage for HPC types of code and started a new era of HPC systems. Nevertheless, Intel's architects realised that a lack of a FMA extension and the possibility to expand vector integer SSE and AVX instructions to 256 bits were still required and worked hard to include these new additions. These capabilities were later added to the x86 instruction set architecture and AVX2 with FMA support has been added to Intel CPU family with the launch of the Haswell CPU. A fused multiply-addition (FMA) is a floating point multiply-addition operation performed in one step, with a single rounding. A FMA computes the entire operation a+b×c to its full precision before rounding the final result down to N significant bits. The main advantages of using FMA operations are performance and accuracy. Typical implementation of fused multiply-addition operation performs ac- tually faster than a multiply operation followed by an addition and requires only one Using AVX2 Instruction Set to Increase Performance 1003 step versus two steps needed for separated operations. In addition, separated and following multiply and addition operations would compute the product b × c first, round it to N significant bits and then add the result to a, and round back again to N significant bits, while again, for fused multiply-addition operation computes the entire sum a + b × c to its full precision before rounding the final result down to N significant bits. Of course, such implementation requires a large amount of logic and extra transistors, but performance benefits compensate the cost of additional resources, as it is going to be demonstrated in this paper. Intel's AVX2 is not the first implementation of FMA for the x86 family as AMD processors supported it before Intel. Unfortunately, the implementation of AMD is not compatible with Intel's and, more importantly, it is not as powerful. There are two variants of FMA implementation: FMA4 and FMA3. Both implementations have identical functionality but they are not compatible, FMA3 instructions have three operands while FMA4 ones have four. The 3-operand is Intel's version of FMA and makes the code shorter and the hardware implementation slightly sim- pler and faster. Other CPU vendors implemented FMA in their products as well. The first implementation of FMA has been done by IBM in the IBM POWER1 processor, and then followed by HP PA-8000, Intel Itanium, Fujitsu SPARC64 VI and other vendors. Today even GPUs and ARM processor are equipped with FMA functionality. Since the official launch of AVX2, FMA has become the standard operation for Intel's ISA but the evolution did not stop here. In 2013 Intel proposed 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 and the use of FMA has been extended. This extension is not only limited to 512-bit but also adding new Integer Fused Multiply Add instructions (IFMA52) operating on 52-bit integers data. For the first implementation of AVX-512 we need to wait, but definitely this shows the trend, importance and role FMA plays already and it is going to play in the future. 3 PROBLEM DESCRIPTION AND EXPERIMENTAL SETUP The main problem studied in this paper is how to effectively utilize AVX2 for HPC code and expose the situation when AVX2 might not be the most effective way to increase performance. The systems and CPU carefully selected for this study are described below. 3.1 Architecture and System Configuration For this study, several Intel Xeon platforms with the Haswell processor have been selected: E3-1270v3 single socket server version, as well as dual socket servers based on Intel Xeon E5-2697v3 and Intel Xeon E5-2680v3. The Intel Xeon E3-1270v3 processor is a quad core, with 3.5 GHz 80 W CPU in an LGA package. It is dedicated to single socket servers and workstations. The 1004 P. Gepner Intel Rainbow Pass platform has been selected as the validation vehicle with 16 GB of system memory (2 × 8 GB DDR3-1600 MHz) has been installed and with enter- prise class Intel SSD X25-E hard disk drives. The system utilizes the Linux kernel installed, known as bullxlinux6.3 (based on Linux 2.6.32 kernel). The Intel Xeon E5-2697v3 processor has 14 cores, with 2.6 GHz, 145 W CPU in FCLGA2011-3 package. It is dedicated to dual socket servers and workstations. The Intel Xeon E5-2680v3 processor has 12 cores, with 2.5 GHz, 120 W CPU in FCLGA2011-3 package. The Intel Server Wildcat Pass S2600WT platform has been selected as the validation vehicle for all Intel Xeon E5-2600v3 product members. Populated system memory contains 64 GB of (8 × 8 GB DDR4-2133 MHz) for both configurations. The Intel Xeon E5-2600v3 series systems are based on Linux RHEL7 64-bit with a kernel vmlinuz-3.10.0-123.el7.x86 64. The latest versions of the Intel tools suite are installed on all systems, amongst others, Intel C/C++/Fortran Compiler XE 13.1 Update 1, Intel Math Kernel Li- brary version 1.2 update 2 and Intel VTune Amplifier XE 2013 Update 5. The new compiler and libraries offer advanced vectorization support, including support for AVX-2 and include Intel Parallel Building Blocks (PBB), OpenMP, High- Performance Parallel Optimizer (HPO), Interprocedural Optimization (IPO) and Profile-Guided Optimization (PGO). All performance tools and libraries provide optimized parallel functions and data processing routines for high-performance applications. 3.2 CPU Characteristic The Haswell processor family is based on the same micro-architecture principles as its predecessor, however, it becomes wider and armed with more resources. As the first members of the Haswell family, the 4th generation Intel Core processors have been designated for the mobile market, the power consumption, performances of the graphics unit as well as performance per Watt characteristic were the primary design goals. Even though those processors are focused on the mobile and desktop markets, some of the technologies, in particular AVX2, are expected to be very applicable for the HPC segment. Of course, the real product for technical computing is coming from the Xeon line as the Intel Xeon E5-2600v3 series. In addition to more cores, bigger cache and more channels of memory, the Intel Xeon E5-2600v3 series supports new DDR4 memory whereas the 4th generation Intel Core processor family still uses DDR3.

Using Avx2 Instruction Set to Increase Performance of High Performance Computing Code

Microcode Revision Guidance August 31, 2019 MCU Recommendations

Class-Action Lawsuit

Multiprocessing Contents

The Intel X86 Microarchitectures Map Version 2.0

Informatyka 1, Studia Niestacjonarne I Stopnia Dr Inż

The Intel X86 Microarchitectures Map Version 2.2

2 Basic Compiler Optimizations.Pdf

Microcode Revision Guidance April2 2018 MCU Recommendations the Following Table Provides Details of Availability for Microcode Updates Currently Planned by Intel

Intel: Manufacturing, Chip Design Expertise Driving Innovation and Integration, Historic Change to Computers

Intel I7 Core Processor – a New Order Processor Level

Why Should Intel Channel Partners Transition to Sandy Bridge?

Camp Marketing Consultancy