Compiler-Based Data Prefetching and Streaming Non-Temporal Store Generation for the Intel R Xeon Phitm Coprocessor

Compiler-based Data Prefetching and Streaming Non-temporal Store Generation for the Intel R Xeon PhiTM Coprocessor Rakesh Krishnaiyer†, Emre Kültürsay†‡, Pankaj Chawla†, Serguei Preis†, Anatoly Zvezdin†, and Hideki Saito† † Intel Corporation ‡ The Pennsylvania State University TM Abstract—The Intel R Xeon Phi coprocessor has software used, and generating them. In this paper, we: TM prefetching instructions to hide memory latencies and spe- • Present how the Intel R Xeon Phi coprocessor soft- cial store instructions to save bandwidth on streaming non- ware prefetch and non-temporal streaming store instructions temporal store operations. In this work, we provide details on R compiler-based generation of these instructions and evaluate are generated by the Intel Composer XE 2013, TM their impact on the performance of the Intel R Xeon Phi • Evaluate the impact of these mechanisms on the overall coprocessor using a wide range of parallel applications with performance of the coprocessor using a variety of parallel different characteristics. Our results show that the Intel R applications with different characteristics. Composer XE 2013 compiler can make effective use of these Our experimental results demonstrate that (i) a large mechanisms to achieve significant performance improvements. number of applications benefit significantly from software prefetching instructions (on top of hardware prefetching) that I. INTRODUCTION are generated automatically by the compiler for the Intel R TM TM The Intel R Xeon Phi coprocessor based on Intel R Xeon Phi coprocessor, (ii) some benchmarks can further Many Integrated Core Architecture (Intel R MIC Architec- improve when compiler options that control prefetching ture) is a many-core processor with long vector (SIMD) units behavior are used (e.g., to enable indirect prefetching), targeted for highly parallel workloads in the High Perfor- and (iii) many applications benefit from compiler generated mance Computing (HPC) segment. It is designed to provide streaming non-temporal store instructions. immense throughput: a single chip can deliver a peak per- The remainder of this paper is organized as follows. formance of well above one double precision TeraFLOPS. In Section 2, we provide the details of the Intel R Xeon TM However, achieving performance close to this peak value Phi coprocessor based on the Intel R MIC architecture. requires the cores and the vector units in the coprocessor In Sections 3 and 4, we present the software/hardware to be kept busy at all times, which in turn requires data to data prefetching mechanisms and streaming non-temporal be delivered at a very high rate without stalls. Providing store instructions in this architecture together with how they sustained low memory latency is more challenging with the are generated automatically by our compiler to improve R R R Intel MIC architecture than the Intel Xeon processors. application performance. Section 5 gives the details of our This is mainly because (i) as opposed to traditional large, experimental framework and presents an in-depth analysis of power-hungry, out-of-order cores which are known to be the impact of data prefetching and streaming non-temporal good at hiding the latency of cache/memory accesses by stores on our target benchmarks. Section 6 discusses related exploiting instruction level parallelism, the coprocessor uses work and we conclude the paper with Section 7. small, power efficient, in-order cores, (ii) the last-level cache capacity per core (and per thread) is smaller on the Intel R II. BACKGROUND TM TM Xeon Phi coprocessor than the latest Intel R Xeon R pro- R TM The Intel Xeon Phi coprocessor is designed to serve cessors, and (iii) the Intel R Xeon Phi coprocessor has a the needs of applications in the HPC segment that are higher memory access latency. highly parallel, make extensive use of vector operations, TM The Intel R Xeon Phi coprocessor relies on software and and are occasionally memory bandwidth bound. We start hardware data prefetching techniques to bring data into local this section with a presentation of the salient architectural caches ahead of need to hide the latency of accessing caches characteristics of this coprocessor, contrasting it with the and memory. It also employs special vector instructions that Intel R Xeon R processor E5-2600 series product family. We can be used for streaming non-temporal stores in order to then approach this new architecture from a programmer’s save memory bandwidth. While those prefetch and streaming point of view. Before moving on to the details of compiler- non-temporal store instructions can be inserted by expert based data prefetching and streaming store instructions in programmers using intrinsics, in practice, it is a hard and the next sections, we also give an overview of our compiler. error-prone task to identify when these instructions will be TM R useful and with which parameters they should be used to A. Architectural Overview of the Intel Xeon Phi Copro- bring benefits. It is much more practical to reduce the role of cessor the programmer and delegate the compiler to be responsible 1) Cores and Parallelism: Targeting highly parallel ap- TM for performing the bulk of the work: analyzing the source plications, the Intel R Xeon Phi coprocessor [1] employs code, identifying the cases where these instructions can be 60 (or more) cores, each core supporting four hardware thread contexts which gives a total of 240 (or more) threads connected to the ring network increases the average number that can run simultaneously. The cores are in-order and of hops each message has to travel on the network, which can issue two instructions per cycle. This dual instruction in turn increases the total access latency. TM issue is coupled with two pipelines (u and v-pipes), which While the Intel R Xeon Phi coprocessor has a total on- improves the utilization of the units inside the core. The chip cache capacity of 30MB, which is higher than the two pipes are not symmetrical and there are restrictions on sum of L2 and L3 caches on Intel R Xeon R processor E5- which instructions can be scheduled on which pipe (e.g., 2600 product family, its per-core cache capacity is actually vector operations can only be scheduled on the u-pipe). At smaller. Instead of 256KB L2 and 2.5MB L3 effective cache TM each cycle, the instruction decoder picks instructions from capacity per-core, an Intel R Xeon Phi coprocessor core the threads scheduled on the core in a round-robin order has only 512KB L2 cache. Further, this space is shared by with the restriction that at least one cycle is needed between four threads. This reduction in per core cache capacity is a decoding instructions from the same thread. Hence, if only design choice that enables such a high number of cores to be one thread is scheduled on a core, its instructions can be integrated in a single chip. As a result, the on-chip cache in issued at every other cycle. This restriction was a design this architecture is more limited and effective management of choice to balance the latency of the pipeline stages in the its capacity is more important to achieve high performance. TM processor and prevent the decode stage from reducing the The Intel R Xeon Phi coprocessor resides on a separate clock frequency. add-in card connected to a host system through PCIe. The Overall, in contrast to the Intel R Xeon R processors, the memory system on the coprocessor is completely indepen- TM Intel R Xeon Phi coprocessor trades off core complex- dent from the host memory system. The coprocessor has ity for coarse grain parallelism. While the former family 16 channels of GDDR5 memory that are controlled by 8 employs up to eight complex out-of-order, four-issue cores, on-chip memory controllers. Operating at a 32-bit interface the latter makes use of a larger number of simpler in-order and 5.5GT/s speed, the memory subsystem has a maximum cores running at a lower frequency. These cores are instances memory bandwidth of 352GB/s. On the other hand, Intel R of the Intel R P54C architecture that is extended based on Xeon R processor E5-2600 product family members have the needs of target applications (e.g., 64-bit addressing to 4 channels that can deliver up to 51.2GB/s maximum satisfy the need for using large memory, long vector units bandwidth per socket at DDR3-1600. With its bandwidth TM to improve performance of data-parallel workloads). The advantage, Intel R Xeon Phi products can be expected to reduction in the complexity of the cores enables significant achieve higher performance on applications that are memory reduction in the core silicon area and power consumption, bandwidth bound. TM which in turn enables the integration of a larger number Although the Intel R Xeon Phi coprocessor has a high of cores on chip. Despite its sacrifice from single-thread memory bandwidth, it also has a high off-chip memory performance and instruction level parallelism, Intel R Xeon access latency. This is mainly because the GDDR5 memory TM Phi products can realize significantly higher thread level it uses is optimized for bandwidth rather than latency, and parallelism. therefore has a higher access latency than the DDR3 memory R R 2) Vector (SIMD) Operations: To improve the perfor- technology [2] used with other Intel Xeon processors. mance of applications that make extensive use of vector op- Considering the smaller per-core cache capacity, an in- TM erations, the Intel R Xeon Phi coprocessor employs 512-bit crease in off-chip memory access latency, together with its vector units. Using a 512-bit vector unit, 16 single precision in-order cores that are not able to hide cache/memory access (or 8 double precision) floating point (FP) operations can be latency as out-of-order cores, optimizing the cache/memory behavior of applications becomes critical in reaching high performed as a single vector operation; and, with the help TM R of the fused multiply-add (FMA) instruction, up to 32 FP performance with the Intel Xeon Phi coprocessor.

Compiler-Based Data Prefetching and Streaming Non-Temporal Store Generation for the Intel R Xeon Phitm Coprocessor

Vxworks Architecture Supplement, 6.2

SIMD Extensions

Multi-Platform Auto-Vectorization

A Bibliography of Publications in IEEE Micro

Vybrid Controllers Technical Overview

POWER Block Course Assignment 3: SIMD with Altivec

Cell Processor and Playstation 3

Product Brief Rev

Enhancing the Matrix Transpose Operation Using Intel Avx Instruction Set Extension

The X86 Is Dead. Long Live the X86!

Release History

Compactpci and Avancedtca Systems