Optimizing Compiler for a CELL Processor
Total Page:16
File Type:pdf, Size:1020Kb
Optimizing Compiler for a CELL Processor Alexandre E. Eichenberger† , Kathryn O’Brien†, Kevin O’Brien†, Peng Wu†, Tong Chen†, Peter H. Oden†, Daniel A. Prener†, Janice C. Shepherd† , Byoungro So†, Zehra Sura†, Amy Wang‡, Tao Zhang, Peng Zhao‡, and Michael Gschwind† †IBM T.J. Watson Research Center ‡IBM Toronto Laboratory College of Computing Yorktown Heights, New York, USA. Markham, Ontario, Canada. Georgia Tech, USA. ABSTRACT first generation CELL processor includes a 64-bit multi- Developed for multimedia and game applications, as well as threaded Power Processor Element (PPE) with two lev- other numerically intensive workloads, the CELL processor els of globally-coherent cache. It supports multiple op- provides support both for highly parallel codes, which have erating systems including Linux. For additional perfor- high computation and memory requirements, and for scalar mance, a CELL processor includes eight Synergistic Pro- codes, which require fast response time and a full-featured cessor Elements (SPEs). Each SPE consists of a new pro- programming environment. This first generation CELL pro- cessor designed for streaming workloads, a local memory, cessor implements on a single chip a Power Architecture pro- and a globally-coherent DMA engine. Computations are cessor with two levels of cache, and eight attached stream- performed by 128-bit wide Single Instruction Multiple Data ing processors with their own local memories and globally (SIMD) functional units. An integrated high bandwidth bus coherent DMA engines. In addition to processor-level par- glues together the nine processors and their ports to external allelism, each processing element has a Single Instruction memory and IO. Multiple Data (SIMD) unit that can process from 2 double In this paper, we present our compiler approach to sup- precision floating points up to 16 bytes per instruction. port the heterogeneous parallelism found in the CELL ar- This paper describes, in the context of a research pro- chitecture, which includes multiple, heterogeneous processor totype, several compiler techniques that aim at automat- elements and SIMD units on all processing elements. The ically generating high quality codes over a wide range of proposed approach is implemented as a research prototype heterogeneous parallelism available on the CELL proces- in IBM’s XL product compiler code base and currently sup- sor. Techniques include compiler-supported branch predic- ports the C and Fortran languages. tion, compiler-assisted instruction fetch, generation of scalar Our first contribution is a set of compiler techniques that codes on SIMD units, automatic generation of SIMD codes, provide high levels of performance for the SPE processors. and data and code partitioning across the multiple proces- To achieve high rates of computation at moderate costs in sor elements in the system. Results indicate that significant power and area, functionality that is traditionally handled in speedup can be achieved with a high level of support from hardware has been partially offloaded to the compiler, such the compiler. as memory realignment and branch prediction. We provide techniques to address these new demands in the compiler. Our second contribution is to automatically generate 1. INTRODUCTION SIMD codes that fully utilize the functional units of the The increasing importance of multimedia, game applica- SPEs as well as the VMX unit found on the PPE. The pro- tions, and other numerically intensive workloads has gener- posed approach minimizes the overhead due to misaligned ated an upsurge in novel computer architectures tailored for data streams and is tailored to handle many of the code such applications. Such applications include highly parallel structures found in multimedia and gaming applications. codes, such as image processing or game physics, which have Our third contribution is to enhance the programmabil- high computation and memory requirements. They also in- ity of the CELL processor by parallelizing a single source clude scalar codes, such as networking or game artificial in- program across the PPE and 8 SPEs. Key to this is our ap- telligence, for which fast response time and a full-featured proach of presenting the user with a single shared memory programming environment are paramount. image, effected through compiler mediated partitioning of Developed with such applications in mind, the CELL pro- code and data and the automatic orchestration of any data cessor provides both flexibility and high performance. This movement implied by this partitioning. We report an average speedup factor of 1.3 for the proposed SPE optimization techniques. When they are combined with SIMD and parallelization compiler tech- Permission to make digital or hard copies of all or part of this work for niques, we achieve average speedup factors of, respectively, personal or classroom use is granted without fee provided that copies are 9.9 and 7.1, on suitable benchmarks. Although not inte- not made or distributed for profit or commercial advantage and that copies grated yet, the latter two techniques will be cumulative as bear this notice and the full citation on the first page. To copy otherwise, to they address distinct sources of parallelism. republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 2005 IEEE X-XXXXX-XX-X/XX/XX ...$5.00. a) CELL Processor b) Synergistic Processing Element (SPE) 1 SPE SPE SPE SPE SPE SPE SPE SPE Even Pipe Odd Pipe Dual−Issue Floating/ Branch Instruction Fixed Memory Logic Point Permute Element Interconnect Bus, 96B/cycle Register File Instr. Buffer 128 x 16B 3.5 x 32 instr. 8 byte/cycle 2 in each dir. 16 byte/cycle Local Store in one dir. 256KB 32 byte/cycle Single Ported in one dir. 3 branch: 1,2 128 byte/cycle branch hint: 1,2 Power Processing Element (PPE) To External Mem To External IO in one dir. DMA instr fetch: 2 dma request: 3 Figure 1: The implementation of a first-generation CELL Processor This paper is organized as follows. We describe the CELL Instructions Pipe Lat. processor in Section 2 and our programming model in Sec- arithmetic, logical, compare, select even 2 tion 3. We present our SPE optimization techniques in Sec- shift, rotate, byte sum/diff/avg even 4 tion 4 and our automatic SIMD code generation techniques float even 6 in Section 5. We discuss our techniques for parallelization 16-bit integer multiply-accumulate even 7 and partitioning of single source programs among the PPE 128-bit shift/rotate, shuffle, estimate odd 4 and its 8 SPEs in Section 6. We report performance results load, store, channel odd 6 in Section 7 and conclude in Section 9. branch odd 1-18 2. CELL ARCHITECTURE Figure 2: Latencies and pipe assignment for SPE. The implementation of a first-generation CELL proces- sor [1] includes a Power Architecture processor and 8 at- An SPE can dispatch up to two instructions per cycle tached processor elements connected by an internal, high- to seven execution units that are organized into even and bandwidth Element Interconnect Bus (EIB). Figure 1a odd instruction pipes. Instructions are issued inorder and shows the organization of the CELL elements and the key routed to their corresponding even/odd pipe. Independent bandwidths between them. instructions are detected by the issue logic hardware and are The Power Processor Element (PPE) consists of a 64-bit, dual-issued provided they satisfy the following code-layout multi-threaded Power Architecture processor with two lev- conditions: the first instruction must come from an even els of on-chip cache. The cache preserves global coherence word address and use the even pipe, and the second in- across the system. The processor also supports IBM’s Vector struction must come from an odd word address and use the Multimedia eXtensions (VMX) [2] to accelerate multimedia odd pipe. When this condition is not satisfied, the two in- applications using its VMX SIMD units. structions are simply executed sequentially. The instruction A major source of compute power is provided by the eight latencies and their pipe assignments are shown in Table 2. on-chip Synergistic Processor Elements (SPEs) [3, 4]. An The SPE’s 256K-byte local memory supports fully- SPE consists of a new processor, designed to accelerate me- pipelined 16-byte accesses for memory instructions and 128- dia and streaming workloads, its local non-coherent mem- byte accesses for instruction fetch and DMA transfers. Be- ory, and its globally-coherent DMA engine. Key units and cause the memory is single ported, instruction fetches, bandwidths are shown in Figure 1b. DMA, and memory instructions compete for the same port. Nearly all instructions provided by the SPE operate in Instruction fetches occur during idle memory cycles, and up a SIMD fashion on 128 bits of data representing either 2 to 3.5 fetches may be buffered in the 32-instruction fetch 64-bit double floats or long integers, 4 32-bit single float buffers to better tolerate bursty peak memory usages. To or integers, 8 16-bit shorts, or 16 8-bit bytes. Instructions further avoid instruction starvation, an explicit instruction may source up to three 128-bit operands and produce one can be used to initiate an inline instruction fetch. 128-bit result. The unified register file has 128 registers and The branches are assumed non-taken by the SPE hard- supports 6 read and 2 write per cycle. ware but the architecture allows for a branch hint instruction The memory instructions also access 128 bits of data, with to override the default branch prediction policy. In addi- the additional constraint that the accessed data must reside tion, the branch hint instruction prefetches up to 32 instruc- at addresses that are multiples of 16 bytes. Namely, the tions starting from the branch target, so that a correctly- lower 4 bits of the load/store byte addresses are simply ig- hinted taken branch incurs no penalty.