<<

Optimizing for a CELL Processor

Alexandre E. Eichenberger† , Kathryn O’Brien†, Kevin O’Brien†, Peng Wu†, Tong Chen†, Peter H. Oden†, Daniel A. Prener†, Janice . Shepherd† , Byoungro So†, Zehra Sura†, Amy Wang‡, Tao Zhang, Peng Zhao‡, and Michael Gschwind† †IBM .J. Watson Research Center ‡IBM Toronto Laboratory College of Yorktown Heights, New York, USA. Markham, Ontario, Canada. Georgia Tech, USA.

ABSTRACT first generation CELL processor includes a 64-bit multi- Developed for multimedia and game applications, as well as threaded Power Processor Element (PPE) with two lev- other numerically intensive workloads, the CELL processor els of globally-coherent . It supports multiple op- provides support both for highly parallel codes, which have erating systems including Linux. For additional perfor- high computation and memory requirements, and for scalar mance, a CELL processor includes eight Synergistic Pro- codes, which require fast response time and a full-featured cessor Elements (SPEs). Each SPE consists of a new pro- programming environment. This first generation CELL pro- cessor designed for streaming workloads, a local memory, cessor implements on a single chip a Power Architecture pro- and a globally-coherent DMA engine. Computations are cessor with two levels of cache, and eight attached stream- performed by 128-bit wide Single Instruction Multiple Data ing processors with their own local memories and globally (SIMD) functional units. An integrated high bandwidth bus coherent DMA engines. In addition to processor-level par- glues together the nine processors and their ports to external allelism, each processing element has a Single Instruction memory and IO. Multiple Data (SIMD) unit that can process from 2 double In this paper, we present our compiler approach to sup- precision floating points up to 16 bytes per instruction. port the heterogeneous parallelism found in the CELL ar- This paper describes, in the context of a research pro- chitecture, which includes multiple, heterogeneous processor totype, several compiler techniques that aim at automat- elements and SIMD units on all processing elements. The ically generating high quality codes over a wide range of proposed approach is implemented as a research prototype heterogeneous parallelism available on the CELL proces- in IBM’s XL product compiler code base and currently sup- sor. Techniques include compiler-supported branch predic- ports the C and languages. tion, compiler-assisted instruction fetch, generation of scalar Our first contribution is a set of compiler techniques that codes on SIMD units, automatic generation of SIMD codes, provide high levels of performance for the SPE processors. and data and code partitioning across the multiple proces- To achieve high rates of computation at moderate costs in sor elements in the system. Results indicate that significant power and area, functionality that is traditionally handled in speedup can be achieved with a high level of support from hardware has been partially offloaded to the compiler, such the compiler. as memory realignment and branch prediction. We provide techniques to address these new demands in the compiler. Our second contribution is to automatically generate 1. INTRODUCTION SIMD codes that fully utilize the functional units of the The increasing importance of multimedia, game applica- SPEs as well as the VMX unit found on the PPE. The pro- tions, and other numerically intensive workloads has gener- posed approach minimizes the due to misaligned ated an upsurge in novel computer architectures tailored for data streams and is tailored to handle many of the code such applications. Such applications include highly parallel structures found in multimedia and gaming applications. codes, such as image processing or game physics, which have Our third contribution is to enhance the programmabil- high computation and memory requirements. They also in- ity of the CELL processor by parallelizing a single source clude scalar codes, such as networking or game artificial in- program across the PPE and 8 SPEs. Key to this is our ap- telligence, for which fast response time and a full-featured proach of presenting the user with a single shared memory programming environment are paramount. image, effected through compiler mediated partitioning of Developed with such applications in mind, the CELL pro- code and data and the automatic orchestration of any data cessor provides both flexibility and high performance. This movement implied by this partitioning. We report an average speedup factor of 1.3 for the proposed SPE optimization techniques. When they are combined with SIMD and parallelization compiler tech- Permission to make digital or hard copies of all or part of this work for niques, we achieve average speedup factors of, respectively, personal or classroom use is granted without fee provided that copies are 9.9 and 7.1, on suitable benchmarks. Although not inte- not made or distributed for profit or commercial advantage and that copies grated yet, the latter two techniques will be cumulative as bear this notice and the full citation on the first page. To copy otherwise, to they address distinct sources of parallelism. republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 2005 IEEE X-XXXXX-XX-X/XX/XX ...$5.00. a) CELL Processor b) Synergistic Processing Element (SPE)

1

SPE SPE SPE SPE SPE SPE SPE SPE Even Pipe Odd Pipe Dual−Issue Floating/ Branch Instruction Fixed Memory Logic Point Permute Element Interconnect Bus, 96B/cycle

Register File Instr. Buffer 128 x 16B 3.5 x 32 instr. 8 byte/cycle 2 in each dir. 16 byte/cycle Local Store in one dir. 256KB 32 byte/cycle Single Ported in one dir. 3 branch: 1,2 128 byte/cycle branch hint: 1,2 Power Processing Element (PPE) To External Mem To External IO in one dir. DMA instr fetch: 2 dma request: 3

Figure 1: The implementation of a first-generation CELL Processor

This paper is organized as follows. We describe the CELL Instructions Pipe Lat. processor in Section 2 and our programming model in Sec- arithmetic, logical, compare, select even 2 tion 3. We present our SPE optimization techniques in Sec- shift, rotate, byte sum/diff/avg even 4 tion 4 and our automatic SIMD code generation techniques float even 6 in Section 5. We discuss our techniques for parallelization 16-bit integer multiply-accumulate even 7 and partitioning of single source programs among the PPE 128-bit shift/rotate, shuffle, estimate odd 4 and its 8 SPEs in Section 6. We report performance results load, store, channel odd 6 in Section 7 and conclude in Section 9. branch odd 1-18

2. CELL ARCHITECTURE Figure 2: Latencies and pipe assignment for SPE. The implementation of a first-generation CELL proces- sor [1] includes a Power Architecture processor and 8 at- An SPE can dispatch up to two instructions per cycle tached processor elements connected by an internal, high- to seven units that are organized into even and bandwidth Element Interconnect Bus (EIB). Figure 1a odd instruction pipes. Instructions are issued inorder and shows the organization of the CELL elements and the key routed to their corresponding even/odd pipe. Independent bandwidths between them. instructions are detected by the issue logic hardware and are The Power Processor Element (PPE) consists of a 64-bit, dual-issued provided they satisfy the following code-layout multi-threaded Power Architecture processor with two lev- conditions: the first instruction must come from an even els of on-chip cache. The cache preserves global coherence word address and use the even pipe, and the second in- across the system. The processor also supports IBM’s Vector struction must come from an odd word address and use the Multimedia eXtensions (VMX) [2] to accelerate multimedia odd pipe. When this condition is not satisfied, the two in- applications using its VMX SIMD units. structions are simply executed sequentially. The instruction A major source of compute power is provided by the eight latencies and their pipe assignments are shown in Table 2. on-chip Synergistic Processor Elements (SPEs) [3, 4]. An The SPE’s 256K-byte local memory supports fully- SPE consists of a new processor, designed to accelerate me- pipelined 16-byte accesses for memory instructions and 128- dia and streaming workloads, its local non-coherent mem- byte accesses for instruction fetch and DMA transfers. Be- ory, and its globally-coherent DMA engine. Key units and cause the memory is single ported, instruction fetches, bandwidths are shown in Figure 1b. DMA, and memory instructions compete for the same port. Nearly all instructions provided by the SPE operate in Instruction fetches occur during idle memory cycles, and up a SIMD fashion on 128 bits of data representing either 2 to 3.5 fetches may be buffered in the 32-instruction fetch 64-bit double floats or long integers, 4 32-bit single float buffers to better tolerate bursty peak memory usages. To or integers, 8 16-bit shorts, or 16 8-bit bytes. Instructions further avoid instruction starvation, an explicit instruction may source up to three 128-bit operands and produce one can be used to initiate an inline instruction fetch. 128-bit result. The unified register file has 128 registers and The branches are assumed non-taken by the SPE hard- supports 6 read and 2 write per cycle. ware but the architecture allows for a branch hint instruction The memory instructions also access 128 bits of data, with to override the default branch prediction policy. In addi- the additional constraint that the accessed data must reside tion, the branch hint instruction prefetches up to 32 instruc- at addresses that are multiples of 16 bytes. Namely, the tions starting from the branch target, so that a correctly- lower 4 bits of the load/store byte addresses are simply ig- hinted taken branch incurs no penalty. One of the instruc- nored. To facilitate the loading/storing of individual values, tion fetch buffers is reserved for the branch hint mechanism. such as a byte or word, there is additional support to ex- In addition, there is extended support for eliminating short tract/merge an individual value from/into a 128-bit register. branches using bit-wise select instructions. Data is transfered between the local memory and the constraints imposed by SPE/PPE memory subsystems. At DMA engine in chunks of 128 bytes. The DMA engine can this level, the application developer must still manually par- support up to 16 requests of up to 16K bytes. Both the tition the program in consideration of the differing charac- PPE and SPEs can initiate DMA requests from/to each- teristics of the underlying ISAs and the constraints of the other’s local-memory as well as from/to global memory. The underlying memory subsystem. Code and data transfers DMA engine is part of the globally coherent memory address must be positioned to ensure maximum overlap of commu- space; addresses of local DMA requests are translated by an nication and computation. The application must be hand MMU before being sent on the bus. Bandwidth between parallelized to run on all the CELL processing elements to the DMA and the EIB bus is 8 bytes per cycle in each direc- ensure the maximum performance benefit. tion. Programs interface with the DMA unit via a channel At the highest level of support, our compiler provides an interface and may initiate blocking as well as non-blocking abstraction of the underlying architectural complexity, pre- requests. senting a single shared memory, multi-processor image, and providing parallelization across all the processing elements. 3. PROGRAMMING MODELS In our current implementation, we use an OpenMP program- ming model, whereby the user specifies the parallel sections of the CELL architecture are presented with of the program. Our approach is equally applicable to au- a wide range of approaches to extracting maximum perfor- tomatic parallelization support and other parallel program- mance. At one end of the spectrum, a dedicated developer ming paradigms such as UPC. has full control over the machine resources. At the other end, an application writer can enjoy a single-source program targeting a shared-memory multi-processor abstraction. 4. OPTIMIZED SPE CODE GENERATION By programming the SPE in assembler, a dedicated pro- In this section, we describe the current compiler optimiza- grammer can select which registers to use, choose scalar or tion techniques that address key architectural features of the SIMD instructions, and schedule the code manually. This SPE. During initial development of the SPE instruction set, provides a high degree of flexibility and, when used judi- a preliminary scalar compiler was prototyped as proof-of- ciously, will yield the highest performance. Within this concept. This allowed us to quickly investigate the perfor- realm there are a number of different approaches to mask- mance impact of a SIMD-only instruction set with a unified ing the system complexity, including the use of libraries, the scalar/vector register file as well as with a memory interface function offload model, and task pipelining. based on loading a 128-bit vector, extracting, and appropri- Prototypes developed using this model have demonstrated ately formatting the desired scalar value. However, in this significant performance potential. Notwithstanding, this ap- section we describe our later implementation within IBM’s proach can be extremely labor intensive, and may repre- XL compiler code base. sent an approach adopted by a small fraction of CELL pro- grammers. Frequently, even dedicated application develop- 4.1 Scalar Code on SIMD Units ers want to deploy high level programming languages, for example, for less performance-critical sections of an appli- As mentioned in Section 2, nearly all SPE instructions op- cation, to reach higher productivity and programmability. erate on 128-bit wide SIMD data fields, including all mem- At the next level of support, we provide distinct ory instructions. One notable exception is the conditional branch instructions which branch on nonzero values from the highly tuned to the respective ISAs. The still 1 has to deal with separate SPE and PPE programs, their primary slot of a 128-bit register. The other notable excep- respective address spaces and explicit interprocessor com- tion is the memory address fields which are also expected to munication. However, most of the machine dependent op- reside in the primary slot by the memory instructions. timization is performed automatically by the highly tuned When generating scalar codes on an SPE, we must ensure compilers. that the SIMD nature of the processor does not get in the For critical sections of an application, a programmer can way of program correctness. Mainly, we must ensure that use a rich set of SPE and VMX intrinsics to precisely con- the alignment of the data within SIMD registers is properly trol the SIMD and data layout, while taken into consideration. leaving scheduling and decisions to the To illustrate the impact of SIMD units on scalar code, a=b+c compiler. An intrinsic provides to the programmer a func- consider the following scalar computation. Because tion that mimics the behavior of the underlying instructions the SPE memory subsystem processes only 16-byte aligned memory requests, loading b from memory yields the 32-bit b in the target architecture. An intrinsic generally translates 2 to a single instruction in the final output of the compiler, and value plus 96 bits of unrelated data. The actual location of b we provide intrinsics for nearly all SPE/VMX instructions, the value within the register is determined by the 16-byte b including a full complement to support DMA operations. alignment of in memory. However, even programming with intrinsics and addressing Once the input data is in registers, the compiler must data-layout issues manually can incur a significant invest- continue to keep track the alignment of its data since data ment of time. Moreover, the resulting will not cansafelybeoperatedoninSPESIMDunitsonlywhen be portable. 1 At the next level of support, our compiler provides auto- The primary slot corresponds to highest order (leftmost) matic simdization targeting the SPE/PPE SIMD units. As 32 bits of a register. For subword data types, only the least significant bits in the primary slot are considered. will be shown in Section 5, we have developed techniques 2This is technically true only when the data is naturally to extract SIMD parallelism from multiple language scopes aligned, e.g. when a 32-bit integer resides in memory at such as across loop iterations or within a basic . We addresses that are multiples of 4 bytes. In this paper, we also provide a systematic solution to addressing alignment assume and support only naturally aligned data elements. they reside at the same register slots. For example, the 128- For loop-closing branches, we attempt to move the hbrs bit registers containing b and c can only be added if the b outside the loop to avoid repetitive execution of hint in- and c values reside at the same byte offset within their re- structions. This optimization is possible because a hint re- spective registers. When relatively misaligned, we permute mains effective until replaced by another one. Unconditional the register content to enforce matching alignment. Since branches are also excellent targets for branch hint instruc- scalar computations in highly optimized multimedia codes tions. The indirect form of the hbr instruction is used for are mostly about address and branch-condition computa- hinting function returns, function calls via pointers, and all tions, the default policy is to move any misaligned scalar other situations that give rise to indirect branches. data into the primary slot of their registers. The storing of a scalar value also requires some care on 4.3 the SPEs since stores are also 128-bit wide instructions. For The scheduling process consists of two closely interacting scalars, the compiler must first load the original 128-bit data subtasks: scheduling and bundling. The scheduler’s main in which the result resides, then splice in the new value, and objective is to schedule operations that are on the critical store the resulting 128-bit data to the memory. path with the highest priority and schedule the other less We take several steps to avoid such overhead. First, we critical operations in the remaining slack. While typical allocate all temporary and global scalar variables in the pri- schedulers deal with resources and latencies, the SPEs also mary slot of their own, private 128-bit chunk in memory. have constraints that are expressed in numbers of instruc- The padding overhead is insignificant compared to the code tions. For example, the hbr branch hint instruction cannot size increase incurred by additional instructions that would be more than 256 instructions away from its target branch be otherwise needed to realign the data at runtime. Sec- and should be no closer than 8 instructions. Constraints ond, we perform aggressive register allocation for all local expressed in terms of instruction counts are further compli- computations, such as address and loop index variables, to cated by the fact that the precise number of instructions in make good use of the 128-entry register file. As a result, a scheduling unit is known only after the bundling subtask most local variables reside exclusively in the primary slot of has completed. their respective registers, and thus memory storage and as- The bundler’s main role is to ensure that each pair of in- sociated load/store instructions are not needed. Finally, we structions that are expected to be dual-issued satisfies the attempt to automatically simdize code to reduce the amount SPE’s instruction issue constraints. As mentioned in Sec- of scalar codes in an application. tion 2, the processor will dual-issue independent instruc- tions only when the first instruction uses the even pipe and 4.2 Branch Optimizations resides in an even word address, and the second instruction The SPE has a high branch-misprediction penalty of 18 uses the odd pipe and resides in an odd word address. Once cycles. Because branches are only detected late in the the instruction ordering is set by the scheduling subtask, pipeline at a time where there are already multiple fall- the bundler can only impact the even/odd word address of through instructions in flight, the SPE simply assumes all a given instruction by judiciously inserting instructions branches to be non-taken. into the instruction stream. Because taken branches are so much more expensive than Another important task of the bundler is to prevent non-taken branches, the compiler first attempts to eliminate instruction-fetch starvation. Recall in Section 2 that a single taken branches. One well-known approach for eliminating local memory port is shared by the instruction-fetch mecha- short if-then-else constructs is if-conversion via the use of nism and the processor’s memory instructions. As a result, select instructions provided by the SPE. Another well-known a large number of consecutive memory instructions can stall approach is to determine the likely outcome of branches in instruction fetching. With a 2.5 instruction-fetch buffers re- a program, either by means of compiler analysis or via user served for the fall-through path, the SPE can run out of directives, and perform code reorganization techniques to instructions in as few as 40 dual-issued cycles. When a move cold paths out of the fall-through path. buffer is empty, there may be as few as 9 cycles for issu- To boost the performance of the remaining taken- ing an instruction-fetch request to still hide its full 15-cycle branches, such as function calls, function returns, loop- latency. Since the refill window is so small, the bundling closing branches, and some unconditional branches, the SPE process must keep track of the status of each buffer and in- provides a branch hint instruction. This instruction, re- sert explicit ifetch instructions when a starvation situation ferred to as Hint for Branch or hbr, specifies the location of is detected. a branch and its likely target address. Instructions from the The refill window is even smaller after a correctly hinted hinted branch target are prefetched from memory in a ded- taken branch since there is only 1 valid buffer after a branch icated hint instruction buffer and the buffered instructions as opposed to 2.5 buffers for the fall-through path. In this are then inserted into the instruction stream immediately case, instruction starvation is prevented only when all in- after the hinted branch. When the hint is correct and sched- structions in the buffer are useful. Namely, the branch target uled at least 11 cycles prior to its branch, the branch latency must point to an address that is a multiple of 16 instructions, is essentially one cycle; otherwise, normal branch penalties which is the alignment constraint of the instruction-fetch apply. Presently, the SPE supports only one active hint at unit. This constraint is enforced by introducing additional atime. nop instructions before a branch target to push it to the next Likely branch outcomes can either be measured via branch multiple of 16 instructions. Our heuristics are fairly success- profiling, estimated statically, or provided by the user via ful at scavenging any idle issue slots so that nop instructions expect builtins or exec freq pragmas. We use the latter may be inserted without performance penalties. two techniques. We then insert a branch hint for branches A final concern of the bundling process is to make sure with taken probability higher than a given threshold. that there is a sufficient number of instructions between a branch hint and its branch instruction. This constraint is be recognized as certain vectorizable idioms, such as paral- due to the fact that a hint is only fully successful if its tar- lel reductions. In addition, since SIMD vectors are packed get branch address is computed prior to that branch entering vectors and are loaded from memory only as 16-byte con- the instruction decode pipeline. The bundler will add extra tiguous chunks, strided accesses are considered not simdiz- nop instructions when the scheduler did not succeed in in- able. We currently simdize the following 5 types of accesses: terleaving a sufficient number of independent instructions stride-one memory accesses, loop invariant, loop private, in- between a hint and its branch. duction, and reduction. For best performance, our approach uses a unified schedul- For loops that contain non-simdizable computation, a cost ing and bundling phase. We generally preserve the cy- model is employed to determine whether to distribute the cle scheduling approach where each nondecreasing cycle of loop. Avoiding excessive loop distribution is particularly im- the schedule is filled in turns, except that we may retroac- portant for simdization because the SIMD parallelism being tively insert nop or ifetch instructions, as required by the exploited is fairly narrow, e.g., 4-way for integer and float. bundling process. When getting ready to schedule the next Thus, the overhead of loop distribution, such as reduced in- cycle in the scheduling unit, we first investigate if an ifetch struction level parallelism, register, and cache reuse, may instruction is required. When this is the case, we forceably override the benefit of SIMD execution. For example, one insert an ifetch instruction in that cycle and update the of our heuristics is to not distribute a loop if the simdizable scheduling resource model accordingly. We then proceed portion involves only private variables, many of which are with the normal scheduling process for that cycle. When no introduced by the compiler during common subexpression additional instruction can be placed in the current cycle, we elimination. then investigate if nop instructions must be inserted in prior Once deemed simdizable, the loop is blocked, and scalar cycles to enable dual-issuing. Once this task is completed, operations are transformed into generic vector operations on we proceed to the next cycle. vector data types. The blocking factor is determined such that the byte length of each vector is a multiple of 16 bytes. 5. AUTOMATIC SPE/VMX SIMDIZATION 5.1.2 Basic-block Level Simdization In this section, we present our simdization framework Short vectors enable us to extract SIMD parallelism which targets both the SPEs and the PPE’s VMX SIMD within a basic block. Such parallelism is often found in units. While their SIMD units have a distinct instruction unrolled loops, either manually by programmers or auto- set architecture, both units share a set of common SIMD ar- matically by the compiler. It is also common in graphic chitectural characteristics. For instance, both units support applications that manipulate adjacent fields of aggregates, 128-bit packed SIMD registers; and both memory subsys- such as the subdivision computation shown in Figure 3. tems allow loads and stores from 16-byte aligned addresses only. Our framework capitalizes on these commonalities for (i=0; i

} c) memory stream c[i+3] c0 c1 c2c3 c4 c5 c6 c7 ...... offset=12 shifted LEFT by 4 bytes Figure 4: A loop with misaligned accesses. d) stream c[i+3] shifted in regs c1 c2 c3 c4 c5 c6 c7 c8 ...... offset=8

+ ...... In this example, the data involved in any given loop itera- e) result stream b−1 b0+ b1+ b2+ b3+ b4+ b5+ b6+ tion always reside at different byte offsets (i.e., slots) of their in regs +c1 c2 c3 c4 c5 c6 c7 c8 ...... offset=8 respective vector registers after being loaded from memory. For example, data involved in the i=0iteration, namely f) memory stream a[i+2] a0 a1 a2 a3 a4 a5 a6 a7 ...... offset=8 b[1], c[3],anda[2] as shown in, respectively, Figures 5a, 5c, and 5f, reside in different slots within their respective memory stream: vector register: iteration i=0: vector registers. This is a problem since each arithmetic SIMD instruction (such as the add instruction in this exam- Figure 5: Simdization of a[i+2]=b[i+1]+c[i+3] with ple) operates in parallel over the data residing in the same minimum data reorganizations. slot of its respective input vector register. To ensure the correctness of simdized computation, part More details on shift placement policies, code genera- of the simdization process deals with “re-aligning” data in tion for stream shifts such as handling runtime stream registers, so that the same data involved in the original com- shifts, SIMD prologue/epilogue code generation for mis- putation reside in the same slots in their respective registers aligned stores, and alignment handling in the presence of after simdization. Figure 5 illustrates one such scheme. In 3 data conversion can be found elsewhere [9, 10]. Figure 5b, we first shift right by one integer slot the stream of data generated by b[i+1] for i=0 to 99. In Figure 5d, we 5.3 SIMD Code Generation shift left by one integer slot the stream of data generated by c[i+3] for i=0 to 99. At this stage, both the b and c register Since extraction of SIMD parallelism and alignment han- streams start in the third integer slot. The SIMD add in- dling both operate on generic vector operations, they are struction is then applied to the shifted streams and produces common to VMX and SPE simdization. In the SIMD code the expected results, b[1]+c[3],...,b[100]+c[102]. generation phase, generic vector operations are translated In prior work [9, 10], we proposed an alignment handling into SIMD intrinsics for specific platforms, which will be scheme that automatically re-arranges data to satisfy align- further optimized by the back-end compiler. ment constraints imposed by hardware like SPE and VMX. This phase also handles arbitrary length vectors. Recall Our algorithm deals with streams of contiguous data in that during loop-level simdization, the loop is often blocked memory that are represented as stride-one accesses in a loop. to ensure all vectors be of multiple of 16 bytes. In this For example, the data touched by access a[i+2] for i=0 phase, operations on vectors that are multiple of 16 bytes to 99 are considered as a stream. If a computation involves are mapped to multiple operations on vectors of 16 bytes. misaligned accesses, our algorithm will shift in registers en- The others are reverted back into multiple scalar operations. tire streams associated one or more misaligned accesses so that data involved in the original computation reside in the 6. GENERATION OF MIMD CODE same vector register slot before SIMD arithmetic operations In this section, we describe compiler generation of parallel are performed. codes for the CELL processor. Guided by user directives, the For a given statement, there are typically several ways compiler distributes loop iterations across the multiple SPE to shift misaligned streams to satisfy alignment constraints. processors of the CELL. In some respect, we are applying We refer to an algorithm that generates one such way as and extending existing techniques in a novel environment, a stream-shift placement policy. For example, the scheme but it is our approach to managing the hybrid memory sub- illustrated in Figure 5 uses the eager-shift policy, where load system of the CELL architecture which enables us to do this. streams are eagerly shifted to the alignment of the store As such, a key component of our parallelization approach is stream. the presentation of a Single Shared Memory abstraction. Another commonly used policy is called lazy-shift pol- icy. Consider a different a[i+3]=b[i+1]+c[i+1] loop. In- 6.1 User-Directed Parallelization stead of shifting each misaligned input to the alignment of Programming the SPE processor is significantly enhanced the store, this policy will lazily shift the result stream of by the availability of an optimizing compiler which supports b[i+1]+c[i+1], since both streams participating in the ad- SIMD intrinsics and automatic simdization. However, pro- dition are relatively aligned with each others. gramming the CELL Processor, the PPE and its 8 SPEs, is a much more complex task, requiring partitioning of an 3To understand the applicability of this scheme, it is crit- ical to realize that “shifting left” and “shifting right” are application to accommodate the limited local memory con- data reorganizations that operate on a stream of consecu- straints of the SPE, parallelization across the multiple SPEs, tive registers, not the traditional logical or arithmetical shift orchestration of the data transfer through insertion of DMA operation. commands, and compiling for two distinct ISAs. We present an approach wherein the compiler manages tween its home location and a temporary location in the local this complexity while still enabling the significant perfor- store. When this movement is done to satisfy the demands mance potential of the machine. We refer to this as Single of an executing SPE program, and the resulting buffers are Source compilation to denote the fact that the user need only organized in such a way as to permit reuse, we refer to the code a single program or set of programs targetting a CELL mechanism employed as a Software Cache. processor, without regard for the underlying architectural heterogeneity. Compiler-Controlled Software Cache Our parallel implementation currently uses OpenMP [11, In our initial implementation of the Software Cache, a sep- 12] pragmas to guide parallelization decisions. Making use of arate directory is maintained in each SPE. The compiler, the existing parallelization infrastructure of the underlying 4 using interprocedural analysis, replaces a subset of load and IBM XL compiler, we outline parallel code sections, in the store instructions with instructions that explicitly look up first pass of the optimizer, and apply machine independent the effective address of the datum in a directory. If the line optimizations to the outlined functions. In the second pass containing the datum is present in the directory, the address of optimization, these outlined functions and their contained of the requested variable is computed and the load or store subgraphs are cloned and optimized codes are generated for continues using this local store location. Otherwise, a sub- both the SPE and the PPE processors. The Master routine, the miss handler, is invoked and the requested data runs on the PPE processor and partakes in all work sharing is transferred from system memory. As is usual in hardware constructs. Since there is no operating system (OS) support caches, the directory is updated, and typically another line on the SPE, this thread also handles all OS service requests. is selected for eviction in order to make room for the new Since cloning of the occurs in the second pass element. Our current implementation provides a 4-way asso- of optimization, after interprocedural analysis, the compiler ciative cache, and all 4 ways are probed inline, exploiting the can generate versions of all library codes as appropriate for SIMD parallelism of the instruction set. Our cache imple- both the SPE and the PPE. mentation can optionally be used either for a serial program, Since the limited size local memory of the SPE must ac- or a parallel program. There is some additional cost involved commodate both code and data, there is always a possibil- in supporting parallel execution, since we must keep track ity that a single SPE object will be too large to fit. With of which bytes in a cache line have been modified, in order our parallelization approach, since the unit of SPE compi- to support multiple writers of the same line. lation is a user defined parallel region, most often a loop In the current version of our compiler, the software cache nest, the extent to which we encounter very large SPE ex- has 4 sets of 128 lines, each of which is 128-byte long. The ecutables is greatly reduced. In a later section, we will de- line length was chosen based on consideration of efficiency of scribe our technique for handling very large SPE executa- DMA transfer. Clearly many other choices could be made, bles. This is fully integrated with our shared memory paral- and we intend in the future to evaluate some of the tradeoffs lelization, such that we never encounter outlined SPE exe- inherent in these choices. Given a particular set of choices, cutables which are too large for a single SPE local memory. the code required to perform a cache lookup can be deter- 6.2 Single Shared Memory Abstraction mined. In order to effect the lookup, we proceed in the following fashion. Initially, having decided that a particular Although a common mode of programming the CELL con- variable is to be allocated in system memory, the compiler sists of manually partitioning code and data into PPE and flags that variable as requiring a cache access. The inter- SPE portions as well as explicitly managing the transfers mediate code for accesses to both cached and non-cached of code and data between system memory and local store, variables initially looks the same, with load and store op- we believe that, in many instances, this imposes too great erations for each reference as appropriate. This allows the a burden on the programmer. In particular, it is usual for optimizer to operate on all accesses equally, thereby elimi- a programmer to view a computer system as possessing a nating many redundant memory references. Late in the op- single addressable memory, and for all the program data to timization phase of our compiler, the remaining loads and reside in this space. stores to the cached references are expanded by inserting In the CELL processor, the local stores, which alone are inline the instructions which lookup the cache directory. In directly addressable by their respective SPEs, are memo- keeping with the spirit of caching, the expanded code as- ries separate from the vastly larger system memory. Each sumes a hit, so the branch to the misshandler is treated by SPE possesses the capability to transfer data, by means of the rest of the compiler as an atomic operation, rather than the DMA engine, between its local store and the system a call. This means that the expected path is not memory. The compiler, under certain conditions, can un- impacted by the need to follow the normal calling conven- dertake the task of orchestrating this data transfer between tions. Because of this, each reference requires a separate the system memory and the local memory of the SPEs by ex- out-of-line miss “stub” which saves registers and explicitly plicitly inserting DMA commands. Moreover, there are, as branches back to the missing load or store upon return from we describe later, many optimizations that the compiler can the handler. Our current implementation takes no care to perform to optimize these data transfers, especially when optimize these stubs, thus, in codes with a high miss rate memory references are regular. In our approach, we attempt we suffer a higher penalty than needed. to abstract the concept of separate memories by allocating The lookup code itself is fairly straightforward, and the SPE program data in system memory and having the com- operations it performs are modeled on those of a hardware piler automatically manage the movement of this data be- cache. It is first necessary to compute the address of the da- 4Outlining refers to the process of creating a new function tum in system memory. The load or store that we are replac- for a particular section of code and replacing the original ing will usually have the address expressed as a base register code section with a call to that newly created function. plus either a displacement or an index register. These we add to produce the value that is to be looked up (this add local and global variables, and it may also be problematic if would normally be implicitly performed by the load/store large local arrays are used. It is also possible, although we unit in hardware). From the result we extract a 7 bit index have not yet implemented this, to explicitly allocate space in field, which is used directly to index into the cache direc- each local store for small shared variables, and additionally tory. The directory entry thus indexed contains a set of 4 to allocate a home location in system memory. These vari- tags, followed by 4 corresponding line addresses. The upper ables can then be prefetched, and the home location be kept 25 bits of the address are compared in SIMD with the 4 tag current by respecting the appropriate memory consistency fields, and if any of the 4 compare equal, the line address model. In OpenMP, this would entail updating the home in the same slot is combined with the lower 7 bits of the location at each explicit or implicit flush, and re-fetching address and used as the byte offset in local memory of the the data at the appropriate acquire points. required datum. If no tag value is equal, the miss handler is For large arrays, if they are accessed in loops using affine invoked and, when it returns, the address of the datum will expressions involving the loop induction variables, it is possi- be available as above. In the miss handler, a line is chosen ble to use well known program restructuring techniques, [13, for eviction and DMA operations are initiated to store back 14] such as loop blocking, to allow multiple elements to be the evicted line if required, and to bring in the requested explicitly fetched in a single operation. Further restructur- line. The appropriate slot in the directory entry is updated, ing can effectively “software pipeline” the blocking loop, so and control returns to the point of the miss. The DMA op- that data movement and computation are overlapped, essen- erations in the miss handler take several hundreds of cycles, tially prefetching the data. As is the case when compiling for but this delay is roughly commensurate with L2 miss times more traditional memory hierarchies, for multi-dimensional on the PPE side of the CELL processor. The performance loop nests the compiler can also use loop restructuring trans- impact of the Software Cache is dominated by the cost of formations such as blocking and interchange to increase lo- the cache probes, not by the miss cost. cality [15, 16]. This is referred to as tiling; it requires le- As mentioned earlier, our current implementation sup- gality checks based on the array dependence information ports two variants of the Software Cache. One variant sup- and thus may not always be possible. Both prefetching and ports only single-threaded code. In this version no attempt tiling in our environment cause the original variables to be is made to keep track of which lines have been modified, and replaced by references to much smaller temporary arrays, the miss handler perforce must write to system memory the and this implies the rewriting of the indexing expressions whole of any evicted line, whether it has changed or not. It also. Although tiling and prefetching techniques, as men- is probable that further compile-time analysis could amelio- tioned above, have been applied for cache memories pre- rate this situation, and we will be investigating this in the viously, their use in the context of the SPE is somewhat future. The second variant supports multi-threaded shared different. Since explicit DMA operations are used, more memory programs. In order to do this we must keep track care is required in the storage management. For many of in the cache directory on each SPE, which bytes have been the cases mentioned here, a further optimization is to try to written since the line was brought in from system memory. combine data transfers using special DMA list commands, Then, when a line is to be evicted, either in support of a miss or in the case of multiple contiguous requests, by simply fus- or to implement explicit flushes required for conformance ing several DMA operations into one. This has the effect of with memory consistency rules, only the modified bytes are reducing the setup cost for DMA operations. To increase written by the DMA engine. Keeping track of dirty bytes the opportunities for combining DMA operations, the com- requires that we insert additional code inline for each store piler can reorder variables with respect to each other, includ- operation, in addition to the lookup code discussed above. ing breaking up or combining structures, at link time. Our To accomplish this, the directory entry is extended by the compiler already performs such transformations for other addition of 4 quadwords, one for each set, which each con- reasons, but we have not yet explored this additional use of tain 128 “dirty bits” allowing each byte in the line to be that optimization. monitored. For all of the foregoing techniques, when they can be used, Since the compiler must deal with the potential for aliases the effect is to eliminate the need for cache lookup. Ul- between cached data and data obtained by other means (for timately, it should only be necessary to use the cache to example via explicit prefetch in the tiler), the miss handler ensure correctness for references which are irregular, or un- is required to take more care in choosing a line to be evicted known at . Clearly to the extent that this can than is normally the case with a hardware implementation. be accomplished, it requires substantial compile time analy- sis and the work is still in progress. We see very encouraging Static Data Buffering results with what we have implemented so far. Of course, the naive approach of replacing loads and stores with the longer instruction sequences required for cache Code Partitioning lookup has the effect of slowing program execution. We em- We propose a code partitioning technique to reduce the ploy several techniques to mitigate this performance degra- impact of the local storage limitations of the SPE on the dation. At the simplest level, we recognize that, in general, program code segment; the scheme we describe can be used scalar data and small structured variables do not represent a in a stand-alone manner with the SPE compiler, or by users large fraction of the space requirements of a program.These choosing to manually partition their applications. In the can often be directly allocated in the local stores, if it can Single Source compilation context our code partitioning be determined that they are not used to communicate be- approach is integrated with the data software cache to allow tween parallel tasks. In particular, the local, stack allocated for the execution of large outlined functions with large data variables of a function often fall into this class. Of course, to run seamlessly across multiple SPEs. difficulties arise when pointer usage causes between 1.0 30

26.2 25.3 0.9 25

0.8 20

15 0.7 11.4 9.9

Speedup factors 10 0.6 7.5 8.1

5 0.5 2.4 2.5 2.9 2.9 Relative reductions in execution time Original +Bundle +Branch Hint + Ifetch 0 0.4 k t IR or ng py ac F sum age A U n oduc FT E L LD o XY Autc Sax er F V y Linp Swim-l2 lendi ID t Mult axpy B Mat Mult Av uffman inpack a S verage Check H L nera M A Dot Pr onvoluti O pha C Al Figure 6: SPE optimizations. Figure 7: Simdization speedups.

tition size limit. The net effect depends on the prefetching In our code partitioning approach, the SPE program is algorithm and the accuracy of the cost model applied. divided into multiple partitions by the compiler. The home locations of code partitions, just as with data in our Software Cache approach, is system memory. When the compiler en- 7. MEASUREMENTS counters such compilations it reserves a small portion of the We first evaluate the optimized SPE code generation tech- SPE local storage for the code partition manager. The re- niques presented in Section 4 using a cycle-accurate simula- served memory is divided into two segments: one to hold tor. Figure 6 presents the reduction in program execution the continuously resident partition manager, while the other time for each optimization, relative to the performance of holds the current active code partition. The partition man- the original compiler (standard optimizations at O3 level, ager is responsible for loading partitions from their home scheduled for the SPE resource and latency model). We re- location in system memory into local storage when neces- port an average reduction of 22%, ranging from 11 to 51%. sary, normally during an inter-partition function call or an The benchmark programs used here are highly optimized, inter-partition return. The compiler modifies the original simdized kernels representative of typical workloads execut- SPE program to replace each inter-partition call with a call ing on the SPEs. Kernels include a variable length decod- to the partition manager. Thus, the partition manager is ing (VLD) from MPEG decoding, a Huffman compression able to take over control and handle the transition from the and decompression, an IDEA encryption, and a ray tracing current partition to the target partition. The partition man- (OnerayXY). Numerical kernels include an FFT, a 7x7 short ager also makes sure an inter-partition return will return to integer convolution, a 64x64 float matrix multiply, a Saxpy, the partition manager first. an LU decomposition, and a solver kernel of Linpack. Currently, the partitioning algorithm is a call graph based Bundling for dual issue results in an 11% average reduc- one, which means the basic unit of partitioning is a func- tion in execution time, ranging from 2 to 22%. Large reduc- tion. The compiler transforms the call graph into an affinity tion percentages indicate benchmarks with large amounts of graph, with edge weight representing call edge frequency, instruction-level parallelism and no lucky instruction align- and then applies a maximum spanning tree algorithm un- ment (where random instruction layout did not satisfy the der a certain resource limit, typically the (adjustable) code dual-issue constraint). buffer size. Hinting predictable branches results in a further 9% av- In the code partitioning, currently we see respectable per- erage reduction in execution time, ranging from 0 to 26%. formance when executing partitioned functions on a single Large reduction percentages indicate predictable branches SPE relative to execution on the PPE. In the automotive with a sufficient amount of work to hide the hint latency. suite of the EEMBC benchmark, across a single SPE, we Some of the small reduction percentages (such as 0% for ma- see a slowdown of between 2 - 10 %. CJPEG, with code and trix multiply) indicate such tight loops that hinting is not data sizes of 1M, slows down 2.7 times, with both software beneficial without jointly addressing the instruction starva- cache and code partitioning enabled. Given the preliminary tion issue. nature of this work, these results are encouraging. Generating explicit instruction fetches results in a further There are several opportunities which we are currently 2% average reduction in execution time, with peak impact exploring to improve the overall performance of our code for very tight loops such as the 20% reduction for matrix partitioning algorithm. Effectively the algorithm largely de- multiply. pends on the accuracy of the affinity (call edge frequency). We now evaluate the automatic simdization techniques To achieve the best results, profiling can be used instead presented in Section 5 targeting a single SPE, also using of static estimation. Also, using the actual partition size the SPE cycle-accurate simulator. Figure 7 presents the rather than the size estimated in the compiler conserva- speedup factors achieved when automatically simdizing se- tively could improve the utilization of local code buffer sig- quential code kernels. Comparisons are performed at the nificantly. Prefetching is of course the most promising opti- same optimization level, which includes high-level, interpro- mization and has the potential to hide the latency incurred cedural optimizations in addition to all of the SPE optimiza- when fetching partitions from main memory. But prefetch- tions presented in Section 4. We report an average speedup ing requires multiple buffers implying a much smaller par- factor of 9.9, ranging from 2.4 to 26.2. 4.5 12 10.9 mgrid software cache 4 swim 10 with optimization 3.5 equake

applu 8 7.7 7.8 3 apsi 7.1 2.5 art 6 wupwise 2 ammp 4.2 Speedup factor 4 Speedup factors 1.5 3.2

2.3 2.0 2.2 1 2 1.6

0.5 0 0 swim mgrid jacobi matrix-mul vector-add spu=1 spu=2 spu=4 spu=6 spu=8 Figure 9: Speedups of parallelization with optimiza- Figure 8: Speedups of parallelization. tion.

The benchmark programs include video, numerical, and of the relative speed of PPE and SPE processors. The mea- telecommunication applications. Kernels include a full Lin- surements discussed do not consider the simdization of the pack solver, a short integer finite impulse response (FIR), an parallelized benchmarks because the integration of these two auto-correlation kernel, an integer dot-product, a TCP/IP optimizations is not yet complete. For the purpose of these checksum routine, and an alpha blending kernel. Two ker- measurement discussions, without simdization enabled, one nels are from the previous benchmarks, namely Saxpy and SPE is approximately 2 times faster than the PPE processor. Matrix Multiply. In the eight OMP2001 benchmarks, Equake, Mgrid and There are two tiers of benchmarks. The four leftmost ker- Swim have notable speedup because they have high coverage nels in Figure 7get respectable speedups (2.4 to 2.9) but of the parallel part and plenty of data reuse in the innermost below average. For Linpack, approximately 27% of the time loops. The long cache line used in the software cache can is spent in a non-simdized part. The simdized part is analo- effectively capture such spatial reuse. Mgrid also has data gous to Saxpy but has decreasing trip counts (from N to 1) reuse across the outer loops that can be exploited by the that results in a higher loop overhead. For Swim, we believe software cache. Other benchmarks suffer, to some extent, that constant subexpression elimination does not currently for the following reasons: address simdized references as well as scalar ones. For FIR • Poor data locality. Data accesses are so scattered and and Autocor, we believe that there are data-type conversion discontinuous that numerous DMA operations are invoked issues that result in excessive overhead. • The rightmost 5 kernels get significant speedup (7.5 to Frequent cache flush. Some benchmarks, for instance, 26.2). Both Dot Product and Checksum are performing ammp and applu, require frequent cache flush. Cache a reduction which is not natively supported by the SPE’s flush is a very expensive operation in our implementation. instruction set. This introduces some overhead which, in Cache flush can be further optimized. these two cases, can be efficiently hidden using partial sum • Some optimizations turned off. In order to generate reductions. Alpha Blending has data type conversions that smaller code for the SPE processor, some optimizations, are efficiently handled. Both Saxpy and Matrix Multiply such as and procedure inlining, have to be exhibit peak performance due to perfect alignment and no turned off. data conversion overheads. Note that super-inear speedups • Lower coverage of parallel parts. Since the currently avail- are possible, e.g., 26.2 on a single SPE, because the simdized able hardware has limited virtual memory space. Smaller version of the code not only computes multiple useful results data sets have to be used, reducing the coverage in some per SIMD instructions, but also avoids all the overhead of benchmarks. running scalar code on SIMD units otherwise incurred on The performance based purely on software cache can be non-simdized code. enhanced by techniques discussed in Section 6. Figure 9 We now consider the parallelization techniques discussed reports the improvement compared with the pure software in Section 6. The experiments described here were run on cache approach. Eight SPE processors are used to execute actual hardware. We first show, in Figure 8, results for the parallel loops. All but the matrix multiplication are dra- parallel execution using only software cache. The experi- matically improved. The code generation for DMA opera- ment was conducted with SPEC OpenMP 2001 benchmark tions in cases of jumping data access needs further tuning. suite [17] with the three Fortran90 benchmarks excluded. The measurements are ratios of the execution times when running the sequential part on PPE and the parallel loops 8. RELATED WORK on various number of SPE processors, to those when run- We are not aware of any related work on generating scalar ning on the PPE processor alone. It may be more intuitive codes exclusively on SIMD units such as the ones found on to use the execution time on one SPE as the baseline. How- the SPEs. But underlying techniques that minimize the ever, these benchmarks are too large to be directly run on performance impact of scalar codes on SIMD units are well an SPE processor, without our code partitioning technique known, such as data padding for predictable alignment [18] which does in itself incur some performance penalty. We and aggressive register allocation to eliminate local variables reduce the data size of several kernels to get a rough idea in memory. Compiler optimizations that reduce the number of taken- cessing capabilities and restrictions on what the SPEs can branches have been intensively researched such as code execute. Also, in our case the processor cores are on the transformations to co-locate consecutively executed basic same chip, and even though the SPEs rely on DMA trans- blocks [19], if-conversion, or combination of both [20]. Build- fers, all cores use the same virtual memory space to address ing on prior work, we focus on eliminating taken-branches shared data. In effect, the cost model for data access and and disregard the locality enhancing aspect of prior work synchronization is different in the CELL processor context. since the SPEs have no cache. Similarly, our if-conversion There has been substantial work on providing caching in scheme currently focuses on eliminating small if-then-else software, including [33, 34, 35, 36]. Most of the previous branch structures since our SPE’s hardware support is lim- work focuses on caching data and providing coherence for ited to select instructions. a software implementation of distributed shared memory in Compiler assisted branch hints and instruction fetch a network of processors. In our implementation, software mechanisms have also been actively researched [21], espe- caching is done within a CELL chip and is designed to work cially the interaction of instruction fetches and cache pol- efficiently using architectural features specific to the CELL. lution [22]. Since SPEs have no cache and have only very Also, our software cache needs to be aware of compiler opti- limited number of instruction fetch buffers, we insert hints mizations that buffer data in SPE local stores, and it works very conservatively compared to prior work. Branch hints on in tandem with such optimizations. SPEs are also different than those on the Itanium 2 [23], as The concept of memory overlays is well-known in the con- our hints both prefetch instructions and impact the branch text of operating systems [37]. We use the same basic idea prediction policy. SPEs’ hint and instruction fetch timing when partitioning large SPE code sections. However, we is also tighter than the timing on a cache-based system, as perform code partitioning entirely in software, using the we are dealing here with the contention between data and compiler to optimize the choice of what codes to be placed instructions through a single memory port. in a partition. The bundling of instructions has also been studied in the In many embedded or reconfigurable systems, the hard- literature [24], e.g., for Itanium [23]. On the SPE, however, ware/software co-design process iteratively refines a series of bundling is strictly a performance issue since the hardware hardware designs, trading off a set of design parameters such is fully capable of detecting parallel instructions. Unlike as space, power, and performance. This iterative process is in Itanium, there are no stop bit; thus dual-issue bundling driven mostly by synthesis tools [38, 39, 40] and/or opti- rules are enforced by adding nop instructions. As a result, mizing compilers [41, 42]. Like other embedded designs, the the main issue faced by the SPE compiler is the interac- SPEs have reduced hardware supports, e.g.,strictmemory tion between code size increases (due to additional nop)and alignment, no branch prediction, simple instruction fetch, instruction fetch issues. for higher clock rate, lower power consumption, and smaller Prior work on automatic SIMD code generation falls gen- floor plan. The SPE compiler is optimized to provide higher erally into two categories, a loop-based approach [25, 26, 27] level functionality and performance using the simpler primi- and an unroll-and-pack approach [8, 28]. Our approach [9, tives (compared to general-purpose processors) provided by 10, 5] uses a combination of both. the SPE. In terms of alignment handling, one technique is to peel the loop until all accesses are fully aligned or fall into a 9. CONCLUSIONS known, compile-time pattern [18, 26]. Our approach is to Developed for multimedia, game applications, and other peel a full simdized loop iteration and conditionally execute numerically intensive workloads, the first generation CELL it, on a per statement-basis. This technique is general, as processor implements on a single chip a Power Processor El- it effectively support distinct compile-time/ runtime align- ement (PPE) and eight attached Synergistic Processor El- ments for each statement. It is also faster on SPEs because ements (SPEs). In addition to processor-level parallelism, even the peeled iterations are executed in SIMD mode. each processing element has multiple SIMD units that can The direct handling of misaligned references in loops has each process from 2 double-precision floating points up to 16 also been discussed in prior work [25, 26, 27]. To our knowl- bytes per instruction. In this paper, we have presented our edge, prior work shifts all relatively misaligned data to slot compiler approach to support the heterogeneous parallelism zero (i.e., the zero-shift policy). Our approach attempts found in the CELL architecture. to systematically minimize the number of shifts by lazily Our CELL compiler implements SPE-specific optimiza- shifting relatively misaligned input streams directly to the tions including support for compiler assisted memory re- alignment of the output stream (i.e., the lazy-shift policy). alignment, branch prediction, and instruction fetch. It ad- We do so with compile-time and runtime alignment, with dresses fine-grained SIMD parallelization as well as more multiple misaligned statements, unknown loop bounds, and general OpenMP task-level parallelization, presenting the in presence of data conversion, reductions, induction, and user with a single shared memory image through compiler private variables. mediated partitioning of code and data and the automatic Prior work on supporting parallel shared-memory applica- orchestration of the data movement implied by this parti- tions on multiple processors includes work related to UPC, tioning. OpenMP, HPF, and distributed shared memory systems [29, Using benchmarks suitable to this platform, we demon- 30, 31, 32]. They are similar to ours in that they dis- strate average speedup factors of 1.3 for the SPE-specific tribute work among multiple processors, determine memory optimizations, 9.9 for the simdization, and 7.1 for the task- accesses in parallel regions that are shared by multiple pro- level parallelization. cessors, and explicitly transfer data for these accesses and We are working on integrating and refining current tech- attempt to optimize them. However, our work is different niques and further exploiting opportunities available on the in that it considers heterogeneous cores with different pro- CELL architecture for our target workloads. 10. REFERENCES tion Prefetching in Modern Processors. In Proceedings of the International Symposium on Microarchitecture, 1998. [1] Dac Pham et al. The Design and Implementation of a First- Generation CELL Processor. In Proceedings of the IEEE In- [23] Itanium 2 Processor Reference Manual for Software De- ternational Solid-State Circuits Conference, 2005. velopment and Optimization. In Intel, 2003. [2] IBM Corporation. PowerPC Family: AltiVec [24] Daniel Kaestner and Sebastian Winkel. ILP-based Instruc- Technology Programming Environments Manual, 2004. tion Scheduling for IA-64. In Proceedings of the Workshop on Languages, compilers and tools for embedded systems, 2001. [3] Brian Flachs et al. The Microarchitecture of the Streaming Processor for a CELL Processor. In Proceedings of the IEEE [25] Gerald Cheong and Monica S. Lam. An Optimizer for Multi- International Solid-State Circuits Conference, 2005. media Instruction Sets. In Second SUIF Compiler Workshop, 1997. [4] Michael Gschwind, Peter Hofstee, Brian Flachs, Martin Hop- kins, Yukio Watanabe, and Takeshi Yamazaki. A Novel [26] Aart Bik, Milind Girkar, Paul M. Grey, and Xinmin Tian. Automatic Intra-Register Vectorization for the Intel Ar- SIMD Architecture for the CELL Heterogeneous Chip- Multiprocessor. In Hot Chips 17, 2005. chitecture. International Journal of Parallel Programming, (2):65–98, April 2002. [5] Peng Wu, Alexandre E. Eichenberger, Amy Wang, and Peng Zhao. An Integrated Simdization Framework Using Virtual [27] Crescent Bay Software. VAST-F/AltiVec: Auto- Vectors. In Proceedings of the International Conference on matic Fortran Vectorizer for PowerPC Vector Unit. http://www.psrv.com/vast altivec.html, 2004. Supercomputing, 2005. [28] Jaewook Shin, Mary W. Hall, and Jacqueline Chame. [6] John Randal Allen and Ken Kennedy. Automatic Transla- tion of Fortran Programs to Vector Form. ACM Transac- Superword-Level Parallelism in the Presence of Control tions on Programming Languages and Systems, (4):491–542, Flow. In Proceedings of the International Symposium on October 1987. Code Generation and Optimization, 2005. [7] Hans Zima and Barbara Chapman. Supercompilers for Par- [29] Wei-Yu Chen, Costin Iancu, and Katherine Yelick. Commu- nication Optimizations for Fine-grained UPC Applications. allel and Vector Computers. ACM Press, 1990. In Proceedings of the International Conference on Parallel [8] Samuel Larsen and Saman Amarasinghe. Exploiting Super- Architecture and Compilation Techniques, 2005. word Level Parallelism with Multimedia Instruction Sets. In Proceedings of the Conference on [30] Seung-Jai Min, Ayon Basumallik, and Rudolf Eigenmann. Optimizing OpenMP Programs on Software Distributed Design and Implementation, 2000. Shared Memory Systems. International Journal of Parallel [9] Alexandre E. Eichenberger, Peng Wu, and Kevin O’Brien. Programming, 31(3):225–249, 2003. Vectorization for SIMD Architectures with Alignment Con- straints. In Proceedings of the Conference on Programming [31] Vikram Adve, Guohua Jin, John Mellor-Crummey, and Qing Language Design and Implementation, 2004. Yi. High Performance Fortran Compilation Techniques for Parallelizing Scientific Codes. In Proceedings of the Interna- [10] Peng Wu, Alexandre E. Eichenberger, and Amy Wang. Ef- tional Conference on Supercomputing, 1998. ficient SIMD Code Generation for Runtime Alignment and Length Conversion. In Proceedings of the International Sym- [32] Sandhya Dwarkadas, Alan L. Cox, and Willy Zwaenepoel. posium on Code Generation and Optimization, 2005. An Integrated Compile-Time/Run-Time Software Dis- tributed Shared Memory System. In Proceedings of the [11] Official OpenMP specifications. http://www.openmp.org Symposium on Architectural Support for Programming Lan- /specs/mp-documents/cspec20.pdf. guages and Operating Systems, 1996. [12] Official OpenMP specifications. http://www.openmp.org [33] Zoran Radovic and Erik Hagersten. Removing the Overhead /specs/mp-documents/fspec20.pdf. from Software-Based Shared Memory. In Proceedings of In- [13] Todd C. Mowry. Tolerating Latency through Software Con- ternational Conference on Supercomputing, 2001. trolled Data Prefetching. PhD Thesis Stanford University, [34] Sandhya Dwarkadas, Kourosh Gharachorloo, Leonidas Kon- March 1994. tothanassis, Daniel J. Scales, Michael L. Scott, and Robert [14] Michael E. Wolf and Monica S. Lam. A Data Locality Opti- Stets. Comparative Evaluation of Fine- and Coarse-Grain mization. In Proceedings of the Conference on Programming Approaches for Software Distributed Shared Memory. In Language Design and Implementation, 1991. Proceedings of the International Symposium on High Per- [15] Gabriel Rivera and Chau-Wen Tseng. Tiling Optimizations formance Computer Architecture, 1999. for 3D Scientific Computation. In Proceedings of Interna- [35] Kirk L. Johnson, M. Frans Kaashoek, and Deborah A. tional Conference on Supercomputing, 2000. Wallach. CRL: High-Performance All-Software Distributed [16] Abdel-Hameed A. Badawy, Aneesh Aggarwal, Donald Ye- Shared Memory. In ACM Operating Systems Review, 1995. ung, and Chau-Wen Tseng. Evaluating the Impact of Mem- [36] Ioannis Schoinas, Babak Falsafi, Alvin R. Lebeck, Steven K. ory System Performance on Software Prefetching and Local- Reinhardt, James R. Larus, and David A. Wood. Fine-Grain ity Optimizations. In Proceedings of the International Con- Access Control for Distributed Shared Memory. ACM SIG- ference on Supercomputing, 2001. PLAN Notices, 29(11):297–306, 1994. [17] SPEC OMP2001 Description. http://www.spec.org/omp/. [37] R. J. Pankhurst. Operating Systems: Program Overlay Tech- [18] Samuel Larsen, Emmett Witchel, and Saman Amarasinghe. niques. volume 11, pages 119–125, 1968. Increasing and Detecting Memory Address Congruence. In [38] Mentor Graphics Inc. Monet R44 User’s Manual R44, 2002. Proceedings of the International Conference on Parallel Ar- [39] Synopsys Inc. Behavioral compiler user’s guide, chitectures and Compilation Techniques, 2002. 1999. http://www.synopsys.com/products/beh syn/ [19] Karl Pettis and Robert C. Hansen. Profile guided code po- beh syn br.html. sitioning. In Proceedings of the Conference on Programming [40] Synopsys Inc. Cocentric data sheet, 2002. language design and implementation, 1990. http://www.synopsys.com/products/cocentric studio/ [20] Scott A. Mahlke, David C. Lin, William Y. Chen, Richard E. cocentric studio.html. Hank, and Roger A. Bringmann. Effective Compiler Support [41] Vinod Kathail, Shail Aditya, Robert Schreiber, B. Ra- for Predicated Execution using the Hyperblock. In Proceed- makrishna Rau, Darren C. Cronquist, and Mukund Sivara- ings of the International Symposium on Microarchitecture, man. PICO: Automatically Designing Custom Computers. 1992. In IEEE Computer, pages 39–47, September 2002. [21] Arthur Bright, Jason Fritts, and Michael Gschwind. A De- [42] Byoungro So, Mary W. Hall, and Pedro C. Diniz. A Compiler coupled Fetch-Execute Engine with Static Branch Prediction Approach to Fast Hardware Design Space Exploration in Support. IBM Research Report RC23261, 1999. FPGA-based Systems. In Proceedings of the Conference on [22] Chi-Keung Luk and Todd C. Mowry. Cooperative Prefetch- Programming Language Design and Implementation, 2002. ing: Compiler and Hardware Support for Effective Instruc-