Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture

Karthikeyan Sankaralingam Ramadass Nagarajan Haiming Liu Changkyu Kim Jaehyuk Huh Doug Burger Stephen W. Keckler Charles R. Moore

Computer Architecture and Technology Laboratory Department of Computer Sciences The University of Texas at Austin [email protected] - www.cs.utexas.edu/users/cart

Abstract dia, streaming, network, desktop) and the emergence of chip multiprocessors (CMPs), for which the number and This paper describes the polymorphous TRIPS archi- granularity of processors is fixed at design time. tecture which can be configured for different granularities One strategy for combating processor fragility is to and types of parallelism. TRIPS contains mechanisms that build a heterogeneous chip, which contains multiple pro- enable the processing cores and the on-chip memory sys- cessing cores, each designed to run a distinct class of work- tem to be configured and combined in different modes for loads effectively. The proposed Tarantula processor is one instruction, data, or -level parallelism. To adapt to such example of integrated heterogeneity [8]. The two ma- small and large-grain concurrency, the TRIPS architecture jor downsides to this approach are (1) increased hardware contains four out-of-order, 16-wide-issue Grid Processor complexity since there is little design reuse between the cores, which can be partitioned when easily extractable two types of processors and (2) poor resource utilization fine-grained parallelism exists. This approach to polymor- when the application mix contains a balance different than phism provides better performance across a wide range of that ideally suited to the underlying heterogeneous hard- application types than an approach in which many small ware. processors are aggregated to run workloads with irregu- An alternative approach to designing an integrated so- lar parallelism. Our results show that high performance lution using multiple heterogeneous processors is to build can be obtained in each of the three modes–ILP, TLP, one or more homogeneous processors on a die, which mit- and DLP–demonstrating the viability of the polymorphous igates the aforementioned complexity problem. When an coarse-grained approach for future . application maps well onto the homogeneous substrate, the utilization problem is solved, as the application is not limited to one of several heterogeneous processors. To 1 Introduction solve the fragility problem, however, the homogeneous hardware must be able to run a wide range of application classes effectively. We define this architectural polymor- General-purpose microprocessors owe their success to phism as the capability to configure hardware for efficient their ability to run many diverse workloads well. Today, execution across broad classes of applications. many application-specific processors, such as desktop, net- A key question, is what granularity of processors and work, server, scientific, graphics, and digital signal proces- memories on a CMP is best for polymorphous capabili- sors have been constructed to match the particular paral- ties. Should future billion-transistor chips contain thou- lelism characteristics of their application domains. Build- sands of fine-grain processing elements (PEs) or far fewer ing processors that are not only general purpose for single- extremely coarse-grain processors? The success or failure threaded programs but for many types of concurrency as of polymorphous capabilities will have a strong effect on well would provide substantive benefits in terms of system the answer to these questions. Figure 1 shows a range of flexibility as well as reduced design and mask costs. points in the spectrum of PE granularities that are possi- Unfortunately, design trends are applying pressure in ble for a 400mm2 chip in 100nm technology. Although the opposite direction: toward designs that are more spe- other possible topologies certainly exist, the five shown in cialized, not less. This performance fragility, in which ap- the diagram represent a good cross-section of the overall plications incur large swings in performance based on how space: well they map to a given design, is the result of the combi- nation of two trends: the diversification of workloads (me-

Proceedings of the 30th Annual International Symposium on (ISCA’03) 1063-6897/03 $17.00 © 2003 IEEE Exploits fine-grain parallelism more effectively

Runs more applications effectively

(a) FPGA (b) PIM (c) Fine-grain CMP (d) Coarse-grain CMP (e) TRIPS Millions of gates 256 Proc. elements 64 In-order cores 16 Out-of-order cores 4 ultra-large cores Figure 1. Granularity of parallel processing elements on a chip.

a) Ultra-fine-grained FPGAs. Regardless of the approach, a polymorphous architec- ture will not outperform custom hardware meant for a b) Hundreds of primitive processors connected to given application, such as graphics processing. However, memory banks such as a processor-in-memory a successful polymorphous system should run well across (PIM) architecture or reconfigurable ALU arrays many application classes, ideally running with only small such as RaPiD [7], Piperench [9], or PACT [3]. performance degradations compared to the performance of c) Tens of simple in-order processors, such as in customized solutions for each application. RAW [25] or Piranha [2] architectures. This paper proposes and describes the polymorphous d) Coarse grained architectures consisting of 10-20 TRIPS architecture, which uses the partitioning approach, 4-issue cores, such as the Power4 [22], Cy- combining coarse-grained polymorphous Grid Processor clops [4], MultiScalar processors [19], other pro- cores with an adaptive, polymorphous on-chip memory posed speculatively-threaded CMPs [6, 20], and the system. Our goal is to design cores that are both as polymorphous Smart Memories [15] architecture. large and as few as possible, providing maximal single- thread performance, while remaining partitionable to ex- e) Wide-issue processors with many ALUs each, such ploit fine-grained parallelism. Our results demonstrate as Grid Processors [16]. that this partitioning approach solves the fragility problem by using polymorphous mechanisms to yield high perfor- The finer-grained architectures on the left of this spec- mance for both coarse and fine-grained concurrent applica- trum can offer high performance on applications with fine- tions. To be successful, the competing approach of synthe- grained (data) parallelism, but will have difficulty achiev- sizing coarser-grain processors from fine-grained compo- ing good performance on general-purpose and serial appli- nents must overcome the challenges of distributed control, cations. For example, a PIM topology has high peak per- long interaction latencies, and synchronization overheads. formance, but its performance on on control-bound codes with irregular memory accesses, such as compression or The rest of this paper describes the polymorphous hard- compilation, would be dismal at best. At the other ex- ware and configurations used to exploit different types of treme, coarser-grained architectures traditionally have not parallelism across a broad spectrum of application types. had the capability to use internal hardware to show high Section 2 describes both the planned TRIPS silicon proto- performance on fine-grained, highly parallel applications. type and its polymorphous hardware resources, which per- Polymorphism can bridge this dichotomy with either mit flexible execution over highly variable application do- of two competing approaches. A synthesis approach mains. These resources support three modes of execution uses a fine-grained CMP to exploit applications with fine- that we call major morphs, each of which is well suited grained, regular parallelism, and tackles irregular, coarser- for a different type of parallelism: instruction-level par- grain parallelism by synthesizing multiple processing el- allelism with the desktop or D-morph (Section 3), thread- ements into larger “logical” processors. This approach level parallelism with the threaded or T-morph (Section 4), builds hardware more to the left on the spectrum in Fig- and data-level parallelism with the streaming or S-morph ure 1 and emulates hardware farther to the right. A par- (Section 5). Section 6 shows how performance increases in titioning approach implements a coarse-grained CMP in the three morphs as each TRIPS core is scaled from a 16- hardware, and logically partitions the large processors to wide up to an even coarser-grain, 64-wide issue processor. exploit finer-grain parallelism when it exists. We conclude in Section 7 that by building large, partition-

Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA’03) 1063-6897/03 $17.00 © 2003 IEEE 1063-6897/03 $17.00 © 2003 IEEE ©2003 $17.00 1063-6897/03 (ISCA’03) Architecture onComputer Symposium 30thAnnual International Proceedings ofthe oposrsucsue ocntutteD ,adS- and T, D, 3–5. the Sections construct in poly- described to morphs the used highlights and resources ar- system, level morphous TRIPS high the the describes of overheads. section chitecture this configuration of and remainder The software complexity both memory hardware minimize of to and level storage, instruction the and coarse- at banks non- employs features, and system polymorphous TRIPS complexity and grained The escalating resources, avoid structures. available time scalable the same of the use that at their so DLP maximize and granularity TLP can ILP, of appropriate levels different their involving workloads balancing is tures ex- are optimization. for that schedulers software channels to posed parti- communication connected These are point-to-point elements by memory runs. and wire large computation avoid long tioned to and partitioned structures heavily centralized the is scale, architecture to difficult TRIPS are that components centralized designs with large-core conventional different to at Contrary applications granularities. concurrent sub- explicitly be to for single- core them divided the enable on augments that and features performance ILP, polymorphous with high high with achieve applications threaded to cores cessing . oeEeuinModel Execution Core 2.1 Architecture TRIPS The 2 rahpoiigfrsligteeegn hleg of challenge emerging the fragility. processor solving ap- this for making promising concurrency, proach of design classes homogeneous many exploit single can a cores, polymorphous able, ented M MM M MM h RP rhtcuei fundamentally is architecture TRIPS The fea- polymorphous the defining in challenge key The pro- coarse-grained large, uses architecture TRIPS The M M M nalmdso prto,porm opldfor compiled programs operation, of modes all In . DRAM Interface M M M M M M M (a) TRIPSChip M MM M MM M MM MM M MM M MM M M M M M M M M M M

DRAM Interface DRAM Interface DRAM Interface M M M M M M M M M M M M M M M M M M M M M M M M MM M M M M

DRAM Interface M M M M M M M iue2 RP rhtcueoverview. architecture TRIPS 2. Figure Stitch Table Next block ICache−M Predictor ICache−3 ICache−2 ICache−1 ICache−0 lc ori- block ( b ) Block Control TRIPSCore rhtcua tt fncsay n hnpoedn othe to proceeding block. then next and necessary, persistent if the to state results architectural its committing completion, execut- to engine, it computational ing the from into block it a opera- loading fetching memory, basic includes the processor the runtime, of At flow block. tional the from upon point depends and exit that inputs, outputs the state state of of set set variable static potentially a a has block Each dependences explicit. inter-instruction are com- that the such onto engine all instructions stati- putational of For for block responsible each boundaries. is scheduling cally block the at execution, only of modes handled are they that tteiptadotu.Ec eevto tto a stor- has station reservation Each connections router output. and and point input stations, the floating reservation at a of ALU, set a integer unit, an containing execution each homogeneous nodes, of are array which an [16], of designs composed of exam- typically family an Processor is Grid the core of TRIPS ple The system. memory primary the completion for targeted 2005. be is in and will with chip 100nm controllers prototype a using memory The built distributed net- memory. external routed of to a set channels by a connected and tiles work, an memory cores, 32KB 16-wide of polymorphous array four prototype of TRIPS consist will the chip short connections, and structures wiring partitioned the point-to-point both the high to While and due dimensions rates larger chip. clock both prototype to a scalable is in architecture implemented be will that . rhtcua Overview Architectural 2.2 i tmclyaditrut are interrupts and atomically For com- blocks mit [14]. programs, hyperblocks parallel level in thread found and instruction multi- as possibly points and exit loops, possible internal ple no point, entry with single instructions a of blocks large into partitioned are TRIPS 49 95−12732−630−3164−95 iue2 hw nepne iwo RP oeand core TRIPS a of view expanded an shows 2b Figure architecture TRIPS the of diagram a shows 2a Figure Cce2LQ Frame1 DCache−3 LSQ2 DCache−2 DCache−1 DCache−0 LSQ3 LSQ1 LSQ0 L2 (c) ExecutionNode lc precise block ntOperands Inst Control Router meaning , Frame 127 . . . Frame 0 age for an instruction and two source operands. When a tion, when a block should be deallocated from the frame contains a valid instruction and a pair space, and which block should be loaded next into the free of valid operands, the node can select the instruction for frame space. To implement different modes of operation, execution. After execution, the node can forward the result a range of policies can govern these actions. The deallo- to any of the operand slots in local or remote reservation cation logic may be configured to allow a block to execute stations within the ALU array. The nodes are directly con- more than once, as is useful in streaming applications in nected to their nearest neighbors, but the routing network which the same inner loop is applied to multiple data ele- can deliver results to any node in the array. ments. The next block selector can be configured to limit The banked instruction cache on the left couples one the speculation, and to prioritize between multiple concur- bank per row, with an additional instruction cache bank rently executing threads useful for multithreaded parallel to issue fetches to values from registers for injection into programs. the ALU array. The banked register file above the ALU Memory Tiles: The TRIPS Memory tiles can be con- array holds a portion of the architectural state. To the figured to behave as NUCA style L2 cache banks [12], right of the execution nodes are a set of banked level-1 , synchronization buffers for pro- data caches, which can be accessed by any ALU through ducer/consumer communication. In addition, the memory the local grid routing network. Below the ALU array is tiles closest to each processor present a special high band- the block control logic that is responsible for sequencing width interface that further optimizes their use as stream block execution and selecting the next block. The back- register files. side of the L1 caches are connected to secondary mem- ory tiles through the chip-wide two-dimensional intercon- nection network. The switched network provides a robust 3 D-morph: Instruction-Level Parallelism and scalable connection to a large number of tiles, using less wiring than conventional dedicated channels between The desktop morph, or D-morph, of the TRIPS pro- these components. cessor uses the polymorphous capabilities of the proces- The TRIPS architecture contains three main types of sor to run single-threaded codes efficiently by exploiting resources. First, the hardcoded, non-polymorphous re- instruction-level parallelism. The TRIPS processor core is sources operate in the same manner, and present the same an instantiation of the Grid Processor family of architec- view of internal state in all modes of operation. Some ex- tures, and as such has similarities to previous work [16], amples include the execution units within the nodes, the in- but with some important differences as described in this terconnect fabric between the nodes, and the L1 instruction section. cache banks. In the second type, polymorphous resources To achieve high ILP, the D-morph configuration treats are used in all modes of operation, but can be configured the instruction buffers in the processor core as a large, dis- to operate differently depending on the mode. The third tributed, instruction issue window, which uses the TRIPS type are the resources that are not required for all modes ISA to enable out-of-order execution while avoiding the and can be disabled when not in use for a given mode. associative issue window lookups of conventional ma- 2.3 Polymorphous Resources chines. To use the instruction buffers effectively as a large Frame Space: As shown in Figure 2c, each execution window, the D-morph must provide high-bandwidth in- node contains a set of reservation stations. Reservation struction fetching, aggressive control and data speculation, stations with the same index across all of the nodes com- and a high-bandwidth, low-latency memory system that bine to form a physical frame. For example, combining preserves sequential memory semantics across a window the first slot for all nodes in the grid forms frame 0. The of thousands of instructions. frame space, or collection of frames, is a polymorphous resource in TRIPS, as it is managed differently by differ- 3.1 Frame Space Management ent modes to support efficient execution of alternate forms of parallelism. By treating the instruction buffers at each ALU as a dis- Register File Banks: Although the programming model tributed issue window, orders-of-magnitude increases in of each execution mode sees essentially the same number window sizes are possible. This window is fundamentally of architecturally visible registers, the hardware substrate a three-dimensional scheduling region, where the x- and y- provides many more. The extra copies can be used in dif- dimensions correspond to the physical dimensions of the ferent ways, such as for speculation or multithreading, de- ALU array and the z-dimension corresponds to multiple pending on the mode of operation. instruction slots at each ALU node, as shown in Figure 2c. Block Sequencing Controls: The block sequencing This three-dimensional region can be viewed as a series of controls determine when a block has completed execu- frames, as shown in Figure 3b, in which each frame con-

Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA’03) 1063-6897/03 $17.00 © 2003 IEEE are speculative (analogous to tasks in a Multiscalar pro- A−frame 2 Reg. File Dataflow graph cessor [19]). When the A-frame holding the oldest hy- N0 H0 A−frame 1 perblock completes, the block is committed and removed. (H1) N6 The next oldest hyperblock becomes non-speculative, and N1 N2 R1 A−frame 0 N5 N4 the released frames can be filled with a new speculative hy- (H0) N2N3 perblock. On a misprediction, all blocks past the offending N3 R1 Z . .. prediction are squashed and restarted. .... N0N1 Frame 3 N4 H1 X .... Since A-frame IDs are assigned dynamically and all .... Frame 2 Y intra-hyperblock communication occurs within a single A- N5 N6 Frame 1 ... Frame 0 frame, each producer instruction prepends its A-frame ID (a) (b) to the Z-coordinate of its consumer to form the correct in- struction buffer address of the consumer. Values passed Figure 3. D-morph frame management. between hyperblocks are transmitted through the register file, as shown by the communication of R1 from H0 to sists of one instruction buffer entry per ALU node, result- H1 in Figure 3b. Such values are aggressively forwarded ing in a 2-D slice of the 3-D scheduling region. when they are produced, using the register stitch table that To fill one of these scheduling regions, the compiler dynamically matches the register outputs of earlier hyper- schedules hyperblocks into a 3-D region, assigning each blocks to the register inputs of later hyperblocks. instruction to one node in the 3-D space. Hyperblocks are predicated, single entry, multiple exit regions formed by 3.3 High-Bandwidth Instruction Fetching the compiler [14]. A 3-D region (the array and the set of frames) into which one hyperblock is mapped is called an To fill the large distributed window the D-morph re- architectural frame, or A-frame. quires high-bandwidth instruction fetch. The control Figure 3a shows a four-instruction hyperblock (H0) model uses a program that points to hyperblock mapped into A-frame 0 as shown in Figure 3b, where N0 headers. When there is sufficient frame space to map a hy- and N2 are mapped to different buffer slots (frames) on perblock, the control logic accesses a partitioned instruc- the same physical ALU node. All communication within tion cache by broadcasting the index of the hyperblock to the block is determined by the compiler which schedules all banks. Each bank then fetches a row’s worth of in- operand routing directly from ALU to ALU. Consumers structions with a single access and streams it to the bank’s are encoded in the producer instructions as X, Y, and Z- respective row. Hyperblocks are encoded as VLIW-like relative offsets, as described in prior work [16]. Instruc- blocks, along with a prepended header that contains the tions can direct a produced value to any element within number of frames consumed by the block. the same A-frame, using the lightweight routed network in The next-hyperblock prediction is made using a highly the ALU array. The maximum number of frames that can tuned tournament exit predictor [10], which predicts a bi- be occupied by one program block (the maximum A-frame nary value that indicates the branch predicted to be the first size) is architecturally limited by the number of instruction to exit the hyperblock. The per-block accuracy of the exit bits to specify destinations, and physically limited by the predictor is shown in row 3 of Table 1; the predictor it- total number of frames available in a given implementa- self is described in more detail elsewhere [17]. The value tion. The current TRIPS ISA limits the number of instruc- generated by the exit predictor is used both to index into a tions in a hyperblock to 128, and the current implementa- BTB to obtain the next predicted hyperblock address, and tion limits the maximum number of frames per A-frame to also to avoid forwarding register outputs produced past the 16, the maximum number of A-frames to 32, and provides predicted branch to subsequent blocks. 128 frames total. 3.4 Memory Interface 3.2 Multiblock Speculation To support high ILP, the D-morph memory system must The TRIPS instruction window size is much larger than provide a high-bandwidth, low-latency data cache, and the average hyperblock size that can be constructed. The must maintain sequential memory semantics. As shown hardware fills empty A-frames with speculatively mapped in Figure 2b, the right side of each TRIPS core contains hyperblocks, predicting which hyperblock will be exe- distributed primary memory system banks, that are tightly cuted next, mapping it to an empty A-frame, and so on. coupled to the processing logic for low latency. The banks The A-frames are treated as a circular buffer in which the are interleaved using the low-order bits of the cache index, oldest A-frame is non-speculative and all other A-frames and can process multiple non-conflicting accesses simulta-

Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA’03) 1063-6897/03 $17.00 © 2003 IEEE Benchmark adpcm ammp art bzip2 compress dct equake gzip hydro2d m88k Good insts/block 30.7 119 80.4 55.8 21.6 163 33.5 36.2 200 40.2 Exit/target pred. acc. 0.72 0.94 0.99 0.74 0.84 0.99 0.97 0.84 0.97 0.95 Avg. frames 2.4 5.2 3.2 2.8 1.3 6.0 2.1 3.1 7.4 2.3 #inwindow 116 1126 1706 364 129 1738 622 671 1573 796 Benchmark mcf mgrid mpeg2 parser swim tomcatv turb3d twolf vortex mean Good insts/block 29.8 179 81.3 14.6 361 210 160 48.9 29.4 99.8 Exit/target pred. acc. 0.91 0.99 0.88 0.93 0.99 0.98 0.94 0.76 0.99 0.91 Avg. frames 2.2 6.9 3.8 1.3 11.8 7.4 6.4 2.6 2.0 4.2 #inwindow 462 1590 958 255 1928 1629 1399 361 918 965 Table 1. Execution characteristics of D-morph codes. 15 15 1 2 4 8 16 10 32 10 Perf Mem Perf (Mem+BP) IPC IPC 5 5

0 adpcmbzip2 compr gzip m88ksimmcf parser twolf vortex MEAN 0 ammpart dct equakehydro2dmgrid mpeg2swim tomcatvturb3dMEAN

Figure 4. D-morph performance as a function of A-frame count.

neously. Each bank is coupled with MSHRs for the cache a 10-cycle branch misprediction penalty, a 250Kb exit pre- bank and a partition of the address-interleaved load/store dictor, a 12-cycle access penalty to a 2MB L2 cache, and queues that enforce ordering of loads and stores. The a 132-cycle main memory access penalty. Optimistic as- MSHRs, the load/store queues, and the cache banks all use sumptions in the simulator currently include no modeling the same interleaving scheme. Stores are written back to of TLBs or page faults, oracular load/store ordering, simu- the cache from the LSQs upon block commit. lation of a centralized register file, and no issue of wrong- The secondary memory system in the D-morph con- path instructions to the memory system. All of the bina- figures the networked banks as a non-uniform cache ac- ries were compiled with the Trimaran tool set [24] (based cess (NUCA) array [12], in which elements of a set are on the Illinois Impact compiler [5]), and scheduled for the spread across multiple secondary banks, and are capable of TRIPS processor with our custom scheduler/rewriter. migrating data on the two-dimensional switched network The first row of Table 1 shows the average number that connects the secondary banks. This network also pro- of useful dynamically executed instructions per block, vides a high-bandwidth link to each L1 bank for parallel discounting overhead instructions, instructions with false L1 miss processing and fills. To summarize, with accu- predicates or instructions past a block exit. The second rate exit prediction, high-bandwidth I-fetching, partitioned row shows the average dynamic number of frames allo- data caches, and concurrent execution of hyperblocks with cated per block by our scheduler for a 4x4 grid. Using the inter-block value forwarding, the D-morph is able to use steady-state block (exit) prediction accuracies shown in the the instruction buffers as a polymorphous out-of-order is- third row, each benchmarks holds 965 useful instructions sue window effectively, as shown in the next subsection. in the distributed window, on average, as shown in row 4 of Table 1. 3.5 D-morph Results Figure 4 shows how IPC scales as the number of A- frames is increased from 1 to 32, permitting deeper spec- In this subsection, we measure the ILP achieved using ulative execution. The integer benchmarks are shown on the mechanisms described above. The results shown in the left; the floating point and Mediabench [13] bench- this section assume a 4x4 (16-wide issue) core, with 128 marks are shown on the right. Each 32 A-frame bar physical frames, a 64KB L1 data cache that requires three also has two additional IPC values, showing the perfor- cycles to access, a 64KB L1 instruction cache (both parti- mance with perfect memory in the hashed fraction of each tioned into 4 banks), 0.5 cycles per hop in the ALU array, bar, and then adding perfect branch prediction, shown in

Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA’03) 1063-6897/03 $17.00 © 2003 IEEE white. Increasing the number of A-frames provides a con- can instead store state from multiple non-speculative and sistent performance boost across many of the benchmarks, speculative blocks. The only additional frame support since it permits greater exploitation of ILP by providing a needed is thread-ID bits in the register stitching logic and larger window of instructions. Some benchmarks show no augmentations to the A-frame allocation logic. performance improvements beyond 16 A-frames (bzip2, Instruction control: The T-morph maintains n pro- m88ksim, and tomcatv), and a few reach their peak at 8 A- gram counters (where n is the number of concurrent frames (adpcm, gzip, twolf, and hydro2d). In such cases, threads allowed) and n global history shift registers in the large frame space is underutilized when running a sin- the exit predictor to reduce thread-induced mispredictions. gle thread, due to either low hyperblock predictability in The T-morph fetches the next block for a given thread us- some cases or a lack of program ILP in others. ing a prediction made by the shared exit predictor, and The graphs demonstrate that while control mispredic- maps it onto the array. In addition to the extra prediction tions cause large performance losses for the integer codes registers, n copies of the commit buffers and block control (close to 50% on average), the large window is able to tol- state must be provided for n hardware threads. erate memory latencies extremely well, resulting in negli- Memory: The memory system operates much the same gible slowdowns due to an imperfect memory system for as the D-morph, except that per-thread IDs on cache tags all benchmarks but mgrid. and LSQ CAMs are necessary to prevent illegal cross- thread interference, provided that shared address spaces 4 T-morph: Thread-Level Parallelism are implemented.

The T-morph is intended to provide higher processor 4.2 T-morph Results utilization by mapping multiple threads of control onto a To evaluate the performance of multi-programmed single TRIPS core. While similar to simultaneous multi- workloads running on the T-morph, we classified the ap- threading [23] in that the execution resources (ALUs) and plications as “high memory intensive” and “low memory memory banks are shared, the T-morph statically partitions intensive”, based on L2 cache miss rates. We picked eight the reservation station (issue window) and eliminates some different benchmarks and ran different combinations of replicated SMT structures, such as the reorder buffer. 2, 4 and 8 benchmarks executing concurrently. The high memory intensive benchmarks are arth, mcfh, equakeh, 4.1 T-Morph Implementation and tomcatvh. The low memory intensive benchmarks are compressl, bzip2l, parserl, and m88ksiml.Weex- There are multiple strategies for partitioning a TRIPS amine the performance obtained while executing multiple core to support multiple threads, two of which are row pro- threads concurrently and quantify the sources of perfor- cessors and frame processors. Row processors space-share mance degradation. Compared to a single thread executing the ALU array, allocating one or more rows per thread. in the D-morph, running threads concurrently introduces The advantage to this approach is that each thread has I- the following sources of performance loss: a) inter-thread cache and D-cache bandwidth and capacity proportional to contention for ALUs and routers in the grid, b) cache pol- the number of rows assigned to it. The disadvantage is that lution, c) pollution and interaction in the the distance to the register file is non-uniform, penalizing tables, and d) reduced speculation depth for each thread, the threads mapped to the bottom rows. Frame proces- since the number of available frames for each thread is re- sors, evaluated in this section, time-share the processor by duced. allocating threads to unique sets of physical frames. We Table 2 shows T-morph performance on a 4x4 TRIPS describe the polymorphous capabilities required for each core with parameters similar to those of the baseline D- of the classes of mechanisms below. morph. The second column lists the combined instruc- Frame space management: Instead of holding non- tion throughput of the running threads. The third column speculative and speculative hyperblocks for a single thread shows the sum of the IPCs of the benchmarks when each as in the D-morph, the physical frames are partitioned a is run on a separate core but with same number of frames priori and assigned to threads. For example, a TRIPS core as available to each thread in the T-morph. Comparing the can dedicate all 128 frames to a single thread in the D- throughput of column 3 with the throughput in column 2, morph, or 64 frames to each of two threads in the T-morph indicates the performance drop due to inter-thread interac- (uneven frame sharing is also possible). Within each tion in the T-morph. Column 4 shows the cumulative IPCs thread, the frames are further divided into some number of of the threads when each is run by itself on a TRIPS core A-frames and is allowed within each with all frames available to it. Comparison of this column thread. No additional register file space is required, since with column 4, indicates the performance drop incurred the same storage used to hold state for speculative blocks from both inter-thread interaction and reduced speculation

Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA’03) 1063-6897/03 $17.00 © 2003 IEEE Benchmarks Throughput (aggregate IPC) Overall Per Thread Speedup T-morph Constant A-frames Scaled A-frames Efficiency (%) Efficiency (%) 2 Threads bzipl, m88ksiml 4.9 5.5 5.5 90 93, 86 1.8 parserl, m88ksiml 3.7 3.8 4.1 90 88, 91 1.8 arth, compressl 5.1 5.7 6.0 86 93, 62 1.6 mcfh, bzipl 3.2 3.9 3.9 81 98, 75 1.7 arth, mcfh 5.1 5.3 5.6 90 91, 87 1.8 equakeh, mcfh 3.3 3.4 3.5 95 101, 83 1.8 MEAN 4.7 - - 87 84 1.7 4 Threads bzipl,m88ksiml, parserl, compressl 6.1 6.7 8.4 72 79, 70, 59, 78 2.9 equakeh, arth, parserl, compressl 6.1 7.0 10.0 61 68, 69, 38, 47 2.2 tomcatvh, mcfh, m88ksiml, bzipl 8.3 10.7 15.0 55 54, 65, 55, 58 2.3 equakeh,arth,tomcatvh,mcfh 9.0 10.5 16.6 54 60, 58, 51, 53 2.2 MEAN 7.4 - - 61 60 2.4 8 Threads art, tomcatv, bzip, m88ksim 9.8 17.7 25.0 39 40, 44, 34, 33 2.9 equake, parser, compress, mcf 50, 23, 26, 43

Table 2. T-morph thread efficiency and throughput.

in the T-morph. Our experiments showed that T-morph the fast inter-ALU communication network for stream- performance is largely insensitive to cache and branch pre- ing media and scientific applications. These applica- dictor pollution, but is highly sensitive to instruction fetch tions are typically characterized by data-level parallelism bandwidth stalls. (DLP) including predictable loop-based control flow with Column 5 shows the overall T-morph efficiency, de- large iteration counts [21], large data sets, regular access fined as the ratio of multithreading performance to patterns, poor locality but tolerance to memory latency, throughput of threads running on independent cores and high computation intensity with tens to hundreds of (column 2/column 4). Column 6 breaks this down fur- arithmetic operations performed per element loaded from ther showing the fraction of peak D-morph performance memory [18]. The S-morph was heavily influenced by the achieved by each thread when sharing a TRIPS core with Imagine architecture [11] and uses the Imagine execution other threads. The last column shows an estimate of the model in which a set of stream kernels are sequenced by speedup provided by the T-morph versus running each of a control thread. Figure 5 highlights the features of the the applications one at a time on a single TRIPS core (with S-morph which are further described below. the assumption that each application has approximately the same running time). The overall efficiency varies from 5.1 S-morph Mechanisms 80–100% with 2 threads down to 39% with 8 threads. Having the low memory benchmarks resident simultane- Frame Space Management: Since the control flow of ously provided the highest efficiency, while mixes of high the programs is highly predictable, the S-morph fuses mul- memory benchmarks provided the lowest efficiency, due tiple A-frames to make a super A-frame, instead of using to increased T-morph cache contention. This effect is less separate A-frames for speculation or multithreading. In- pronounced in the 2-thread configurations with the pair- ner loops of a streaming application are unrolled to fill the ing of high memory benchmarks being equally efficient as reservation stations within these super A-frames. Code re- others. The overall speedup provided by multithreading quired to set up the execution of the inner loops and to ranges from a factor 1.4 to 2.9 depending on the number connect multiple loops can run in one of three ways: (1) of threads. In summary, most benchmarks do not com- embedded into the program that uses the frames for S- pletely exploit the deep speculation provided by all of the morph execution, (2) executed on a different core within A-frames available in the D-morph, due to branch mispre- the TRIPS chip–similar in function to the Imagine host dictions. The T-morph converts these less useful A-frames processor, or (3) run within its own set of frames on the to non-speculative computations when multiple threads or same core running the DLP kernels. In this third mode, a jobs are available. Future work will evaluate the T-morph subset of the frames are dedicated to a data parallel thread, on multithreaded parallel programs. while a different subset are dedicated to a sequential con- trol thread. Instruction Fetch: 5 S-morph: Data-Level Parallelism To reduce the power and instruction fetch bandwidth overhead of repeated fetching of the same code block across inner-loop iterations, the S-morph em- The S-morph is a configuration of the TRIPS processor ploys mapping reuse, in which a block is kept in the reser- that leverages the technology scalable array of ALUs and vation stations and used multiple times. The S-morph im-

Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA’03) 1063-6897/03 $17.00 © 2003 IEEE Benchmark Original Iteration Fused Iterations #of Kernel size Inputs/ Unrolling Compute insts Block Total Revitalizations (insts) Outputs Constants factor per block size Constants convert 15 3/3 9 16 240 303 144 171 dct 70 8/8 10 8 560 580 80 128 fir16 34 1/1 16 16 544 620 256 512 fft8 104 16/16 16 4 416 570 64 128 idea 112 2/2 52 8 896 1020 416 512 transform 37 8/8 21 16 592 740 336 64

Table 3. Characteristics of S-morph codes.

increased SRF bandwidth. The S-morph DLP loops can On chip memory tiles Super N2 SRF execute an SRF read that acts as load multiple word in- A−frame     struction by transferring an entire SRF line into the grid,   

   spreading it across the ALUs in a fixed pattern within a   N3  Frame 7    row. Once within the grid, data can be easily moved to any  .... N0N1    ....   ALU using the high-bandwidth in-grid routing network,  ....    TO PROCESSOR  Frame 1  rather than requiring a data between the SRF banks Frame 0 D−morph L1 L2 NUCA Cache and the ALU array. Streams are striped across the multi- Memory System (a) (b) ple banks of the SRF. Stores to the SRF are aggregated in a store buffer and then transmitted to the SRF bank over Figure 5. Polymorphism for S-morph. narrow channels to the memory tile. Memory tiles not ad- jacent to the processing core can be configured as a con- < > plements mapping reuse with a repeat N instruction ventional level-2 cache still accessible to the unchanged (similar to RPTB in the TMS320C54x [1]) which indicates level-1 . The conventional cache hierarchy that the next block of instructions constitute a loop and is can be used to store irregularly accessed data structures, to execute a finite number of times N where N can be de- such as texture maps. termined at runtime and is used to set an iteration counter. When all of the instructions from an iteration complete, 5.2 Results the hardware decrements the iteration counter and triggers a revitalization signal which resets the reservation stations, We evaluate the performance of the TRIPS S-morph on maintaining constant values residing in reservation station, a set of streaming kernels, shown in Table 3, extracted so that they may fire again when new operands arrive for from the Mediabench benchmark suite [13]. These ker- the next iteration. When the iteration counter reaches zero, nels were selected to represent different computation-to- the super A-frame is cleared and the hardware maps the memory ratios, varying from less than 1 to more than 14. next block onto the ALUs for execution. The kernels are hand-coded in a TRIPS meta-assembly Memory System: Similar to Smart Memories [15], the language, then mapped to the ALU array using a custom TRIPS S-morph implements the Imagine stream register scheduler akin to the D-morph scheduler, and simulated file (SRF) using a subset of on-chip memory tiles. S- using an event-driven simulator that models the TRIPS S- morph memory tile configuration includes turning off tag morph. checks to allow direct data array access and augmenting Program characteristics: Columns 2–4 of Table 3 show the cache line replacement state machine to include DMA- the intrinsic characteristics for one iteration of the kernel like capabilities. Enhanced transfer mechanisms include code, including the number of arithmetic operations, the block transfer between the tile and remote storage (main number of bytes read from/written to memory, and number memory or other tiles), strided access to remote storage of unique run time constants required. The unrolling factor (gather/scatter), and indirect gather/scatter in which the re- for each inner loop is determined by the size of the kernel mote addresses to access are contained within a subset of and the capacity of the super A-frame (a 4x4 grid with the tile’s storage. Like the Imagine programming model, 128 frames or 2K instructions). The useful instructions we expect that transfers between the tile and remote mem- per block includes only computation instructions while ory will be orchestrated by a separate thread. the block size numbers include overhead instructions for As shown in Figure 5b, memory tiles adjacent to the memory access and data movement within the grid. The processor core are used for the SRF and are augmented total constant count indicates the number of reservation with dedicated wide channels (256 bits per row assuming stations that must be filled with constant values from the 4 64-bit channels for the 4x4 array) into the ALU array for register file for each iteration of the unrolled loop. Most

Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA’03) 1063-6897/03 $17.00 © 2003 IEEE average IPC to 7.65 corresponding to 5% performance im- 15 D-morph S-morph provement on average. However, on a 8x8 TRIPS core, S-morph ideal experiments show that increased store bandwidth can im- 1/4 LD B/W prove performance by 22%. As expected, compute inten- 4X ST B/W sive kernels, such as fir and idea, show little sensitivity to 10 NoRevitalize SRF bandwidth. Revitalization: As shown in the NoRevitalize bar in Fig- ure 6, eliminating revitalization causes S-morph perfor- 5 mance to drop by a factor of 5 on average. This effect

Compute Inst/Cycle is due to the additional latency for mapping instructions into the grid as well as redistributing the constants from 0 the register file on every unrolled iteration. For exam- convert dct fft8 fir16 idea transform MEAN ple, the unrolled inner loop of the dct kernel requires 37 cycles to fetch the 580 instructions (assuming 16 instruc- tions fetched per cycle) plus another 10 cycles to fetch the Figure 6. S-morph performance. 80 constants from the banked register file. Much of this of these register moves can be eliminated by allowing the overhead is exposed because unlike the D-morph with its constants to remain in reservation stations across revital- speculative instruction fetch, the S-morph has hard syn- izations. The number of revitalizations corresponds to the chronization boundaries between iterations. One solution number of iterations of the unrolled loop. The unrolling of that we are examining to further reduce the impact of in- the kernels is based on 64-Kbyte input and output streams, struction fetch is to overlap revitalization and execution. both being striped and stored in the SRF. Further extensions to this configuration can allow the indi- Performance analysis: Figure 6 compares the perfor- vidual ALUs at each node to act as separate MIMD pro- mance of the D-morph to the S-morph on a 4x4 TRIPS cessors. This technique would benefit applications with core with 128 frames, a 32 entry store buffer, 8 cycle revi- frequent data-dependent control flow, such as real-time talization delay, and a pipelined 7-cycle SRF access delay. graphics and network processing workloads. The D-morph configuration in this experiment assumes perfect L1 caches with 3-cycle hit latencies. Figure 6 6 Scalability to Larger Cores shows that the S-morph sustains an average of 7.4 com- putation (not counting overhead in- structions or address compute instructions), a factor of 2.4 While the experiments in Sections 3 – 5 reflect the per- higher than the D-morph. A more idealized S-morph con- formance achievable on three application classes with 16 figuration that employs 256 frames and no revitalization la- ALUs, the question of granularity still remains. Given tency improves performance to 9 compute ops/cycle, 26% a fixed single-chip silicon budget, how many processors higher than the realistic S-morph. An alternative approach should be on the chip, and how powerful should each pro- to S-morph polymorphism is the Tarantula architecture [8] cessor be? To address this question, we first examined which exploits data-level parallelism by augmenting the the performance of each application class as a function of processor core of an Alpha 21464 with a dedicated vector the architecture granularity by varying the issue width of data path of 32 ALUs, an approach that sustains between a TRIPS core. We use this information to determine the 10 and 20 FLOPS per cycle. Our results indicate that the sweet spot for each application class and then describe how TRIPS S-morph can provide competitive performance on this sweet spot can be achieved for each application class data-parallel workloads; a 8x4 grid consisting of 32 ALUs using the configurability of the TRIPS system. sustains, on average, 15 compute ops per cycle. Further- Figures 7a and 7b show the aggregate performance of more, the polymorphous approach provides superior area ILP and DLP workloads on TRIPS cores of different di- efficiency compared to Tarantula, which contains two large mensions, including 2x2, 4x4, 8x4, and 8x8. The selected heterogeneous cores. benchmarks represent the general behavior of the bench- SRF bandwidth: To investigate the sensitivity of the S- mark suite as a whole. Unsurprisingly, the benchmarks morph to SRF bandwidth we investigated two alternative with low instruction-level concurrency see little benefit design points: load bandwidth decreased to 64 bits per from TRIPS cores larger than 4x4, and a class of them row (1/4 LD B/W) and store bandwidth increased to 256 (represented by adpcm) sees little benefit beyond 2x2. bits per row (4X ST B/W). Decreasing the load bandwidth Benchmarks with higher concurrency such as swim and drops performance by 5% to 31%, with a mean percentage idea see diminishing returns beyond 8x4, while others, drop of 27%. Augmenting the store bandwidth increases such as mgrid and fft continue to benefit from increasing

Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA’03) 1063-6897/03 $17.00 © 2003 IEEE (a) ILP on single thread (b) DLP (c) ILP on multiple threads 15 25 2x2 50 32 4x4 20 40 32 10 8x4 16 15 8x8 30 16 16 IPC IPC 8 10 16 8 5 20 8 8 4 4 4 5 Compute Inst/Cycle 10 2

0 0 0 adpcm vortex MEAN-Dlow mgrid swim MEAN-Dhigh fft8 idea MEAN-S 1-way 2-way 4-way 8-way

Figure 7. TRIPS single-core scalability and CMP throughput.

ALU density. Table 4 shows the best suited configurations for the different applications in column 2. Grid Preferred # TRIPS Total L2 The variations across applications and application do- Dimensions Applications cores (MB) 2x2 adpcm 8 3.90 mains demand both large coarse-grain processors (8x4 and 4x4 vortex 4 3.97 8x8) and small fine-grain processors (2x2). Nonetheless, 8x4 swim, idea 4 1.25 for single-threaded ILP and DLP applications, the larger 8x8 mgrid, fft 2 1.25 processors provide better aggregate performance at the ex- 21264 - 10 3.97 pense of low utilization for some applications. For multi- threaded and multiprogrammed workloads, the decision is Table 4. TRIPS CMP Designs. more complex. Table 4 shows several alternative TRIPS multiple application domains. Unlike prior configurable chip designs, ranging from 8 2x2 TRIPS cores to 2 8x8 2 systems that aggregate small primitive components into cores, assuming a 400mm die in a 100nm technology. larger processors, TRIPS starts with a large, technology- The equivalent real estate could be used to construct 10 scalable core that can be logically subdivided to support Alpha 21264 processors and 4MB of on-chip L2 cache. ILP, TLP, and DLP. The goal of this system is to achieve Figure 7c shows the instruction throughput (in aggre- performance and efficiency approaching that of special- gate IPC), with each bar representing the core dimensions, purpose systems. In this paper, we proposed a small set each cluster of bars showing the number of threads per of mechanisms (managing reservation stations and mem- core, and the number atop each bar showing the total num- ory tiles) for a large-core processor that enables adaptation ber of threads (# cores times threads per core). The 2x2 ar- into three modes for these diverse application domains. ray is the worst performing when large number of threads We have shown that all three modes achieve the goal of are available. The 4x4 and 8x4 configurations have the high performance on their respective application domain. same number of cores due to changing on-chip cache ca- The D-morph sustains 1–12 IPC (average of 4.4) on serial pacity, but the 8x4 and 8x8 have the same total number of codes, the T-morph achieves average thread efficiencies of ALUs and instruction buffers across the full chip. With 87%, 60%, and 39% for two, four, and eight threads, re- ample threads and at most 8 threads per core, the best de- spectively, and the S-morph executes as many as 12 arith- sign point is the 8x4 topology, no matter how many to- metic instructions per clock on a 16-ALU core, and an av- tal threads are available (e.g., of all the bars labeled 16 erage of 23 on an 8x8 core. threads, the 8x4 configuration is the highest-performing). These results validate the large-core approach; one 8x4 While we have described the TRIPS system as having core has higher performance for both single-threaded ILP three distinct personalities (the D, T, and S-morphs), in and DLP workloads than a smaller core, and shows higher reality each of these configurations is composed of basic throughput than many smaller cores using the same area mechanisms that can be mixed and matched across exe- when many threads are available. We are currently explor- cution models. In addition, there are also minor recon- ing and evaluating space-based subdivision for both TLP figurations, such as adjusting the level-2 cache capacity, and DLP applications beyond the time-based multithread- that do not require a change in the programming model. ing approach described in this paper. A major challenge for polymorphous systems is designing the interfaces between the software and the configurable hardware as well as determining when and how to initiate 7 Conclusions and Future Directions reconfiguration. At one extreme, application programmers and compiler writers can be given a fixed number of static The polymorphous TRIPS system enables a single set morphs; programs are written and compiled to these static of processing and storage elements to be configured for machine models. At the other extreme, a polymorphous

Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA’03) 1063-6897/03 $17.00 © 2003 IEEE system could expose all of the configurable mechanisms to [11] B. Khailany, W. J. Dally, S. Rixner, U. J. Kapasi, P. Mattson, the application layers, enabling them to select the config- J. Namkoong, J. D. Owens, B. Towles, and A. Chang. Imag- urations and the time of reconfiguration. We are exploring ine: Media processing with streams. IEEE Micro, 21(2):35–46, March/April 2001. both the hardware and software design issues in the course [12] C. Kim, D. Burger, and S. W. Keckler. An adaptive, non-uniform of our development of the TRIPS prototype system. cache structure for wire-delay dominated on-chip caches. In 10th International Conference on Architectural Support for Program- Acknowledgments ming Languages and Operating Systems (ASPLOS), pages 211– 222, October 2002. We thank the anonymous reviewers for their suggestions [13] C. Lee, M. Potkonjak, and W. H. Mangione-Smith. Mediabench: that helped improve the quality of this paper. This research is A tool for evaluating and synthesizing multimedia and communi- supported by the Defense Advanced Research Projects Agency cations systems. In International Symposium on , under contract F33615-01-C-1892, NSF instrumentation grant pages 330–335, 1997. EIA-9985991, NSF CAREER grants CCR-9985109 and CCR- [14] S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank, and R. A. Bring- 9984336, two IBM University Partnership awards, and grants mann. Effective compiler support for predicated execution using from the Alfred P. Sloan Foundation, the Peter O’Donnell Foun- the hyperblock. In Proceedings of the 25st International Sympo- dation, and the Research Council. sium on Microarchitecture, pages 45–54, 1992. [15] K. Mai, T. Paaske, N. Jayasena, R. Ho, W. J. Dally, and References M. Horowitz. Smart memories: A modular reconfigurable architec- ture. In Proceedings of the 27th Annual International Symposium on Computer Architecture, pages 161–171, June 2000. [1] TMS320C54x DSP Reference Set, Volume 2: Mnemonic Instruc- tion Set, Literature Number: SPRU172C, March 2001. [16] R. Nagarajan, K. Sankaralingam, D. Burger, and S. W. Keckler. A design space evaluation of grid processor architectures. In Proceed- [2] L. A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, ings of the 34th Annual International Symposium on Microarchitec- S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese. Piranha: A ture, pages 40–51, December 2001. scalable architecture based on single-chip . In Pro- ceedings of the 27th Annual International Symposium on Computer [17] N. Ranganathan, R. Nagarajan, D. Burger, and S. W. Keckler. Com- Architecture, pages 282–293, June 2000. bining hyperblocks and exit prediction to increase front-end band- width and performance. Technical Report TR-02-41, Department of [3] V. Baumgarte, F. May, A. Nuckel,¨ M. Vorbach, and M. Weinhardt. Computer Sciences, The University of Texas at Austin, September PACT XPP – A Self-Reconfigurable Data Processing Architecture. 2002. In 1st International Conference on Engineering of Reconfigurable Systems and Algorithms, June 2001. [18] S. Rixner, W. J. Dally, U. J. Kapasi, B. Khailany, A. Lopez- Lagunas, P. R. Mattson, and J. D. Owens. A bandwidth-efficient [4] C. Casc¸aval, J. Castanos, L. Ceze, M. Denneau, M. Gupta, architecture for media processing. In Proceedings on the 31st Inter- D. Lieber, J. E. Moreira, K. Strauss, and H. S. W. Jr. Evaluation of national Symposium on Microarchitecture, pages 3–13, December multithreaded architecture for cellular computing. In Proceedings 1998. of the 8th International Symposium on High Performance Computer Architecture, pages 311–322, January 2002. [19] G. S. Sohi, S. E. Breach, and T. N. Vijaykumar. Multiscalar pro- cessors. In Proceedings of the 22nd International Symposium on [5] P. P. Chang, S. A. Mahlke, W. Y. Chen, N. J. Warter, and W. mei Computer Architecture, pages 414–425, June 1995. W. Hwu. IMPACT: An architectural framework for multiple- instruction-issue processors. In Proceedings of the 18th Annual In- [20] J. G. Steffan, C. B. Colohan, A. Zhai, and T. C. Mowry. A scalable ternational Symposium on Computer Architecture, pages 266–275, approach to thread-level speculation. In Proceedings of the 27th May 1991. Annual International Symposium on Computer Architecture, pages 1–12, June 2000. [6] M. Cintra, J. F. Mart´ınez, and J. Torrellas. Architectural support for scalable speculative parallelization in shared-memory multiproces- [21] D. Talla, L. John, and D. Burger. Bottlenecks in multimedia sors. In Proceedings of the 27th Annual International Symposium processing with SIMD style extensions and architectural enhance- on Computer Architecture, pages 13–24, June 2000. ments. IEEE Transactions on Computers, to appear, pages 35–46, 2003. [7] C. Ebeling, D. C. Cronquist, and P. Franklin. Configurable comput- ing: The catalyst for high-performance architectures. In Interna- [22] J. M. Tendler, J. S. Dodson, J. J. S. Fields, H. Le, and B. Sinharoy. tional Conference on Application-Specific Systems, Architectures, POWER4 system microarchitecture. IBM Journal of Research and and Processors, pages 364–372, 1997. Development, 26(1):5–26, January 2001. [23] D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous multi- [8] R. Espasa, F. Ardanaz, J. Emer, S. Felix, J. Gago, R. Gramunt, threading: Maximizing on-chip parallelism. In Proceedings of the I. Hernandez, T. Juan, G. Lowney, M. Mattina, and A. Seznec. 22nd International Symposium on Computer Architecture, pages Tarantula: A Vector Extension to the Alpha Architecture. In Pro- 392–403, June 1995. ceedings of The 29th International Symposium on Computer Archi- tecture, pages 281–292, May 2002. [24] V.Kathail, M.Schlansker, and B.R.Rau. Hpl-pd architecture speci- fication: Version 1.1. Technical Report HPL-93-80(R.1), Hewlett- [9] S. C. Goldstein, H. Schmit, M. Budiu, S. Cadambi, M. Moe, and Packard Laboratories, February 2000. R. Taylor. Piperench: A reconfigurable architecture and compiler. IEEE Computer, 33(4):70–77, April 2000. [25] E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarsinghe, and [10] Q. Jacobson, S. Bennett, N. Sharma, and J. E. Smith. Control flow A. Agarwal. Baring it all to software: RAW machines. IEEE Com- speculation in multiscalar processors. In Proceedings of the 3rd puter, 30(9):86–93, September 1997. International Symposium on High Performance Computer Archi- tecture, Feb. 1997.

Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA’03) 1063-6897/03 $17.00 © 2003 IEEE