Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture
Karthikeyan Sankaralingam Ramadass Nagarajan Haiming Liu Changkyu Kim Jaehyuk Huh Doug Burger Stephen W. Keckler Charles R. Moore
Computer Architecture and Technology Laboratory Department of Computer Sciences The University of Texas at Austin [email protected] - www.cs.utexas.edu/users/cart
Abstract dia, streaming, network, desktop) and the emergence of chip multiprocessors (CMPs), for which the number and This paper describes the polymorphous TRIPS archi- granularity of processors is fixed at processor design time. tecture which can be configured for different granularities One strategy for combating processor fragility is to and types of parallelism. TRIPS contains mechanisms that build a heterogeneous chip, which contains multiple pro- enable the processing cores and the on-chip memory sys- cessing cores, each designed to run a distinct class of work- tem to be configured and combined in different modes for loads effectively. The proposed Tarantula processor is one instruction, data, or thread-level parallelism. To adapt to such example of integrated heterogeneity [8]. The two ma- small and large-grain concurrency, the TRIPS architecture jor downsides to this approach are (1) increased hardware contains four out-of-order, 16-wide-issue Grid Processor complexity since there is little design reuse between the cores, which can be partitioned when easily extractable two types of processors and (2) poor resource utilization fine-grained parallelism exists. This approach to polymor- when the application mix contains a balance different than phism provides better performance across a wide range of that ideally suited to the underlying heterogeneous hard- application types than an approach in which many small ware. processors are aggregated to run workloads with irregu- An alternative approach to designing an integrated so- lar parallelism. Our results show that high performance lution using multiple heterogeneous processors is to build can be obtained in each of the three modes–ILP, TLP, one or more homogeneous processors on a die, which mit- and DLP–demonstrating the viability of the polymorphous igates the aforementioned complexity problem. When an coarse-grained approach for future microprocessors. application maps well onto the homogeneous substrate, the utilization problem is solved, as the application is not limited to one of several heterogeneous processors. To 1 Introduction solve the fragility problem, however, the homogeneous hardware must be able to run a wide range of application classes effectively. We define this architectural polymor- General-purpose microprocessors owe their success to phism as the capability to configure hardware for efficient their ability to run many diverse workloads well. Today, execution across broad classes of applications. many application-specific processors, such as desktop, net- A key question, is what granularity of processors and work, server, scientific, graphics, and digital signal proces- memories on a CMP is best for polymorphous capabili- sors have been constructed to match the particular paral- ties. Should future billion-transistor chips contain thou- lelism characteristics of their application domains. Build- sands of fine-grain processing elements (PEs) or far fewer ing processors that are not only general purpose for single- extremely coarse-grain processors? The success or failure threaded programs but for many types of concurrency as of polymorphous capabilities will have a strong effect on well would provide substantive benefits in terms of system the answer to these questions. Figure 1 shows a range of flexibility as well as reduced design and mask costs. points in the spectrum of PE granularities that are possi- Unfortunately, design trends are applying pressure in ble for a 400mm2 chip in 100nm technology. Although the opposite direction: toward designs that are more spe- other possible topologies certainly exist, the five shown in cialized, not less. This performance fragility, in which ap- the diagram represent a good cross-section of the overall plications incur large swings in performance based on how space: well they map to a given design, is the result of the combi- nation of two trends: the diversification of workloads (me-
Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA’03) 1063-6897/03 $17.00 © 2003 IEEE Exploits fine-grain parallelism more effectively
Runs more applications effectively
(a) FPGA (b) PIM (c) Fine-grain CMP (d) Coarse-grain CMP (e) TRIPS Millions of gates 256 Proc. elements 64 In-order cores 16 Out-of-order cores 4 ultra-large cores Figure 1. Granularity of parallel processing elements on a chip.
a) Ultra-fine-grained FPGAs. Regardless of the approach, a polymorphous architec- ture will not outperform custom hardware meant for a b) Hundreds of primitive processors connected to given application, such as graphics processing. However, memory banks such as a processor-in-memory a successful polymorphous system should run well across (PIM) architecture or reconfigurable ALU arrays many application classes, ideally running with only small such as RaPiD [7], Piperench [9], or PACT [3]. performance degradations compared to the performance of c) Tens of simple in-order processors, such as in customized solutions for each application. RAW [25] or Piranha [2] architectures. This paper proposes and describes the polymorphous d) Coarse grained architectures consisting of 10-20 TRIPS architecture, which uses the partitioning approach, 4-issue cores, such as the Power4 [22], Cy- combining coarse-grained polymorphous Grid Processor clops [4], MultiScalar processors [19], other pro- cores with an adaptive, polymorphous on-chip memory posed speculatively-threaded CMPs [6, 20], and the system. Our goal is to design cores that are both as polymorphous Smart Memories [15] architecture. large and as few as possible, providing maximal single- thread performance, while remaining partitionable to ex- e) Wide-issue processors with many ALUs each, such ploit fine-grained parallelism. Our results demonstrate as Grid Processors [16]. that this partitioning approach solves the fragility problem by using polymorphous mechanisms to yield high perfor- The finer-grained architectures on the left of this spec- mance for both coarse and fine-grained concurrent applica- trum can offer high performance on applications with fine- tions. To be successful, the competing approach of synthe- grained (data) parallelism, but will have difficulty achiev- sizing coarser-grain processors from fine-grained compo- ing good performance on general-purpose and serial appli- nents must overcome the challenges of distributed control, cations. For example, a PIM topology has high peak per- long interaction latencies, and synchronization overheads. formance, but its performance on on control-bound codes with irregular memory accesses, such as compression or The rest of this paper describes the polymorphous hard- compilation, would be dismal at best. At the other ex- ware and configurations used to exploit different types of treme, coarser-grained architectures traditionally have not parallelism across a broad spectrum of application types. had the capability to use internal hardware to show high Section 2 describes both the planned TRIPS silicon proto- performance on fine-grained, highly parallel applications. type and its polymorphous hardware resources, which per- Polymorphism can bridge this dichotomy with either mit flexible execution over highly variable application do- of two competing approaches. A synthesis approach mains. These resources support three modes of execution uses a fine-grained CMP to exploit applications with fine- that we call major morphs, each of which is well suited grained, regular parallelism, and tackles irregular, coarser- for a different type of parallelism: instruction-level par- grain parallelism by synthesizing multiple processing el- allelism with the desktop or D-morph (Section 3), thread- ements into larger “logical” processors. This approach level parallelism with the threaded or T-morph (Section 4), builds hardware more to the left on the spectrum in Fig- and data-level parallelism with the streaming or S-morph ure 1 and emulates hardware farther to the right. A par- (Section 5). Section 6 shows how performance increases in titioning approach implements a coarse-grained CMP in the three morphs as each TRIPS core is scaled from a 16- hardware, and logically partitions the large processors to wide up to an even coarser-grain, 64-wide issue processor. exploit finer-grain parallelism when it exists. We conclude in Section 7 that by building large, partition-
Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA’03) 1063-6897/03 $17.00 © 2003 IEEE 1063-6897/03 $17.00 © 2003 IEEE ©2003 $17.00 1063-6897/03 (ISCA’03) Architecture onComputer Symposium 30thAnnual International Proceedings ofthe oposrsucsue ocntutteD ,adS- and T, D, 3–5. the Sections construct in poly- described to morphs the used highlights and resources ar- system, level morphous TRIPS high the the describes of overheads. section chitecture this configuration of and remainder The software complexity both memory hardware minimize of to and level storage, instruction the and coarse- at banks non- employs features, and system polymorphous TRIPS complexity and grained The escalating resources, avoid structures. available time scalable the same of the use that at their so DLP maximize and granularity TLP can ILP, of appropriate levels different their involving workloads balancing is tures ex- are optimization. for that schedulers software channels to posed parti- communication connected These are point-to-point elements by memory runs. and wire large computation avoid long tioned to and partitioned structures heavily centralized the is scale, architecture to difficult TRIPS are that components centralized designs with large-core conventional different to at Contrary applications granularities. concurrent sub- explicitly be to for single- core them divided the enable on augments that and features performance ILP, polymorphous with high high with achieve applications threaded to cores cessing . oeEeuinModel Execution Core 2.1 Architecture TRIPS The 2 rahpoiigfrsligteeegn hleg of challenge emerging the fragility. processor solving ap- this for making promising concurrency, proach of design classes homogeneous many exploit single can a cores, polymorphous able, ented M MM M MM h RP rhtcuei fundamentally is architecture TRIPS The fea- polymorphous the defining in challenge key The pro- coarse-grained large, uses architecture TRIPS The M M M nalmdso prto,porm opldfor compiled programs operation, of modes all In . DRAM Interface M M M M M M M (a) TRIPSChip M MM M MM M MM MM M MM M MM M M M M M M M M M M
DRAM Interface DRAM Interface DRAM Interface M M M M M M M M M M M M M M M M M M M M M M M M MM M M M M
DRAM Interface M M M M M M M iue2 RP rhtcueoverview. architecture TRIPS 2. Figure Stitch Table Next block ICache−M Predictor ICache−3 ICache−2 ICache−1 ICache−0 lc ori- block ( b ) Block Control TRIPSCore rhtcua tt fncsay n hnpoedn othe to proceeding block. then next and necessary, persistent if the to state results architectural its committing completion, execut- to engine, it computational ing the from into block it a opera- loading fetching memory, basic includes the processor the runtime, of At flow block. tional the from upon point depends and exit that inputs, outputs the state state of of set set variable static potentially a a has block Each dependences explicit. inter-instruction are com- that the such onto engine all instructions stati- putational of For for block responsible each boundaries. is scheduling compiler cally block the at execution, only of modes handled are they that tteiptadotu.Ec eevto tto a stor- has station reservation Each connections router output. and and point input stations, the floating reservation at a of ALU, set a integer unit, an containing execution each homogeneous nodes, of are array which an [16], of designs composed of exam- typically family an Processor is Grid the core of TRIPS ple The system. memory primary the completion for targeted 2005. be is in and will with process chip 100nm controllers prototype a using memory The built distributed net- memory. external routed of to a set channels by a connected and tiles work, an memory cores, 32KB 16-wide of polymorphous array four prototype of TRIPS consist will the chip short connections, and structures wiring partitioned the point-to-point both the high to While and due dimensions rates larger chip. clock both prototype to a scalable is in architecture implemented be will that . rhtcua Overview Architectural 2.2 i tmclyaditrut are interrupts and atomically For com- blocks mit [14]. programs, hyperblocks parallel level in thread found and instruction multi- as possibly points and exit loops, possible internal ple no point, entry with single instructions a of blocks large into partitioned are TRIPS 49 95−12732−630−3164−95 iue2 hw nepne iwo RP oeand core TRIPS a of view expanded an shows 2b Figure architecture TRIPS the of diagram a shows 2a Figure Cce2LQ Frame1 DCache−3 LSQ2 DCache−2 DCache−1 DCache−0 Register File LSQ3 LSQ1 LSQ0 L2 Cache (c) ExecutionNode lc precise block ntOperands Inst Control Router meaning , Frame 127 . . . Frame 0 age for an instruction and two source operands. When a tion, when a block should be deallocated from the frame reservation station contains a valid instruction and a pair space, and which block should be loaded next into the free of valid operands, the node can select the instruction for frame space. To implement different modes of operation, execution. After execution, the node can forward the result a range of policies can govern these actions. The deallo- to any of the operand slots in local or remote reservation cation logic may be configured to allow a block to execute stations within the ALU array. The nodes are directly con- more than once, as is useful in streaming applications in nected to their nearest neighbors, but the routing network which the same inner loop is applied to multiple data ele- can deliver results to any node in the array. ments. The next block selector can be configured to limit The banked instruction cache on the left couples one the speculation, and to prioritize between multiple concur- bank per row, with an additional instruction cache bank rently executing threads useful for multithreaded parallel to issue fetches to values from registers for injection into programs. the ALU array. The banked register file above the ALU Memory Tiles: The TRIPS Memory tiles can be con- array holds a portion of the architectural state. To the figured to behave as NUCA style L2 cache banks [12], right of the execution nodes are a set of banked level-1 scratchpad memory, synchronization buffers for pro- data caches, which can be accessed by any ALU through ducer/consumer communication. In addition, the memory the local grid routing network. Below the ALU array is tiles closest to each processor present a special high band- the block control logic that is responsible for sequencing width interface that further optimizes their use as stream block execution and selecting the next block. The back- register files. side of the L1 caches are connected to secondary mem- ory tiles through the chip-wide two-dimensional intercon- nection network. The switched network provides a robust 3 D-morph: Instruction-Level Parallelism and scalable connection to a large number of tiles, using less wiring than conventional dedicated channels between The desktop morph, or D-morph, of the TRIPS pro- these components. cessor uses the polymorphous capabilities of the proces- The TRIPS architecture contains three main types of sor to run single-threaded codes efficiently by exploiting resources. First, the hardcoded, non-polymorphous re- instruction-level parallelism. The TRIPS processor core is sources operate in the same manner, and present the same an instantiation of the Grid Processor family of architec- view of internal state in all modes of operation. Some ex- tures, and as such has similarities to previous work [16], amples include the execution units within the nodes, the in- but with some important differences as described in this terconnect fabric between the nodes, and the L1 instruction section. cache banks. In the second type, polymorphous resources To achieve high ILP, the D-morph configuration treats are used in all modes of operation, but can be configured the instruction buffers in the processor core as a large, dis- to operate differently depending on the mode. The third tributed, instruction issue window, which uses the TRIPS type are the resources that are not required for all modes ISA to enable out-of-order execution while avoiding the and can be disabled when not in use for a given mode. associative issue window lookups of conventional ma- 2.3 Polymorphous Resources chines. To use the instruction buffers effectively as a large Frame Space: As shown in Figure 2c, each execution window, the D-morph must provide high-bandwidth in- node contains a set of reservation stations. Reservation struction fetching, aggressive control and data speculation, stations with the same index across all of the nodes com- and a high-bandwidth, low-latency memory system that bine to form a physical frame. For example, combining preserves sequential memory semantics across a window the first slot for all nodes in the grid forms frame 0. The of thousands of instructions. frame space, or collection of frames, is a polymorphous resource in TRIPS, as it is managed differently by differ- 3.1 Frame Space Management ent modes to support efficient execution of alternate forms of parallelism. By treating the instruction buffers at each ALU as a dis- Register File Banks: Although the programming model tributed issue window, orders-of-magnitude increases in of each execution mode sees essentially the same number window sizes are possible. This window is fundamentally of architecturally visible registers, the hardware substrate a three-dimensional scheduling region, where the x- and y- provides many more. The extra copies can be used in dif- dimensions correspond to the physical dimensions of the ferent ways, such as for speculation or multithreading, de- ALU array and the z-dimension corresponds to multiple pending on the mode of operation. instruction slots at each ALU node, as shown in Figure 2c. Block Sequencing Controls: The block sequencing This three-dimensional region can be viewed as a series of controls determine when a block has completed execu- frames, as shown in Figure 3b, in which each frame con-
Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA’03) 1063-6897/03 $17.00 © 2003 IEEE are speculative (analogous to tasks in a Multiscalar pro- A−frame 2 Reg. File Dataflow graph cessor [19]). When the A-frame holding the oldest hy- N0 H0 A−frame 1 perblock completes, the block is committed and removed. (H1) N6 The next oldest hyperblock becomes non-speculative, and N1 N2 R1 A−frame 0 N5 N4 the released frames can be filled with a new speculative hy- (H0) N2N3 perblock. On a misprediction, all blocks past the offending N3 R1 Z . .. prediction are squashed and restarted. .... N0N1 Frame 3 N4 H1 X .... Since A-frame IDs are assigned dynamically and all .... Frame 2 Y intra-hyperblock communication occurs within a single A- N5 N6 Frame 1 ... Frame 0 frame, each producer instruction prepends its A-frame ID (a) (b) to the Z-coordinate of its consumer to form the correct in- struction buffer address of the consumer. Values passed Figure 3. D-morph frame management. between hyperblocks are transmitted through the register file, as shown by the communication of R1 from H0 to sists of one instruction buffer entry per ALU node, result- H1 in Figure 3b. Such values are aggressively forwarded ing in a 2-D slice of the 3-D scheduling region. when they are produced, using the register stitch table that To fill one of these scheduling regions, the compiler dynamically matches the register outputs of earlier hyper- schedules hyperblocks into a 3-D region, assigning each blocks to the register inputs of later hyperblocks. instruction to one node in the 3-D space. Hyperblocks are predicated, single entry, multiple exit regions formed by 3.3 High-Bandwidth Instruction Fetching the compiler [14]. A 3-D region (the array and the set of frames) into which one hyperblock is mapped is called an To fill the large distributed window the D-morph re- architectural frame, or A-frame. quires high-bandwidth instruction fetch. The control Figure 3a shows a four-instruction hyperblock (H0) model uses a program counter that points to hyperblock mapped into A-frame 0 as shown in Figure 3b, where N0 headers. When there is sufficient frame space to map a hy- and N2 are mapped to different buffer slots (frames) on perblock, the control logic accesses a partitioned instruc- the same physical ALU node. All communication within tion cache by broadcasting the index of the hyperblock to the block is determined by the compiler which schedules all banks. Each bank then fetches a row’s worth of in- operand routing directly from ALU to ALU. Consumers structions with a single access and streams it to the bank’s are encoded in the producer instructions as X, Y, and Z- respective row. Hyperblocks are encoded as VLIW-like relative offsets, as described in prior work [16]. Instruc- blocks, along with a prepended header that contains the tions can direct a produced value to any element within number of frames consumed by the block. the same A-frame, using the lightweight routed network in The next-hyperblock prediction is made using a highly the ALU array. The maximum number of frames that can tuned tournament exit predictor [10], which predicts a bi- be occupied by one program block (the maximum A-frame nary value that indicates the branch predicted to be the first size) is architecturally limited by the number of instruction to exit the hyperblock. The per-block accuracy of the exit bits to specify destinations, and physically limited by the predictor is shown in row 3 of Table 1; the predictor it- total number of frames available in a given implementa- self is described in more detail elsewhere [17]. The value tion. The current TRIPS ISA limits the number of instruc- generated by the exit predictor is used both to index into a tions in a hyperblock to 128, and the current implementa- BTB to obtain the next predicted hyperblock address, and tion limits the maximum number of frames per A-frame to also to avoid forwarding register outputs produced past the 16, the maximum number of A-frames to 32, and provides predicted branch to subsequent blocks. 128 frames total. 3.4 Memory Interface 3.2 Multiblock Speculation To support high ILP, the D-morph memory system must The TRIPS instruction window size is much larger than provide a high-bandwidth, low-latency data cache, and the average hyperblock size that can be constructed. The must maintain sequential memory semantics. As shown hardware fills empty A-frames with speculatively mapped in Figure 2b, the right side of each TRIPS core contains hyperblocks, predicting which hyperblock will be exe- distributed primary memory system banks, that are tightly cuted next, mapping it to an empty A-frame, and so on. coupled to the processing logic for low latency. The banks The A-frames are treated as a circular buffer in which the are interleaved using the low-order bits of the cache index, oldest A-frame is non-speculative and all other A-frames and can process multiple non-conflicting accesses simulta-
Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA’03) 1063-6897/03 $17.00 © 2003 IEEE Benchmark adpcm ammp art bzip2 compress dct equake gzip hydro2d m88k Good insts/block 30.7 119 80.4 55.8 21.6 163 33.5 36.2 200 40.2 Exit/target pred. acc. 0.72 0.94 0.99 0.74 0.84 0.99 0.97 0.84 0.97 0.95 Avg. frames 2.4 5.2 3.2 2.8 1.3 6.0 2.1 3.1 7.4 2.3 #inwindow 116 1126 1706 364 129 1738 622 671 1573 796 Benchmark mcf mgrid mpeg2 parser swim tomcatv turb3d twolf vortex mean Good insts/block 29.8 179 81.3 14.6 361 210 160 48.9 29.4 99.8 Exit/target pred. acc. 0.91 0.99 0.88 0.93 0.99 0.98 0.94 0.76 0.99 0.91 Avg. frames 2.2 6.9 3.8 1.3 11.8 7.4 6.4 2.6 2.0 4.2 #inwindow 462 1590 958 255 1928 1629 1399 361 918 965 Table 1. Execution characteristics of D-morph codes. 15 15 1 2 4 8 16 10 32 10 Perf Mem Perf (Mem+BP) IPC IPC 5 5
0 adpcmbzip2 compr gzip m88ksimmcf parser twolf vortex MEAN 0 ammpart dct equakehydro2dmgrid mpeg2swim tomcatvturb3dMEAN
Figure 4. D-morph performance as a function of A-frame count.
neously. Each bank is coupled with MSHRs for the cache a 10-cycle branch misprediction penalty, a 250Kb exit pre- bank and a partition of the address-interleaved load/store dictor, a 12-cycle access penalty to a 2MB L2 cache, and queues that enforce ordering of loads and stores. The a 132-cycle main memory access penalty. Optimistic as- MSHRs, the load/store queues, and the cache banks all use sumptions in the simulator currently include no modeling the same interleaving scheme. Stores are written back to of TLBs or page faults, oracular load/store ordering, simu- the cache from the LSQs upon block commit. lation of a centralized register file, and no issue of wrong- The secondary memory system in the D-morph con- path instructions to the memory system. All of the bina- figures the networked banks as a non-uniform cache ac- ries were compiled with the Trimaran tool set [24] (based cess (NUCA) array [12], in which elements of a set are on the Illinois Impact compiler [5]), and scheduled for the spread across multiple secondary banks, and are capable of TRIPS processor with our custom scheduler/rewriter. migrating data on the two-dimensional switched network The first row of Table 1 shows the average number that connects the secondary banks. This network also pro- of useful dynamically executed instructions per block, vides a high-bandwidth link to each L1 bank for parallel discounting overhead instructions, instructions with false L1 miss processing and fills. To summarize, with accu- predicates or instructions past a block exit. The second rate exit prediction, high-bandwidth I-fetching, partitioned row shows the average dynamic number of frames allo- data caches, and concurrent execution of hyperblocks with cated per block by our scheduler for a 4x4 grid. Using the inter-block value forwarding, the D-morph is able to use steady-state block (exit) prediction accuracies shown in the the instruction buffers as a polymorphous out-of-order is- third row, each benchmarks holds 965 useful instructions sue window effectively, as shown in the next subsection. in the distributed window, on average, as shown in row 4 of Table 1. 3.5 D-morph Results Figure 4 shows how IPC scales as the number of A- frames is increased from 1 to 32, permitting deeper spec- In this subsection, we measure the ILP achieved using ulative execution. The integer benchmarks are shown on the mechanisms described above. The results shown in the left; the floating point and Mediabench [13] bench- this section assume a 4x4 (16-wide issue) core, with 128 marks are shown on the right. Each 32 A-frame bar physical frames, a 64KB L1 data cache that requires three also has two additional IPC values, showing the perfor- cycles to access, a 64KB L1 instruction cache (both parti- mance with perfect memory in the hashed fraction of each tioned into 4 banks), 0.5 cycles per hop in the ALU array, bar, and then adding perfect branch prediction, shown in
Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA’03) 1063-6897/03 $17.00 © 2003 IEEE white. Increasing the number of A-frames provides a con- can instead store state from multiple non-speculative and sistent performance boost across many of the benchmarks, speculative blocks. The only additional frame support since it permits greater exploitation of ILP by providing a needed is thread-ID bits in the register stitching logic and larger window of instructions. Some benchmarks show no augmentations to the A-frame allocation logic. performance improvements beyond 16 A-frames (bzip2, Instruction control: The T-morph maintains n pro- m88ksim, and tomcatv), and a few reach their peak at 8 A- gram counters (where n is the number of concurrent frames (adpcm, gzip, twolf, and hydro2d). In such cases, threads allowed) and n global history shift registers in the large frame space is underutilized when running a sin- the exit predictor to reduce thread-induced mispredictions. gle thread, due to either low hyperblock predictability in The T-morph fetches the next block for a given thread us- some cases or a lack of program ILP in others. ing a prediction made by the shared exit predictor, and The graphs demonstrate that while control mispredic- maps it onto the array. In addition to the extra prediction tions cause large performance losses for the integer codes registers, n copies of the commit buffers and block control (close to 50% on average), the large window is able to tol- state must be provided for n hardware threads. erate memory latencies extremely well, resulting in negli- Memory: The memory system operates much the same gible slowdowns due to an imperfect memory system for as the D-morph, except that per-thread IDs on cache tags all benchmarks but mgrid. and LSQ CAMs are necessary to prevent illegal cross- thread interference, provided that shared address spaces 4 T-morph: Thread-Level Parallelism are implemented.
The T-morph is intended to provide higher processor 4.2 T-morph Results utilization by mapping multiple threads of control onto a To evaluate the performance of multi-programmed single TRIPS core. While similar to simultaneous multi- workloads running on the T-morph, we classified the ap- threading [23] in that the execution resources (ALUs) and plications as “high memory intensive” and “low memory memory banks are shared, the T-morph statically partitions intensive”, based on L2 cache miss rates. We picked eight the reservation station (issue window) and eliminates some different benchmarks and ran different combinations of replicated SMT structures, such as the reorder buffer. 2, 4 and 8 benchmarks executing concurrently. The high memory intensive benchmarks are arth, mcfh, equakeh, 4.1 T-Morph Implementation and tomcatvh. The low memory intensive benchmarks are compressl, bzip2l, parserl, and m88ksiml.Weex- There are multiple strategies for partitioning a TRIPS amine the performance obtained while executing multiple core to support multiple threads, two of which are row pro- threads concurrently and quantify the sources of perfor- cessors and frame processors. Row processors space-share mance degradation. Compared to a single thread executing the ALU array, allocating one or more rows per thread. in the D-morph, running threads concurrently introduces The advantage to this approach is that each thread has I- the following sources of performance loss: a) inter-thread cache and D-cache bandwidth and capacity proportional to contention for ALUs and routers in the grid, b) cache pol- the number of rows assigned to it. The disadvantage is that lution, c) pollution and interaction in the branch predictor the distance to the register file is non-uniform, penalizing tables, and d) reduced speculation depth for each thread, the threads mapped to the bottom rows. Frame proces- since the number of available frames for each thread is re- sors, evaluated in this section, time-share the processor by duced. allocating threads to unique sets of physical frames. We Table 2 shows T-morph performance on a 4x4 TRIPS describe the polymorphous capabilities required for each core with parameters similar to those of the baseline D- of the classes of mechanisms below. morph. The second column lists the combined instruc- Frame space management: Instead of holding non- tion throughput of the running threads. The third column speculative and speculative hyperblocks for a single thread shows the sum of the IPCs of the benchmarks when each as in the D-morph, the physical frames are partitioned a is run on a separate core but with same number of frames priori and assigned to threads. For example, a TRIPS core as available to each thread in the T-morph. Comparing the can dedicate all 128 frames to a single thread in the D- throughput of column 3 with the throughput in column 2, morph, or 64 frames to each of two threads in the T-morph indicates the performance drop due to inter-thread interac- (uneven frame sharing is also possible). Within each tion in the T-morph. Column 4 shows the cumulative IPCs thread, the frames are further divided into some number of of the threads when each is run by itself on a TRIPS core A-frames and speculative execution is allowed within each with all frames available to it. Comparison of this column thread. No additional register file space is required, since with column 4, indicates the performance drop incurred the same storage used to hold state for speculative blocks from both inter-thread interaction and reduced speculation
Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA’03) 1063-6897/03 $17.00 © 2003 IEEE Benchmarks Throughput (aggregate IPC) Overall Per Thread Speedup T-morph Constant A-frames Scaled A-frames Efficiency (%) Efficiency (%) 2 Threads bzipl, m88ksiml 4.9 5.5 5.5 90 93, 86 1.8 parserl, m88ksiml 3.7 3.8 4.1 90 88, 91 1.8 arth, compressl 5.1 5.7 6.0 86 93, 62 1.6 mcfh, bzipl 3.2 3.9 3.9 81 98, 75 1.7 arth, mcfh 5.1 5.3 5.6 90 91, 87 1.8 equakeh, mcfh 3.3 3.4 3.5 95 101, 83 1.8 MEAN 4.7 - - 87 84 1.7 4 Threads bzipl,m88ksiml, parserl, compressl 6.1 6.7 8.4 72 79, 70, 59, 78 2.9 equakeh, arth, parserl, compressl 6.1 7.0 10.0 61 68, 69, 38, 47 2.2 tomcatvh, mcfh, m88ksiml, bzipl 8.3 10.7 15.0 55 54, 65, 55, 58 2.3 equakeh,arth,tomcatvh,mcfh 9.0 10.5 16.6 54 60, 58, 51, 53 2.2 MEAN 7.4 - - 61 60 2.4 8 Threads art, tomcatv, bzip, m88ksim 9.8 17.7 25.0 39 40, 44, 34, 33 2.9 equake, parser, compress, mcf 50, 23, 26, 43
Table 2. T-morph thread efficiency and throughput.
in the T-morph. Our experiments showed that T-morph the fast inter-ALU communication network for stream- performance is largely insensitive to cache and branch pre- ing media and scientific applications. These applica- dictor pollution, but is highly sensitive to instruction fetch tions are typically characterized by data-level parallelism bandwidth stalls. (DLP) including predictable loop-based control flow with Column 5 shows the overall T-morph efficiency, de- large iteration counts [21], large data sets, regular access fined as the ratio of multithreading performance to patterns, poor locality but tolerance to memory latency, throughput of threads running on independent cores and high computation intensity with tens to hundreds of (column 2/column 4). Column 6 breaks this down fur- arithmetic operations performed per element loaded from ther showing the fraction of peak D-morph performance memory [18]. The S-morph was heavily influenced by the achieved by each thread when sharing a TRIPS core with Imagine architecture [11] and uses the Imagine execution other threads. The last column shows an estimate of the model in which a set of stream kernels are sequenced by speedup provided by the T-morph versus running each of a control thread. Figure 5 highlights the features of the the applications one at a time on a single TRIPS core (with S-morph which are further described below. the assumption that each application has approximately the same running time). The overall efficiency varies from 5.1 S-morph Mechanisms 80–100% with 2 threads down to 39% with 8 threads. Having the low memory benchmarks resident simultane- Frame Space Management: Since the control flow of ously provided the highest efficiency, while mixes of high the programs is highly predictable, the S-morph fuses mul- memory benchmarks provided the lowest efficiency, due tiple A-frames to make a super A-frame, instead of using to increased T-morph cache contention. This effect is less separate A-frames for speculation or multithreading. In- pronounced in the 2-thread configurations with the pair- ner loops of a streaming application are unrolled to fill the ing of high memory benchmarks being equally efficient as reservation stations within these super A-frames. Code re- others. The overall speedup provided by multithreading quired to set up the execution of the inner loops and to ranges from a factor 1.4 to 2.9 depending on the number connect multiple loops can run in one of three ways: (1) of threads. In summary, most benchmarks do not com- embedded into the program that uses the frames for S- pletely exploit the deep speculation provided by all of the morph execution, (2) executed on a different core within A-frames available in the D-morph, due to branch mispre- the TRIPS chip–similar in function to the Imagine host dictions. The T-morph converts these less useful A-frames processor, or (3) run within its own set of frames on the to non-speculative computations when multiple threads or same core running the DLP kernels. In this third mode, a jobs are available. Future work will evaluate the T-morph subset of the frames are dedicated to a data parallel thread, on multithreaded parallel programs. while a different subset are dedicated to a sequential con- trol thread. Instruction Fetch: 5 S-morph: Data-Level Parallelism To reduce the power and instruction fetch bandwidth overhead of repeated fetching of the same code block across inner-loop iterations, the S-morph em- The S-morph is a configuration of the TRIPS processor ploys mapping reuse, in which a block is kept in the reser- that leverages the technology scalable array of ALUs and vation stations and used multiple times. The S-morph im-
Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA’03) 1063-6897/03 $17.00 © 2003 IEEE Benchmark Original Iteration Fused Iterations #of Kernel size Inputs/ Unrolling Compute insts Block Total Revitalizations (insts) Outputs Constants factor per block size Constants convert 15 3/3 9 16 240 303 144 171 dct 70 8/8 10 8 560 580 80 128 fir16 34 1/1 16 16 544 620 256 512 fft8 104 16/16 16 4 416 570 64 128 idea 112 2/2 52 8 896 1020 416 512 transform 37 8/8 21 16 592 740 336 64
Table 3. Characteristics of S-morph codes.
increased SRF bandwidth. The S-morph DLP loops can On chip memory tiles Super N2 SRF execute an SRF read that acts as load multiple word in- A−frame struction by transferring an entire SRF line into the grid,