Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture

Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture Karthikeyan Sankaralingam Ramadass Nagarajan Haiming Liu Changkyu Kim Jaehyuk Huh Doug Burger Stephen W. Keckler Charles R. Moore Computer Architecture and Technology Laboratory Department of Computer Sciences The University of Texas at Austin [email protected] - www.cs.utexas.edu/users/cart Abstract dia, streaming, network, desktop) and the emergence of chip multiprocessors (CMPs), for which the number and This paper describes the polymorphous TRIPS archi- granularity of processors is fixed at processor design time. tecture which can be configured for different granularities One strategy for combating processor fragility is to and types of parallelism. TRIPS contains mechanisms that build a heterogeneous chip, which contains multiple pro- enable the processing cores and the on-chip memory sys- cessing cores, each designed to run a distinct class of work- tem to be configured and combined in different modes for loads effectively. The proposed Tarantula processor is one instruction, data, or thread-level parallelism. To adapt to such example of integrated heterogeneity [8]. The two ma- small and large-grain concurrency, the TRIPS architecture jor downsides to this approach are (1) increased hardware contains four out-of-order, 16-wide-issue Grid Processor complexity since there is little design reuse between the cores, which can be partitioned when easily extractable two types of processors and (2) poor resource utilization fine-grained parallelism exists. This approach to polymor- when the application mix contains a balance different than phism provides better performance across a wide range of that ideally suited to the underlying heterogeneous hard- application types than an approach in which many small ware. processors are aggregated to run workloads with irregu- An alternative approach to designing an integrated so- lar parallelism. Our results show that high performance lution using multiple heterogeneous processors is to build can be obtained in each of the three modes–ILP, TLP, one or more homogeneous processors on a die, which mit- and DLP–demonstrating the viability of the polymorphous igates the aforementioned complexity problem. When an coarse-grained approach for future microprocessors. application maps well onto the homogeneous substrate, the utilization problem is solved, as the application is not limited to one of several heterogeneous processors. To 1 Introduction solve the fragility problem, however, the homogeneous hardware must be able to run a wide range of application classes effectively. We define this architectural polymor- General-purpose microprocessors owe their success to phism as the capability to configure hardware for efficient their ability to run many diverse workloads well. Today, execution across broad classes of applications. many application-specific processors, such as desktop, net- A key question, is what granularity of processors and work, server, scientific, graphics, and digital signal proces- memories on a CMP is best for polymorphous capabili- sors have been constructed to match the particular paral- ties. Should future billion-transistor chips contain thou- lelism characteristics of their application domains. Build- sands of fine-grain processing elements (PEs) or far fewer ing processors that are not only general purpose for single- extremely coarse-grain processors? The success or failure threaded programs but for many types of concurrency as of polymorphous capabilities will have a strong effect on well would provide substantive benefits in terms of system the answer to these questions. Figure 1 shows a range of flexibility as well as reduced design and mask costs. points in the spectrum of PE granularities that are possi- Unfortunately, design trends are applying pressure in ble for a 400mm2 chip in 100nm technology. Although the opposite direction: toward designs that are more spe- other possible topologies certainly exist, the five shown in cialized, not less. This performance fragility, in which ap- the diagram represent a good cross-section of the overall plications incur large swings in performance based on how space: well they map to a given design, is the result of the combi- nation of two trends: the diversification of workloads (me- Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA’03) 1063-6897/03 $17.00 © 2003 IEEE Exploits fine-grain parallelism more effectively Runs more applications effectively (a) FPGA (b) PIM (c) Fine-grain CMP (d) Coarse-grain CMP (e) TRIPS Millions of gates 256 Proc. elements 64 In-order cores 16 Out-of-order cores 4 ultra-large cores Figure 1. Granularity of parallel processing elements on a chip. a) Ultra-fine-grained FPGAs. Regardless of the approach, a polymorphous architecture will not outperform custom hardware meant for a b) Hundreds of primitive processors connected to given application, such as graphics processing. However, memory banks such as a processor-in-memory a successful polymorphous system should run well across (PIM) architecture or reconfigurable ALU arrays many application classes, ideally running with only small such as RaPiD [7], Piperench [9], or PACT [3]. performance degradations compared to the performance of c) Tens of simple in-order processors, such as in customized solutions for each application. RAW [25] or Piranha [2] architectures. This paper proposes and describes the polymorphous d) Coarse grained architectures consisting of 10-20 TRIPS architecture, which uses the partitioning approach, 4-issue cores, such as the Power4 [22], Cy- combining coarse-grained polymorphous Grid Processor clops [4], MultiScalar processors [19], other pro- cores with an adaptive, polymorphous on-chip memory posed speculatively-threaded CMPs [6, 20], and the system. Our goal is to design cores that are both as polymorphous Smart Memories [15] architecture. large and as few as possible, providing maximal single- thread performance, while remaining partitionable to ex- e) Wide-issue processors with many ALUs each, such ploit fine-grained parallelism. Our results demonstrate as Grid Processors [16]. that this partitioning approach solves the fragility problem by using polymorphous mechanisms to yield high perfor- The finer-grained architectures on the left of this spec- mance for both coarse and fine-grained concurrent applica- trum can offer high performance on applications with fine- tions. To be successful, the competing approach of synthe- grained (data) parallelism, but will have difficulty achiev- sizing coarser-grain processors from fine-grained compo- ing good performance on general-purpose and serial appli- nents must overcome the challenges of distributed control, cations. For example, a PIM topology has high peak per- long interaction latencies, and synchronization overheads. formance, but its performance on on control-bound codes with irregular memory accesses, such as compression or The rest of this paper describes the polymorphous hard- compilation, would be dismal at best. At the other ex- ware and configurations used to exploit different types of treme, coarser-grained architectures traditionally have not parallelism across a broad spectrum of application types. had the capability to use internal hardware to show high Section 2 describes both the planned TRIPS silicon proto- performance on fine-grained, highly parallel applications. type and its polymorphous hardware resources, which per- Polymorphism can bridge this dichotomy with either mit flexible execution over highly variable application do- of two competing approaches. A synthesis approach mains. These resources support three modes of execution uses a fine-grained CMP to exploit applications with fine- that we call major morphs, each of which is well suited grained, regular parallelism, and tackles irregular, coarser- for a different type of parallelism: instruction-level par- grain parallelism by synthesizing multiple processing el- allelism with the desktop or D-morph (Section 3), thread- ements into larger “logical” processors. This approach level parallelism with the threaded or T-morph (Section 4), builds hardware more to the left on the spectrum in Fig- and data-level parallelism with the streaming or S-morph ure 1 and emulates hardware farther to the right. A par- (Section 5). Section 6 shows how performance increases in titioning approach implements a coarse-grained CMP in the three morphs as each TRIPS core is scaled from a 16- hardware, and logically partitions the large processors to wide up to an even coarser-grain, 64-wide issue processor. exploit finer-grain parallelism when it exists. We conclude in Section 7 that by building large, partition- Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA’03) 1063-6897/03 $17.00 © 2003 IEEE 64−95 95−12732−630−31 Register File MM M M ICache−M M M M M Stitch Table M M M M Control MMM DRAM Interface M ICache−0 DCache−0 LSQ0 Inst Operands Frame 127 MM M M MM M M M M M M ICache−1 DCache−1 LSQ1 . M M M M M M M M M M M M L2 Cache . M M M M M M M M M M M M . ICache−2 DCache−2 LSQ2 Frame 1 DRAM Interface MMM DRAM Interface M MMM M MMM DRAM Interface M Frame 0 ICache−3 MM M M DCache−3 LSQ3 M M M M M M M M Next block Block Control Router MMM DRAM Interface M Predictor (a) TRIPS Chip (b) TRIPS Core (c) Execution Node Figure 2. TRIPS architecture overview. able, polymorphous cores, a single homogeneous design TRIPS are partitioned into large blocks of instructions with can exploit many classes of concurrency, making this ap- a single entry point, no internal loops, and possibly multi- proach promising for solving the emerging challenge of ple possible exit points as found in hyperblocks [14]. For processor fragility. instruction and thread level parallel programs, blocks com- mit atomically and interrupts are block precise, meaning 2 The TRIPS Architecture that they are handled only at block boundaries. For all modes of execution, the compiler is responsible for stati- cally scheduling each block of instructions onto the com- The TRIPS architecture uses large, coarse-grained pro- putational engine such that inter-instruction dependences cessing cores to achieve high performance on single- are explicit.

Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture

Performance and Energy Efficient Network-On-Chip Architectures

Computer Architecture: Dataflow (Part I)

Configurable Fine-Grain Protection for Multicore Processor Virtualization 1

CG-Ooo Energy-Efficient Coarse-Grain Out-Of-Order Execution

Parallel Computer Architecture III

Distributed Microarchitectural Protocols in the TRIPS Prototype Processor

An Evaluation of the TRIPS Computer System

A Survey on Coarse-Grained Reconfigurable Architectures from a Performance Perspective

Designing Heterogeneous Many-Core Processors to Provide High Performance Under Limited Chip Power Budget

Modeling Instruction Placement on a Spatial Architecture

Compiling for EDGE Architectures

Scatter-Add in Data Parallel Architectures