Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture

Total Page:16

File Type:pdf, Size:1020Kb

Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture Karthikeyan Sankaralingam Ramadass Nagarajan Haiming Liu Changkyu Kim Jaehyuk Huh Doug Burger Stephen W. Keckler Charles R. Moore Computer Architecture and Technology Laboratory Department of Computer Sciences The University of Texas at Austin [email protected] - www.cs.utexas.edu/users/cart Abstract dia, streaming, network, desktop) and the emergence of chip multiprocessors (CMPs), for which the number and This paper describes the polymorphous TRIPS archi- granularity of processors is fixed at processor design time. tecture which can be configured for different granularities One strategy for combating processor fragility is to and types of parallelism. TRIPS contains mechanisms that build a heterogeneous chip, which contains multiple pro- enable the processing cores and the on-chip memory sys- cessing cores, each designed to run a distinct class of work- tem to be configured and combined in different modes for loads effectively. The proposed Tarantula processor is one instruction, data, or thread-level parallelism. To adapt to such example of integrated heterogeneity [8]. The two ma- small and large-grain concurrency, the TRIPS architecture jor downsides to this approach are (1) increased hardware contains four out-of-order, 16-wide-issue Grid Processor complexity since there is little design reuse between the cores, which can be partitioned when easily extractable two types of processors and (2) poor resource utilization fine-grained parallelism exists. This approach to polymor- when the application mix contains a balance different than phism provides better performance across a wide range of that ideally suited to the underlying heterogeneous hard- application types than an approach in which many small ware. processors are aggregated to run workloads with irregu- An alternative approach to designing an integrated so- lar parallelism. Our results show that high performance lution using multiple heterogeneous processors is to build can be obtained in each of the three modes–ILP, TLP, one or more homogeneous processors on a die, which mit- and DLP–demonstrating the viability of the polymorphous igates the aforementioned complexity problem. When an coarse-grained approach for future microprocessors. application maps well onto the homogeneous substrate, the utilization problem is solved, as the application is not limited to one of several heterogeneous processors. To 1 Introduction solve the fragility problem, however, the homogeneous hardware must be able to run a wide range of application classes effectively. We define this architectural polymor- General-purpose microprocessors owe their success to phism as the capability to configure hardware for efficient their ability to run many diverse workloads well. Today, execution across broad classes of applications. many application-specific processors, such as desktop, net- A key question, is what granularity of processors and work, server, scientific, graphics, and digital signal proces- memories on a CMP is best for polymorphous capabili- sors have been constructed to match the particular paral- ties. Should future billion-transistor chips contain thou- lelism characteristics of their application domains. Build- sands of fine-grain processing elements (PEs) or far fewer ing processors that are not only general purpose for single- extremely coarse-grain processors? The success or failure threaded programs but for many types of concurrency as of polymorphous capabilities will have a strong effect on well would provide substantive benefits in terms of system the answer to these questions. Figure 1 shows a range of flexibility as well as reduced design and mask costs. points in the spectrum of PE granularities that are possi- Unfortunately, design trends are applying pressure in ble for a 400mm2 chip in 100nm technology. Although the opposite direction: toward designs that are more spe- other possible topologies certainly exist, the five shown in cialized, not less. This performance fragility, in which ap- the diagram represent a good cross-section of the overall plications incur large swings in performance based on how space: well they map to a given design, is the result of the combi- nation of two trends: the diversification of workloads (me- Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA’03) 1063-6897/03 $17.00 © 2003 IEEE Exploits fine-grain parallelism more effectively Runs more applications effectively (a) FPGA (b) PIM (c) Fine-grain CMP (d) Coarse-grain CMP (e) TRIPS Millions of gates 256 Proc. elements 64 In-order cores 16 Out-of-order cores 4 ultra-large cores Figure 1. Granularity of parallel processing elements on a chip. a) Ultra-fine-grained FPGAs. Regardless of the approach, a polymorphous architec- ture will not outperform custom hardware meant for a b) Hundreds of primitive processors connected to given application, such as graphics processing. However, memory banks such as a processor-in-memory a successful polymorphous system should run well across (PIM) architecture or reconfigurable ALU arrays many application classes, ideally running with only small such as RaPiD [7], Piperench [9], or PACT [3]. performance degradations compared to the performance of c) Tens of simple in-order processors, such as in customized solutions for each application. RAW [25] or Piranha [2] architectures. This paper proposes and describes the polymorphous d) Coarse grained architectures consisting of 10-20 TRIPS architecture, which uses the partitioning approach, 4-issue cores, such as the Power4 [22], Cy- combining coarse-grained polymorphous Grid Processor clops [4], MultiScalar processors [19], other pro- cores with an adaptive, polymorphous on-chip memory posed speculatively-threaded CMPs [6, 20], and the system. Our goal is to design cores that are both as polymorphous Smart Memories [15] architecture. large and as few as possible, providing maximal single- thread performance, while remaining partitionable to ex- e) Wide-issue processors with many ALUs each, such ploit fine-grained parallelism. Our results demonstrate as Grid Processors [16]. that this partitioning approach solves the fragility problem by using polymorphous mechanisms to yield high perfor- The finer-grained architectures on the left of this spec- mance for both coarse and fine-grained concurrent applica- trum can offer high performance on applications with fine- tions. To be successful, the competing approach of synthe- grained (data) parallelism, but will have difficulty achiev- sizing coarser-grain processors from fine-grained compo- ing good performance on general-purpose and serial appli- nents must overcome the challenges of distributed control, cations. For example, a PIM topology has high peak per- long interaction latencies, and synchronization overheads. formance, but its performance on on control-bound codes with irregular memory accesses, such as compression or The rest of this paper describes the polymorphous hard- compilation, would be dismal at best. At the other ex- ware and configurations used to exploit different types of treme, coarser-grained architectures traditionally have not parallelism across a broad spectrum of application types. had the capability to use internal hardware to show high Section 2 describes both the planned TRIPS silicon proto- performance on fine-grained, highly parallel applications. type and its polymorphous hardware resources, which per- Polymorphism can bridge this dichotomy with either mit flexible execution over highly variable application do- of two competing approaches. A synthesis approach mains. These resources support three modes of execution uses a fine-grained CMP to exploit applications with fine- that we call major morphs, each of which is well suited grained, regular parallelism, and tackles irregular, coarser- for a different type of parallelism: instruction-level par- grain parallelism by synthesizing multiple processing el- allelism with the desktop or D-morph (Section 3), thread- ements into larger “logical” processors. This approach level parallelism with the threaded or T-morph (Section 4), builds hardware more to the left on the spectrum in Fig- and data-level parallelism with the streaming or S-morph ure 1 and emulates hardware farther to the right. A par- (Section 5). Section 6 shows how performance increases in titioning approach implements a coarse-grained CMP in the three morphs as each TRIPS core is scaled from a 16- hardware, and logically partitions the large processors to wide up to an even coarser-grain, 64-wide issue processor. exploit finer-grain parallelism when it exists. We conclude in Section 7 that by building large, partition- Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA’03) 1063-6897/03 $17.00 © 2003 IEEE 64−95 95−12732−630−31 Register File MM M M ICache−M M M M M Stitch Table M M M M Control MMM DRAM Interface M ICache−0 DCache−0 LSQ0 Inst Operands Frame 127 MM M M MM M M M M M M ICache−1 DCache−1 LSQ1 . M M M M M M M M M M M M L2 Cache . M M M M M M M M M M M M . ICache−2 DCache−2 LSQ2 Frame 1 DRAM Interface MMM DRAM Interface M MMM M MMM DRAM Interface M Frame 0 ICache−3 MM M M DCache−3 LSQ3 M M M M M M M M Next block Block Control Router MMM DRAM Interface M Predictor (a) TRIPS Chip (b) TRIPS Core (c) Execution Node Figure 2. TRIPS architecture overview. able, polymorphous cores, a single homogeneous design TRIPS are partitioned into large blocks of instructions with can exploit many classes of concurrency, making this ap- a single entry point, no internal loops, and possibly multi- proach promising for solving the emerging challenge of ple possible exit points as found in hyperblocks [14]. For processor fragility. instruction and thread level parallel programs, blocks com- mit atomically and interrupts are block precise, meaning 2 The TRIPS Architecture that they are handled only at block boundaries. For all modes of execution, the compiler is responsible for stati- cally scheduling each block of instructions onto the com- The TRIPS architecture uses large, coarse-grained pro- putational engine such that inter-instruction dependences cessing cores to achieve high performance on single- are explicit.
Recommended publications
  • Performance and Energy Efficient Network-On-Chip Architectures
    Linköping Studies in Science and Technology Dissertation No. 1130 Performance and Energy Efficient Network-on-Chip Architectures Sriram R. Vangal Electronic Devices Department of Electrical Engineering Linköping University, SE-581 83 Linköping, Sweden Linköping 2007 ISBN 978-91-85895-91-5 ISSN 0345-7524 ii Performance and Energy Efficient Network-on-Chip Architectures Sriram R. Vangal ISBN 978-91-85895-91-5 Copyright Sriram. R. Vangal, 2007 Linköping Studies in Science and Technology Dissertation No. 1130 ISSN 0345-7524 Electronic Devices Department of Electrical Engineering Linköping University, SE-581 83 Linköping, Sweden Linköping 2007 Author email: [email protected] Cover Image A chip microphotograph of the industry’s first programmable 80-tile teraFLOPS processor, which is implemented in a 65-nm eight-metal CMOS technology. Printed by LiU-Tryck, Linköping University Linköping, Sweden, 2007 Abstract The scaling of MOS transistors into the nanometer regime opens the possibility for creating large Network-on-Chip (NoC) architectures containing hundreds of integrated processing elements with on-chip communication. NoC architectures, with structured on-chip networks are emerging as a scalable and modular solution to global communications within large systems-on-chip. NoCs mitigate the emerging wire-delay problem and addresses the need for substantial interconnect bandwidth by replacing today’s shared buses with packet-switched router networks. With on-chip communication consuming a significant portion of the chip power and area budgets, there is a compelling need for compact, low power routers. While applications dictate the choice of the compute core, the advent of multimedia applications, such as three-dimensional (3D) graphics and signal processing, places stronger demands for self-contained, low-latency floating-point processors with increased throughput.
    [Show full text]
  • Computer Architecture: Dataflow (Part I)
    Computer Architecture: Dataflow (Part I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture n These slides are from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 22: Dataflow I n Video: n http://www.youtube.com/watch? v=D2uue7izU2c&list=PL5PHm2jkkXmh4cDkC3s1VBB7- njlgiG5d&index=19 2 Some Required Dataflow Readings n Dataflow at the ISA level q Dennis and Misunas, “A Preliminary Architecture for a Basic Data Flow Processor,” ISCA 1974. q Arvind and Nikhil, “Executing a Program on the MIT Tagged- Token Dataflow Architecture,” IEEE TC 1990. n Restricted Dataflow q Patt et al., “HPS, a new microarchitecture: rationale and introduction,” MICRO 1985. q Patt et al., “Critical issues regarding HPS, a high performance microarchitecture,” MICRO 1985. 3 Other Related Recommended Readings n Dataflow n Gurd et al., “The Manchester prototype dataflow computer,” CACM 1985. n Lee and Hurson, “Dataflow Architectures and Multithreading,” IEEE Computer 1994. n Restricted Dataflow q Sankaralingam et al., “Exploiting ILP, TLP and DLP with the Polymorphous TRIPS Architecture,” ISCA 2003. q Burger et al., “Scaling to the End of Silicon with EDGE Architectures,” IEEE Computer 2004. 4 Today n Start Dataflow 5 Data Flow Readings: Data Flow (I) n Dennis and Misunas, “A Preliminary Architecture for a Basic Data Flow Processor,” ISCA 1974. n Treleaven et al., “Data-Driven and Demand-Driven Computer Architecture,” ACM Computing Surveys 1982. n Veen, “Dataflow Machine Architecture,” ACM Computing Surveys 1986. n Gurd et al., “The Manchester prototype dataflow computer,” CACM 1985. n Arvind and Nikhil, “Executing a Program on the MIT Tagged-Token Dataflow Architecture,” IEEE TC 1990.
    [Show full text]
  • Configurable Fine-Grain Protection for Multicore Processor Virtualization 1
    Configurable Fine-Grain Protection for Multicore Processor Virtualization David Wentzlaff1, Christopher J. Jackson2, Patrick Griffin3, and Anant Agarwal2 [email protected], [email protected], griffi[email protected], [email protected] 1Princeton University 2Tilera Corp. 3Google Inc. Abstract TLB Access DMA Engine “User” Network I/O Network 2 2 2 2 1 3 1 3 1 3 1 3 Multicore architectures, with their abundant on-chip re- 0 0 0 0 sources, are effectively collections of systems-on-a-chip. The protection system for these architectures must support Key: 0 – User Code, 1 – OS, 2 – Hypervisor, 3 – Hypervisor Debugger multiple concurrently executing operating systems (OSes) Figure 1. With CFP, system software can dy• with different needs, and manage and protect the hard- namically set the privilege level needed to ac• ware’s novel communication mechanisms and hardware cess each fine•grain processor resource. features. Traditional protection systems are insufficient; they protect supervisor from user code, but typically do not protect one system from another, and only support fixed as- ticore systems, a protection system must both temporally signment of resources to protection levels. In this paper, protect and spatially isolate access to resources. Spatial iso- we propose an alternative to traditional protection systems lation is the need to isolate different system software stacks which we call configurable fine-grain protection (CFP). concurrently executing on spatially disparate cores in a mul- CFP enables the dynamic assignment of in-core resources ticore system. Spatial isolation is especially important now to protection levels. We investigate how CFP enables differ- that multicore systems have directly accessible networks ent system software stacks to utilize the same configurable connecting cores to other cores and cores to I/O devices.
    [Show full text]
  • CG-Ooo Energy-Efficient Coarse-Grain Out-Of-Order Execution
    CG-OoO Energy-Efficient Coarse-Grain Out-of-Order Execution Milad Mohammadi⋆, Tor M. Aamodt†, William J. Dally⋆‡ ⋆Stanford University, †University of British Columbia, ‡NVIDIA Research [email protected], [email protected], [email protected] ABSTRACT CG-OoO model to make it even more energy efficient. We introduce the Coarse-Grain Out-of-Order (CG- Despite the significant achievements in improving en- OoO) general purpose processor designed to achieve ergy and performance properties of the OoO proces- close to In-Order processor energy while maintaining sor in the recent years [2], studies show the energy Out-of-Order (OoO) performance. CG-OoO is an and performance attributes of the OoO execution model energy-performance proportional general purpose remain superlinearly proportional [3, 4]. Studies indi- architecture that scales according to the program cate control speculation and dynamic scheduling tech- load1. Block-level code processing is at the heart of nique amount to 88% and 10% of the OoO superior the this architecture; CG-OoO speculates, fetches, performance compared to the In-Order (InO) proces- schedules, and commits code at block-level granu- sor [5]. Scheduling and speculation in OoO is performed larity. It eliminates unnecessary accesses to energy at instruction granularity regardless of the instruction consuming tables, and turns large tables into smaller type even though they are mainly effective during un- and distributed tables that are cheaper to access. predictable dynamic events (e.g. unpredictable cache CG-OoO leverages compiler-level code optimizations misses) [5]. Furthermore, our studies show speculation to deliver efficient static code, and exploits dynamic and dynamic scheduling amount to 67% and 51% of instruction-level parallelism and block-level parallelism.
    [Show full text]
  • Parallel Computer Architecture III
    Parallel Computer Architecture III Stefan Lang Interdisciplinary Center for Scientific Computing (IWR) University of Heidelberg INF 368, Room 532 D-69120 Heidelberg phone: 06221/54-8264 email: [email protected] WS 14/15 Stefan Lang (IWR) Simulation on High-Performance Computers WS 14/15 1 / 51 Parallel Computer Architecture III Parallelism and Granularity Graphic cards I/O Detailed study Hypertransport Protocol Stefan Lang (IWR) Simulation on High-Performance Computers WS 14/15 2 / 51 Parallelism and Granularity level five jobs or Programs coarse grained Subprograms, modules level four or classes middle grained Increase of Increase of Parallelism Communications Procedures, functions level three Requirements or methods Non−recursive loops level two or iterators fine grained Instructions level one or statements Stefan Lang (IWR) Simulation on High-Performance Computers WS 14/15 3 / 51 Graphics Cards GPU = Graphics Processing Unit CUDA = Compute Unified Device Architecture ◮ Toolkit by NVIDIA for direct GPU Programming ◮ Programming of a GPU without graphical API ◮ GPGPU compared to CPUs strongly increased computing performance and storage bandwidth GPUs are cheap and broadly established Stefan Lang (IWR) Simulation on High-Performance Computers WS 14/15 4 / 51 Computing Performance: CPU vs. GPU Stefan Lang (IWR) Simulation on High-Performance Computers WS 14/15 5 / 51 Graphics Cards: Hardware Specification Stefan Lang (IWR) Simulation on High-Performance Computers WS 14/15 6 / 51 Chip Architecture: CPU vs. GPU Stefan Lang (IWR) Simulation on High-Performance Computers WS 14/15 7 / 51 Graphics Cards: Hardware Design Stefan Lang (IWR) Simulation on High-Performance Computers WS 14/15 8 / 51 Graphics Cards: Memory Design 8192 registers (32-bit), in total 32KB per multiprocessor 16KB of fast shared memory per multiprocessor Large global memory (hundreds of MBs, e.g.
    [Show full text]
  • Distributed Microarchitectural Protocols in the TRIPS Prototype Processor
    Øh Appears in the ¿9 Annual International Symposium on Microarchitecture Distributed Microarchitectural Protocols in the TRIPS Prototype Processor Karthikeyan Sankaralingam Ramadass Nagarajan Robert McDonald Ý Rajagopalan DesikanÝ Saurabh Drolia M.S. Govindan Paul Gratz Divya Gulati Heather HansonÝ ChangkyuKim HaimingLiu NityaRanganathan Simha Sethumadhavan Sadia SharifÝ Premkishore Shivakumar Stephen W. Keckler Doug Burger Department of Computer Sciences ÝDepartment of Electrical and Computer Engineering The University of Texas at Austin [email protected] www.cs.utexas.edu/users/cart Abstract are clients on one or more micronets. Higher-level mi- croarchitectural protocols direct global control across the Growing on-chip wire delays will cause many future mi- micronets and tiles in a manner invisible to software. croarchitectures to be distributed, in which hardware re- In this paper, we describe the tile partitioning, micronet sources within a single processor become nodes on one or connectivity, and distributed protocols that provide global more switched micronetworks. Since large processor cores services in the TRIPS processor, including distributed fetch, will require multiple clock cycles to traverse, control must execution, flush, and commit. Prior papers have described be distributed, not centralized. This paper describes the this approach to exploiting parallelism as well as high-level control protocols in the TRIPS processor, a distributed, tiled performanceresults [15, 3], but have not described the inter- microarchitecture that supports dynamic execution. It de- tile connectivity or protocols. Tiled architectures such as tails each of the five types of reused tiles that compose the RAW [23] use static orchestration to manage global oper- processor, the control and data networks that connect them, ations, but in a dynamically scheduled, distributed archi- and the distributed microarchitectural protocols that imple- tecture such as TRIPS, hardware protocols are required to ment instruction fetch, execution, flush, and commit.
    [Show full text]
  • An Evaluation of the TRIPS Computer System
    Appears in the Proceedings of the 14th International Conference on Architecture Support for Programming Languages and Operating Systems An Evaluation of the TRIPS Computer System Mark Gebhart Bertrand A. Maher Katherine E. Coons Jeff Diamond Paul Gratz Mario Marino Nitya Ranganathan Behnam Robatmili Aaron Smith James Burrill Stephen W. Keckler Doug Burger Kathryn S. McKinley Department of Computer Sciences The University of Texas at Austin [email protected] www.cs.utexas.edu/users/cart Abstract issue-width scaling of conventional superscalar architec- The TRIPS system employs a new instruction set architec- tures. Because of these trends, major microprocessor ven- ture (ISA) called Explicit Data Graph Execution (EDGE) dors have abandoned architectures for single-thread perfor- that renegotiates the boundary between hardware and soft- mance and turned to the promise of multiple cores per chip. ware to expose and exploit concurrency. EDGE ISAs use a While many applications can exploit multicore systems, this block-atomic execution model in which blocks are composed approach places substantial burdens on programmers to par- of dataflow instructions. The goal of the TRIPS design is allelize their codes. Despite these trends, Amdahl’s law dic- to mine concurrency for high performance while tolerating tates that single-thread performance will remain key to the emerging technology scaling challenges, such as increas- future success of computer systems [9]. ing wire delays and power consumption. This paper eval- In response to semiconductor scaling trends, we designed uates how well TRIPS meets this goal through a detailed a new architecture and microarchitecture intended to extend ISA and performance analysis.
    [Show full text]
  • A Survey on Coarse-Grained Reconfigurable Architectures from a Performance Perspective
    Received May 31, 2020, accepted July 13, 2020, date of publication July 27, 2020, date of current version August 20, 2020. Digital Object Identifier 10.1109/ACCESS.2020.3012084 A Survey on Coarse-Grained Reconfigurable Architectures From a Performance Perspective ARTUR PODOBAS 1,2, KENTARO SANO1, AND SATOSHI MATSUOKA1,3 1RIKEN Center for Computational Science, Kobe 650-0047, Japan 2Department of Computer Science, KTH Royal Institute of Technology, 114 28 Stockholm, Sweden 3Department of Mathematical and Computing Sciences, Tokyo Institute of Technology, Tokyo 152-8550, Japan Corresponding author: Artur Podobas ([email protected]) This work was supported by the New Energy and Industrial Technology Development Organization (NEDO). ABSTRACT With the end of both Dennard’s scaling and Moore’s law, computer users and researchers are aggressively exploring alternative forms of computing in order to continue the performance scaling that we have come to enjoy. Among the more salient and practical of the post-Moore alternatives are reconfigurable systems, with Coarse-Grained Reconfigurable Architectures (CGRAs) seemingly capable of striking a balance between performance and programmability. In this paper, we survey the landscape of CGRAs. We summarize nearly three decades of literature on the subject, with a particular focus on the premise behind the different CGRAs and how they have evolved. Next, we compile metrics of available CGRAs and analyze their performance properties in order to understand and discover knowledge gaps and opportunities for future CGRA research specialized towards High-Performance Computing (HPC). We find that there are ample opportunities for future research on CGRAs, in particular with respect to size, functionality, support for parallel programming models, and to evaluate more complex applications.
    [Show full text]
  • Designing Heterogeneous Many-Core Processors to Provide High Performance Under Limited Chip Power Budget
    DESIGNING HETEROGENEOUS MANY-CORE PROCESSORS TO PROVIDE HIGH PERFORMANCE UNDER LIMITED CHIP POWER BUDGET A Thesis Presented to The Academic Faculty by Dong Hyuk Woo In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the School of Electrical and Computer Engineering Georgia Institute of Technology December 2010 DESIGNING HETEROGENEOUS MANY-CORE PROCESSORS TO PROVIDE HIGH PERFORMANCE UNDER LIMITED CHIP POWER BUDGET Approved by: Dr. Hsien-Hsin S. Lee, Advisor Dr. Sung Kyu Lim School of Electrical and Computer School of Electrical and Computer Engineering Engineering Georgia Institute of Technology Georgia Institute of Technology Dr. Sudhakar Yalamanchili Dr. Milos Prvulovic School of Electrical and Computer School of Computer Science Engineering Georgia Institute of Technology Georgia Institute of Technology Dr. Marilyn Wolf Date Approved: 23 September 2010 School of Electrical and Computer Engineering Georgia Institute of Technology To my family. iii ACKNOWLEDGEMENTS I would like to take this opportunity to thank all those who directly or indirectly helped me in completing my Ph.D. study. First of all, I would like to thank my advisor, Dr. Hsien-Hsin S. Lee, who contin- uously motivated me, patiently listened to me, and often challenged me with critical feedback. I would also like to thank Dr. Sudhakar Yalamanchili, Dr. Marilyn Wolf, Dr. Sung Kyu Lim, and Dr. Milos Prvulovic for volunteering to serve in my commit- tee and reviewing my thesis. I would also like to thank all the MARS lab members, Dr. Weidong Shi, Dr. Taeweon Suh, Dr. Chinnakrishnan Ballapuram, Dr. Mrinmoy Ghosh, Fayez Mo- hamood, Richard Yoo, Dean Lewis, Eric Fontaine, Ahmad Sharif, Pratik Marolia, Vikas Vasisht, Nak Hee Seong, Sungkap Yeo, Jen-Cheng Huang, Abilash Sekar, Manoj Athreya, Ali Benquassmi, Tzu-Wei Lin, Mohammad Hossain, Andrei Bersatti, and and Jae Woong Sim.
    [Show full text]
  • Modeling Instruction Placement on a Spatial Architecture
    Modeling Instruction Placement on a Spatial Architecture Martha Mercaldi Steven Swanson Andrew Petersen Andrew Putnam Andrew Schwerin Mark Oskin Susan J. Eggers Computer Science & Engineering University of Washington Seattle, WA USA {mercaldi,swanson,petersen,aputnam,schwerin,oskin,eggers}@cs.washington.edu ABSTRACT Keywords In response to current technology scaling trends, architects dataflow, instruction placement, spatial computing are developing a new style of processor, known as spatial computers. A spatial computer is composed of hundreds or even thousands of simple, replicated processing elements 1. INTRODUCTION (or PEs), frequently organized into a grid. Several current Today’s manufacturing technologies provide an enormous spatial computers, such as TRIPS, RAW, SmartMemories, quantity of computational resources. Computer architects nanoFabrics and WaveScalar, explicitly place a program’s are currently exploring how to convert these resources into instructions onto the grid. improvements in application performance. Despite signifi- Designing instruction placement algorithms is an enor- cant differences in execution models and underlying process mous challenge, as there are an exponential (in the size of technology, five recently proposed architectures - nanoFab- the application) number of different mappings of instruc- rics [18], TRIPS [34], RAW [23], SmartMemories [26], and tions to PEs, and the choice of mapping greatly affects pro- WaveScalar [39] - share the task of mapping large portions gram performance. In this paper we develop an instruction of an application’s binary onto a collection of processing el- placement performance model which can inform instruction ements. Once mapped, the instructions execute “in place”, placement. The model comprises three components, each explicitly sending data between the processing elements. Re- of which captures a different aspect of spatial computing searchers call this form of computation distributed ILP [34, performance: inter-instruction operand latency, data cache 23, 39] or spatial computing [18].
    [Show full text]
  • Compiling for EDGE Architectures
    Appears in the Proceedings of the 4th International Symposium on Code Generation and Optimization (CGO 04). Compiling for EDGE Architectures AaronSmith JimBurrill1 Jon Gibson Bertrand Maher Nick Nethercote BillYoder DougBurger KathrynS.McKinley Department of Computer Sciences 1Department of Computer Science The University of Texas at Austin University of Massachusetts Austin, Texas 78712 Amherst, Massachusetts 01003 Abstract sibilities between programmer, compiler, and hardware to discover and exploit concurrency. Explicit Data Graph Execution (EDGE) architectures of- In previous solutions, CISC processors intentionally fer the possibility of high instruction-level parallelism with placed few ISA-imposed requirements on the compiler to energy efficiency. In EDGE architectures, the compiler expose concurrency. In-order RISC processors required the breaks a program into a sequence of structured blocks that compiler to schedule instructions to minimize pipeline bub- the hardware executes atomically. The instructions within bles for effective pipelining concurrency. With the advent each block communicate directly, instead of communicating of large-window out-of-order microarchitectures, however, through shared registers. The TRIPS EDGE architecture both RISC and CISC processors rely mostly on the hard- imposes restrictions on its blocks to simplify the microar- ware to support superscalar issue. These processors use a chitecture: each TRIPS block has at most 128 instructions, dynamic placement, dynamic issue execution model that re- issues at most 32 loads and/or stores, and executes at most quires the hardware to construct the program dataflow graph 32 register bank reads and 32 writes. To detect block com- on the fly, with little compiler assistance. VLIW processors, pletion, each TRIPS block must produce a constant number conversely, place most of the burden of identifying con- of outputs (stores and register writes) and a branch deci- current instructions on the compiler, which must fill long sion.
    [Show full text]
  • Scatter-Add in Data Parallel Architectures
    Scatter-Add in Data Parallel Architectures Jung Ho Ahn, Mattan Erez and William J. Dally ∗ Computer Systems Laboratory Stanford University, Stanford, CA 94305, USA {gajh,merez,billd}@cva.stanford.edu Abstract histogram Many important applications exhibit large amounts of data parallelism, and modern computer systems are de- signed to take advantage of it. While much of the com- putation in the multimedia and scientific application do- mains is data parallel, certain operations require costly bins serialization that increase the run time. Examples in- clude superposition type updates in scientific computing and histogram computations in media processing. We in- troduce scatter-add, which is the data-parallel form of dataset the well-known scalar fetch-and-op, specifically tuned for SIMD/vector/stream style memory systems. The scatter-add Figure 1: Parallel histogram computation leads to mem- mechanism scatters a set of data values to a set of mem- ory collision, when multiple elements of the dataset up- ory addresses and adds each data value to each refer- date the same histogram bin. enced memory location instead of overwriting it. This novel architecture extension allows us to efficiently sup- tencies. In this paper we will concentrate on the single in- port data-parallel atomic update computations found in struction multiple data (SIMD) class of DPAs, exemplified parallel programming languages such as HPF, and ap- by vector [9, 6], and stream processors [21, 10, 37]. plies both to single-processor and multi-processor SIMD While much of the computation of a typical multimedia data-parallel systems. We detail the micro-architecture ofa or scientific application is indeed data parallel, some sec- scatter-add implementation on a stream architecture, which tions of the code require serialization which significantly requires less than 2% increase in die area yet shows per- limits overall performance.
    [Show full text]