Design and Implementation of the TRIPS EDGE Architecture

Tutorial Outline Design and Implementation of • Part I: Overview – Introduction (Doug Burger/Steve Keckler) the TRIPS EDGE Architecture – EDGE architectures and the TRIPS ISA (Doug Burger) – TRIPS microarchitecture and prototype overview (Steve Keckler) – TRIPS code examples (Robert McDonald) • Part II: TRIPS Front End TRIPS Tutorial – Global control (Ramdas Nagarajan) – Control-flow prediction (Nitya Ranganathan) Originally Presented at ISCA-32 – Instruction fetch (Haiming Liu) – Register accesses and operand routing (Karu Sankaralingam) June 4, 2005 • Part III: TRIPS Execution Core & Memory System – Issue logic and execution (Premkishore Shivakumar) – Primary memory system (Simha Sethumadhavan) – Secondary memory system (Changkyu Kim) – Chip- and system-level networks and I/O (Paul Gratz) • Part IV: TRIPS Compiler Department of Computer Sciences – Compiler overview (Kathryn McKinley) – Forming TRIPS Blocks (Aaron Smith) The University of Texas at Austin – Code optimization (Kathryn McKinley) • Part V: Conclusions Names in italics refer to both the presenters at ISCA-32 and the main developers of each section’s slide deck. June 4, 2005 UT-Austin TRIPS Tutorial 1 June 4, 2005 UT-Austin TRIPS Tutorial 2 Copyright Information Acknowledgment of TRIPS Sponsors • DARPA • The material in this tutorial is Copyright © 2005-2006 by The University of Texas at Austin. All rights reserved. – Polymorphous Computing Architectures (2001-2006) • Contracts F33615-01-C-1892 and F44615-03-C-4106 • This material was developed as a part of the TRIPS project in the Computer Architecture – Special thanks to Bob Graybill and Technology Laboratory, Department of Computer Sciences, at The University of Texas at Austin. Please contact Doug Burger ([email protected]) or Steve Keckler • Air Force Research Laboratory ([email protected]) for information about redistribution or usage of these materials • Intel Research Council in other presentations. – 2 Research Grants (2000-2004) and equipment donations (2000-2005) • Portions of the TRIPS technology described in this tutorial are patent pending. •IBM Commercial parties interested in licensing or using TRIPS technology for profit should – Faculty Partnership Awards (1999-2004) and SUR grants (1999, 2003) contact the project principal investigators listed above. • Sun Microsystems – Research Grant (2003) • National Science Foundation – CAREER and Research Infrastructure Grants (1999-2008) • Peter O’Donnell Foundation • UT-Austin College of Natural Sciences – Matching funds for industrial fellows, infrastructure grants (1999-2008) June 4, 2005 UT-Austin TRIPS Tutorial 3 June 4, 2005 UT-Austin TRIPS Tutorial 4 Other Members of the TRIPS Team Relevant TRIPS/EDGE Publications Research Scientists Graduate Students • “The Design and Implementation of the TRIPS Prototype Chip” – HotChips 17, 2005. • Bill Yoder •Xia Chen – Contains details about the prototype ASIC. – Chief developer of the TRIPS toolchain – PowerPC to TRIPS translation and emulation • “Dynamic Placement, Static Issue (SPDI) Scheduling for EDGE Architectures” • Jim Burrill • Raj Desikan – Intl. Conference on Parallel Architectures and Compilation Techniques (PACT), 2004. – Chief engineer of the Scale compiler – Initial data tile design • Nick Nethercote • Saurabh Drolia – Specifies compiler algorithm for scheduling instructions on the TRIPS architecture. – TRIPS compiler performance leader – Prototype DMA controller, system simulator, parallel • “Scaling to the End of Silicon with EDGE Architectures” software – IEEE Computer, July, 2004 • Madhu Sibi Govindan – Prototype external bus and clock controller – Provides an overview of EDGE architectures using the TRIPS architecture as a case study. • Divya Gulati • “Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture” – Execution tile design, processor core verification – Intl. Symposium on Computer Architecture (ISCA-30), 2003. • Heather Hanson – Details the flexibility of the TRIPS architecture for exploiting different types of parallelism. – Operand router design • Sundeep Kushwaha • “An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches” – Back-end instruction placement – Intl. Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS- •Bert Maher X), 2002. – Hyperblock construction and formation – The first paper describing NUCA caches and their design tradeoffs. • Suriya Narayanan • “A Design Space Evaluation of Grid Processor Architectures” – Compiler/memory system optimization • Sadia Sharif – Intl. Symposium on Microarchitecture (MICRO-34), 2001. – TRIPS system software – An early study that formed the basis for the TRIPS architecture. June 4, 2005 UT-Austin TRIPS Tutorial 5 June 4, 2005 UT-Austin TRIPS Tutorial 6 1 Principal Related Work Key Trends •Power • Transport-Triggered Architectures [Supercomputing 1991] – Now a budget, just like area – Hans Mulder and Henk Corporaal – What performance can you get for a given power and area budget? – MOVE Architecture had direct instruction-to-instruction communication bypassing registers • WaveScalar [Micro 2003] – Transferring work from the HW to the SW (compiler, programmer, etc.) is power efficient – Mark Oskin, Steve Swanson, Ken Michelson, and Andrew Schwerin – Leakage trends must be addressed (not underlying goal of TRIPS) – Imperative/dataflow EDGE ISA • Slowing (stopping?) frequency increases – 3 most significant differences with TRIPS: – Pipelining has reached depth limits • Dynamic instruction placement – Device speed scaling may stop tracking feature size reduction • All-path execution • Two-level microarchitectural execution hierarchy – May result in decreasing main memory latencies • ASH/CASH/Pegasus IR [CGO-03, ASPLOS-04, ISPASS-05] – Must exploit more concurrency – Mihai Budiu and Seth Goldstein – Key issue: how to expose this concurrency? Force the programmer? – Similar goals and related compiler techniques • Wire Delays 35 nm – More of a focus on compiling to ASICs or reconfigurable substrates – Will force growing partitioning of future chips • RAW [IEEE Computer-97, ISCA-04] 65 nm – Works against reduced power and improved concurrency – Pioneering tiled architecture tackling concurrency and wire delays • Reliability – Leading work on Scalar Operand Networks 90 nm – Direct instruction-to-instruction communication through network switch I-stream – Do everything twice (or thrice) at some level of abstraction – Works against power limits and exploitation of concurrency 130 nm 20 mm June 4, 2005 UT-Austin TRIPS Tutorial 7 June 4, 2005 UT-Austin TRIPS Tutorial 8 Performance Scaling and Technology Challenges EDGE Architectures: Can New ISAs Help? Single-processor Performance Scaling 55%/year improvement 16.0 New programming '60s, '70s '80s, '90s, early '00s late '00s, '10s models needed? 14.0 EDGE CPI 12.0 1 Frequency wall Concurrency only CISC RISC EDGE? solution for higher 10.0 performance Device speed 8.0 Complex, few instructions More numerous, simple instructions Blocks amortize per-inst overhead Discrete components Reliance on compiler ordering Hundreds to thousands in flight Architectural frequency wall Conventional architectures Minimize storage Optimized for pipelining Inter-inst. communication explicit Log2 Speedup 6.0 cannot improve performance Small numbers in flight Tens of instructions in flight Exploits more concurrency 4.0 Pipelining Pipelining difficult Wide issue inefficient Leakage not yet addressed Reliability not yet addressed 2.0 RISC ILP wall Industry shifts to frequency RISC/CISC CPI dominated strategy 0.0 1RISC concepts implemented 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 in both ISA and H/W 90 nm 65 nm 45 nm 35nm 25nm June 4, 2005 UT-Austin TRIPS Tutorial 9 June 4, 2005 UT-Austin TRIPS Tutorial 10 What is an EDGE Architecture? The TRIPS EDGE ISA • Explicit Data Graph Execution – Defined by two key features 1. Program graph is broken into sequences of blocks ISCA-32 Tutorial – Basic blocks, hyperblocks, or something else – Blocks commit atomically or not - a block never partially executes 2. Within a block, ISA support for direct producer-to-consumer communication Doug Burger – No shared named registers within a block (point-to-point dataflow edges only) – Caveat: memory is still a shared namespace – The block’s dataflow graph (DFG) is explicit in the architecture Department of Computer Sciences • What are design alternatives for different architectures? The University of Texas at Austin – Single path through program block graph (TRIPS) – All paths through program block graph (WaveScalar) – Mechanism for specifying instruction-to-instruction links in the ISA – Block constraints (composition, fixed size vs. variable size, etc.) June 4, 2005 UT-Austin TRIPS Tutorial 11 June 4, 2005 UT-Austin TRIPS Tutorial 12 2 Architectural Structure of a TRIPS Block TRIPS Block Contents Blocks encoded in object file include: Address+targets sent to Block characteristics: Read memory, data returned Reg. banks • Read/Write instructions to target instructions • Fixed size: • Computation instructions 32 read instructions – 128 instructions max • 32-bit store mask for tracking store – L1 and core expands empty completion 32 loads 32-inst chunks to NOPs 32 stores • Load/store IDs: Two major considerations 1 - 128 – Maximum of 32 Normal Memory loads+stores may be • Every possible execution of a given instruction instruction block must emit a consistent number of PC read emitted, but blocks can hold

Load more