Tutorial Outline

Design and Implementation of • Part I: Overview – Introduction (Doug Burger/Steve Keckler) the TRIPS EDGE Architecture – EDGE architectures and the TRIPS ISA (Doug Burger) – TRIPS and prototype overview (Steve Keckler) – TRIPS code examples (Robert McDonald) • Part II: TRIPS Front End TRIPS Tutorial – Global control (Ramdas Nagarajan) – Control-flow prediction (Nitya Ranganathan) Originally Presented at ISCA-32 – Instruction fetch (Haiming Liu) – Register accesses and operand routing (Karu Sankaralingam) June 4, 2005 • Part III: TRIPS Execution Core & Memory System – Issue logic and execution (Premkishore Shivakumar) – Primary memory system (Simha Sethumadhavan) – Secondary memory system (Changkyu Kim) – Chip- and system-level networks and I/O (Paul Gratz) • Part IV: TRIPS Department of Computer Sciences – Compiler overview (Kathryn McKinley) – Forming TRIPS Blocks (Aaron Smith) The University of Texas at Austin – Code optimization (Kathryn McKinley) • Part V: Conclusions

Names in italics refer to both the presenters at ISCA-32 and the main developers of each section’s slide deck.

June 4, 2005 UT-Austin TRIPS Tutorial 1 June 4, 2005 UT-Austin TRIPS Tutorial 2

Copyright Information Acknowledgment of TRIPS Sponsors

• DARPA • The material in this tutorial is Copyright © 2005-2006 by The University of Texas at Austin. All rights reserved. – Polymorphous Computing Architectures (2001-2006) • Contracts F33615-01-C-1892 and F44615-03-C-4106 • This material was developed as a part of the TRIPS project in the – Special thanks to Bob Graybill and Technology Laboratory, Department of Computer Sciences, at The University of Texas at Austin. Please contact Doug Burger ([email protected]) or Steve Keckler • Air Force Research Laboratory ([email protected]) for information about redistribution or usage of these materials • Research Council in other presentations. – 2 Research Grants (2000-2004) and equipment donations (2000-2005) • Portions of the TRIPS technology described in this tutorial are patent pending. •IBM Commercial parties interested in licensing or using TRIPS technology for profit should – Faculty Partnership Awards (1999-2004) and SUR grants (1999, 2003) contact the project principal investigators listed above. • – Research Grant (2003) • National Science Foundation – CAREER and Research Infrastructure Grants (1999-2008) • Peter O’Donnell Foundation • UT-Austin College of Natural Sciences – Matching funds for industrial fellows, infrastructure grants (1999-2008)

June 4, 2005 UT-Austin TRIPS Tutorial 3 June 4, 2005 UT-Austin TRIPS Tutorial 4

Other Members of the TRIPS Team Relevant TRIPS/EDGE Publications

Research Scientists Graduate Students • “The Design and Implementation of the TRIPS Prototype Chip” – HotChips 17, 2005. • Bill Yoder •Xia Chen – Contains details about the prototype ASIC. – Chief developer of the TRIPS toolchain – PowerPC to TRIPS translation and emulation • “Dynamic Placement, Static Issue (SPDI) Scheduling for EDGE Architectures” • Jim Burrill • Raj Desikan – Intl. Conference on Parallel Architectures and Compilation Techniques (PACT), 2004. – Chief engineer of the Scale compiler – Initial data tile design • Nick Nethercote • Saurabh Drolia – Specifies compiler algorithm for scheduling instructions on the TRIPS architecture. – TRIPS compiler performance leader – Prototype DMA controller, system simulator, parallel • “Scaling to the End of Silicon with EDGE Architectures” software – IEEE Computer, July, 2004 • Madhu Sibi Govindan – Prototype external and clock controller – Provides an overview of EDGE architectures using the TRIPS architecture as a case study. • Divya Gulati • “Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture” – Execution tile design, core verification – Intl. Symposium on Computer Architecture (ISCA-30), 2003. • Heather Hanson – Details the flexibility of the TRIPS architecture for exploiting different types of parallelism. – Operand router design • Sundeep Kushwaha • “An Adaptive, Non-Uniform Structure for Wire-Dominated On-Chip Caches” – Back-end instruction placement – Intl. Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS- •Bert Maher X), 2002. – Hyperblock construction and formation – The first paper describing NUCA caches and their design tradeoffs. • Suriya Narayanan • “A Design Space Evaluation of Grid Processor Architectures” – Compiler/memory system optimization • Sadia Sharif – Intl. Symposium on Microarchitecture (MICRO-34), 2001. – TRIPS system software – An early study that formed the basis for the TRIPS architecture.

June 4, 2005 UT-Austin TRIPS Tutorial 5 June 4, 2005 UT-Austin TRIPS Tutorial 6

1 Principal Related Work Key Trends

•Power • Transport-Triggered Architectures [Supercomputing 1991] – Now a budget, just like area – Hans Mulder and Henk Corporaal – What performance can you get for a given power and area budget? – MOVE Architecture had direct instruction-to-instruction communication bypassing registers • WaveScalar [Micro 2003] – Transferring work from the HW to the SW (compiler, programmer, etc.) is power efficient – Mark Oskin, Steve Swanson, Ken Michelson, and Andrew Schwerin – Leakage trends must be addressed (not underlying goal of TRIPS) – Imperative/dataflow EDGE ISA • Slowing (stopping?) frequency increases – 3 most significant differences with TRIPS: – Pipelining has reached depth limits • Dynamic instruction placement – Device speed scaling may stop tracking feature size reduction • All-path execution • Two-level microarchitectural execution hierarchy – May result in decreasing main memory latencies • ASH/CASH/Pegasus IR [CGO-03, ASPLOS-04, ISPASS-05] – Must exploit more concurrency – Mihai Budiu and Seth Goldstein – Key issue: how to expose this concurrency? Force the programmer? – Similar goals and related compiler techniques • Wire Delays 35 nm – More of a focus on compiling to ASICs or reconfigurable substrates – Will force growing partitioning of future chips • RAW [IEEE Computer-97, ISCA-04] 65 nm – Works against reduced power and improved concurrency – Pioneering tiled architecture tackling concurrency and wire delays • Reliability – Leading work on Scalar Operand Networks 90 nm – Direct instruction-to-instruction communication through network I-stream – Do everything twice (or thrice) at some level of abstraction – Works against power limits and exploitation of concurrency 130 nm

20 mm

June 4, 2005 UT-Austin TRIPS Tutorial 7 June 4, 2005 UT-Austin TRIPS Tutorial 8

Performance Scaling and Technology Challenges EDGE Architectures: Can New ISAs Help?

Single-processor Performance Scaling

55%/year improvement 16.0 New programming '60s, '70s '80s, '90s, early '00s late '00s, '10s models needed? 14.0

EDGE CPI 12.0 1 Frequency wall Concurrency only CISC RISC EDGE? solution for higher 10.0 performance

Device speed 8.0 Complex, few instructions More numerous, simple instructions Blocks amortize per-inst overhead Discrete components Reliance on compiler ordering Hundreds to thousands in flight Architectural frequency wall Conventional architectures Minimize storage Optimized for pipelining Inter-inst. communication explicit Log2 Speedup 6.0 cannot improve performance Small numbers in flight Tens of instructions in flight Exploits more concurrency

4.0 Pipelining Pipelining difficult Wide issue inefficient Leakage not yet addressed Reliability not yet addressed 2.0 RISC ILP wall Industry shifts to frequency RISC/CISC CPI dominated strategy 0.0 1RISC concepts implemented 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 in both ISA and H/W 90 nm 65 nm 45 nm 35nm 25nm

June 4, 2005 UT-Austin TRIPS Tutorial 9 June 4, 2005 UT-Austin TRIPS Tutorial 10

What is an EDGE Architecture?

The TRIPS EDGE ISA • Explicit Data Graph Execution – Defined by two key features

1. Program graph is broken into sequences of blocks ISCA-32 Tutorial – Basic blocks, hyperblocks, or something else – Blocks commit atomically or not - a block never partially executes 2. Within a block, ISA support for direct producer-to-consumer communication Doug Burger – No shared named registers within a block (point-to-point dataflow edges only) – Caveat: memory is still a shared namespace – The block’s dataflow graph (DFG) is explicit in the architecture Department of Computer Sciences • What are design alternatives for different architectures? The University of Texas at Austin – Single path through program block graph (TRIPS) – All paths through program block graph (WaveScalar) – Mechanism for specifying instruction-to-instruction links in the ISA – Block constraints (composition, fixed size vs. variable size, etc.)

June 4, 2005 UT-Austin TRIPS Tutorial 11 June 4, 2005 UT-Austin TRIPS Tutorial 12

2 Architectural Structure of a TRIPS Block TRIPS Block Contents

Blocks encoded in object file include: Address+targets sent to Block characteristics: Read memory, data returned Reg. banks • Read/Write instructions to target instructions • Fixed size: • Computation instructions 32 read instructions – 128 instructions max • 32-bit store mask for tracking store – L1 and core expands empty completion 32 loads 32-inst chunks to NOPs 32 stores • Load/store IDs: Two major considerations 1 - 128 – Maximum of 32 Normal Memory loads+stores may be • Every possible execution of a given instruction instruction block must emit a consistent number of PC read emitted, but blocks can hold DFG Memory more than 32 DFG outputs (register writes, stores, PC)

PC Store mask • Registers: – Outputs may be actual or null terminating –8 read insts. max to reg. – Different blocks may have different 32 write instructions branch bank (4 banks = max of 32) #s of outputs –8 write insts. max to reg PC • No block state may be written until PC bank (4 banks = max of 32) block is committed Reg. banks Write buffer • Control flow: instructions – Bulk commit is logically atomic – Exactly one branch emitted – Similar to an architectural checkpoint – Blocks may hold more in some respects

June 4, 2005 UT-Austin TRIPS Tutorial 13 June 4, 2005 UT-Austin TRIPS Tutorial 14

TRIPS Block Example Key Features of TRIPS ISA

RISC code TIL (operand format) TASL (target format) • Fixed instruction lengths - all instructions 32 bits

ld R3, 4(R2) .bbegin block1 [R1]$g1 [2] • Read and write instructions add R4, R1, R3 read $t1, $g1 [R2]$g2 [1] [4] – Contained in block header for managing DFG/ communication st R4, 4(R2) read $t2, $g2 [1] ld L[1] 4 [2] addi R5, R4, #2 ld $t3, 4($t2) [2] add [3] [4] •Target format (T0, T1) beqz R4, Block3 add $t4, $t1, $t3 [3] mov [5] [6] st $t4, 4($t2) [4] st S[2] 4 – Output destinations specified with 9-bit targets addi $t5, $t4, 2 [5] addi 2 [W1] • Predication (PR) teqz $t6, $t4 [6] teqz [7] [8] • Read target format b_t<$t6> block3 [7] b_t block3 – Nearly every instruction has a predicate field, encoded for efficient • Predicated instructions b_f<$t6> block2 [8] b_f block2 dataflow predication write $g5, $t5 [W1] $g5 • LD/ST sequence numbers .bend block1 • Load/store IDs (LSID) • Target fanout with movs – Used to maintain sequential memory semantics despite EDGE ISA .bbegin block2 ... • Block outputs fixed • Exit bits (EXIT) (3 in this example) – Used to improve block exit control prediction

June 4, 2005 UT-Austin TRIPS Tutorial 15 June 4, 2005 UT-Austin TRIPS Tutorial 16

TRIPS Instruction Formats TRIPS G and L formats

General Instruction Formats 31 25 24 23 22 18 17 9 8 0 OPCODE PR XOP T1 T0 G 725 9 9 INSTRUCTION FIELDS OPCODE PR XOP IMM T0 I opcode pr xop T1 T0 OPCODE = Primary Opcode G: Pred Target field Load and Store Instruction Formats XOP = Extended Opcode

31 25 24 23 22 18 17 9 8 0 OPCODE PR LSID IMM T0 L PR = Predicate Field 00: unpredicated IMM = Signed Immediate OPCODE PR LSID IMM 0 S 01: reserved T0 = Target 0 Specifier Normal 10: predicated on false Branch Instruction Format Instructions T1 = Target 1 Specifier 31 25 24 23 22 20 19 0 LSID = Load/Store ID 11: predicated on true OPCODE PR EXIT OFFSET B EXIT = Exit Number Constant Instruction Format 725 9 9 OFFSET = Branch Offset 31 25 24 9 8 0 OPCODE CONST T0 C CONST = 16-bit Constant L: opcode pr LSID IMMIMM T0 V = Valid Bit Read Instruction Format 21 20 16 15 8 7 0 GR = General Register Index

V GR RT0 RT1 R RT0 = Read Target 0 Specifier 5-bit sequence number 9-bit address index for Write Instruction Format

Special RT1 = Read Target 1 5 4 0 for ordering loads/stores calculating effective address Instructions Specifier V GR W Not shown: M3, M4 formats

June 4, 2005 UT-Austin TRIPS Tutorial 17 June 4, 2005 UT-Austin TRIPS Tutorial 18

3 Target (Operand) Formats Object Code Example

TASL (target format) Object Code 9 bits [R1] $g1 [2] v reg T1 T0 Target field [R2] $g2 [1] [4] v reg T1 T0

[1] ld L[1] 4 opcode pr LSID imm T0 [2] add [3] [4] opcode pr xop T1 T0 [3] mov [5] [6] opcode pr xop T1 T0 2 bits 7 bits [4] st S[2] 4 opcode pr LSID imm 0 [5] addi 2 [W1] opcode pr xop imm T0 • 00: No target (or write inst.) • Names one of 128 target instructions • 01: Targets predicate field • Location of instruction is [6] teqz [7] [8] opcode pr xop T1 T0 • 10: Targets left operand microarchitecture dependent [7] b_t block3 opcode pr exit offset • 11: Targets right operand [8] b_f block2 opcode pr exit offset 01 5 bits [W1] $g5 v reg • Names one of 32 write instructions

June 4, 2005 UT-Austin TRIPS Tutorial 19 June 4, 2005 UT-Austin TRIPS Tutorial 20

Object Code Example TRIPS Block Format

TASL (target format) Object Code PC Header 128 Bytes • Each block is formed from two to five Chunk [R1] $g1 [2] 1 00001 [2] ---- 128-byte program“chunks” [R2] $g2 [1] [4] 1 00010 [1] [4] • Blocks with fewer than five chunks are Instruction [1] ld L[1] 4 ld 00 00001 4 [2] 128 Bytes expanded to five chunks in the L1 I-cache Chunk 0 [2] add [3] [4] add 00 --- [3] [4] • The header chunk includes a block header (execution flags plus a store [3] mov [5] [6] mov 00 --- [5] [6] Instruction 128 Bytes mask) and register read/write instructions [4] st S[2] 4 Chunk 1 st 00 00010 4 --- • Each instruction chunk holds 32 4-byte [5] addi 2 [W1] addi 00 --- 2 [W1] instructions (including NOPs)

[6] teqz [7] [8] teqz 00 --- [7] [8] Instruction 128 Bytes • A maximally sized block contains 128 Chunk 2 [7] b_t block3 b 11 (t) E0 block3 - PC regular instructions, 32 read instructions, and 32 write instructions [8] b_f block2 b 10 (f) E1 block2 - PC Instruction 128 Bytes [W1] $g5 1 00101 Chunk 3

Store mask: 0000000000000000000000000000100

June 4, 2005 UT-Austin TRIPS Tutorial 21 June 4, 2005 UT-Austin TRIPS Tutorial 22

Example of Predicate Operation Predicate Optimization and Fanout Reduction

if (p==0) • Test instructions output low-order • Predicate dependence heads • Predicate dependence tails z = a * 2 + 3; – Reduce instructions executed – Tolerate predicate latencies with bit of 1 or 0 (T or F) else – Reduces dynamic energy by using speculatively executed insts. • Predicated instructions either z = b * 3 + 4; less speculation – Cannot speculate on block outputs match the arriving predicate or not p –An ADD with PR=10 is predicated p on false, an arriving 0 predicate p will match and enable the ADD teqz f – Predicated instructions must b a b a b teqz teqz receive all operands and a t a mov3 matching predicate before firing – Instructions may receive multiple muli muli muli muli muli muli predicate operands but at most one matching predicate mov3 addi addi addi addi addi addi – Despite predicates, blocks must produce fixed # of outputs. st(1) st(1) st(1) st(1) st(1) st(1)

June 4, 2005 UT-Austin TRIPS Tutorial 23 June 4, 2005 UT-Austin TRIPS Tutorial 24

4 Instruction (Store) Merging Nullified Outputs

• Old: two copies of stores • New: merge redundant insts. • Each transmitted operand tagged with “null” and down distinct predicate paths – Speculative copies unneeded “exception” tokens. – Only one data source will fire p – Nullified operands arriving at insts. produce nullified outputs p – Nulls used to indicate that a block won’t produce a certain output (writes and stores) a teqz b a p a teqz b Block with store down muli muli muli muli one conditional path: muli teqz ... if (p==0) addi addi null addi addi addi z = a * 2 + 3; ... st(1) st(1) st(1) st(1)

June 4, 2005 UT-Austin TRIPS Tutorial 25 June 4, 2005 UT-Austin TRIPS Tutorial 26

Exception Handling Other Architecturally Supported Features

• TRIPS exception types: • Execution mode via Control Register – fetch, breakpoint, timeout, execute, store, – Types of speculation (load, branch) system call, reset, interrupt – Number of blocks in flight • TRIPS exception model demands – Breakpoints before/after blocks execute – Must not raise exceptions for speculative instructions • T-Morph via Processor Control Register • Key concept: propagate exception tokens within dataflow graph – Can place processors into single or 4-threaded SMT modes – Instructions with exception-set operands do not execute, but propagate • Per-bank cache/scratchpad configurations exception token to their targets • TLBs managed by software – Exceptions are detected at commit time if one or more block output (write, store, branch) has exception bit set • On-chip network routing tables programmed by software – Exceptions serviced between blocks • 40-bit global system memory address space • “Block-precise” exception model – Divided into SRF, cached memory, configuration quadrants

June 4, 2005 UT-Austin TRIPS Tutorial 27 June 4, 2005 UT-Austin TRIPS Tutorial 28

TRIPS Microarchitecture Principles

TRIPS Microarchitecture and Implementation • Limit wire lengths – Architecture is partitioned and distributed • Microarchitecture composed of different types of tiles – No centralized resources ISCA-32 Tutorial – Local wires are short • Tiles are small (2-5 mm2 per tile is typical) – No global wires Steve Keckler – Networks connect only nearest neighbors

• Design for scalability – Design productivity by replicating tiles (design reuse) Department of Computer Sciences – Tiles interact through well-defined control and data networks The University of Texas at Austin • Networks are extensible, even late in the design cycle • Opportunity for asynchronous interfaces – Prototype: employs communication latencies of 35nm technology, even though prototype is implemented in 130nm

June 4, 2005 UT-Austin TRIPS Tutorial 29 June 4, 2005 UT-Austin TRIPS Tutorial 30

5 TRIPS Chip Block Diagram TRIPS Tile-level Microarchitecture

TRIPS Tiles G: Processor control - TLB w/ variable size pages, dispatch, TRIPS TRIPS DMA DMA next block predict, commit Proc 0 Proc 1 Ctrl 0 Ctrl 1 R: Register file - 32 registers x 4 threads, register forwarding On-Chip I: Instruction cache - 16KB storage per tile x5 x5 x5 x5 Memory x16 D: Data cache - 8KB per tile, 256-entry load/store queue, TLB Interrupts E: - Int/FP ALUs, 64 reservation stations On-Chip Network L2 Cache x16 M: Memory - 64KB, configurable as L2 cache or scratchpad (1MB) N: OCN network interface - router, translation tables External C2C SDRAM SDRAM Bus Ctrl Ctrl Ctrl 0 Ctrl 1 DMA: Direct memory access controller SDC: DDR SDRAM controller x4 x4 EBC: External bus controller - interface to external PowerPC C2C: Chip-to-chip network controller - 4 links to XY neighbors IRQ External Bus Chip-2-Chip SDRAM SDRAM Interface Links Module Module (266 MHz) (DDR-266) (DDR-266)

June 4, 2005 UT-Austin TRIPS Tutorial 31 June 4, 2005 UT-Austin TRIPS Tutorial 32

Grid Processor Tiles and Interfaces Mapping TRIPS Blocks to the Microarchitecture

0123Thread R-tile[3] Architecture Register Files

R I G R R R R W •GDN: global dispatch network Read/Write Queues I D E E E E •OPN: operand network 012345 67Frame I D E E E E •GSN: global status network D-tile[3] 7 E-tile[3,3] 6 Instruction reserv ation stations 4 5 Frame3 01234 567 Frame I D E E E E 2 1 IOP1OP2 •GCN: global control network 0 0 0 Slot

I D E E E E 7 LSID

31

June 4, 2005 UT-Austin TRIPS Tutorial 33 June 4, 2005 UT-Austin TRIPS Tutorial 34

Mapping TRIPS Blocks to the Microarchitecture Mapping TRIPS Blocks to the Microarchitecture

Block i mapped into Frame 0 Block i+1 mapped into Frame 1

0123Thread 0123Thread Header R-tile[3] Header R-tile[3] Chunk Architecture Chunk Architecture Register Files Register Files

R R Inst. W Inst. W Chunk 0 Chunk 0 Read/Write Queues Read/Write Queues

012345 67Frame 012345 67Frame Inst. Inst. Chunk 3 Chunk 3

D-tile[3] 7 E-tile[3,3] D-tile[3] 7 E-tile[3,3] 6 Instruction reservation stations 6 Instruction reserv ation stations 4 5 4 5 Frame3 01234 567 Frame Frame3 01234 567 Frame 2 2 1 IOP1OP2 1 IOP1OP2 0 0 0 0 0 0 Slot Slot

7 7 LSID LSID

31 31

June 4, 2005 UT-Austin TRIPS Tutorial 35 June 4, 2005 UT-Austin TRIPS Tutorial 36

6 Mapping Target Identifiers to Reservation Stations Processor Core Tiles and Interfaces

Frame 4

Frame (assigned by GT ISA Target Identifier at runtime) Frame Type Y Slot X • Fetch Target = 87, OP1 I G R R R R (3 bits) (2 bits) (2 bits) (3 bits) (2 bits) 100 10 10 101 11 – G-tile examines I$ tags I D E E E E – Sends dispatch I G R R R R command to I-tiles – I-tiles fetch I D E E E E instructions, distribute Frame: 100 I D E E E E Slot: 101 to R-tiles, E-tiles OP: 10 = OP1 I D E E E E I D E E E E [10,11] I D E E E E I D E E E E

I D E E E E

June 4, 2005 UT-Austin TRIPS Tutorial 37 June 4, 2005 UT-Austin TRIPS Tutorial 38

Processor Core Tiles and Interfaces Processor Core Tiles and Interfaces

I G R R R R • Execute I G R R R R • Commit – R-tiles inject register – R-tiles/D-tiles I D E E E E values I D E E E E accumulate outputs – E-tiles execute block • Send completion instructions messages to G-tile I D E E E E • compute, load/store I D E E E E – G-tile sends commit – E-tiles delivers outputs message • Results to – R-tiles/D-tiles respond I D E E E E R-tiles/D-tiles I D E E E E with commit • Branch to G-tile acknowledge • R-tiles commit state to I D E E E E I D E E E E register files • D-tiles commit state to data cache

June 4, 2005 UT-Austin TRIPS Tutorial 39 June 4, 2005 UT-Austin TRIPS Tutorial 40

Block Execution Timeline On-Chip Network

DMA SDC EBC Time (cycles) • Flexible/fast on-chip network fabric 0 510 3040 N N N N – Programmable translation tables Frame 2 Bi Bi+8 FETCH EXECUTE COMMIT – Configurable 1 MB memory system (variable execution time) N M M N – 128-bit wide, 533 MHz

Frame 3 Bi+1 – 6.8GB/sec per link N M M N GP0 Frame 4 Bi+2 • M-tile (on-chip memory)

Frame 5 Bi+3 N M M N – 64 KB SRAM capacity

Frame 6 B – Configurable tag array (on/off) i+4 N M M N

Frame 7 Bi+5 • N-tile with translation tables – System address ⇒ OCN address Frame 0 Bi+6 N M M N • EBC - external bus controller Frame 1 Bi+7 N M M N – Connection to master control processor • SDC - DDR1 SDRAM controller • Execute/commit overlapped across multiple blocks N M M N GP1 – 1.1 GB/sec per controller N M M N • G-tile manages frames as a circular buffer • DMA controller – D-morph: 1 thread, 8 frames N N N N – Linear, scatter-gather, chaining – T-morph: up to 4 threads, 2 frames each DMA SDC C2C June 4, 2005 UT-Austin TRIPS Tutorial 41 June 4, 2005 UT-Austin TRIPS Tutorial 42

7 Chip-to-Chip (C2C) Network Chip Implementation

• Chip Specs • Glueless interchip network – 130 nm 7LM IBM ASIC – Extension of OCN – 18.3mm x 18.3mm die – 2D mesh router – 853 signal I/Os • 568 for C2C OCN – 64-bits wide N – 266MHz,1.7GB/sec per link • 216 for SDCs • ~70 misc I/Os • Schedule W E C2C – 5/99: Seed of TRIPS ideas 18.3mm – 12/01: Publish TRIPS concept architecture S – 10/02: Publish NUCA concept architecture – 6/03: Start high-level prototype design – 10/03: Start RTL design – 2/05: RTL 90% complete – 7/05: Design 100% complete – 9/05: Tapeout – 12/05: Start system evaluation in lab TRIPS Working Floorplan (to scale)

June 4, 2005 UT-Austin TRIPS Tutorial 43 June 4, 2005 UT-Austin TRIPS Tutorial 44

Prototype Simplifications TRIPS Prototype System Board

TRIPS BOARD • TRIPS chip network • Architecture simplifications – 4 TRIPS chips (8 processors) per board – Connected using a 2x2 chip-to-chip FPGA Virtex-II P ro FPGA – No hardware mechanisms Connector network Connector FPGA

– No floating-point divider – Each chip provides local and remote Flash – No native exception handling access to 2GB SDRAM SDRAM SDRAM C2C TRIPS C2C TRIPS C2C Connector Connector – Simplified exception reporting • PPC440GP 140 0 2 140 – Limited support for variable-sized blocks – Configures and controls the TRIPS SDRAM SDRAM chips SDRAM SDRAM – Connects to a Host PC for user interface (via serial port or ethernet) C2C TRIPS TRIPS C2C Connector Connector • Microarchitecture simplifications – Runs an embedded Linux OS 140 1 3 140 – Includes a variety of integrated SDRAM SDRAM – No I-cache prefetching or block predict for I-cache refill peripherals – No variable resource allocation (for smaller blocks) Flash – One cycle per hop on every network • SDRAM DIMMs EBI – 8 slots for up to 8 GB TRIPS memory Clock SDRAM – No clock-gating Gen IBM – 2 slots for up to 1 GB PPC memory PPC440GP – Direct-mapped and replicated LSQ (non-associative, oversized) SDRAM – No selective re-execution • FPGA - usage TBD but may include – Protocol translator for high-speed I/O Ethernet Serial – No fast block refetch devices – Programmable acceleration engine HOST PC

June 4, 2005 UT-Austin TRIPS Tutorial 45 June 4, 2005 UT-Austin TRIPS Tutorial 46

Maximum Size TRIPS System TRIPS Prototype System Capabilities

Virtex-II Pro TBD • Clock speeds • Full system (8 boards) TBD FPGA – 533MHz internal clock – 32 chips, 64 processors Flash SDRAM SDRAM – 266MHz C2C clock – 1024 64-bit FPUs TRIPS C2C TRIPS – 133/266 SDRAM clock 140 1 3 140 – 545 Gops/Gflops SDRAM SDRAM – Up to 256 threads SDRAM SDRAM Board 1 Board 2 Board 7 • Single chip – 64GB SDRAM TRIPS TRIPS – 2 processors per chip each with – 70 GB/s aggregate DRAM 140 0 2 140 • 64 KB L1-I$, 32KB L1-D$ SDRAM SDRAM bandwidth • 8.5 Gops or GFlops/processor Flash – 17 Gops or Gflops/chip peak – 6.8 GB/sec C2C bisection EBI bandwidth Clock SDRAM – Up to 8 threads IBM Gen – 1 MB on-chip memory PPC440GP SDRAM Board 0 • L2 cache or scratchpad –2 GB SDRAM – 2.2 GB/sec DRAM bandwidth

Ethernet Switch HOST PC

June 4, 2005 UT-Austin TRIPS Tutorial 47 June 4, 2005 UT-Austin TRIPS Tutorial 48

8 Two Examples

Code Examples • Vector Add Example – A simple for loop – Basic unrolling – TRIPS Intermediate Language (TIL) ISCA-32 Tutorial – Block Dataflow Graph – Placed Instructions Robert McDonald – TRIPS Assembly Language (TASL) – Sample Execution

• Linked List Example Department of Computer Sciences – A while loop with pointers The University of Texas at Austin – TRIPS predication – Predicated loop unrolling

June 4, 2005 UT-Austin TRIPS Tutorial 49 June 4, 2005 UT-Austin TRIPS Tutorial 50

Vector Add – C Code Vector Add – Unrolled

#define VSIZE 1024 • Consider this simple C routine enter • This loop wants to be unrolled void vadd(double* A, double* B, double* C) • The routine does an add and • Unrolling reduces the { i = 0 int i; accumulate for fixed-size overhead per loop iteration for (i = 0; i < VSIZE; i++) { vectors C[i] += (A[i] + B[i]); • It reduces the number of } C[i] += A[i] + B[i] } • An initial control flow graph is C[i+1] += A[i+1] + B[i+1] conditional branches that must C[i+2] += A[i+2] + B[i+2] shown C[i+3] += A[i+3] + B[i+3] be executed C[i+4] += A[i+4] + B[i+4] enter C[i+5] += A[i+5] + B[i+5] C[i+6] += A[i+6] + B[i+6] C[i+7] += A[i+7] + B[i+7] i = 0 i+=8 i < 1024

C[i] += A[i] + B[i] i++ i < 1024 F T F T return return

June 4, 2005 UT-Austin TRIPS Tutorial 51 June 4, 2005 UT-Austin TRIPS Tutorial 52

Vector Add – TIL Code Vector Add – Block Dataflow Graph

.bbegin vadd$1 ld $t29, 40($t1) L[20] read $t0, $g70 ld $t30, 40($t2) L[21] • The compiler produces TRIPS • Read and load read $t1, $g71 ld $t31, 40($t3) L[22] Intermediate Language (TIL) files read read read read read $t2, $g72 fadd $t32, $t30, $t31 C A B i instructions gather read $t3, $g73 fadd $t33, $t29, $t32 ld $t4, ($t1) L[0] sd 40($t1), $t33 S[23] • Each file defines one or more block inputs ld $t5, ($t2) L[1] ld $t34, 48($t1) L[24] program blocks ld $t6, ($t3) L[2] ld $t35, 48($t2) L[25] fanout fanout fanout addi • Write, store, and fadd $t7, $t5, $t6 ld $t36, 48($t3) L[26] 2:17 2:9 2:9 8 fadd $t8, $t4, $t7 fadd $t37, $t35, $t36 • Each block includes instructions, branch instructions sd ($t1), $t8 S[3] fadd $t38, $t34, $t37 normally listed in program order ld $t9, 8($t1) L[4] sd 48($t1), $t38 S[27] produce block ld $t10, 8($t2) L[5] ld $t39, 56($t1) L[28] • Instructions are defined using a addi addi addi genu outputs ld $t11, 8($t3) L[6] ld $t40, 56($t2) L[29] ld # ld # ld # mov fadd $t12, $t10, $t11 ld $t41, 56($t3) L[30] familiar syntax (name, target, sources) 64 64 64 1024 fadd $t13, $t9, $t12 fadd $t42, $t40, $t41 • Address fanout can sd 8($t1), $t13 S[7] fadd $t43, $t39, $t42 • Read and write instructions access ld $t14, 16($t1) L[8] sd 56($t1), $t43 S[31] present an write write write write ld $t15, 16($t2) L[9] addi $t45, $t0, 8 registers fadd tlt interesting challenge ld $t16, 16($t3) L[10] addi $t47, $t1, 64 C A B i fadd $t17, $t15, $t16 addi $t49, $t2, 64 • All other instructions deal with fadd $t18, $t14, $t17 addi $t51, $t3, 64 sd 16($t1), $t18 S[11] genu $t52, 1024 temporaries bro_t bro_f ld $t19, 24($t1) L[12] tlt $t54, $t45, $t52 fadd ld $t20, 24($t2) L[13] bro_t<$t54> vadd$1 • Notice the Load/Store IDs # # ld $t21, 24($t3) L[14] bro_f<$t54> vadd$2 fadd $t22, $t20, $t21 write $g70, $t45 • Notice the predicated branches fadd $t23, $t19, $t22 write $g71, $t47 sd 24($t1), $t23 S[15] write $g72, $t49 st # ld $t24, 32($t1) L[16] write $g73, $t51 ld $t25, 32($t2) L[17] .bend x8 ld $t26, 32($t3) L[18] fadd $t27, $t25, $t26 fadd $t28, $t24, $t27 (abbreviated graph) sd 32($t1), $t28 S[19] (continued)

June 4, 2005 UT-Austin TRIPS Tutorial 53 June 4, 2005 UT-Austin TRIPS Tutorial 54

9 Vector Add – Placed Instructions Vector Add – TASL Code

.bbegin vadd$1 N[97] ld L[6] 8 N[38,1] N[33] ld L[28] 56 N[71,0] • The scheduler R[2] read G[70] N[14,0] N[38] fadd N[45,1] N[42] ld L[29] 56 N[67,0] ET0: ET1: ET2: ET3: R[3] read G[71] N[43,0] N[3,0] N[45] fadd N[49,1] N[50] ld L[30] 56 N[67,1] analyzes each block R[0] read G[72] N[48,0] N[12,0] N[49] sd S[7] 8 N[67] fadd N[71,1] ld, mov3, mov3, ld, mov, mov3, mov, mov, R[1] read G[73] N[18,0] N[13,0] N[65] ld L[8] 16 N[80,0] N[71] fadd N[75,1] mov3, mov, mov, mov, ld, addi, addi, mov, mov3, ld dataflow graph N[110] ld L[9] 16 N[81,0] N[75] sd S[31] 56 N[3] mov N[7,0] N[9,0] N[34] ld L[10] 16 N[81,1] N[14] addi 8 N[22,0] mov, mov mov, mov mov • It inserts fanout N[7] mov N[11,0] N[2,0] N[81] fadd N[80,1] N[22] mov N[79,0] W[2] N[9] mov N[4,0] N[37,0] N[80] fadd N[84,1] N[43] addi 64 W[3] instructions, as N[11] mov N[15,0] N[6,0] N[84] sd S[11] 16 N[48] addi 64 W[0] N[2] mov N[1,0] N[75,0] N[19] ld L[12] 24 N[105,0] N[18] addi 64 W[1] ET4: ET5: ET6: ET7: needed N[4] mov3 N[32,0] N[78,0] N[64,0] N[72] ld L[13] 24 N[101,0] N[39] genu 1017 N[79,1] ld, ld, ld, ld, mov3, ld, fadd, ld, ld, ld, genu, addi N[37] mov3 N[49,0] N[65,0] N[84,0] N[36] ld L[14] 24 N[101,1] N[79] tlt N[83,p] N[111,p] • It places the N[15] mov3 N[19,0] N[109,0] N[10,0] N[101] fadd N[105,1] N[83] bro_t vadd$1 mov3, addi mov3, fadd, sd ld N[6] mov3 N[108,0] N[5,0] N[106,0] N[105] fadd N[109,1] N[111] bro_f vadd$2 instructions within N[1] mov3 N[0,0] N[107,0] N[33,0] N[109] sd S[15] 24 N[12] mov N[16,0] N[76,0] N[10] ld L[16] 32 N[104,0] W[2] write G[70] the block (they N[16] mov N[44,0] N[21,0] N[96] ld L[17] 32 N[100,0] W[3] write G[71] don’t have to be in N[76] mov3 N[66,0] N[77,0] N[110,0] N[68] ld L[18] 32 N[100,1] W[0] write G[72] ET8: ET9: ET10: ET11: N[44] mov3 N[72,0] N[96,0] N[73,0] N[100] fadd N[104,1] W[1] write G[73] ld, ld, ld, ld, ld, ld, ld, ld, fadd, fadd, fadd, fadd, sd, program order) N[21] mov N[35,0] N[42,0] N[104] fadd N[108,1] .bend N[13] mov N[17,0] N[41,0] N[108] sd S[19] 32 mov3, fadd, sd fadd sd tlt, bro_t • It produces N[17] mov N[8,0] N[20,0] N[5] ld L[20] 40 N[102,0] N[41] mov3 N[69,0] N[97,0] N[34,0] N[73] ld L[21] 40 N[98,0] assembly language N[8] mov3 N[36,0] N[68,0] N[40,0] N[40] ld L[22] 40 N[98,1] • TRIPS Assembly N[20] mov N[46,0] N[50,0] N[98] fadd N[102,1] files N[32] ld L[0] N[74,0] N[102] fadd N[106,1] Language (TASL) files ET12: ET13: ET14: ET15: N[66] ld L[1] N[70,0] N[106] sd S[23] 40 • An instruction fires N[69] ld L[2] N[70,1] N[0] ld L[24] 48 N[103,0] are fed to the assembler ld, fadd, fadd, ld, fadd, fadd, fadd, fadd, sd, fadd, fadd, sd, N[70] fadd N[74,1] N[35] ld L[25] 48 N[99,0] sd sd ld bro_f when its operands N[74] fadd N[78,1] N[46] ld L[26] 48 N[99,1] • Instructions are N[78] sd S[3] N[99] fadd N[103,1] become available N[64] ld L[4] 8 N[45,0] N[103] fadd N[107,1] described using a N[77] ld L[5] 8 N[38,0] N[107] sd S[27] 48 (regardless of (cont) dataflow notation position) (cont)

June 4, 2005 UT-Austin TRIPS Tutorial 55 June 4, 2005 UT-Austin TRIPS Tutorial 56

Vector Add – Block-Level Execution Linked List – C Code

0102030 40 50 60 70 80 90 100 110 120 130 140 150 160 170 // type info struct foo • For a second example, let’s i fet ch execute commit { long tag; examine something more i+1 fet ch execute commit long value; struct foo * next; complex i+2 fet ch execute commit }; struct foo * head; • We’ll be focusing upon the i+3 fet ch execute commit void list_add( long key ) code in red i+4 fet ch execute commit { struct foo * p = head; • A linked list operation is i+5 fet ch execute commit // search list and increment node while (p) performed using a while loop i+6 fet ch execute commit { if (p->tag == key) and search pointer i+7 fet ch execute commit { p->value++; stall execute commit • The number of while loop i+8 fet ch break; } iterations can vary • The TRIPS processor can execute up to 8 blocks concurrently p = p->next; • This diagram shows block latency and throughput for an optimal vector add loop } • It achieves ~7.4 IPC, ~3.2 loads & stores per cycle // ... }

June 4, 2005 UT-Austin TRIPS Tutorial 57 June 4, 2005 UT-Austin TRIPS Tutorial 58

Linked List – CFG Linked List – Hyperblock #1

.bbegin block1 • Here is a TIL representation of the hyperblock • The while loop’s control flow read $t0, $g70 ; key • Predicate p0 represents the test to determine read $t1, $g71 ; p whether p is null graph is shown here ; BB0 Hyperblock teqi $p0, $t1, 0 • Predicate p1 represents the test to determine T p == 0 • It consists of four rather small ; BB1 whether the tag matches the key ld $t3, ($t1) ; p->tag BB0 • The test instruction that produces p1 is itself F basic blocks teq_f <$p0> $p1, $t3, $t0 predicated upon p0 being false ld $t4, 16($t1) ; p->next p->tag == key • That means that there are only three possible • Its hard to get anywhere fast ; BB2 predicate outcomes: BB1 ld $t10, 8($t1) ; p->data T F null_t <$p0> $t11 – p0 true, p1 unresolved with all of those branches addi_t <$p1> $t11, $t10, 1 – p0 false, p1 false null_f <$p1> $t11 – p0 false, p1 true p->value++ p = p->next • We’d like to use predication sd 8($t1), $t11 ; p->data • This example demonstrates “predicate-oring”, BB2 BB3 ; BB3 which allows multiple exclusive predicates to to form a larger program null_t <$p0,$p1> $t99 mov_f <$p1> $t99, $t4 be OR’d at the target instruction block (or hyperblock) write $g71, $t99 ; p • This example demonstrates write and store ; branches nullification, which allows block outputs to be bro_t <$p0,$p1> block2 cancelled (...) bro_f <$p1> block1 • Loads are allowed to execute speculatively .bend because exceptions will be predicated downstream

June 4, 2005 UT-Austin TRIPS Tutorial 59 June 4, 2005 UT-Austin TRIPS Tutorial 60

10 Linked List – Predicated Loop Unrolling Linked List – Hyperblock #2

.bbegin block1 • With predication, it’s also possible read $t0, $g70 ; key • Here is some TIL for the unrolled (and helpful) to unroll this loop read $t1, $g71 ; p while loop T p == 0 Hyperblock teqi $p0, $t1, 0 • Many variations are possible ld $t3, ($t1) ; p->tag • The changes from the first version F teq_f <$p0> $p1, $t3, $t0 are highlighted (in red) p->tag == key • This diagram shows just two ld $t4, 16($t1) ; p->next teqi_f <$p1> $p2, $t4, 0 • Notice the use of predicate-ORing, T F iterations within the hyperblock ld $t5, ($t4) ; p->next->tag teq_f <$p2> $p3, $t5, $t0 which allows the block to finish p = p->next • There number of block exits and ld $t6, 16($t4) ; p->next->next executing early in some cases ; conditional p update block outputs remains the same as null_t <$p0,$p1> $t99 • There are now four predicates and T p == 0 mov_t <$p2,$p3> $t99, $t4 before mov_f <$p3> $t99, $t6 five possible outcomes F • The compiler can use static write $g71, $t99 ; p p->tag == key ; conditional p->data update p0 p1 p2 p3 ld_t <$p1> $t10, 8($t1) T F heuristics or profile data to choose ld_f <$p1> $t10, 8($t4) T - - - an unrolling factor null_t <$p0,$p2> $t11 p = p->next p->value++ addi_t <$p1,$p3> $t11, $t10, 1 F T - - • Can sometimes use instruction null_f <$p3> $t11 sd 8($t1), $t11 ; p->data F F T - merging to combine multiple ; branches statements into one (in yellow) bro_t <$p0,$p1,$p2,$p3> block2 F F F T (...) bro_f <$p3> block1 .bend F F F F

June 4, 2005 UT-Austin TRIPS Tutorial 61 June 4, 2005 UT-Austin TRIPS Tutorial 62

Conclusion

• The EDGE ISA allows instructions to be both fetched and Global Control executed out-of-order with minimal book-keeping • It’s possible to use loop unrolling and predication to form large blocks ISCA-32 Tutorial • The TRIPS predication model allows control flow to be converted into an efficient dataflow representation Ramadass Nagarajan • Operand fanout is more explicit in an EDGE ISA and costs extra instructions Department of Computer Sciences The University of Texas at Austin

June 4, 2005 UT-Austin TRIPS Tutorial 63 June 4, 2005 UT-Austin TRIPS Tutorial 64

G-Tile Location and Connections Major G-Tile Responsibilities

• Orchestrates global control for the core HALT_GP GP_HALTED – Maps blocks to execution resources (to EBC) (from EBC) • Instruction fetch control – Determines I-cache hits/misses, ITLB hits/misses GDN GDN – Initiates I-cache refills GRN GCN – Initiates block fetches I-Tile G-Tile GSN R-Tile OCN • Next-block prediction GSN OPN – Generates speculative block addresses • Block completion detection and commit initiation • Flush detection and initiation GCN GSN ESN OPN – Branch mispredictions, Ld/St dependence violations, exceptions D-Tile • Exception reporting

June 4, 2005 UT-Austin TRIPS Tutorial 65 June 4, 2005 UT-Austin TRIPS Tutorial 66

11 Networks Used by G-Tile Global Dispatch Network (GDN)

• OPN - Operand Network – Receive and send branch addresses from/to E-tiles • OCN - On-Chip Network I G R R R R Interface Ports – Receive block header from adjacent I-tile CMD: None/Fetch/Dispatch • GDN - Global Dispatch Network I D E E E E ADDR: Address of block – Send inst. fetch commands to I-tiles and read/write instructions to R-tiles SMASK: Store mask in block • GCN - Global Control Network I D E E E E FLAGS: Block execution flags – Send commit and flush commands to D-tiles, R-tiles and E-tiles ID: Frame identifier • GRN - Global Refill Network I D E E E E INST: Instruction bits – Send I-cache refill commands to I-tile slave cache banks • GSN - Global Status Network – Receive refill, store, and register write completion status and commit acknowledgements I D E E E E from I-tiles, D-tiles, and R-tiles • ESN - External Store Network • Delivers instruction fetch commands to I-tiles – Receive external store pending and error status from D-tiles • Dispatches instructions from I-Tiles to R-Tiles/E-Tiles

June 4, 2005 UT-Austin TRIPS Tutorial 67 June 4, 2005 UT-Austin TRIPS Tutorial 68

Global Refill Network (GRN) Global Control Network (GCN)

I G R R R R I G R R R R

I D E E E E Interface Ports I D E E E E Interface Ports

I D E E E E CMD: None/Refill I D E E E E CMD: None/Commit/Flush RID: Refill buffer identifier MASK: Mask of frames for commit/flush ADDR: Address of block to refill I D E E E E I D E E E E

I D E E E E I D E E E E

Delivers refill commands to I-Tiles Delivers commit and flush commands to R-Tiles, D-Tiles and E-Tiles

June 4, 2005 UT-Austin TRIPS Tutorial 69 June 4, 2005 UT-Austin TRIPS Tutorial 70

Global Status Network (GSN) External Store Network (ESN)

I G R R R R I G R R R R Interface Ports I D E E E E Interface Ports I D E E E E PENDING: Pending status of CMD: None/Completed/Committed external store requests and I D E E E E I D E E E E spills in each thread ID: Frame/Refill buffer identifier ERROR: Errors encountered for STATUS: Exception Status external store requests and I D E E E E I D E E E E spills in each thread

I D E E E E I D E E E E

I-tiles: Completion status for pending refills R-tiles: Completion writes, acknowledgement of register commits Delivers external store pending and error status from D-tiles D-tiles: Completion status for stores, acknowledgement of store commits

June 4, 2005 UT-Austin TRIPS Tutorial 71 June 4, 2005 UT-Austin TRIPS Tutorial 72

12 Overview of Block Execution Major G-Tile Structures

GDN/GRN Refill Unit (GDN) (GSN) I-cache GSN Allocate Fetch Completions MSHRs Ready in Ready for Dispatch Start Completed I-cache Fetch Execute Commit Flush Fetch Unit (GCN) (GCN) OCN Control Exit Refill ITLB (GRN) Committing Registers Predictor Flushed OPN I-cache dir. Acks (GSN) Deallocated

Retire Unit ESN Commit/Flush Ctrl

Different states in the execution of a block OPN GCN GSN

June 4, 2005 UT-Austin TRIPS Tutorial 73 June 4, 2005 UT-Austin TRIPS Tutorial 74

Frame Management Frame Management

Frames allocated from a circular queue Frames allocated from a circular queue

0 0 1 Youngest 1 2 Youngest 2 Youngest 3 3 4 4 5 Oldest 5 Oldest 6 Oldest 6 7 Control Speculation 7 Control Speculation Non-speculative block Non-speculative block AfterFrames a commitnewin use fetch Speculative blocks After a new fetch Speculative blocks

June 4, 2005 UT-Austin TRIPS Tutorial 75 June 4, 2005 UT-Austin TRIPS Tutorial 76

Frame Management Fetch Unit

Frames allocated from a circular queue • Instruction TLB • 16 entries

0 • 64KB - 1TB supported page size 1 • I-cache directory 2 Youngest • 128 entries 3 • Each entry corresponds to one cached block 4 • Entry Fields 5 • Physical tag for the block 6 Oldest • Store mask for the block 7 Control Speculation • Execution flags for the block Non-speculative block • 2-way set associative After a commit Speculative blocks • LRU replacement • Virtually indexed, physically tagged

June 4, 2005 UT-Austin TRIPS Tutorial 77 June 4, 2005 UT-Austin TRIPS Tutorial 78

13 Fetch Pipeline Refill Pipeline

Cycle 0 Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Predict Predict Predict TLB Hit/Miss Fetch Fetch Fetch Fetch Block A (Stage 0) (Stage 1) (Stage 2) lookup detection SLOT 0 SLOT 1 SLOT 2 SLOT 3 Cycle 0 Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle X Cycle X+1 Address I-cache Frame Predict Predict Predict TLB Hit/Miss Refill ……. Refill Fetch lookup allocation Select Block A (Stage 0) (Stage 1) (Stage 2) lookup detection Complet SLOT 0 Predict Predict Predict Stall Address I-cache Stall e Block B (Stage 0) (Stage 1) (Stage 2) Select lookup CACHE Frame MISS allocatio n

Predict cycle Control cycle Fetch cycle Cycle 9 Cycle 10 Cycle 11 Cycle 12 Cycle Cycle 14 Cycle 15 Cycle 16 Cycle 17 13 Fetch Fetch Fetch Fetch SLOT 4 SLOT 5 SLOT 6 SLOT 7 • Refills stall fetches for the same thread

Stall Address TLB Hit/Miss Fetch Fetch Fetch Fetch Fetch • Support for up to four outstanding refills – one per thread Select lookup detection SLOT 0 SLOT 1 SLOT 2 SLOT 3 SLOT 4 I-cache Frame lookup allocation

Predict cycle Control cycle Fetch cycle

June 4, 2005 UT-Austin TRIPS Tutorial 79 June 4, 2005 UT-Austin TRIPS Tutorial 80

Per-Frame State Block Completion Detection

Frame status Branch prediction status • Block completed execution if: V Valid BADDR Block address O Oldest – All register writes received NADDR Next block address Y Youngest – All stores received NSTATE Next block address state (predicted or resolved) M Branch misprediction detected – Branch target received Block execution status PC Prediction completed RC Registers completed RC Repair completed • Register completion tracked locally at each R-Tile SC Stores completed UC Update completed – Expected writes at each R-Tile known statically BC Branch completed – Completion status forwarded from farthest R-Tile to the G-Tile on the GSN E Exception reported L Load violation reported Commit  V & O & RC & SC & BC • Store completion reported by D-tile 0 & ~E & ~L & ~CS – All D-Tiles communicate with each other to determine completion Block commit status – Completion status delivered on the GSN by D-tile 0 CS Commit sent Dealloc  V & O & CS RCT Registers committed & RCT & SCT & UC SCT Stores committed

June 4, 2005 UT-Austin TRIPS Tutorial 81 June 4, 2005 UT-Austin TRIPS Tutorial 82

Commit Pipeline T-Morph Support

Cycle 0 Cycle 1 Cycle 2 Cycle X Cycle X+1 Frame Allocation • Up to four threads supported in T-morph Register/Branch/ Commit detect Commit send …… Commit acks Frame Thread 0 Store (GCN) received (GSN) deallocation mode Thread 0 – Threads mapped to fixed set of frames Thread 1 Completion Predictor received Commit Thread 1 (GSN/OPN) • Per-thread architected registers in the G-tile Thread 2 Thread 2 • Block ready for commit if: – Thread Control and Status Registers Thread 3 – Oldest in the thread – Program (PC) Thread 3 – Completed with no exceptions – No load/store dependence violations detected • Other T-morph support state/logic Per-thread Registers

– Per-thread frame management GR0 • Commit operation TCR GR0 – Round-robin thread prioritization for block TCR GR1 TSR GR0 – G-Tile sends commit message for a block on the GCN TCR GR1GR0 TSRTCR …. fetch/commit/flush PC GR1 – R-Tiles and D-Tiles perform local commit operations TSR …. GR1 PC TSR GR 128…. PC GR 128…. – R-Tiles and D-Tiles acknowledge commit using GSN PC GR 128 – G-Tile updates predictor history for the block GR127

June 4, 2005 UT-Austin TRIPS Tutorial 83 June 4, 2005 UT-Austin TRIPS Tutorial 84

14 Exception Reporting

• Exceptions reported to the G-tile Exceptions detected at G-tile Control Flow Prediction

– GSN: refill exceptions, execute exceptions delivered to Breakpoint stores, register outputs Timeout – OPN: execute exceptions delivered to branch targets System Call – ESN: errors in external stores Reset ISCA-32 Tutorial Fetch • G-tile halts processor whenever exception occurs Interrupt – Speculative blocks in all threads flushed Nitya Ranganathan – Oldest blocks in all threads completed and committed – All outstanding refills, stores and spills completed – Signals EBC to cause external interrupt Exceptions detected at other tiles Department of Computer Sciences • Exception type reported in PSR and TSRs Execute The University of Texas at Austin – Multiple exceptions may be reported Store Fetch

June 4, 2005 UT-Austin TRIPS Tutorial 85 June 4, 2005 UT-Austin TRIPS Tutorial 86

Predictor Responsibilities Example Hyperblock from Linked List Code

• Each hyperblock may have • Predict next block address for fetch several predicated exit – Predict block exit and exit branch type (branches not seen by predictor) T p == 0 Hyperblock H1 branches – Predict target of exit based on predicted branch type F – Exit e0  block H2 p->tag == key – Exit e1  block H1 – Predictions amortized over each large block T F • Only one branch will fire • Maximum necessary prediction rate is one every 8 cycles • Predictor uses exit IDs to p = p->next make prediction – 8 exit IDs supported • Speculative update for most-recent histories T p == 0 –e0  “000” – Keep safe copies around to recover from mispredictions F –e1  “001” p->tag == key • Predictor predicts the ID of T F the exit that will fire • Repair predictor state upon a misprediction – implicitly predict several p = p->next p->value++ predicates in a path • Exit history formed by • Update the predictor upon block commit bro H1 concatenating sequence of – Update exits, targets and branch types with retired block state bro H2 exit IDs

Hyperblock H2 (...) Exit e0 Exit e1

June 4, 2005 UT-Austin TRIPS Tutorial 87 June 4, 2005 UT-Austin TRIPS Tutorial 88

High-level Predictor Design Predictor Components

Exit Predictor (37 Kbits) Target Predictor (45 Kbits) block 2-level • Index exit predictor with current block address address Local Sequential predicted • Use exit history to predict 1 of 8 exits Predictor predictor next-block address • Predict branch type (call/return/branch) for chosen exit (9 K) predicted exit • Index multiple target predictors with block address and predicted exit 2-level Branch • Choose target address based on branch type Global Target Predictor Buffer (16 K) (20 K) block address 2-level Call Branch predicted Choice Target Type next- predicted predicted Predictor Buffer Predictor Branch block (6 K) (12 K) Exit exit branch type Target (12 K) Type address Predictor Predictor Predictor Return Address Stack (7 K)

June 4, 2005 UT-Austin TRIPS Tutorial 89 June 4, 2005 UT-Austin TRIPS Tutorial 90

15 Prediction using Exit History (linked list example) Prediction Timing

• Dynamic block sequence H1 H1 H1 H2 H1 H1 H1 H2 Cycle 1 Cycle 2 Cycle 3 • Local exit history for block H1 e1 e1 e1 e0 e1 e1 e1 e0 Local Read local history Read local exit • Local exit history encoding (2 bits) 01 01 01 00 01 01 01 00 predictor Choose – Use only 2 bits from each 3-bit exit to permit longer histories Global Read global exit predictor (contd.) final exit • History table entry holds 5 exit IDs

• Prediction table contains exit IDs and hysteresis bits Choice Read choice predictor (contd.)

Exit history registerPrediction table entry Predictor action

01 01 01 00 01 01 01 00 001 Predict “001” Btype Read branch types for all exits (contd.) Get chosen Choose exit’s final 01 01 01 00 01 01 01 00 000 Update history BTB Read branch targets for all exits (contd.) type, target targets 01 01 01 00 01 01 01 00 000 Predict “000” Read call, return 01 01 01 00 01 01 01 00 001 Update history CTB targets for all exits

Push return address Exit branch history RAS Read return address from stack (contd.)

June 4, 2005 UT-Austin TRIPS Tutorial 91 June 4, 2005 UT-Austin TRIPS Tutorial 92

More Predictor Features Learning Call-Return Relationships

Call Target Buffer (CTB) Return Address Stack (RAS) • Speculative state buffering update call target update return address RAS Address Stack – Latest global histories in global and choice history registers TOS pop return address CTB index push address Return Addr • Old snapshots of histories stored in history file for repair Call Target Address Return Address – Latest local history maintained in future file • Local exit history table holds committed block history – Return address stack state buffering • Save top of stack pointer and value before every prediction push CTB index • Repair stack on address and branch type mispredictions When a call commits: RAS Link Stack • Update Call Target Address in CTB • Call-return • Push CTB index onto Link Stack – Call/return instructions enforce proper control flow When a return commits: TOS pop CTB index • Pop CTB index from Link Stack CTB Index – Predictor learns to predict return addresses using call-return link stack • Update Return Address in CTB • Update stores return address in link stack corresponding to call stored in CTB During prediction: • Call  push return address from CTB onto Address Stack • Return  pop address from Address Stack

June 4, 2005 UT-Austin TRIPS Tutorial 93 June 4, 2005 UT-Austin TRIPS Tutorial 94

Prediction and Speculative Update Timing Optimizations for Better Prediction

Cycle 1 Cycle 2 Cycle 3 • Exit prediction more difficult than branch prediction Update local future Local Read local history Read local exit file history – 1 of 8 vs. 1 of 2 Read local future file predictor • Fewer exits in block  better predictability Choose Global Read global exit predictor (contd.) Update speculative final exit Backup global history global history Predicating control Exit merging Choice Read choice predictor (contd.) Update speculative flow merges (B8) Backup choice history choice history H1 H1 Hyperblock bro h2 B1 … … Btype Read branch types for all exits (contd.) Get bro h2 chosen Choose … H3 … B2 B3 exit’s final bro h3 H2 H3bro h3 H2 BTB Read branch targets for all exits (contd.) type, target … targets … bro h2 bro h2 B4 B5 B6 Read call, return … … … … CTB targets for all exits

B7 B8 Push return address Read return address from stack (contd.) RAS Backup top of stack

June 4, 2005 UT-Austin TRIPS Tutorial 95 June 4, 2005 UT-Austin TRIPS Tutorial 96

16 I-Tile Location and Connections Instruction Fetch G-Tile

ISCA-32 Tutorial GDN GRN GSN

GDN

Haiming Liu OCN G-Tile/ N-Tile I-Tile OCN D-Tile

Department of Computer Sciences The University of Texas at Austin GDN GRN GSN I-Tile

June 4, 2005 UT-Austin TRIPS Tutorial 97 June 4, 2005 UT-Austin TRIPS Tutorial 98

I-Tile Major Responsibilities Major I-Tile Structures

• Instruction Fetch – Cache instructions for one row of R-Tiles/E-Tiles GRN GDN GSN – Dispatch instructions to R-Tiles/E-Tiles through GDN mux • Instruction Refill GDN

– Receive refill commands from G-Tile on GRN Refill buffer I-$ array – Submit memory read requests to the N-Tile on OCN (128bit x 32, 1 read CTRL (128bit x1024 port, 1 write port) logic SRAM, 1 R/W port) – Keep track of completion status of each ongoing refill – Report completion status to G-Tile on the GSN when refill OCN OCN completes control

• OCN Access Control GDN GSN GRN – Share N-Tile connection between I-Tile and D-Tile/G-Tile – Forward OCN transactions between GT/NT or DT/NT

June 4, 2005 UT-Austin TRIPS Tutorial 99 June 4, 2005 UT-Austin TRIPS Tutorial 100

Instruction Layout in a Maximal Block Instruction Fetch Pipeline

Cycle 0 Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 9 31 0 ITLB Hit/Miss Fetch slot 0 Fetch slot 1 Fetch slot 2 Fetch slot 7 H0 Read 0 Write 0 Word 0 lookup detection …… PC H1 Read 1 Write 1 Word 1 GT Frame I-$ Tag Header allocation 128 bytes ...... lookup Chunk H30 Read 30 Write 30 Word 30 H31 Read 31 Write 31 Word 31 Cycle 3 Cycle 4 Cycle 9 Cycle 10 128 bytes Instruction Chunk0 R/W slot 0 R/W slot 1 R/W slot 6 R/W slot 7 Fetch slot Fetch slot 1 ……. Fetch slot 6 Fetch slot 7 0 Dispatch Dispatch Dispatch 128 bytes Instruction IT0 Chunk1 slot 0 slot 5 slot 6

128 bytes Instruction Chunk2 127 0 Cycle 4 Cycle 9 Inst. 96 Inst. 97 Inst. 98 Inst. 99 R/W slot 0 R/W slot 5 Instruction 128 bytes Chunk3 … … … … Fetch slot ……. Fetch slot 5 ……. 0 Dispatch Inst. 124 Inst. 125 Inst. 126 Inst. 127 IT1 slot 4 Column 0 Column 1 Column 2 Column 3

June 4, 2005 UT-Austin TRIPS Tutorial 101 June 4, 2005 UT-Austin TRIPS Tutorial 102

17 Instruction Dispatch Timing Diagram Instruction Dispatch Timing Diagram

Issue fetch on Issue fetch on cycle x cycle x IT0 GT RT0 RT1 RT2 RT3 IT0 GT RT0 RT1 RT2 RT3 Receives Receives Receives Receives Receives Receives Receives Receives Dispatch inst 0 inst 0 inst 0 inst 0 inst 1 inst 1 inst 1 inst 1 Forwards cycle x+2 cycle x+3 cycle x+4 cycle x+5 cycle x+6 cycle x+4 cycle x+5 cycle x+6 cycle x+7 fetch cycle x+1 IT0 DT0 ET0 ET1 ET2 ET3 IT0 DT0 ET0 ET1 ET2 ET3 Dispatch Receives Receives Receives Receives Receives Receives Receives Receives inst 0 inst 0 inst 0 inst 1 inst 1 inst 1 Forwards cycle x+3 inst 0 inst 1 cycle x+4 cycle x+5 cycle x+6 cycle x+7 cycle x+5 cycle x+6 cycle x+7 cycle x+8 fetch cycle x+2 IT0 DT1 ET4 ET5 ET6 ET7 IT0 DT1 ET4 ET5 ET6 ET7 Dispatch Receives Receives Receives Receives Receives Receives Receives Receives inst 0 inst 0 inst 0 inst 1 inst 1 inst 1 Forwards cycle x+4 inst 0 inst 1 cycle x+5 cycle x+6 cycle x+7 cycle x+8 cycle x+6 cycle x+7 cycle x+8 cycle x+9 fetch cycle x+3 IT0 DT2 ET8 ET9 ET10 ET11 IT0 DT2 ET8 ET9 ET10 ET11 Dispatch Receives Receives Receives Receives Receives Receives Receives Receives inst 0 inst 0 inst 0 inst 0 inst 1 inst 1 inst 1 inst 1 Forwards cycle x+5 cycle x+7 cycle x+8 cycle x+9 cycle x+8 cycle x+9 cycle x+10 fetch cycle x+6 cycle x+7 cycle x+4 IT0 DT3 ET12 ET13 ET14 ET15 IT0 DT3 ET12 ET13 ET14 ET15 Dispatch Receives Receives Receives Receives Receives Receives Receives Receives cycle x+6 inst 0 inst 0 inst 0 inst 0 inst 1 inst 1 inst 1 inst 1 cycle x+7 cycle x+8 cycle x+9 cycle x+10 cycle x+8 cycle x+9 cycle x+10 cycle x+11

June 4, 2005 UT-Austin TRIPS Tutorial 103 June 4, 2005 UT-Austin TRIPS Tutorial 104

Instruction Dispatch Timing Diagram Instruction Refill

• Receive refill id and block physical address from GRN input IT0 GT RT0 RT1 RT2 RT3 Receives Receives Receives Receives • Request 128 bytes of inst. from inst 7 inst 7 inst 7 inst 7 – Send out 2 OCN read requests cycle x+10 cycle x+11 cycle x+12 cycle x+13 • Buffer received inst. in refill buffer • Report refill completion on GSN output when IT0 DT0 ET0 ET1 ET2 ET3 Receives Receives Receives Receives – 128 bytes of inst. have been received inst 7 inst 7 inst 7 inst 7 – Completion has been received on GSN input cycle x+11 cycle x+12 cycle x+13 cycle x+14 • Write inst. into SRAM when dispatched from refill buffer

IT0 DT1 ET4 ET5 ET6 ET7 Receives Receives Receives Receives GRN GDN GSN inst 7 inst 7 inst 7 inst 7 cycle x+12 cycle x+13 cycle x+14 cycle x+15 mux GDN Refill buffer I-$ array IT0 DT2 ET8 ET9 ET10 ET11 (128bit x 32, 1 read CTRL (128bit x1024 SRAM, 1 Receives Receives Receives Receives logic port, 1 write port) R/W port) inst 7 inst 7 inst 7 inst 7 cycle x+13 cycle x+14 cycle x+15 cycle x+16

OCN OCN control IT0 DT3 ET12 ET13 ET14 ET15 Receives Receives Receives Receives GDN GSN inst 7 inst 7 inst 7 inst 7 GRN cycle x+14 cycle x+15 cycle x+16 cycle x+17

June 4, 2005 UT-Austin TRIPS Tutorial 105 June 4, 2005 UT-Austin TRIPS Tutorial 106

R-Tile and OPN Location and Connections Register Accesses and Operand Routing

GDN GDN ISCA-32 Tutorial GCN GCN G-Tile GSN R-Tile GSN R-Tile

OPN OPN OPN OPN OPN Karu Sankaralingam Router Router Router

OPN Department of Computer Sciences OPN The University of Texas at Austin E-Tile Router

June 4, 2005 UT-Austin TRIPS Tutorial 107 June 4, 2005 UT-Austin TRIPS Tutorial 108

18 Major R-Tile Responsibilities Major Structures in the R-tile

• Maintain architecture state, commit register writes GSN – 512 registers total W OPN Router E – 128 registers per thread, interleaved across 4 tiles GSN Commit – 32 registers/thread * 4 threads = 128 registers per tile S • Process read/write instructions for each block

– 64 entry Read Queue (RQ), 8 blocks, 8 reads in each Decode Committed GCN, Write Queue Register File – 64 entry Write Queue (WQ), 8 blocks, 8 writes in each GDN • Forward inter-block register values

• Detect per-block and per-tile register write completion Read Queue • Detect exceptions that flow to register writes GCN, GDN

June 4, 2005 UT-Austin TRIPS Tutorial 109 June 4, 2005 UT-Austin TRIPS Tutorial 110

Pipeline diagrams: Register Read and Write Register Forwarding: RQ and WQ contents

0 1 2 3 4 5 6 7 Cycle 0 Cycle 1 Cycle 2 Cycle 3 Read Queue

Read Read wakeup OPN Read Queue Read issue GDN dispatch (WQ inject search) Read dispatch Valid Status Reg # Target 1Target 2 WQ wait WQ pointer

Write 1 3 5 8 8 1 6 Write forward 0 1 2 3 4 5 6 7 Write Queue Write issue OPN OPN update (RQ inject search) Write Queue Write forward

Block Commit Arch.

GCN select Write Queue wakeup Register File Valid Status Reg # Value Write commit 1 8 5 64

June 4, 2005 UT-Austin TRIPS Tutorial 111 June 4, 2005 UT-Austin TRIPS Tutorial 112

Register Forwarding Example Block completion: Commit unit Read Queues B0 B1 0 Reg T1 T2 WQ Ptr Reg T1 T2 WQ Ptr B0: 1 State change V 4 N[0] - 0 - V 16 N[0] - 1 0 R[0],W[0]: dispatch 2 V 4 N[1] N[3] 0 - R[0]: search WQ, read 3 .bbegin B0 from CRF 4 R[0] read G[4] N[0] 5 N[0] addi 45 W[0] 6 W[0] write G[16] B1: All writes Right neighbor Completion 7 Valid .bend R[0],R[4],W[0]: dispatch received completed sent R[0]: search WQ, wait Block Status R[4]: search WQ, read .bbegin B1 from CRF R[0] read G[16] N[0] R[4] read G[4] N[1] N[3] V 01 16 45- V 01 8 180 - RT0 RT1 RT2 RT3 N[0] addi 45 N[2,0] B0: N[1] addi 90 N[2,1] W[0] receives value at WQ N[2] add N[3,1] Wakes up R[0] in B1 Cycle 0 N[3] add W[0] 1 1 0 0 1 0 0 0 1 1 1 0 1 1 1 1 GSN W[0] write G[8] Forwards 45 to N[0] Cycle 1 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 .bend … B1: Cycle 5 1 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 N[0]…N[3] execute Cycle 6 GSN StatusReg ValueStatus Reg Value W[0] receives value at WQ 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 Write Queues B0 B1 Cycle 7 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

June 4, 2005 UT-Austin TRIPS Tutorial 113 June 4, 2005 UT-Austin TRIPS Tutorial 114

19 Major OPN Responsibilities and Attributes OPN Packet Format

• One control packet followed by one data packet in next cycle • Dedicated wires for control/data Control Packet Types RT core 0 - Generic transfer or reply • Control packet 1 - Load 2 - Store – Contains routing information 4 - PC read 5 - PC write – Enables early wakeup and fast bypass in ET 14 - Special register read OPN 15 - Special register write Router 1 4 3 6 5 6 5 Control packet valid type frame id dst. node dst. index src. node src. index • Wormhole-routed network connecting 25 tiles XXX.YYY XXX.YYY • 4 entry FIFO flit buffers in each direction • Fixed packet length; 2 flits 1 2 64 40 3 • On-off flow control: 1-bit hold signal lb, lh, lw, ld Data packet valid type value address operation • Dimension order routing (Y-X) sb, sh, sw, sd Normal, Null, Exception

June 4, 2005 UT-Austin TRIPS Tutorial 115 June 4, 2005 UT-Austin TRIPS Tutorial 116

OPN Router Schematic Operand Network Properties

• Directly integrated into processor pipeline North South East West Local – Hold signal stalls tile pipelines (back pressure flow control) 140 – FIFOs flushed with processor flush wave • Deadlock avoidance – On-off flow control – Dimension order routing Arbiter Arbiter Arbiter Arbiter Arbiter – Guaranteed buffer location for every packet 2 140 x 4 • Tailored for operands 4-1 MUX 4-1 MUX 4-1 MUX 4-1 MUX 4-1 MUX – [Taylor et al. 2003, Sankaralingam et al. 2003]

North South East West Local

June 4, 2005 UT-Austin TRIPS Tutorial 117 June 4, 2005 UT-Austin TRIPS Tutorial 118

E-Tile Location and Connections Issue Logic and Execution E/R-Tile

OPN ISCA-32 Tutorial

GDN GDN Premkishore Shivakumar E/D-Tile GCN E-Tile GCN E-Tile

OPN OPN

Department of Computer Sciences OPN The University of Texas at Austin E-Tile

June 4, 2005 UT-Austin TRIPS Tutorial 119 June 4, 2005 UT-Austin TRIPS Tutorial 120

20 Networks Used by E-Tile Major E-Tile Responsibilities

• OPN - Operand Network • Buffer Instructions and Operands – Local reservation stations for instruction and operand management – Transmit register values from/to R-tiles • Instruction Wakeup and Select – Transmit operands from/to other E-tiles • Operand Bypass – Transmit load/store packets from/to D-tiles – Local operand bypass to enable back-to-back local dependent instruction execution – Remote operand bypass with one extra cycle of OPN forwarding latency per hop • GDN - Global Dispatch Network • Instruction Target Delivery – Instructions with multiple local and/or remote targets – Receive instructions from I-tile, and the dispatch commands from G-tile – Successive local or remote targets in consecutive cycles • Through a D-tile or a neighboring E-tile – Concurrently deliver a remote and a local target • Predication Support • GCN - Global Control Network – Predicate-based conditional execution – Hardware predicate OR-ing – Receive commit, flush commands originating at G-tile • Exception and Null • From a D-tile or a neighboring E-tile – Exception generation: divide by zero exception, NULL instruction – Exception and null tokens propagated in dataflow manner to register writes, stores • Flush and Commit – Clear block state (pipeline registers, status buffer etc.)

June 4, 2005 UT-Austin TRIPS Tutorial 121 June 4, 2005 UT-Austin TRIPS Tutorial 122

E-Tile High Level Block Diagram Functional Unit Latencies in the E-Tile

GDN OPN Functional Unit Latency Implementation Operand Buffer Router Integer ALU 1 Synopsys IP Integer Multiplier 3 (Pipelined) Synopsys IP

V INST OPA OPB PR FUTYPE INSTRUCTION OP1 OP2 I0 I0 (64 entries, Integer Divider 24 (Iterative) Synopsys IP I1 I1 .. .. 8 frames, FP Add / Sub 4 (Pipelined) Synopsys IP I63 I63 8 slots per frame) Status Buffer FP Multiplier 4 (Pipelined) Synopsys IP EXECUTIO N UNITS INT ALU FP Comparator 2 (Pipelined) Synopsys IP INT MULT Instruction Buffer FP DtoI Convert 2 (Pipelined) Synopsys IP INT DIV ALU FP ADD/SUB FP ItoD Convert 3 (Pipelined) Synopsys IP FP MULT Router Router FP CMP FP StoD Convert 2 (Pipelined) UT TRIPS IP FP CVT (DtoS ,StoD, FP DtoS Convert 3 (Pipelined) UT TRIPS IP OPN ItoD, DtoI)

June 4, 2005 UT-Austin TRIPS Tutorial 123 June 4, 2005 UT-Austin TRIPS Tutorial 124

Issue Prioritization Among Ready Instructions Issue Prioritization Among Ready Instructions

Reservation stations at one E-Tile Reservation stations at one E-Tile Older Block First B2 B2 (spec) B2 B2 (spec) B1 (spec) B1 before B2 B1 (spec) B1 B1 B0 (non-spec) B0 (non-spec) B0 B0 D-MORPH B7 (spec) D-MORPH B7 (spec) B7 Smaller Instruction B7 B3 (spec) Slot First B3 (spec) B3 B3 B3.I0 before B3.I2

Three-level Select Priority: T2 • Instructions can execute out-of-order T2 • Round robin priority across T1 threads • Issue prioritization among ready instructions T1 T-MORPH T0 • Older block first priority in a • Scheduler assigns slots for block instructions T0 B1 T3 B0 thread T3 • Slot based instruction priority within a block • Smaller instruction slot first within a block

June 4, 2005 UT-Austin TRIPS Tutorial 125 June 4, 2005 UT-Austin TRIPS Tutorial 126

21 Two Phase Instruction Selection Basic Operand Local & Remote Bypass

Cycle 1 Definite Selection Cycle 2 Late Selection mov Local Crtl Pkt Index OPN Remote Crtl Pkt Index ET0: ET1: V INST OPA OPB PR FUTYPE sub mov add I0 V INST OPA OPB PR FUTYPE sub I1 Priority .. Encoder I63 Two add Phase Status Buffer Instruction Selection WAKEUP WAKEUP Sample Dependence Graph Instruction Physical Placement Definite Ready Instruction Local Bypass OPN Remote Bypass Cycle 1 Local Bypass Cycle 2 Cycle 3 Cycle 4 Instruction Instuction mov  sub • Execute mov • R eceive mov Local • Send sub OPN • R eceive sub OPN Bypass Data Pkt Data Pkt from ET0 Data Pkt • Send mov Local to ET1 Priority Bypass Ctrl Pkt • Execute sub Final Selection Remote • Execute add Encoder Bypass • Wakeup, Select • Wakeup, Select • Send sub OPN add sub  add sub Ctrl Pkt from ET0 to ET1 • Priority Encoder uses Issue Prioritization Policy Local Bypass Remote Bypass (extra OPN cycle) • Wakeup: Back-to-back operand delivery, and two bypassed operands in the same cycle

June 4, 2005 UT-Austin TRIPS Tutorial 127 June 4, 2005 UT-Austin TRIPS Tutorial 128

Bypass: Sources of Pipeline Bubbles Target Delivery: Multiple Local Targets

• OPN stall due to network congestion

Cycle 1 Local Bypass Cycle 2 Cycle X Cycle X+1 mov ET0: mov  sub • Execute mov • R eceive mov Local • Send sub OPN • R eceive sub OPN mov Bypass Data Pkt Data Pkt from ET0 Data Pkt • Send mov Local to ET1 add sub sub Remote Bypass Ctrl Pkt • Execute sub • Execute add add Bypass • Wakeup, Select • Wakeup, Select • Send sub OPN add sub  add sub Ctrl Pkt from ET0 OPN Stall to ET1

• Bypassed operand data is a non-enabling predicate Cycle 1 Cycle 2 Cycle 3 • Execute mov • R eceive mov Local • R eceive mov Local –Detected in execute stage after receiving bypassed data => Issue slot wasted Local Bypass Bypass Data Pkt1 Bypass Data Pkt2 mov  sub • Send mov Local –Similar pipeline bubble due to invalid bypassed OPN data packet Bypass Ctrl Pkt1 • Execute sub • Execute add

Local Bypass • Wakeup, Select • Send mov Local sub Bypass Ctrl Pkt2 Cycle 1 Cycle 2 mov  add • Execute teq • R eceive teq Local • Wakeup, Select add teq ET0: Bypass Data Pkt teq Local Bypass • Send teq Local teq  muli Bypass Ctrl Pkt • Data is an muli Non-enabling muli • Wakeup, Select Predicate P kt • Successive local targets delivered in consecutive cycles muli • Issue slot wasted • Successive remote targets also delivered in consecutive cycles (extra OPN forward cycle)

June 4, 2005 UT-Austin TRIPS Tutorial 129 June 4, 2005 UT-Austin TRIPS Tutorial 130

Concurrent Local & Remote Target Delivery Primary Memory System mov ET0: ET1: mov add add sub sub # cycles to emit targets ISCA-32 Tutorial (Local/Remote)

Cycle 1 Cycle 2 Cycle 3 LL 2 • Execute mov • R eceive mov Local Local Bypass • Receive mov OPN LR 1 Bypass Data Pkt1 Data Pkt2 Simha Sethumadhavan mov  sub • Send mov Local RR 2 Bypass Ctrl Pkt • Execute sub • Execute add LLL 3 Remote • Wakeup, Select • Send mov OPN LLR 2 Bypass sub Data Pkt2 from ET0 to ET1 LRR 2 mov  add • Send mov OPN RRR 3 Ctrl Pkt from ET0 • Wakeup, Select Department of Computer Sciences to ET1 add LLLL 4 The University of Texas at Austin LLLR 3 • Instruction local and remote target can be sent concurrently LLRR 2 LRRR 3 • mov3, mov4 instructions have 3 and 4 targets respectively, used to fanout RRRR 4 operands to multiple children

June 4, 2005 UT-Austin TRIPS Tutorial 131 June 4, 2005 UT-Austin TRIPS Tutorial 132

22 D-Tile Location and Connections Major D-Tile Responsibilities

G-Tile • Provide D-cache access for arriving loads and stores • Perform address translation with DTLB GCN GSN ESN • Handle cache misses with MSHRs GDN GDN • Perform dynamic memory disambiguation with load/store queues GCN I-Tile OCN D-Tile E-Tile • Perform dependence prediction for aggressive load/store issue OPN • Detect per-block store completion • Write stores to caches/memory upon commit • Store merging on L1 cache misses GCN GSN DSN ESN D-Tile

June 4, 2005 UT-Austin TRIPS Tutorial 133 June 4, 2005 UT-Austin TRIPS Tutorial 134

Primary Memory Latencies Major Structures and Sizes in the D-Tile

• Data cache is interleaved on a cache line basis • Caches (64 bytes) DSN – 8KB, 2-way, 1R and 1W port – 64 byte lines, LRU, Virtually indexed, • Instruction placement affects load latency physically tagged, Cache subunit – Best case load-to-use latency: 5 cycles – Write-back, Write-no-allocate D E E E E – Worst case load-to-use latency: 17 cycles GCN •DTLB DTLB subunit – Loads favored in E-tile column adjacent to the b a – 16 entries, 64KB to 1TB pages GDN D-tiles D E E E E • Dependence Predictor Dependence c – Deciding row placement is difficult Main Predictor – 1024 bits, PC based hash function OPN subunit D E E E E Control • Load Store Queues GSN • Memory access latency: Three components LSQ subunit – 256 entries, 32 loads/stores per block, – From E-Tile (load) to D-Tile a D E E E E single ported b ESN – D-Tile access (this talk) Miss Handling OCN • Miss Handling – From D-Tile back to E-Tile (load target) c subunit – 16 outstanding load misses to 4 cache lines – Support for merging stores up to one DSN full cache line

June 4, 2005 UT-Austin TRIPS Tutorial 135 June 4, 2005 UT-Austin TRIPS Tutorial 136

Load Execution Scenarios Load Pipeline

• Common case: 2 cycle load hit latency TLB Dep. Pr LSQ Cache Response TLB Hit Cycle 1 Cycle 2 Miss - - - Report TLB Exception DP Miss Access Load Reply Cache, Cache Hit TLB, DP, Wait Defer load until all prior stores are Hit - - LSQ Miss and LSQ (Hit) received. Non deterministic latency • For LSQ hits, load latency varies between 4 to n+3 cycles, where n is size of the load Hit Miss Miss Hit Forward data from L1 Cache in bytes Forward data from L2 Cache (or memory) – Stalls all pipelines except forwarding stages (cycles 2 and 3) Hit Miss Miss Miss Issue cache fill request (if cacheable)

Hit Miss Hit Hit Forward data from LSQ TLB Hit Cycle 1 Cycle 2 Cycle 3 Cycle 4 Forward data from LSQ DP miss Access Identify Read Load Reply Hit Miss Hit Miss Cache, Store in Store Data Issue cache fill request (if cacheable) Cache TLB, DP, the LSQ Hit/Miss and LSQ LSQ Hit Stall Pipe Stall Pipe

June 4, 2005 UT-Austin TRIPS Tutorial 137 June 4, 2005 UT-Austin TRIPS Tutorial 138

23 Dependence Prediction Deferred Load Pipeline

• Prediction at memory side (not at issue) Cycle 1 Cycle 2 Cycle X Cycle X+1 Cycle X+2 Access – New invention because of distributed execution All Prior Prepare Re-execute Cache, Mark Load Stores Load for as if new TLB, DP, W aiting • Properties Arrived Replay load and LSQ – PC-based, updated on violation Stall Pipe Stall Pipe – Only loads access the predictor, loads predicted dependent are deferred Deferred Load Processing (a .k.a. Replay – Deferred loads are woken up when all previous stores have arrived Dependence Predictor Hit pipe) – Reset periodically (flash clear) based on number of blocks committed – Dependence speculation is configurable: Override, Serial, Regular • Deferred loads are woken up when all prior stores have arrived at D-tiles Violation – Loads between two consecutive stores can be woken up out-of-order • Deferred loads are re-injected into the main pipeline, as if they were new load Block Predictor – During this phase, loads get the value from dependent stores, if any Load Address Array Prediction Hash AND OR Frame ID Table (1024 x • (8 entries) 1bit) – When deferred load in cycle X+2 cannot be re-injected into the main pipeline Override Load LSID Serialize – Load cannot be prepared in cycle X+1 because of resource conflicts in the LSQ

Reset

June 4, 2005 UT-Austin TRIPS Tutorial 139 June 4, 2005 UT-Austin TRIPS Tutorial 140

Store Counting Block Store Completion Detection GDN Block Completion Store Count Table Signal CMD: Dispatch • Need to share store arrival information Frame Id 5 Frame ID: 5 3 stores, Num stores: 3 1 received between D-tiles DT0 Store Mask: 32 DT0 DT0 bits 0x00011001 Frame ID: 4 7 stores – Block completion – Waking up deferred loads DT0 CYCLE 1 • Data Status Network (DSN) is used to Store arrival communicate store arrival information DT1 DT1 (Frame id 5, DT1 LSID 15) – Multi-hop broadcast network CYCLE 0 Store DT1 Store count – Each link carries store FrameID and store LSID Store arrival count counter counter • DSN Operation – Store arrival initiates transfer on DSN DT2 DT2 DT2 Store Count Table – All tiles know about the store arrival after 3 cycles DT2 CYCLE 1 Frame ID: 5 3 stores Frame ID: 4 7 stores • GSN Operation Active DSN – Completion can be detected at any D-tile but DT3 DT3 DT3 special processing in done in D-tile0 before Inactive DSN sending it to G-tile DT3 CYCLE 2 GSN At block dispatch, each Each DT records store counts and On store arrival, the received – Messages sent on GSN DT receives and records store mask (in the LSQ). Store store count is updated at local information on stores in mask (32 bits) identifies the store tile and store arrival is passed the dispatched block instructions in a block on to other tiles (next slide) June 4, 2005 UT-Austin TRIPS Tutorial 141 June 4, 2005 UT-Austin TRIPS Tutorial 142

Store Commit Pipeline Load and Store Miss Handling

• Load Miss Operation Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 – Cycle 1: Miss determination Pick Read Access Check for Store Write to Store Store cache hit Cache or – Cycle 2: MSHR allocation and merging Commit to commit data from tags or miss Merge LSQ buffer – Cycle 3: Requests sent out on OCN, if available Stall Pipe Stall Pipe • Load Miss Return – Cycle 1: Wake up all loads waiting on incoming data • Stores are committed in LSID and block order at each D-tile – Cycle 2: Select one load; data for the load miss starts arriving – Written into L1 cache or bypassed on a cache miss – Cycle 3: Prepare load and re-inject into main pipeline, if unblocked – One store committed per D-tile per cycle • Store Merging • Up to four stores total can be committed every cycle, possibly out-of-order – T-morph: Store commits from different threads are not interleaved – Attempts to merge consecutive writes to same line • On a cache miss, stores are inserted into the merge buffer – Snoops on all incoming load misses for coherency • Commit pipe stalls when cache ports are busy • Resources are shared across threads – Fills from missed loads take up cache write ports

June 4, 2005 UT-Austin TRIPS Tutorial 143 June 4, 2005 UT-Austin TRIPS Tutorial 144

24 D-Tile Pipelines Block Diagram Memory System Design Challenges

Load input from Load output to Cache + LSQ Costly resources are shared E-Tile E-Tile e.g. cache and LSQ ports, • Optimizing and reducing the size of the LSQ Load Port L2 bandwidth Replay load pipeline Miss handling – Prototype LSQ size exceeds data cache array size

arb LD X Spill Mgmt Missed load pipeline CACHE

Unfinished store bypassing BYPASSES arb to load in previous stage. Bypassing=> fewer stalls CC => higher throughput L2 Req.

Channel LSQ Line fill pipeline Store hit ST X ST • Further optimization of memory-side dependence prediction Store commit pipeline arb Coherence checks – Prior state of the art is done at instruction dispatch/issue time e.g. Load and store misses to Cache same cache line • Scaling and decentralizing the Data Status Network (DSN) Store Port Store miss pipe

June 4, 2005 UT-Austin TRIPS Tutorial 145 June 4, 2005 UT-Austin TRIPS Tutorial 146

D-Tile Roadmap  Secondary Memory System

ISCA-32 Tutorial

Changkyu Kim

Department of Computer Sciences The University of Texas at Austin

June 4, 2005 UT-Austin TRIPS Tutorial 147 June 4, 2005 UT-Austin TRIPS Tutorial 148

M-Tile Location and Connections M-Tile Network Attributes

• Total 1MB on-chip memory capacity M-Tile – Array of 16 64KB M-Tiles Or – Connected by specialized on-chip network (OCN) N-Tile • High bandwidth OCN – A peak of 16 cache accesses per cycle – 68 GB/sec to the two processor cores M-Tile OCN OCN M-Tile • Low latency for future technologies Or M-Tile Or N-Tile N-Tile – Non-uniform (NUCA) level-2 cache • Configuration OCN – Scratchpad / Level-2 cache modes on a per-bank basis – Configurable address mapping M-Tile Or • Shared L2 cache / Private L2 cache N-Tile

June 4, 2005 UT-Austin TRIPS Tutorial 149 June 4, 2005 UT-Austin TRIPS Tutorial 150

25 Non-Uniform Level-2 Cache (NUCA) Multiple Modes in TRIPS Secondary Memory

L2 DMA SDC EBC : M-Tile configured as a level-2 Cache 9 cycles • Non-uniform access latency S : M-Tile configured as a

– Closer bank can be accessed DMA SDC EBC DMA SDC EBC DMA SDC EBC DMA SDC EBC N N N N N N N N N N N N N N N N PROC0 faster • Designed for wire-dominated N L2 L2 N N L2 L2 N N S S N N L2 S N future technologies N L2 L2 N N L2 L2 N N S S N N L2 S N N L2 L2 N N L2 L2 N N S S N N L2 S N – No global interconnect 25 cycles N L2 L2 N N L2 L2 N N S S N N L2 S N • Connected by switched network N L2 L2 N N L2 L2 N N S S N N L2 S N – Can fit large number of banks N L2 L2 N N L2 L2 N N S S N N L2 S N

with small area overhead On-Chip Network (OCN) On-Chip Network (OCN) N L2 L2 N N L2 L2 N On-Chip Network (OCN) N S S N On-Chip Network (OCN) N L2 S N • Allows thousands of in-flight PROC1 N L2 L2 N N L2 L2 N N S S N N L2 S N transactions N N N N N N N N N N N N N N N N – Network capacity = 2100 flits DMA SDC C2C DMA SDC C2C DMA SDC C2C DMA SDC C2C – Maximize throughput All shared Per-processor All scratchpad Half level-2 caches, half DMA SDC C2C level-2 caches level-2 caches memories scratchpad memories

June 4, 2005 UT-Austin TRIPS Tutorial 151 June 4, 2005 UT-Austin TRIPS Tutorial 152

Major Structures in the M-Tile Back-to-Back Read Pipeline

North 7-cycle latency for transferring all 64B data

OCN ROUTER 4-cycle latency for sending a reply header West MSHR East

OCN Incoming MT Cycle 0 Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle9 Interface Decoder Tag Data Local Request Tag Data Send Send 1st Send 2nd Send 3rd Send 4th Array Array Access Access 2 reply 16B of 16B of 16B of 16B of Fir s t Arrives at Input Data header reply data reply reply reply data Read FIFO Access 1 data data OCN Outgoing MT Interface Encoder

Request Tag Data Send Send Arrives at Access Access 2 reply 1st 16B South Second Input Data header of reply • 8-way set associative cache Read FIFO Access 1 data • 64B Line size • 1 MSHR entry (Total 16 entries in the whole cache) • Configurability (tag on/off) Seamless Data Transfer (6.8GB/sec) • 1B ~ 64B Read / Write / Swap transactions support • Error checking and handling

June 4, 2005 UT-Austin TRIPS Tutorial 153 June 4, 2005 UT-Austin TRIPS Tutorial 154

Read Miss Pipeline Chip- and System-Level Networks and I/O 4-cycle latency for sending a fill request to SDC

Cycle 0 Cycle 1 Cycle 2 Cycle 3 Request Tag MSHR Send Access Allocation outgoing Access the main Incoming Arrives at Input read miss memory ISCA-32 Tutorial read FIFO request miss request Paul Gratz

9-cycle latency for filling 64B data and sending a reply header

Cycle 0 Cycle 1 Cycle 2 Cycle Cycle Cycle 5 Cycle 6 Cycle 7 Cycle 8 Department of Computer Sciences 3 4 The University of Texas at Austin Reply Tag Fill 1st Fill 1st Fill 1st Fill 1st MSHR Generate Send 1st Arrives at Access 16B of 16B of 16B of 16B of Access reply reply Incoming reply reply reply reply header Input header read FIFO data data data data reply

June 4, 2005 UT-Austin TRIPS Tutorial 155 June 4, 2005 UT-Austin TRIPS Tutorial 156

26 OCN Location and Connections Goals and Requirements of the OCN

OCN OCN • Fabric for interconnection of processors, memory and I/O • Handle multiple types of traffic: OCN OCN OCN M-Tile N-Tile Client I-Tile – Routing of L1 and L2 cache misses – Memory requests OCN OCN – Configuration and data register reads/writes by external host – Inter-chip communication OCN OCN OCN M-Tile N-Tile Client I-Tile • OCN Network Attributes – Flexible mapping of resources OCN OCN – Deadlock free operation

OCN OCN OCN – Performance (16 Client Links – 6.8GB/sec per link) N-Tile N-Tile Client I-Tile – Guaranteed delivery of request and reply

OCN Client C2C

June 4, 2005 UT-Austin TRIPS Tutorial 157 June 4, 2005 UT-Austin TRIPS Tutorial 158

OCN Protocol N-Tile Design

Y 0123 • Flexible mapping of resources •Goals X Network Tile – System address mapping tables are – Router for OCN packets 0 Local – System address to network address runtime re-configurable to allow for Input translation 1 OCN Router runtime re-mapping between L2 cache North South East West • Configurable L2 network addresses banks and scratchpad memory banks Input Input Input Input – One cycle per hop latency 2 • Deadlock Avoidance • Attributes 3 – Composed of OCN router with – Dimension order routed network (Y-X) translation tables on local input 4 – Four virtual channels used to avoid Translation – Portions of the system address index deadlock Table into translation tables 5 – Translation tables configured • Performance through memory mapped registers 6 Crossbar – Links are 128 bit wide in each direction – Error back on invalid addresses 7 – Credit-based flow control Request 8 Reply – One cycle per hop plus cycle at network North South East West Local insertion Output Output Output Output Output 9 – 68 GB/sec at processor interfaces

June 4, 2005 UT-Austin TRIPS Tutorial 159 June 4, 2005 UT-Austin TRIPS Tutorial 160

I/O Tiles C2C Tile Design Considerations

•C2C Goals • EBC Tile – Extend OCN fabric across multiple – Provides interface between board-level C2C Tile chips PowerPC 440 GP chip and the TRIPS chip Local – Provide flexible extensibility – Forwards interrupts to the 440 GP’s North South East West Input service routines Input Input Input Input • C2C Attributes • SDC Tile – C2C network protocol similar to OCN except: – Contains IBM SDRAM Memory • 64 bit wide Controller OCN -> C2C • 2 virtual channels – Error detection, swap support and address remapping added – Operates at up to 266MHz – Clock speed configurable at boot • DMA Tile Cross Bar – Source synchronous off-chip links – Supports multi-block and scatter/gather transfers C2C -> OCN – Multiple clock domains with – Buffered to optimize traffic resynchronization

• C2C Tile Local North South East West Output Output Output Output Output

June 4, 2005 UT-Austin TRIPS Tutorial 161 June 4, 2005 UT-Austin TRIPS Tutorial 162

27 C2C Network

Y 01• Latencies: Compiler Overview X – L2 Hit Latency: 9 – 27 cycles SDRAM SDRAM – L2 Miss Latency: 26 – 44 cycles 0 (+ SDRAM Access time) CPU0 CPU0 – L2 Hit Latency (adjacent chip): OCN OCN ISCA-32 Tutorial CPU1 CPU1 33 – 63 cycles – L2 Miss Latency (adjacent chip): SDRAM SDRAM 50 – 78 cycles (+SDRAM) Kathryn McKinley – Each additional C2C hop adds 8 SDRAM SDRAM cycles in each direction 1 • Bandwidth: CPU0 CPU0 – OCN Link: 6.8 GB/sec OCN OCN Department of Computer Sciences CPU1 CPU1 – One processor to OCN: 34 GB/sec The University of Texas at Austin – Both processors to OCN: 68 GB/sec SDRAM SDRAM – Both SDRAMs: 2.2 GB/sec – C2C Link: 1.7 GB/sec

June 4, 2005 UT-Austin TRIPS Tutorial 163 June 4, 2005 UT-Austin TRIPS Tutorial 164

Compiler Challenges Generating Correct, High Performance Code

“Don’t put off to runtime what you can do at • TRIPS specific compiler challenges compile time.” Norm Jouppi, May 2005. – EDGE execution model – What’s a TRIPS block? • The compiler’s job • Block constraints & compiler scheduling – Correct code – Makes the hardware simpler • Satisfying TRIPS block constraints – Shifts work to compile time – Optimized code • Filling up TRIPS blocks with useful instructions

June 4, 2005 UT-Austin TRIPS Tutorial 165 June 4, 2005 UT-Austin TRIPS Tutorial 166

Block Execution TRIPS Block Constraints

• Fixed size: 128 instructions Address+targets sent to memory, data returned – Padded with no-ops if needed to target Reg. banks – Simpler hardware: instruction size, I-cache design, 32 read instructions global control

32 loads • Registers: 8 reads and writes to each of 4 banks (in addition 32 stores to 128 instructions) 1 - 128

Memory instruction – Reduced buffering PC read – Read/write instruction overhead, register file size

DFG Memory PC • Block Termination: all stores and writes execute, one terminating branch 32 write instructions branch – Simplifies hardware logic for detecting block completion PC Reg. banks • Fixed output: 32 load or store queue IDs, one branch – LSQ size, instruction size

June 4, 2005 UT-Austin TRIPS Tutorial 167 June 4, 2005 UT-Austin TRIPS Tutorial 168

28 Outline Compiler Phases

• Overview of TRIPS compiler components • How hard is it to meet TRIPS block constraints? – Compiler algorithms to enforce them – Evaluation • Question: How often does the compiler under-fill a block to meet a constraint? •Answer: Almost never • Optimization – High quality TRIPS blocks – Scheduling for TRIPS – Cache bank optimization • Preliminary evaluation

June 4, 2005 UT-Austin TRIPS Tutorial 169 June 4, 2005 UT-Austin TRIPS Tutorial 170

Compiler Phases (2) Forming TRIPS Blocks

ISCA-32 Tutorial

Aaron Smith

Department of Computer Sciences The University of Texas at Austin

June 4, 2005 UT-Austin TRIPS Tutorial 171 June 4, 2005 UT-Austin TRIPS Tutorial 172

High Quality TRIPS Blocks TRIPS Block Constraints

•Full • Fixed size: at most, 128 instructions • High ratio of committed to total instructions – Easy to satisfy by the compiler, but…. – Dependence height is not an issue – Good performance requires blocks full of • Minimize branch misprediction useful instructions • Meet correctness constraints • Registers: 8 reads and writes to each of 4 banks (plus 128 instructions) • Block Termination • Fixed output: at most, 32 load or store queue IDs

June 4, 2005 UT-Austin TRIPS Tutorial 173 June 4, 2005 UT-Austin TRIPS Tutorial 174

29 03: Basic Blocks as TRIPS Blocks O4: Hammock Hyperblocks as TRIPS Blocks

• O3: every basic block is a TRIPS block • Simple, but not high performance

Two possible hammock hyperblocks for this graph

June 4, 2005 UT-Austin TRIPS Tutorial 175 June 4, 2005 UT-Austin TRIPS Tutorial 176

Static TRIPS Block Instruction Counts SPEC Total TRIPS Blocks

• 15% average reduction in number of TRIPS blocks between O3 and O4 • Results missing sixtrack, perl, and gcc Static Averages 03 04 min maxmean min max mean SPEC2000 7 42 16 8 42 18 EEMBC 7 31 16 9 67 22

June 4, 2005 UT-Austin TRIPS Tutorial 177 June 4, 2005 UT-Austin TRIPS Tutorial 178

Enforcing Block Size SPEC > 128 Instructions

• At O4 an average of 0.9% TRIPS blocks > 128 insts. (max 6.6%) Too Big? 1. Unpredicated - cut 2. Predicated - Reverse if-conversion

Initial Hammock Hyperblock After Reverse If-Conversion

June 4, 2005 UT-Austin TRIPS Tutorial 179 June 4, 2005 UT-Austin TRIPS Tutorial 180

30 In Progress: Tail Duplication In Progress: More Aggressive Hyperblocks

• Tail duplicate A while (*pp != '\0') { • Increases block size if (*pp == '.') { • Hot long paths break; } pp++; }

• Add support for complex control flow (i.e. breaks/continues) After Reverse If-Conversion After Tail Duplication • An example: the number of TRIPS blocks in ammp_1 microbenchmark • 38 / 27 / 12 with O3 / 04 (current) / new

June 4, 2005 UT-Austin TRIPS Tutorial 181 June 4, 2005 UT-Austin TRIPS Tutorial 182

While Loop Unrolling TRIPS Block Constraints

!= ‘\0’ • Fixed size: at most, 128 instructions while (*pp != '\0') != ‘\0’ { == ‘.’ • Registers: 8 reads and writes to each of 4 banks if (*pp == '.') { == ‘.’ break; (plus 128 instructions) } break pp++ break pp++ pp++; – Simple linear scan priority order allocator } != ‘\0’ • Block Termination == ‘.’ • Fixed output: at most, 32 load or store queue IDs

pp++ break

• Supported on architectures with predication • Implemented, but needs improved hyperblock generation

June 4, 2005 UT-Austin TRIPS Tutorial 183 June 4, 2005 UT-Austin TRIPS Tutorial 184

Linear Scan Register Allocation SPEC > 32 (8x4 banks) Registers

• 128 registers 32 per 4 banks • At O4 an average of 0.3% TRIPS blocks have > 32 register reads • Hyperblocks as large instructions • At O4 an average of 0.2% TRIPS blocks have > 32 register writes • Computes liveness over hyperblocks • Memory savings vs. instructions • Fewer hyperblocks than instructions • Allocator ignores local variables • Spills only 5 SPEC2000 vs 18 Alpha • gcc: 4290 Alpha - 211 TRIPS • sixtrack:3510 Alpha - 165 TRIPS • mesa: 2048 Alpha - 207 TRIPS • apsi: 920 Alpha - 7 TRIPS • applu: 14 Alpha - 19 TRIPS • Average spill: 1 store, 2-7 loads • Spills are less than 0.5% of all dynamic loads/stores

June 4, 2005 UT-Austin TRIPS Tutorial 185 June 4, 2005 UT-Austin TRIPS Tutorial 186

31 SPEC > 8 Registers per Bank TRIPS Block Constraints

• At O4 an average of 0.12% TRIPS blocks have > 8 read banks • At O4 an average of 0.07% TRIPS blocks have > 8 write banks • Fixed size: at most, 128 instructions • Control flow: 8 branches (soft) • Registers: 8 reads and writes to each of 4 banks (plus 128 instructions) • Block Termination • Fixed output: at most, 32 load or store queue IDs

June 4, 2005 UT-Austin TRIPS Tutorial 187 June 4, 2005 UT-Austin TRIPS Tutorial 188

Block Termination Register Writes

• One branch fires read 4 • How do we guarantee all paths • All writes complete produce a value? – Insert corresponding read – Write nullification instructions for all write def 4 def 6 instructions • All stores complete – Transform into SSA form – Store nullification – Out of SSA to add moves • Could nullify writes – Increases block size phi 4,6 write 4 write 4

Before SSA After SSA

June 4, 2005 UT-Austin TRIPS Tutorial 189 June 4, 2005 UT-Austin TRIPS Tutorial 190

Store Nullification SPEC Store Nullification

How do you insert nulls for this? Two possibilities. Which is better? • Nullification is 1.2% of instructions for 177.mesa (0.8% dummy stores) • Is it significant?

NULL NULL SD SD SD

SD

• Dummy stores • Code motion – Wastes space – Better resource – Easier to schedule? utilization

June 4, 2005 UT-Austin TRIPS Tutorial 191 June 4, 2005 UT-Austin TRIPS Tutorial 192

32 TRIPS Block Constraints Load/Store ID Assignment

• Fixed size: at most, 128 instructions • Ensures Memory Consistency • Simplifies Hardware • Control flow: 8 branches (soft) • Overlapping LSQ IDs LD LD [0] – Increases number of load/store • Registers: 8 reads and writes to each of 4 banks LD LD [0] instructions in a TRIPS block (plus 128 instructions) – But a load and store cannot share the same ID • Block Termination: each store issues once • Fixed output: at most, 32 load or store queue SD SD [1] IDs Before Assignment After Assignment

June 4, 2005 UT-Austin TRIPS Tutorial 193 June 4, 2005 UT-Austin TRIPS Tutorial 194

SPEC > 32 LSQ IDs Correct Compilation of TRIPS Blocks

• At O4 an average of 0.3% TRIPS blocks have > 32 LSQ IDs • Some new compiler challenges • New, but straightforward algorithms • Does not limit compiler effectiveness (so far!)

June 4, 2005 UT-Austin TRIPS Tutorial 195 June 4, 2005 UT-Austin TRIPS Tutorial 196

Outline

Code Optimization • Overview of TRIPS Compiler components • Are TRIPS block constraints onerous? – Compiler algorithms that enforce them – Evaluation ISCA-32 Tutorial • Question: How often does the compiler under-fill a block to meet a constraint? Kathryn McKinley • Answer: almost never • Towards optimized code – TRIPS specific optimization policies Department of Computer Sciences – Scheduling for TRIPS The University of Texas at Austin – Cache bank optimization • Preliminary evaluation

June 4, 2005 UT-Austin TRIPS Tutorial 197 June 4, 2005 UT-Austin TRIPS Tutorial 198

33 Towards Optimized Code TRIPS Scheduling Problem

• Filling TRIPS blocks with useful instructions Register File Hyperblock add Flowgraph – Better hyperblock formation add ld – Loop unrolling for TRIPS blocks constraints cmp – Aggressive inlining br – Driving block formation with path and/or edge profiles add ld sub add ld Data Caches • Shortening the critical path in a block shl shl add ld – Tree-height reduction sw cmp br br – Redundant computation ld • 64 bit machine optimizations add add •etc. sw br

sw • Place instructions on 4x4x8 grid sw add • Encode placement in target form cmp br

June 4, 2005 UT-Austin TRIPS Tutorial 199 June 4, 2005 UT-Austin TRIPS Tutorial 200

Contrast with Conventional Approaches Dissecting the Problem

•VLIW • Scheduling is a two-part problem – Relies completely on compiler to schedule code – Placement: Where an instruction executes + Eliminates need for dynamic dependence check hardware – Issue: When an instruction executes + Good match for partitioning + Can minimize communication latencies on critical paths • VLIW represents one extreme – Poor tolerance to unpredictable dynamic latencies – Static Placement and Static Issue (SPSI) – These latencies continue to grow + Static Placement works well for partitioned architectures – Static Issue causes problems with unknown latencies • Superscalar approach – Hardware dynamically schedules code • Superscalars represent another extreme + Can tolerate dynamic latencies – Dynamic Placement and Dynamic Issue (DPDI) – Quadratic complexity of dependence check hardware + Dynamic Issue tolerates unknown latencies – Not a good match for partitioning – Dynamic Placement is difficult in the face of partitioning – Difficult to make good placement decisions – ISA does not allow software to help with instruction placement

June 4, 2005 UT-Austin TRIPS Tutorial 201 June 4, 2005 UT-Austin TRIPS Tutorial 202

EDGE Architecture Solution Scheduling Algorithms (1 & 2)

• EDGE: Explicit Dataflow Graph Execution • CRBLO: list scheduler [PACT 2004] – Static Placement and Dynamic Issue (SPDI) – greedy top down – Renegotiates the compiler/hardware binary interface – C - prioritize critical path • An EDGE ISA explicitly encodes the dataflow graph – R - reprioritize specifying targets – B - load balance for issue contention RISC EDGE ALU-1 ALU-2 – L - data cache locality i1: movi r1, #10 ALU-1: movi #10, ALU-3 mov mov i2: movi r2, #20 ALU-2: movi #20, ALU-3 – O - register bank locality i3: add r3, r2, r1 ALU-3: add ALU-4 add ALU-3 • BSP: Block Specific Policy • Static Placement – Characterizes blocks’ ILP and load/stores  Explicit DFG simplifies hardware  no HW dependency analysis!  Results are forwarded directly  no associative issue queues! – if l/s > 50%, pre-place them and schedule bottom up through point-to-point network  no global bypass network! – elseif flt pt > 50% or int + flt pt > 90% use CRBLO • Dynamic Instruction Issue – else preplace l/s and schedule top down greedy  Instructions execute in original dataflow-order

June 4, 2005 UT-Austin TRIPS Tutorial 203 June 4, 2005 UT-Austin TRIPS Tutorial 204

34 Scheduling Algorithms (3) Improvements on Hand Coded Microbenchmarks • Physical path scheduling CRBLO BSP min max(L,N) min max(L,N) + PP – Top down, critical path focused 1.40 – Models network contention (N), in addition to issue 1.20 contention based on instruction latencies (L) 1.00 – Cost function selects an instruction placement that 0.80 minimizes maximum contention and latency 0.60 • min of max(N, L) 0.40 Speedup (Normalized) • PP - load/store pre-placement 0.20 • tried average, difference, etc. 0.00

average art_1_hand art_2_hand art_3_hand sieve_hand vadd_hand gzip_2_hand bzip2_3_hand matrix_1_handparser_1_hand equake_1_handfft2_GMTI_handfft4_GMTI_hand doppler_GMTI_hand transpose_GMTI_hand

June 4, 2005 UT-Austin TRIPS Tutorial 205 June 4, 2005 UT-Austin TRIPS Tutorial 206

Spatial L1 Bank D-Cache Load/Store Alignment Scheduling for L1 Bank D-cache Communication • Problem

– Load/store mapping to data L1 cache I G R R R R bank unknown to compiler •Example I D E EA E E

for (i=0; i<1024; i++) { L2 cache I D E E E E // A[i], L[i], B[i] to all banks A[i] += L[i] + B[i]; I D E E E E } I D EL E E E

June 4, 2005 UT-Austin TRIPS Tutorial 207 June 4, 2005 UT-Austin TRIPS Tutorial 208

Spatial L1 D-cache Communication Spatial Load/Store Alignment • Solution: – Align arrays on bank alignment boundary • function of cache line size and number of banks I G R R R R • alignment declaration &/or specialized malloc – Jump-and-fill new loop transformation I D E ES E E • similar to unroll-and-jam

for (j=0; j<1024; j+=32) { L2 cache I D E EA E E for (i=j; i

June 4, 2005 UT-Austin TRIPS Tutorial 209 June 4, 2005 UT-Austin TRIPS Tutorial 210

35 Spatial L1 D-cache Communication Unrolling for Larger Hyperblocks • Vector Add Unrolled Continuously (tcc 03) for (i=0; i<1024; j+=32) { I G R R R R C[i ] += (A[i ] + B[i ]); // Bank 0 C[i+ 1] += (A[i+ 1] + B[i+ 1]); // Bank 0 C[i+ 2] += (A[i+ 2] + B[i+ 2]); // Bank 0 S . . . I D EA ESL E E C[i+31] += (A[i+31] + B[i+31]); // Bank 3 } • Vector Add Unrolled with Jump-and-Fill (JF Unroll) L2 cache I D EAS ELA E E for (i=0; i<1024; j+=32) { C[i ] += (A[i ] + B[i ]); // Bank 0 C[i+ 8] += (A[i+ 8] + B[i+ 8]); // Bank 1 I D EAS EL E E C[i+16] += (A[i+16] + B[i+16]); // Bank 2 C[i+24] += (A[i+24] + B[i+24]); // Bank 3 C[i+ 1] += (A[i+ 1] + B[i+ 1]); // Bank 0 S I D AEL EL E E . . . C[i+31] += (A[i+31] + B[i+31]); // Bank 3 }

June 4, 2005 UT-Austin TRIPS Tutorial 211 June 4, 2005 UT-Austin TRIPS Tutorial 212

Spatial Load/Store Alignment Results Performance Goals

• Technology comparison points 1.2 – Alpha • Fairest comparison we can do 1 • High ILP, high frequency, low cycle count machine •gcc 0.8 No Hints • Alpha compiler Speedup 0.6 – Scale on Alpha Hints – Scale on TRIPS 0.4 • Performance goal: 4x speed up over Alpha • Benchmarks 0.2 – Hand coded microbenchmarks

0 • Metric for measuring compiler success Unroll 32C Jump&Fill JF Unroll – Industry standards: EEMBC, SPEC tcc -03

June 4, 2005 UT-Austin TRIPS Tutorial 213 June 4, 2005 UT-Austin TRIPS Tutorial 214

Microbenchmark Speedup Progress Towards Correct & Optimized Code

 1.2 avg. speedup O3 to Alpha  3.1 avg. speedup hand to Alpha • SPEC 2000 C/Fortran-77 -- 03: basic blocks  1.5 avg. speedup O4 to Alpha – March 2004: 0/21 12.00 – November 2004: 6/21

10.00 – March 2005: 15/21 – May 2005: 19/21 8.00 • SPEC 2000 C/Fortran-77 -- 04: hammock hyperblocks Alpha gcc -O3 – Scale O3 May 2005: 8/21 6.00 Scale O4 Hand TIL • EEMBC -- 04

4.00 – May 2005: 30/30 • GCC torture tests and NIST -- 04 2.00 – May 2005: 30/30 • Plenty of work left to do! 0.00

art_1 art_2 art_3 sieve vadd gzip_1gzip_2 bzip2_1bzip2_2bzip2_3 twolf_1twolf_3 ammp_1ammp_2 parser_1 equake_1fft2_GMTIfft4_GMTI matrix_1

doppler_GMTI transpose_GMTI

June 4, 2005 UT-Austin TRIPS Tutorial 215 June 4, 2005 UT-Austin TRIPS Tutorial 216

36 Project Plans

Wrap-up and Discussion • Tools release – Hope to build academic/industrial collaborators – Plan to release , toolchain, and simulators sometime in the (hopefully early) fall • Prototype system – Plan to have it working December 2005/January 2006 ISCA-32 Tutorial – Software development effort increasing after tape-out • Commercialization – Actively looking for interested commercial partners TRIPS Team and Guests – Discussing best business models for transfer to industry – Happy to listen to ideas

• Will EDGE architectures become widespread? Department of Computer Sciences – Many questions remain unanswered (power, performance) The University of Texas at Austin – Promising initial results, hope to have more concrete answers by next spring – A key question is: how successful will universal parallal programming become? • Perhaps room for parallel programming & EDGE architectures

June 4, 2005 UT-Austin TRIPS Tutorial 217 June 4, 2005 UT-Austin TRIPS Tutorial 218

And finally ... Appendix A - Supplemental Materials

THANK YOU for your interest, questions and taking all of this time! • ISA Specification • Bandwidth summary Subset of complete manuals and specifications available at: http://www.cs.utexas.edu/users/cart/trips • Additional Tile Details

June 4, 2005 UT-Austin TRIPS Tutorial 219 June 4, 2005 UT-Austin TRIPS Tutorial 220

ISA: Header Chunk Format ISA: Read and Write Formats

Register Read Instructions Bit Offsets • Includes 32 General Register 3127 6 5 0 2120 16 15 8 7 0 GR RT1 RT2 0 H0 Read 0.0 Write 0.0 “read” instructions V RQ 4 H1 Read 1.0 Write 1.0 • Includes 32 General Register Register Write Instructions 8 H2 Read 2.0 Write 2.0 “write” instructions 504 12 H3 Read 3.0 Write 3.0 V GR WQ 16 H4 Read 0.1 Write 0.1 • Up to 128 bits of header 20 H5 Read 1.1 Write 1.1 information is encoded in the upper • General Register access is controlled by special read and write instructions 24 H6 Read 2.1 Write 2.1 nibbles 28 H7 Read 3.1 Write 3.1 – Block Type • Read instructions will be executed from the Read Queue … … … • Write instructions will be executed from the Write Queue 96 –Block Size H24 Read 0.6 Write 0.6 • Read instructions may target the 1st or 2nd operand slot of any instruction 100 H25 Read 1.6 Write 1.6 –Block Flags 104 H26 Read 2.6 Write 2.6 –Store Mask • General registers are divided into 4 architectural banks 108 H27 Read 3.6 Write 3.6 – 32 registers per bank 112 H28 Read 0.7 Write 0.7 – Bank i holds GRs i, i+4, i+8, etc 116 H29 Read 1.7 Write 1.7 120 H30 Read 2.7 Write 2.7 – Up to 8 reads and 8 writes may be performed at each bank 124 H31 Read 3.7 Write 3.7 – Bank position is implicit based on read and write instructions in block header – Implicit two bits added to 5-bit general register fields for 128 registers total Byte Offsets

June 4, 2005 UT-Austin TRIPS Tutorial 221 June 4, 2005 UT-Austin TRIPS Tutorial 222

37 ISA: Microarchitecture-Specific Target Formats ISA: More Details on Predication

PR Description • Any instruction using the G, I, L, S, or B format • A 9-bit general target specifier is used by most 00 Not Predicated may be predicated 876543210 instructions • A 2-bit predicate field specified whether an • Read target specifiers are truncated to 8 bits, 01 (Reserved) instruction is predicated and, if so, upon what 00 00 00 000 No Target with the implicit 9th bit always set to 1 (they 10 Predicated Upon False condition 00 01 WID cannot target a write slot or a predicate slot) Write Slot (WQ) 11 Predicated Upon True • Predicated instructions must receive all of their • The all-zero encoding is used to represent no normal operands plus a matching predicate 01 Inst. ID Predicate Slot (IQ) target before they are allowed to execute 10 Inst. ID OP0 Slot (IQ) • Write Queue entry 0.0 is implicitly targeted by • Instructions that meet these criteria are said to branch instructions and cannot be explicitly be enabled 11 Inst. ID OP1 Slot (IQ) targeted • Instructions that don’t meet these criteria are • Most instructions are allowed to target register said to be inhibited outputs, predicates, 1st operands, and 2nd • Inhibited instructions may cause other operands dependent instructions to be indirectly inhibited • A block completes executing when all of its register and memory outputs are produced (it is not necessary for all of its instructions to execute)

June 4, 2005 UT-Austin TRIPS Tutorial 223 June 4, 2005 UT-Austin TRIPS Tutorial 224

ISA: Loads and Stores ISA: Integer Arithmetic

Mnemonic Name Format Sources Targets Mnemonic Name Format Sources Targets LB Load Byte L OP0, IMM T0 ADD Add G OP0, OP1 T0, T1 LBS Load Byte Signed L OP0, IMM T0 ADDI Add Immediate I OP0, IMM T0 LH Load Halfword L OP0, IMM T0 SUB Subtract G OP0, OP1 T0, T1 LH Load Halfword Signed L OP0, IMM T0 LW Load Word L OP0, IMM T0 SUBI Subtract Immediate I OP0, IMM T0 LWS Load Word Signed L OP0, IMM T0 MUL Multiply G OP0, OP1 T0, T1 LD Load Doubleword L OP0, IMM T0 MULI Multiply Immediate I OP0, IMM T0 SB Store Byte S OP0, OP1, IMM None DIVS Divide Signed G OP0, OP1 T0, T1 SH Store Halfword S OP0, OP1, IMM None SW Store Word S OP0, OP1, IMM None DIVSI Divide Signed Immediate I OP0, IMM T0 SD Store Doubleword S OP0, OP1, IMM None DIVU Divide Unsigned G OP0, OP1 T0, T1

• LB: T0 = ZEXT( MEM_B[OP0 + SEXT(IMM)] ) • SH: MEM_H[OP0 + SEXT(IMM)] = OP1[15:0] DIVUI Divide Unsigned Immediate I OP0, IMM T0 • LBS: T0 = SEXT( MEM_B[OP0 + SEXT(IMM)] ) • SW: MEM_W[OP0 + SEXT(IMM)] = OP1[31:0] • LH: T0 = ZEXT( MEM_H[OP0 + SEXT(IMM)] ) • SD: MEM_D[OP0 + SEXT(IMM)] = OP1[63:0] • LHS: T0 = SEXT( MEM_H[OP0 + SEXT(IMM)] ) • There is no support for overflow detection or extended arithmetic • LW: T0 = ZEXT( MEM_W[OP0 + SEXT(IMM)] ) • LWS: T0 = SEXT( MEM_W[OP0 + SEXT(IMM)] ) • The multiply instruction may be split into signed and unsigned versions • LD: T0 = ZEXT( MEM_D[OP0 + SEXT(IMM)] )

June 4, 2005 UT-Austin TRIPS Tutorial 225 June 4, 2005 UT-Austin TRIPS Tutorial 226

ISA: Integer Logical ISA: Integer Shift

Mnemonic Name Format Sources Targets Mnemonic Name Format Sources Targets AND Bitwise AND G OP0, OP1 T0, T1 SLL Shift Left Logical G OP0, OP1 T0, T1 ANDI Bitwise AND Immediate I OP0, IMM T0 SLLI Shift Left Logical Immediate I OP0, IMM T0 OR Bitwise OR G OP0, OP1 T0, T1 SRL Shift Right Logical G OP0, OP1 T0, T1 ORI Bitwise OR Immediate I OP0, IMM T0 SRLI Shift Right Logical Immediate I OP0, IMM T0 XOR Bitwise XOR G OP0, OP1 T0, T1 SRA Shift Right Arithmetic G OP0, OP1 T0, T1 XORI Bitwise XOR Immediate I OP0, IMM T0 SRAI Shift Right Arithmetic I OP0, IMM T0 Immediate

• XORI may be used to perform a bitwise complement – XORI( A, -1 ) • Operand 1 provides the data to be shifted • Operand 2 or IMM provides the shift count • The lower 6 bits of the shift count are interpreted as an unsigned quantity • The higher bits of the shift count are ignored

June 4, 2005 UT-Austin TRIPS Tutorial 227 June 4, 2005 UT-Austin TRIPS Tutorial 228

38 ISA: Integer Extend ISA: Integer Test

Mnemonic Name Format Sources Targets Mnemonic Name Format Sources Targets EXTSB Extend Signed Byte G OP0 T0, T1 TEQ Test EQ G OP0, OP1 T0, T1 EXTSH Extend Signed Halfword G OP0 T0, T1 TNE Test NE G OP0, OP1 T0, T1 EXTSW Extend Signed Word G OP0 T0, T1 TLE Test LE G OP0, OP1 T0, T1 EXTUB Extend Unsigned Byte G OP0 T0, T1 TLEU Test LE Unsigned G OP0, OP1 T0, T1 EXTUH Extend Unsigned Halfword G OP0 T0, T1 TLT Test LT G OP0, OP1 T0, T1 EXTUW Extend Unsigned Word G OP0 T0, T1 TLTU Test LT Unsigned G OP0, OP1 T0, T1 TGE Test GE G OP0, OP1 T0, T1 TGEU Test GE Unsigned G OP0, OP1 T0, T1 • These are better than doing a left shift followed by a right shift TGT Test GT G OP0, OP1 T0, T1 TGTU Test GT Unsigned G OP0, OP1 T0, T1

• Test instructions produce true (1) or false (0)

June 4, 2005 UT-Austin TRIPS Tutorial 229 June 4, 2005 UT-Austin TRIPS Tutorial 230

ISA: Integer Test #2 ISA: Floating Point Arithmetic

Mnemonic Name Format Sources Targets Mnemonic Name Format Sources Targets TEQI Test EQ Immediate I OP0, IMM T1 FADD FP Add G OP0, OP1 T0, T1 TNEI Test NE Immediate I OP0, IMM T1 FSUB FP Substract G OP0, OP1 T0, T1 TLEI Test LE Immediate I OP0, IMM T1 FMUL FP Multiply G OP0, OP1 T0, T1 TLEUI Test LE Unsigned Immediate I OP0, IMM T1 FDIV FP Divide (simulator only!) G OP0, OP1 T0, T1 TLTI Test LT Immediate I OP0, IMM T1 TLTUI Test LT Unsigned Immediate I OP0, IMM T1 • These operations are performed assuming double-precision values TGEI Test GE Immediate I OP0, IMM T1 • All results are rounded as necessary using the default IEEE rounding mode TGEUI Test GE Unsigned Immediate I OP0, IMM T1 (round-to-nearest) TGTI Test GT Immediate I OP0, IMM T1 • No exceptions are reported • The is no support for gradual underflow or denormal representations TGTUI Test GT Unsigned Immediate I OP0, IMM T1

• Test instructions produce true (1) or false (0)

June 4, 2005 UT-Austin TRIPS Tutorial 231 June 4, 2005 UT-Austin TRIPS Tutorial 232

ISA: Floating-Point Test ISA: Floating Point Conversion

Mnemonic Name Format Sources Targets Mnemonic Name Format Sources Targets FEQ FP Test EQ G OP0, OP1 T0, T1 FITOD Convert Integer to Double FP G OP0 T0, T1 FNE FP Test NE G OP0, OP1 T0, T1 FDTOI Convert Double FP to Integer G OP0 T0, T1 FLE FP Test LE G OP0, OP1 T0, T1 FSTOD Convert Single FP to Double FP G OP0 T0, T1 FLT FP Test LT G OP0, OP1 T0, T1 FDTOS Convert Double FP to Single FP G OP0 T0, T1 FGE FP Test GE G OP0, OP1 T0, T1 FGT FP Test GT G OP0, OP1 T0, T1 • FITOD: Converts a 64-bit signed integer to a double-precision float • FDTOI: Converts a double-precision float to a 64-bit signed integer • These operations are performed assuming double-precision values • FSTOD: Converts a single-precision float to a double-precision float • Test instructions produce true (1) or false (0) • FDTOS: Converts a double-precision float to a single-precision float • FSTOD is needed to load of a single-precision float value • FDTOS is needed to store of a single-precision float value

June 4, 2005 UT-Austin TRIPS Tutorial 233 June 4, 2005 UT-Austin TRIPS Tutorial 234

39 ISA: Control Flow ISA: Miscellaneous

Mnemonic Name Format Sources Targets Mnemonic Name Format Sources Targets BR Branch B OP0 PC NULL Nullify Output G None T0, T1 MOV Move G OP0 T0, T1 BRO Branch with Offset B OFFSET PC += MOVI Move Immediate I IMM T0 CALL Call B OP0 PC MOV3 Move to three targets M3 OP0 T0, T1, T2 CALLO Call with Offset B OFFSET PC += MOV4 Move to four targets M4 OP0 T0, T1, T2, T3 RET Return B OP0 PC MFPC Move from PC I None T0 SCALL System Call B None PC GENS Generate Signed Constant C CONST T0 GENU Generate Unsigned Constant C CONST T0 • Predication should be used for conditional branches APP Append Constant C OP0, CONST T0 • Special branch and call instructions may produce an address offset (in NOP No Operations C None None chunks) rather than a full address Lock Load and Lock L OP0 T0

• Call and return behave like normal branches (but may be treated in a • NULL is a special instruction for cancelling one or more block outputs special way be a hardware ) • MOV is the same as RPT (but with a conventional name) • The generate and append instructions may be used to form large constants • The System Call instruction will trigger a System Call Exception • Three instructions are required to form a 48-bit constant • Four instructions are required to form a 64-bit constant

June 4, 2005 UT-Austin TRIPS Tutorial 235 June 4, 2005 UT-Austin TRIPS Tutorial 236

Bandwidth Summary G-tile: Access to Architected Registers

Interface Width Freq Multiplier Bytes/Sec Access to architected registers allowed when processor is halted

Processor Data Cache (Fill + Spill) 128 bits 533 MHz 4 * 2 68 GB/s

Processor Data Cache (Load + Store) 64 bits 533 MHz 4 * 2 34 GB/s Processor Inst Cache (Fetch or Fill) 128 bits 533 MHz 4 34 GB/s Read/write request Read/write request OPN Processor to OCN Interface 128 bits 533 MHz 5 * 2 * 0.8 68 GB/s OCN G-tile Reply Read reply On-Chip Network Link (64B Read) 128 bits 533 MHz 0.8 6.8 GB/s

On-Chip Memory Bank (64B Read) 128 bits 533 MHz 0.8 6.8 GB/s OPN Chip-to-Chip Network Link (64B (266 64 bits 0.8 (1.7 GB/s) Read) MHz) DDR 266 SDRAM (64B Read) 64 bits 266 MHz 0.5 1.1 GB/s

External Bus Interface 32 bits 66 MHz 0.33 88 MB/s Only one read/write request handled at a time

(Maximum bandwidth calculations)

June 4, 2005 UT-Austin TRIPS Tutorial 237 June 4, 2005 UT-Austin TRIPS Tutorial 238

OPN: Detailed Router Schematic E-Tile: List of Pipelines

• Instruction Dispatch Pipeline (GDN) – Update File and Status Buffer OPN_NI_CTRL OPN_SI_CTRL OPN_EI_CTRL OPN_WI_CTRL • Operand Delivery Pipeline (OPN) OPN_NI_HOLD OPN_SI_HOLD – Update Operand Register File and Status Buffer OPN_EI_HOLD head packet OPN_WI_HOLD FIFO-N FIFO-S FIFO-E FIFO-W OPN_LI_HOLD • Instruction Wakeup and Select Pipeline destination – Two stage instruction selection among three potential candidates

OPN_LI_CTRL • Definite ready instruction • Instruction Late Selection: instruction with remote/local bypassed operand – Guarantee absence of retire conflict Arbiter Arbiter ArbiterArbiter Arbiter • Instruction Execute and Writeback Pipeline – Pipelined execution – Retire local targets through local bypass – Retire remote targets through OPN OPN_NO_CTRL OPN_SO_CTRL OPN_EO_CTRL OPN_WO_CTRL OPN_LO_CTRL • Instruction Flush / Commit Pipeline (GCN) – Clear block state (pipeline registers, status buffer etc.) OPN_NO_HOLD OPN_SO_HOLD OPN_EO_HOLD OPN_WO_HOLD

June 4, 2005 UT-Austin TRIPS Tutorial 239 June 4, 2005 UT-Austin TRIPS Tutorial 240

40 E-tile: Two Stage Select-Execute Pipeline M-Tile: Internal Structure

MT decoder Previous Remote Bypass STATUS Data OCN BUFFER Previous Local Bypass Data OCN dataIn [127:0] Incoming OCN typeIn [1:0] Interface Register Data OCN validIn [3:0] MSHR DEFINITE Immediate Data READY 1. Instruction Local Bypass Data INST Wakeup OPN Remote Bypass Data 2. Issue Tag OPN Prioritization EXECUTE Array Data Array CTRL PKT 3. Scoreboard

Access Previous Remote Bypass Control LOCAL 4. Instruction, Data Previous Local Bypass Data Packet CTRL PKT Operand Register Immediate Data Read Local Bypass Data OCN dataOut [127:0] OCN typeOut [1:0] OCN OPN Remote Bypass Data OCN validOut [3:0] Outgoing Interface MT encoder DEFINITE INSTRUCTION INSTRUCTION LATE INSTRUCTION EXECUTION CYCLE SELECTION CYCLE SELECTION CYCLE

June 4, 2005 UT-Austin TRIPS Tutorial 241 June 4, 2005 UT-Austin TRIPS Tutorial 242

EBC: External Bus Controller SRC: SDRAM Controller Diagram

• Provides an interface between the 440 • provided GP and TRIPS chips irq by IBM 440 GP (MCP) • GPIOs for general purpose I/O • SDC Interface provides: • Synchronizes between 66MHz 440GP Chip SDC Boundary bus and system clock SDRAM – Protocol conversion between OCN and Controller’s • Interrupt process: EBC SDC IBM Interface Memory interface. intr EBI GPIO CPU 1 OCN Client Controller CPU 0 Interrupt – CPU’s and DMAs notify EBC of an OCN Interface Network Controller intr exception through the “intr” interface – Synchronization between SDRAM clock and OCN clock intr – Interrupt controller calls service halt Halt routines in 440 GP via irq line – Address remapping from DMA 0 DMA 1 Controller – 440 GP writes to the Halt Controller’s System address to 2GB OCN Controller intr registers to stop the processors SDRAMs halt – 400 GP services the interrupt – Error detection and reply for

OCN Client Interface misaligned/out of range requests – Support for OCN swap operation

June 4, 2005 UT-Austin TRIPS Tutorial 243 June 4, 2005 UT-Austin TRIPS Tutorial 244

DMA Controller

• Facilitates transfer of data from one set of system addresses to another. DMA DMA FIFO • Transfer from and to any addresses except CFG space DMA • Three types of transfers: Configuration – Single Block contiguous Registers – Single Block scatter/gather – Multi Block linked list • Up to 64KB transfer per OCN Interface Unit block • 512 byte buffer to optimize OCN Client Interface OCN traffic OCN Network

June 4, 2005 UT-Austin TRIPS Tutorial 245

41