OpenSPARC Slide-Cast In 12 Chapters Presented by OpenSPARC designers, developers, and programmers ●to guide users as they develop their own OpenSPARC designs and ●to assist professors as they This materialteach the is made next available generation under Creative Commons Attribution-Share 3.0 United States License

www..net CreativeCreative Commons Commons Attribution-Share Attribution-Share 3.0 3.0 United United States States License License 74 Chapter Four OPENSPARC T2 OVERVIEW

Denis Sheahan Distinguished Engineer Niagara Architecture Group www.opensparc.net CreativeCreative Commons Commons Attribution-Share Attribution-Share 3.0 3.0 United United States States License License 75 Agenda • Chip overview • SPARC core > Execution Units > Power > RAS • Crossbar • L2 • Summary

www.opensparc.net Creative Commons Attribution-Share 3.0 United States License 76 OpenSPARC T2 Chip Goals

• Double throughput versus OpenSPARC T1 > Doubling cores versus increasing threads per core > Utilization of execution units • Improve throughput / watt • Improve single- performance • Improve floating-point performance • Maintain SPARC binary compatibility

www.opensparc.net Creative Commons Attribution-Share 3.0 United States License 77 UltraSPARC T2 Overview • 8 SPARC cores,

L2 Data L2 Data 8 threads each, Bank 0 Bank 4 64 threads total L2B0 SPARC SPARC SPARC SPARC L2B4 Core 0 Core 1 Core 5 Core 4 • Shared 4MB L2, L2 Data L2 Data Bank 1 Bank 5 8 banks,

L2B1 L2B5 16 way associative MCU0 L2 L2 L2 L2 MCU2 • Four dual-channel TAG0 TAG1 TAG5 TAG4 FSR NCU MCU1 MCU3 FBDIMM memory EFU controllers L2B3 SII CCX L2B7 FSR SIO CCU L2 Data L2 Data • Full 8x9 crossbar Bank 3 Bank 7 connects cores to L2 L2 L2 L2 L2B2 TAG2 TAG3 TAG7 TAG6 L2B6 L2 banks / SIU and L2 Data L2 Data vice versa Bank 2 Bank 6 • SIU connects I/O to DMU SPARC SPARC SPARC SPARC memory Core 2 Core 3 Core 7 Core 6 RDP TDS

PEU

RTX PSR ESR FSR MAC UltraSPARC T2 Die Photo www.opensparc.net Creative Commons Attribution-Share 3.0 United States License 78 UltraSPARC® T2 : TrueTrue System On a Chip

• Up to 8 cores @ 1.2 /1.4GHz

FB DIMM FB DIMM FB DIMM FB DIMM • Up to 64 threads per CPU • Huge Memory Capacity

FB DIMM FB DIMM FB DIMM FB DIMM > Up to 512GB memory > Up to 64 Fully Buffered Dimms

MCU MCU MCU MCU • High Memory Bandwidth > 2.5x memory BW = 60+GB/S L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ • 8x FPUs, 1 fully pipelined Full Cross Bar floating point unit/core

C0 C1 C2 C3 C4 C5 C6 C7 • 4MB L2$ (8 banks) 16 way FPU FPU FPU FPU FPU FPU FPU FPU • Security co-processor / core > DES, 3DES, AES, RC4, SHA1, SHA256, MD5, RSA to 2048 key, ECC,CRC32 NIU Sys I/F PCI-Ex (E-net+) Buffer Switch Core

Power 60 – 123W 2x 10GE Ethernet x8 @2.5GHz www.opensparc.net Creative Commons Attribution-Share 3.0 United States License 79 UltraSPARC® T2 Processor: TrueTrue System On a Chip

• Up to 8 cores @ 1.2 /1.4GHz

FB DIMM FB DIMM FB DIMM FB DIMM • Up to 64 threads per CPU • Huge Memory Capacity

FB DIMM FB DIMM FB DIMM FB DIMM > Up to 512GB memory > Up to 64 Fully Buffered Dimms

MCU MCU MCU MCU • High Memory Bandwidth > 2.5x memory BW = 60+GB/S L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ • 8x FPUs, 1 fully pipelined Full Cross Bar floating point unit/core

C0 C1 C2 C3 C4 C5 C6 C7 • 4MB L2$ (8 banks) 16 way FPU FPU FPU FPU FPU FPU FPU FPU • Security co-processor / core > DES, 3DES, AES, RC4, SHA1, SHA256, MD5, RSA to 2048 key, ECC,CRC32 NIU Sys I/F PCI-Ex (E-net+) Buffer Switch Core

Power 60 – 123W 2x 10GE Ethernet x8 @2.5GHz www.opensparc.net Creative Commons Attribution-Share 3.0 United States License 80 UltraSPARC T2 Architecture A true system on a chip Dual-channel Dual-channel Dual-channel Dual-channel FB-DIMM FB-DIMM FB-DIMM FB-DIMM • Up to 8 SPARC cores @ 1.0–1.4 GHz > Up to 64 total threads

> 4-MB, 16-way, 8-bank L2$ Memory Memory Memory controller controller controller • 1 floating-point unit per core L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ bankL2$ Bankbank bankL2$ Bankbank bankL2$ Bankbank bankL2$ Bankbank • 1 SPU (crypto) per core CrossbarCrossbar • FB-DIMM 1.0 support 16 KB I$ 16 KB I$ 16 KB I$ 16 KB I$ 16 KB I$ 16 KB I$ 16 KB I$ 16 KB I$ New 8 KB D$ 8 KB D$ 8 KB D$ 8 KB D$ 8 KB D$ 8 KB D$ 8 KB D$ 8 KB D$ • 8-lane PCI Express 1.0 bus FPU FPU FPU FPU FPU FPU FPU FPU interface SPU SPU SPU SPU SPU SPU SPU SPU C1 C2 C3 C4 C5 C6 C7 C8 • 2 x 1/10 Gb on-chip Ethernet

Sys I/F • Power: < 95 W (nominal) buffer switch NIU core PCIe

www.opensparc.net Creative Commons Attribution-Share 3.0 United States License 81 UltraSPARC T2 “Zero Cost” Security

• One crypto unit integrated per core (eight total) • Supports the ten most common ciphers and secure hashing functions • Composed of two independent sub-units that operate in parallel > Modular Arithmetic Unit > Cipher/Hash Unit

www.opensparc.net Creative Commons Attribution-Share 3.0 United States License 82 Integrated Multithreaded 10 GbE • Dual, multithreaded, 10 GbE (XAUI) > Up to 4X the performance of current network interface cards > 16 Rx and Tx DMA channels for virtualization • Limited classification > Classified at layer 2 ,3 and 4 into Rx DMA buffer to match the flow • Benefits > Eliminates network I/O bottlenecks > Enables faster network access

www.opensparc.net Creative Commons Attribution-Share 3.0 United States License 83 Integrated Floating Point Unit • Each UltraSPARC T2 core has its own Floating Point Unit • Fully-pipelined (except divide/sqrt) > Divide/sqrt in parallel with add or multiply operations of other threads Data • Full VIS 2.0 implementation • FPU performs integer multiply, divide, population count

www.opensparc.net Creative Commons Attribution-Share 3.0 United States License 84 UltraSPARC T2: 7 World Records Built on a heritage of network throughput • Standard performance benchmarks > SPECint_Rate2006 (single chip) > SPECfp_Rate2006 (single chip) > Web Performance: SPECweb2005 > Unix VM (single socket): SPECjbb2005 > Java App Server: SPECjAppServer2004 (dual node) > Unix ERP Platform: Single-socket SAP SD-2 Tier > OLTP Platform: Database Tier SPECjAppServer2004 Dual Node Result

See disclosures www.opensparc.net Creative Commons Attribution-Share 3.0 United States License 85 OpenSPARC T2 Block Diagram

SPARC Core 0 L2 Bank0 Memory FBDIMM SPARC Core 1 L2 Bank1 Controller 0 SPARC Core 2 L2 Bank2 Memory FBDIMM SPARC Core 3 8x9 L2 Bank3 Controller 1 Cache SPARC Core 4 Crossbar L2 Bank4 Memory SPARC Core 5 Controller 2 FBDIMM L2 Bank5 SPARC Core 6 L2 Bank6 Memory SPARC Core 7 FBDIMM L2 Bank7 Controller 3

I/O

System Interface Unit

www.opensparc.net Creative Commons Attribution-Share 3.0 United States License 86 OpenSPARC T1 to T2 Core Changes

• Increase threads from 4 to 8 in each core • Increase execution units from 1 to 2 in each core • Floating-point and Graphics Unit in each core • New pipe stage: pick > Choose 2 threads out of 8 to execute each cycle • Instruction buffers after L1 instruction cache for each thread • Increase set associativity of L1 instruction cache to 8 • Increase size of fully associative DTLB from 64 to 128 entries • Hardware tablewalk for ITLB and DTLB misses • Speculate branches not taken

www.opensparc.net Creative Commons Attribution-Share 3.0 United States License 87 OpenSPARC T1 to T2 Chip Changes • Increase L2 banks from 4 to 8 > 15 percent performance loss with only 4 banks and 64 threads • FBDIMM memory interface replaces DDR2 > Saves pins > Improved bandwidth > 42 GB/sec read > 21 GB/sec write > Improved capacity (512 GB) • RAS changes (to match T1 FIT rate)

www.opensparc.net Creative Commons Attribution-Share 3.0 United States License 88 SPARC Core Block Diagram • IFU – Instruction Fetch Unit > 16 KB I$, 32B lines, 8-way SA TLU IFU > 64-entry fully-associative ITLB • EXU0/1 – Integer Execution Units > 4 threads share each unit > Executes one instruction/cycle EXU0 EXU1 • LSU – Load/Store Unit > 8KB D$, 16B lines, 4-way SA > 128-entry fully-associative DTLB MMU/ HWTW • FGU – Floating-Point and Graphics Unit • TLU – Trap Logic Unit > Updates machine state, handles FGU LSU exceptions and interrupts • MMU – Memory Management Unit > Hardware tablewalk (HWTW) > 8KB, 64KB, 4MB, 256MB pages • Gasket arbitrates between the core units Gasket for the crossbar interface

xbar/L2 www.opensparc.net Creative Commons Attribution-Share 3.0 United States License 89 SPARC Core Pipeline • 8 stage integer pipeline

Fetch Cache Pick Decode Execute Mem Bypass W

> 3 cycle load-use penalty > Memory (data address translation, access tag/data array) > Bypass (late way select, data formatting, data forwarding) • 12 stage floating-point pipeline

Fetch Cache Pick Decode Execute Fx1 Fx2 Fx3 Fx4 Fx5 FB FW

> 6 cycle latency for dependent FP instructions > Longer pipeline for divide/sqrt

www.opensparc.net Creative Commons Attribution-Share 3.0 United States License 90 IFU Integer and F Load/Store Pipeline C TG0 TG1 IB2IB3 IB6IB7 IB0IB1 IB4IB5

P P D D LSU E E

M M M

B B B W W W www.opensparc.net Creative Commons Attribution-Share 3.0 United States License 91 IFU Threaded Execution and F2 Thread Groups C6 TG0 TG1 IB2IB3 IB6IB7 IB0IB1 IB4IB5

P0 P5

D2 D7 LSU E0 E6

M3 M4 M4

B1 B1 B7

W2 W6 W6 www.opensparc.net Creative Commons Attribution-Share 3.0 United States License 92 Instruction Fetch • Instruction cache and fetch shared between the eight threads Fetch Fetch Unit Addr Gen • Fetch up to four instructions per cycle > Each thread in ready or wait state

16 KB Cache > Wait state caused by: ITLB 8 way Miss > TLB miss ICache Logic > cache miss > Instruction Instruction instruction buffer full Buffers (4x8) Buffers (4x8) > Least-recently fetched among ready threads Pick > One instruction buffer/thread Pick 0 Pick 1 Unit • Branches assumed to be not-taken; Decode 5-cycle penalty if taken Decode 0 Decode 1 Unit > T1 switched threads if branch or load EXU 0 EXU 1 Gasket fetched • Limited I$ miss prefetching • Pick and Decode decoupled from Fetch by the instruction buffer

www.opensparc.net Creative Commons Attribution-Share 3.0 United States License 93 Instruction Pick and Decode

Fetch Fetch Unit • Threads divided into two groups of four Addr Gen threads each • One instruction from each thread group 16 KB Cache picked each cycle ITLB 8 way Miss > Least-recently picked within a thread ICache Logic group among ready threads > Wait states: dependency, D$ miss, Instruction Instruction DTLB miss, divide/sqrt, ... Buffers (4x8) Buffers (4x8) > Gives priority to nonspeculative threads (e.g. no load) Pick Pick 0 Pick 1 Unit • Decode resolves conflicts > Each thread group picks Decode independently of the other Decode 0 Decode 1 Unit > Both thread groups pick load/store or EXUEXU0 0 EXU1EXU 1 Gasket FGU instructions • Independent instructions after loads

www.opensparc.net Creative Commons Attribution-Share 3.0 United States License 94 Execution Unit • Executes integer operations and some LSU FGU graphics operations • Generates addresses for loads and stores • Adder / logic unit, shifter

RML IRF • Each EXU contains state for four threads > Integer (IRF) BYP > 8 register windows per thread > 4 global levels per thread > Window or global level change SHFT ALU requires multiple cycles (but pipelined) > Register window management logic (RML) LSU FGU

www.opensparc.net Creative Commons Attribution-Share 3.0 United States License 95 Load Store Unit • One load or store per cycle VA • Store-through Data • D$ allocates on load misses, DTLB Cache store data store

compare load addr for RAW for addr load compare updates on store hits Tags ldst_miss load data (hit) PA x 4 PA 8 KB • Load Miss Queue (LMQ) supports 4 way one pending load miss per thread PA Data == waysel Cache load miss load RAW bypass data • Store buffer (STB) contains 8 stores per thread > Stores to same L2 cache line are STB LMQ data return tobypass IRF pipelined to L2 re data for D$ update for D$ data re

sto fill data

store to pcx to store • Arbiter for crossbar between load to pcx to misses and stores ACK > Fairness between threads, loads, and stores Gasket (to xbar/L2)

www.opensparc.net Creative Commons Attribution-Share 3.0 United States License 96 Floating-point and Graphics Unit Load FGU Register File Data • Fully pipelined 8x32x64b 2W / 2R (except divide/sqrt) > Divide/sqrt in parallel with rs1 rs2 add or multiply operations of Integer other threads Sources Fx1 • FGU performs integer multiply, divide, population count Fx2 Add VIS 2.0 Div/ • FGU predicts exceptions in Fx3 Mul Sqrt Fx1 stage

Store Fx4 Data Fx5

Integer Result Fb

www.opensparc.net Creative Commons Attribution-Share 3.0 United States License 97 Memory Management Unit

• Hardware tablewalk of up to 4 translation storage buffers (TSBs) (a.k.a page tables) > Each TSB supports one page size • Three search modes: > Sequential – search TSBs in order > Burst – search TSBs in parallel > Prediction – use VA to predict TSB to search > Two-bit predictor orders first two TSB searches • Up to 8 pending misses > ITLB or DTLB miss per thread

www.opensparc.net Creative Commons Attribution-Share 3.0 United States License 98 Core Power Management • Minimal speculation > Next sequential I$ line prefetch > Predict branches not-taken > Predict loads hit in D$ > Pick independent instructions after loads > Hardware tablewalk search control • Extensive clock gating > Datapath > Control blocks > Arrays • External power throttling > Add stall cycles at decode stage

www.opensparc.net Creative Commons Attribution-Share 3.0 United States License 99 Core Reliability and Serviceability • Extensive RAS features > Parity-protection on I$, D$ tags and data, ITLB, DTLB CAM and data, store buffer address > ECC on integer RF, floating-point RF, store buffer data, trap stack, other internal arrays • Combination of hardware and software correction flows > Hardware re-fetch for I$, D$ > ECC inside the core is corrected by software

www.opensparc.net Creative Commons Attribution-Share 3.0 United States License 100 Crossbar • Two complementary, non-blocking, pipelined switches

SPARC SPARC SPARC SPARC SPARC SPARC SPARC SPARC > PCX – processor to cache Core0 Core1 Core2 Core3 Core4 Core5 Core6 Core7 > CPX – cache to processor ~90 GB/s • 8 load/store requests and 8 data write returns can be done at the same time • Arbitration for a target is required L2 B0 Mux • Priority given to oldest requestor to maintain fairness and order PCX • Three cycle arbitration protocol L2 B7 Mux > Request, arbitrate, and grant ~180 GB/s • Supports 8 byte writes from a read core to a bank L2 L2 L2 L2 L2 L2 L2 L2 • Supports 16 byte reads from a Bank0 Bank1 Bank2 Bank3 Bank4 Bank5 Bank6 Bank7 bank to core

www.opensparc.net Creative Commons Attribution-Share 3.0 United States License 101 L2 Cache PCX Request CPX Return • 4 MB L2 cache > 16 way set associative Input I/O Request Output Queue Queue > 8 L2 banks Replayed Miss Fill Request > 64 byte line size L2 Arbiter 16B Directory > T1: 3 MB, 12 ways, 4 banks lookup • L2 cache is write-back, L2 Tag L2 Valid Invalidation write-allocate Packet Array Array > L1 data cache is write-thru • Support for partial stores hit L2 Data 16B 64B Eviction • L2 cache manages coherency Array miss I/O data 64B > Maintains directories for all 16B 16 L1 caches

I/O Miss Buffer Write-back • 16 byte data transfers to the cores Write Buffer Buffer Miss Request Arbiter Fill Buffer 64B Line Fill

Miss Request 64B Memory Read to Memory 64B Memory Write www.opensparc.net Creative Commons Attribution-Share 3.0 United States License 102 Summary

• >2x throughput and throughput/watt vs. OpenSPARC T1 • Greatly improved floating-point performance • Significantly improved integer performance

www.opensparc.net Creative Commons Attribution-Share 3.0 United States License 103

OpenSPARC Slide-Cast In 12 Chapters Presented by OpenSPARC designers, developers, and programmers ●to guide users as they develop their own OpenSPARC designs and ●to assist professors as they This materialteach the is made next available generation under Creative Commons Attribution-Share 3.0 United States License

www.opensparc.net CreativeCreative Commons Commons Attribution-Share Attribution-Share 3.0 3.0 United United States States License License 104