ULTRASPARC T1 SUN + SPARC = ULTRASPARC THE FORMERLY KNOWN AS “NIAGARA” Processor Cores Threads/Core Clock L1D L1I L2 Cache

UltraSPARC IIi 1 1 550Mhz, 650Mhz 16KiB 16KiB 512KiB UltraSPARC IIIi 1 1 1.593Ghz I D 1MBa UltraSPARC III 1 1 1.05-1.2GHz 64KiB 32KiB 8MiBb UltraSPARC IV 2c 1 1.05-1.35Ghz 64KiB 32KiB 16MiBd UltraSPARC IV+ 1 2 1.5Ghz I D 2MiBe UltraSPARC T1 8 4 1.2Ghz 32KiB 16KiBf 3MiBg UltraSPARC T2h 16 (?) 8 2Ghz+ (?) ? ? ? Slide 1 Slide 3 aOn-chip bExternal, on chip tags cUltraSPARC III cores d8MiB per core e32MiB off chip L3 fI/D Cache per core g4 way banked hSecond-half 2007

This work supported by UNSW and HP through the Gelato Federation

SPARC HISTORY INSTRUCTION SET ➜ Scalable Processor ARCHitecture ➜ RISC! ➜ 1985 – ➜ Berkeley RISC – 1980-1984 ➜ Load–store only through registers ➜ MIPS – 1981-1984 ➜ Fixed size instructions (32 bits) ➜ register + register Slide 2 Architecture v Implementation: Slide 4 ➜ register + 13 bit immediate ➜ SPARC Architecture ➜ Branch delay slot ➜ SPARC V7 – 1986 X Condition Codes ➜ SPARC Interntaional, Ltd – 1989 V (V9) CC and non-CC instructions ➜ SPARC V8 – 1990 V (V9) Compare on integer registers ➜ SPARC V9 – 1994 ➜ Synthesised instructions ➜ Privileged v Non-Privileged

SUN + SPARC = ULTRASPARC 1 CODE EXAMPLE 2 CODE EXAMPLE V9 REGISTER WINDOWS

void addr(void) { int i = 0xdeadbeef; }

00000054 : Slide 5 54: 9d e3 bf 90 save %sp, -112, %sp Slide 7 58: 03 37 ab 6f sethi %hi(0xdeadbc00), %g1 5c: 82 10 62 ef or %g1, 0x2ef, %g1 60: c2 27 bf f4 st %g1, [ %fp + -12 ] 64: 81 e8 00 00 restore 68: 81 c3 e0 08 retl 6c: 01 00 00 00 nop

REGISTERS REGISTER WINDOWS TODAY X (V8) Window buffer needs to be flushed ➜ Large %i0 − %i7 ➜ Fixed register windows X Kernel code has deep call chains ➜ Registers renamed ➜ Walks up and down a lot %l0 − %l7 Caller X Question over studies showing advantages %r8 <− r%32 ➜ save and restore Register Window ➜ State required for C compared to higher-level languages Slide 6 %o0 − %o7 %i0 − %i7 Slide 8 %r32 −>r%8 X Less is not more

Callee ➜ Superscalar needs a lot of registers General Windowed Description %l0 − %l7 %r0 - %r7 %g0 - %g7 Global (all) ➜ %r8 - %r15 %o0 - %o7 Window output V Variable sized windows %r16 - %r23 %l0 - %l7 Window local %o0 − %o7 %r24 - %r31 %i0 - %i7 Window input V RSE deals with fill/spill V Allows for growth of underlying register file Register File

V9 REGISTER WINDOWS 3 CONTEXTS 4 THROUGHPUT COMPUTING

CONTEXTS 100

90

Primary memory conflict ASID 80 long fp ASID short fp Secondary ASID 70 long integer short integer ASID Nucleus 60 load delays control hazards Context Registers 50 branch misprediction Slide 9 Slide 11 dcache miss Processes 40 icache miss dtlb miss

Percent of Total Issue Cycles Issue Total of Percent 30 itlb miss Address Spaces processor busy 20 Operating System

10

➜ 0 Multiple Address spaces li ora swm doduc nasa7 fpppp alvinn su2cor eqntott hydro2d mdljdp2 mdljsp2 tomcatv espresso

➜ composite Primary, Alternate and explicit load instructions Applications

➜ 1 th < 8 issue machine requires 8 (12.5%) filled for CPI 1

TSB ➜ Translation Store Buffer is a direct mapped cache of ... ➜ Translation Table Entries ➜ sun4u, sun4v ULTRASPARC T1 ➜ Hardware pre-computes index into TSB (for 2 specified page sizes) ➜ 8 cores ➜ Software in fast fault handler can check if TTE valid ➜ 4 threads / core – group

Small Pages Large Pages ➜ 32 way multi-threaded Slide 10 Slide 12 ➜ 5.76 IPC (CPI 0.17, efficiency 71%)

TTE TTE ➜ Good luck finding the clock speed (1.2Ghz) TTE TTE TTE TTE ➜ 70 Watts – “Green Processor” TTE TTE TTE TTE ➜ UltraSPARC Architecture 2005 TTE TTE Small Page TSB Large Page TSB ➜ Same underlying principles as SPARC V9

Virtual Address Space

THROUGHPUT COMPUTING 5 ULTRASPARC T1 PIPELINE 6 ULTRASPARC T1 PIPELINE CHIP RESOURCES (2)

Sparc pipe DDR 4-way MT Dram control Fetch Thread select Decode Execute Memory Writeback Channel 0 L2 B0 Sparc pipe 4-way MT

Register Sparc pipe DDR file 4-way MT Dram control × Channel 1 4 L2 B1 Sparc pipe 4-way MT

ICache Instruction DCache Sparc pipe DDR × ALU Crossbar ITLB buffer 4 Thread DTLB Crossbar 4-way MT Dram control Slide 13 select Decode MUL Slide 15 Channel 2 Shifter store interface L2 B2 Mux buffers × 4 Sparc pipe DIV 4-way MT

Sparc pipe DDR 4-way MT Dram control Instruction type Channel 3 L2 B3 Thread selects Thread Misses Sparc pipe select 4-way MT logic Traps and interrupts Resource conflicts PC Thread logic select × 4 Mux I/O and shared functions I/O interface

CHIP RESOURCES ➜ Per Thread ➜ Registers ➜ Working Set and Architectural Set THREAD SWITCHING ➜ Instruction Buffer ➜ Default – switch per cycle, LRU ➜ Per Core ➜ Other heuristics go into thread swapping logic ➜ Slide 14 L1I 16Kib, 4-way set associative, 32 byte lines Slide 16 ➜ Predecoded Information – long latency instructions ➜ L1D 8KiB, 4-way set associative, 16 byte lines ➜ Traps – system calls, exceptions ➜ I/DTLB 64-entry, fully associative ➜ Resource Conflicts – execution resources ➜ Execution Units ➜ Cache Miss ➜ Shared ➜ L2 3MiB, 12-way set associative, 4-way banked ➜ I/O

CHIP RESOURCES (2) 7 HYPERVISOR 8 HYPERVISOR ➜ Unprivileged, Privileged, Hyperprivileged ➜ The hypervisor is the “hardware” ➜ API and source code published OPEN SOURCE ➜ Hyperprivileged resources V First open source processor Slide 17 ➜ MMU Slide 19 ➜ http://opensparc.sunsource.net/ ➜ Interrupts ➜ Mailing Lists, Forums ➜ PCI ➜ Bug reports and feedback ➜ Machine Description mechanism ➜ Fast and Slow trap mechanisms to privileged mode ➜ VA extended with partition ID

HYPERVISOR IN ACTION

Primary ASID ASID User requests VA

Secondary ASID Virtual Address VPN Offset Nucleus ASID

Context Registers Hypervisor intercepts Calculate TSB offset, raise OS fault

Address Spaces

ASID taken from primary or secondary context register Real Address DATA TAG ASID VPN THANK YOU Slide 18 TSB Address Space ID Slide 20 Operating System QUESTIONS? OS requests TLB insert Hypervisor adds partition ID TSB Address To TLB address Partition ID Partition ID Partition ID PID ASID VPN Partition ID Per OS Hypervisor State Partition Real Addresses Hypervisor TLB translation

ASIDPartition Virtual Physical

TLB

Physical Hardware

OPEN SOURCE 9 QUESTIONS? 10