Ultrasparc Iii

UltraSparc IIi

Kevin Normoyle and the Sabre cats

One size doesn’t ﬁt all

SPARC chip technology available in three broad product categories:

• S-Series (servers, highest performance) • I-Series (price/performance, ease-of-use) • E-Series (lowest cost)

Tired: Marginal micro-optimizations Wired: Interfaces and integration.

2 of 26

Desktop/High end embedded system issues

• Ease of design-in • Low-cost • Simple motherboard designs • Power • Higher performance graphics interfaces • Low latency to memory and I/O • Upgrades • The J word

3 of 26

UltraSPARC-IIi System Example

Module
72-bit DIMM
72-bit DIMM
SRAM

SRAM SRAM
UltraSPARC
IIi

75MHz/72-bit
XCVRs

clocks

100MHz/64-bit

66 or 33 MHz 32-bit 3.3v
UPA64S device example: Creator3D graphics

APB
Advanced PCI Bridge
(optional)

Can use PC-style SUPERIO chips (boot addressing / INT_ACK features)
33MHz/32-bit 5v/3.3v
33MHz/32-bit 5v/3.3v

4 of 26

Highlights

• TI Epic4 GS2, 5-layer CMOS, 2.6v core, 3.3v I/O.

(TM)

• Sun’s Visual Instruction Set - VIS

-

2-D, 3-D graphics, image processing, real-time compression/decompression, video effects

- block loads and stores (64-byte blocks) sustain 300 MByte/s to/from main memory, with arbitrary src/dest alignment

• Four-way superscalar instruction issue

- 6 pipes: 2 integer, 2 fp/graphics, 1 load/store, 1 branch

• High bandwidth / Low latency interfaces

- Synchronous, external L2 cache, 0.25MB - 2MB - 8-byte (+parity) data bus to L2 cache (1.2 GByte/s sustained) - 8-byte (+ECC) data bus to DRAM XCVRs (400 MByte/s sustained) or UPA64S (800 Mbyte/s sustained)

- 4-byte PCI/66 interface. (190 Mbyte/s sustained)

5 of 26

Highlights (continued)

• Instruction prefetching

- in-cache, dynamic branch prediction

- 1 branch per cycle

- 2-way, 16kB Instruction Cache

• Non-blocking Data Cache

- 8-entry load buffer: pipe stalls only if load data needed - 8-entry (8 bytes per entry) store buffer for write-thru to L2 cache - single-cycle load-use penalty for D-cache hit - throughput of one load/store per 2 cycles to L2 cache

• Non-blocking software prefetch to L2 cache

- Prefetch completion isn’t constrained by the memory model. - Compiler uses to further hide the latency of L2 miss.

• Simultaneous support of little/big endian data

6 of 26

Instruction Pipelines

Integer

E

D

G

C

N1 N2

N3

W

F

Cache Access

Write

Fetch Decode

Group

Execute

Integer instructions are executed and virtual addresses calculated
Instructions Instructions are fetched are decoded instructions from I-cache and placed are grouped

Up to 4

Dcache/TLB Dcachehitor Integer pipe

Traps are

resolved
All results arewrittento the register ﬁles. Instructions are accessed. Branch resolved miss determined. Deferred waits for ﬂoating-

point/

in the

instruction buffer and dispatched. RF accessed load enters load buffer graphics pipe committed

X2

R

X1

X3

Register

Start execution
Finish execution
Floating- point/
Execution continued graphics instructions are further decoded. RF accessed

Floating-point/Graphics

7 of 26

Block Diagram

Prefetch and Dispatch Unit
ICache IMMU
Branch Prediction

Integer Execution Unit
Floating Point/ Graphics Unit
Branch Unit
Load/ Store Unit

DCache DMMU
Load Buffer
Store Buffer

Separate 64-byte PIO and DMA buffers
64-byte
DRAM write buffer
L2 Cache, DRAM, UPA64S, PCI
Interfaces
IOMMU
DRAM
Second- Level Cache
TI XCVRs
Data
PCI 66MHz
DRAM

128+16 (ECC)
64+8 (ECC)
64+8 (parity)
36+1 (parity)

8 of 26

Prefetch and Dispatch Unit

64

••
16kB ICache, 2-way set associative, 32B line size w/pre-decoded bits

Second Level Cache
Next Field
Branch
ICache

Prediction

2kB Next Field RAM which contains 1k branch target addresses and 2k dynamic branch predictions

12
128
12
4

Pre- Decoded Unit
Prefetch Unit

I0

I1

BP

I2

I3 BP

NFA

4 x 76

PA

VA

44

••••••
4-entry Return Address Stack for fetch prediction

41

Instruction Buffer
12 Entry

Branch prediction (software/hardware)

IMMU ITLB
64 Entry

64-entry, fully associative ITLB backs up 1-entry µTLB

Dispatch Unit

12-entry Instruction Buffer fed by ICache or second-level cache

4
Instructions

Single-cycle dispatch logic considers “top” 4 instructions

Floating Point/ Graphics
Integer Execution
Load/ Store
Branch

Controls Integer operand bypassing

9 of 26

Integer Execution Unit

Dispatch Unit

••
Integer Register File has 7 read ports/3 write ports

7 read addresses

8 windows plus 4 sets of 8 global registers

Integer

3x64

Register File

Store Data

8 windows

64

•••
Dual Integer Pipes

4 global sets

2x64

ALU0 executes shift instructions

2x64
2x64

ALU1 executes register-based CTIs, Integer multiply/divide, and condition code-setting instructions

ALU0

VA

Adder
ALU1

Register- based CTIs Condition Codes

•

Integer multiply uses 2-bit Booth

encoding w/“early out” -- 5-cycle latency for typical operands

44

Load/Store
Unit

Integer Multiply/ Divide
Shifter

••
Integer divide uses 1-bit non- restoring subtraction algorithm

64
64
64

Completion Unit buffers ALU/load results before instructions are committed, providing centralized operand bypassing

Completion Unit

•

Precise exceptions and 5 levels of

nested traps

10 of 26

Floating-Point/Graphics Unit

Dispatch Unit

•••
Floating-point/Graphics Register File has 5 read ports/3 write ports

5 read addresses

32 single-precision/32 double- precision registers

Floating-

3x64

Point/ Graphics

Store Data
64

5 functional units, all fully pipelined except for Divide/Square Root unit

Register File 32, 64b regs

••
High bandwidth: 2 FGops per cycle

4x64

Completion Unit buffers results before instructions are committed, supporting rich operand bypassing

GR

^F÷^P/√

+
X

FP

•

Short latencies

X
+

Load/ Store
•••••

•Floating-point compares:

1 cycle

Unit

Floating-point add/subtract/convert: Floating-point multiplies:

3 cycles 3 cycles 12 cycles 22 cycles

Load Data
64

2x64

Floating-point divide/square root(sp): Floating-point divide/square root(dp): Partitioned add/subtract, align, merge, expand, logical:

Completion Unit

1 cycle

•

Partitioned multiply, pixel compare,

pack, pixel distance:

3 cycles

11 of 26

Load/Store Unit

••••
16 kB DCache (D$), direct-mapped, 32B line size w/16B sub-blocks

Register File
2x64

64-entry, fully associative DTLB supports multiple page sizes

VA Adder

D$ is non-blocking, supported by 9- entry Load Buffer

44
VA

D$ tags are dual-ported, allowing a tag check from the pipe and a line ﬁll/snoop to occur in parallel

DCache
DCache

Tags
DTLB

=

hit/miss?

41

•••
Sustained throughput of 1 load per 2 cycles from second-level cache (E$)

PA
128
64

Store Buffer
Load Buffer

Pipelined stores to D$ and E$ via decoupled data and tag RAMs

64

address data address

Integer/FP

Store compression reduces E$ bus utilization

Completion Units

64

Second-Level Cache

External Cache Control

•

E$ sizes from 0.25 MB to 2MB

rq

ad (ac)

tc

Check

dt

transfer
Request Address
(Access SRAM)

Data

Tag

•••••
E$ is direct-mapped, physically indexed and tagged, w/64B line size

E$

transfer

Uses synchronous SRAMs with delayed write

External Cache Pipeline

8B (+ parity) interface to E$ supports 1.2 GB/s sustained bandwidth

15
128

64
External
Prefetch

Unit
Cache Tags

Supports two sram modes, compatible with US-I or US-II srams

Second Level
16+2(parity)

Cache Control Unit

ad and dt are 2 cycles each. ac stage doesn’t exist for 22 mode srams

18
External Cache
Load/ Store Unit

••
PCI activity is fully cache coherent 16-byte ﬁll to L1 data cache, desired 8 bytes ﬁrst

64 + 8 (parity)
64

13 of 26

Pins (balls)

• 90 - L2 cache data and tag • 37 - L2 cache addr/control • 8 - L2 cache byte write enables • 53 - PCI plus arbiter • 9 - Interrupt interface • 35 - UPA64S address/control interface • 34 - dram and transceiver control • 72 - dram/UPA64S data • 18 - clock input and reset/modes • 10 - jtag/test

total with power: 587

14 of 26

Performance (300mhz)

SPEC

0.5mbyte L2 2.0 Mbyte L2
SPECint95 SPECfp95
11.0 12.8
11.6 15.0

Memory

L2-Cache read L2-Cache write DRAM read (random) DRAM write DRAM read (same page) DRAM, memcopy
1.2 GBytes/S 1.2 GBytes/S 350 Mbytes/S 350 Mbytes/S 400 Mbytes/S 303 Mbytes/S
DRAM, memcopy to UPA 550 Mbytes/S

STREAM (Compiled, and with VIS block store)

Copy Scale Add
199 Mbytes/S 199 Mbytes/S 224 Mbytes/S 210 Mbytes/S
303 Mbytes/S 296 Mbytes/S 277 Mbytes/S

277 Mbytes/S

Triad

15 of 26

Performance

PCI (66 MHz/32 bits, Single Device)

PCI to DRAM (DMA) PCI from DRAM (DMA) PCI to L2 Cache (DMA) PCI from L2 Cache (DMA) 64-byte Reads CPU to PCI (PIO)
64-byte Writes 64-byte Reads 64-byte Writes
151 Mbytes/S 132 Mbytes/S 186 Mbytes/S 163 Mbytes/S

200 Mbytes/S

64-byte Writes

UPA64S (100 MHz/64 bits)