Ultrasparc Iii

UltraSparc IIi Kevin Normoyle and the Sabre cats One size doesn’t fit all SPARC chip technology available in three broad product categories: • S-Series (servers, highest performance) • I-Series (price/performance, ease-of-use) • E-Series (lowest cost) Tired: Marginal micro-optimizations Wired: Interfaces and integration. 2 of 26 Desktop/High end embedded system issues • Ease of design-in • Low-cost • Simple motherboard designs • Power • Higher performance graphics interfaces • Low latency to memory and I/O • Upgrades • The J word 3 of 26 UltraSPARC-IIi System Example Module 72-bit DIMM 72-bit DIMM SRAM SRAM SRAM UltraSPARC IIi 75MHz/72-bit XCVRs clocks 100MHz/64-bit 66 or 33 MHz 32-bit 3.3v UPA64S device example: Creator3D graphics APB Advanced PCI Bridge (optional) Can use PC-style SUPERIO chips (boot addressing / INT_ACK features) 33MHz/32-bit 5v/3.3v 33MHz/32-bit 5v/3.3v 4 of 26 Highlights • TI Epic4 GS2, 5-layer CMOS, 2.6v core, 3.3v I/O. (TM) • Sun’s Visual Instruction Set - VIS -2-D, 3-D graphics, image processing, real-time compression/decompression, video effects - block loads and stores (64-byte blocks) sustain 300 MByte/s to/from main memory, with arbitrary src/dest alignment • Four-way superscalar instruction issue - 6 pipes: 2 integer, 2 fp/graphics, 1 load/store, 1 branch • High bandwidth / Low latency interfaces - Synchronous, external L2 cache, 0.25MB - 2MB - 8-byte (+parity) data bus to L2 cache (1.2 GByte/s sustained) - 8-byte (+ECC) data bus to DRAM XCVRs (400 MByte/s sustained) or UPA64S (800 Mbyte/s sustained) - 4-byte PCI/66 interface. (190 Mbyte/s sustained) 5 of 26 Highlights (continued) • Instruction prefetching - in-cache, dynamic branch prediction - 1 branch per cycle - 2-way, 16kB Instruction Cache • Non-blocking Data Cache - 8-entry load buffer: pipe stalls only if load data needed - 8-entry (8 bytes per entry) store buffer for write-thru to L2 cache - single-cycle load-use penalty for D-cache hit - throughput of one load/store per 2 cycles to L2 cache • Non-blocking software prefetch to L2 cache - Prefetch completion isn’t constrained by the memory model. - Compiler uses to further hide the latency of L2 miss. • Simultaneous support of little/big endian data 6 of 26 Instruction Pipelines Integer <ul style="display: flex;"><li style="flex:1">E</li><li style="flex:1">D</li><li style="flex:1">G</li><li style="flex:1">C</li><li style="flex:1">N1 N2 </li><li style="flex:1">N3 </li><li style="flex:1">W</li><li style="flex:1">F</li></ul>Cache Access <ul style="display: flex;"><li style="flex:1">Write </li><li style="flex:1">Fetch Decode </li></ul><ul style="display: flex;"><li style="flex:1">Group </li><li style="flex:1">Execute </li></ul>Integer instructions are executed and virtual addresses calculated Instructions Instructions are fetched are decoded instructions from I-cache and placed are grouped <ul style="display: flex;"><li style="flex:1">Up to 4 </li><li style="flex:1">Dcache/TLB Dcachehitor Integer pipe </li><li style="flex:1">Traps are </li></ul>resolved All results arewrittento the register files. Instructions are accessed. Branch resolved miss determined. Deferred waits for floating- <ul style="display: flex;"><li style="flex:1">point/ </li><li style="flex:1">in the </li></ul>instruction buffer and dispatched. RF accessed load enters load buffer graphics pipe committed X2 <ul style="display: flex;"><li style="flex:1">R</li><li style="flex:1">X1 </li><li style="flex:1">X3 </li></ul>Register Start execution Finish execution Floating- point/ Execution continued graphics instructions are further decoded. RF accessed Floating-point/Graphics 7 of 26 Block Diagram Prefetch and Dispatch Unit ICache IMMU Branch Prediction Integer Execution Unit Floating Point/ Graphics Unit Branch Unit Load/ Store Unit DCache DMMU Load Buffer Store Buffer Separate 64-byte PIO and DMA buffers 64-byte DRAM write buffer L2 Cache, DRAM, UPA64S, PCI Interfaces IOMMU DRAM Second- Level Cache TI XCVRs Data PCI 66MHz DRAM 128+16 (ECC) 64+8 (ECC) 64+8 (parity) 36+1 (parity) 8 of 26 Prefetch and Dispatch Unit 64 •• 16kB ICache, 2-way set associative, 32B line size w/pre-decoded bits Second Level Cache Next Field Branch ICache Prediction 2kB Next Field RAM which contains 1k branch target addresses and 2k dynamic branch predictions 12 128 12 4Pre- Decoded Unit Prefetch Unit <ul style="display: flex;"><li style="flex:1">I0 </li><li style="flex:1">I1 </li></ul><ul style="display: flex;"><li style="flex:1">BP </li><li style="flex:1">I2 </li><li style="flex:1">I3 BP </li><li style="flex:1">NFA </li></ul>4 x 76 <ul style="display: flex;"><li style="flex:1">PA </li><li style="flex:1">VA </li></ul>44 •••••• 4-entry Return Address Stack for fetch prediction 41 Instruction Buffer 12 Entry Branch prediction (software/hardware) IMMU ITLB 64 Entry 64-entry, fully associative ITLB backs up 1-entry µTLB Dispatch Unit 12-entry Instruction Buffer fed by ICache or second-level cache 4 Instructions Single-cycle dispatch logic considers “top” 4 instructions Floating Point/ Graphics Integer Execution Load/ Store Branch Controls Integer operand bypassing 9 of 26 Integer Execution Unit Dispatch Unit •• Integer Register File has 7 read ports/3 write ports 7 read addresses 8 windows plus 4 sets of 8 global registers Integer 3x64 Register File Store Data 8 windows 64 ••• Dual Integer Pipes 4 global sets 2x64 ALU0 executes shift instructions 2x64 2x64 ALU1 executes register-based CTIs, Integer multiply/divide, and condition code-setting instructions <ul style="display: flex;"><li style="flex:1">ALU0 </li><li style="flex:1">VA </li></ul>Adder ALU1 Register- based CTIs Condition Codes <ul style="display: flex;"><li style="flex:1">•</li><li style="flex:1">Integer multiply uses 2-bit Booth </li></ul>encoding w/“early out” -- 5-cycle latency for typical operands 44 Load/Store Unit Integer Multiply/ Divide Shifter •• Integer divide uses 1-bit non- restoring subtraction algorithm 64 64 64 Completion Unit buffers ALU/load results before instructions are committed, providing centralized operand bypassing Completion Unit <ul style="display: flex;"><li style="flex:1">•</li><li style="flex:1">Precise exceptions and 5 levels of </li></ul>nested traps 10 of 26 Floating-Point/Graphics Unit Dispatch Unit ••• Floating-point/Graphics Register File has 5 read ports/3 write ports 5 read addresses 32 single-precision/32 double- precision registers Floating- 3x64 Point/ Graphics Store Data 64 5 functional units, all fully pipelined except for Divide/Square Root unit Register File 32, 64b regs •• High bandwidth: 2 FGops per cycle 4x64 Completion Unit buffers results before instructions are committed, supporting rich operand bypassing <ul style="display: flex;"><li style="flex:1">GR </li><li style="flex:1">GR </li></ul>F÷P /√ + X<ul style="display: flex;"><li style="flex:1">FP </li><li style="flex:1">FP </li></ul><ul style="display: flex;"><li style="flex:1">•</li><li style="flex:1">Short latencies </li></ul>X +Load/ Store ••••••Floating-point compares: 1 cycle Unit Floating-point add/subtract/convert: Floating-point multiplies: 3 cycles 3 cycles 12 cycles 22 cycles Load Data 64 2x64 Floating-point divide/square root(sp): Floating-point divide/square root(dp): Partitioned add/subtract, align, merge, expand, logical: Completion Unit 1 cycle <ul style="display: flex;"><li style="flex:1">•</li><li style="flex:1">Partitioned multiply, pixel compare, </li></ul>pack, pixel distance: 3 cycles 11 of 26 Load/Store Unit •••• 16 kB DCache (D$), direct-mapped, 32B line size w/16B sub-blocks Register File 2x64 64-entry, fully associative DTLB supports multiple page sizes VA Adder D$ is non-blocking, supported by 9- entry Load Buffer 44 VA D$ tags are dual-ported, allowing a tag check from the pipe and a line fill/snoop to occur in parallel DCache DCache Tags DTLB =hit/miss? 41 ••• Sustained throughput of 1 load per 2 cycles from second-level cache (E$) PA 128 64 Store Buffer Load Buffer Pipelined stores to D$ and E$ via decoupled data and tag RAMs 64 address data address Integer/FP Store compression reduces E$ bus utilization Completion Units <ul style="display: flex;"><li style="flex:1">64 </li><li style="flex:1">64 </li></ul>Second-Level Cache External Cache Control <ul style="display: flex;"><li style="flex:1">•</li><li style="flex:1">E$ sizes from 0.25 MB to 2MB </li></ul>rq <ul style="display: flex;"><li style="flex:1">ad (ac) </li><li style="flex:1">tc </li></ul>Check dt transfer Request Address (Access SRAM) <ul style="display: flex;"><li style="flex:1">Data </li><li style="flex:1">Tag </li></ul>••••• E$ is direct-mapped, physically indexed and tagged, w/64B line size <ul style="display: flex;"><li style="flex:1">E$ </li><li style="flex:1">transfer </li></ul>Uses synchronous SRAMs with delayed write External Cache Pipeline 8B (+ parity) interface to E$ supports 1.2 GB/s sustained bandwidth 15 128 64 External Prefetch Unit Cache Tags Supports two sram modes, compatible with US-I or US-II srams Second Level 16+2(parity) Cache Control Unit ad and dt are 2 cycles each. ac stage doesn’t exist for 22 mode srams 18 External Cache Load/ Store Unit •• PCI activity is fully cache coherent 16-byte fill to L1 data cache, desired 8 bytes first 64 + 8 (parity) 64 13 of 26 Pins (balls) • 90 - L2 cache data and tag • 37 - L2 cache addr/control • 8 - L2 cache byte write enables • 53 - PCI plus arbiter • 9 - Interrupt interface • 35 - UPA64S address/control interface • 34 - dram and transceiver control • 72 - dram/UPA64S data • 18 - clock input and reset/modes • 10 - jtag/test total with power: 587 14 of 26 Performance (300mhz) SPEC 0.5mbyte L2 2.0 Mbyte L2 SPECint95 SPECfp95 11.0 12.8 11.6 15.0 Memory L2-Cache read L2-Cache write DRAM read (random) DRAM write DRAM read (same page) DRAM, memcopy 1.2 GBytes/S 1.2 GBytes/S 350 Mbytes/S 350 Mbytes/S 400 Mbytes/S 303 Mbytes/S DRAM, memcopy to UPA 550 Mbytes/S STREAM (Compiled, and with VIS block store) Copy Scale Add 199 Mbytes/S 199 Mbytes/S 224 Mbytes/S 210 Mbytes/S 303 Mbytes/S 296 Mbytes/S 277 Mbytes/S <ul style="display: flex;"><li style="flex:1">277 Mbytes/S </li><li style="flex:1">Triad </li></ul>15 of 26 Performance PCI (66 MHz/32 bits, Single Device) PCI to DRAM (DMA) PCI from DRAM (DMA) PCI to L2 Cache (DMA) PCI from L2 Cache (DMA) 64-byte Reads CPU to PCI (PIO) 64-byte Writes 64-byte Reads 64-byte Writes 151 Mbytes/S 132 Mbytes/S 186 Mbytes/S 163 Mbytes/S <ul style="display: flex;"><li style="flex:1">200 Mbytes/S </li><li style="flex:1">64-byte Writes </li></ul>UPA64S (100 MHz/64 bits)

Ultrasparc Iii

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support