
<p><strong>UltraSparc IIi </strong></p><p><strong>Kevin Normoyle and the Sabre cats </strong></p><p><strong>One size doesn’t fit all </strong></p><p><strong>SPARC chip technology available in three broad product categories: </strong></p><p><strong>• S-Series (servers, highest performance) • I-Series (price/performance, ease-of-use) • E-Series (lowest cost) </strong></p><p><strong>Tired: Marginal micro-optimizations Wired: Interfaces and integration. </strong></p><p>2 of 26 </p><p><strong>Desktop/High end embedded system issues </strong></p><p><strong>• Ease of design-in • Low-cost • Simple motherboard designs • Power • Higher performance graphics interfaces • Low latency to memory and I/O • Upgrades • The J word </strong></p><p>3 of 26 </p><p><strong>UltraSPARC-IIi System Example </strong></p><p><strong>Module </strong><br><strong>72-bit DIMM </strong><br><strong>72-bit DIMM </strong><br><strong>SRAM </strong></p><p><strong>SRAM SRAM </strong><br><strong>UltraSPARC </strong><br><strong>IIi </strong></p><p><strong>75MHz/72-bit </strong><br><strong>XCVRs </strong></p><p><strong>clocks </strong></p><p><strong>100MHz/64-bit </strong></p><p><strong>66 or 33 MHz 32-bit 3.3v </strong><br><strong>UPA64S device example: Creator3D graphics </strong></p><p><strong>APB </strong><br><strong>Advanced PCI Bridge </strong><br><strong>(optional) </strong></p><p><strong>Can use PC-style SUPERIO chips (boot addressing / INT_ACK features) </strong><br><strong>33MHz/32-bit 5v/3.3v </strong><br><strong>33MHz/32-bit 5v/3.3v </strong></p><p>4 of 26 </p><p><strong>Highlights </strong></p><p><strong>• TI Epic4 GS2, 5-layer CMOS, 2.6v core, 3.3v I/O. </strong></p><p><strong>(TM) </strong></p><p><strong>• Sun’s Visual Instruction Set - VIS </strong></p><p><strong>-</strong></p><p>2-D, 3-D graphics, image processing, real-time compression/decompression, video effects </p><p><strong>- </strong>block loads and stores (64-byte blocks) sustain 300 MByte/s to/from main memory, with arbitrary src/dest alignment </p><p><strong>• Four-way superscalar instruction issue </strong></p><p><strong>- </strong>6 pipes: 2 integer, 2 fp/graphics, 1 load/store, 1 branch </p><p><strong>• High bandwidth / Low latency interfaces </strong></p><p><strong>- </strong>Synchronous, external L2 cache, 0.25MB - 2MB <strong>- </strong>8-byte (+parity) data bus to L2 cache (1.2 GByte/s sustained) <strong>- </strong>8-byte (+ECC) data bus to DRAM XCVRs (400 MByte/s sustained) or UPA64S (800 Mbyte/s sustained) </p><p><strong>- </strong>4-byte PCI/66 interface. (190 Mbyte/s sustained) </p><p>5 of 26 </p><p><strong>Highlights (continued) </strong></p><p><strong>• Instruction prefetching </strong></p><p><strong>- </strong>in-cache, dynamic branch prediction </p><p><strong>- </strong>1 branch per cycle </p><p><strong>- </strong>2-way, 16kB Instruction Cache </p><p><strong>• Non-blocking Data Cache </strong></p><p><strong>- </strong>8-entry load buffer: pipe stalls only if load data needed <strong>- </strong>8-entry (8 bytes per entry) store buffer for write-thru to L2 cache <strong>- </strong>single-cycle load-use penalty for D-cache hit <strong>- </strong>throughput of one load/store per 2 cycles to L2 cache </p><p><strong>• Non-blocking software prefetch to L2 cache </strong></p><p><strong>- </strong>Prefetch completion isn’t constrained by the memory model. <strong>- </strong>Compiler uses to further hide the latency of L2 miss. </p><p><strong>• Simultaneous support of little/big endian data </strong></p><p>6 of 26 </p><p><strong>Instruction Pipelines </strong></p><p><strong>Integer </strong></p><ul style="display: flex;"><li style="flex:1"><strong>E</strong></li><li style="flex:1"><strong>D</strong></li><li style="flex:1"><strong>G</strong></li><li style="flex:1"><strong>C</strong></li><li style="flex:1"><strong>N1 N2 </strong></li><li style="flex:1"><strong>N3 </strong></li><li style="flex:1"><strong>W</strong></li><li style="flex:1"><strong>F</strong></li></ul><p></p><p><strong>Cache Access </strong></p><ul style="display: flex;"><li style="flex:1"><strong>Write </strong></li><li style="flex:1"><strong>Fetch Decode </strong></li></ul><p></p><ul style="display: flex;"><li style="flex:1"><strong>Group </strong></li><li style="flex:1"><strong>Execute </strong></li></ul><p></p><p><strong>Integer instructions are executed and virtual addresses calculated </strong><br><strong>Instructions Instructions are fetched are decoded instructions from I-cache and placed are grouped </strong></p><ul style="display: flex;"><li style="flex:1"><strong>Up to 4 </strong></li><li style="flex:1"><strong>Dcache/TLB Dcachehitor Integer pipe </strong></li><li style="flex:1"><strong>Traps are </strong></li></ul><p><strong>resolved </strong><br><strong>All results arewrittento the register files. Instructions are accessed. Branch resolved miss determined. Deferred waits for floating- </strong></p><ul style="display: flex;"><li style="flex:1"><strong>point/ </strong></li><li style="flex:1"><strong>in the </strong></li></ul><p><strong>instruction buffer and dispatched. RF accessed load enters load buffer graphics pipe committed </strong></p><p><strong>X2 </strong></p><ul style="display: flex;"><li style="flex:1"><strong>R</strong></li><li style="flex:1"><strong>X1 </strong></li><li style="flex:1"><strong>X3 </strong></li></ul><p></p><p><strong>Register </strong></p><p><strong>Start execution </strong><br><strong>Finish execution </strong><br><strong>Floating- point/ </strong><br><strong>Execution continued graphics instructions are further decoded. RF accessed </strong></p><p><strong>Floating-point/Graphics </strong></p><p>7 of 26 </p><p><strong>Block Diagram </strong></p><p><strong>Prefetch and Dispatch Unit </strong><br><strong>ICache IMMU </strong><br><strong>Branch Prediction </strong></p><p><strong>Integer Execution Unit </strong><br><strong>Floating Point/ Graphics Unit </strong><br><strong>Branch Unit </strong><br><strong>Load/ Store Unit </strong></p><p><strong>DCache DMMU </strong><br><strong>Load Buffer </strong><br><strong>Store Buffer </strong></p><p><strong>Separate 64-byte PIO and DMA buffers </strong><br><strong>64-byte </strong><br><strong>DRAM write buffer </strong><br><strong>L2 Cache, DRAM, UPA64S, PCI </strong><br><strong>Interfaces </strong><br><strong>IOMMU </strong><br><strong>DRAM </strong><br><strong>Second- Level Cache </strong><br><strong>TI XCVRs </strong><br><strong>Data </strong><br><strong>PCI 66MHz </strong><br><strong>DRAM </strong></p><p><strong>128+16 (ECC) </strong><br><strong>64+8 (ECC) </strong><br><strong>64+8 (parity) </strong><br><strong>36+1 (parity) </strong></p><p>8 of 26 </p><p><strong>Prefetch and Dispatch Unit </strong></p><p><strong>64 </strong></p><p><strong>••</strong><br><strong>16kB ICache, 2-way set associative, 32B line size w/pre-decoded bits </strong></p><p><strong>Second Level Cache </strong><br><strong>Next Field </strong><br><strong>Branch </strong><br><strong>ICache </strong></p><p><strong>Prediction </strong></p><p><strong>2kB Next Field RAM which contains 1k branch target addresses and 2k dynamic branch predictions </strong></p><p><strong>12 </strong><br><strong>128 </strong><br><strong>12 </strong><br><strong>4</strong></p><p><strong>Pre- Decoded Unit </strong><br><strong>Prefetch Unit </strong></p><p></p><ul style="display: flex;"><li style="flex:1"><strong>I0 </strong></li><li style="flex:1"><strong>I1 </strong></li></ul><p></p><ul style="display: flex;"><li style="flex:1"><strong>BP </strong></li><li style="flex:1"><strong>I2 </strong></li><li style="flex:1"><strong>I3 BP </strong></li><li style="flex:1"><strong>NFA </strong></li></ul><p></p><p><strong>4 x 76 </strong></p><p></p><ul style="display: flex;"><li style="flex:1"><strong>PA </strong></li><li style="flex:1"><strong>VA </strong></li></ul><p></p><p><strong>44 </strong></p><p><strong>••••••</strong><br><strong>4-entry Return Address Stack for fetch prediction </strong></p><p><strong>41 </strong></p><p><strong>Instruction Buffer </strong><br><strong>12 Entry </strong></p><p><strong>Branch prediction (software/hardware) </strong></p><p><strong>IMMU ITLB </strong><br><strong>64 Entry </strong></p><p><strong>64-entry, fully associative ITLB backs up 1-entry </strong>µ<strong>TLB </strong></p><p><strong>Dispatch Unit </strong></p><p><strong>12-entry Instruction Buffer fed by ICache or second-level cache </strong></p><p><strong>4</strong><br><strong>Instructions </strong></p><p><strong>Single-cycle dispatch logic considers “top” 4 instructions </strong></p><p><strong>Floating Point</strong>/ <strong>Graphics </strong><br><strong>Integer Execution </strong><br><strong>Load/ Store </strong><br><strong>Branch </strong></p><p><strong>Controls Integer operand bypassing </strong></p><p>9 of 26 </p><p><strong>Integer Execution Unit </strong></p><p><strong>Dispatch Unit </strong></p><p><strong>••</strong><br><strong>Integer Register File has 7 read ports/3 write ports </strong></p><p><strong>7 read addresses </strong></p><p><strong>8 windows plus 4 sets of 8 global registers </strong></p><p><strong>Integer </strong></p><p><strong>3x64 </strong></p><p><strong>Register File </strong></p><p><strong>Store Data </strong></p><p><strong>8 windows </strong></p><p><strong>64 </strong></p><p><strong>•••</strong><br><strong>Dual Integer Pipes </strong></p><p><strong>4 global sets </strong></p><p><strong>2x64 </strong></p><p><strong>ALU0 executes shift instructions </strong></p><p><strong>2x64 </strong><br><strong>2x64 </strong></p><p><strong>ALU1 executes register-based CTIs, Integer multiply/divide, and condition code-setting instructions </strong></p><p></p><ul style="display: flex;"><li style="flex:1"><strong>ALU0 </strong></li><li style="flex:1"><strong>VA </strong></li></ul><p><strong>Adder </strong><br><strong>ALU1 </strong></p><p><strong>Register- based CTIs Condition Codes </strong></p><p></p><ul style="display: flex;"><li style="flex:1"><strong>•</strong></li><li style="flex:1"><strong>Integer multiply uses 2-bit Booth </strong></li></ul><p><strong>encoding w/“early out” -- 5-cycle latency for typical operands </strong></p><p><strong>44 </strong></p><p><strong>Load/Store </strong><br><strong>Unit </strong></p><p><strong>Integer Multiply/ Divide </strong><br><strong>Shifter </strong></p><p><strong>••</strong><br><strong>Integer divide uses 1-bit non- restoring subtraction algorithm </strong></p><p><strong>64 </strong><br><strong>64 </strong><br><strong>64 </strong></p><p><strong>Completion Unit buffers ALU/load results before instructions are committed, providing centralized operand bypassing </strong></p><p><strong>Completion Unit </strong></p><p></p><ul style="display: flex;"><li style="flex:1"><strong>•</strong></li><li style="flex:1"><strong>Precise exceptions and 5 levels of </strong></li></ul><p><strong>nested traps </strong></p><p>10 of 26 </p><p><strong>Floating-Point/Graphics Unit </strong></p><p><strong>Dispatch Unit </strong></p><p><strong>•••</strong><br><strong>Floating-point/Graphics Register File has 5 read ports/3 write ports </strong></p><p><strong>5 read addresses </strong></p><p><strong>32 single-precision/32 double- precision registers </strong></p><p><strong>Floating- </strong></p><p><strong>3x64 </strong></p><p><strong>Point/ Graphics </strong></p><p><strong>Store Data </strong><br><strong>64 </strong></p><p><strong>5 functional units, all fully pipelined except for Divide/Square Root unit </strong></p><p><strong>Register File 32, 64b regs </strong></p><p><strong>••</strong><br><strong>High bandwidth: 2 FGops per cycle </strong></p><p><strong>4x64 </strong></p><p><strong>Completion Unit buffers results before instructions are committed, supporting rich operand bypassing </strong></p><p></p><ul style="display: flex;"><li style="flex:1"><strong>GR </strong></li><li style="flex:1"><strong>GR </strong></li></ul><p></p><p><sup style="top: -0.8275em;"><strong>F</strong></sup>÷<sup style="top: -0.8275em;"><strong>P </strong></sup>/√ </p><p>+<br>X</p><p></p><ul style="display: flex;"><li style="flex:1"><strong>FP </strong></li><li style="flex:1"><strong>FP </strong></li></ul><p></p><p></p><ul style="display: flex;"><li style="flex:1"><strong>•</strong></li><li style="flex:1"><strong>Short latencies </strong></li></ul><p></p><p>X<br>+</p><p><strong>Load/ Store </strong><br><strong>•••••</strong></p><p>•Floating-point compares: </p><p><strong>1 cycle </strong></p><p><strong>Unit </strong></p><p>Floating-point add/subtract/convert: Floating-point multiplies: </p><p><strong>3 cycles 3 cycles 12 cycles 22 cycles </strong></p><p><strong>Load Data </strong><br><strong>64 </strong></p><p><strong>2x64 </strong></p><p>Floating-point divide/square root(sp): Floating-point divide/square root(dp): Partitioned add/subtract, align, merge, expand, logical: </p><p><strong>Completion Unit </strong></p><p><strong>1 cycle </strong></p><p></p><ul style="display: flex;"><li style="flex:1">•</li><li style="flex:1">Partitioned multiply, pixel compare, </li></ul><p>pack, pixel distance: </p><p><strong>3 cycles </strong></p><p>11 of 26 </p><p><strong>Load/Store Unit </strong></p><p><strong>••••</strong><br><strong>16 kB DCache (D$), direct-mapped, 32B line size w/16B sub-blocks </strong></p><p><strong>Register File </strong><br><strong>2x64 </strong></p><p><strong>64-entry, fully associative DTLB supports multiple page sizes </strong></p><p><strong>VA Adder </strong></p><p><strong>D$ is non-blocking, supported by 9- entry Load Buffer </strong></p><p><strong>44 </strong><br><strong>VA </strong></p><p><strong>D$ tags are dual-ported, allowing a tag check from the pipe and a line fill/snoop to occur in parallel </strong></p><p><strong>DCache </strong><br><strong>DCache </strong></p><p><strong>Tags </strong><br><strong>DTLB </strong></p><p><strong>=</strong></p><p>hit/miss? </p><p><strong>41 </strong></p><p><strong>•••</strong><br><strong>Sustained throughput of 1 load per 2 cycles from second-level cache (E$) </strong></p><p><strong>PA </strong><br><strong>128 </strong><br><strong>64 </strong></p><p><strong>Store Buffer </strong><br><strong>Load Buffer </strong></p><p><strong>Pipelined stores to D$ and E$ via decoupled data and tag RAMs </strong></p><p><strong>64 </strong></p><p>address data address </p><p><strong>Integer/FP </strong></p><p><strong>Store compression reduces E$ bus utilization </strong></p><p><strong>Completion Units </strong></p><p></p><ul style="display: flex;"><li style="flex:1"><strong>64 </strong></li><li style="flex:1"><strong>64 </strong></li></ul><p><strong>Second-Level Cache </strong></p><p><strong>External Cache Control </strong></p><p></p><ul style="display: flex;"><li style="flex:1"><strong>•</strong></li><li style="flex:1"><strong>E$ sizes from 0.25 MB to 2MB </strong></li></ul><p></p><p>rq </p><ul style="display: flex;"><li style="flex:1">ad (ac) </li><li style="flex:1">tc </li></ul><p></p><p>Check </p><p>dt </p><p>transfer <br>Request Address <br>(Access SRAM) </p><ul style="display: flex;"><li style="flex:1">Data </li><li style="flex:1">Tag </li></ul><p></p><p><strong>•••••</strong><br><strong>E$ is direct-mapped, physically indexed and tagged, w/64B line size </strong></p><p></p><ul style="display: flex;"><li style="flex:1">E$ </li><li style="flex:1">transfer </li></ul><p></p><p><strong>Uses synchronous SRAMs with delayed write </strong></p><p><strong>External Cache Pipeline </strong></p><p><strong>8B (+ parity) interface to E$ supports 1.2 GB/s sustained bandwidth </strong></p><p><strong>15 </strong><br><strong>128 </strong></p><p><strong>64 </strong><br><strong>External </strong><br><strong>Prefetch </strong></p><p><strong>Unit </strong><br><strong>Cache Tags </strong></p><p><strong>Supports two sram modes, compatible with US-I or US-II srams </strong></p><p><strong>Second Level </strong><br><strong>16+2(parity) </strong></p><p><strong>Cache Control Unit </strong></p><p><strong>ad and dt are 2 cycles each. ac stage doesn’t exist for 22 mode srams </strong></p><p><strong>18 </strong><br><strong>External Cache </strong><br><strong>Load/ Store Unit </strong></p><p><strong>••</strong><br><strong>PCI activity is fully cache coherent 16-byte fill to L1 data cache, desired 8 bytes first </strong></p><p><strong>64 + 8 (parity) </strong><br><strong>64 </strong></p><p>13 of 26 </p><p><strong>Pins (balls) </strong></p><p><strong>• 90 - L2 cache data and tag • 37 - L2 cache addr/control • 8 - L2 cache byte write enables • 53 - PCI plus arbiter • 9 - Interrupt interface • 35 - UPA64S address/control interface • 34 - dram and transceiver control • 72 - dram/UPA64S data • 18 - clock input and reset/modes • 10 - jtag/test </strong></p><p><strong>total with power: 587 </strong></p><p>14 of 26 </p><p><strong>Performance (300mhz) </strong></p><p><strong>SPEC </strong></p><p><strong>0.5mbyte L2 2.0 Mbyte L2 </strong><br><strong>SPECint95 SPECfp95 </strong><br><strong>11.0 12.8 </strong><br><strong>11.6 15.0 </strong></p><p><strong>Memory </strong></p><p><strong>L2-Cache read L2-Cache write DRAM read (random) DRAM write DRAM read (same page) DRAM, memcopy </strong><br><strong>1.2 GBytes/S 1.2 GBytes/S 350 Mbytes/S 350 Mbytes/S 400 Mbytes/S 303 Mbytes/S </strong><br><strong>DRAM, memcopy to UPA 550 Mbytes/S </strong></p><p><strong>STREAM (Compiled, and with VIS block store) </strong></p><p><strong>Copy Scale Add </strong><br><strong>199 Mbytes/S 199 Mbytes/S 224 Mbytes/S 210 Mbytes/S </strong><br><strong>303 Mbytes/S 296 Mbytes/S 277 Mbytes/S </strong></p><ul style="display: flex;"><li style="flex:1"><strong>277 Mbytes/S </strong></li><li style="flex:1"><strong>Triad </strong></li></ul><p></p><p>15 of 26 </p><p><strong>Performance </strong></p><p><strong>PCI (66 MHz/32 bits, Single Device) </strong></p><p><strong>PCI to DRAM (DMA) PCI from DRAM (DMA) PCI to L2 Cache (DMA) PCI from L2 Cache (DMA) 64-byte Reads CPU to PCI (PIO) </strong><br><strong>64-byte Writes 64-byte Reads 64-byte Writes </strong><br><strong>151 Mbytes/S 132 Mbytes/S 186 Mbytes/S 163 Mbytes/S </strong></p><ul style="display: flex;"><li style="flex:1"><strong>200 Mbytes/S </strong></li><li style="flex:1"><strong>64-byte Writes </strong></li></ul><p></p><p><strong>UPA64S (100 MHz/64 bits) </strong></p>
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages13 Page
-
File Size-