Ultrasparc Iii

Ultrasparc Iii

<p><strong>UltraSparc IIi </strong></p><p><strong>Kevin Normoyle and the Sabre cats </strong></p><p><strong>One size doesn’t fit all </strong></p><p><strong>SPARC chip technology available in three broad product categories: </strong></p><p><strong>• S-Series&nbsp;(servers, highest performance) • I-Series (price/performance,&nbsp;ease-of-use) • E-Series&nbsp;(lowest cost) </strong></p><p><strong>Tired: Marginal micro-optimizations Wired: Interfaces and integration. </strong></p><p>2 of 26 </p><p><strong>Desktop/High end embedded system issues </strong></p><p><strong>• Ease&nbsp;of design-in • Low-cost • Simple&nbsp;motherboard designs • Power • Higher&nbsp;performance graphics interfaces • Low&nbsp;latency to memory and I/O • Upgrades • The&nbsp;J word </strong></p><p>3 of 26 </p><p><strong>UltraSPARC-IIi System Example </strong></p><p><strong>Module </strong><br><strong>72-bit DIMM </strong><br><strong>72-bit DIMM </strong><br><strong>SRAM </strong></p><p><strong>SRAM SRAM </strong><br><strong>UltraSPARC </strong><br><strong>IIi </strong></p><p><strong>75MHz/72-bit </strong><br><strong>XCVRs </strong></p><p><strong>clocks </strong></p><p><strong>100MHz/64-bit </strong></p><p><strong>66 or 33 MHz 32-bit 3.3v </strong><br><strong>UPA64S device example: Creator3D graphics </strong></p><p><strong>APB </strong><br><strong>Advanced PCI Bridge </strong><br><strong>(optional) </strong></p><p><strong>Can use PC-style SUPERIO chips (boot addressing / INT_ACK features) </strong><br><strong>33MHz/32-bit 5v/3.3v </strong><br><strong>33MHz/32-bit 5v/3.3v </strong></p><p>4 of 26 </p><p><strong>Highlights </strong></p><p><strong>• TI&nbsp;Epic4 GS2, 5-layer CMOS, 2.6v core, 3.3v I/O. </strong></p><p><strong>(TM) </strong></p><p><strong>• Sun’s&nbsp;Visual Instruction Set - VIS </strong></p><p><strong>-</strong></p><p>2-D, 3-D graphics, image processing, real-time compression/decompression, video effects </p><p><strong>- </strong>block loads and stores (64-byte blocks) sustain 300 MByte/s to/from main memory, with arbitrary src/dest alignment </p><p><strong>• Four-way&nbsp;superscalar instruction issue </strong></p><p><strong>- </strong>6 pipes: 2 integer, 2 fp/graphics, 1 load/store, 1 branch </p><p><strong>• High&nbsp;bandwidth / Low latency interfaces </strong></p><p><strong>- </strong>Synchronous, external L2 cache, 0.25MB - 2MB <strong>- </strong>8-byte (+parity) data bus to L2 cache (1.2 GByte/s sustained) <strong>- </strong>8-byte (+ECC) data bus to DRAM XCVRs (400 MByte/s sustained) or UPA64S (800 Mbyte/s sustained) </p><p><strong>- </strong>4-byte PCI/66 interface. (190 Mbyte/s sustained) </p><p>5 of 26 </p><p><strong>Highlights (continued) </strong></p><p><strong>• Instruction&nbsp;prefetching </strong></p><p><strong>- </strong>in-cache, dynamic branch prediction </p><p><strong>- </strong>1 branch per cycle </p><p><strong>- </strong>2-way, 16kB Instruction Cache </p><p><strong>• Non-blocking&nbsp;Data Cache </strong></p><p><strong>- </strong>8-entry load buffer: pipe stalls only if load data needed <strong>- </strong>8-entry (8 bytes per entry) store buffer for write-thru to L2 cache <strong>- </strong>single-cycle load-use penalty for D-cache hit <strong>- </strong>throughput of one load/store per 2 cycles to L2 cache </p><p><strong>• Non-blocking&nbsp;software prefetch to L2 cache </strong></p><p><strong>- </strong>Prefetch completion isn’t constrained by the memory model. <strong>- </strong>Compiler uses to further hide the latency of L2 miss. </p><p><strong>• Simultaneous&nbsp;support of little/big endian data </strong></p><p>6 of 26 </p><p><strong>Instruction Pipelines </strong></p><p><strong>Integer </strong></p><ul style="display: flex;"><li style="flex:1"><strong>E</strong></li><li style="flex:1"><strong>D</strong></li><li style="flex:1"><strong>G</strong></li><li style="flex:1"><strong>C</strong></li><li style="flex:1"><strong>N1 N2 </strong></li><li style="flex:1"><strong>N3 </strong></li><li style="flex:1"><strong>W</strong></li><li style="flex:1"><strong>F</strong></li></ul><p></p><p><strong>Cache Access </strong></p><ul style="display: flex;"><li style="flex:1"><strong>Write </strong></li><li style="flex:1"><strong>Fetch Decode </strong></li></ul><p></p><ul style="display: flex;"><li style="flex:1"><strong>Group </strong></li><li style="flex:1"><strong>Execute </strong></li></ul><p></p><p><strong>Integer instructions are executed and virtual addresses calculated </strong><br><strong>Instructions Instructions are fetched&nbsp;are decoded&nbsp;instructions from I-cache&nbsp;and placed&nbsp;are grouped </strong></p><ul style="display: flex;"><li style="flex:1"><strong>Up to 4 </strong></li><li style="flex:1"><strong>Dcache/TLB Dcachehitor Integer&nbsp;pipe </strong></li><li style="flex:1"><strong>Traps are </strong></li></ul><p><strong>resolved </strong><br><strong>All results arewrittento the register files. Instructions are accessed. Branch resolved miss determined. Deferred waits for floating- </strong></p><ul style="display: flex;"><li style="flex:1"><strong>point/ </strong></li><li style="flex:1"><strong>in the </strong></li></ul><p><strong>instruction buffer and dispatched. RF accessed load enters load buffer graphics pipe committed </strong></p><p><strong>X2 </strong></p><ul style="display: flex;"><li style="flex:1"><strong>R</strong></li><li style="flex:1"><strong>X1 </strong></li><li style="flex:1"><strong>X3 </strong></li></ul><p></p><p><strong>Register </strong></p><p><strong>Start execution </strong><br><strong>Finish execution </strong><br><strong>Floating- point/ </strong><br><strong>Execution continued graphics instructions are further decoded. RF accessed </strong></p><p><strong>Floating-point/Graphics </strong></p><p>7 of 26 </p><p><strong>Block Diagram </strong></p><p><strong>Prefetch and Dispatch Unit </strong><br><strong>ICache IMMU </strong><br><strong>Branch Prediction </strong></p><p><strong>Integer Execution Unit </strong><br><strong>Floating Point/ Graphics Unit </strong><br><strong>Branch Unit </strong><br><strong>Load/ Store Unit </strong></p><p><strong>DCache DMMU </strong><br><strong>Load Buffer </strong><br><strong>Store Buffer </strong></p><p><strong>Separate 64-byte PIO and DMA buffers </strong><br><strong>64-byte </strong><br><strong>DRAM write buffer </strong><br><strong>L2 Cache, DRAM, UPA64S, PCI </strong><br><strong>Interfaces </strong><br><strong>IOMMU </strong><br><strong>DRAM </strong><br><strong>Second- Level Cache </strong><br><strong>TI XCVRs </strong><br><strong>Data </strong><br><strong>PCI 66MHz </strong><br><strong>DRAM </strong></p><p><strong>128+16 (ECC) </strong><br><strong>64+8 (ECC) </strong><br><strong>64+8 (parity) </strong><br><strong>36+1 (parity) </strong></p><p>8 of 26 </p><p><strong>Prefetch and Dispatch Unit </strong></p><p><strong>64 </strong></p><p><strong>••</strong><br><strong>16kB ICache, 2-way set associative, 32B line size w/pre-decoded bits </strong></p><p><strong>Second Level Cache </strong><br><strong>Next Field </strong><br><strong>Branch </strong><br><strong>ICache </strong></p><p><strong>Prediction </strong></p><p><strong>2kB Next Field RAM which contains 1k branch target addresses and 2k dynamic branch predictions </strong></p><p><strong>12 </strong><br><strong>128 </strong><br><strong>12 </strong><br><strong>4</strong></p><p><strong>Pre- Decoded Unit </strong><br><strong>Prefetch Unit </strong></p><p></p><ul style="display: flex;"><li style="flex:1"><strong>I0 </strong></li><li style="flex:1"><strong>I1 </strong></li></ul><p></p><ul style="display: flex;"><li style="flex:1"><strong>BP </strong></li><li style="flex:1"><strong>I2 </strong></li><li style="flex:1"><strong>I3 BP </strong></li><li style="flex:1"><strong>NFA </strong></li></ul><p></p><p><strong>4 x 76 </strong></p><p></p><ul style="display: flex;"><li style="flex:1"><strong>PA </strong></li><li style="flex:1"><strong>VA </strong></li></ul><p></p><p><strong>44 </strong></p><p><strong>••••••</strong><br><strong>4-entry Return Address Stack for fetch prediction </strong></p><p><strong>41 </strong></p><p><strong>Instruction Buffer </strong><br><strong>12 Entry </strong></p><p><strong>Branch prediction (software/hardware) </strong></p><p><strong>IMMU ITLB </strong><br><strong>64 Entry </strong></p><p><strong>64-entry, fully associative ITLB backs up 1-entry </strong>µ<strong>TLB </strong></p><p><strong>Dispatch Unit </strong></p><p><strong>12-entry Instruction Buffer fed by ICache or second-level cache </strong></p><p><strong>4</strong><br><strong>Instructions </strong></p><p><strong>Single-cycle dispatch logic considers “top” 4 instructions </strong></p><p><strong>Floating Point</strong>/ <strong>Graphics </strong><br><strong>Integer Execution </strong><br><strong>Load/ Store </strong><br><strong>Branch </strong></p><p><strong>Controls Integer operand bypassing </strong></p><p>9 of 26 </p><p><strong>Integer Execution Unit </strong></p><p><strong>Dispatch Unit </strong></p><p><strong>••</strong><br><strong>Integer Register File has 7 read ports/3 write ports </strong></p><p><strong>7 read addresses </strong></p><p><strong>8 windows plus 4 sets of 8 global registers </strong></p><p><strong>Integer </strong></p><p><strong>3x64 </strong></p><p><strong>Register File </strong></p><p><strong>Store Data </strong></p><p><strong>8 windows </strong></p><p><strong>64 </strong></p><p><strong>•••</strong><br><strong>Dual Integer Pipes </strong></p><p><strong>4 global sets </strong></p><p><strong>2x64 </strong></p><p><strong>ALU0 executes shift instructions </strong></p><p><strong>2x64 </strong><br><strong>2x64 </strong></p><p><strong>ALU1 executes register-based CTIs, Integer multiply/divide, and condition code-setting instructions </strong></p><p></p><ul style="display: flex;"><li style="flex:1"><strong>ALU0 </strong></li><li style="flex:1"><strong>VA </strong></li></ul><p><strong>Adder </strong><br><strong>ALU1 </strong></p><p><strong>Register- based CTIs Condition Codes </strong></p><p></p><ul style="display: flex;"><li style="flex:1"><strong>•</strong></li><li style="flex:1"><strong>Integer multiply uses 2-bit Booth </strong></li></ul><p><strong>encoding w/“early out” -- 5-cycle latency for typical operands </strong></p><p><strong>44 </strong></p><p><strong>Load/Store </strong><br><strong>Unit </strong></p><p><strong>Integer Multiply/ Divide </strong><br><strong>Shifter </strong></p><p><strong>••</strong><br><strong>Integer divide uses 1-bit non- restoring subtraction algorithm </strong></p><p><strong>64 </strong><br><strong>64 </strong><br><strong>64 </strong></p><p><strong>Completion Unit buffers ALU/load results before instructions are committed, providing centralized operand bypassing </strong></p><p><strong>Completion Unit </strong></p><p></p><ul style="display: flex;"><li style="flex:1"><strong>•</strong></li><li style="flex:1"><strong>Precise exceptions and 5 levels of </strong></li></ul><p><strong>nested traps </strong></p><p>10 of 26 </p><p><strong>Floating-Point/Graphics Unit </strong></p><p><strong>Dispatch Unit </strong></p><p><strong>•••</strong><br><strong>Floating-point/Graphics Register File has 5 read ports/3 write ports </strong></p><p><strong>5 read addresses </strong></p><p><strong>32 single-precision/32 double- precision registers </strong></p><p><strong>Floating- </strong></p><p><strong>3x64 </strong></p><p><strong>Point/ Graphics </strong></p><p><strong>Store Data </strong><br><strong>64 </strong></p><p><strong>5 functional units, all fully pipelined except for Divide/Square Root unit </strong></p><p><strong>Register File 32, 64b regs </strong></p><p><strong>••</strong><br><strong>High bandwidth: 2 FGops per cycle </strong></p><p><strong>4x64 </strong></p><p><strong>Completion Unit buffers results before instructions are committed, supporting rich operand bypassing </strong></p><p></p><ul style="display: flex;"><li style="flex:1"><strong>GR </strong></li><li style="flex:1"><strong>GR </strong></li></ul><p></p><p><sup style="top: -0.8275em;"><strong>F</strong></sup>÷<sup style="top: -0.8275em;"><strong>P </strong></sup>/√ </p><p>+<br>X</p><p></p><ul style="display: flex;"><li style="flex:1"><strong>FP </strong></li><li style="flex:1"><strong>FP </strong></li></ul><p></p><p></p><ul style="display: flex;"><li style="flex:1"><strong>•</strong></li><li style="flex:1"><strong>Short latencies </strong></li></ul><p></p><p>X<br>+</p><p><strong>Load/ Store </strong><br><strong>•••••</strong></p><p>•Floating-point compares: </p><p><strong>1 cycle </strong></p><p><strong>Unit </strong></p><p>Floating-point add/subtract/convert: Floating-point multiplies: </p><p><strong>3 cycles 3 cycles 12 cycles 22 cycles </strong></p><p><strong>Load Data </strong><br><strong>64 </strong></p><p><strong>2x64 </strong></p><p>Floating-point divide/square root(sp): Floating-point divide/square root(dp): Partitioned add/subtract, align, merge, expand, logical: </p><p><strong>Completion Unit </strong></p><p><strong>1 cycle </strong></p><p></p><ul style="display: flex;"><li style="flex:1">•</li><li style="flex:1">Partitioned multiply, pixel compare, </li></ul><p>pack, pixel distance: </p><p><strong>3 cycles </strong></p><p>11 of 26 </p><p><strong>Load/Store Unit </strong></p><p><strong>••••</strong><br><strong>16 kB DCache (D$), direct-mapped, 32B line size w/16B sub-blocks </strong></p><p><strong>Register File </strong><br><strong>2x64 </strong></p><p><strong>64-entry, fully associative DTLB supports multiple page sizes </strong></p><p><strong>VA Adder </strong></p><p><strong>D$ is non-blocking, supported by 9- entry Load Buffer </strong></p><p><strong>44 </strong><br><strong>VA </strong></p><p><strong>D$ tags are dual-ported, allowing a tag check from the pipe and a line fill/snoop to occur in parallel </strong></p><p><strong>DCache </strong><br><strong>DCache </strong></p><p><strong>Tags </strong><br><strong>DTLB </strong></p><p><strong>=</strong></p><p>hit/miss? </p><p><strong>41 </strong></p><p><strong>•••</strong><br><strong>Sustained throughput of 1 load per 2 cycles from second-level cache (E$) </strong></p><p><strong>PA </strong><br><strong>128 </strong><br><strong>64 </strong></p><p><strong>Store Buffer </strong><br><strong>Load Buffer </strong></p><p><strong>Pipelined stores to D$ and E$ via decoupled data and tag RAMs </strong></p><p><strong>64 </strong></p><p>address data address </p><p><strong>Integer/FP </strong></p><p><strong>Store compression reduces E$ bus utilization </strong></p><p><strong>Completion Units </strong></p><p></p><ul style="display: flex;"><li style="flex:1"><strong>64 </strong></li><li style="flex:1"><strong>64 </strong></li></ul><p><strong>Second-Level Cache </strong></p><p><strong>External Cache Control </strong></p><p></p><ul style="display: flex;"><li style="flex:1"><strong>•</strong></li><li style="flex:1"><strong>E$ sizes from 0.25 MB to 2MB </strong></li></ul><p></p><p>rq </p><ul style="display: flex;"><li style="flex:1">ad (ac) </li><li style="flex:1">tc </li></ul><p></p><p>Check </p><p>dt </p><p>transfer <br>Request Address <br>(Access SRAM) </p><ul style="display: flex;"><li style="flex:1">Data </li><li style="flex:1">Tag </li></ul><p></p><p><strong>•••••</strong><br><strong>E$ is direct-mapped, physically indexed and tagged, w/64B line size </strong></p><p></p><ul style="display: flex;"><li style="flex:1">E$ </li><li style="flex:1">transfer </li></ul><p></p><p><strong>Uses synchronous SRAMs with delayed write </strong></p><p><strong>External Cache Pipeline </strong></p><p><strong>8B (+ parity) interface to E$ supports 1.2 GB/s sustained bandwidth </strong></p><p><strong>15 </strong><br><strong>128 </strong></p><p><strong>64 </strong><br><strong>External </strong><br><strong>Prefetch </strong></p><p><strong>Unit </strong><br><strong>Cache Tags </strong></p><p><strong>Supports two sram modes, compatible with US-I or US-II srams </strong></p><p><strong>Second Level </strong><br><strong>16+2(parity) </strong></p><p><strong>Cache Control Unit </strong></p><p><strong>ad and dt are 2 cycles each. ac stage doesn’t exist for 22 mode srams </strong></p><p><strong>18 </strong><br><strong>External Cache </strong><br><strong>Load/ Store Unit </strong></p><p><strong>••</strong><br><strong>PCI activity is fully cache coherent 16-byte fill to L1 data cache, desired 8 bytes first </strong></p><p><strong>64 + 8 (parity) </strong><br><strong>64 </strong></p><p>13 of 26 </p><p><strong>Pins (balls) </strong></p><p><strong>• 90&nbsp;- L2 cache data and tag • 37&nbsp;- L2 cache addr/control • 8&nbsp;- L2 cache byte write enables • 53&nbsp;- PCI plus arbiter • 9&nbsp;- Interrupt interface • 35&nbsp;- UPA64S address/control interface • 34&nbsp;- dram and transceiver control • 72&nbsp;- dram/UPA64S data • 18&nbsp;- clock input and reset/modes • 10&nbsp;- jtag/test </strong></p><p><strong>total with power: 587 </strong></p><p>14 of 26 </p><p><strong>Performance (300mhz) </strong></p><p><strong>SPEC </strong></p><p><strong>0.5mbyte L2&nbsp;2.0 Mbyte L2 </strong><br><strong>SPECint95 SPECfp95 </strong><br><strong>11.0 12.8 </strong><br><strong>11.6 15.0 </strong></p><p><strong>Memory </strong></p><p><strong>L2-Cache read L2-Cache write DRAM read (random) DRAM write DRAM read (same page) DRAM, memcopy </strong><br><strong>1.2 GBytes/S 1.2 GBytes/S 350 Mbytes/S 350 Mbytes/S 400 Mbytes/S 303 Mbytes/S </strong><br><strong>DRAM, memcopy to UPA&nbsp;550 Mbytes/S </strong></p><p><strong>STREAM (Compiled, and with VIS block store) </strong></p><p><strong>Copy Scale Add </strong><br><strong>199 Mbytes/S 199 Mbytes/S 224 Mbytes/S 210 Mbytes/S </strong><br><strong>303 Mbytes/S 296 Mbytes/S 277 Mbytes/S </strong></p><ul style="display: flex;"><li style="flex:1"><strong>277 Mbytes/S </strong></li><li style="flex:1"><strong>Triad </strong></li></ul><p></p><p>15 of 26 </p><p><strong>Performance </strong></p><p><strong>PCI (66 MHz/32 bits, Single Device) </strong></p><p><strong>PCI to DRAM (DMA) PCI from&nbsp;DRAM (DMA) PCI to L2 Cache (DMA) PCI from&nbsp;L2 Cache (DMA)&nbsp;64-byte Reads CPU to PCI (PIO) </strong><br><strong>64-byte Writes 64-byte Reads 64-byte Writes </strong><br><strong>151 Mbytes/S 132 Mbytes/S 186 Mbytes/S 163 Mbytes/S </strong></p><ul style="display: flex;"><li style="flex:1"><strong>200 Mbytes/S </strong></li><li style="flex:1"><strong>64-byte Writes </strong></li></ul><p></p><p><strong>UPA64S (100 MHz/64 bits) </strong></p>

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    13 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us