UltraSPARCTM-II

High-Performance 64-bit RISC Application Note March 1996

UltraSPARC-II

1.0 UltraSPARC-II Overview:

UltraSPARC-II is a second generation product of UltraSPARC pipeline based products. In addition to using a new process technology, the UltraSPARC-II provides a higher clock frequency, multiple SRAM modes and System to Processor clock ratio’s to accommodate multiple price points for system developer’s and at the same time provide software compatibility with existing UltraSPARC-I based systems. UltraSPARC-II also implements the SPARCv9 PREFETCH instruction.

2.0 UltraSPARC-II Advantage

UltraSPARC-II offers Scaled Compute, Multimedia and Networking Performance over UltraSPARC-I with the following features and benefits

• 64-bit SPARCV9 Architecture processor • 4-Way SuperScalar, In-order dispatch, out-of-order completion • Higher Clock Frequencies of 250+ Mhz • Multiple Clocking Modes • External Cache Flexibility • Support for Larger Second Level External Cache • Software Prefetch Instruction Support • Improved memory subsystem capabilities - Multiple Outstanding requests • Increased data bandwidth • UDB-II with additional buffers to support prefetch from E$ • Plug-Compatible with UltraSPARC-I modules • Next Generation Technology

March 1996 1 UltraSPARC-II

Prefetch and Dispatch Unit (PDU) Memory Management Unit (MMU)

Instruction Cache and Buffer

Grouping Logic Integer Reg and Annex Load Store Unit (LSU) Integer Execution Unit (IEU) Data Cache Load Queue Store Queue

Floating Point Unit (FPU) FP multiply External Cache Unit (ECU) FP FP add Ext. Cache Reg RAM FP divide Graphics Unit (GRU)

Memory Interface Unit (MIU)

System Interconnect

Figure 1 UltraSPARC-II Block Diagram (changed blocks in grey)

2 UltraSPARC-II

3.0 External Cache SRAM Modes:

UltraSPARC-II supports 2 Ecache SRAM modes. Two SRAM modes are architected to accommodate multiple system price points.

Table 1 SRAM Modes

Mode SRAM “class” pin-to-pin Clock Type of SRAM latency frequency

“1-1-1” 4ns 3 cycles 250 MHz 4ns cycle time, pipelined. custom SRAM Functionally identical to UltraSPARC-I Ecache SRAM

“2-2” 6ns 4 cycles 125 MHz 7 ns access time in “register-latch” UltraSPARC-1 mode. Will be clocked at half the class SRAM processor frequency

3.1 “1-1-1” SRAM Mode

This mode uses pipelined SRAM’s with cycle time equal to the processor cycle time (4ns). The “1-1-1” nomenclature refers to one processor clock to send address, one to access the SRAM array, and one to return data. Functionally, this is identical to the SRAM’s used by UltraSPARC-I. This provides the best possible Ecache latency and throughput.

March 1996 3 UltraSPARC-II

3 cycles: 1 cycle each to send Address, Access SRAM and return Data.

L5_CLK 12345678910

ECAD AA A B B A C A D A E AF AG

DSYN_WR_L

DOE_L

EDATA Q A Q B D C Q D Q E

4 UltraSPARC-II

Clock distribution on US-II module

SRAM’s (4+1)

250 MHz 2 Data Buffer UPA Data (4ns) (UDB-II) US-II 250 MHz E$Data Bus (4.0 ns)

125 MHz (8ns)

System Clock- 83.3 MHz Clock buffer chip Divide Divide (12ns) by 2 by 1

CPU Clock - 250 MHz (4 ns)

Figure 2 System: 3:1 Clocking 1-1-1 SRAM Mode

March 1996 5 UltraSPARC-II

3.2 “2-2” SRAM Mode

“2-2” uses SRAM’s similar to UltraSPARC-I’s, but in a register-latch output mode. This provides improved access latency by eliminating the output register. However, these are run at half the processor cycle time. The “2-2” nomenclature refers to two processor clocks to send address to the SRAM, and two clocks to access and return data on a read.

4 cycles: 2 Clocks to send address, 2 clocks to access SRAM and return data.

L5_CLK 12345678910 SRAM_CLK (2ns early)

ECAD AA AB AC AD AE

DSYN_WR_L

DOE_L

EDATA QA QB DC QD

These SRAM’s also have the late-write architecture. The “dead” cycles in Ecache data bus-turnaround have been eliminated, allowing consecutive read-write-read accesses with no bubbles.

This provides a lower cost, minimized performance impact system implementation.

6 UltraSPARC-II

Clock distribution on US-II module

sdbclka SRAM US-II (6ns, 4+1) 250MHz (4ns) clka

UDB-II

125MHz 83.3MHz UltraSPARC-II Module (8ns) (12ns) CPU_CLK SYSTEM_CLK

Figure 3 System 3:1 Clocking 2-2 SRAM Mode

March 1996 7 UltraSPARC-II

SRAM’s (4+1)

250 MHz (4ns access) 2 Data Buffers UPA Data (UDB-II) US-II 330 MHz E$Data Bus (3.0ns)

165 MHz 165 MHz (6ns) (6ns)

System Clock-83.3MHz (12 ns) Clock buffer chip

CPU Clock -165 MHz (6ns)

Figure 4 System 4:1 Clocking 2-2 SRAM Mode

This mode provides three essential advantages:

1. Cost reduction: these are essentially UltraSPARC-I SRAMs, with continued high volume, and with additional vendors in the market.

2. Risk reduction: availability of 4ns parts is questionable early in UltraSPARC-II life. The “2-2” mode (UltraSPARC-I-class) SRAM’s are assured. 3. Density: 4x density improvement (4Mbit parts) in the UltraSPARC-I-class SRAM’s. This should be available for use in the UltraSPARC-II modules.

8 UltraSPARC-II

4.0 Ecache Configuration

The SRAM configurations can support larger External Cache configurations as shown in the following table.

Table 2 Ecache Configuration.

Number/type of SRAMs Ecache Mode Size Data Tag

“1-1-1” 512K bytes 4 32Kx36 (1Mbit) 1 32Kx36 (1Mbit)

“2-2” 512K bytes 4 32Kx36 (1Mbit) 1 32Kx36 (1Mbit)

1M bytes 8 64Kx18 (1Mbit) 1 32Kx36 (1Mbit)

2M bytes 4 128Kx36 (4Mbit) 1 32Kx36 (1Mbit)

4M bytes 8 256Kx18 (4Mbit) 1 128Kx36 (4Mbit)

4ns SRAM’s are not expected to have densities greater than 1Mbit (through 1997), and we currently believe we’re limited to 4 data SRAM’s in order to run at 250MHz rate. Therefore, the “1-1-1” mode Ecache will initially be limited to 512K bytes.

5.0 Software Prefetch and multiple-outstanding misses

UltraSPARC-II supports the SPARC v9 Prefetch instruction. Prefetches primarily address floating-point vector code, in which the software (compiler) can accurately schedule the prefetch of data sufficiently ahead of its usage, and in which execution is bounded by Ecache miss throughput. UltraSPARC-I treats PREFETCH instructions as NOPs. UltraSPARC-II has the following enhancements:

1. Allowing loads and stores (Ecache-hits) to continue while a prefetch (Ecache-miss) is outstanding. An outstanding Prefetch should not block subsequent load or store hits. This extension from UltraSPARC-I allows greater miss throughput. The UltraSPARC-I Load Buffer is designed such that a load with an Ecache-miss will block subsequent load hits; these load-hits in turn block subsequent load misses. This tends to serialize load-misses. However, Prefetch misses will not block subsequent load hits. Hence prefetches can be scheduled sufficiently far in advance of the associated Load (or Store) instruction, without interfering with subsequent loads and stores.

2. Allowing multiple outstanding read-misses from the Ecache, to increase miss-throughput relative to UltraSPARC-I: UltraSPARC-I supports at most one outstanding read request to the system; UltraSPARC-II will support up to three. If prefetch instructions are scheduled effectively, we can “pipeline” the memory latency with multiple outstanding read requests. This can increase miss throughput by 2.5x to 3x relative to UltraSPARC-I.

March 1996 9 UltraSPARC-II

Prefetches will appear like Loads which do not return data to a register. A prefetch request which is sent to the ECU will check the Ecache for the block. If the PF hits in the Ecache, the operation will be complete; if it does not hit, the ECU will request that block from the system. When the system returns the requested data, it is written into the Ecache only, not to the Dcache.

5.1 Prefetch variants

The SPARC v9 Prefetch instruction defines a “function” field to indicate read-once, read-many, write-once, write-many, or prefetch-page access. UltraSPARC-II supports two variants: read-many and write-many. i.e, a prefetch which misses in the Ecache will issue a P_RDS_REQ or P_RDO_REQ request to the system, depending on read or write variant. The following table describes the actions for each Prefetch variant:

fcn Prefetch function Action

0 Prefetch for several reads Generate P_RDS_REQ if the desired line is not Ecache-resident 1 Prefetch for one read

4 Prefetch page

2 Prefetch for several writes Generate P_RDO_REQ if the desired line is not Ecache-resident in the M or E state 3 Prefetch for one write

5-15 reserved illegal-instruction trap

16-31 Implementation-dependent NOPs

5.1.1 Multiple outstanding read-miss requests Independent of prefetch, UltraSPARC-II can also support multiple outstanding miss requests from different internal sources. At any given time there may be any combination of one Instruction-fetch miss, one Store miss and up to three Load/Prefetch misses outstanding to the system. However, the total number of outstanding read requests is limited to three. Also note that this is for cacheable requests and Block-Reads only; non-cacheable (single) read misses are still limited to one outstanding, and this one non-cacheable is mutually exclusive with any cacheable read-misses.

More precisely, UltraSPARC-II can have at most three outstanding class-0 requests, comprised of:

1. Up to three P_RDS or P_RDO requests

2. At most one P_RDD or P_NCBRD requests. This limit of one is due to internal load-buffer restrictions, and is not fundamental to the ECU or the UPA interface) 3. At most one P_RDSA. (internal restriction)

As noted above, these outstanding class-0 requests are mutually exclusive with a P_NCRD request, which is class-1. Once a P_NCRD request is issued, UltraSPARC-II will not issue any other requests until the

10 UltraSPARC-II

P_NCRD receives its S_Reply.

Note For backward-compatibility with existing systems, UltraSPARC-II can be configured to allow either one or three outstanding miss requests.

5.1.2 Multiple outstanding writebacks UltraSPARC-II supports up to two outstanding dirty victim writebacks. This is needed to balance the three outstanding reads. Two writebacks is sufficient since not all misses will require writebacks. Also, three misses + three writebacks would more than saturate the expected memory bandwidth.

Actually, UltraSPARC-II will not send more than two outstanding “dirty misses” -- reads with DVP (see note) set. A third read with DVP set will be stalled until the first P_WRB gets S-replied.

Note: Dirty Victim Pending (DVP): If a cache transaction displaces a dirty victim block in the cache, the Dirty_Victim_Pending bit is set in the request packet.

For backward-compatibility, UltraSPARC-II can be configured to allow either one or two writebacks. This is tied to the number of outstanding miss requests, to provide only two supported modes. The mode can be identified by bits in the UPA Configuration Register (see section 9.0 on page 17)

Table 3 Multiple outstanding Writeback modes

Read/Writeback Mode Max outstanding Max outstanding dirty Read-miss requests to writebacks system

“UltraSPARC-I mode” 11

“UltraSPARC-II mode” 32

5.1.3 P_Request ordering for multiple dirty misses UltraSPARC-II will always send a dirty read miss (P_RDO/RDS/RDSA) followed by its P_WRB without any intervening P_requests. Thus, multiple read & writeback requests will always be sent in: P_RDx/P_WRB/P_RDx/P_WRB order.

6.0 UltraSPARC DataBuffer - II (UDB-II)

The UDB chip for UltraSPARC-II is called “UDB-II”. The UDB-IIs are EPIC-3 process gate arrays, i.e, same as UDB-I. UDB-II’s have the same interfaces and function as UDB-Is, with the following changes:

6.1 Split clock domains

The UDB-II will have two clock domains, one using the UPA clock, the other using a fixed 2:1 ratio clock. The two domains are synchronous, with fixed ratios (e.g, 2:2, 3:2, and 4:2).

The domains are coupled through the read-buffers and write-buffers, as can be seen in Figure 5 on page 13.

This is primarily a design simplification, to accommodate more combinations of Processor:System clock

March 1996 11 UltraSPARC-II ratios as well as the new SRAM modes. The ECU and SRAM’s can communicate with the UDB-II on a fixed 2:1 bus, independent of UPA clock ratio. ECU must track the domain-boundary crossing, to ensure that data is not unloaded from the read-buffers before it is available, and also that data loaded into write-buffers does not cause the buffer data to be overwritten.

6.2 Additional Read-buffers

The UDB-II provides three 64-byte read buffers to support the three possible outstanding miss requests. UDB-Is required only one. The UPA interconnect architecture doesn’t allow a Master Port (i.e, the processor/UDB-II) to flow-control data read from memory. Therefore, UDB-II needs one read buffer for each outstanding block read request.

6.3 Additional Pipelining in Ecache-Fill path

The 2:1 domain requires the CPU-side interface to run at 8ns. Similar logic for UDB-I and using the same technology was designed for 12ns. Therefore one additional pipeline stage is inserted to convert the prior 12ns path into two 8ns cycles.

12 UltraSPARC-II

new pipelining stage System Clock relative to UDB-1 rdbuf

ECC corr. parity gen

Interrupt Receive EDATA SYSDATA ncstore

parity chk snoop ECC gen

writeback

Interrupt Send

2:1 Clock

Figure 5 UDB-II block diagram

6.4 Maximum ecache size

UltraSPARC-II ECU supports up to 16 MByte Ecache. This presumes availability of 16M bit SRAM’s (with 7ns access time in Register-Latch mode).

6.5 Ecache de-configurability

UltraSPARC-II supports de-configurable Ecache size, to allow UltraSPARC-II modules with different amounts of Ecache to plug into a system for compatibility with existing modules.

UltraSPARC-II also permits a single module (PC-board) design to support various Ecache sizes. The base Ecache Tag, a 32Kx36 SRAM, has sufficient depth to support Ecaches of 512K to 2Mbytes. By masking off Tag-index bits which correspond to unimplemented Data SRAM, one module design could support different Ecache sizes just by changing the Data SRAM’s.

March 1996 13 UltraSPARC-II

Ecache configuration would be determined at boot time, and bits set in the UPA_CONFIG register. These registered bits then mask-off appropriate high-order Data & Tag SRAM index bits. See “UPA_CONFIG register” on page 17.

7.0 Processor:System clock ratios

UltraSPARC-II will support the following Processor:System clock ratios.

0

Table 4 UltraSPARC-II - Processor:System clock ratio.

Clock CPU UPA Notes ratio clock clock range range

4:1 < 4 ns < 16 ns This is for a “fast” UltraSPARC-II. Note this requires a “fast” > 12 ns UDB-II as well, since the 2:1 domain and Ecache I/O timing must scale with the CPU clock

3:1 >= 4 ns >= 12 ns Base design. Compatible with existing 12ns UPA interface, using 3:1 instead of UltraSPARC-I’s 2:1

Note: The current UPA implementation limits the system interface to maximum of 83.3 Mhz or 12ns. Future implementations are expected to go down to 10ns, this would allow for a 3.3 ns processor clock in 3:1 mode.

2:1 >= 6 ns >= 12 ns This mode is used only for UltraSPARC-I “replacement” in which the processor clock runs at UltraSPARC-I (6ns) cycle time.

Table 5 Clock Modes

Mode UltraSPARC-II Mode Frequency

2:1 UltraSPARC-I Mode 167:83

3:1 Primary Mode 250:83

4:1 High Speed 250+:83

Note The UPA architecture defines a UPA_Ratio signal which indicates either 2:1 or 3:1 mode. The system will not provide UPA_Sys_clk and UPA_CPU_Clk with a 4:1 ratio; in order to get 4:1, the module will receive effectively 4:2 clocks (which the system sees as 2:1) and UltraSPARC-II internally doubles the 2:1 clock.

14 UltraSPARC-II

8.0 UltraSPARC-II module clock domains

The following table lists required clock ratios for each combination of SRAM modes & Processor:System clock modes. This presumes that systems will run no faster than 12ns cycle time. For simplicity, the table shows only one clock-frequency combination for each mode; this is the fastest the clocks can run for each mode shown. In each case, the CPU clock at the chip pin will be one-half the frequency of the desired core , hence must be doubled by the PLL

March 1996 15 UltraSPARC-II

Table 6 UltraSPARC-II Module Clock Domains

UltraSPARC-II Pipeline: Ecache UPA_Sys_clk UPA_cpu_clk CPU UDB-II clocks SRAM “pipeline” UPA SRAM (@ module pin) clock (ecache-side clock clock clock mode (@ chip & sys-side) mode pin)

3ns 4:11 “22” 12ns 6ns 6ns 6ns & 12ns 6ns (sub-4ns)

4ns 3:12 “111” 12ns 4ns 8ns 8ns & 12ns 4ns

“22” 8ns 8ns

6ns 2:13,4 “111” 12ns 6ns 12ns 12ns & 12ns 6ns

1. No support for 111-mode SRAM with 4:1 system clock! The system can not generate sub-4ns clock to be used by SRAM. 2. The is the production mode. 3. Only 111-mode SRAM is supported with 2:1 system clock. 4. This is limited-volume UltraSPARC-I-replacement mode.

Table 7 Processor:System Clock Ratios

SRAM Mode 2:1 3:1 4:1

“111” supported supported no

“22” No Supported Supported

16 UltraSPARC-II

Figure 6 Clocks Generation

Divsel A

f/2 CPU Clock LVE111 f/4

SYNTH f/4

f/6 UPA Clock LVE111 LVE39

Divsel B

9.0 UPA_CONFIG register

The UPA configuration register is changed inside UltraSPARC-II, to support the new feature.

Table 8 UPA Register Configuration

UPA_CONFIG register format

- MCAP CLK_MODE E$ ELIM PCON MID PCAP

63 43 42 39 38 37 36 35 33 32 22 21 17 16 0

MCAP[3:0] Implementation-dependent Module Capability bits. These bits may be used by software to determine the processor module speed capability. The bits are hard-wired or jumper’ed on the module, and brought on chip. This is a read-only field.

March 1996 17 UltraSPARC-II

CLK_MODE[1:0] System:Processor clock ratio specifies the ratio of processor’s internal clock to the system clock. This is a read-only field; writes to these bits have no effect. The encoding is as follows.

Table 9 Clock Modes and Rations:

CLK_MODE[1:0] Ratio

00 2:1

01 3:1

10 4:1

E$ External Cache SRAM mode. Indicates whether "111" mode or "22" mode SRAM is used on the module. This is a read-only field; writes to this bit have no effect.

Table 10 Ecache Access

E$ SRAM mode

0 111-mode

1 22-mode

ELIM[2:0] Ecache Limit -- Sets an upper limit on the Ecache size to be configured. At reset, this would be set to indicate 16M bytes, and need only be modified during boot-up to force a smaller Ecache than is actually present. The effective Ecache size would be the smaller of the ELIM value and the amount of RAM actually present:

18 UltraSPARC-II

Table 11 ECache Limits

ELIM[2:0] Ecache Limit

000 16M bytes

001 8M bytes

010 4M bytes

011 2M bytes

100 1M byte

101 512K bytes

PCON[10:0] Consists of five fields, [WB, SCIQ0, BST, NCST, SCIQ1] that determine the depth of the system queues for transactions issued by UltraSPARC-II. The PCON field is initialized with the minimum values at reset and may be modified by an ASI store.

- WB: Maximum number of outstanding (N-1) Writebacks

- SCIQ0[1:0]: Maximum number of outstanding (N-1) class 0 transactions.

- BST: (same as UltraSPARC-I)

- NCST[2:0] (same as UltraSPARC-I)

- SCIQ1[3:0] (same as UltraSPARC-I)

[Definitions of MID[4:0] and PCAP[16:0] are also unchanged from UltraSPARC-I PRM]

Note The only combinations of WB & SCIQ0[1:0] supported by UltraSPARC-II are:

WB==0, SCIQ0==0 and

WB==1, SCIQ0==2

March 1996 19 UltraSPARC-II

10.0 Performance - Estimated

Assumptions:

UltraSPARC-II 250Mhz with UPD-II “111” Mode, SRAM’s 4ns, 2Meg ECache, SW Prefetch support, Clock Ratio of 3:1

350-420 SPECInt92

550-660 SPECFp92

Approximately 10% of additional performance boost in SPECfp92 Performance from Software Prefetch and Multiple-outstanding reads.

Up to 100% speedup in critical loops.

11.0 Technology Overview

• TI’s EPIC-4 Process - 5 Layer Metal • n-well CMOS; p+ Substrate; p-epi

• 0.29 micron L ploy drawn (0.22 micron Leff)

• 5.7 nm Tox • 2.5V Supply (CORE) • 3.3V Interface to UPA • Flip Chip • Land Grid Array Ceramic Package • Die Size: 12.5mm x 12.5mm • Approximate Transistor Count: 5.2 Million • Estimate Power Consumption: 26Watts @ 250Mhz • Estimated Power Consumption Module: 49 Watts @ 250Mhz

12.0 Pin Description:

New Pins:

ECACHE_22_MODE, PHASE_DET_CLK, MCAP[3:0], SPARE_OUT

Pins Deleted:

LOOP_CAP, SCLK_MODE

13.0 OBP Issues

20 UltraSPARC-II

Added additional E$Size support.

Additional Speed Mode Programmability.

14.0 System Controller

The USC-Plus Uni-Processor (System Controller Plus) or SPEC (Single Processor Enhanced Controller) is and enhancement over the USC to support the UltraSPARC-II processor and adds some features for a high-end uniprocessor system. Some of the new features are:

• Support for multiple outstanding loads (up to 3 Class 0 loads). • Support for multiple Writebacks and Writeback Cancels. • Support for an extra FFB graphics port. • Support for up to 16 SIMMs, maximum memory of 2GB. • 30 additional pins over USC (372 pin BGA package)

Note This chip has not been productized yet, so no additional data will be available at this time.

March 1996 21 2550 Garcia Avenue Mountain View, CA, USA 94043 408 774-8545 408 774-8537 facs.

1996 , Inc. All Rights reserved. THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED “AS IS” WITHOUT ANY EXPRESS REPRESENTATIONS OF WARRANTIES. IN ADDITION, SUN MICROSYSTEMS, INC. DISCLAIMS ALL IMPLIED REPRESENTATIONS AND WARRANTIES, INCLUDING ANY WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT OF THIRD PARTY INTELLECTURAL PROPERTY RIGHTS. This document contains proprietary information of Sun Microsystems, Inc. or under license from third parties. No part of this document may be reproduced in any form or by any means or transferred to any third party without the prior written consent of Sun Microsystems, Inc. Sun, Sun Microsystems and the Sun Logo are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. in the United States and other countries. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems, Inc. The information contained in this document is not designed or intended for use in on-line control of aircraft, air traffic, aircraft navigation or aircraft communications; or in the design, construction, operation or maintenance of any nuclear facility. Sun disclaims any express or implied warranty of fitness for such uses. Part Number: 802-7254-01