
UltraSPARCTM-II High-Performance 64-bit RISC Processor Application Note March 1996 UltraSPARC-II 1.0 UltraSPARC-II Overview: UltraSPARC-II is a second generation product of UltraSPARC pipeline based products. In addition to using a new process technology, the UltraSPARC-II provides a higher clock frequency, multiple SRAM modes and System to Processor clock ratio’s to accommodate multiple price points for system developer’s and at the same time provide software compatibility with existing UltraSPARC-I based systems. UltraSPARC-II also implements the SPARCv9 PREFETCH instruction. 2.0 UltraSPARC-II Advantage UltraSPARC-II offers Scaled Compute, Multimedia and Networking Performance over UltraSPARC-I with the following features and benefits • 64-bit SPARCV9 Architecture processor • 4-Way SuperScalar, In-order dispatch, out-of-order completion • Higher Clock Frequencies of 250+ Mhz • Multiple Clocking Modes • External Cache Flexibility • Support for Larger Second Level External Cache • Software Prefetch Instruction Support • Improved memory subsystem capabilities - Multiple Outstanding requests • Increased data bandwidth • UDB-II with additional buffers to support prefetch from E$ • Plug-Compatible with UltraSPARC-I modules • Next Generation Technology March 1996 1 UltraSPARC-II Prefetch and Dispatch Unit (PDU) Memory Management Unit (MMU) Instruction Cache and Buffer Grouping Logic Integer Reg and Annex Load Store Unit (LSU) Integer Execution Unit (IEU) Data Cache Load Queue Store Queue Floating Point Unit (FPU) FP multiply External Cache Unit (ECU) FP FP add Ext. Cache Reg RAM FP divide Graphics Unit (GRU) Memory Interface Unit (MIU) System Interconnect Figure 1 UltraSPARC-II Block Diagram (changed blocks in grey) 2 UltraSPARC-II 3.0 External Cache SRAM Modes: UltraSPARC-II supports 2 Ecache SRAM modes. Two SRAM modes are architected to accommodate multiple system price points. Table 1 SRAM Modes Mode SRAM “class” pin-to-pin Clock Type of SRAM latency frequency “1-1-1” 4ns 3 cycles 250 MHz 4ns cycle time, pipelined. custom SRAM Functionally identical to UltraSPARC-I Ecache SRAM “2-2” 6ns 4 cycles 125 MHz 7 ns access time in “register-latch” UltraSPARC-1 mode. Will be clocked at half the class SRAM processor frequency 3.1 “1-1-1” SRAM Mode This mode uses pipelined SRAM’s with cycle time equal to the processor cycle time (4ns). The “1-1-1” nomenclature refers to one processor clock to send address, one to access the SRAM array, and one to return data. Functionally, this is identical to the SRAM’s used by UltraSPARC-I. This provides the best possible Ecache latency and throughput. March 1996 3 UltraSPARC-II 3 cycles: 1 cycle each to send Address, Access SRAM and return Data. L5_CLK 12345678910 ECAD AA A B B A C A D A E AF AG DSYN_WR_L DOE_L EDATA Q A Q B D C Q D Q E 4 UltraSPARC-II Clock distribution on US-II module SRAM’s (4+1) 250 MHz 2 Data Buffer UPA Data (4ns) (UDB-II) US-II 250 MHz E$Data Bus (4.0 ns) 125 MHz (8ns) System Clock- 83.3 MHz Clock buffer chip Divide Divide (12ns) by 2 by 1 CPU Clock - 250 MHz (4 ns) Figure 2 System: 3:1 Clocking 1-1-1 SRAM Mode March 1996 5 UltraSPARC-II 3.2 “2-2” SRAM Mode “2-2” uses SRAM’s similar to UltraSPARC-I’s, but in a register-latch output mode. This provides improved access latency by eliminating the output register. However, these are run at half the processor cycle time. The “2-2” nomenclature refers to two processor clocks to send address to the SRAM, and two clocks to access and return data on a read. 4 cycles: 2 Clocks to send address, 2 clocks to access SRAM and return data. L5_CLK 12345678910 SRAM_CLK (2ns early) ECAD AA AB AC AD AE DSYN_WR_L DOE_L EDATA QA QB DC QD These SRAM’s also have the late-write architecture. The “dead” cycles in Ecache data bus-turnaround have been eliminated, allowing consecutive read-write-read accesses with no bubbles. This provides a lower cost, minimized performance impact system implementation. 6 UltraSPARC-II Clock distribution on US-II module sdbclka SRAM US-II (6ns, 4+1) 250MHz (4ns) clka UDB-II 125MHz 83.3MHz UltraSPARC-II Module (8ns) (12ns) CPU_CLK SYSTEM_CLK Figure 3 System 3:1 Clocking 2-2 SRAM Mode March 1996 7 UltraSPARC-II SRAM’s (4+1) 250 MHz (4ns access) 2 Data Buffers UPA Data (UDB-II) US-II 330 MHz E$Data Bus (3.0ns) 165 MHz 165 MHz (6ns) (6ns) System Clock-83.3MHz (12 ns) Clock buffer chip CPU Clock -165 MHz (6ns) Figure 4 System 4:1 Clocking 2-2 SRAM Mode This mode provides three essential advantages: 1. Cost reduction: these are essentially UltraSPARC-I SRAMs, with continued high volume, and with additional vendors in the market. 2. Risk reduction: availability of 4ns parts is questionable early in UltraSPARC-II life. The “2-2” mode (UltraSPARC-I-class) SRAM’s are assured. 3. Density: 4x density improvement (4Mbit parts) in the UltraSPARC-I-class SRAM’s. This should be available for use in the UltraSPARC-II modules. 8 UltraSPARC-II 4.0 Ecache Configuration The SRAM configurations can support larger External Cache configurations as shown in the following table. Table 2 Ecache Configuration. Number/type of SRAMs Ecache Mode Size Data Tag “1-1-1” 512K bytes 4 32Kx36 (1Mbit) 1 32Kx36 (1Mbit) “2-2” 512K bytes 4 32Kx36 (1Mbit) 1 32Kx36 (1Mbit) 1M bytes 8 64Kx18 (1Mbit) 1 32Kx36 (1Mbit) 2M bytes 4 128Kx36 (4Mbit) 1 32Kx36 (1Mbit) 4M bytes 8 256Kx18 (4Mbit) 1 128Kx36 (4Mbit) 4ns SRAM’s are not expected to have densities greater than 1Mbit (through 1997), and we currently believe we’re limited to 4 data SRAM’s in order to run at 250MHz rate. Therefore, the “1-1-1” mode Ecache will initially be limited to 512K bytes. 5.0 Software Prefetch and multiple-outstanding misses UltraSPARC-II supports the SPARC v9 Prefetch instruction. Prefetches primarily address floating-point vector code, in which the software (compiler) can accurately schedule the prefetch of data sufficiently ahead of its usage, and in which execution is bounded by Ecache miss throughput. UltraSPARC-I treats PREFETCH instructions as NOPs. UltraSPARC-II has the following enhancements: 1. Allowing loads and stores (Ecache-hits) to continue while a prefetch (Ecache-miss) is outstanding. An outstanding Prefetch should not block subsequent load or store hits. This extension from UltraSPARC-I allows greater miss throughput. The UltraSPARC-I Load Buffer is designed such that a load with an Ecache-miss will block subsequent load hits; these load-hits in turn block subsequent load misses. This tends to serialize load-misses. However, Prefetch misses will not block subsequent load hits. Hence prefetches can be scheduled sufficiently far in advance of the associated Load (or Store) instruction, without interfering with subsequent loads and stores. 2. Allowing multiple outstanding read-misses from the Ecache, to increase miss-throughput relative to UltraSPARC-I: UltraSPARC-I supports at most one outstanding read request to the system; UltraSPARC-II will support up to three. If prefetch instructions are scheduled effectively, we can “pipeline” the memory latency with multiple outstanding read requests. This can increase miss throughput by 2.5x to 3x relative to UltraSPARC-I. March 1996 9 UltraSPARC-II Prefetches will appear like Loads which do not return data to a register. A prefetch request which is sent to the ECU will check the Ecache for the block. If the PF hits in the Ecache, the operation will be complete; if it does not hit, the ECU will request that block from the system. When the system returns the requested data, it is written into the Ecache only, not to the Dcache. 5.1 Prefetch variants The SPARC v9 Prefetch instruction defines a “function” field to indicate read-once, read-many, write-once, write-many, or prefetch-page access. UltraSPARC-II supports two variants: read-many and write-many. i.e, a prefetch which misses in the Ecache will issue a P_RDS_REQ or P_RDO_REQ request to the system, depending on read or write variant. The following table describes the actions for each Prefetch variant: fcn Prefetch function Action 0 Prefetch for several reads Generate P_RDS_REQ if the desired line is not Ecache-resident 1 Prefetch for one read 4 Prefetch page 2 Prefetch for several writes Generate P_RDO_REQ if the desired line is not Ecache-resident in the M or E state 3 Prefetch for one write 5-15 reserved illegal-instruction trap 16-31 Implementation-dependent NOPs 5.1.1 Multiple outstanding read-miss requests Independent of prefetch, UltraSPARC-II can also support multiple outstanding miss requests from different internal sources. At any given time there may be any combination of one Instruction-fetch miss, one Store miss and up to three Load/Prefetch misses outstanding to the system. However, the total number of outstanding read requests is limited to three. Also note that this is for cacheable requests and Block-Reads only; non-cacheable (single) read misses are still limited to one outstanding, and this one non-cacheable is mutually exclusive with any cacheable read-misses. More precisely, UltraSPARC-II can have at most three outstanding class-0 requests, comprised of: 1. Up to three P_RDS or P_RDO requests 2. At most one P_RDD or P_NCBRD requests. This limit of one is due to internal load-buffer restrictions, and is not fundamental to the ECU or the UPA interface) 3. At most one P_RDSA. (internal restriction) As noted above, these outstanding class-0 requests are mutually exclusive with a P_NCRD request, which is class-1.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages24 Page
-
File Size-