ULTRASPARC-III: Designing Third-Generation 64-Bit Performance

ULTRASPARC-III: Designing Third-Generation 64-Bit Performance EVERY DECISION HAS AT LEAST ONE ASSOCIATED TRADE-OFF. SYSTEM ARCHITECTS ULTIMATELY ARRIVED AT THIS 64-BIT PROCESSOR DESIGN AFTER A CHALLENGING SERIES OF DECISIONS AND TRADE-OFFS. The UltraSPARC-III is the third gen- Compatibility eration of Sun Microsystems’ most powerful With more than 10,000 third-party applica- microprocessors, which are at the heart of Sun’s tions available for SPARC processors, compat- computer systems. These systems, ranging ibility—with both application and operating from desktop workstations to large, mission- system—is an essential goal and primary fea- critical servers, require the highest performance ture of any new SPARC processor. Previous that the UltraSPARC line has to offer. The SPARC generations required a corresponding, newest design permits vendors the scalability new operating system release to accommodate to build systems consisting of 1,000+ Ultra- changes in the privileged interface registers, SPARC processors. Furthermore, the design which are visible to the operating system. This ensures compatibility with all existing SPARC in turn required all third-party applications to applications and the Solaris operating system. be qualified on the new operating system before The UltraSPARC-III design extends Sun’s they could run on the new processor. Main- Tim Horel and SPARC Version 9 architecture, a 64-bit exten- taining the same privileged register interface in sion to the original 32-bit SPARC architec- all generations eliminated the delay inherent in Gary Lauterbach ture that traces its roots to the Berkeley releasing a new operating system. RISC-I processor.1 Table 1 (next page) lists Part of our compatibility goal included Sun Microsystems salient microprocessor pipeline and physical increasing application program perfor- attributes. The UltraSPARC-III design target mance—without having to recompile the is a 600-MHz, 70-watt, 19-mm die to be built application—by more than 90%. Further- in 0.25-micron CMOS with six metal layers more, this benefit had to apply to all applica- for signals, clocks, and power. tions, not just those that might be a good match for the new architecture. This goal Architecture design goals demanded a sizable microarchitecture perfor- In defining the newest microprocessor’s mance increase while maintaining the pro- architecture, we began with a set of four high- grammer-visible characteristics (such as level goals for the systems that would use the number of functional units and latencies) UltraSPARC-III processor. These goals were from previous generations of pipelines. shaped by team members from marketing, engineering, management, and operations in Performance Sun’s processor and system groups. To design high performance into the Ultra- 0272-1732/99/$10.00 1999 IEEE 73 ULTRASPARC-III Table 1. UltraSPARC-III pipeline and physical data. a common undesirable characteristic—the speedup varies greatly across a set of programs. Pipeline feature Parameter Relying on ILP techniques for most of the Instruction issue 4 integer processor’s performance increase would not 2 floating-point deliver the desired performance boost. ILP 2 graphics techniques vary greatly from program to pro- Level-one (L1) caches Data: 64-Kbyte,4-way gram because many programs or program sec- Instruction: 32-Kbyte,4-way tions use algorithms that are serially data Prefetch: 2-Kbyte,4-way dependent. Figure 1 shows an example of a Write: 2-Kbyte,4-way serially data-dependent algorithm. Level-two (L2) cache Unified (data and instruction): In a high-performance processor such as the 4- and 8-Mbyte, 1-way UltraSPARC-III, several iterations of the loop On-chip tags can concurrently execute. Figure 1 shows three Off-chip data iterations overlapped in time. The time it Physical feature Parameter takes these three iterations to execute depends Process 0.25-micron CMOS, 6 metal layers on the latency of the load instruction. If the Clock 600+ MHz load executes with a single-cycle latency, then Die size 360 mm2 the maximum overlap occurs, and the proces- Power 760 watts @1.8 volts sor can execute three instructions each cycle. Transistor count RAM: 12 million As the load latency increases, the amount of Logic: 4 million ILP overlap decreases, as Table 2 shows. Package 1,200-pin LGA Many ILP studies have assumed latencies of 1 for all instructions, which can cause mis- leading results. In a nonacademic machine, Table 2. Load latency increases as ILP decreases. the load instruction latency is not a constant but depends on the memory system’s cache Instruction-level parallelism Load latency hit rates, resulting in a fractional average laten- (instructions per cycle) (cycles) cy. The connection between ILP (or achieved 0.75 4 ILP, commonly referred to as instructions per 1.00 3 cycle—IPC or 1/CPI) and operation latency 1.50 2 makes these units cumbersome to analyze for 3.00 1 determining processor performance. One design consideration was an average latency measurement of a data dependency chain (ending at a branch instruction) for the Time Iteration 1 SPEC95 integer suite. The measurement was Iteration 2 Loop: Id [r0], r0 revealing: The average dependent chain in Iteration 3 tst r0 Loop: Id [r0], r0 SPEC95 consisted of a serial-data-dependent bne Loop tst r0 Loop: Id [r0], r0 bne Loop tst r0 chain with one and a half arithmetic or logi- bne Loop cal operations (on average, half of a load instruction), ending with a branch instruction. A simplified view is that SPEC95 inte- Figure 1. A serially data-dependent algorithm example, which is a simple ger code is dominated by load-test-branch search for the end of a linked-list data structure. data dependency chains. We realized that keeping the execution latency of these short dependency chains low SPARC-III, we believed we needed—and would significantly affect the UltraSPARC- have—a unique approach. Recent research III’s performance. Execution latency is anoth- shows that the trend for system architects is to er way to view the clock rate’s profound design ways of extracting more instruction-level influence on performance. As the clock rate parallelism from programs. In considering many scales, all the bandwidths (in operations per aggressive ILP extraction techniques for the unit time) and latencies (in time per opera- UltraSPARC-III, we discovered that they share tion) of a processor scale, proportionately. 74 IEEE MICRO Bandwidth (ILP) alone cannot provide a • Don’t allow bad data to propagate silently. speedup for all programs; it’s only by scaling For example, when the processor sourc- both bandwidth and latency that performance ing data on a copy-back operation detects can be boosted for all programs. an uncorrectable cache ECC error, it poi- Our focus thus became to scale up the sons the outgoing data with a unique, bandwidths while simultaneously reducing uncorrectable ECC syndrome. Any other latencies. This goal should not be viewed sim- processor in a multiprocessor system will ply as raising the clock rate. It’s possible to thus get an error if it touches the data. simply raise the clock rate by deeper pipelin- The sourcing processor also takes a trap ing the stages but at the expense of increased when the copy-back error is detected to latencies. Each time we insert a pipeline stage, fulfill the next guideline, below. we incur an additional increment in the clock- • Identify the source of the error. To minimize ing overhead (flop delay, clock skew, clock jit- downtime of large multiprocessor systems, ter). This forces less actual work to be done the failing replaceable unit must be cor- per cycle, thus leading to increased latency (in rectly identified so that a field technician absolute nanoseconds). Our goal was to push can quickly swap it out. This requires the up the clock rate while at the same time scal- error’s source to be correctly identified. ing down the execution latencies (in absolute • Detect errors as soon as possible. If errors are nanoseconds). not quickly detected, identifying the error’s true source can become difficult at best. Scalability The UltraSPARC-III is the newest genera- Major architectural units tion of processors that will be based on the The processor’s microarchitecture design design we describe in this article. We designed has six major functional units that perform this processor so that as process technology relatively independently. The units commu- evolves, it can realize the full potential of future nicate requests and results among themselves semiconductor processes. Scalability was there- through well-defined interface protocols, as fore of major importance. As an example, Figure 2 shows. research at Sun Labs indicates that propaga- tion delay in wiring will pose increasing prob- Instruction issue unit lems as process geometries decrease.2 We thus This unit feeds the execution pipelines with focused on eliminating as many long wires as instructions. It independently predicts the possible in the architecture. Any remaining control flow through a program and fetches long wires are on paths that allowed cycles to the predicted path from the memory system. be added with minimum performance impact. Fetched instructions are staged in a queue Scalability also required designing the on- before forwarding to the two execution units: chip memory system and the bus interface to integer and floating point. The IIU includes handle multiprocessor systems to be built with a 32-Kbyte, four-way associative instruction from two to 1,000 UltraSPARC-III processors. cache, the instruction address translation buffer, and a 16 K-entry branch predictor. Reliability A large number of UltraSPARC-III proces- Integer execute unit sors will be used in systems such as transac- This unit executes all integer data type tion servers, file servers, and compute servers.

Load more