A Chip Multithreaded Processor for Network-Facing Workloads
Total Page:16
File Type:pdf, Size:1020Kb
A CHIP MULTITHREADED PROCESSOR FOR NETWORK- FACING WORKLOADS THE ARTICLE DESCRIBES AN EARLY IMPLEMENTATION OF A CHIP MULTITHREADED (CMT) PROCESSOR, SUN MICROSYSTEM’S TECHNOLOGY TO SUPPORT ITS THROUGHPUT COMPUTING STRATEGY. CMT PROCESSORS ARE DESIGNED TO HANDLE WORKLOADS CONSISTING OF MULTIPLE THREADS WITH LIGHT COMPUTATIONAL DEMANDS, TYPICAL OF WEB SERVERS AND OTHER NETWORK-FACING SYSTEMS. With its throughput computing The processor described in this article is a strategy, Sun Microsystems seeks to reverse member of Sun’s first generation of CMT long-standing trends towards increasingly processors designed to efficiently execute net- elaborate processor designs by focusing work-facing workloads. Network-facing sys- instead on simple, replicated designs that tems primarily service network clients and are effectively exploit threaded workloads and often grouped together under the label “Web thus enable radically higher levels of perfor- servers.” Systems of this type tend to favor com- mance scaling.1 pute dense form factors, such as racks and Throughput computing is based on chip blades.2 The processor’s dual-thread execution Sanjiv Kapil multithreading processor design technology. capability, compact die size, and minimal In CMT technology, maximizing the amount power consumption combine to produce high Harlan McGhan of work accomplished per unit of time or throughput performance per watt, per transis- other relevant resource, rather than minimiz- tor, and per square millimeter of die area. Given Jesse Lawrendra ing the time needed to complete a given task the short design cycle Sun needed to create the or set of tasks, defines performance. Similar- processor, the result is a compelling early proof Sun Microsystems ly, a CMT processor might seek to use power of the value of throughput computing. more efficiently by clocking at a lower fre- quency—if this tradeoff lets it generate more Design goals work per watt of expended power. By CMT To satisfy target market requirements, the standards, the best processor accomplishes the design team set the following design goals: most work per second of time, per watt of expended power, per square millimeter of die • Focus on throughput (number of area, and so on (that is, it operates most threads), not single-thread performance. efficiently). • Maximize performance-watt ratio. 20 Published by the IEEE Computer Society 0272-1732/04/$20.00 2004 IEEE • Minimize die area. of translating this design to current 130-nm • Minimize design cost for one- to four- copper metal technology, targeting a frequen- way servers. cy of 1,200 MHz, proved formidable. Minimizing power consumption was a con- Meeting these goals lets system designers cern throughout the design, affecting not only pack processors tightly together to build inex- the choice of thread execution engine, but also pensive boxes or racks offering very high per- the design of every functional block on the formance per cubic foot. Alternatively, they chip. The processor also provides low-power can use the higher packing densities and lower operating modes, allowing software to selec- costs possible with the design to provide more tively reduce power consumption during peri- memory or greater functional integration. ods of little or no activity by reducing the Three additional requirements constrained chip’s operating frequency to 1/2 (idle mode) the four basic design goals: minimizing time or 1/32 (sleep mode) of normal clock speed. to market and development cost while main- To further reduce costs (and reduce system- taining full scalable processor architecture level power consumption by eliminating the (Sparc) V9 (64-bit) binary compatibility.3 need to drive external buses), we replaced the These constraints dictated our decision to large (4- to 8-Mbyte) external L2 cache typi- leverage existing UltraSparc designs to the cal of UltraSparc II systems with 1 Mbyte of greatest extent possible, while the design goals on-chip L2 cache. Moving the L2 cache on motivated our unusual decision to combine chip also substantially improved performance elements from two different UltraSparc gen- by reducing access latencies; moreover, cache erations. frequency automatically scales up when the processor’s clock rate increases. Overview To achieve compatibility with current-gen- To maximize thread density while mini- eration systems, we took the design’s memo- mizing both die area and power consumption, ry controller and JBus system interface from we rejected the third-generation 29-million- current UltraSparc IIIi processors, reworking transistor UltraSparc III processor design4 in them to interface with the dual UltraSparc II favor of the execution engine from the far pipelines. Indeed, to maximize leverage and smaller and less power-hungry first-genera- minimize costs, we designed the processor to tion 5.2-million-transistor UltraSparc I be pin compatible with UltraSparc IIIi, fit- design.5,6 This choice let us create a processor ting into the same 959-pin ceramic micro- incorporating two complete four-way super- PGA package used for the existing product. scalar execution engines on a single die, con- Finally, we implemented a robust set of reli- ferring simple dual-thread capability (one ability, availability, and serviceability (RAS) thread per core), while keeping both die area features to ensure high levels of data integri- and power consumption to a minimum. ty, and we added new CMT features to con- Achieving even higher thread counts, trol the processor’s dual-thread operation. although suitable for the target market, was Figure 1 (next page) is a block diagram of the inconsistent with the processor’s stringent design, showing the major functional units schedule and die size goals. and their interconnections. The decision to use the UltraSparc I execu- tion engine did introduce a complication, Thread execution engine however: porting the older design, originally We lifted the thread execution engine essen- implemented in a mid-1990s 0.5-micron, 3.3- tially intact from the UltraSparc II processor. V processor, to contemporary deep-submicron The execution engine’s design comprises a technology. To avoid needless work, the design nine-stage execution pipeline, incorporating effort started from the most advanced version four-way superscalar instruction fetch and dis- of the basic UltraSparc I design available: a patch to a set of nine functional units: four 250-nanometer aluminum metal database cre- integer, including load/store and branch; three ated in 1998 for the last revision of the Ultra- floating-point (FP), including divide/square Sparc II processor (targeted at an upper root; and two single-instruction, multiple- frequency of 450 MHz). Even so, the problem data (SIMD) operations. The pipeline also MARCH–APRIL 2004 21 HOT CHIPS 15 Double data rate 1 (DDR1) SDRAM L2 cache 1 19 16 L2 cache 0 35 137 128 128 32 128 32 128 30 30 JBus Memory control JBus Core 1 Core 0 unit 128 unit (MCU) 128 unit 128 128 128 128 57 128 JBus Control pins Address pins Data bus Address/data shared bus Figure 1. Processor block diagram. We reworked the memory controller and JBus system interface, originally from UltraSparc IIIi processors, to interface with UltraSparc II pipelines. includes a set of eight overlapping general- total core, including the execution units, both purpose register windows, a set of 32 floating- L1 caches, the L2 cache controller, the point registers, and two separate 16-Kbyte L1 load/store unit, and associated miscellaneous caches for instructions and data together with logic, aggregated to less than 29 mm2. Thus, associated memory management units, each together the two cores take up only about a with a 64-entry translation look-aside buffer quarter of the 206-mm2 die. (TLB) for caching recently mapped virtual- The integer unit dissipates only about 3 to-physical address pairs. watts, with the floating-point unit adding less Between its two four-issue pipelines, the than another 1.5 watts. Maximum power dis- processor can fetch and issue eight instruc- sipation by the entire execution engine is less tions per clock cycle. A 12-entry instruction than 7 watts, so between them, the two engines buffer decouples instruction fetch from are responsible for less than half the 32 watts instruction issue to buffer any discrepancies maximum dissipated by the processor chip. between these two rates. Figure 2 details the UltraSparc II pipeline. On-chip L2 cache When implemented in 130-nm technolo- Supporting each of the two UltraSparc II gy, the 64-bit, superscalar UltraSparc II execution engines in the processor is a high- pipeline design proved remarkably efficient in performance, 512-Kbyte, four-way set-asso- both area and power. The basic integer unit, ciative, on-chip L2 cache. A 128-bit data bus including two full arithmetic logic units, connects each core to the L2. The total load- required only about 8 mm2 of area, with the to-use latency for the L2 is 9 to 10 cycles, with FP and SIMD units, which share execution four cycles needed to transit in and out of the resources, adding less than 5 mm2 more. The cache itself, including error-correcting code 22 IEEE MICRO I-Cache 16 Kbyte, two-way 1 Fetch set association (2WSA) 12 entry I-Buffer 2 Decode 3 General purpose (GP) Group register windows Floating-point (FP)/single-instruction, Integer ops multiple-data (SIMD) ops 4 Execute 4 Register FP register file D-Cache 5 Cache 5 X1 16 Kbyte, 1WSA 1. Four instructions read from I-Cache, branch prediction, target obtained Miss 6 N1 6 X2 2. Instructions predecoded and to L2 Hit put into instruction buffer or 3. Up to four instructions grouped for issue, 7 7 integer ops read general purpose regs memory- N2 X3 Sign-extended 4. Integer ops execute, FP/SIMD ops read management floating-point registers unit 5. Loads/stores access D-Cache, FP/SIMD ops begin execution 6.