NIAGARA 2 OPENS the FLOODGATES Niagara 2 Design Is Closest Thing Yet to a True Server on a Chip by Harlan Mcghan {11/6/06-01}

MICROPROCESSOR www.MPRonline.com THE REPORTINSIDER’S GUIDE TO MICROPROCESSOR HARDWARE NIAGARA 2 OPENS THE FLOODGATES Niagara 2 Design Is Closest Thing Yet to a True Server on a Chip By Harlan McGhan {11/6/06-01} At the recent Fall Microprocessor Forum, Sun Microsystems presented its new Niagara 2 microprocessor design, the successor to Niagara 1. Because Niagara 1 processors ship under the product name UltraSPARC T1, Niagara 2 will presumably go to market next year as the UltraSPARC T2. Sun’s presentation at MPF was deliv- simpler design era. (See MPR 11/17/03-03, “Will Micro- ered by Robert Golla, Niagara 2 principal architect. processors Become Simpler?”) Sun’s Niagara-series processors are, unquestionably, the odd duck among today’s server processors. Rival server- The Origin of Chip Multithreading processor designs differ from each other in degree. How Niagara is by no means the first radical processor design ever much advantage does the true simultaneous multithreading to reach market. To the contrary, the progression of this lat- (SMT) capability of dual-core, dual-threaded POWER6 est proposed design revolution traces a familiar arc. It began processors provide over the coarse-grained or vertical mul- as an academic research project. Once the feasibility of the tithreading (VMT) capability of dual-core, dual-threaded notion was established, it was implemented by a startup SPARC64 VI processors? Is the compiler-scheduled, static specifically created by its academic founder to give commer- in-order wide-superscalar issue of VLIW-style Itanium cial expression to the new idea. The startup was later processors better than the hardware-driven, dynamic out- acquired by an established system company that adopted the of-order wide-superscalar issue of RISC processors? What new idea for its own and played a key role in developing and benefits does HyperTransport (HT) technology confer on promoting the technology in the marketplace. Opteron processors over the traditional front-side bus used During the 1980s, this path was trod by Professor John by Xeon processors? Hennessy’s 1981–83 MIPS reduced instruction set comput- These are not the sort of questions to ask about Nia- ing (RISC) project at Stanford University, the 1984 startup gara processors. Niagara differs from other high-end server Hennessy founded (MIPS Technologies), and Silicon processors not merely in degree but also in kind. In a way, Graphics (later SGI)—once, the principal supporter and they are most readily understood by those who have spent defender of high-end MIPS RISC architecture processors. It the past two decades oblivious to the forces that have caused also was traced by Josh Fisher’s early 1980s Trace Scheduling processor cores to evolve from simple, low-frequency, in- project at Yale University, which led to the notion of very order scalar designs to complex, high-frequency, out-of- long instruction word (VLIW) hardware; the 1984 startup order superscalar designs—or by those who have taken the that Fisher founded (Multiflow); and Hewlett-Packard— lessons of this progression most deeply to heart. Niagara which later teamed with Intel to bring the descendant processors are predicated on the conviction that just about VLIW-style Itanium processor family to market. everything in contemporary server-processor design is In the case of Niagara, the progression begins with Pro- wrong, and that the best way forward is to revert to a far fessor Kunle Olukotun’s late-1990s Hydra chip multithreading Article Reprint 2 Niagara 2 Opens the Floodgates (CMT) project at Stanford, the 2000 startup Olukotun The Niagara 1 core fetches one or two instructions founded (Afara Websystems), and Sun Microsystems—which from a thread out of the instruction cache during a clock acquired Afara in July 2002 and publicized the commercial cycle (depending on whether the fetch address is odd or derivative of the Hydra design under the Niagara code-name. even). The first instruction fetched proceeds down the For an account of multithreading strategies and their associ- pipeline, flowing through to the decode stage in the next ated terminology, see the sidebar “Multithreading Strategies in clock cycle. When two instructions are fetched from an even Server Processors.” boundary, the second “odd” instruction is held in a one-line In a sense, “Niagara 2” is a generational misnomer. In instruction buffer (next-instruction flop) at the beginning fact, Niagara 2 closely resembles the original vision of a of the select stage for later issue. Thus, although the Niagara 1 commercial implementation of Hydra at Afara Websystems. core sometimes fetches multiple instructions, it always issues Niagara 1 was a cut-back version of this design, initiated at just one instruction down the pipeline, like any simple Sun shortly after it acquired Afara, in order to fit the design scalar design. onto a manufacturable die using TI’s 90nm process technol- Although the Niagara 1 core executes just one instruc- ogy and to get the product to market as rapidly as possible. tion per clock cycle, it manages four separate execution The full realization of the initial Afara plan was postponed to threads simultaneously. As Figure 1 indicates, each thread the Niagara 2 generation. The additional development time has its own program counter (PC), one-line instruction afforded to a second-generation design, and the more ample buffer (for holding “odd” instructions), and register file. transistor budgets of 65nm process technology, makes the The four threads share the instruction cache, the data cache, Niagara 2 implementation commercially feasible. and the various execution resources in the core. Because of this relationship between the first two gen- Niagara 1 processors implement a fine-grained multi- erations of the Niagara design—where Niagara 1 is, in effect, threading scheme. The thread-select logic implements a Niagara Jr.—Niagara 1 furnishes a convenient starting point least recently used (LRU) algorithm for picking from for understanding Niagara 2. among the available threads. When all four threads are available, it simply plays round-robin on a clock-by-clock Niagara 1 Core Overview basis, fetching an instruction for thread 0 from the I-cache At the core level, Niagara 1 is a design not seen in commer- on clock 0 and issuing it to the decoder; fetching and issu- cial high-end desktop or server processors since the late ing an instruction for thread 1 on clock 1; for thread 2 on 1980s. It is an integer-only core with an attached (shared) clock 2; for thread 3 on clock 3; before starting over again FPU. It revives the basic five-stage pipeline used by early with thread 0 on clock 4. If a thread stalls for any reason, the RISC implementations: instruction fetch, decode, execu- thread-select logic simply drops it from the round-robin tion, memory access (for load/store instructions), and until it is ready to resume execution, fetching and issuing write-back. However, it does insert one extra stage—thread instructions from the next executable thread in its place. select—between instruction fetch and decode. Figure 1 Effectively, the individual threads dynamically speed shows the Niagara 1 core pipeline. up and slow down as they run. At a processor frequency of 1.2GHz, the four executable threads each run at a speed of 300MHz. If one thread drops out, the Fetch Thrd Sel Decode Execute Memory WB remaining three threads run at 400MHz; or, if two threads drop out, at 600MHz. If all but one of the Crypto threads stall, the last remaining executable thread Coprocessor Regfile runs at the full 1.2GHz processor speed—until x 4 one or more of the stalled threads joins back in, Alu D-Cache slowing the individually executing threads down I-Cache Inst Crossbar Thrd Decode Mul Dtlb again in proportion to the number running at the Itlb Buf x 4 Interface Sel Shft Stbuf x 4 same time. Mux Div For the most part, Niagara simply ignores Instruction Type the problems that other contemporary proces- Thread sors go to great lengths to minimize. Multi- Thrd PC Select Misses Sel Logic Logic Traps and Interrups threading allows the core to stay busy without the Mux x4 Resource Conflicts complications of out-of-order execution. Rather than attempting to predict branches (with imperfect success), or predicate them (at great Figure 1. Niagara 1 core pipeline. This unit handles basic integer ALU operations plus expense), Niagara simply drops a thread issuing shifts and integer multiply and divide. The only features that distinguish it from a basic a branch out of the rotation until its branch 1980s scalar RISC pipeline are the thread-select stage and the cryptographic coprocessor. The cryptographic coprocessor shares the crossbar interface with the L1 D-cache but does condition is resolved. And although Niagara is not modify its contents, streaming data directly out of and back into the L2 cache. willing to speculate that loads hit the L1 cache, it © IN-STAT NOVEMBER 6, 2006 MICROPROCESSOR REPORT Niagara 2 Opens the Floodgates 3 also lowers the priority of speculatively issued instructions. thread at the processor’s full clock speed. It’s natural to wonder The result is that speculations issue only as a last resort, why Sun makes the substantial investment needed to imple- when the core has run out of more-certain instructions to ment fine-grained multithreading on Niagara processors. decode. Ideally, before the core gets around to issuing a The answer is that, even on a simple scalar core, few load-dependent instruction speculatively, the load will threads can sustain an execution rate of four instructions either have returned its data or missed the cache—either every four clock cycles for very long. Threads are constantly way, entirely removing the element of speculation from any slowed by long-latency operations, including branches, dependent instructions.

Load more