...... ROCK:AHIGH-PERFORMANCE SPARC CMT

...... ROCK,SUN’S THIRD-GENERATION CHIP-MULTITHREADING PROCESSOR, CONTAINS 16

HIGH-PERFORMANCE CORES, EACH OF WHICH CAN SUPPORT TWO SOFTWARE THREADS.

ROCK USES A NOVEL CHECKPOINT-BASED ARCHITECTURE TO SUPPORT AUTOMATIC

HARDWARE SCOUTING UNDER A LOAD MISS, SPECULATIVE OUT-OF-ORDER RETIREMENT

OF INSTRUCTIONS, AND AGGRESSIVE DYNAMIC HARDWARE PARALLELIZATION OF A

SEQUENTIAL INSTRUCTION STREAM.IT IS ALSO THE FIRST PROCESSOR TO SUPPORT

TRANSACTIONAL MEMORY IN HARDWARE.

...... Designing an aggressive chip- even more aggressive simultaneous speculative Shailender Chaudhry multithreaded (CMT) processor1 involves threading (SST), which uses two checkpoints. many tradeoffs. To maximize throughput EA is an area-efficient way of creating a large Robert Cypher performance, each processor core must be virtual issue window without the large asso- highly area and power efficient, so that ciative structures. SST dynamically extracts Magnus Ekman many cores can coexist on a single die. Simi- parallelism, letting execution proceed in par- larly, if the processor is to perform well on a allel at two different points in the program. Martin Karlsson wide spectrum of applications, per-thread per- These schemes let Rock maximize formance must be high so that serial code also throughput when the parallelism is available, Anders Landin executes efficiently, minimizing performance and focus the hardware resources to boost problems related to Amdahl’s law.2,3 Rock, per-thread performance when it is at a pre- Sherman Yip Sun’s third-generation chip-multithreading mium. Because these speculation techniques ˚ processor, uses a novel resource-sharing hier- make Rock less sensitive to cache misses, Hakan Zeffer archical structure tuned to maximize perfor- we have been able to reduce on-chip cache mance per area and power.4 sizes and instead add more cores to further Marc Tremblay Rock uses an innovative checkpoint-based increase performance. The high core-to- architecture to implement highly aggressive cache ratio requires high off-chip bandwidth. speculation techniques. Each core has sup- Rock-based systems use large caches in the port for two threads, each with one check- off-chip memory controllers to maintain a point. The hardware support for threads relatively low access rate to DRAM; this can be used two different ways: a single maximizes performance and optimizes total core can run two application threads, giving system power. each thread hardware support for execute To help the programmability of multi- ahead (EA) under cache misses; or all of processor systems with hundreds of threads a core’s resources can adaptively be com- and to improve synchronization-related bined to run one application thread with performance, Rock supports transactional ......

6 Published by the IEEE Computer Society 0272-1732/09/$25.00 c 2009 IEEE

Authorized licensed use limited to: Politecnico di Milano. Downloaded on December 23, 2009 at 03:56 from IEEE Xplore. Restrictions apply. SerDes I/O links D cache MCU MCU MC MC CoreFGU Core

IFU Core L2 L2 cluster cache cache CoreFGU Core SerDes I/O links

D cache Rock Rock EFA PLL MMU Data switch SIU processor I/O processor I/O chip chip

L2 L2 cache cache Core Core cluster cluster

MC MC

SerDes I/O links

(a) (b)

Figure 1. Die photo of the Rock chip (a) and logical overview of a two-processor-chip, 64-thread Rock-based system (b).

memory5 in hardware; it is the first commer- number of instructions per cycle. Thus, cial processor to do so. TM lets a Rock four cores can share one IFU in a round- thread efficiently execute all instructions in robin fashion while maintaining full fetch a transaction as an atomic unit, or not at bandwidth. Furthermore, the shared in- all. Rock leverages the checkpointing mecha- struction cache enables constructive shar- nism with which it supports EA to imple- ing of common code, as is encountered ment the TM semantics with low additional in shared libraries and operating system hardware overhead. routines. Each core cluster contains two L1 data System overview caches (D cache) and two floating-point Figure 1 shows a die photo of the units (FGU), each of which is shared by a Rock processor and a logical overview of pair of cores. As Figure 1a shows, these struc- a Rock-based system using two processor tures are relatively large, so sharing them pro- chips. Each Rock processor has 16 cores, vides significant area savings. A crossbar each configurable to run one or two threads; connects the four core clusters with the four thus, each chip can run up to 32 threads. The level-two banks (L2 cache), the second-level 16 cores are divided into core clusters, with memory-management unit (MMU), and four cores per cluster. This design enables re- the system interface unit (SIU). Rock’s L2 source sharing among cores within a cluster, cache is 2 Mbytes, consisting of four 512- thus reducing the area requirements. All Kbyte 8-way set-associative banks. A rela- cores in a cluster share an instruction fetch tively large fraction of the die area is devoted unit (IFU) that includes the level-one (L1) to core clusters as opposed to L2 cache. We instruction cache. We decided that all four selected this design point after simulating cores in a cluster should share the IFU be- designs with different ratios of core clusters cause it is relatively simple to fetch a large to L2 cache. It provides the highest ......

MARCH/APRIL 2009 7

Authorized licensed use limited to: Politecnico di Milano. Downloaded on December 23, 2009 at 03:56 from IEEE Xplore. Restrictions apply...... HOT CHIPS 20

throughput of the different options and exploiting the great spatial locality typical yields good single-thread performance. in instruction data. The fetch unit fetches a Each Rock chip is directly connected to full cache line of 16 instructions and for- an I/O bridge ASIC and to multiple memory wards it to a strand-specific fetch buffer. controller ASICs via high-bandwidth Rock’s is a 60-Kbit serializer/deserializer (SerDes) links. This sys- (30K two-bit saturating counters) gshare tem topology provides truly uniform access branch predictor, enhanced to take advan- to all memory locations. The memory tage of the software prediction bit provided controllers implement a directory-based in Sparc V9 branches. In each cycle, two MESI coherence protocol (which uses the program-counter-relative (PC-relative) branches states modified, exclusive, shared,andinvalid) are predicted. Indirect branches are predicted supporting multiprocessor configurations. by either a 128-entry branch-target buffer or We decided not to support the owned a 128-entry return-address predictor. cache state (O) because there are no direct As Figure 2 shows, the fetch stages are links between processors, and so accesses to connected to the decode pipeline through a cache line in the O state in another pro- aninstructionbuffer.Thedecodeunit, cessor would require a four-hop transaction, which serves one strand per cycle, has a which would increase the latency and the helper ROM that it uses to decompose com- link bandwidth requirements. plex instructions into sequences of simpler To further reduce link bandwidth instructions. Once decoded, instructions requirements, Rock systems use hardware proceed to the appropriate strand-specific in- to compress packets sent over the links be- struction queue. A round-robin arbiter tween the processors and the memory decides which strand will have access to the controllers. Each memory controller has an issue pipeline the next cycle. On every 8-Mbyte level-three cache, with all the L3 cycle, up to four instructions can be steered caches in the system forming a single large, from the instruction queue to the issue globally shared cache. This L3 cache reduces first-in, first-out buffers (FIFOs). There are both the latency of memory requests and the five issue FIFOs (labeled A0, A1, BR, FP, DRAM bandwidth requirements. Each and MEM in Figure 2); each is three entries memory controller has 4 þ 1 fully buffered deep and directly connected to its respective dual in-line memory module (FB-DIMM) execution pipe. An issue FIFO will stall an memorychannels,withthefifthofthese instruction if there is a read-after-write used for redundancy. (RAW) hazard. However, such a stall does not necessarily affect instructions in the Pipeline overview other issue FIFOs. Independent younger A strand consists of the hardware resour- instructions in the other issue FIFOs can ces needed to execute a software thread. proceed to the execution unit out of order. Thus, in a Rock system, eight strands share Instructions that can’t complete their execu- a fetch unit, which uses a round-robin tion go into the deferred instruction queue scheme to determine which strand gets (DQ) for later reexecution. The DQ is an access to the instruction cache each cycle. SRAM-based structure with only a single The 32-Kbyte shared instruction cache read/write port, so it is area and power is four-way set associative, and virtual-to- efficient. physical translation is provided by a 64- A two-level provides register entry, four-way-associative instruction trans- operands. The first-level register file is a lation look-aside buffer (ITLB). Because highly ported structure to support the source fetch logic in the IFU keeps track of when operand requirements of the maximum issue a fetch request can cross a page boundary, width.6 It contains one register window, most fetch requests can bypass the ITLB ac- whereas the second-level register file contains cess and thereby save ITLB power and fetch all of the eight register windows that Rock latency. Each strand has a sequential next- supports. The second-level register file has a cache-line prefetch mechanism that automat- single read/write port and is made out of ically requests the next cache line, thereby SRAM. For each architectural register in ...... 8 IEEE MICRO

Authorized licensed use limited to: Politecnico di Milano. Downloaded on December 23, 2009 at 03:56 from IEEE Xplore. Restrictions apply. IWRF

A0 ALU

IARF

DQ A1 ALU Instruction I buffer queue Fetch Decode Issue BR BR I cache pipeline pipeline pipeline

FP FP pipeline FARF

FWRF

MEM MEM pipeline

Figure 2. Pipeline overview showing five execution units (ALU, ALU, BR, FP pipeline, MEM pipeline), the integer architectural register file (IARF), the integer working register file (IWRF), the floating-point architectural register file (FARF), and the floating-point working register file (FWRF).

the second-level register file, there is both a array, which enables a three-cycle load-use la- main and a speculative copy to support tency. The data cache is accompanied by a checkpointing and speculation. Although 32-entry, two-way skewed-associative way- the second-level register file holds about predicted micro data translation look-aside 16 times as much data as the first-level regis- buffer (micro-DTLB) with full least recently ter file, because it is constructed of SRAM, used (LRU) replacement. The way-predictor the two structures are almost the same size. has a conflict avoidance mechanism that To keep track of which register copy to removes conflicts by duplicating entries. write to—main or speculative—each archi- Stores are handled by a 32-entry store buffer tectural register has a one-bit pointer associ- in each core. The store buffer, which is two- ated with it. way banked for timing reasons, supports The execution unit, which can execute up store merging. Stores that hit in the data to five instructions per cycle, consists of five cache update both the data cache and the pipelines: simple ALU (A0), complex ALU L2 cache. Stores that miss in the data cache (A1), branch (BR), floating-point (FP), and update the L2 cache and do not allocate in memory (MEM). The complex ALU pro- thedatacache.Toincreasememory-level vides all the functionality of the simple parallelism, stores generate prefetch requests ALU and can also handle multicycle instruc- to the L2 and can update the L2 out of pro- tions such as leading-zero detect. The mem- gram order while maintaining the total store ory pipeline accepts one load or store order (TSO)7 memory model. instruction per cycle. To conserve area, two cores share each Two cores share each 32-Kbyte four-way floating-point unit in Rock. A floating- set-associative data cache. The data cache is point unit contains a floating-point multiply- dual ported, so both cores sharing it can ac- accumulate unit and a graphics and logical cess it in each cycle. The two cores accessing instructions unit that executes the Sparc the same data cache each have a private tag VIS floating-point instructions. The unit ......

MARCH/APRIL 2009 9

Authorized licensed use limited to: Politecnico di Milano. Downloaded on December 23, 2009 at 03:56 from IEEE Xplore. Restrictions apply...... HOT CHIPS 20

computes divides and square roots through the speculative copy of the destination regis- the iterative Newton-Raphson method. ter, and the NA bit of the destination register Rock implements new instructions to allow is cleared. When the long-latency operation register values to be moved between integer completes, the strand transitions from the and floating-point registers. The floating- EA phase to a replay phase, in which the de- point unit further implements a new Sparc ferred instructions are read out of the DQ V9 instruction that produces random values and reexecuted. with high statistical entropy. In Figure 3a, the strand executes in a non- In the pipeline’s back end, the commit speculative phase until instruction 1 experi- unit maintains architectural state and con- ences a data cache miss, at which point the trols some of the checkpointed state required strand takes a checkpoint, defers instruction for speculation. The commit unit can retire 1 to the DQ, and marks destination register up to four instructions per cycle while the r8 as NA. When a subsequent instruction pipeline is not speculating. An unlimited that uses r8 as a source register (instruction 2) number of instructions can be committed enters the pipeline, it is also deferred because upon successful completion of speculation. of the NA bit for r8. As a result, its destina- tion register, r10, is also marked as NA. Checkpoint-based architecture An independent instruction (3) executes Rock implements a checkpoint-based ar- and writes its result into the speculative chitecture,8-10 in which a checkpoint of the copy of the register. Because source register architectural state can be taken on an arbi- r10 for instruction 4 has been marked NA, trary instruction. Once a strand takes a check- that instruction also goes to the DQ. point, it operates in a speculative mode, in When instruction 1’s data is returned, the which each write to a register updates a strand transitions from an EA to a replay ‘‘shadow’’ speculative copy of the register. phase. The instruction queue stops accepting If the speculation succeeds, the strand dis- instructions from the instruction buffer and cards the checkpoint, and the speculative cop- starts accepting instructions from the DQ. ies become the new architectural copies; we Because instruction 1 is now a cache hit, r8 call this operation a join. If the speculation is no longer marked NA. The chain of de- fails, the strand discards speculative copies pendent instructions (2 and 4) can also exe- of registers and resumes execution from the cute and write the results to the speculative checkpoint. In addition to enabling EA, copies of the registers. With all deferred SST, and TM, the checkpoint mechanism instructions successfully reexecuted, the simplifies trap processing and increases the strand initiates a join operation to promote issue rate by relaxing the issue rules. the speculative values as the new architectural state. The join operation merely involves Execute ahead manipulating the one-bit pointers that indi- When a Rock system strand encounters a cate which of the two register copies holds long-latency instruction, it takes a check- the architectural state values. It is thus a point and initiates an EA phase. A data fast and power-efficient operation. The cache miss, a micro-DTLB miss, and a di- store buffer is notified that the EA phase vide are examples of long-latency events trig- has joined, and the buffered stores (if any) gering an EA phase. The destination register are allowed to start draining to the L2 of the long-latency instruction is marked as cache. The join also notifies the instruction not available (NA) in a dedicated NA score- queue to resume accepting instructions board, and the long-latency instruction goes from the instruction buffer, and the strand into the DQ for later reexecution. For begins a nonspeculative phase. every subsequent instruction that has at In the EA process, only the long-latency least one source operand that is NA, its des- instruction and its dependents are stored in tination register is marked as NA and the in- the DQ; independent instructions update structiongoesintotheDQ.Instructions the speculative copies of their destination for which none of the operands is NA registers, and these instructions are specula- are executed, their results are written to tively retired. As a result, independent ...... 10 IEEE MICRO

Authorized licensed use limited to: Politecnico di Milano. Downloaded on December 23, 2009 at 03:56 from IEEE Xplore. Restrictions apply. EA phase EA phase #1 1d r7r8 Miss; checkpoint, deferred #1 1d r7r8 Miss; checkpoint, deferred

#2 add r8,r9r10 Deferred #2 add r8,r9r10 Deferred

#3 sub r4,r5r6 #3 sub r4,r5r6

#4 or r10,r11r12 Deferred #4 or r10,r11r12 Deferred Replay phase Data return for #1 Data return for #1 Replay #1 1d r7r8 #2 add r8,r9r10  Replay Replay phase EA phase #4 or r10,r11 r12 #5 and r1,r2r3 Checkpoint Join #1 1d r7r8 Time #2 add r8,r9r10 #6 sub r11,1r11 #4 or r10,r11r12 Join #7 1d r3,r4r5

#8 add r5,r6r7 Nonspeculative phase #5 and r1,r2r3 #9 mov r5r6

#6 sub r11,1r11

#7 1d r3,r4r5

#8 add r5,r6r7

#9 mov r5r6

(a) (b)

Figure 3. Examples of execute ahead (a) and simultaneous speculative threading (b). The nonspeculative phase in the EA example (a) is replaced in the SST example (b) by an EA phase that executes in parallel with the replay phase.

instructions require no additional architec- checkpoint. In this case, the EA phase tural resources, so there is no strict archi- execution acts as a hardware scout thread tectural limit on the number of independent that can encounter additional cache misses, instructions executed. Therefore, with a rela- thus providing memory-level parallel- tively small DQ, an EA phase can have ism.12,13 In addition, the branch predictors more than a thousand instructions in flight.11 are updated so as to reduce branch mispre- In the example in Figure 3a, all instruc- dictions when execution returns to the tions that went into the DQ can execute suc- checkpoint. (Chaudhry et al. have described cessfully in a single replay phase. However, the performance gains from the hardware if additional cache misses or other long- scout feature.14) latency instructions arise in the EA phase, Because the EA and replay phases execute some instructions might have one or more instructions out of program order, they must NA source registers when they reexecute in address several register hazards. In particular, the replay phase. In this case, these instruc- write-after-read hazards could occur in which tions will return to the DQ, and additional an older instruction reexecuting in a replay replay phases will run. phase reads a register that was written by a If an overflow of the store buffer or DQ younger instruction in the preceding EA occurs during an EA phase, the EA phase phase. To prevent such hazards (and similar continues execution until the long-latency hazards between successive replay phases), instruction completes. At this point, the EA whenever an instruction goes into the DQ, phase fails and execution resumes from the all of the instruction’s operands that are ......

MARCH/APRIL 2009 11

Authorized licensed use limited to: Politecnico di Milano. Downloaded on December 23, 2009 at 03:56 from IEEE Xplore. Restrictions apply...... HOT CHIPS 20

available (not NA) go into the DQ with the In Figure 3b, the second EA phase com- instruction. pleted successfully without deferring any of In addition, write-after-write register haz- the instructions (5 through 9) to the DQ. ards could exist in which an older instruction However, if one or more of these instructions that is reexecuted in a replay phase overwrites were deferred (for example, because of a a register written by a younger instruction in cache miss within this EA phase or a depen- the preceding EA phase. To prevent this type dence on a cache miss from the earlier EA of hazard (and similar hazards between suc- phase), the strand that performed the first cessive replay phases), older instructions are replay phase would then start a new replay prevented from writing the speculative copy phase to complete these deferred instruc- of a register that has already been written tions. In this manner, it is possible to repeat- by a younger instruction (although all edly initiate new EA phases with one strand, instructions are allowed to write the copy while performing replay phases for deferred of their destination register in the first-level instructions with the other strand. register file to provide data forwarding). In addition to enabling the parallel execu- The EA and replay phases can specula- tion of an EA phase and a replay phase, SST tively execute loads out of program order. reduces the EA phase failures due to resource However, to support the TSO memory limitations. Specifically, if the store queue or model, it is essential that no other thread DQ is nearly full during a first EA phase, the can detect that the loads executed out of core takes a second checkpoint and a new order. Thus, a bit per cache line per strand, EA phase begins. In this manner, the called an s-bit, is added to the data cache. first EA phase can complete successfully Every load in an EA or replay phase even if the store queue or DQ overflows dur- sets the appropriate s-bit. If the cache line ing the second EA phase. containing a set s-bit is invalidated or evicted, In terms of on-chip real estate, no extra the speculation fails and execution resumes state is added to support SST. The book- from the checkpoint. Once an EA phase keeping structures needed for two-thread joins or fails, the strand’s s-bits are cleared. EA are simply reused differently to allow for SST. Hence, SST requires virtually no Simultaneous speculative threading extra area. Rock’s ability to dedicate the resources of two strands to a single software thread ena- Transactional memory bles SST, a more aggressive form of specula- To support TM, Rock uses two new tion.InFigure3a,whenthelong-latency instructions that have been added to the instruction (1) completes, the strand stops Sparc instruction set—checkpoint and com- fetching new instructions and instead replays mit. The checkpoint instruction denotes the instructions from the DQ. In contrast, beginning of a transaction, whereas commit with SST, two instruction queues are dedi- denotes the end of the transaction. The cated to the single software thread, so one in- checkpoint instruction has one parameter, struction queue continues to receive new called the fail-pc. If the transaction fails for instructions from the fetch unit while the any reason, the instructions within the trans- other replays instructions from the DQ. As action have no effect on the architectural a result, with SST it is possible to start a state (with the exception of nontransactional new EA phase in parallel with execution of stores, described later), and the program con- the replay phase. tinues execution from the fail-pc. If the Figure 3b shows an example of SST: transaction succeeds, the architectural state When the data returns for instruction 1, includes the effects of all instructions within one strand begins a replay phase to com- the transaction, the loads and stores within the plete the deferred instructions 1, 2, and transaction are atomic within the logical 4. In parallel, the other strand takes an ad- memory order, and execution continues ditional checkpoint and begins a new EA after the commit instruction. phase in which it executes new instructions Rock’s implementation of TM relies 5through9. heavily on the same mechanisms that enable ...... 12 IEEE MICRO

Authorized licensed use limited to: Politecnico di Milano. Downloaded on December 23, 2009 at 03:56 from IEEE Xplore. Restrictions apply. EA operation. In particular, a strand views 1.8 the checkpoint instruction as a long-latency Stall instruction (analogous to a load miss) that 1.6 EA checkpoints the register state and the fail- SST pc. If the transaction fails, the strand discards 1.4 the speculative register updates performed 1.2 within the transaction, restores the check- pointed state, and continues execution from 1.0 the fail-pc. If the transaction succeeds, the speculative register updates are committed 0.8 to architectural state, and the checkpoint is 0.6 discarded. Loads within the transaction are executed 0.4 speculatively and thus set s-bits on the cache lines that are read. The invalidation or re- Normalized performance single-thread 0.2 placement of a line with the thread’s s-bit 0 set prior to committing the transaction will TPC-E SPECjbb2005 SPECint2006 SPECfp2006 fail the transaction. As a result, the same mechanism that allows reordering of specula- Figure 4. Single-thread performance improvements over stalling in-order tive loads during EA lets the strand make processor achieved by speculative techniques execute ahead (EA) and transactional loads appear atomic. Stores simultaneous speculative threading (SST). within the transaction are placed in the store queue in program order. The addresses of stores are sent to the L2 cache, which then tracks conflicts with loads or stores from a checkpoint status register that can be read other threads. If the L2 cache detects such at the fail-pc to determine the cause of a a conflict, it reports the conflict to the core, failed transaction. It also provides nontran- which then fails the transaction. When the sactional loads and stores. commit instruction executes, the L2 locks Nontransactional loads are logically not all lines being written by the transaction. part of the transaction; they do not cause fail- Locked lines cannot be read or written by ure when another thread stores to the loca- any other threads. This is the point at tion that was read nontransactionally. As a which other threads view the transaction’s result, nontransactional loads can be used to loads and stores as being performed, thus synchronize the committing of a transaction. guaranteeing atomicity. The stores then For example, it is possible to use nontransac- drain from the store queue and update the tional loads to repeatedly read a flag variable lines, with the last store to each line releasing until it has been set by another thread. thelockonthatline.Thesupportforlocking Nontransactional stores, similarly, are log- lines stored to by a committed transaction is ically not part of the transaction; they update the primary hardware mechanism that we memory even if the transaction fails. There- addedtoRocktoimplementTM. fore, it is possible to use nontransactional Rock’s TM support primarily targets effi- stores to determine what fraction of a trans- cient execution of moderately sized transac- action executed prior to a failure, thus aiding tions that fit within the hardware resources. debugging. As an example, all stores in a transaction are buffered in the store queue, so the store Performance results queue size places a limit on the size of a suc- Figure 4 shows simulation results for cessful transaction. how the speculation techniques we’ve Along with the checkpoint and commit described improve single-thread perfor- instructions, Rock provides additional regis- mance. We validated the simulator against ters and instructions to aid in debugging hardware. The results—for the benchmarks and to provide greater control over a transac- TPC-E, SPECjbb, SPECint2006, and tion’s execution. Specifically, Rock provides SPECfp2006—are normalized to a stalling ......

MARCH/APRIL 2009 13

Authorized licensed use limited to: Politecnico di Milano. Downloaded on December 23, 2009 at 03:56 from IEEE Xplore. Restrictions apply...... HOT CHIPS 20

In Figure 4, the rightmost bar for each 25 benchmark indicates the performance gain TPC-C TPC-E from SST. The main advantage of this tech- SAP 5.0 nique is that a new EA phase can execute in SPECjbb2005 parallel with the replay phase. As with EA, 20 the gains from SST are most pronounced in the low-miss-rate benchmarks. Per-thread performance also depends on how heavily the chip resources are shared with other threads. The miss rate increases 15 as more threads share caches, and as a result Rock’s speculation techniques become more important. To obtain the results shown in Figure 5, we placed the threads on the chip so as to minimize interference between 10 threads. Four or fewer active threads run in Normalized throughput different core clusters, and the only shared resource is the L2 cache. When eight threads are active, threads also share an instruction cache and an instruction fetch unit. When 5 16 threads are active, all cores are used, and hence threads also share a data cache. When we enabled all 32 threads, two threads share each core. 0 Figure 5 shows the throughput at var- 0 2 4 6 8 10121416 18 20 22 24 26 28 30 32 ious thread counts, with performance nor- No. of threads malized to that of a single EA thread (with no other threads running on the chip). Despite Figure 5. Rock chip throughput for various thread counts, performance the sharing of multiple resources, the chip normalized to that of a single EA thread. throughput increases up to 22 as the thread count increases from 1 to 32. Thus, the per- thread performance remains high.

in-order processor. All of the benchmarks ock is the first commercial processor gain significant performance from EA, rang- R to use a checkpoint-based architecture ing from 18 percent for SPECint to 35 per- to enable speculative, out-of-order retire- cent for SPECfp. The two commercial ment of instructions, and the first commer- workloads fall in between, with improve- cial processor to implement hardware TM. ments of 20 and 30 percent. These novel features raised many unique The performance gains due to EA come challenges during the processor’s design and from the combination of two effects. One verification. is the increased memory-level parallelism To accommodate Rock’s checkpointing that comes from uncovering more cache and speculation, we made several changes misses under the original miss. This effect to the Sparc verification infrastructure. We is most pronounced in the two commercial verified the processor’s operation using a benchmarks (TPC-E and SPECjbb) because Sparc functional simulator as a reference of their large working sets and resulting model. Unlike other processors, Rock specu- high miss rate in the on-chip caches. The latively retires instructions out of order, and other effect is the overlapping of cache misses then either performs a join or fails specula- with the execution of independent instruc- tion. As a result, the intermediate results tions, which is most important in SPECint produced by the speculatively retired instruc- and SPECfp because of their relatively low tions could not be directly verified by single- miss rate in the shared L2 cache. stepping the reference model. Instead, we ...... 14 IEEE MICRO

Authorized licensed use limited to: Politecnico di Milano. Downloaded on December 23, 2009 at 03:56 from IEEE Xplore. Restrictions apply. had to step the reference model tens or hun- 3. M.D. Hill and M.R. Marty, ‘‘Amdahl’s Law in dreds of times to capture the state after a join, the Multicore Era,’’ Computer, vol. 41, and only at this point could we compare the no. 7, 2008, pp. 33-38. architectural state to the reference model. 4. M. Tremblay and S. Chaudhry, ‘‘A Third- The aggressive speculative design compli- Generation 65nm 16-Core 32-Thread Plus cated not only correctness debugging, but 32-Scout-Thread SPARC Processor,’’ Proc. also performance debugging, because slight Int’l Solid-State Circuits Conf. Digest of changes to the implementation could cause Technical Papers (ISSCC 08), IEEE Press, speculation to succeed or fail, thus signifi- 2008, pp. 82-83. cantly impacting performance. As a result, 5. M. Herlihy and J.E.B. Moss, ‘‘Transactional accurately modeling the speculation failure Memory: Architectural Support for Lock- conditions in the performance model was es- Free Data Structures,’’ Proc. IEEE/ACM sential. Furthermore, failed speculation also Int’l Symp. Computer Architecture (ISCA complicated correctness debugging by hiding 93), ACM Press, 1993, pp. 289-300. the effects of rare, corner-case bugs that 6. M. Tremblay, B. Joy, and K. Shin, ‘‘A Three appeared in EA phases, which often failed. Dimensional Register File for Superscalar Finally, to verify the TM implementa- Processors,’’ Proc. Hawaii Int’l Conf. Sys- tion, we implemented a logical reference tem Sciences (HICSS 95), IEEE CS Press, memory model that tracked the values of 1995, pp. 191-201. loads and stores from multiple threads. We 7. SPARC Int’l, The SPARC Architecture Man- used this logical reference model to verify ual (version 9), Prentice-Hall, 1994. the atomicity of the loads and stores within 8. H. Akkary, R. Rajwar, and S.T. Srinivasan, a transaction by applying all of them to the ‘‘Checkpoint Processing and Recovery: To- reference model when the transaction logi- wards Scalable Large Instruction Window cally commits. The greatest challenge was Processors,’’ Proc. IEEE/ACM Int’l Symp. identifying precisely when the transaction Microarchitecture (MICRO 03), IEEE CS logically commits with respect to loads and Press, 2003, pp. 423-434. stores from other threads. 9. W.W. Hwu and Y.N. Patt, ‘‘Checkpoint Repair With its novel organization of on-chip for Out-of-Order Execution Machines,’’ Proc. resources and key speculation mechanisms, IEEE/ACM Int’l Symp. Computer Architecture Rock delivers leading-edge throughput per- (ISCA 87), ACM Press, 1987, pp. 18-26. formance in combination with high per-thread 10. S.T. Srinivasan et al., ‘‘Continual Flow Pipe- performance, with a good performance/ lines: Achieving Resource-Efficient Latency power ratio. Its hardware implementation Tolerance,’’ IEEE Micro, vol. 24, no. 6, of TM will help programmability of multiproc- 2004, pp. 62-73. essor systems and improve synchronization- 11. A. Cristal et al., ‘‘Out-of-Order Commit related performance. MICRO Processors,’’ Proc. IEEE Int’l Symp. High- Performance Computer Architectures (HPCA Acknowledgments 04), IEEE CS Press, 2004, pp. 48-59. It is an honor to present this work on be- 12. J. Dundas and T. Mudge, ‘‘Improving Data half of the talented Rock team. Rock would Cache Performance by Pre-Executing not have been possible without their devo- Instructions Under a Cache Miss,’’ Proc. tion and hard work. Int’l Conf. Supercomputing (ICS 97), ACM ...... Press, 1997, pp. 68-75. References 13. O. Mutlu et al., ‘‘Runahead Execution: An 1. P. Kongetira, K. Aingaran, and K. Olukotun, AlternativetoVeryLarge Instruction Win- ‘‘Niagara: A 32-Way Multithreaded Sparc dows for Out-of-Order Processors,’’ Proc. Processor,’’ IEEE Micro, vol. 25, no. 2, IEEE Int’l Symp. High-Performance Com- 2005, pp. 21-29. puter Architecture (HPCA 03), IEEE CS 2. G.M. Amdahl, ‘‘Validity of the Single Pro- Press, 2003, pp. 129-140. cessor Approach to Achieving Large Scale 14. S. Chaudhry et al., ‘‘High-Performance Computing Capabilities,’’ Proc. AFIPS Conf., Throughput Computing,’’ IEEE Micro, vol. 30, AFIPS Press, 1967, pp. 483-485. vol. 25, no. 3, 2005, pp. 32-45......

MARCH/APRIL 2009 15

Authorized licensed use limited to: Politecnico di Milano. Downloaded on December 23, 2009 at 03:56 from IEEE Xplore. Restrictions apply...... HOT CHIPS 20

Shailender Chaudhry is a distinguished architectures. He has a PhD in computer engineer at Sun Microsystems, where he is science from Uppsala University. chief architect of the Rock processor. His research interests include processor architec- Anders Landin is a distinguished engineer ture, transactional memory, algorithms, and and a chief system architect at Sun Micro- protocols. Chaudhry has an MS in computer systems. His research interests include multi- engineering from Syracuse University. processor system architecture, processor architecture, and application and system Robert Cypher is a distinguished engineer performance modeling. He has an MS in and a chief system architect at Sun Micro- computer science and engineering from systems. His research interests include parallel Lund University, Sweden. computer architecture, processor architecture, transactional memory, and parallel program- Sherman Yip is a staff engineer at Sun ming. He has a PhD in computer science Microsystems, where he is a core architect from the University of Washington. working closely with the design team on Rock’s microarchitecture. His research inter- Magnus Ekman is a staff engineer at Sun ests include evaluating new processor features Microsystems, where he is a performance and efficient implementation of architectures architect. His research interests include that use deep speculation. Yip has a BS in processor architecture and memory system computer science from the University of design. Ekman has a PhD in computer California at Davis. engineering from Chalmers University of Technology. He also has an MS in financial Ha˚kan Zeffer is a staff engineer at Sun economics from Gothenburg University. Microsystems, where he is a system and performance architect. His research interests Martin Karlsson is a staff engineer at Sun include parallel computer architecture, co- Microsystems, where he is a core architect herence, memory consistency, and system working closely with the design team on software. He has a PhD in computer science Rock’s microarchitecture. His research inter- from Uppsala University. ests include transactional memory, instruc- tion scheduling, and checkpoint-based Marc Tremblay is a Sun Fellow, senior vice president, and chief technology officer for Sun’s Microelectronics Group. In this role, Tremblay sets future directions for Sun’s processor and system roadmap. His mission has been to move the entire Sparc processor product line to the throughput computing paradigm, incorporating techniques he has helped develop over the years—including multicores, multithreading, speculative multi- threading, scout threading, and transactional memory.

Direct questions and comments about this article to Martin Karlsson, Sun Micro- systems, 4180 Network Circle, Mailstop SCA18-211, Santa Clara, CA 95054; [email protected].

For more information on this or any other computing topic, please visit our Digital Library at http://computer.org/ csdl...... 16 IEEE MICRO

Authorized licensed use limited to: Politecnico di Milano. Downloaded on December 23, 2009 at 03:56 from IEEE Xplore. Restrictions apply.