[3B2-9] mmi2012020048.3d 19/3/012 12:48 Page 48

...... THE IBM BLUE GENE/Q COMPUTE CHIP

...... Ruud A. Haring BLUE GENE/Q AIMS TO BUILD A MASSIVELY PARALLEL HIGH-PERFORMANCE COMPUTING Martin Ohmacht SYSTEM OUT OF POWER-EFFICIENT PROCESSOR CHIPS, RESULTING IN POWER-EFFICIENT, Thomas W. Fox COST-EFFICIENT, AND FLOOR-SPACE-EFFICIENT SYSTEMS.FOCUSING ON RELIABILITY Michael K. Gschwind DURING DESIGN HELPS WITH SCALING TO LARGE SYSTEMS AND LOWERS THE TOTAL COST David L. Satterfield OF OWNERSHIP.THIS ARTICLE EXAMINES THE ARCHITECTURE AND DESIGN OF THE Krishnan Sugavanam COMPUTE CHIP, WHICH COMBINES PROCESSORS, MEMORY, AND COMMUNICATION Paul W. Coteus FUNCTIONS ON A SINGLE CHIP.

Philip Heidelberger ...... Likeitspreviousgenerations, units (PU00 to PU17) surrounding a large Blue Gene/L (BG/L)1 and Blue Gene/P Level-2 (L2) cache that occupies the center Matthias A. Blumrich (BG/P),2 the Blue Gene/Q (BG/Q) system of the chip (see Figure 1). The processor is optimized for price performance, energy units are based on the PowerPC A2 pro- Robert W. Wisniewski efficiency, and reliability in large-scale scien- cessor core, with BG/Q-specific additions: a tific applications. The system also aims to ex- single-instruction, multiple-data (SIMD) Alan Gara pand into applications with commercial and quad floating-point unit; a Level-1 (L1) pre- industrial impact, such as analytics. For the fetching unit; and a wake-up unit. George Liang-Tai Chiu heart of the system, the Blue Gene/Q Com- The processor units function as 16 user pute (BQC) chip, this translates into design processors plus one processor for operating IBM T.J. requirements to optimize power efficiency system services. A redundant processor in terms of floating-point operations per sec- increases manufacturing yield. The processor Research Center ond, per Watt (/W); to optimize the units connect to the L2 cache via a crossbar chip’s networking and connectivity features; switch in the middle of the chip. Peter A. Boyle and to optimize reliability features such as re- The 32-Mbyte L2 cache is built from 16 dundancy, the usage of error correction and 2-Mbyte slices (L200 to L215), using detection, and the minimization of sensitivity embedded DRAM (eDRAM) for dense, to soft errors. power-efficient, and high-bandwidth storage. The BQC chip is an 18.96 18.96 mm The L2 cache is multiversioned to support Norman H. Christ chip in IBM’s Cu-45HP (45 nm silicon on in- and transactional mem- Columbia University sulator [SOI]) application-specific integrated ory. In addition, the L2 cache supports circuit (ASIC) technology, and it contains atomic operations. L2 cache misses are about 1.47 109 transistors. The design inte- handled by dual on-chip memory controllers Changhoan Kim grates processors, a memory subsystem, and a (MC0 and MC1) that drive the double data networking subsystem on a single chip. A die rate type three (DDR3) integrated I/O RIKEN BNL Research Center photograph of the chip shows 18 processor blocks (PHYs) located on the left and right ......

48 Published by the IEEE Computer Society 0272-1732/12/$31.00 2012 IEEE [3B2-9] mmi2012020048.3d 19/3/012 12:48 Page 49

SerDes SerDes

Figure 1. A die photograph of the Blue Gene/Q Compute (BQC) chip. The photograph shows a large Level-2 (L2) cache in the center of the chip, surrounded by 18 processor units based on the PowerPC A2 processor core. Integrated DDR3 memory controllers are on the left and right sides, and the chip-to-chip communication (network) logic is at the bottom side of the chip.

edges of the chip. The network and messag- instruction set architecture (Power ISA) and ing unit occupies the chip’s south end. is optimized for aggregate throughput. The A2 is four-way simultaneously multithreaded BQC chip design and supports two-way concurrent instruction The BQC chip is optimized for floating- issue: one integer, branch, or load/store in- point performance, power efficiency, chip-to- struction and one floating-point instruction. chip communication bandwidth, and reliability. Within a thread, dispatch, execution, and The multithreaded multicore chip design completion are in order. contains several innovative hardware features The A2 core’s L1 data cache is 16 Kbyte, that support efficient software execution. eight-way set associative, with 64- lines. The 32-byte-wide load/store interface has Processor cores sufficient bandwidth to support the quad The BQC chip’s processor core is an aug- floating-point unit. The L1 instruction mented version of the A2 processor core used cache is 16 Kbyte, four-way set associative. on the IBM PowerEN chip.3 The A2 pro- We resynthesized the processor core cessor core implements the 64-bit Power torunat1.6GHzatareducedvoltage ......

MARCH/APRIL 2012 49 [3B2-9] mmi2012020048.3d 19/3/012 12:48 Page 50

...... HOT CHIPS

(0.8 V nominal) versus the original A2 de- select instructions, defined in the Power sign point. The lower-voltage operation ISA, are executed in execution slot 0, and op- reduces both active and leakage power. erate on element 0 QPRs, which serve as both the scalar floating-point registers Quad floating-point processing unit (FPRs) for scalar instructions and the ele- The Quad Processing eXtension (QPX) ment 0 QPRs for vector instructions. to the Power ISA is an architectural defini- tion and set of instructions for floating- QPU design. Each execution slot within the point processing in a microprocessor core, QPU embodies a fused multiply-add data developed specifically for BG/Q. The BQC flow (MAD) pipeline, which is the machinery chip Quad floating-point Processing Unit that calculates the mathematical equation (QPU) implements the QPX instruction [(A C) B], where A, B, and C are the set, as well as the scalar floating-point operands defined by the Power and QPX instructions from the Power ISA. The ISAs. The QPU instantiates four copies of QPU replaces the A2 core’s scalar floating- this data flow, creating a four-way SIMD point unit as used on the PowerEN chip. floating-point microarchitecture, as Figure 2 shows. QPX instruction set architecture. The QPX Each execution slot has a register file architecture’s computational model is a vec- holding one element of each vector register. tor SIMD model, with four execution slots The register file contains the state of four and a register file containing 32 256-bit threads, to support the processor core’s quad processing registers (QPRs). We can four-way multithreading capabilities. Physi- envision each of the 32 registers as contain- cally, the 2W4R register file is built out of ing four 64-bit elements, whereby each exe- two identical copies of a 2W2R register file. cution slot operates on one vector element. Arithmetic operations in the MAD pipe- The QPX architecture contains a full set line have a latency of six cycles for back-to- of arithmetic instructions, including fused back dependent operations. Data shuffling, multiply-add (FMA) instructions, a new set select, and move instructions have a latency of permute instructions for data shuffling of two cycles. Some dependent operations across execution slots, comparison, conver- involving complex instructions can incur an sion, and move instructions. In addition, extra cycle of latency. All data paths are the QPX architecture contains a novel set fully pipelined, allowing a new instruction of complex arithmetic instructions, which to be issued every cycle. go beyond the traditional repertoire of To protect the QPU’s large architected SIMD lane-oriented computation by allow- register state from soft errors, we use parity ing execution slots to operate on data from protection on interleaved groups of bits in adjacent slots. A collection of load/store the register file. This will detect soft-error instructions for complex data complements upsets, including multibit hits. When a bit- the complex arithmetic instructions. parity error is detected, the correction mecha- Certain QPX store instructions provide a nism exploits the redundancy provided by the novel mechanism for the detection and indi- two identical copies of the register file. When cation of numerically exceptional conditions a corrupted data value is detected in one of at the store interface. A Store Indicate NaN the operands, a recovery operation is per- or Store Indicate Infinity exception occurs formed, in which data from the good copy when the source operand of a Store with In- overwrites the corrupted register file entry. dicate instruction in any of the four execu- The QPU’s peak performance is four tion slots contains a NaN or Infinity value. FMA operations (that is, eight double- Scalar floating-point load instructions, as precision floating-point operations) per defined in the Power ISA, cause a replication cycle. At 1.6 GHz, an A2 processor core of the source data across all elements of the and QPU therefore has a peak performance target register. of 12.8 Gflops. The aggregate peak perfor- Scalar floating-point move, arithmetic, mance of the 16 user-processors on the rounding and conversion, comparison, and BQC chip is 204.8 Gflops...... 50 IEEE MICRO [3B2-9] mmi2012020048.3d 19/3/012 12:48 Page 51

256

Load

A2 RF RF RF RF 64

Permute

MAD0 MAD1 MAD2 MAD3

Figure 2. The BG/Q Quad floating-point Processing Unit (QPU) consists of a Quad Processing Register file, four parallel double-precision floating-point multiply-add (MAD) pipelines, and a Permute unit.

L1 prefetch unit increasing contiguous memory addresses, a Another BG/Q-specific augmentation to classic form of prefetching.4 The BG/L the A2 processor core is the L1 prefetch chip prefetches one 128-byte line and a sub- unit. The L1P implements two types of auto- sequent 128-byte line when a second L1 miss matic data prefetching to try to hide the la- address lies within either line. Thus, BG/L tency of accesses to the L2 cache and DDR prefetches streams of depth two and can memory. Data is prefetched as 128-byte L2 maintain seven active streams within the cache lines (twice the L1 cache line length) 15 lines that the prefetch buffer can store. andstoredinabuffercapableofholding The BG/P chip implements a more flexible 32 prefetch lines. Full coherency is main- system in which L1 misses to increasing pre- tained for all prefetched data lines. fetched addresses can trigger the prefetch of The first type of prefetching is an two consecutive 128-byte data lines, thus enhanced version of the stream prefetching supporting streams of depth three in the implemented in BG/L and BG/P. Stream case of continued, uninterrupted access to prefetching hardware identifies a sequence that stream. of loads from increasing, contiguous memory The increased memory latency, relative to locations, and then prefetches additional clock period, of the BG/Q memory system contiguous data in this sequence. and the demands of four threads per core, The second type of prefetching, called list each potentially accessing multiple streams, prefetching, is designed to anticipate a se- results in increased pressure on prefetch quence of loads made from an arbitrary series buffer storage in fixed-depth prefetch schemes. of addresses, when that sequence of load To minimize the prefetch storage require- addresses is accessed more than once. ments, we designed a powerful adaptive stream-prefetching engine. The engine Stream prefetching. For L1 misses, the two assigns a variable depth to each established previous Blue Gene machines incorporate stream, up to a maximum of 16 streams. hardware that automatically detects and pre- The largest supported depth is eight. Each fetches data accessed in a sequence of stream begins with a default preloaded ......

MARCH/APRIL 2012 51 [3B2-9] mmi2012020048.3d 19/3/012 12:48 Page 52

...... HOT CHIPS

depth that is typically small (for example, until the programmed maximum length of two). When a stream is first established, alistisreached. data is prefetched up to this initial depth. Further refinements increase the tolerance Subsequent prefetches are triggered when to variations that are expected in a multi- anL1missaddresslieswithinthestream. processor environment. Perhaps most impor- Such hits on data that have not yet been tant is the address-matching algorithm’s retrieved from memory trigger an adaptation ability to compare each L1 miss against not event that increases that stream’s depth. That only the next unmatched recorded address, depth is stolen from the least recently hit but also up to seven subsequent addresses. stream, keeping the total depth for all active Thus, if one or more recorded L1 misses streams at or below the prefetch buffer’s do not reoccur, they can be skipped over 32-line capacity. With this demand-driven without losing synchronization. Further- adaptation, buffer thrashing is reduced more, support is provided to continue the when many streams are active. The device recording of L1 miss addresses on subsequent will adapt to operate efficiently over a iterations through the code. This new list can range of workloads, from a single stream then replace the list recorded earlier, allowing from one thread to multiple streams and a self-healing evolution as the pattern of L1 multiple threads. misses changes. Finally, added hardware control lets us List prefetching. The second prefetch scheme implement important software features. targets repetitive, deterministic code, such as When the application encounters a segment that in iterative solvers, in which the same se- of code with nonrepetitive data access (such quence of addresses is accessed again and as a table look-up), it can pause and later re- again. This new method requires the begin- sume the list prefetching. Similarly, a func- ning and end of a repeating code segment tion call can stop list prefetching and save to be identified in the application program. the state of the prefetching engine when a Once activated, each list prefetching engine subroutine is entered and restore that state (one per thread) will capture the subsequent and continue prefetching when the function series of L1 miss addresses. The recording of returns, allowing standard and modular code addresses is controllable with memory page to exploit this list prefetching. granularity. The captured miss addresses are packed in a list written to memory. WakeUp unit When this repetitive portion of code is Each processor core has its own external later executed again, the recorded sequence WakeUp unit, whose main purpose is to in- is read from memory and the list of addresses crease overall application performance by is used to prefetch the needed data. The reducing thread-to-thread interactions arising loading of this recorded address sequence from software blocked in a spin loop or similar and the subsequent data prefetching must blocking polling loop. Because the processor be synchronized with code execution so the core has four threads but performs at most data is prefetched in time to be used, but one integer instruction and one floating- not so far in advance to waste space in the point instruction per processor cycle, a thread prefetch buffer. We accomplish this by stall- in a polling loop takes cycles from the other ing the processor at the beginning of repeti- threads. This cost is highest if the polled vari- tive code until the first of the recorded able is L1-cached, because the loop’s frequency addresses have been loaded into the prefetch is highest. In general, degradation will occur engine. The list-prefetch engine then whenever the polling thread competes for matches the L1 misses with these recorded hardware resources with other threads. addresses and will start prefetching the data Implementing a wait procedure using the at list addresses that follow the matched ad- WakeUp unit will incur less degradation dress, up to a preloaded depth. This address than a polling loop implemented in software. matching and data prefetching continues Upon entering a wait, software writes a base until a terminating end-of-list marker is address and enable mask for the polled address reached in the list of recorded addresses or range to a wait address compare (WAC) ...... 52 IEEE MICRO [3B2-9] mmi2012020048.3d 19/3/012 12:48 Page 53

register in the WakeUp unit. Subsequently, the L1P units), the L2 cache, the networking an instruction is executed to pause the corre- logic, and various low-bandwidth units on sponding hardware thread. When any of the the chip. The crossbar and L2 cache are other 67 hardware threads, the message unit, clocked at half the processor frequency, or the PCI Express unit writes a matching with a separate 800 MHz clock grid. The address, the WakeUp unit will drive the crossbar has 22 master ports (18 PUs, three thread’s pause/enable signal to resume execu- network logic ports, and a DevBus port tion. The resumed program then reads the used by PCI Express) and 18 slave ports data from the matching address range. If (16 L2 slices, the network logic, and the the desired condition has indeed been DevBus slave). The crossbar actually consists reached, the polling loop is exited. Other- of three switches, one each for request traffic, wise, the program again configures the response traffic, and invalidation traffic. The WakeUp unit and pauses itself. response switch has a 32-byte-wide read data The WakeUp unit has 12 WACs; each path from all slave devices, such as L2 caches, WAC snoops all writes on the chip by to all master devices, such as L1P units. The snooping the local store operations and in- aggregate maximum read bandwidth from all coming L1 invalidates. The WakeUp unit L2 slices is 409.6 Gbytes/s. The write data can also resume a thread using path is 16 wide, delivering an aggregate (such as network packet arrivals). Each of write bandwidth of 153.6 Gbytes/s (under the core’s four hardware threads has a status simultaneous reads). and enable register in the WakeUp unit, so that software can select the WACs and inter- L2 cache rupts used to resume. All processor cores share the L2 cache. To The WakeUp unit also outputs a logical- provide sufficient bandwidth, the cache is OR of a programmable subset of the 12 split into 16 slices. Each slice is 16-way set WACs. System software can use this function associative, operates in write-back mode, to generate an external to the A2. and has 2-Mbyte capacity. Physical addresses For example, it can use this interrupt to rec- are scattered across the slices via a program- ognize a stack overflow. mable hash function to achieve uniform slice utilization. Processor unit redundancy The L2 cache serves as point of coherence The A2 core, QPU, L1P, and WakeUp as well as control for atomic accesses using units comprise a processor unit (PU), the PowerISA’s Load and Reserve/Store which is replicated 18 times on the chip (see Conditional (larx/stcx) instructions. Figure 1). Because user and system software Given the large number of concurrent use only 17 PUs, one of these PUs is redun- threads and the target of high aggregate per- dant. The redundant PU increases manufactur- formance, we designed the L2 cache to pro- ing yield. If we find a defect in a PU during vide hardware assist capabilities to accelerate manufacturing test (at wafer, module, card, sequential code as well as thread interactions. or system level), we can quiesce the defective Specifically, the L2 cache provides memory PU and activate the redundant PU. PU sparing speculation support and atomic memory up- information is carried with the physical chip in date operations. eFuses or in an on-card EEPROM, and is acti- vated on the chip at boot time. PU sparing is Memory speculation: transactional memory transparent to the end user—that is, system (TM) and speculative execution (SE). The and user software will see 17 contiguous BQC chip implements hardware support operational processors, logically numbered for memory speculation in the form of a 0 through 16, irrespective of which physical multiversioning L2 cache. During memory processor (if any) has been spared out. speculation, the L2 cache tracks state changes caused by speculative threads and keeps them Crossbar separate from the main memory state. At the The crossbar switch is the central connec- end of a speculative code section, the modi- tion structure between all the PUs (specifically, fications can either be reverted (invalidate) or ......

MARCH/APRIL 2012 53 [3B2-9] mmi2012020048.3d 19/3/012 12:48 Page 54

...... HOT CHIPS

made permanent (commit). This allows pro- dynamically allocated short speculation grammerstowritecodesectionswithouthav- IDs instead of thread IDs. The system pro- ing to reason in advance about the correctness vides more speculation IDs than there are of concurrent execution. The L2 cache will hardware threads. Commit and invalidate track access conflicts between threads, letting operations are mere state transitions of the a speculation runtime system reason about speculation ID—consequently, they’re very the correctness of execution. This hardware fast. A background scrub executes the alter- support enables transactional memory (TM) ing of the directory tags on commit and in- and speculative execution (SE). validate while the hardware thread allocates For TM, sections of a parallel program simply another ID and proceeds with execut- are annotated to be executed atomically ing the next speculative section in parallel. and in isolation. Data that is speculatively There are 128 speculation IDs. Programs written is visible only to the thread that has can divide the set of IDs into 1, 2, 4, 8, or written it. The L2 cache tracks Read-after- 16 domains. Each domain can operate either Write (RAW), Write-after-Write (WAW), in TM mode or SE mode, allowing different and Write-after-Read (WAR) conflicts. The hardware threads to use all modes concur- program can configure the L2 cache to rently. Each SE domain maintains its order react with an invalidation or with a notifica- locally for its IDs, letting ID allocations pro- tion to software, or both. In the second case, ceed independently per domain. software can decide which thread to invali- date to resolve the conflict. In the common L2 atomic operations. The BQC chip L2 case, the invalidated thread will retry to exe- cache logic also provides an extensive set of cute the section. atomic operations, intended to accelerate soft- For SE, a sequential program is parti- ware operations, such as locking and barriers, tioned into tasks, which execute speculatively and to enable efficient work-queue manage- in parallel. For SE, there is a definite order of ment with multiple producers and consumers. precedence, namely the expected sequential The atomic operations are 8-byte load or execution behavior, which hardware uses to store operations that can modify the value at resolve access conflicts. Data written by the given memory address. Each L2 slice can threads earlier in sequential semantics process an atomic operation every four pro- (older)isforwardedtothreadslaterinse- cessor clock cycles. This high throughput quential semantics (younger). Writes of enables some simple synchronization con- older threads to locations already read by structs and concurrent data structures that younger threads (WAR conflict) are signaled scale well to the many BG/Q threads. As an to the runtime system. The partitioning into example, to perform a ticket lock wherein speculative tasks can be controlled by each thread contains a unique number, #pragma statements similar to those of 64 threads can use the L2 atomic Load- OpenMP. For example, the following code Increment operation to obtain their section causes the loop’s iterations to execute unique value in 78 (initial latency) þ 64 speculatively in parallel: (threads) 4 (cycles) ¼ 334 cycles in an ideal case. An equivalent atomic increment #pragma speculative for implemented using general-purpose load- for (i=0; i

the usual load, store, and coherence opera- store the maximum of the given and memory tions to that memory location. An atomic values for unsigned integers and floating- operation is distinguished from a regular point numbers, respectively. load or store by software setting a high- order address bit. Three other bits distin- L2 replacement policy. Similar to the last- guish up to eight different atomic operations. level cache in previous Blue Gene architec- Software maps the atomic addresses as un- ture generations, the BG/Q L2 cache pro- cached. For the processor hardware, an atomic vides prefetch capabilities (from the DDR3 operation is a normal load or store. Similarly, memory controllers) driven by hint bits gen- the network message unit can remote-put or erated by the L1 prefetch units. The L2 cache remote-get an atomic operation. uses a true LRU replacement policy with a The L2 atomic load operations follow. configurable LRU stack insertion point to LoadClear returns the current memory reduce cache pollution. For example, a pre- value and then stores 0 there. LoadIncre- fetch from main memory into the L2 cache ment returns the current value and then could be inserted at position 14 of the increments the memory value; Load- 16-entry LRU stack, making it a probable Decrement is similar. Load returns the candidate for eviction if not accessed soon. current memory value. LoadIncrement- Once accessed by a real demand fetch, it Bounded assumes an 8-byte counter in the could be promoted to position 8, causing un- address and an 8-byte boundary in the subse- used prefetches to be evicted before it would quent address. If the counter and boundary become a victim. When accessed for a second values differ, LoadIncrementBounded time, it could be promoted to position 1, behaves like LoadIncrement.Otherwise, now least prone to eviction, as such a policy 0x 8000 0000 0000 0000 is returned and would assume likely further reuse. indicates to software that the boundary has been reached. LoadDecrementBounded DDR3 memory controllers is similar, with the boundary located in the L2 cache misses are serviced by two mem- previous address. ory controllers (MC0 and MC1 in Figure 1). The latter two operations support some Each controller drives a 16 þ 2-byte wide high-throughput lock-free concurrent data DDR-3 channel at 1.333 Gbits/s per pin. structures. For example, for an array used It uses an 8-byte error-correcting code as a circular buffer for a queue, Load- (ECC) for a 64-byte data block (that is, IncrementBounded prevents the con- ECC is calculated across four consecutive sumer pointer from passing the producer beats), with double symbol error correct pointer. In another example, for a concurrent and triple detect, where a symbol is 2 bits stack, LoadDecrementBounded pre- wide 4 beats. The ECC supports retry vents the stack pointer from passing through and partial or full chip-kill for 8 bit-wide the bottom. DDR3 memory modules. The ECC pro- The L2 atomic store operations follow. tectsdatainthefaceofafullDRAM StoreAdd adds the given and memory device fail, synchronous with a second values. StoreAddCoherenceOnZero bit fail. minimizes coherence traffic, while software The peak DDR bandwidth is 42.67 waits for the value 0 to be reached. Store- Gbytes/s, excluding ECC. Twin stores the given value to the address The memory controllers support multiple and the subsequent address, if their values DDR3 density/rank/speed configurations. were previously equal. For example, this Blue Gene/Q will initially be configured can move the top and bottom pointers of a with 16 Gbytes of DDR3 per BQC chip. double-ended queue (deque) into the mid- The 1.35V DDR3 memory chips are directly dle, when the deque is empty. Store sim- soldered onto the same compute card as the ply stores the given value. StoreOr does BQC chip for high reliability. Because of the a logical-OR of the given and memory val- resulting short interconnects, we can leave ues; StoreXor is similar. StoreMax- the compute card’s data nets unterminated, Unsigned and StoreMaxSigned which helps with power efficiency......

MARCH/APRIL 2012 55 [3B2-9] mmi2012020048.3d 19/3/012 12:48 Page 56

...... HOT CHIPS

Network and messaging unit PCI Express device The BG/Q networking papers give a On BQC chips configured as I/O nodes, detailed description of the network and mes- two of the 11 chip communication ports are saging unit,5,6 so we only briefly describe repurposed and combined to form a PCI Ex- them here. press Generation 2 x8 (4 Gbytes/s) interface, The BG/Q network consists of a set of supporting communication cards to the out- compute nodes (BQC chip þ memory) side world. Two other 2-Gbyte/s links con- arranged in a 5D torus configuration. On nect back to two compute bridge nodes, compute nodes, 10 chip-to-chip communi- thus balancing the 4-Gbytes/s PCI Express cation links are used to build the 5D torus. port. On chip, the PCI Express interface A subset of the compute nodes, called bridge logic is connected to the memory subsystem nodes, has an eleventh communication link via a single crossbar switch master port. active to connect with an I/O node. We build the compute node and the I/O node Other slave devices from the same BQC chip, with different A set of low-bandwidth on-chip devices configuration settings and packaging. shares a crossbar slave port via an arbiter Each of the BQC chip’s 11 communica- called the DevBus interface. Such devices in- tion links is implemented using high-speed clude the boot eDRAM, which is a local serial I/Os, which drive and receive data at static 256 Kbyte memory that can be pre- (effectively)4Gbits/sperdifferentialwire loaded with boot code via a serial IEEE pair. A set of 4 þ 4 (send þ receive) pairs 1149.1 (JTAG) port. During functional op- comprises a chip-to-chip link. Thus, each eration, the JTAG port serves as a side-band of the 11 chip-to-chip links can transmit communication channel with the external raw data at 2 Gbytes/s and simultaneously control system, for node monitoring and receive data at 2 Gbytes/s, for a peak total control. bandwidth of 44 Gbytes/s. The DevBus also connects to the Univer- The logic for this BG/Q interconnection sal Performance Counter unit, the heart of a network is integrated onto the BQC chip. distributed performance monitoring system The network logic supports point-to-point, that can simultaneously monitor 512 events, collective, and barrier messages between each with a 64-bit wide counter. Software nodes over a single physical network. can configure these counters to monitor The message unit provides a high- and count events or to capture a 1,024- throughput, low-latency interface between cycle-long sequence of events. the network routing logic and the memory subsystem, with enough bandwidth to Data integrity keep all the links busy. The message unit Because data integrity is crucial when scal- is connected to the crossbar switch via a sin- ing out to tens of thousands of chips and gle slave port and three master ports. millions of threads, we have protected all Through this memory master capability, on-chip data paths by a single error correct, the message unit provides remote DMA double error detect (SEC-DED) ECC. facilities, supporting direct puts, remote ECC also protects most memory arrays. gets, and memory first-in, first-out (FIFO) We implemented the register files in the pro- messages. The remote DMA extends to cessor and QPU with parity and redundancy, the atomic operations described above, allowing restoration of corrupted entries such as LoadIncrement. from redundant copies. The L1 caches are In addition, the on-chip networking logic parity protected and are automatically invali- provides hardware-assist facilities for broad- dated and reloaded from L2 on parity error. cast and reduction operations, such as inte- Configuration-carrying registers and state grated floating-point, fixed-point, and bit- machines are typically implemented with wise arithmetic. soft error resistant latches (stacked or DICE These hardware assists will handle most [Dual Interlocked Storage ] latches) aspects of messaging autonomously, with and are parity protected to reduce the prob- minimal disturbance of the PUs. ability of a silent error...... 56 IEEE MICRO [3B2-9] mmi2012020048.3d 19/3/012 12:48 Page 57

Software and performance Development, vol.49,nos.2/3,2005, InBG/Q,wewillmaintaintheBG/L1 pp. 195-212. and BG/P2 system software philosophy of a 2. The Blue Gene/P Team, ‘‘Overview of the simple streamlined stack designed for scal- IBM Blue Gene/P Project,‘‘ IBM J. Research ability. The BG/Q system is optimized for and Development, vol. 52, nos. 1/2, 2008, applications that are based on standard pro- pp. 199-220. gramming models such as MPI and 3. C. Johnson et al., ‘‘A Wire-Speed Power OpenMP. Beyond these standard program- Processor: 2.3 GHz 45-nm SOI with 16 ming models, the BG/Q system also lets Cores and 64 Threads,‘‘ 2010 IEEE ISSCC users explore new ways to utilize the multi- Digest, 2010, pp. 104-105. core, multithreaded BG/Q compute node. 4. S.P. Vanderwiel and D.J. Lilja, ‘‘Data Pre- The BG/Q I/O nodes will run . On fetch Mechanisms,‘‘ ACM Computing Sur- the BG/Q compute nodes, however, the veys, vol. 32, no. 2, 2000, pp. 174-199. compute node kernel (CNK) will continue 5. D. Chen et al., ‘‘The IBM Blue Gene/Q Inter- to be the kernel.7 CNK off- connection Network and Message Unit,‘‘ loads file I/O and maps the communication Proc. Int’l Conf. High-Performance Com- hardware directly in user space for efficient puting, Networking, Storage, and Analysis messaging. CNK, the IBM XL compiler, (SC11),IEEECSPress,2011, and associated runtime support have all doi:10.1145/2063384.2063419. been enhanced and tuned to fully exploit 6. D. Chen et al., ‘‘The IBM Blue Gene/Q Inter- the advanced hardware features provided on connection Fabric,‘‘ IEEE Micro, vol. 32, the BQC chip. no. 1, 2012, pp. 32-43. 7. M. Giampapa et al., ‘‘Experiences with a s of November 2011, a 4,096-node Lightweight Kernel: Les- A BG/Q installation had achieved a sons Learned From Blue Gene’s CNK,‘‘ performance of 677.10 Tflops on Linpack Proc. ACM/IEEE Int’l Conf. High Perfor- (http://www.top500.org). This system also mance Computing, Networking, Storage, achieved the top rating on the Graph 500 and Analysis (SC10),IEEECSPress, (http://www.graph500.org) at 254 Gteps 2010, doi:10.1109/SC.2010.22. (giga traversed edges per second). Finally, BG/Q systems achieved top ratings on the Ruud A. Haring is a research staff member Green 500 list of the most energy-efficient and manager of Blue Gene Chip Design at at 2.0 Gflops/W (http:// the IBM T.J. Watson Research Center, www..org). where he is responsible for the synthesis, Future work will present a more in-depth physical design, and design for test of the analysis of the machine, its software, and its Blue Gene chip designs, as well as for performance. For more information on the functional diagnostics of Blue Gene cards Blue Gene team, see the ‘‘Blue Gene Project and systems. His research interests include Members’’ sidebar. MICRO electronic circuit design on microprocessors and application-specific integrated circuits Acknowledgments (ASICs). Haring has a PhD in physics from The BG/Q project has been supported Leyden University (the Netherlands). He is a and partially funded by the Argonne Na- senior member of IEEE. tional Laboratory and the Lawrence Liver- more National Laboratory on behalf of the Martin Ohmacht is a research staff member US Department of Energy, under Lawrence and manager at the IBM T.J. Watson Livermore National Laboratory subcontract Research Center, where he has worked on number B554331. memory subsystem architecture and imple- mentation for all Blue Gene supercomputer ...... systems. His research interests include com- References puter architecture, memory subsystems, 1. A. Gara et al., ‘‘Overview of the Blue Gene/L speculative execution, design and verification System Architecture,‘‘ IBM J. Research and of multiprocessor systems, and compiler ......

MARCH/APRIL 2012 57 [3B2-9] mmi2012020048.3d 19/3/012 12:48 Page 58

...... HOT CHIPS

...... Blue Gene Project Members The Blue Gene project is a team effort, benefiting from the coopera- Patrick J. McCarthy, Moyra K. McManus, Mark Megerian, Douglas tion of many individuals at IBM Research, IBM Systems and Technology Miller, Sam Miller, Jaime H. Moreno, Adam Muff, Patrick Mulli- Group, and IBM Software Group. The Blue Gene Team contributors are gan, Roy Musselman, Tom Musta, Indira Nair, Ben Nathanson, as follows. Mike Nelson, Hoang N. Nguyen, Carl Nilsen, Kinya Noguchi, From IBM: Mike Aho, David W. Alderman, Eric Alter, Eberhard Carl Obert, Kathryn O’Brien, Kevin K O’Brien, Alda S. Ohmacht, Amann, Anatoli Andreev, Sameh Asaad, John Attinella, Vernon Martin Ohmacht, Bitwoded Okbay, Jim Van Oosten, Michael R. Austel, Marcy L. Averill, Michael J. Azevedo, Joern Babinsky, Ouellette, Bruce Owens, Mike J. Palmer, Jeff Parker, David P. Ralph Bellofatto, James R. Bentlage, Jeremy Berg, Darcy Berger, Paulsen, Huyen Phan, Kerry Pfarr, Swetha Pullela, Don Reed, Randy Bickford, SuEllen Birkholz, Michael Blocksome, Matthias A. Michael T. Repede, Dennis Rickert, Thomas Roewer, Jeff Blumrich, Lynn Boger, Alan Boulter, Thomas C. Brennan, Jeremy J. Ruedinger, Bryan S. Rosenburg, Yogish Sabharwal, Valentina Brewer, Bernard Brezzo, Arthur A. Bright, Jose Brunheroto, Jay S. Salapura, David L. Satterfield, Jun Sawada, Vaibhav Saxena, Bryant, Nathan C. Buck, Tom Budnik, Daniel Buerkle, Raymond J. Paul Schardt, Matthew Scheckel, Brandon Schenck, Heiko Schick, Bulaga, Mark Campana, Kenneth M. Caron, Bob Cernohous, Doug- Dietmar Schmunkamp, Bob Schoen, Andrew A. Schram, Brian las Chartrand, Jeffery D. Chauvin, Dong Chen, Wang Chen, Chen- Schuelke, Alan D. Secor, Faith W. Sell, Woody Sellers, Robert Yong Cher, George L.T. Chiu, I-Hsin Chung, Paul K. Coffman, M. Senger, James Sexton, Vinay V. Shah, Robert Shearer, Brian Miguel Comparan, Paul W. Coteus, Ron Daede, Bruce D’Amora, Smith, Karl Solie, Carlos P. Sosa, David L. Sparks, Burkhard Stein- Kris Davis, Michael Deindl, Brian Deskin, Jun Doi, Marc B. Dom- macher-Burow, Will Stockdell, Scott Strissel, Craig Stunkel, browa, Roger Dong, Michael L. Eaton, Alexandre Eichenberger, Krishnan Sugavanam, Yutaka Sugawara, Nobu Suginaka, Takehito Noel A. Eisley, Matthew R. Ellavsky, Kahn C. Evans, Sean T. Sakuragi, Corey Swenson, Keith A. Tally, Yoshihisa Takatsu, Todd Evans, Steve Faas, Daniel Faraj, Mitchell D. Felton, Blake G. Takken, Andrew Tauferner, Kiswanto Thayib, John Thomas, Shurong Fitch, Ryan A. Fitch, Bruce M. Fleischer, Bill Flynn, Jeffrey Tian, Jose A. Tierno, Ailoan T. Tran, Matthew R. Tubbs, Priya Fosmo, Thomas W. Fox, John F. Fraley, Ross L. Franke, Scott Unnikrishnan, Jason L. VanVreede, Pascal Vezolle, Ivan Vo, Martha Frei, Robert Germain, Philip Germann, Mark E. Giampapa, Frank Voytovich, Charles Wait, Bob Walkup, Amy Wang, Alfred T. P. Giordano, Mike P. Good, Thomas Gooding, Nicholas Goracke, Watson, Bryan J. Weatherford, Robert W. Wisniewski, Bruce Jason Greenwood, Michael K. Gschwind, John A. Gunnels, Winter, Bryon Wirtz, Kelvin Wong, Peng Wu, Ching Zhou, Jianwei Shawn Hall, Michael Hamilton, Ruud A. Haring, James S. Harveland, Zhuang, Fred Ziegler, Matthew Ziegler, and Christian G. Zoellin. Troy L. Haugen, Philip Heidelberger, Olaf Hendrickson, Brent Hilgart, From Columbia University: Norman H. Christ. Misky H. Hillestad, Judith W. Hjortness, Dennis Y. Huang, Todd From Columbia University and RIKEN BNL Research Center: Inglett, David J. Iverson, Randy T. Jacobson, Geert Janssen, Changhoan Kim. Mark Jeanson, Mark C. Johnson, Steven P. Jones, Kirk Jordan, From University of Edinburgh: Peter A. Boyle. Jeffrey N. Judd, Kerry T. Kaliszewski, Robert Kammerer, Michael Formerly at IBM: Jeremy Balster, James Clemens, Paul Curtis, Kaufmann, Amanda R. Kaufer, Kyu-hyoun Kim, David Klepacki, Gabor Dozsa, Don Eisenmenger, Matthias Friitsch, Alan Gara, Rus- Gerard V. Kopcsay, Anatoly Koyfman, Brant L. Knudson, Jon sell Hoover, Fariba Kasemkhani, Stefan Koch, Brian Koehler, Matt Kriegel, Sameer Kumar, Lih-Chung Kuo, Mark Kupferschmidt, Logelin, Curt Mathiowetz, Eric O. Mejdrich, Virginia Metayer, David E. Lackey, Alphonso P. Lanzetta, Cory Lappi, Jay A. Timothy Moe, Mike Mundy, Eldon Nelson, Dennis Olson, Rick Lawrence, David Lawson, Tak O. Leung, Tom Liebsch, Meryl Lo, A. Rand, Joseph Ratterman, Daniele P. Scarpazza, Marcel Schaal, Ray Lucas, Bob Lytle, Scott H. Mack, David Malone, Amith R. Ryan J. Schlichting, Catherine Trammell, Simeon Wahl, Tim Mamidala, Jim Marcella, Christopher M. Marroquin, John K. Masi, Wensky, and Xiaotong Zhuang.

optimizations. Ohmacht has a Dr-Ing in Gene/Q Quad floating-point unit. His profes- electrical engineering from the University of sional interests include processor architectures, Hanover, Germany. He is a member of floating-point arithmetic, and 3D graphics. IEEE and the ACM. Fox has an MEE in electrical engineering from Rensselaer Polytechnic Institute. Thomas W. Fox is a senior engineer at the IBM T.J. Watson Research Center, where he Michael K. Gschwind is a senior technical has worked on digital signal processors, staff member and senior manager of system domain-specific graphics processors, PowerPC architecture in the IBM Systems and general-purpose processors, and the Blue Technology Group in Poughkeepsie, where ...... 58 IEEE MICRO [3B2-9] mmi2012020048.3d 19/3/012 12:48 Page 59

his team defines the architecture of IBM’s architecture and design, logic verification, mainframe System z and Power systems. He hardware bring-up, network software inter- previously served as Blue Gene/Q floating- faces, and efficient communications algo- point chief architect and lead at the IBM T.J. rithms. Heidelberger has a PhD in operations Watson Research Center. Gschwind has a research from Stanford University. He is a PhD in computer engineering from Tech- member of the IBM Academy of Technology nische Universitaet Wien. He is an IEEE and a fellow of IEEE and the ACM. fellow,anIBMMasterInventor,anda member of the IBM Academy of Technology. Matthias A. Blumrich is a research staff member and member of the Blue Gene David L. Satterfield is a senior engineer at hardware team at the IBM T.J. Watson the IBM T.J. Watson Research Center, where Research Center, where he has worked on the he works on the design, verification, and architecture and design of the Blue Gene/L, bring-up of the Blue Gene/Q supercomputer. Blue Gene/P, and Blue Gene/Q systems. His professional interests have included the Blumrich has a PhD in computer science design, verification, and bring-up of the Blue from Princeton University. Gene/P network interconnect ASIC, and the design of the microprocessors used in the Robert W. Wisniewski is the chief software XBOX 360 and of a low-power floating- architect for Blue Gene and manager of the point unit for IBM’s embedded PowerPC Blue Gene and Exascale Research Software processor line. Satterfield has a BS in Team at the IBM T.J. Watson Research electrical engineering from Tufts University. Center. His research interests include experi- mental scalable systems with the goal of Krishnan Sugavanam is a senior engineer at achieving high performance by cooperation the IBM T.J. Watson Research Center. His between system and application, as well as research interests are in computer architec- studying how to structure and design systems ture, with emphasis on cache design. Suga- to perform well on parallel machines, and vanam has an MS in computer engineering how to design those systems to allow user from the University of Louisiana at Lafayette customization. Wisniewski has a PhD in and an MS in applied electronics from Anna computer science from the University of University, Chennai, India. Rochester. He is an IBM Master Inventor and ACM Distinguished Member. Paul W. Coteus is a research staff member and the chief engineer of Blue Gene systems Alan Gara is an Intel fellow. He was at the IBM T.J. Watson Research Center, previously the chief architect of the IBM where he leads the system design and Blue Gene systems at the IBM T.J. Watson packaging of these supercomputers. He Research Center. His research interests manages the Systems Packaging Group, include the design and realization of power- where he directs and designs advanced efficient high-performance supercomputers. packaging for high-speed electronics, includ- Gara has a PhD in physics from the ing I/O circuits, memory system design and University of Wisconsin, Madison. standardization of high-speed DRAM, and high-performance system packaging. Coteus George Liang-Tai Chiu is the senior man- has a PhD in physics from Columbia ager of Advanced High Performance Systems University. He is a senior member of IEEE, in the Systems Department at the IBM T.J. a member of IBM’s Academy of Technology, Watson Research Center, where he is respon- and an IBM Master Inventor. sible for the overall hardware and software of theBlueGenePlatform.Chiuisoneofthe Philip Heidelberger is a research staff three cofounders of the Blue Gene project, member at the IBM T.J. Watson Research and he has been in charge of the Blue Gene Center. He is a member of IBM Research’s supercomputer designs. Chiu has a PhD in Blue Gene supercomputer hardware team, astrophysics from the University of California, where he has been involved in network Berkeley. He is a member of the International ......

MARCH/APRIL 2012 59 [3B2-9] mmi2012020048.3d 19/3/012 12:48 Page 60

...... HOT CHIPS

Astronomical Union and the IBM Academy He designed the prefetch unit for Blue Gene/Q of Technology and a fellow of IEEE. while at Columbia University and RIKEN BNL Research Center. His research interests Peter A. Boyle is a reader in theoretical include supercomputer chip design, algo- particle physics at the University of Edinburgh. rithm development, application performance His research interests are in quantum field optimization on massively parallel super- theory, lattice quantum chromodynamics computers, and extending the utility of high- (QCD), and supercomputer codesign for performance computing beyond academia. QCD simulation. Boyle has a PhD in Kim has a PhD in physics from Columbia theoretical particle physics from the Uni- University. versity of Edinburgh. Direct questions and comments about Norman H. Christ is the Gildor Professor of this article to Ruud A. Haring, IBM T.J. Physics at Columbia University. His research Watson Research Center, PO Box 218, interests include particle physics, lattice Yorktown Heights, NY 10598; ruud@ gauge theory, and quantum field theory. us..com. Christ has a PhD in physics from Columbia University. He is a fellow of the American Physical Society.

Changhoan Kim is a research staff member at the IBM T.J. Watson Research Center.

...... 60 IEEE MICRO