[3B2-9] mmi2012020048.3d 19/3/012 12:48 Page 48
...... THE IBM BLUE GENE/Q COMPUTE CHIP
...... Ruud A. Haring BLUE GENE/Q AIMS TO BUILD A MASSIVELY PARALLEL HIGH-PERFORMANCE COMPUTING Martin Ohmacht SYSTEM OUT OF POWER-EFFICIENT PROCESSOR CHIPS, RESULTING IN POWER-EFFICIENT, Thomas W. Fox COST-EFFICIENT, AND FLOOR-SPACE-EFFICIENT SYSTEMS.FOCUSING ON RELIABILITY Michael K. Gschwind DURING DESIGN HELPS WITH SCALING TO LARGE SYSTEMS AND LOWERS THE TOTAL COST David L. Satterfield OF OWNERSHIP.THIS ARTICLE EXAMINES THE ARCHITECTURE AND DESIGN OF THE Krishnan Sugavanam COMPUTE CHIP, WHICH COMBINES PROCESSORS, MEMORY, AND COMMUNICATION Paul W. Coteus FUNCTIONS ON A SINGLE CHIP.
Philip Heidelberger ...... Likeitspreviousgenerations, units (PU00 to PU17) surrounding a large Blue Gene/L (BG/L)1 and Blue Gene/P Level-2 (L2) cache that occupies the center Matthias A. Blumrich (BG/P),2 the Blue Gene/Q (BG/Q) system of the chip (see Figure 1). The processor is optimized for price performance, energy units are based on the PowerPC A2 pro- Robert W. Wisniewski efficiency, and reliability in large-scale scien- cessor core, with BG/Q-specific additions: a tific applications. The system also aims to ex- single-instruction, multiple-data (SIMD) Alan Gara pand into applications with commercial and quad floating-point unit; a Level-1 (L1) pre- industrial impact, such as analytics. For the fetching unit; and a wake-up unit. George Liang-Tai Chiu heart of the system, the Blue Gene/Q Com- The processor units function as 16 user pute (BQC) chip, this translates into design processors plus one processor for operating IBM T.J. Watson requirements to optimize power efficiency system services. A redundant processor in terms of floating-point operations per sec- increases manufacturing yield. The processor Research Center ond, per Watt (flops/W); to optimize the units connect to the L2 cache via a crossbar chip’s networking and connectivity features; switch in the middle of the chip. Peter A. Boyle and to optimize reliability features such as re- The 32-Mbyte L2 cache is built from 16 dundancy, the usage of error correction and 2-Mbyte slices (L200 to L215), using University of Edinburgh detection, and the minimization of sensitivity embedded DRAM (eDRAM) for dense, to soft errors. power-efficient, and high-bandwidth storage. The BQC chip is an 18.96 18.96 mm The L2 cache is multiversioned to support Norman H. Christ chip in IBM’s Cu-45HP (45 nm silicon on in- speculative execution and transactional mem- Columbia University sulator [SOI]) application-specific integrated ory. In addition, the L2 cache supports circuit (ASIC) technology, and it contains atomic operations. L2 cache misses are about 1.47 109 transistors. The design inte- handled by dual on-chip memory controllers Changhoan Kim grates processors, a memory subsystem, and a (MC0 and MC1) that drive the double data networking subsystem on a single chip. A die rate type three (DDR3) integrated I/O RIKEN BNL Research Center photograph of the chip shows 18 processor blocks (PHYs) located on the left and right ......
48 Published by the IEEE Computer Society 0272-1732/12/$31.00 c 2012 IEEE [3B2-9] mmi2012020048.3d 19/3/012 12:48 Page 49
SerDes SerDes
Figure 1. A die photograph of the Blue Gene/Q Compute (BQC) chip. The photograph shows a large Level-2 (L2) cache in the center of the chip, surrounded by 18 processor units based on the PowerPC A2 processor core. Integrated DDR3 memory controllers are on the left and right sides, and the chip-to-chip communication (network) logic is at the bottom side of the chip.
edges of the chip. The network and messag- instruction set architecture (Power ISA) and ing unit occupies the chip’s south end. is optimized for aggregate throughput. The A2 is four-way simultaneously multithreaded BQC chip design and supports two-way concurrent instruction The BQC chip is optimized for floating- issue: one integer, branch, or load/store in- point performance, power efficiency, chip-to- struction and one floating-point instruction. chip communication bandwidth, and reliability. Within a thread, dispatch, execution, and The multithreaded multicore chip design completion are in order. contains several innovative hardware features The A2 core’s L1 data cache is 16 Kbyte, that support efficient software execution. eight-way set associative, with 64-byte lines. The 32-byte-wide load/store interface has Processor cores sufficient bandwidth to support the quad The BQC chip’s processor core is an aug- floating-point unit. The L1 instruction mented version of the A2 processor core used cache is 16 Kbyte, four-way set associative. on the IBM PowerEN chip.3 The A2 pro- We resynthesized the processor core cessor core implements the 64-bit Power torunat1.6GHzatareducedvoltage ......
MARCH/APRIL 2012 49 [3B2-9] mmi2012020048.3d 19/3/012 12:48 Page 50
...... HOT CHIPS
(0.8 V nominal) versus the original A2 de- select instructions, defined in the Power sign point. The lower-voltage operation ISA, are executed in execution slot 0, and op- reduces both active and leakage power. erate on element 0 QPRs, which serve as both the scalar floating-point registers Quad floating-point processing unit (FPRs) for scalar instructions and the ele- The Quad Processing eXtension (QPX) ment 0 QPRs for vector instructions. to the Power ISA is an architectural defini- tion and set of instructions for floating- QPU design. Each execution slot within the point processing in a microprocessor core, QPU embodies a fused multiply-add data developed specifically for BG/Q. The BQC flow (MAD) pipeline, which is the machinery chip Quad floating-point Processing Unit that calculates the mathematical equation (QPU) implements the QPX instruction [(A C) B], where A, B, and C are the set, as well as the scalar floating-point operands defined by the Power and QPX instructions from the Power ISA. The ISAs. The QPU instantiates four copies of QPU replaces the A2 core’s scalar floating- this data flow, creating a four-way SIMD point unit as used on the PowerEN chip. floating-point microarchitecture, as Figure 2 shows. QPX instruction set architecture. The QPX Each execution slot has a register file architecture’s computational model is a vec- holding one element of each vector register. tor SIMD model, with four execution slots The register file contains the state of four and a register file containing 32 256-bit threads, to support the processor core’s quad processing registers (QPRs). We can four-way multithreading capabilities. Physi- envision each of the 32 registers as contain- cally, the 2W4R register file is built out of ing four 64-bit elements, whereby each exe- two identical copies of a 2W2R register file. cution slot operates on one vector element. Arithmetic operations in the MAD pipe- The QPX architecture contains a full set line have a latency of six cycles for back-to- of arithmetic instructions, including fused back dependent operations. Data shuffling, multiply-add (FMA) instructions, a new set select, and move instructions have a latency of permute instructions for data shuffling of two cycles. Some dependent operations across execution slots, comparison, conver- involving complex instructions can incur an sion, and move instructions. In addition, extra cycle of latency. All data paths are the QPX architecture contains a novel set fully pipelined, allowing a new instruction of complex arithmetic instructions, which to be issued every cycle. go beyond the traditional repertoire of To protect the QPU’s large architected SIMD lane-oriented computation by allow- register state from soft errors, we use parity ing execution slots to operate on data from protection on interleaved groups of bits in adjacent slots. A collection of load/store the register file. This will detect soft-error instructions for complex data complements upsets, including multibit hits. When a bit- the complex arithmetic instructions. parity error is detected, the correction mecha- Certain QPX store instructions provide a nism exploits the redundancy provided by the novel mechanism for the detection and indi- two identical copies of the register file. When cation of numerically exceptional conditions a corrupted data value is detected in one of at the store interface. A Store Indicate NaN the operands, a recovery operation is per- or Store Indicate Infinity exception occurs formed, in which data from the good copy when the source operand of a Store with In- overwrites the corrupted register file entry. dicate instruction in any of the four execu- The QPU’s peak performance is four tion slots contains a NaN or Infinity value. FMA operations (that is, eight double- Scalar floating-point load instructions, as precision floating-point operations) per defined in the Power ISA, cause a replication cycle. At 1.6 GHz, an A2 processor core of the source data across all elements of the and QPU therefore has a peak performance target register. of 12.8 Gflops. The aggregate peak perfor- Scalar floating-point move, arithmetic, mance of the 16 user-processors on the rounding and conversion, comparison, and BQC chip is 204.8 Gflops...... 50 IEEE MICRO [3B2-9] mmi2012020048.3d 19/3/012 12:48 Page 51
256
Load
A2 RF RF RF RF 64
Permute
MAD0 MAD1 MAD2 MAD3
Figure 2. The BG/Q Quad floating-point Processing Unit (QPU) consists of a Quad Processing Register file, four parallel double-precision floating-point multiply-add (MAD) pipelines, and a Permute unit.
L1 prefetch unit increasing contiguous memory addresses, a Another BG/Q-specific augmentation to classic form of prefetching.4 The BG/L the A2 processor core is the L1 prefetch chip prefetches one 128-byte line and a sub- unit. The L1P implements two types of auto- sequent 128-byte line when a second L1 miss matic data prefetching to try to hide the la- address lies within either line. Thus, BG/L tency of accesses to the L2 cache and DDR prefetches streams of depth two and can memory. Data is prefetched as 128-byte L2 maintain seven active streams within the cache lines (twice the L1 cache line length) 15 lines that the prefetch buffer can store. andstoredinabuffercapableofholding The BG/P chip implements a more flexible 32 prefetch lines. Full coherency is main- system in which L1 misses to increasing pre- tained for all prefetched data lines. fetched addresses can trigger the prefetch of The first type of prefetching is an two consecutive 128-byte data lines, thus enhanced version of the stream prefetching supporting streams of depth three in the implemented in BG/L and BG/P. Stream case of continued, uninterrupted access to prefetching hardware identifies a sequence that stream. of loads from increasing, contiguous memory The increased memory latency, relative to locations, and then prefetches additional clock period, of the BG/Q memory system contiguous data in this sequence. and the demands of four threads per core, The second type of prefetching, called list each potentially accessing multiple streams, prefetching, is designed to anticipate a se- results in increased pressure on prefetch quence of loads made from an arbitrary series buffer storage in fixed-depth prefetch schemes. of addresses, when that sequence of load To minimize the prefetch storage require- addresses is accessed more than once. ments, we designed a powerful adaptive stream-prefetching engine. The engine Stream prefetching. For L1 misses, the two assigns a variable depth to each established previous Blue Gene machines incorporate stream, up to a maximum of 16 streams. hardware that automatically detects and pre- The largest supported depth is eight. Each fetches data accessed in a sequence of stream begins with a default preloaded ......
MARCH/APRIL 2012 51 [3B2-9] mmi2012020048.3d 19/3/012 12:48 Page 52
...... HOT CHIPS
depth that is typically small (for example, until the programmed maximum length of two). When a stream is first established, alistisreached. data is prefetched up to this initial depth. Further refinements increase the tolerance Subsequent prefetches are triggered when to variations that are expected in a multi- anL1missaddresslieswithinthestream. processor environment. Perhaps most impor- Such hits on data that have not yet been tant is the address-matching algorithm’s retrieved from memory trigger an adaptation ability to compare each L1 miss against not event that increases that stream’s depth. That only the next unmatched recorded address, depth is stolen from the least recently hit but also up to seven subsequent addresses. stream, keeping the total depth for all active Thus, if one or more recorded L1 misses streams at or below the prefetch buffer’s do not reoccur, they can be skipped over 32-line capacity. With this demand-driven without losing synchronization. Further- adaptation, buffer thrashing is reduced more, support is provided to continue the when many streams are active. The device recording of L1 miss addresses on subsequent will adapt to operate efficiently over a iterations through the code. This new list can range of workloads, from a single stream then replace the list recorded earlier, allowing from one thread to multiple streams and a self-healing evolution as the pattern of L1 multiple threads. misses changes. Finally, added hardware control lets us List prefetching. The second prefetch scheme implement important software features. targets repetitive, deterministic code, such as When the application encounters a segment that in iterative solvers, in which the same se- of code with nonrepetitive data access (such quence of addresses is accessed again and as a table look-up), it can pause and later re- again. This new method requires the begin- sume the list prefetching. Similarly, a func- ning and end of a repeating code segment tion call can stop list prefetching and save to be identified in the application program. the state of the prefetching engine when a Once activated, each list prefetching engine subroutine is entered and restore that state (one per thread) will capture the subsequent and continue prefetching when the function series of L1 miss addresses. The recording of returns, allowing standard and modular code addresses is controllable with memory page to exploit this list prefetching. granularity. The captured miss addresses are packed in a list written to memory. WakeUp unit When this repetitive portion of code is Each processor core has its own external later executed again, the recorded sequence WakeUp unit, whose main purpose is to in- is read from memory and the list of addresses crease overall application performance by is used to prefetch the needed data. The reducing thread-to-thread interactions arising loading of this recorded address sequence from software blocked in a spin loop or similar and the subsequent data prefetching must blocking polling loop. Because the processor be synchronized with code execution so the core has four threads but performs at most data is prefetched in time to be used, but one integer instruction and one floating- not so far in advance to waste space in the point instruction per processor cycle, a thread prefetch buffer. We accomplish this by stall- in a polling loop takes cycles from the other ing the processor at the beginning of repeti- threads. This cost is highest if the polled vari- tive code until the first of the recorded able is L1-cached, because the loop’s frequency addresses have been loaded into the prefetch is highest. In general, degradation will occur engine. The list-prefetch engine then whenever the polling thread competes for matches the L1 misses with these recorded hardware resources with other threads. addresses and will start prefetching the data Implementing a wait procedure using the at list addresses that follow the matched ad- WakeUp unit will incur less degradation dress, up to a preloaded depth. This address than a polling loop implemented in software. matching and data prefetching continues Upon entering a wait, software writes a base until a terminating end-of-list marker is address and enable mask for the polled address reached in the list of recorded addresses or range to a wait address compare (WAC) ...... 52 IEEE MICRO [3B2-9] mmi2012020048.3d 19/3/012 12:48 Page 53
register in the WakeUp unit. Subsequently, the L1P units), the L2 cache, the networking an instruction is executed to pause the corre- logic, and various low-bandwidth units on sponding hardware thread. When any of the the chip. The crossbar and L2 cache are other 67 hardware threads, the message unit, clocked at half the processor frequency, or the PCI Express unit writes a matching with a separate 800 MHz clock grid. The address, the WakeUp unit will drive the crossbar has 22 master ports (18 PUs, three thread’s pause/enable signal to resume execu- network logic ports, and a DevBus port tion. The resumed program then reads the used by PCI Express) and 18 slave ports data from the matching address range. If (16 L2 slices, the network logic, and the the desired condition has indeed been DevBus slave). The crossbar actually consists reached, the polling loop is exited. Other- of three switches, one each for request traffic, wise, the program again configures the response traffic, and invalidation traffic. The WakeUp unit and pauses itself. response switch has a 32-byte-wide read data The WakeUp unit has 12 WACs; each path from all slave devices, such as L2 caches, WAC snoops all writes on the chip by to all master devices, such as L1P units. The snooping the local store operations and in- aggregate maximum read bandwidth from all coming L1 invalidates. The WakeUp unit L2 slices is 409.6 Gbytes/s. The write data can also resume a thread using interrupts path is 16 bytes wide, delivering an aggregate (such as network packet arrivals). Each of write bandwidth of 153.6 Gbytes/s (under the core’s four hardware threads has a status simultaneous reads). and enable register in the WakeUp unit, so that software can select the WACs and inter- L2 cache rupts used to resume. All processor cores share the L2 cache. To The WakeUp unit also outputs a logical- provide sufficient bandwidth, the cache is OR of a programmable subset of the 12 split into 16 slices. Each slice is 16-way set WACs. System software can use this function associative, operates in write-back mode, to generate an external interrupt to the A2. and has 2-Mbyte capacity. Physical addresses For example, it can use this interrupt to rec- are scattered across the slices via a program- ognize a stack overflow. mable hash function to achieve uniform slice utilization. Processor unit redundancy The L2 cache serves as point of coherence The A2 core, QPU, L1P, and WakeUp as well as control for atomic accesses using units comprise a processor unit (PU), the PowerISA’s Load and Reserve/Store which is replicated 18 times on the chip (see Conditional (larx/stcx) instructions. Figure 1). Because user and system software Given the large number of concurrent use only 17 PUs, one of these PUs is redun- threads and the target of high aggregate per- dant. The redundant PU increases manufactur- formance, we designed the L2 cache to pro- ing yield. If we find a defect in a PU during vide hardware assist capabilities to accelerate manufacturing test (at wafer, module, card, sequential code as well as thread interactions. or system level), we can quiesce the defective Specifically, the L2 cache provides memory PU and activate the redundant PU. PU sparing speculation support and atomic memory up- information is carried with the physical chip in date operations. eFuses or in an on-card EEPROM, and is acti- vated on the chip at boot time. PU sparing is Memory speculation: transactional memory transparent to the end user—that is, system (TM) and speculative execution (SE). The and user software will see 17 contiguous BQC chip implements hardware support operational processors, logically numbered for memory speculation in the form of a 0 through 16, irrespective of which physical multiversioning L2 cache. During memory processor (if any) has been spared out. speculation, the L2 cache tracks state changes caused by speculative threads and keeps them Crossbar separate from the main memory state. At the The crossbar switch is the central connec- end of a speculative code section, the modi- tion structure between all the PUs (specifically, fications can either be reverted (invalidate) or ......
MARCH/APRIL 2012 53 [3B2-9] mmi2012020048.3d 19/3/012 12:48 Page 54
...... HOT CHIPS
made permanent (commit). This allows pro- dynamically allocated short speculation grammerstowritecodesectionswithouthav- IDs instead of thread IDs. The system pro- ing to reason in advance about the correctness vides more speculation IDs than there are of concurrent execution. The L2 cache will hardware threads. Commit and invalidate track access conflicts between threads, letting operations are mere state transitions of the a speculation runtime system reason about speculation ID—consequently, they’re very the correctness of execution. This hardware fast. A background scrub executes the alter- support enables transactional memory (TM) ing of the directory tags on commit and in- and speculative execution (SE). validate while the hardware thread allocates For TM, sections of a parallel program simply another ID and proceeds with execut- are annotated to be executed atomically ing the next speculative section in parallel. and in isolation. Data that is speculatively There are 128 speculation IDs. Programs written is visible only to the thread that has can divide the set of IDs into 1, 2, 4, 8, or written it. The L2 cache tracks Read-after- 16 domains. Each domain can operate either Write (RAW), Write-after-Write (WAW), in TM mode or SE mode, allowing different and Write-after-Read (WAR) conflicts. The hardware threads to use all modes concur- program can configure the L2 cache to rently. Each SE domain maintains its order react with an invalidation or with a notifica- locally for its IDs, letting ID allocations pro- tion to software, or both. In the second case, ceed independently per domain. software can decide which thread to invali- date to resolve the conflict. In the common L2 atomic operations. The BQC chip L2 case, the invalidated thread will retry to exe- cache logic also provides an extensive set of cute the section. atomic operations, intended to accelerate soft- For SE, a sequential program is parti- ware operations, such as locking and barriers, tioned into tasks, which execute speculatively and to enable efficient work-queue manage- in parallel. For SE, there is a definite order of ment with multiple producers and consumers. precedence, namely the expected sequential The atomic operations are 8-byte load or execution behavior, which hardware uses to store operations that can modify the value at resolve access conflicts. Data written by the given memory address. Each L2 slice can threads earlier in sequential semantics process an atomic operation every four pro- (older)isforwardedtothreadslaterinse- cessor clock cycles. This high throughput quential semantics (younger). Writes of enables some simple synchronization con- older threads to locations already read by structs and concurrent data structures that younger threads (WAR conflict) are signaled scale well to the many BG/Q threads. As an to the runtime system. The partitioning into example, to perform a ticket lock wherein speculative tasks can be controlled by each thread contains a unique number, #pragma statements similar to those of 64 threads can use the L2 atomic Load- OpenMP. For example, the following code Increment operation to obtain their section causes the loop’s iterations to execute unique value in 78 (initial latency) þ 64 speculatively in parallel: (threads) 4 (cycles) ¼ 334 cycles in an ideal case. An equivalent atomic increment #pragma speculative for implemented using general-purpose load- for (i=0; i