Architecture of Parallel CSC / ECE 506

BlueGene Architecture

Lecture 24

7/31/2006

Dr Steve Hunter BlueGene/L Program

• December 1999: IBM Research announced a 5 year, $100M US, effort to build a petaflop/s scale to attack science problems such as protein folding. Goals: – Advance the state of the art of scientific simulation. – Advance the state of the art in design and software for capability and capacity markets.

• November 2001: Announced Research partnership with Lawrence Livermore National Laboratory (LLNL). November 2002: Announced planned acquisition of a BG/L machine by LLNL as part of the ASCI Purple contract.

• May 11, 2004: Four racks DD1 (4096 nodes at 500 MHz) running Linpack at 11.68 TFlops/s. It was ranked #4 on 23rd Top500 list.

• June 2, 2004: 2 racks DD2 (1024 nodes at 700 MHz) running Linpack at 8.655 TFlops/s. It was ranked #8 on 23rd Top500 list. • September 16, 2004, 8 racks running Linpack at 36.01 TFlops/s. • November 8, 2004, 16 racks running Linpack at 70.72 TFlops/s. It was ranked #1 on the 24th Top500 list. • December 21, 2004 First 16 racks of BG/L accepted by LLNL.

Arch of Parallel Computers CSC / ECE 506 2 BlueGene/L Program

• Massive collection of low-power CPUs instead of a moderate-sized collection of high-power CPUs. – A joint development of IBM and DOE’s National Nuclear Security Administration (NNSA) and installed at DOE’s Lawrence Livermore National Laboratory • BlueGene/L has occupied the No. 1 position on the last three TOP500 lists (http://www.top500.org/) – It has reached a Linpack benchmark performance of 280.6 TFlop/s (“teraflops” or trillions of calculations per second) and still remains the only system ever to exceed the level of 100 TFlop/s. – BlueGene holds the #1, #2, and #8 positions in top 10. • “Objective was to retain exceptional cost/performance levels achieved by application-specific machines, while generalizing the massively parallel architecture enough to enable a relatively broad class of applications” - Overview of BG/L system architecture, IBM JRD – Design approach was to use a very high level of integration that made simplicity in packaging, design, and bring-up possible

– JRD issue available at: http://www.research.ibm.com/journal/rd49-23.html

Arch of Parallel Computers CSC / ECE 506 3 BlueGene/L Program

• BlueGene is a family of . – BlueGene/L is the first step, aimed as a multipurpose, massively parallel, and cost/effective supercomputer 12/04 – BlueGene/P is the petaflop generation 12/06 – BlueGene/Q is the third generation ~2010. • Requirements for future generations – Processors will be more powerful. – Networks will be higher bandwidth. – Applications developed on BlueGeneG/L will run well on BlueGene/P.

Arch of Parallel Computers CSC / ECE 506 4 BlueGene/L Fundamentals

• Low Complexity nodes gives more per transistor and per watt • 3D Interconnect supports many scientific simulations as nature as we see it is 3D

Arch of Parallel Computers CSC / ECE 506 5 BlueGene/L Fundamentals

• Cellular architecture – Large numbers of low power, more efficient processors interconnected • Rmax of 280.6 Teraflops – Maximal LINPACK performance achieved • Rpeak of 360 Teraflops – Theoretical peak performance • 65,536 dual- compute nodes – 700MHz IBM PowerPC 440 processors – 512 MB memory per compute node, 16 TB in entire system. – 800 TB of disk space • 2,500 square feet

Arch of Parallel Computers CSC / ECE 506 6 Comparing Systems (Peak) upercomputer Peak Performance

1E+17 multi-Petaflop Petaflop Blue Gene/L 1E+14 Thunder Earth Red Storm Blue Pacific ASCI White, ASCI Q ASCI Red Option SX-5 T3E ASCI Red NWT SX-4 CP-PACS 1E+11 CM-5 Paragon Delta T3D SX-3/44 Doubling time = 1.5 yr. i860 (MPPs) CRAY-2 SX-2 VP2600/10 S-810/20 X-MP4 Y-MP8 Cyber 205 X-MP2 (parallel vectors) 1E+8 CDC STAR-100 (vectors) CRAY-1 CDC 7600 ILLIAC IV CDC 6600 (ICs) Peak Speed (flops) Speed Peak 1E+5 IBM Stretch IBM 7090 (transistors) IBM 704 IBM 701 UNIVAC 1E+2 ENIAC (vacuum tubes) 1940 1950 1960 1970 1980 1990 2000 2010 Year Introduced

Arch of Parallel Computers CSC / ECE 506 7 Comparing Systems (Byte/Flop)

! Red Storm 2.0 2003 ! 2.0 2002 ! Intel Paragon 1.8 1992 ! nCUBE/2 1.0 1990 ! ASCI Red 1.0 (0.6) 1997 ! T3E 0.8 1996 ! BG/L 1.5 0.75(torus)+0.75(tree) 2004 ! Cplant 0.1 1997 ! ASCI White 0.1 2000 ! ASCI Q 0.05 Quadrics 2003 ! ASCI Purple 0.1 2004 ! Intel Cluster 0.1 IB 2004 ! Intel Cluster 0.008 GbE 2003 ! Virginia Tech 0.16 IB 2003 ! Chinese Acad of Sc 0.04 QsNet 2003 ! NCSA - Dell 0.04 Myrinet 2003

Arch of Parallel Computers CSC / ECE 506 8 Comparing Systems (GFlops/Watt)

• Power efficiencies of recent supercomputers – Blue: IBM Machines

– Black: Other US Machines IBM Journal of Research – Red: Japanese Machines and Development

Arch of Parallel Computers CSC / ECE 506 9 Comparing Systems

ASCI White ASCI Q Earth Blue Gene/L Simulator Machine Peak 12.3 30 40.96 367 (TF/s) Total Mem. 8 33 10 32 (TBytes) Footprint (sq ft) 10,000 20,000 34,000 2,500

Power (MW)* 1 3.8 6-8.5 1.5

Cost ($M) 100 200 400 100

# Nodes 512 4096 640 65,536

MHz 375 1000 500 700

* 10 megawatts approximate usage of 11,000 households

Arch of Parallel Computers CSC / ECE 506 10 BG/L Summary of Performance Results

• DGEMM (Double-precision, GEneral Matrix-Multiply): – 92.3% of dual core peak on 1 node – Observed performance at 500 MHz: 3.7 GFlops – Projected performance at 700 MHz: 5.2 GFlops (tested in lab up to 650 MHz) • LINPACK: – 77% of peak on 1 node – 70% of peak on 512 nodes (1435 GFlops at 500 MHz) • sPPM (Spare Matrix Multiple Vector Multiply), UMT2000: – Single processor performance roughly on par with POWER3 at 375 MHz – Tested on up to 128 nodes (also NAS Parallel Benchmarks) • FFT (Fast Fourier Transform): – Up to 508 MFlops on single processor at 444 MHz (TU Vienna) – Pseudo-ops performance (5N log N) @ 700 MHz of 1300 Mflops (65% of peak) • STREAM – impressive results even at 444 MHz: – Tuned: Copy: 2.4 GB/s, Scale: 2.1 GB/s, Add: 1.8 GB/s, Triad: 1.9 GB/s – Standard: Copy: 1.2 GB/s, Scale: 1.1 GB/s, Add: 1.2 GB/s, Triad: 1.2 GB/s – At 700 MHz: Would beat STREAM numbers for most high end • MPI: – Latency – < 4000 cycles (5.5 ✙s at 700 MHz) – Bandwidth – full link bandwidth demonstrated on up to 6 links

Arch of Parallel Computers CSC / ECE 506 11 BlueGene/L Architecture

• To achieve this level of integration, the machine was developed around a processor with moderate frequency, available in system-on-a-chip (SoC) technology – This approach was chosen because of the performance/power advantage

– In terms of performance/watt the low-frequency, low-power, embedded IBM PowerPC core consistently outperforms high-frequency, high-power, microprocessors by a factor of 2 to 10 – Industry focus on performance / rack

» Performance / rack = Performance / watt * Watt / rack

» Watt / rack = 20kW for power and thermal cooling reasons • Power and cooling – Using conventional techniques, a 360 Tflops machine would require 10-20 megawatts.

– BlueGene/L uses only 1.76 megawatts

Arch of Parallel Computers CSC / ECE 506 12 Power Density Growth

Arch of Parallel Computers CSC / ECE 506 13 System Power Comparison

BG/L 450 Thinkpads 2048 processors 20.1 kW 20.3 kW

Arch of Parallel Computers CSC / ECE 506 (LS Mok,4/2002) 14 BlueGene/L Architecture

• Networks were chosen with extreme scaling in mind – Scale efficiently in terms of both performance and packaging – Support very small messages

» As small as 32 bytes

– Includes hardware support for collective operations » Broadcast, reduction, scan, etc. • Reliability, Availability and Serviceability (RAS) is another critical issue for scaling – BG/L need to be reliable and usable even at extreme scaling limits

– 20 fails per 1,000,000,000 hours = 1 node failure every 4.5 weeks • System Software and Monitoring also important to scaling – BG/L designed to efficiently utilize a distributed memory, message-passing programming model

– MPI is the dominant message-passing model with hardware features added and parameter tuned

Arch of Parallel Computers CSC / ECE 506 15 RAS (Reliability, Availability, Serviceability)

• System designed for RAS from top to bottom – System issues » Redundant bulk supplies, power converters, fans, DRAM bits, cable bits » Extensive data logging (voltage, temp, recoverable errors … ) for failure forecasting » Nearly no single points of failure – Chip design » ECC on all SRAMs » All dataflow outside processors is protected by error-detection mechanisms » Access to all state via noninvasive back door – Low power, simple design leads to higher reliability – All interconnects have multiple error detections and correction coverage » Virtually zero escape probability for link errors

Arch of Parallel Computers CSC / ECE 506 16 BlueGene/L System

136.8 Teraflop/s on LINPACK (64K processors) 1 TF = 1000,000,000,000 Flops Rochester Lab 2005

Arch of Parallel Computers CSC / ECE 506 17 BlueGene/L System

Arch of Parallel Computers CSC / ECE 506 18 BlueGene/L System

Arch of Parallel Computers CSC / ECE 506 19 BlueGene/L System

Arch of Parallel Computers CSC / ECE 506 20 Physical Layout of BG/L

Arch of Parallel Computers CSC / ECE 506 21 Midplanes and Racks

Arch of Parallel Computers CSC / ECE 506 22 The Compute Chip

• System-on-a-chip (SoC) • 1 ASIC – 2 PowerPC processors – L1 and L2 Caches – 4MB embedded DRAM – DDR DRAM interface and DMA controller – Network connectivity hardware – Control / monitoring equip. (JTAG)

Arch of Parallel Computers CSC / ECE 506 23 Compute Card

Arch of Parallel Computers CSC / ECE 506 24 Node Card

Arch of Parallel Computers CSC / ECE 506 25 BlueGene/L Compute ASIC

PLB (4:1) 32k/32k L1 256 128 L2

440 CPU 4MB

EDRAM Shared “Double FPU” L3 directory L3 1024+ Multiported for EDRAM or 256 144 ECC snoop Shared Memory SRAM 32k/32k L1 128 Buffer L2 440 CPU 256 Includes ECC I/O proc

256 “Double FPU”

128

DDR • IBM CU-11, 0.13 µm Ethernet JTAG Control Gbit Access Torus Tree Global with ECC • 11 x 11 mm die size Interrupt • 25 x 32 mm CBGA • 474 pins, 328 signal Gbit JTAG 6 out and 3 out and 144 bit wide • 1.5/2.5 Volt Ethernet 6 in, each at 3 in, each at 4 global DDR 1.4 Gbit/s link 2.8 Gbit/s link barriers or 256/512MB interrupts

Arch of Parallel Computers CSC / ECE 506 26 BlueGene/L Interconnect Networks

3 Dimensional Torus – Main network, for point-to-point communication – High-speed, high-bandwidth – Interconnects all compute nodes (65,536) – Virtual cut-through hardware routing – 1.4Gb/s on all 12 node links (2.1 GB/s per node) – 1 µs latency between nearest neighbors, 5 µs to the farthest – 4 µs latency for one hop with MPI, 10 µs to the farthest – Communications backbone for computations – 0.7/1.4 TB/s bisection bandwidth, 68TB/s total bandwidth Global Tree – One-to-all broadcast functionality – Reduction operations functionality – MPI collective ops in hardware – Fixed-size 256 byte packets – 2.8 Gb/s of bandwidth per link – Latency of one way tree traversal 2.5 µs – ~23TB/s total binary tree bandwidth (64k machine) – Interconnects all compute and I/O nodes (1024) – Also guarantees reliable delivery

Ethernet – Incorporated into every node ASIC – Active in the I/O nodes (1:64) – All external comm. (file I/O, control, user interaction, etc.) Low Latency Global Barrier and Interrupt – Latency of round trip 1.3 µs

Control Network

Arch of Parallel Computers CSC / ECE 506 27 The Torus Network

• 3 dimensional: 64 x 32 x 32 – Each compute node is connected to its six neighbors: x+, x-, y+, y-, z+, z- – Compute card is 1x2x1 – Node card is 4x4x2 – 16 compute cards in 4x2x2 arrangement – Midplane is 8x8x8 – 16 node cards in 2x2x4 arrangement

• Communication path – Each uni-directional link is 1.4Gb/s, or 175MB/s. – Each node can send and receive at 1.05GB/s. – Supports cut-through routing, along with both deterministic and adaptive routing. – Variable-sized packets of 32,64,96…256 bytes – Guarantees reliable delivery

Arch of Parallel Computers CSC / ECE 506 28 Complete BlueGene/L System at LLNL

48 WAN

64 visualization

128 archive

BG/L BG/L compute 1024 512 CWFS I/O nodes nodes

1,024 ports 2,048 65,536 8 Front-end nodes

Federated Gigabit Ethernet Ethernet Gigabit Federated 8 Service node 8 Control network

Arch of Parallel Computers CSC / ECE 506 29 System Software Overview

• Operating system - Linux • Compilers - IBM XL C, C++, Fortran95 • Communication - MPI, TCP/IP • Parallel File System - GPFS, NFS support • System Management - extensions to CSM • Job scheduling - based on LoadLeveler • Math libraries - ESSL

Arch of Parallel Computers CSC / ECE 506 30 BG/L Software Hierarchical Organization

• Compute nodes dedicated to running user application, and almost nothing else - simple compute node kernel (CNK) • I/O nodes run Linux and provide a more complete range of OS services – files, sockets, launch, signaling, debugging, and termination • Service node performs system management services (e.g., heart beating, monitoring errors) - transparent to application software

Arch of Parallel Computers CSC / ECE 506 31 BG/L System Software

• Simplicity – Space-sharing – Single-threaded – No demand paging • Familiarity – MPI (MPICH2) – IBM XL Compilers for PowerPC

Arch of Parallel Computers CSC / ECE 506 32 Operating Systems

• Front-end nodes are commodity systems running Linux • I/O nodes run a customized Linux kernel • Compute nodes use an extremely lightweight custom kernel • Service node is a single multiprocessor machine running a custom OS

Arch of Parallel Computers CSC / ECE 506 33 Compute Node Kernel (CNK)

• Single user, dual-threaded • Flat address space, no paging • Physical resources are memory-mapped • Provides standard POSIX functionality (mostly) • Two execution modes: – Virtual node mode – mode

Arch of Parallel Computers CSC / ECE 506 34 Service Node OS

• Core Management and Control System (CMCS) • BG/L’s “global” operating system. • MMCS - Midplane Monitoring and Control System • CIOMAN - Control and I/O Manager • DB2 relational database

Arch of Parallel Computers CSC / ECE 506 35 Running a User Job

• Compiled, and submitted from front-end node. • External scheduler • Service node sets up partition, and transfers user’s code to compute nodes. • All file I/O is done using standard Unix calls (via the I/O nodes). • Post-facto debugging done on front-end nodes.

Arch of Parallel Computers CSC / ECE 506 36 Performance Issues

• User code is easily ported to BG/L. • However, MPI implementation requires effort & skill – Torus topology instead of crossbar – Special hardware, such as collective network.

Arch of Parallel Computers CSC / ECE 506 37 BG/L MPI Software Architecture

GI = Global Interrupt CIO = Control and I/O Protocol CH3 = Primary device distributed with MPICH2 communication MPD = Multipurpose Daemon

Arch of Parallel Computers CSC / ECE 506 38 MPI_Bcast

Arch of Parallel Computers CSC / ECE 506 39 MPI_Alltoall

Arch of Parallel Computers CSC / ECE 506 40 References

• IBM Journal of Research and Development, Vol. 49, No. 2-3. – http://www.research.ibm.com/journal/rd49-23.html » “Overview of the Blue Gene/L system architecture” » “Packaging the Blue Gene/L supercomputer” » “Blue Gene/L compute chip: Memory and Ethernet subsystems” » “Blue Gene/L torus interconnection network” » “Blue Gene/L programming and operating environment” » “Design and implementation of message-passing services for the Blue Gene/L supercomputer”

Arch of Parallel Computers CSC / ECE 506 41 References (cont.)

• BG/L homepage @ LLNL: • BlueGene homepage @ IBM:

Arch of Parallel Computers CSC / ECE 506 42 The End

Arch of Parallel Computers CSC / ECE 506 43