IBM Blue Gene/Q Overview PRACE Winter School February 6-10, 2012 Pascal Vezolle vezolle@fr..com

© 2009 IBM Corporation Blue Gene Roadmap Goals: Three orders of magnitude performance in 10 years Push state of the art in Power efficiency, scalability, & reliability Enable unprecedented application capability Exploit new technologies: PCM, photonics, 3DP

Blue Gene / Q In progress 20+ PF Blue Gene / P

Performance Performance PPC 450 @850MHz Goals: 1+ PF  Lay the ground work for ExaFlop & usability Blue Gene / L  Address many of the power PPC 440 @700MHz efficiency, reliability and technology challenges 596+ TF

2004 2008 2012 2016 2020

2 © 2009 IBM Corporation Understanding Blue Gene

 BG/P – BG/Q architecture and hardware overview  IO Node  Processor  Network – 5D Torus  BG/Q software and programming model – BG/Q innovatives features: atomic, wake up unit, TM, TLS – Partitioning – Programming model – BG/Q kernel overview – User environment

3 © 2009 IBM Corporation Blue Gene Evolution

 BG/L (5.7 TF/rack) – 130nm ASIC (1999-2004 GA) – Embedded 440 core, dual-core system-on-chip – Memory: 0.5/1 GB/node – Biggest installed system (LLNL): 104 racks, 212,992 cores, 596 TF/s, 210 MF/W

 BG/P (13.9 TF/rack) – 90nm ASIC (2004-2007 GA) – Embedded 450 core – Memoy: 2/4 GB/node, quad core SOC, DMA – Biggest installed system (Jülich): 72 racks, 294,912 cores, 1 PF/s, 357 MF/W – SMP support, OpenMP, MPI

 BG/Q (209 TF/rack) – 45nm ASIC+ (2007-2012 GA) – A2 core, 16 core/64 thread SOC – 16 GB/node – Biggest installed system (LLNL): 96 racks, 1,572,864 cores, 20 PF/s, 2 GF/W, – , sophisticated L1 prefetch, transactional memory, fast thread handoff, compute + IO systems.

4 © 2009 IBM Corporation System Blue Gene/P 72 Racks, 72x32x32 Rack Cabled 8x8x16

32 Node Cards

Node Card 1 PF/s (32 chips 4x4x2) 144 (288) TB 32 compute, 0-1 IO cards 13.9 TF/s Compute Card 2 (4) TB 1 chip, 20 435 GF/s DRAMs 64 (128) GB

Chip 13.6 GF/s 4 processors 2.0 GB DDR2 (4.0GB 6/30/08)

13.6 GF/s 8 MB EDRAM © 2009 IBM Corporation Blue Gene/P Asic

4 cores- SMP/8MB L3 shared – 0.85Ghz – 1 thread/core – 4 GBytes memory

6 © 2009 IBM Corporation Blue Gene/P System in a Complete Configuration  Compute nodes dedicated to running user application, and almost nothing else - simple compute node kernel (CNK)  I/O nodes run and provide a more complete range of OS services – files, sockets, process launch, signaling, debugging, and termination  Service node performs system management services (e.g., heart beating, monitoring errors) - transparent to application software

ks WAN ac R 2 7 to 1 visualization

archive

BG/P BG/P 10GigE compute I/O nodes FS nodes 8 to 64 per I/ON 1024 per Rack per Rack

Front -end nodes Federated 10GigabitFederated Switch Service node GPFS 1GigE - 4 per Rack

Control network

7 © 2009 IBM Corporation Blue Gene/Q 4. Node Card: 3. Compute card: 32 Compute Cards, One chip module, Optical Modules, Link Chips, Torus 16 GB DDR3 Memory,

2. Single Chip Module

1. Chip: 16 λP cores

5b. IO drawer: 7. System: 8 IO cards w/16 GB 20PF/s 8 PCIe Gen2 x8 slots

5a. Midplane: 16 Node Cards

•Sustained single node perf: 10x P, 20x L • MF/Watt: (6x) P, (10x) L (~2GF/W target)

6. Rack: 2 Midplanes 8 © 2009 IBM Corporation Blue Gene/Q nodeBlue board: Gene/Q 32 compute Node-Board nodes Assembly

Fiber-Optic Ribbons (36X, 12 Fibers each)

Compute Card with One Node (32X)

Water Hoses

48-Fiber Connectors

9 Redundant, Hot-Pluggable Power-Supply Assemblies© 2009 IBM Corporation BG/Q External I/O

 I/O Network to/from Compute rack I/O drawers – 2 links (4GB/s in 4GB/s out) feed an I/O PCI-e port – Every node card has up to 4 ports (8 links) – Typical configurations • 8 ports (32GB/s/rack) • 16 ports (64 GB/s/rack) • 32 ports (128 GB/s/rack) – Extreme configuration 128 ports (512 GB/s/rack)

 I/O Drawers – 8 I/O nodes/drawer with 8 ports (16 links) to compute rack – 8 PCI-e gen2 x8 slots (32 GB/s aggregate) – 4 I/O drawers per compute rack – Optional installation of I/O drawers in external racks for extreme bandwidth configurations I/O nodes PCIe

10 © 2009 IBM Corporation I/O Blocks

 I/O Nodes are also combined into blocks – All I/O drawers can be grouped into a single block for administrative convenience – In normal operation the I/O Block(s) remain booted while compute blocks are reconfigured and rebooted – I/O blocks do not need to be rebooted to resolve fatal errors from I/O nodes – Rationale for having multiple I/O node partitions would be experimentation with different Linux ION kernels

 Can be created via genIOblock

 Locations of IO enclosures can be: – Qxx-Iy (in an IO rack, y is 0 - B) – Rxx-Iy (in a compute rack, y is - F)

11 © 2009 IBM Corporation BG/Q I/O architecture

PCI_E IB IB

BG/Q compute racks BG/Q IO Switch RAID Storage & File Servers

 BlueGene Classic I/O with GPFS clients on the logical I/O nodes  Similar to BG/L and BG/P  Uses InfiniBand switch  Uses DDN RAID controllers and File Servers  BG/Q I/O Nodes are not shared between compute partitions – IO Nodes are bridge data from function-shipped I/O calls to parallel file system client  Components balanced to allow a specified minimum compute partition size to saturate entire storage array I/O bandwidth

12 © 2009 IBM Corporation Partitions

 Compute partitions are booted and "attach" to an already booted IO partition that contains the attached IO nodes

 IO block cannot be freed while there are attached compute blocks that are booted

 A compute partition must have at least 1 directly attached link, to an IO node that's not marked in error

 When the compute partition is booted, at least one IO node in the IO partition must have an outbound link, and not be marked in error – For large blocks, there must be at least one attached IO node per midplane

13 © 2009 IBM Corporation IO Node: BG/P vs. BG/Q

BG/P BG/Q

Minimal MCP based Linux Distro Fully Featured RHEL6.X based Linux Distro

Only supported IONs Supports IONs and Log-In Nodes (LNs)

Installed via a static tar file RPM based installation and is customizable before and after installation Ramdisk based root filesystem Hybrid read only NFS root with in memory (tmpfs) read/write file spaces No persistent storage space Per-node persistent files spaces

Rebooted frequently Designed to be booted infrequently

Part of the compute block Independent I/O or LN block associated with mone or more compute blocks Only supported ethernet Supports PCIe based 10Gb Ethernet, Infiniband and Combo Eth/IB cards Image was a few hundred megabytes in size Each Image is 5 GB in size

No health monitoring Integrated health monitoring system.

14 © 2009 IBM Corporation Blue Gene/Q processor

15 © 2009 IBM Corporation BQC processor core (A2 core)

 Simple core, designed for excellent power efficiency and small footprint.  Embedded 64 b PowerPC compliant  4 SMT threads typically get a high level of utilization on shared resources.  Design point is 1.6 GHz @ 0.74V.  AXU port allows for unique BGQ style floating point  One AXU (FPU) and one other instruction issue per cycle  In-order execution

16 © 2009 IBM Corporation Multithreading

 Four threads issuing to two pipelines – Impact of memory access latency reduced

 Issue – Up to two instructions issued per cycle • One Integer/Load/Store/Control instruction issue per cycle • One FPU instruction issue per cycle – At most one instruction issued per thread

 Flush – Pipeline is not stalled on conflict – Instead, • Instructions of conflicting thread are invalidated • Thread is restarted at conflicting instruction – Guarantees progress of other threads

17 © 2009 IBM Corporation PPC L1 PF FPU 2MB PPC L1 PF L2 FPU 2MB Blue Gene/Q chip architecture PPC L1 PF L2 FPU 2MB  16+1 core SMP PPC L2 L1 PF Each core 4-way hardware threaded FPU 2MB External PPC L1 PF L2 DDR-3 DDR3 FPU 2MB Controller  Transactional memory and thread level speculation PPC L1 PF L2 FPU 2MB  Quad floating point unit on each core PPC L1 PF L2 FPU 2MB 204.8 GF peak node PPC L1 PF L2 FPU 2MB  Frequency target of 1.6 GHz PPC L1 PF L2 FPU 2MB  563 GB/s bisection bandwidth to shared L2 PPC L1 PF L2 FPU 2MB (Blue Gene/L at LLNL has 700 GB/s for system ) PPC L1 PF L2 FPU 2MB  32 MB shared L2 cache PPC L1 PF L2 FPU 2MB  42.6 GB/s DDR3 bandwidth (1.333 GHz DDR3)

PPC L1 PF switch fullcrossbar L2 DDR-3 FPU 2MB Controller External (2 channels each with chip kill protection) DDR3 PPC L1 PF L2 FPU 2MB  10 intra-rack interprocessor links each at 2.0GB/s PPC L1 PF L2 FPU 2MB  one I/O link at 2.0 GB/s PPC L1 PF L2 FPU 2MB  8-16 GB memory/node L2 PPC L1 PF FPU  ~60 watts max DD1 chip power

PPC L1 PF FPU 2 GB/s I/O link (to I/O subsystem) Network dma Test 10*2GB/s intra-rack & inter-rack (5-D torus)

Blue Gene/Q PCI_Express note: chip I/O shares function with PCI_Express compute chip 18 IBM Confidential © 2009 IBM Corporation Scalability Enhancements: the 17th Core

 RAS Event handling and off-load – Reduce O/S noise and jitter – Core-to-Core when necessary

 CIO Client Interface – Asynchronous I/O completion hand-off – Responsive CIO application control client

 Application Agents: privileged application processing – Messaging assist, e.g., MPI pacing thread – Performance and trace helpers

19 © 2009 IBM Corporation Blue Gene/Q Network: 5D Torus

20 © 2009 IBM Corporation BG/Q Networks

 Networks – 5 D torus in compute nodes, – 2 GB/s bidirectional bandwidth on all (10+1) links, 5D nearest neighbor exchange measured at ~1.75 GB/s per link – Both collective and barrier networks are embedded in this 5-D torus network. – Virtual Cut Through (VCT) – Floating point addition support in collective network

 Compute rack to compute rack bisection BW (46X BG/L, 19X BG/P) – 20.1PF: bisection is 2x16x16x12x2 (bidi)x2(torus, not mesh)x 2GB/s link bandwidth = 49.152 TB/s – 26.8PF: bisection is 2x16x16x16x4x2GB/s = 65.536TB/s – BGL at LLNL is 0.7 TB/s

 I/O Network to/from Compute rack – 2 links (4GB/s in 4GB/s out) feed an I/O PCI-e port (4GB/s in, 4GB/s out) – Every Q32 node card has up to I/O 8 links or 4 ports – Every rack has up to 32x8 = 256 links or 128 ports

 I/O rack – 8 I/O nodes/drawer, each node has 2 links from compute rack, and 1 PCI-e port to the outside world – 12/drawers/rack – 96 I/O, or 96x4 (PCI-e) = 384 TB/s = 3 Tb/s.

21 © 2009 IBM Corporation Network Performance

 Performance – All-to-all: 97% of peak – Bisection: > 93% of peak – Nearest-neighbor: 98% of peak – Collective: FP reductions at 94.6% of peak – No performance problems identified in network logic

22 © 2009 IBM Corporation BGQ changes from BGL/BGP  New Node

– New voltage scaled processing core (A2) with 4-way SMT – New SIMD floating point unit (8 flop/clock) with alignment support – New “intelligent” prefetcher – 17 th Processor core for system functions. – Speculative multithreading and transactional memory support with 32 MB of speculative state – Hardware mechanisms to help with multithreading (“fetch & op”, “wake on pin”) – Dual SDRAM-DDR3 memory controllers with up to 16 GB/node  New Network :

– 5 D torus in compute nodes • 2 GB/s bidirectional bandwidth on all (10+1) links • Bisection bandwidth of 65TB/s (26PF/s) / 49 TB/s (20 PF/s) BGL at LLNL is 0.7 TB/s – Collective and barrier networks embedded in 5-D torus network. – Floating point addition support in collective network – 11 th port for auto-routing to IO fabric  I/O system

– I/O nodes in separate drawers/rack with private 3D (or 4D) torus – PCI-Express Gen 2 on every node with full sized PCI slot – Two I/O configurations (one traditional, one conceptual)

23 © 2009 IBM Corporation BG/Q Software & Programming model

24 © 2009 IBM Corporation BG/Q Software

Property BG/Q Overall Philosophy Scalability Scale infinitely, added more functionality Openness almost all open Shared Memory yes Hybrid 1-64 processes 64-1 threads Programming Model Low-Level General Messaging DCMF, generic parallel program runtimes, wake-up unit Programming Models MPI, OpenMP, UPC, ARMCI, global arrays, Charm++ System call interface Linux/POSIX system calls Library/threading glibc/pthreads Kernel Linking static or dynamic Compute Node OS CNK, Linux, Red Hat I/O Node OS SMP Linux with SMT, Red Hat Control Scheduling generic and real-time API Run Mode HPC, generalized sub-partitioning, HA with job cont OS Linux, ZeptOS, Plan 9 Generalizing and Research Tools HPC/S Toolkit, dyninst, valgrind Initiatives Financial Kittyhawk, InfoSphere Streams

Commercial Kittyhawk, Cloud, SLAcc, ASF 25 © 2009 IBM Corporation Blue Gene/Q System Architecture

collective network Service Node I/O Node C-Node 0 C-Node n Front-end File System Linux Console Nodes Servers fs client app app

ciod CNK CNK optical FunctionalFunctional NetworkNetwork MMCS torus DB2 10Gb10Gb QDR QDR I/O Node C-Node 0 C-Node n

Linux

fs client optical app app ControlControl LoadLeveler Ethernet Ethernet ciod CNK CNK (1Gb)(1Gb) FPGA JTAG

26 © 2009 IBM Corporation BG/Q innovations will help programmers cope with an exploding number of hardware threads

 Exploiting a large number of threads is a challenge for all future architectures. This is a key component of the BGQ research.  Novel hardware and software is utilized in BGQ to ..

a) Reduce the overhead to hand off work to high numbers of threads used in OpenMP and messaging through hardware support for atomic operations and fast wake up of cores.

b) Multiversioning cache to help in a number of dimensions such as performance, ease of use and RAS.

c) Aggressive FPU to allow for higher single thread performance for some applications. Most will get modest bump (10-25%), some big bump (approaching 300%)

d) “perfect ” prefetching for repeated memory reference patterns in arbitrarily long code segments. Also helps achieve higher single thread for some applications.

2727 IBM Confidential 1/24/2012 © 2009 IBM Corporation BG/Q Node Features

• Atomic operations Barrier speed using different syncronizing hardware

• Pipelined at L2 rather than retried as in commodity 16000 14000 atomic: no -invalidates processors 12000 atomic: invalidates 10000 lwarx/stwcx 8000 • Low latency even under high contention 6000 4000 2000 • Faster OpenMP work hand off; lowers messaging 0 number number of cycles processor 1 10 100 latency number of threads

Improvement in atomics under contention • Wake up unit

Single MPI • Allows SMT threads to “sleep” waiting for an event Task • Faster OpenMP work hand off; lowers messaging User defined parallelism latency

User defined transaction start

• Multiversioning cache User defined transaction • Transactional Memory eliminates need for locks end Hardware detected • Speculative Execution allows OpenMP threading for dependency sections with data dependencies conflict rollback

parallelization completion synchronization point

2828 IBM Confidential 1/24/2012 Thread flow using transactions© 2009 IBM Corporation NewTech: L2 Atomics

 L2 implements atomic operations on every 64-bit word in memory – Allows specific memory operations to performed atomically: • Load: Increment, Decrement, Clear (plus variations) • Store: Twin, Add, OR, XOR, Max (plus variations) – L2 knows the operation when specific shadow physical addresses are read/written

 CNK exposes these L2 atomic physical addresses – Applications must pre-register the location of the L2 atomic before access • Otherwise segfault – Fast barrier implementation using L2 atomics available

 CNK also uses L2 atomics internally – Fast performance counters (store w/ add 1) – Locking

29 © 2009 IBM Corporation Atomic Performance results

Barrier speed using different syncronizing hardware

16000 14000 atomic: no -invalidates 12000 atomic: invalidates 10000 lwarx/stwcx 8000 6000 4000 2000 0 number number of cycles processor 1 10 100 number of threads

30 © 2009 IBM Corporation NewTech: Wakeup Unit

 Used in conjunction with the PowerPC wait instruction

 When a hardware thread is in a wait state, the hardware thread stops executing and the other hardware threads will benefit from the additional available cycles.

 Sends a wakeup signal to a hardware thread

 Configurable wakeup conditions: – WakeUp Address Compare – Messaging Unit activity – Interrupt sources

 Allows active threads to better utilize core resources – Reduce wasted core cycles in polling and spin loops

 Kernel provides application interfaces to utilize the wakeup unit

 Thread guard pages – CNK uses the Wakeup Unit to provide memory protection between stack/heap – Detects violation, but cannot prevent it

31 © 2009 IBM Corporation NewTech: Transactional Memory

 Performance optimization for critical regions

1. Software threads enter “transactional memory” mode  Memory accesses are tracked. Writes are not visible outside of the thread until committed. 2. Perform calculation without locking 3. Hardware automatically detects memory contention conflicts between threads 1. If conflict: • TM hardware detects conflict • Kernel decides whether to rollback transaction or let the thread continue • If rollback, the compiler runtime decides whether to serialize or retry 2. If no conflicts, threads can commit their memory

 Threads can commit out-of-order.  XL Compiler only  This is a DD2 only feature.

32 © 2009 IBM Corporation Transactional Memory

LOCKS and Deadlock

// WITH LOCKS void move(T s, T d, Obj key){ LOCK(s); LOCK(d); tmp = s.remove(key); d.insert(key, tmp); UNLOCK(d); UNLOCK(s); } Moreover Thread 0 move(a, b, key1); Thread 1 Coarse-grain locking limits move(b, a, key2); concurrency

DEADLOCK! Fine-grain locking difficult

33 © 2009 IBM Corporation Transactional Memory

Transactional Memory (TM)

void move(T s, T d, Obj key){  Programmer says atomic { – “I want this atomic” tmp = s.remove(key); d.insert(key, tmp);  TM system } – “Makes it so” }

 Avoids Deadlock  Fine Grain Locking

 Correct  Fast

2/22/2010 Wisconsin Multifacet Project 34 34 © 2009 IBM Corporation BlueGene/Q transactional memory mode

 User program model: – User defines parallel work to be done – User explicitly defines start and end of Single MPI transactions within parallel work that are to Task be treated as atomic User defined parallelism  Compiler implications: – Interpret user program annotations to spawn multiple threads – Interpret user program annotation for start of transaction and save state to memory on User defined entry to transaction to enable rollback – At end of transaction program annotation transaction start test for successful completion and optionally branch back to rollback pointer. User defined  Hardware implications: – Transactional memory support required to transaction detect transaction failure and rollback end – L1 cache visibility for L1 hits as well as misses allowing for ultra low overhead to Hardware enter a transaction detected dependency conflict rollback

parallelization completion synchronization point

35 © 2009 IBM Corporation NewTech: Speculative Execution

 Similar to Transactional Memory – Except Ordered thread commit and different usage model  Leverages existing OpenMP parallelization – However compiler does not need to guarantee that there is no array overlap – Should allow the compiler to do a much better job of auto-parallelizing  Total work is subdivided into workunits without locking  If work units collide in memory: – SE hardware detects – Kernel rolls back transaction – Runtime decides whether to retry or serialize

 XL Compiler only  This is a DD2 only feature

36 © 2009 IBM Corporation TM and TLS

Standard OpenMP Transactional Memory Thread Level Speculation

User control System control (IBM Beta System control (Research) software available) User must manage the Transactions allow  Same memory ordered access as a memory contentions operations to go through, single thread but check for memory  If the nonspeculative thread modifies violations and unroll it if a memory address already touched by necessary a speculative thread, the system • Transactional memory (TM) detects a conflict and reschedules the ensures atomicity and isolation corresponding threads. • An abstract construct that hides its details from the programmers Language extension for compilation stage #pragma omp parallel #pragma omp parallel for ... #pragma … for... for (int i=0; i

37 © 2009 IBM Corporation BG/Q Node Features (Continued)

256 • Quad FPU Load • 4-wide double precision FPU SIMD A2 • 2-wide complex SIMD RF RF RF RF 64 • Supports a multitude of alignments • Allows for higher single thread performance for some applications Permute • 4R/2W register file MAD0 MAD1 MAD2 MAD3 32x256 bits per thread • 32B (256 bits) datapath to/from L1 cache QPX: Quad FPU

t 1 e L e is r iss r d d L m d s d s s a s a

• “Perfect” prefetching a a • Augments traditional stream prefetching b b x • Used for repeated memory reference patterns in c arbitrarily long code segments c d addresses addresses d y Prefetched Prefetched • Record pattern on first iteration of loop; playback for e subsequent iteration z List-based “perfect” f e • Tolerance for missing or extra cache misses prefetching has g f tolerance for missing • Achieves higher single thread performance for some h g or extra cache misses applications i h k i 3838 IBM Confidential 1/24/2012 © 2009 IBM Corporation Auto-SIMDization Support

 Improving SIMD from BG/P/L to BG/Q – Moving from low level optimizer to high-level optimizer due to better loop optimization structure, and alignment analysis  To enable SIMDization – Start to compile (assume appropriate –qarch=qp and –qtune=qp option ): • -O3 (compile time, limited hot and SIMD optimizations) – Increase optimization Level • -O4 (compile time, limited scope analysis, SIMDization) • -O5 (link time, pointer analysis, whole-program analysis, and SIMD instruction) – -qhot=level=0 (enables SIMD by default)  To disable SIMDization – Turn off SIMDization for the program • add -qhot=nosimd to the previous command line options – Turn off simdization for a particular loop • #pragma nosimd | !IBM* NOSIMD  Tune your programs – Help the compiler with extra information (directives/pragmas) – Check the SIMD instruction generation in the object code listing (-qsource -qlist). – Use compiler feedback (-qreport -qhot) to guide you.

39 © 2009 IBM Corporation NewTech: L1 Perfect Prefetcher

 The L1p has 2 different prefetch algorithms: – Stream prefetcher • Similar to the prefetcher algorithms used on BGL/BGP – “Perfect” prefetcher

 Perfect Prefetcher: – First iteration through code: • BQC records sequence of memory load accesses • Sequence is stored in DDR memory – Subsequent iteration through code: • BQC loads the sequence and tracks where the code is in the sequence • Prefetcher attempts to prefetch memory before it is needed in the sequence

 Kernel provides access routines to setup and configure the stream and perfect prefetchers

40 © 2009 IBM Corporation Blue Gene/Q Partitioning

41 © 2009 IBM Corporation Blue Gene Compute blocks

 BG/P – the partitions are denoted by letter XYZT, for XYZ the 3 dimensions and T for core (0-3)

 BG/Q – the 5 dimensions are denoted by the letters A, B, C, D, and E, T for the core (0-15).

– the latest dimension E is always 2, and is contained entirely within a midplane.

– for any compute block, compute nodes (as well midplanes for large blocks) are combined in 4 dimensions - only 4 dimensions need to be considered.

42 © 2009 IBM Corporation Partitions

 Compute partitions are generally grouped into two categories: – Occupying one or more complete midplanes (large blocks) – Occupying less than a midplane (small blocks)

 Large blocks have these characteristics: – Always a multiple of 512 nodes – Can be a torus in all 5 dimensions

 Small blocks have these characteristics: – Occupies one or more node cards on a midplane – Always a multiple of 32 nodes (32, 64, 128, 256) – Cannot be a torus in all 5 dimensions

43 © 2009 IBM Corporation Compute Partitions - Large Blocks

This shows a 4x4x4x3 (A,B,C,D) machine This is what is currently planned for, 192 midplanes. Each blue box is a midplane. Note that this is an abstract view of the midplanes, they are not physically organized this way.

A midplane is 4x4x4x4x2 (512) nodes Sequoia full system is 16x16x16x12x2 nodes Using A, B, C, D, E to denote the 5 dimensions, the last dimension (E) is fixed at 2, and does not factor into the Partitioning details

44 © 2009 IBM Corporation Node board 32 nodes: 2x2x2x2x2 20 09 21 24 08 05 25 10 23 04 E 11 22 (twin direction, 06 27 always 2) 19 07 26 14 18 (0,0,0,0,1)) 15 31 02 13 16 30 03 12 17 (0,0,0,1,0) D 1 28 B C 0 29 (0,0,0,0,0)

A

45 © 2009 IBM Corporation Partition : mesh versus Torus

Previously on BG/P we had torus vs. mesh for an entire block, now will have torus vs. mesh on a per-dimension basis, partial torus for sub-midplane blocks, partial torus for multi-midplane blocks.

# Node Cards # Nodes Dimensions Torus (ABCDE)

1 32 2x2x2x2x2 00001

2 (adjacent pairs) 64 2x2x4x2x2 00101

4 (quadrants) 128 2x2x4x4x2 00111

8 (halves) 256 4x2x4x4x2 10111

46 © 2009 IBM Corporation 5D Torus Wiring in a Midplane: (A,B,C,D,E)

The 5 dimensions are denoted by the letters A, B, C, D, and E. The latest dimension E is always 2, and is contained entirely within a midplane.

B  Each nodeboard is 2x2x2x2x2 N07 N15  Arrows show how dimensions C D D C A,B,C,D span across nodeboards N06 N14  Dimension E does not extend across D D N05 N13 nodeboards C C N04 N12 A A  The nodeboards combine to form a N03 N11 4x4x4x4x2 torus C D A A D C  N02 N10 Note that nodeboards are paired in dimensions A,B,C and D as indicated D A A D N01 N09 by the arrows C A A C N00 N08 nodeboard Side view of a midplane midplane

47 © 2009 IBM Corporation Partitioning a Midplane into Blocks

48 © 2009 IBM Corporation I/O Requirements

49 © 2009 IBM Corporation Sub-block jobs

50 © 2009 IBM Corporation Sub-block jobs 1/2  BG/L and BG/P mpirun – minimum job size 1 pset (I/O and associated computes) – –ex: mpirun -np 4 would waste remaining 28 compute nodes on a node board  Single node HTC jobs :single node jobs without MPI  sub-block jobs – new in BG/Q – allows multiple multi-node jobs sharing one or more I/O nodes – rectangular shapes smaller than block – restricted to a midplane (512 nodes) and smaller shapes – MPI fully supported  some collectives may not be optimized

1 single BG/Q partition

51 © 2009 IBM Corporation Sub-block jobs 2/2

 Can subdivide a block into a smaller torus – Allows small jobs to run on a block while maintaining the I/O requirement

 Specify a corner and shape on runjob – Example: • runjob --exe hello --block R00-M0 --corner R00-M0-N04-J00 --shape 1x2x2x1x2 --ranks-per- node 16

 MPI is limited by hardware from sending messages outside this sub-block – System software (kernel) can send outside -- allowing I/O to function – Note that I/O will be routed through sub-blocks (i.e. some noise introduced)

 High Throughput Computing – For the extreme scale single task HTC case, a job can be targeted to a single core on a node – i.e. 16 independent jobs can run on a single BG/Q compute node

52 © 2009 IBM Corporation Programmability

 Standards-based programming environment – Linux TM development environment – Familiar GNU toolchain with GLIBC, pthreads, gdb – XL Compilers providing C, C++, with OpenMP – Totalview debugger  Message Passing – Optimized MPICH2 providing MPI 2.2 – Intermediate and low-level message libraries available, documented, and open source – GA/ARMCI, Berkeley UPC, etc, ported to this optimized layer  Compute Node Kernel (CNK) eliminates OS noise – File I/O offloaded to I/O nodes running full Linux – GLIBC environment with few restrictions for scaling  Flexible and fast Job Control – MPMD and sub-block jobs now supported

53 © 2009 IBM Corporation 54

Toolchain and Tools

 BGQ GNU toolchain – gcc is currently at 4.4.4. Will update again before we ship. – glibc is 2.12.2 (optimized QPX memset/memcopy) – binutils is at 2.21.1 – gdb is 7.1 with QPX registers – gmon/gprof thread support • Can turn profiling on/off on a per thread basis

 Python – Running both Python 2.6 and 3.1.1. – NUMPY, pynamic, UMT all working – Python is now an RPM

 Toronto compiler test harness is running on BGQ LNs

54 IBM Confidential © 2009 IBM Corporation CNK Overview

Compute Node Kernel (CNK) Binary Compatible with Linux System Calls Leverage Linux runtime environments and tools Up to 64 Processes (MPI Tasks) per Node SPMD and MIMD Support Multi-Threading: optimized runtimes Native POSIX Threading Library (NPTL) OpenMP via XL and Gnu Compilers Thread-Level Speculation (TLS) System Programming Interfaces (SPI) Networks and DMA, Global Interrupts Synchronization, Locking, Sleep/Wake Performance Counters (UPC) MPI and OpenMP (XL, Gnu) Transactional Memory (TM) Speculative Multi-Threading (TLS) Shared and Persistent Memory Scripting Environments (Python) Dynamic Linking, Demand Loading

Firmware Boot, Configuration, Kernel Load Control System Interface Common RAS Event Handling for CNK & Linux

55 © 2009 IBM Corporation Parallel Active Message Interface

Application Global High-Level Other API Converse/Charm++ MPICH Arrays ARMCI UPC * Paradigms* Low-Level API PAMI API (C)

pt2pt protocols collective protocols

DMA Device Collective Device GI Device Shmem Device Message Layer Core (C++) SPI Network Hardware (DMA, Collective Network, Global Interrupt Network)

 Message Layer Core has C++ message classes and other utilities to program the different network devices  Support many programming paradigms

56 © 2009 IBM Corporation Blue Gene/Q Job Submission

 BG/L & BG/P Single interface for job submission – mpirun – submit – submit_job – mpiexec

 BG/P a single interface for job submission: runjob

57 © 2009 IBM Corporation Job Submission: runjob command

 runjob is somewhat backwards compatible with mpirun  Supported args: – exe, arg, env, exp_env, env_all, np, partition, strace, label, mode  Unsupported args: – shape, psets_per_bp, connect, reboot, nofree, noallocate, free, boot_options  Primary difference – runjob does not boot or free partitions, only runs the job – booting/rebooting/freeing achieved by a scheduler, console, or script – partition arg is always required  sub-partition jobs require a new location arg – ranges: 0-7 – lists: 1,2,4,9 – combination: 0-7,16,17,18,19 – specific processor: 0.1 for single node jobs only

58 © 2009 IBM Corporation Ranks per node

 BG/L mpirun – supported SMP and co-processor mode – either 1 or 2 ranks per node

 BG/P mpirun – supported SMP, dual, and virtual node mode – either 1, 2, or 4 ranks per node

 BG/Q runjob – supports 1, 2, 4, 8, 16, 32, or 64 ranks per node – parameter is --ranks-per-node rather than --mode

59 © 2009 IBM Corporation Execution Modes in BG/Q per Node

60 © 2009 IBM Corporation Job Launch

 Single interface for job submission; runjob – Mpirun – submit – submit_job – mpiexec  Remove ‘scheduler’ behavior from mpirun – cannot create blocks  Gateway for debuggers

 runjob on the LNs controls via the SN

 Paths shown between LNs, SN and IONs are Infiniband or 10Gbps Ethernet

 Note the new daemons on the IONs

61 © 2009 IBM Corporation BG/Q MPI Implementation

 MPI-2.1 standard ( http://www.mpi-forum.org/docs/docs.html )

 BG/Q mpi execution command: runjob

 To support the Blue Gene/Q hardware, the following additions and modifications have been made to the MPICH2 software architecture:

– A Blue Gene/Q driver has been added that implements the MPICH2 abstract device interface (ADI). – Optimized versions of the Cartesian functions exist (MPI_Dims_create(), MPI_Cart_create(), MPI_Cart_map()). – MPIX functions create hardware-specific MPI extensions.

62 © 2009 IBM Corporation 5-Dimensional Torus Network

 The 5-dimensional Torus network provides point-to-point and collective communication facilities.

 point-to-point messaging the route from a sender to a receiver on a torus network has the following two possible paths: – Deterministic routing • Packets from a sender to a receiver go along the same path. • Advantage: Latency - maintained without additional logic. However, this technique also creates • Disadvantage: network hot spots with several point-to-point coms – Adaptive routing • This technique generates a more balanced network load but introduces a latency penalty.

 Selecting deterministic or adaptive routing depends on the protocol used for communication – 4 in BG/Q: Immediate Message, MPI short, MPI eager and MPI rendez-vous  environment variables can be used to customize MPI communications (c.f. IBM BG/Q redbook)

63 © 2009 IBM Corporation Blue Gene/Q extra MPI communicators

 int MPIX_Cart_comm_create (MPI_Comm *cart_comm) – This function creates a six-dimensional (6D) Cartesian communicator that mimics the exact hardware on which it is run. The A, B, C, D, and E dimensions match those of the block hardware, while the T dimension is equivalent to the ranks per node argument to runjob .

 Changing class-route usage at runtime – int MPIX_Comm_update(MPI_Comm comm, int optimize)

 Determining hardware properties – Int MPIX_Init_hw(MPIX_Hardware_t *hw); – int MPIX_Torus_ndims(int *numdimensions) – int MPIX_Rank2torus(int rank, int *coords) – int MPIX_Torus2rank(int *coords, int *rank)

64 © 2009 IBM Corporation Blue Gene/Q Kernel Overview

65 © 2009 IBM Corporation Processes

 Similarities to BGP – Number of tasks per node fixed at job start – No fork/exec support – Support static and dynamically linked processes – -np supported  Plus: – 64-bit processes – Support for 1, 2, 4, 8, 16, 32, or 64 processes per node • Numeric “names” for process config. (i.e., not smp, dual, quad, octo, vnm, etc) – Processes use 16 cores – The 17th core on BQC reserved for: • Application agents • Kernel networking • RAS offloading – Sub-block jobs  Minus – No support for 32-bit processes

66 © 2009 IBM Corporation Threads

 Similarities to BGP – POSIX NPTL threading support • E.g., libpthread  Plus – Thread affinity and thread migration – Thread priorities • Support both pthread priority and A2 hardware thread priority – Full scheduling support for A2 hardware threads – Multiple software threads per hardware thread is now the default – CommThreads have extended priority range compared to normal threads – Performance features • HWThreads in scheduler without pending work are put into hardware wait state No cycles consumed from other hwthreads on core • Snoop scheduler providing user-state fast access to: Number of software threads that exist on the current hardware thread Number of runnable software threads on the current hardware thread

67 © 2009 IBM Corporation Memory

 Similarities to BGP – Application text segment is shared – Shared memory – Memory Protection and guard pages  Plus – 64-bit virtual addresses • Supports up to 64GB of physical memory* • No TLB misses • Up to 4 processes per core – Fixed 16MB memory footprint for CNK. Remainder of physical memory to applications – Memory protection for primordial dynamically-linked text segment – Memory aliases for long-running TM/SE – Globally readable memory – L2 atomics

68 © 2009 IBM Corporation System Calls

 Similarities to BGP – Many common syscalls on Linux work on BG/Q. – Linux syscalls that worked on BGP should work on BGQ  Plus – Support glibc 2.12.2 – Real-time signals support – Low overhead syscalls • Only essential registers are saved and restored – Pluggable File Systems • Allows CNK to support multiple file system behaviors and types Different simulator platforms have different capabilities • File systems:Si Function shipped Standard I/O Ramdisk Shared Memory

69 © 2009 IBM Corporation I/O Services

 Similarities to BGP – Function shipping system calls to ionode – Support NFS, GPFS, and PVFS2 filesystems  Plus – PowerPC64 Linux running on 17 cores – Supports 8192:1 compute task to ionode ratio • Only 1 ioproxy per compute node • Significant internal changes from BGP – Standard communications protocol • OFED verbs • Using Torus DMA hardware for performance – Network over Torus • E.g., tools can now communicate between IONodes via torus – Using “off-the-shelf” Infiniband driver from Mellanox – ELF images now pulled from I/O nodes, vs push

70 © 2009 IBM Corporation Debugging

 Similarities to BGP – GDB – Totalview – Coreprocessor – Dump_all, dump_memory – Lightweight and binary corefiles  Plus – Tools interface (CDTI) • Allows customers to write custom debug and monitoring tools – Support for versioned memory (TM/SE) – Fast breakpoint and watchpoint support – Asynchronous thread control • Allow selected threads to run while others are being debugged

71 © 2009 IBM Corporation BG/Q Application Environment

72 © 2009 IBM Corporation Compiling MPI programs on Blue Gene/Q

There are six versions of the libraries and the scripts

– gcc : GMU compiler with fine-grained locking in MPICH – error checking

– gcc.legacy : GMU with coarse-grained lock – sligthy better latency for single-thread code

– xl : PAMI compiled with GNU – fine-grained lock

– xl.legacy : PAMI compiled wiyh GNU – coarse-grained lock

– Xl.ndebug : xl with error checking and asserts off ⇒lower latency but not as much debug info

– xl.legacy.ndebug : xl.legacy with error checking and asserts off

73 © 2009 IBM Corporation Control/monitoring

 Provides the Blue Gene Web Services – Getting data (blocks, jobs, hardware, envs, etc.) – Create blocks, delete blocks, run diags, etc.  A web server  Runs under BGMaster – Should run as special bgws user for security  New for BG/Q – Had Navigator server in BG/P (Tomcat) – Tomcat in BG/L

74 © 2009 IBM Corporation Blue Gene Navigator

75 © 2009 IBM Corporation Debugging – Batch scheduler

 Debugging – Integrated Tools • GDB • Core Files + addr2line • coreprossecor • Compiler Options • Traceback functions, memory size kernel, signal or exit trap, … – Supported Commercial Software • Totalview • DDT (Alinea) ?

 Batch scheduler – IBM LoadLeveler – SULRM – LSF?

76 © 2009 IBM Corporation Coreprocessor GUI

77 © 2009 IBM Corporation Performance Analysis

 Profiling – GNU profiling, vprof with command line or GUI  IBM HPC Toolkit, IBM mpitrace library  Major Open-Source Tools – Scalasca – TAU – mpiP http://mpip.sourceforge.net –…

78 © 2009 IBM Corporation IBM mpi trace library – HPC Toolkit

 Mpi timing summary  Communication and elapsed times  Heap memory used  Mpi basic informations: #calls, message sizes, # hops  Call-stack for every MPI function call  Source and destination torus coordinates identification for point-to-point messages  Unix-based profiling  BG/Q Hardware-counters  Event-tracing

79 © 2009 IBM Corporation Thanks, Questions?

80 © 2009 IBM Corporation