IBM Systems Group

IBM Deep Computing Strategy

© 2004 IBM Corporation IBM Systems Group

IBM’s Deep Computing Strategy Solving Problems More Quickly at Lower Cost

ƒ Aggressively evolve the POWER-based Deep Computing product line

ƒ Develop advanced systems based on loosely coupled clusters

ƒ Deliver supercomputing capability with new access models and financial flexibility

ƒ Research and overcome obstacles to parallelism and other revolutionary approaches to supercomputing

© 2004 IBM Corporation IBM Systems Group

IBM’s Deep Computing Strategy Solving Problems More Quickly at Lower Cost

ƒ Aggressively evolve the POWER-based Deep Computing product line

ƒ Develop advanced systems based on loosely coupled clusters

ƒ Deliver supercomputing capability with new access models and financial flexibility

ƒ Research and overcome obstacles to parallelism and other revolutionary approaches to supercomputing

© 2004 IBM Corporation IBM Systems Group

Creating the Future of Deep Computing ƒ ASCI Purple and Blue Gene/L ► The two systems will provide over 460TF ƒ Track record for delivering the world's largest production-quality supercomputers ► ASCI Blue (3.9 TF) & ASCI White (12.3 TF) ► ASCI Pathforward (Federation 4GB Switch) ƒ DARPA’s HPCS initiative ► Awarded $53M in funding for phase 2 of DARPA’s High Productivity Computing Systems initiative ► IBM PERCS program aimed at bringing sustained multi-petaflop performance and autonomic capabilities to commercial supercomputers

© 2004 IBM Corporation IBM Systems Group

IBM Systems – Industry Leadership & Choice Clusters / Virtualization Large SMP

p690/p690+ e325 • High BW High Performance Switch • 1U Opteron-based • SSI • LPAR Linux Cluster (e1350) • RAS AIX Clusters (e1600)

p655 • High density x445 • POWER4-based • Scalable Nodes • Static LPAR • VMware x382 High Density • 2-way 2 Rack Mount BladeCenterTM x335 • Denser form factors • Rapid deployment’ Scale Up / SMP Computing Scale Up Scale Up / SMP Computing Scale Up • 1U/2p density • Flexible architectures • Intel-based • Switch integration LINUX

Scale Out / Distributed Computing

© 2004 IBM Corporation IBM Systems Group

POWER / PowerPC: The Most Scaleable Architecture POWER5 POWER4+ POWER4 POWER3 Servers POWER2

PPC 970 op PPC 750FX Desk T s PPC 603e PPC 750 Game

PPC 440GX PPC 440GP ded PPC 401 PPC 405GP Embed

© 2004 IBM Corporation IBM eServer pSeries IBM Confidential POWER4 / POWER5 Differences

POWER4 Design POWER5 Design Benefit 2-way Associative 4-way Associative Improved L1 Cache L1 Cache FIFO LRU performance 8-way Associative 10-way Associative Fewer L2 Cache misses L2 cache 1.44MB 1.9MB Better performance 32MB / Shared 36MB / Private Better Cache performance L3 Cache 8-way Associative 12-way Associative 40% improvement 118 Clock Cycles ~80 Clock Cycles 4X improvement Memory Bandwidth 4GB / sec / Chip ~16GB / sec / Chip Faster memory access Simultaneous Better processor utilization No Yes Multi-Threading 40% System improvement Better usage of processor Processor Partitioning 1 processor 1/10 of processor resources Floating Point 72 120 Better Performance Registers Chip Interconnect Type Distributed Switch Enhanced Dist. Switch Better systems through put Intra MCM data bus ½ Proc. Speed Processor Speed Better performance Inter MCM data bus ½ Proc. Speed ½ Proc. Speed 50% more transistors in the Size 412mm 389mm same space

© 2004 IBM Corporation IBM eServer pSeries IBM Confidential POWER5 Design DCM POWER5 Enhancements M SMT SMT Mem ƒSMT Cores Core Core Ctrl E ƒIntegrated Memory Cntrl M ƒIntegrated L3 Controller L3 L3 Shared L2 Dir O ƒLarger Caches ( L2 & L3 ) (1.9 MB) 36 MB Ctl R ƒMega Bandwidth Enhanced Distributed Switch Y ƒScalable Buses ƒVirtualization support

MCM – MCM GX Bus POWER5 Design Chip - Chip ƒ276M tranistors ƒ389mm2

DCM: Dual Chip Module ƒUsed in Entry & Midrange systems

© 2004 IBM Corporation IBM Systems Group

POWER4 and POWER5 Storage Hierarchy

POWER4 POWER5

L2 Cache Capacity, line size 1.44 MB, 128 B line 1.92 MB, 128 B line Associativity, replacement 8-way, LRU 10-way, LRU

Off-chip L3 Cache Capacity, line size 32 MB, 512 B line 36 MB, 256 B line Associativity, replacement 8-way, LRU 12-way, LRU

Chip interconnect Type Distributed switch Enhanced distributed switch Intra-MCM data buses ½ processor speed Processor speed Inter-MCM data buses ½ processor speed ½ processor speed Memory 512 GB maximum 1024 GB (1 TB) maximum

IBM Confidential © 2004 IBM Corporation IBM Systems Group

Latency and Bandwidth Comparison: POWER4 and POWER5

Latency POWER4 to POWER5 Bandwidth Increase POWER4 POWER5 (Same Frequency) GPR, FPR 1 cycle 1 cycle Same Register Files access access Instruction 1 cycle 1 cycle Same Cache access access Data Cache 2 cycles 2 cycles Same L2 Cache 12 cycles 13 cycles Same L3 Cache 123 cycles * 87 cycles * 1.5x Memory 351 cycles * 220 cycles * up to 2.7x read, 1.3x write Chip-chip -- -- 4x

* Assumes data accessed from local L3 and locally attached memory

IBM Confidential © 2004 IBM Corporation IBM Systems Group

Simultaneous Multi-Threading in POWER5

ƒ Each chip appears as a 4-way Simultaneous Multi-Threading SMP to software FX0 ƒ Processor resources optimized FX1 for enhanced SMT performance FP0 ƒ Software controlled FP1 priority LS0 LS1 ƒ Dynamic feedback of runtime BRX behavior to adjust priority CRL

ƒ Dynamic switching between Thread 0 active Thread 1 active single and multithreaded mode

© 2004 IBM Corporation IBM eServer pSeries IBM Confidential POWER5 SF4 SF4

1 , 2, 4 – way Architecture POWER5

L3 Cache 36MB Memory 1GB – 64GB Packaging Deskside / 4U ( 19” rack ) DASD / Bays 8 DASD (Hot Plug) I/O Expansion 5 / 4 slots (Hot Plug) SF4 Entry System Integrated SCSI Dual Ultra 320 ‰Deskside Internal RAID Optional ‰4U rack chassis ƒRack: 19" x 26" Dual Ports Integrated Ethernet 10/100/1000

Media Bays 3 RIO2 Drawers Yes / 8 Dynamic LPAR 40 Redundant Power Feature

Redundant Cooling Yes

© 2004 IBM Corporation IBM eServer pSeries IBM Confidential POWER5 ML4 ML4 2 / 4 – way Base System Architecture 4 – way Secondary Systems POWER5 L3 Cache 36MB Memory* 1GB – 64GB (Base System) Packaging 4U ( 19” rack ) DASD / Bays 6 DASD (Base system) ML4 Entry System I/O Expansion 6/5 slots (Hot Plug) ‰Deskside Integrated SCSI Dual Ultra 320 (Base System) ‰4U rack chassis Internal RAID Optional ƒRack: 19" x 26" Dual Ports Integrated Ethernet 10/100/1000 (Base System) Media Bays 3 (Base System) RIO2 Drawers Yes / 8 (Base System) Dynamic LPAR 40 (Base System) Redundant Power Yes Redundant Cooling Yes

© 2004 IBM Corporation IBM eServer pSeries IBM Confidential Planned POWER5 IH POWER5 IH 8 – way Architecture H C R U6 POWER5 IBM L3 Cache 288MB (Total) POWER5 IH System ‰ 2.5U rack chassis Memory 2GB – 256GB ƒRack: 24" X 43 " Deep, Full Drawer 2U ( 24” rack ) Packaging 10 systems / rack 10 Systems / Rack 80 Processors / Rack DASD / Bays 2 DASD (Hot Plug) 6 / 4 / 2 PCI-X H C R U6 IBM I/O Expansion 2 - GX Bus & 2 -RIO

H C R U6

IBM

H C R U6 IBM Integrated SCSI Ultra 320

H C R U6 IBM 4 Ports

H C R U6 IBM Integrated Ethernet 10/100/1000

H C R U6

IBM Yes ( 1/2 or 1 ) / drawer

H C R U6 IBM RIO2 Drawers 0 - 5 / rack

H C R U6

IBM

H C R U6 IBM Dynamic LPAR Yes

H C R U6 IBM HPS 0 - 2 / rack Cluster Interconnect Gigabit Ethernet OS AIX 5.2 & Linux

Linux

© 2004 IBM Corporation IBM Systems Group

POWER5 Improves HPC Performance

ƒ Higher Sustained-to-Peak FLOPS ratio compared to POWER4 ƒ Dedicated memory bus ƒ Reduction in L3 and memory latency  Integrated ƒ Increased rename resources allows higher instruction level parallelism in compute intensive applications ƒ Fast barrier synchronization operation ƒ Enhanced data prefetch mechanism

© 2004 IBM Corporation IBM Research

BlueGene/L system architecture

BlueGene/L © 2003 IBM Corporation IBM Research

BlueGene/L fundamentals

„ A large number of nodes (65,536) ™ Low-power (20W) nodes for density ™ High floating-point performance ™ System-on-a-chip technology „ Nodes interconnected as 64x32x32 three- dimensional torus ™ Easy to build large systems, as each node connects only to six nearest neighbors – full routing in hardware ™ Bisection bandwidth per node is proportional to n2/n3 ™ Auxiliary networks for I/O and global operations „ Applications consist of multiple processes with message passing ™ Strictly one process/node ™ Minimum OS involvement and overhead

BlueGene/L | System Software Overview © 2003 IBM Corporation IBM Research

BlueGene/L interconnection networks

„ 3 Dimensional Torus ™ Interconnects all compute nodes (65,536) ™ Virtual cut-through hardware routing ™ 1.4Gb/s on all 12 node links (2.1 GB/s per node) ™ Communications backbone for computations ™ 350/700 GB/s bisection bandwidth

„ Global Tree ™ One-to-all broadcast functionality ™ Reduction operations functionality ™ 2.8 Gb/s of bandwidth per link ™ Latency of tree traversal in the order of 2 µs ™ Interconnects all compute and I/O nodes (1024)

„ Gigabit Ethernet ™ Incorporated into every node ASIC ™ Active in the I/O nodes (1:64) ™ All external comm. (file I/O, control, user interaction, etc.)

BlueGene/L | System Software Overview © 2003 IBM Corporation IBM Systems Group BG/LBG/L Applications ClassesClasses Petroleum Reservoirs Biomolecular Dynamics / Protein Folding Flows in Porous Media Rational Fourier Graph Drug Design Molecular Methods Theoretic Reaction-Diffusion Modelling N-Body Transport Fracture Discrete Basic Partial Fluid Events Algorithms Diff. EQs. Dynamics Multiphase Flow Mechanics & Monte Ordinary Carlo Numerical Diff. EQs. VLSI Methods Weather and Climate Raster Nanotechnology Fields Design Graphics Pattern Symbolic Structural Mechanics Matching Processing Seismic Processing Genome Aerodynamics Processing Large-scale Cryptography Data Mining

Applications which map well to ultra-parallel (>10,000 CPU) systems Important applications can leverage increased compute/memory ratios Best applications have high ratios of MIPS & FLOPs vs. memory size & cluster BW Intense, mostly-local interactions between 100,000s to 1,000,000,000s of simple units e.g., atoms (Protein Folding/Drug Design), air pockets (Weather and Climate), logic gates (VLSI sim), etc. Purely-parallel searches over huge data or parameter spaces Genome searching, Cryptography, Data mining, etc.

© 2004 IBM Corporation IBM Systems Group

Target TOP500 ranking for SC2004

Linpack Cost System Make Procs Description Type MW Frames TFlops M$ 750 MHz special #1 Blue Gene L (1/2) IBM 64k 100 .8 32 ? POWER purpose

500 MHz special #2 Earth Simulator NEC 5120 35.9 5.1 640+ 350 NEC purpose

2 GHz special 20-35 #3 Red Storm CRAY 10k < 2 108 90 Opteron purpose (40 Peak) 2.2 GHz commercially #4 Three Rivers IBM 4564 20+ << 1 45 30 POWER available

1.25 GHz special #5 ASCI Q HP 8192 13.9 2 600 100 Alpha purpose

2 GHz #6 Big Mac Apple 2200 homebuilt 9.8 < 1.5 130+ 5.2 POWER

Project Three Rivers | IBM Confidential | © 2004 IBM Corporation IBM Systems Group Major Three Rivers Innovative Technologies IBM 970 POWER processor Æ Industry leading 64-bit commodity processor

IBM Blade Center Integration Æ Record cluster density Æ Improved cluster operating efficiency (power, space, cooling)

Advanced semiconductor technology (CMOS10S) Æ Record price/performance in HPC workloads Æ Record system throughput

New dense Myrinet interconnect and new MPI software (MX) Æ Significant reduction in switching hardware Æ Faster parallel processing

Enterprise scale-out FAStT IBM Storage Æ Improved subsystem cost and reliability

Network root file system Æ Improved node reliability Æ Reduced installation and maintenance costs

Project Three Rivers | IBM Confidential | © 2004 IBM Corporation IBM Systems Group

DARPA award ( Defense Advance Research Projects Agency ) IBM to receive $53.3M grant... Develop new generation of Supercomputing Multi-petaflop sustained performance by 2010 petaflop = 1 quadrillion calculations / sec IBM Research with consortium of 12 leading universities Univ. of Il, Univ. of New Mexico, etc. Continuation of IBM Technology Leadership Processors, systems design, etc. IBM proposal: PERCS Productive, Easy-to-use, Reliable Computing Systems Highly adaptable systems that configures hardware & software to match application needs Revolutionary chip technology New computer architecture

© 2004 IBM Corporation IBM Systems Group

Creating the Future of Deep Computing ƒ ASCI Purple and Blue Gene/L ►The two systems will provide over 460TF ƒ Track record for delivering the world's largest production-quality supercomputers ►ASCI Blue (3.9 TF) & ASCI White (12.3 TF) ►ASCI Pathforward (Federation 4GB Switch) ƒ DARPA’s HPCS initiative ►Awarded $53M in funding for phase 2 of DARPA’s High Productivity Computing Systems initiative ►IBM PERCS program aimed at bringing sustained multi-petaflop performance and autonomic capabilities to commercial supercomputers

© 2004 IBM Corporation IBM Systems Group

IBM Systems – Industry Leadership & Choice Clusters / Virtualization Large SMP

p690/p690+ e325 • High BW High Performance Switch • 1U Opteron-based • SSI • LPAR Linux Cluster (e1350) • RAS AIX Clusters (e1600)

p655 • High density x445 • POWER4-based • Scalable Nodes • Static LPAR • VMware x382 High Density • 2-way Itanium 2 Rack Mount BladeCenterTM x335 • Denser form factors • Rapid deployment’ Scale Up / SMP Computing Scale Up Scale Up / SMP Computing Scale Up • 1U/2p density • Flexible architectures • Intel-based • Switch integration LINUX

Scale Out / Distributed Computing

© 2004 IBM Corporation