IBM Systems Group
IBM Deep Computing Strategy
© 2004 IBM Corporation IBM Systems Group
IBM’s Deep Computing Strategy Solving Problems More Quickly at Lower Cost
Aggressively evolve the POWER-based Deep Computing product line
Develop advanced systems based on loosely coupled clusters
Deliver supercomputing capability with new access models and financial flexibility
Research and overcome obstacles to parallelism and other revolutionary approaches to supercomputing
© 2004 IBM Corporation IBM Systems Group
IBM’s Deep Computing Strategy Solving Problems More Quickly at Lower Cost
Aggressively evolve the POWER-based Deep Computing product line
Develop advanced systems based on loosely coupled clusters
Deliver supercomputing capability with new access models and financial flexibility
Research and overcome obstacles to parallelism and other revolutionary approaches to supercomputing
© 2004 IBM Corporation IBM Systems Group
Creating the Future of Deep Computing ASCI Purple and Blue Gene/L ► The two systems will provide over 460TF Track record for delivering the world's largest production-quality supercomputers ► ASCI Blue (3.9 TF) & ASCI White (12.3 TF) ► ASCI Pathforward (Federation 4GB Switch) DARPA’s HPCS initiative ► Awarded $53M in funding for phase 2 of DARPA’s High Productivity Computing Systems initiative ► IBM PERCS program aimed at bringing sustained multi-petaflop performance and autonomic capabilities to commercial supercomputers
© 2004 IBM Corporation IBM Systems Group
IBM Systems – Industry Leadership & Choice Clusters / Virtualization Large SMP
p690/p690+ e325 • High BW High Performance Switch • 1U Opteron-based • SSI • LPAR Linux Cluster (e1350) • RAS AIX Clusters (e1600)
p655 • High density x445 • POWER4-based • Scalable Nodes • Static LPAR • VMware x382 High Density • 2-way Itanium 2 Rack Mount BladeCenterTM x335 • Denser form factors • Rapid deployment’ Scale Up / SMP Computing Scale Up Scale Up / SMP Computing Scale Up • 1U/2p density • Flexible architectures • Intel-based • Switch integration LINUX
Scale Out / Distributed Computing
© 2004 IBM Corporation IBM Systems Group
POWER / PowerPC: The Most Scaleable Architecture POWER5 POWER4+ POWER4 POWER3 Servers POWER2
PPC 970 op PPC 750FX Desk T s PPC 603e PPC 750 Game
PPC 440GX PPC 440GP ded PPC 401 PPC 405GP Embed
© 2004 IBM Corporation IBM eServer pSeries IBM Confidential POWER4 / POWER5 Differences
POWER4 Design POWER5 Design Benefit 2-way Associative 4-way Associative Improved L1 Cache L1 Cache FIFO LRU performance 8-way Associative 10-way Associative Fewer L2 Cache misses L2 cache 1.44MB 1.9MB Better performance 32MB / Shared 36MB / Private Better Cache performance L3 Cache 8-way Associative 12-way Associative 40% improvement 118 Clock Cycles ~80 Clock Cycles 4X improvement Memory Bandwidth 4GB / sec / Chip ~16GB / sec / Chip Faster memory access Simultaneous Better processor utilization No Yes Multi-Threading 40% System improvement Better usage of processor Processor Partitioning 1 processor 1/10 of processor resources Floating Point 72 120 Better Performance Registers Chip Interconnect Type Distributed Switch Enhanced Dist. Switch Better systems through put Intra MCM data bus ½ Proc. Speed Processor Speed Better performance Inter MCM data bus ½ Proc. Speed ½ Proc. Speed 50% more transistors in the Size 412mm 389mm same space
© 2004 IBM Corporation IBM eServer pSeries IBM Confidential POWER5 Design DCM POWER5 Enhancements M SMT SMT Mem SMT Cores Core Core Ctrl E Integrated Memory Cntrl M Integrated L3 Controller L3 L3 Shared L2 Dir O Larger Caches ( L2 & L3 ) (1.9 MB) 36 MB Ctl R Mega Bandwidth Enhanced Distributed Switch Y Scalable Buses Virtualization support
MCM – MCM GX Bus POWER5 Design Chip - Chip 276M tranistors 389mm2
DCM: Dual Chip Module Used in Entry & Midrange systems
© 2004 IBM Corporation IBM Systems Group
POWER4 and POWER5 Storage Hierarchy
POWER4 POWER5
L2 Cache Capacity, line size 1.44 MB, 128 B line 1.92 MB, 128 B line Associativity, replacement 8-way, LRU 10-way, LRU
Off-chip L3 Cache Capacity, line size 32 MB, 512 B line 36 MB, 256 B line Associativity, replacement 8-way, LRU 12-way, LRU
Chip interconnect Type Distributed switch Enhanced distributed switch Intra-MCM data buses ½ processor speed Processor speed Inter-MCM data buses ½ processor speed ½ processor speed Memory 512 GB maximum 1024 GB (1 TB) maximum
IBM Confidential © 2004 IBM Corporation IBM Systems Group
Latency and Bandwidth Comparison: POWER4 and POWER5
Latency POWER4 to POWER5 Bandwidth Increase POWER4 POWER5 (Same Frequency) GPR, FPR 1 cycle 1 cycle Same Register Files access access Instruction 1 cycle 1 cycle Same Cache access access Data Cache 2 cycles 2 cycles Same L2 Cache 12 cycles 13 cycles Same L3 Cache 123 cycles * 87 cycles * 1.5x Memory 351 cycles * 220 cycles * up to 2.7x read, 1.3x write Chip-chip -- -- 4x
* Assumes data accessed from local L3 and locally attached memory
IBM Confidential © 2004 IBM Corporation IBM Systems Group
Simultaneous Multi-Threading in POWER5
Each chip appears as a 4-way Simultaneous Multi-Threading SMP to software FX0 Processor resources optimized FX1 for enhanced SMT performance FP0 Software controlled thread FP1 priority LS0 LS1 Dynamic feedback of runtime BRX behavior to adjust priority CRL
Dynamic switching between Thread 0 active Thread 1 active single and multithreaded mode
© 2004 IBM Corporation IBM eServer pSeries IBM Confidential POWER5 SF4 SF4
1 , 2, 4 – way Architecture POWER5
L3 Cache 36MB Memory 1GB – 64GB Packaging Deskside / 4U ( 19” rack ) DASD / Bays 8 DASD (Hot Plug) I/O Expansion 5 / 4 slots (Hot Plug) SF4 Entry System Integrated SCSI Dual Ultra 320 Deskside Internal RAID Optional 4U rack chassis Rack: 19" x 26" Dual Ports Integrated Ethernet 10/100/1000
Media Bays 3 RIO2 Drawers Yes / 8 Dynamic LPAR 40 Redundant Power Feature
Redundant Cooling Yes
© 2004 IBM Corporation IBM eServer pSeries IBM Confidential POWER5 ML4 ML4 2 / 4 – way Base System Architecture 4 – way Secondary Systems POWER5 L3 Cache 36MB Memory* 1GB – 64GB (Base System) Packaging 4U ( 19” rack ) DASD / Bays 6 DASD (Base system) ML4 Entry System I/O Expansion 6/5 slots (Hot Plug) Deskside Integrated SCSI Dual Ultra 320 (Base System) 4U rack chassis Internal RAID Optional Rack: 19" x 26" Dual Ports Integrated Ethernet 10/100/1000 (Base System) Media Bays 3 (Base System) RIO2 Drawers Yes / 8 (Base System) Dynamic LPAR 40 (Base System) Redundant Power Yes Redundant Cooling Yes
© 2004 IBM Corporation IBM eServer pSeries IBM Confidential Planned POWER5 IH POWER5 IH 8 – way Architecture H C R U6 POWER5 IBM L3 Cache 288MB (Total) POWER5 IH System 2.5U rack chassis Memory 2GB – 256GB Rack: 24" X 43 " Deep, Full Drawer 2U ( 24” rack ) Packaging 10 systems / rack 10 Systems / Rack 80 Processors / Rack DASD / Bays 2 DASD (Hot Plug) 6 / 4 / 2 PCI-X H C R U6 IBM I/O Expansion 2 - GX Bus & 2 -RIO
H C R U6
IBM
H C R U6 IBM Integrated SCSI Ultra 320
H C R U6 IBM 4 Ports
H C R U6 IBM Integrated Ethernet 10/100/1000
H C R U6
IBM Yes ( 1/2 or 1 ) / drawer
H C R U6 IBM RIO2 Drawers 0 - 5 / rack
H C R U6
IBM
H C R U6 IBM Dynamic LPAR Yes
H C R U6 IBM HPS 0 - 2 / rack Cluster Interconnect Gigabit Ethernet OS AIX 5.2 & Linux
Linux
© 2004 IBM Corporation IBM Systems Group
POWER5 Improves HPC Performance
Higher Sustained-to-Peak FLOPS ratio compared to POWER4 Dedicated memory bus Reduction in L3 and memory latency Integrated memory controller Increased rename resources allows higher instruction level parallelism in compute intensive applications Fast barrier synchronization operation Enhanced data prefetch mechanism
© 2004 IBM Corporation IBM Research
BlueGene/L system architecture
BlueGene/L © 2003 IBM Corporation IBM Research
BlueGene/L fundamentals
A large number of nodes (65,536) Low-power (20W) nodes for density High floating-point performance System-on-a-chip technology Nodes interconnected as 64x32x32 three- dimensional torus Easy to build large systems, as each node connects only to six nearest neighbors – full routing in hardware Bisection bandwidth per node is proportional to n2/n3 Auxiliary networks for I/O and global operations Applications consist of multiple processes with message passing Strictly one process/node Minimum OS involvement and overhead
BlueGene/L | System Software Overview © 2003 IBM Corporation IBM Research
BlueGene/L interconnection networks
3 Dimensional Torus Interconnects all compute nodes (65,536) Virtual cut-through hardware routing 1.4Gb/s on all 12 node links (2.1 GB/s per node) Communications backbone for computations 350/700 GB/s bisection bandwidth
Global Tree One-to-all broadcast functionality Reduction operations functionality 2.8 Gb/s of bandwidth per link Latency of tree traversal in the order of 2 µs Interconnects all compute and I/O nodes (1024)
Gigabit Ethernet Incorporated into every node ASIC Active in the I/O nodes (1:64) All external comm. (file I/O, control, user interaction, etc.)
BlueGene/L | System Software Overview © 2003 IBM Corporation IBM Systems Group BG/LBG/L Applications ClassesClasses Petroleum Reservoirs Biomolecular Dynamics / Protein Folding Flows in Porous Media Rational Fourier Graph Drug Design Molecular Methods Theoretic Reaction-Diffusion Modelling N-Body Transport Fracture Discrete Basic Partial Fluid Events Algorithms Diff. EQs. Dynamics Multiphase Flow Mechanics & Monte Ordinary Carlo Numerical Diff. EQs. VLSI Methods Weather and Climate Raster Nanotechnology Fields Design Graphics Pattern Symbolic Structural Mechanics Matching Processing Seismic Processing Genome Aerodynamics Processing Large-scale Cryptography Data Mining
Applications which map well to ultra-parallel (>10,000 CPU) systems Important applications can leverage increased compute/memory ratios Best applications have high ratios of MIPS & FLOPs vs. memory size & cluster BW Intense, mostly-local interactions between 100,000s to 1,000,000,000s of simple units e.g., atoms (Protein Folding/Drug Design), air pockets (Weather and Climate), logic gates (VLSI sim), etc. Purely-parallel searches over huge data or parameter spaces Genome searching, Cryptography, Data mining, etc.
© 2004 IBM Corporation IBM Systems Group
Target TOP500 ranking for SC2004
Linpack Cost System Make Procs Description Type MW Frames TFlops M$ 750 MHz special #1 Blue Gene L (1/2) IBM 64k 100 .8 32 ? POWER purpose
500 MHz special #2 Earth Simulator NEC 5120 35.9 5.1 640+ 350 NEC purpose
2 GHz special 20-35 #3 Red Storm CRAY 10k < 2 108 90 Opteron purpose (40 Peak) 2.2 GHz commercially #4 Three Rivers IBM 4564 20+ << 1 45 30 POWER available
1.25 GHz special #5 ASCI Q HP 8192 13.9 2 600 100 Alpha purpose
2 GHz #6 Big Mac Apple 2200 homebuilt 9.8 < 1.5 130+ 5.2 POWER
Project Three Rivers | IBM Confidential | © 2004 IBM Corporation IBM Systems Group Major Three Rivers Innovative Technologies IBM 970 POWER processor Æ Industry leading 64-bit commodity processor
IBM Blade Center Integration Æ Record cluster density Æ Improved cluster operating efficiency (power, space, cooling)
Advanced semiconductor technology (CMOS10S) Æ Record price/performance in HPC workloads Æ Record system throughput
New dense Myrinet interconnect and new MPI software (MX) Æ Significant reduction in switching hardware Æ Faster parallel processing
Enterprise scale-out FAStT IBM Storage Æ Improved subsystem cost and reliability
Network root file system Æ Improved node reliability Æ Reduced installation and maintenance costs
Project Three Rivers | IBM Confidential | © 2004 IBM Corporation IBM Systems Group
DARPA award ( Defense Advance Research Projects Agency ) IBM to receive $53.3M grant... Develop new generation of Supercomputing Multi-petaflop sustained performance by 2010 petaflop = 1 quadrillion calculations / sec IBM Research with consortium of 12 leading universities Univ. of Il, Univ. of New Mexico, etc. Continuation of IBM Technology Leadership Processors, systems design, etc. IBM proposal: PERCS Productive, Easy-to-use, Reliable Computing Systems Highly adaptable systems that configures hardware & software to match application needs Revolutionary chip technology New computer architecture
© 2004 IBM Corporation IBM Systems Group
Creating the Future of Deep Computing ASCI Purple and Blue Gene/L ►The two systems will provide over 460TF Track record for delivering the world's largest production-quality supercomputers ►ASCI Blue (3.9 TF) & ASCI White (12.3 TF) ►ASCI Pathforward (Federation 4GB Switch) DARPA’s HPCS initiative ►Awarded $53M in funding for phase 2 of DARPA’s High Productivity Computing Systems initiative ►IBM PERCS program aimed at bringing sustained multi-petaflop performance and autonomic capabilities to commercial supercomputers
© 2004 IBM Corporation IBM Systems Group
IBM Systems – Industry Leadership & Choice Clusters / Virtualization Large SMP
p690/p690+ e325 • High BW High Performance Switch • 1U Opteron-based • SSI • LPAR Linux Cluster (e1350) • RAS AIX Clusters (e1600)
p655 • High density x445 • POWER4-based • Scalable Nodes • Static LPAR • VMware x382 High Density • 2-way Itanium 2 Rack Mount BladeCenterTM x335 • Denser form factors • Rapid deployment’ Scale Up / SMP Computing Scale Up Scale Up / SMP Computing Scale Up • 1U/2p density • Flexible architectures • Intel-based • Switch integration LINUX
Scale Out / Distributed Computing
© 2004 IBM Corporation