IBM Deep Computing Strategy
Total Page:16
File Type:pdf, Size:1020Kb
IBM Systems Group IBM Deep Computing Strategy © 2004 IBM Corporation IBM Systems Group IBM’s Deep Computing Strategy Solving Problems More Quickly at Lower Cost Aggressively evolve the POWER-based Deep Computing product line Develop advanced systems based on loosely coupled clusters Deliver supercomputing capability with new access models and financial flexibility Research and overcome obstacles to parallelism and other revolutionary approaches to supercomputing © 2004 IBM Corporation IBM Systems Group IBM’s Deep Computing Strategy Solving Problems More Quickly at Lower Cost Aggressively evolve the POWER-based Deep Computing product line Develop advanced systems based on loosely coupled clusters Deliver supercomputing capability with new access models and financial flexibility Research and overcome obstacles to parallelism and other revolutionary approaches to supercomputing © 2004 IBM Corporation IBM Systems Group Creating the Future of Deep Computing ASCI Purple and Blue Gene/L ► The two systems will provide over 460TF Track record for delivering the world's largest production-quality supercomputers ► ASCI Blue (3.9 TF) & ASCI White (12.3 TF) ► ASCI Pathforward (Federation 4GB Switch) DARPA’s HPCS initiative ► Awarded $53M in funding for phase 2 of DARPA’s High Productivity Computing Systems initiative ► IBM PERCS program aimed at bringing sustained multi-petaflop performance and autonomic capabilities to commercial supercomputers © 2004 IBM Corporation IBM Systems Group IBM Systems – Industry Leadership & Choice Clusters / Virtualization Large SMP p690/p690+ e325 • High BW High Performance Switch • 1U Opteron-based • SSI • LPAR Linux Cluster (e1350) • RAS AIX Clusters (e1600) p655 • High density x445 • POWER4-based • Scalable Nodes • Static LPAR • VMware x382 High Density • 2-way Itanium 2 Rack Mount BladeCenterTM x335 • Denser form factors • Rapid deployment’ Scale Up / SMP Computing Scale Up Scale Up / SMP Computing Scale Up • 1U/2p density • Flexible architectures • Intel-based • Switch integration LINUX Scale Out / Distributed Computing © 2004 IBM Corporation IBM Systems Group POWER / PowerPC: The Most Scaleable Architecture POWER5 POWER4+ POWER4 POWER3 Servers POWER2 PPC 970 op PPC 750FX Desk T s PPC 603e PPC 750 Game PPC 440GX PPC 440GP ded PPC 401 PPC 405GP Embed © 2004 IBM Corporation IBM eServer pSeries IBM Confidential POWER4 / POWER5 Differences POWER4 Design POWER5 Design Benefit 2-way Associative 4-way Associative Improved L1 Cache L1 Cache FIFO LRU performance 8-way Associative 10-way Associative Fewer L2 Cache misses L2 cache 1.44MB 1.9MB Better performance 32MB / Shared 36MB / Private Better Cache performance L3 Cache 8-way Associative 12-way Associative 40% improvement 118 Clock Cycles ~80 Clock Cycles 4X improvement Memory Bandwidth 4GB / sec / Chip ~16GB / sec / Chip Faster memory access Simultaneous Better processor utilization No Yes Multi-Threading 40% System improvement Better usage of processor Processor Partitioning 1 processor 1/10 of processor resources Floating Point 72 120 Better Performance Registers Chip Interconnect Type Distributed Switch Enhanced Dist. Switch Better systems through put Intra MCM data bus ½ Proc. Speed Processor Speed Better performance Inter MCM data bus ½ Proc. Speed ½ Proc. Speed 50% more transistors in the Size 412mm 389mm same space © 2004 IBM Corporation IBM eServer pSeries IBM Confidential POWER5 Design DCM POWER5 Enhancements M SMT SMT Mem SMT Cores Core Core Ctrl E Integrated Memory Cntrl M Integrated L3 Controller L3 L3 Shared L2 Dir O Larger Caches ( L2 & L3 ) (1.9 MB) 36 MB Ctl R Mega Bandwidth Enhanced Distributed Switch Y Scalable Buses Virtualization support MCM – MCM GX Bus POWER5 Design Chip - Chip 276M tranistors 389mm2 DCM: Dual Chip Module Used in Entry & Midrange systems © 2004 IBM Corporation IBM Systems Group POWER4 and POWER5 Storage Hierarchy POWER4 POWER5 L2 Cache Capacity, line size 1.44 MB, 128 B line 1.92 MB, 128 B line Associativity, replacement 8-way, LRU 10-way, LRU Off-chip L3 Cache Capacity, line size 32 MB, 512 B line 36 MB, 256 B line Associativity, replacement 8-way, LRU 12-way, LRU Chip interconnect Type Distributed switch Enhanced distributed switch Intra-MCM data buses ½ processor speed Processor speed Inter-MCM data buses ½ processor speed ½ processor speed Memory 512 GB maximum 1024 GB (1 TB) maximum IBM Confidential © 2004 IBM Corporation IBM Systems Group Latency and Bandwidth Comparison: POWER4 and POWER5 Latency POWER4 to POWER5 Bandwidth Increase POWER4 POWER5 (Same Frequency) GPR, FPR 1 cycle 1 cycle Same Register Files access access Instruction 1 cycle 1 cycle Same Cache access access Data Cache 2 cycles 2 cycles Same L2 Cache 12 cycles 13 cycles Same L3 Cache 123 cycles * 87 cycles * 1.5x Memory 351 cycles * 220 cycles * up to 2.7x read, 1.3x write Chip-chip -- -- 4x * Assumes data accessed from local L3 and locally attached memory IBM Confidential © 2004 IBM Corporation IBM Systems Group Simultaneous Multi-Threading in POWER5 Each chip appears as a 4-way Simultaneous Multi-Threading SMP to software FX0 Processor resources optimized FX1 for enhanced SMT performance FP0 Software controlled thread FP1 priority LS0 LS1 Dynamic feedback of runtime BRX behavior to adjust priority CRL Dynamic switching between Thread 0 active Thread 1 active single and multithreaded mode © 2004 IBM Corporation IBM eServer pSeries IBM Confidential POWER5 SF4 SF4 1 , 2, 4 – way Architecture POWER5 L3 Cache 36MB Memory 1GB – 64GB Packaging Deskside / 4U ( 19” rack ) DASD / Bays 8 DASD (Hot Plug) I/O Expansion 5 / 4 slots (Hot Plug) SF4 Entry System Integrated SCSI Dual Ultra 320 Deskside Internal RAID Optional 4U rack chassis Rack: 19" x 26" Dual Ports Integrated Ethernet 10/100/1000 Media Bays 3 RIO2 Drawers Yes / 8 Dynamic LPAR 40 Redundant Power Feature Redundant Cooling Yes © 2004 IBM Corporation IBM eServer pSeries IBM Confidential POWER5 ML4 ML4 2 / 4 – way Base System Architecture 4 – way Secondary Systems POWER5 L3 Cache 36MB Memory* 1GB – 64GB (Base System) Packaging 4U ( 19” rack ) DASD / Bays 6 DASD (Base system) ML4 Entry System I/O Expansion 6/5 slots (Hot Plug) Deskside Integrated SCSI Dual Ultra 320 (Base System) 4U rack chassis Internal RAID Optional Rack: 19" x 26" Dual Ports Integrated Ethernet 10/100/1000 (Base System) Media Bays 3 (Base System) RIO2 Drawers Yes / 8 (Base System) Dynamic LPAR 40 (Base System) Redundant Power Yes Redundant Cooling Yes © 2004 IBM Corporation IBM eServer pSeries IBM Confidential Planned POWER5 IH POWER5 IH 8 – way Architecture H C R U6 POWER5 IBM L3 Cache 288MB (Total) POWER5 IH System 2.5U rack chassis Memory 2GB – 256GB Rack: 24" X 43 " Deep, Full Drawer 2U ( 24” rack ) Packaging 10 systems / rack 10 Systems / Rack 80 Processors / Rack DASD / Bays 2 DASD (Hot Plug) 6 / 4 / 2 PCI-X H C R U6 IBM I/O Expansion 2 - GX Bus & 2 -RIO H C R U6 IBM H C R U6 IBM Integrated SCSI Ultra 320 H C R U6 IBM 4 Ports H C R U6 IBM Integrated Ethernet 10/100/1000 H C R U6 IBM Yes ( 1/2 or 1 ) / drawer H C R U6 IBM RIO2 Drawers 0 - 5 / rack H C R U6 IBM H C R U6 IBM Dynamic LPAR Yes H C R U6 IBM HPS 0 - 2 / rack Cluster Interconnect Gigabit Ethernet OS AIX 5.2 & Linux Linux © 2004 IBM Corporation IBM Systems Group POWER5 Improves HPC Performance Higher Sustained-to-Peak FLOPS ratio compared to POWER4 Dedicated memory bus Reduction in L3 and memory latency Integrated memory controller Increased rename resources allows higher instruction level parallelism in compute intensive applications Fast barrier synchronization operation Enhanced data prefetch mechanism © 2004 IBM Corporation IBM Research BlueGene/L system architecture BlueGene/L © 2003 IBM Corporation IBM Research BlueGene/L fundamentals A large number of nodes (65,536) Low-power (20W) nodes for density High floating-point performance System-on-a-chip technology Nodes interconnected as 64x32x32 three- dimensional torus Easy to build large systems, as each node connects only to six nearest neighbors – full routing in hardware Bisection bandwidth per node is proportional to n2/n3 Auxiliary networks for I/O and global operations Applications consist of multiple processes with message passing Strictly one process/node Minimum OS involvement and overhead BlueGene/L | System Software Overview © 2003 IBM Corporation IBM Research BlueGene/L interconnection networks 3 Dimensional Torus Interconnects all compute nodes (65,536) Virtual cut-through hardware routing 1.4Gb/s on all 12 node links (2.1 GB/s per node) Communications backbone for computations 350/700 GB/s bisection bandwidth Global Tree One-to-all broadcast functionality Reduction operations functionality 2.8 Gb/s of bandwidth per link Latency of tree traversal in the order of 2 µs Interconnects all compute and I/O nodes (1024) Gigabit Ethernet Incorporated into every node ASIC Active in the I/O nodes (1:64) All external comm. (file I/O, control, user interaction, etc.) BlueGene/L | System Software Overview © 2003 IBM Corporation IBM Systems Group BG/LBG/L Applications ClassesClasses Petroleum Reservoirs Biomolecular Dynamics / Protein Folding Flows in Porous Media Rational Fourier Graph Drug Design Molecular Methods Theoretic Reaction-Diffusion Modelling N-Body Transport Fracture Discrete Basic Partial Fluid Events Algorithms Diff. EQs. Dynamics Multiphase Flow Mechanics & Monte Ordinary Carlo Numerical Diff. EQs. VLSI Methods Weather and Climate Raster Nanotechnology Fields Design Graphics Pattern Symbolic Structural Mechanics Matching Processing Seismic Processing