Architecture of Parallel Computers CSC / ECE 506 Bluegene
Total Page:16
File Type:pdf, Size:1020Kb
Architecture of Parallel Computers CSC / ECE 506 BlueGene Architecture Lecture 24 7/31/2006 Dr Steve Hunter BlueGene/L Program • December 1999: IBM Research announced a 5 year, $100M US, effort to build a petaflop/s scale supercomputer to attack science problems such as protein folding. Goals: – Advance the state of the art of scientific simulation. – Advance the state of the art in computer design and software for capability and capacity markets. • November 2001: Announced Research partnership with Lawrence Livermore National Laboratory (LLNL). November 2002: Announced planned acquisition of a BG/L machine by LLNL as part of the ASCI Purple contract. • May 11, 2004: Four racks DD1 (4096 nodes at 500 MHz) running Linpack at 11.68 TFlops/s. It was ranked #4 on 23rd Top500 list. • June 2, 2004: 2 racks DD2 (1024 nodes at 700 MHz) running Linpack at 8.655 TFlops/s. It was ranked #8 on 23rd Top500 list. • September 16, 2004, 8 racks running Linpack at 36.01 TFlops/s. • November 8, 2004, 16 racks running Linpack at 70.72 TFlops/s. It was ranked #1 on the 24th Top500 list. • December 21, 2004 First 16 racks of BG/L accepted by LLNL. Arch of Parallel Computers CSC / ECE 506 2 BlueGene/L Program • Massive collection of low-power CPUs instead of a moderate-sized collection of high-power CPUs. – A joint development of IBM and DOE’s National Nuclear Security Administration (NNSA) and installed at DOE’s Lawrence Livermore National Laboratory • BlueGene/L has occupied the No. 1 position on the last three TOP500 lists (http://www.top500.org/) – It has reached a Linpack benchmark performance of 280.6 TFlop/s (“teraflops” or trillions of calculations per second) and still remains the only system ever to exceed the level of 100 TFlop/s. – BlueGene holds the #1, #2, and #8 positions in top 10. • “Objective was to retain exceptional cost/performance levels achieved by application-specific machines, while generalizing the massively parallel architecture enough to enable a relatively broad class of applications” - Overview of BG/L system architecture, IBM JRD – Design approach was to use a very high level of integration that made simplicity in packaging, design, and bring-up possible – JRD issue available at: http://www.research.ibm.com/journal/rd49-23.html Arch of Parallel Computers CSC / ECE 506 3 BlueGene/L Program • BlueGene is a family of supercomputers. – BlueGene/L is the first step, aimed as a multipurpose, massively parallel, and cost/effective supercomputer 12/04 – BlueGene/P is the petaflop generation 12/06 – BlueGene/Q is the third generation ~2010. • Requirements for future generations – Processors will be more powerful. – Networks will be higher bandwidth. – Applications developed on BlueGeneG/L will run well on BlueGene/P. Arch of Parallel Computers CSC / ECE 506 4 BlueGene/L Fundamentals • Low Complexity nodes gives more flops per transistor and per watt • 3D Interconnect supports many scientific simulations as nature as we see it is 3D Arch of Parallel Computers CSC / ECE 506 5 BlueGene/L Fundamentals • Cellular architecture – Large numbers of low power, more efficient processors interconnected • Rmax of 280.6 Teraflops – Maximal LINPACK performance achieved • Rpeak of 360 Teraflops – Theoretical peak performance • 65,536 dual-processor compute nodes – 700MHz IBM PowerPC 440 processors – 512 MB memory per compute node, 16 TB in entire system. – 800 TB of disk space • 2,500 square feet Arch of Parallel Computers CSC / ECE 506 6 Comparing Systems (Peak) upercomputer Peak Performance 1E+17 multi-Petaflop Petaflop Blue Gene/L 1E+14 Thunder Earth Red Storm Blue Pacific ASCI White, ASCI Q ASCI Red Option SX-5 T3E ASCI Red NWT SX-4 CP-PACS 1E+11 CM-5 Paragon Delta T3D SX-3/44 Doubling time = 1.5 yr. i860 (MPPs) CRAY-2 SX-2 VP2600/10 S-810/20 X-MP4 Y-MP8 Cyber 205 X-MP2 (parallel vectors) 1E+8 CDC STAR-100 (vectors) CRAY-1 CDC 7600 ILLIAC IV CDC 6600 (ICs) Peak Speed (flops) Speed Peak 1E+5 IBM Stretch IBM 7090 (transistors) IBM 704 IBM 701 UNIVAC 1E+2 ENIAC (vacuum tubes) 1940 1950 1960 1970 1980 1990 2000 2010 Year Introduced Arch of Parallel Computers CSC / ECE 506 7 Comparing Systems (Byte/Flop) ! Red Storm 2.0 2003 ! Earth Simulator 2.0 2002 ! Intel Paragon 1.8 1992 ! nCUBE/2 1.0 1990 ! ASCI Red 1.0 (0.6) 1997 ! T3E 0.8 1996 ! BG/L 1.5 0.75(torus)+0.75(tree) 2004 ! Cplant 0.1 1997 ! ASCI White 0.1 2000 ! ASCI Q 0.05 Quadrics 2003 ! ASCI Purple 0.1 2004 ! Intel Cluster 0.1 IB 2004 ! Intel Cluster 0.008 GbE 2003 ! Virginia Tech 0.16 IB 2003 ! Chinese Acad of Sc 0.04 QsNet 2003 ! NCSA - Dell 0.04 Myrinet 2003 Arch of Parallel Computers CSC / ECE 506 8 Comparing Systems (GFlops/Watt) • Power efficiencies of recent supercomputers – Blue: IBM Machines – Black: Other US Machines IBM Journal of Research – Red: Japanese Machines and Development Arch of Parallel Computers CSC / ECE 506 9 Comparing Systems ASCI White ASCI Q Earth Blue Gene/L Simulator Machine Peak 12.3 30 40.96 367 (TF/s) Total Mem. 8 33 10 32 (TBytes) Footprint (sq ft) 10,000 20,000 34,000 2,500 Power (MW)* 1 3.8 6-8.5 1.5 Cost ($M) 100 200 400 100 # Nodes 512 4096 640 65,536 MHz 375 1000 500 700 * 10 megawatts approximate usage of 11,000 households Arch of Parallel Computers CSC / ECE 506 10 BG/L Summary of Performance Results • DGEMM (Double-precision, GEneral Matrix-Multiply): – 92.3% of dual core peak on 1 node – Observed performance at 500 MHz: 3.7 GFlops – Projected performance at 700 MHz: 5.2 GFlops (tested in lab up to 650 MHz) • LINPACK: – 77% of peak on 1 node – 70% of peak on 512 nodes (1435 GFlops at 500 MHz) • sPPM (Spare Matrix Multiple Vector Multiply), UMT2000: – Single processor performance roughly on par with POWER3 at 375 MHz – Tested on up to 128 nodes (also NAS Parallel Benchmarks) • FFT (Fast Fourier Transform): – Up to 508 MFlops on single processor at 444 MHz (TU Vienna) – Pseudo-ops performance (5N log N) @ 700 MHz of 1300 Mflops (65% of peak) • STREAM – impressive results even at 444 MHz: – Tuned: Copy: 2.4 GB/s, Scale: 2.1 GB/s, Add: 1.8 GB/s, Triad: 1.9 GB/s – Standard: Copy: 1.2 GB/s, Scale: 1.1 GB/s, Add: 1.2 GB/s, Triad: 1.2 GB/s – At 700 MHz: Would beat STREAM numbers for most high end microprocessors • MPI: – Latency – < 4000 cycles (5.5 ✙s at 700 MHz) – Bandwidth – full link bandwidth demonstrated on up to 6 links Arch of Parallel Computers CSC / ECE 506 11 BlueGene/L Architecture • To achieve this level of integration, the machine was developed around a processor with moderate frequency, available in system-on-a-chip (SoC) technology – This approach was chosen because of the performance/power advantage – In terms of performance/watt the low-frequency, low-power, embedded IBM PowerPC core consistently outperforms high-frequency, high-power, microprocessors by a factor of 2 to 10 – Industry focus on performance / rack » Performance / rack = Performance / watt * Watt / rack » Watt / rack = 20kW for power and thermal cooling reasons • Power and cooling – Using conventional techniques, a 360 Tflops machine would require 10-20 megawatts. – BlueGene/L uses only 1.76 megawatts Arch of Parallel Computers CSC / ECE 506 12 Microprocessor Power Density Growth Arch of Parallel Computers CSC / ECE 506 13 System Power Comparison BG/L 450 Thinkpads 2048 processors 20.1 kW 20.3 kW Arch of Parallel Computers CSC / ECE 506 (LS Mok,4/2002) 14 BlueGene/L Architecture • Networks were chosen with extreme scaling in mind – Scale efficiently in terms of both performance and packaging – Support very small messages » As small as 32 bytes – Includes hardware support for collective operations » Broadcast, reduction, scan, etc. • Reliability, Availability and Serviceability (RAS) is another critical issue for scaling – BG/L need to be reliable and usable even at extreme scaling limits – 20 fails per 1,000,000,000 hours = 1 node failure every 4.5 weeks • System Software and Monitoring also important to scaling – BG/L designed to efficiently utilize a distributed memory, message-passing programming model – MPI is the dominant message-passing model with hardware features added and parameter tuned Arch of Parallel Computers CSC / ECE 506 15 RAS (Reliability, Availability, Serviceability) • System designed for RAS from top to bottom – System issues » Redundant bulk supplies, power converters, fans, DRAM bits, cable bits » Extensive data logging (voltage, temp, recoverable errors … ) for failure forecasting » Nearly no single points of failure – Chip design » ECC on all SRAMs » All dataflow outside processors is protected by error-detection mechanisms » Access to all state via noninvasive back door – Low power, simple design leads to higher reliability – All interconnects have multiple error detections and correction coverage » Virtually zero escape probability for link errors Arch of Parallel Computers CSC / ECE 506 16 BlueGene/L System 136.8 Teraflop/s on LINPACK (64K processors) 1 TF = 1000,000,000,000 Flops Rochester Lab 2005 Arch of Parallel Computers CSC / ECE 506 17 BlueGene/L System Arch of Parallel Computers CSC / ECE 506 18 BlueGene/L System Arch of Parallel Computers CSC / ECE 506 19 BlueGene/L System Arch of Parallel Computers CSC / ECE 506 20 Physical Layout of BG/L Arch of Parallel Computers CSC / ECE 506 21 Midplanes and Racks Arch of Parallel Computers CSC / ECE 506 22 The Compute Chip • System-on-a-chip (SoC) • 1 ASIC – 2 PowerPC processors – L1 and L2 Caches – 4MB embedded DRAM – DDR DRAM interface and DMA controller – Network connectivity hardware – Control / monitoring equip.