Architecture of Parallel Computers CSC / ECE 506 Bluegene

Architecture of Parallel Computers CSC / ECE 506 BlueGene Architecture Lecture 24 7/31/2006 Dr Steve Hunter BlueGene/L Program • December 1999: IBM Research announced a 5 year, $100M US, effort to build a petaflop/s scale supercomputer to attack science problems such as protein folding. Goals: – Advance the state of the art of scientific simulation. – Advance the state of the art in computer design and software for capability and capacity markets. • November 2001: Announced Research partnership with Lawrence Livermore National Laboratory (LLNL). November 2002: Announced planned acquisition of a BG/L machine by LLNL as part of the ASCI Purple contract. • May 11, 2004: Four racks DD1 (4096 nodes at 500 MHz) running Linpack at 11.68 TFlops/s. It was ranked #4 on 23rd Top500 list. • June 2, 2004: 2 racks DD2 (1024 nodes at 700 MHz) running Linpack at 8.655 TFlops/s. It was ranked #8 on 23rd Top500 list. • September 16, 2004, 8 racks running Linpack at 36.01 TFlops/s. • November 8, 2004, 16 racks running Linpack at 70.72 TFlops/s. It was ranked #1 on the 24th Top500 list. • December 21, 2004 First 16 racks of BG/L accepted by LLNL. Arch of Parallel Computers CSC / ECE 506 2 BlueGene/L Program • Massive collection of low-power CPUs instead of a moderate-sized collection of high-power CPUs. – A joint development of IBM and DOE’s National Nuclear Security Administration (NNSA) and installed at DOE’s Lawrence Livermore National Laboratory • BlueGene/L has occupied the No. 1 position on the last three TOP500 lists (http://www.top500.org/) – It has reached a Linpack benchmark performance of 280.6 TFlop/s (“teraflops” or trillions of calculations per second) and still remains the only system ever to exceed the level of 100 TFlop/s. – BlueGene holds the #1, #2, and #8 positions in top 10. • “Objective was to retain exceptional cost/performance levels achieved by application-specific machines, while generalizing the massively parallel architecture enough to enable a relatively broad class of applications” - Overview of BG/L system architecture, IBM JRD – Design approach was to use a very high level of integration that made simplicity in packaging, design, and bring-up possible – JRD issue available at: http://www.research.ibm.com/journal/rd49-23.html Arch of Parallel Computers CSC / ECE 506 3 BlueGene/L Program • BlueGene is a family of supercomputers. – BlueGene/L is the first step, aimed as a multipurpose, massively parallel, and cost/effective supercomputer 12/04 – BlueGene/P is the petaflop generation 12/06 – BlueGene/Q is the third generation ~2010. • Requirements for future generations – Processors will be more powerful. – Networks will be higher bandwidth. – Applications developed on BlueGeneG/L will run well on BlueGene/P. Arch of Parallel Computers CSC / ECE 506 4 BlueGene/L Fundamentals • Low Complexity nodes gives more flops per transistor and per watt • 3D Interconnect supports many scientific simulations as nature as we see it is 3D Arch of Parallel Computers CSC / ECE 506 5 BlueGene/L Fundamentals • Cellular architecture – Large numbers of low power, more efficient processors interconnected • Rmax of 280.6 Teraflops – Maximal LINPACK performance achieved • Rpeak of 360 Teraflops – Theoretical peak performance • 65,536 dual-processor compute nodes – 700MHz IBM PowerPC 440 processors – 512 MB memory per compute node, 16 TB in entire system. – 800 TB of disk space • 2,500 square feet Arch of Parallel Computers CSC / ECE 506 6 Comparing Systems (Peak) upercomputer Peak Performance 1E+17 multi-Petaflop Petaflop Blue Gene/L 1E+14 Thunder Earth Red Storm Blue Pacific ASCI White, ASCI Q ASCI Red Option SX-5 T3E ASCI Red NWT SX-4 CP-PACS 1E+11 CM-5 Paragon Delta T3D SX-3/44 Doubling time = 1.5 yr. i860 (MPPs) CRAY-2 SX-2 VP2600/10 S-810/20 X-MP4 Y-MP8 Cyber 205 X-MP2 (parallel vectors) 1E+8 CDC STAR-100 (vectors) CRAY-1 CDC 7600 ILLIAC IV CDC 6600 (ICs) Peak Speed (flops) Speed Peak 1E+5 IBM Stretch IBM 7090 (transistors) IBM 704 IBM 701 UNIVAC 1E+2 ENIAC (vacuum tubes) 1940 1950 1960 1970 1980 1990 2000 2010 Year Introduced Arch of Parallel Computers CSC / ECE 506 7 Comparing Systems (Byte/Flop) ! Red Storm 2.0 2003 ! Earth Simulator 2.0 2002 ! Intel Paragon 1.8 1992 ! nCUBE/2 1.0 1990 ! ASCI Red 1.0 (0.6) 1997 ! T3E 0.8 1996 ! BG/L 1.5 0.75(torus)+0.75(tree) 2004 ! Cplant 0.1 1997 ! ASCI White 0.1 2000 ! ASCI Q 0.05 Quadrics 2003 ! ASCI Purple 0.1 2004 ! Intel Cluster 0.1 IB 2004 ! Intel Cluster 0.008 GbE 2003 ! Virginia Tech 0.16 IB 2003 ! Chinese Acad of Sc 0.04 QsNet 2003 ! NCSA - Dell 0.04 Myrinet 2003 Arch of Parallel Computers CSC / ECE 506 8 Comparing Systems (GFlops/Watt) • Power efficiencies of recent supercomputers – Blue: IBM Machines – Black: Other US Machines IBM Journal of Research – Red: Japanese Machines and Development Arch of Parallel Computers CSC / ECE 506 9 Comparing Systems ASCI White ASCI Q Earth Blue Gene/L Simulator Machine Peak 12.3 30 40.96 367 (TF/s) Total Mem. 8 33 10 32 (TBytes) Footprint (sq ft) 10,000 20,000 34,000 2,500 Power (MW)* 1 3.8 6-8.5 1.5 Cost ($M) 100 200 400 100 # Nodes 512 4096 640 65,536 MHz 375 1000 500 700 * 10 megawatts approximate usage of 11,000 households Arch of Parallel Computers CSC / ECE 506 10 BG/L Summary of Performance Results • DGEMM (Double-precision, GEneral Matrix-Multiply): – 92.3% of dual core peak on 1 node – Observed performance at 500 MHz: 3.7 GFlops – Projected performance at 700 MHz: 5.2 GFlops (tested in lab up to 650 MHz) • LINPACK: – 77% of peak on 1 node – 70% of peak on 512 nodes (1435 GFlops at 500 MHz) • sPPM (Spare Matrix Multiple Vector Multiply), UMT2000: – Single processor performance roughly on par with POWER3 at 375 MHz – Tested on up to 128 nodes (also NAS Parallel Benchmarks) • FFT (Fast Fourier Transform): – Up to 508 MFlops on single processor at 444 MHz (TU Vienna) – Pseudo-ops performance (5N log N) @ 700 MHz of 1300 Mflops (65% of peak) • STREAM – impressive results even at 444 MHz: – Tuned: Copy: 2.4 GB/s, Scale: 2.1 GB/s, Add: 1.8 GB/s, Triad: 1.9 GB/s – Standard: Copy: 1.2 GB/s, Scale: 1.1 GB/s, Add: 1.2 GB/s, Triad: 1.2 GB/s – At 700 MHz: Would beat STREAM numbers for most high end microprocessors • MPI: – Latency – < 4000 cycles (5.5 ✙s at 700 MHz) – Bandwidth – full link bandwidth demonstrated on up to 6 links Arch of Parallel Computers CSC / ECE 506 11 BlueGene/L Architecture • To achieve this level of integration, the machine was developed around a processor with moderate frequency, available in system-on-a-chip (SoC) technology – This approach was chosen because of the performance/power advantage – In terms of performance/watt the low-frequency, low-power, embedded IBM PowerPC core consistently outperforms high-frequency, high-power, microprocessors by a factor of 2 to 10 – Industry focus on performance / rack » Performance / rack = Performance / watt * Watt / rack » Watt / rack = 20kW for power and thermal cooling reasons • Power and cooling – Using conventional techniques, a 360 Tflops machine would require 10-20 megawatts. – BlueGene/L uses only 1.76 megawatts Arch of Parallel Computers CSC / ECE 506 12 Microprocessor Power Density Growth Arch of Parallel Computers CSC / ECE 506 13 System Power Comparison BG/L 450 Thinkpads 2048 processors 20.1 kW 20.3 kW Arch of Parallel Computers CSC / ECE 506 (LS Mok,4/2002) 14 BlueGene/L Architecture • Networks were chosen with extreme scaling in mind – Scale efficiently in terms of both performance and packaging – Support very small messages » As small as 32 bytes – Includes hardware support for collective operations » Broadcast, reduction, scan, etc. • Reliability, Availability and Serviceability (RAS) is another critical issue for scaling – BG/L need to be reliable and usable even at extreme scaling limits – 20 fails per 1,000,000,000 hours = 1 node failure every 4.5 weeks • System Software and Monitoring also important to scaling – BG/L designed to efficiently utilize a distributed memory, message-passing programming model – MPI is the dominant message-passing model with hardware features added and parameter tuned Arch of Parallel Computers CSC / ECE 506 15 RAS (Reliability, Availability, Serviceability) • System designed for RAS from top to bottom – System issues » Redundant bulk supplies, power converters, fans, DRAM bits, cable bits » Extensive data logging (voltage, temp, recoverable errors … ) for failure forecasting » Nearly no single points of failure – Chip design » ECC on all SRAMs » All dataflow outside processors is protected by error-detection mechanisms » Access to all state via noninvasive back door – Low power, simple design leads to higher reliability – All interconnects have multiple error detections and correction coverage » Virtually zero escape probability for link errors Arch of Parallel Computers CSC / ECE 506 16 BlueGene/L System 136.8 Teraflop/s on LINPACK (64K processors) 1 TF = 1000,000,000,000 Flops Rochester Lab 2005 Arch of Parallel Computers CSC / ECE 506 17 BlueGene/L System Arch of Parallel Computers CSC / ECE 506 18 BlueGene/L System Arch of Parallel Computers CSC / ECE 506 19 BlueGene/L System Arch of Parallel Computers CSC / ECE 506 20 Physical Layout of BG/L Arch of Parallel Computers CSC / ECE 506 21 Midplanes and Racks Arch of Parallel Computers CSC / ECE 506 22 The Compute Chip • System-on-a-chip (SoC) • 1 ASIC – 2 PowerPC processors – L1 and L2 Caches – 4MB embedded DRAM – DDR DRAM interface and DMA controller – Network connectivity hardware – Control / monitoring equip.

Architecture of Parallel Computers CSC / ECE 506 Bluegene

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support