Acceleration Technology for High Performance Computing in China

Real Performance. Real Science. Real Tools. Acceleration Technology for High Performance Computing in China John Gustafson, Ph.D. CTO, High Performance Computing ClearSpeed Technology 1 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Thesis • China faces the challenge of enormous energy demand for its continued growth • Computing in the US now consumes over 10% of the national power grid (and growing)! • China will soon follow this pattern • For HPC applications, ClearSpeed has sophisticated technologies for reducing power use per operation by tenfold • ClearSpeed is partnering with one of the top 3 Chinese computer companies to create a new high-performance computer 2 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com ClearSpeed company background • Fabless semiconductor company based in Bristol and San Jose – CSX600 coprocessors manufactured by IBM – Accelerator boards assembled and tested by Flextronics • Core Products – World’s highest performance, lowest power consumption processors for Double Precision Floating Point (IEEE 754 compliant) – Accelerators for PCI expansion slots in servers and workstations – Work alongside 32 bit or 64 bit x86 industry-standard processors to accelerate compute intensive functions • Market Focus – Acceleration of High Performance Computing (HPC) applications – Universities and National Laboratories, Life Sciences & Financial Services – Embedded applications in consumer and military applications • Competitive Position – Only supplier of custom-designed, HPC-focused acceleration products – Uniquely positioned to exploit growing HPC acceleration need – Substantial Intellectual Property base with over 100 patents granted/pending 3 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Constraints For Processor Development John Shalf and David Bailey (Lawrence Berkeley NL) 2007: • New constraints – Power limits clock rates – Cannot squeeze more performance from ILP (complex cores with Instruction Level Parallelism) either • But Moore’s Law continues! – What to do with all of those transistors if everything else is flat- lining? – Now, #cores per chip doubles every 18 months instead of clock frequency • Power consumption is chief concern for system architects • Power efficiency is the primary concern of consumers of computer systems!! Figure courtesy of Kunle Olukotun, Lance Hammond, Herb Sutter, and Burton Smith 4 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Folding@home with 700,000 PlayStation 3s? • Each PS3 averages 220 watts on this application. • Total power use: 266 megawatts! • Power cost: about $600,000 per day • 2000 barrels of oil per day for a petaflops/s • For that much electric power, our accelerators can get 500 petaflops/s! 5 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com ClearSpeed product overview • CSX600 Processors – World’s highest performance, most energy efficient processor for double precision floating point applications – 96 processing cores – 40 DP GFLOPS peak, >33 GFLOPS DGEMM – 10 watts (typical) • Advance™ PCI-X and PCIe Accelerators – Exploit standard expansion slots for servers, workstations and blade expansion units – >66 GFLOPS DGEMM per accelerator – 25 – 33 watts (typical) • Software – Linux and Microsoft® drivers – ClearSpeed CSXL plug and play acceleration • Accelerates compute intensive calls from Intel MKL and AMD ACML standard libraries – Software Development Kit • Familiar X86 development environment • C compiler with parallel extensions • Complete Visual Profiling and Debugging Tools 6 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com The ClearSpeed Advance accelerator family Advance X620 Advance e620 • The only accelerator family specifically designed for HPC – 80.64 GFLOPS peak, 66 GFLOPS sustained double precision – Industry leading energy efficiency at > 2 GFLOPS per watt – Advance e620 • PCIe x8, standard height: 98 mm (3.9 in), half length: 167 mm (6.5 in) – Advance X620 • PCI-X, standard height: 98 mm (3.9 in), two-thirds length: 203 mm (8.0 in) – Plug & Play acceleration with standard math libraries including Level 3 BLAS and LAPACK – Fully programmable in Cn extended parallel language 7 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Heat leads to bulk • Air cooling hits limits at about 70 watts/liter – PCI standard of 25 watts, size is 0.3 liters ✔ – A 1U server might use 1000 watts, volume is 14 liters ✔ – A 42U standard rack might use 40 kilowatts, 3000 liters ✔ • Exceed 70 watts/liter, and temperatures rise above operational limits 4 inches by 6 inches 0.5 liter in system 35 watts 9 ounces Latest e620 ClearSpeed accelerator 8 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Dissipation volume can exceed actual volume • To find the real volume occupied by a component in liters, divide its wattage by 70 • What may seem like a dense, powerful solution might actually dilute the GFLOPS per liter because of heat generation. 9 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Performance & Power Efficiency: 250-watt budget 32-bit Peak 64-bit Peak Average multiply-add multiply-add Wattage GFLOPS GFLOPS Intel Clovertown (3.6 GHz) 250 86 57 Nvidia Tesla 170(1) 345(1) not supported 1/8th of 32-bit Future Nvidia 64-bit unknown unknown performance(1) 10 FPGA PCI cards Virtex LX160 based 250 430 4.2 Cell BE 210 230 15 Future Cell HPC 220 200 104 7 ClearSpeed e620 Advance™ Boards 231 564 564 Notes - 1) Table uses information given by vendors at International Supercomputing Conference, Dresden, June 2007 2) 25 to 50 Watts is current expansion slot power budget, 250 Watts proposed 10 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com New Design Approach Delivers 1 TFLOP in 1U • 1U standard server • Intel 5365 3.0 GHz – 2-socket, quad core – 96 DP GFLOPS peak – Approx. 650 watts – Approx. 3.5 TFLOPS peak in a 25 kW rack • 1U ClearSpeed Accelerated TeraScale Server (CATS) – 24 CSX600 96 core processors – ~1 DP TFLOPS peak – Approx. 500 watts – Approx. 19 TFLOPS peak in a 25 kW rack – 18 standard servers & 18 acceleration servers 11 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Top 500 Supercomputer In A Single Cabinet • 40 servers with 2.66 • Add 80 ClearSpeed GHz x86 quad-core Advance cards • 2.8 TFLOPS LINACK • 7 TFLOPS LINPACK • 26 kW • 24 kW • 10 sq. ft. • 10 sq. ft. • 800 pounds • 850 pounds • ~$400,000 • < $1,000,000 ClearSpeed increases… • Power draw by 8% • Floor space by 0% • Weight by 6% • Speed by 150% 12 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Double the Earth Simulator speed with only 1 MW November 2007: Tokyo Tech added more ClearSpeed accelerators to TSUBAME. Accelerators raise cluster performance from 38 TFLOPS to 56.4 TFLOPS with 648 ClearSpeed Advance cards – Performance increase of 48% for just a 2% increase in power consumption, 10% increase in cost – Hybrid approach: 10,368 AMD Opteron cores with just 648 ClearSpeed cards – Far smaller volume than the Earth Simulator – ClearSpeed accelerates AMBER, which is about 70% of submitted jobs Professor Matsuoka standing beside TSUBAME at Tokyo Tech 13 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Acceleration for finance and science applications • Finance – Up to 20x speedup per accelerator for Monte Carlo based analytic option pricing (per accelerator) • Universities and National Laboratories – 3x to 9x speedup for AMBER molecular modeling – Test data from major pharmaceutical company • Scalable performance – Low energy consumption supports multiple accelerators per system • Maximize performance density and energy efficiency 14 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Math functions exploit 160 GB/s Bandwidth 64-bit Function Operations per Second (Billions) 2.5 2.6 GHz dual-core Opteron 2.0 3 GHz dual-core Woodcrest ClearSpeed Advance card 1.5 1.0 0.5 0.0 Sqrt InvSqrt Exp Ln Cos Sin SinCos Inv Function name Typical speedup of ~8X over the fastest x86 processors, because math functions stay in the local memory on the card. 15 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com NAB and AMBER 10 acceleration • Newton-Raphson refinement now possible; analytically-computed second derivatives • 2.6x speedup obtained for this operation in three hours of effort (no source code changes) • Enables accurate computation of entropy and Gibbs free energy for first time. • Available now in NAB (Nucleic Acid Builder) code. Slated for addition to AMBER 10. 16 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Quantum chemistry acceleration results • DGEMM content is 18% to 65% in quantum codes like Gaussian, GAMESS, NWChem, Molpro. • Initial work with Molpro shows 9x speedup on CATS, versus a 3 GHz Intel Woodcrest server. • More modern approaches (Qbox, Car-Parinello) are 50% DGEMM with all dimensions large; host does non-DGEMM work, with net doubling of speed. 17 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Summary • ClearSpeed’s very high ratio of flops-per-watt means much more compact HPC systems are possible, which then helps the communication issues of large clusters. • In China as in other countries, the future of HPC belongs to the technologies with the highest 64-bit power/size effectiveness. • Now seeing value for real 64-bit applications in chemistry, financial modelling, and life sciences. Mechanical engineering applications may be next. 18 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com .

Acceleration Technology for High Performance Computing in China

Linpack Evaluation on a Supercomputer with Heterogeneous Accelerators

Tsubame 2.5 Towards 3.0 and Beyond to Exascale

(Intel® OPA) for Tsubame 3

Clearspeed Technical Training

TSUBAME---A Year Later

Introduction Hardware Acceleration Philosophy Popular Accelerators In

Tokyo Tech's TSUBAME 3.0 and AIST's AAIC Ranked 1St and 3Rd on the Green500

World's Greenest Petaflop Supercomputers Built with NVIDIA Tesla Gpus

The Return of Acceleration Technology

The TSUBAME Grid: Redefining Supercomputing

Highlights of the 53Rd TOP500 List

Tokyo Tech Tsubame Grid Storage Implementation