Acceleration Technology for High Performance Computing in China
Total Page:16
File Type:pdf, Size:1020Kb
Real Performance. Real Science. Real Tools. Acceleration Technology for High Performance Computing in China John Gustafson, Ph.D. CTO, High Performance Computing ClearSpeed Technology 1 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Thesis • China faces the challenge of enormous energy demand for its continued growth • Computing in the US now consumes over 10% of the national power grid (and growing)! • China will soon follow this pattern • For HPC applications, ClearSpeed has sophisticated technologies for reducing power use per operation by tenfold • ClearSpeed is partnering with one of the top 3 Chinese computer companies to create a new high-performance computer 2 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com ClearSpeed company background • Fabless semiconductor company based in Bristol and San Jose – CSX600 coprocessors manufactured by IBM – Accelerator boards assembled and tested by Flextronics • Core Products – World’s highest performance, lowest power consumption processors for Double Precision Floating Point (IEEE 754 compliant) – Accelerators for PCI expansion slots in servers and workstations – Work alongside 32 bit or 64 bit x86 industry-standard processors to accelerate compute intensive functions • Market Focus – Acceleration of High Performance Computing (HPC) applications – Universities and National Laboratories, Life Sciences & Financial Services – Embedded applications in consumer and military applications • Competitive Position – Only supplier of custom-designed, HPC-focused acceleration products – Uniquely positioned to exploit growing HPC acceleration need – Substantial Intellectual Property base with over 100 patents granted/pending 3 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Constraints For Processor Development John Shalf and David Bailey (Lawrence Berkeley NL) 2007: • New constraints – Power limits clock rates – Cannot squeeze more performance from ILP (complex cores with Instruction Level Parallelism) either • But Moore’s Law continues! – What to do with all of those transistors if everything else is flat- lining? – Now, #cores per chip doubles every 18 months instead of clock frequency • Power consumption is chief concern for system architects • Power efficiency is the primary concern of consumers of computer systems!! Figure courtesy of Kunle Olukotun, Lance Hammond, Herb Sutter, and Burton Smith 4 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Folding@home with 700,000 PlayStation 3s? • Each PS3 averages 220 watts on this application. • Total power use: 266 megawatts! • Power cost: about $600,000 per day • 2000 barrels of oil per day for a petaflops/s • For that much electric power, our accelerators can get 500 petaflops/s! 5 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com ClearSpeed product overview • CSX600 Processors – World’s highest performance, most energy efficient processor for double precision floating point applications – 96 processing cores – 40 DP GFLOPS peak, >33 GFLOPS DGEMM – 10 watts (typical) • Advance™ PCI-X and PCIe Accelerators – Exploit standard expansion slots for servers, workstations and blade expansion units – >66 GFLOPS DGEMM per accelerator – 25 – 33 watts (typical) • Software – Linux and Microsoft® drivers – ClearSpeed CSXL plug and play acceleration • Accelerates compute intensive calls from Intel MKL and AMD ACML standard libraries – Software Development Kit • Familiar X86 development environment • C compiler with parallel extensions • Complete Visual Profiling and Debugging Tools 6 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com The ClearSpeed Advance accelerator family Advance X620 Advance e620 • The only accelerator family specifically designed for HPC – 80.64 GFLOPS peak, 66 GFLOPS sustained double precision – Industry leading energy efficiency at > 2 GFLOPS per watt – Advance e620 • PCIe x8, standard height: 98 mm (3.9 in), half length: 167 mm (6.5 in) – Advance X620 • PCI-X, standard height: 98 mm (3.9 in), two-thirds length: 203 mm (8.0 in) – Plug & Play acceleration with standard math libraries including Level 3 BLAS and LAPACK – Fully programmable in Cn extended parallel language 7 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Heat leads to bulk • Air cooling hits limits at about 70 watts/liter – PCI standard of 25 watts, size is 0.3 liters ✔ – A 1U server might use 1000 watts, volume is 14 liters ✔ – A 42U standard rack might use 40 kilowatts, 3000 liters ✔ • Exceed 70 watts/liter, and temperatures rise above operational limits 4 inches by 6 inches 0.5 liter in system 35 watts 9 ounces Latest e620 ClearSpeed accelerator 8 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Dissipation volume can exceed actual volume • To find the real volume occupied by a component in liters, divide its wattage by 70 • What may seem like a dense, powerful solution might actually dilute the GFLOPS per liter because of heat generation. 9 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Performance & Power Efficiency: 250-watt budget 32-bit Peak 64-bit Peak Average multiply-add multiply-add Wattage GFLOPS GFLOPS Intel Clovertown (3.6 GHz) 250 86 57 Nvidia Tesla 170(1) 345(1) not supported 1/8th of 32-bit Future Nvidia 64-bit unknown unknown performance(1) 10 FPGA PCI cards Virtex LX160 based 250 430 4.2 Cell BE 210 230 15 Future Cell HPC 220 200 104 7 ClearSpeed e620 Advance™ Boards 231 564 564 Notes - 1) Table uses information given by vendors at International Supercomputing Conference, Dresden, June 2007 2) 25 to 50 Watts is current expansion slot power budget, 250 Watts proposed 10 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com New Design Approach Delivers 1 TFLOP in 1U • 1U standard server • Intel 5365 3.0 GHz – 2-socket, quad core – 96 DP GFLOPS peak – Approx. 650 watts – Approx. 3.5 TFLOPS peak in a 25 kW rack • 1U ClearSpeed Accelerated TeraScale Server (CATS) – 24 CSX600 96 core processors – ~1 DP TFLOPS peak – Approx. 500 watts – Approx. 19 TFLOPS peak in a 25 kW rack – 18 standard servers & 18 acceleration servers 11 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Top 500 Supercomputer In A Single Cabinet • 40 servers with 2.66 • Add 80 ClearSpeed GHz x86 quad-core Advance cards • 2.8 TFLOPS LINACK • 7 TFLOPS LINPACK • 26 kW • 24 kW • 10 sq. ft. • 10 sq. ft. • 800 pounds • 850 pounds • ~$400,000 • < $1,000,000 ClearSpeed increases… • Power draw by 8% • Floor space by 0% • Weight by 6% • Speed by 150% 12 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Double the Earth Simulator speed with only 1 MW November 2007: Tokyo Tech added more ClearSpeed accelerators to TSUBAME. Accelerators raise cluster performance from 38 TFLOPS to 56.4 TFLOPS with 648 ClearSpeed Advance cards – Performance increase of 48% for just a 2% increase in power consumption, 10% increase in cost – Hybrid approach: 10,368 AMD Opteron cores with just 648 ClearSpeed cards – Far smaller volume than the Earth Simulator – ClearSpeed accelerates AMBER, which is about 70% of submitted jobs Professor Matsuoka standing beside TSUBAME at Tokyo Tech 13 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Acceleration for finance and science applications • Finance – Up to 20x speedup per accelerator for Monte Carlo based analytic option pricing (per accelerator) • Universities and National Laboratories – 3x to 9x speedup for AMBER molecular modeling – Test data from major pharmaceutical company • Scalable performance – Low energy consumption supports multiple accelerators per system • Maximize performance density and energy efficiency 14 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Math functions exploit 160 GB/s Bandwidth 64-bit Function Operations per Second (Billions) 2.5 2.6 GHz dual-core Opteron 2.0 3 GHz dual-core Woodcrest ClearSpeed Advance card 1.5 1.0 0.5 0.0 Sqrt InvSqrt Exp Ln Cos Sin SinCos Inv Function name Typical speedup of ~8X over the fastest x86 processors, because math functions stay in the local memory on the card. 15 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com NAB and AMBER 10 acceleration • Newton-Raphson refinement now possible; analytically-computed second derivatives • 2.6x speedup obtained for this operation in three hours of effort (no source code changes) • Enables accurate computation of entropy and Gibbs free energy for first time. • Available now in NAB (Nucleic Acid Builder) code. Slated for addition to AMBER 10. 16 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Quantum chemistry acceleration results • DGEMM content is 18% to 65% in quantum codes like Gaussian, GAMESS, NWChem, Molpro. • Initial work with Molpro shows 9x speedup on CATS, versus a 3 GHz Intel Woodcrest server. • More modern approaches (Qbox, Car-Parinello) are 50% DGEMM with all dimensions large; host does non-DGEMM work, with net doubling of speed. 17 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Summary • ClearSpeed’s very high ratio of flops-per-watt means much more compact HPC systems are possible, which then helps the communication issues of large clusters. • In China as in other countries, the future of HPC belongs to the technologies with the highest 64-bit power/size effectiveness. • Now seeing value for real 64-bit applications in chemistry, financial modelling, and life sciences. Mechanical engineering applications may be next. 18 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com .