Real Performance. Real Science. Real Tools.
Acceleration Technology for High Performance Computing in China
John Gustafson, Ph.D. CTO, High Performance Computing ClearSpeed Technology
1 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Thesis
• China faces the challenge of enormous energy demand for its continued growth • Computing in the US now consumes over 10% of the national power grid (and growing)! • China will soon follow this pattern • For HPC applications, ClearSpeed has sophisticated technologies for reducing power use per operation by tenfold • ClearSpeed is partnering with one of the top 3 Chinese computer companies to create a new high-performance computer
2 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com ClearSpeed company background
• Fabless semiconductor company based in Bristol and San Jose – CSX600 coprocessors manufactured by IBM – Accelerator boards assembled and tested by Flextronics
• Core Products – World’s highest performance, lowest power consumption processors for Double Precision Floating Point (IEEE 754 compliant) – Accelerators for PCI expansion slots in servers and workstations – Work alongside 32 bit or 64 bit x86 industry-standard processors to accelerate compute intensive functions
• Market Focus – Acceleration of High Performance Computing (HPC) applications – Universities and National Laboratories, Life Sciences & Financial Services – Embedded applications in consumer and military applications
• Competitive Position – Only supplier of custom-designed, HPC-focused acceleration products – Uniquely positioned to exploit growing HPC acceleration need – Substantial Intellectual Property base with over 100 patents granted/pending 3 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Constraints For Processor Development John Shalf and David Bailey (Lawrence Berkeley NL) 2007:
• New constraints – Power limits clock rates – Cannot squeeze more performance from ILP (complex cores with Instruction Level Parallelism) either • But Moore’s Law continues! – What to do with all of those transistors if everything else is flat- lining? – Now, #cores per chip doubles every 18 months instead of clock frequency • Power consumption is chief concern for system architects • Power efficiency is the primary concern of consumers of computer systems!! Figure courtesy of Kunle Olukotun, Lance Hammond, Herb Sutter, and Burton Smith 4 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Folding@home with 700,000 PlayStation 3s?
• Each PS3 averages 220 watts on this application. • Total power use: 266 megawatts! • Power cost: about $600,000 per day • 2000 barrels of oil per day for a petaflops/s • For that much electric power, our accelerators can get 500 petaflops/s! 5 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com ClearSpeed product overview
• CSX600 Processors – World’s highest performance, most energy efficient processor for double precision floating point applications – 96 processing cores – 40 DP GFLOPS peak, >33 GFLOPS DGEMM – 10 watts (typical)
• Advance™ PCI-X and PCIe Accelerators – Exploit standard expansion slots for servers, workstations and blade expansion units – >66 GFLOPS DGEMM per accelerator – 25 – 33 watts (typical)
• Software – Linux and Microsoft® drivers – ClearSpeed CSXL plug and play acceleration • Accelerates compute intensive calls from Intel MKL and AMD ACML standard libraries – Software Development Kit • Familiar X86 development environment • C compiler with parallel extensions • Complete Visual Profiling and Debugging Tools
6 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com The ClearSpeed Advance accelerator family
Advance X620 Advance e620 • The only accelerator family specifically designed for HPC – 80.64 GFLOPS peak, 66 GFLOPS sustained double precision – Industry leading energy efficiency at > 2 GFLOPS per watt – Advance e620 • PCIe x8, standard height: 98 mm (3.9 in), half length: 167 mm (6.5 in) – Advance X620 • PCI-X, standard height: 98 mm (3.9 in), two-thirds length: 203 mm (8.0 in) – Plug & Play acceleration with standard math libraries including Level 3 BLAS and LAPACK – Fully programmable in Cn extended parallel language
7 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Heat leads to bulk
• Air cooling hits limits at about 70 watts/liter – PCI standard of 25 watts, size is 0.3 liters ✔ – A 1U server might use 1000 watts, volume is 14 liters ✔ – A 42U standard rack might use 40 kilowatts, 3000 liters ✔ • Exceed 70 watts/liter, and temperatures rise above operational limits
4 inches by 6 inches 0.5 liter in system 35 watts 9 ounces
Latest e620 ClearSpeed accelerator 8 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Dissipation volume can exceed actual volume
• To find the real volume occupied by a component in liters, divide its wattage by 70 • What may seem like a dense, powerful solution might actually dilute the GFLOPS per liter because of heat generation.
9 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Performance & Power Efficiency: 250-watt budget
32-bit Peak 64-bit Peak Average multiply-add multiply-add Wattage GFLOPS GFLOPS Intel Clovertown (3.6 GHz) 250 86 57
Nvidia Tesla 170(1) 345(1) not supported 1/8th of 32-bit Future Nvidia 64-bit unknown unknown performance(1) 10 FPGA PCI cards Virtex LX160 based 250 430 4.2 Cell BE 210 230 15
Future Cell HPC 220 200 104
7 ClearSpeed e620 Advance™ Boards 231 564 564
Notes - 1) Table uses information given by vendors at International Supercomputing Conference, Dresden, June 2007 2) 25 to 50 Watts is current expansion slot power budget, 250 Watts proposed 10 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com New Design Approach Delivers 1 TFLOP in 1U
• 1U standard server • Intel 5365 3.0 GHz – 2-socket, quad core – 96 DP GFLOPS peak – Approx. 650 watts – Approx. 3.5 TFLOPS peak in a 25 kW rack
• 1U ClearSpeed Accelerated TeraScale Server (CATS) – 24 CSX600 96 core processors – ~1 DP TFLOPS peak – Approx. 500 watts – Approx. 19 TFLOPS peak in a 25 kW rack – 18 standard servers & 18 acceleration servers
11 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Top 500 Supercomputer In A Single Cabinet
• 40 servers with 2.66 • Add 80 ClearSpeed GHz x86 quad-core Advance cards
• 2.8 TFLOPS LINACK • 7 TFLOPS LINPACK • 26 kW • 24 kW • 10 sq. ft. • 10 sq. ft. • 800 pounds • 850 pounds
• ~$400,000 • < $1,000,000
ClearSpeed increases… • Power draw by 8% • Floor space by 0% • Weight by 6% • Speed by 150%
12 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Double the Earth Simulator speed with only 1 MW
November 2007: Tokyo Tech added more ClearSpeed accelerators to TSUBAME. Accelerators raise cluster performance from 38 TFLOPS to 56.4 TFLOPS with 648 ClearSpeed Advance cards – Performance increase of 48% for just a 2% increase in power consumption, 10% increase in cost – Hybrid approach: 10,368 AMD Opteron cores with just 648 ClearSpeed cards – Far smaller volume than the Earth Simulator – ClearSpeed accelerates AMBER, which is about 70% of submitted jobs
Professor Matsuoka standing beside TSUBAME at Tokyo Tech
13 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Acceleration for finance and science applications
• Finance – Up to 20x speedup per accelerator for Monte Carlo based analytic option pricing (per accelerator) • Universities and National Laboratories – 3x to 9x speedup for AMBER molecular modeling – Test data from major pharmaceutical company • Scalable performance – Low energy consumption supports multiple accelerators per system • Maximize performance density and energy efficiency
14 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Math functions exploit 160 GB/s Bandwidth
64-bit Function Operations per Second (Billions) 2.5 2.6 GHz dual-core Opteron 2.0 3 GHz dual-core Woodcrest ClearSpeed Advance card 1.5
1.0
0.5
0.0 Sqrt InvSqrt Exp Ln Cos Sin SinCos Inv Function name Typical speedup of ~8X over the fastest x86 processors, because math functions stay in the local memory on the card.
15 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com NAB and AMBER 10 acceleration
• Newton-Raphson refinement now possible; analytically-computed second derivatives • 2.6x speedup obtained for this operation in three hours of effort (no source code changes) • Enables accurate computation of entropy and Gibbs free energy for first time. • Available now in NAB (Nucleic Acid Builder) code. Slated for addition to AMBER 10.
16 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Quantum chemistry acceleration results
• DGEMM content is 18% to 65% in quantum codes like Gaussian, GAMESS, NWChem, Molpro. • Initial work with Molpro shows 9x speedup on CATS, versus a 3 GHz Intel Woodcrest server. • More modern approaches (Qbox, Car-Parinello) are 50% DGEMM with all dimensions large; host does non-DGEMM work, with net doubling of speed.
17 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Summary
• ClearSpeed’s very high ratio of flops-per-watt means much more compact HPC systems are possible, which then helps the communication issues of large clusters. • In China as in other countries, the future of HPC belongs to the technologies with the highest 64-bit power/size effectiveness. • Now seeing value for real 64-bit applications in chemistry, financial modelling, and life sciences. Mechanical engineering applications may be next.
18 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com