Real Performance. Real Science. Real Tools.

Acceleration Technology for High Performance Computing in China

John Gustafson, Ph.D. CTO, High Performance Computing ClearSpeed Technology

1 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www..com Thesis

• China faces the challenge of enormous energy demand for its continued growth • Computing in the US now consumes over 10% of the national power grid (and growing)! • China will soon follow this pattern • For HPC applications, ClearSpeed has sophisticated technologies for reducing power use per operation by tenfold • ClearSpeed is partnering with one of the top 3 Chinese computer companies to create a new high-performance computer

2 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com ClearSpeed company background

• Fabless semiconductor company based in Bristol and San Jose – CSX600 manufactured by IBM – Accelerator boards assembled and tested by Flextronics

• Core Products – World’s highest performance, lowest power consumption processors for Double Precision Floating Point (IEEE 754 compliant) – Accelerators for PCI expansion slots in servers and workstations – Work alongside 32 bit or 64 bit x86 industry-standard processors to accelerate compute intensive functions

• Market Focus – Acceleration of High Performance Computing (HPC) applications – Universities and National Laboratories, Life Sciences & Financial Services – Embedded applications in consumer and military applications

• Competitive Position – Only supplier of custom-designed, HPC-focused acceleration products – Uniquely positioned to exploit growing HPC acceleration need – Substantial Intellectual Property base with over 100 patents granted/pending 3 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Constraints For Processor Development John Shalf and David Bailey (Lawrence Berkeley NL) 2007:

• New constraints – Power limits clock rates – Cannot squeeze more performance from ILP (complex cores with Instruction Level Parallelism) either • But Moore’s Law continues! – What to do with all of those transistors if everything else is flat- lining? – Now, #cores per chip doubles every 18 months instead of clock frequency • Power consumption is chief concern for system architects • Power efficiency is the primary concern of consumers of computer systems!! Figure courtesy of Kunle Olukotun, Lance Hammond, Herb Sutter, and Burton Smith 4 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Folding@home with 700,000 PlayStation 3s?

• Each PS3 averages 220 watts on this application. • Total power use: 266 megawatts! • Power cost: about $600,000 per day • 2000 barrels of oil per day for a petaflops/s • For that much electric power, our accelerators can get 500 petaflops/s! 5 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com ClearSpeed product overview

• CSX600 Processors – World’s highest performance, most energy efficient processor for double precision floating point applications – 96 processing cores – 40 DP GFLOPS peak, >33 GFLOPS DGEMM – 10 watts (typical)

• Advance™ PCI-X and PCIe Accelerators – Exploit standard expansion slots for servers, workstations and blade expansion units – >66 GFLOPS DGEMM per accelerator – 25 – 33 watts (typical)

• Software – Linux and Microsoft® drivers – ClearSpeed CSXL plug and play acceleration • Accelerates compute intensive calls from Intel MKL and AMD ACML standard libraries – Software Development Kit • Familiar X86 development environment • C compiler with parallel extensions • Complete Visual Profiling and Debugging Tools

6 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com The ClearSpeed Advance accelerator family

Advance X620 Advance e620 • The only accelerator family specifically designed for HPC – 80.64 GFLOPS peak, 66 GFLOPS sustained double precision – Industry leading energy efficiency at > 2 GFLOPS per watt – Advance e620 • PCIe x8, standard height: 98 mm (3.9 in), half length: 167 mm (6.5 in) – Advance X620 • PCI-X, standard height: 98 mm (3.9 in), two-thirds length: 203 mm (8.0 in) – Plug & Play acceleration with standard math libraries including Level 3 BLAS and LAPACK – Fully programmable in Cn extended parallel language

7 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Heat leads to bulk

• Air cooling hits limits at about 70 watts/liter – PCI standard of 25 watts, size is 0.3 liters ✔ – A 1U server might use 1000 watts, volume is 14 liters ✔ – A 42U standard rack might use 40 kilowatts, 3000 liters ✔ • Exceed 70 watts/liter, and temperatures rise above operational limits

4 inches by 6 inches 0.5 liter in system 35 watts 9 ounces

Latest e620 ClearSpeed accelerator 8 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Dissipation volume can exceed actual volume

• To find the real volume occupied by a component in liters, divide its wattage by 70 • What may seem like a dense, powerful solution might actually dilute the GFLOPS per liter because of heat generation.

9 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Performance & Power Efficiency: 250-watt budget

32-bit Peak 64-bit Peak Average multiply-add multiply-add Wattage GFLOPS GFLOPS Intel Clovertown (3.6 GHz) 250 86 57

Nvidia Tesla 170(1) 345(1) not supported 1/8th of 32-bit Future 64-bit unknown unknown performance(1) 10 FPGA PCI cards Virtex LX160 based 250 430 4.2 BE 210 230 15

Future Cell HPC 220 200 104

7 ClearSpeed e620 Advance™ Boards 231 564 564

Notes - 1) Table uses information given by vendors at International Supercomputing Conference, Dresden, June 2007 2) 25 to 50 Watts is current expansion slot power budget, 250 Watts proposed 10 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com New Design Approach Delivers 1 TFLOP in 1U

• 1U standard server • Intel 5365 3.0 GHz – 2-socket, quad core – 96 DP GFLOPS peak – Approx. 650 watts – Approx. 3.5 TFLOPS peak in a 25 kW rack

• 1U ClearSpeed Accelerated TeraScale Server (CATS) – 24 CSX600 96 core processors – ~1 DP TFLOPS peak – Approx. 500 watts – Approx. 19 TFLOPS peak in a 25 kW rack – 18 standard servers & 18 acceleration servers

11 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Top 500 In A Single Cabinet

• 40 servers with 2.66 • Add 80 ClearSpeed GHz x86 quad-core Advance cards

• 2.8 TFLOPS LINACK • 7 TFLOPS LINPACK • 26 kW • 24 kW • 10 sq. ft. • 10 sq. ft. • 800 pounds • 850 pounds

• ~$400,000 • < $1,000,000

ClearSpeed increases… • Power draw by 8% • Floor space by 0% • Weight by 6% • Speed by 150%

12 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Double the Earth Simulator speed with only 1 MW

November 2007: Tokyo Tech added more ClearSpeed accelerators to . Accelerators raise cluster performance from 38 TFLOPS to 56.4 TFLOPS with 648 ClearSpeed Advance cards – Performance increase of 48% for just a 2% increase in power consumption, 10% increase in cost – Hybrid approach: 10,368 AMD Opteron cores with just 648 ClearSpeed cards – Far smaller volume than the Earth Simulator – ClearSpeed accelerates AMBER, which is about 70% of submitted jobs

Professor Matsuoka standing beside TSUBAME at Tokyo Tech

13 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Acceleration for finance and science applications

• Finance – Up to 20x speedup per accelerator for Monte Carlo based analytic option pricing (per accelerator) • Universities and National Laboratories – 3x to 9x speedup for AMBER molecular modeling – Test data from major pharmaceutical company • Scalable performance – Low energy consumption supports multiple accelerators per system • Maximize performance density and energy efficiency

14 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Math functions exploit 160 GB/s Bandwidth

64-bit Function Operations per Second (Billions) 2.5 2.6 GHz dual-core Opteron 2.0 3 GHz dual-core Woodcrest ClearSpeed Advance card 1.5

1.0

0.5

0.0 Sqrt InvSqrt Exp Ln Cos Sin SinCos Inv Function name Typical speedup of ~8X over the fastest x86 processors, because math functions stay in the local memory on the card.

15 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com NAB and AMBER 10 acceleration

• Newton-Raphson refinement now possible; analytically-computed second derivatives • 2.6x speedup obtained for this operation in three hours of effort (no source code changes) • Enables accurate computation of entropy and Gibbs free energy for first time. • Available now in NAB (Nucleic Acid Builder) code. Slated for addition to AMBER 10.

16 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Quantum chemistry acceleration results

• DGEMM content is 18% to 65% in quantum codes like Gaussian, GAMESS, NWChem, Molpro. • Initial work with Molpro shows 9x speedup on CATS, versus a 3 GHz Intel Woodcrest server. • More modern approaches (Qbox, Car-Parinello) are 50% DGEMM with all dimensions large; host does non-DGEMM work, with net doubling of speed.

17 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com Summary

• ClearSpeed’s very high ratio of -per-watt means much more compact HPC systems are possible, which then helps the communication issues of large clusters. • In China as in other countries, the future of HPC belongs to the technologies with the highest 64-bit power/size effectiveness. • Now seeing value for real 64-bit applications in chemistry, financial modelling, and life sciences. Mechanical engineering applications may be next.

18 Copyright © 2007 ClearSpeed Technology plc. All rights reserved. www.clearspeed.com