High-Performance Computing at CSCS: Co-Design and New Separations of Concerns for HPC
Thomas C. Schulthess Computer performance and application performance increase ~103 every decade ~100 Kilowatts ~5 Megawatts 20-30 MW ~1 Exaflop/s
100 million or billion 1.35 Petaflop/s processing cores (!) Cray XT5 150’000 processors This system was built with commodity 1.02 Teraflop/s processors Cray T3E 1’500 processors 1 Gigaflop/s Cray YMP 8 processors
1988 1998 2008 2018
First sustained GFlop/s First sustained TFlop/s First sustained PFlop/s Another 1,000x increase in Gordon Bell Prize 1988 Gordon Bell Prize 1998 Gordon Bell Prize 2008 sustained performance
Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano HOW WELL CAN APPLICATIONS DEAL WITH CONCURRENCY?
HOW EFFICIENT ARE THEY?
Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Applications running at scale on Jaguar @ ORNL (Spring 2011) Domain area Code name Institution # of cores Performance Notes
2008 Gordon Bell Materials DCA++ ORNL 213,120 1.9 PF Prize Winner 2009 Gordon Bell Materials WL-LSMS ORNL/ETH 223,232 1.8 PF Prize Winner 2008 Gordon Bell Chemistry NWChem PNNL/ORNL 224,196 1.4 PF Prize Finalist 2010 Gordon Bell Materials DRC ETH/UTK 186,624 1.3 PF Prize Hon. Mention 2010 Gordon Bell Nanoscience OMEN Duke 222,720 > 1 PF Prize Finalist 2010 Gordon Bell Biomedical MoBo GaTech 196,608 780 TF Prize Winner Chemistry MADNESS UT/ORNL 140,000 550 TF
2008 Gordon Bell Materials LS3DF LBL 147,456 442 TF Prize Winner 2008 Gordon Bell Seismology SPECFEM3D USA (multiple) 149,784 165 TF Prize Finalist Combustion S3D SNL 147,456 83 TF
Weather WRF USA (multiple) 150,000 50 TF
Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Hirsch-Fey quantum Monte Carlo with Delayed updates (or Ed updates) Ed D’Azevedo, ORNL t Gc({si,l}k+1)=Gc({si,l}k)+ak × bk
t Gc({si,l}k+1)=Gc({si,l}0)+[a0|a1|...|ak] × [b0|b1|...|bk]
2 6000 Complexity for k updates remains O(kNt ) mixed precision double precision But we can replace k rank-1 updates 4000 with one matrix-matrix multiply plus some additional bookkeeping. 2000 time to solution [sec] G. Alvarez, M. S. Summers, D. E. Maxwell, M. Eisenbach, J. S. Meredith, J. M. Larkin, J. Levesque, T. A. Maier, P. R. C. Kent, E. F. D'Azevedo, T. C. Schulthess; New algorithm to enable 400+ TFlop/ 0 0 20 40 60 80 100 s sustained performance in simulations of disorder effects in high-Tc superconductors, Proceedings of the 2008 ACM/IEEE conference on Supercomputing 61, 2008 delay Wednesday, March 13, 2013 HPC Advisory Council Switzerland Conference, Lugano Arithmetic intensity of a computation floating point operations ↵ = data transferred
t Gc({si,l}k+1)=Gc({si,l}k)+ak × bk