Algorithms/Architecture Co-design for Exascale Computer
Stanislav Sedukhin The University of Aizu Outline Scalar Fused Mul ply-Add opera on as a workhorse of current scien fic compu ng Current State-of-the-Art and Historical Observa on (Review of 50+ single-chip uProcessors with FMA units) Tree- and Torus-structured Machines Arithme c (Scalar Mul ply-Add) à Algebra (Matrix Mul ply-Add): Algebraic Path Problem Design Space of Extremely-scalable GEneral Matrix Mul ply-add (GEMM) Algorithms/Architecture (Best algorithm/architecture selec on) GEMM-based Orbital Algorithms and Unified Architecture for Mul dimensional Data Processing Conclusion
9/24/2014 © S. Sedukhin, University of Aizu 2 SCALAR FUSED MULTIPLY-ADD OPERATION AS A WORKHORSE OF CURRENT SCIENTIFIC COMPUTING
9/24/2014 © S. Sedukhin, University of Aizu 3 It is Time to Rethink Compu ng
Cost of Compu ng Before • Reduc on of expensive arithme c Date US$ (2013) Computer opera ons (mostly mul plica on): per GFLOP/S Fast algorithms: – FFT (Cooley-Tukey), 1961 0.8 × 1012 IBM 1620 x 106 – MMA (Strassen, Winograd,…) @$64,000 each • Algorithm Serializa on • Avoiding storage and mul plica on by 0 1984 42,8 × 106 Cray X-MP/48 – Sparse Algorithms 1997 42 × 103 Beowulf Pen um 2000 836 KLAT2 Athlon Now and Later On 2003 100 KASY0 Athlon • Minimiza on of expensive data movement is more important than reduc on of cheap 2007 52 Microwulf opera ons. 2011 1.80 HPU4Science • Deeply parallelize the original (w/regular data access), not fast, algorithm 2013 0.22 Sony Playsta on 4 • “Go ahead, mul ply by zero!” (Dr. J. Gustafson) Source: h p://en.wikipedia.org/wiki/FLOPS • Time to Compute Sparse as Dense
9/24/2014 © S. Sedukhin, University of Aizu 4 Scalar Fused Mul ply-Add (FMA)
Fused Mul ply-Add (FMA): • two (SP/DP) floa ng-point scalar (3-read & 1-write) Floa ng Point Unit opera ons (× and +) in a single Register File cycle (2FLOPs/cycle); r 0 0 = • improved the accuracy due to r 1 1 = only one final rounding; r 2 • a few (3÷6) cycles latency; r 3 • standard “addi on” and FMA “mul plica on” by using
rn hardwired constants 0.0 and 1.0 • allows an efficient so ware implementa on of division and square root; c ç FMA(a,b,c); a × b , c = 0.0 : mul plica on; • Use of the FMA opera on results c ç a × b + c : a(b) + c , b(a) = 1.0 : addi on; in prac cal “disappearance” of a(b) , b(a) = 1.0, c = 0.0 : copy. four basic arithme c opera ons: add, subtract, mul ply, divide
9/24/2014 © S. Sedukhin, University of Aizu 5 50+ Single-chip μ-Processors (1990 – 2014)
Processor Architecture Vendor Year # DP FMA/clock Clock Speed (GHz) Cores DP peak (GFLOPS) TDP (Watt) #Transistors (x Million) Fab. process (nm) Die Size (mm^2) POWER1 RIOS-1 IBM 1990 1 0.041 1 0.082 4 6.9 1000 PA-8000 PA-8000 HP 1996 1 0.2 1 0.4 3.8 500 337.69 SuperH SH-4 SuperH Hitachi 1998 1 0.2 1 1.4 1.5 3.2 250 42.25 Pen um III Xeon 550 P6 Intel 1999 2 0.55 1 2.2 39.5 9.5 250 128 Itanium Itanium (Merced) Intel 2001 2 0.8 1 3.2 130 25 180 544 Opteron 850 K8 (SledgeHammer) AMD 2004 2 2.4 1 9.6 89 105.9 130 193 Itanium 2 Itanium 2 (Madison) Intel 2005 2 1.67 1 6.67 130 221 130 544 Xeon 3.80E NetBurst Intel 2005 2 3.8 1 15.2 110 189 90 135 Opteron 880 K8 (Egypt) AMD 2005 4 2.4 2 19.2 95 233 90 Opteron 8220 SE K8 (Santa Rosa) AMD 2006 4 2.8 2 22.4 119.2 243 90 Itanium 2 9010 Itanium 2 9000 (Montecito) Intel 2006 4 1.6 2 12.8 104 1720 90 544 Xeon 7140M NetBurst Intel 2006 4 3.4 2 27.2 150 1328 65 435 SPARC64 VI SPARC64 VI Fujitsu 2007 4 2.4 2 19.2 120 543 90 421.25 Opteron 8360 SE K10 (Barcelona) AMD 2007 8 2.5 4 40 119 463 65 Blue Gene/P PowerPC 450 IBM 2007 8 0.85 4 13.6 36.7 208 90 121 Xeon X7350 Core Intel 2007 8 2.93 4 46.88 130 582 65 503 SPARC64 VII SPARC64 VII Fujitsu 2008 8 2.88 4 46.08 600 65 445 Xeon X7460 Penryn Intel 2008 12 2.66 6 63.84 130 1900 45 503 PowerXCell 8i Cell IBM 2008 16 3.2 8 109 92 250 65 212 Tesla C1060 GT200 NVIDIA 2008 30 1.3 240 78 187.8 1400 65 470 Opteron 8439 SE K10 (Istanbul) AMD 2009 12 2.8 6 67.2 137 904 45 Xeon X7560 Nehalem Intel 2009 16 2.266 8 72.512 130 2300 45 684 SPARC64 VII+ SPARC64 VII+ Fujitsu 2010 8 3 4 48 45 Itanium 9350 Itanium 9300 (Tukwila) Intel 2010 16 1.73 4 55.36 185 2046 65 698.75 Xeon E7-8870 Westmere (Nehalem-C) Intel 2010 20 2.4 10 96 130 2600 32 Opteron 6180 SE K10 (Magny-Cours) AMD 2010 24 2.5 12 120 140 904 45 POWER7 POWER7 IBM 2010 32 4.14 8 264.96 150 1200 45 567 Tesla C2070 Fermi NVIDIA 2010 224 1.15 448 515 247 3100 40 529 Radeon HD 5870 Evergreen (Cypress XT) AMD 2010 320 0.85 1600 544 188 2154 40 334 Opteron 6282 SE Bulldozer (Interlagos) AMD 2011 32 2.6 16 166.4 140 1200 32 HPC-ACE SPARC64 VIIIfx SPARC64 (Venus) Fujitsu 2011 32 2 8 128 58 760 45 513 Xeon E5-4650 Sandy Bridge Intel 2011 32 2.7 8 172.8 130 2270 32 Godson-3 L3B Loongson 3B ICT 2011 64 1.05 8 128 40 582.6 65 300 Tesla M2090 Fermi NVIDIA 2011 256 1.301 512 666 225 3000 40 Radeon HD 6970 Northern Islands (Cayman XT) AMD 2011 384 0.88 1536 676 250 2640 40 389 Opteron 6386 SE Piledriver (Abu Dhabi) AMD 2012 32 2.8 16 179.2 140 1308 32 Itanium 9560 Itanium 9500 (Poulson) Intel 2012 48 2.53 8 242.88 170 3100 32 544 HPC-ACE SPARC64 VIXfx SPARC64 Fujitsu 2012 64 1.848 16 236.5 110 45 Blue Gene/Q PowerPC A2 IBM 2012 64 1.6 18 204.8 55 1470 45 428 Xeon Phi 5110P MIC Intel 2012 480 1.053 60 1011 225 5000 22 350 Radeon HD 7970 Southern Islands (Tahi XT) AMD 2012 512 0.925 2048 947 230 4313 28 352 Radeon HD 7970 GHz Edi on Southern Islands (Tahi XT2) AMD 2012 512 1.05 2048 1075 230 4313 28 352 Tesla K20X Kepler GK110 NVIDIA 2012 896 0.732 2688 1312 235 7080 28 561 Quadro K6000 GK110 NVIDIA 2013 960 0.901 2880 1732 225 7080 28 561 Core i7-4770K Haswell Intel 2013 32 3.8 4 243.2 84 1400 22 177 SPARC64 XIfx SPARC64 XIfx Fujitsu 2014 272 2.2 34 1196.8 160 3750 20 600 FirePro W9100 Hawaii XT AMD 2014 1408 0.93 2816 2618.9 275 6200 28 438 FirePro S9150 Hawaii XT GL AMD 2014 1408 0.9 2816 2534.4 235 6200 28 438 GeForce GTX Titan Black GK110-430 NVIDIA 2014 960 0.889 2880 1707 250 7080 28 561 POWER8 POWER8 IBM 2014 24 4.2 12 201.6 250 4200 22 675
9/24/2014 © S. Sedukhin, University of Aizu 6 32768 16384 8192 4096 2048 1024 512 256 128 64 32 16 8 4 2 1 Number of Transistors (Millions) 1990 1995 2000 2005 2010 2015 2020
8192
4096
2048
1024
512
256
128
64
32
16
8
4 Number of DP FMA/clock (FMA units) 2
1 1990 1995 2000 2005 2010 2015 2020
9/24/2014 © S. Sedukhin, University of Aizu 7 8
4
2
1
0.5
0.25 Clock Speed (GHz)
0.125
0.063
0.031 1990 1995 2000 2005 2010 2015 2020 512
256
128
64
32
16
8
4
2
1 1990 1995 2000 2005 2010 2015 2020
1024
512
256
128
64
32
16 Die Size (mm^2) 8
4
2
1 1990 1995 2000 2005 2010 2015 2020 9/24/2014 © S. Sedukhin, University of Aizu 8 DP Peak Performance (GFLOPS)
10000 AMD W9100 (2619) NVIDIA GK110 (1732) AMD S9150 (2534) NVIDIA K20X (1312) NVIDIA GTX 780 Ti (1707) 1000 AMD 5870 (544) Fujitsu SPARC64 XI (1196.8) IBM PowerXCell 8i (109) Intel Haswell-EP 2686v3 (1008) 100 Intel Xeon 7140M (27.2) IBM POWER8 (201.6) AMD Opteron 880 (19.2) Intel Haswell (243.2) AMD Opteron 850 (9.6) AMD Opteron 6386 (179.2) 10 IBM PowerPC 450 (13.6) Intel Pentium III (2.2) Intel Sandy Bridge (172.8) Hitachi SuperH-4 (1.4) Intel Itanium (3.2) Intel Itanium2 (6.67) Fujitsu SPARC64 VII+(48) 1 HP PA-8000 (0.4)
DP Peak Performance (GFLOPS) 0.1 IBM POWER1 (0.08)
0.01 1990 1995 2000 2005 2010 2015 2020
2014: Peak Performance (GFLOPS) 1. AMD W9100 2619 2. AMD S9150 2534 3. NVIDIA GTX 780 Ti 1707 4. NVIDIA GTX 980 1537 5. Fujitsu SPARC64 XIfx 1197 6. Intel Haswell-EP 1008 7. IBM POWER8 202
9/24/2014 © S. Sedukhin, University of Aizu 9 1 W/mm2 100.00 Merced
2 Pentium III Madison 0.1 W/mm
Opteron Montecito 10.00 Xeon 3.80E SPARC64 VI Xeon 7140 PowerPC 460 Itanium 9300 GT200 Xeon 7350 Penryn Xeon 7650 POWER8 1.00 PowerXCell8i Tesla C2070 Itanium 9560 Haswell POWER7 SPARC64 VIII MIC Loongson Radeon 7970 Radeon 5870 PowerPC A2 GK110-410 Radeon HD 6970 GTX 780 SPARC64 XIfx 0.10 Haswell-EP Tesla K20X Power/Throughput (W/GFLOPS) GTX 980 W9100 S9150
0.01 0.1 1 10 100 Area/Throughput (mm²/GFLOPS)
9/24/2014 © S. Sedukhin, University of Aizu 10 Performancepeak (GFLOPS) = #FMAunits × 2FLOPs × CLKperiod (GHz)
10 20 50 100 200 500 1.000 2.000 3.000 4.000 10.000 GFLOPS
2014
4
2
1
Processor Vendor Clock Period Die Area TDP # DP Peak Performance GFLOPS/W Area/GFLOPS (GHz) (mm2) (Wa s) FMA units (GFLOPS) (mm2/GFLOPS) POWER8 IBM 4.2 675 250 24 201.6 0.81 3.35
SPARC64 XIfx Fujitsu 2.2 600 160 272 1197 7.5 0.50
S9150 AMD 0.90 438 235 1408 2534 10.8 0.17
9/24/2014 © S. Sedukhin, University of Aizu 11 Compu ng Efficiency: Performance per Wa
50 GFLOPS/Wa for Exaflopic Super @20MW 100
AMD S9150 (10.8) NVIDIA GK110 (7.7) AMD W9100 (9.5) 10 NVIDIA K20X (5.6) NVIDIA GTX 980 (9.3) AMD 5870 (2.9) Intel Haswell (8.4) Fujitsu SPARC 64 XI (7.5) PowerXCell8i (1.2) Hitachi SuperH (0.93) NVIDIA GK110-430 (6.8) 1 IBM PowerPC 450 (0.37) IBM POWER8 (0.8) Intel Opteron 880 (0.2) Intel Opteron 850 (0.1) 0.1 Intel Pentium III (0.06) IBM Power1 (0.02)