Algorithms/Architecture Co-design for Exascale Computer

Stanislav Sedukhin The University of Aizu Outline Scalar Fused Mulply-Add operaon as a workhorse of current scienfic compung Current State-of-the-Art and Historical Observaon (Review of 50+ single-chip uProcessors with FMA units) Tree- and Torus-structured Machines Arithmec (Scalar Mulply-Add) à Algebra (Matrix Mulply-Add): Algebraic Path Problem Design Space of Extremely-scalable GEneral Matrix Mulply-add (GEMM) Algorithms/Architecture (Best algorithm/architecture selecon) GEMM-based Orbital Algorithms and Unified Architecture for Muldimensional Data Processing Conclusion

9/24/2014 © S. Sedukhin, University of Aizu 2 SCALAR FUSED MULTIPLY-ADD OPERATION AS A WORKHORSE OF CURRENT SCIENTIFIC COMPUTING

9/24/2014 © S. Sedukhin, University of Aizu 3 It is Time to Rethink Compung

Cost of Compung Before • Reducon of expensive arithmec Date US$ (2013) Computer operaons (mostly mulplicaon): per GFLOP/S Fast algorithms: – FFT (Cooley-Tukey), 1961 0.8 × 1012 IBM 1620 x 106 – MMA (Strassen, Winograd,…) @$64,000 each • Algorithm Serializaon • Avoiding storage and mulplicaon by 0 1984 42,8 × 106 Cray X-MP/48 – Sparse Algorithms 1997 42 × 103 Beowulf Penum 2000 836 KLAT2 Athlon Now and Later On 2003 100 KASY0 Athlon • Minimizaon of expensive data movement is more important than reducon of cheap 2007 52 Microwulf operaons. 2011 1.80 HPU4Science • Deeply parallelize the original (w/regular data access), not fast, algorithm 2013 0.22 Sony Playstaon 4 • “Go ahead, mulply by zero!” (Dr. J. Gustafson) Source: hp://en.wikipedia.org/wiki/FLOPS • Time to Compute Sparse as Dense

9/24/2014 © S. Sedukhin, University of Aizu 4 Scalar Fused Mulply-Add (FMA)

Fused Mulply-Add (FMA): • two (SP/DP) floang-point scalar (3-read & 1-write) Floang Point Unit operaons (× and +) in a single Register File cycle (2FLOPs/cycle); r 0 0 = • improved the accuracy due to r 1 1 = only one final rounding; r 2 • a few (3÷6) cycles latency; r 3 • standard “addion” and FMA “mulplicaon” by using

rn hardwired constants 0.0 and 1.0 • allows an efficient soware implementaon of division and square root; c ç FMA(a,b,c); a × b , c = 0.0 : mulplicaon; • Use of the FMA operaon results c ç a × b + c : a(b) + c , b(a) = 1.0 : addion; in praccal “disappearance” of a(b) , b(a) = 1.0, c = 0.0 : copy. four basic arithmec operaons: add, subtract, mulply, divide

9/24/2014 © S. Sedukhin, University of Aizu 5 50+ Single-chip μ-Processors (1990 – 2014)

Processor Architecture Vendor Year # DP FMA/clock Clock Speed (GHz) Cores DP peak (GFLOPS) TDP (Watt) #Transistors (x Million) Fab. process (nm) Die Size (mm^2) POWER1 RIOS-1 IBM 1990 1 0.041 1 0.082 4 6.9 1000 PA-8000 PA-8000 HP 1996 1 0.2 1 0.4 3.8 500 337.69 SuperH SH-4 SuperH Hitachi 1998 1 0.2 1 1.4 1.5 3.2 250 42.25 Penum III 550 1999 2 0.55 1 2.2 39.5 9.5 250 128 Itanium (Merced) Intel 2001 2 0.8 1 3.2 130 25 180 544 Opteron 850 K8 (SledgeHammer) AMD 2004 2 2.4 1 9.6 89 105.9 130 193 Itanium 2 Itanium 2 (Madison) Intel 2005 2 1.67 1 6.67 130 221 130 544 Xeon 3.80E NetBurst Intel 2005 2 3.8 1 15.2 110 189 90 135 Opteron 880 K8 (Egypt) AMD 2005 4 2.4 2 19.2 95 233 90 Opteron 8220 SE K8 (Santa Rosa) AMD 2006 4 2.8 2 22.4 119.2 243 90 Itanium 2 9010 Itanium 2 9000 () Intel 2006 4 1.6 2 12.8 104 1720 90 544 Xeon 7140M NetBurst Intel 2006 4 3.4 2 27.2 150 1328 65 435 SPARC64 VI SPARC64 VI Fujitsu 2007 4 2.4 2 19.2 120 543 90 421.25 Opteron 8360 SE K10 (Barcelona) AMD 2007 8 2.5 4 40 119 463 65 Blue Gene/P PowerPC 450 IBM 2007 8 0.85 4 13.6 36.7 208 90 121 Xeon X7350 Core Intel 2007 8 2.93 4 46.88 130 582 65 503 SPARC64 VII SPARC64 VII Fujitsu 2008 8 2.88 4 46.08 600 65 445 Xeon X7460 Penryn Intel 2008 12 2.66 6 63.84 130 1900 45 503 PowerXCell 8i Cell IBM 2008 16 3.2 8 109 92 250 65 212 Tesla C1060 GT200 NVIDIA 2008 30 1.3 240 78 187.8 1400 65 470 Opteron 8439 SE K10 (Istanbul) AMD 2009 12 2.8 6 67.2 137 904 45 Xeon X7560 Nehalem Intel 2009 16 2.266 8 72.512 130 2300 45 684 SPARC64 VII+ SPARC64 VII+ Fujitsu 2010 8 3 4 48 45 Itanium 9350 Itanium 9300 (Tukwila) Intel 2010 16 1.73 4 55.36 185 2046 65 698.75 Xeon E7-8870 Westmere (Nehalem-C) Intel 2010 20 2.4 10 96 130 2600 32 Opteron 6180 SE K10 (Magny-Cours) AMD 2010 24 2.5 12 120 140 904 45 POWER7 POWER7 IBM 2010 32 4.14 8 264.96 150 1200 45 567 Tesla C2070 Fermi NVIDIA 2010 224 1.15 448 515 247 3100 40 529 Radeon HD 5870 Evergreen (Cypress XT) AMD 2010 320 0.85 1600 544 188 2154 40 334 Opteron 6282 SE Bulldozer (Interlagos) AMD 2011 32 2.6 16 166.4 140 1200 32 HPC-ACE SPARC64 VIIIfx SPARC64 (Venus) Fujitsu 2011 32 2 8 128 58 760 45 513 Xeon E5-4650 Intel 2011 32 2.7 8 172.8 130 2270 32 Godson-3 L3B Loongson 3B ICT 2011 64 1.05 8 128 40 582.6 65 300 Tesla M2090 Fermi NVIDIA 2011 256 1.301 512 666 225 3000 40 Radeon HD 6970 Northern Islands (Cayman XT) AMD 2011 384 0.88 1536 676 250 2640 40 389 Opteron 6386 SE Piledriver (Abu Dhabi) AMD 2012 32 2.8 16 179.2 140 1308 32 Itanium 9560 Itanium 9500 (Poulson) Intel 2012 48 2.53 8 242.88 170 3100 32 544 HPC-ACE SPARC64 VIXfx SPARC64 Fujitsu 2012 64 1.848 16 236.5 110 45 Blue Gene/Q PowerPC A2 IBM 2012 64 1.6 18 204.8 55 1470 45 428 Xeon Phi 5110P MIC Intel 2012 480 1.053 60 1011 225 5000 22 350 Radeon HD 7970 Southern Islands (Tahi XT) AMD 2012 512 0.925 2048 947 230 4313 28 352 Radeon HD 7970 GHz Edion Southern Islands (Tahi XT2) AMD 2012 512 1.05 2048 1075 230 4313 28 352 Tesla K20X Kepler GK110 NVIDIA 2012 896 0.732 2688 1312 235 7080 28 561 Quadro K6000 GK110 NVIDIA 2013 960 0.901 2880 1732 225 7080 28 561 Core i7-4770K Haswell Intel 2013 32 3.8 4 243.2 84 1400 22 177 SPARC64 XIfx SPARC64 XIfx Fujitsu 2014 272 2.2 34 1196.8 160 3750 20 600 FirePro W9100 Hawaii XT AMD 2014 1408 0.93 2816 2618.9 275 6200 28 438 FirePro S9150 Hawaii XT GL AMD 2014 1408 0.9 2816 2534.4 235 6200 28 438 GeForce GTX Titan Black GK110-430 NVIDIA 2014 960 0.889 2880 1707 250 7080 28 561 POWER8 POWER8 IBM 2014 24 4.2 12 201.6 250 4200 22 675

9/24/2014 © S. Sedukhin, University of Aizu 6 32768 16384 8192 4096 2048 1024 512 256 128 64 32 16 8 4 2 1 Number of Transistors (Millions) 1990 1995 2000 2005 2010 2015 2020

8192

4096

2048

1024

512

256

128

64

32

16

8

4 Number of DP FMA/clock (FMA units) 2

1 1990 1995 2000 2005 2010 2015 2020

9/24/2014 © S. Sedukhin, University of Aizu 7 8

4

2

1

0.5

0.25 Clock Speed (GHz)

0.125

0.063

0.031 1990 1995 2000 2005 2010 2015 2020 512

256

128

64

32

16

8

4

2

1 1990 1995 2000 2005 2010 2015 2020

1024

512

256

128

64

32

16 Die Size (mm^2) 8

4

2

1 1990 1995 2000 2005 2010 2015 2020 9/24/2014 © S. Sedukhin, University of Aizu 8 DP Peak Performance (GFLOPS)

10000 AMD W9100 (2619) NVIDIA GK110 (1732) AMD S9150 (2534) NVIDIA K20X (1312) NVIDIA GTX 780 Ti (1707) 1000 AMD 5870 (544) Fujitsu SPARC64 XI (1196.8) IBM PowerXCell 8i (109) Intel Haswell-EP 2686v3 (1008) 100 Intel Xeon 7140M (27.2) IBM POWER8 (201.6) AMD Opteron 880 (19.2) Intel Haswell (243.2) AMD Opteron 850 (9.6) AMD Opteron 6386 (179.2) 10 IBM PowerPC 450 (13.6) Intel III (2.2) Intel Sandy Bridge (172.8) Hitachi SuperH-4 (1.4) Intel Itanium (3.2) Intel Itanium2 (6.67) Fujitsu SPARC64 VII+(48) 1 HP PA-8000 (0.4)

DP Peak Performance (GFLOPS) 0.1 IBM POWER1 (0.08)

0.01 1990 1995 2000 2005 2010 2015 2020

2014: Peak Performance (GFLOPS) 1. AMD W9100 2619 2. AMD S9150 2534 3. NVIDIA GTX 780 Ti 1707 4. NVIDIA GTX 980 1537 5. Fujitsu SPARC64 XIfx 1197 6. Intel Haswell-EP 1008 7. IBM POWER8 202

9/24/2014 © S. Sedukhin, University of Aizu 9 1 W/mm2 100.00 Merced

2 Pentium III Madison 0.1 W/mm

Opteron Montecito 10.00 Xeon 3.80E SPARC64 VI Xeon 7140 PowerPC 460 Itanium 9300 GT200 Xeon 7350 Penryn Xeon 7650 POWER8 1.00 PowerXCell8i Tesla C2070 Itanium 9560 Haswell POWER7 SPARC64 VIII MIC Loongson Radeon 7970 Radeon 5870 PowerPC A2 GK110-410 Radeon HD 6970 GTX 780 SPARC64 XIfx 0.10 Haswell-EP Tesla K20X Power/Throughput (W/GFLOPS) GTX 980 W9100 S9150

0.01 0.1 1 10 100 Area/Throughput (mm²/GFLOPS)

9/24/2014 © S. Sedukhin, University of Aizu 10 Performancepeak (GFLOPS) = #FMAunits × 2FLOPs × CLKperiod (GHz)

10 20 50 100 200 500 1.000 2.000 3.000 4.000 10.000 GFLOPS

2014

4

2

1

Processor Vendor Clock Period Die Area TDP # DP Peak Performance GFLOPS/W Area/GFLOPS (GHz) (mm2) (Was) FMA units (GFLOPS) (mm2/GFLOPS) POWER8 IBM 4.2 675 250 24 201.6 0.81 3.35

SPARC64 XIfx Fujitsu 2.2 600 160 272 1197 7.5 0.50

S9150 AMD 0.90 438 235 1408 2534 10.8 0.17

9/24/2014 © S. Sedukhin, University of Aizu 11 Compung Efficiency: Performance per Wa

50 GFLOPS/Wa for Exaflopic Super @20MW 100

AMD S9150 (10.8) NVIDIA GK110 (7.7) AMD W9100 (9.5) 10 NVIDIA K20X (5.6) NVIDIA GTX 980 (9.3) AMD 5870 (2.9) Intel Haswell (8.4) Fujitsu SPARC 64 XI (7.5) PowerXCell8i (1.2) Hitachi SuperH (0.93) NVIDIA GK110-430 (6.8) 1 IBM PowerPC 450 (0.37) IBM POWER8 (0.8) Intel Opteron 880 (0.2) Intel Opteron 850 (0.1) 0.1 Intel Pentium III (0.06) IBM Power1 (0.02)

Intel Itanium (0.03)

0.01 1990 1995 2000 2005 2010 2015 2020

2014: Peak Performance (GFLOPS) 2014: Efficiency (GFLOPS/W) 1. AMD W9100 2619 AMD S9150 10.8 2. AMD S9150 2534 AMD W9110 9.5 3. NVIDIA GTX 780 Ti 1707 NVIDIA GTX 980 9.3 4. NVIDIA GTX 980 1537 Intel Haswell-EP 8.4 5. Fujitsu SPARC64 XIfx 1197 Fujitsu SPARC64 XIfx 7.5 6. Intel Haswell-EP 1008 NVIDIA GTX 780 Ti 6.8 7. IBM POWER8 202 IBM POWER8 0.8

9/24/2014 © S. Sedukhin, University of Aizu 12 Even more FMA Units in Supercomputers

IBM Roadrunner’s Peak Performance: 12,960 PowerXCell 8i × 8 SPEs × 2-way FMAs/clock = 207.360 FMAs/clock (@3.2 GHz) ≈ 1.3 PFLOPS K Supercomputer: 80,000 SPARC64 VIIIfx chips, each chip is 8-core CPU 128GFLOPS, i.e. 640,000 cores × 4-way FMAs/clock = 2.56M FMAs /clock (@2 GHz) = 10.24 PFLOPS IBM BlueGene/Q (Sequoia): 1.3M cores × 4-way FMAs/clock = 5.2M FMAs/clock ≈ 10M FLOPs/clock (@2 GHz) ≈ 20 PFLOPS IBM Mira: 16-core Power A2 (@1.6GHz); 750K cores × 4-way = 3M FMAs/clock ≈ 6.14MFLOPs/clock ≈ 9.83 PFLOPS GPU-based Supers: – Tianhe-1A: Tesla M2050 (Fermi): 256 FMA/clock (@1.6 GHz) × 7,168 GPUs ≈ 1.8M FMAs/clock ≈ 5.9 PFLOPS – TSUBAME 2: Tesla M2050 (Fermi): 256 FMA/clock (@1.6 GHz) × 4,224 GPUs ≈ 1.1M FMAs/clock ≈ 3.5 PFLOPS Tianhe-2 : 16,000 nodes × (2 Intel Xeon Ivy Bridge + 3 Intel Xeon Phi): – Phi: 480 × 3 = 1440 FMA/clock/node × 16,000 ≈ 23M FMAs/clock × 2FLOPs @1.053GHz ≈ 48.5PFLOPS – Ivy Bridge: 12 cores × 4 FMAs/cycle × 2 = 96 FMA/clock/node × 16,000 ≈ 1.54M FMAs/clock × 2FLOPs @2.2GHz ≈ 6.8PFLOPS – Total: 23M + 1.54M ≈ 25M FMAs/clock; 48.555PFLOPS + 6.855PFLOPS ≈ 55PFLOPS – Footprint: 720 m2 ≈ 27 m × 27 m Synchronizaon of Processes in such mul-node GALS-Supers is provided by distributed Global “Barrier Synchronizaon” in hardware or/and in soware! – “Barrier Synchronizaon” on 32-processor Shared Memory SGI Origin 300 System: 232,000 cycles ≈ 22MFLOPs – Time of “Barrier Synchronizaon” depends on the size of system – Scalability of MPI_Barrier in “Earth Simulator”: ~3 μsec – 333KHz while operaonal frequency is 1GHz, i.e. difference is 3000 mes…

9/24/2014 © S. Sedukhin, University of Aizu 13 Trends for “Clock Region” or “Span of Control”

Source: Matzke, D., "Will physical scalability sabotage Source: Agarwal, V., Hrishikesh, V., M., Keckler, S., W., Burger, D. , performance gains?," Computer , vol.30, no.9, “Clock rate versus IPC: the end of the road for convenonal pp.37,39, Sep 1997 microarchitectures.” SIGARCH Comput. Archit. News 28, 2 (May 2000), 248-259

8

4

2

1

0.5

0.25 Clock Speed (GHz)

0.125

0.063

0.031 1990 1995 2000 2005 2010 2015 2020

9/24/2014 © S. Sedukhin, University of Aizu 14 More Performance and Less Power Consumpon by Frequency Reducon • Performance P (GFLOPS) = #FPUs × 2FLOPs × F (GHz),

where F is a clock cycle speed or an operaon frequency.

• The number of FPUs (#FPUs) defines the area (A) of a system. Clock period T = 1/F should be long enough to “cover” A in a single clock. • The maximum area, “reachable” in a single clock period would be

π ŸR2, for planar technology (area of a circle); A = (4/3) π ŸR3, for 3D technology (volume of a ball/sphere);

where a “reachability” radius R = S ŸT = S / F and S is the speed of a clock-signal propagaon in a media (wires, opc, etc.): S = k ŸĊ, the speed-of-light Ċ and 0.5 < k < 1.

9/24/2014 © S. Sedukhin, University of Aizu 15 More Performance and Less Power Consumpon by Frequency Reducon

• Reducing a frequency F m-mes, i.e. increasing the “reachability” radius R m- mes, leads to increasing area A and, therefore, the number of FPUs, m2 mes, for planar technology; m3 mes, for cubical technology. • Because each FPU becomes m-mes “slower”, the performance P is increased m mes, for planar technology; m2 mes, for cubical technology.

• The power consumpon E of a processor is the funcon of frequency F : 2 E = C ŸV ŸF , where C is capacitance, V is voltage. Hence, reducon of frequency m-mes will proporonally reduces a power. Reality: • VLSI technology uses not Euclidian, but Manhaan geometry (metric), i.e. area will be not a circle/sphere, but a square/cube! • Historically, VLSI technology increases the chip area (die size) very slowly (only ×2, from 350 to 700 mm2 for the last 25 years), while decreases the feature size exponenally ( ×50, from 1000 to 20 nm for the same period of me)!

9/24/2014 © S. Sedukhin, University of Aizu 16 The Light Speed Barrier limits the size of synchronous computer by the rao D ≤ S / F , where D is the diameter of a system (mm) , S is the speed of clock-signal propagaon (mm/s), S = kŸĊ, the speed-of-light Ċ ≈ 3Ÿ1011 mm/s, 0.5 < k < 1, and F is an operaonal frequency in Hz (1/s).

Relavisc theory of compung pays most aenon to transferring data to processor. This is theory of fast compung as in the relavisc, speed-of-light moon, mechanics.

Tianhe-2 Footprint 720 ≈ 27×27 (m2) ½ ) Die Size (

Heat Barrier D

Chip Area 2 Classical theory of compung where 700 ≈ 27×27 (mm ) the me of data transferring to processor is neglected. It is theory of slow compung as in the classical mechanics – the theory of slow moon.

9/24/2014 © S. Sedukhin, University of Aizu 17 Typical DGEMM Performance on CPUs & GPUs

512 FMAs @0.925 GHz (947 GFLOPS) AMD “Tahi” HD 7970 384 FMAs @0.880 GHz (676 GFLOPS) AMD “Cayman” HD 6970

512 FMAs @0.650 GHz (666 GFLOPS) NVIDIA “Fermi” Tesla M2090

96 FMAs @1.085 GHz (122 GFLOPS) NVIDIA “Kepler” GTX 670 OC

32 FMAs @2.7 GHz (173 GFLOPS) Intel “Sandy Bridge” Core i7 3960X 32 FMAs @2.6 GHz (166 GFLOPS) AMD “Bulldozer” FX-8150

For today supercomputers with more than 106 FMA units, N ≈ 107 and N ≈ 106 max 1/2 Performance scalability is not acceptable for mobile/embedded applicaons

Source: Matsumoto, K.; Nakasato, N.; Sedukhin, S.G., "Performance Tuning of Matrix Mulplicaon in OpenCL on Different GPUs and CPUs, " High Performance Compung, Networking, Storage and Analysis (SCC), 2012 SC Companion: , pp.396-405, 10-16 Nov. 2012

9/24/2014 © S. Sedukhin, University of Aizu 18 Tree-structured Flat Machines: Low Performance Scalability

• Today scienfic-oriented computers are very ineral: inial data are very far from FPUs • Pipelined FPU w/few cycles latency requires a few concurrent threads (no fine-grained implementaon) • Storage hierarchy adds more overhead w/ each addional level of hierarchy (coarse- grained memory accesses) • Data in storage hierarchy are copied at each level: no compung-in-place is possible • Chips are power limited and most power is spent for (global) data moving and replicaon • processor/memory/interconnect are scaled differently (a progressive computer scaling is provided by drascally increasing the data or problem size, Nmax for FLOPSmax)

Source: Bill Dally, Chief Scienst & Sr. VP of Research, NVIDIA

9/24/2014 © S. Sedukhin, University of Aizu 19 Target: Mesh/Torus-structured Machines

• Data reusing by local data movement

between FPUs (not by using hierarchical N N N N LM LM LM LM data storage and global mulple data P P P P

replicaon) N N N N LM LM LM LM • Fine-grained data processing (compung P P P P

and data access/exchange) N N N N LM LM LM LM • Combinaon w/parallel read-out sensors P P P P

and stacked memory are possible N N N N LM LM LM LM (compung-in-place) P P P P • Processor/Memory/Interconnect as a unified element of structural scalability which keeps a single image of the system k like a biological cell. • 2D/3D machines for compung 2D/3D tensor data • More specialized, but more reacve computer organizaon • Global synchronizaon by locally coordinated (asynchronous) massively i j data circulaon 9/24/2014 © S. Sedukhin, University of Aizu 20 Relaons between ARITHMETIC & ALGEBRA

9/24/2014 © S. Sedukhin, University of Aizu 21 FMA and Algebraic Semiring

FMA: { R, (× +), 0.0, 1.0 } Semiring: { S, ⊗, ⊕, 0, 1 } • Set of real numbers • Set of numbers – R – S • Fused arithmec • Two algebraic operaons mulply-and-add: – mulply ⊗ – (× +) – add ⊕ • Two constants from R: • Two constants from S: – 0.0 – 0 – 1.0 – 1

9/24/2014 © S. Sedukhin, University of Aizu 22 Algebraic Path Problem

l Problems from different disciplines represented as a single algorithmic scheme (rich in FMA operaons) - Linear Algebra - compung the inverse of a matrix - Graph and Network Problems - transive&reflexive closure and transive reducon - shortest distance problems (distance funcons) - capacity problems (max flow, network capacity, tunnel-problem) - connecvity measures for reliability networks - stochasc communicaon network problems - Regular Language Problems - correspondence: regular expressions and finite state automata l Unificaon based on the theory of algebraic semirings – Different semirings for different applicaons

l Soluon as a case of a matrix closure problem

9/24/2014 © S. Sedukhin, University of Aizu 23 The Algebraic Path Problem Definion

Let G = (V, E, w) be a weighted graph, V = {0,1,…,n-1} is a set of n verces; E = V × V is a set of edges; w: E → S is an edge weighng funcon whose values taken from the set S

The weighng funcon belongs to the so called path algebra or algebraic semiring (S, ⊕, ⊗, ∗, 0, 1)

9/24/2014 © S. Sedukhin, University of Aizu 24/2424 Scalar Semiring

l A closed semiring (S, ⊕, ⊗, ∗, 0, 1) is an algebraic structure defined by: - Set of scalar elements S - Two binary operaons:

l Addion ⊕ : S × S → S

l Mulplicaon ⊗ : S × S → S - Unary operaon called closure ∗ : S → S - Two constants 0 and 1 in S , 0 and 1 are the neutral elements for ⊕ and ⊗ , respecvely.

9/24/2014 © S. Sedukhin, University of Aizu 25 Different Semirings and Associave Problems

S ⊕ ⊗ (*) 0 1 Basic&Problem& Core&Algorithm& 0,1 ∨ ∧ a* = 1 0 1 Transitive*Closure* Warshall* ℜ + × a* = (1− a) −1 0 1 Matrix*Inversion* Gauss5Jordan* + a* 0 0 All*Pairs*Shortest*Paths* Floyd* ℜ+ ∪{+∞} min = + ∞ max + a* 0 0 Maximum*Cost*(Critical*Path)* ?* ℜ+ ∪{−∞} = − ∞ max 0 Maximum*Capacity*Paths* ?* ℜ+ ∪{+∞} min a* = ∞ + ∞ ?* ℜ[0,1] max × a* = 1 0 1 Maximum*Reliability*Paths* New* ℜ[0,1] min × a* = 1 0 1 Minimum*Reliability*Paths* max a* 0 0 Minimum*Spanning*Tree* Maggs5Plotkin* ℜ+ ∪{+∞} min = + ∞

Given an n×n matrix A: distance/cost matrix of a weighted n-vertex graph. The APP is to find the closure A* of a matrix A in different algebraic semirings: 2 3 A* = Īn ⊕ A ⊕ A ⊕ A ⊕ … = Ī ⊕ A ⊗ A* Soluon by unified Gauss-Jordan/Warshall/Floyd (GJWF) algorithm

9/24/2014 © S. Sedukhin, University of Aizu 26 Matrix Semiring

bxb l ( S , ⊕, ⊗, ∗, 0, I ) is an algebraic structure defined by: - Set of b×b matrices Sbxb over a closed scalar semiring (S, ⊕, ⊗, ∗, 0, 1) - Two binary operaons: bxb bxb bxb l Matrix Addion ⊕ : S × S → S bxb bxb bxb l Matrix Mulplicaon ⊗ : S × S → S - A unary operaon: closure ∗ of a matrix: Sbxb → Sbxb - Two n×n matrices of constants in Sbxb - 0 where all elements are equal to 0 (zero matrix) - I where all diagonal elements are equal to 1 (identy matrix)

9/24/2014 © S. Sedukhin, University of Aizu 27 Scalar FMA Operaon ⇔ Matrix FMA Operaon Arithmec: Matrix Algebra: Scalar Fused Mulply-Add Matrix Fused Mulply-Add

• • (×, +)-algebra: (×, +)-algebra: n−1 c ← c + c ⋅c MMA c ← c + a ⋅b ij ij ∑k=0 ik kj

• (+, min)-algebra: • (+, min)-algebra: n−1 cij ← min{cij ,min(cik + ckj )} SPP c ← min(c,a + b) k=0

• (+, max)-algebra: • (+,max)-algebra: n−1 c ← max(c,a + b) cij ← max{cij ,max(cik + ckj )} CRP k =0

• (min, max)-algebra: • (min, max)-algebra: n−1 cij ← max[cij ,max{min(cik ,ckj )}] MCP c ← min{c,max(a,b)} k =0

• (×, max)-algebra: • (×, max)-algebra: n−1 c ← max(c,a×b) cij ← max{cij ,max(cik ×ckj )} MRP k=0

• (max, min)-algebra: • (max, min)-algebra: n−1 c ← min{c,max(a,b)} cij ← min[cij ,min{max(cik ,ckj )}] MST k=0 MMA– Matrix Mulply-Add CRP– Crical Path Problem MRP– Max. Reliable Paths SPP– Shortest Paths Problem MCP– Max. Capacity Paths MST– Min. Spanning Tree 9/24/2014 © S. Sedukhin, University of Aizu 28 Accelerators: Scalar FMA Unit ⇔ Matrix FMA Unit

Arithmec: Matrix Algebra: Scalar Fused Mulply-Update Unit Matrix FMA Array Processor

No need to understand how FMA unit is internally constructed! No need to understand how “Big Mulplier” is internally constructed!

9/24/2014 © S. Sedukhin, University of Aizu 29 XGEMM-based Algorithms/Architecture

• GEMM in Different Algebras Streaming 2D Data (XGEMM) from Sensor Array or Stacked Memory ! C ← A⊗ B ⊕ C T ! C ← A ⊗ B ⊕ C

C A BT C ! ← ⊗ ⊕ • Chaining Matrix Products ! D ← AT ⊗ B ⊗C ! D ← A⊗ B ⊗CT !! ! • Focal-Plane I/O for Streaming Data • Compung-near-Data

9/24/2014 © S. Sedukhin, University of Aizu 30 Algorithmically and Technologically EXTREMELY-SCALABLE GENERAL MATRIX MULTIPLY-ADD ALGORITHM

9/24/2014 © S. Sedukhin, University of Aizu 31 Matrix-by-Matrix Mulply-Add (MMA)

C ← A × B +C, where A, B, and C are dense (n × n) matrices: n−1 c(k) (i, j) ← ∑a(i, k)⋅ b(k, j)+ c(k−1) (i, j), 0 ≤ i, j < n; k=0 • The computational index space is a bounded grid of n × n × n index points: ℑ = {(i, j, k)T : 0 ≤ i, j, k < n} ⊆ Z3, where in each index point p = (i, j, k)T ∈ ℑ we have to update c(i, j) by implementing scalar multiply-add operation: c(i, j) ← a(i, k)⋅ b(k, j)+ c(i, j), i.e. all three scalars a(i, k), b(k, j), and c(i, j) should be available in this index point before computing is started. • Because there are only 3n2 input matrix data, but n3 index points, no more than n2 data independent index points can be activated (i.e., no more than n2 multiply-add operations per time-step) p (2,1, 2)T if no data replication is considered. Example: = ∈ ℑ (2) (1) • Hence, n is the minimal number of "read - compute - write" steps c (2,1) ← a(2, 2)⋅ b(2,1)+ c (2,1) to implement a matrix-by-matrix multiply-add. C(2)[2,1]← A[2, 2]× B[2,1]+C(1)[2,1]

9/24/2014 © S. Sedukhin, University of Aizu 32 Time-Space Scheduling of MMA

A me-scheduling funcon: step(p): Z3 → Z , for all p=(i,j,k)T from I – linear or modular form: step(p) = (αTŸp) or step(p) = (αTŸp) mod_n T where α = (αi, αj, αk) is a me-scheduling vector A space-scheduling funcon: allocaon(p): Z3 → Z2 , for all p=(i,j,k)T from I – linear projecon method: allocaon(p): S × p, where S is a 2×3 space transformaon matrix corresponding to the projecon vector η, such that S×η = Ō and (αTŸη) ≠ 0

9/24/2014 © S. Sedukhin, University of Aizu 33 Œ Broadcast-Compute-Shi (BCS) Scheduling

The time scheduling function: Grid-based Scheduling step(p) = α T ⋅ p, T B A C T T where α = (αi ,α j ,αk ) = (0, 0,1) , i.e. step(p) = k, p = (i, j, k)T ∈ ℑ = {(i, j, k)T ;0 ≤ i, j, k < n} ⊆ Z3; On each step(p) = s ∈ [0, n), broadcasting of the n-element T column as ∈ A and n-element row bs ∈ B to update an n2 -element matrix C(s) are required. Hence, BCS-scheduling corresponds an implementation on each time-step s = 0,1,..., n −1 the rank-1 update: (s+1) (s) T C ← C + as ⊗ bs . All initial/intermediate/final data are resided inside the index space ℑ. The total number of “broadcast-compute-shi” steps is n.

9/24/2014 © S. Sedukhin, University of Aizu 34  All-Shi-Compute (ASC) Scheduling

Systolic or Mesh Scheduling The time scheduling function: step(p) = α T ⋅ p, B A C T T where α = (αi ,α j ,αk ) = (1,1,1) , i.e. step(p) = i + j + k, p = (i, j, k)T ∈ ℑ={(i, j, k)T ;0 ≤ i, j, k < n} ⊆ Z3; It is a "broadcast-to-pipeline" or "time-multiplexing" version of a BCS scheduling. All initial data is aligned and located on the hyper-plane outside of the index space ℑ. The same will be for the final data.

The total number of “all-shi-compute” steps is 3n-2.

9/24/2014 © S. Sedukhin, University of Aizu 35 Ž Broadcast-Compute-Roll (BCR) Scheduling

The time scheduling function: Cylindrical Scheduling step(p) = (α T ⋅ p)modn, B A C T T where α = (αi ,α j ,αk ) = (−1, 0,1) , i.e. step(p) = (k −i)modn, p = (i, j, k)T ∈ ℑ ⊆ Z3; This scheduling requires on each step(p) = s ∈ [0, n) :

s-th dioganal n-vector as ∈ A be broadcast A ! (α j = 0) along j-axis to compute (update) a matrix C and, then, matrices B and C are rolled opposite ! B ! C i -axis or orbit (αi = -1) and along k-orbit (αk =1). All initial/intermediate/final data are resided inside the index space ℑ.

The total number of “broadcast-compute-roll” steps is n.

9/24/2014 © S. Sedukhin, University of Aizu 36  Compute-Roll-All (CRA) Scheduling

The modular time scheduling function: Orbital or Toroidal Scheduling step(p) = (α T ⋅ p)modn, B A C T T where α = (αi ,α j ,αk ) = (±1,±1,±1) , i.e. step(p) = (±i ± j ± k)modn, T T 3 p = (i, j, k) ∈ ℑ = {(i, j, k) ;0 ≤ i, j, k < n} ⊆ Zn; The computational index space ℑ is a 3D torus. At each step(p) = s ∈ [0, n) : computing: c(i, j) ← c(i, j)+ a(i, k)⋅ b(k, j); roll - all: a(i, k), b(k, j), and c(i, j) are rolled along ± j-,±i-, and ± k-orbit, respectevely. All initial/intermediate/final data are resided inside the index space ℑ. The total number of “compute-roll-all” steps is n.

9/24/2014 © S. Sedukhin, University of Aizu 37 3D Index-Space à 2D Processor-Space Allocaon

ijk

step(p) = [-i - j + k] mod n k step(p) = 0, n = 4

303 213 123 033 c b b 033 b b 033 (0,0, 123 123 213 - 213 303 1) T

c 303 202 112 022 332 c

022 022 112 112 202 202 332 c 332 101 011 321 231 c

011 011 101 101 231 231 321 c 321 000 310 220 130 c (0,-1,0)T T j (-1,0,0) 000 000 130 a a a a 130 220 220 310 c 310 i B-staonary and in a canonical layout A-staonary and in a canonical layout a a a a b b 000 011 022 033 Cannon’s b 101 112 123 130 b 202 213 220 231 Algorithm [1969] 303 310 321 332

C-staonary and in a canonical layout 9/24/2014 © S. Sedukhin, University of Aizu 38 Comparison of Different MMA Implementaons

Œ BCS  ASC Ž BCR  CRA # me-steps n 3n-2 n n Index Space 3D Grid 3D Mesh 3D Cylinder 3D Torus I/O Data Locaon Inside Outside Inside Inside Data reusing/step 2n 1…3n(n+1)/2…1 n+n2 2n2 Data Movement Bcast/Shi All Shi Bcast/Roll All Roll (Global/Local) Compung-in-Place Possible Impossible Possible Possible 3D à 2D Projecon 2D Grid 2D Mesh 2D Torus 2D Torus Agarwal, at al. Kung, Leiserson, Fox, Oo, Hey, Cannon (1994) (1979) (1987) (1969) van de Geijn, (1995) Scalability Bad Good Bad Good Final Selecon No No No Yes

9/24/2014 © S. Sedukhin, University of Aizu 39 Three Forms of MMA

• Any me-step scheduling • By using the same me funcon defines only matrix scheduling funcon we can data distribuon among the implement either acve index points and does ! n−1 not specify how compung (k ) C ← AB + C ⇔ c(i, j) ← c(i, j) + ∑a(i,k)⋅b(k, j); k=0 is performed actually. ! n−1 (i ) B ← AT C + B ⇔ b(k, j) ← b(k, j) + ∑a(i,k)⋅c(i, j); i=0 ! n−1 ( j) A ← CBT + A ⇔ a(i,k) ← a(i,k) + ∑c(i, j)⋅b(k, j); j=0

Accumulaon along:

NN-, TN-, and NT-forms ¾ of GEMM 9/24/2014 © S. Sedukhin, University of Aizu 40 Chaining of Matrix Products

• Because in the Orbital • Two MMAs Chaining: ! Scheduling the resulng ! T ( j, k ) : E ← (CB + A)D + E; data is properly resided ! ! (i, k ) : E ← D(ATC + B)+ E; inside the index space, ! ! (k, j ) : E ← (AB +C)DT + E; by using the three ! ! forms of accumulaon (i, j ) : E ← DT (ATC + B)+ E; ! ! (k), (i), and (j) we can ( j,i ) : E ← (CBT + A)T D + E; ! ! efficiently implement (k,i ) : E ← DT (AB +C)+ E. MMAs chaining. • 2 n me-steps are needed to complete this chaining

9/24/2014 © S. Sedukhin, University of Aizu 41 GEMM-BASED ORBITAL ALGORITHMS

9/24/2014 © S. Sedukhin, University of Aizu 42 Systolic → Orbital Rescheduling

u Linear Algebra u Non-numerical applicaons § Matrix-by-matrix mulplicaon, § Data structures (stack/queue, § Soluon linear systems (matrix searching, priority queue, sorng), inversion), § Graph algorithms (transive closure, § LU, QR, SVD decomposions, minimum spanning tree, connected components,...) § ... § Language recognion (string matching, u Digital Signal Processing regular expression) § FIR, IIR, ID/2D convoluon, § Dynamic programming, § 2D DFT, DCT, DHT, ... § Relaonal database operaons, § Dynamic scene analysis, § Encoders (polynomial division) § Image resampling, § ... § Interpolaon,

§ 1D/2D median filtering, § Geometric warping, § …

9/24/2014 © S. Sedukhin, University of Aizu 43 9/24/2014 © S. Sedukhin, University of Aizu 44 2-Dimensional Separable Transforms

• Let X=[x(i,k)] be n×n signal or image • Chaining of Matrix Products data. A 2D forward transform of X is ! ! T defined as ( j,k ) : E ← (CB + A)D + E; ! " T n−1 n−1 (i ,k ) : E ← D(A C + B) + E; !x!(u, v) = c(i,u) x(i, k)⋅ c(k, v) ! ! T ∑ ∑ (k, j) : E ← (AB + C)D + E; i=0 k=0 ! ! in a matrix form: (i , j) : E ← DT (AT C + B) + E; ! ! X!! = CT × X ×C ( j,i ) : E ← (CBT + A)T D + E; ! ! • A 2D inverse transform: (k,i ) : E ← DT (AB + C) + E. T X = C × X!! ×C • Totally 2n time-steps are needed to • The transform coefficient matrix C implement any 2D transform can be

– symmetric (C = CT) and unitary (C-1 = C*T) as for DFT or DHT – unitary and real as in DCT – only ±1 and be symmetric and orthogonal as in DWHT

9/24/2014 © S. Sedukhin, University of Aizu 45 9/24/2014 © S. Sedukhin, University of Aizu 46 9/24/2014 © S. Sedukhin, University of Aizu 47 9/24/2014 © S. Sedukhin, University of Aizu 48 9/24/2014 © S. Sedukhin, University of Aizu 49 Data Manipulaon by Matrix-Matrix Mulply-Add

A generic form of MMA operaon: D ← MMA[⊗,⊕](A,B,C) : D ← A ⊗ B ⊕ C Row/column interchange: D(n,n) ← MMA[×,+](P(n,n),A(n,n),zero(n,n)) where P(n,n) is a (i,j)-permutaon matrix Rows/columns Rotaon Scalar Data Replicaon (Broadcast)

9/24/2014 © S. Sedukhin, University of Aizu 50 Global Reducon and Broadcast

Generic form: B(n,n) ← ones(n,n) ⊗ A(n,n) ⊗ ones(n,n) For example, summaon and broadcast: q C(n,n) ← MMA[×,+](ones(n,n),A(n,n),zero(n,n)) q D(n,n) ← MMA[×,+](C(n,n),ones(n,n),zero(n,n))

maximum and broadcast: v C(n,n) ← MMA[×,max](ones(n,n),A(n,n),-inf(n,n)) v C(n,n) ← MMA[×,max](C(n,n),ones(n,n),-inf(n,n))

9/24/2014 © S. Sedukhin, University of Aizu 51 CONCLUSION IN ONE SLIDE

9/24/2014 © S. Sedukhin, University of Aizu 52 Matrix Array Processor Accelerated Functions Target Applications • Medical Imaging / Visualization k Computing • Radar Systems

• Matrix Mathematics • Sonar Systems

– GEMM (BLAS Level 3): • Defense and Security IT C=C+A×B; C=C+AT×B; C=C+A×BT – Linear Algebra • Surveillance • Graph (Path) Algorithms • Wireless Communications

GEMM in Different Algebras • Network Processing

– Transitive Closure • Voice and Pattern Recognition

– All Pairs Shortest/Longest Paths • Computational Chemistry – Critical Path • Climate Modeling – Maximum Capacity Path – Most Reliable Path • Data Mining and Analysis i j – Minimum Spanning Tree • Game Physics / Physics • Multidimensional Linear Transforms Simulation • 2D/3D Toroidal Array Processor – DFT 2D/3D Fwd/Inv Separable Transforms • Life Sciences & Biotechnology for Computing-near-Data – DCT Y=(CT×(X×C))×C – Computational chemistry and biology – Basic Operation: Matrix multiply-add – DHT X=(C×(Y×CT))×CT – In silico drug discovery – 3-D toroidally connected Banks of – DWH – Gene sequencing Memory with attached scalar FMA units – DST – Pharmacogenomics – Using Slower and Simpler Cores Data Manipulation – Protein folding – Multidimensional I/O • Rotation – Molecular dynamics • Keeping Integrity of Data – Personalized medicine • Permutation • Highly scalable computing/ – Genomics, Proteomics, Metabolomics • Transposition memory/interconnect fabric – Simulation of biological systems • Copy • Geophysical Science • Replication – Seismic data processing • Reduction – Petroleum Reservoir Modeling • Broadcast • Financial Analysis and Modeling • …

9/24/2014 © S. Sedukhin, University of Aizu 53 The Human Brain

• Number of neurons 1011 • Number of synapses (adult) 1014 (2,000-5,000 per neuron) • Power consumpon (adult): 20-40 Was (0.5-4 nW/neuron) • Maximum firing frequency of neuron: 250-2,000 Hz (0.5-4 ms intervals) • Signal propagaon speed inside axon 90 m/s sheathed, <0.1 m/s unsheathed • Processing of complex smuli 0.5s or 100-1,000 firings • Normal operang temperature 37±2°C • Sleep requirement (adult) average 7.5 hours/day or 31% • Atrophy/death of neurons 50,000 per day (between ages 20 and 75) • Weight 1.5 kg (Einstein’s brain 1,230 g) • Volume 1130/1260 cm3 (W/M)

9/24/2014 © S. Sedukhin, University of Aizu 54 THANK YOU !

9/24/2014 © S. Sedukhin, University of Aizu 55