Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms
Total Page:16
File Type:pdf, Size:1020Kb
Introduction 2.5D matrix multiplication 2.5D LU factorization Conclusion Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms Edgar Solomonik and James Demmel UC Berkeley September 1st, 2011 Edgar Solomonik and James Demmel 2.5D algorithms 1/ 36 Introduction 2.5D matrix multiplication 2.5D LU factorization Conclusion Outline Introduction Strong scaling 2.5D matrix multiplication Strong scaling matrix multiplication Performing faster at scale 2.5D LU factorization Communication-optimal LU without pivoting Communication-optimal LU with pivoting Conclusion Edgar Solomonik and James Demmel 2.5D algorithms 2/ 36 Introduction 2.5D matrix multiplication Strong scaling 2.5D LU factorization Conclusion Solving science problems faster Parallel computers can solve bigger problems I weak scaling Parallel computers can also solve a xed problem faster I strong scaling Obstacles to strong scaling I may increase relative cost of communication I may hurt load balance Edgar Solomonik and James Demmel 2.5D algorithms 3/ 36 Introduction 2.5D matrix multiplication Strong scaling 2.5D LU factorization Conclusion Achieving strong scaling How to reduce communication and maintain load balance? I reduce communication along the critical path Communicate less I avoid unnecessary communication Communicate smarter I know your network topology Edgar Solomonik and James Demmel 2.5D algorithms 4/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion Strong scaling matrix multiplication Matrix multiplication on BG/P (n=65,536) 100 2.5D MM 2D MM 80 60 40 20 Percentage of machine peak 0 256 512 1024 2048 #nodes Edgar Solomonik and James Demmel 2.5D algorithms 5/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion Blocking matrix multiplication B A B A B A B A Edgar Solomonik and James Demmel 2.5D algorithms 6/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion 2D matrix multiplication [Cannon 69], [Van De Geijn and Watts 97] B A 16 CPUs (4x4) B CPU CPU CPU CPU CPU CPU CPU CPU A CPU CPU CPU CPU CPU CPU CPU CPU B A B A Edgar Solomonik and James Demmel 2.5D algorithms 7/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion 3D matrix multiplication [Agarwal et al 95], [Aggarwal, Chandra, and Snir 90], [Bernsten 89] 64 CPUs (4x4x4) CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU B CPU CPU CPU CPU A CPU CPU CPU CPU CPU CPU CPU CPU B CPU CPU CPU CPU CPU CPU CPU CPU A CPU CPU CPU CPU B CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU A B CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU A 4 copies of matrices Edgar Solomonik and James Demmel 2.5D algorithms 8/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion 2.5D matrix multiplication B 32 CPUs (4x4x2) A CPU CPU CPU CPU CPU CPU CPU CPU B CPU CPU CPU CPU CPU CPU CPU CPU A CPU CPU CPU CPU B CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU A 2 copies of matrices B A Edgar Solomonik and James Demmel 2.5D algorithms 9/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion 2.5D strong scaling n = dimension, p = #processors, c = #copies of data I must satisfy 1 c p1=3 I special case: c = 1 yields 2D algorithm I special case: c = p1=3 yields 3D algorithm cost(2.5D MM(p; c)) = O(n3=p) ops p + O(n2= c ¡ p) words moved q + O( p=c3) messages£ *ignoring log(p) factors Edgar Solomonik and James Demmel 2.5D algorithms 10/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion 2.5D strong scaling n = dimension, p = #processors, c = #copies of data I must satisfy 1 c p1=3 I special case: c = 1 yields 2D algorithm I special case: c = p1=3 yields 3D algorithm cost(2D MM(p)) = O(n3=p) ops p + O(n2= p) words moved p + O( p) messages£ = cost(2.5D MM(p; 1)) *ignoring log(p) factors Edgar Solomonik and James Demmel 2.5D algorithms 11/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion 2.5D strong scaling n = dimension, p = #processors, c = #copies of data I must satisfy 1 c p1=3 I special case: c = 1 yields 2D algorithm I special case: c = p1=3 yields 3D algorithm cost(2.5D MM(c ¡ p; c)) = O(n3=(c ¡ p)) ops p + O(n2=(c ¡ p)) words moved p + O( p=c) messages = cost(2D MM(p))=c perfect strong scaling Edgar Solomonik and James Demmel 2.5D algorithms 12/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion 2.5D MM on 65,536 cores Matrix multiplication on 16,384 nodes of BG/P 100 2.5D MM 2D MM 80 2.7X faster 60 Using c=16 matrix copies 40 12X faster 20 Percentage of machine peak 0 8192 131072 n Edgar Solomonik and James Demmel 2.5D algorithms 13/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion Cost breakdown of MM on 65,536 cores Matrix multiplication on 16,384 nodes of BG/P 1.4 communication 1.2 idle 95% reduction in comm computation 1 0.8 0.6 0.4 0.2 Execution time normalized by 2D 0 n=8192, 2Dn=8192, 2.5D n=131072, 2Dn=131072, 2.5D Edgar Solomonik and James Demmel 2.5D algorithms 14/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2.5D LU strong scaling (without pivoting) LU without pivoting on BG/P (n=65,536) 100 ideal scaling 2.5D LU 2D LU 80 60 40 20 Percentage of machine peak 0 256 512 1024 2048 #nodes Edgar Solomonik and James Demmel 2.5D algorithms 15/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2D blocked LU factorization A Edgar Solomonik and James Demmel 2.5D algorithms 16/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2D blocked LU factorization U₀₀ L₀₀ Edgar Solomonik and James Demmel 2.5D algorithms 17/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2D blocked LU factorization U L Edgar Solomonik and James Demmel 2.5D algorithms 18/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2D blocked LU factorization U L S=A-LU Edgar Solomonik and James Demmel 2.5D algorithms 19/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2D block-cyclic decomposition 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 Edgar Solomonik and James Demmel 2.5D algorithms 20/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2D block-cyclic LU factorization Edgar Solomonik and James Demmel 2.5D algorithms 21/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2D block-cyclic LU factorization U L Edgar Solomonik and James Demmel 2.5D algorithms 22/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2D block-cyclic LU factorization U L S=A-LU Edgar Solomonik and James Demmel 2.5D algorithms 23/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2.5D LU factorization U₀₀ L₀₀ U₀₀ U₀₃ L₀₀ U₀₀ U₀₃ L₀₀ U₀₀ U₀₁ L₀₀ (A) L₃₀ L₂₀ L₁₀ Edgar Solomonik and James Demmel 2.5D algorithms 24/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2.5D LU factorization U₀₀ L₀₀ (B) U₀₀ U₀₃ L₀₀ U₀₀ U₀₃ L₀₀ U₀₀ U₀₁ L₀₀ (A) L₃₀ L₂₀ L₁₀ Edgar Solomonik and James Demmel 2.5D algorithms 25/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2.5D LU factorization U₀₀ L₀₀ (B) U₀₀ U₀₃ L₀₀ U₀₀ U₀₃ L₀₀ U₀₀ U₀₁ L₀₀ (A) L₃₀ L₂₀ L₁₀ U (D) (C) L Edgar Solomonik and James Demmel 2.5D algorithms 26/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2.5D LU factorization Look at how this update is distributed. What does it remind you of? Edgar Solomonik and James Demmel 2.5D algorithms 27/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2.5D LU factorization Look at how this update is distributed.