Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms

Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms

Introduction 2.5D matrix multiplication 2.5D LU factorization Conclusion Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms Edgar Solomonik and James Demmel UC Berkeley September 1st, 2011 Edgar Solomonik and James Demmel 2.5D algorithms 1/ 36 Introduction 2.5D matrix multiplication 2.5D LU factorization Conclusion Outline Introduction Strong scaling 2.5D matrix multiplication Strong scaling matrix multiplication Performing faster at scale 2.5D LU factorization Communication-optimal LU without pivoting Communication-optimal LU with pivoting Conclusion Edgar Solomonik and James Demmel 2.5D algorithms 2/ 36 Introduction 2.5D matrix multiplication Strong scaling 2.5D LU factorization Conclusion Solving science problems faster Parallel computers can solve bigger problems I weak scaling Parallel computers can also solve a xed problem faster I strong scaling Obstacles to strong scaling I may increase relative cost of communication I may hurt load balance Edgar Solomonik and James Demmel 2.5D algorithms 3/ 36 Introduction 2.5D matrix multiplication Strong scaling 2.5D LU factorization Conclusion Achieving strong scaling How to reduce communication and maintain load balance? I reduce communication along the critical path Communicate less I avoid unnecessary communication Communicate smarter I know your network topology Edgar Solomonik and James Demmel 2.5D algorithms 4/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion Strong scaling matrix multiplication Matrix multiplication on BG/P (n=65,536) 100 2.5D MM 2D MM 80 60 40 20 Percentage of machine peak 0 256 512 1024 2048 #nodes Edgar Solomonik and James Demmel 2.5D algorithms 5/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion Blocking matrix multiplication B A B A B A B A Edgar Solomonik and James Demmel 2.5D algorithms 6/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion 2D matrix multiplication [Cannon 69], [Van De Geijn and Watts 97] B A 16 CPUs (4x4) B CPU CPU CPU CPU CPU CPU CPU CPU A CPU CPU CPU CPU CPU CPU CPU CPU B A B A Edgar Solomonik and James Demmel 2.5D algorithms 7/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion 3D matrix multiplication [Agarwal et al 95], [Aggarwal, Chandra, and Snir 90], [Bernsten 89] 64 CPUs (4x4x4) CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU B CPU CPU CPU CPU A CPU CPU CPU CPU CPU CPU CPU CPU B CPU CPU CPU CPU CPU CPU CPU CPU A CPU CPU CPU CPU B CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU A B CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU A 4 copies of matrices Edgar Solomonik and James Demmel 2.5D algorithms 8/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion 2.5D matrix multiplication B 32 CPUs (4x4x2) A CPU CPU CPU CPU CPU CPU CPU CPU B CPU CPU CPU CPU CPU CPU CPU CPU A CPU CPU CPU CPU B CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU A 2 copies of matrices B A Edgar Solomonik and James Demmel 2.5D algorithms 9/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion 2.5D strong scaling n = dimension, p = #processors, c = #copies of data I must satisfy 1 c p1=3 I special case: c = 1 yields 2D algorithm I special case: c = p1=3 yields 3D algorithm cost(2.5D MM(p; c)) = O(n3=p) ops p + O(n2= c ¡ p) words moved q + O( p=c3) messages£ *ignoring log(p) factors Edgar Solomonik and James Demmel 2.5D algorithms 10/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion 2.5D strong scaling n = dimension, p = #processors, c = #copies of data I must satisfy 1 c p1=3 I special case: c = 1 yields 2D algorithm I special case: c = p1=3 yields 3D algorithm cost(2D MM(p)) = O(n3=p) ops p + O(n2= p) words moved p + O( p) messages£ = cost(2.5D MM(p; 1)) *ignoring log(p) factors Edgar Solomonik and James Demmel 2.5D algorithms 11/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion 2.5D strong scaling n = dimension, p = #processors, c = #copies of data I must satisfy 1 c p1=3 I special case: c = 1 yields 2D algorithm I special case: c = p1=3 yields 3D algorithm cost(2.5D MM(c ¡ p; c)) = O(n3=(c ¡ p)) ops p + O(n2=(c ¡ p)) words moved p + O( p=c) messages = cost(2D MM(p))=c perfect strong scaling Edgar Solomonik and James Demmel 2.5D algorithms 12/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion 2.5D MM on 65,536 cores Matrix multiplication on 16,384 nodes of BG/P 100 2.5D MM 2D MM 80 2.7X faster 60 Using c=16 matrix copies 40 12X faster 20 Percentage of machine peak 0 8192 131072 n Edgar Solomonik and James Demmel 2.5D algorithms 13/ 36 Introduction 2.5D matrix multiplication Strong scaling matrix multiplication 2.5D LU factorization Performing faster at scale Conclusion Cost breakdown of MM on 65,536 cores Matrix multiplication on 16,384 nodes of BG/P 1.4 communication 1.2 idle 95% reduction in comm computation 1 0.8 0.6 0.4 0.2 Execution time normalized by 2D 0 n=8192, 2Dn=8192, 2.5D n=131072, 2Dn=131072, 2.5D Edgar Solomonik and James Demmel 2.5D algorithms 14/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2.5D LU strong scaling (without pivoting) LU without pivoting on BG/P (n=65,536) 100 ideal scaling 2.5D LU 2D LU 80 60 40 20 Percentage of machine peak 0 256 512 1024 2048 #nodes Edgar Solomonik and James Demmel 2.5D algorithms 15/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2D blocked LU factorization A Edgar Solomonik and James Demmel 2.5D algorithms 16/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2D blocked LU factorization U₀₀ L₀₀ Edgar Solomonik and James Demmel 2.5D algorithms 17/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2D blocked LU factorization U L Edgar Solomonik and James Demmel 2.5D algorithms 18/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2D blocked LU factorization U L S=A-LU Edgar Solomonik and James Demmel 2.5D algorithms 19/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2D block-cyclic decomposition 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 Edgar Solomonik and James Demmel 2.5D algorithms 20/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2D block-cyclic LU factorization Edgar Solomonik and James Demmel 2.5D algorithms 21/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2D block-cyclic LU factorization U L Edgar Solomonik and James Demmel 2.5D algorithms 22/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2D block-cyclic LU factorization U L S=A-LU Edgar Solomonik and James Demmel 2.5D algorithms 23/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2.5D LU factorization U₀₀ L₀₀ U₀₀ U₀₃ L₀₀ U₀₀ U₀₃ L₀₀ U₀₀ U₀₁ L₀₀ (A) L₃₀ L₂₀ L₁₀ Edgar Solomonik and James Demmel 2.5D algorithms 24/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2.5D LU factorization U₀₀ L₀₀ (B) U₀₀ U₀₃ L₀₀ U₀₀ U₀₃ L₀₀ U₀₀ U₀₁ L₀₀ (A) L₃₀ L₂₀ L₁₀ Edgar Solomonik and James Demmel 2.5D algorithms 25/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2.5D LU factorization U₀₀ L₀₀ (B) U₀₀ U₀₃ L₀₀ U₀₀ U₀₃ L₀₀ U₀₀ U₀₁ L₀₀ (A) L₃₀ L₂₀ L₁₀ U (D) (C) L Edgar Solomonik and James Demmel 2.5D algorithms 26/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2.5D LU factorization Look at how this update is distributed. What does it remind you of? Edgar Solomonik and James Demmel 2.5D algorithms 27/ 36 Introduction 2.5D matrix multiplication Communication-optimal LU without pivoting 2.5D LU factorization Communication-optimal LU with pivoting Conclusion 2.5D LU factorization Look at how this update is distributed.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    60 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us