<<

QR Factorization in OpenMP and Plus: Development Experience and Performance Analysis

TACC- Highly Symposium, Austin, TX

John Eisenlohr, David E. Hudak, Karen Tomko Ohio Center Timothy . Prince Intel Corporation

Overview • Heterogeneous architectures and their challenges • Research questions and project goals • Communication-Avoiding QR (CAQR) algorithm and its implementation • Experimental results • Conclusions

2 Heterogeneous Architecture Highlights • Products from major manufacturers: – Intel: MIC architecture – Knights Corner – : Tesla GPGPU, Denver architecture – AMD: Fusion architecture • New technology with many tradeoffs remaining – Special-purpose (GPU) vs. General-purpose (MIC) – Integration at I/O device level (PCI), socket level (e.g., QPI) or chip-level (e.g., Fusion) • Notable impact in high-performance computing – Three of top five systems in Top 500 use GPUs (November 2011) – Upcoming systems : ORNL Titan (nVidia GPUs) and TACC Stampede (Intel Knights Corner)

3 Heterogeneous Programming Challenges • Short-term challenges • Long-term challenges – Device-driver style – Nested parallelism • Allocate buffer on device – Combine task and data • DMA transfer into buffer parallelism • Call function on device – Data-dependent task • Poll for function parallelism (dataflow) completion • DMA transfer from buffer – Manage data movement across – Limited accelerator multiple memories memory – Overlap computation – Multiple toolchains with communication

4 Project Overview • Research Questions • Project goals – Lots of parallel – Find algorithm with programming methods. and Which one should you algorithmic-level task choose? parallelism – How does application- – Understand KNF level restructuring performance impact performance? – Evaluate programming – What impact does the alternatives run time system have?

5 QR factorization

• Factor a matrix (A) into the product (A=QR) of an orthogonal matrix (Q) and an upper triangular matrix (R) • QR factorization has been extensively studied for every HPC architecture – Vector, MPP, SMP, Cluster, Multicore Cluster, GPU Cluster • Like all matrix factorizations, the matrix remaining to be factored decreases as factorization progresses • We examined the communication-avoiding QR factorization (CAQR) for tall-skinny (TS) matrices

6 CAQR Algorithm Overview • NxM matrix is stored in 1- block row layout – Block is a square of size b x b • Algorithm works on panels – rectangular set of blocks – For example, we show panels that are 2 blocks high, 1 block wide – In general, panels have size h x b • Each iteration of the main loop completely factors upper b rows of the matrix – So, we are striding along the diagonal of the matrix 1 block at a time

7 CAQR Algorithm Overview • Each iteration of the main loop has four steps – Step 1: Factor the panels in the leftmost column – Step 2: Update the other panels in each row using the leftmost panel from that row – Step 3: Combine each panel in the leftmost column using the panels below it – Step 4: Update the panels in each row using the new leftmost column panel and the panels beneath it • For increased parallelism, steps 3 and 4 are implemented with a binary tree reduction – Combined into single step

8 Step 1: Factor Panels on LHS • N/h tasks, each factoring one panel factor – hh_factor_panel_block • Topmost block in panel in upper triangular form – all lower blocks in panel zeroed

9 Step 2: Apply Factors to Row • Once leftmost panel is factored, all panels in row can be updated apply_qt_h concurrently • N/h * M/b tasks – Majority of work in algorithm • We tested two implementations – Simple – N/h tasks – Nested – N/h * M/b tasks

10 Step 3.1: Binary Tree Reduction to Combine Factors

• log2(N/h) steps factor_tree – Step 1: N/2h tasks – Step 2: N/4h tasks • Two steps are shown here (2 tasks, 1 task) • factor_panel_triangle – Lower panel’s top block is zeroed

11 Step 3.2: Binary Tree Reduction to Update Rows • Two steps are show here (2 tasks, 1 task) apply_qt_tree • Each task updates two rows – But, only the uppermost block in each panel – So, not as much work as step 2 • Panels in each column could be updated in parallel – we have not done this yet

12 OpenMP Implementation

13 Cilk Implementation

14 Performance Evaluation • Algorithmic factors – Nontrivial algorithm (which is why we picked it) – Not fully parallelized, so cannot expect linear – Load balancing will also be a factor due to decreasing amount of work • Hardware factors – Number of cores – Per-core factors: vector length, HW threading – Cache factors: cache organization, sizes, interconnection and coherency • Performance model for KNF is needed

15 Experimental Results

• Experiments 1

run on 0.95 Knights Ferry 0.9

0.85 • 29 threads 0.8 • Nesting 0.75 parallelism in 0.7 step 2 0.65 0.6

provided Execution Time Relative to at Cilk-Single 58 Threads 0.55 improvement 0.5 Cilk Single Cilk Nested OMP Single OMP Nested OMP MPI-Like

16 Threads/core OMP Cilk 1 100% 100% 2 83% 83% Impact of Increasing Threads 4 77% 99% Execuon me normalized to 1 case • Execution OMP Nested and Cilk Nested cases time 1 decreases 0.95 from 1-2 0.9 Smaller is Better threads 0.85

(9Kx4K 0.8

case) 0.75 Cilk Single Cilk Nested • Increasing 0.7 matrix size 0.65 OMP MPI-Like OMP Single (16Kx4K) 0.6 OMP Nested Execution Time Relative to 15-thread case

increases 0.55

effect 0.5 15 20 25 30 35 40 45 50 55 60

Number of Threads

17 Impact of Panel Shape on Execution Time

Panel Height 512 141% 187% 176% 213% • OpenMP nested and 256 127% 123% 132% 128 107% 120% 128% Cilk nested code 64 100% 121% 32 126% examples 16 32 64 128 Panel Width Execuon mes normalized to 64x16 le shape • Best panel shape is Matrix size: 16Kx4K OpenMP 4 thread/core 64x16 in each case

Panel Height 512 177% 284% 251% 280% • Scaling panel size is a 256 165% 161% 173% 128 135% 158% bad idea 64 100% 133% 32 132% 16 32 64 128 Panel Width Execuon mes normalized to 64x16 le shape Matrix size: 16Kx4K Cilk 4 thread/core

18 Impact of Matrix Size on GF Rate

Number of Rows 16K 44% 73% 100% 8K 30% 54% 69% • GF rate increased with 4K 19% 34% 2K 9% matrix size 1K 2K 4K Number of Columns 64x16 panel size GF rate normalized to 16K x 4K GF rate - OpenMP 4 threads/core

Number of Rows 16K 50% 74% 100% 8K 33% 47% 56% 4K 21% 30% 2K 10% 1K 2K 4K Number of Columns 64x16 panel size GF rate normalized to 16K x 4K GF rate - Cilk 4 threads/core

19 Future Work • Detailed performance model for implementation – Timing for individual steps of algorithm • Performance model for KNF – Load balancing and caching impact • Increase parallelism in Step 3.2 • Exploit Cilk Plus to express parallelism among portions of algorithm – Stage 2 can be overlapped with Stage 3.1 • Cluster solution (multiple KNF cards) • Replicate tile shape and matrix size tests on Xeon

20 Conclusions • Intel’s advice could be summarized as “Vectorize, lots of tasks, lots of workers” That worked well. • We did find performance sensitivity to panel shape • Heterogenous architectures new and in flux – Parallelism and locality are issues that will remain • In the long term, parallelism and locality will become logical/algorithmic properties that are mapped to complex architectures by the and run-time

21