On the High Performance Implementation of Quaternionic Matrix Operations

On the High Performance Implementation of Quaternionic Matrix Operations wavefunction91.github.io SIAM-CSE 2019 David Williams-Young Scalable Solvers Group Computational Research Division Lawrence Berkeley National Lab February 28, 2019 Problem Motivation • Moving towards the end of Moore's law: Simply applying existing algorithms and data structures not sufficient. • Quaternion symmetry is very common in many scientific and engineering disciplines, especially those whose target involves physical space. • Much research has been afforded to real / complex linear algebra algorithms to exploit this symmetry (Dongerra, et al, 1984; Shiozaki, 2018) The quaternion algebra is [...] somewhat complicated, and its computation cannot be easily mapped to highly optimized linear algebra libraries such as BLAS and LAPACK. Problem Statement How can we leverage techniques such as auto-tuning and microarchitechture optimization to provide optimized implementations of quaternion linear algebra software? • What possible use could I have for matrices of quaternions? • What does all of this have to do with auto-tuning? This talk will attempt to answer (discuss) three questions: • What are quaternions and why do we care? • What does all of this have to do with auto-tuning? This talk will attempt to answer (discuss) three questions: • What are quaternions and why do we care? • What possible use could I have for matrices of quaternions? This talk will attempt to answer (discuss) three questions: • What are quaternions and why do we care? • What possible use could I have for matrices of quaternions? • What does all of this have to do with auto-tuning? Quaternions: Formally Quaternions are defined as the set H of all q such that 0 1 2 3 0 1 2 3 q = q e0 + q e1 + q e2 + q e3; q ; q ; q ; q 2 R with e0ej = ej e0 = ej ; j 2 f0; 1; 2; 3g; 3 X k ei ej = −δij e0 + "ij ek ; i; j 2 f1; 2; 3g; k=1 p; q; r 2 H 3 ! 3 0 3 1 0 0 X i i X 0 k k 0 X k i j r = pq = p q − p q e0 + @p q + p q + "ij p q A ek ; i=1 k=1 i;j=1 3 X k i j j i [p; q] = "ij p q − p q ek 6= 0: i;j;k=1 Quaternions: Formally a; b; c 2 R; c = ab; [a; b] = 0 0 0 1 1 0 1 1 0 w; v; z 2 C; z = wv = w v − w v + (w v + w v )i [w; v] = 0 Quaternions: Formally a; b; c 2 R; c = ab; [a; b] = 0 0 0 1 1 0 1 1 0 w; v; z 2 C; z = wv = w v − w v + (w v + w v )i [w; v] = 0 p; q; r 2 H 3 ! 3 0 3 1 0 0 X i i X 0 k k 0 X k i j r = pq = p q − p q e0 + @p q + p q + "ij p q A ek ; i=1 k=1 i;j=1 3 X k i j j i [p; q] = "ij p q − p q ek 6= 0: i;j;k=1 Quaternion Applications: Spacial Rotations Topologically, the set of unit quaternions (ver- sors) V = fv 2 H s.t. jjvjj = 1g is S 3, and thus isomorphic to SU(2) which pro- vides a double cover of SO(3) (rotations in R3). We may describe spatial rotations in R3 via 3 H 1 2 3 r 2 R 7! r = r e1 + r e2 + r e3 θ R(e^; θ) 2 SO(3) 7! ±v = ± exp (ê1e +e ^2e +e ^3e ) 2 1 2 3 such that r 0 = R(e^; θ) r 7! r 0H = vr H v −1 Quaternion Applications: Spacial Rotations The SO(3) cover has found extensive exploitation in computer graphics / vision 2a b c3 • (v 0; v 1; v 2; v 3) (4 real numbers) vs. 4d e f 5 (9 real numbers) g h i • v1; v2 2 H, v1v2 (16 FLOPs) vs R1; R2 2 SO(3), R1R2 (27 FLOPs) • SLERP (Spherical Linear Interpolation) Matrices of Quaternions The algebra generated by fe0; e1; e2; e3g is identical to the algebra ∼ generated by the Pauli matrices, thus H = hSU(2)i ⊂ M2(C), e0 $ σ0; e1 $ iσ3; e2 $ iσ2; e3 $ iσ1; with 1 0 0 1 0 −i 1 0 σ = ; σ = ; σ = ; σ = : 0 0 1 1 1 0 2 i 0 3 0 −1 Such that q0 + q1i q2 + q3i q0 q1 q $ q = = 2 ( ) C −q2 + q3i q0 − q1i −q1 q0 M2 C Matrices of Quaternions The set of quaternion matrices, MN (H), is defined by 0 1 2 3 0 1 2 3 Q = Q e0 + Q e1 + Q e2 + Q e3; Q ; Q ; Q ; Q 2 MN (R): Examining the M2(C) representation of a particular element (Q ) = Q0 σ + iQ1 σ + iQ2 σ + iQ3 σ : µν C µν 0 µν 3 µν 2 µν 1 which yields the Kronecker structure 0 1 2 3 QC = Q ⊗ σ0 + Q ⊗ iσ3 + Q ⊗ iσ2 + Q ⊗ iσ1 " # Q0 Q1 = 1 0 2 M2N (C); −Q Q Matrices of Quaternions " # Q0 Q1 Q 2 MN (H) $ 1 0 2 M2N (C); −Q Q • Ubiquitous in quantum chemistry / nuclear physics (time-reversal symmetry). • Applications in image processing and machine learning (quaternion PCA, etc). Formal theory for quaternion linear algebra has been developed • QR Algorithm • Diagonalization, SVD • LU, Cholesky, LDLH Factorizations Performance Considerations Table: Real floating point operations (FLOPs) comparison for elementary arithmetic operations using H and M2(C) data structures. Operation FLOPs in H FLOPs in M2(C) Addition 4 8 Multiplication 16 32 p + q ! pC + qC; pq ! pCqC; Performance Considerations Table: Real floating point operations (FLOPs) comparison for common linear algebra operations using MN (H) and M2N (C) data structures. Operation FLOPs in MN (H) FLOPs in M2N (C) Addition 4N2 8N2 Multiplication 16N3 32N3 P + Q ! PC + QC; PQ ! PCQC; Performance Considerations Quaternion arithmetic offers: • 0.5x required FLOPs • 0.5x memory footprint (4x / 8x floats) • 2x arithmetic intensity (FLOPs / byte) We should be using quaternion arithmetic! HAXX • gh/wavefunction91/HAXX • Optimized C++14 library for quaternion arithmetic • HBLAS: Optimized quaternionic BLAS functionality • HLAPACK: Optimized quaternionic LAPACK functionality (in progress) • Intrinsics + Assembly kernels Great! Quaternion GEMM =) HP Quaternion Linear Algebra, Right? Quaternion Matrix Multiplication The most fundamental linear algebra operation is the general matrix multiply (GEMM). Quaternion Matrix Multiplication The most fundamental linear algebra operation is the general matrix multiply (GEMM). Great! Quaternion GEMM =) HP Quaternion Linear Algebra, Right? 1000 HGEMM-Ref 3.0 2.5 800 2.0 600 1.5 Time / s 1.0 400 log Time / s 0.5 200 0.0 0 0.5 1000 2000 3000 4000 5000 6000 2.8 3.0 3.2 3.4 3.6 3.8 Quaternion Problem Dimension (N = M = K) log Quaternion Problem Dimension (N = M = K) Quaternion Matrix Multiplication The most fundamental linear algebra operation is the general matrix multiply (GEMM). Great! Quaternion GEMM =) HP Quaternion Linear Algebra, Right? 1000 HGEMM-Ref 3.0 ZGEMM-MKL 2.5 800 2.0 600 1.5 Time / s 1.0 400 log Time / s 0.5 200 0.0 0 0.5 1000 2000 3000 4000 5000 6000 2.8 3.0 3.2 3.4 3.6 3.8 Quaternion Problem Dimension (N = M = K) log Quaternion Problem Dimension (N = M = K) Quaternion Matrix Multiplication The most fundamental linear algebra operation is the general matrix multiply (GEMM). Great! Quaternion GEMM =) HP Quaternion Linear Algebra, Right? 25 1000 HGEMM-Ref ZGEMM-MKL 20 800 600 15 Time / s 400 GFLOP / s 10 200 5 0 0 1000 2000 3000 4000 5000 6000 1000 2000 3000 4000 5000 6000 Quaternion Problem Dimension (N = M = K) Quaternion Problem Dimension (N = M = K) Reference GEMM Pros: Algorithm 0: Reference GEMM ! Able to implement in an afternoon Input : A 2 Mm;k (F); B 2 Mk;n(F), C 2 Mm;n(F), ! Architecture agnostic Scalars α; β 2 F Output: C = αAB + βC Cons: for j = 1 : n do % No caching of B 1 Load C = C(:,j) j % Reloads all of A for each Cj 2 Cj = βCj % for l = 1 : k do For large m; k, A load boots Cj from cache 3 Load Al = A(:,l) 4 Cj = Cj + αAl Blj % Relies on optimizing compiler end for SIMD, FMA, etc 5 Store Cj % (Scalable) parallelism is end non-trivial % Not tunable High-Performance Matrix-Matrix Multiplication A layered (Goto-style) algorithm significantly improves performance Pros: ! Caches parts of A; B for maximum resuability ! Factors architecture specific µ-ops into single micro-kernel ! Obvious avenue for SMP ! Tunable! Cons: % Significantly more complicated than naive algorithm % Requires allocation of auxiliary memory % Micro-kernel must be written for each architecture Van Zee, F.G., et al 2017, ACM TOMS 7:1-7:36. High-Performance Quaternionic GEMM (HGEMM) The optimized implementation of GEMM in HBLAS utilizes the Goto algorithm. In essence, Goto's original algorithm may be extended to H by specialization of two sets routines: • Micro-kernels which perform the H rank-1 update (assembly / intrinsics) • Efficient matrix packing routines (intrinsics) and optimization of 3 caching parameters, mc ; nc and kc for the architecture of interest. High-Performance Quaternionic GEMM (HGEMM) The optimized implementation of GEMM in HBLAS utilizes the Goto algorithm. In essence, Goto's original algorithm may be extended to H by specialization of two sets routines: • Micro-kernels which perform the H rank-1 update (assembly / intrinsics) • Efficient matrix packing routines (intrinsics) and optimization of 3 caching parameters, mc ; nc and kc for the architecture of interest.

Load more