Multithreaded Dense Linear Algebra on Asymmetric Multi-Core Processors

UNIVERSITAT JAUME I DE CASTELLO´ ESCOLA DE DOCTORAT DE LA UNIVERSITAT JAUME I Multithreaded Dense Linear Algebra on Asymmetric Multi-core Processors Castello´ de la Plana, November 2017 Ph.D. Thesis Presented by: Sandra Catalan´ Pallares´ Supervised by: Enrique S. Quintana Ort´ı Rafael Rodr´ıguez Sanchez´ Programa de Doctorat en Informàtica Escola de Doctorat de la Universitat Jaume I Multithreaded Dense Linear Algebra on Asymmetric Multi-core Processors Memòria presentada per Sandra Catalán Pallarés per optar al grau de doctor/a per la Universitat Jaume I. Sandra Catalán Pallarés Enrique S. Quintana Ortí, Rafael Rodríguez Sánchez Castelló de la Plana, desembre de 2017 Agraïments Institucionals Esta tesi ha estat finançada pel projecte FP7 318793 “EXA2GREEN” de la Comisió Europea, l’ajuda predoctoral de la Universitat Jaume I i per l’ajuda predoctoral FPU del Ministeri de Ciència i Educació. vi Contents 1 Introduction 1 1.1 Motivation.........................................1 1.2 Contributions........................................3 1.3 Organization........................................4 2 Background 5 2.1 Basic Linear Algebra Subprograms (BLAS).......................5 2.1.1 Levels of BLAS...................................6 2.1.2 The BLAS library election.............................7 2.2 Linear Algebra PACKage (LAPACK)..........................8 2.3 Target architectures....................................8 2.3.1 Power consumption.................................8 2.3.2 Asymmetric multi-core processors........................9 2.3.2.1 ARM big.LITTLE............................9 2.3.2.2 Odroid.................................. 10 2.3.2.3 Juno................................... 10 2.3.3 Intel......................................... 11 2.4 Measuring power consumption.............................. 11 2.4.1 PMLib and big.LITTLE platforms........................ 12 2.5 Summary.......................................... 13 3 Basic Linear Algebra Subprograms (BLAS) 15 3.1 Blas-like Library Instantiation Software (BLIS)..................... 15 3.2 BLIS-3............................................ 16 3.2.1 General matrix-matrix multiplication (gemm).................. 17 3.2.2 Generalization to BLAS-3............................. 19 3.2.3 Triangular system solve (trsm).......................... 20 3.3 BLIS-3 on ARM big.LITTLE............................... 20 3.4 Optimizing symmetric BLIS gemm on big.LITTLE................... 23 3.4.1 Cache optimization for the big and LITTLE cores............... 23 3.4.2 Multi-threaded Evaluation of BLIS gemm on the big and LITTLE clusters. 24 3.4.3 Asymmetric BLIS GEMM on big.LITTLE.................... 26 vii 3.4.3.1 Static-asymmetric scheduling (sas).................. 26 3.4.3.2 Cache-aware static-asymmetric scheduling (ca-sas)......... 29 3.4.3.3 Cache-aware dynamic-asymmetric scheduling (ca-das)....... 30 3.5 Asymmetric BLIS-3 Evaluation.............................. 32 3.5.1 Square operands in gemm ............................. 33 3.5.2 gemm with rectangular operands......................... 34 3.5.3 Other BLIS-3 kernels with rectangular operands................ 34 3.5.4 BLIS-3 in 64-bit AMPs.............................. 38 3.6 BLIS-2............................................ 38 3.6.1 General matrix-vector multiplication (gemv).................. 39 3.6.2 Symmetric matrix-vector multiplication (symv)................. 40 3.7 BLIS-2 on big.LITTLE................................... 41 3.7.1 Level-2 BLAS micro-kernels for ARMv7 architectures............. 41 3.7.2 Asymmetry-aware symv and gemv for ARM big.LITTLE AMPs....... 41 3.7.3 Cache optimization for the big and LITTLE cores............... 43 3.8 Evaluation of Asymmetric BLIS-2............................ 44 3.8.1 symv and gemv with square operands...................... 44 3.8.2 gemv with rectangular operands......................... 45 3.9 Summary.......................................... 45 4 Factorizations 47 4.1 Cholesky factorization................................... 47 4.2 LU.............................................. 50 4.3 Look-ahead in the LU factorization............................ 52 4.3.1 Basic algorithms and Block-Data Parallelism (BDP).............. 52 4.3.2 Static look-ahead and nested Task-Parallelism+Block-Data Parallelism (TP+BDP)..................................... 55 4.3.3 Advocating for Malleable Thread-Level Linear Algebra Libraries....... 57 4.3.3.1 Worker sharing (WS): Panel factorization less expensive than update 58 4.3.3.2 Early termination: Panel factorization more expensive than update 61 4.3.3.3 Relation to adaptive look-ahead via a runtime............ 63 4.3.4 Experimental Evaluation............................. 63 4.3.4.1 Optimal block size............................ 66 4.3.4.2 Performance comparison of the variants with static look-ahead... 67 4.3.4.3 Performance comparison with OmpSs................. 69 4.3.4.4 Multi-socket performance comparison with OmpSs.......... 70 4.4 Malleable LU on big.LITTLE............................... 71 4.4.1 Mapping of threads for the variant with static look-ahead........... 72 4.4.1.1 Thread mapping of GEMM....................... 72 4.4.1.2 Configuration of TRSM......................... 74 4.4.1.3 The relevance of the partial pivoting.................. 75 4.4.2 Dynamic look-ahead with OmpSs......................... 75 4.4.3 Global comparison................................. 76 4.5 Summary.......................................... 77 viii 5 Reduction to Condensed Forms 79 5.1 General Structure of the Reduction to Condensed Forms................ 79 5.2 General Optimization of the TSOR Routines...................... 81 5.2.1 Selection of the core configuration........................ 82 5.3 Reduction to Tridiagonal Form (sytrd)......................... 84 5.3.1 The impact of the algorithmic block size in sytrd ............... 85 5.3.2 Performance of the optimized sytrd ....................... 86 5.4 Reduction to Bidiagonal Form (gebrd)......................... 88 5.4.1 The impact of the algorithmic block size in the Reduction to Bidiagonal Form (gebrd)....................................... 88 5.4.2 Performance of the optimized gebrd ....................... 89 5.5 Reduction to Upper Hessenberg Form (gehrd)..................... 90 5.5.1 The impact of the algorithmic block size in the Reduction to Upper Hessen- berg Form (gehrd)................................ 91 5.5.2 Performance of the gehrd routine........................ 91 5.6 Modeling the Performance of the TSOR Routines.................... 93 5.7 Summary.......................................... 97 6 Conclusions 99 6.1 Conclusions and Main Contributions........................... 99 6.1.1 BLAS-3 kernels................................... 100 6.1.2 BLAS-2 kernels................................... 100 6.1.3 TSOR routines on asymmetric multicore processors (AMPs) and models... 101 6.1.4 LU and Cholesky factorization on big.LITTLE................. 101 6.1.5 Thread-level malleability............................. 101 6.2 Related Publications.................................... 102 6.2.1 Directly related publications........................... 102 6.2.1.1 Chapter3. Basic Linear Algebra Subprograms (BLAS)....... 102 6.2.1.2 Chapter4. Factorizations........................ 103 6.2.1.3 Chapter5. Reductions......................... 105 6.2.2 Indirectly related publications........................... 106 6.2.3 Other publications................................. 107 6.3 Open Research Lines.................................... 108 7 Conclusiones 109 7.1 Conclusiones y Contribuciones Principales........................ 109 7.1.1 Kernels BLAS-3.................................. 110 7.1.2 Kernels BLAS-2.................................. 110 7.1.3 Rutinas TSOR en AMPs y modelos....................... 111 7.1.4 Factorizaciones LU y Cholesky en big.LITTLE................. 111 7.1.5 Maleabilidad a nivel de thread.......................... 112 7.2 Publicaciones relacionadas................................. 112 7.2.1 Publicaciones directamente relacionadas..................... 112 7.2.1.1 Chapter3. Basic Linear Algebra Subprograms (BLAS)....... 112 7.2.1.2 Chapter4. Factorizaciones....................... 113 7.2.1.3 Chapter5. Reducciones......................... 114 7.2.2 Publicaciones indirectamente relacionadas.................... 114 7.2.3 Otras publicaciones................................ 115 ix 7.3 L´ıneasabiertas de investigación.............................. 116 x List of Figures 2.1 Exynos 5422 block diagram................................. 11 2.2 ARM Juno SoC block diagram............................... 11 2.3 Intel Xeon block diagram.................................. 12 2.4 PMLib diagram....................................... 13 3.1 High performance implementation of gemm in BLIS................... 17 3.2 Data movement involved in the BLIS implementation of gemm............. 18 3.3 Standard partitioning of the iteration space and assignment to threads/cores for a multi-threaded BLIS implementation........................... 22 3.4 Performance and energy efficiency of the reference BLIS gemm............. 22 3.5 BLIS optimal cache configuration parameters mc and kc for the ARM Cortex-A15 and Cortex-A7 in the Samsung Exynos 5422 SoC..................... 24 3.6 Performance and energy efficiency of the BLIS gemm using exclusively one type of core, for a varying number of threads........................... 25 3.7 Power consumption of the BLIS gemm for a matrix size of 4,096 using exclusively one type of core, for a varying number

Load more