Exploiting Data Sparsity in Covariance Matrix Computations on Heterogeneous Systems
Total Page:16
File Type:pdf, Size:1020Kb
Exploiting Data Sparsity In Covariance Matrix Computations on Heterogeneous Systems Dissertation by Ali M. Charara In Partial Fulfillment of the Requirements For the Degree of Doctor of Philosophy King Abdullah University of Science and Technology Thuwal, Kingdom of Saudi Arabia May, 2018 2 EXAMINATION COMMITTEE PAGE The dissertation of Ali M. Charara is approved by the examination committee Committee Chairperson: Prof. David E. Keyes Committee Members: Prof. Marc Genton, Prof. Markus Hadwiger, Prof. Anne C. Elster, and Dr. Hatem Ltaief. 3 ©May, 2018 Ali M Charara All Rights Reserved 4 ABSTRACT Exploiting Data Sparsity In Covariance Matrix Computations on Heterogeneous Systems Ali M. Charara Covariance matrices are ubiquitous in computational sciences, typically describing the correlation of elements of large multivariate spatial data sets. For example, covari- ance matrices are employed in climate/weather modeling for the maximum likelihood estimation to improve prediction, as well as in computational ground-based astronomy to enhance the observed image quality by filtering out noise produced by the adap- tive optics instruments and atmospheric turbulence. The structure of these covariance matrices is dense, symmetric, positive-definite, and often data-sparse, therefore, hier- archically of low-rank. This thesis investigates the performance limit of dense matrix computations (e.g., Cholesky factorization) on covariance matrix problems as the number of unknowns grows, and in the context of the aforementioned applications. We employ recursive formulations of some of the basic linear algebra subroutines (BLAS) to accelerate the covariance matrix computation further, while reducing data traffic across the memory subsystems layers. However, dealing with large data sets (i.e., covariance matrices of billions in size) can rapidly become prohibitive in memory footprint and algorithmic complexity. Most importantly, this thesis investigates the tile low-rank data format (TLR), a new compressed data structure and layout, which is valuable in exploiting data sparsity by approximating the operator. The TLR com- pressed data structure allows approximating the original problem up to user-defined numerical accuracy. This comes at the expense of dealing with tasks with much lower 5 arithmetic intensities than traditional dense computations. In fact, this thesis con- solidates the two trends of dense and data-sparse linear algebra for HPC. Not only does the thesis leverage recursive formulations for dense Cholesky-based matrix al- gorithms, but it also implements a novel TLR-Cholesky factorization using batched linear algebra operations to increase hardware occupancy and reduce the overhead of the API. Performance reported of the dense and TLR-Cholesky shows many-fold speedups against state-of-the-art implementations on various systems equipped with GPUs. Additionally, the TLR implementation gives the user flexibility to select the desired accuracy. This trade-off between performance and accuracy is, currently, a well-established leading trend in the convergence of the third and fourth paradigm, i.e., HPC and Big Data, when moving forward with exascale software roadmap. 6 ACKNOWLEDGEMENTS I would like to express my sincere gratitude to my advisor Prof. David Keyes for his continuous support of my Ph.D. study, for his encouragement, immense knowledge, guidance, and inspirations. Equal gratefulness is due to my main mentor Dr. Hatem Ltaief for his dedication, patience, and endurance. Most of the ideas of this thesis work were initiated, supported, and improved by him. I can not say enough to thank his enormous efforts to bring this thesis to its success. I would like to thank my committee members for their insightful comments and encouragements. I'm also grateful to my previous advisor, Prof. Alyn Rockwood, who has sponsored and encouraged me throughout my Master's study and the first project of my Ph.D. study at KAUST. I thank my colleagues and labmates at both ECRC and VCC labs for insightful discussions and encouragements. Special thanks go to the members of HiCMA project for the collaborative and teamwork we have done together, as well as to the members of the KSL lab for their great support and readiness to help. I would like to thank also Philippe Vandermersch for hosting me at their team at NVIDIA as an intern. Highest appreciation goes to my parents and brothers who believed and instilled in me the continuous desire to improve my knowledge and achieve more. Utmost gratefulness and recognition are for my wife who has been, along with my lovely sons (Hassan, Mohsen, Hussein, and Mohamad), the most patient, encouraging, and supportive. They endured long vacations and weekends and sleepless nights while I'm away and indulged in my work. They have been my source of joy and for whom I present this work, hoping it would be an inspiration for them to pursue excellence in their studies. 7 TABLE OF CONTENTS Examination Committee Page 2 Copyright 3 Abstract 4 Acknowledgements 6 List of Abbreviations 11 List of BLAS/LAPACK Routines 12 List of Figures 13 List of Tables 18 List of Algorithms 19 1 Introduction 20 1.1 Motivation . 20 1.2 Objectives and Contributions . 20 1.3 Thesis Outline . 23 2 Covariance Matrices in Scientific Applications 25 2.1 Covariance Matrices . 25 2.2 Multi-Object Adaptive Optics Application . 26 2.3 Spatial Statistics Application . 31 3 Hardware and Software Trends 33 3.1 Emerging Hardware Architectures . 33 3.1.1 GPU . 34 3.1.2 Xeon Phi . 36 3.2 Dense Linear Algebra Algorithms . 37 3.2.1 Block Algorithm . 38 8 3.2.2 Tile Algorithm . 39 3.3 Runtime Systems . 43 3.3.1 The StarPU Runtime System . 43 3.3.2 The OmpSs Runtime System . 44 3.3.3 Customized Static Runtime System . 47 4 Leveraging Vertical Performance using Recursive Formulations 49 4.1 Performance of Matrix Inversion on GPUs . 49 4.2 Triangular BLAS 3 Acceleration on NVIDIA GPUs . 51 4.2.1 Introduction . 51 4.2.2 Related Work . 52 4.2.3 The Triangular Matrix Operations: TRMM and TRSM . 54 4.2.4 Recursive Definition and Implementation Details . 58 4.2.5 Experimental Results . 59 4.2.6 Performance Impact on High-Level Dense Linear Algebra Al- gorithms . 65 4.2.7 Conclusions . 66 4.3 A Framework for Dense Triangular Matrix Kernels on Various Many- core Architectures . 67 4.3.1 Prologue . 67 4.3.2 Related Work . 69 4.3.3 Background on General Recursive TRMM and TRSM Kernels 70 4.3.4 Motivation for further enhancements . 72 4.3.5 Implementation Details for KBLAS CUDA TRSM and TRMM for Few Right-Hand Sides . 73 4.3.6 KBLAS Multi-GPU RecTRSM and RecTRMM ........ 77 4.3.7 On Intel x86 CPU Architectures . 79 4.3.8 Experimental Results . 80 4.3.9 Conclusion . 85 4.4 GPU-Resident Cholesky Factorization . 87 4.4.1 Prologue and Related Work . 87 4.4.2 GPU-Resident Implementation . 88 4.4.3 Performance Results . 91 5 Leveraging Horizontal Performance using Fine-Grained Computa- tions 95 5.1 Background on Adaptive Optics Simulations . 95 9 5.2 Pipelining Computational Stages of the TR for MOAO on a Multi- GPU System . 97 5.2.1 Contributions . 97 5.2.2 A New Mathematical Model . 98 5.2.3 Implementation Details . 103 5.2.4 Experimental Results . 106 5.3 Adaptive Optics Simulation for the World's Largest Telescope on Mul- ticore Architectures with Multiple GPUs . 112 5.3.1 Contributions . 114 5.3.2 Prologue . 115 5.3.3 High Performance MOAO Framework Implementations . 116 5.3.4 Performance Results and Analysis . 119 5.4 Real-Time Massively Distributed Multi-Object Adaptive Optics Sim- ulations for the European Extremely Large Telescope . 125 5.4.1 Introduction . 125 5.4.2 Implementation Details . 128 5.4.3 System and Environment Settings . 133 5.4.4 Science Performance Analysis . 134 5.4.5 HPC Performance Results . 136 5.4.6 Impacts for MOAO Moving Forward . 139 5.4.7 Conclusion . 141 6 Towards Off-Diagonal Low-Rank Matrix Computations 142 6.1 Accelerating Low-Rank Computations . 144 6.2 Contributions Plan . 145 6.3 Batched Operations for Very Small Triangular Matrices Over GPUs . 147 6.3.1 Prologue . 147 6.3.2 Related Work . 150 6.3.3 The KBLAS Library . 151 6.3.4 Limitations of Triangular DLA operations . 153 6.3.5 Fundamental Algorithmic Techniques . 155 6.3.6 High Performance Implementation Details . 160 6.3.7 Performance Results and Analysis . 171 6.3.8 Discussions on Non-uniform Matrix Sizes . 186 6.4 Batched Tile Low-Rank BLAS operations . 187 6.4.1 Introduction . 187 10 6.4.2 Related Work . 189 6.4.3 Background . 190 6.4.4 Design of Tile Low-Rank GEMM Kernels . 192 6.4.5 Implementation Details . 195 6.4.6 Experimental Results . 198 6.5 Tile Low-Rank Cholesky factorization . 202 6.5.1 TLR-Cholesky implementation variants . 203 6.5.2 Performance Results and Analysis . 206 7 Concluding Remarks 211 References 212 Appendices 231 11 LIST OF ABBREVIATIONS AO Adaptive Optics 26 BLAS Basic Linear Algebra Subroutines 21 DAG directed acyclic graph 40 DLA Dense Linear Algebra 38 DM deformable mirrors 26 DP double precision 35 E-ELT European Extremely Large Telescope 26 FoV Field of View 26 HiCMA Hierarchical Computations on Manycore Architectures 145 IP in-place 54 LAPACK Linear Algebra Package 22 MAGMA Matrix Algebra on GPU and Multicore Architectures 39 MOAO Multi-Object Adaptive Optics 27 MORSE Matrices Over Runtime System at Exascale 42 MOS Multi-object spectroscopy 26 OOP out-of-place 54 PLASMA Parallel Linear Algebra for Scalable Multi-core Architectures 40 PSF point spread function 112 RAW Read After Write 51 SM Streaming Multiprocessor 34 SP single precision 35 SPD symmetric positive-definite 98 SVD Singular Value Decomposition 143 TB Thread-Blocks 34 TLR Tile Low-Rank 17 TR tomographic reconstructor 29 TS truth sensor 98 WAR Write After Read 51 WFS wavefront sensors 26 12 LIST OF BLAS/LAPACK ROUTINES GEMM GEneral Matrix Multiply.