Sparse Linear System Solvers on Gpus: Parallel Preconditioning, Workload Balancing, and Communication Reduction

UNIVERSIDAD JAUME I DE CASTELLÓN E. S. DE TECNOLOGÍA Y CIENCIAS EXPERIMENTALES SPARSE LINEAR SYSTEM SOLVERS ON GPUS: PARALLEL PRECONDITIONING, WORKLOAD BALANCING, AND COMMUNICATION REDUCTION CASTELLÓN DE LA PLANA,MARCH 2019 TESIS DOCTORAL PRESENTADA POR:GORAN FLEGAR DIRIGIDA POR:ENRIQUE S. QUINTANA-ORTÍ HARTWIG ANZT UNIVERSIDAD JAUME I DE CASTELLÓN E. S. DE TECNOLOGÍA Y CIENCIAS EXPERIMENTALES SPARSE LINEAR SYSTEM SOLVERS ON GPUS: PARALLEL PRECONDITIONING, WORKLOAD BALANCING, AND COMMUNICATION REDUCTION GORAN FLEGAR Abstract With the breakdown of Dennard scaling during the mid-2000s, and the end of Moore’s law on the horizon, hardware vendors, datacenters, and the high performance computing community are turning their attention towards unconventional hardware in hope of continuing the exponential performance growth of computational capacity. Among the available hardware options, a new generation of graphics processing units (GPUs), designed to support a wide variety of workloads in addition to graphics processing, is achieving the widest adoption. These processors are employed by the majority of today’s most powerful supercomputers to solve the world’s most complex problems in physics simulations, weather forecasting, data analytics, social network analysis, and ma- chine learning, among others. The potential of GPUs for these problems can only be unleashed by developing appropriate software, specifically tuned for the GPU architectures. Fortunately, many algorithms that appear in these applications are constructed out of the same basic building blocks. One example of a heavily-used building block is the solution of large, sparse linear systems, a challenge that is addressed in this thesis. After a quick overview of the current state-of-the-art methods for the solution of linear systems, this dissertation pays detailed attention to the class of Krylov iterative methods. Instead of deriving new methods, improvements are introduced to components that are already widely used in existing methods, and therein account for a significant fraction of the overall runtime cost. The components are designed for a single GPU, while scaling to multiple GPUs can be achieved by either generalizing the same ideas, or by decomposing the larger problem into multiple independent parts which can leverage the implementations described in this thesis. The most time-consuming part of a Krylov method is often the matrix-vector product. Two improvements are suggested in this dissertation: one for the widely-used compressed sparse row (CSR) matrix format, and an alternative one for the coordinate (COO) format, which has not yet achieved such ample adoption in numerical linear algebra. The new GPU implementation for the CSR format is specifically tuned for matrices with irregular sparsity patterns and, while experiencing slowdowns of up to 3x compared with the vendor library implementation for regular patterns, it achieves up to 100x speedup for irregular ones. However, the slowdown can be eliminated by using a simple heuristic that selects the superior implementation based on the sparsity pattern of the matrix. The new COO algorithm is suggested as the default matrix-vector product implementation for cases when a specific matrix sparsity pattern is not known in advance. This algorithm achieves 80% higher minimal and 22% higher average performance than the newly developed CSR algorithm on a variety of large matrices arising from real-world applications, making it an ideal default choice for general-purpose libraries. The second component addressed in this dissertation is preconditioning. It explores the relatively simple class of block-Jacobi preconditioners, and shows that these can significantly increase the robustness and de- crease the total runtime of Krylov solvers for a certain class of matrices. Several algorithmic realizations of the preconditioner are evaluated, and the one based on Gauss-Jordan elimination is identified as performance winner in most problem settings. The variant based on the LU factorization can be attractive for problems that converge in few iterations. In this dissertation, block-Jacobi preconditioning is analyzed further via an initial study of the effects that single and half precision floating-point storage have on this type of preconditioners. The resulting adaptive precision block-Jacobi preconditioner dynamically assigns storage precisions to individual blocks at runtime, taking into account the numerical properties of the blocks. A sequential implementation in a high-level lan- v guage, backed by a theoretical error analysis, shows that this preconditioner reduces the total memory transfer volume, while maintaining the preconditioner quality of a full precision block-Jacobi. A theoretical energy model predicts that the adaptive variant can offer energy savings of around 25% in comparison to the full precision block-Jacobi. Acknowledging that new algorithms or optimized implementations are only useful for the scientific computing community if they are available as production-ready open source code, the final part of this dissertation presents a possible design of a sparse linear algebra library, which effectively solves the problem of excessive manifoldness of components for the iterative solution of linear systems. These ideas represent the backbone of the open source Ginkgo library, which also includes successful implementations of matrix-vector product algorithms and preconditioners described in this thesis. Resumen Con el final de la ley de escalado de Dennard a mitad de la pasada década, y el fin de la ley de Moore en el horizonte, los vendedores de sistemas hardware, los grandes centros de datos y la comunidad que trabaja en computación de altas prestaciones están fijando su atención en nuevas tecnologías no convencionales, con la esperanza de mantener el crecimiento exponencial de la capacidad computacional. Entre las diferentes opciones hardware disponibles, la nueva generación de procesadores gráficos (o GPUs, del término en inglés Graphics Processing Units), diseñados para ejecutar de manera eficiente una gran variedad de aplicaciones además del procesamiento gráfico, está consiguiendo una amplia aceptación. Hoy en día, estos procesadores se emplean en la mayor parte de los supercomputadores más potentes, para resolver problemas enormemente complejos relacionados con simulaciones de fenómenos físicos, predicción climática, análisis de datos, análisis de redes sociales y aprendizaje máquina, entre otros. El potencial de las GPUs para tratar estos problemas solo puede aprovecharse mediante el desarrollo de programas eficientes, específicamente optimizados para este tipo de arquitecturas. Por fortuna, muchos de los algoritmos que aparecen en estas aplicaciones se construyen a partir de un conjunto reducido de bloques básicos. Un ejemplo de bloque básico, comúnmente usado, es la solución de sistemas lineales dispersos de gran dimensión, un reto que se afronta en esta tesis. Tras una breve revisión del estado del arte en métodos para la resolución de sistemas lineales, esta tesis doctoral presta especial atención a la familia de métodos iterativos de Krylov. Sin embargo, en lugar de intentar derivar nuevos métodos, en este trabajo se introducen mejoras en los componentes que se usan ampliamente en los métodos ya existentes, y que suponen una parte importante de su coste de ejecución total. Los componentes están diseñados para una única GPU, pero escalarlos a un sistema con múltiples aceleradores gráficos puede conseguirse generalizando las mismas ideas, o descomponiendo el problema en múltiples partes independientes que puedan aprovechar las implementaciones descritas en esta tesis. A menudo, la parte computacionalmente más costosa de los métodos de Krylov es el producto matriz-vector. En esta tesis se sugieren dos mejoras para esta operación: una para el formato matrix-vector CSR (compressed sparse row), y otra para el formato alternativo COO (coordinado), que no ha logrado una aceptación tan amplia como el CSR en álgebra lineal numérica. La nueva implementación del formato CSR para GPUs está diseñada para ser especialmente eficiente con matrices con un patrón de dispersidad iregular y, si bien sufre una reducción de rendimiento en un factor 3x comparada con la implementación de las bibliotecas estándar para patrones regulares, también es cierto que ofrece una aceleración de 100x para los patrones irregulares. Además, la merma en las prestaciones puede eliminarse mediante una heurística simple que selecciona la mejor implementación en función del patrón de dispersidad de la matriz. Este algoritmo consigue, como mínimo un 80% y como media un 22% mejor rendimiento medio que el nuevo algoritmo basado en CSR en una evaluación con una variedad de matrices de gran tamaño, que surgen en aplicaciones reales, ofreciendo una muy buena opción por defecto para bibliotecas de propósito general. El segundo componente que se aborda en esta tesis doctoral es el precondicionado. Nuestro trabajo explora la clase relativamente simple de precondicionadores de Jacobi por bloques, y muestra que estos pueden mejorar la robustez y reducir el tiempo de ejecución de los métodos de Krylov para un determinado tipo de matrices. En este trabajo se evalúan algunas realizaciones del precondicionador, y se identifica una, basada en la eliminación de Gauss-Jordan, como aquella que ofrece mejores prestaciones en la mayor parte de escenarios. La variante vii basada en la factorización LU, en cambio, puede ser una buena opción para problemas donde

Sparse Linear System Solvers on Gpus: Parallel Preconditioning, Workload Balancing, and Communication Reduction

A GMRES Solver with ILU(K) Preconditioner for Large-Scale Sparse Linear Systems on Multiple Gpus

Product Irregularity Strength of Graphs with Small Clique Cover Number

Phd Thesis Parallel and Scalable Sparse Basic Linear Algebra

Multi-Color Low-Rank Preconditioner for General Sparse Linear Systems ∗

Fast Large-Integer Matrix Multiplication

A Study of Linear Error Correcting Codes

Optimizing the Sparse Matrix-Vector Multiplication Kernel for Modern Multicore Computer Architectures

Why Jacket Matrices?

Optimizing and Auto-Tuning Scale-Free Sparse Matrix-Vector Multiplication on Intel Xeon Phi

Dynamic Scheduling for Efficient Hierarchical Sparse Matrix

Investigation on Digital Fountain Codes Over Erasure Channels and Additive White

261 Triangularizing Matrices by Congruence This Paper Is