Efficient Parallel Algorithm for Dense Matrix LU Decomposition with Pivoting on Hypercubes

Computers Math. Applic. Vol. 33, No. 8, pp. 39-50, 1997 Pergamon CopyrightO1997 Elsevier Science Ltd Printed in Great Britain. All rights reserved 0898-1221/97 $17.00 + 0.00 PII: S0898-1221(97)000$2-7 Efficient Parallel Algorithm for Dense Matrix LU Decomposition with Pivoting on Hypercubes ZHIYONG LIU Institute of Computing Technology, Academia Sinica, Beijing 100080, P.R. China D. W. CHEUNG Department of Computer Science, The University of Hong Kong, Pokfulam, Hong Kong (Received November 1996; accepted December 1996) Abstract--LU decomposition is intensively used in various scientific and engineering computations. A parallel algorithm for dense matrix LU decomposition with pivoting on hypercubes is presented. Using n processors, the presented algorithm can finish LU decomposition of an n x n matrix in 0(n2/3 + O(n~/'~log2 n)) steps, including computations as well as communications, and its efficiency is 1 asymptotically when n becomes large. The algorithm employs row- column- as well as block-parallelisms interchangeably so that the n processors are used efficiently in the whole computation process. Using the rich connectivity, all the data alignment requirements can be realized in O(log2 n) steps. The algorithm proposed here not only is suitable for systems with small numbers of processors, but also is suitable for systems with large numbers of processors. Keywords--Linear systems of equations, LU decomposition, Partial pivoting, Parallel processing, Efficiency of parallel algorithms, Hypercubes. 1. INTRODUCTION Linear systems of equations are used intensively in various scientific and engineering computations. As parallel processing systems advance, various parallel processing strategies for LU decompositions have been developed. Partitioned algorithms are given in [1,2], which are efficient for VLSI implementation. Optimal schedule strategies for LU decompositions on MIMD systems are given in [3-5]. The effect of the order of parallelization of the indices "i," "j," "k" in connection with different storage schemes (row or column interleaved storage of the matrix) is analyzed in [6,7] for both vector and parallel computer systems. The effects that data storage schemes and pivoting (row or column) strategies have the etticiency of LU decomposition on distributed memory systems are analyzed in [8]. Gallivan eL al. give a review, and provide an overall perspective of parallel algorithms for various linear algebra computations on both shared and distributed memory systems in [9]. Some techniques are given to reduce communication overheads for LU decomposition on message passing multicomputers in [10]. Sequential LU decomposition for a dense n x n matrix needs ns/3 + O(n 2) time steps, where a time steps is the time needed to execute an addition and a multiplication operation [3,11]. The This research is supported partly by a grant from The University of Hong Kong and The China National "863" Project of High Technology. Typeset by .4j~S-T~ 39 40 Z. Lxu AND D. W. CHEUNO algorithm for multiprocessors given in [3], using n/2 processors, can be completed in n 2 - 1 steps with the efficiency of 2/3. When n processors are used, the algorithms given in [8], known as RSRP, CSRP, can be completed in n2/2 steps, which have the same efficiency as the algorithm in [3]. The above algorithms for distributed memory systems use as many processors as possible to process each row (or column) in the updating process. However, as the decomposition pro- ceeds, there are not enough matrix elements to be updated in a row (or column). Thus, some computation resources are lost. Note that the BLAS-3 LU decomposition algorithms, as described in [9] for shared memory systems have the potential to reach the optimal linear speed-up for a system with n processors. It depends how the n processors are scheduled. It is pointed out in [10] that with the row or column wrap scheme, the sequential BLAS-3 computations in some stages contributes some O(n 2) time, and this can be a bottleneck for the whole computation. Computation and communication time complexities on multiprocessor ring connected by local link and broadcast bus for various task mapping strategies are analyzed in [12]. Blocked LU decomposition algorithms are given in [13,14] for mesh connected multiprocessors. The presented algorithms can balance the workload on the available processors. In the algorithm given in [14], not only data distribution is in a blocked fashion so that workload can be balanced, but the tasks are also scheduled in a blocked fashion and systolic algorithm is used for the submatrices multiplications, thus data reuse ratio is high. For hypercube connected multiprocessors, different algorithms are given in [15-17]. Various strategies for load balancing, communication and computation overlapping, including row or column wrapped, row or column blocked data distribution with corresponding pivoting strategies, and various parallelization forms (fik, jki, and kji), are discussed there thoroughly. A task partition scheme different from the one pre~ented in [3] is presented in [5], which can reduce the length of the critical task path in the scheme given in [3], but it then cannot use the task schedule strategy in [3] to complete the computation using n/2 processors, it's efficiency is 2/3 when n processors are used. The algorithm is implemented on hypercube. When systems with large numbers of processors are used, especially when the number of processors is comparable to the number of the equations of the linear system (e.g., is of O(n)), row or column oriented data distribution and task schedule is not efficient because the workload becomes imbalanced as the computation progresses. This is because, once the size of the submatrix becomes much smaller than n, then the number of processors that will be involved in each step of the computation will be much smaller than n. Therefore, algorithms designed for hypercubes, not only efficient for systems with small number of processors, but also efficient for systems with large number of processors, are needed. Our goal here is to develop an algorithm whose efficiency can reach 1 asymptotically even when the number of processors used is comparable with the number of equations. Notice that it is easy to keep high efficiency when the number of processors is much smaller than the matrix size; however, it may not be easy to keep high efficiency when the number of processors is as large as the size of the matrix. The proposed algorithm LU_RCB strives to achieve this goal. We will propose a parallel LU decomposition algorithm with pivoting, named LU_RCB, for hypercube systems, and we will mainly describe it for an MIMD distributed memory hypercube system in this paper; a modification is straightforward for an implementation on SIMD hypercubes, but we will omit its description for SIMD hypercube systems. The algorithm exploits parallelisms inherent in rows, columns, and blocks interchangeably. The techniques in some parallel storage schemes, e.g., the one that we have developed in [18] are applied here for both data distribution and task scheduling, so that the matrix elements and the computational tasks are distributed among all the processors evenly. An optimal algorithm for matrix multiplication on hypercube from [19,20] are adopted here as the basic building block for submatrix modification. When a hypercube with n nodes is used for an n x n matrix decomposition, its efficiency can be 1 asymptotically as n becomes large. Dense Matrix LU Decomposition 41 In short, the special features of LU_RCB are as follows. (1) The data and workload are distributed evenly among all the processors in the whole decomposition process so that it does not have the efficiency problem resulted from load imbalance. (2) It adapts an optimal matrix multiplication algorithm proposed for hypercubes in [19], and its data reuse ratio is high (O(v~) for one memory access). (3) Data movements can be realized efficiently using existing routing algorithms on both SIMD and MIMD hypercube systems. This paper is organized as follows. Section 2 will give some preliminary knowledge. A parallel algorithm for LU decomposition with partial pivoting will be described in Section 3. In Section 4, we will analyze time complexity of the algorithm briefly, and discuss some implementation issue, especially the implementation method of data alignment requirements. Section 5 will give some conclusions. 2. PRELIMINARIES A linear system of equations Ax = b, where A is a n x n matrix, and x and b are vectors of size n with x being the unknown, can be solved in two stages. First, decomposing A into a lower triangular matrix L and an upper triangular matrix U (such that A = L x U). Second, solving the two triangular systems by substitutions, i.e., solving Ly = b by substitutions and solution for the vector y is obtained, then solving Ux = y by substitutions, and the solution for x is obtained. In the whole computation process, the decomposition stage is the dominant part, which needs n3/3 + O(n 2) time steps when executed sequentially. Sequential LU decomposition algorithms based on Ganssian elimination can be found in literature, e.g., in [10]. Various parallel algorithms have been developed to speed up the decomposition process [1-5,7,9,11,13-16,21-23]. The reader is referred to [9] for a review. Let the n rows (and n columns) of A be numbered from 0 to n - 1, from top to bottom (from left to right). Let A (k) denote the submatrix which includes only the elements ai,j with k < i, j < n - 1. In iteration k of the LU decomposition algorithm, only the submatrix A (~) is still active (need to be further updated), the other part of the matrix has reached its final form.

Efficient Parallel Algorithm for Dense Matrix LU Decomposition with Pivoting on Hypercubes

Recursive Approach in Sparse Matrix LU Factorization

Lecture 5 - Triangular Factorizations & Operation Counts

LU Factorization LU Decomposition LU Decomposition: Motivation LU Decomposition

LU-Factorization 1 Introduction 2 Upper and Lower Triangular Matrices

LU, QR and Cholesky Factorizations Using Vector Capabilities of Gpus LAPACK Working Note 202 Vasily Volkov James W

LU and Cholesky Decompositions. J. Demmel, Chapter 2.7

Math 2270 - Lecture 27 : Calculating Determinants

LU Decomposition S = LU, Then We Can  1 0 0   2 4 −2   2 4 −2  Use It to Solve Sx = F

Scalable Sparse Symbolic LU Factorization on Gpus

LU, QR and Cholesky Factorizations Using Vector Capabilities of Gpus LAPACK Working Note 202 Vasily Volkov James W

Gaussian Elimination and Lu Decomposition (Supplement for Ma511)

Effective GPU Strategies for LU Decomposition H