Journal of Computer Science 3 (12): 956-964, 2007 ISSN 1549-3636 © 2007 Science Publications

Some P-RAM for Sparse Linear Systems

1Rakesh Kumar Katare and 2N.S. Chaudhari 1Department of Computer Science, A.P.S. University, Rewa (M.P.) 486003, India 2Nanyang Technological University, Singapore

Abstract: PRAM algorithms for Symmetric is presented. We showed actual testing operations that will be performed during Symmetric Gaussian elimination, which caused symbolic factorization to occur for sparse linear systems. The array pattern of processing elements (PE) in row major order for the specialized sparse in formulated. We showed that the access function in2+jn+k contains topological properties. We also proved that cost of storage and cost of retrieval of a matrix are proportional to each other in polylogarithmic parallel time using P-RAM with a polynomial numbers of processor. We use symbolic factorization that produces a data structure, which is used to exploit the sparsity of the triangular factors. In these parallel algorithms number of multiplication/division in O(log3n), number of addition/subtraction in O(log3n) and the storage in O(log2n) may be achieved.

Key words: Parallel algorithms, sparse linear systems, cholesky method, hypercubes, symbolic factorization

INTRODUCTION BACKGROUND

In this research we will explain the method of We consider symmetric Gaussian elimination for representing a in parallel by using P- the solution of the system of linear Eq. RAM model. P-RAM model is a shared memory model. According to Gibbons and Rytter[6], the PRAM model A x = b will survive as a theoretically convenient model of parallel computation and as a starting point for a Where A is an n×n symmetric, positive definite methodology. There are number of processors working matrix. Symmetric Gaussian elimination is equivalent to the square root free Cholesky method, i.e., first synchronously and communicating through a common T random access memory. The processors are indexed by factoring A into the product U DU and then forward the natural number and they synchronously execute the and back solving to obtain x and we often talk same program (through the central main control). interchangeably about these two methods of solving A x = b. Furthermore, since almost the entire of A, we Although performing the same instructions, the restrict most of our attention to that portion of the processors can be working on different data (located in solution process. a different storage location). Such a model is also called [13] T Sherman developed a row-oriented U DU a single-instruction, multiple-data stream (SIMD) factorization for dense matrices A and model. The symmetric factorization is in logarithmic [2,10] considers two modifications of it which take advantage time of the hypercube SIMD model . of sparseness in A and U. We also discuss the type of To develop these algorithms we use method of information about A and U which is required to allow symbolic factorization on sparse symmetric sparse symmetric factorization to be implemented factorization. Symbolic factorization procedures a data efficiently. structure that exploits the sparsity of the triangular Here we will present a parallel algorithm for sparse factors. We will represent an array pattern of processing and will be implemented to elements (PE) for sparse matrix on hypercube. We have hypercube. already developed P-RAM algorithms for linear system Previously we have already developed matrix (dense case)[7,8,9]. These algorithms are implemented on multiplication algorithm on P-RAM model[7] and Hypercube architecture. implementation of back-substitution in Gauss- Corresponding Author: Rakesh Kumar Katare, Department of Computer Science, A.P.S. University, Rewa (M.P.) 486003 India 956 J. Computer Sci., 3 (12): 956-964, 2007

[8] elimination on P-RAM model . Now we further extend ≈ a (k) ’ Ω ∆ ik ÷a(k) < 0 this idea for sparse linear systems of Eq.s. (k) jk «a kk ◊

Ω (k) (k) ANALYSIS OF SYMMETRIC GAUSS aik or a jk is negative ELIMINATION ≈ a (k) ’ Ω (k+1) (k) ik (k) aij = aij + ∆ (k) ÷a jk «a ◊ In this section we suggest the actual testing kk operations that will be performed during Symmetric (k) (k+1) (k) (k) Gaussian elimination. Basically the Gaussian Lemma 2: If (aij − aij ) > 0 then a jk or aik both (k) elimination is applied to transform matrices of linear negative or both positive and for pivot element aij is Eq. to triangular form. This process may be performed always positive. in general by creating zeros in the first column, then the second and so forth. For k = 1,2,...,n-1 we use the Proof: - formulae ≈ a(k) ’ (k) (k +1) Ω ik (k) aij − aij > 0 ∆ (k) ÷a jk > 0 ( ) a (k) « kk ◊ ≈ a ’ a (k +1) = a (k) − ∆ ik ÷a (k) , i, j > k (1) ij ij (k) kj Ω (k) (k+1) (k+1) «a kk ◊ aij > aij then aij is positive Ω (k) (k) a jk or aik both negative or both positive and and for pivot element a (k) is always positive. ij ≈ (k) ’ (k+1) (k) aik (k) (k) (k+1) bi = bi − ∆ (k) ÷bk , i > k (2) Lemma 3: If (aij − aij ) = 0 then no elimination will «a kk ◊ take place and one of the following condition will be

(1) satisfied. where, aij = aij , i, j = 1, 2,…,n. The only assumption (k) (k) (i) aik = 0 or (ii) a jk = 0 or both (i) and (ii) are required is that the inequalities a (k) ≠ 0 , k = 1,2,..., n kk zero. hold. These entries are called pivot in Gaussian elimination. It is convenient to use the notation, Proof: (k) (k+1) if (aij -aij ) = 0 then (k) (k) A x = b ≈ a(k) ’ (k) (k+1) ∆ ij ÷ (k) aij = aij and∆ (k) ÷a jk = 0 «a kk ◊ For the system obtained after (k-1) steps, k = 1, Ω (k) (k) (1) = (1) = (n) (aik .ajk ) = 0 2,..., n with A A and b b. The final matrix A is (k) (k) [3] Ω aik = 0 or ajk = 0 or both are zero. upper triangular matrix (k) Ωaik will remain same after elimination i.e. no ≈ (k) ’ (k+1) (k) aik (k) elimination is required As we know that in aij = aij − ∆ (k) ÷a jk we are «aik ◊ (k) Case 2: if i = j then eliminating position aik ≈ a(k) ’ a (k+1) = a (k) − ∆ ik ÷a (k) ii ii a(k) ik Case 1: If i ≠ j then the above expression can be « kk ◊ ≈ (k) 2 ’ written as follows: a Ω (k+1) (k) ∆( ik ) ÷ aii = aii − ∆ a (k) ÷ ≈ (k) ’ « kk ◊ (k) (k+1) aik (k) (aij − aij ) = ∆ (k) ÷a jk (3) «a kk ◊ a (k +1) = 0 a (k) = a (k).a (k) Lemma 4: If ii then ik ii kk (k) (k+1) (k) (k) ≈ (k) 2 ’ Lemma 1: if (aij − aij ) < 0 then aik or a jk is a (k +1) (k) ( ik ) (k) (k) If a = 0 then a = ∆ ÷ Ωa .a = negative Proof: ii ii ∆ (k) ÷ ii kk a kk « ◊ (k) (k+1) (k) 2 Ω (k) (k) (k) Proof: (aij − aij ) < 0 (aik ) aik = aii .akk 957 J. Computer Sci., 3 (12): 956-964, 2007

(k) (k+1) (k) Lemma 5: If aik = 0 then aii = aii no iteration and for the second case the element which is to be will take place. eliminated will be greater than zero and for the third case elimination will not take place. Proof: The result is obvious. We suggested testing operation for variation occurring in the elements of Symmetric Gaussian (k) (k +1) elimination. There are more testing operations rather Lemma 6: If (aii − aii ) > 0 then aik > 0 than arithmetic operations, so that the running time of the algorithms could be proportional to the amount of Proof: testing rather than amount of arithmetic operation, ≈ (k) 2 ’ a 2 which will cause symbolic factorization to occur. To (k) (k +1) Ω ∆( ik ) ÷ Ω (k) (k) (aii − aii ) > 0 (k) > 0 (aik ) > a kk ∆ a ÷ avoid this problem Sherman pre computed the sets rak, « kk ◊ ruk, cuk and showed that in implementation of this Ω (k) 2 scheme, the total storage required is proportional to the (aik ) > 0 number of non zeroes in A and U and that the total running time is proportional to the number of arithmetic If (a (k)-a (k+1)) = 0 then a = 0 Lemma 7: ii ii ik operations on nonzero. In addition to this the

(k) (k+1) (k) (k+1) preprocessing required to compute the sets { rak} { ruk} Proof: (aii -aii ) = 0 Ω aii = aii and {cuk} can be performed in time proportional to the (k) 2 (aik ) total storage. To see how to avoid these problems, let us Ω = 0 (k) assume that for each k, 1≤ k ≤ N, we have pre- a kk (k) computed: Ω a = 0 ii

(k) (k+1) (k) • The set rak of columns j ≥ k for which akj ≠ 0; Lemma 8: if (aii -aii ) < 0 then a kk will be • The set ruk of columns j > k for which ukj ≠ 0; eliminated. • The set cuk of columns i < k for which uik ≠ 0;

Proof: Line (k) (k+1) (k) (k+1) (k+1) (aii -aii ) < 0 Ω aii > aii Ω aii = ≈ (k) 2 ’ a 1. For k ← 1 to N do (k) ∆( ik ) ÷ aii − (k) 2. [m 1; ∆ a ÷ kk ← « kk ◊ 3. dkk ← akk; 4. For j  ruk do In Gauss elimination, a nonzero position in 5. [mkj← 0]; position jk, implied that there was also a nonzero 6. For j  {n  rak : n > k} do position in kj, which would cause fill to occur in 7. [mkj ← akj]; position i, j, when row k is used to eliminate the 8. For i  cuk do [4] element in position ik . 9. [t ← mik; Here we present the following analysis when the 10. mik ← mik / dii; matrix is symmetric and positive definite. 11. dkk ← dkk-t. mik; (k) (k+1) When i ≠ j in Eq. 3 and (aij − aij ) is less than 0 12. For j  {n  rui : n >k} do (k) k 13. [m „ m -m m ]]]; then the elements aik or ajk is negative before kj kj ik. ij (k) (k+1) elimination and when in Eq. 3 (aij − aij ) is greater (k) (k) Comment: Now M = U than 0 then ajk or aik both are negative or both are T (k) Row-oriented Sparse U DU Factorization (with positive and for pivot element aij is always positive pre-processing) [Source 13]. (k) (k+1) and when in Eq. 3 (aij − aij ) is equal to zero then no elimination will take place and one of the following Algorithm 1 condition will be satisfied In Algorithm 1 only entries of M, which are used (k) (k) (a) aik = 0 or (b) ajk = 0 or both (a) and (b) are those corresponding to nonzeroes in A or U. An will be zero. implementation for Algorithm 1, it is already shown When i = j in the above explanation of Eq. 3 then that the total storage required is proportional to the for the first case the pivot element will be eliminated number of nonzeroes in A and U and that the total 958 J. Computer Sci., 3 (12): 956-964, 2007 running time is proportional to the number of arithmetic reasonably light)[4,5] operations on nonzeroes. In addition, A.H. Sherman[13] 1. For k ← 1 to N do showed that the pre-processing required to compute the 2. [ruk ←0]; sets {rak}, {ruk}, {cuk} can be performed in time 3. For k ← 1 to N-1 do proportional to the total storage. Comment: Form ruk by set Unions 4. [For j ∈ {n ∈ rak : n > k} do SYMBOLIC FACTORIZATION 5. [ruk ← ruk ∪ {j}];

6. For i ∈ cuk do The sets {ra }, which describe the structure of A, k k 7. [For j ∈ ru do are input parameters and the symbolic factorization i 8. [if j ∉ ruk then algorithm, computes the sets {ruk} from them. The sets {cu } could be computed from the sets {ru }, at the k-th 9. ruk ← ruk ∪ {j}]]]]; k k step of the symbolic factorization algorithm, ruk is O(θA) symbolic factorization [Source 13] computed from rak and the sets rui for i < k. An examination of Algorithm 1 shows that for j > k, ukj ≠ 0 if and only if either Algorithm 2

1. For k ← 1 to N do i. akj ≠ 0 or 2. [ruk ← 0; ii. uik ≠ 0 for some i ∈cuk. Thus letting 3. lk ← 0]; k ru i = {j ∈ rui : j> k}, 4. For k ← 1 to N-1 do Comment: Form ruk by set Unions we have j ∈ ruk if and only if either 5. [For j ∈ {n ∈ rak : n > k} do iii. j ∈ rak or k 6. [ruk ← ruk ∪ {j}]; iv. j ∈ ru i for some i ∈ cuk. 7. For i ∈ lk do k Algorithm 2, 3 are a symbolic factorization 8. [For j ∈ rui do algorithm based directly on Algorithm 1. At the k-th 9. [if j ∉ ruk then k step, ruk is formed by combining rak with sets { ru i} for 10. [ruk ← ruk ∪ {j}]]]; k i∈cuk. However, it is not necessary to examine ru i for 11. m ← min {j: j∈ ruk ∪ {N+1}} all rows i∈cuk. Let lk be the set of rows i∈cuk for which 12. If m < N+1 then k is the minimum column index in rui. Then we have 13. [lm ← lm ∪ {k}]]; the following result, which expresses a type of transitivity condition for the fill-in in symmetric O(θs) symbolic factorization Algorithm[Source 13] Gaussian elimination. There are some important reasons why it is Algorithm 3 desirable to perform such a symbolic factorization. PARALLEL MATRIX ALGORITHM • Since a symbolic factorization produce a data structure that exploits the sparsity of the triangular By an efficient parallel algorithm we mean one that factors, the numerical decomposition can be takes polylogarithmic time using a polynomial number performed using a static storage scheme. There is of processors. In practical terms, at most a polynomial no need to perform storage allocations for the fill- number of processors is reckoned to be feasible[6]. A in during the numerical computations. This reduces polylogarithmic time algorithm takes O (logkn) parallel both storage and execution time overheads for the time for some constant integer k, where n is the numerical phase[4,5]. problem size. Problems which can be solved within • We obtained from the symbolic factorization a these constraints are universally regarded as having bound on the amount of sparse we need in order to efficient parallel solutions and are said to belong to the solve the linear system. This immediately tells us if class NC(Nick Pippenger's Class). the numerical computation is feasible. (of course, this is important only if the symbolic factorization Representation of Array Pattern of Processing can be performed both in terms of storage and Elements (P.Es.): Consider a case of three dimensional execution time and if the bound on space is array pattern with n3 = 23q (Processing Elements) PEs.

959 J. Computer Sci., 3 (12): 956-964, 2007

Conceptually these PEs may be regarded as • Representation of upper-triangular matrix: arranged, in n×n×n array pattern. If we assume that the Index of (aij) = Number of elements up to aij PEs are row major order, the PE (i,j,k) in position (i,j,k) element = (i-1)×(n-i/2)+j (1≤i, j≤n) of this array has index in2+jn+k (note that array indices are in the range[0, (n-1)]. Hence, if r3q-1,.....,r0 is the • Representation of diagonal matrix: binary representation of the PE position (i,j,k) then i = In the sparse matrices having the elements only on diagonal following points are evident: r3q-1,....,r2q, j = r2q-1,...,rq, k = rq-1,....,r0 using A(i,j,k), B(i,j,k) and C(i,j,k) to represent memory locations in Number of elements in a n×n square diagonal P(i,j,k), we can describe the initial condition for matrix matrix = n multiplication as: Any element aij can be referred as processing element using the formula Address (aij) = i[or j] A(0,j,k) = Ajk

B(0,j,k) = Bjk, 0 < = j < k, 0 < = k < n • Representation of tri-diagonal matrix: Index of (aij) = Total number of elements in first (i- Ajk and Bjk are the elements of the two matrices to th be multiplied. The desired final configuration is 1) rows + Number of elements up to j Column in th the i Row = 2+2 x (i-2)+j (1≤i, j≤n) (1≤i, j≤n) C(0,j,k) = C(j,k), 0 < = j < n, 0 < = k < n • Representation of αβ-band matrix: Where, Case 1: 1≤i≤β th n−1 Index of (aij) = Number of elements in first (i -1) th th C jk = ƒA jlBlk (4) row + Number of element in i row up to j l=0 column

This algorithm computes the product matrix C by directly making use of (4). The algorithm has three = α × (i-1) + ((i-1) (i-2)) / 2 + j distinct phases. In the first, element of A and B are 3 Case 2: β < i ≤ (n-α+1) distributed over the n PEs so that we have A(l,j,k) = Ajl Index of (a ) = Number of elements in first row + and B(l,j,k) = Blk. In the second phase the products ij β th C(l,j,k) = A(l,j,k) * B(l,j,k) = AjlBlk are computed. Number of elements between (β+1) row and (i- th th n−1 1) row + Number of elements in i Row Finally, in third phase the sum ƒC(l, j,k) are l=0 = αβ + (β(β-1))/2+(α+β-1)(i-β-1)+j-i+β computed.

The details are spelled out in Dekel, Nassimi and Case 3: n-α+1

When the lower triangular matrix is presented in Lemma 10: Cost of storage and cost of retrieval of a three dimension then the PE’s are indexed in the matrix are proportional to each other in polylogarithmic following manner. parallel time using P-RAM with a polynomial number

961 J. Computer Sci., 3 (12): 956-964, 2007 of processor. End for End Proof: For storage and retrieval of a matrix we use O(θA) symbolic factorization PRAM-CREW parallel iteration. Parallel iteration has the property of Algorithm: convergence in log n in parallel. It converge Begin geometrically or at the rate of geometric progression Repeat log n times do therefore they are proportional to each other for a single For all (ordered) pair (i, j, k), 0< k ≤ n, (j > k : uk,j ≠ 0) value. From the above fact we can write that cost of and q = log n in parallel do retrieval is proportional to cost of storage u (k, j) „ 0

End for Ω cost of retrieval = k x cost of storage, (Where k is a For all (ordered) pair (i, j, k), 0 < k < n and q = log n constant) in parallel do Ω if k ≥ 1 then it is a Dense matrix and if k < 1 then it For all (ordered) pair (i, j, k), 0< k < n, is a sparse matrix. j ∈ (n ∈ a(k, j): n > k) do u(k, j) „ u(k, j) ∪ {i} Here we are representing PRAM-CREW end for Algorithm. for all (ordered) pair (i,j, k), 0 0 storage requirements has been discussed in very short j ∈ u(k, j) and q = log n here in this research. The storage schemes used for in parallel do sparse matrices consists of two facts, primary storage m(22qk+2qj+k) = m(k, j) used to hold the numerical values and overhead storage, m(k, j) „ 0; used for pointers, subscripts and other information end for needed to record the structure of the matrix. The data for all (ordered) pair (i, j, k), 0 k) and q = log n in parallel do may be compared. The elements of the upper triangle a(22qi+2qj+k) = a(k, j) (excluding the diagonal) are stored row by row in a m(k, j) „ a(k, j) single array with a parallel array holding their column end for subscripts. A third array indicates the position of each For all (ordered) pair (i, j, k) 0 k) and q = log n in parallel do m(k, j) „ m(k, j)-m(i, k). m(i, j) The above discussion proves the values of the End for elements of the matrix. The Gauss elimination for 962 J. Computer Sci., 3 (12): 956-964, 2007 symmetric, positive for share memory interval. The hypercube is modeled as a discrete space has been studied. The classical problem for sequential with discrete time because the processor's are in algorithm for UTDU factorization of a matrix A was Hamming distances, where as hypercube is an computed by A. Sherman. The results are given as undirected graph consisting of n = 2k vertices, if and 2 3 3 follows: θS(A) ≈ O(n ),θM(A) ≈ O(n ),θA(A) ≈ O(n ) only if the binary representation of their labels differ where θS, θM and θA denote the storage, the number of by one and only one bit. multiplication/division and addition/subtraction respectively. The (SIMD- j 0 1 2 3 Hypercube) example of Dekel, Nassimi and Sahni 1981 i is extended to sparse linear systems now. Consider a 0 0 cube connected computer with n3 PEs. Conceptually, 0000 1 1 2 these PEs may be regarded as arranged in an n×n×n 0001 0010 array pattern. If it is assumed that the PEs are indexed in row-major order, the PE, PE(i,j,k) in position (i,j,k) 2 3 4 5 of this array has index in2+jn+k (note that array indices 0011 0100 0101 3 6 7 8 9 are in the range[0, n-1]). Hence, if r3q-1,...,r0 is the binary representation of the PE position (i,j,k) then i = 0110 0111 1000 1001 r ,...,r , j = r ,...,r and k = r ,...,r . In mapping of 3q-1 2q 2q-1 q q-1 0 Fig. 2: Array view of a 16 PE for a lower triangular data into the hypercube it was indicated that the data is matrix. Each square matrix represents a PE. mapped to its all possible neighbor processors in the n- The number in a square is the PE index (both cube which has hamming distance exactly by one bit, decimal and binary representation are provided) which makes like a tree structure of having leaf of all its possible dimensions (i.e. for n-cube the tree has n The most important property is that the degree of the leaf). The complexity of these parallel algorithm is graph and the diameter are always equal, which will 2 3 achieve a good balance between the communication OS(A) = O(log n), OM(A) = O(log n) and OA(A) = 3 speed and the complexity of the topology network. O(log n) and the number of PE are as shown in Fig. 2 These structures are restricted to having exactly 2k for the case of lower triangular matrix (i.e. only nodes. Because structure sizes must be a power of 2, (i(i+1)/2) PE are required). Same way for upper there are large gaps in the sizes of the system that can triangular matrix only (i-1) (n-i)/2) + j, processors for be built with the hypercubes. This severely restricts the diagonal matrix on i (or j) number of processors, for tri- number of possible nodes. A hypercube architecture has diagonal matrix only 2 + 2 x (i-2) + j no. of processors a delay time approximately equal to 2 log n and has a and for αβ-band matrix only no of processors for Case skew, i.e. different delay times for different inter 1 is α × (i-1) + ((i-1) (i-2)) / 2 + j for Case 2 is connecting nodes[11]. αβ+(β(β-1))/2+(α+β-1)(i-β-1)+j-i+β and for Case REFERENCES 3isαβ+(β(β-1))/2+(α+β-1) (n-α-β+1)+(α+β) (i-n+α-1)- 1. Ammon, J., Hypercube Connectivity within cc ((i-n+α-1)×(i-n+α-2))/2+1 is required to calculate T NUMA architecture Silicon Graphics, 201LN. U DU factors. Based on the above concept we Shoreline Blvd. Ms 565, Mountain View, CA developed the following Algorithm. Row-oriented 94043 dense UTDU factorization PRAM-CREW Algorithm, 2. Dekel E.D., Nassimi and Sahni, S. 1981. Parallel Row oriented sparse UTDU factorization (with zero matrix and graph algorithms. SIAM. J. Comput., testing). PRAM-CREW Algorithm, Row oriented 10 (4): 657-675. sparse UTDU factorization (with pre-processing 3. Duff, I., A. Erisman and Reid , J.K. 1987. Direct Methods for Sparse Matrices. Oxford University PRAM-CREW Algorithm and O(θ ) symbolic A Press. Oxford, UK. factorization PRAM-CREW Algorithm respectively. It 4. George, A. and Esmond, N.G. 1987. Symbolic has been shown that the access function or routing factorization for sparse guassion elimination with function to map data on hypercube contains topological partial pivoting. SIAM. J. Sci. Stat., Comput., properties. This function is convergent in the finite 8 (6) 877-898. 963 J. Computer Sci., 3 (12): 956-964, 2007

5. George, A., J. Liu and Ng, E. 2003. Computer 10. Quinn, M.J., 1994. Parallel computing. Mc Graw- solution of sparse positive definite systems, hill INC. unpublished Book. Department of Computer 11. Saad, Y. and Schultz, M.H. 1988. Topological Science, University of Waterloo. properties of Hypercubes. IEEE Trans. Comput., 6. Gibbons, A. and Rytter, W. 1988. Efficient Parallel 37 (7): 867-872. Algorithms. Cambridge University Press, 12. Samanta, D., 2001. Classic data structures. PHI, Cambridge. New Delhi. 7. Katare, R.K. and Chaudhari, N.S. 1999. 13. Sherman, A., 1975. On the efficient solution of Formulation of matrix multiplication on P-RAM sparse systems of linear and nonlinear Eq.s. Ph.D. model. N. Delhi, Proc. of International Conference Thesis, Yale University, New Haven, CT. on Cognitive Systems. 8. Katare, R.K. and Chaudhari, N.S. 1999. 14. Tamiko, Teil, 1994. The design of the connection Implementation of back substitution in Gauss- machine. Design Issues, 10 (1). elimination on P-RAM model. Rewa, Indian 15. YVES ROBERT, The Impact of Vector and Mathematical Society, Proc of 65th Annual Elimination Algorithm. Halsted press, A division Conference. of John Wiley and sons, New York. 9. Katare, Rakesh Kumar 2000. Some P-RAM algorithms for linear systems. M. Tech Dissertation, Devi Ahilya University, Indore.

964