Efficient Parallelizations of Hermite and Smith Normal Form Algorithms

Gerold J¨agera, Clemens Wagnerb

aComputer Science Institute, University of Halle-Wittenberg, D-06120 Halle (Saale), Germany bdenkwerk, Vogelsanger Straße 66, D-50823 K¨oln, Germany

Abstract Hermite and Smith normal form are important forms of matrices used in lin- ear algebra. These terms have many applications in group theory and number theory. As the entries of the and of its corresponding transformation matrices can explode during the computation, it is a very difficult problem to compute the Hermite and Smith normal form of large dense matrices. The main problems of the computation are the large execution times and the mem- ory requirements which might exceed the memory of one processor. To avoid these problems, we develop parallelizations of Hermite and Smith normal form algorithms. These are the first parallelizations of algorithms for computing the normal forms with corresponding transformation matrices, both over the rings Z and F[x]. We show that our parallel versions have good efficiency, i.e., by doubling the processes, the execution time is nearly halved. Furthermore, they succeed in computing normal forms of dense large example matrices over the rings Q[x], F3[x], and F5[x]. Key words: Hermite normal form, Smith normal form, parallelization.

1. Introduction

A matrix in Rm,n, where R is a commutative, integral and Euclidean ring with 1, with rank n is in Hermite normal form (HNF), if it is a lower triangular matrix, where all elements are smaller than the diagonal element of the same row. The definition can easily be generalized to rank r < n. It follows from Hermite [19] that you can obtain from an arbitrary matrix in Rm,n the uniquely determined HNF by doing unimodular column operations. A matrix in Rm,n with rank r is in Smith normal form (SNF), if it is a diagonal matrix with the first r diagonal elements being divisors of the next diagonal element and the last diagonal elements being zero. It follows from Smith [31] that you can

Email addresses: [email protected] (Gerold J¨ager), [email protected] (Clemens Wagner) obtain from an arbitrary matrix in Rm,n the uniquely determined SNF by doing unimodular row and column operations. Thus the SNF is a generalization of the HNF for both, row and column operations. As Smith and Hermite normal forms are the basic building blocks to solving linear equations over the integers, they are at one more level of complexity than (where elimination is done over the reals). Furthermore, Hermite and Smith normal form play an important role in the theory of finite Abelian groups, the theory of finitely generated modules over principal ideal rings, system theory, number theory, and integer programming. For many applications, for example linear equations over the integers, the transformation matrices describing the unimodular operations are important as well. There are many algorithms for computing the HNF [1, 2, 6, 14, 24], and SNF [2, 13, 15], most of them only for the ring Z. Some of these algorithms are probabilistic ([10] for the SNF and R = Z, [34] for the SNF and R = Q[x]). Deterministic algorithms for R = Z often use modular techniques ([23] for the HNF, [16, 32] for the SNF, [11], [30, Chapter 8.4], [33] for the HNF and SNF). Most of these modular algorithms are unable to compute the corresponding transformation matrices. Unfortunately, most algorithms lead to coefficient explosion, i.e., during the computation, entries of the matrix and the corresponding transformation matrices occur, which are very large, even exponential [8, 17]. For high-dimensional matrices with large entries this leads to large execution times and memory problems, i.e., the memory of one process is not large enough for the normal form computation of large matrices. These problems can be remedied by parallelization, so that it is possible to handle considerably larger matrices. Much effort has been done for matrix and linear algebra computations in parallel [3, 7, 9, 18, 26, 27, 29, 36]. In [21] a parallel HNF algorithm and in [21, 22, 37] parallel probabilistic SNF algorithms are introduced for the ring F[x], but without experimental results, and in [28] a parallel SNF algorithm is described which only works for characteristic matrices. The purpose of this paper is to show efficient parallelizations of Hermite and Smith normal form computations with empirical evidence. Especially we parallelize the well-known HNF and SNF algorithms of Kannan, Bachem [25] generalized to rectangular matrices with arbitrary rank, and the SNF algorithm of Hartley, Hawkes [12]. These are three of the most important algorithms which work for both the ring Z and the ring F[x] and which are able to compute the corresponding transformation matrices. It is an important problem of paral- lelization of normal forms, how to uniformly distribute a large matrix to many processes. Our main idea for this problem comes from the following observation which holds for all HNF and SNF algorithms considered: When an elimination step is done by series of column (row) operations, the operations depend only on one particular row (column). Thus it is reasonable to use the well-known row (column) distribution of matrices [9, 29]. Especially we use a row distribu- tion for column operations and a column distribution for row operations, where row (column) distribution means distributing the matrix among the processes, so that each whole row (column) goes to a single process. This is done by a broadcast operation. When an elimination step is done involving entries in a particular row (column), that row (column) is broadcast to all the processes,

2 so that they can determine all in parallel what column (row) operations are to be done on the matrix. Then they update their rows (columns) by doing these column (row) operations on all their rows (columns). As for the SNF we use both row and column operations, an auxiliary algorithm is used, which trans- forms a row distributed matrix into a column distributed one and vice versa. This procedure is an implementation of parallel matrix transposition [4, 5, 35]. We estimate the parallel operations of the three algorithms and observe that the complexity of the parallel HNF algorithm is much better than that of both parallel SNF algorithms and that the parallel Hartley-Hawkes SNF algorithm leads to a better complexity than the parallel Kannan-Bachem SNF algorithm. We implement the algorithms and test it for large dense matrices over the rings Q[x], F3[x], and F5[x]. The experiments show that the parallel HNF algorithm and the parallel Kannan-Bachem SNF algorithm give very similar results. Com- paring the SNF algorithms, the parallel Kannan-Bachem SNF algorithm leads to better results for the ring Q[x], and the parallel Hartley-Hawkes SNF algorithm to better results for the rings F3[x], F5[x]. Considering medium-sized matrices, we observe that the algorithms have a good efficiency, even for 64 processes. The algorithms are also able to compute the HNF and SNF with its corresponding transformation matrices of large matrices in reasonable time. Because of the memory requirements, the program packages MAGMA and are not able to do most of such computations.

2. Preliminaries Let R be a commutative, integral ring with 1 and R∗ ⊆ R the set of units of R. Let R be Euclidean, i.e., there is a mapping φ : R \{0} → N0, so that for a ∈ R, b ∈ R \{0} q, r ∈ R exist with a = qb + r and r = 0 ∨ φ(r) < φ(b), where we define ψ(a, b) := r. Further let R ⊆ R be a system of representatives of R, i.e., for each a ∈ R \{0} unique e ∈ R∗ and b ∈ R exist with a = e · b, 1 where we define β(a) := e . In this paper we only consider two examples: a) The set Z of integers. We choose φ := | · |, R := N0. For a ∈ R let bac be the largest integer ≤ a. With the above notations we define for a ∈ Z, b ∈ Z \{0} : ψ(a, b) := r = a − ba/bc · b and for a ∈ Z \{0} we have: m,n β(a) := sgn(a). For A ∈ Z let kAk∞ := max1≤i≤m,1≤j≤n{|Ai,j|}. b) The polynomial ring F[x] with a field F. We choose φ := deg, R := {Monic polynomials over F[x]}. With the above notations it holds for a ∈ F[x], b ∈ F[x] \{0} : ψ(a, b) := r, where r is uniquely determined by polynomial division of a and b. Further for a ∈ [x] \{0} we have: β(a) := 1 for a = F ak k X i m,n ai · x , ak 6= 0. For A ∈ F[x] let dAedeg := max1≤i≤m,1≤j≤n {deg(Ai,j)}. i=0 n,n Definition 2.1. a) The matrix En = (Eij)1≤i,j≤n ∈ R is defined by  1, if i = j E = . i,j 0, otherwise n,n b) GLn(R) is the group of matrices in R whose is a unit in the ring R. These matrices are called unimodular matrices.

3 m,n Definition 2.2. A matrix A = (Aij)1≤i≤m,1≤j≤n ∈ R with rank r is in Hermite normal form (HNF), if the following conditions hold: a) ∃i1, . . . , ir with 1 ≤ i1 < ··· < ir ≤ m with Aij ,j ∈ R \ 0 for 1 ≤ j ≤ r (the

Aij ,j are called pseudo diagonal elements). b) Ai,j = 0 for 1 ≤ i ≤ ij − 1, 1 ≤ j ≤ r. c) The columns r + 1, . . . , n are zero. d) Aij ,l = ψ(Aij ,l,Aij ,j) for 1 ≤ ` < j ≤ r. The matrix A is in left Hermite normal form (LHNF), if its transpose AT is in HNF.

m,n Theorem 2.3. [19] Let A ∈ R . Then a matrix V ∈ GLn(R) exists, so that H = AV is in HNF. The matrix H is uniquely determined. The matrix V is called the corresponding transformation matrix for the HNF.

m,n Definition 2.4. A matrix A = (Aij)1≤i≤m,1≤j≤n ∈ R with rank r is in Smith normal form (SNF), if the following conditions hold: a) A is a diagonal matrix. b) Ai,i ∈ R \ {0} for 1 ≤ i ≤ r. c) Ai,i | Ai+1,i+1 for 1 ≤ i ≤ r − 1. d) Ai,i = 0 for r + 1 ≤ i ≤ min{m, n}.

m,n Theorem 2.5. [31] Let A ∈ R . Then matrices U ∈ GLm(R) and V ∈ GLn(R) exist, so that C = UAV is in SNF. The matrix C is uniquely deter- mined. The matrices U, V are called the corresponding left hand and right hand transformation matrix for the SNF.

3. HNF and SNF algorithms

The following algorithms compute the HNF and SNF of an arbitrary matrix in Rm,n with corresponding transformation matrices. All algorithms are also formulated for the transformation matrices U and V . Normally, each algorithm starts with U = Em and V = En, but if an algorithm is a subroutine of another algorithm, the settings of U ∈ GLm(R) and V ∈ GLn(R) come from the main algorithm.

3.1. HNF Algorithm column by column The following algorithm ROW-ONE-GCD (A, V, i, j, l) works on the i-th row of a matrix A. It does a unimodular transformation on two arbitrary entries Ai,j and Ai,l of this row, so that after the transformation the second entry is new old old new zero. Exactly it holds: Ai,j = gcd(Ai,j ,Ai,` ) ∈ R and Ai,` = 0. In parallel the corresponding transformation matrix V can be computed.

Gcd computation of two elements of the same row m,n INPUT A = [a1, . . . , an] ∈ R ,V = [v1, . . . , vn] ∈ GLn(R) and i, j, l with 1 ≤ i ≤ m, 1 ≤ j < ` ≤ n 1 IF Ai,j 6= 0 ∨ Ai,l 6= 0

4 2 THEN Compute d := gcd (Ai,j ,Ai,l) and u, v with d = uAi,j + vAi,l „ « u −Ai,l/d 3 [aj , al] = [aj , al] · v Ai,j /d „ « u −Ai,l/d 4 [vj , vl] = [vj , vl] · v Ai,j /d OUTPUT (A, V ) = ROW-ONE-GCD (A, V, i, j, l)

In step 2, for two integers x, y we compute gcd(x, y) and u, v with gcd = ux + vy, using the Euclidean algorithm. In steps 3 and 4, a is multiplied to the original matrix and to the transformation matrix V from the right, so that the j-th and `-th column are changed in such a way that the conditions for Ai,j and Ai,` are fulfilled. We denote the analogous algorithm for the gcd computation of two elements of the same column by COL-ONE-GCD (A, U, i, j, l). With these procedures we can formulate the HNF algorithm which we will parallelize. The HNF algo- rithm is based on the HNF algorithm of Kannan, Bachem [25] who proposed for a square matrix A ∈ Rn,n of full rank to compute the HNF of a (t × t)− submatrix for t = 1, . . . , n recursively. This algorithm is also able to compute the corresponding transformation matrix, and – because of its simple structure – it is ideal for parallelization. A natural generalization of this algorithm for rectangular matrices with arbitrary rank is to recursively compute the HNF of the first t columns for t = 1, . . . , n.

HNF computation column by column m,n INPUT A = [a1, . . . , an] ∈ R ,V = [v1, . . . , vn] ∈ GLn(R) 1 FOR t = 1, . . . , n (Compute HNF of first t columns) 2 r = 0 3 FOR s = 1, . . . , m 4 IF As,r+1 6= 0 ∨ As,t 6= 0 5 THEN r = r + 1 6 ir = s 7 IF t = r 8 THEN IF As,t ∈/ R 9 THEN at = β(As,t) · at 10 vt = β(As,t) · vt 11 ELSE ROW-ONE-GCD (A, V, s, r, t) 12 FOR ` = 1, . . . , r − 1 13 al = al − ψ(As,l,As,r) · ar 14 vl = vl − ψ(As,l,As,r) · vr 15 IF t = r 16 THEN GOTO 1 with next t OUTPUT (A, V ) = HNF (A, V ) with rank r

In steps 4 to 11, the current pseudo diagonal element As,r is computed. After steps 12 to 14, the elements of the s-th row fulfill the condition d) of the HNF definition. If t = r holds after step 14, the current pseudo diagonal element As,r is found and we can go to the next t of the FOR loop of step 1.

5 * * 1 * 4 * * * * * 2 3 * 5 2 * 4 5 6 * 6 3 1 *

(a) (b)

Figure 1: Order of reducing the elements left from the diagonal elements: (a) Standard, (b) Chou-Collins

Obviously, there is an analogous algorithm for the computation of the LHNF row by row. Chou, Collins [6] found an essential theoretical and practical im- provement for the HNF computation by reducing the elements left from the diagonal elements (steps 12 and 13 of Algorithm 3.1) in a different order (see Fig. 1). As the columns, which are added to reduce an element, have only reduced non-diagonal elements at this status, the algorithm leads to less coeffi- cient explosion. Unfortunately, the Chou-Collins idea cannot be combined with HNF Algorithm 3.1 for efficient parallelization, as the communication operations between the processes become too large (see section 4.2).

3.2. Algorithm DIAGTOSMITH In the SNF algorithms we use an elementary algorithm DIAGTOSMITH [30] which computes the SNF of a matrix in diagonal form.

m,n T INPUT A ∈ R in diagonal form, U = [u1, . . . , um] ∈ GLm(R), V = [v1, . . . , vn] ∈ GLn(R) 1 FOR k = 1,..., min{m, n} − 1 2 FOR ` = min{m, n} − 1, . . . , k 3 IF A`,` 6 | A`+1,`+1 4 THEN g = A`,` A`+1,`+1 5 A`,` = gcd (A`,`,A`+1,`+1) 6 A`+1,`+1 = g/A`,` 7 Compute d := gcd (A`,`,A`+1,`+1) and u, v with d = uA`,` + vA`+1,`+1 „ « T u v T 8 [u`, u`+1] = A`+1,`+1 A`,` · [u`, u`+1] − d d A`+1,`+1 ! 1 −v · d 9 [v`, v`+1] = [v`, v`+1] · A`,` 1 u · d 10 FOR ` = 1,..., min{m, n} 11 IF A`,` 6= 0 12 THEN A`,` = β(A`,`) · A`,` 13 vl = β(A`,`) · vl OUTPUT (A, U, V ) = DIAGTOSMITH (A, U, V )

In steps 4 to 6 two neighboring diagonal elements A`,` and A`+1,`+1 are sub- stituted by its gcd and its lcm, so that after these steps it holds: A`,` | A`+1,`+1.

6 The steps 7 to 9 hold because of the following equation [30]: !  u v    A`+1,`+1 A`,` 0 1 −v d − A`+1,`+1 A`,` 0 A A`,` d d `+1,`+1 1 u d  d 0  = (1) 0 lcm(A`,`,A`+1,`+1)

These steps are repeated until the conditions c) and d) of the SNF definition are fulfilled. After the steps 10 to 13, also condition b) of the SNF definition is fulfilled. The algorithm needs not more than min{m, n}2 gcd computations.

3.3. Kannan-Bachem SNF algorithm m,n INPUT A ∈ R ,U ∈ GLm(R),V ∈ GLn(R) 1 WHILE (A is not in diagonal form) 2 (A, V ) = HNF (A, V ) 3 (A, U) = LHNF (A, U) 4 (A, U, V ) = DIAGTOSMITH (A, U, V ) OUTPUT (A, U, V ) = KB-SNF (A, U, V )

The algorithm of Kannan and Bachem [25] alternately computes the HNF and the LHNF in the steps 2 and 3, until the matrix is in diagonal form. In step 4, the algorithm DIAGTOSMITH is applied.

3.4. Hartley-Hawkes SNF algorithm For the algorithm of Hartley and Hawkes [12, p. 112] many variants are known, see [2, 13, 15]. They differ in the implementation of the following proce- dures ROWGCD and COLGCD and in some additional row and column swaps. For i, j with 1 ≤ i ≤ m, 1 ≤ j ≤ n, ROWGCD (A, V, i, j) transforms A, so new old old old new new that Ai,j = gcd (Ai,j ,Ai,j+1,...,Ai,n ) ∈ R,Ai,j+1 = ... = Ai,n = 0. This is done by subtracting the multiple of a column from another column very often. The corresponding transformation matrix V ∈ GLn(R) is changed by repeating all column operations applied to A also to V . The procedure COLGCD (A, U, i, j) is analogously defined with the role of rows and columns exchanged.

m,n INPUT A ∈ R ,U ∈ GLm(R),V ∈ GLn(R) 1 ` = 1 2 WHILE ` ≤ min{m, n} 3 IF NOT (A`,`+1:n) = 0 4 THEN A = ROWGCD (A, V, `, `) 5 IF (A`+1:m,`) = 0 6 THEN ` = ` + 1 7 ELSE A = COLGCD (A, U, `, `) 8 (A, U, V ) = DIAGTOSMITH (A, U, V ) OUTPUT (A, U, V ) = HH-SNF (A, U, V )

7 For ` = 1,..., min{m, n} the algorithm of Hartley, Hawkes alternately uses the procedures ROWGCD (A, V, `, `) and COLGCD (A, U, `, `) in steps 3 to 7, until the first ` rows and columns have diagonal form. In step 8, again the algorithm DIAGTOSMITH is used.

4. Parallelization of the HNF and SNF algorithms

4.1. Idea of Parallelization A parallel program is a set of independent processes with data being inter- changed between the processes. We write BROADCAST x , if a process sends a variable x to all other processes, BROADCAST-RECEIVE x FROM z, if a process receives a variable x from the process with the number z which it has sent with BROADCAST, SEND x TO z, if a process sends a variable x to the process with the number z, and SEND-RECEIVE x FROM z, if a process receives a variable x from the process with the number z which it has sent with SEND. The matrix whose HNF and SNF shall be computed has to be distributed to the different processes as uniformly as possible. It is straightforward to put different rows or different columns of a matrix to one process [9, 29]. For algorithms in which mainly column operations are used, a column distribution is not reasonable, as for column additions with multiplicity the columns involved in computations mostly belong to different processes, so that for each such computation at least one column element should be sent. Thus we distribute rows, if column operations are used, and columns, if row operations are used. As the SNF algorithms use both operations, we have to switch between both distributions (see the procedures PAR-ROWTOCOL and PAR-COLTOROW in Section 4.4). Quinn [29] considers two approaches for row distribution based on block data decomposition. As experiments in section 5.3 show that both approaches lead to rather similar execution times of our algorithms, we use the simpler one of both. Let the matrix A¯ ∈ Rm,n be distributed on q processes and let the z-th process consist of krow(z) rows. Every process z has as input a matrix A ∈ krow(z),n Pq−1 R with z=0 krow(z) = m. For every process let the order of the rows be equal to the original order. At any time we can obtain the complete matrix by putting together these matrices. We choose the following uniform distribution. The process with the number z receives the rows z + 1, z + 1 + q, . . . , z + 1 + b(m − z − 1)/qc · q (see Fig. 2(a) for an example). The most important point of this distribution is that each process receives rows from different parts of the whole matrix (compare section 5.3). For 1 ≤ ` ≤ m, let ROW-TASK (l) return the number of the process, where the `-th row is (for example in Fig. 2(a): ROW-TASK (7) = 2). For 1 ≤ ` ≤ m, let ROW-NUM (l) return the order of the original `-th row in the local list of rows in any process, even if the row ` is not present in that list (for example in Fig. 2(a): ROW-NUM (7) = 3 on processes 0,1 and ROW-NUM (7) = 2 on processes 2,3)

8 ProcessProcess 0 0 Process 0 Process 0 11 1 55 1 5 9 5 9 6 9 9 Process 1 Process 1 Process 1 Process 1 22 2 66 2 6 6 10 10 10 Process 2 Process 2 Process 2 Process3 2 3 3 6 7 3 7 7 7 11 11 11

Process 3 Process 3 Process 3 Process4 3 4 48 4 8 6 8 8

(a) Row distribu- (b) Column distri- (c) Broadcast of tion bution row 6

Figure 2: Example of a matrix with 11 rows, 9 columns for 4 processes

Analogously, we define the column distribution, where the z-th process con- m,k (z) sists of kcol(z) columns and every process z has as input a matrix A ∈ R col Pq−1 with z=0 kcol(z) = n, and the functions COL-TASK and COL-NUM (see Fig. 2(b) for an example). Since only column operations are performed on V , we choose for V a row distribution. Analogously, a column distribution is chosen for U.

4.2. Parallel HNF algorithm column by column In the following we only consider the HNF computation of a row distributed matrix. Obviously, the LHNF computation of a column distributed matrix works analogously. Considering the FOR loop of step 3 of the HNF Algorithm 3.1 with s, we observe that the column operations only depend on the s-th row. Thus it is a good idea to send the s-th row to all processes, so that each process can execute its column operations.

INPUT Cardinality of processes q, number of its own process z, 0 0 T krow(z),n Pq−1 A = [a1, . . . , akrow(z)] = [a1, . . . , an] ∈ R with z=0 krow(z) = m, kcol(z),n Pq−1 V = [v1, . . . , vn] ∈ R with z=0 kcol(z) = n 1 FOR t = 1, . . . , n (Compute HNF of first t columns) 2 r = 0 3 FOR s = 1, . . . , m 4 y = ROW-TASK (s) 5 h = ROW-NUM (s) 6 IF y = z

9 0 7 THEN BROADCAST vector ah 8 ELSE BROADCAST-RECEIVE vector g FROM y 9 Insert g as h-th row vector of A 10 IF Ah,r+1 6= 0 ∨ Ah,t 6= 0 11 THEN r = r + 1 12 ir = s 13 IF t = r 14 THEN IF Ah,t ∈/ R 15 THEN at = β(Ah,t) · at 16 vt = β(Ah,t) · vt 17 ELSE ROW-ONE-GCD (A, V, h, r, t) 18 FOR ` = 1, . . . , r − 1 19 al = al − ψ(Ah,l,Ah,r)ar 20 vl = vl − ψ(Ah,l,Ah,r)vr 21 IF t = r 22 THEN IF NOT y = z 23 THEN Remove the h-th row vector of A 24 GOTO 1 with next t 25 IF NOT y = z 26 THEN Remove the h-th row vector of A OUTPUT (A, V ) = PAR-HNF (A, V ) with rank r

Correctness: In this algorithm (and all following algorithms) the same steps are executed as in the corresponding original algorithm, but on the processes the current elements lie on. For constant t and for s = 1, . . . , m the s-th row is sent from its process to all other processes. There it is inserted in row h below all rows which are in the whole matrix above row s (see Fig. 2(a) and Fig. 2(c) for an example). On all processes the row h and the rows below row h are transformed according to the original algorithm. After that on all processes except the sending process, the received row is removed again. 

Theorem 4.1. Let p = max{m, n}. Algorithm 4.2 needs O(p2) BROADCAST operations and O(p3) ring elements are sent. Proof. Each BROADCAST operation appears in 2 nested FOR loops, where at most p ring elements are sent. It is easy to see that a parallel Chou-Collins version of HNF Algorithm 4.2 would need O(p3) BROADCAST operations which is too much for efficient parallelization of this algorithm. For this reason, we have only parallelized the original HNF Algorithm 3.1.

4.3. Algorithm PAR-DIAGTOSMITH As we did not find an efficient parallelization of DIAGTOSMITH and as the original procedure is very fast in practice, we decided to parallelize it trivially. More exactly each diagonal element is broadcast from the process it lies on

10 to all other processes, so that the operations of the original algorithm can be performed successively. Although this parallelization does not save time, in general it is necessary, because for large matrices the memory of one process might not be large enough for the complete transformation matrices.

INPUT Cardinality of processes q, number of its own process z, A ∈ Rkrow(z),n Pq−1 ¯ m,n with z=0 krow(z) = m, whole matrix A ∈ R in diagonal form T m,krow(z) kcol(z),n U = [u1, . . . , um] ∈ R , V = [v1, . . . , vn] ∈ R with Pq−1 z=0 kcol(z) = n 1 FOR k = 1,..., min{m, n} − 1 2 FOR ` = min{m, n} − 1, . . . , k 3 y1 = ROW-TASK (l) 4 y2 = ROW-TASK (` + 1) 5 IF y1 = z 6 THEN h1 = ROW-NUM (l)

7 g1 = Ah1,l 8 BROADCAST number g1 9 ELSE BROADCAST-RECEIVE number g1 10 IF y2 = z 11 THEN h2 = ROW-NUM (` + 1)

12 g2 = Ah2,`+1 13 BROADCAST number g2 14 ELSE BROADCAST-RECEIVE number g2 15 Compute d := gcd (g1, g2) and u, v with d = u · g1 + v · g2 16 IF y1 = z 17 THEN IF g1 6 | g2

18 THEN Ah1,l = gcd (g1, g2) 19 IF y2 = z 20 THEN IF g1 6 | g2 g1·g2 21 THEN Ah2,`+1 = gcd (g1,g2) 22 IF g1 6 | g2 „ « T u v T 23 THEN [u`, u`+1] = g2 g1 · [u`, u`+1] − d d „ g2 « 1 −v · d 24 [v`, v`+1] = [v`, v`+1] · g1 1 u · d 25 FOR ` = 1,..., min{m, n} 26 y = ROW-TASK (l) 27 IF y = z 28 THEN h = ROW-NUM (l) 29 IF Ah,l 6= 0 30 THEN Ah,l = β(Ah,l) · Ah,l 31 vl = β(Ah,l) · vl OUTPUT (A, U, V ) = PAR-DIAGTOSMITH (A, U, V )

Correctness: The elements A¯`,` and A¯`+1,`+1 of the whole matrix lie on the processes y1 and y2, where they are the elements Ah1,l and Ah2,`+1, respectively. ¯ Ah1,l is broadcast from y1 and Ah2,`+1 from y2, so that the elements A`,` and ¯ A`+1,`+1 are known on all processes. The new Ah1,l is computed on process y1 and the new Ah2,`+1 on process y2. The computation of the transformation

11 matrices comes from (1).  In this algorithm O(p2) BROADCAST operations are performed and O(p2) ring elements are sent with BROADCAST. We use two versions of this algo- rithm, one for row distribution and one for column distribution.

4.4. Auxiliary algorithms PAR-ROWTOCOL and PAR-COLTOROW For SNF algorithms, we have to change between a row distributed matrix and a column distributed matrix and vice versa. We call these algorithms PAR- ROWTOCOL and PAR-COLTOROW, respectively. In fact, both procedures are parallel matrix transpositions. As in the following SNF algorithms matrix transpositions are only called a few times (for the Kannan-Bachem SNF Algo- rithm 4.5 in average not more than 4 to 5 times, for the Hartley-Hawkes SNF Algorithm 4.6 in average not much more than min{m, n} times), we used the following simple implementation (see Fig. 2(a) and Fig. 2(b)). krow(z),n Pq−1 Let A ∈ R with z=0 krow(z) = m be row distributed. A is trans- m,kcol(z) Pq−1 formed into a matrix B ∈ R with z=0 kcol(z) = n. Every process com- municates with every different process. For a process x the SEND operation has the form SEND {As,t | 1 ≤ s ≤ krow(x), 1 ≤ t ≤ n} TO COL-TASK (t). For a process x the SEND-RECEIVE operation has the form SEND-RECEIVE {Bs,t | 1 ≤ s ≤ m, 1 ≤ t ≤ kcol(x)} FROM ROW-TASK (s). This algorithm does not need to be applied to the corresponding transformation matrices, as the left hand transformation matrix U is always transformed by row operations, i.e., is column distributed, and the right hand transformation matrix V is always transformed by column operations, i.e., is row distributed. We obtain the algorithm PAR-COLTOROW from the algorithm PAR-ROW- TOCOL by exchanging the role of rows and columns.

4.5. Parallel Kannan-Bachem SNF algorithm INPUT Cardinality of processes q, number of its own process z, krow(z),n Pq−1 ¯ m,n A ∈ R with z=0 krow(z) = m, whole matrix A ∈ R , T m,krow(z) kcol(z),n U = [u1, . . . , um] ∈ R , V = [v1, . . . , vn] ∈ R Pq−1 with z=0 kcol(z) = n 1 WHILE (A¯ is not in diagonal form) 2 (A, V ) = PAR-HNF (A, V ) “ ” 3 B = PAR-ROWTOCOL (A) B ∈ Rm,kcol(z) 4 (B,U) = PAR-LHNF (B,U) 5 A = PAR-COLTOROW (B) 6 (A, U, V ) = PAR-DIAGTOSMITH (A, U, V ) OUTPUT (A, U, V ) = PAR-KB-SNF (A, U, V )

Theorem 4.2. [20] Let A¯ ∈ Rm,n with p = max{m, n}, and R = Z and 4 ¯ 4 ¯ R = F[x], respectively. In Algorithm 4.5 O(p log2(pkAk∞)) and O(p dAedeg) 5 ¯ BROADCAST operations are performed, respectively, and O(p log2(pkAk∞)) 5 and O(p dA¯edeg) ring elements are sent with BROADCAST, respectively. Fur- 2 2 ¯ 2 2 ¯ ther O(q p log2(p kAk∞)) and O(q p dAedeg) SEND operations are performed,

12 4 ¯ 4 ¯ respectively, and O(p log2(pkAk∞)) and O(p dAedeg) ring elements are sent with SEND, respectively.

4.6. Parallel Hartley-Hawkes SNF algorithm INPUT Cardinality of processes q, number of its own process z, T krow(z),n Pq−1 A = [a1, . . . , akrow(z)] ∈ R with z=0 krow(z) = m T m,krow(z) kcol(z),n U = [u1, . . . , um] ∈ R , V = [v1, . . . , vn] ∈ R Pq−1 with z=0 kcol(z) = n 1 ` = 1 2 WHILE ` ≤ min{m, n} 3 y = ROW-TASK (l) 4 h = ROW-NUM (l) 5 IF y = z 6 THEN BROADCAST vector ah 7 ELSE BROADCAST-RECEIVE vector v FROM y 8 Insert v as h-th row vector of A 9 IF NOT (Ah,`+1:n) = 0 10 THEN A = ROWGCD (A, V, h, l) 11 IF NOT y = z 12 THEN Remove the h-th row vector of A “ ” 13 B = PAR-ROWTOCOL (A) B ∈ Rm,kcol(z) 14 y = COL-TASK (l) 15 h = COL-NUM (l) 16 IF y = z 17 THEN BROADCAST vector bh (the h-th column of B) 18 ELSE BROADCAST-RECEIVE vector g FROM y 19 Insert g as h-th column vector of B 20 IF (B`+1:m,h) = 0 21 THEN ` = ` + 1 22 ELSE B = COLGCD (B, U, `, h) 23 IF NOT y = z 24 THEN Remove the h-th column vector of B 25 A = PAR-COLTOROW (B) 26 (A, U, V ) = PAR-DIAGTOSMITH (A, U, V ) OUTPUT (A, U, V ) = PAR-HH-SNF (A, U, V )

For the proof of the correctness we refer to [20].

Theorem 4.3. [20] Let A¯ ∈ Rm,n with p = max{m, n}, and R = Z and 2 ¯ 2 ¯ R = F[x], respectively. In Algorithm 4.6 O(p log2(pkAk∞)) and O(p dAedeg) 3 ¯ BROADCAST operations are performed, respectively, and O(p log2(pkAk∞)) 3 and O(p dA¯ edeg) ring elements are sent with BROADCAST, respectively. Fur- 2 2 ¯ 2 2 ¯ ther O(q p log2(p kAk∞)) and O(q p dAedeg) SEND operations are performed, 4 ¯ 4 ¯ respectively, and O(p log2(p kAk∞)) and O(p dAedeg) ring elements are sent with SEND, respectively. In comparison to the PAR-KB-SNF Algorithm the complexity of the BROAD- CAST operations is improved by a factor of p2, whereas the complexity of the SEND operations remains unchanged.

13 5. Experiments with the parallel versions of the normal form algo- rithms

The original algorithms of this paper were implemented in the language C++ with the compiler g++, version 3.4.4 and the parallel programs with mpicc, version 3.4.4. The sequential and parallel experiments were made on 32 nodes with 2 Intel Xeon 2.4 GHz machines with 1 GB for the first 32 processors and 0.5 GB for the last 32 processors. For the parallel experiments we use up to 64 processors, where 2 processors belong to a node. Every process of our parallel program runs on one of these processors. Additionally, we compare our results with the program package MAGMA (abbreviated by ”MM”) V2.13-6 under Linux on a GenuineIntel Intel(R) Pentium(R) 4 CPU 3.00 GHz processor with main memory 1 GB and with MAPLE (abbreviated by ”MP”) 6 under SunOS on 4 sparcv9 floating point processors with 1281 MHz and main memory 16 GB. We do experiments with matrices over the rings Q[x], F3[x], and F5[x] (for the results for the rings Z and F2[x] we refer to [20]). Note that MAPLE only works for the ring Q[x]. The execution times are given in the form ”hh:mm:ss” (hours:minutes:seconds). The tests with MAGMA and MAPLE were stopped after 24 hours. If a special time is not listed, it means that the corresponding algorithm needed more than 24 hours. As first test class we used the matrices 0 0 Bn = (bs,t = bs,t − x · En) for 1 ≤ s, t ≤ n, where bs,t are randomly cho- sen from [−99; 100] for Q[x], from {0; 1; 2} for F3[x], and from {0; 1; 2; 3; 4} for F5[x]. These are characteristic matrices with full rank. For the rings F3[x] and q t−1 F5[x] we also used a second test class, namely the matrices Cn = (cs,t = ps−1 2 2 mod q) for 1 ≤ s, t ≤ n, where (ps(x))s≥0 = (0, 1, 2, x, x+1, x+2, x , x +1,... ) 2 2 for the ring F3[x] and (ps(x))s≥0 = (0, 1, 2, 3, 4, x, x+1, x+2, x+3, x+4, x , x + 1,... ) for the ring F5[x] and where q is one of the irreducible polynomials 5 4 4 q1(x) = x + x + x + 2, q2(x) = 2x + 3x + 3.

5.1. Efficiency An important criterion for the quality of a parallel algorithm is the efficiency E, depending on the cardinality of the processes q, which we define as E(q) = Ts , where Tq is the execution time for the parallel algorithm with q processes q·Tq and Ts the execution time for the corresponding sequential (original) algorithm. The efficiency is the percentage of the total parallel execution time devoted to the computation in comparison to that devoted to the communication. In general, if the efficiency is large, the quality of the parallel program is quite good, but it is also rather important that the efficiency remains constant (or nearly constant) for a larger cardinality of processes. The efficiency is at most 1. We computed the normal forms with corresponding transformation matrices of a matrix of medium size on 1, 2, 4, 8, 16, 32 and 64 processes with the PAR- HNF algorithm, the PAR-KB-SNF Algorithm and PAR-HH-SNF Algorithm. We also used the corresponding original algorithms. For every parallel normal form computation and for each cardinality of processes 1, 2, 4, 8, 16, 32 and 64, we give the efficiency in % and show also the behaviour graphically. The

14 results of the PAR-HNF Algorithm, the PAR-KB-SNF Algorithm, PAR-HH- SNF Algorithm and of MAGMA and MAPLE for the full-rank matrix B16 of q1 the ring Q[x], the full-rank matrix C240 of the ring F3[x], and the full-rank q2 matrix C200 of the ring F5[x] can be found in Table/Fig. 3. Note that MAGMA and MAPLE supply only one SNF version (we do not know, whether KB or HH or another version is implemented). Results for the ring Q[x]: Q[x] is a very difficult ring, as the entries can explode in two ways: the degrees of the polynomials and the numera- tors/denominators of the coefficients. Therefore it is only possible to test such a small matrix as B16. Obviously it makes no sense to use more processes than the row or column number of the matrix (in this case 16). The PAR-KB-SNF Algorithm and the PAR-HNF Algorithm are a little bit faster than the PAR- HH-SNF Algorithm, but the efficiency is nearly the same. The efficiency is smaller for a large cardinality of processes. MAGMA and MAPLE are faster for the HNF, but not competitive for the SNF. Results for the ring F3[x]: The PAR-HH-SNF Algorithm is faster than the PAR-HNF Algorithm and the PAR-KB-SNF Algorithm. The efficiency of the PAR-HNF Algorithm is the best one, and the efficiency of the PAR-KB-SNF Algorithm is larger than the PAR-HH-SNF Algorithm for a small cardinality of processes and smaller for a large cardinality of processes. MAGMA is faster than 4 processes of the PAR-HNF Algorithm and not competitive for the SNF. Results for the ring F5[x]: The results are similar as for the ring F3[x]. The PAR-HH-SNF Algorithm is much faster than the PAR-HNF Algorithm and the PAR-KB-SNF Algorithm, but considering the efficiency the PAR-HNF Algorithm is the best one, and the PAR-KB-SNF Algorithm is better than the PAR-HH-SNF Algorithm. Again, MAGMA is faster than 4 processes of the PAR-HNF Algorithm and not competitive for the SNF.

5.2. Large example matrices For all rings we want to find out the maximum number of rows and columns of an input matrix we are able to compute the HNF/SNF for. The results of the PAR-HNF Algorithm, the algorithms PAR-KB-SNF, PAR-HH-SNF and of MAGMA and MAPLE for the largest possible example matrices can be found in Table 1, where one process per each row/column is used for Q[x] and 64 processes are used for F3[x] and F5[x]. Note that MAGMA and MAPLE supply only one SNF version. Results for the ring Q[x]: We succeeded in computing the normal forms of B26, B28, B30, and B32, where the maximum memory (80 MB) was used by the PAR-HH-SNF Algorithm for B32. As for B16 from section 5.1, MAGMA and MAPLE are faster for the HNF, but not competitive for the SNF. Results for the ring F3[x]: We succeeded in computing the normal forms of q1 q1 the full-rank matrices B700, B800 and the matrices C340 C360 of rank 243, where the maximum memory (397 MB) was used by the PAR-HH-SNF Algorithm for q1 C360, i.e., the available memory is nearly exhausted, and thus matrices with much larger row/column numbers cannot be computed.

15 Q[x]: B16 PAR-HNF PAR-KB-SNF PAR-HH-SNF Processes Time Eff. Time Eff. Time Eff. 1 (orig.) 00:02:31 – 00:02:31 – 00:03:10 – 1 (par.) 00:02:33 99 % 00:02:34 98 % 00:03:14 98 % 2 00:01:24 90 % 00:01:24 90 % 00:01:46 90 % 4 00:00:49 77 % 00:00:49 77 % 00:01:02 77 % 8 00:00:32 59 % 00:00:33 57 % 00:00:41 58 % 16 00:00:24 39 % 00:00:24 39 % 00:00:30 40 % MAGMA 00:00:06 – – – – – MAPLE 00:00:21 – 01:25:54 – 01:25:54 –

q1 F3[x]: C240 PAR-HNF PAR-KB-SNF PAR-HH-SNF Processes Time Eff. Time Eff. Time Eff. 1 (orig.) 03:03:49 – 03:03:56 – 02:06:21 – 1 (par.) 03:07:49 98 % 03:08:54 97 % 02:11:26 96 % 2 01:36:28 95 % 01:37:44 94 % 01:07:55 93 % 4 00:48:44 94 % 00:50:00 92 % 00:34:46 91 % 8 00:25:33 90 % 00:26:50 86 % 00:18:16 86 % 16 00:13:44 84 % 00:15:03 76 % 00:09:54 80 % 32 00:07:56 72 % 00:09:20 62 % 00:05:52 67 % 64 00:04:57 58 % 00:06:32 44 % 00:04:01 49 % MAGMA 00:32:45 – – – – –

q2 F5[x]: C200 PAR-HNF PAR-KB-SNF PAR-HH-SNF Processes Time Eff. Time Eff. Time Eff. 1 (orig.) 02:57:48 – 02:58:13 – 00:31:48 – 1 (par.) 03:06:26 95 % 03:07:20 95 % 00:35:09 90 % 2 01:36:32 92 % 01:37:39 91 % 00:18:30 86 % 4 00:50:55 87 % 00:51:55 86 % 00:09:32 83 % 8 00:26:32 84 % 00:27:31 81 % 00:05:03 79 % 16 00:14:20 78 % 00:15:19 73 % 00:02:48 71 % 32 00:08:26 66 % 00:09:27 59 % 00:01:46 56 % 64 00:05:26 51 % 00:06:39 42 % 00:01:15 40 % MAGMA 00:32:56 – 08:26:18 – 08:26:18 –

100 100 100 HNF HNF KB-SNF HNF KB-SNF HH-SNF KB-SNF 90 HH-SNF HH-SNF 90 90

80 80 80

70

70 70 60 Efficiency in %

Efficiency in % 60 Efficiency in % 60 50

50 40 50

30 40 1 2 4 8 16 40 1 2 4 8 16 32 64 1 2 4 8 16 32 64 Cardinality of processes Cardinality of processes Cardinality of processes q1 q2 (a) Q[x]: B16 (b) F3[x]: C240 (c) F5[x]: C200

Figure 3: Execution time and efficiency for a special instance for the rings Q[x], F3[x], F5[x]

16 Q[x] PAR-HNF MM-HNF MP-HNF KB-SNF PAR-HH-SNF B26 00:23:58 00:03:20 00:05:44 00:23:59 00:46:30 B28 00:45:26 00:05:23 00:09:15 00:45:28 01:34:28 B30 01:22:59 00:08:46 00:15:35 01:23:05 03:03:48 B32 02:28:14 00:14:08 00:24:41 02:28:17 05:51:45

F3[x] PAR-HNF MM-HNF PAR-KB-SNF PAR-HH-SNF B700 03:21:21 – 03:39:27 00:23:43 B800 06:30:10 – 06:47:33 00:43:25 q1 C340 00:06:45 00:33:45 00:12:09 00:08:55 q1 C360 00:06:56 00:37:53 00:13:37 00:10:18

F5[x] PAR-HNF MM-HNF PAR-KB-SNF PAR-HH-SNF B650 03:15:37 – 03:24:04 00:16:17 B700 04:29:01 – 04:40:12 00:21:54 q2 C320 00:41:13 05:03:19 00:45:52 00:08:38 q2 C340 00:57:01 05:55:02 01:02:50 00:11:15

Table 1: Execution time for large example matrices of the rings Q[x], F3[x], F5[x]

Results for the ring F5[x]: We succeeded in computing the normal forms q2 q2 of the full-rank matrices B650, B700 and the full-rank matrices C320 C340, i.e., in comparison to the ring F3[x] matrices with a little bit smaller dimensions can be computed. Here the maximum memory (240 MB) was used by the PAR- q1 KB-SNF Algorithm for C240. In most cases, matrices with larger row/column numbers could not be computed by at least one algorithm. For example, the SNF of the matrix B800 could not be computed by the PAR-HH-SNF Algorithm.

5.3. Data distribution Another important criterion for the quality of a parallel program is the data distribution, i.e., whether the data are nearly equally distributed to all processes. Especially in our case, when the data might be too large for one process, this criterion is essential for the algorithms. One hint, that the data distribution does not lead to load imbalances, is the quite good efficiency shown in the experiments of section 5.1, as by a bad data distribution one process would probably receive more work than another one leading to a bad efficiency. To show the data distribution, we consider the combination algorithm/matrix from section 5.1 with the largest memory requirement for one process, which is q1 the PAR-HH-SNF Algorithm for the ring F3[x], applied to the matrix C240. In Fig. 4(a) we graphically show the maximum used memory in MB for 1, 2, 4, 8, 16, 32, and 64 processes. We observe that the used memory is significantly reduced by each step from each cardinality of processes to the double of this cardinality. Overall the used memory is reduced from 919 MB for 1 process to 118 MB for 64 processes, i.e., we receive nearly a reduction factor of 8. In section 4.1 we have suggested a rather natural data distribution. As shown in section 5.1, this distribution leads to a good efficiency, and shown by the previous example, the memory reduction for increasing cardinality of

17 1000 130 HH-SNF Distribution 1 900 Distribution 2 125 Distribution 3 800 120 700

600 115

500 110

400 105 300 Used memory in MB 100 Maximum used memory in MB 200

100 95 1 2 4 8 16 32 64 1 2 4 8 16 32 64 Cardinality of processes Process number (a) Maximum used memory of the PAR-HH- (b) Used memory of each process of the PAR- SNF Algorithm for the ring F3[x], applied to HNF Algorithm with 64 processes for the ring q1 the matrix C240 Z, applied to the matrix A991

Figure 4: Memory requirements processes is also good. But it is still not clear, whether our data distribution can essentially be improved or not. As already mentioned in section 4.1, the data distribution suggested in [29] is a possible alternate choice. To analyse this effect, we additionally apply three versions of the PAR-HNF Algorithm for the ring Z to the full-rank matrix A991. The three versions only differ in the used row distribution. The first distribution is our original distribution, the second one is that of [29]. As third one we test the following even more natural distribution: If the cardinality of processes q is a divisor of m, the process with the number z receives the rows z · (m/q) + 1, z · (m/q) + 2, . . . , z · (m/q) + m/q for z = 0, . . . , q − 1. If q is no divisor of m with m = q · s + t and 0 < t < q, the first t processes receive an additional row (which also holds for our original distribution). Fig. 4(b) shows the used memory for each of the three distributions and for each of the 64 processes. For distribution 1 we observe a nearly constant memory with only one strong decrease at process 30. This decrease comes from the equality 31 = 991 mod 64, as up to process 30 the processes receive an additional row. Distribution 2 also has a nearly constant memory, but the memory values are permuted in comparison to distribution 1. Distribution 3 is the worst one, as it has small memory for the first processes and very large memory for the last processes. The reason for this bad behaviour is that the coefficient explosion during the PAR-HNF Algorithm is stronger for rows with larger numbers. For example, the last process receives all rows with the largest numbers and with the strongest coefficient explosion. Then this process has to do the most work and needs the largest memory. The effects of the three distributions can be seen on the execution times measured by the Intel Xeon 2.4 GHz machines. The PAR-HNF Algorithm based on distribution 1 leads to an execution time of 01:56:17, distribution 2 to 01:55:37, where (as shown by further experiments) the difference comes only from execution time inaccuracies. In comparison, distribution 3 leads to a much worse execution time 05:14:48, i.e., this distribution would lead to a much worse efficiency.

18 Acknowledgement

We would like to thank Volker Gebhardt for some helpful discussions, the University of Halle-Wittenberg for the platform for the parallel experiments and the anonymous referees for their comments, which helped us to improve the paper. The research of both authors was supported by DFG (Germany).

References

[1] W.A. Blankinship, Algorithm 287: Matrix Triangulation with Integer Arithmetic [F1], Comm. ACM 9(7) (1966) 513.

[2] G.H. Bradley, Algorithms for Hermite and Smith Normal Matrices and Linear Diophantine Equations, Math. Comp. 25(116) (1971) 897-907.

[3] R.P. Brent, Parallel Algorithms in Linear Algebra, Algorithms and Architectures: Proc. Second NEC Research Symposium, 1993, pp. 54-72.

[4] S. Chatterjee, S. Sen, Cache-Efficient Matrix Transposition, in: Proc. 6th International Sym- posium on High-Performance Computer Architecture (IEEE Computer Society), 2000, pp. 195-205.

[5] J. Choi, J. Dongarra, D.W. Walker, Parallel Matrix Transpose Algorithms on Distributed Memory Concurrent Computers, Parallel Comput. 21(9) (1995) 1387-1405.

[6] T.W.J. Chou, G.E. Collins, Algorithms for the Solution of Systems of Linear Diophantine Equations, SIAM J. Comput. 11(4) (1982) 687-708.

[7] J. Dongarra et al., Sourcebook of Parallel Computing, Morgan Kaufmann Publishers, Inc., San Francisco, 2003

[8] X.G. Fang, G. Havas, On the Worst-Case Complexity of Integer , in: Proc. International Symposium on Symbolic and Algebraic Computation, ACM Press, 1997, pp. 28-31.

[9] A. Fujii, R. Suda, A. Nishida, Parallel Matrix Distribution Library for Sparse Matrix Solvers, in: Proc. 8th International Conference on High-Performance Computing in Asia-Pacific Region, IEEE Computer Society, 2005, pp. 213-219.

[10] M. Giesbrecht, Fast Computation of the Smith Normal Form of an Integer Matrix, in: Proc. International Symposium on Symbolic and Algebraic Computation, ACM Press, 1995, pp. 110-118.

[11] J.L. Hafner, K.S. McCurley, Asymptotically Fast Triangularization of Matrices over Rings, SIAM J. Comput. 20(6) (1991) 1068-1083.

[12] B. Hartley, T.O. Hawkes, Rings, Modules and Linear Algebra, Chapman and Hall, London, 1970.

[13] G. Havas, D.F. Holt, S. Rees, Recognizing Badly Presented Z-Modules, Linear Algebra Appl. 192 (1993) 137-163.

[14] G. Havas, B.S. Majewski, Hermite Normal Form Computation for Integer Matrices, Congr. Numer. 105 (1994) 87-96.

[15] G. Havas, B.S. Majewski, Integer Matrix Diagonalization, J. Symbolic Comput. 24(3/4) (1997) 399-408.

[16] G. Havas, L.S. Sterling, Integer Matrices and Abelian Groups, in: Proc. International Sympo- sium on Symbolic and Algebraic Manipulation, Lecture Notes in Comput. Sci., 72, Springer, New York, 1979, pp. 431-451.

19 [17] G. Havas, C. Wagner, Matrix Reduction Algorithms for Euclidean Rings, in: Proc. Asian Symposium on Computer Mathematics, Lanzhou University Press, 1998, pp. 65-70.

[18] D. Heller, A Survey of Parallel Algorithms in Numerical Linear Algebra, SIAM Review 20(4) (1978) 740-777.

[19] C. Hermite, Sur l’introduction des variables continues dans la th´eorie des nombres, J. Reine Angew. Math., 41 (1851) 191-216.

[20] G. J¨ager, Parallel Algorithms for Computing the Smith Normal Form of Large Matrices, in: Proc. 10th European PVM/MPI, Lecture Notes in Comput. Sci. 2840, Springer, Berlin- Heidelberg, 2003, pp. 170-179.

[21] E. Kaltofen, M.S. Krishnamoorthy, B.D. Saunders, Fast Parallel Computation of Hermite and Smith Forms of Polynomial Matrices, SIAM J. Algebraic and Discrete Methods 8(4) (1987) 683-690.

[22] E. Kaltofen, M.S. Krishnamoorthy, B.D. Saunders, Parallel Algorithms for Matrix Normal Forms, Linear Algebra Appl. 136 (1990) 189-208.

[23] M. Kaminski, A. Paz, Computing the Hermite Normal Form of an Integral Matrix, Tech. Rep. Department of Computer Science, Technion-Israel Institute of Technology, Haifa, Israel, June 1986.

[24] R. Kannan, Polynomial-Time Algorithms for Solving Systems of Linear Equations over Poly- nomials, Theoret. Comput. Sci. 39 (1985) 69-88.

[25] R. Kannan, A. Bachem, Polynomial Algorithms for Computing the Smith and Hermite Normal Forms of an Integer Matrix, SIAM J. Comput. 8(4) (1979) 499-507.

[26] H.-J. Lee, J.A.B. Fortes, Toward data distribution independent parallel matrix multiplication, in: Proc. 9th International Parallel Processing Symposium IEEE Computer Society, 1995, pp. 436-440.

[27] F.T. Leighton, Introduction to Parallel Algorithms and Architectures, Morgan Kaufmann Publishers, Inc., San Francisco, 1992.

[28] G.O. Michler, R. Staszewski, Diagonalizing Characteristic Matrices on Parallel Machines, Preprint 27, Institut f¨urExperimentelle Mathematik, Universit¨at/GH Essen, 1995.

[29] M. Quinn, Parallel Programming in C with MPI and OpenMP, McGraw Hill, Columbus, 2004.

[30] C.C. Sims, Computation with Finitely Presented Groups, Cambridge University Press, 1994.

[31] H.J.S Smith, On Systems of Linear Indeterminate Equations and Congruences, Philos. Trans. R. Soc. Lond. 151 (1861) 293-326.

[32] A. Storjohann, Near Optimal Algorithms for Computing Smith Normal Forms of Integer Matrices, in: Proc. International Symposium on Symbolic and Algebraic Computation, ACM Press, 1996, pp. 267-274.

[33] A. Storjohann, Computing Hermite and Smith Normal Forms of Triangular Integer Matrices, Linear Algebra Appl. 282(1-3) (1998) 25-45.

[34] A. Storjohann, G. Labahn, A Fast Las Vegas Algorithm for Computing the Smith Normal Form of a Polynomial Matrix, Linear Algebra Appl. 253(1) (1997) 155-173.

[35] J. Suh, V.K. Prasanna, An Efficient Algorithm for Out-of-Core Matrix Transposition, IEEE Trans. Computers 51(4) (2002) 420-438.

[36] H.A. van der Vorst, P. van Doreen (Eds.), Parallel Algorithms for Numerical Linear Algebra, Advances in Parallel Computing 1, North-Holland, 1990.

[37] G. Villard, Fast Parallel Computation of the Smith Normal Form of Polynomial Matrices, in: Proc. International Symposium on Symbolic and Algebraic Computation, ACM Press, 1994, pp. 312-317.

20