FFPACK: Finite Field Linear Algebra Package
Jean-Guillaume Dumas Pascal Giorgi Cl´ement Pernet
Universit´eJoseph Fourier, Laboratoire de l’Informatique du Universit´eJoseph Fourier, Laboratoire de Mod´elisation et Calcul, Parall´elisme, Laboratoire de Mod´elisation et Calcul, 50 av. des Math´ematiques. LIP–ENS Lyon. 46, all´ee d’Italie, 50 av. des Math´ematiques. B.P. 53 X, 38041 Grenoble, France. F69364 Lyon C´edex 07, France. B.P. 53 X, 38041 Grenoble, France.
[email protected] [email protected] [email protected]
ABSTRACT Now for the applications. Many algorithms have been de- The FFLAS project has established that exact matrix mul- signed to use matrix multiplication in order to be able to tiplication over finite fields can be performed at the speed prove an optimal theoretical complexity. In practice those of the highly optimized numerical BLAS routines. Since algorithms were only seldom used. This is the case e.g. in many algorithms have been reduced to use matrix multipli- many linear algebra problems such as determinant, rank, cation in order to be able to prove an optimal theoretical inverse, system solution or minimal and characteristic poly- complexity, this paper shows that those optimal complex- nomial. Over finite fields or over the integers those finite ity algorithms, such as LSP factorization, rank determinant field linear algebra routines are used to solve many differ- and inverse computation can also be the most efficient. ent problems. Among them are integer polynomial factor- ization, Gr¨obner basis computation, integer system solving, large integer factorization, discrete logarithms, error cor- Categories and Subject Descriptors recting codes, etc. Even sparse or polynomial linear algebra G.4 [Mathematical Software]: Algorithm design and anal- needs some very efficient dense subroutines [12, 10]. We ysis; F.2.1 [Analysis of Algorithms and Problem Com- believe that with our kernel, each one of those optimal com- plexity]: Numerical Algorithms and Problems—computa- plexity algorithms can also be the most efficient. tions in finite fields. The goal of this paper is to show the actual effectiveness of this belief for the factorization of any shape and any rank General Terms matrices. The application of this factorization to determi- Algorithms, Experimentation, Performance. nant, rank, and inverse is presented as well. Some of the ideas from FFLAS, in particular the fast ma- trix multiplication algorithm for small prime fields, are now Keywords incorporated into the Maple computer algebra system since Word size Finite fields; BLAS level 1-2-3; Linear Algebra its version 8. Therefore an effort towards effective reduc- Package; Matrix Multiplication; LSP Factorization tion has been made within Maple by A. Storjohann[4]. Ef- fective reduction for minimal and characteristic polynomial were sketched in [20] and A. Steel has reported on simi- 1. INTRODUCTION lar efforts within his implementation of some Magma rou- Exact matrix multiplication over finite fields can now be tines. We provide a full C++ package available directly1 performed at the speed of the highly optimized numerical or through LinBox2[7]. Extending the work undertaken by BLAS routines. This has been established by the FFLAS the authors et al.[18, 8, 3, 11], this paper focuses on ma- project [8]. Moreover, since finite field computations e.g. trix factorization, namely the exact equivalent of the LU do not suffer from numerical stability, this project showed factorization. Indeed, unlike numerical matrices, exact ma- also an easy effectiveness of the algorithms with even bet- trices are very often singular, even more so if the matrix ter arithmetic complexity (such as Winograd’s variant of is not square ! Consequently, Ibarra, Moran and Hui have Strassen’s fast matrix multiplication) [18]. developed generalizations of the LU factorization, namely the LSP and LQUP factorizations [16]. In section 4 we deal with the implementation of those two routines as well as memory optimized (an in-place and a cyclic block) versions using the fast matrix multiplication kernel. Those imple- Permission to make digital or hard copies of all or part of this work for mentations require the resolution of triangular systems with personal or classroom use is granted without fee provided that copies are matrix right or left hand side. In section 3 the triangu- not made or distributed for profit or commercial advantage and that copies lar system solver (Trsm routine using BLAS terminology) is bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific therefore studied. Then, in section 5, we propose different permission and/or a fee. 1www-lmc.imag.fr/lmc-mosaic/Jean-Guillaume.Dumas/FFLAS ISSAC'04, July 4–7, 2004, Santander, Spain. 2 Copyright 2004 ACM 1-58113-827-X/04/0007 ...$5.00. www.linalg.org uses of the factorization routine to solve other classical linear 3.1 Scheme of the block recursive algorithm algebra problems. In particular, speed ratios are presented The classical idea is to use the divide and conquer ap- and reflects the optimal behavior of our routines. proach. Here, we consider the upper left triangular case without loss of generality, since the any combination of up- per/lower and left/right triangular cases are similar: if U 2. BASE FIELDS is upper triangular, L is lower triangular and B is rect- The algorithms we present in this paper are written gener- angular, we call ULeft-Trsm the resolution of UX = B, ically with regards to the field over which they operate, as LLeft-Trsm that of LX = B, URight-Trsm that of XU = B long as they provide some conversion functions from a field and LRight-Trsm that of XL = B. element to a floating point representation and backwards. As demonstrated in [8] this is easily done for prime fields Algorithm ULeft-Trsm(A, B) m×m m×n and also for other finite fields, via a q-adic transformation. Input: A ∈ Zp , B ∈ Zp . LinBox m×n The chosen interface is that of the fields [22, §5.3]. Output: X ∈ Zp such that AX = B. In the following, the prime field with characteristic p will be Scheme Z denoted by p. if m=1 then −1 X := A1,1 × B. For our experiments we use some classical representa- m m else (splitting matrices into b 2 c and d 2 e blocks) tions, e.g. modular prime fields, primitive roots Galois fields, Montgomery reduction, etc. implemented in different li- A X B braries, as in [8, 6]. Still and all, when no special implemen- A A X B tation is required and when the prime field is small enough, 1 2 1 = 1 A3 X2 B2 one could rather use what we call a Modular
3www-id.imag.fr/Logiciels/givaro 4www.netlib.org/blas Indeed, as soon as no overflow occurs one can replace the will remain exact. recursive call to ULeft-Trsm by the numerical BLAS dtrsm routine. But one can remark that approximate divisions can We now propose a bound for the values of the solutions occur. So we need to ensure both that only exact divisions of such a system; this bound is optimal (in the sense that are performed and that no overflow appears. Not only one there exists a worst case matching the bound when n = 2ib). has to be careful for the result to remain within acceptable This enables the implementation of a cascading algorithm, bounds, but, unlike matrix multiplication where data grows starting recursively and taking advantage of the BLAS per- linearly, data involved in linear system grows exponentially formances as soon as possible. as shown in the following. × Theorem 3.2. Let T ∈ Zn n be a unit diagonal upper The next two subsections first show how to deal with di- triangular matrix, and b ∈ Zn, with |T | ≤ p − 1 and |b| ≤ visions, and then give an optimal theoretical bound on the n p − 1. Let X = (xi) ∈ ∈ Z be the solution of T.X = b coefficient growth and therefore an optimal threshold for the i [1..n] over the integers. Then, ∀ k ∈ [0..n − 1]: switch to the numerical call. k k xn−k k k (p − 2) − p ≤ 2 p−1 ≤ p +(p − 2) if k is even 3.2.1 Dealing with divisions k k xn−k k k −p − (p − 2) ≤ 2 − ≤ p − (p − 2) if k is odd In algorithms like ULeft-Trsm, all divisions appear only ( p 1 within the last recursion’s level. In the general case it can- The proof is presented in appendix A. The idea is to use an n not be predicted whether these divisions will be exact or induction on k with the relation xk = bk − i=k+1 Tk,ixi. not. However when the system is unitary (only 1’s on the Two lower and an upper bounds for xn−k are computed, main diagonal) the division are of course exact and will even depending whether k is even or odd. P never be performed. Our idea is then to transform the ini- − − − Corollary 3.3. |X|≤ p 1 pn 1 +(p − 2)n 1 . tial system so that all the recursive calls to ULeft-Trsm are 2 Moreover, this bound is optimal. unitary. For a triangular system AX = B, it suffices to Proof. p−1 n n factor first the matrix A into A = UD, where U, D are re- We denote by un = 2 [p − (p − 2) ] and vn = p−1 n n spectively an upper unit triangular matrix and a diagonal 2 [p +(p − 2) ] the bounds of the theorem 3.2. Now matrix. Next the unitary system UY = B is solved by any ∀ k ∈ [0..n − 1] uk ≤ vk ≤ vn−1. Therefore the theorem 3.2 p−1 n−1 n−1 − ULeft-Trsm (even a numerical one), without any division. gives ∀ k ∈ [1..n] xk ≤ vn 1 ≤ 2 p +(p − 2) The initial solution is then recovered over the finite field via −1 . . . . . . X = D Y . This normalization leads to an additional cost ...... of: 1 p − 1 0 p − 1 0 Z −1 • m inversions over p for the computation of D . Let T = 1 p − 1 0 ,b = p − 1 m Z • (m − 1) 2 + mn multiplications over p for the nor- 1 p − 1 0 malizations of U and X. 1 p − 1 Zn Nonetheless, in our case, we need to avoid divisions only Then the solution X =(xi)i∈[1..n] ∈ of the system T.X = during the numerical phase. cwTherefore, the normalization b satisfies ∀ k ∈ [0..n − 1] |xn−k| = vk can take place only just before the numerical routine calls. Thus, for a given p, the dimension n of the system must Let β be the size of the system when we switch to a numerical satisfy computation. To compute the cost, we assume that m = i p − 1 − − 2 β, where i is the number of recursive level of the algorithm pn 1 +(p − 2)n 1 < 2m (1) ULeft-Trsm. The implementation can however handle any 2 matrix size. Now, there are 2i normalizations with systems where m is the size of the mantissa so that the resolution of size β. This leads to an additional cost of: over the integers using the BLAS trsm routine is exact. For • m inversions over Zp. instance, with a 53 bits mantissa, this gives quite small ma- m Z • (β − 1) 2 + mn multiplications over p. trices, namely at most 55 × 55 for p = 2, at most 4 × 4 for 1 1 2 p ≤ 9739, and at most p = 94906249 for 2 × 2 matrices. This allows us to save 2 − 2i+1 m multiplications over Zp from a whole normalization of the initial system. One Nevertheless, this technique is speed-worthy in most cases 1 2 as shown in section 3.4. iteration suffices to save 4 m multiplications and we can save up to 1 (m2 − m) multiplications with log m iterations. 2 3.3 Recursive with delayed modulus 3.2.2 A theoretical threshold In the previous section we noticed that BLAS routines The use of the BLAS routine trsm is the resolution of the within Trsm are used only for small systems. An alternative triangular system over the integers (stored as double for is to change the cascade: instead of calling the BLAS, one dtrsm or float for strsm). The restriction is the coefficient could switch to the classical iterative algorithm: Let A ∈ Z m×m Z m×n growth in the solution. Indeed, the kth value in the solu- p and B, X ∈ p such that AX = B, then tion vector is a linear combination of the (n − k) already 1 ∀i, Xi,∗ = (Bi,∗ − Ai,[i+1..m]X[i+1..m],∗) (2) computed next values. This implies a linear growth in the Ai,i coefficient size of the solution, with respect to the system The idea is that the iterative algorithm computes only one dimension. Now this resolution can only be performed if row of the whole solution at a time. Therefore its threshold every element of the solution can be stored in the mantissa t is greater than the one of the BLAS routine, namely it of the floating point representation (e.g. 53 bits for double ). Therefore overflow control consists in finding the largest requires only block dimension b, such that the result of the call to dtrsm t(p − 1)2 < 2m (3) Resultantly, an implementation of this iterative algorithm Dot product of a vector with 512 elements on a P3 993 MHz 900 depends mainly on the matrix-vector product. The arith- Classical double representation metical cost of such an algorithm is now cubic in the size of 800 Overflow trick the system, where blocking improved the theoretical com- plexity. Anyway, in practice fast matrix multiplication al- 700 gorithms are not better than the classical one for such small 600 matrices [8, §3.3.2]. In section 3.4 we compare both hybrid 500 implementations with different thresholds to the pure recur- sive one. 400 Speed (Mfops)
300 Now we focus on the dot product operation, base for matrix-vector product. We use the results of [6], extend- 200 ing those of [8, §3.1]. There several implementations of a 100 dot product are proposed and compared on different ar- 0 chitectures. According to [6], where many different imple- 0 10000 20000 30000 40000 50000 60000 mentations are compared (Zech log, Montgomery, float, ...), Prime number the best implementation is a combination of a conversion to floating point representation with delayed modulus (for big Figure 1: Speed improvement of dot product by de- prime and vector size) and an overflow detection trick (for layed division, on a P3, 993 MHz smaller prime and vector size).
The first idea is to specialize dot product in order to make This is an exact wrapping of the ATLAS library5 used as a several multiplications and additions before performing the kernel to implement the Trsm variants. In the following we division (which is then delayed). Indeed, one needs to per- denote by “pure rec” the implementation of the recursive form a division only when the intermediate result might Trsm described in section 3.1. “BLAS” denotes the variant overflow. Now, if the available mantissa is of m bits and of section 3.2 with optimal threshold. “delayedt” denotes the modulo is p, divisions happen at worst every n multi- the variant of section 3.3 where t satisfies equation 3. In plications where n satisfies condition (3). There the best our comparisons we use the fields presented in section 2 as compromise has to be chosen between speed of computation base fields and the version 3.4.2 of ATLAS. Performances and available mantissa. A double floating point represen- are expressed in million of finite field operations (Mfops) tation gives actually the best performances for most of the per second for n × n dense systems. vector and prime sizes [6]. Moreover one can then perform the division “`ala NTL” using a floating point precomputa- n 400 700 1000 2000 3000 5000 −1 tion of the inverse: a ∗ b mod p = a ∗ b − ba ∗ b ∗ p c ∗ p. pure rec. 853 1216 1470 1891 2059 2184 BLAS 1306 1715 1851 2312 2549 2660 For small primes, however, there is a faster method: the delayed100 1163 1417 1538 1869 2042 2137 second idea is to use an integer representation and to let the delayed50 1163 1491 1639 1955 2067 2171 overflow occur. Then one should detect this overflow and delayed23 1015 1465 1612 2010 2158 2186 delayed 901 1261 1470 1937 2134 2166 correct the result if needed. Indeed, suppose that we have 3 n 400 700 1000 2000 3000 5000 added a product ab to the accumulated result t and that an pure rec. 810 1225 1449 1886 2037 2184 overflow has occurred. The variable t now contains actually BLAS 1639 2099 2321 2378 m 1066 1504 t − 2 . Well, the idea is just to precompute a correction delayed100 1142 1383 1538 1860 2019 2143 m CORR = 2 mod p and add this correction whenever an delayed50 1163 1517 1639 1955 2080 2172 unsigned overflow has occurred. Now for a portable un- delayed23 1015 1478 1612 2020 2146 2184 signed overflow detection, we use a trick of B. Hovinen [14]: delayed3 914 1279 1449 1941 2139 2159 since 0 < ab < 2m, an unsigned overflow has occurred if and only if t + ab < t. Of course, better performances are Table 1: Comparing speed (Mfops) of Trsm using attained when several products are grouped (whenever pos- Modular
Table 2: Comparing speed (Mfops) of Trsm using Givaro-ZpZ, on a P4, 2.4GHz (Upper table is over Z5, lower table is over Z32749) L1 L L ter performances than the BLAS variant. This is because G the conversion from machine integers to floating point num- bers becomes too big a price to pay. As a comparison, we also provide performances for several thresholds, in partic- Figure 2: Principle of the LSP factorization ular the same as in the BLAS variant (3 and 23). Then for larger matrices, conversions ( O(n2) ) are dominated by computations ( O(nω )), and the “BLAS” variant is again halves and performs a recursive call on the top block. It thus the fastest one, provided that the field is small enough. gives the T , Y and L1 blocks of figure 2. Then, after some Now, table 2 reflects experiments using the integer based permutations ([XZ] = [A21A22]P ), it computes G such that modular fields of Givaro, Givaro-ZpZ. First, using these GT = X via Trsm, replaces X by zeroes and eventually up- fields, Trsm is slower than with Modular
Y n 400 1000 3000 5000 8000 10000 LSP 0.05 0.48 8.01 32.54 404.8 1804 X Z LUdivine 0.05 0.47 7.79 30.27 403.9 1691 LQUP 0.05 0.45 7.59 29.90 201.7 1090
Table 3: Comparing real time (seconds) of LSP, Figure 3: Principle of the LUdivine factorization LUdivine, LQUP over Z101, on a P4, 2.4GHz
matrices (but with 3n non-zero entries) so as to have rank the matrix product Z = Z − GY also requires a temporary defficient matrices. The timings given in table 3 are close memory allocation, since rows of Y have to be shifted. This since the dominating operations of the three routines are motivates the introduction of the LQUP decomposition. similar. LSP is slower, since it performs some useless zero matrix multiplications when computing Z = Z − GY (sec- 4.3 LQUP tion 4.2). LQUP is slightly faster than LUdivine since row To solve the data locality issues, due to zero rows inside permutations involve less operations than the whole block S, one can prefer to compute the LQUP factorization, also copy of LUdivine (section 4.3). However these operations introduced in [16]. It consists in a slight modification of do not dominate the cost of the factorization, and they are the LSP factorization: S is replaced by U, the corresponding therefore of little influence on the total timings. This is upper triangular matrix, after the permutation of the zero true until the matrix size induces some swapping, around rows. The tranpose of this row permutation is stored in Q. 8000 × 8000.
Now for the memory usage, the fully in-place implemen- Y tation of LQUP saves 20% of memory (table 4) when com- pared to LUdivine and 55% when compared to LSP. Ac- tually, the memory usage of the original LSP is approxi- X Z mately that of LUdivine augmented by the extra matrix storage (which corresponds exactly to that of LQUP: e.g. 5000 ∗ 5000 ∗ 8bytes = 200Mb). This memory reduction
n 400 1000 3000 5000 8000 10000 LSP 2.83 17.85 160.4 444.2 1136 1779 LUdivine 1.60 10.00 89.98 249.9 639.6 999.5 LQUP 1.28 8.01 72.02 200.0 512.1 800.0
Figure 4: Principle of the LQUP factorization Table 4: Comparing memory usage (Mega bytes) of LSP, LUdivine, LQUP over Z101, on a P4, 2.4GHz with 512 Mb RAM This prevents the use of temporaries for Y and T , since the triangles in U are now contiguous. Moreover, the number of is of high interest when dealing with large matrices (further instructions to perform the row permutations is lower than improvements on the memory management are presented the number of instructions to perform the block copies of section 4.5). LUdivine or LSP. Furthermore, our implementation of LQUP also uses the trick of LUdivine, namely storing L in its com- 4.5 Data locality pressed form L˜. Thanks to all these improvements, this To solve even bigger problems, say that the matrices do triangulation appears to be fully in place. As will be shown not fit in RAM, one has mainly two solutions: either per- in section 4.4, it is also more efficient. Here again, the LSP form out of core computations or parallelize the resolution. and LQUP factorizations are simply connected via S = QU. In both cases, the memory requirements of the algorithms So the recovery of the LSP is still straightforward. to be used will become the main concern. This is because the memory accesses (either on hard disk or remotely via a 4.4 Comparisons network) dominate the computational cost. A classical so- As shown in previous sections the three variants of tri- lution is then to improve data locality so as to reduce the angularization mainly differ by their memory management. volume of these remote accesses. In such critical situations, Indeed, the main operations remain matrix multiplications one may have to prefer a slower algorithm having a good and triangular system solving. Therefore, the implementa- memory management, rather than the fastest one, but suf- tion of all these variants use the fast matrix multiplication fering from high memory requirements. We here propose to routine of the FFLAS package [8] and the triangular system deal with this concern in the case of rank or determinant solver of subsection 3.2 as kernel. The results are impressive: computations of large dense matrices. The generalization for example, table 3 shows that it is possible to triangularize to the full factorization case being direct but not yet fully a 5000 × 5000 matrix over a finite field in 29.9 seconds. We implemented. now compare the three routine speed and memory usage with the same kernels: a Modular
Figure 6: TURBO versus LQUP for out of core rank
more. Paradoxically, this prevents the virtual blocks from fitting in the RAM, since they are just a view of the large initial matrix. For this reason, both performance losses ap- pear for matrices of order around 8000. However, the drop is lower for TURBO thanks to the recursive blocked data for- mats producing better data locality. Figure 5: Principle of the TURBO decomposition This behavior of course confirms that as soon as the RAM is full, data locality becomes more important than memory only one recursive step of TURBO is used, the five recursive saves : TURBO over Givaro-Zpz is the fastest for matrices of calls being performed by the LQUP algorithm. For the actual size bigger than 8000, despite its bigger memory demand. size of matrices, the quite complex implementation of more This is advocating further uses of recursive blocked data recursive levels of TURBO is not yet mandatory. formats and of more recursive levels of TURBO.
Now for the comparisons of figure 6, we use the full LQUP 5. RANK, DETERMINANT, INVERSE factorization algorithm as a reference. Factorization of ma- The LQUP factorization and the Trsm routines reduce to trices of size below 8000 fit in 512Mb of RAM. Then LQUP is matrix multiplication as we have seen in the previous sec- slightly faster than TURBO, implementation of the latter pro- tions. Theoretically, as matrix multiplication requires 2n3 − ducing slightly more scattered groups. Now, the first field n2 arithmetic operations, the factorization, requiring at most representation chosen (curves 1 and 2) is a modular prime 3 2 n arithmetic operations, could be computed in about 1 of field representation using machine integers. As presented in 3 3 the time. Now, the matrix multiplication routine Fgemm of [8], any matrix multiplication occurring in the decomposi- FFLAS package can compute 5000 × 5000 matrix multipli- tion over such a representation is performed by converting cations in only 56.43 seconds on a 2.4GHz pentium 4. This the three operands to three extra floating point matrices. is achieved with pipelining within the P4 processor and with This memory overhead is critical in our comparison. TURBO, very good performances of the BLAS. This corresponds to having a better data locality and using square blocks when- 4430 millions of finite field arithmetic operations per sec- ever possible, requires smaller temporary matrices than the onds! Well, table 5 shows that with n × n matrices we are large and very rectangular blocks used in LQUP. Therefore, not very far from these quasi-optimal performances also for for matrices of order over 8000, LQUP has to swap a lot while the factorization: TURBO remains more in RAM. This is strikingly true for ma- trices between 8000 and 8500, where TURBO manages to keep its top speed. n 400 700 1000 2000 3000 5000 LQUP 0.05s 0.19s 0.45s 2.58s 7.59s 29.9s Moreover, one can also reduce the memory overhead due Fgemm 0.05s 0.24s 0.66s 4.49s 12.66s 56.43s to the conversions to floating point numbers, by using the Ratio 1 0.78 0.68 0.57 0.6 0.53 modular
Now, the inverse can then be computed with at most 2 3 3 3 n + 2n arithmetic operations which gives a theoretical 7. REFERENCES ratio of 4 . Once again, table 6 proves that our implementa- 3 [1] A. V. Aho, J. E. Hopcroft, and J. D. Ullman. The tion has pretty good performances: Indeed, operations per- Design and Analysis of Computer Algorithms. Addison-Wesley, 1974. n 400 700 1000 2000 3000 5000 [2] D. Bini and V. Pan. Polynomial and Matrix Inv 0.2s 0.76s 1.9s 11.27s 32.08s 132.72s Computations, Volume 1: Fundamental Algorithms. Fgemm 0.05s 0.24s 0.66s 4.49s 12.66s 56.43s Birkhauser, Boston, 1994. Ratio 4 3.17 2.92 2.51 2.53 2.35 [3] M. Brassel, P. Giorgi, and C. Pernet. LUdivine: A symbolic block LU factorisation for matrices over Table 6: Comparing cpu time of Matrix Multiplica- finite fields using blas. In East Coast Computer tion and Inverse over Z , on a P4, 2.4GHz 101 Algebra Day, Clemson, South Carolina, USA, Apr. 2003. Poster. formed in LQUP, or Trsm are not grouped as well as in Fgemm. [4] Z. Chen and A. Storjohann. Effective reductions to Therefore, the excellent performances of Fgemm make the ra- matrix multiplication, July 2003. ACA’2003, 9th tio somewhat unreachable, although the invert routine is International Conference on Applications of Computer very fast. Note that, as the first LLeft-Trsm call is made Algebra, Raleigh, North Carolina State University, on the identity it could be accelerated in a specific routine. USA. Indeed, during the course of LUP , L−1 can actually be com- [5] J. Dongarra and V. Eijkhout. Self-adapting numerical n3 software and automatic tuning of heuristics. Lecture puted with only a 3 overhead, thus reducing the theoretical ratio from 4/3 to 1. Notes in Computer Science, 2660:759–770, Jan. 2003. [6] J.-G. Dumas. Efficient dot product over word-size 6. CONCLUSIONS finite fields. Rapport de recherche, IMAG-RR1064, Mar. 2004. We have achieved the goal of approaching the speed of the [7] J.-G. Dumas, T. Gautier, M. Giesbrecht, P. Giorgi, numerical factorization of any shape and any rank matrices, B. Hovinen, E. Kaltofen, B. D. Saunders, W. J. but for finite fields. For example, the LQUP factorization of Turner, and G. Villard. LinBox: A generic library for a 3000 × 3000 matrix over a finite field takes 7.59 seconds exact linear algebra. In A. M. Cohen, X.-S. Gao, and where 6 seconds are needed for the numerical LUP factoriza- 6 N. Takayama, editors, Proceedings of the 2002 tion of lapack . To reach these performances one could use International Congress of Mathematical Software, blocks that fit the cache dimensions of a specific machine. In Beijing, China, pages 40–50. World Scientific Pub, [8] we proved that this was not mandatory for matrix multi- Aug. 2002. plication. We think we prove here that this is not mandatory [8] J.-G. Dumas, T. Gautier, and C. Pernet. Finite field for any dense linear algebra routine. By the use of recursive linear algebra subroutines. In T. Mora, editor, block algorithms and efficient numerical BLAS, one can ap- Proceedings of the 2002 International Symposium on proach the numerical performances. Moreover, long range Symbolic and Algebraic Computation, Lille, France, efficiency and portability are warranted as opposed to every pages 63–74. ACM Press, New York, July 2002. day tuning with at most 10% loss for large matrices (see [9] J.-G. Dumas and J.-L. Roch. On parallel block table 2 where delayed can beat BLAS only for big primes algorithms for exact triangularizations. Parallel and with a specific empirical threshold). Computing, 28(11):1531–1548, Nov. 2002. Besides, the exact equivalent of stability constraints for [10] J.-G. Dumas and G. Villard. Computing the rank of numerical computations is coefficient growth. Therefore, sparse matrices over finite fields. In V. G. Ganzha, whenever possible, we computed and improved theoretical E. W. Mayr, and E. V. Vorozhtsov, editors, bounds on this growth (see bounds 3.3 and [8, Theorem Proceedings of the fifth International Workshop on 3.1]). Those optimal bounds enable further uses of the BLAS Computer Algebra in Scientific Computing, Yalta, routines. Ukraine, pages 47–62. Technische Universit¨at M¨unchen, Germany, Sept. 2002. Further developments include: [11] P. Giorgi. From blas routines to finite field exact linear algebra solutions, July 2003. ACA’2003, 9th 6www.netlib.org/lapack International Conference on Applications of Computer Algebra, Raleigh, North Carolina State University, APPENDIX USA. [12] P. Giorgi, C.-P. Jeannerod, and G. Villard. On the A. PROOFOF THEOREM 3.2 n×n complexity of polynomial matrix computations. In Theorem 3.2 Let T ∈ Z be a unit diagonal upper n Sendra [21], pages 135–142. triangular matrix, and b ∈ Z , with |T | ≤ p − 1 and |b| ≤ Zn [13] F. Gustavson, A. Henriksson, I. Jonsson, and p − 1. Let X = (xi)i∈[1..n] ∈ be the solution of T.X = b B. Kaagstroem. Recursive blocked data formats and over the integers. BLAS’s for dense linear algebra algorithms. Lecture Then Notes in Computer Science, 1541:195–206, 1998. −u ≤ x − ≤ v if k is even ∀ k ∈ [0..n − 1] k n k k (4) [14] B. Hovinen, 2002. Personal communication. −vk ≤ xn−k ≤ uk if k is odd [15] X. Huang and V. Y. Pan. Fast rectangular matrix where multiplications and improving parallel matrix p−1 n n computations. In ACM, editor, PASCO ’97. un = 2 [p − (p − 2) ] Proceedings of the second international symposium on p−1 n n vn = [p +(p − 2) ] parallel symbolic computation, July 20–22, 1997, ( 2 Maui, HI, pages 11–23, New York, NY 10036, USA, Proof. Let us define the induction hypothesis IHl to be 1997. ACM Press. that the equations (4) are true for k ∈ [0..l − 1] . [16] O. H. Ibarra, S. Moran, and R. Hui. A generalization When l = 0, xn = bn which implies that −u0 = 0 ≤ xn ≤ of the fast LUP matrix decomposition algorithm and p − 1= v0. Thus IH0 is proven. applications. Journal of Algorithms, 3(1):45–56, Mar. Let us suppose that ∀j ∈ [0..l] IHj is true, and prove 1982. IHl+1. There are two cases: either l is odd or not ! If l is odd, l + 1 is even. Now, by induction, an upper bound [17] E. Kaltofen, M. S. Krishnamoorthy, and B. D. for xn−l−1 is Saunders. Parallel algorithms for matrix normal l−1 forms. Linear Algebra and its Applications, 2 136:189–208, 1990. (p − 1) 1+ u2i + v2i+1 [18] C. Pernet. Implementation of Winograd’s matrix i=0 X multiplication over finite fields using ATLAS level 3 l−1 BLAS. Technical report, Laboratoire Informatique et 2 (p − 1)2 ≤ p − 1+ p2i − (p − 2)2i + p2i+1 + (p − 2)2i+1 Distribution, July 2001. 2 i=0 www-id.imag.fr/Apache/RR/RR011122FFLAS.ps.gz. X l−1 [19] C. Pernet. Calcul du polynˆome caract´eristique sur des 2 (p − 1)2 ≤ p − 1+ p2i(p + 1) + (p − 2)2i(p − 3) corps finis. Master’s thesis, University of Delaware, 2 i=0 June 2003. X (p − 1)2 pl+1 − 1 (p − 2)l+1 − 1 [20] C. Pernet and Z. Wan. LU based algorithms for the ≤ p − 1+ (p + 1) + (p − 3) 2 p2 − 1 (p − 2)2 − 1 characteristic polynomial over a finite field. In Sendra [21]. Poster. p − 1 ≤ pl+1 + (p − 2)l+1 [21] R. Sendra, editor. ISSAC’2003. Proceedings of the 2 h i 2003 International Symposium on Symbolic and ≤ vl+1 Algebraic Computation, Philadelphia, Pennsylvania, USA. ACM Press, New York, Aug. 2003. Similarly, a lower bound for x − − is [22] W. J. Turner. Blackbox linear algebra with the LinBox n l 1 library. PhD thesis, North Carolina State University, l−1 2 May 2002. −(p − 1) v2i + u2i+1 [23] R. C. Whaley, A. Petitet, and J. J. Dongarra. i X=0 Automated empirical optimizations of software and l−1 (p − 1)2 2 the ATLAS project. Parallel Computing, 27(1–2):3–35, ≥ − p2i + (p − 2)2i + p2i+1 − (p − 2)2i+1 2 Jan. 2001. www.elsevier.nl/gej-ng/10/35/21/47/- i=0 X 25/23/article.pdf. l−1 (p − 1)2 2 ≥ − p2i(p + 1) − (p − 2)2i(p − 3) 2 i=0 X p − 1 ≥ − pl+1 − (p − 2)l+1 2 h i ≥ −ul+1
Finally, If l is even, a similar proof leads to
−vl+1 ≤ xn−l+1 ≤ ul+1