Faster Integer Multiplication

Faster Integer Multiplication Martin Fürer∗ Department of Computer Science and Engineering Pennsylvania State University [email protected] April 22, 2007 Abstract polynomials, multiplication of the values, followed by interpolation. The classical method For more than 35 years, the fastest known of Karatsuba and Ofman [KO62] can be viewed method for integer multiplication has been the as selecting the values of linear forms at (0, 1), Schönhage-Strassen algorithm running in time (1, 0), and (1, 1) to achieve time T (n) = O(nlg 3). O(n log n log log n). Under certain restrictive Toom [Too63] evaluates at small consecutive in- conditions there is a corresponding Ω(n log n) teger values to improve the time to T (n) = lower bound. The prevailing conjecture has al- O(n1+). Finally Schönhage and Strassen [SS71] ways been that the complexity of an optimal al- use the usual fast Fourier transform (FFT) gorithm is Θ(n log n). We present a major step (i.e., evaluation and interpolation at 2mth roots towards closing the gap from above by presenting of unity) to compute integer products in time O(log∗ n) an algorithm running in time n log n 2 . O(n log n log log n). They conjecture the opti- The main result is for boolean circuits as well mal upper bound (for a yet unknown algorithm) as for multitape Turing machines, but it has to be O(n log n), but their result has remained consequences to other models of computation as unchallenged. well. Schönhage and Strassen [SS71] really propose two distinct methods. The first uses numeri- 1 Introduction cal approximation to complex arithmetic, and reduces multiplication of length n to that of All known methods for integer multiplication length O(log n). The complexity of this method (except the trivial school method) are based on is slightly higher. It is only proposed as a one some version of the Chinese remainder theo- level approach. Even with the next level of mul- rem. Schönhage [Sch66] computes modulo num- tiplications done by a trivial algorithm, it is al- k bers of the form 2 + 1. Most methods can ready very fast. The second method employs be interpreted as schemes for the evaluation of arithmetic in rings of integers modulo numbers 2m ∗This work is supported in part by the Penn State of the form Fm = 2 + 1 (Fermat numbers), Grove Award and reduces the length of the factors from n to 1 √ O( n). This second method is used recursively graph (butterfly graph) can be composed of two with O(log log n) nested calls. In the ring ZFm of levels, one containing K copies of a J-point FFT integers modulo Fm, the integer 2 is a particu- graph, and the other containing J copies of a K- larly convenient root of unity for the FFT com- point FFT graph. Clearly N = JK could be putation, because all multiplications with this factored differently into N = J 0K0 and the same root of unity are just modified cyclic shifts. N-point FFT graph could be viewed as being On the other hand, the first method has composed of J 0-point and K0-point FFT graphs. the advantage of the significant length reduc- The astonishing fact is that this is just true for tion from n to O(log n). If this method is ap- the FFT graph and not for the FFT computa- plied recursively, it results in a running time of tion. Every way of (recursively) partitioning N ∗ order n log n log log n . 2O(log n), because dur- produces another FFT algorithm. Multiplica- ing the kth of the O(log∗ n) recursion lev- tions with other powers of ω appear when an- els, the amount of work increases by a factor other recursive decomposition is used. of O(log log ... log n) (with the log iterated k It seems that this fact has been mainly unno- times). Note that, for their second method, ticed except for its use some time ago [Für89]in Schönhage and Strassen have succeeded with the an earlier attempt to obtain a faster integer mul- difficult task of keeping the work of each level ba- tiplication algorithm. In that paper, the follow- sically constant, avoiding a factor of logO(1) n = ing result has been shown. If there is an integer 2O(log log n) instead of O(log log n). k > 0 such that for every m, there is a Fermat Our novel use of the FFT allows us to combine prime in the sequence Fm+1,Fm+2,...,F2m+k , the main advantages of both methods. The re- then multiplication of binary integers of length ∗ duction is from length n to length O(log n), and n can be done in time n log n 2O(log n). Hence, still most multiplications with roots of unity are the Fermat primes could be extremely sparse and just cyclic shifts. Unfortunately, we are not able would still be sufficient for a fast integer multi- to avoid the geometric increase over the log∗ n plication algorithm. Nevertheless, this paper is levels. not so exciting, because it is well known that Relative to the conjectured optimal time the number of Fermat primes is conjectured to of Θ(n log n), the first Schönhage and be finite. Strassen method had an overhead factor of It has long become standard to view the FFT ∗ log log n . 2O(log n), representing a doubly as an iterative process (see e.g., [SS71, AHU74]). exponential decrease compared to previous Even though the description of the algorithm methods. Their second method with an gets more complicated, it results in less computa- overhead of O(log log n) constitutes another tional overhead. A vector of coefficients at level polynomial improvement. Our new method 0 is transformed level by level, until we reach ∗ reduces the overhead to 2O(log n), and thus the Fourier transformed vector at level lg N. represents a more than multiple exponential The operations at each level are additions, sub- improvement of the overhead factor. tractions, and multiplications with powers of ω. We use a new divide-and-conquer approach to They are done as if the N-point FFT were re- the N-point FFT, where N is a power of 2. It is cursively decomposed into N/2-point FFT’s fol- well known and obvious that the JK-point FFT lowed by 2-point FFT’s. The current author has 2 seen Schönhage present the other natural recur- unconditional corresponding lower bound. Still, sive decomposition into 2-point FFT’s followed long ago there have been some remarkable at- by N/2-point FFT’s. It results in another dis- tempts. In the algebraic model, Morgenstern tribution of the powers of ω, even though each [Mor73] has shown that every N-point Fourier power of ω appears as a coefficient in both it- transform done by just using linear combinations erative methods with the same frequency. But αa + βb with |α| + |β| ≤ c for inputs or previ- other decompositions produce completely differ- ously computed values a and b requires at least ent frequencies. The standard fast algorithm (n lg n)/(2 lg c) operations. Under different as- design principle, divide-and-conquer, calls for a sumptions on the computation graph, Papadim- balanced partition, but in this case it is not at itriou [Pap79] and Pan [Pan86] have shown lower all obvious that this will provide any benefit. bounds of Ω(n log n) for the FFT. Both are for A balanced√ approach uses two stages of the interesting case of n being a power of 2. roughly N-point FFT’s. This allows an im- Cook and Anderaa [CA69] have developed a provement, because it turns out that “odd” pow- method for proving non-linear lower bounds for ers of ω are then very seldom. This key observa- on-line computations of integer products and re- tion alone is not sufficiently powerful, to obtain a lated functions. Based on this method, Pater- better asymptotic running time, because usually√ son, Fischer and Meyer [PFM74] have improved 1, −1, i, −i and to a lesser extent ±(1±i)/ 2 are the lower bound for on-line integer multiplica- the only powers of ω that are easier to handle. tion to Ω(n log n). Naturally, one would like to We will achieve the desired speed-up by work- see unconditional lower bounds, as the on-line ing over a ring with many “easy” powers of ω. requirement is a very severe restriction. On-line Hence, the new faster integer multiplication al- means that starting with the least significant bit, gorithm is based on two key ideas. the kth bit of the product is written before the k + 1st bit of the factors are read. • An unconventional FFT algorithm is used Besides showing that our algorithm is more with the property that most occurring roots efficient in terms of circuit complexity and Tur- of unity are of low order. ing machine time, one could reasonably ask how • The computation is done over a ring with well it performs in terms of more practical com- very simple multiplications with many low plexity measures. Well, first of all, it is worth- order roots of unity. while pointing out that all currently competitive algorithms are nicely structured, and for such The question remains whether the optimal run- algorithms, a Turing machine model with an al- ning time for integer multiplication is indeed of phabet size of 2w (where w is the computer word ∗ the form n log n 2O(log n).

Load more