<<

Schönhage-Strassen with MapReduce

for Multiplying Terabit Integers (April 30, 2011)

Tsz-Wo Sze Yahoo! Cloud Platform 701 First Avenue Sunnyvale, CA 94089, USA [email protected]

ABSTRACT abit scale integer , the major application of it We present MapReduce-SSA, an integer multiplication al- is the computation of classical constants, in particular, the gorithm using the ideas from Sch¨onhage- computation of π [21, 1, 18, 7, 5]. (SSA) on MapReduce. SSA is one of the most commonly The known, asymptotically fastest multiplication algo- used large integer multiplication . MapReduce rithms are based on (FFT). In prac- is a programming model invented for distributed data pro- tice, multiplication libraries often employ a mix of algo- cessing on large clusters. MapReduce-SSA is designed for rithms, as the performance and suitability of an algorithm multiplying integers in terabit scale on clusters of commod- vary depending on the bit range computed and on the partic- ity machines. As parts of MapReduce-SSA, two algorithms, ular machine where the multiplication is performed. While MapReduce-FFT and MapReduce-Sum, are created for com- benchmarks inform the tuning of an implementation on a puting discrete Fourier transforms and summations. These given computation platform, algorithms best suited to par- mathematical algorithms match the model of MapReduce ticular bit ranges are roughly ranked both by empirical and seamlessly. theoretical results. Generally, the na¨ıve algorithm performs best in the lowest bit range, then the , then the Toom-Cook algorithm, and multiplication in the Categories and Subject Descriptors largest bit ranges typically employs FFT-based algorithms. I.1.2 [Computing Methodologies]: Algorithms—symbolic For details, see [3, 13]. and algebraic manipulation; F.2.1 [Theory of Computa- The Sch¨onhage-Strassen algorithm (SSA) is one such FFT- tion]: Numerical Algorithms—computation of fast Fourier based integer , requiring transforms; G.4 [Mathematics of Computing]: Mathe- O(N log N log log N) matical Software bit operations to multiply two N-bit integers [15]. There General Terms are other FFT-based algorithms asymptotically faster than SSA. In 2007, Furer¨ published an algorithm with running Algorithms, distributed computing time ∗ O(N(log N)2log N ) Keywords bit operations [11]. Using similar ideas from Furer’s¨ algo- ∗ Integer multiplication, multiprecision arithmetic, fast Fourier rithm, De et al. discovered another O(N(log N)2log N ) al- transform, summation, MapReduce gorithm using while the arithmetic of Furer’s¨ algorithm is carried out over complex numbers [9]. 1. INTRODUCTION As these two algorithms are relatively new, they are not Integer multiplication is one of the most critical operations found in common software packages and the bit ranges where in arbitrary precision arithmetic. Beyond its direct applica- they begin to outperform SSA are, as yet, unclear. SSA re- tions, the existence of fast multiplication algorithms run- mains a dominant implementation in practice. See [12, 6] ning in essentially linear time and the reducibility of other for details on SSA implementations. common operations such as division, square root, , Algorithms for multiplying large integers, including SSA, etc. to integer multiplication motivate improvements in tech- require a large amount of memory to store the input operands, niques for its efficient computation at scale [4, 8]. For ter- intermediate results, and output product. For in-place SSA, the total space required to multiply to N-bit integers ranges from 8N to 10N bits; see §3.1. Traditionally, multiplica- tion of terabit-scale integers is performed on supercomputers with software customized to exploit its particular memory Permission to make digital or hard copies of all or part of this work for and interconnect architecture [5, 7, 18]. Takahashi recently personal or classroom use is granted without fee provided that copies are computed the square of a 2 TB integer with 5.15 trillion dec- not made or distributed for profit or commercial advantage and that copies imal digits in 31 minutes and 41 seconds using the 17.5 TB bear this notice and the full citation on the first page. To copy otherwise, to main memory on the T2K Open Supercomputer [17]. republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Most arbitrary precision arithmetic software packages as- Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00. sume a uniform address space on a single machine. The precision supported by libraries such as GMP and Magma The machines are connected with a two-level network topol- is limited by the available physical memory. When mem- ogy shown in Figure 1. There are 40 machines connected to a ory is exhausted during a computation, these packages may rack switch by 1-gigabit links. Each rack switch is connected have very poor performance or may not work at all. Such to a set of core switches with 8-gigabit or higher aggregate packages are not suited to terabit scale multiplication, as the bandwidth. memory available in most machines is insufficient by orders of magnitude. Other implementations such as y-cruncher [20] and TachusPI [2] use clever paging techniques to scale a single machine beyond these limits. By using the hard disk sparingly, its significantly higher capacity enables effi- cient computation despite its high latencies. According to Yee, y-cruncher is able to multiple integers with 5.2 trillion decimal digits using a desktop computer with only 96 GB memory in 41.6 hours [19]. In this paper, we present a distributed variety of SSA for multiplying terabit integers on commodity hardware using a shared compute layer. Our prototype implementation runs Figure 1: Network topology on Hadoop clusters. Hadoop is an open source, distributed computing framework developed by the Apache Software Foundation. It includes a fault-tolerant, distributed file sys- For terabit scale integer multiplication, the data involved tem, namely HDFS, designed for high-throughput access to in the matrix transposition step in parallel FFT is terabyte very large data sets in petabyte scale [16]. It also includes scale. Small clusters with tens of machines are inadequate an implementation of MapReduce, a programming model de- to perform such operation, even if individual machines are signed for processing big data sets on large clusters [10]. equipped with high performance CPUs and storage, since Our environment is distinguished both from the super- the network bandwidth is a bottleneck of the system. The computer and single machine variants described in the pre- benefits of adding machines to a cluster are twofold: it in- ceding paragraphs. Unlike these settings, our execution en- creases the number of CPUs and the size of storage, and also vironment assumes that hardware failures are common. The increases the aggregate network bandwidth. From our ex- fixed computation plan offered by our host environment is perience, clusters with five hundred machines are sufficient designed for multi-tenant, concurrent batch processing of to perform terabit scale multiplication. large data volumes on a shared storage and compute grid. As 2.2 MapReduce a sample constraint, communication patterns between pro- cesses are fixed and communicated only at their conclusion, In the MapReduce model, the input and the output of a to ensure execution plans remain feasible should a machine computation are lists of key-value pairs. Let kt be key types fail. While the problem of memory remains a central and and vt be value types for t = 1, 2, 3. The user specifies two familiar consideration of the implementation, some of the functions, challenges posed and the advantages revealed by our pro- map : (k , v ) −→ listhk , v i, and gramming paradigm and execution environment are unique. 1 1 2 2 We learned that not only SSA, but FFT and other arbi- reduce : (k2, listhv2i) −→ listhk3, v3i. trary precision routines like summation match the MapRe- When a job starts, the MapReduce framework launches map duce programming model seamlessly. More specifically, the tasks. A map task transforms one or more inputs in (k1, v1) FFT matrix transposition, which is traditionally difficult in to intermediate outputs in (k2, v2) according to map( · ). preserving locality, becomes trivial in MapReduce. Once all map tasks have completed, the framework begins The rest of the sections are organized as follows. More a process called shuffle. It puts all intermediate values with details on computation environment are given in §2. SSA is the same key into a list, launches reduce tasks and sends briefly described in §3. A new algorithm, MapReduce-SSA, the keys with the corresponding list to the reduce tasks. A is presented in §4. The conclusion is in §5. reduce task uses reduce( · ) to process one or more of the key-list pairs in (k2, listhv2i) and outputs key-value pairs in 2. COMPUTATION ENVIRONMENT (k3, v3). The job is finished when all reduce tasks are com- In this section, we first describe our cluster configurations, pleted. Figure 2 shows a diagram of the MapReduce model. and then give a short introduction of MapReduce. 2.1 Cluster Configurations At Yahoo!, Hadoop clusters are composed of hundreds or thousands of commodity machines. All the machines in the same cluster are collocated in the same data center. Each machine typically has

• 2 quad-core CPUs, Figure 2: MapReduce model • 6 GB or more memory, and The number of map tasks is at least one but the number • 4 hard drives. of reduce tasks could possibly be zero. For a map-only job, shuffle is unnecessary and is not performed. The output integer 2n +1 is called the Sch¨onhage-Strassen modulus. The from the map tasks becomes the output of the job. exponent n must satisfy the following conditions, The MapReduce framework included in Hadoop is highly parallel, locality-aware, elastic, flexible and fault tolerant. (C1) D | 2n, and Operators routinely add new machines and decommission failing ones without affecting running jobs. Hadoop provides (C2) n ≥ 2M + k. a flexible API and permits users to define various compo- nents, including input/output formats that define how data Thus, n is chosen to be the smallest integer satisfying both is read and written from HDFS. Hardware failures are com- conditions (C1) and (C2). At last, the componentwise mul- tiplications are computed by recursive calls to the integer mon, so the framework must also detect and gracefully tol- −1 erate such events. As a MapReduce application, recovery is multiplication routine, and dft (or dft ) is calculated by transparent as long as it conforms to the assumptions and forward (or backward) FFT. conventions of the framework. There are some well-known improvements on the original SSA. For√ example, condition (C1) can be relaxed to D | 4n 3. SCHÖNHAGE-STRASSEN using the 2 trick; see [12]. We describe SSA briefly in this section. For more detailed 3.1 Memory Requirement discussions, see [15, 8, 3]. For multiplying two N-bit integers, the total input size The goal is to compute is 2N bits. The output size is 2N bits as well. The values p = ab, (1) of dft(a) and dft(b) in equation (2) are the intermediate results. We have where a, b ∈ are the input integers and p is the product of Z D a and b. Without loss of generality, we may assume a > 0   dft(a) ∈ Z , and b > 0. Suppose a and b are N-bit integers for N a power (2n + 1)Z of two. Write a D-dimension tuple. By condition (C2), the size is N = KM = 2kM, | dft(a)| ≥ nD ≥ (2M + k)D > 4MK = 4N. where K and M are also powers of two. Let

K−1 It is possible to choose the SSA parameter such that the def X i M DFT size is within the (4N, 5N] interval. Therefore, the A(x) = aix for 0 ≤ ai < 2 and i=0 total space required for the multiplication is in (12N, 14N] K−1 if the buffers for the input, the intermediate results and the def X i M output are separated. For an in-place implementation, i.e. B(x) = bix for 0 ≤ bi < 2 i=0 the buffers are reused, the total required space decreases to (8N, 10N]. As an example, the required memory is from M be such that A(x),B(x) ∈ Z[x], a = A(2 ) 1 TB to 1.25 TB for multiplying two 1-terabit integers with M and b = B(2 ). There are M bits in each coefficient ai and an in-place implementation; see also §4.4 and Table 2. bi. The number of coefficients in each A(x) and B(x) is K. Let 4. MapReduce-SSA def k+1 D = 2K = 2 . In our design, we target on input integers in terabit scale. The strategy is to compute the product polynomial SSA is a recursive algorithm. The first level call has to pro- cess multi-terabit data. The parameters are selected such D−1 def def X i 2M+k that the data in the recursive calls is only in megabit or gi- P (x) = A(x)B(x) = pix for 0 ≤ < 2 . gabit scale. Therefore, the recursive calls can be executed i=0 in a single machine locally. MapReduce framework is used Then, we have p = P (2M ). Consider polynomials A, B and to handle the first level call so that the computation is dis- P as D-dimensional tuples, tributed across the machines and executed in parallel.

def a = (0,..., 0, aK−1, . . . , a0), 4.1 Data Model def In order to allow accessing terabit integers efficiently, an b = (0,..., 0, b , . . . , b ), and K−1 0 integer is divided into sequences of records. An integer x is def p = (pD−1, . . . , p0). represented as a D-dimensional tuple

D Then p is the cyclic convolution of a and b, x = (xD−1, xD−2, . . . , x0) ∈ Z p = a × b. as illustrated in §3. It is customary to call the components By the convolution theorem, of x the digits of x. Let a × b = dft−1(dft(a) ∗ dft(b)), (2) D = IJ. (3) where ∗ denotes componentwise multiplication, dft(·) and where I > 1 and J > 1 are powers of two. For 0 ≤ i < I, dft−1(·) respectively denote the discrete Fourier transform define a J-dimensional tuple and its inverse. All the operations on the right hand side of n (i) def J equation (2) are performed over the Z/(2 + 1)Z. The x = (x(J−1)I+i, x(J−2)I+i, . . . , xi) ∈ Z (4) so that backward FFT, which is very similar to forward FFT, is again calculated by MapReduce-FFT. The carrying in S4  (0)    x x(J−1)I x(J−2)I . . . x0 can be done by MapReduce-Sum as shown in §4.2.4.      x(1)   x x . . . x     (J−1)I+1 (J−2)I+1 1    =   . 4.2.1 Parallel-FFT  .   . . .. .   .   . . . .  There is a well-known algorithm, called parallel-FFT, for     (I−1) evaluating a DFT in parallel. As before, let x x(J−1)I+(I−1) x(J−2)I+(I−1) . . . xI−1 def def aˆ = (ˆaD−1, aˆD−2,..., aˆ0) = dft(a). We call (x(0),..., x(I−1))t the (I,J)-format of x. Each x(i) is represented as a sequence of J records and By the definition of DFT, each record is a key-value pair, where the key is the index D−1 X ij and the value is the value of the digit as shown in Table 1. aˆj = aiζ for 0 ≤ i < D, (7) There are I sequences for the entire tuple x. i=0 n where ζ is a principle Dth root of unity in Z/(2 + 1)Z. We Record # may take

0 ζ = 22n/D (mod 2n + 1). 1 J+i Rewrite the summation, for all 0 ≤ j0 < J and 0 ≤ j1 < I, . . . . I−1 J−1 X X (i1I+i0)(j1J+j0) aˆj1J+j0 = ζ ai1I+i0 J − 1 <(J − 1)I + i, x(J−1)I+i> i0=0 i1=0 I−1 J−1 ! (i) X i0j1 i0j0 X i1j0 Table 1: The representation of x = ζI ζ ζJ ai1I+i0 , i0=0 i1=0

def J def I We give a few definitions in the following. A tuple where ζI = ζ and ζJ = ζ are principle Ith and Jth roots n of unity in Z/(2 + 1)Z, respectively. For 0 ≤ i < I, let x = (xD−1, . . . , x0) (i) def M a = (a(J−1)I+i, a(J−2)I+i, . . . , ai) is normalized if 0 ≤ xt < 2 for all 0 ≤ t < D. The input tuples of the algorithm are assumed to be normalized and and the output tuple returned is also normalized. Define the left   (i) def (i) (i) (i) def (i) component-shift operator , for m ≥ 0, ad = adJ−1, adJ−2,..., ad0 = dft(a ). (8)

def x  m = (xD−m−1, xD−m−2, . . . , x0, 0,..., 0). (5) These I DFTs with J points can be calculated in parallel. | {z } For 0 ≤ i < I and 0 ≤ j < J, define m def ij (i) By definition, x  0 is a no-op. Let y = (yD−1, . . . , y0). zjI+i = ζ adj , (9) Define the with carrying operator ⊕ as below, [j] def z = (zjI+(I−1), zjI+(I−2), . . . , zjI ), (10) def M M def x ⊕ y = (sD−1 mod 2 , . . . , s0 mod 2 ), (6) zc[j] = dft(z[j]). (11)  st−1  where s−1 = 0 and st = xt + yt + 2M for 0 ≤ t < D. Notice that we use [ · ] in equation (10) in order to emphasize the difference between z[j] and z(j). The J DFTs with I 4.2 Algorithms points in equation (11) can be calculated in parallel. Finally, Using the notations in §3, SSA consists of four steps, I−1 X i j def def aˆ = ζ 0 1 z = z[j0] . S1: two forward FFTs, aˆ = dft(a) and bˆ = dft(b); j1J+j0 I j0I+i0 d j1 i0=0 def ˆ S2: componentwise multiplication, pˆ = aˆ ∗ b; 4.2.2 MapReduce-FFT S3: a backward FFT, p = dft−1(pˆ); and We may carry out parallel-FFT on MapReduce. For for- ward FFT, the input a is in (I,J)-format — it is divided S4: carrying, normalizing p. into a(i) for 0 ≤ i < I as in §4.1. There are I map and J S1 first reads the two input integers, and then runs two reduce tasks in a job. The map tasks execute I parallel J- forward FFTs in parallel. Forward FFTs are performed by point FFTs as shown in equation (8), while the reduce tasks the MapReduce-FFT algorithm described in §4.2.2. It is run J parallel I-point FFTs as shown in equation (11). For clear that the componentwise multiplication in S2 can be 0 ≤ j < J, the output from reduce task j is ˆ executed in parallel. Notice that the digits in aˆ and b are (j) [j] aˆ = (ˆa , aˆ ,..., aˆj ) = zc . relatively small in size. For terabit integers, the size of the (I−1)J+j (I−2)J+j digits is only in the scale of megabits. Thus, The output aˆ is written in (J, I)-format. The mapper and with these digits can be computed by a single machine. We reducer algorithms are shown in Algorithms 1 and 2. Note discuss componentwise multiplication in §4.2.3. In S3, the that equation (9) computed in (f.m.3) indeed is a bit-shifting since ζ is a power of two. Note also that the transposi- 4.2.3 Componentwise Multiplication tion of the elements from ζij ad(i) in equation (9) to z[j] in Componentwise multiplication can be easily done by a equation (10) is accomplished by the shuffle process, i.e. the map-only job with J map tasks and zero reduce tasks. The transition from map tasks to reduce tasks. mapper j reads aˆ(j) and bˆ(j), computes

Algorithm 1 (Forward FFT, Mapper). pˆ(j) def= aˆ(j) ∗ bˆ(j) (12) (i) (f.m.1) read key i, value a ; n (j) over Z/(2 +1)Z and then writes pˆ . Componentwise mul- (f.m.2) calculate a J-point DFT by equation (8); tiplication indeed can be incorporated in Algorithm 3 to eliminate the extra map-only job. We replace Step (b.m.1) (f.m.3) componentwise bit-shift by equation (9); with the following. (f.m.4) for 0 ≤ j < J, emit key j, value (i, zjI+i). Algorithm 5 (Componentwise-Mult). Algorithm 2 (Forward FFT, Reducer). (b.m.1.1) read key j, value (aˆ(j), bˆ(j)); (f.r.1) receive key j, list [(i, z )] ; jI+i 0≤i 1 be an integer constant, which is much smaller ( r0, if t = 0; then N. For 0 ≤ s < m, let yt = 0 M (24) (rt + ct−1) mod 2 , if 1 ≤ t < D. def y[s] = (yD−1,s, . . . , y0,s) Then, be m normalized tuples. Extending the definition (6) of (yD−1, . . . , y0) = y[0] ⊕ · · · ⊕ y[m−1]. addition with carrying, define sum with carrying 0  st  Proof. We prove by induction that ct = 2M for 0 ≤ def M M y[0] ⊕ · · · ⊕ y[m−1] = (sD−1 mod 2 , . . . , s0 mod 2 ), t < D. Clearly,

0 j z0 k j s0 k where s−1 = 0 and, for 0 ≤ t < D, c = c = = . 0 0 2M 2M j st−1 k st = yt,0 + ··· + yt,m−1 + . (16) Assume c0 =  st  for 0 ≤ t < D − 1. By equation (21), 2M t 2M 0 ct < m. Then, If y[s]’s are considered as integers y[s] as in §3, then y[0] ⊕ · · · ⊕ y is the normalized representation of the integer j st k M 0 [m−1] st+1 = zt+1 + = ct+12 + rt+1 + ct (25) sum 2M m−1 by equations (16), (18) and (19). Since m is much smaller X DM M y[s] (mod 2 ). than 2 by assumption, we have s=0 0 M M+1 0 ≤ rt+1 + ct < 2 + m < 2 . The classical addition algorithm is a sequential algorithm because the carries in the higher digits depend on the carries Then, ( in the lower digits. It is not suitable to our environment  r + c0  1, if r + c0 ≥ 2M ; since the entire D-dimensional tuple is in terabyte scale. t+1 t = t+1 t (26) 2M Processing it with a single machine takes too much time. 0, otherwise. An observation is that it is unnecessary to have the entire Suppose d = m. By equation (20), 2M − r ≥ m. We digit sequence for determining the carries  st−1  in equa- t+1 t+1 2M have tion (16). The carries can be obtained using a much smaller j s k data set. For 0 ≤ t < D, let r + c0 = r + t < r + m ≤ 2M . t+1 t t+1 2M t+1 def M zt = yt,0 + ··· + yt,m−1, (17) Divide equation (25) by 2 , def M 0 rt = zt mod 2 , (18) j s k  r + c  t+1 = c + t+1 t = c = c0 j z k M t+1 M t+1 t+1 c def= t (19) 2 2 t 2M by equations (23) and (26). Suppose dt+1 < m. Then, be the componentwise sums, the remainders and the inter- equation (23) becomes mediate carries, respectively. Note that ct may not equals  st  ( 0 M to 2M since, when ct’s are added to the digits, overflow 0 ct+1 + 1, if rt+1 + ct ≥ 2 ; may occur. Then carrying is required again in such case. ct+1 = ct+1, otherwise; Define the difference M ( M M by substituting dt+1 = 2 − rt+1. By equations (26) and 2 − rt, if 2 − rt < m; (25), dt = (20) m, otherwise.  0  0 rt+1 + ct j st+1 k ct+1 = ct+1 + M = M . Notice that 0 < dt ≤ m. We also have 2 2 j s k 0 ≤ c ≤ t < m. (21) Finally, by equation (24), t 2M y = r ≡ z ≡ s (mod 2M ) The last inequality is due to the fact that the input tuples 0 0 0 0 y[s]’s are normalized. The sizes of the tuples (cD−1, ··· , c0) and, for 1 ≤ t < D, and (dD−1, ··· , d0) are in O(D), given m a constant by as- j s k y ≡ r + c0 ≡ z + t−1 ≡ s (mod 2M ). sumption. Similar to [12, equation (3)], we have t t t−1 t 2M t √ D ≤ 8N. (22) The theorem follows.

The data size is reduced√ from O(N), for classical addition al- We present MapReduce-Sum below. All tuples are repre- gorithm, to O(D) = O( N); i.e. from terabits to megabits. sented in the (I,J)-format described in §4.1. Two jobs are So these two tuples can be sequentially processed in a single involved in MapReduce-Sum. The first job for component- machine efficiently. For 0 ≤ t < D, let wise summation (Algorithm 8) is a map-only job, which has I map and zero reduce tasks. The mapper i reads input ( 0 (i) (i) 0 ct + 1, if t > 0 and ct−1 ≥ dt; tuples y[s] for 0 ≤ s < m, and writes the remainders r , ct = (23) (i) (i) ct, otherwise. the intermediate carries c and the differences d .

0 0 Then, c0, . . . , cD−1 are the carries as shown below. Algorithm 8 (Componentwise Sum, Mapper). (i) (i) (s.m.1) read key i, value (y[0] ,..., y[m] ); Denote the number of map and reduce tasks in the forward FFTs by Nf,m and Nf,r, and the number of map and reduce (i) (s.m.2) compute z by equation (17); tasks in the backward FFT by Nb,m and Nb,r. Then,

(i) (s.m.3) compute r by equation (18); Nf,m = Nb,r = I and Nf,r = Nb,m = J.

(s.m.4) compute c(i) by equation (19); Excluding the memory overhead from the system software and the temporary variables, let Mf,m and Mf,r be the num- (s.m.5) compute d(i) by equation (20); ber of bits of the required memory in map and reduce tasks in forward FFT, and Mb,m and Mb,r be the number of bits of (s.m.6) write key i, value (r(i), c(i), d(i)). the required memory in map and reduce tasks in backward FFT. Then, The second job for carrying (Algorithms 9 and 10) has a single map task and I reduce tasks. The mapper reads Mf,m = Mb,r = Jn and Mf,r = Mb,m = In. all the intermediate carries and all the differences, and then We always choose I ≤ J so that Mb,m = In ≤ Jn, in order evaluates the actual carries. The reducers receive the actual to have more memory available for running Algorithm 5 in carries from the mapper, read the corresponding remainders the map tasks of the backward FFT. In Tables 2, 3 and and then compute the final output. For single-map jobs, we 4, we show some possible choices of the parameters D and may use the null key, denoted by ∅, as the input key. I, and the corresponding values of some other variables for N = 240, 243 and 246. The memory requirement is just Algorithm 9 (Carrying, Mapper). 0.56 GB per task for N = 240. For N = 246, it requires n o 12 GB per task. (c.m.1) read key ∅, value (i, c(i), d(i)) : 0 ≤ i < I ; N = 240 J n In Jn (c.m.2) compute c0 by equation (23); D = 219, I = 29 210 16.5D 0.52GB 1.03GB (c.m.3) for 0 ≤ i < I, emit key i, value c0((i−1) mod I). D = 220, I = 210 210 4.5D 0.56GB 0.56GB Algorithm 10 (Carrying, Reducer). D = 221, I = 210 211 1.5D 0.38GB 0.75GB 0((i−1) mod I) (c.r.1) receive key i, singleton list [c ]; D = 222, I = 211 211 0.5D 0.50GB 0.50GB (c.r.2) read r(i); Table 2: Possible parameters for N = 240 (c.r.3) compute y(i) by equation (24);

(c.r.4) write key i, value y(i). N = 243 J n In Jn 4.3 Summary 20 10 10 The input of MapReduce-SSA is two integers a and b, D = 2 , I = 2 2 32.5D 4.06GB 4.06GB which are represented as normalized tuples a and b in (I,J)- D = 221, I = 210 211 8.5D 2.13GB 4.25GB format. The output is also in (I,J)-format, a normalized 22 11 11 tuple p˜, which represents an integer p such that p = ab. D = 2 , I = 2 2 2.5D 2.50GB 2.50GB MapReduce-SSA has the following sequence of jobs: D = 223, I = 211 212 D 2.00GB 4.00GB J1: two concurrent forward FFT jobs (Alg. 1 & 2); Table 3: Possible parameters for N = 243 J2: a backward FFT job with componentwise multiplica- tion and splitting (Alg. 3 with 5, & 4 with 6); J3: a componentwise summation job (Alg. 8); N = 246 J n In Jn J4: a carrying job (Alg. 9 & 10). D = 222, I = 211 211 16.5D 16.5GB 16.5GB 4.4 Choosing Parameters D = 223, I = 211 212 4.5D 9.0GB 18.0GB In SSA, once D, the dimension of the FFTs, is fixed, all D = 224, I = 212 212 1.5D 12.0GB 12.0GB other variables, including the exponent n in the Sch¨onhage- n 25 12 13 Strassen modulus 2√+ 1, are fixed consequently. The stan- D = 2 , I = 2 2 0.5D 8.0GB 16.0GB dard choice is D ≈ N. In MapReduce-SSA, there is an additional parameter I, Table 4: Possible parameters for N = 246 defined in equation (3). It determines the number of se- quences in the integer representation, the numbers of tasks in the jobs and the memory required in each task. There- fore, the choice of I depends on the cluster capacity, i.e. 4.5 A Prototype Implementation the numbers of map and reduce tasks supported, and the The program DistMpMult is a prototype implementation, available memory in each machine. which includes DistFft, DistCompSum and DistCarrying as subroutines. DistFft implements Algorithms 1, 2, 3 and 4 N = 240 Step Duration with Algorithms 5 and 6, while DistCompSum implements Algorithm 8 and DistCarrying implements Algorithms 9 and (f.m.1) 12s 10. We have verified the correctness of the implementation (f.m.2) 24s by comparing the outputs from DistMpMult to the outputs Mapper of the GMP library. (Alg. 1) (f.m.3) 1s Since DistFft is notably more complicated than DistComp- (f.m.4) 120s Sum and DistCarrying, we have measured the performance of DistFft on a 1350-node cluster. Each node has 6 GB Forward Total 157s memory, and supports 2 map tasks and 1 reduce task. The FFT Shuffle 223s cluster is installed with Apache Hadoop version 0.20. It is a shared cluster, which is used by various research projects (f.r.1) 19s in Yahoo!. Figure 3 shows a graph of the elapsed time t Reducer (f.r.2) 22s against the input integer size N in base-2 logarithmic scales. (Alg. 2) The elapsed time includes the computation of two parallel (f.r.3) 49s forward FFTs and then a backward FFT. From the range Total 90s 232 ≤ N ≤ 236, the elapsed time fluctuates between 2 to 4 minutes because of the system overheads. The utilization is (b.m.1) 89s extremely low in these cases. From the range 236 ≤ N ≤ 240, the graph is almost linear with slope ≈ 0.8 < 1, which indi- (b.m.2) 70s Mapper cates that the cluster is still underutilized since FFT is an (Alg. 3) (b.m.3) 2s essentially linear time algorithm. Unfortunately, the 1350- node cluster is unable to run DistFft for N ≥ 241 due to (b.m.4) 90s the memory limitation configured in the cluster, so that it Total 251s limits the aggregated memory usage of individual jobs. Backward Shuffle 483s FFT (b.r.1) 20s Reducer (b.r.2) 20s (Alg. 4) (b.r.3) 24s (b.r.4) 13s Total 77s

Table 5: Average duration for each step in MapReduce-FFT with N = 240

16 GB memory in each machine or 2100 nodes with 32 GB Figure 3: Actual running time of DistFft on a 1350- memory in each machine according to the parameters sug- node cluster gested in Table 4. For N = 240, we choose D = 220 and I = 210. Then, J = I and 5. CONCLUSION An integer multiplication algorithm, MapReduce-SSA, is n = 4.5D = 4, 718, 592 designed for multiplying integers in terabit scale on clusters as shown in Table 2. In a forward FFT job, the average with many commodity machines. It employs the ideas from durations of a map, a shuffle and a reduce task are 157, Sch¨onhage-Strassen algorithm in the MapReduce model. We 223 and 90 seconds, respectively. Although the total is only first introduce a data model which allows efficiently access- 7.83 minutes, the the job duration is 15.38 minutes due to ing terabit integers in parallel. We show forward/backward scheduling and other overheads. In a backward FFT job, MapReduce-FFT for calculating DFT and the inverse DFT, the average durations for a map, a shuffle and a reduce task and describe how to perform componentwise multiplications are 251, 483 and 77 seconds, and the job duration is 18.17 and carrying operations. Componentwise multiplication is minutes. The itemized details are shown in Table 5. It shows simply incorporated in the backward FFT step. The car- that the computation is I/O bound. The CPU intensive rying operation is reduced to a summation, which is com- computation only contributes 10% to 15% of the elapsed puted by MapReduce-Sum. A MapReduce-SSA computation time excluding the system overheads. involves five jobs — two concurrent jobs for two forward Since there are already benchmarks showing that Hadoop FFTs, a job for a backward FFT, and two jobs for carrying. is able to scale adequately on manipulating 100-TB data on In addition, we discuss the choices of the parameters and a large cluster [14], we estimate that DistFft will be able the impacts on cluster capacity and memory requirements. to process integers with N = 246 in some suitable clusters. We have developed a program, DistMpMult, which is a Such integers have 21 trillion decimal digits and require 8 TB prototype implementation of the entire MapReduce-SSA al- space. A suitable cluster may consists of 4200 nodes with gorithm. It includes DistFft, DistCompSum and DistCarry- ing as subroutines for calculating DFT/inverse DFT, com- [15] A. Sch¨onhage and V. Strassen. Schnelle Multiplikation ponentwise summation and carrying, respectively. All rou- großer Zahlen. Computing, 7:281–292, 1971. tines run on Apache Hadoop clusters. The correctness of the [16] K. Shvachko, H. Kuang, S. Radia, and R. Chansler. implementation has been verified by comparing the outputs The Hadoop Distributed File System. In 26th IEEE from DistMpMult to the outputs of the GMP library. Our Symposium on Massive Storage Systems and experiments demonstrate that DistFft is able to process ter- Technologies, 2010. abit integers efficiently using a 1350-node cluster with 6 GB [17] D. Takahashi, 2010. Personal communication. memory in each machine. With the supporting data from [18] D. Takahashi. Parallel implementation of benchmarks on Hadoop, we estimate that DistFft is able to multiple-precision arithmetic and 2,576,980,370,000 process integers with 8 TB in size, or, equivalently, 21 trillion decimal digits of π calculation. Parallel Comput., decimal digits, on some large clusters. 36:439–448, August 2010. [19] A. J. Yee, 2010. Personal communication. 6. REFERENCES [20] A. J. Yee. y-cruncher – a multi-threaded pi-program, 2010. http://www.numberworld.org/y-cruncher/. [1] F. Bellard. Pi computation record, 2009. [21] A. J. Yee and S. Kondo. 5 trillion digits of pi - new http://bellard.org/pi/pi2700e9/announce.html. world record, 2010. http://www.numberworld.org/ [2] F. Bellard. Tachuspi, 2009. misc_runs/pi-5t/announce_en.html. http://bellard.org/pi/pi2700e9/tpi.html. [3] D. J. Bernstein. Multidigit multiplication for mathematicians, 2001. Available at http://cr.yp.to/papers/m3.pdf. [4] D. J. Bernstein. Fast multiplication and its applications. Number 44 in Mathematical Sciences Research Institute Publications, pages 325–384. Cambridge University Press, 2008. [5] J. M. Borwein, P. B. Borwein, and D. H. Bailey. Ramanujan, modular equations, and approximations to pi or how to compute one billion digits of pi. 96(3):201–219, 1989. [6] R. P. Brent and P. Zimmermann. Modern Computer Arithmetic. Number 18 in Cambridge Monographs on Computational and Applied Mathematics. Cambridge University Press, Cambridge, United Kingdom, 2010. [7] D. Chudnovsky and G. Chudnovsky. The computation of classical constants. In Proc. Nat. Acad. Sci. USA, volume 86, pages 8178–8182. Academic Press, 1989. [8] R. Crandall and C. Pomerance. Prime Numbers: A Computational Perspective. Springer-Verlag, New York, NY, 2001. [9] A. De, P. P. Kurur, C. Saha, and R. Saptharishi. Fast integer multiplication using modular arithmetic. In Proceedings of the 40th annual ACM Symposium on Theory of Computing, pages 499–506. ACM, 2008. [10] J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. In Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, pages 137–150. USENIX Association, 2004. [11] M. Furer.¨ Faster integer multiplication. In Proceedings of the 39th Annual ACM Symposium on Theory of Computing, pages 57–66. ACM, 2007. [12] P. Gaudry, A. Kruppa, and P. Zimmermann. A GMP-based implementation of Sch¨onhage-Strassen’s large integer multiplication algorithm. In Proceedings of the 2007 International Symposium on Symbolic and Algebraic Computation, pages 167–174. ACM, 2007. [13] D. E. Knuth. The Art of Computer Programming, Volume 2: Seminumerical Algorithms. Addison-Wesley, Reading, MA, 1997. [14] O. O’Malley and A. C. Murthy. Winning a 60 second dash with a yellow elephant, 2009. Available at http://sortbenchmark.org/Yahoo2009.pdf.