EDIC RESEARCH PROPOSAL 1 Elliptic Method for Integer Factorization on Parallel Architectures Andrea Miele I&C, EPFL

Abstract—The method (ECM) for integer fac- of n). However, ECM plays a relevant role in the NFS co- torization is an algorithm that uses the algebraic structure of factorization step in which many small composite integers the set of points of an elliptic curve for factoring integers. The (100 − 200 bits) need to be factored. This task can be off- running time of ECM depends on the size of the smallest prime divisor of the number to be factored. One of its main applications loaded on low-cost highly parallel devices like graphics cards. is the co-factorization step in the number field sieve algorithm ECM has also two applications for large integers which can that is used for assessing the security of the RSA cryptosystem. be accelerated on such devices. One is the factorization of The principal goal emphasized in this proposal is the efficient numbers whose size is out of reach for NFS. This application implementation of ECM on highly parallel low-cost devices, like is of interest only in the context of recreational mathematics. graphics cards. This requires theoretical and practical study of parallel algorithms for elliptic curve and finite field arithmetic. The second one is the factorization of RSA multiprime moduli. In this variant of the RSA, the modulus is built up from r > 2 Index Terms—ECM, finite field arithmetic, elliptic , primes of about the same size which allows to speed up the Edwards curves, integer factorization. decryption step when using the Chinese Remainder Theorem. The problem of implementing ECM efficiently on low-cost I.INTRODUCTION highly parallel devices is relevant not only in the context of integer factorization. Several cryptological applications other Implementation and study of algorithms for integer fac- than ECM are based on the implementation of finite field torization is crucial for the security assessment of several arithmetic and elliptic curve arithmetic, e.g., Elliptic Curve public-key cryptosystems. The Number Sieve (NFS) [1] Cryptography (ECC) based protocols. is the best known method for factoring integers with large Latest graphics processing units (GPUs) are an interesting prime factors (such as RSA moduli) which directly impacts the platform for the implementation of ECM and the underlying security of the RSA. The Elliptic Curve Method (ECM) [2] for arithmetic. In the last years they have evolved from simple integer factorization is expected to yield better performance parallel graphics pipelines to many-core architectures with full than NFS only if the composite integer n to be factored hardware/software support for general purpose computations. has some small size prime divisors (compared to the size This has led to the popular general-purpose computing on graphics processing units (GPGPU) concept. GPUs are suit- Proposal submitted to committee December 8th, 2011; Can- able for applications which involve many independent parallel didacy exam date: December 15th, 2011; Candidacy exam computations on different chunks of data, with little or no committee: Emre Telatar, Arjen Lenstra, Amin Shokrollahi. synchronization needed between such computations. This research plan has been approved: The papers described in this proposal cover the essential background related to ECM and its implementation. The classic “Factoring integers with elliptic curves” [2] by Hendrik Date: ———————————— Lenstra from 1987 introduced ECM. All the facts necessary to explain why and when it works are described along with two variants of the factoring algorithm and a conjecture on Doctoral candidate: ———————————— its expected running time. The second paper, “Speeding the (name and signature) Pollard and Elliptic Curve Methods of Factorization” [3], describes several improvements applicable to ECM and other factoring methods that must be taken into consideration in Thesis director: ———————————— view of implementing these algorithms efficiently. The last (name and signature) one, “Twisted Edwards Curves Revisited” [4], presents the fastest known algorithms for performing group operations Thesis co-director: ———————————— on elliptic curves that can speed up several cryptological (if applicable) (name and signature) applications including ECM [5]. In section II detailed descriptions of the papers will be given followed by the research proposal in section III. Doct. prog. director:———————————— (R. Urbanke) (signature)

EDIC-ru/05.05.2009 EDIC RESEARCH PROPOSAL 2

II.SURVEY OF THE SELECTED PAPERS mentioned above and defines “pseudo-addition” on a subset of Notation E(Z/nZ). This operation can fail in some cases (that occur when one attempts to compute the multiplicative inverse of an The symbol log without explicit subscript for the base will element u ∈ /n that is not a unit and so gcd(u, n) > 1) denote the natural logarithm throughout the paper. Z Z and such a failure can lead to finding a non-trivial divisor of n. Let O denote the (0 : 1 : 0) of P2(Z/nZ), and let the A. Factoring integers with elliptic curves 2 subset Vn of P (Z/nZ) consist of the “finite” points together In this paper, Hendrik Lenstra proposes the elliptic curve with O: method (ECM) for factoring positive integers, that is obtained V = {(x : y : 1) : x, y ∈ ( /n )} ∪ {O}. from Pollard’s (p − 1)-method by replacing the multiplicative n Z Z ∗ group of residues modulo p (Z/pZ) with the group of points For P ∈ Vn and a prime p dividing n, Pp denotes the point 2 on a random elliptic curve modulo p. in P (Fp) that is obtained reducing the coordinates of P 1) Elliptic curves over finite fields: Let K be a field, the modulo p. Notice that Pp = Op ⇔ P = O. author focuses on the case that K = Fp for some prime Given n ∈ Z>1, a ∈ Z/nZ and P,Q ∈ Vn the author number p > 3. designs an algorithm that either computes a non-trivial divisor 2 3 2 A pair (a, b) ∈ K for which 4a + 27b 6= 0 defines an d of n, or determines a point R ∈ Vn with the following elliptic curve over K corresponding to the short Weierstrass property: if p is any prime divisor of n for which there exists b ∈ Fp such that y2 = x3 + ax + b. (1) 6(4a3 + 27b2) 6= 0 for a = a(mod p), The elliptic curve defined by (a, b) is denoted by Ea,b, or P ∈ E ( ) Q ∈ E ( ), by E. The set of points E(K) of Ea,b over K is defined by p a,b Fp p a,b Fp E(K) = {(x : y : z) ∈ 2(K): y2z = x3 + axz2 + bz3}. then Rp = Pp + Qp in the group Ea,b(Fp). P −1 The algorithm attempts to compute first (x1−x2) (mod n) 2 P (K) denotes the projective plane over K, i.e., the set of (see group law formulae in paragraph II-A1) using the Eu- 3 equivalence classes of triples (x, y, z) ∈ K , (x, y, z) 6= clidean algorithm, which outputs d = gcd(x1 − x2, n). If 0 0 0 (0, 0, 0); two triples (x, y, z) and (x , y , z ) are equivalent 1 < d < n the addition fails and a non-trivial factor of ∗ 0 0 if there exists c ∈ K such that cx = x , cy = y and n is found. If d = 1 the algorithm determines a point R 0 cz = z . The equivalence class containing (x, y, z) is denoted with the above property. If d = n it attempts to compute −1 by (x : y : z). Given an elliptic curve E over K, the point (y1 + y2) (mod n) (notice that in this case y1 = y2 and zero point (0 : 1 : 0) ∈ E(K) is the of the curve; it is denoted P = Q) and the value e = gcd(y1 + y2, n) is used exactly as by O and it is the only point with z = 0. All the other points the value d except that if e = n the output is R = O (i.e., of E are of the form (x : y : 1), where x, y ∈ K satisfy Eq. P = −Q in Vn). If the algorithm determines a point R, it will abelian group (1). The set E(K) has the structure of an with be denoted by P + Q and the partial binary operation on Vn the group law defined as follows (additive notation): will be called addition. If the ordinary Euclidean algorithm is • Identity element: O+P = P +O = P for all P ∈ E(K). used, O((log n)2) bit operations are performed. • Given P = (x1 : y1 : 1) 6= O and Q = (x2 : y2 : 1) 6= O, Using a sequence of pseudo-additions an algorithm that then P + Q = O if and only if x1 = x2 and y1 = −y2; computes the following can be devised. Given k ∈ Z>0, n ∈ thus −(x : y : z) = (x : −y : z). Z>1, a ∈ Z/nZ and P ∈ Vn, it either calculates a non- • Otherwise, given λ ∈ K such that λ = (y1 − y2)/(x1 − trivial divisor d of n, or determines a point R ∈ Vn with 2 x2) if P 6= Q and λ = (3x1 + a)/(2y1) if P = Q. Rp = k · Pp in the group Ea,b(Fp), for suitable b and p as Then P + Q = R, where R = (x3 : y3 : 1) with x3 = for the pseudo-addition. If the algorithm determines such a 2 λ − x1 − x2 and y3 = −λx3 − y1 + λx1. point R, it will be denoted by kP and the partial operation 2) Elliptic curves modulo a composite n: Consider the set defined in this way multiplication. The number of additions of all triples (x, y, z) ∈ (Z/nZ)3 for which gcd(x, y, z, n) = performed by the algorithm depends on which addition chain 1. The group of units (Z/nZ)∗ acts on this set by u(x, y, z) = is used for computing kP and whether kP is defined or not. (ux, uy, uz). The orbits under this action (the set of elements An addition chain for n ∈ Z>0 is a sequence of positive integer that a given triple can be transformed to) are the points of the values v0 = 1, v1, . . . , vm = n where for each 0 < j ≤ m, projective plane over Z/nZ. The orbit of (x, y, z) is denoted vj = vh + vl for some 0 ≤ h, l < j. If k = k1k2 for some 2 by (x : y : z), and the set of all orbits by P (Z/nZ). Given k1, k2 ∈ Z>0, kP can be computed as kP = k1(k2P ). Y e(r) a, b ∈ Z/nZ let E = Ea,b be the curve defined over Z/nZ by So if k is such that k = r , where r ranges over a the equation y2 = x3 + ax + b. The set of points E(Z/nZ) finite set of positive integers and each e(r) is a positive integer, of E over Z/nZ is defined by kP can be computed performing e(r) multiplications by r for each r. E( /n ) = {(x : y : z) ∈ 2( /n ): y2z = x3+axz2+bz3}. Z Z P Z Z 3) Introduction: Pollard’s (p − 1)-method aims to find a If 6(4a3 + 27b2) ∈ (Z/nZ)∗ then E is defined as an non-trivial divisor of a given positive integer n using Fermat’s elliptic curve over Z/nZ and the set E(Z/nZ) has a natural little theorem. The idea of the algorithm is to pick a random abelian group law. The author avoids using the group structure residue modulo n, say c, and to compute its k-th power modulo EDIC RESEARCH PROPOSAL 3

√ √ n. The value of k is chosen as the product of small prime ≈ n and the time becomes gM(n)e (1+o(1)) log n log log n. powers less than a bound B (e.g., k = lcm(1, 2,...,B)). Several other algorithms have expected running time given by One hopes that for some prime factor p of n, the number the latter expression but independent of the size of the prime p − 1 will divide k. The algorithm computes ck mod n and factors of n. For example, the expected running time of the d = gcd((ck mod n) − 1, n). If for some prime factor p of n, Quadratic Sieve (QS) [6] is the same as ECM in the worst k is divisible by p − 1, d will be a non-trivial factor of n by case. However, ECM is expected to be faster in presence of Fermat’s little theorem unless all prime factors of n are found small prime factors. simultaneously, i.e., d = n. If for some prime factor p of n, 4) ECM with one curve: Let n, v, w ∈ Z>1 and a, x, y ∈ p−1 is the product of primes less than B, i.e., it is B−smooth, Z/nZ be given. For each integer r ≥ √2, denote by e(r) the the algorithm is likely to succeed. Whereas if for each prime largest integer m such that rm ≤ v + 2 v + 1, and put p dividing n the number p − 1 has a large prime factor, then w Y Pollard’s (p − 1)-method would need a large bound B (i.e., a k = re(r). large running time) to have a reasonable chance of success. r=2

ECM uses the group of points on a random elliptic curve Given P = (x : y : 1) ∈ Vn, attempt to compute kP using ∗ modulo p instead of (Z/pZ) . First fix k = lcm(1, 2,...,B) the pseudo-addition method described in paragraph II-A2. If it as for Pollard’s (p − 1)-method and select a random elliptic fails then a non-trivial divisor d of n is found. If it succeeds in curve E defined over Z/nZ (as in paragraph II-A4 or using a computing kP the algorithm terminates with no factors found. suitable parametrization) and a point P on E with coordinates 5) ECM trying several curves: Given n, v, w, h ∈ Z>1, in Z/nZ, where n is the number to factor. Next, compute the generate a, x, y ∈ Z/nZ at random, and apply algorithm multiple k · P of P using the group law of the curve. In (II-A4) to n, v, w, a, x, y. If a non-trivial divisor d of n is practice one can use the pseudo-addition algorithm described found, halt. Otherwise repeat the above procedure unless it in paragraph II-A2. If for some prime divisor p of n, k ·P and has been already applied h times. the zero point O of the curve become the same modulo p (but The choice of a, x, y determines the elliptic curve used. not modulo n) the algorithm succeeds. This corresponds to the • Algorithm (II-A4); the value v may be thought of as an failure of an inversion while computing the pseudo-addition. upper bound for the divisor d that is hoped to be found, One can modify the pseudo-addition to work with projective though the algorithm can determine a divisor d larger coordinates with O = (0 : 1 : 0) and avoid inversions. In than v. The parameter w determines the execution time this case one must explicitly check for the above condition, and the probability of success. The larger w, the larger that is now equivalent to p dividing the z (or x) coordinate the execution time and the probability of success. of the result, calculating the greatest common divisor of such • Algorithm (II-A5); w is the execution time of the algo- z (or x) with n. ECM has the same properties as Pollard’s h ∗ rithm on a single curve and is the number of curves (p−1)-method with the order p−1 of (Z/pZ) replaced by the that will be tried. In this case the probability of success order of the group E(Z/pZ) of points of E with coordinates is a function of w and h. in /p . Z Z 6) When does the algorithm succeed?: The author proves Hasse’s theorem (1934) [2] states that the order of E( /p ) Z Z a sufficient condition for the success of the algorithm: is of the form p + 1 − t , where t is an integer that depends p √p on E and p for which |tp| ≤ 2 p. If there exists a prime Proposition 1. Let n, v, w ∈ Z>1 and a, x, y ∈ Z/nZ be 2 3 factor p of n such that the number p + 1 − tp is B−smooth as in algorithm (II-A4), put b = y − x − ax ∈ Z/nZ and (and so k is a multiple thereof), then ECM is likely to find a P = (x : y : 1) ∈ Vn (see paragraph II-A2). Let p and q be non-trivial divisor of n. prime divisors of n satisfying the following conditions. The author proves that if an elliptic curve over Fp, where 1) p ≤ v 2 p > 3 is prime, is chosen at random, then its order is 2) 6(4a3 + 27b ) 6= 0 for a = a(mod p), b = b(mod p); approximately1 uniformly distributed in the interval (p + 1 − √ √ 3) each prime divisor r of #Ea,b satisfies r ≤ w; 2 p, p + 1 + 2 p). It follows that, if the algorithm fails, it 4) 6(4ˆa3 + 27ˆb2) 6= 0 for aˆ = a(mod q), ˆb = b(mod q); can be run again selecting a different elliptic curve. This will 5) #Ea,ˆ ˆb is not divisible by the largest prime number likely yield a new tp value and so, the number p + 1 − tp, will dividing the order of Pp (see paragraph II-A2). have a new chance to be B−smooth. Then algorithm II-A4 finds a non-trivial divisor of n. It will be shown that, under certain assumptions and with a suitable choice of parameters (see paragraph II-A7 for the 7) Efficiency: Assume that the addition chain used for details), given a positive integer g, ECM finds a non-trivial computing k · P uses the binary representation of k. Then divisor of the number n in within time gK(p)M(n) with prob- O(log k) pseudo-additions are performed. Let M(n) be an −g ability at least 1 − e ,√ where the function K : R>0 → R>0 upper bound for the time, measured in bit operations, required is such that K(x) = e (2+o(1)) log x log log x for x → ∞, p to perform one pseudo-addition (see paragraph (II-A2)). Then is the least prime factor of n and M(n) is an upper bound algorithm (II-A4) requires time O(w(log v)M(n)), since k is for the time required by a single addition on an elliptic curve such that log k = O(w log v). Algorithm (II-A5) requires time modulo n. The worst case occurs if n = pq with p, q primes at most h times as large, i.e., O(hw(log v)M(n)) (neglecting the time required by the random number generator used). √ √ 1This is in fact proved for the interval (p + 1 − p, p + 1 + p) only. Using proposition (1) and an estimate of the number of elliptic EDIC RESEARCH PROPOSAL 4 curves over Fp whose order is not divisible by a given prime such that the following assertion holds. Let n ∈ Z>1 be an l the author proves the following. integer that is not a prime power and that is not divisible by 2 or 3, and let g be any positive integer. Then algorithm 1) Let n, v, w ∈ Z>1 be such that n has at least two distinct prime divisors > 3, and such that the smallest prime (II-A5), when performed with suitable values for v, w, h, can factor p of n for which p > 3 satisfies p ≤ v. Put be used to find a non-trivial divisor of n with probability at √ least 1 − e−g, within time u = #{s ∈ Z : |s − (p + 1)| < p, and each prime dividing s is ≤ w}. gK(p)M(n), then the triple (a, x, y) results in the success of the where p denotes the least prime factor of n and where M(n) algorithm with probability that is not much less than denotes an upper bound for the time required by the pseudo- √ the probability u/(2[ p] + 1) that a random integer in addition algorithm defined in paragraph (II-A2), measured in √ √ the interval (p + 1 − p, p + 1 + p) has all its prime bit operations. factors ≤ w. ECM can be repeated until it leads to the complete factor- 2) (Corollary) Let w ∈ be such that the number u ≥ 3 Z√>1 ization of n with expected time at most and let f(w) = u/(2[ p] + 1) be the above probability. √ 1+o(1) (1+o(1)) log n log log n Assume that in algorithm (II-A5) each triple (a, x, y) L(n) = e for n → ∞. is generated uniformly at random and successive triples The worst case occurs if the√ second largest prime factor of n are generated independently. There exists an effectively is not much smaller than n and so n is built up from some computable constant c > 1 such that for any h ∈ Z>1 small primes and two large primes of the same size. the success probability of algorithm (II-A5) on input 8) Conclusions: If the second largest prime factor of n is −hf(w)/ log v √ n, v, w, h is at least 1 − c . much smaller than n, ECM is asymptotically faster than The author observes that choosing h ≈ (log v)/f(w) provides several other algorithms whose conjectured expected execution a reasonable chance of success. If h ≈ (log v)/f(w), algo- time is L(n)1+o(1) but it is independent of the size of prime rithm (II-A5) requires time O((log v)2(w/f(w))M(n)). Then factors of n. However, in practice, these algorithms may result to minimize the running time it suffices to minimize w/f(w). faster in the worst case, due to the different constants hidden in The optimal value of w is determined as follows. the asymptotics. ECM can be used to recognize numbers that Define √ are built up from prime factors smaller than a given bound. L(x) = e log x log log x, This problem must be solved in several factoring algorithms. for a real number x > e. B. Speeding the Pollard and Elliptic Curve Methods of Fac- Given α ∈ R>0, the probability that a random positive integer s ≤ x has all its prime factors ≤ L(x)α is torization

−1 +o(1) In this paper the author presents some techniques to speed L(x) 2α for p → ∞. up several algorithms for integer factorization. This is stated in a theorem of Canfield, Erdos¨ and Pomerance. 1) Introduction: Four factoring algorithms are considered in this paper: ECM, Pollard’s (p − 1)-method, Pollard’s Rho The author conjectures that this result√ is valid if s√is a random integer in the interval (x + 1 − x, x + 1 + x). method and Williams’ (p+1)-method. However, in the context Putting x = p this implies that of this research proposal, the techniques to speed up ECM are the most relevant and the following description will be focused α −1 +o(1) f(L(p) ) = L(p) (2α) for p → ∞, on them. In some cases, such techniques can be adapted for any fixed positive α and f(w) as in the corollary above. to the other algorithms. All the aforementioned algorithms If w = L(p)α then involve some computations modulo the composite number to be factored n, which is assumed to have a prime factor p. At 1 +α+o(1) w/f(w) = L(p) (2α) for p → ∞, the end of each step of these algorithms one must compute the gcd of a partial result with n, hoping that this will be the optimal choice of w being: a non trivial divisor thereof. It is possible to avoid taking a √ √1 +o(1) 2+o(1) w = L(p) 2 , w/f(w) = L(p) , for p → ∞. gcd at each step by replacing it with a multiplication modulo n and computing a gcd only at the end of the last step. The choice of w depends on p, the least prime factor > 3 of n, This accomplished by applying the following observation, which is not known beforehand. In practice p is replaced by v p | gcd(xy mod n, n) ⇔ p | gcd(x, n) or p | gcd(y, n). It in the above formula for w and algorithm (II-A5) is performed follows that if k steps are performed and at end of each step for a reasonable increasing sequence of values for v. a gcd of the result xk mod n and n must be computed, it 2 Using these facts (notice that the factor (log v) in the is possible to “accumulate” the results by multiplying them o(1) execution time above is L(p) ) the author provides the together. Then, after the last step, the gcd of the final product following conjectural running time estimate for ECM. and n is computed, i.e., d = gcd(x1 ·x2·, . . . , xk mod n, n). In this way k gcd’s are replaced by k −1 multiplications modulo Conjecture 1. There is a function K : R>0 → R>0 with √ n and one gcd with n. It can happen that d = n (i.e. all K(x) = e (2+o(1)) log x log log x for x → ∞ the prime factors of n have been found) in which case one EDIC RESEARCH PROPOSAL 5

must “backtrack” to check whether all the factors were found operations plus π(B2)−π(B1) modular gcd’s/multiplications. at once in a single step, or different divisors were found at This is not a significant improvement over running again stage different steps. In the latter case the algorithm is successful. one with B1 increased to B2. The main technique that will be studied in the following is 3) “Baby-step giant-step” approach: The performance of the “stage two” or “continuation” of ECM. stage two can be improved by using a memory-time trade-off 2) ECM stage two: The version of ECM presented in II-A4 technique to look for the prime s. will be referred to as “stage one” of the algorithm. It can The idea is to represent each prime in (B1,B2) in a sort be summarized as follows. To factor a composite n, select a of radix√ w representation, where w is an integer such that random elliptic curve E modulo n, a point P = (x, y) on it w ≈ B2. Let v1 = dB1/we and v2 = dB2/we. Assume that and then compute Q = kP where k > 0 is an integer divisible affine coordinates are used. For each v such that v1 ≤ v ≤ v2 by all prime powers less than a positive integer bound B1. If p and u such that 0 ≤ u < w compute vwQ = (xvwQ, yvwQ) is a prime factor of n, stage one succeeds when k is divisible and uQ = (xuQ, yuQ). Then compute by the order of P on the curve E modulo p (but not by the Y Y order of P on the curve E modulo all the other prime factors h = (xvwQ − xuQ) mod n (2) of n), in which case Q = kP = O on E modulo p and a non v u trivial divisor is found through a gcd computation. If the stage for each u and v such that s = vw + u for some prime s in one fails, the point Q on E modulo n is output. The number of (B1,B2), in π(B2) − π(B1) modular multiplications. Finally curve operations required to compute Q is O(log k) = O(B1). check whether gcd(h, n) gives a non trivial divisor of n. In case of failure, one can increase the bound B1 and run The number of elliptic√ curve operations is now reduced ECM again or simply abandon it. from π(B2) − π(B2) to O√( B2). Memory requirements have Assume now that sQ = O on E modulo p for some changed from D/2 to O( B2). prime factor p of n (but not for all of them), where s is a The cost is further reduced by storing points uQ for u such prime between B1 and a larger value B2. In other words, one that gcd(u, w) = 1, thus dropping some points for which u assumes that the order of Q modulo p is s (i.e., the order of P does not correspond to any prime. Moreover, points vwQ need modulo p is B1-smooth except for the prime s). In this case, not to be stored and can be computed as needed if the primes one can run again stage one increasing the bound B1 to the are processed in ascending order. More memory space can be value of B2 to have a good chance of success. The number saved reducing the value of w. of elliptic curve operations will be O(B2). Performance can be further improved if two primes are A better alternative is to run the stage two or continuation tested at once. In order to do so, one must look for pairs (v, u) of the algorithm that is tailored for cases in which the order such that every prime in the interval (B1,B2) is represented of Q (P ) is of the above form. The idea is to attempt to find as vw ± u for some pair (v, u). Now consider the polynomial the prime s such that sQ = O on E modulo p in a smart way. g(m) = m2 and observe that given two primes represented One wants to increase the chance of success of each run of by the pair (v, u), s1 = vw + u and s2 = vw − u, the algorithm on a given curve at a small additional cost (e.g., vw ± u | g(vw) − g(u) = (vw)2 − u2. The idea is to store comparable with the cost of the stage one just executed). This points g(vw)Q and g(u)Q corresponding to the found pairs will result in the reduction of the overall expected running in tables and then recover them through table look-ups to time. compute gcd(xg(vw)Q − xg(u)Q, n). To keep the tables small, The standard continuation entails testing each prime s values of v and u should be restricted. A possible choice is between B and B one after the other. This can be done 1 2  |u| ≤ u , in a naive way, by simply computing sQ for each s, but this max v = bB /wc ≤ v ≤ dB /we, would have a cost comparable to running again stage one with 1 1 2 B1 = B2. where umax ≥ w/2 is selected in advance. Building the tables A smarter approach arises from the observation that if sj will require O(v2 − v1) + O(umax) elliptic curve operations. denotes the j−th prime then the difference sj+1 −sj is known The number of gcd’s/modular multiplications performed to to be small. The idea is to pre-compute the points (sj+1−sj)Q look for a non trivial divisor is then proportional to the number for all the differences of consecutive primes belonging to the of pairs (v, u) required to represent all primes in (B1,B2) and interval (B1,B2) and store them in a table. Then one can so their number should be reduced as much as possible. use the table to compute sj+1Q as (sj+1 − sj)Q + sjQ for One idea for devising an algorithm that finds such pairs is j > (π(B1)+1). This will require π(B2)−π(B1) elliptic curve based on the observation that given two primes s1 = vw + u operations. If the largest difference between two consecutive and s2 = vw − u their sum s1 + s2 is a multiple of 2w and primes in the interval (B1,B2) is D than the table will have at vice versa if s1 +s2 is a multiple of 2w then s1 = vw +u and most D/2 entries that can be computed in O(D) elliptic curve s2 = vw−u for some u and v. The idea is to maintain a queue operations. The number of elliptic curve operations needed to Qq where q ranges over the residues modulo 2w with |q| ≤ w. compute the first point sπ(B1)+1Q is O(log sπ(B1)+1). Finally For each prime s to be paired, compute q = s mod 2w and the number of elliptic curve operations required to compute a such that 2aw + q = s. Then store a into the queue Qq 0 0 each multiple of Q for each prime in (B1,B2) using the unless there is a (corresponding to the prime 2a w − q) in 0 pre-computed differences is π(B2) − π(B1). The overall cost Q−q such that u = w(a − a ) + q is less then umax. If this is of this continuation is roughly π(B2) − π(B1) elliptic curve the case then two primes have been paired. After all the primes EDIC RESEARCH PROPOSAL 6

are processed as described, some elements corresponding to product x3x4 using (3) and introducing projective coordinates, unpaired primes can be present is some queues, in which case i.e., x = X . Given the ratios X1 and X2 for distinct points Z Z1 Z2 they are paired with a composite. X3 P1 and P2 the ratio of their sum Z is given by: 4) FFT continuation: Another possible approach is the Fast 3 2 Fourier Transform (FFT) continuation that splits the interval X3 = 4Z4 · (X1X2 − Z1Z2) , 2 (B1,B2) in smaller intervals of length w and pre-compute Z3 = 4X4 · (X1Z2 − Z1X2) . several multiples of the point Q as above. The double product These formulae can be computed using 2 squarings and 4 in (2) is now viewed as a polynomial h(x), whose roots are multiplications by caching some intermediate values. the x coordinates of the points uQ, evaluated at a sequence X1 X3 Given for P1 the ratio of P3 = 2P1 is given by: of values (the x coordinates of the points vwQ). For each Z1 Z3 2 2 2 0 ≤ u < w with gcd(u, w) = 1 and v1 ≤ v ≤ v2 X3 = (X1 − Z1 ) , where v = dB /we and v = dB /we, compute the points 2 1 1 2 2 Z3 = (4X1Z1)[(X1 − Z1) + ((A + 2)/4)(4X1Z1)], uQ = (xuQ, yuQ) and vwQ = (xvwQ, yvwQ). Then compute the coefficients of the polynomial These formulae can be computed using 2 squarings and 3 multiplications by caching some intermediate values. Y h(x) = (x − xuQ) mod n Since the above addition formulae require the difference of u two points, the scalar multiplication (Q = kP for a positive as follows. integer k) is performed using a special case of addition chains 1) write h(x) recursively as the product of two monic (see end of paragraph II-A2 for the definition of addition polynomials of degree as close as possible and store each chain) called Lucas chains [7]. polynomial in a “binary tree”. If φ(w) is a power of 2, Suyama’s parametrization for Montgomery curves allows to select a curve (and fix a point on it) whose order is divisible the tree has log2 φ(w) levels. The i-th level (the root by 12. This is desirable when looking for curves whose order corresponds to i = log2 φ(w) and the leaves to i = 0) has at most φ(w) polynomials of degree φ(w) . is divisible by small prime powers as in ECM (because it is 2i 2log2 φ(w)−i 2 2) These polynomials are pairwise multiplied together from already divisible by 3 and 2 ). the leaves up to the root (that is h(x)), using fast 6) Conclusions: This paper describes several techniques algorithms for polynomial multiplication that require to improve ECM. Above all, the continuation or stage two O(d log d) operations for two degree d polynomials. of ECM (executed upon the failure of stage one, i.e., the The cost is then O(φ(w)(log φ(w))2) operations modulo n, original algorithm), that reduces the expected running time where φ(w) is the number of positive integers less than w and of the algorithm. co-prime with w. The value φ(w) is the degree of h(x), since Montgomery curves provide fast arithmetic for ECM, but it has as many roots as the number of different u values. Twisted Edwards curves (presented in the next section) are asymptotically faster. Next evaluate hv = h(xvwQ) for each v1 ≤ v ≤ v2 and Q compute h = v hv. Finally check whether gcd(h, n) gives a non trivial divi- C. Twisted Edwards Curves Revisited sor of n. A polynomial of degree d can be evaluated at In this paper [4], the authors present fast algorithms for d successive terms of a geometric progression in d log d computing group operations on Twisted Edwards Curves steps and so the above evaluation can be accomplished in which lead to the fastest elliptic curve scalar multiplication O(φ(w) log φ(w)) steps (if φ(w) ≈√(B2/w − B1/w)). Mont- that can speed up both ECC and cryptanalytic applications gomery suggests choosing w ≈ B2 for good asymptotic (e.g, ECM). The following notation is used to analyze the performance. algorithms: M: field multiplication, S: field squaring, I: field 5) Montgomery curves: The equation inversion, D: multiplication by a curve constant. 1) Introduction: Recently Edwards curves have gained at- By2 = x3 + Ax2 + x, (3) tention in the context of cryptology because of their fast defines a “Montgomery curve”. Montgomery curves provide arithmetic. Edwards introduced a normal form for elliptic faster arithmetic than Weierstrass curves in contexts in which curves along with the addition law; such curves are defined the y coordinate of points can be dropped. This is equivalent by x2 + y2 = c2 + c2x2y2 [8]. Bernstein and Lange in- to identify points up to their sign and despite that, it is still troduced a more general version of these curves defined by possible to compute scalar multiplication. In ECM this can x2 + y2 = c2(1 + dx2y2) or x2 + y2 = 1 + dx2y2 along with be exploited because, as seen so far, the only computation the first algorithm for computing group operations in projective involved on elliptic curves is the scalar multiplication. There coordinates (e.g. the point addition requires 10M+1S+1D) [9]. is no need for determining the sign of a point at any time and These curves are today known as Edwards curves. Bernstein the value of the x coordinate is what one is only interested in. and Lange also introduced inverted Edwards coordinates re- Given two points on a Montgomery curve, P1 = (x1, y1) sulting in point addition with cost 9M+1S+1D [10]. Finally and P2 = (x2, y2), and their difference P4 = P1 − P2 = Bernstein and other authors introduced a generalization of (x4, y4), it is possible to derive efficient formulae for comput- Edwards curves, i.e., twisted Edwards curves [11]. ing the x coordinate of their sum P3 = P1+P2 = (x3, y3), that The authors of [4] present the fastest group arithmetic do not involve y coordinates. This is done by manipulating the for twisted Edwards curves obtained by using an additional EDIC RESEARCH PROPOSAL 7 coordinate, i.e., the extended twisted Edwards coordinates convenient form. The operations can be reduced to 7M+1D system. They design a fast algorithm for scalar multiplication by setting Z2 = 1 and using a mixed addition algorithm. by mixing this system with the standard one. Dedicated Addition in Ee. In this case formulae are similar 2) Twisted Edwards curves: The following terms charac- to (5) but they are independent of the curve constant d. The terize the group law (additive notation) on elliptic curves: operations can be performed with a 9M+1D algorithm and a • unified: point addition formulae that remain valid when mixed addition algorithm can be derived setting Z2 = 1. The the two input points are identical. case a = −1 allows to derive an 8M algorithm, that can be • complete: point addition formulae defined for all inputs. reduced further to 7M setting Z2 = 1. e • mixed: point addition formulae that add an affine point Dedicated Doubling in E . The authors provide doubling to a point in a given projective representation. formulae which are independent of the curve constant d. The Let K be a field of odd characteristic, Edwards curves are operations can be performed with a 4M+4S+1D algorithm e defined by x2 + y2 = c2(1 + dx2y2) where c, d ∈ K which can be improved by mixing E with E. This formulae with cd(1 − dc4) 6= 0. Such form is a special case of more do not require the T1 coordinate of the point to be doubled. general twisted Edwards curves form defined by Notice that these formulae are slower than 3M+4S+1D ones in E [11]. 2 2 2 2 EE,a,d : ax + y = 1 + dx y 4) Applications: The authors focus on the implementation where a, d ∈ K with ad(a − d) 6= 0 (Edwards curves of scalar multiplication on parallel architectures. In particular represent the special case where a can be rescaled to 1). Group they present a detailed comparison between scalar multipli- operations formulae for this curves can be found in [11]. cation in extended twisted Edwards coordinates using unified The inversion I is usually more expensive than M. It is then addition only and the Montgomery ladder using Montgomery convenient to use projective coordinates to avoid it. curves. Both of them provide theoretical Simple Power Anal- ysis (SPA) protection, since addition and doubling are per- 2 2 2 4 2 2 (aX + Y )Z = Z + dX Y . (4) formed using the same sequence of field operations. Extended Eq. (4) defines the projective closure of the curve ax2 + y2 = twisted Edwards curves result faster in parallel environments, 1 + dx2y2. The identity element is (0 : 1 : 1) and the negative i.e., when 2 or 4 processors are used (up to 66.7% on 4 of (X : Y : Z) is (−X : Y : Z). For all λ 6= 0 ∈ K, processors). In the context of ECM this comparison is not rel- (X : Y : Z) = (λX : λY : λZ). This system is denoted by E. evant since SPA is not needed. However, the authors propose a 3) Extended Twisted Edwards Coordinates: A new coor- fast algorithm for scalar multiplication dedicated formulae for dinate t = xy is introduced to represent a point (x, y) on addition and doubling which are faster than the ones for unified ax2 + y2 = 1 + dx2y2 in extended affine coordinates (x, y, t). addition. This algorithm mixes twisted Edwards coordinates e The map (x, y, t) → (x : y : t : 1) allows to pass to projective E with extended twisted Edwards coordinates E and uses a coordinates. For all nonzero λ ∈ K, (X : Y : T : Z) = “windowing” technique. It turns out to be the fastest scalar (λX : λY : λT : λZ) that satisfies Eq. (4) and corresponds multiplication algorithm for elliptic curves. to the extended affine point (X/Z, Y/Z, T/Z) with Z 6= 0. Fast Scalar Multiplication. Scalar multiplication on twisted The auxiliary coordinate T has the property T = XY/Z. This Edwards curves involves point doublings and can be sped up e e system is called extended twisted Edwards coordinates and is by mixing E and E replacing slower doublings in E with denoted by Ee. The identity element is (0 : 1 : 0 : 1). The faster doublings in E and using the fact that no consecutive negative of (X : Y : T : Z) is (−X : Y : −T : Z). Given additions are performed: (X,Y,Z) in E, passing to Ee can be performed in 3M+1S by 1) If a point doubling is followed by another point dou- computing (XZ,YZ,XY,Z2) whereas given (X : Y : T : Z) bling, use E ← 2E. in Ee passing to E is cost-free by dropping T . 2) If a point doubling is followed by a point addition, use e Unified Addition in E . Given (X1 : Y1 : T1 : Z1) and a) Ee ← 2E for the doubling step and then, (X2 : Y2 : T2 : Z2) with Z1 6= 0 and Z2 6= 0, then (X1 : Y1 : b) E ← Ee + Ee for the point addition step. T : Z ) + (X : Y : T : Z ) = (X : Y : T : Z ) where 1 1 2 2 2 2 3 3 3 3 E ← 2E is performed using 3M+4S+1D formulae in [11]. X3 = (X1Y2 + Y1X2)(Z1Z2 − dT1T2), The operation Ee ← 2E can be performed as follows: e Y3 = (Y1Y2 − aX1X2)(Z1Z2 + dT1T2), Instead of passing from E to E in 3M+1S as described (5) in paragraph (II-C3), dedicated doubling formulae in Ee are T3 = (Y1Y2 − aX1X2)(X1Y2 + Y1X2), used since they do not require the input T and so they Z = (Z Z − dT T )(Z Z + dT T ). 1 3 1 2 1 2 1 2 1 2 can be used for Ee ← 2E. E ← Ee + Ee is based on e These unified formulae are complete if d is not a square in dedicated addition formulae in E . The computation of T3 K and a is a square in K, and they can be computed with a can be avoided. This compensates the extra field multiplication e 9M+2D algorithm by caching some intermediate results. An necessary to compute T3 in E ← 2E. The authors show the 8M+2D mixed addition algorithm can be derived by setting cost estimates, in terms of M performed, for 256-bit fast scalar Z2 = 1, i.e., adding (X1 : Y1 : T1 : Z1) and an extended multiplication under different S/M and D/M scenarios. Twisted affine point (x2, y2, x2y2) which can be written as (x2 : y2 : Edwards curves with a = −1 and mixed coordinates result e x2y2 : 1). If E is used, an 8M+1D point addition algorithm always faster than Edwards curves, inverted Edwards curves can be devised if a = −1 transforming the curve in a more and Montgomery curves. EDIC RESEARCH PROPOSAL 8

Fast Scalar Multiplication in parallel. Mixing E with Ee in in [13]. In this work the author presents an implementation the scalar multiplication algorithm does not seem to provide of HECM on central processing units (CPUs) derived from sources of parallelism that can be exploited. However, the GMP-ECM. This implementation is faster then GMP-ECM for authors show that using 4 processors, the doubling operation large numbers and can be improved by optimizing the squaring Ee ← 2Ee can be performed with a 1M+ 1S algorithm and operation. Tackling the implementation of higher genus curve that the addition operation Ee ← Ee + Ee can be performed arithmetic on parallel architectures is relevant also for other with a 2M algorithm (a 2M+ 2S algorithm and a 4M algorithm problems, e.g., the elliptic curve discrete logarithm problem. respectively using 2 processors). This suggests using Ee only Although the choice of GPUs as the main implementation when working in parallel settings. platform seems to be reasonable, following the evolution of 5) Conclusions: This paper introduces a new representation different architectures like multi-core CPUs and field pro- Ee for twisted Edwards curves and describes group operations. grammable gate arrays (FPGAs) must not be neglected. The A fast scalar multiplication algorithm using dedicated formu- debate on which one is the most convenient for parallel lae is presented, which is designed by mixing Ee and E. It applications is quite hectic and so far sees no clear winner results 4% − 18% faster than the algorithms in literature and (see [14] for a recent ECM implementation on FPGAs). The can be further sped up by a factor of 3.54 using 4 processors integration of CPUs with graphics processors (e.g., AMD in parallel. This algorithm can be used to accelerate ECM. Fusion family), the availability of high level synthesis tools for FPGAs and the constant improvement of GPGPU architectures III.RESEARCH PROPOSAL stir things up even more. This research proposal addresses the problem of implement- ing ECM efficiently on parallel architectures, which requires REFERENCES the study of parallel algorithms for elliptic curve arithmetic [1] A. K. Lenstra and J. Hendrik W. Lenstra, Eds., The development of and finite field arithmetic. the number field sieve, ser. Lecture Notes in Mathematics. Berlin: Springer-Verlag, 1993, vol. 1554. The efficiency of finite field arithmetic depends mainly on [2] H. W. Lenstra, “Factoring integers with elliptic curves,” The Annals of the modular multiplication operation. The first research goal Mathematics, vol. 126, no. 3, pp. 649–673, Nov. 1987. is then the study of the implementation of algorithms for [3] P. L. Montgomery, “Speeding the Pollard and Elliptic Curve Methods of Factorization,” Mathematics of Computation, vol. 48, no. 177, pp. modular multiplication on GPUs (e.g., comparison between 243–264, 1987. schoolbook and Karatsuba multiplication and implementation [4] H. Hisil, K. K.-H. Wong, G. Carter, and E. Dawson, “Twisted Edwards of FFT multiplication using floating point arithmetic). Curves Revisited,” in Proceedings of the 14th International Conference on the Theory and Application of Cryptology and Information Security: Edwards curves provide the fastest elliptic curve arithmetic. Advances in Cryptology, ser. ASIACRYPT ’08. Berlin, Heidelberg: Studying their efficiency on parallel architectures is relevant Springer-Verlag, 2008, pp. 326–343. for ECM and all the applications using elliptic curves (see [5] D. J. Bernstein, T.-R. Chen, C.-M. Cheng, T. Lange, and B.-Y. Yang, “ECM on Graphics Cards,” in Proceedings of the 28th Annual Interna- [5] for an example). The second goal is then the comparison tional Conference on Advances in Cryptology: the Theory and Appli- between Edwards curve arithmetic and Montgomery curve cations of Cryptographic Techniques, ser. EUROCRYPT ’09. Berlin, arithmetic on GPUs. This implies the comparison between the Heidelberg: Springer-Verlag, 2009, pp. 483–501. [6] C. Pomerance, “The Quadratic Sieve Factoring Algorithm.” in EURO- sliding window algorithm and Montgomery’s PRAC algorithm CRYPT’84, 1984, pp. 169–182. [7] for scalar multiplication. Building on these insights, the [7] P. L. Montgomery, “Evaluating recurrences of form third goal is the efficient implementation of ECM for factoring Xm+n = f(Xm,Xn,Xm−n) via Lucas chains,” 1992, URL: ftp://ftp.cwi.nl/pub/pmontgom/Lucas.ps.gz. numbers up to roughly 200 bits which can be used effectively [8] H. M. Edwards, “A Normal Form for Elliptic Curves,” Bulletin of the as a sub-routine for co-factorization in the NFS. This fits American Mathematical Society, vol. 44, no. 3, pp. 393–422, July 2007. within the RSA moduli factorization project at LACAL. [9] D. J. Bernstein and T. Lange, “Faster addition and doubling on elliptic curves,” in Proceedings of the Advances in Crypotology 13th interna- The fourth goal is the optimization of an ongoing work tional conference on Theory and application of cryptology and informa- at LACAL on high-throughput implementation of ECM for tion security, ser. ASIACRYPT’07. Berlin, Heidelberg: Springer-Verlag, factoring larger numbers on GPUs. This is applicable in the 2007, pp. 29–50. [10] ——, “Inverted edwards coordinates,” in Proceedings of the 17th in- context of the ECM record pursuit. Another goal of practical ternational conference on Applied algebra, algebraic algorithms and relevance is the optimization of high-throughput implementa- error-correcting codes, ser. AAECC’07. Berlin, Heidelberg: Springer- tion on GPUs of the RSA developed at LACAL. Verlag, 2007, pp. 20–27. [11] D. J. Bernstein, P. Birkner, M. Joye, T. Lange, and C. Peters, “Twisted From a more theoretical perspective there are several chal- Edwards curves,” in Proceedings of the Cryptology in Africa 1st inter- lenges that this proposal aims to take on. One is the optimiza- national conference on Progress in cryptology, ser. AFRICACRYPT’08. tion of Edwards curves arithmetic, with the focus on reducing Berlin, Heidelberg: Springer-Verlag, 2008, pp. 389–405. [12] D. J. Bernstein, P. Birkner, and T. Lange, “Starfish on strike,” in the memory requirements and the number of additions per- Proceedings of the 1st international conf. on Progress in cryptology: formed. The second one is the research of efficient curves for cryptology and information security in Latin America, ser. LATIN- ECM (see [12]). The third one is the implementation of the CRYPT’10. Berlin, Heidelberg: Springer-Verlag, 2010, pp. 61–80. [13] R. Cosset, “Factorization with genus 2 curves,” Mathematics of Com- stage two of ECM on parallel architectures which is hurdled by putation / Mathematics of Computation of the American Mathematical the memory requirements of the variants known in literature. Society, vol. 79, pp. 1191–1208., 2010. Another interesting problem that can be explored is the [14] K. Gaj, S. Kwon, P. Baier, P. Kohlbrenner, H. Le, M. Khaleeluddin, R. Bachimanchi, and M. Rogawski, “Area-time efficient implementation study and the implementation of higher genus curve arithmetic. of the elliptic curve method of factoring in reconfigurable hardware for One interesting application would be implementing the hyper- application in the number field sieve,” IEEE Trans. Comput., vol. 59, elliptic curves method for factorization (HECM) introduced pp. 1264–1280, September 2010.