arXiv:1801.07561v1 [cs.CR] 15 Jan 2018 uotluzn paul.gorissen ludo.tolhuizen, r ihPiisRsac,Enhvn h Netherlands the L Eindhoven, and Research, Hoogh, Philips de with Sebastiaan are Rietman, Ronald Netherlands; the nitrsigmto o otoeymlilcto nan in multiplication Montgomery arithme for RNS method in interesting [6] multi- multiplication An Barrett Montgomery and as [5] such plication methods multiplication modular [4]. [3], fo [1], [2]; example is processing, for RNS see speech for background, and reference additional image classical c The Filtering, cryptography. RNS Digital and for in found applications ot to be Typical conversion for a where representations. need addition/subtraction applications a (digital) are rarely in only operations interesting with required multiplication, mainly the therefore is of representation, RNS most RNS of in use numbers the for divisi conversion difficult as residue-to-digital more such and c operations comparison, fault-tolerant other magnitude to Unfortunately, applications devices for [1]. mobile useful puting in also i is found use RNS those for [1]. attractive as it such makes carries processors, of embedded absense the o conventional to power-consumption due in low RNS and numbers speed on high The ga representation. to large compared promising thus speed operands, in the of t residues on resenting multiplication and subtraction addition, carry-free) RNS full of cann carries use exploit the that to carr applied. attacks (2048- any Due involve cryptographic large not modulus. does therefore a system 4096-bit) with the algorithms, exponentiation even Montgomery modular or realize bit, to RSA as parallelization. fo implementati massive operations for hardware allows A modular method moduli. only this 8-bits) modular of using example implement moduli, (for to small bits) f used (2000+ multiply-and-accumulate a be large and Imbert, with can multiplication, and Res (remainders (RRNS) addition, Bajard Recursive resulting by System pseudo-residues The modulus). algorithm Number with the algorit an than range multiplication work on larger The based to layer. is extended bottom use arithm we the actual the that result, to a that As deferred below. algorithm layer is multiplication the on Montgomery modular RNS RNS the required uses an implemen the is of where layer means non-bottom RNS, any by on original modulus the a RNS for virtual of arithmetic adding top by (RNS) on System layers Number Residue a of range hoe;RA oua exponentiation. Modular RSA; Theorem; ekDL olan oadRemn eatand og,Lu Hoogh, de Sebastiaan Rietman, Ronald Hollmann, D.L. Henk Email: ekDL olanadPu oisnaewt hlp PS E IP&S, Philips with are Gorissen Paul and Hollmann D.L. Henk uhrsac a endn nteipeetto of implementation the in done been has research Much (hence parallel allows (RNS) System Number Residue A Terms Index such algorithms cryptographic in applied be can method Our Abstract ut-ae eusv eiu ubrSystem Number Residue Recursive Multi-layer A W rsn ehdt nraetedynamical the increase to method a present —We { ekdlhlmn,rnl.ita,sebastiaan.de.hoog ronald.rietman, henk.d.l.hollmann, RsdeNme ytm hns Remainder Chinese System; Number —Residue .I I. } @philips.com NTRODUCTION d Tolhuizen udo n alGorissen Paul and indhoven, rvery or erep- he tbe ot idue om- etic her hm on, tic. ies, are ted ins nd an on h, n r f r o ml) ar auswl fe eeul if equal; be often will values carry small), (or number ifrn ausof values different ai dai hti admintegers random The if respectively. HMAC, that and is RSA idea of basic side-chan context through and the carries [10] in of in leakage analysis shown the been exploit suc has to how It for [11] carry-free. arithmetic be modular not the would moduli of implementation direct the h euti h omo a of form the in result the n ny8btmdl,telretatial yaia ran dynamical attainable largest the employ- is moduli, example, 8-bit For available. only moduli ing simpl small there enough since moduli, large not large 2048-bits are sufficiently rather for need is algorithm would we that RNS moduli, Bajard-Imbert represent) the of can range implement RNS needed. (the to are range the dynamical RSA that a in with values bits systems 4096 RNS even realize or To 2048 of moduli ments, carry-free. inherently thus method, RNS full a iharnelre hntemnmmrequired minimum the than larger range a with rl ar-re h aadIbr loih implements modulus algorithm a modulo Bajard-Imbert multiplication Montgomery a The methods arithmetic carry-free. the modular consis all truly the moreover, which only; layer, all moduli RNS consequence, small bottom an of the a to using As deferred below. is layer the arithmetic layer employs (non-bottom) the that on each RNS algorithm in Bajard-Imbert-type moduli RNS-based, RNS arithmeti modular the the implement to for T is while moduli. method small the range behind for idea arithmetic dynamical modular unlimited only employing virtual still with systems encodings. RNS input the attack on the way, information this much In obtain attacker input. an can fixed values, the all carr on through encoded information fixing runs obtain the input by of other of then the distribution if the value, context value observing carry the and input encoded input encoded in an one two produces mounted has table and be a values if le can [12]: much attack cryptography is whitebox similar distribution value A carry skew. the value, intermediate ouini oepo ouio pca ye o example for type, special a of form moduli the employ of to is solution N sthe is RNS 1] 1] u ehdapisnntiilmdlrarithme layer layer modular lower layers. higher non-trivial the all applies a on in moduli method on the our moduli of [15], some use [14], of that products the or are [13], that o arithmetic layer modular bottom do the only ca that im- (also systems slightly multi-layer RNS known a hierarchical) to as contrast s seen with In operate be pseudo-residues. also can can that algorithm present Bajard-Imbert we proved that algorithm The ome h rsn-a rna-uuescrt require- security near-future or present-day the meet To hspprpeet ehd oipeetmulti-layer implement to methods a presents paper This lcm(2 S , oMGM Tolhuizen, M.G.M. do 3 h rbblt itiuino ar ausover values carry of distribution probability the , , . . . , aadIbr algorithm Bajard-Imbert 2 n − 256) c R o small for 6-isnme.Teconventional The number. 363-bits a , i eed on depends pseudo-residue c e,eg,[] 9.However, [9]. [8], e.g., see, , eirMme,IEEE Member, Senior S ecie n[] hsis This [7]. in described R ned if Indeed, . i modulo r de oafixed a to added are N n delivers and , N S number a , N S slarge is a an has values. lled uch can nel tic , he ge re er ss ts h n y y a c 1 Our multi-layer RNS method allows for massive paralleliza- Chinese Remainder Theorem (CRT) gives a method to recover tion, with one, or even a multitude, of processors per bottom the integer x modulo M from a H-representation (α1,...,αk) modulus, so the method may also be suitable for high-speed as applications. x ξ1(M/M1)+ + ξk(M/Mk) mod M, The contents of this paper are as follows. In the next section, ≡ ··· where ξ α H (M/M )−1 mod M . we present some background concerning Residue Number i ≡ i i i i Systems and Montgomery multiplication techniques, and we outline the Bajard-Imbert algorithm. We introduce pseudo- B. The Montgomery technique residues in the form as used in this paper, and we indicate Montgomery reduction is a technique to replace the (dif- how to adapt the base-extension method used in the Bajard- ficult) division by the modulus N as required in modular Imbert algorithm so that it can operate with pseudo-residues. reduction by an easier division by a suitably chosen integer M, In Section III, we first discuss the requirements on the RNS the Montgomery constant, where gcd(N,M)=1. (Typically, moduli in the current top-lay that are needed when adding the M is chosen to be a power of 2.) Then to reduce a given integer next RNS layer. Then we describe in detail our new Bajard- h modulo N, with 0 h

2 In this paper, we only consider residue intervals of the As a consequence of Proposition II.2 and Corollary II.3, form = [0, 1), referred to as standard residues, or of the by combining (1) and (2) we can use the pseudo-residues I form = [ 1/2, 1/2), referred to as symmetric residues. α1,...,αk of an integer x M together with the residue Note thatI ordinary− residues correspond to the case of standard x = x for a “redundant”∈ modulus I M , with M kφ 0 | |M0 0 0 ≥⌈ ⌉ residues with expansion bound 1. The idea to integers other (standard residues) or M0 kφ +1 (symmetric residues), than 0, 1,...,n 1 to represent residues modulo n has been to find a (pseudo-)residue ≥ρ for ⌈ a⌉ new modulus n, by using used before, see− for example [4], [19], [20], [21]. that Typically, modular multiplication algorithms such as, for k ρ X αi(M/Mi) qM mod n. example, Montgomery [5], Barrett [6], or Quisquater [22] ≡ − deliver the result in the form of a pseudo-residue. As seen i=1 in Section II-B, the output of a Montgomery multiplication E. The Bajard-Imbert Montgomery RNS algorithm is a pseudo-residue with (standard) expansion bound 2 when the inputs themselves are pseudo-residues with expansion Our method is based on an algorithm similar to the RNS- bound 2. As mentioned in the introduction, for a multi- based Montgomery described in [7], layer RNS method based on such a modular multiplication referred to here as the Bajard-Imbert RNS algorithm. This algorithm, we need an RNS implementation of the algorithm algorithm employs an RNS consisting of a left RNS = B capable of handling inputs and outputs represented by pseudo- (M1,...,Mk) with dynamical range M = M1M2 Mk, ′ ··· residues with given, fixed expansion. Our example method a right RNS = (Mk+1,...,M2k) with dynamical range ′ B is based on the full RNS implementation of a Montgomery M = Mk+1Mk+2 M2k, and a redundant modulus M0 ··· multiplication from [7], with some modifications. That algo- used for base extension as in [23], and computes a Mont- rithm uses the base extension method from [23] that employs gomery multiplication with Montgomery constant M, for a a redundant modulus, and this method also has to be adapted modulus N that satisfies the conditions 0 < (k + 2)2N < to work with pseudo-residues. We describe the resulting algo- min(M,M ′) and gcd(N,M)=1. Given inputs x, y repre- ′ rithm in Section III. sented by their residues in M0 , a representation of { }∪B∪B−1 ′ a Montgomery product z xyM mod N in M0 is computed using the following≡ steps. { }∪B∪B D. Generalized base extension with a redundant modulus ′ 1) Compute h = xy in M0 . Base extension refers to the operation of computing a −{1 }∪B∪B−1 2) Compute µi = N (M/Mi) h Mi for i =1,...,k; residue of a number modulo a new modulus from a given k |− | set u = µ (M/M ). Then h + uN 0 mod M. RNS representation. We need a slight generalization of the Pi=1 i i 3) Compute the representation of u in M ≡ ′. base extension method from [23]. Our generalization is based 0 4) Compute z = (h + uN)/M in M { }∪B′. on the following. 0 5) Find the representation of z in { (by}∪B base extension). B Proposition II.2 Let = (M1,...,Mk) be an RNS, with The algorithm from [7] has the property that if the inputs B dynamical range M = M1 Mk, let = [ 1+ e,e) be a x, y satisfy 0 x,y < (k + 2)N, then the result z of the ··· I − residue interval, and let φ be an expansion constant. Let x Montgomery multiplication≤ again satisfies 0 z < (k + 2)N. ≤ be an integer with x M , and let α1,...,αk be pseudo- Note that in order to apply this algorithm for large (2048- ∈ I residues with x αi(M/Mi) and αi φ for all i. Then x bits) moduli , we require that 2 is large, which is ≡ ∈ I N M/(k + 2) can be written as not possible by employing small moduli Mi only. To enable k efficient implementation of the required modular arithmetic, x = X αi(M/Mi) qM (1) one solution would be to choose moduli of a simple form such − i=1 as 2n c with small c or as 2n 2m 1 [8]. In this paper, − ± ± for an integer q with k(1 e)φ e

3 permits to use residues modulo one modulus as entries to a such assumptions for some new Montgomery constant M table for another modulus, which our algorithm requires. and some B B1, and some new expansion constants We now state exactly what we require when constructing a ϕ, φ, by adding≫ a new RNS layer on top of the exist- new RNS layer. Let = [ e, 1 e) be an interval of length ing layers. To this end, we first choose a left RNS = I − − B 1, let m be a positive integer, and let B1, ϕ1, φ1 be positive (M1,...,Mk) with dynamical range M = M1 Mk, a ′ ··· rationals, with ϕ φ1 1. We will write (B1, m, , ϕ1, φ1) right RNS = (Mk+1,...,Mk+l) with dynamical range ≥ ≥ A I ′ B to denote that for all moduli n with 1 n B1 and M = Mk+1 Mk+l, and a redundant modulus M0. The gcd(n,m)=1, the following statements hold.≤ ≤ moduli have to··· be chosen such that the following is satisfied.

1. (Montgomery product) For all integers x, y ϕ1n , we can Ms B1 and gcd(Ms,m)=1 for s =1,...,k + l, so ∈ I −1 • ≤ compute an integer z = x (n,m)y for which z xym mod the modular arithmetic modulo every Ms can be realized; ⊗ ≡ ∗ ′ n and z ϕ1n ; moreover, if y n , then even z φ1n . The full base = M is an RNS, that is, ∈ I ∈ I ∈ I • B { 0}∪B∪B 2. (Modular multiply-and-accumulate) Given integer constants gcd(Ms,Mt)=1 for all 0 s

4 operation. This choice helps to avoid certain scaling operations satisfies u N −1h mod M, that is, h+uN 0 mod in the algorithms. We will see later that there is another, M. ≡− ≡ −1 −1 less obvious choice for the representation constants Hs that 2) Next, let z = (h + uN)/M hM + uNM mod ′ ≡ can significantly lower the computational complexity of the M . We want to compute a (H, ϕ1)-representation algorithm. (ξ0, ξ1,...,ξk+l) for z. To this end, compute D. The recursive RNS method - the method k ξ = χ M −1 + µ S NM −1 (6) We will now describe our method to implement Mont- 0 | 0| |M0 X i i| i |M0 |M0 i=1 gomery reduction and Montgomery multiplication mod- ulo N for certain moduli N, given that the assumption and determine ξj for j = k +1,...,k +l such that ξj −1 −1 −1 k −1 −1 ≡ (B , m, , ϕ , φ ) holds. A high-level description of the zH χj KjM H + µiSiNM H mod A 1 I 1 1 j ≡ j Pi=1 i j algorithm to compute z = x (N,M)y consists of the following Mj , by computing steps. ⊗ −1 −1 1) Compute h = xy by computing the residue of h modulo ξj χj Kj M Hj Mj ′ ≡ | | M0 and suitable pseudo-residues of h in ; k B∪B −1 −1 2) compute z = (N,M)(h) as follows: + X µi SiNM H Mj mod Mj (7) R | i j | a) compute u such that h+uN 0 mod M by com- i=1 puting suitable pseudo-residues≡ in the left RNS ; B with ξj ϕ1Mj , using the multiply-and-add operation b) compute z = (h + uN)/M by computing the from∈ SectionI III-A. SMj residue modulo M0 and suitable pseudo-residues 3) We have now determined the part of the H- in the right RNS ′; B representation of z for the right RNS. To find the part of c) use base extension to compute corresponding the H-representation of z for the left RNS, we use (gen- pseudo-residues of z in the left RNS . B eralized) base-extension as discussed in Section II-D. Below we work out these steps in detail. In this section, we • First, we write z in the form concentrate on explaining and verifying that the computed RNS representations indeed satisfy the required congruences. k+l ′ ′ z = X ηj (M /Mj) qM . (8) As a consequence, the algorithm works provided that the − numbers thus represented coincide with the numbers that they j=k+1 are supposed to represent, that is, provided that the numbers By the constructive CRT, we should take ηj that we want to compute are known beforehand to be contained ′ −1 ′ −1 ≡ z(M /Mj) ξj Hj (M /Mj) mod Mj for in the “correct” interval. The required bounds that guarantee j = k +1,...,k≡+ l. Hence we should take this are analyzed in Section IV. ′ ′ −1 Let D = M0MM denote the dynamical range of the full ηj = ξj (Mj ,m) Hj (M /Mj) m Mj (9) ∗ ′ ⊗ | | RNS = M0 . Fix representation constants B { }∪B∪B for j = k +1,...,k + l. H0 = 1,H1,...,Hk+l and K0 = 1,K1,...,Kk+l with • Then, we use the redundant residue ξ0 = z0 to gcd(Hs,Ms) = gcd(Ks,Ms)=1 for all s. In addition, we determine q from choose (small) integers S1,...,Sk (the reason for this will be discussed later; for the time being, we may assume that Si =1 k+l for all i). Suppose we are given inputs x, y ϕN with q ξ ( M ′)−1 + η M −1 mod M . ∈ I ≡ 0| − |M0 X j | j |M0 0 (H, ϕ1)-representations (α0,...,αk+l) and (β0,...,βk+l), j=k+1 respectively. Then we proceed as follows. (10) 1. Compute h = χ = α β and • And finally, we use (8) to compute 0 0 | 0 0|M0 −1 ′ −1 χs = αs (Ms,m) βs (3) ξ zH M H ⊗ i ≡ i ≡|− i |Mi 2 2 k+l for s =1,...,k+l. Then h xy αsβsHs = χsmHs mod ≡ ≡ ′ −1 M , that is, (χ ,χ ,...,χ ) is a (K, ϕ )-representation, + X ηj (M /Mj)H Mi mod Mi, (11) s 0 1 k+l 1 | i | 2 j=k+1 where K0 =1 and Ks = mHs for s =1,...,k + l. 2. Given a proper K-representation (χ0,χ1,...,χk+l) for h, with ξ ϕ M , again by using the multiply-and- i ∈ 1 iI for certain representation constants Ks, we compute z = add operation Mi from Section III-A. (h) as follows. S R(N,M) Preferably, on a non-bottom level we compute (7) by comput- 1) Compute ing µ = χ N −1K (M/M )−1S−1m (4) i i ⊗(Mi,m) |− i i i |Mi k −1 −1 −1 −1 −1 −1 for i =1,...,k. Then µ S N h(M/M ) mod sj = χj KjM H m Mj + X µi SiNM H m Mj i i ≡− i | j | | i j | M for i =1,...,k, so by the CRT, the integer i=1 i (12) k followed by u = µ S M/M (5) X i i i ξ = (s ) (13) i=1 j R(Mj ,m) j

5 and (3) by computing Algorithm 2 Computation of z = (N,M)(h) R k+l 1: for i =1 to k do ′ −1 ′ −1 2: µi = χi (Mi,m) Ci ti = q M H m Mi + X ηj (M /Mj )H m Mi (14) |− i | | i | ⊗ j=k+1 3: end for 4: ξ = χ D + µ D + + µ D 0 | 0 0,0 1 0,1 ··· k 0,k|M0 followed by 5: for j = k +1 to k + l do ξi = (Mi,m)(ti). (15) 6: ξ = (D ,D ,...,D ; χ ,µ ,...,µ ) R j S j,0 j,1 j,k j 1 k 7: χj Dj,0 + µ1Dj,1 + + µkDj,k mod Mj We will refer to this method to compute the ξj ’s and the ξi’s ≡ ··· 8: end for as postponed reduction. This is similar to the method called 9: for j = k +1 to k + l do accumulate-then-reduce [24], also called lazy reduction (see, 10: η = ξ E e.g., [25]), for computing a sum-of-products where modular j j ⊗(Mj ,m) j 11: end for reduction is done only once at the end, instead of after 12: η = ξ F + µ F + + µ F each multiplicaton and addition. We will discuss postponed 0 | 0 0,0 k+1 0,k+1 ··· k+l 0,k+l|M0 13: for i =1 to k do reduction in more detail in Section V-A. 14: ξ = (G , G ,...,G ; η , η ,...,η ) i S i,0 i,k+1 i,k+l 0 k+1 k+l 15: η G +η G + +η G mod M ≡ 0 i,0 k+1 i,k+1 ··· k+l i,k+l i E. The recursive RNS method - the algorithm 16: end for The description above can be summarized as follows. Assume that we are given inputs x, y ϕN by means ∈ I It is not difficult to verify that by this choice of constants, of (H, ϕ1)-representations (α0,...,αk+l) and (β0,...,βk+l), respectively. In order to compute h = xy, we run Algorithm 1 the output (ξ0,...,ξk+l) of Algorithm 2 has the following below. properties. Theorem III.3 Define u = k µ S (M/M ) and z = (h+ Algorithm 1 Computation of h = xy Pi=1 i i i uN)/M. Then u N −1h mod M and hence z is integer. 1: χ0 = α0β0 M0 ≡ − We have ξ0 = z M0 and z ξj Hj mod Mj for j = k + | | | | ≡k+l 2: for s =1 to k + l do χs = αs (Ms,m) βs ′ ′ ⊗ 1,...,k + l, and hence z = j=k+1 ηj (M /Mj) qM for P − ′ some integer q. We have q η0 mod M0, and, setting z = k+l ′ ≡′ ′ This has the following result. j=k+1 ηj (M /Mj) η0M = z + (q η0)M , we have that P′ − − z ξiHi mod Mi for i =1,...,k. Theorem III.2 The tuple (χ0,χ1,...,χk+l) constitutes a ≡ (K, ϕ )-representation for h = xy in ∗ provided that Proof. Straightforward from the definitions of the 1 B h M MM ′ , where K = 1 and K = mH2 for µi and the ξs, and from the definitions of the ∈ 0 I 0 s s s =1,...,k + l. various constants. We have that µi χi (Mi,m) −1 −1 −1 ≡−1 −⊗1 Ci hKi ( N )Ki(M/Mi) Si mm ≡ −−1 ≡ Proof. We have χ0 α0β0 xy = h mod M0 and χs = hN −1(M/M )−1S mod M , so that u −≡ − ≡− − i i i 1 1 1 1 for − −1 ≡ αs (Ms,m) βs Hs xHs ym Ks h mod Ms s = µ S (M/M ) N h mod M . Hence u hN mod M, ⊗ ≡ ≡  i i i i 1,...,k + l. so that z is integer.≡− ≡− −1 −1 −1 Next, assume that h has (K, ϕ )-representation We have z = hM + uNM = hM + 1 k −1 k (χ ,χ ,...,χ ). We desire to compute a (H, ϕ )- i=1 µiSiMi N, hence z χ0D0,0 + i=1 µiD0,i 0 1 k+l 1 P ≡ kP ≡ ξ mod M and z χ D H + µ D H representation (ξ0,...ξk+l) for z = (N,M)(h) as above, 0 0 j j,0 j Pi=1 i j,i j R ≡ ≡ with z ϕN again. To achieve that, proceed as follows. ξj Hj mod Mj for j = k +1,...,k + l. ∈ I Next, since First, choose S1,...,Sk with Si small for all i (the reason ηj = ξj (Mj ,m) Ej −1 −1 ′ −1 ⊗′ −1 ≡ will be discussed later). Next, pre-compute the constants m zHj Hj (M /Mj ) m = z(M /Mj) , we have ′ ′ ′ C = N −1K (M/M )−1S−1m (i =1,...,k); z ηj (M /Mj ) z mod Mj . Also, η0M • i |− i i i |Mi ≡ k+l −≡1 ′ ′ ≡ D = M −1 and ξ0 + j=k+1 ηj MMj z + (z + η0M ) mod M0, 0,0 M0 − P ′ ≡ − • D = |S M −| 1N (i =1,...,k), hence z z mod M0 and q η0 mod M0. 0,i | i i |M0 ≡ ≡ ′ −1 for j = k +1,...,k + l, Finally, for i = 1,...,k, we have z Hi m − k+l − ≡ • −1 −1 ′ 1 ′ 1 D = K M H and µ0( M Hi m) + j=k+1 µj (M /Mj )Hi m ti mod j,0 j j Mj − P ≡ | −1 −|1 Mi.  Dj,i = SiMi NHj Mj (i =1,...,k) | ′ −1 | Ej = Hj (M /Mj ) m Mj (j = k +1,...,k + l); Note that the Algorithm 2 can be implemented with one • | ′ −1 | F0,0 = ( M ) M0 and register of length 1 + k + l to store the values η0, µi for • | −−1 | F0,j = M M0 (j = k +1,...,k + l); , and for , and another | j | i = 1,...,k ηj j = k +1,...,k + l for i =1,...,k, register of length 1+ k + l to store the values of χs for s = • ′ −1 Gi,0 = M Hi Mi and 0, 1,...,k + l, which can be overwritten to also store the ξs |− ′ |−1 Gi,j = (M /Mj)Hi Mi (j = k +1,...,k + l), for . | | s =0, 1,...,k + l and then, run Algorithm 2 below, using the modular add-and- Note also that in order to change the modulus N, we can accumulate operator discussed in Section III-A. simply replace some of the constants in algorithm 2. S

6 F. The redundant modulus residues). Then given (H, ϕ )-representations for x, y ϕN , 1 ∈ I Note that to be able to execute Algorithm 2, the modulus M Algorithm 1 produces a (K, ϕ1)-representation for h = xy 0 2 2 ′ ∈ has to have some additional properties. ϕ N M0MM and given a (K, ϕ1)-representation for h ϕ2IN ⊆2 2, AlgorithmI 2 produces a (H, ϕ )-representation 1) In steps 4 and 12, we need to be able to extract the 1 (ξ∈,...,ξ I ) for (h)= z with z ϕN ; moreover, residue modulo M from the numbers µ ,...,µ ; so 0 k+l (N,M) 0 1 k even z φN if hR ϕN 2 2. ∈ I either these numbers are small, or this residue must be ∈ I ∈ I obtainable from one or more of the residues in an RNS Proof. Suppose that h ϕ2N 2 2. We have z = (h+uN)/M k ∈ I representation for these numbers. with u = Pi=1 µiSi(M/Mi) UM by our above as- 2) In step 15, we have to be able to multiply a constant sumptions. Hence z (ϕ2N 2e +∈ UMNI)/M . So we have ∈ I modulo M with a computed residue η modulo M . that z ϕN provided that ϕ2N/M + U ϕ. From this i 0 0 ∈ I ≤ These requirements are indeed satisfied when adding the inequality, we see that ϕ>U. So we can write ϕ = U/ǫ with first (bottom) RNS layer by our assumptions at the start of 0 <ǫ< 1, and the condition becomes Section III-A. Suppose that on the bottom level, we have a N Mǫ(1 ǫ)e−1/U = M(1 ǫ)e−1/ϕ. (16) redundant modulus m0, a left RNS with moduli m1,...,mk1 , ≤ − − ′ 2 2 and a right RNS with moduli mk1+l1 ,...,mk1+l1 . Let m = If (16) holds, then ϕN < 2M and ϕN M , hence ϕ N < ′ ′ 2 2≤ 2 ′ m m denote the dynamical range of the left RNS (this 2MM M0MM , hence xy ϕ N M0MM . So 1 ··· k1 ≤ ∈ I ⊆ I will be the Montgomery constant for the moduli on the next according to Theorem III.2, (χ0,χ1,...,χk+l) is a (K, ϕ1)- ∗ level). For the redundant modulus M0 on the second level, we representation for h = xy in . Using (16), it is easily B 2 2 can take for example M0 = m0mj for some j > 0. Since checked that even z φN if h ϕN . ∈ I ′ ∈ I every pseudo-residue µ on the second level is represented Finally, since z ϕN M , the bound on M0 follows i ∈ I ⊆ I by its residues in the full RNS on the bottom level, we from Proposition II.2 and Corollary II.3.  can immediately obtain the residues of µi modulo m0 and As a consequence of Theorem IV.1, we can can again satisfy modulo mj , and by the CRT, these two residues represent the assumption (B, m, , ϕ, φ) (see Section III-A), now with residue modulo M0. Similarly, a constant C modulo some Mi A −1 I B = Mǫ(1 ǫ)e /U B1, Montgomery constant M, and (that is, an integer C Mi ) is represented by its residues expansion constants− ϕ =≫U/ǫ and φ = U +1 ǫ = ϕǫ+1 ǫ C = C modulo the∈ mI. To multiply by a residue η , − − ≤ s | |ms s o ϕ since ϕ 1. use Mixed Radix Conversion (see, e.g., [2]) to write η0 in ≥ the Mixed Radix form a + m b. Then η C has residues j 0 V. IMPROVEMENTS (a+mj b)Cs Ms , where each residue can be obtained by four modular| operations.| In this section, we discuss several ways in which the On higher levels, Step 2 above can always be executed algorithm can be optimized or improved. in a similar way by obtaining some kind of Mixed Radix representation for η0, as long as there is no overflow on any A. Postponed reduction level. And to be able to execute Step 1 above, we should Under certain conditions, steps 6 and 14 in Algorithm 2 can probably require that on higher levels the redundant modulus is be done by Montgomery reduction. For example, on a non- the product of some of the moduli on the level below. Further bottom layer, step 6 may be replaced by implementation considerations are left to the reader. t = χ D′ + µ D′ + + µ D′ ; j j j,0 1 j,1 ··· k j,k ξj = (Mj ,m)(tj ), R ′ IV. BOUNDS where Dj,i = mDj,i Mj for all i. This will work provided that t can be| computed| and satisfies t ϕ2M 2 2. This For the algorithms to work as desired, several bounds have j j 1 j is similar to the method called accumulate-then-reduce∈ I [24], to hold. We need some preparation. Let = [ 1+ e,e) be a also called lazy reduction (see, e.g., [25]), for computing a residue interval, with e = 1 (standard residues)I − or e = 1/2 sum-of-products where modular reduction is done only once (symmetric residues). Note that 2 = e if e =1/2 or e =1. I I k at the end, instead of after each multiplicaton and addition. We also need a bound on the integer u = µ S (M/M ) Pi=1 i i i This is a special case (in fact the simplest case) of the possible as defined in Theorem III.3. We let denote the smallest U implementation of the multiply-and-accumulate operation as positive integer for which u UM . We have to ensure that discussed in Section III-A. Since χ ϕ M , D′ MS U exists. In the case of standard∈ residues,I this is achieved by j 1 j j,i j and µ φ M , we have that t ∈ϕ2M 2 2I if and∈ onlyI if requiring that the numbers S are all positive, with 0

7 φ = k +1 ǫ = k +1/2; moreover, all moduli will have residues, one attractive choice is to take every M prime 1 1 − 1 1 i approximately the same (very large) size, so δ 1. Now the with Mi 3 mod 4, so that 1 is a non-square modulo necessary condition (17) reduces to 2k +k((k +1≈/2)δ 4k2, M (such≡ a restriction on the top-layer− moduli is almost for 1 1 ≤ 1 i or free); Then we can choose Si =1 or Si = 1 to ensure that −1 − k 4k1/δ. (18) N(M/Mi)Sim a square. Remark that the upper bound ≤ −U on u will not be influenced by this choice of the S . On Similarly, we may replace step 14 in Algorithm 2 by i ′ ′ ′ the other hand, in the case of symmetric residues, we should si = Gi,0η0 + G ηk+1 + G ηk+l; i,k+1 ··· i,k+l take Si positive but small; in that case we only take moduli ξi = (Mi,m)(si), R ′ Mi for which there exists a small positive non-square when where Gi,j = Gi,j m Mi for all j. Again, this will work −1 | | 2 2 2 N(M/Mi)Sim is a non-square. provided that ti can be computed and satisfies ti ϕ1Mi . − ′ ∈ I Again, even if µi χi mod Mi for i = 1,...,k, the µi In a similar way, writing δ = max Mj/Mi 1 i ≡ { | ≤ ≤ have expansion φ1 while the χi have larger expansion ϕ1. k, ; k +1 j k + l and ω = max ≤ ≤ M /M , we have ≤ ≤ } 1 i k 0 i Similarly, the bound (17) required for postponed reduction that postponed reduction in step 14 in Algorithm 2 works if should be replaced by ′ 2 ω + lφ1δ ϕ1. (19) 1+ lδ ϕ . (21) ≤ ≤ 1 Normally, ω 1, hence with the same assumptions on the bottom level,≪ we now find that the necessary condition (19) VI. COMPLEXITY ESTIMATES AND OPTIMIZATION ′ will certainly be satisfied if l 4k1/δ . Consider a three-layer RNS system to implement expo- ≤ nentiation modulo N by repeated Montgomery multiplica- B. Some improvements tions, consisting of a bottom, layer-1 RNS with moduli m0; m1,...,mk1 ; mk1+1,...,mk1+l1 , a middle, layer-2 RNS The algorithm in Section III-E can be slightly improved. with moduli M0; M1,...,Mk; Mk+1,...,Mk+l, and a top, Indeed a careful choice of the representing constants Hs and layer-3 RNS consisting of a single modulus N. We now give of the signs Si may allow to skip steps 2 and 10 in Algorithm an estimate for the number of simple operations required for a 2. Montgomery multiplication modulo N, where a simple oper- First, if we choose ation is a modular operation for the “small” moduli mi or for −1 ′ the redundant modulus M0 (we assume that all such operations Hj = Jj Mj = M /Mj Mj | | | | are approximately equally costly). Here we assume that on for j = k+1,...,k+l, then Ej = m, hence ηj ξkj mod Mj level 2 and 3, a Montgomery multiplication is implemented ≡ for j = k +1,...,k + l; as a consequence, we may be able to by Algorithm2 1 and 2, where steps 6 and 14 are done by skip step (9) of the algorithm, that is, step 10 in Algorithm 2.A postponed reduction. Let A ,M , , denote the number i i Ri Mi slight complication is that the range of the ηj was smaller than of simple operations required for an addition, a multiplication, that of the ξj : we have ηj φ1Mj , but ξ ϕ1Mj . As a a Montgomery reduction, and a Montgomery operation on ∈ I ∈ I consequence, the bound (19) required for postponed reduction layer i, respectively. Then A1 = M1 = 1 =1 and 1 =0 should be replaced by the bound since all arithmetic on the bottom layerM is exact. Moreover,R ′ we have that t+1 = Mt+1 + t+1, At+1 = (kt + lt)At +1, ω/ϕ1 + lδ ϕ1. (20) M R ≤ and Mt+1 = (kt + lt) t +1 for t 1. Finally, carefully For example, if on the level above the bottom level we have counting the contributionsM of the various≥ steps in Algorithm2 that ϕ1 = k1/ǫ1, then we can satisfy this bound by taking gives that ǫ1 small enough. However, note that as a consequence of R = k − + (2k +1)+ l ((k + 1)M − + k A − ) choosing a smaller ǫ, the upper bound on moduli N on the t tMt 1 t t t t 1 t t 1 +l − + l − + (2l +1)+ k ((l + 1)M − next level will get smaller. tRt 1 tMt 1 t t t t 1 Similarly, if we choose +l A − )+ k − t t 1 tRt 1 K = NS (M/M ) = (kt + lt) t−1 + 2(kt + lt + 1) i |− i i |Mi M +(2ktlt + kt + lt)Mt−1 then Ci = m, and hence µi χi mod Mi for i = 1,...,k. ≡ +2k l A − + (k + l ) − In that case, we may be able to skip step 2 of Algorithm 2. t t t 1 t t Rt 1 In the full Montgomery multiplication algorithm, we would for t 2. It can be shown that in an optimal system, kt lt 2 ≥ ≈ have Ki = Hi m after algorithm 1; so for the improvement, for t = 1, 2; assuming this, we find that A2 M2 = 2k1, we would require that 2 2 ≈ 2 4k1, 2 4k1 +2k1, and on the top layer, we have R ≈ M ≈ 2 2 2 2 −1 M3 = 2k 1 8kk1, 3 16kk1 +8k k1. So the total Hi N(M/Mi)Sim mod Mi. M ≈ R ≈ ≡− number 3 of simple operations required for a Montgomery This choice is only available if the right-hand side is a square multiplicationM modulo N satisfies modulo Mi, but this could be achieved by choosing Si = 1 2 2 3 24kk1 +8k k1. (22) if it is a square and Si a quadratic non-residue if it is a M ≈ non-square. In addition, we need the Si to be small in order Now suppose that we employ table lookup to implement the to get a good upper bound on u. In the case of symmetric simple operations, using tables of size t t bits (so m 2t × i ≤

8 for all i), and we want to be able to handle RSA moduli N up the dynamical range of the left RNS 1 and the right RNS ′ B to b bits. So for the upper bound B0 for the layer-1 moduli, we 1, respectively. Let ǫ1 = 1/2 (the optimal choice), the best have B =2t. Then m = m m 2k1t, so for the upper partitionB of these 18 bottom moduli such that m′ (1 ǫ )m 0 1 ··· k1 ≈ ≥ − 1 bound B1 = mǫ0(1 ǫ0)/U0 for the layer-2 moduli Ms, we with m maximal turns out to result in a left RNS 1 = have B 2tk1 . Similarly,− M = M M 2tk1k, so for (256, 251, 249, 247, 241, 239, 235, 199, 197), of size k B= 9 1 ≈ 1 ··· k ≈ 1 the upper bound B2 = M(ǫ1(1 ǫ1)/U1 on the modulus N, and with m = 2097065983013254306560, and a right RNS tk1k − ′ we have B2 2 . So we conclude that b tk1k, that is, 1 = (191, 193, 211, 217, 223, 227, 229, 233, 253), of size k b/(k t).≈ Using this value for k in the expression≈ for Bl =9 with m′ = 1153388216560035715721. Note that m = ≈ 1 M2 1 0 as given in (22) and minimizing for k1 results in minimizing 17 > k1 =9= l1 as required. Let U1 = k1φ0 = k1 = 9, value k1 pb/(3t), and minimum and put B1 = ǫ1(1 ǫ1)m/U1 = 57669314532864493430, ≈ ϕ = U /ǫ =⌊ 18, and− φ = U⌋ +1 ǫ = 9.5. Then the √ 3/2 1 1 0 1 1 0 2 16 3 (b/t) . (23) given RNS m ′ can be used− to realize assumption M ≈ { 0}∪B1 ∪B1 For example, if we want t = 8 (byte-based tables) and (B1, m, , ϕ1, φ1). M I b = 2048 (modular arithmetic for 2048-bits RSA moduli), Now the choice of the large moduli Ms on the middle layer then for the optimal k1, we find is more or less automatic: if we need k moduli for the left RNS , we simply take the k largest primes below B1; then with k 2048/24 = √84.333 9. B ′ 1 ≈ p ≈ ǫ =1/2, in order to realize M (1 ǫ)M we can take l = k ≥ − ′ In practice, it turns out that the number of table operations is and and take the next l largest primes for the right RNS . B minimized when k1 =9 and k = 32. We want this layer to realize assumption (B,M, , ϕ, φ) with B 22048 so that we can handle RSAM moduliI N on ≥ VII. ANEXAMPLE the top layer with up to 2048 bits. It turns out that in order to have B = Mǫ(1 ǫ)/U large enough, we need to take To test and verify the algorithms in this paper, we have − implemented in C++ a 2-layer RNS algorithm for modular k = 32 lower primes below B1. For the redundant modulus, we can take for some (in our program, exponentiation with a 2048-bits RSA modulus N. Effectively, M0 = m0mj j > k this is just a 3-layer RNS system with a top-layer consisting we took M0 = 17 253). It turns out that the parameters allow postponed reduction× (see Section V-A), which greatly of a single 2048-bits modulus. Note that changing the RSA increases the efficiency of the program. modulus N in the top-layer amounts to adapting some of the constants in Algorithm 2 for computing modulo N on the Remark VII.1 It is possible to build a multi-layer RNS with top-layer with the RNS on the middle layer; the 2-layer RNS unbounded dynamical range starting with a bottom layer of below remains unchanged. moduli of at most 4 bits. Indeed, take = (16, 15), ′ = For simplicity, we used standard residues (so = [ 1+ (13, 11), and M =7. Assuming exact arithmeticB for allB 4-bit I − 0 e,e) = [0, 1) with e =1), and we employed tables of size 8 8 moduli, we have (B = 8,m = 1, = [ 1/2, 1/2), ϕ = × A 1 I − 1 bits for the required modular arithmetic. So we use moduli of 1, φ1 = 1), hence according to Theorem IV.1 with k = l =2, ′ size at most 256 on the bottom layer, and assumption (B0 = M = 16.15 = 240, M = 13.11 = 143, and taking ǫ = 1/2, 8 A 2 , 1, = [0, 1), ϕ0 =1, φ0 = 1) holds. For the bottom layer, we have U = 1 and B = M.(1/4)/1 = 80, and hence we I it turns out that it is optimal to have one redundant modulus can use this RNS to realize assumption (B = 80,M = and 18 further moduli. For these small moduli, we take the 240, = [ 1/2, 1/2), ϕ = 2, φ = 3/2). FromA here on it is primes easyI to further− increase the dynamical range by adding further (virtual) layers. 191, 193, 197, 199, 211, 223, 227, 229, 233, 239, 241, 251, which are the 12 largest primes less than 256, and the VIII. CONCLUSIONS composite numbers We have presented an improved Bajard-Imbert-type full 256 = 28, 253 = 11 23, 249 = 3 83, 247 = 13 19, RNS algorithm that can also operate on inputs represented by · · · pseudo-residues. Using this algorithm, we have developed a 235 = 5 47, 217 = 7 31, · · multi-layer Residue Number System (RNS) that is capable of which are the largest numbers of the from pia with a > 13 implementing modular addition, subtraction and multiplication prime, and which produces the largest attainable product for for very large moduli by only using actual arithmetic for any list of 18 mutually co-prime numbers of size at most a fixed set of moduli. If the moduli of this fixed set are 256. Note that 255 = 3 5 17 is a worse choice for both sufficiently small, the method allows for a fully table-based 3 and 5, similarly 245 =· 5 ·72 is a worse choice for both 5 implementation. In contrast to digit-based implementations of and 7; the choices for 2, 11,· and 13 are evidently optimal. modular operations for large moduli, our method allows for Note that, as a consequence, the small moduli involve as a massively parallel implementation and is completely carry- prime factors all primes from 191 to 251 together with the free, thus thwarting potential attacks exploiting such carries, primes 2, 3, 5, 7, 11, 13, 19, 23, 31, 47, 83. So for the bottom- e.g., with side-channel analysis or in a white-box cryptography layer redundant modulus, we can take m0 = 17. context. Note that the bottom layer moduli have expansion coeffi- Our system may be considered as a method to provide ′ cients ϕ0 and φ0 for which ϕ0 = φ0 =1. Let m and m denote a given, fixed RNS with a very large dynamical range. To

9 illustrate the method, we have described a 2-layer RNS system [17] C¸. K. Koc¸, T. Acar, and B. S. Kaliski Jr., “Analyzing and comparing that can be used to implement an RSA exponentiation by Montgomery multiplication algorithms,” IEEE Micro, vol. 16, no. 3, pp. 26–33, 1996. adding the desired RSA modulus on top in a third layer. The [18] A. Menezes, P. van Oorschot, and S. Vanstone, Handbook of Applied system employs 19 moduli of 8-bits each in the bottom layer Cryptography. CRC Press, 1996. and can be used to implement an RSA exponentiation for [19] M. K. Ibrahim, “Novel Digital Filter Implementations Using Hybrid RNS-binary Arithmetic,” Signal Process., vol. 40, no. 2-3, pp. 287–294, 2048-bits RSA moduli with all the required arithmetic done Nov. 1994. by table look-up, using 19 modular addition tables and 19 [20] B. Parhami, “A note on digital filter implementation using hybrid RNS- modular multiplication tables, each of these 38 tables having binary arithmetic,” Signal Processing, vol. 51, no. 1, pp. 65 – 67, 1996. [21] ——, “Application of symmetric redundant residues for fast and reliable size 28 28 8 bits, with one modular multiplication taking × × arithmetic,” Proc. SPIE, vol. 4791, pp. 393–402, 2002. approximately 160,000 table look-ups. We further observed [22] J.-J. Quisquater, “Encoding system according to the so-called RSA that in order to change the RSA modulus, only some constants method, by means of a microcontroller and arrangement implementing this system,” US Patent 5,166,978, 1992. for computing on the top layer with moduli on the middle layer [23] A. Shenoy and R. Kumaresan, “Fast base extension using a redundant need to be updated. This update need not be computed in a modulus in RNS,” IEEE Trans. Comput., vol. 38, no. 2, pp. 292–297, secure manner and hence can be done quickly. 1989. [24] C. H. Lim and H. S. Hwang, “Fast implementation of elliptic curve Our straightforward (non-parallelized) C++ program im- arithmetic in GF(pn),” in International Workshop on Public Key plementation of this RSA exponentiation method with table- Cryptography. Springer, 2000, pp. 405–421. lookup takes approximately 0.3 second on a HP Elitebook [25] D. F. Aranha, K. Karabina, P. Longa, C. H. Gebotys, and J. L´opez, “Faster explicit formulas for computing pairings over ordinary curves,” Folio 9470m laptop to realize 500-bit modular exponentiation in Proc. EUROCRYPT 2011. Springer, 2011, pp. 48–68. for a 2048-bits RSA modulus. So the security wish to remove all carries in the arithmetic can be achieved with a implemen- tation operating at an acceptable speed.

REFERENCES [1] A. Omondi and B. Premkumar, Residue number systems: theory and implementation. Imperial College Press, 2007. [2] N. S. Szabo and R. I. Tanaka, Residue arithmetic and its applications to computer technology. McGraw-Hill, 1967. [3] P. A. Mohan, Residue number systems: algorithms and architectures. Springer Science & Business Media, 2012, vol. 677. [4] B. Parhami, Computer Arithmetic. Oxford University Press, New York, 2000. [5] P. L. Montgomery, “Modular multiplication without trial division,” Mathematics of computation, vol. 44, no. 170, pp. 519–521, 1985. [6] P. Barrett, Implementing the Rivest Shamir and Adleman Public Key Encryption Algorithm on a Standard Digital Signal Processor. Berlin, Heidelberg: Springer, 1987, pp. 311–323. [7] J.-C. Bajard and L. Imbert, “A full RNS implementation of RSA,” IEEE Trans. on Computers, vol. 53, no. 6, pp. 769–774, June 2004. [8] J.-C. Bajard, S. Duquesne, M. Ercegovac, and N. Meloni, “ Residue systems efficiency for modular products summation: Application to Elliptic Curves Cryptography,” in Proceedings of SPIE: Advanced Signal Processing Algorithms, Architectures, and Implementations XVI, Augustus 2006, pp. 631 304, 1–11. [9] M. Ciet, M. Neve, E. Peeters, and J. J. Quisquater, “Parallel FPGA implementation of RSA with residue number systems - can side-channel threats be avoided?” in 2003 46th Midwest Symposium on Circuits and Systems, vol. 2, Dec 2003, pp. 806–810. [10] P.-A. Fouque, D. R´eal, F. Valette, and M. Drissi, The Carry Leakage on the Randomized Exponent Countermeasure. Berlin, Heidelberg: Springer, 2008, pp. 198–213. [11] C. H. Gebotys, B. A. White, and E. Mateos, “Preaveraging and Carry Propagate Approaches to Side-Channel Analysis of HMAC-SHA256,” ACM Trans. Embed. Comput. Syst., vol. 15, no. 1, pp. 4:1–4:19, Feb. 2016. [12] S. Chow, P. Eisen, H. Johnson, and P. C. van Oorschot, A White-Box DES Implementation for DRM Applications. Berlin, Heidelberg: Springer, 2003, pp. 1–15. [13] H. M. Yassine, “Hierarchical residue numbering system suitable for VLSI arithmetic architectures,” in 1992 IEEE International Symposium on Circuits and Systems, vol. 2, May 1992, pp. 811–814. [14] A. Skavantzos and M. Abdallah, “Implementation issues of the two-level residue number system with pairs of conjugate moduli,” IEEE Trans. Signal Process., vol. 47, no. 3, pp. 826–838, March 1999. [15] T. Tomczak, “Hierarchical residue number systems with small moduli and simple converters,” Int. J. Appl. Math. Comput. Sci., pp. 173–192, 2011. [16] J.-F. Dhem, “Design of an efficient public-key cryptographic library for RISC-based smart cards,” Ph.D. dissertation, Universit´e catholique de Louvain, May 1998.

10