<<

arXiv:1709.02631v1 [math.PR] 8 Sep 2017 unn time running ieee ihu hsclacs otedvc.Attacks device. the the to called measure access are this physical often exploiting without can even one time since dangerous, particularly ahmtclojcs euiydfiiin involve definitions security objects, mathematical l uhr eespotdb oihNtoa cec Centr Science National DEC-2013/10/E/ST1/00359. Polish by supported were authors All • • Require: 1 Algorithm information, implementation, In side-channel particular additional scheme. of a cryptographic lots or expose to given may device, access a a only however, has of reality adversary outputs an and that inputs assumed is it model: COMP SECURE AND DEPENDABLE ON TRANSACTIONS IEEE TO ACCEPTED I 1 3: 2: 1: 4: 5: 8: 7: 6: .Lrki ihteMteaia nttt,Fclyof Faculty Institute, Mathematical Pol the University, Wrocław with Science, Computer [email protected] is Email and Mathematics Lorek P. S Uni Computer Wrocław of Poland. Technology, Technology, Department and of the Science Problems with Fundamental are of Zagórski Faculty F. and Kulis M. ntetaiinlcytgah rmtvsaetetdas treated are primitives traditional the In for x return for end ne Terms Index si he areas: three in is Abstract NTRODUCTION 1 = x if n if end s = i = 2) 3) 1) d x i ,d c, = mod =1 == x to Ti ae rsnsapiaiiyo togSainr Ti Stationary Strong of applicability presents paper —This cee hc simn otetmn-takb esn impl con reusing on by focus timing-attack mainly researchers the cryptography to symmetric immune is algorit SST-based which of properties scheme) leverage can one how show We properties: provable obtain ieto n xlrsieso nu masking. input of ideas explores and direction a stop algorithms these Instead, steps. of number predefined nlsso daie mteaia)mdl feitn cr existing of models (mathematical) idealized of Analysis rpstoso e ls fcytgahcagrtm (ps algorithms cryptographic of class new a of Propositions mod ipemdlrexponentiation modular Simple = togsainr ie n t s in use its and times stationary Strong en n ftems motn.I a be can It important. most the of one being 0 ( b) a) Ped-admpruaingnrtr akvcan,mix chains, Markov generator, permutation —Pseudo-random x then d do ( s · x d ,n x, muiyt iigatcs(oifrainaotteresult performed). the algorithm about SST information steps of (no number attacks timing to distribution, uniform immunity from samples perfect are results s · − ,n c, 1 ) · · · ) iigattacks timing d 1 d 0 ; d i { ∈ ae oe,FlpZgrk,adMca Kulis Michał and Zagórski, Filip Lorek, Paweł 0 . , 1 } cryptography c otatnumber contract e d mod black-box est of versity n cience and. the ✦ porpi cee – schemes yptographic TN,2017 UTING, e ST ehiusi h rao rporpy h applic The cryptography. of area the in techniques (SST) mes 5 number able line is times result of the compute recover to completely takes to it time the measuring by navray ycrflslcino messages of selection careful by adversary, An Require: 2 Algorithm 1 Algorithm in 2. Algorithm described with modu- as together the algorithm used exponentiation RSA lar of implementations Straightforward aua approaches: attacks. natural timing discuss to one first the was [16] Kocher oevr h oplri oecsscnotmz the optimize again, can time cases running ones. data-dependent some in time resulting in al- constant algorithm compiler all to the not transformable However, Moreover, easily self-explaining. are is gorithms approach A1 The 1 osattm implementation Constant-time A1) 2 akn (blinding) Masking A2) uorno emtto eeaos hc ontrnfrth for run not do which generators) permutation eudo-random tn ie(eipeettos u prahge nadiff a in goes approach Our (re)implementations. time stant 1: 2: 4: 3: fteAgrtm1aecle eed ohon both depends called are 1 Algorithm the of crigt tpigrl enda S,frwihoecan one which for SST, as defined rule stopping a to ccording m ocntuta mlmnain(fasmercencrypti symmetric a (of implementation an construct to hms o a n rvn iigatcs hr r two are There attacks? timing prevent one can How mnain hc r o eueaanttmn-tak.I timing-attacks. against secure not are which ementations if return if end n ie temcpe,tmn attacks timing cipher, stream time, ing x x n emtto ek hog h nomto bu the about information the through leaks permutation ing sdt needn i usi h aetm nall on time same time the running in the inputs). runs possible that (it way independent a data is such in algorithm the nsc a httm-eedn prtosare operations time-dependent inputs. that random on way performed a such in ≥ = ,n x, n x x % then oua division Modular n i.e., eipoearsl yMrnv[21]. Mironov by result a improve we 2 d fteAgrtm2adln number line and 2 Algorithm the of h rvt e snetenumber the (since key private the – s oeetarandomness extra some Use . mod ( ,n x, Re-implementing . = ) x mod c ability n just and , erent c n and on n e d ). 1 ACCEPTED TO IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2017 2

or a constant-time implementation on one platform may The KSAs of mentioned ciphers perform shuffling for become again time attack susceptible on a different platform some predefined number of steps. The security of such – see [23], [26]. a scheme is mainly based on analyzing idealized version In this paper we focus our attention on masking (i.e., A2) of the algorithm and corresponds to the “quality of a by performing some “random” permutation first, which is shuffling”. Roughly speaking, shuffling is considered as a key-dependent. Markov chain on permutations, all of them converge to Authors of [7] presented fast constant-time algorithms uniform distribution (perfectly shuffled cards). Then we for code-based public-key cryptography called McBits. should perform as many steps as needed to be close to this McBits uses a permutation network to achieve immunity uniform distribution, what is directly related to the so-called against cache-timing attacks. Permutation network is a net- mixing time. This is one of the main drawbacks of RC4: work built up of wires and comparators. Wires carry values it performs Cyclic-to-Random Transpositions for n steps, while comparators sort values between two wires (smaller whereas the mixing time is of the order n log n. value goes to the upper wires, bigger value goes to the lower There is a long list of papers which point out weaknesses wires). Permutation network is a fixed construction: number of the RC4 algorithm. Attacks exploited both weaknesses of wires and comparators is fixed. When such a network has of PRGA and KSA or the way RC4 was used in specific the property that it sorts all inputs, then it is called a sorting systems [5], [13], [14], [15], [19]. As a result, in 2015 RC4 network. was prohibited in TLS by IETF, Microsoft and Mozilla. In Section 5 of [7] on “Secret permutations” authors In the paper we use so-called Strong Stationary Times write (where a permutation is considered as the output of a (SST) for Markov chains. The main area of application of permutation network) SSTs is studying the rate of convergence of a chain to its “By definition a permutation network can reach stationary distribution. However, they may also be used for every permutation, but perhaps it is much more perfect sampling from stationary distribution of a Markov likely to reach some permutations than others. Per- chain, consult [27] (on Coupling From The Past algorithm) haps this hurts security. Perhaps not [...]” and [12] (on algorithms involving Strong Stationary Times Above motivation leads to the following question: and Strong Stationary Duality). How to obtain a key-based permutation which is This paper is an extension (and applications to timing indistinguishable from a truly random one? attacks) of [17] presented at Mycrypt 2016 conference. The doubt about the resulting permutation in [7] can come from the following: one can have a recipe for constructing 1.1 Related work - timing attacks a permutation involving some number of single steps, de- Kocher [16] was first to observe that information about termining however the necessary number of steps is not the time it takes a given cryptographic implementation to trivial. In this paper we resolve the issue, our considera- compute the result may leak information about the secret tions are Markov chain based. In particular, we consider (or private) key. Timing attack may be treated as the most some card shuffling schemes. Our methods can have two dangerous kind of side-channel attacks since it may be applications: we can determine number of steps so that exploited even by a remote adversary while most of other we are arbitrary close to uniform permutation; in so-called types of side-channel attacks are distance bounded. Kocher SST version algorithm does not run pre-defined number of discussed presented timing attacks on RSA, DSS and on steps, but instead it stops once randomness is reached. As symmetric key schemes. a side-effect we have a detailed analysis of some existing Similar timing attack was implemented for CASCADE stream ciphers. Many stream ciphers (e.g., RC4, RC4A [29], smart cards [10]. Authors report recovering 512-bit key after RC4+ [30], VMPC [32], Spritz [28]) are composed of two 350.000 timing measurements (few minutes in late 90s). algorithms: Later, Brumley and Boneh [8] showed that timing attacks 1) KSA (Key Scheduling Algorithm) uses a secret key may be performed remotely and on algorithms that were to transform identity permutation of n elements into previously not considered vulnerable (RSA using Chinese some other permutation. Remainder Theorem). 2) PRGA (Pseudo Random Generation Algorithm) The usual way of dealing with timing attacks is to attempt to re-implement a given scheme in such a way that starts with a permutation generated by KSA and it runs the same number of steps (i.e., approach A1), no outputs random bits from it (some function of the matter what the input is. This approach has the following permutation and the current step number) while drawbacks: updating the permutation at the same time. KSAs of all aforementioned algorithms (RC4, RC4A, 1) it requires to replace cryptographic hardware, Spritz, RC4+, VMPC) can be seen as performing some 2) it is hard to do it right, card shuffling, where a secret key corresponds to/replaces 3) it badly influences the performance. randomness. If we consider a version of the algorithm with (1) The first limitation is quite obvious. So it may be purely random secret key of infinite length, then we indeed worth considering the idea of reusing a timing attack sus- consider a card shuffling procedure. Following [21], we will ceptible implementation by masking (see Section 1.1.1) its refer to such a version of the algorithm as an idealized inputs. This was the original timing attack countermeasure version. In the case of KSA used by RC4 the idealized technique proposed by Kocher for both RSA and DSS. version (mathematical model) of the card shuffle is called (2) An example of s2n – Amazon’s TLS implementation Cyclic-to-Random Transpositions shuffle. and its rough road to become immune to timing attacks [2] ACCEPTED TO IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2017 3

shows that achieving timing resistance is not easy. It is not for some pre-defined number of steps, we make it random- an isolated case, it may be hard to achieve constant time in ized (Las Vegas algorithm). As far as we are aware the the case of of implementation [24] but what may be more idea of key-dependent running time in context of symmetric worrying, even a change at the architecture level may not ciphers was not considered earlier. To be more specific we lead to success [31]. suggest utilization of so-called Strong Stationary Times (3) BearSSL is an implementation of TLS protocol which (SST) for Markov chains. We use SST to obtain samples consists of many constant time implementations (along with from uniform distribution on all permutations (we actually a “traditional” ones) of the main cryptographic primitives. perform perfect sampling). Few things needs pointing out: Constant-time implementation of CBC encryption for AES is 5.89 times slower than the “standard” implementation [25]. 1) In McBits [7] construction, authors perform similar On a reference platform it achieves the following perfor- masking by performing some permutation. This mance: AES-CBC (encryption, big): 161.40 MB/s while the is achieved via a permutation network. However, constant time (optimized for 64-bit architecture) has 27.39 they: a) run it for a pre-defined number of steps MB/s. The slow-down is not always at the same rate, it (they are unsure if the resulting permutation is depends on both the function and on the mode of operation random); b) they work hard to make the implemen- used (for AES in CTR mode the slowdown is 1.86; for 3DES tation constant-time. it is 3.10). Utilization of SST allows to obtain a perfect sample from uniform distribution (once SST rule is found, 1.1.1 Masking we have it “for free”). This is a general concept which may be applied to many algorithms. Let F ( ) be the RSA decryption (described as Algorithm 1) 2) Coupling methods are most commonly used tool implementation· which uses a private key [d, N] to decrypt e for studying the rate of convergence to stationarity an encrypted message x mod N. The following shows for Markov chains. They allow to bound the so- how to design which is immune to timing attacks but F called total variation distance between the distri- reuses implementation of F . The approach proposed by bution of given chain at time instant k and its Kocher [16] is to run a bad implementation F on a random stationary distribution. However, the (traditional) input (same idea as Chaum’s blind-signatures [9]), the code security definitions require something “stronger”. for is presented in Algorithm 3. F It turns out that bounding separation distance is what one actually needs. It fits perfectly into the Algorithm 3 – timing attack immune implementation of F notion of Strong Stationary Times we are using (see RSA decryption using time-attack susceptible implementa- Section 2.3). tion F 3) By construction, the running time of our model is Require: x,F,d,e,N key dependent. In extreme cases (very unlikely) it 1: select r uniformly at random from ZN∗ may leak some information about the key, but it 2: compute y = F (d, xre mod N) 1 does not leak any information about the resulting 3: return s = yr− mod N permutation to an adversary. We elaborate more on this in Section 6.5. The security of this approach comes from the fact that while the running time of G is not constant, an attacker (2) Preventing timing attacks. We a technique cannot deduce information about the key, because it does which allows to obtain a timing attack immune implemen- not know the input to the F . This is because xre mod N is tation of a symmetric key encryption scheme by re-using a timing attack vulnerable scheme. The technique follows a random number uniformly distributed in ZN∗ . The beauty of this approach lies in the fact that F is ideas of Kocher’s masking of the RSA algorithm and are actually run on a random input. After applying F one similar in nature to the utilization of a permutation network in McBits. can easily get rid of it. To be more precise: let g1(x, r) = e 1 xr mod N and let g2(x, r) = xr− mod N then G(d, x) = The idea is that an input to a timing-attack vulnerable g2(F (d, g1(x, r)), r) = F (d, x) (both g1 and g2 depend on function is first a subject to a pseudo-random permutation. the public key [N,e]). Taking advantage of properties of The permutation is generated (from a secret key) via SST- the underlying group lets us to remove the randomness r based pseudo-random permutation generation algorithm. (which remains secret to the attacker during the computa- The main properties of such an approach are following: tions). the resulting permutation is uniformly random provided the Similar approach can be applied to other public key long enough key is applied; the running time of an SST- schemes e.g., RSA signatures, DSA signatures. It is unknown based algorithm does not reveal any information about the how this approach – how to design G – may be applied to resulting permutation. symmetric-key cryptographic schemes without introducing Based on our straightforward implementation, we show any changes to the original function F . that the performance of our approach is similar to constant- time implementations available in BearSSL [25], see Sec- tion 7, Figure 7 (but does not require that much effort and 1.2 Our contribution caution to implement it right). (1) Strong stationary time based PRNGs (Pseudo Random (3) Analysis of idealized models of existing crypto- Number Generators). Instead of running a PRNG algorithm graphic schemes. ACCEPTED TO IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2017 4

The approach, in principle, allows us to analyze the It is relatively easy to check that d ( (X ), ψ) T V L k ≤ quality of existing PRNGs, closing the gap between theo- sep( (X ), ψ). L k retical models and practice. The “only” thing which must Strong Stationary Times. The definition of separation be done for a given scheme is to find an SST for the distance fits perfectly into the notion of Strong Stationary corresponding shuffle and to analyze its properties. We Times (SST) for Markov chains. This is a probabilistic tool for present in-depth analysis of RC4’s KSA algorithm. As a studying the rate of convergence of Markov chains allowing result of Mironov’s [21] work, one knows that the idealized also perfect sampling. We can think of stopping time as of version of RC4’s KSA would need keys of length 23 037 ≈ a running time of an algorithm which observes a Markov in order to meet the mathematical model. Similarly as in chain X and which stops according to some stopping rule [21], we propose SST which is valid for Cyclic-to-Random (depending only on the past). Transpositions and Random-to-Random Transpositions. For the latter one we calculate the mixing time which is “faster”. Definition 2.1. Random variable T is a randomized stop- It is known that Random-to-Random Transpositions card ping time if it is a running time of the RANDOMIZED 1 shuffling needs 2 n log n steps to mix. It is worth men- STOPPING TIME algorithm. tioning that this shuffling scheme and Cyclic-to-Random Transpositions are similar in the spirit, it does not transfer 1 Algorithm 4 RANDOMIZED STOPPING TIME automatically that the latter one also needs 2 n log n steps to mix. Mironov [21] states that the expected time till reaching 1: k := 0 uniform distribution is upper bounded by 2nH n, we 2: coin :=Tail n − show that the expected running time for this SST applied to 3: while coin ==Tail do Random-to-Random Transpositions is equal to 4: At time k, (X0,...,Xk) was observed 5: Calculate f (X), where f : (X ,...,X ) [0, 1] k k 0 k → E[T ]= nHn + n + O(Hn) 6: Let p = fk(X0,...,Xk). Flip the coin resulting in Head with probability p and in Tail with probability H n ( n is the -th harmonic number) and empirically check 1 p. Save result as coin. that the result is similar for Cyclic-to-Random Transposi- − 7: k := k +1 tions. This directly translates into the required steps that 8: end while should be performed by RC4. In fact one may use a better shuffling than the one that is used in RC4, e.g., time-reversed riffle shuffle which requires (on average) 4096 bits – not much more than 2048 bits which Definition 2.2. Random variable T is a Strong Stationary are allowed for RC4 (see Section 5). Time (SST) if it is a randomized stopping time for chain X such that:

2 PRELIMINARY (i E) P r(X = i T = k)= ψ(i). ∀ ∈ k | Throughout the paper, let n denote a set of all permuta- tions of a set 1,...,n =: [Sn]. { } Having SST T lets us bound the following (see [4]) 2.1 Markov chains and rate of convergence X sep( (X ), ψ) P r(T > k). (1) Consider an ergodic Markov chain = Xk, k 0 on L k ≤ a finite state space E = 0,...,M 1 with{ a stationary≥ } { − } distribution ψ. Let (Xk) denote the distribution of the chain at time instantL k. By the rate of convergence we 2.2 Distinguishers and security definition understand the knowledge on how fast the distribution of the chain converges to its stationary distribution. We have In this section we define a Permutation distinguisher – to measure it according to some distance dist. Define we consider its advantage in a traditional cryptographic dist security definition. That is, a Permutation distinguisher is τ (ε) = inf k : dist( (Xk), ψ) ε , mix { L ≤ } given a permutation π and needs to decide whether π is which is called mixing time (w.r.t. given distance dist). In a result of a given (pseudo-random) algorithm or if π is our case the state space is the set of all permutations of a permutation selected uniformly at random from the set [n], i.e., E := n. The stationary distribution is the uniform of all permutations of a given size. The upper bound on S E 1 E Permutation distinguisher distribution over , i.e., ψ(σ) = n! for all σ . In most ’s advantage is an upper bound applications the mixing time is defined w.r.t. total∈ variation on any possible distinguisher – even those which are not distance: bounded computationally. Additionally, in the Section 4.1 1 1 we consider Sign distinguisher. d ( (X ), ψ)= P r(X = σ) . T V L k 2 k − n! Permutation distinguisher The distinguishability game σ n X∈S for the adversary is as follows (Algorithm 5):

The separation distance is defined by: Definition 2.3. The permutation indistinguishability sep( (Xk), ψ) := max (1 n! P r(Xk = σ)) . Shuffle σ E S, (n, r) experiment: L ∈ − · A ACCEPTED TO IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2017 5

Algorithm 5 ShuffleS, (n, r) guarantees (i.e., Definition 2.5). In our case (E = n and A E S Let S be a shuffling algorithm which in each round requires ψ is a uniform distribution on ) having dT V small, i.e., 1 1 m bits, and let be an adversary. dT V ( (Xk), ψ) = 2 σ n P r(Xk = σ) n! ε does A L ∈S 1 − ≤ not imply that P r(Xk = σ) is uniformly small (i.e., 1) S is initialized with: | P − n! | of order 1 ). Knowing however that sep( (X ), ψ) ε n! L k ≤ a) a key generated uniformly at random implies rm K ∼ 1 ε ( 0, 1 ), (σ E) P r(X = σ) . (2) U { } k n! n! b) S0 = π0 (identity permutation) ∀ ∈ − ≤

2) S is run for r rounds: S := S( ) and produces a Above inequality is what we need in our security definitions r K permutation πr. and shows that the notion of separation distance is an 3) Weset: adequate measure of mixing time for our applications. It is worth noting that dT V ( (Xk), (E)) ε implies c0 := πrand a random permutation from L U E ≤ • (see Theorem 7 in [3]) that sep( (X2k), ( )) ε. This uniform distribution is chosen, means that proof of security whichL boundsU directly≤ total c := π . • 1 r variation distance by ε would require twice as many bits 4) A challenge bit b 0, 1 is chosen at random, of randomness compared to the result which guarantees ε ∈ { } permutation cb is sent to the Adversary. bound on the separation distance. 5) Adversary replies with b′. 6) The output of the experiment is defined to be 1 if 3 SHUFFLING CARDS b′ = b, and 0 otherwise. In this Section we consider four chains X(T 2R), X(RTRT ), X(CTRT ), X(RS) corresponding to known shuffling schemes: Top-To-Random, Random-To- In the case when the adversary wins the game (if b = b′) we say that succeeded. The adversary wins the game if it can Random Transpositions, Cyclic-To-Random Transpositions distinguishA the random permutation from the permutation and Riffle Shuffle (we actually consider its time reversal). E being a result of the PRPG algorithm. The state space of all chains is = n. The generic state is denoted by S E, where S[j] denotesS the position of Definition 2.4. A shuffling algorithm S generates indistin- element j. Rough∈ description of one step of each shuffling guishable permutations if for all adversaries there (in each shuffling “random” corresponds to the uniform exists a negligible function negl such that A distribution) is following: 1 X(T 2R) P r [ShuffleS, (n, r) = 1] + negl(n). : choose random position j and move the A ≤ 2 • card S[1] (i.e., which is currently on top) to this The above translates into: position. X(RTRT ): choose two positions i and j at random Definition 2.5. A shuffling algorithm S generates indistin- • guishable permutations if for any adversary there and swap cards S[i] and S[j]. X(CTRT ) exists a negligible function negl such that: A : at step t choose random position j and • swap cards S[(t mod n) + 1] and S[j]. X(RS): assign each card random bit. Put all cards • P r [ (S(K)) = 1] P r [ (R) = 1] with bit assigned 0 to the top keeping relative posi- keyLen R←U(S ) K←{0,1} A − n A tions of these cards.

negl(n) . ≤ The stationary distribution of all the chains is uniform, i.e. ψ(σ) = 1 for all σ E. Aforementioned Strong Stationary n! ∈ 2.3 Strong stationary times and security guarantees Time is one of the methods to study rate of convergence. Probably the simplest (in description) is an SST for X(T 2R): Coupling method is a commonly used tool for bounding the rate of convergence of Markov chains. Roughly speaking, a Wait until the card which is initially at the bottom coupling of a Markov chain X with a transition matrix P is reaches the top, then make one step more. (T 2R) (T 2R) a bivariate chain (X′, X′′) such that marginally X′ and X′′ Denote this time by T . In this classical example T are Markov chains with the transition matrix P and once corresponds to the classical coupon collector problem. We (T 2R) c know that P r(T >n log n+cn) e− , thus via (1) we the chains meet they stay together (in some definitions this ≤ condition can be relaxed). Let then T = inf X′ = X′′ have c k{ k k } i.e., the first time chains meet, called coupling time. The cou- (T 2R) (T 2R) c dT V ( (Xk ), ψ) sep( (Xk ), ψ) e− pling inequality states that dT V ( (Xk), ψ) P r(Tc > k). L ≤ L ≤ On the other hand, separationL distance≤ is an upper for k = n log n + cn. This bound is actually tight, in a sense bound on total variation distance, i.e., d ( (X ), ψ) that there is a so-called cutoff at n log n, see Aldous and T V L k ≤ sep( (Xk), ψ). At first glance it seems that it is better to Diaconis [3]. L X(RTRT ) 1 directly bound dT V , since we can have dT V very small, The chain mixes at 2 n log n, the result was first whereas sep is (still) large. However, knowing that sep is proven by Diaconis and Shahshahani [11] with analytical small gives us much more than just knowing that dT V is methods, Matthews [20] provided construction of SST for small, what turns out to be crucial for proving security this chain. ACCEPTED TO IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2017 6

There is a simple construction of SST for X(RS): initially Lemma 1. The expected running time of Random-to- consider all pairs of cards as unmarked. Then, if at some Random Transpositions shuffling with Mironov stop- step cards in some pair were assigned different bits mark ping rule is: this pair. Stop at the first time when all pairs are marked. ET =2nHn n + O(Hn). This SST yields in 2n log2 n mixing time (the actual answer − is that (3/2)n log n steps are enough). 2 Proof. We start with one card checked. When k cards are The case X(CTRT ) is a little bit more complicated. Hav- checked, then probability of checking another one is equal ing SST T does not automatically imply that we are able (n k)(k+1) to bound P r(T > k), or even calculate ET . This is due to pk = − n2 . Thus, the time to check all the cards is to the cyclic structure of the shuffle. Note that “in spirit” distributed as a sum of geometric random variables and its X(CTRT ) is similar to X(RTRT ), which - as mentioned - expectation is equal to: 1 mixes at n log n (and even more - there is a so-called cutoff n 1 2 − 1 n2 at this value, meaning that “most of the action happens” =2 Hn n =2nHn n + O(Hn). 1 p n +1 − − around n log n). Mironov [21] (in context of analysis of k=1 k 2 X RC4) showed that it takes O(n log n) for this shuffle to mix.  Unfortunately Mironov’s “estimate of the rate of growth of the strong stationary time T is quite loose” and results “are a far cry both from the provable upper and lower bounds 3.2 Better SST for Cyclic-to-Random Transpositions on the convergence rate”. He: We suggest another “faster” SST which is valid for both Cyclic-to-Random Transpositions and Random-to-Random proved an upper bound O(n log n). More precisely Transpositions. We will calculate its expectation and vari- • Mironov showed that there exists some positive ance for Random-to-Random Transpositions and check ex- constant c such that P [T > cn log n] 0 when perimentally that it is similar if the stopping rule is applied n . Author experimentally checked→ that P [T > to Cyclic-to-Random Transpositions. The SST is given in 2n→lg ∞n] < 1/n for n = 256 which corresponds to StoppingRuleKLZ algorithm. P [T > 4096] < 1/256. experimentally showed that E[T ] 11.16n Algorithm 6 StoppingRuleKLZ • 1.4n lg n 2857 (for n = 256) – which≈ translates≈ ≈ Input set of already marked cards 1,...,n , into: on average one needs to drop 2601 initial M ⊆{ } bytes. ≈ round r, Bits Output {YES,NO} Later Mosel, Peres and Sinclair [22] proved a matching lower bound establishing mixing time to be of order j =n-value(Bits) Θ(n log n). However, the constant was not determined. if there are less than (n 1)/2 marked cards then ⌈ − ⌉ One can apply the following approach: find T which is if both π[r] and π[j] are unmarked then a proper SST for both X(CTRT ) and X(RTRT ), calculate its mark card π[r] expectation for X(RTRT ) and show experimentally that the end if expectation is similar for X(CTRT ). This is the main part of else this Section resulting in a conjecture that the mixing time of if (π[r] is unmarked and π[j] is marked) OR (π[r] is X(CTRT ) is n log n. unmarked and r = j) then mark card π[r] 3.1 Mironov’s SST for Cyclic-to-Random Transposi- end if tions end if

Let us recall Mironov’s [21] construction of SST: if all cards are marked then “At the beginning all cards numbered STOP 0,...,n 2 are unchecked, the (n 1)th card − − else is checked. Whenever the shuffling algorithm CONTINUE exchanges two cards, S[i] and S[j], one of the two end if rules may apply before the swap takes place: a. If S[i] is unchecked and i = j, check S[i]. Lemma 2. The stopping rule StoppingRuleKLZ is SST for b. If S[i] is unchecked and S[j] is checked, X(CTRT ). check S[i]. Proof. First phase of the procedure (i.e., the case when there The event T happens when all cards become are less than (n 1)/2 cards marked) is a construction a checked. ” random permutation⌈ − of marked⌉ cards by placing unmarked Then the author proves that this is a SST for Cyclic-to- cards on randomly chosen unoccupied positions – this is Random Transpositions and shows that there exists constant actually the first part of Matthews’s [20] marking scheme. c (can be chosen less than 30) such that P r[T >cn log n] Second phase is simply a Broder’s construction. Theorem 9. 0 when n . Note that this marking scheme is also valid→ in [21] shows that this is a valid SST for Cyclic-to-Random for Random-to-Random→ ∞ Transpositions shuffling, for which Transpositions. Both phases combined produce a random we have: permutation of all cards.  ACCEPTED TO IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2017 7

Remark 1. One important remark should be pointed. Full V ar[Td]= Matthews’s marking [20] scheme is an SST which is d 1 d 1 (n k)2 n (n x)2 − 1 p (k) − 1 −2 2 1 −2 4 “faster” than ours. However, although it is valid for − a = − n − n dx = n. 2 2 2 2 2 Random-to-Random Transpositions, it is not valid for (pa(k)) (n k) ≈ 0 (n x) 3 k=0 k=0 −2 Z −2 Cyclic-to-Random Transpositions. X X n n V ar[T T ]=     n − d Calculating ET or V arT seems to be a challenging task. But n 1 n 1 n 1 (n k)(k+1) note that marking scheme StoppingRuleKLZ also yields a − − 1 pb(k) − 1 − 2 V ar[Y ]= − = − n valid SST for Random-to-Random Transpositions. In next k 2 (n k)2(k+1)2 k=d k=d (pb(k)) k=d − n4 Lemma we calculate ET and V arT for this shuffle, later Xn 1 X X − n(n 1) + k(1 n)+ k2 we experimentally show that ET is very similar for both = n2 − − (n k)2(k + 1)2 marking schemes. k=d Xn 1 − 1 − n(n 1)+ k(1 n)+ k2 Lemma 3. Let T be the SST StoppingRuleKLZ applied to n2 − − ≈ · 2 (n k)2(k + 1)2 Random-to-Random Transpositions. Then we have k=0 2 X − 2 n 2 4 (2) π 2 E[T ] = nH + n + O(H ), Hn +3Hn n . n n ≈ 2 n − n2 ∼ 4    2 (3) V ar[T ] π n2, ∼ 4 Finally, f(k) 2 where f(k) g(k) means that limk =1. π ∼ →∞ g(k) V ar[T ]= V ar[T ]+ V ar[T T ] n2. n d n − d ∼ 4 Proof. Define Tk to be the first time when k cards are  marked (thus T Tn). Let d = (n 1)/2 . Then Td is the running time of the≡ first phase and⌈ (T− T⌉) is the running n − d time of the second phase. Denote Y := T T . k k+1 − k Assume that there are k < d marked cards at a certain 3.3 Experimental results step. Then the new card will be marked in next step if we The expectation of the following SSTs applied to Random- choose two unmarked cards what happens with probability: to-Random Transpositions shuffling scheme is known: 2 (n k) Mironov’s SST (used in [21]) it is 2nHn n pa(k)= − . • − n2 StoppingRuleKLZ SST it is n(H + 1) • n Thus Y is a geometric random variable with parameter k As shown, both marking schemes are valid for both, Cyclic- p (k) and a to-Random Transpositions and Random-to-Random Trans- E[Td]= positions shufflings. For both stopping rules applied to Cyclic-to-Random Transpositions no precise results on ex- d 1 d 1 d 1 n − − 1 − n2 1 = E[Y ]= = = n2 pected running times are known. Instead we estimated them k p (k) (n k)2 k2 via simulations, simply running 10.000 of them. The results k=0 k=0 a k=0 k=n d+1 X X X − X− are given in Fig. 1. 2 (2) (2) 2 1 1 = n Hn Hn d = n + O 2 = n + O(1). − − n n      StoppingRuleKLZ ST Now assume that there are k d cards marked at a certain Mironov’s step. Then, the new card will≥ be marked in next step with n = 256 1811 2854 probability: ET n = 512 3994 6442 n = 1024 (n k)(k + 1) 8705 14324 p (k)= − n = 256 111341 156814 b n2 V arT n = 512 438576 597783 and Yk is a geometric random variable with parameter n = 1024 1759162 2442503 pa(k). Thus: n(Hn + 1) 2nHn n E[T T ] − n − d n = 256 1823.83 2879.66 d 1 n 1 1 n 1 n2 n = 512 4002.05 6468.11 = − E[Y ]= − = − k=0 k k=d pb(k) k=d (n k)(k+1) − n = 1024 8713.39 14354.79 Pn2 n 1 1 P 1 nP2 n d 1 n 1 = n+1 k=−d n k + k+1 = n+1 k=1− k + k=d+1 k Fig. 1. Simulations’ results for Mironov’s and StoppingRuleKLZ stop- − ping rules. n2 P   n2 P n2 P  = (Hn d + Hn Hd)= Hn + (Hn d Hd) n+1 − − n+1 n+1 − − n n2 The conclusion is that they do not differ much. This = nHn Hn + (Hn d Hd) − n+1 n+1 − − suggest following: = nHn + O(Hn)+ O(1) = nHn + O(Hn). sep Conjecture 1. The mixing time τmix for Cyclic-to-Random Transpositions converges to n log n as n . For variance we have: → ∞ ACCEPTED TO IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2017 8

3.4 Optimal shuffling 4 (NOT SO) RANDOM SHUFFLES OF RC4 – REVIS- Cyclic-to-Random Transpositions is the shuffle used in RC4 ITED and in Spritz. To reach stationarity (i.e., produce random permutation), as we have shown, one needs to perform RC4 algorithm. RC4 is a , its so-called internal O(n log n) steps. In each step we use a random number state is (S,i,j), where S is a permutation of [n] and i, j from interval [0,...,n 1], thus this shuffling requires are some two indices. As input it takes L byte message 2 − − O(n log n) random bits. m1,...,mL and a secret key K and returns One can ask the following question: Is this shuffling optimal c1,...,cL. The initial state is the output of KSA. Based on in terms of required bits? The answer is no. The entropy of this state PRGA is used to output bits which are XORed the uniform distribution on [n] is O(n log n) (since there are with the message. The actual KSA algorithm used in RC4 n! permutations of [n]), thus one could expect that optimal is presented in Figure 3 together with its idealized version shuffling would require this number of bits. Applying de- KSA∗ (where a secret key-based randomness is replaced scribed Riffle Shuffle yields SST with ET = 2 lg n, at each with pure randomness) step n random bits are used, thus this shuffling requires ∗ 2n lg n random bits, matching the requirement of optimal KSA KSA shuffle (up to a constant). for i := 0 to n − 1 do for i := 0 to n − 1 do S[i] := i In Figure 2 we summarized required number of bits for S[i] := i end for various algorithms and stopping rules. end for j := 0 # bits RC4 Mironov KSAKLZ RS for i := 0 to n − 1 do used 40 − 2048 for i := 0 to n − 1 do − j := j + S[i]+ K[i mod l] asympt. (2nHn 1) lg n n(Hn + 1) lg n 2n lg n j :=random(n) required 23 037 14 590 4 096 swap(S[i],S[j]) swap(S[i],S[j]) end for end for Fig. 2. Comparison between number of bits used by RC4 (40 to i, j := 0 2048) and required by mathematical models (Mironov [21] and ours – i, j := 0 KSAKLZ ) versus length of the key for the time-reversed riffle shuffle. ∗ Bits asymptotics approximates the number of fresh bits required by the Fig. 3. KSA of RC4 algorithm and its idealized version KSA . mathematical model (number of bits required by the underlying Markov chain to converge to stationary distribution). Bits required is (rounded) value of # bits asymptotics when n = 256. The goal of KSA of original RC4 is to produce a pseu- dorandom permutation of n = 256 cards. The original algorithm performs 256 steps of Cyclic-to-Random Trans- 3.5 Perfect shuffle positions. Even if we replace j := j + S[i]+ K[i mod l] Let X be an ergodic chain with a stationary distribution ψ. in KSA with j = random(n) (i.e., we actually consider Ergodicity means that nevertheless of initial distribution, the idealized version), then Θ(n log n) steps are needed. Con- distribution (Xk) converges, in some sense, to ψ as k sidering idealized version was a main idea of Mironov L → ∞ (it is enough to think of dT V ( (Xk), ψ) converging to 0 as [21]. Moreover, he showed that 2n log n steps are enough. L k ). Beside some trivial situations, the distribution of However, our result stated in Lemma 3 together with → ∞ (Xk) is never equal to ψ for any finite k. That is why experimental results from Section 3.3 imply that around L we need to study mixing time to know how close (Xk) n log n steps are enough. Moreover, if we consider KSA L to ψ is. However, there are methods for so-called perfect with Random-to-Random Transpositions (instead of Cyclic- simulations. This term refers to the art of converting a to-Random Transpositions) then we can state more formal Markov chain into the algorithm which (if stops) returns theorem in terms of introduced security definitions. exact (unbiased) sample from stationary distribution of the Theorem 1. Let be an adversary. Let K 0, 1 rn be a chain. The most famous one is the Coupling From The Past A ∈ { } secret key. Let (K) be KSA with Random-to-Random (CFTP) algorithm introduced in [27]. The ingenious idea is Transpositions shufflingS which runs for to realize the chain as evolving from the past. However, to apply algorithm effectively the chain must be so-called πn 1 r = n(Hn +1)+ realizable monotone, which is defined w.r.t some underlying 2 √n!ε partial ordering. Since we consider random walk on permu- 1 tations, applying CFTP seems to be a challenging task. steps with 0 <ε< n! . Then However, having Strong Stationary Time rule we can actually perform perfect simulation(!). We simply run the P r [ (S(K)) = 1] P r [ (R) = 1] ε. K←{0,1}rm R←U(S ) chain until event SST occurs. At this time, the Definition 2.2 A − n A ≤

implies, that the distribution of the chain is stationary. Proof . The stopping rule StoppingRuleKLZ given in Sec- It is worth mentioning that all SSTs mentioned at the tion 3.2 is an SST for Random-to-Random Transpositions, beginning of this Section are deterministic (i.e., they are denote it by T . From Lemma 3 we know that ET = deterministic functions of the path of the chain, they do not 2 n(H +1)+ O(H ) V arT π n2 use any extra randomness - it is not the case for general SST). n n and 4 . Inequality (1) and ∼ r = n(H +1)+ πn c For a method for perfect simulation based on SST (and other Chebyshev’s inequality imply that for n 2 we have methods) see [18], where mainly monotonicity requirements 1 for three perfect sampling algorithms are considered. sep( (Xr), ψ) P (T > r) , L ≤ ≤ c2 ACCEPTED TO IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2017 9

sep 2 1 i.e., r = τ (c− ). Taking c = we perform Random- k +1 1 ǫ mix √n!ε sep − 3.89672 to-Random Transpositions for r = τmix(n!ε) steps, i.e., 0 .5671382998250798 .4328617001749202 2− 6.79344 sep( (Xr), ψ) n!ε Inequality (2) implies that P r(Xr = 256 .509015 .490985 2− L1 ≤ | 9.69016 σ) ε for any permutation σ and thus completes the 512 .5012105173235390 .4987894826764610 2− − n! |≤ 12.5869 proof.  768 .500163 .499837 2− 15.4836 1024 .5000218258757580 .4999781741242420 2− 27.0705 2048 .5000000070953368 .4999999929046632 2− 2 50.2442 4.1 Sign distinguisher for or Cyclic-to-Random Trans- 4096 .5000000000000007 .4999999999999993 − 2 96.5918 positions. 8192 .5 .5 −

Sign distinguisher. For a permutation π which has Fig. 5. The advantage (ǫ) of the Sign distinguisher of RC4 after n + k a representation of non-trivial transpositions π = steps or equivalently, after discarding initial k bytes. (a1b1)(a2b2) ... (ambm) the sign is defined as: sign(π) = ( 1)m. So the value of the sign is +1 whenever m is even For n = 256 (which corresponds to the value of n − used in RC4) and initial distribution being identity per- and is equal to 1 whenever m is odd. The distinguisher is 256 presented in Figure− 4. mutation after t = n = 256 steps one gets: v0 M256 = (0.567138, 0.432862). · In [19] it was suggested that the first 512 bytes of the output should be dropped. Fig. 4. Sign distinguisher for Cyclic-to-Random Transpositions Input: S,t 5 STRONG STATIONARY TIME BASED KSA ALGO- Output: b RITHMS if sign(S)=( 1)t − then return false In addition to KSA and KSA∗ (given in Fig. 3) we introduce else return true another version called KSAShuffle,ST∗∗ (n). The main differ- ence is that it does not run a predefined number of steps, but it takes some procedure ST as a parameter (as well It was observed in [21] that the sign of the permutation as Shuffle procedure). The idea is that it must be a rule at the end of RC4’s KSA algorithm is not uniform. And as corresponding to Strong Stationary Time for given shuffling a conclusion it was noticed that the number of discarded scheme. The KSAShuffle,ST∗∗ (n) is given in the following shuffles (by PRGA) must grow at least linearly in n. Below algorithm: we present this result obtained in a different way than

in [21], giving the exact formula for advantage at any step t. Algorithm 7 KSAShuffle,ST∗∗ (n) One can look at the sign-change process for the Cyclic-to- Require: Card shuffling Shuffle procedure, stopping rule Random Transpositions as follows: after the table is initial- ST which is a Strong Stationary Time for Shuffle. ized, sign of the permutation is +1 since it is identity so the initial distribution is concentrated in v0 = (P r(sign(Z0) = for i := 0 to n 1 do +1), P r(sign(Z )= 1)) = (1, 0). − 0 − S[i] := i Then in each step the sign is unchanged if and only if end for i = j which happens with probability 1/n. So the transition matrix Mn of a sign-change process induced by the shuffling while ( ST) do process is equal to: Shuffle¬ (S) end while 1 1 1 M := n n . n 1 1 −1  − n n  The algorithm works as follows. It starts with identity permutation. Then at each step it performs some card shuf- This conclusion corresponds to looking at the distribution of fling procedure Shuffle. Instead of running it for a pre- the sign-change process after t steps: v M t , where v is the 0 · n 0 defined number of steps, it runs until an event defined by a initial distribution. The eigenvalues and eigenvectors of Mn procedure ST occurs. The procedure ST is designed in such 2 n T T are (1, − ) and (1, 1) , ( 1, 1) respectively. The spectral n − a way that it guarantees that the event is a Strong Stationary decomposition yields Time. At each step the algorithm uses new randomness – one can think about that as of an idealized version but

t 1 1 when the length of a key is greater than the number of 1 1 1 0 2 2 t − − random bits required by the algorithm then we end up with v0 Mn = (1, 0)       · 1 1 0 2−n 1 1 a permutation which cannot be distinguished from a ran-    n   − 2 2  1 1 2 t 1 1 2 t dom (uniform) one (even by a computationally unbounded = + 1 , 1 .  2 2  n −  2 − 2  n −   adversary). Notational convention: in KSAShuffle,ST∗∗ (n) we omit param- In Fig. 5 the advantage ǫ of a sign-adversary after dropping eter n. Moreover, if Shuffle and ST are omitted it means k bytes of the output (so after n + k steps of the shuffle, for that we use Cyclic-to-Random Transpositions as shuffling the mathematical model) is presented. procedure and stopping rule is clear from the context. Note ACCEPTED TO IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2017 10

that if we use for stopping rule ST “stop after n steps” various forms e.g., λ(k) = HW (k) – the Hamming weight (which of course is not SST), it is equivalent to RC4’s KSA∗. of k, or in the best case λ(k)= k. Given a shuffling procedure Shuffle one wants to The reason why a timing attack on implementation of a have a “fast” stopping rule ST (perfectly one wants an function F (y = F (k, x)) is possible is because the running optimal SST which is stochastically the smallest). The stop- time of F is not the same on each input. Usually timing ping rule ST is a parameter, since for a given shuffling attacks focus on a particular part of F – fweak which scheme one can come up with a better stopping rule(s). operates on a function hx of the original input x and on This is exactly the case with Cyclic-to-Random Transposi- a function hk of the original key (e.g., hk may be a substring tions and Random-to-Random Transpositions. In Section 3 of a round key). When F (k, x) = frest(k,fweak(k, x)) the we recalled Mironov’s [21] stopping rule as well as new statistical model is built and T (F (k, x)) = T (fweak(k, x)) + introduced “faster” rule called (called StoppingRuleKLZ). T (frest(k,fweak(k, x))). The distribution of T (fweak(k, x)) is conditioned on k and estimating this value (conditioned on subset of bits of k) is the goal of the adversary. Let us point out few important properties of KSAShuffle,ST∗∗ (n) algorithm: 6.1 Countermeasures – approach I P1 The resulting permutation is purely random. There is one common approach to eliminate possibility of P2 The information about the number of rounds that an timing attacks, namely to re-implement F (ct(F ) = Fct SST-algorithm performs does not reveal any infor- – constant-time implementation of F ) in such a way that for all pairs (k, x) = (k′, x′) the running times are the mation about the resulting permutation. 6 same T (F (k, x)) = T (F (k′, x′)) = t. Then, an adversary’s The property P1 is due to the fact that we actually perform observation: perfect simulation (see Section 3.5). The property P2 follows = m ,c ,t = m , F (k,m , r ),t from the fact that the Definition 5.1 is equivalent to the OT iming,ct(F ) h i i i h i ct i i i following one (see [4]): becomes the same as in the black-box model. Definition 5.1. Random variable T is Strong Stationary Time (SST) if it is a randomized stopping time for chain 6.2 Countermeasures – approach II X such that: The other approach is to still use a weak implementa- XT has distribution ψ and is independent of T. tion, but “blind” (mask) the adversary’s view. Let us start with a public-key example, described in the Section 1.1.1. The property P1 is the desired property of the output of The RSA’s fragile decryption F (d, c) = f (d, c) (Algo- the KSA algorithm, whereas P2 turns out to be crucial for weak rithm 1) is implemented with masking (Algorithm 8) as: our timing attacks applications (details in Section 6). Fmasked(d, c) = g2(fweak(d, g1(c, r)), r), where g1(c, r) = e 1 cr mod N and g2(x, r) = xr− mod N. In this case, the 6 TIMING ATTACKS AND SST adversary’s view is: In the black box model (for CPA-security experiment), an T iming,masked(F ) = ci, Fmasked(d, ci),ti . adversary has access to an oracle computing O h i F (k, ) for a secret key k. So the adversary is allowed to While ti are again different, they depend on both d and · obtain a vector of cryptograms (c1,...,cq) for plaintexts g1(ci, ri), but an adversary does not obtain information (m ,...,m ) of its choice (where each c = r ,y = about random ri’s and thous information about ti is useless. 1 i h i ii F (k,mi, ri)). Each ri is the randomness used by the oracle to compute encryption (e.g., its initial counter ctr for CTR- 6.3 Non-constant time implementation immune to tim- mode, IV for CBC-mode etc.). So the adversary has a vector ing attacks for symmetric encryption of tuples Let us use the similar approach for the symmetric encryp- = m ,c = m , F (k,m , r ) . tion case. For a time-attack susceptible function F we would OBlackBox,F h i ii h i i i i like to build a function G by designing functions g1( ) and In the real world, as a result of interacting with an · g2( ) and then to define implementation of F , a timing information may also be · accessible to an adversary. In such a case her knowledge G(k, x)= g2(F (k,g1(x, )), ). is a vector of tuples · · We restrict ourselves only to functions G of that form. We do = m ,c ,t = m , F (k,m , r ),t , not require that G(k, x)= F (k, x) as it was in the public key OT iming,F h i i ii h i i i ii case. It would be enough for us that there exists an efficient where t is the time to reply with c on query m and ran- i i i algorithm for decryption – this is satisfied if both g ,g have domness used r . (A very similar information is accessible 1 2 i efficiently computable inversions. to an adversary in chosen-ciphertext attacks.) Now let us consider the following approaches for select- An encryption scheme is susceptible to a timing attack ing g (and g ): when there exists an efficient function H which on input 1 2 H ( m1,c1,t1 ,..., mw,cw,tw ) outputs λ(k) (some func- 1) g1(x) – then such an approach is trivially broken tionh of a secreti keyhk). Dependingi on the scheme and on because an attacker knows exactly what is the input a particular implementation the leakage function can be of to F . ACCEPTED TO IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2017 11

1 2) g1(x, r) for a randomly selected r generated during Algorithm 9 G− – timing attack immune decryption im- run of G. Assuming that r cannot be efficiently plementation based on timing attack susceptible symmetric 1 removed from F (k,g1(x, r)) via application of g2, decryption F − the ciphertext of G would need to include r but then 1 1 1 Require: y, F − ,g1− ,g2− , kf , kg this is exactly the same case as (1). 1 1: compute x1 = g2− (kg,y) 3) g1(k, x) – in that case an attacker might want to 1 2: compute x2 = F − (kf , x1) perform a timing attack on g1 – one would need to 1 3: return x = g1− (kg, x2) guarantee that g1 itself is immune to timing attacks. This approach is used in McBits [7] system, g1 is a constant-time implementation of Beneš permutation Resulting G has the following properties: network [6]. G1 Resulting permutation g1 is uniformly random, 4) g1(k, x, r) – again value r would need to be in- provided a long enough kg is applied (see prop- cluded in ciphertext generated by G but now for erty P1 from Section 5). an attacker this case would be no easier than when G2 For a given kg and for any x, x′ one obtains g1 depends solely on k (just as in the case 3). T (g1(kg, x)) = T (g1(kg , x′)) (and the same holds for g2) – while g1 is a result of SST, While g2 cannot remove randomness (unless F is triv- running it for the same input kg leads to the ially broken) and the randomness used in g1 needs to same running time. be included in the ciphertext, one needs g2 anyway. It is required for the protection in the case when attacker has G3 The running time of g1 (and of g2) does not reveal any information about the resulting per- access to the decryption oracle – then g2 plays the same mutation (see property P2). As a result, the ad- role as g1 plays for protecting against timing attacks during encryption. versary does not have influence on inputs to F . We suggest the following approach, let k = (kf , kg) The property G3 guarantees security of the outputs of where kf is a portion of the secret key used by F : g1 and g2 and that they cannot be derived from the running time but it does no guarantee that k will remain secure. 1) Let g (k , x),g (k , x) be permutations of n = x g 1 g 2 g It turns out that some bits of k may leak through the bits depending on a key. I.e., g (k , ) and g (k |, )| g 1 g · 2 f · information about the running time but: are permutations constructed based on kg and kf respectively, which are later on applied to x, a there are at least n! (where n is the number of bits of • sequence of n bits (similarly as in McBits scheme) x) different keys which have the same running time, 2) Both g ,g are the permutations obtained as a result the information which leaks from the running time 1 2 • of KSAShuffle,ST∗∗ algorithm – one has a freedom in does not let to obtain ANY information about the the choice of a shuffle algorithm. resulting permutation. For more details see Section 6.5. x 6.4 Implementation To implement the above construction, one needs

kg g1 (pseudo)random bits to generate a permutation g1. These bits can be obtained in the similar way one may produce a key for authenticated encryption: let k be a 128-bit AES key, calculate kg AES(k, 0), kg AES(k, 1), kf F and k AES(k, 2). Then AES←(k , IV ) is used← as a stream m ← g cipher to feed Riffle Shuffle until SST occurs, producing g1. The following bits in the same way produce g2. F is kg g2 e.g.,AES-128 in CBC or CTR mode, all having kf as a secret key. Then, km may be used to compute the MAC (this disallows an adversary to play with e.g., hamming weight g y of inputs of 1). A proof of concept implementation was prepared (as a student project: https://bitbucket.org/dmirecki/bearssl/). Fig. 6. Timing attack immune G obtained from timing attack susceptible This implementation first uses the AES key and riffle-shuffle F via compounding with SST-based shuffling algorithms g1,g2. to generate an array (permutation) which stores the permu- tation g1 (of size 128). Then it permutes bits of the buffer by performing 128-step loop (line 34 of the listing 1). Algorithm 8 G – timing attack immune encryption based on Fix n = 128 = 27. Any permutation of [n] can be timing attack susceptible symmetric encryption F represented as 7 steps of Riffle Shuffle. This can be exploited Require: x,F,g1,g2, kf , kg to speed up the application of the permutation: 1: compute y1 = g1(kg, x) 1) generate a permutation using Riffle Shuffle with 2: compute y2 = F (kf ,y1) SST (14 steps on average), say permutation σ was 3: return y = g2(kg ,y2) obtained. ACCEPTED TO IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2017 12

Listing 1 A fragment of the (src/riffle.c) responsible for was achieved after n steps (the lower bound for the running applying a shuffle. time) – please keep in mind that the expected running time /src/riffle.c of this SST is O(n log n). Based on this knowledge it is easy 34 for (uint32_t i = 0; i < 128; i++){ 35 if (CHECK_BIT(buf[i >> 3], i & 7)) { to observe that there is only one possible value of kg[1] pro- 36 SET_BIT(temp_buf[permutation[i] >> 3], vided that SST lasted n steps – kg[1] = 111 = [8] [7]; kg[2] is 37 permutation[i] & 7); \ either 110 or 111 so kg[2] [8] [6] and kg[i] [n] [n i] for 38 } ∈ \ ∈ \ − 39 } each i =1,...,n. While some information about the key kg leaks from the fact that the SST stopped just after n steps, still no adversary can learn anything about the resulting permutation because every permutation is generated with 2) represent σ as 7 steps of some other Riffle Shuffle exactly the same probability. scheme, i.e., compute the corresponding 7 128 bits. We showed an extreme case when SST algorithm for 1 · Do the same for σ− . Top-To-Random runs the minimal number of steps n. Since 3) apply this 7-step Riffle Shuffle on input bits. kg[i] [n] [n i] for i = 1,...,n so the number of bits ∈ \ − k Then the last step may be implemented using efficient in- of entropy of the secret key g adversary learns each round i lg i n! structions: _pext_u64, _pdep_u64, _mm_popcnt_u64 is equal to . There are possible keys which result in running time n of SST. An adversary learns a fraction of n! or its 32-bit analogs. Consult [1] p. 4-270 – 4-273 Vol. 2B for n lg n n above instructions. out of 2 = n possible keys. Moreover it can predict a prefix of the kg with high probability. big ct student sst AES-128 CBC encrypt 170.78 29.06 11.46 37.63 6.5.2 Riffle Shuffle AES-128 CBC decrypt 185.23 44.28 11.48 37.65 Consider the algorithm KSARS,ST∗∗ with the shuffling proce- AES-128 CTR 180.94 55.60 16.83 37.52 dure corresponding to (time reversal of) Riffle Shuffle. Recall Fig. 7. Comparison of the efficiency of selected algorithms: big – the shuffling: at each step assign each card a random bit, classic, table-based AES implementation, ct – constant time implemen- then put all cards with bit assigned 0 to the top keeping the tation using bitslicing strategy, student – implementation of the sst- relative ordering. Sample execution is presented in Fig. 9. masking described in Section 6 with slow permutation application, sst Recall the construction of SST for this example: – estimated implementation which of the sst-masking which uses BMI2 n and SSE4.2 instructions. All results are in MB/s and were obtained by Consider, initially unmarked, all 2 pairs of cards; running speed tests of the BearSSL [25] (via the command: testspeed at each step mark pair (i, j) if cards i and j were  all on Intel i7-4712HQ with 16GB RAM). assigned different bits. Assume for simplicity that n is a power of 2. How many pairs can be marked in first step? For example, if there is 6.5 Timing attacks and KSA** only one bit equal to 0 and all other are equal to 1, then This section explains why bits of the secret key kg used to n 1 pairs are marked. It is easy to see that the maximal generate g ,g need to be distinct to the bits k which are − n 1 2 f number of pairs will be marked if exactly half (i.e., 2 ) cards used by F . This is a simple conclusion of properties P1 and were assign bit 1 and other half were assign bit 0. Then P2 (defined in Section 5). The examples are intended to give n2 4 pairs will be marked (all pairs of form (i, j): i < j and a reader a better insights on the properties achieved by SST cards i and j were assigned different bits). After such a step shuffling algorithms. As already mentioned - regardless the there will be two groups of cards: the ones assigned 0 and duration of SST algorithm adversary can learn something ones assigned 1. All pairs of cards within each group are about the key, but has completely no knowledge about the not marked. Then again, to mark as many pairs as possible resulting permutation. The examples will also show how a exactly half of the cards in each group must be assigned bit choice of a shuffling algorithm influences the leaking infor- n2 0 and other half bit 1 – then 8 new pairs will be marked. mation about the key. In both cases we consider extreme Iterating this procedure, we see that the shortest execution (and very unlikely) situation where algorithm runs for the takes lg n steps after which all smallest number of steps. n2 n2 n2 n2 n + + + ... + 1+lg n = 6.5.1 Top-to-random card shuffling 4 8 16 2 2! Consider the algorithm KSA with the shuffling proce- T2R,ST∗∗ pairs are marked. In Figure 9 such sample execution is dure corresponding to Top-To-Random card shuffling: put presented. There are exactly n! keys resulting in lg n steps the card which is currently on the top to the position j of SST (each resulting in different permutation). Thus the defined by the bits of the key k in the current round. Recall g adversary learns a tiny fraction of possible keys: n! out of stopping rule ST which is SST for this shuffle: stop one step 2n lg n = nn. after the card which was initially at the bottom reaches the top of the deck. Before the start of the algorithm, mark the 6.5.3 Summary last card (i.e., the card n is marked1). As we have seen the choice of shuffle algorithm matters for In Figure 8 a sample execution of the algorithm is pre- KSA . In any case the resulting permutation is not revealed sented (for n = 8). It is an extreme situation in which SST ∗∗ at all, but some knowledge on key can be obtained. In both 1. It is known [3] that optimal SST for top to random initially marks presented examples an adversary who observed that the card which is second from the bottom. time to generate a permutation was minimal (for a given ACCEPTED TO IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2017 13

step: 1 step: 2 step: 3 step: 4 step: 5 step: 6 step: 7 step: 8 1 2 3 4 5 6 7 8 6 2 3 4 5 6 7 8 6 8 3 4 5 6 7 8 6 2 2 4 5 6 7 8 2 2 5 5 5 6 7 8 2 5 5 1 1 6 7 8 2 1 1 1 7 7 7 8 2 1 4 4 4 4 4 8 1 1 3 3 3 3 3 3

Fig. 8. Example run of top-to-random shuffle taken from the key kg of the length equal to 8 8-value bytes (24-bits). Conditioning on the number of steps of the SST (in this case 8) one can find out that: out of the possible 224 keys only (exactly) 8! keys are possible (due to the fact that SST stopped exactly after 8 steps) and every of 8! permutations is possible – moreover each permutation with exactly the same probability. SST leaks bits of the key i.e., kg[1] = 111 but does not leak any information about the produced permutation.

step:1 step:2 step:3

1 0 1 0 1 0 1 2 0 2 0 2 1 5 3 0 3 1 5 0 3 4 0 4 1 6 1 7 5 1 5 0 3 0 2 6 1 6 0 4 1 6 7 1 7 1 7 0 4 8 1 8 1 8 1 8

Fig. 9. Example run of (time reversal of) Riffle Shuffle of n = 8 cards and key kg = [00001111, 00110011, 01010101,...]. At each step the left column represents current permutation, whereas the right one currently assigned bits. See description in the text.

SST) learns that one out of n! keys was used (out of nn This follows from the fact that the variance of SST for possible keys). However there is a huge difference between Riffle Shuffle is much smaller than the variance for Top- these two shuffling schemes: To-Random. Smaller variance means that fewer information can be gained from the running time. While the expected in Riffle Shuffle an adversary learns that in the first n number of bits required by KSA may seem to be high, for • step one key out of possible keys (leading to ∗∗ n/2 the post-quantum algorithms like McBits it is not an issue – the shortest running time) was used. For the second 16 20  2 McBits requires keys of size 2 2 bytes, Top-To-Random n/2 − step there are exactly n/4 for each choice of the needs around 12,000 bits and Riffle Shuffle needs on average key made in the fist step, and similarly for the next  4096 bits to generate a uniform permutation for n = 256. steps. This makes recovering any information about It is also worth mentioning that in real-life systems the key used infeasible. we could artificially additionally mask the duration of the in Top-To-Random the situation of an adversary is • algorithm: If we know values of ET for given SST we quite different. It gets all lg n bits of the key used can run it always for at least ET steps (similarly, to avoid in the first step of the algorithm (provided the al- extremely long execution we can let at most ET + c V arT gorithm run for n steps). In the ith round it gets steps provided V arT is known). · information about lg n i +1 bits. − Recall that these were the extreme and very unlikely cases. 7 CONCLUSIONS In both cases the probability of the shortest run is equal to n 55 n!/n (=7.29 10− for n = 128). We presented the benefits of using Strong Stationary Times Another advantage× of Riffle Shuffle is that on average in cryptographic schemes (pseudo random permutation the number of required bits is equal to 2n lg n which is generators). These algorithms have a “health-check” built-in only 2 times more than the minimal number of bits (as and guarantee the best possible properties (when it comes presented in the example). For Top-To-Random on average to the quality of randomness of the resulting permutation). the number of bits required is equal to n log n lg n which We showed how SST algorithms may be used to con- is log n times larger than the minimal number of bits. struct timing attack immune implementations out of timing ACCEPTED TO IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2017 14 attack susceptible ones. The SST-based masking leads to [18] Paweł Lorek and Piotr Markowski. Monotonicity requirements for comparable (to constant-time implementation) slow-down efficient exact sampling with Markov chains. To appear in Markov Processes And Related Fields, 2017. (but it is platform independent), see Figure 7 for efficiency [19] Itsik Mantin and Adi Shamir. A Practical Attack on Broadcast comparison. RC4. Fast Software Encryption, 8th International Workshop, Yokohama, We showed that algorithms using SST achieve better se- Japan, pages 152–164, 2001. curity guarantees than any algorithm which runs predefined [20] Peter Matthews. A strong uniform time for random transpositions. J. Theoret. Probab., 1(4):411–423, 1988. number of steps. [21] Ilya Mironov. (Not So) Random Shuffles of RC4. Advances in We also proved a better bound for the mixing-time of the Cryptology—CRYPTO 2002, 2002. Cyclic-to-Random Transpositions shuffling process which [22] Elchanan Mossel, Yuval Peres, and Alistair Sinclair. Shuffling RC4 by semi-random transpositions. Foundations of Computer Science, is used in and showed that different, more efficient pages 572–581, 2004. shuffling methods (i.e., time reversal of Riffle Shuffle) may [23] Cesar Pereida, García Billy, and Bob Brumley. Constant-Time be used as KSA. This last observation shows that the gap be- Callees with Variable-Time Callers. Technical report. tween mathematical model (4096 bits required) and reality [24] Cesar Pereida García, Billy Bob Brumley, and Yuval Yarom. Make Sure DSA Signing Exponentiations Really are Constant-Time. (2048 allowed as maximum length of RC4) is not that big as pages 1639–1650, 2016. previously thought (bound of 23 037 by Mironov [21]). [25] Thomas Pornin. BearSSL - Constant-Time Crypto. [26] Pornin Thomas. BearSSL - Constant-Time Mul. [27] James Gary Propp and David Bruce Wilson. Exact sampling with REFERENCES coupled Markov chains and applications to statistical mechanics. Random Structures and Algorithms, 9:223–252, 1996. [1] Intel 64 and IA-32 Architectures. Software Developer’s Manual, [28] Jacob C. N. Schuldt; Ronald L. Rivest. Spritz—a spongy RC4-like 2016. stream cipher and hash function. Technical report, 2014. [2] Martin R. Albrecht and Kenneth G. Paterson. Lucky Microsec- [29] Bart Preneel Souradyuti Paul. A New Weakness in the RC4 onds: A Timing Attack on Amazon’s s2n Implementation of TLS. Keystream Generator and an Approach to Improve the Security of Advances in Cryptology - EUROCRYPT, 9665:622—-643, 2016. the Cipher. In Bimal Roy and Willi Meier, editors, Fast Software [3] David Aldous and Persi Diaconis. Shuffling cards and stopping Encryption, volume 3017 of Lecture Notes in Computer Science. times. American Mathematical Monthly, 93(5):333–348, 1986. Springer Berlin Heidelberg, Berlin, Heidelberg, 2004. [4] David Aldous and Persi Diaconis. Strong Uniform Times and [30] Bart Preneel Souradyuti Paul. A New Weakness in the RC4 Finite Random Walks. Advances in Applied Mathematics, 97:69–97, Keystream Generator and an Approach to Improve the Security of 1987. the Cipher. In Bimal Roy and Willi Meier, editors, Fast Software [5] Nadhem AlFardan, Daniel J Bernstein, Kenneth G Paterson, Encryption, volume 3017 of Lecture Notes in Computer Science. Bertram Poettering, and Jacob C N Schuldt. On the Security of RC4 Springer Berlin Heidelberg, Berlin, Heidelberg, 2004. in TLS. In Presented as part of the 22nd USENIX Security Symposium [31] Yuval Yarom, Daniel Genkin, and Nadia Heninger. CacheBleed : (USENIX Security 13), pages 305–320, Washington, D.C., 2013. A Timing Attack on OpenSSL Constant Time RSA. CHES, 2016. USENIX. [32] Bartosz Zoltak. VMPC One-Way Function and Stream Cipher. In [6] V.E. Beneš. Mathematical theory of connecting networks and telephone Fast Software Encryption, pages 210–225. 2004. traffic. Academic Press, New York :, 1965. [7] Daniel J Bernstein, Tung Chou, and Peter Schwabe. McBits: fast constant-time code-based cryptography. In Cryptographic Hardware and Embedded Systems - CHES 2013, 2013. [8] David Brumley and . Remote timing attacks are practical. Computer Networks, 48(5):701–716, 2005. [9] David Chaum. Blind Signatures for Untraceable Payments. In Advances in Cryptology, pages 199–203. Springer US, Boston, MA, 1983. [10] Jean-François Dhem, François Koeune, Philippe-Alexandre Ler- oux, Patrick Mestré, Jean-Jacques Quisquater, and Jean-Louis Willems. A Practical Implementation of the Timing Attack. Smart Card. Research and Applications Third International Conference, CARDIS98, 1820:167–182, 1998. [11] Persi Diaconis and Mehrdad Shahshahani. Generating a ran- dom permutation with random transpositions. Zeitschrift fur Wahrscheinlichkeitstheorie und Verwandte Gebiete, 57(2):159–179, 1981. [12] James Allen Fill. An interruptible algorithm for perfect sampling via Markov chains. The Annals of Applied Probability, 8(1):131–162, February 1998. [13] S Fluhrer, I Mantin, and A Shamir. Weaknesses in the key scheduling algorithm of RC4. Selected areas in cryptography, 2001. [14] Scott R Fluhrer and David a. McGrew. Statistical Analysis of the Alleged RC4 Keystream Generator. Fast Software Encryption, 7th International Workshop, pages 19–30, 2000. [15] J Golic. Linear Statistical Weakness of Alleged RC4 Keystream Generator. In Walter Fumy, editor, Advances in Cryptology — EUROCRYPT ’97, volume 1233 of Lecture Notes in Computer Science. Springer Berlin Heidelberg, Berlin, Heidelberg, July 1997. [16] Paul C. Kocher. Timing Attacks on Implementations of Diffie- Hellman, RSA, DSS, and Other Systems. pages 104–113. Springer, Berlin, Heidelberg, 1996. [17] Michal Kulis, Paweł Lorek, and Filip Zagórski. Randomized stopping times and provably secure pseudorandom permutation generators. In Phan RW., Yung M. (eds) Paradigms in Cryptology – Mycrypt 2016. Malicious and Exploratory Cryptology. Mycrypt 2016. Lecture Notes in Computer Science, vol 10311. Springer, pages 145—- 167, 2017.