Generating random permutations by coin-tossing: classical , new analysis and modern implementation

Axel Bacher| , Olivier Bodini| , Hsien-Kuei Hwang~ , and Tsung-Hsi Tsai~

| LIPN, Universite´ Paris 13, Villetaneuse, France ~ Institute of Statistical Science, Acamedia Sinica, Taipei, Taiwan

March 10, 2016

Abstract Several simple, classical, little-known algorithms in the statistical literature for generating random permutations by coin-tossing are examined, analyzed and implemented. These algo- rithms are either asymptotically optimal or close to being so in terms of the expected number of times the random bits are generated. In addition to asymptotic approximations to the expected complexity, we also clarify the corresponding variances, as well as the asymptotic distribu- tions. A brief comparative discussion with numerical computations in a multicore system is also given.

1 Introduction

Random permutations are indispensable in widespread applications ranging from cryptology to statistical testing, from data structures to experimental design, from data to Monte Carlo simulations, etc. Natural examples include the block cipher (Rivest, 1994), permutation tests (Berry et al., 2014) and the generalized association plots (Chen, 2002). Random permutations are also central in the framework of Boltzmann sampling for labeled combinatorial classes (Flajolet et al., 2007) where they intervene in the labeling process of samplers. Finding simple, efficient and cryptographically secure algorithms for generating large random permutations is then of vi- tal importance in the modern perspective. We are concerned in this paper with several simple classical algorithms for generating random permutations (each with the same probability of being

This research was partially supported by the ANR-MOST Joint Project MetAConC under the Grants ANR 2015- BLAN-0204 and MOST 105-2923-E-001-001-MY4.

1 generated), some having remained little known in the statistical and computer science literature, and focus mostly on their stochastic behaviors for large samples; implementation issues are also discussed.

Algorithm Laisant-Lehmer: when UnifŒ1; n! is available (Here and throughout this paper UnifŒa; b represents a discrete uniform distribution over all integers in the interval Œa; b.) The earliest for generating a random permutation dated back to Laisant’s work near the end of the 19th century (Laisant, 1888), which was later re-discovered in (Lehmer, 1960). It is based on the factorial representation of an integer

k c1.n 1/! c2.n 2/! cn 11! .0 6 k < n!/; D C C    C where 0 6 cj 6 n j for 1 6 j 6 n 1. A simple algorithm implementing this representation then proceeds as follows; see (Devroye, 1986, p. 648) or (Robson, 1969). Let k UnifŒ0; n! 1. The first element of the ran- D dom permutation is c1 1, which is then removed from 1;:::; n . The next elementC of the random permutation Algorithm 1: LL.n; c/ f g will then be the .c2 1/st element of the n 1 remaining Input: c: an array with n elements elements. ContinueC this procedure until the last element Output: A random permutation on c is removed. A direct implementation of this algorithm begin 1 u UnifŒ1; n!; results in a two-loop procedure; a simple one-loop proce- WD dure was devised in (Robson, 1969) and is shown on the 2 for i n downto 2 do WD u 3 t ; j u i t 1; right; see also (Plackett, 1968) and Devroye’s book (De- WD i WD C 4 swap.ci; cj /; u t; vroye, 1986, ~XIII.1) for variants, implementation details WD and references. This algorithm is mostly useful when n is small, say less than 20 because n! grows very fast and the large number arithmetics involved reduce its effi- ciency for large n. Also the generation of the uniform distribution is better realized by the coin- tossing algorithms described in Section2.

Algorithm Fisher-Yates (FY): when UnifŒ1; n is available One of the simplest and mostly widely used algorithms (based on a given sequence of distinct numbers c1;:::; cn ) for generat- ing a random permutation is the Fisher-Yates or Knuth shuffle (in its modernf form byg Durstenfeld (Durstenfeld, 1964); see Wikipedia’s page on Fisher-Yates shuffle and (Devroye, 1986; Dursten- feld, 1964; Fisher and Yates., 1948; Knuth, 1998a) for more information). The algorithm starts by swapping cn with a ran- domly chosen element in c1;:::; cn (each with the Algorithm 2: FY.n; c/ same probability of beingf selected), andg then repeats Input: c: an array with n > 2 elements the same procedure for cn 1,..., c2. See also the re- Output: A random permutation on c cent book (Berry et al., 2014) or the survey paper (Rit- begin ter, 1991) for a more detailed account. 1 for i n downto 2 by 1 do WD Such an algorithm seems to have it all: single loop, 2 j UnifŒ1; i; swap.ci; cj / one-line description, constant extra storage, efficient WD

2 and easy to code. Yet it is not optimal in situations such as (i) when only a partial subset of the permuted elements is needed (see (Black and Rogaway, 2002; Brassard and Kannan, 1988)), (ii) when imple- mented on a non-array type data structure such as a list (see (Ressler, 1992)), (iii) when numerical truncation errors are inherent (see (Kimble, 1989)), and (iv) when a parallel or distributed comput- ing environment is available (see (Anderson, 1990; Langr et al., 2014)). On the other hand, at the memory access level, a direct generation of the uniform results in a higher rate of cache miss (see (Andres´ and Perez´ , 2011)), making it less efficient than it seems, notably when n is very large; see also Section6 for some implementation and simulation aspects. Finally, this algo- rithm is sequential in nature and the memory conflict problem is subtle in parallel implementation; see (Waechter et al., 2011). Note that the implementation of this algorithm strongly relies on the availability of a uniform random variate generator, and its bit-complexity (number of random bits needed) is of linearithmic order (not linear); see (Lumbroso, 2013; Sandelius, 1962)) and below for a detailed analysis.

From unbounded uniforms to bounded uniforms? Instead of relying on uniform distribution with very large range, our starting question was: can we generate random permutations by bounded uniform distributions (for example, by flipping unbiased coins)? There are at least two different ways to achieve this:

Fisher-Yates type: simulate the uniform distribution used in Fisher-Yates shuffle by coin-  tossing, which can be realized either by von Neumann’s rejection method (von Neumann, 1951) or by the Knuth-Yao algorithm (for generating a discrete distribution by unbiased coins; see (Devroye, 1986; Knuth and Yao, 1976)) and Section2, and

divide-and-conquer type: each element flips an unbiased coin and then depending on the  outcome being head or tail divide the elements into two groups. Continue recursively the same procedure for each of the two groups. Then a random resampling is achieved by an inorder traversal on the corresponding digital tree; see the next section for details. This realization induces naturally a binary trie (Knuth, 1998b), which is closely related to a few other binomial splitting processes that will be briefly described below; see (Fuchs et al., 2014).

It turns out that exactly the same binomial splitting idea was already developed in the early 1960’s in the statistical literature in (Rao, 1961) and independently in (Sandelius, 1962), and an- alyzed later in (Plackett, 1968). The papers by Rao and by Sandelius also propose other variants, which have their modern algorithmic interests per se. However, all these algorithms have remained little known not only in computer science but also in statistics (see (Berry et al., 2014; Devroye, 1986)), partly because they rely on tables of random digits instead of more modern computer generated random bits although the underlying principle remains the same. Since a complete and rigorous analysis of the bit-complexity of these algorithms remains open, for historical reasons and for completeness, we will provide a detailed analysis of the algorithms proposed in (Rao, 1961) and (Sandelius, 1962) (and partially analyzed in (Plackett, 1968)) and two versions of Fisher-Yates

3 with different implementations of the underlying uniform UnifŒ0; n 1 by coin-tossing: one re- lying on von Neumann’s rejection method (Devroye and Gravel, 2015; von Neumann, 1951) and the other on Knuth-Yao’s tree method (Devroye, 1986; Knuth and Yao, 1976). As the ideas of these algorithms are very simple, it is no wonder that similar ideas also appeared in computer science literature but in different guises; see (Barker and Kelsey, 2007; Flajolet et al., 2011; Koo et al., 2014; Ressler, 1992) and the references therein. We will comment more on this in the next section. We describe in the next section the algorithms we will analyze in this paper. Then we give a complete probabilistic analysis of the number of random bits used by each of them. Implemen- tation aspects and benchmarks are briefly discussed in the final section. Note that Fisher-Yates shuffle and its variants for generating cyclic permutations have been analyzed in (Louchard et al., 2008; Mahmoud, 2003; Prodinger, 2002; Wilson, 2009) but their focus is on data movements rather than on bit-complexity.

2 Generating random permutations by coin-tossing

We describe in this section three algorithms for generating random permutations, assuming that a bounded uniform UnifŒ0; r 1 is available for some fixed integer r > 2. The first algorithm relies on the divide-and-conquer strategy and was first proposed in (Rao, 1961) and independently in (Sandelius, 1962), so we will refer to it as Algorithm RS (Rao-Sandelius). The other two ones we study are of Fisher-Yates type but differ in the way they simulate UnifŒ0; n 1 by a bounded uniform UnifŒ0; r 1: the first of these two simulates UnifŒ0; n 1 by a rejection procedure in the spirit of von Neumann (von Neumann, 1951) and was proposed and implemented in (Sandelius, 1962), named ORP (One-stage-Randomization Procedure) there, but for convenience we will refer to it as Algorithm FYvN (Fisher-Yates-von-Neumann); see also (Moses and Oakford, 1963); and the other one relies on an optimized version of Lumbroso’s implementation (Lumbroso, 2013) of Knuth-Yao’s DDG-tree (discrete distribution generating tree) algorithm (Knuth and Yao, 1976), which will be referred to as Algorithm FYKY (Fisher-Yates-Knuth-Yao). See also (Devroye, 1986, Ch. XV) on the “bit model” and the more recent updates (Devroye, 2010; Devroye and Gravel, 2015). For simplicity of presentation and practical usefulness, we focus in what follows on the binary rand-bit Bernoulli 1 case r 2. For convenience, let denote the random variable . 2 /, which returnsD zero or one with equal probability.

2.1 Algorithm RS: divide-and-conquer We describe Algorithm RS only in the binary case assuming an unbiased coin is available. Since we will carry out a detailed analysis of this algorithm, we give its procedure in recursive form as follows. (For practical implementation, it is more efficient to remove the recursions by standard techniques; see Section6.)

4 A sequence of distinct numbers c1;:::; cn is given. f g Algorithm 3: RS.n; c/ Input: c: a sequence with n elements 1. Each ci generates a rand-bit, one indepen- Output: A random permutation on c dently of the others; begin 2. Group them according to the outcomes being 1 if n 6 1 then 2 return 0 or 1, and arrange the groups in increasing c 3 if n 2 then order of the group labels. D 4 if rand-bit 1 then D 3. For each group of cardinality : 5 return .c2; c1/ 6 else (a) if  1, then stop; 7 return .c ; c / D 1 2 (b) if  2, then generate a rand-bit b D 8 Let A0 and A1 be two empty arrays and reverse their relative order if b 1; for i 1 to n do D WD (c) if  > 2, then repeat Steps 1–3 for each 9 add ci into Arand-bit group. 10 return RS. A0 ; A0/,RS. A1 ; A1/ j j j j As an illustrative example, we begin with the sequence c1;:::; c6 . Assume that the flipped bi- c1c2c3c4c5c6 f  cg c c c c c à nary sequence is 1 2 3 4 5 6 . Then 1 0 1 1 0 0 c2c5c6 c1c3c4 we split the ci’s into the 0-group .c2; c5; c6/ and c2c5 c6 c1c4 c3 the 1-group .c1; c3; c4/, which can be written in the form .c2 c5 c6/.c1 c3 c4/. As both groups c2 c5 c4 c1 have cardinality larger than two, we run the same coin-flipping process for both groups. Assume that  c c c à  c c c à further coin-flippings yield 2 5 6 and 1 3 4 , respectively. Then we obtain 0 0 1 0 1 0 .c2 c5/ c6 .c1 c4/ c2. If the two extra coin-flippings needed to permute the two subgroups of size  two are 0 and 1, respectively, then we get the random permutation c2 c5 c6 c4 c1 c3 : The splitting process of this algorithm is, up to the boundary conditions, essentially the same as constructing a random trie under the Bernoulli model or sorting using radixsort (see (Fuchs et al., 2014; Knuth, 1998b)), and was also briefly mentioned in (Flajolet et al., 2011). On the other hand, Ressler in (Ressler, 1992) proposed an algorithm for randomly permuting a list structure using a similar divide-and-conquer idea but performed in a rather different way. To the best of our knowledge, except for these references, this simple algorithm seems to remain unknown in the literature and we believe that more attention needs to be paid on its practical usefulness and theoretical relevance.

Essentially identical binomial splitting processes In addition to the above connection to trie and radixsort, the splitting process of Algorithm RS is also reminiscent of the so-called initializa- tion problem in distributed computing (or processor identity problem), where a unique identifier is to be assigned to each processor in some distributed computing environment; see (Nakano and

5 Olariu, 2000; Ravelomanana, 2007). Yet another context where exactly the same coin-tossing pro- cess is used to resolve conflict is the tree algorithm (or CTM algorithm, named after Capetanikis, Tsybakov and Mikhailov) in multi-access channel; see (Massey, 1981; Wagner, 2009). For more references on binomial splitting processes, see (Fuchs et al., 2014). Nowadays, it is well-known that the stochastic behaviors of these structures can be understood through the study of the binomial recurrence  à X n n fn gn 2 .fk fn k /; (1) D C k C 06k6n with suitably given initial conditions. In almost all cases of interest, such a recurrence often gives rise to asymptotic approximations (for large n) that involve periodic oscillations with minute ampli- 5 tudes (say, in the order of 10 ), which may lead to inexact conjectures (see for example (Massey, 1981)) but can be well described by standard complex-analytic tools such as Mellin transform (Flajolet et al., 1995) and saddle-point method (Flajolet and Sedgewick, 2009) (or analytic de- Poissonization (Jacquet and Szpankowski, 1998)); see (Fuchs et al., 2014) and the references com- piled there. From a historical perspective, such a clarification through analytic means was first worked out by Flajolet and his co-authors in the early 1980’s; see again (Fuchs et al., 2014) for a brief account. However, the periodic oscillations had already been observed in the 1960’s by Plackett in (Plackett, 1968) based on heuristic arguments and figures, which seems less expected because of the limited computer power at that time and of the proper normalization needed to visualize the fluctuations; see Figures1 and2 for the subtleties involved. Unlike Algorithm FY, Algorithm RS is more easily adapted to a distributed or parallel comput- ing environment because the random bits needed can be generated simultaneously. Furthermore, we will prove that the total number of random bits used is asymptotically optimal, namely, the expected complexity is asymptotic to n log2 n nFRS.log2 n/ O.1/, where FRS.t/ is a periodic function of period 1 with very small amplitudeC ( F .t/ 1C:1 10 5); see Figure1. Another j RS j 6  distinctive feature is that FRS is very smooth (infinitely differentiable), differing from most other periodic functions arising in the analysis below. Note that the information-theoretic lower bound n satisfies log2 n! n log2 n log 2 O.log n/. While the asymptotic optimality of such a sim- ple algorithm wasD already discussed C in detail in (Sandelius, 1962) and such an asymptotic pattern anticipated in (Plackett, 1968), the rigorous proof and the explicit characterization of the periodic function FRS are new. Also we show that the variance is relatively small (being of linear order with periodic fluctuations) and that the distribution is asymptotically normal.

2.2 Algorithm FYvN and FYKY We describe in this subsection the two versions of Algorithm FY: FYvN and FYKY. Both algo- rithms follow the same loop of Fisher-Yates shuffle and simulate successively the discrete uniform distributions UnifŒ1; n, ::: , UnifŒ1; 2 by flipping unbiased coins. To simulate UnifŒ1; k, both

algorithms generate first log2 k random bits. If these bits, when read as a binary representation, have a value less than k,d then returne this value plus 1 as the required random element; otherwise,

6 Algorithm FYvN rejects these bits and restart the same procedure until finding a value < k. Al- gorithm FYKY, on the other hand, does not reject the flipped bits but uses the difference between this value and k as the “seed” of the next round and repeat the same procedure. We modified and optimized these two procedures in a way to reduce the number of arithmetic operations, and we mark their differences in red color; see Algorithm4 and5. Algorithm 4: Algorithm FYvN Algorithm 5: Algorithm FYKY Input: c: an array with n elements Input: c: an array with n elements Output: A random permutation on c Output: A random permutation on c begin begin 1 for i n downto 2 by 1 do 1 for i n downto 2 by 1 do WD WD 2 j von-Neumann.i/ 1; 2 j Knuth-Yao.i/ 1; WD C WD C swap.ci; cj /; swap.ci; cj /; Procedure von-Neumann(n) Procedure Knuth-Yao(n) Input: a positive integer n Input: a positive integer n Output: UnifŒ0; n 1 Output: UnifŒ0; n 1 begin begin 1 u 1; x 0; 1 u 1; x 0; WD WD WD WD 2 while 1 1 do 2 while 1 1 do D D 3 while u < n do 3 while u < n do 4 u 2u; 4 u 2u; WD WD x 2x rand-bit; x 2x rand-bit; WD C WD C 5 d u n; 5 d u n; WD WD 6 if x > d then 6 if x > d then 7 return x d; 7 return x d; 8 else 8 else 9 u 1; x 0; 9 u d; WD WD WD

Note that both algorithms are identical when n 2k and n 3 for which the algorithm parameters evolve as in the following diagram. D D

.4; 3/ x d 2 .2; 1/ D .u; x/ .4; 2/ x d 1 n 3 D .1; 0/ D W .4; 1/ x d 0 .2; 0/ D .4; 0/ .1; 0/ recursive

While the difference of both algorithms in such a pseudo-code level is minor, we show that the asymptotic behavior of their bit-complexity for generating a random permutation of n elements differs significantly:

7 Algorithm Mean Variance Method   RS n log n nF . / nG . / Analytic 2 C RS  RS  FYvN n.log n/F Œ1. / nF Œ2. / n.log n/2G . / Elementary vN  C vN  vN  FYKY n log n nF . / nG . / Analytic 2 C KY  KY 

Here F . / and G. / are all bounded, continuous periodic functions of parameter log2 n. We see that the minor difference  in Algorithm FYvN results not only in higher mean but also larger vari- ance, making FYvN less competitive in modern practical applications although it was used, for example, by Moses and Oakford to produce tables of random permutations (Moses and Oakford, 1963). Also the procedure von-Neumann in Algorithm4, as one of the simplest and most natural ideas of simulating a uniform by coin-tossing, was independently proposed under different names in the literature; see, for example, (Granboulan and Pornin, 2007; Koo et al., 2014); in particular, it is called “Simple Discard Method” in NIST’s (Barker and Kelsey, 2007) “Recommendation for random number generation using deterministic random bit generators.” Thus, we also include the analysis of FYvN in this paper although it is less efficient in bit-complexity. The mean and the variance of Algorithm FYvN were already derived in (Plackett, 1968) but only when n 2k . In addition to this approximation, we will also show that the variance is of a less commonD higher order n.log n/2, and the distribution remains asymptotically normal.

2.3 Outline of this paper We focus in this paper on a detailed probabilistic analysis of the bit-complexity of the three algo- rithms RS, FYvN and FYKY. Indeed, in all three cases we will establish a very strong local limit theorem for the bit-complexity of the form (although the variances are not of the same order)

2 x   3 Ãà  j p kÁ e 2 1 x P Wn E.Wn/ x V.Wn/ p 1 O C j j ; D C D 2V.Wn/ C pn

1 uniformly for x o.n 6 /, where Wn represents the bit-complexity of any of the three algorithms RS, FYvN, FYKYD . Our method of proof is mostly analytic, relying on proper use of generating functionsf (includingg characteristic functions) and standard complex-analytic techniques (see (Fla- jolet and Sedgewick, 2009)). The diverse uniform estimates needed for the characteristic functions constitute the hard part of our proofs. The same method can be extended to clarify finer proba- bilities of moderate deviations, but for simplicity, we content with the above result in the central range. We also implemented these algorithms and tested their efficiency in terms of running time. The simulation results are given in the last section. Briefly, Algorithm FYKY is recommended when n 7 is not very large, say n 6 10 , and Algorithm RS performs better for larger n or when a multicore system is available. Finally, our analysis and simulations also suggest that the “Simple Discard Algorithm” recom- mended in NITS’s (Barker and Kelsey, 2007) “Recommendation for random number generation” is better replaced by the procedure Knuth-Yao in Algorithm5 whose expected optimality (in bit-complexity) was established in (Horibe, 1981).

8 3 The bit-complexity of Algorithm RS

We consider the total number Xn of times the random variable rand-bit is used in Algorithm RS for generating a random permutation of n elements. We will derive precise asymptotic approx- imations to the mean, the variance and the distribution by applying the approaches developed in our previous papers (Fuchs et al., 2014; Hwang, 2003; Hwang et al., 2010).

Recurrences and generating functions By construction, Xn satisfies the distributional recur- rence d Xn XIn Xn In n;.n > 3/; D „ƒ‚… C „ƒ‚… C 1-group 0-group

with the initial conditions X0 X1 0 and X2 1, where In denotes the binomial distribution 1 D D D with parameters n and 2 . Here the .Xn/’s are independent copies of the .Xn/’s and are independent of In. This random variable is, up to initial conditions, identical to the external path length of random tries constructed from n random binary strings. It may also be interpreted in may different ways; see (Fuchs et al., 2014; Knuth, 1998b) and the references therein. Xnt In terms of the moment generating function Pn.t/ E.e /, we have the recurrence WD Â Ã nt X n n Pn.t/ e 2 Pk .t/Pn k .t/.n > 3/; (2) D k 06k6n

t with P0.t/ P1.t/ 1 and P2.t/ e . From these relations, we see that the bivariate Poisson generating functionD PD.z; t/ e z PD Pn.t/ zn satisfies the functional equation Q WD n>0 n! .et 1/z 1 t 2 P.z; t/ e P e z; t Q.z; t/; (3) Q D Q 2 C Q where t  z 1 t t  Q.z; t/ 1 e ze 1 e z 2 e : Q WD C 4 C m m z P E.Xn / n Let now fm.z/ m!Œt P.z; t/ e z denote the Poisson generating function Q WD Q D n>0 n! of the mth moment of Xn. From (3), we see that ( f .z/ 2f z  g .z/; Q1 Q1 2 1 D z  CQ f2.z/ 2f2 g2.z/; Q D Q 2 CQ

with g1.0/ g2.0/ 0, where Q DQ D ( z 3  g1.z/ z ze 1 z ; Q D C 4 (4) z 2 z  z  2 z 11  g2.z/ 2f1 4zf1 2zf z z ze 1 z : Q D Q 2 C Q 2 C Q10 2 C C C 4

9 Mean value From the recurrence (2), we see that the mean n E.Xn/ can be computed recursively by WD Â Ã X 1 n n n n 2 k .n 3/; D C k > 06k6n with   0 and  1. Let H P j 1 denote the harmonic numbers and 0 1 2 n 16j6n denote Euler’sD constant.D D WD

Theorem 1. The expected number n of random bits used by Algorithm RS for generating a ran- dom permutation of n elements satisfies the identity  à n Hn 1 1 3 1 X €. k /€.n/ 3 1 k ; (5) n D log 2 C 2 4 log 2 log 2 €.n k / C 4 k Z 0 2 nf g C 2ki for n 3, where € is the Gamma function and k . Asymptotically, n satisfies > WD log 2

n n log n nFRS.log n/ O.1/; (6) D 2 C 2 C

where FRS.t/ is a periodic function of period 1 whose Fourier series expansion is given by  à 1 3 1 X 3 2kit FRS.t/ €. k / 1 k e ; D log 2 C 2 4 log 2 log 2 C 4 k Z 0 2 nf g the Fourier series being absolutely convergent.

Proof. To derive a more effective asymptotic approximation to n, we begin with the expansion

j X . 1/ 3j 7 j g1.z/ z : Q D .j 1/!  4 j>2

n n n We then see that the sequence n n!Œz fQ1.z/, where Œz f .z/ denotes the coefficient of z in the Taylor expansion of f , satisfiesQ WD

n Œz g1.z/ n Q .n 2/: Q D 1 21 n > n z It follows, by Cauchy convolution, that the coefficient n n!Œz e fQ1.z/ has the closed-form expression WD  à X n k k 3k 7 n . 1/ 1 k .n > 1/; D k 1 2  4 26k6n which, by standard integral representation for finite differences (see (Flajolet and Sedgewick, 1995)), can be expressed as

1 Z 2 i  à n 1 C 1 €.n/€.s/ 3 s 1 s ds; n D 2i 1 i €.n s/.1 2 / C 4 2 1 C

10 1 where the integral path is the vertical line .s/ 2 . By moving the line of integration to the < D 2ki right and by collecting all residues at the poles k .k Z/, we obtain D log 2 2  à n Hn 1 1 3 1 X €. k /€.n/ 3 1 k Rn; n D log 2 C 2 4 log 2 log 2 €.n k / C 4 C k Z 0 2 nf g C where 1 Z 2 i  à 1 C 1 €.n/€.s/ 3 Rn s 1 s ds: WD 2i 1 i €.n s/.1 2 / C 4 2 1 C Since there is no other singularity lying to the right of the imaginary axis, we deform first the integration path into a large half-circle to the right, and then prove that the integral tends to zero as the radius of the circle tends to infinity. In this way, we deduce that Rn 0 for n > 3, proving the identity (5). The asymptotic approximation (6) then follows from the asymptoticÁ expansion for the ratio of Gamma functions (see (Erdelyi´ et al., 1953, ~1.18))

€.n/  . 1/  4 Ãà k k k k n 1 O j 2j .k 0/: €.n k / D 2n C n ¤ C Indeed, the O.1/-term in (6) can be further refined by this expansion, and be replaced by

1  3 à X k 1 €.1 k /. k 1/ 1 k n O n ; 2 log 2 C C 4 C k Z 2 the series on the right-hand side defining another bounded periodic function. Finally, the Fourier series is absolutely convergent because of the estimate (see (Erdelyi´ et al., 1953, ~1.18))

c 1  t  €.c i t/ O t 2 e 2 j j ; (7) j C j D j j for large t and bounded c. Indeed, F .t/ is infinitely differentiable for .t/ > 0. j j RS <

Periodic fluctuations of n Due to the small amplitude of variation of FRS.t/, the periodic os- cillations are invisible if one plots naively n log n for increasing values of n as approximations n 2 of FRS.t/ (see Figure1). Also note that the mean value of FRS equals numerically 1 3 0:25072 48966 10144 :::; (8) log 2 C 2 4 log 2 

1 which is larger than the corresponding linear term in the information-theoretic lower bound log 2 1:44. 

11 n Figure 1: Periodic fluctuations of n for n 16 to 1024 in log-scale: n log2 n (first from left), n 1 n Hn 1 D log n (second), (third), and FRS.t/ for t Œ0; 1 (fourth). n 2 C 2 log 2 n log 2 2

Variance We prove that the variance is asymptotically linear with periodic oscillations. The expressions involved are very complicated, showing the complexity of the underlying asymptotic problem.

Theorem 2. The variance of Xn satisfies

V.Xn/ nGRS.log2 n/ O.1/; D C where GRS.t/ is a periodic function of period 1 whose Fourier series is given by  à 1 X 2ki 2kit GRS.t/ g 1 e : D log 2 Q C log 2 k Z 2 The Fourier series is absolutely convergent (and infinitely differentiable). An explicit expression for the function g .s/ is given as follows. Q 

g.s/ X  k  sÁ s 5 Q .3s 5/ 1 1 2 C €.s 1/ D C C C 4 C k>1 (9) .s 2/.9s3 66s2 163s 362/ h .s/ C C C C . .s/ > 2/; Q s 9 C 2 C < where 0 3s3 34s2 41s 6 1 k 3 4 3 C 2  k X 2 B 9s 87s 317s 333s 30 2 C h.s/ s 5 B 3 C 2 C C  C2k 1 C : Q WD k   @ 3s 22s 141s 170 2 A k 1 1 2 C > C C 3s2 C 37s C50 2 3kC 2 .3s 5/2 4k 3 C C C C C Proof. For the variance, we consider, as in (Fuchs et al., 2014), the corrected Poissonized variance

2 2 V .z/ f2.z/ f1.z/ zf 0.z/ : Q WD Q Q Q1 Then, by (4), V .z/ 2V z  g.z/; Q D Q 2 CQ 12 where n g.z/ e z z.3z 4/f z  1 z3z2 2z 4f z  Q1 2 2 Q10 2 Q D C (10) 1 2 1 3 2 zo z z z.z 1/.9z 12z 16z 16/ e ; C C 4 16 C C C which is exponentially small for large z in the half-plane .z/ > 0. Indeed, j j < z 3  g.z/ O e< z log z . z .z/ > 0/: (11) Q D j j j j j j ! 1I < We follow the same method of proof developed in (Fuchs et al., 2014) and need to compute the Mellin transform of g.z/, which exists in the half-plane .s/ > 2 because g.z/ O. z 2/ as z 0. Now Q < Q D j j j j ! X k z  f1.z/ 2 g1 k : Q D Q 2 k>0 Thus Z 1 s 1 g.s/ g.z/z dz 1.s/ 2.s/ 3.s/; Q WD 0 Q D C C where Z 1 s z z  1.s/ z e .3z 4/f1 2 dz WD 0 C Q Z 1 1 s z 2  z  2.s/ 2 z e 3z 2z 4 f10 2 dz WD 0 Q Z 1 s z 1 1 3 2 z 3.s/ z e 1 4 z 16 .z 1/.9z 12z 16z 16/ e dz: WD 0 C C C C First, for .s/ > 2, < 1 s 9 3 2  3.s/ €.s 1/ .s 5/ 2 .s 2/.9s 66s 163s 362/ : D C 4 C C C C C 125 Note that 3.s/ has no singularity at s 1 (indeed, 3. 1/ 128 log 2). On the other hand, by an integration by parts, D D C Z 1 s 1 z z  3 2  1.s/ 2.s/ z e f1 2 3z .3s 11/z 2.s 3/z 4s dz C D 0 Q C C Z X k 1 1 s 1 z z  3 2  2 z e g1 2k 3z .3s 11/z 2.s 3/z 4s dz D 0 Q C C k>1  à X  k  sÁ €.s 1/ .3s 5/ 1 1 2 h.s/ ; D C C C C Q k>1

which can be continued into the half-plane .s/ > 2 and leads then to (9). Note that 3. 1/ 37 < c 7  t  D log 2. Also, by (7), g .c i t/ O t 2 e 2 for large t and c > 2. Thus, 32 C j Q  C j D j j C j j j j 13 the Fourier series expansion for GRS.t/ is absolutely convergent. By the same Poisson-Charlier approach used in (Fuchs et al., 2014), we see that

2 n n 2 1 V.Xn/ V .n/ V 00.n/ f100.n/ O n ; D „ƒ‚…Q 2 Q 2 Q C O.n/ „ ƒ‚ … D O.1/ D where the O-terms can be made more precise by Mellin transform techniques (see (Flajolet et al., 1995)) as follows. First, by moving the line of integration to the right and collecting all residues 2ki encountered, we deduce that ( k ) WD log 2

3 Z 2 i 1 C 1 s g.s/ V .n/ n Q ds Q 3 s 1 D 2i i 1 2 C 2 1 1 i n 1 Z 2 g .s/ X k C 1 s  g. 1 k /n n Q s 1 ds D log r Q C 2i 1 i 1 2 C k Z 2 1 2 X k k nG .log n/ 2 g.2 n/; D RS 2 Q k>1

which is not only an asymptotic expansion but also an identity for n > 1. Here GRS.t/ is a 1- periodic function with small amplitude, and the series over k represents exponentially small terms; see (11). Similarly, 1 X k X k k nV 00.n/ k . k 1/g. 1 k /n 2 g00.2 n/; Q D log 2 C Q Q k Z k>1 2 the first series being bounded while the second exponentially small for large n. In particular, the mean value of the periodic function GRS is given by

k k k g. 1/ 125 X k  1 X 3 8 10 4 34 2 14 Q 1 2 log 1 2  C   log 2 D 128 log 2 C 2 C 4 log 2 .2k 1/4 k>1 k>1 C 1:82994 9955089 43482 69596 20844 :::; (12)  in accordance with the numerical calculations; see Figure2.

Asymptotic normality By applying either the contraction method (see (Neininger and Ruschen-¨ dorf, 2004)) or the refined method of moments (see (Hwang, 2003)), we can establish the conver- gence in distribution of the centered and normalized random variables .Xn n/=n to the standard 2 normal distribution, where n E.Xn/ and n V.Xn/. The latter is also useful in providing stronger results such as the following.WD WD

14 V.Xn/ c0 1 Figure 2: A plot (right) of for n from 12 to 256 in logarithmic scale, where c0 2 n D 2.log 2/ is the mean value of the second-order term (another periodic function). Without this correction term c0, the fluctuations are invisible (left).

Theorem 3. The sequence of random variables Xn satisfies a local limit theorem of the form f g 2 x   3 Ãà e 2 1 x P .Xn n xn / 1 O C j j (13) D b C c D p2 n C pn 1  uniformly for x o n 6 . D Proof. (Sketch) The refined method of moments proposed in (Hwang, 2003) begins with introduc- ing the normalized function

1 2 2 1 2 2 n  .Xn n/  n n  'n./ e 2 E e e 2 Pn./: WD D Then '0./ '1./ '2./ 1 and D D D  à n 2 X n n;k  ın;k  'n./ 2 'k ./'n k ./e .n > 3/; D k 06k6n 1 2 2 2 where n;k n k n k n and ın;k 2 k n k n . From this, we see that all Taylor WD.m/ C C WD C coefficients 'n .0/ satisfy the same recurrence of the form (1) with different non-homogeneous part. Then a good estimate for 'n./ for  small is obtained by establishing the uniform bounds j j m ˇ .m/ ˇ m 3 ˇ'n .0/ˇ 6 m!C n .m > 3/; for a sufficiently large number C > 0. Such bounds are proved by induction using Gaussian tails of the binomial distribution and the estimates 2  pn ; ı pn O 1 x ; n; n x n; n x 2 C 2 2 C 2 D C uniformly for x o.pn/ (the remaining range completed by using the smallness of the binomial distribution). ThenD it follows that ˇ  à ˇ ˇ .m/ ˇ i ˇ'n .0/ˇ  1 Á ˇ ˇ X m 2 3 ˇ'n 1ˇ 6 m  O n  ; ˇ n ˇ m!n j j D j j m>3

15 1 uniformly for  o.n 6 /, or, equivalently, j j D

 Xn n à 1 2 1 1 2  i  3   E e n e 2 O n 2  e 2 ; (14) D C j j for  in the same range. Then another induction leads to the uniform estimate (see (Hwang, 2003) for a similar setting)

2 ˇ Xni ˇ ".n 1/ ˇE e ˇ e C .   n 3/; (15) 6 j j 6 I > where " > 0 is a sufficiently small constant. (We use " > 0 as a generic symbol representing a sufficiently small number whose occurrence may change from one occurrence to another.) These two uniform bounds are sufficient to prove the local limit theorem by standard Fourier analysis (see (Petrov, 1975)) starting from the inversion formula 1 Z  ik Xni  P.Xn k/ e E e d; D D 2  and then splitting the integration range into two parts: 1 ÂZ Z Ã ik Xni  P.Xn k/ 1 1 e E e d: D D 2  6"n 3 C "n 3 <  6 j j j j By (15), the second integral is asymptotically negligible

ˇZ ˇ ÂZ Ã Â 1 Ã 1 1 2 1 3 ˇ ik Xni  ˇ "n 6 "n ˇ 1 e E e dˇ O 1 e d O n e : 2 ˇ "n 3 <  6 ˇ D "n 3 D j j 1 The integral over the central range  "n 3 is then evaluated by (14) using j j 6

k n xn n xn n; n O.1/; D b C c DW C C D giving Z Z 1 1  Xn n i Á  1 Á ik Xni  ix n 2  1 e E e d 1 e E e 1 O n d 2  6"n 3 D 2n  6"n 6 C j j j j Z 2 1 1 ix    1 3ÁÁ e 2 1 O n 2 1  d D 2n C j C j 2 1 x 3 e 2  Â1 x Ãà 1 O C j j ; D p2 n C pn which completes the proof of (13).

Note that our estimates for the characteristic function of Xn also lead to an optimal Berry- Esseen bound ˇ Â Ã Z x 2 ˇ ˇ Xn n 1 t ˇ 1  sup ˇP 6 x e 2 dtˇ O n 2 : x R ˇ n p2 ˇ D 2 1

16 Figure 3: Normalized (multiplied by standard deviation) histograms of the random variables Xn for n 15;:::; 50; the tendency to normality becomes apparent for larger n. D

A simple improved version The first few terms of n and those of the expected bit complexity of Algorithm FYKY are given in the following table. Algorithm 2 3 4 5 6 7 8 9 10 E.RS/. n/ 1 5 8:29 12:1 16:3 20:7 25:3 30:1 35 D E.FYKY/ 1 3:67 5:67 9:27 12:9 16:4 19:4 24 28:6 and we see that for small n Algorithm RS is better replaced by Algorithm FYKY. The analysis of the bit-complexity of these mixed algorithms (using FYKY for small n and RS for larger n) can be done by the same methods used above but the expressions become more involved.

4 The bit-complexity of Algorithm FYvN

In this section, we analyze the bit-complexity of Algorithm FYvN ( Sandelius’s ORP in (Sandelius, D 1962)), which is described in Introduction. Briefly, for each 2 6 k 6 n, select  log2 k ran- D d e  dom bits (independently and uniformly at random), which gives rise to a number 0 6 u < 2 . If u < k, use u as the required random number, otherwise repeat the same procedure until success. Let Yn represent the total number of random digits used for generating a random permutation of n element. Plackett showed (see (Plackett, 1968)), in the special case when n 2, that D E.Yn/ 2n .log n log 2/ ;  and 2  V.Yn/ 2 .1 log 2/ n log2 n 2 log2 n 1 : (16)  C In the section, we complete the analysis of Planckett of the mean and the variance for all n, and establish a stronger local limit theorem for the bit-complexity.

Lemma 1. Let k log k and Geok be a geometric random variable with probability of WD d 2 e success k=2k (with support on the positive integers). Then d X Yn k Geok .n 1/: (17) D > 16k6n

17 Proof. Observe that the number of random bits used for selecting each ck is a geometric random variable Geok .

Expected value The mean of Yn satisfies  X k k E.Yn/ 2 : D k 16k6n j j 1 By splitting the range Œ1; n into blocks of the form .2 ; 2 C , we obtain the following asymptotic approximation to E.Yn/. Theorem 4. The expected number of random digits used by Algorithm FYvN to generate a random permutation of n elements satisfies

Œ1 Œ2 2 E.Yn/ FvN .log2 n/n log n nFvN .log2 n/ O..log n/ /; (18) D C C Œ1 Œ2 where FvN .t/ and FvN .t/ are continuous, 1-periodic functions defined by

Œ1 1 t F .t/ 2 f g.1 t / vN WD C f g Œ2 1 t 2 F .t/ 2 f g.log 2/ 1 t : vN WD C f g Proof. We start with the decomposition

X ` 1 X 1 n X 1 E.Yn/ .` 1/2 C n2 : D C j C j 06`6n 2 2`

n n 1 n When n 2 , write n 2 , where n log n . Then ¤ D C WD f 2 g 1 n 1 n 2 2 E.Yn/ 2 .1 n/ n log n 2 .log 2/ 1 n n O .log n/ ; D C C C which is also valid when n 2n . This completes the proof of (18) and Theorem4. D Œ1 Œ1 Note that the dominant term can be written as FvN . /n log n .log 2/FvN . /n log2 n, and the Œ1  D  minimum value of .log 2/FvN . / equals 2 log 2 1:38 > 1, which means that Algorithm FYvN requires more random bits than Algorithm RS for large n; see (8).

18 Œ1 Œ2 Œ1 Œ2 Œ3 Figure 4: The periodic functions (from left to right) FvN ; FvN ; GvN ; GvN ; GvN in the unit interval.

Variance Analogously, by (17), the variance of Yn is given by

2k k X 2 k V.Yn/ k 2 : D k2 16k6n From this expression and a similar analysis as above, we can derive the following asymptotic approximation to the variance whose proof is omitted here.

Theorem 5. The variance of Yn satisfies

Œ1 2 Œ2 Œ3 3 V.Yn/ GvN .log2 n/n.log n/ GvN .log2 n/n log n nGvN .log2 n/ O .log n/ ; (19) D C C C Œ1 Œ2 Œ3 where GvN .t/; GvN .t/ and GvN .t/ are continuous, 1-periodic functions defined by (see Figure4)

1 t Œ1 2 f g 1 t  G .t/ 3 .log 2/.1 t / 2 f g vN WD .log 2/2 C f g

Œ2 Œ1 1 log 2 3 t G .t/ 2.log 2/.1 t /G .t/ 2 f g vN WD f g vN log 2 Œ3 2 2 Œ1 2 t G .t/ .log 2/ .1 t / G .t/ .1 log 2/.1 2 t /2 f g: vN WD f g vN C C f g In particular, if n 2n , then D 2  3 V.Yn/ 2 .1 log 2/ n .log2 n/ 2 log2 n 3 O .log n/ ; D C C where the last term inside the parentheses differs slightly from Plackett’s expression in (Plackett, 1968).

Asymptotic normality Since Yn is the sum of independent geometric random variables, we can derive very precise limit theorems by following the classical approach; see (Petrov, 1975). Theorem 6. The bit-complexity of Algorithm FYvN satisfies the local limit theorem

2 x   3 Ãà  j p kÁ e 2 1 x P Yn E.Yn/ x V.Yn/ p 1 O C j j D C D 2V.Yn/ C pn 1  uniformly for x o n 6 . D 19 k Proof. By (17), the moment generating function of Yn satisfies (pk k=2 ) D p ek  Yn  Y k E e   : (20) D 1 .1 pk /e k 16k6n By induction, we see that the cumulant of order m satisfies

X m m m  p polynomial .pk / O .n.log n/ / .m 1; 2; : : : /: k k m D D 16k6n From this we deduce that  à  2  3 Ãà Yni ni   E exp p exp p O j j ; (21) V.Yn/ D V.Yn/ 2 C pn uniformly for  6 "pn. This estimate, coupling with the usual Berry-Esseen inequality, is sufficient to provej j an optimal convergence rate to normality. For the stronger local limit theorem, it suffices to prove the bound

2 ˇ Yni ˇ "1n ˇE e ˇ 6 e ; (22) uniformly for  6 , where "1 > 0 is a sufficiently small constant. Then the local limit theorem follows from thej j same argument used in the proof of (13). To prove (22), a direct calculation from (20) yields 1 ˇ Yni ˇ Y ˇE e ˇ q D 2 1 k n 1 2.1 pk /p .1 cos k / 6 6 C k Y 1 6 p1 2.1 p /.1 cos  / 1 k 2n 1 k k 6 6 C Y Y 1 : D q k 1 `<n 1 k<2` 1 1 2 .1 cos `/ 6 6 C 2` 1 x For 0 x 4, we have the elementary inequality e 5 , so that 6 6 p1 x 6 C  2 k à ˇ Yni ˇ X X ˇE e ˇ exp .1 cos `/ 6 5 2` 1 `< ` 1 6 n 16k<2  1 X à exp .2` 2/.1 cos `/ : 6 20 16`<n By the inequality 2` 2 2` 1 for ` 2, we then obtain > >  à 1 1 n ˇ Yni ˇ X ` 2 n./ ˇE e ˇ exp 2 .1 cos `/ e 40 ; 6 40 6 26`<n

20 where 5 4 cos  cos n 2 cos.n 1/ n./ C : WD 2.5 4 cos / 1 By monotonicity and induction, we deduce that n./ .1 cos / for n 2; consequently, > 6 > 1 n 1 ˇ Yni ˇ 2 .1 cos / n.1 cos / ˇE e ˇ 6 e 240 6 e 480 ; uniformly for  . But 1 cos  2  2 for  , so that (22) follows. j j 6 > 2 j j 6 5 The bit-complexity of Algorithm FYKY

We analyze the total number of bits used by Algorithm FYKY for generating a random permutation of n elements. Let Bn denote the total number of random bits flipped in the procedure Knuth-Yao of Algorithm FYKY for generating UnifŒ0; n 1. Then B k k. 2 D Bn  Lemma 2. The probability generating function E t of Bn satisfies n 2k  Bn  X k E t 1 .1 t/ t .n 2; 3; : : : /: (23) D 2k n D k>0 Proof. The probability that the algorithm does not stop after k random flips is given by n 2k  P.Bn > k/ .k 0; 1; : : : /; D 2k n D because after the first k random coin-tossings (2k different configurations) there are exactly 2k mod ˚ 2k « n n n cases that the algorithm does not return a random integer in the specified interval Œ0;Dn 1. n ˚ 2k « Throughout this section, write Lx log2 x for x > 0 and L0 0. Since 2k n 1 for Ln WD b c WD D 0 k Ln when n 2 , we obtain 6 6 ¤ n 2k  Bn  Ln 1 X k Ln E t t C .t 1/ t .n 2 /: D C 2k n ¤ k>Ln

We see that Bn is close to Ln 1, plus some geometric-type perturbations. For computational purposes,C the infinite series in (23) is less useful and it is preferable to use the following finite representation. Let .n/ denote Euler’s totient function (the number of positive integers less than n and relatively prime to n).

Corollary 1. For n > 2 8 B n  t E t 2 ; if n is even <ˆ I t Bn  1 t X 2k mod n (24) E 1 t k ; if n is odd: D ˆ .n/ k : 1 t  2 2 06k<.n/

21 Proof. This follows from (23) by grouping terms containing the same fractional parts.

Let Zn B1 Bn represent the total number of bits required by Algorithm FYKY for generating aD randomC  permutation   C of n elements.

5.1 Expected value of Zn By (23), we have X n E.Zn/ am; WD D 16m6n where X2k  n an E.Bn/ : WD D n 2k k>0 This sequence has been studied in the literature; see (Knuth and Yao, 1976; Pokhodze˘ı, 1985; Lumbroso, 2013). Obviously,

X X 2k  n an 1 ; D C n 2k 06k6Ln k>Ln

when n 2Ln , so we obtain the easy bounds ¤ n Ln an Ln 1 ; 6 6 Ln 1 C C 2 C and an log n O.1/. D 2 C Lemma 3. For n > 1 8 a n 1; if n is even; ˆ 2 < C.n/ j an 2 X 2 mod n D ; if n is odd: :ˆ2.n/ 1 2j 06j<.n/ Proof. These relations follow from (24).

Corollary 2. For n > 1

an log n F0.log n/.n 1/; (25) D 2 C 2 >

for some 1-periodic function F0; see Figure5.

A formal expansion for F0.t/ was derived in (Lumbroso, 2013). We prove the following esti- mate for n.

22 Figure 5: Periodic fluctuations of an log2 n in log-scale (left) and normalized in the unit interval k k 1 (right). The largest value achieved by the periodic function in the interval n Œ2 ; 2 C  is at k k 2 k n 2 1 for which a2 1 k 2 2k 1 , which approaches 2 for large k. D C C D C

1 1 n log2 n C 2 3 Figure 6: Periodic fluctuations of n log2 n in log-scale (left) and FKY in the unit interval (right).

Theorem 7. The expected number n of random bits required by Algorithm FYKY satisfies 2 n n log n nFKY.log n/ O .log n/ ; (26) D 2 C 2 C 2ki where FKY.t/ is a continuous 1-periodic function whose Fourier expansion is given by ( k ) WD log 2

1 1 X . k 1/ 2kit FKY.t/ C e .t R/; (27) D 2 log 2 Clog 2 2 1 2 k 0 k „ ƒ‚ … ¤ 0:33274  the series being absolutely convergent. Here .s/ denotes Riemann’s zeta function. 2  Note that .1 i t/ O .log t / 3 for large t ; see (Titchmarsh, 1986). Our method of j C j D j j j j proof is based on approximating the partial sum n by an integral Z x X2k  x M.x/ a.t/ dt; where a.x/ k .x > 0/; WD 0 WD x 2 k>0

and estimating their difference. Obviously, an a.n/ for integer n > 0. The asymptotics of M.x/ is comparatively simpler and can be derived byD standard Mellin transform techniques; see (Flajolet et al., 1995). Indeed, we derive an asymptotic expansion that is itself an identity for x > 1.

23 Proposition 1. The integral M.x/ satisfies the identity 2 M.x/ x log x xFKY.log x/ ; (28) D 2 C 2 C 12 for x > 1, where FKY is given in (27). Proof. We start with the relation ( x  1; if x > 1 a.x/ a 2 ˚ 1 « I D C x x ; if 0 < x 6 1: Then for x > 1 x Z x Z 2 x  M.x/ 2M 2 a.t/ dt 2 a.t/ dt D 0 0 Z x t  a.t/ a 2 dt D 0 Z 1 ˚ 1 « x 1 t t dt: D C 0 The last integral is equal to Z 1 Z Z 1 ˚ 1 « 1 t X t 2 t t dt f 3g dt 3 dt 1 12 : 0 D 1 t D 0 .j t/ D j>1 C Thus, M.x/ satisfies the functional equation

2 M.x/ 2M x  x  ;.x > 1/; (29) D 2 C 12 2 M.x/  which implies that M .x/ 12 log x is a periodic function, namely, M .2x/ M .x/ N x 2 N N for x > 1, or, equivalentlyWD (28), and it remains to derive finer properties of the periodicD function FKY. For that purpose, we apply Mellin transform. First, the integral M is decomposed as Z x  k  Z X t 2 X 1 t M.x/ dt 2k f g dt: (30) k k 3 D 0 2 t D 2 t k>0 k>0 x Then the Mellin transform of M.x/ can be derived as follows (assuming 2 < .s/ < 1): < Z Z Z Z X k 1 s 1 1 t X k.s 1/ 1 s 1 1 t 2 x f g dt dx 2 C x f g dt dx k 3 3 0 2 t D 0 x t k>0 x k>0 Z Z t X k.s 1/ 1 t s 1 2 C f 3g x dx dt D 0 t 0 k>0 Z 1 X k.s 1/ 1 t 2 C fs g3 dt D s 0 t C k>0 .s 2/ C ; D s.s 2/.1 2s 1/ C C 24 where we used the integral representation for .s 1/ (see (Titchmarsh, 1986, p. 14)) C Z 1 t .s 1/ .s 1/ fs g2 dt . 1 < .s/ < 0/: C D C 0 t C < All steps here are justified by absolute convergence if 2 < .s/ < 1. We then have the inverse Mellin integral representation <

3 Z 2 i 1 C 1 .s 2/ s M.x/ C s 1 x ds D 2i 3 i s.s 2/.1 2 C / 2 1 1 C Z 2 i 1 C 1 .s 1/ 1 s 2 C s x ds .x > 0/: D 2i 1 i .s 1/.1 2 / 2 1 Move now the line of integration to the right using known asymptotic estimates for .s/ (see (Titchmarsh, 1986, Ch. V)) j j

( 1 .1 c/ " O t 2 C ; if 0 6 c 6 1 .c i t/ j j 2  I (31) j C j D O .log t / 3 ; if c 1; j j D as t . A direct calculation of the residues at the poles (a double pole at s 0 and simple j j ! 1 D poles at s k , s 1) then gives D D 1 Z 2 i 1 C 1 .s 1/ 1 s 2 2 C s x ds x log2 x xFKY.log2 x/ 12 .x/; 2i 1 i .s 1/.1 2 / D C C C 2 1

for x > 0, where FKY is given in (27) and .x/ is give by

3 Z 2 i 1 C 1 .s 1/ 1 s .x/ 2 C s x ds: (32) WD 2i 3 i .s 1/.1 2 / 2 1 To evaluate this integral, we use the relations

Z c i s  1 C 1 x 0; if x > 1 2 ds x 1 I .c > 1/; 2i c i 1 s D 2 2x ; if 0 < x 6 1; 1 by standard residue calculus (integrating along a large half-circle to the right of the line .s/ c if x > 1, and to the left otherwise). With this relation, we then have < D 8 0; if x > 1 ˆ  à I < 1 X k 1 2 1 .x/ 2 x ; if 0 < x 1: 2 2k `2 6 D ˆ k :ˆ 2 `x61 k;`>1

s 1 2 by expanding the zeta function and 1 2s 1 2 s in Dirichlet series and then integrating term by term. Note that the double sum expression D for .x/ can be simplified but we do not need it. Also .x/ 0 for 1 < x 1. D 2 6 25 Observe that the Fourier series expansion (27) converges only polynomially. We derive a dif- ferent expansion for FKY, with an exponential convergence rate.

Lemma 4. The periodic function FKY has the series expansion  k t à 2 t X 2 f g k 1 t  k t ˘  FKY.t/ 1 t 6 2f g 1 bk 1 tc 2 f g 0 2 f g 1 ; (33) D f g C 2 C f g C k>1 P 1 for t R, where denotes the digamma function (derivative of log €) and 0.k 1/ j>k 2 . 2 C D j For large k, we have

k t 2 2 f g k 1 t  k t ˘  1 6k 6k 1 3k  1 b c 2 f g 0 2 f g 1 C O 2 ; 2k 1 t C D 2k 2 t 3 22.k 1 t / C C f g C f g  C f g 1 1 1 1  ˚ k t « since .k 1/ 2 3 O 4 , where k 2 . 0 C D k 2k C 6k C k WD f g Proof. By (30), we have, for x > 0,

0  2k ˘ 1 Z x 1 Z X C 1 t M.x/ 2k f g dt @ k k A 3 D 2 C  2 ˘ 1 t k>0 x x C 0 1 X Z 1 t X Z 1 t 2k B dt dtC @ n k o 3 3 A D 2  2k ˘ Á C 0 .j t/ k>0 x t  2k ˘ x j> 1 C C x C  2  k  Â k  Ãà X x 2 k 1 2 x k 1 2 0 1 : D 2 C x x C k>0 m Now if x 2 , then (Lx log x ) ¤ WD b 2 c  2  k  Â k  Ãà X 2 k  X x 2 k 1 2 M.x/ x 12 2 x k 1 2 0 1 D C 2 C x x C 06k6Lx k>Lx 1 C 2 2  Lx  xLx x 2 D C 6 C 12   k Lx  k Lx 1 Â k Lx  Ãà X x 2 C 2 C 2 C x 1 0 1 ; k Lx 1 C 2 C C x x x C k>1 which also holds for x 2m, and in that case we have D 2 2 M.2m/ m2m $ 1  2m  ; D C 2 C 6 C 12 2 by using .2/ 1  , where (see Section 5.3) 0 D C 6 X k k  $ 1 2 0.2 1/ 0:44637 64113 48039 93349 :::: WD C  k>1

This proves (33) by writing Lx log x log x . D 2 f 2 g 26 Note that if we use the expression (33) for FKY.t/, then the identity (28) holds for x > 0. We turn now to estimating the difference between n and M.n/.

Proposition 2. The difference n M.n/ satisfies 2 n M.n/ O .log n/ : (34) D Proof. We have (defining a.0/ 0) D X Z 1 n M.n/ .a.m/ a.m t// dt O.log n/: D 0 C C 06m6n Now X Â2k  m  2k m t à a.m/ a.m t/ C C D m 2k m t 2k k>Lm C X m Â2k   2k à  X 1 X m à O D 2k m m t C 2k C 2k Lm6k<2Lm C k>Lm k>2Lm X m Â2k   2k à  1 à O : D 2k m m t C m Lm6k<2Lm C Thus X X m Z 1Â2k   2k à n M.n/ k dt O.log n/: D 2 0 m m t C 26m6n Lm6k<2Lm C By writing x x x , we then obtain f g D b c m Z 1Â2k   2k à Z 1 t m Z 1Â2k   2k à dt dt dt: 2k m m t D m t 2k m m t 0 C 0 C 0 C The first integral on the right-hand side is bounded above by X X Z 1 t  X log mà dt O O.log n/2: 0 m t D m D 26m6n Lm6k<2Lm C 26m6n It remains to estimate the double-sum X X m Z 1Â2k   2k à M1 k dt WD 2 0 m m t 26m6n Lm

27  2k ˘  2k ˘ For a fixed k, the difference m m 1 assumes the value 1 if there exists an integer q lying in the interval C 2k 2k < q ; (35) m 1 6 m C  2k ˘  2k ˘ and m m 1 assumes the value 0 otherwise. For those m satisfying (35), we have the m C1 inequality 2k 6 q . It follows that

X m Â2k   2k à X 1 O.k/; 2k m m 1 6 q D k k 1 k C 16q62 2 2 C 6m6min 2 ;n b c f g and, consequently,  à X 2 M1 O k O .log n/ : D D 36k62Ln This proves the proposition.

5.2 Variance of Zn 2 Let bn E.Bn / Bn00.1/ Bn0 .1/. D D C Lemma 5. For n > 1 8 b n 2a n 1; if n is even ˆ 2 2 < C kC Â 1 .n/ Ã I b X 2 mod n 2k 1 2 .n/ n C ; if n is odd: D ˆ k .n/ .n/ 2 : 2 1 2 C .1 2 / 06k<.n/ Proof. By (23) and (24).

2 Note that the variance vn bn a of Bn satisfies the recurrence WD n

v2n vn .n 1/: D > On the other hand, we also have

X n 2k  bn .2k 1/ ; D C 2k n k>0

and we will derive an asymptotic approximation to the variance of Zn

2 X X 2 &n V.Zn/ vm bn an : WD D D 26m6n 26m6n

28 2 10 Figure 7: Periodic fluctuations of the variance of Bn ( bn an) in log-scale for n 2;:::; 2 (left) and for n 29;:::; 210 (right). D D D Theorem 8. The variance of the total number of random bits flipped to generate a random permu- tation by Algorithm FYKY satisfies 2 3 & nGKY.log n/ O..log n/ /; (36) n D 2 C where GKY.u/ is a continuous, bounded, periodic function of period 1 defined by u j Z 2f g 1 u X j u GKY.u/ v02 f g 2 f g g.t/ dt: (37) D C 0 j>1 R 1 Here v0 g.t/ dt and WD 0 Â  1 Ã Â x Á  1 Ã g.x/ 1 x 2a x : (38) WD x 2 C x

Numerically, v0 0:47021 47736 99741 30560 ::: ; see Section 5.3 for different approaches of numerical evaluation. This theorem will follow from Propositions3 and5 given below. 2 Similar to the case of n, a good approximation to &n is given by the integral Z x Z x V .x/ v.t/ dt b.t/ a.t/2 dt; WD 0 D 0 2 where v.x/ b.x/ a.x/ represents a continuous version of vn and WD X x 2k  b.x/ .2k 1/ : WD C 2k x k>0 Now consider !2 X x 2k  X x 2k  v.x/ .2k 1/ D C 2k x 2k x k>0 k>0 2   x  k    x  k ! 1 X 2 2 1 X 2 2 x .2k 3/ k x x k x ; D x C C 2 2 x C 2 2 k>0 k>0 From this relation, we derive the following functional equation.

29 Lemma 6. For x > 0 x Á Z min 1;x V .x/ 2V f g g.t/ dt: (39) 2 D 0 Proof. If x > 1, then x Á v.x/ v D 2 I if 0 < x 1, then 6 x Á v.x/ v g.x/; D 2 C where g is defined in (38). We now show that this functional equation leads to an asymptotic approximation that is itself an identity, as in the case of M.x/. Proposition 3. The integral V .x/ satisfies

V .x/ xGKY.log x/ v0; (40) D 2 for x > 1, where GKY is defined in (37). Proof. By a direct iteration of (39), we obtain

x Z 2Lx j Lx 1  X Lx j C V .x/ v0 2 C 1 2 C g.t/ dt; D C 0 j>1 for x 1, where the sum is absolutely convergent because (a.x/ O.x/ and x˚ 1 « O.x/) > D x D Z x Z x  1à   t à 1à g.t/ dt 1 t 2a t dt Ox2; (41) 0 D 0 t 2 C t D

Lx x as x 0. Now writing x 2 , where x log x , we obtain (40). Note that G .0/ ! D C WD f 2 g KY D limu 1 GKY.u/, and GKY is continuous on Œ0; 1. On the other hand, since g.x/ O.x/, its integral! is of order x2 as x 0, which implies that the series in (37) is absolutelyD convergent. ! Accordingly, GKY is a bounded periodic function. Proposition 4. The Fourier coefficients of G .u/ P g e2kiu can be computed by KY k Z k D 2 1 Z 1 k 1 gk g.t/t dt .k Z/; (42) D .log 2/. k 1/ 2 C 0 the series being absolutely convergent. In particular, the mean value g0 is given by

 2 à 2 1 1  2 2 X k. k 1/. k 1/ g0 2 1 C 2 C D 24 C 2.log 2/2 6 .log 2/3 sinh 2k (43) k>1 log 2 1:55834 75820 73324 42639 35697 76811 51355 37715 91606 58602    

30 where 1 is a Stieltjes constant: ! X log j .log m/2 1 lim 0:72815 84548 36767 24860 :::: WD m j 2  !1 26j6m Note that the terms in the series in (43) are convergent extremely fast with the rate

 2 à 4 2k 4 12 k k.log k/ 3 exp k.log k/ 3 2:33 10 ; (44) log 2   by (31). Furthermore, the mean value (43) is smaller than that (12) of Algorithm RS. Proof. By definition,

u j Z 1 Z 1 Z 2 2kiu 1 u X 2kiu j u gk v0 e 2 du e 2 g.t/ dt du: D 0 C 0 0 j>1

The first term equals 1 . The second term g can be simplified as follows. .log 2/. k 1/ k0 C j 1 j ! Z 2 Z 1 Z 2 Z 1 X 2kiu j u gk0 g.t/e 2 du dt j D 0 0 C 2 j log2 t j>1 C j 1 j ! 1 Z 2 Z 2 X j 1 k 1 j 1 2 g.t/ dt g.t/ t 2 dt : D .log 2/. k 1/ 0 C 2 j C j>1 By summation by parts, we see that

j j Z 2 Z 2 X j 1 X j 2 g.t/ dt .2 1/ g.t/ dt 0 D 2 j 1 j>1 j>0 1 j Z 2 Z 1 X j 1 2 g.t/ dt g.t/ dt: D 2 j 0 j>1 Thus, we obtain (42). The proof of (43), together with different numerical procedures, will be given in the next section. We now show that & 2 V .n/ is small. n 2 Proposition 5. The difference between the variance &n and its continuous approximation V .n/ is bounded above by O.log n/3.

31 2 &n 2 log2 n 3 7 11 Figure 8: Periodic fluctuations of C n C in log-scale for n 2 ;:::; 2 (left) and GKY.u/ (right). D

Proof. The proof is similar to that of Proposition2. By definition,

Z 1 2 X &n V .n/ .v.m/ v.m t// dt O.1/ D 0 C C 06m6n X Z 1 b.m/ b.m t/ a.m/2 a.m t/2 dt O.1/: D 0 C C C 06m6n Now divide the sum of terms into three parts:

2 & V .n/ 2W1.n/ W2.n/ W3.n/ O.1/; n D C C C where

X Z 1 X  m 2k  m t  2k à W1.n/ k k k Ck dt D 0 2 m 2 m t 06m6n k>1 C Z 1 X  W2.n/ a.m/ a.m t/ dt D 0 C 06m6n Z 1 X 2 2 W3.n/ a.m/ a.m t/ dt: D 0 C 06m6n

2 We already proved in Proposition2 that W2.n/ O .log n/ . On the other hand, D Z 1 X   W3.n/ a.m/ a.m t/ a.m/ a.m t/ dt D 0 C C C 06m6n  Z 1 à X ˇ ˇ O .log n/ ˇa.m/ a.m t/ˇ dt D 0 C 06m6n O.log n/3; D

32 by Proposition2. For W1.n/, we again follow exactly the same argument used in proving Proposi- tion2 and deduce that X X m Z 1Â2k   2k à W1.n/ k k dt O.log n/ D 2 0 m m t C 06m6n Lm6k<2Lm C ! X X k X X k O O.log n/ m 1 q D C k C 06m6n Lm6k<2Lm C 16k62Ln 16q62 O.log n/3: D

Theorem8 now follows from Propositions3,4 and5. It remains to prove the more precise expression (43) for the mean value g0 and other Fourier coefficients gk .

5.3 Evaluation of gk

We show in this part how the coefficients g0 and gk with k 0 can be numerically evaluated to high precision. For that purpose, we will derive a few different¤ expressions for them, which are of interest per se. We focus mainly on g0, and most of the approaches used also apply to other constants or coefficients appeared in this paper.

The mean value of GKY The mean value of GKY is split, by (38), into two parts Z 1 1 g.t/ g0 g000 g0 dt C ; D log 2 0 t DW log 2 where Z 1   à   Z  2 à 2 1 1 1 t t  1 g0 1 t dt f 2g f 3g dt ; WD 0 t t D 1 t t D 12 2 and Z 1 1  1à  t à g000 2 1 t a dt: WD 0 t t 2 Lemma 7. Z  à X k 1 1 t g00 2 1 dt: (45) 0 2k t t D 0 e 1 e 1 k>1 Proof. By definition and direct expansions X Z 1 1 2k   1à g000 2 k 1 t dt D 0 2 t t k>1 X X Z 1 2k j t 2 dt: k 3 D k 0 2 j ` t k;j>1 06`<2 C C

33 Now, by the integral representation Z s 1 1 xu s 1 x e u du .x; .s/ > 0/; D €.s/ 0 < we see that Z 1 Z Z 1 X X j t 1 2 X 2k ju X `u tu 2 dt u j e e te dt du k 3 k 0 2 j ` t D 0 k 0 j>1 06`<2 C C j>1 06`<2 Z 1  u Á 1 1 du: 2k t u D 0 e 1 e 1 This proves (45). Lemma 8.

X h` g00 2 ; (46) 0 D `2.` 1/ `>3 ˙ `  where h` 2h ` 2 1 for ` > 2 with h0 h1 0. D d 2 e C D D The first few terms of h` are

h2` `>1 h2` 1 `>1 0; 1; 4; 5; 12; 13; 16; 17; 32; 33; 36; 37; 44; 45; 48; 49; 80; ; f g D f g D f    g which correspond to sequence A080277 in Sloane’s OEIS (Online Encyclopedia of Integer Se- quences), and is connected to partial sums of dyadic valuation. Proof. Inverting (45) using Binet’s formula (see (Erdelyi´ et al., 1953, ~1.9)) Z  à 1 t zt 1 z 0.z 1/ z 1 e dt; (47) C D et 1 0 we get  à X X 1 k k g00 2 0.2 j 1/ : 0 D j C k>1 j>1 Since 1 X 1 0.m 1/ ; m C D `2.` 1/ `>m 1 C by grouping terms with the same number, we get  à X 1 X k g00 2 0.m 1/ 2 ; 0 D m C m>2 2k m j k>1 which then implies (46).

34 1 First approach: k convergence rate The most naive approach to compute g0 consists of evaluating exactly the first k > 1 terms of the series (46) and adding the error by an asymptotic estimate of the remainders. More precisely, choose k sufficiently large and then split the series 1 into two parts depending on ` < k and ` > k. Since h` 2 ` log2 ` O.`/ for large `, we see that the remainder is asymptotic to D C

X h` X log ` log k 2 2 2 ; `2.` 1/  `2  k `>k `>k

1 with an additional error of order k . But such an approach is poor in terms of convergence rate.

k Second approach: 3 convergence rate A better approach to compute g000 from (46) consists in expanding the series

X h` X X h` D1.k/; where D1.s/ ; `2.` 1/ D WD `s `>3 k>3 `>3

and then evaluate D1 by the recurrence relation of h`, namely,

X 2h` ` 1 X 2h` ` 1 D1.s/ C C D .2`/s C .2` 1/s `>1 `>1 Â Â Ã Ã 1 s X s j 1 D1.s j / .s 2/ .1 2 /.s 1/ .s/ 2 C s Cj : D 1 2 C j 2 C j>1

k .s/ 1 j Since D1.k/ O.3 / for large k, the terms in such a series converge at the rate O.j 6 /. D <

k 1 Third approach: k5 convergence rate We can do better by applying the 2 -balancing tech- nique introduced in (Grabner and Hwang, 2005), which begins with the relation

k  k ˘  X h` X . 1/ 2 1 X h` C D2.k 3/; where D2.s/ s : `2.` 1/ D 2k C WD ` 1  `>3 k>0 `>3 2

k Here the convergence rate is of order k5 . So it suffices to compute D2.j / for j > 3. Now

X 2h` ` 1 X 2h` ` 1 D2.s/ C s C s D 2` 1  C 2` 3  `>1 2 `>1 2  à s  à sà 1 s X h` 1 1 s 2 s 1 1 2 Z.s/; D ` 1  4` 1  C C 4` 1  C `>1 2 2 2 where Z.s/ s 1; 1  s 1; 3  1 s; 1  3 s; 3 : WD 4 C 4 4 4 4 4 35 Thus, we obtain the recurrence  à Z.s/ 1 X s 2j 1 D2.s 2j / D2.s/ s 2 s 2 C Cj ; D 4.2 1/ C 2 1 2j 16 j>1

.s/ 1 j where the convergence rate is now improved to O.j < 100 /. In this way, we obtain the g0 g000 numerical value in (43) since g0 C . D log 2 Such an approach is generally satisfactory. But for our g0 it turns out that a very special symmetric property makes the identity (43) possible, which is not the case for other constants appearing in this paper (e.g., v0 and $).

2 4 2k Fourth approach: k.log k/ 3 e log 2 convergence rate Instead of the elementary approach used above, we now apply Mellin transform to compute the Fourier series of GKY. We start with defining V .x/ V .x/ v0. Then, by (39), N WD C ( x Á 0; if x > 1 V .x/ 2V I N N 2 D R 1 g.t/ dt; if 0 < x 1: x 6 From this it follows that the Mellin transform V .s/ of V .x/ satisfies  N s 1 V .s/ 1 2 C g.s/; D where Z 1 Z 1 Z 1 s 1 1 s g.s/ x g.t/ dt dx g.t/t dt: WD 0 x D s 0

By (41), we see that g.s/ is well-defined in the half-plane .s/ > 2. Thus, we anticipate the same expansion (40) with the Fourier coefficients (42). What< is missing here is the growth order of g.c i t/ for c > 2 as t , which can be obtained by the integral representation (48) below.j C j j j ! 1 By (38), we first decompose g into two parts:

Z 1 Â  Ã Â Â Ã  Ã 1 1 t 1 s 1  g.s/ 1 t 2a t t dx g1.s/ g2.s/ ; D s 0 t 2 C t DW s C where Z 1 Â  Ã Â Ã 1 t s g1.s/ 2 1 t a t dt D 0 t 2 Z 1 Â  Ã   1 1 s 1 g2.s/ 1 t t C dt: D 0 t t

36 The second integral is easier and we have

Z  2 à 1 t t .s 3/ .s 1/.s 2/ g.s/ f g f g dt C C C ; 2 D t s 3 t s 4 D s 3 .s 2/.s 3/ 1 C C C C C for .s/ > 2 (when s 1, the last term is taken as the limit 1 ). < D 2 Consider now g1.s/. The following integral representation is crucial in proving (43). Lemma 9. For .s/ > 2, < Z c i 2 1 C 1 €.w 1/.w 1/€.s w 2/.s w 2/ g1.s/ C C w C C dw; (48) D €.s 4/  2i c i 1 2 C 1 where 1 < c < .s/ 1. < C Proof. By straightforward expansions as above

Z s 1 2 X k.s 2/ 1 u C  u Á g.s/ 2 C 1 du: (49) 1 2k u u D €.s 4/ 0 e 1 e 1 C k>1 Since Z 1 w 1  u Á u 1 du €.w 1/.w 1/. 1 < .w/ < 0/; eu 1 D C C < 0 we obtain the Mellin inversion representation

Z c i u 1 C 1 w u 1 €.w 1/.w 1/u dw .c . 1; 0//: e 1 D 2i c i C C 2 1 Substituting this into (49), we obtain (48).

Proof of (43) Taking s 1 in (48), we get D 1 Z 2 i 1 C 1 €.w 1/.w 1/€. w 1/. w 1/ g1. 1/ C C w C C dw D 2i 1 i 1 2 2 1 R1 J2; D C

where R1 sums over all residues of the poles on the imaginary axis and

1 Z 2 i 1 C 1 €.w 1/.w 1/€. w 1/. w 1/ J2 C C w C C dw WD 2i 1 i 1 2 2 1 1 Z 2 i 1 C 1 €.w 1/.w 1/€. w 1/. w 1/ C C w C C dw: D 2i 1 i 1 2 2 1

37 The last integral is almost identical to g . 1/ except the denominator for which we write 1 1 1 1 : 1 2w D C 1 2 w Thus J2 g . 1/ J3, where D 1 C 1 Z 2 i 1 C 1 J3 €.w 1/.w 1/€. w 1/. w 1/ dw 2i 1 WD 2 i C C C C Z 1 1 1  u Á u u 1 du D 0 e 1 e 1 2 1  : D 6 Collecting these relations, we see that

R1 J3 g. 1/ ; 1 D 2 C 2 and R1 g. 1/ g. 1/ g. 1/ ; D 1 C 2 D 2 2  1 J3 because g2. 1/ 12 2 2 . It remains to compute the residues of the poles on the imaginary axis: D D  à R1 X €.w 1/.w 1/€. w 1/. w 1/ g. 1/ Res C C w C C D 2 D 1 2 w k Z k 2 D  2 à 2 log 2 1  2 X 2k . k 1/. k 1/ 2 1 C C ; D 24 C 2 log 2 6 .log 2/2 sinh 2k2 k>1 log 2

where 1 is defined in Proposition4. The terms in the series are convergent at the rate (44), and is much faster than the previous three approaches:

g. 1/ g0 1:55834 75820 73324 42639 35697 76811 51355 377159 16065 86021 D log 2  33003 19983 06704 40332 28575 51733 41447 78391 56441 48117 :::

108 (using only 18 terms of the series, one gets an error less than 1:8 10 ). Also the dominant term alone, namely, 

 2 à 1 1  2 2 1 1:55834 75821 66122 :::; 24 C 2.log 2/2 6 

11 gives an approximation to g0 to within an error less than 9:3 10 . 

38 Calculation of gk for k 0 Consider now g . 1 k / when k 0. Similarly, by (48) with ¤ 1 C ¤ s 1 k , we have D C g. 1 k / R2 J4; 1 C D C where R2 denotes the sum of all residues of the poles on the imaginary axis and

1 Z 2 i 2 1 C 1 €.w 1/.w 1/€.1 k w/.1 k w/ J4 C C C w C dw: WD €.3 k /  2i 1 i 1 2 C 2 1

By the change of variables w k w, we get 7! 1 Z 2 i 2 1 C 1 €.w 1/.w 1/€.1 k w/.1 k w/ J4 C C C w C dw D €.3 k /  2i 1 i 1 2 C 2 1 g. 1 k / J5; D 1 C C where

1 Z 2 i 2 1 C 1 J5 €.w 1/.w 1/€.1 k w/.1 k w/ dw WD €.3 k /  2i 1 i C C C C C 2 1 Z k 2 1 u  u Á u u 1 du D €.3 k / 0 e 1 e 1  C à k . k 1/ . k 2/ 2 C C ; D . k 2/. k 1/ k 2 C C C which equals 2g . 1 k /. Then 2 C

g. 1 k / R2 g k C D .log 2/. k 1/ D 2.log 2/. k 1/ C C 2 . k 1/ . k 1/. k 1/ 0 C C 0 C C 2 2 D .log 2/ . 1/. k 2/ k C 2 X €. k j 1/. k j 1/€. j 1/. j 1/ (50) C C 2 C C C C C .log 2/ . k 1/€. k 3/ j>1 C 1 X €. j 1/. j 1/€. k j 1/. k j 1/ 2 C C C C : C .log 2/ . k 1/€. k 3/ 16j6k 1 C

By the order estimate (7) for Gamma function and (31) for -function (which implies that 0.1 5 j C i t/ O.log t  3 /, we deduce that j D j j 2 5  gk O k .log k/ 3 ; (51) D for large k , so that the Fourier series of G is absolutely convergent. j j KY

39 5.4 Asymptotic normality of Zn

We prove in this section the asymptotic normality of the bit-complexity Zn of Algorithm FYKY. Such a result is well anticipated because Zn B1 Bn and each Bk is close to Lk 1 with a geometric perturbation having bounded meanD andC variance.    C Indeed, we can establish aC stronger local limit theorem for Zn.

Theorem 9. The bit-complexity Zn of Algorithm FYKY satisfies a local limit theorem of the form

2 x   3 Ãà e 2 1 x P .Zn n x&n / 1 O C j j ; (52) D b C c D p2 &n C pn

1 6  2 uniformly for x o n , where n E.Zn/ and &n V.Zn/; see (26) and (36). D WD WD Proof. Since Zn is the sum of n independent random variables, the r-th cumulant of Zn, denoted by Kr .n/, satisfies X Kr .n/ r .m/.r 1/; D > 26m6n

where r .m/ stands for the r-th cumulant of Bm. To show that r .m/ are bounded for all m and B  r > 2, we observe that E t n can be extended to any x > 0 by defining

X x 2k  B.x; t/ 1 .1 t/ t k .x > 0/; WD 2k x k>0

Bn  x so that E t B.n; t/. Also B.x; t/ tB. 2 ; t/ for x > 1 and the cumulants r .x/ r!Œsr  log B.x; esD/ are well-defined. It followsD that for x > 1 WD

r  x sÁÁ x Á r .x/ r!Œs  s log B ; e r D C 2 D 2 x  for r > 2, which then implies that r .x/ r 2Lx 1 for x > 1. It remains to prove that D C r .x/ O.1/ for x .0; 1/. Note that r .x/ is a (finite) linear combination of sums of the followingD form 2  k   à X j x 2 X j k k O x k 2 O.x/ O.1/; 2k x D D D k>0 k>0

for each j 1; 2;::: . This proves that each r .x/ is bounded for x > 0, and, accordingly, D X Kr .n/ r .m/ O.n/.r 2; 3; : : : /: D D D 26m6n These estimates, together with those in (26) and (36), yield

  Ãà  2  3 Ãà Zn n   E exp i exp O j j ; (53) &n D 2 C pn

40 uniformly for  6 "pn. We now derivej j a uniform bound of the form

2 Zni "n Ln E.e / e .   n 5; n 2 /; (54) j j 6 j j 6 I > ¤ for some " > 0. This bound, together with (53), will then be sufficient to prove the local limit theorem (52). Ln Bni .Ln 1/i P ikt For n 2 , let E.e / e C k 0 pn;k e , where ¤ D >

 Ln k   Ln k 1  n 2 C n 2 C C pn;k : Ln k Ln k 1 WD 2 C n 2 C C n

When both pn;0 and pn;1 are nonzero, we have

Bni i E.e / 6 1 pn;0 pn;1 pn;0 pn;1e j j C jq C j 2 1 pn;0 pn;1 .pn;0 pn;1/ 2pn;0pn;1.1 cos / D C C  à pn;0pn;1 6 1 pn;0 pn;1 .pn;0 pn;1/ 1 2 .1 cos / ; C C .pn;0 pn;1/ C by using the inequality p1 x 1 1 x for x Œ0; 1. Then by the inequalities 1 x e x and 6 2 2 6 1 cos  2  2 for  , we obtain, for  , > 2 j j 6 j j 6  à  à Bni pn;0pn;1 2 pn;0pn;1 2 E.e / 6 exp .1 cos / 6 exp 2  ; j j pn;0 pn;1   pn;0 pn;1 C C 0 which holds for all n > 1 provided we interpret 0 as zero. In this way, we see that

2 2 1 2 Zni 2 ƒn ƒn E.e / e  e 5 ; j j 6 6 for  6 , where j j X pk;0pk;1 ƒn : WD pk;0 pk;1 16k6n C n We now prove that ƒn > "n for some " > 0. Observe that pn;0 2Ln 1 when n Ln, and D C ¤

8 l Lk 2 m k Lk 2 C < Lk 2 ; if 2 < k < 3 ; p 2 C k;1 l Lk 2 m D 2 C Lk 1 :0; if 3 6 k 6 2 C :

41 It follows that k k X X 2` 1 2` 2 C  C ƒn > k k l ` 2 m ` 1 ` 2 26` 6 18 6 26` "02 > "n; for a sufficiently small " > 0. This completes the proof of (54) and the local limit theorem (52).

6 Implementation and testing

We discuss in this section the implementation and testing of the two algorithms FYKY and RS. We implemented the algorithms in the C language, taking as input an array of 32-bit integers (which is enough to represent permutations of size up to over four billion). To generate the needed random bits, we used the rdrand instruction, present on Intel processors since 2012 (Intel, 2012) and AMD processors since 2015. This instruction provides access to physical , which does not have the biases of a pseudorandom generator. This choice also makes it easy to compare the performance of the algorithms without relying on third-party software. Alternatively, one could use a pseudorandom generator like Mersenne Twister, which is the default choice in most software, such as R, Python, Matlab and Maple, and runs faster than rdrand when properly implemented. But such a generator has been known to be cryptographically insecure because one can predict all the future iterations if a sufficient number (624 in the case of MT9937) of iterations is available. The hardware driven instruction rdrand, in contrast, is proved to be cryptographically secure. Our implementation takes care of not wasting any random bits and provides the option to track the number of random bits consumed. The implementation of Algorithm FYKY is rather straightforward, but that of Algorithm RS is more involved. First of all, the recursive calls in RS are handled in the following fashion, depending on the size of the input: for large inputs, we run the recursive calls in parallel using the Posix thread library pthread;  for intermediate inputs, we run the recursive calls sequentially to limit the number of threads;  for small inputs, we use the Fisher-Yates algorithm instead to reduce the number of recursive  calls. The cutoffs between small, intermediate and large inputs were determined experimentally; in our tests, thresholds of 216 and 220 seemed efficient, but this may depend on machine and other imple- mentation details.

42 The second optimization for Algorithm RS concerns the splitting routine. Written naively, this routine contains a loop with an if statement depending on random data. This is a problem because branches are considerably more efficient if they can be correctly predicted by the processor during execution. We are able to avoid using branches altogether by vectorizing the code, i.e., using SIMD (Single Instruction, Multiple Data) processor instructions. Such instructions take as input 128-bit vector registers capable of storing four 32-bit integers and operate on all four elements at the same time. The C language provides extensions capable of accessing such instructions. Specifically, we used in our implementation two instructions, present in the AVX (Advanced Vector Extensions) instruction set supported by newer processors. They are vpermilps, which arbitrarily permutes the four 32-bit elements of a vector; and vmaskmovps, which writes an arbitrary subset of the four elements of a vector to memory. Both instructions take as additional input a control vector specifying the permutation or subset, of which only two bits out of every 32-bit element are read. We use these instructions to separate four elements of the permutation at a time into two groups. This can be done in 16 possible ways, which means that we have to supply each instruction with one of 16 possible control registers. We do this by building a master register containing all 16 of them in a packed fashion. We then draw randomly an integer r between 0 and 15 and shift every component of the master register by 2r bits to select the appropriate control register. This lets us handle four elements at a time without using branches.

Benchmarks Below are our benchmarks for Algorithm FYKY, Algorithm RS and one of its parallel versions. The tests were performed on a machine with 32 processors.

n FYKY RS Parallel RS Algorithm Mean 105 4.84ms 4.59ms 4.18ms RS n log n .0:25 "/n 2 C ˙ 106 51.1ms 51.6ms 18.5ms FYKY n log n .0:33 "/n 2 ˙ 107 712ms 623ms 121ms Algorithm Variance 108 12.5s 7.26s 1.04s RS .1:83 "/n 109 145s 81.7s 10.3s FYKY .1:56 ˙ "/n ˙ Table 1: Left: the execution times to sample permutations of sizes from 105 to 109 (each averaged over 100 runs for sizes up to 10 million and 10 runs otherwise). Right: the analytic results we obtained in this paper. Here c " indicates fluctuations around the mean value c (coming from the periodic functions); see (8),˙(12), (27) and (43).

As expected, parallelism speeds up the execution by as much as a factor of 8. What is more surprising is that, even in a sequential form, Algorithm RS is nearly twice as efficient as Fisher- Yates for the larger sizes, despite making on linearithmic order of memory accesses instead of linear. The reason for this has to do with the memory cache, which makes it more efficient to access memory in a sequential fashion instead of at haphazard places. The Fisher-Yates shuffle accesses memory at a random place at each iteration of its loop, causing a large number of cache misses. Algorithm RS, in comparison, does not have this drawback, which accounts for the observed gap in performance.

43 References

Anderson, R. J. (1990). Parallel algorithms for generating random permutations on a shared mem- ory machine. In Proceedings of the Second Annual ACM Symposium on Parallel Algorithms and Architectures, pages 95–102.

Andres,´ D. M. and Perez,´ L. P. (2011). Efficient parallel random rearrange. In International Symposium on Distributed Computing and Artificial Intelligence, pages 183–190. Springer.

Barker, E. B. and Kelsey, J. M. (2007). Recommendation for random number generation us- ing deterministic random bit generators (revised). US Department of Commerce, Technology Administration, National Institute of Standards and Technology, Computer Security Division, Information Technology Laboratory.

Berry, K. J., Johnston, J. E., and Mielke, Jr., P. W. (2014). A Chronicle of Permutation Statistical Methods (1920–2000, and Beyond). Springer, Cham.

Black, J. and Rogaway, P. (2002). Ciphers with arbitrary finite domains. In Topics in Cryptology - CT-RSA 2002, The Cryptographer’s Track at the RSA Conference, 2002, San Jose, CA, USA, February 18-22, 2002, Proceedings, pages 114–130.

Brassard, G. and Kannan, S. (1988). The generation of random permutations on the fly. Inf. Process. Lett., 28(4):207–212.

Chen, C.-H. (2002). Generalized association plots: Information visualization via iteratively gener- ated correlation matrices. Statistica Sinica, 12(1):7–30.

Devroye, L. (1986). Nonuniform Random Variate Generation. Springer-Verlag, New York.

Devroye, L. (2010). Complexity questions in non-uniform random variate generation. In Proceed- ings of COMPSTAT’2010, pages 3–18. Springer.

Devroye, L. and Gravel, C. (2015). The expected bit complexity of the von neumann rejection algorithm. arXiv preprint arXiv:1511.02273.

Durstenfeld, R. (1964). Algorithm 235: Random permutation. Commun. ACM, 7(7):420.

Erdelyi,´ A., Magnus, W., Oberhettinger, F., and Tricomi, F. (1953). Higher Transcendental Func- tions. Vol. I. McGraw-Hill, New York.

Fisher, R. A. and Yates., F. (1948). Statistical tables for biological, agricultural and medical re- search. 3rd ed. Edinburgh and London, 13(3):26–27.

Flajolet, P., Fusy, E.,´ and Pivoteau, C. (2007). Boltzmann sampling of unlabelled structures. In Proceedings of the Ninth Workshop on Algorithm Engineering and Experiments and the Fourth Workshop on Analytic Algorithmics and Combinatorics, pages 201–211. SIAM, Philadelphia, PA.

44 Flajolet, P., Gourdon, X., and Dumas, P. (1995). Mellin transforms and asymptotics: harmonic sums. Theoret. Comput. Sci., 144(1-2):3–58.

Flajolet, P., Pelletier, M., and Soria, M. (2011). On Buffon machines and numbers. In Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2011, San Francisco, California, USA, January 23-25, 2011, pages 172–183.

Flajolet, P. and Sedgewick, R. (1995). Mellin transforms and asymptotics: finite differences and Rice’s integrals. Theoret. Comput. Sci., 144(1-2):101–124.

Flajolet, P. and Sedgewick, R. (2009). Analytic Combinatorics. Cambridge University Press, New York, NY, USA.

Fuchs, M., Hwang, H.-K., and Zacharovas, V. (2014). An analytic approach to the asymptotic variance of trie statistics and related structures. Theoret. Comput. Sci., 527:1–36.

Grabner, P. J. and Hwang, H.-K. (2005). Digital sums and divide-and-conquer recurrences: Fourier expansions and absolute convergence. Constr. Approx., 21(2):149–179.

Granboulan, L. and Pornin, T. (2007). Perfect block ciphers with small blocks. In Fast Software Encryption, 14th International Workshop, FSE 2007, Luxembourg, Luxembourg, March 26-28, 2007, Revised Selected Papers, pages 452–465.

Horibe, Y. (1981). Entropy and an optimal random number transformation. IEEE Transactions on Information Theory, 27(4):527–529.

Hwang, H.-K. (2003). Second phase changes in random m-ary search trees and generalized quick- sort: convergence rates. Ann. Probab., 31(2):609–629.

Hwang, H.-K., Fuchs, M., and Zacharovas, V. (2010). Asymptotic variance of random symmetric digital search trees. Discrete Math. Theor. Comput. Sci., 12(2):103–165.

Intel (2012). Intel Digital Random Number Generator (DRNG): Software Implementation Guide. Intel Corporation.

Jacquet, P. and Szpankowski, W. (1998). Analytical de-Poissonization and its applications. Theo- ret. Comput. Sci., 201(1-2):1–62.

Kimble, G. W. (1989). Observations on the generation of permutations from random sequences. International Journal of Computer Mathematics, 29(1):11–19.

Knuth, D. E. (1998a). The Art of Computer Programming. Vol. 2, Seminumerical Algorithms. Addison-Wesley, Reading, MA.

Knuth, D. E. (1998b). The Art of Computer Programming. Vol. 3. Sorting and Searching. Addison- Wesley, Reading, MA. Second edition.

45 Knuth, D. E. and Yao, A. C. (1976). The complexity of nonuniform random number generation. Algorithms and Complexity: New Directions and Recent Results, pages 357–428.

Koo, B., Roh, D., and Kwon, D. (2014). Converting random bits into random numbers. The Journal of Supercomputing, 70(1):236–246.

Laisant, C.-A. (1888). Sur la numeration´ factorielle, application aux permutations. Bulletin de la Societ´ e´ Mathematique´ de France, 16:176–183.

Langr, D., Tvrd´ık, P., Dytrych, T., and Draayer, J. P. (2014). Algorithm 947: Paraperm—parallel generation of random permutations with MPI. ACM Trans. Math. Softw., 41(1):5:1–5:26.

Lehmer, D. H. (1960). Teaching combinatorial tricks to a computer. In Proc. Sympos. Appl. Math. Combinatorial Analysis, volume 10, pages 179–193.

Louchard, G., Prodinger, H., and Wagner, S. (2008). Joint distributions for movements of elements in Sattolo’s and the Fisher-Yates algorithm. Quaest. Math., 31(4):307–344.

Lumbroso, J. (2013). Optimal discrete uniform generation from coin flips, and applications. CoRR, abs/1304.1916.

Mahmoud, H. M. (2003). Mixed distributions in Sattolo’s algorithm for cyclic permutations via randomization and derandomization. J. Appl. Probab., 40(3):790–796.

Massey, J. L. (1981). Collision-Resolution Algorithms and Random-Access Communications. Springer.

Moses, L. E. and Oakford, R. V. (1963). Tables of Random Permutations. Stanford University Press.

Nakano, K. and Olariu, S. (2000). Randomized initialization protocols for ad hoc networks. IEEE Transactions on Parallel and Distributed Systems, 11(7):749–759.

Neininger, R. and Ruschendorf,¨ L. (2004). A general limit theorem for recursive algorithms and combinatorial structures. Ann. Appl. Probab., 14(1):378–418.

Petrov, V. V. (1975). Sums of Independent Random Variables. Springer-Verlag, New York- Heidelberg. Translated from the Russian by A. A. Brown, Ergebnisse der Mathematik und ihrer Grenzgebiete, Band 82.

Plackett, R. L. (1968). Random permutations. Journal of the Royal Statistical Society. Series B (Methodological), 30(3):pp. 517–534.

Pokhodze˘ı, B. B. (1985). Complexity of tabular methods of simulating finite discrete distributions. Izv. Vyssh. Uchebn. Zaved. Mat., (7):45–50 & 85.

Prodinger, H. (2002). On the analysis of an algorithm to generate a random cyclic permutation. Ars Combin., 65:75–78.

46 Rao, C. R. (1961). Generation of random permutation of given number of elements using random sampling numbers. Sankhya A, 23:305–307.

Ravelomanana, V. (2007). Optimal initialization and gossiping algorithms for random radio net- works. IEEE Transactions on Parallel and Distributed Systems, 18(1):17–28.

Ressler, E. K. (1992). Random list permutations in place. Inf. Process. Lett., 43(5):271–275.

Ritter, T. (1991). The efficient generation of cryptographic confusion sequences. Cryptologia, 15(2):81–139.

Rivest, R. L. (1994). The RC5 encryption algorithm. In Fast Software Encryption: Second Inter- national Workshop. Leuven, Belgium, 14-16 December 1994, Proceedings, pages 86–96.

Robson, J. M. (1969). Algorithm 362: Generation of random permutations [G6]. Commun. ACM, 12(11):634–635.

Sandelius, M. (1962). A simple randomization procedure. Journal of the Royal Statistical Society. Series B (Methodological), 24(2):pp. 472–481.

Titchmarsh, E. C. (1986). The Theory of the Riemann Zeta-Function. The Clarendon Press Oxford University Press, New York, second edition. Edited and with a preface by D. R. Heath-Brown. von Neumann, J. (1951). Various techniques used in connection with random digits. Journal of Research of the National Bureau of Standards. Applied Mathematics Series, 12:36–38.

Waechter, M., Hamacher, K., Hoffgaard, F., Widmer, S., and Goesele, M. (2011). Is your permu- tation algorithm unbiased for n 2m? In Parallel Processing and Applied Mathematics, pages 297–306. Springer. ¤

Wagner, S. (2009). On tries, contention trees and their analysis. Annals of Combinatorics, 12(4):493–507.

Wilson, M. C. (2009). Random and exhaustive generation of permutations and cycles. Ann. Comb., 12(4):509–520.

47