CRYPTOGRAPHIC MERSENNE TWISTER AND FUBUKI STREAM/

MAKOTO MATSUMOTO, TAKUJI NISHIMURA, MARIKO HAGITA, AND MUTSUO SAITO

Abstract. We propose two stream ciphers based on a non-secure pseudoran- dom number generator (called the mother generator). The mother generator is here chosen to be the Mersenne Twister (MT), a widely used 32-bit integer generator having 19937 bits of internal state and period 219937 − 1. One proposal is CryptMT, which computes the accumulative product of the output of MT, and use the most significant 8 bits as a secure random numbers.Itsperiodisprovedtobe219937 − 1, and it is 1.5-2.0 times faster than the most optimized AES in counter-mode. The other proposal, named Fubuki, is designed to be usable also as a block cipher. It prepares nine different kinds of functions (bijections from blocks to blocks), each of which takes a parameter. Fubuki encrypts a sequence of blocks (= a plain message) by applying these encryption functions iteratedly to each of the blocks. Both the combination of the functions and their parameters are pseudorandomly chosen by using its mother generator MT. The and the initial value are passed to the initialization scheme of MT.

1. Introduction In this paper, we consider cryptographic systems implemented in software. We assume a 32-bit CPU machine with fast multiplication of words, and a moderate size of working area (about 4K bytes). In a narrow sense, a system is to generate cryptographically secure pseudorandom numbers (PN) from a shared key, and take exclusive-or with the plain message to obtain ciphered message. One way to generate such PN is to use a non-secure generator like LFSR (which we call the mother generator), initialize it by using the key, and then filter its outputs, i.e., apply some complicated functions to obtain a secure sequence. Along this line, we propose to use a GF(2)-linear generator whose internal state consists of 19937 bits, Mersenne Twister (MT) (see §3 for the detail). MT is invented by two of the authors [4]. It has period 219937 − 1 and uniform equidis- tribution property upto 623 dimension. Its initialization scheme is improved to accept an array of any length as an initial seed in 2002. MT is widely accepted in

Date: June 1, 2005. Key words and phrases. Mersenne Twister, non-secure random number generator, stream ci- pher, CryptMT, Fubuki, AES. The cipher systems CryptMT and Fubuki in this paper are proposed to ECRYPT Stream Cipher Proposal http://www.ecrypt.eu.org/stream/. The reference codes and a prototype of this paper are available there. The first author was supported in part by JSPS Grant-In-Aid #14654021 and #16204002, and Hiroshima University President’s Discretion Fund ’05. 1 2 MAKOTO MATSUMOTO, TAKUJI NISHIMURA, MARIKO HAGITA, AND MUTSUO SAITO

Figure 1. CryptMT: period 219937 − 1, and twice faster than the optimized AES (Pentium-M, gcc -O3) the society of MonteCarlo simulations, and implementations in C and many other languages are available from the homepage [5]. As described in this homepage, a way to generate a cryptographically secure PN sequence is to use MT, and compress its outputs by using, say, MD5 or SHA1.

2. CryptMT The first proposal in this paper is even simpler. MT generates a sequence of unsigned 32 bit integers (which from now on we shall call words). The given key and initial value are concatenated and passed to the initialization scheme (§3) of MT. We prepare a variable accum of word size, which is set to 1 at the beginning (this may be any odd integer). Then, we iterate the following process to obtain (probably) a cryptographically secure PN sequence of 8-bit integers (= byte): (1) Generate one pseudorandom word gen rand by MT. (2) Multiply it to accum: accum ← accum × (gen rand | 1). (3) Output the most significant 8 bits of accum.GotoStep1. To raise the security, the first 64 bytes of the outputs are discarded. Here the C-language-like notation “|” denotes bitwise-OR operation. This op- eration is to make the multiplier odd (otherwise, after several iterations, accum would be zero). Multiplication is considered modulo 232.

This method generates a PN sequence of bytes, which fits to the usual require- ments for a stream cipher. We call this stream cipher CryptMT, meaning Crypto- graphic Mersenne Twister. Our experiment shows that CryptMT is faster by a factor of 1.5–2.0 than the well- optimized counter-mode AES (see §6.1), widely known as rijndael-alg-fst.c. The size of the internal state of MT seems to be enough to make any kind of time-memory-trade-off attacks infeasible. If all bits of accum were used (differently from the 8 bits as in CryptMT) then the sequence would not be cryptographically secure, since from the change of the CRYPTOGRAPHIC MERSENNE TWISTER AND FUBUKI STREAM/BLOCK CIPHER 3 accum we could recover the output of MT (except for the least significant bit), then by linear algebra one can decide the internal state after observing 19937 bits of the output. However, if only the most significant 8 bits after multiplication are observed, then we can not imagine how to obtain the internal state of MT. It is important to use the most significant bits: the least significant bit is always 1, and the second bit of accum coincides with the summation (modulo 2) of the second bit of the outputs of MT so far, from which one could compute MT’s internal state. The most significant bits seem to be safe, since the bit-diffusion pattern of the multiplication is from right to left, and most significant bits gather information of all the less significant bits of the two operands: accum and the output of MT. The above gave a complete description of CryptMT, except for the description of the mother generator MT (§3). Security of CryptMT is largely depending on the mother generator MT and its initialization. The facts that (1) the size of internal state of MT is huge, (2) 3/4 of the output bits of MT are discarded, (3) MSBs after multiplication gather information of all bits, and (4) initialization is highly nonlinear, seem to imply high security, but we need more detailed study. The period of each of 8 bits of the output of CryptMT is 219937 − 1 (see Appendix A). Design rationales of CryptMT are (1) use a fast linear generator, which has a huge state (e.g. thousands of bits), and (2) filter its output by a finite state non-linear automaton which has a relatively small state (e.g. one word), (3) Only a small fraction of the information of the state is output (e.g. 8 bits among 32 bits). as seen in Figure 1. The former ensures the long period, and the latter ensures complicated bit-diffusion, under a compromise with the speed in software imple- mentation.

3. Mersenne Twister MT generates a PN word sequence by the GF(2)-linear recursion

w624+i = w397+i ⊕ ((wi&0x80000000)|(w1+i&0x7fffffff))A (i =0, 1, 2,...).

Here wi (i =0, 1, 2,...) are 32-bit integers, each of which is considered as a 32- dimensional row vector over the two element field GF(2). The binary operator ⊕ denotes bitwise exclusive-or, i.e., addition as a vector. The C-like hexadecimal notation 0x80000000 denotes the vector whose compo- nents are all zero except for the left most 1. Thus, ((wi&0x80000000)|(w1+i&0x7fffffff)) is the row vector obtained by concatenating the MSB of wi and all bits but the MSB of w1+i. To this vector a constant 32 × 32 matrix A is multiplied from the right. This matrix A is of the form ⎛ ⎞ 1 ⎜ ⎟ ⎜ 1 ⎟ ⎜ ⎟ ⎜ .. ⎟ , ⎜ . ⎟ ⎝ 1 ⎠ a31 a30 ··· ··· a0 4 MAKOTO MATSUMOTO, TAKUJI NISHIMURA, MARIKO HAGITA, AND MUTSUO SAITO and so the multiplication is computed by shiftright(x) (if the LSB of x is 0) xA = shiftright(x) ⊕ a (if the LSB of x is 1), where a is a constant vector

a =(a31,a30,...,a0)=0x9908B0DF in the hexadecimal notation. These constants are chosen so that the period of the sequence is 219937 − 1. The number of nonzero terms in the characteristic polynomial of the state transition function is 153. Figure 2 illustrates the state transition of MT. The state consists of 623 words + 1 bit. The next state is obtained by shifting one words to the above, and insert a new word linearly computed from the discarded part and a middle term. In a software implementation, instead of shifting, we use round robin technique (i.e. to use a pointer) for the efficiency of the generation. Advantages of this configuration over usual LFSR with coefficients, say, in GF(232), are (1) No need of costive operations like “multiplication modulo polynomials,” (2) There is a fast algorithm to check the maximality of the period by paring and Galois theory [4], (3) The generation speed is independent of a, which makes the parameter-search easier. This is different from LFSR, where the generation speed depends on a particular choice of the coefficients. Remark 3.1. In the original MT, the output is transformed by a linear trans- formation to attain nearly optimally high-dimensional equidistribution at MSBs. This is called Tempering. We removed this transformation, since we think it non- necessary because of the application of complicated functions to the outputs of MT in the encryption process.

For initialization, we need to specify w0,w1,...,w623 as the initial state (to be precise, all the 31 bits of w0 but the most significant bit are neglected in generating the next word, so the state space has 624 × 32 − 31 = 19937 bits). The output sequence of MT is w624,w625,..., i.e., MT skips the contents of the initial state. The initialization is particularly important for MT. Because of the linearity and the sparseness of the recursion of MT, if the initial state has too many zeroes, then the output sequence has same tendency for more than 10000 outputs. The 2002 version of MT [5] has an initialization function init by array(u32 init key[], int key length), whose first argument is an array of 32-bit words with length given by the second argument, which is described as follows. First, w0,...,w623 are set to a fixed nontrivial value by a recursion

w0 ← 19650218, (1) wi ← ((wi−1 ⊕ (wi−1 >> 30)) + i) × 1812433253

(1 ≤ i ≤ 623), where wi are considered as 32-bit unsigned integer variables, and every arithmetic operation is modulo 232. The notation >> 30 means the shift to the right by 30 bits. (The constant 19650218 is the birthday of one of the authors, chosen without reason. The other constant 1812433253 is a multiplier for a linear congruential generator [3, P.106], here chosen without reason.) This recursion is chosen to have a good bit-diffusion property. The multiplication with constant has a good bit-mixing property, except for that the diffusion of the CRYPTOGRAPHIC MERSENNE TWISTER AND FUBUKI STREAM/BLOCK CIPHER 5

Figure 2. State transition of Mersenne Twister bit information is always from right to left. The most significant two bits (which gather the information of all bits after multiplication) are sent to the least significant two bits of wi by exclusive-or, to complement the multiplication. The assignment wi−1 → wi is bijective. Addition with i is to avoid the following phenomenon. Suppose that i is not added in the recursion (1). Suppose that an initial value w0 is chosen (although it is fixed to 19650218 in the above implementation) and let w0,w1,...,w623 be the generated sequence. Suppose that in another initialization another initial value w0 is chosen, which generates a sequence w0,w1,...,w623. What we worry is that it may happen that w0 = w1 by accident (or w0 = w2, or alike), and then, wi = wi+1 for i =0, 1,...,622. Such similarity of the initial states yields correlated outputs for 10000 outputs or so according to the experiments. The addition with i avoids these phenomena. The initial seed is given as an array init key[length] of an arbitrary length length upto 64. The initialization scheme init by array rewrites the above w1,...,w623 by the following recursive substitution:

wi ← (wi ⊕ (((wi−1 ⊕ (wi−1 >> 30)) × 1664525)) (2) +init key[i mod length]+(i mod length) for i =1, 2,...,623. Note that every multiplication or addition is done modulo 232. This recursion is chosen in the same spirit as (1), with adding init key[] meanwhile. The reason why taking “i modulo length” at the last of the recur- sion is as follows. Suppose that we don’t take modulo length. Suppose that one initialization is given by an array init key[], and another initialization is given another array of the twice length with the content being the two repetitions of the 6 MAKOTO MATSUMOTO, TAKUJI NISHIMURA, MARIKO HAGITA, AND MUTSUO SAITO original array. Then the two initializations yield the same state. Such phenomenon is avoided by taking modulo length. Then, we substitute the first word

w0 ← w623, and again apply a similar recursive substitution

(3) wi ← (wi ⊕ (((wi−1 ⊕ (wi−1 >> 30)) × 1566083941)) − i for i =1, 2, 3,...,623. Finally, the most significant bit of w0 is set to one, to avoid the zero initial state. According to the experiments, this initialization has a good bit-distribution prop- erty. Any one bit of the change in the initial seed array init key[] dramatically changes the initial state. The worst related keys seem to be those having difference only at the last word of init key[]. However, the output of MT starts from w624, which depends on w397, which seem to be difficult to control because of at least 397 − (64 + 64) times application of (2) at the last word of init key[]. (This 64 + 64 is because the size of the key and the size of the initial value are upto 64 words). In addition, each word of the internal state is transformed by the nonlinear bit-mixing recurrence (3) on the key. It seems very difficult to utilize the technique of differential with respect to the key. Remark 3.2. The above initialization scheme was incorporated in 2002, and it actually admits an array of arbitrary length as the initial seeding vector [5]. This feature is to answer to the requests of financial engineers, who want to use the ascii code of the name of each companies in the stock markets, as the initial-seed array. This initialization is not designed for cryptographic purpose, but it seems to have enough resistance, so we keep it as is. However, from the viewpoint of efficiency, this is redundant. A quicker initialization is possible. For example, we do not need the first round (setting constants to the state array) in the initialization.

4. Fubuki The other proposal in this paper is Fubuki cipher system. The basic idea is: “software is soft, so we can make the choice of encryption functions as flexible as possible.” This may be contrasted to Rijndael block cipher, where in each round the encryption function is fixed: just the key (to be exor-ed) is changed. The proposal of Fubuki is to change the parameters of each functions. To fix the situation, we assume that a block consists of 4 words (i.e. 4 of 32-bit integers), but the reference implementation of Fubuki allows 4, 8 or 16 words as one block. We use the notations W := the set of 32 bit words, B := W 4 = the set of blocks. A plain message is a finite sequence of blocks, i.e. an element of BL, where L ∈ N is the length of the message. This paper considers a stream cipher in a more general sense than taking exclusive- or with PN. Let K be the key space. In a typical case, K = W 4 = the set of 128 bits. Definition 4.1. A stream cipher is a sequence of functions called encoding func- tions Ei : B→B,i=1, 2,..., CRYPTOGRAPHIC MERSENNE TWISTER AND FUBUKI STREAM/BLOCK CIPHER 7 and another sequence of functions called decoding functions

Di : B→B,i=1, 2,... such that Di ◦ Ei =id (the identity function). A plain message B1,B2,...is encoded into E1(B1),E2(B2),..., and then decoded by D1(E1(B1)),D2(E2(B2)),....EachofEi,Di depends on both the key K and i.

Remark 4.2. If Ei and Di are identical for any i, then the system is called a block cipher. Remark 4.3. A stream cipher in Definition 4.1 can be used to generate a (possibly cryptographically secure) PN, by merely encoding a message consisting of, say, all zero. The basic strategy of Fubuki is to compose several simple encryption functions, i.e., Shannon’s “product” in his 1949 paper. We use the following definition, which is nothing but a usual block cipher system (if we consider the parameter set as the set of keys). Definition 4.4. A primitive encryption family PF with a parameter set P is a mapping PF : P×B→B with its inverse family PF PF : P×B→B such that for all P ∈P and B ∈B PF(P, (PF(P, B))) = B holds. The size of P may vary among different PEFs. Let us denote by PF(P, −):B→B,B→ PF(P, B) the bijection associated to the PF with parameter P . Fubuki prepares nine different primitive encryption families. Four of them are designed to diffuse the bit-information mainly inside each word in the block (word- wise PEF), four of them are designed to mix the information of words (inter-word PEF), and the last one is designed to cut off the incidence relation of bits in each word (vertical rotate, denoted by PFv-rot). The given key and the given initial value are passed to the initialization of the mother generator MT. Using the non-secure PN sequence generated by MT, word Fubuki pseudorandomly selects one of the four wordwise PEFs, say PF1 ,and word word word its parameter P1 , and apply PF1 (P1 , −)toB1. Then similarly select inter inter one of the four inter-word PEFs, say PF1 , and its parameter P1 , and apply it to the result. Then apply PFv-rot with pseudorandomly selected parameter v-rot P1 . This is one round of Fubuki encryption, and it is repeated several times. The choice in the reference code is four times iteration for each block. Thus, the block Bi is encoded by applying

Ei := Round4,i ◦ Round3,i ◦ Round2,i ◦ Round1,i 8 MAKOTO MATSUMOTO, TAKUJI NISHIMURA, MARIKO HAGITA, AND MUTSUO SAITO where each round is given by

Roundj,i = v-rot v-rot inter inter word word PF (Pj,i , −) ◦ PFj,i (Pj,i , −) ◦ PFj,i (Pj,i , −)(j =1, 2, 3, 4), inter where PEFs and their parameters are pseudorandomly chosen by MT. (The PFj,i is uniformly pseudorandomly chosen from four inter-word PEFs and its parameter word is uniformly chosen from its parameter space. Similarly for PFj,i .) The decoding function Di is its inverse, given by Ei := Round1,i ◦ Round2,i ◦ Round3,i ◦ Round4,i where each round is given by Roundj,i = word word inter inter v-rot v-rot PFj,i (Pj,i , −) ◦ PFj,i (Pj,i , −) ◦ PF (Pj,i , −)(j =1, 2, 3, 4), where denotes inverse PEFs. The design rationale of Fubuki is as follows. (1) An idea is to choose simplest operations (i.e. those in the instruction set of typical CPU, such as exor or multiplication) as the building blocks of PEF. Any complicated operation is made from simple operations, so freely composing simple ones seems better than fixing one way. (2) However, if it is too free, then (with very small probability) it may happen that one same PEF is selected all the time. So, there should be a trade-off between freedom to choose a combination of PEFs and restriction to assure good bit-information diffusion. Fubuki did this by making each PEF a combination of a few simple operations. (3) Fubuki has no S-boxes. In some sense, the integer multiplication replaces the S-boxes. The integer multiplication has a good and fast bit-diffusion property. It has two weakness: (1) the bit-diffusion is only from the right bits to the left ones, (2) it is (bi)-linear, so differential cryptanalysis is valid. However, Fubuki compensates these by (1) suitable bit-operations with left-to-right diffusion property and (2) combining exclusive-or to make it non-linear. A recent study warns that any cryptographic system in a fast implemen- tation using a large lookup table for S-box is vulnerable by cache-timing attack [6] [2]. This method breaks AES. (Fubuki has two tables of 32 words, which may leak some information. We need further study). (4) Fubuki consumes far (e.g. 13 times) more PNs (from its mother generator) than the size of the plain message: since the parameter space of each PEF is large (actually we arranged the size to be nearly the same with one block), every round consumes three times block-size of PNs. This redundancy makes it difficult to guess the internal state of the mother generator, even by chosen-plain text attack with chosen initial val- ues. Actually, Fubuki has an aspect of block cipher. Suppose that the key and the initial value are fixed, and these are repeatedly used in a stream cipher for different texts (which is prohibited usually for stream cipher; it is the context for block ciphers). Then, a stream cipher in a narrower sense (i.e. exor with PN) is easily CRYPTOGRAPHIC MERSENNE TWISTER AND FUBUKI STREAM/BLOCK CIPHER 9 broken by known-plain text attack, since the PN sequence is recovered by taking exor. In a block cipher, it is required that Ei (and Di) are difficult to guess from an (even huge) number of pairs (B,Ei(B)) where B can be chosen (i.e. chosen plain text attack). Fubuki is designed to have this type of resistance.

5. Description of Fubuki 5.1. Overview. Fubuki prepares four wordwize PEFs empr, emer, emps, emes and four inter-word PEFs ma, mem, ome, eme and one PEF vert rotate. One round of Fubuki consists of three stages: choose one of the four wordwise PEFs and apply it to the plain block, then choose one of the four inter-word PEFs, then apply vert rotate. In C-like notation, it is described as c = pseudorandom_two_bits(); switch (c) { case 0: crypt_empr(msgbuf); break; case 1: crypt_emer(msgbuf); break; case 2: crypt_emps(msgbuf); break; case 3: crypt_emes(msgbuf); break; }

c = pseudorandom_two_bits(); switch (c) { case 0: crypt_ma(msgbuf); break; case 1: crypt_mem(msgbuf); break; case 2: crypt_ome(msgbuf); break; case 3: crypt_eme(msgbuf); break; } crypt_vert_rotate(msgbuf); Here, msgbuf is an array of block size, and each PEF rewrites this array. The parameters are generated in each of PEF by calling MT, so not visible in this description. This round is iterated for Iteration times (which is 4 in default case, for 128 bit blocks). The two-bits pseudorandom integers are generated as follows. We use C-language- like notation. genrand_tuple_int32(func_choice, 4); func_choice[2] *= (func_choice[0] | 1); func_choice[3] *= (func_choice[1] | 1); func_choice[0] ^= (func_choice[3] >> 5); func_choice[1] ^= (func_choice[2] >> 5); Here, the first function fills four PNs into the array func choice. The next four transformations mix these four words. The notation *= is to multiply the right hand side to the left and write to the left, ^= is similar operation with exclusive-or, | is bitwise OR operation, >> 5 is bit-shift-to-right by 5 bits. Fubuki uses the 128 bits in this array for function choice: first use the most significant two bits of func choice[0], then next two bits of func choice[0],and 10 MAKOTO MATSUMOTO, TAKUJI NISHIMURA, MARIKO HAGITA, AND MUTSUO SAITO so on. Thus, if Iteration=4, then most significant 16 bits of func choice[0] are used to choose PEFs 2 × 4 times. 5.2. Primitive encryption families. To make the explanation and the imple- mentation simpler, we choose one uniform parameter space P for all PEFs. Let t be the number of words in the block (typically four, and assumed to be a power of 2). Then, B := W t, and we set P := W t ×{1, 2,...,t− 1}. We denote an element of P ∈P and B ∈Bby

P =(p0,p1,...,pt−1, jump),B=(b0,b1,...,bt−1).

Each of bi is considered as a variable of wordsize. 5.3. Wordwise PEFs. Each of wordwise PEFs is described as follows. Again, t is the number of the words in a block. The block B is transformed as follows. Let j = 0. The first operation is

bj ← bj ⊕ pj. Note that its inverse is itself. Then, multiply a constant 32 bj ← bj × cj mod 2 . 32 The constant cj should be (multiplicatively) invertible modulo 2 , i.e., should be odd. Their inverses are necessary in decoding, and would be time-consuming if they were computed during the decoding process. So, before starting encryption, Fubuki prepares 32 pseudo-randomly chosen 32-bit integers m0,m1,...,m31,using MT, and store them in an array mult table. Before decoding, Fubuki computes their inverses m0,m1,...,m31, and store them in an array inv table. These multipliers mk (k =0,...,31) are the first 32 outputs of the initialized MT, but by bit-operations we set the least significant four bits of mk to 1011 for k even, and to 0111 for k odd. Moreover, the ((k mod 8) + 1)-st bit and ((k mod 8) + 2)-nd bit of mk are set to 1 and 0, respectively. This is to avoid too trivial multipliers like 1 or 232 − 1. The constant cj is chosen from these multipliers by putting c m ,k p >> − . j := kj j =( j+ mod t (32 5)) Here, is a constant and >> (32 − 5) means right-shift by 32 − 5 bits. For empr, emer, emps, emes, the constant is 1, 2, 2, 3, respectively. Such a multiplication diffuses information of bits in bj from right to left. To force the diffusion in the inverse direction, Fubuki prepares 32 pseudorandomly chosen 32-bit constants a0,a1,...,a31 using MT, stored in an array add table of 32 words. This array is filled with the next 32 outputs of MT after setting mk’s. To avoid trivial constants, apply the following operation: for (i=0; i< 32; i++) { u32 s; s = (i * 1103515245 + 12345) & 31; s^=(s>>2); add_table[i] <<= 5; add_table[i] |= s; } CRYPTOGRAPHIC MERSENNE TWISTER AND FUBUKI STREAM/BLOCK CIPHER 11

That is, the content of add table[i] is shifted to left by five bits (the notation <<= 5), and the least five bits are set to be si which is given by

si ← (1103515245i + 12345) mod 32, si ← si ⊕ (si >> 2), for i =0,...,31. (The notation & denotes bitwise-AND, so & 31 is taking modulo 32.) This is just to make the correspondence i → si a complicated permutation on {0, 1,...,31}. Then, apply the operation

(4) b(j+jump)modt ← b(j+jump)modt  (add table[bj >> (32 − 5)]), where  is plus (modulo 232)forempr and emps, and exclusive-or for emer and emes. Here, jump is a component of the parameter. Each PEF is designed to rewrite bj depending on the information in bj−jump mod t. The parameter jump is not randomly chosen: jump is set to 1 before encoding each block, and after executing each PEF, it is rewritten by

jump ← jump × 2; if jump ≥ t then jump ← 1.

Thus, jump moves 1, 2, 4, 8,. . . , and if it becomes more than or equal to t,itis set to 1. This is to scatter the information of one word to the other words in the block as quick as possible. Partial sum of 1, 2, 4,. . . represents any integer, so after log2(t) times iteration of PEFs, the information of one word tends to be passed to all the other words in the block. A shortcoming of a feedback (4) is that there are only 32 different patterns may occur. However, the most significant bits of bj gather the information of all the lower bits by multiplying cj, so a change of one bit in bj tends to be reflected in the choice of an element in the array add table, so having good bit-diffusion property. The final operation of wordwise PEF is “rotation” or “shift”. Using pj, obtain pseudorandom integer between 16 and 23 by the following bit-operation

(5) sj ← ((pj >> (32 − 4))|0x10)&0x17, where 0x denotes that the following number is hexadecimal. Then, the “rotation” operation

bj =((∼ bj) << (32 − sj))|(bj >> sj); is computed. This is rotate to the right by s bits, but the rotated s bits are reversed. The unary operator ∼ means the bit reverse. The “shift” operation is

(6) bj ← bj ⊕ ((∼ bj) >> sj).

Since sj is not less than 16 and a word is 32-bit, the inverse of this operation is itself. The “rotate” is chosen for empr and emer, and the “shift” is chosen for emps and emes. The above operations (i.e. the operations stated in this subsection) are applied for j =0, 1,...,t− 1 in this order. This is the description of four wordwise PEFs. 12 MAKOTO MATSUMOTO, TAKUJI NISHIMURA, MARIKO HAGITA, AND MUTSUO SAITO

5.4. Inter-word PEFs. Wordwize PEFs are mainly to mix the information of each word, except for the operation involving jump, which passes (only) 5-bit of information to another word. Inter-word PEFs ma, mem, ome, eme are designed to pass more information of each word to the other words. First we describe ma.Letj =0.Putk := (j − jump)modt. Then, we apply the Feistel-network-type operation

bj ← bj +(bk × pj).

Compute a pseudorandom integer sj by (5), and apply the “shift” operation (6). Iterate this for j =0, 1,...,t− 1, in this order. This describes the PEF ma.The other three PEFs are as follows. Let j =0.Putk := (j − jump)modt. Generate a pseudorandom integer s between 0 and t − 1 such that s = j,by

s ← pj >> (32 − log (t)); (7) 2 if s = j then s ← ((s − 1) mod t). Then, in the case of mem, apply the operation

bj ← (bj ⊕ (bk × bs)) − pj bj ← bj ⊕ (bj >> 16). In the case of ome or eme, apply

bj ← (bj ⊕ (bk × (bspj)) bj ← bj ⊕ (bj >> c), where  =bitwise-OR and c =16forome and  = ⊕ and c =17foreme. Iterate this for j =0, 1,...,t−1, in this order. This completes the description of inter-word PEFs.

5.5. crypt vert rotate. This PEF is to cut off the incidence relation among bits 32 in each word. Put k := 2 ∗ (p0 + pt−1) + 1 (modulo 2 ). Put jump odd := (jump − 1)|1, which is the largest odd integer not exceeding jump. We consider B to be t × 32 matrix with components 0-1, and permute each row vector of B selected by the bit mask k (i.e., if n-th bit of k is 1, then the n-th row is selected). The permutation of a row vector is rotation with lag jump odd. Namely, a t-dimensional row vector t t (x0,x1,...,xt−1) is mapped to (x−o,x1−o,...xt−1−o), where o = jump odd and the subscripts are considered modulo t.Thejump odd is forced to be odd, since it is faster to compute a cyclic permutation.

5.6. Security of Fubuki. Heuristically, one round of Fubuki has much better bit- diffusion property than one round of AES. Each of steps in one round in AES has some corresponding steps in Fubuki: ByteSub and ShiftRow in AES are replaced by wordwise PEFs (multiplication plus feedback from right-to-left (4)), MixColumn is included in inter-word PEFs and AddRoundKey appears in every PEFs where the parameters pi are added, exor-ed, or subtracted. The 19937 bits of internal state of MT seems to make time-memory-trade-off attacks infeasible. Although, we have to say that we need to study more on the security, such as difference propagation property. Experimental results are shown in §6.2 and Appendix B. CRYPTOGRAPHIC MERSENNE TWISTER AND FUBUKI STREAM/BLOCK CIPHER 13

algorithm initial encrypt decrypt PowerPC crypt-mt 256601 29 30 1.33GHz fubuki 414029 204 407 rijndael-alg-fst 2793 41 38 rijndael-alg-ref 33117 1068 1063 PentiumM crypt-mt 279123 20 19 1.4GHz fubuki 489662 133 256 rijndael-alg-fst 3192 40 41 rijndael-alg-ref 85253 916 916 Table 1. execution time (number of cycles per byte)

∆fubuki,1 ∆rijndael,1 00100010111011011111001100110011 00111110000000000000000000000000 11101111001111111100110100011100 00000000000000000000000001010011 00101101111110010000000100100010 00000000000000001111000100000000 10100001000101110110011111100101 00000000010100000000000000000000 Table 2. Differential of one-round Fubuki and two-round AES

6. Comparison with AES 6.1. Speed and memory. Table 1 lists the approximate number of CPU cycles consumed to (1) setup keys, (2) encrypt one byte, (3) decrypt one byte, for four stream ciphers CryptMT, Fubuki, optimized AES (rijndael-alg-fst.c) and reference AES (rijndael-alg-ref.c) [1]. Rough estimate of the size of the working area is 2.5K bytes, 3K bytes, 10.5K bytes, and 1.4K bytes, respectively.

6.2. Bit-diffusion. Let : B→Bbe one round of Fubuki encryption, with both the key and the initial value are 128 bits of zeroes. To grasp the bit-diffusion property of E0, we compute

∆fubuki,i := E0(i) ⊕ E0(0), where i is a 128-bit vector with all zero components except for the i-th bit being 1. For comparison, we compute

∆rijndael,i := A0(i) ⊕ A0(0), where A0 is the 2-round AES with randomly chosen key. The result for i = 1 is described in Table 2. The bit patterns suggest that one round of Fubuki seems to have quicker bit-diffusion property than two rounds of rijndael. Similar results are observed for all 1 ≤ i ≤ 128.

Appendix A. Period of CryptMT In this appendix, we shall prove the following. Theorem A.1. The period of the output sequence of CryptMT is P := 219937 − 1. Moreover, every bit of the sequence has period P . This may seem to be obvious, but we think not so. 14 MAKOTO MATSUMOTO, TAKUJI NISHIMURA, MARIKO HAGITA, AND MUTSUO SAITO

Let x0, x1, x2,... be the output of MT. MT has the following 623-dimensional equidistribution property. Lemma A.2. Any bit pattern of (32 × 623) bits occurs the same (actually two) times in the form of the tuple

(xi, xi+1,...,xi+622) when we move i =0, 1,...,P−1. There is an exception: the all-zero pattern occurs only once. In the following proof, we use only this property and the fact that P is a prime. Let us define xi to be xi with the least significant bit set to 1. We denote by Z/232 the residue ring modulo 232 identified with 32-bit integers, and by (Z/232)× its multiplicative group, identified with 32-bit odd integers. In the following, mul- tiplication of 32-bit integers are taken in this group, i.e. modulo 232. Let us define 32-bit word sequence ai by ai+1 = aixi, (a0 =some odd integer). The output sequence of CryptMT is the sequence of the most significant 8 bits of ai. Corollary A.3. Any element of ((Z/232)×)623 occurs the same times in the form of the tuple (xi, xi+1,...,xi+622) when we move i =0, 1,...,P − 1, except for (1, 1,...,1) which occurs once less often. 32 × Any element of (Z/2 ) occurs same times in the form of xi for i =0, 1,...,P− 1. Corollary A.4. P−1 xi =1. i=0 Proof. By cancellation of x and x−1, for any abelian group G we have x = x. x∈G x2=1 In the case of G =(Z/232)×, the elements of order two are exactly x =1, −1, 232−1− 1, 232−1 + 1, and their product is 1. The conclusion is deduced from the above corollary. 

Lemma A.5. The period of ai divides P . Proof. This is because P−1 ai+P =( xi)ai, i=0 which is ai by the above corollary. 

Since P is a (Mersenne) prime, in particular, every bit of ai has period 1 or P. Thus, to show the theorem, we only need to show that its period is not 1. Lemma A.6. Define a function ϕ :((Z/232)×)624 → ((Z/232)×)623 CRYPTOGRAPHIC MERSENNE TWISTER AND FUBUKI STREAM/BLOCK CIPHER 15 by (b0,...,b623) → (b1/b0,b2/b1,...,b623/b622). Define sets A and X by

A := {(ai,ai+1,...,ai+623)|i =0, 1, 2,...},

X := {(xi,xi+1,...,xi+622)|i =0, 1, 2,...}. Then, the restriction ϕ|A : A → X is surjective.

Proof. Because (ai,ai+1,...,ai+623) is mapped to (xi,xi+1,...,xi+622). 

Now we prove the theorem. Suppose that -th bit of ai has period 1. Then, each 32 × ai is contained in the subset of (Z/2 ) whose -th bit coincides with that of ai. There are 230 such odd integers, so we have #(A) ≤ 230×624. On the other hand, by Corollary A.3, we have #(X)=231×623, which is larger than #(A), contradicting to the existence of a surjection A → X. This shows that -th bit has period > 1, but it divides the prime P , hence = P . Remark A.7. The lemma A.6 claims that #(A) ≥ 231×623, where A ⊂ ((Z/231)×)624. This suggests that ai would be fairy well equi-distributed in a high-dimensional cube, and so would the output of CryptMT. Remark A.8. The above remark is valid for (1) any mother generator with high-dimensional equidistribution property, and (2) any binary operation ◦ a := a ◦ x which is invertible, i.e. x → a ◦ x is bijective. In our case, ◦ is the multipli- cation.

Appendix B. Experimental comparison of Fubuki and AES We tested the bit-diffusion property of Fubuki and AES with small number of rounds, as follows. This is a typical differential attack to block ciphers. Let E : B→Bbe an encryption function. Let B ⊂B. We uniformly randomly generate B ∈B, and compute the difference ∆E(B):=E(B ⊕ 1) ⊕ E(B)(1 is a block whose bits are all zero except for the first bit). The resulting 128 bits = 4 words are considered to be 4×4 matrix of 8-bit integers. Then the (i, j)-component E ∆ij (B)of∆E(B) is a random variable with values in 0, 1,...,255, which should ideally be uniform. We generate B randomly N times. Then we count the number of occurrence of E E an integer k (0 ≤ k ≤ 255) as ∆ij (B), denoted by Fij (N)[k](F for frequency). E E If Fij (N)[k] = 0 then k did not appear in the (i, j) component of ∆ (B)forN E times. Ideally the expectation of Fij (N)[k] should be N/256. We take the following normalization: E E fij (N)[k]:=Fij (N)[k] × 256/N. E We compute these statistics, and lists the minimum and the maximum of fij (N)[k] for k =0,...,255. 16 MAKOTO MATSUMOTO, TAKUJI NISHIMURA, MARIKO HAGITA, AND MUTSUO SAITO

3-round AES 0.73700, 1.20090 0.68740, 1.21830 0.78480, 1.26710 0.73810, 1.24930 0.66180, 1.23490 0.78750, 1.28410 0.75200, 1.24890 0.73110, 1.20510 0.77420, 1.28260 0.74730, 1.23530 0.73150, 1.21340 0.67320, 1.21810 0.75130, 1.23510 0.73240, 1.19540 0.68640, 1.21500 0.77710, 1.29130 4-round AES 0.97610, 1.02900 0.97460, 1.02790 0.97670, 1.02280 0.96700, 1.02820 0.96750, 1.03410 0.96760, 1.02740 0.97730, 1.03680 0.97410, 1.02370 0.97660, 1.02330 0.97060, 1.03260 0.97070, 1.02670 0.96830, 1.03220 0.97290, 1.03590 0.96650, 1.03000 0.97220, 1.02920 0.97600, 1.02280 1-round Fubuki 0.51770, 1.66800 0.69160, 1.54620 0.24140, 3.50450 0.38440, 1.59600 0.01450, 4.30050 0.88450, 1.10840 0.43290, 1.44460 0.43300, 2.35560 0.04200, 1.96300 0.00000, 3.98790 0.00000, 3.14730 0.12590, 2.16670 0.97650, 1.02430 0.97670, 1.02490 0.97740, 1.02910 0.97120, 1.03330 2-round Fubuki 0.97480, 1.02720 0.97790, 1.02440 0.97250, 1.02920 0.97340, 1.02710 0.97000, 1.03010 0.96790, 1.02700 0.97450, 1.02540 0.97270, 1.02880 0.97600, 1.02530 0.97650, 1.03050 0.97360, 1.02970 0.96860, 1.03170 0.97480, 1.02900 0.96770, 1.02820 0.96690, 1.02770 0.97260, 1.03390 Table 3. Differentials of 3-round AES, 4-round AES, 1-round Fubuki and 2-round Fubuki. Lists of the normalized frequency (minimum, maximum) for N =2, 560, 000 times sampling

.

Table 3 shows the result of the tests. We selected 3-round AES, 4-round AES, 1- round Fubuki and 2-round Fubuki as ciphers. We choose B to be the blocks whose first word is arbitrary and the rest three words are zero. We take N = 2560000 such sample blocks, and compute the difference caused by adding 1 as above. The table lists the minimum and and the maximum normalized frequencies. For example, the first pair (0.73700, 1.20090) in the table of 3-round AES shows that there is some k, k =0, 1, 2,...,255, for which (1, 1)-coordinate byte of the block-difference takes value k for 0.73700 × 2560000/256 times. This is the least frequency among 256 possible values as for the (1,1)-coordinate. The table shows that 3-round AES is still weak, since for 4-round AES, the frequencies are much different. Although not listed, 2-round AES has many (0, 256). For 1-round Fubuki, there are several zeroes as minimum-frequencies. This implies that there are some impossible 8-bit patterns for that coordinate. Such bias seems to be eliminated in 2-round Fubuki. Although the result is omitted, experimental results for larger rounds are similar to that for 4-round AES (and 2-round Fubuki).

References

[1] AES lounge: http://www.iaik.tu-graz.ac.at/research/krypto/AES/ [2] Bernstein, D. J. Cache-timing attack on AES http://cr.yp.to/antiforgery/cachetiming-20050414.pdf CRYPTOGRAPHIC MERSENNE TWISTER AND FUBUKI STREAM/BLOCK CIPHER 17

[3] Knuth, D. E. The Art of Computer Programming. Vol. 2. Seminumerical Algorithms 3rd Ed. Addison-Wesley, Reading, Mass., (1997). [4] Matsumoto, M. and Nishimura, T. Mersenne Twister: A 623-dimensionally equidistributed uniform pseudo-random number generator, ACM Transactions on Modeling and Computer Simulation, 8 (1998) 3–30. [5] Matsumoto, M. and Nishimura, T. Mersenne Twister Homepage. http://www.math.sci.hiroshima-u.ac.jp/˜m-mat/emt.html [6] Tusnoo, Y., Saito, T., Suzaki, T., Shigeri, M., and Miyauchi, H. Cryptanalysis of DES imple- mented on computers with cache, in Cryptographic hardware and embedded systems–CHES 2003, Springer-Verlag, Berlin (2003), 62–76.

Department of Mathematics, Hiroshima University, Hiroshima 739-8526, JAPAN E-mail address: [email protected] Department of Mathematics, Yamagata University, Yamagata Japan E-mail address: [email protected] Department of Information Science, Ochanomizu University, Tokyo Japan E-mail address: [email protected] Department of Mathematics, Hiroshima University, Hiroshima 739-8526, JAPAN E-mail address: [email protected]