Throughput Optimized Implementations of QUAD

Throughput Optimized Implementations of QUAD Jason R. Hamlet · Robert W. Brocato Abstract We present several software and hardware im- tivariate quadratic equations used in QUAD is written plementations of QUAD, a recently introduced stream ci- as pher designed to be provably secure and practical to implement. The software implementations target both a per- Q(x)= α x x + β x + γ (1) X i,j i j X i i sonal computer and an ARM microprocessor. The hard- 1≤i≤j≤n 1≤i≤n ware implementations target field programmable gate arrays. The purpose of our work was to first find the baseline In QUAD, a system of m = kn equations, S(x) = performance of QUAD implementations, then to optimize (Q1(x),...,Qkn(x)) are iterated. On each iteration, n bits our implementations for throughput. Our software imple- are used to update the internal state, and the remain- mentations perform comparably to prior work. Our hard- ing m − n are output as keystream values. To do this, ware implementations are the first known implementa- we let Sout(x) = (Qn+1(x),...,Qkn(x)) and Sit(x) = tions to use random coefficients, in agreement with QUAD’s (Q1(x),...,Qn(x)). The n polynomials in Sit are used to security argument, and achieve much higher throughput update the internal state, while Sout produces the keystream. than prior implementations. As such, one round of QUAD entails calculating S(x) = (Sit(x),Sout(x)) with current state x, outputting the n Keywords QUAD · stream cipher · throughput bits generated with Sout(x), and then updating x with optimization · hardware acceleration the n bits generated by Sit(x). Inspection of Equation 1 reveals that there are no conditional branches required in implementing QUAD. 1 Introduction Consequently, QUAD permits constant time implementations, and so side-channel timing attacks [17] on QUAD The QUAD algorithm is a stream cipher proposed by are unlikely. Unfortunately, the key initialization proce- Berbain, Gilbert, and Patarin and is intended to be prov- dure described in [1] does include conditional branching ably secure and practical to implement [1]. QUADs se- and so may be susceptible to such attacks. However, this curity is derived from the difficulty of solving the multi- initialization procedure is non-standard and was removed variate quadratic (MQ) problem. That is, the security of from QUAD in [4]. Depending on the implementation, the QUAD cipher is provably reducible to the NP-hard there is a possibility of hardware leaking the key or ini- problem of finding a solution to a multivariate quadratic tialization vector (IV) bits during key and IV setup, and system of m quadratic equations in n variables over a fi- there are likely simple power analysis (SPA) or differential nite field, GF (q). Each equation in the system of kn mul- power analysis (DPA) attacks [18]. QUAD’s resistance to Sandia National Laboratories is a multi-program laboratory timing attacks is beneficial, and its potential suscepti- managed and operated by Sandia Corporation, a wholly owned bility to hardware leakage or power analysis attacks is subsidiary of Lockheed Martin Corporation, for the U.S. De- consistent with other ciphers [14]. partment of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. To satisfy the security proof in [1] the coefficients defining the polynomials Sx must be randomly generated, J. R. Hamlet Sandia National Laboratories, Albuquerque, NM 87123 USA but they are not secret values. Prior implementations re- Tel.: +505-845-0903 place these random coefficients with values output from E-mail: [email protected] a pseudo random number generator (PRNG) [5,6]. While R. W. Brocato this leads to more compact implementations, QUAD’s se- Sandia National Laboratories, Albuquerque, NM 87123 USA curity proof has not been extended to the pseudo random E-mail: [email protected] case. For the first time, we present hardware results for 2 Jason R. Hamlet, Robert W. Brocato QUAD implementations using random coefficients, in ac- of an entropy source with a personalization string and cordance with QUAD’s security argument. then apply a cryptographic hash function to the result to In this paper, we report on our efforts to measure com- seed the internal state of the algorithm. For the purpose putational performance of the QUAD algorithm on a per- of this work initialization approaches are inconsequential, sonal computer (PC), an ARM Cortex A8 embedded mi- since we have removed the effects of that delay from our croprocessor, and in Altera Cyclone V and Xilinx Virtex- throughput time measurements. Consequently, the initial- 4 field programmable gate arrays (FPGAs). We imple- ization can be viewed as an initial delay that is identical mented the QUAD algorithm with a number of different between implementations and whose impact decreases as variations on each platform in an effort to optimize per- the length of the generated keystream increases. formance on each. The standard measure of performance Each of our software implementations uses a 128-bit that we seek to optimize throughout these tests is the rate internal state with a 128-bit keystream output for each of keystream production, measured in bits/second. update cycle. We tested these programs on both a PC In this work we consider only n = 128, k = 2, and and an ARM microprocessor. The PC used for testing coefficients in GF (2). The solution to Equation (1) is has a 3.0GHz Intel Core 2 Duo processor with 6 Mbytes the m = 256 bit value S(x) = Q1(x),...,Q256(x). Val- of cache memory and 8 Gbytes of random access memory ues Sout(x) = Q129(x),...,Q256(x) are output as the (RAM) running Red Hat Enterprise Linux version 5. We keystream, and values Sit(x) = Q1(x),...,Q128(x) are compiled our code with the GNU compiler version 4.1.2. used to update the internal state. The α, β, and γ coeffi- We also ran each program on a Cortex A8 ARM core cients are random but public values. There are 256 128 = that is part of a DaVinci DM3730 microprocessor. The 2 2, 080, 768 bits of α, which we term nonlinear coefficients, DM3730 microprocessor has an additional C64x digital 256 × 128 = 32, 768 bits of linear coefficients β, and 256 signal processor (DSP) core. We attempted to compile our bits of γ, for a total of 2, 113, 792 bits. QUAD’s secu- programs to run on the DSP core, but we had insufficient rity argument requires these coefficients to be chosen ran- development time to replace the key C language functions domly, and its designers state that bad choices of coeffi- used to measure algorithm execution times. Consequently, cients are unlikely, though this has not been proven [1]. our reported speeds are limited to the PC and the ARM The coefficients used in our implementations were gener- microprocessor. ated using the random number generator in OpenSSL [3]. 2.2 Software Implementation: Results 2 Software Implementations 2.2.1 QUAD1 Using the C programming language, we implemented four different software versions of QUAD. Each version used Our first software implementation, QUAD1, computes the the same random coefficients. For each version we tar- internal state value and keystream output by means of the geted both a PC and an ARM microprocessor and mea- most computationally simplistic approach. It was used to sured the resulting throughput, which varies significantly derive test vectors for the other software implementations. between implementations. In this section we describe each In this version, separate computations are performed to of the implementations and our results. update the 128-bit internal state register and the 128- bit keystream output register. That is, the internal state and keystream are treated as two separate registers in 2.1 Software Implementation: Overview QUAD1. Computations for the nonlinear, linear, and constant terms are performed separately. No effort was made The initialization used in all of our software and hardware to streamline computations in QUAD1, and the bit-wise implementations of the QUAD algorithm differs from that computations required for the QUAD algorithm are not presented in [1], which describes an initialization proce- well suited to the register-based computations of a PC dure that uses two different multivariate quadratic sys- running a C program. Due to these factors, this first soft- tems, S0 and S1, of n equations in n unknowns. The ware version of QUAD achieves an average speed of only internal state x, which has been set to an initial value K, 4.7 kbits/sec on the PC and 970 bits/sec on the ARM is used to select either the output of S0(x) or of S1(x), microprocessor. depending on the sequentially selected value of the internal state. However, this approach is non-standard and has 2.2.2 QUAD2 been removed from cryptographic standards that include QUAD [4]. For our software implementations we simply Our second version, QUAD2, was created in an effort to make a call to the Unix function /dev/random to pro- speed up the implementation of the algorithm by per- vide an entropy source to seed the internal state of the forming block matrix-vector multiplication. Most of the algorithm. In practice, one might concatenate the output computations required in the algorithm take place in the Throughput Optimized Implementations of QUAD 3 quadratic (αij xixj) terms. To speed up these computa- Table 1 Throughput results for our software implementations tions, the arithmetic in QUAD2 is performed on words of QUAD sized to fit the register size in the microprocessor, which PC PC ARM µP ARM µP (Mb/s) (cycles/byte) (Mb/s) (cycles/byte) is 32 bits for the ARM and 64 bits for the Intel Duo.

Throughput Optimized Implementations of QUAD

Hardness Evaluation of Solving MQ Problems

Side Channel Analysis Attacks on Stream Ciphers

Petawatt and Exawatt Class Lasers Worldwide

Hardware Implementation of Four Byte Per Clock RC4 Algorithm Rourab Paul, Amlan Chakrabarti, Senior Member, IEEE, and Ranjan Ghosh

UPDATED Canadian Values 07-04 201 7/26/2016 4:42:21 PM UPDATED Canadian Values 07-04 202 COIN VALUES: CANADA 02 .0 .0 12

9/11 Report”), July 2, 2004, Pp

How to Find Short RC4 Colliding Key Pairs

Stream Cipher Designs: a Review

Copyright Infringement

MQ Challenge: Hardness Evaluation of Solving MQ Problems

LNCS 4593, Pp

Voice Encryption Using Twin Stream Cipher Algorithm تشفير الصوت باستخدام خوارزمية التوأم