Fast Multiplication of Binary Polynomials with the Forthcoming Vectorized VPCLMULQDQ Instruction

Fast multiplication of binary polynomials with the forthcoming vectorized VPCLMULQDQ instruction Nir Drucker Shay Gueron Vlad Krasnov University of Haifa, Israel, University of Haifa, Israel, CloudFlare, Inc. and and San Francisco, USA Amazon Web Services Inc.1 Amazon Web Services Inc.1 Abstract—Polynomial multiplication over binary fields F2n is Intel has recently announced [7] that its future architecture, a common primitive, used for example by current cryptosystems codename ”Ice Lake”, will introduce a new instruction called n = 128 such as AES-GCM (with ). It also turns out to be a prim- VPCLMULQDQ. This instruction, together with the new vector- itive for other cryptosystems, that are being designed for the Post VAESENC VAESDEC Quantum era, with values n 128. Examples from the recent ized AES instructions ( and ), are useful submissions to the NIST Post-Quantum Cryptography project, for accelerating AES-GCM. However, we argue that even as a are BIKE, LEDAKem, and GeMSS, where the performance of standalone instruction, VPCLMULQDQ is useful for some new the polynomial multiplications, is significant. Therefore, efficient emerging algorithms, which paper demonstrates how it can be F n n polynomial multiplication over 2 , with large , is a significant used for accelerating ”big” polynomial multiplications (with emerging optimization target. > 511 Anticipating future applications, Intel has recently announced degree ). that its future architecture (codename ”Ice Lake”) will introduce The correctness of the algorithms (and the code) can be a new vectorized way to use the current VPCLMULQDQ instruc- checked now, but actual performance measurements require tion. In this paper, we demonstrate how to use this instruction a real CPU, which is currently unavailable. To address this for accelerating polynomial multiplication. Our analysis shows difficulty, we use other techniques to estimate the future 2 a prediction for at least x speedup for multiplications with performance, such as counting instructions and running with polynomials of degree 512 or more. some ”stand in” replacements. The results of both techniques are similar, and this allows us to predict that polynomial I. INTRODUCTION multiplication is going to be at least 2x faster on these future Several modern encryption schemes use polynomial multi- CPUs, compared to the current software implementations on plication over F2n as one of their main primitives. One promi- the current architectures. nent example is AES-GCM whose (almost) XOR-universal The paper is organized as follows: Section II describes the hash function (GHASH) is an evaluation of polynomials with new VPCLMULQDQ instruction. Section III presents some con- coefficients in F2128 . The nonce misuse resistant Authenticated cepts that we use for polynomial multiplications. Section IV Encryption AES-GCM-SIV is using a similar (almost) XOR- presents our new implementation. We show our experimental universal hash function (POLYVAL). These evaluations can results in Section V, and conclude in Section VI. be significantly sped up with dedicated instructions. Indeed, II. THE VPCLMULQDQ INSTRUCTION modern general-purpose processors are equipped with the PCLMULQDQ ”carry-less multiplication” instruction [1], [2]. Algorithm 1 DST = VPCLMULQDQ(SRC1, SRC2, Imm8) ≥ 128 Multiplication of polynomials with higher degrees ( ) Inputs: SRC1, SRC2 (wide registers) Imm8 (8 bits) is becoming a useful computational task for newly designed Outputs: DST (a wide register) cryptosystems that are based on coding or multivariate poly- 1: procedure VPCLMULQDQ(SRC1, SRC2, Imm8) i 0 KL − 1 nomials. These are motivated by the NIST Post-Quantum 2: for := to do 3: j1 =2i + Imm8[0] Project [3] that targets the definition of the next generation 4: j2 =2i + Imm8[4] of quantum-resistant public key and signature cryptosystems. 5: T1[ 63 : 0 ] = SRC1[ 64(j1 +1)− 1 : 64j1 ] 69 6: T2[ 63 : 0 ] = SRC2[ 64(j2 +1)− 1 : 64j2 ] We give three examples out of the recent submissions 7: DST[ 128(i +1)− 1 : 128i ]=PCLMULQDQ( T1, T2 ) to this project, that use polynomial multiplication in F2n : return DST BIKE [4] (MDPC codes) with n =32, 749, LEDAKem [5] (LDPC codes) with n =99, 053, and GeMSS [6] (multivariate Alg. 1 [7] presents the extended instruction VPCLMULQDQ. polynomials) using n = 354. We note that the performance of It vectorizes polynomial (carry-less) multiplications, and is the polynomial multiplication is significant in these algorithms, able to perform KL =1/2/4 multiplications of two qwords for example, the key generation and the encapsulation steps (64 bits). The multiplicands are selected from two source of BIKE are dominated by such multiplications. operands, which are 128/256/512-bit registers (named xmm, ymm, zmm, respectively), and the selection is determined by 1This work was done prior to joining Amazon. the value of the immediate byte. Note that the case KL =1 XXX-X-XXXXXXX-X-X/ARITH18/c 2018 IEEE 111 a7 a6 a5 a4 a3 a2 a1 a0 b7 b6 b5 b4 b3 b2 b1 b0 a0b7 a0b6 a0b5 a0b4 a0b3 a0b2 a0b1 a0b0 a1b7 a1b6 a1b5 a1b4 a1b3 a1b2 a1b1 a1b0 a2b7 a2b6 a2b5 a2b4 a2b3 a2b2 a2b1 a2b0 a3b7 a3b6 a3b5 a3b4 a3b3 a3b2 a3b1 a3b0 a4b7 a4b6 a4b5 a4b4 a4b3 a4b2 a4b1 a4b0 a5b7 a5b6 a5b5 a5b4 a5b3 a5b2 a5b1 a5b0 a6b7 a6b6 a6b5 a6b4 a6b3 a6b2 a6b1 a6b0 a7b7 a7b6 a7b5 a7b4 a7b3 a7b2 a7b1 a7b0 Fig. 1. Schoolbook multiplication of 8 × 8 qwords (or 4 × 4 qwords in red dashed border). The size of each aibj is 128 bits. using xmm registers, is exactly the VPCLMULQDQ instruction zmm registers. This is why splitting each row into ”even” that is currently available in the modern CPU’s. and ”odd” columns, where each part is 8 qwords long, seems natural. We use 16 zmm registers, namely Even0,...,Even7 III. POLYNOMIAL MULTIPLICATION and Odd0,...,Odd7, to accommodate these values. Table I Obviously, polynomial multiplication can be executed by presents the layout of the qwords in the registers. the standard schoolbook algorithms. However, for sufficiently high degrees, a Carry-less Karatsuba algorithm [1], [2], [8]is TABLE I faster. For even higher degrees, other algorithms such as Toom- A SPLIT PRESENTATION OF THE EVEN AND ODD COLUMNS PRESENTED IN FIGURE 1, THE RED DASHED LINE MARKS THE 4 × 4 QWORDS Cook [9], [10] may become useful. These algorithms (e.g., MULTIPLICATION BOUNDARY Karatsuba and Toom-Cook) work in a recursive way, where at Register Values each step multiplies smaller degree operands. After a sufficient alias (4 packets of 128-bits dashed for alignment) number of iterations, the involved polynomial has a small Even0 ----a0b6 a0b4 a0b2 a0b0 a b a b a b a b enough degree, and then it pays to revert to the schoolbook Even1 ---1 7 1 5 1 3 1 1 - a b a b a b a b multiplication, for the final step. Even2 ---2 6 2 4 2 2 2 0 - Even3 --a3b7 a3b5 a3b3 a3b1 -- In [11], we showed that for polynomials of degree slightly Even4 --a4b6 a4b4 a4b2 a4b0 -- a b a b a b a b smaller or equal to a power of two, the best performance Even5 - 5 7 5 5 5 3 5 1 --- Even6 - a6b6 a6b4 a6b2 a6b0 --- is attained with a recursive Karatsuba algorithm, until the Even a7b7 a7b5 a7b3 a7b1 ---- 255 4 7 polynomials area the degree of ( qwords), and then revert Odd0 ----a0b7 a0b5 a0b3 a0b1 a b a b a b a b to the schoolbook multiplication. Odd1 ----1 6 1 4 1 2 1 0 a b a b a b a b Here, we demonstrate how the new VPCLMULQDQ instruc- Odd2 ---2 7 2 5 2 3 2 1 - Odd3 ---a3b6 a3b4 a3b2 a3b0 - tion can be used for two methods that multiply polynomials of Odd4 --a4b7 a4b5 a4b3 a4b1 -- a b a b a b a b degree 511 (8 qwords): a) A schoolbook multiplication of 8×8 Odd5 --5 6 5 4 5 2 5 0 -- a b a b a b a b qwords using zmm registers; b) A Karatsuba multiplication Odd6 - 6 7 6 5 6 3 6 1 --- Odd7 - a7b6 a7b4 a7b2 a7b0 --- that invokes a 4 × 4 qwords schoolbook multiplication, as its ”core”, and uses ymm registers. We compare them to a reference implementation [11] that uses the existing version TABLE II of PCLMULQDQ, for a Karatsuba multiplication that wraps a PERMUTATION OF VALUES FROM TABLE I FOR 8 × 8 MULTIPLICATION 4 × 4 qwords schoolbook flow. Register Permutation mask Output Values alias (Original qword idx) (4 packets of 128-bits) IV. VPCLMULQDQ AND SCHOOLBOOK Even0 - a0b6 a0b4 a0b2 a0b0 a b a b a b a b We use the same algorithm for both the 8×8 qwords and 4× Even1 54321076 1 5 1 3 1 1 1 7 Even 54321076a2b4 a2b2 a2b0 a2b6 4 2 qwords multiplications. The only difference is in the sizes of Even3 32107654a3b3 a3b1 a3b7 a3b5 a b a b a b a b the wide registers that are used (zmm and ymm, respectively). Even4 32107654 4 2 4 0 4 6 4 4 a b a b a b a b For brevity, we present only the 8 × 8 qwords version, but in Even5 10765432 5 1 5 7 5 5 5 3 Even 10765432a6b0 a6b6 a6b4 a6b2 4 × 4 6 each table, also mark the qwords parts with red dashed Even7 - a7b7 a7b5 a7b3 a7b1 lines. Fig. 2 shows a real (37-lines) code snippet of the 4 × 4 Odd0 - a0b7 a0b5 a0b3 a0b1 a b a b a b a b qwords multiplication, as an example. Odd1 - 1 6 1 4 1 2 1 0 Odd2 54321076a2b5 a2b3 a2b1 a2b7 Fig. 1 illustrates the basic schoolbook multiplication with Odd3 54321076a3b4 a3b2 a3b0 a3b6 a b a b a b a b polynomials of degree 511, namely a, b.

Fast Multiplication of Binary Polynomials with the Forthcoming Vectorized VPCLMULQDQ Instruction

Key Agreement from Weak Bit Agreement

Identifying Open Research Problems in Cryptography by Surveying Cryptographic Functions and Operations 1

The Design and Evolution Of

Investigation of Ice Particle Habits to Be Used for Ice Cloud Remote Sensing for the GCOM-C Satellite Mission

Operator's Manual

CICE: the Los Alamos Sea Ice Model Documentation and Software User's

High-Speed, Hardware-Oriented Authenticated Encryption

Qualcomm® Inline Crypto Engine (UFS) Version 3.1.0 and Version 3.2.0

Crypto 101 Lvh

Authenticated Encryption by Enciphering

Cryptographic (In)Security in Android Apps an Empirical Analysis

Anything-But Eazy in Hardware