Fast Fourier Orthogonalization

∗ † Léo Ducas Thomas Prest Cryptology Group, CWI Thales Communications & Security Amsterdam, The Netherlands Gennevilliers, France [email protected] [email protected]

ABSTRACT the isomorphism between the ring of d × d circulant matrices d The classical fast (FFT) allows to compute and the circular convolution ring Rd = R[x]/(x − 1). in quasi-linear time the product of two , in the A widely studied and more involved question is matrix d decomposition [20] for structured matrices, in particular circular convolution ring R[x]/(x − 1) — a task that naively requires quadratic time. Equivalently, it allows to accelerate Gram-Schmidt orthgonalization (GSO). In this work, we are matrix-vector products when the matrix is circulant. interested in the GSO of circulant matrices, and more gener- In this work, we discover that the ideas of the FFT can be ally of matrices with circulant blocks. Our main motivation applied to speed up the orthogonalization process of matrices is to accelerate lattice algorithms for lattices that admit a with circulant blocks of size d × d. We show that, when d is basis with circulant blocks. This use case allows a helpful composite, it is possible to proceed to the orthogonalization extra degree of freedom: one may permute rows and columns in an inductive way —up to an appropriate re-indexation of of the lattice basis since this leaves the generated lattice rows and columns. This leads to a structured Gram-Schmidt unchanged —up to an isometry. decomposition. In turn, this structured Gram-Schmidt de- As we will show, a proper re-indexation of these matri- composition accelerates a cornerstone lattice algorithm: the ces highlights an inductive structure, with a fast Fourier nearest plane algorithm. The complexity of both algorithms flavor. This leads to accelerations of the orthogonalization may be brought down to Θ(d log d). process —and of the related nearest plane algorithm— down Our results easily extend to cyclotomic rings, and can be to quasilinear time and space. adapted to Gaussian samplers. This finds applications in lattice-based , improving the performances of The Nearest Plane Algorithm, Lattices and Cryp- trapdoor functions. tography Keywords. , Gram-Schmidt or- The nearest plane algorithm [1] is a central algorithm over thogonalization, nearest plane algorithm, lattice algorithms, lattices. It allows, after precomputation of the GSO and lattice trapdoor functions. using a quadratic number of real operations, to find a rel- atively close point in a lattice to an arbitrary target. It is a core subroutine of LLL [13], and can be used for error 1. INTRODUCTION correction over analogical noisy channels. It has also found When operations involving linear algebra are to be per- applications in lattice-based cryptography as a decryption al- formed over structured matrices, a natural problem is to gorithm, and a randomized variant (called discrete Gaussian accelerate them by exploiting the structure. The most classi- sampling) [11, 5] provides secure trapdoor functions based cal example is the fast Fourier transform [2, 8], which allows on lattice problems. This leads to cryptosystems (attribute- to perform matrix-vector multiplication in quasilinear time based encryption) with fine-grained access control, as [18, 6] when the matrix is circulant. This is achieved by exploiting to name a few. d ∗ Given a basis B of a lattice L ⊂ R and a target vector c, This work has been supported by a grant from CWI from budget the nearest plane algorithm finds a lattice point somewhat for public-private-partnerships and in part by a grant from NXP close to c. The result belongs to a fundamental domain Semiconductors. Part of this work was also supported by an NWO ˜ Free Competition Grant. centered in c, whose shape is the cuboid defined by B, the GSO of B (see Figure 1). This algorithm performs Θ(d2) real †Part of this work was done when the author was at the Ecole´ Normale Sup´erieure, Paris. operations. The GSO itself is required as a precomputation. Structured lattices in cryptography. When it comes to practical lattice-based cryptography, a quadratic cost in the dimension is rather prohibitive consid- ering the lattices at hand have dimensions ranging in the hundreds, or even thousands. For efficiency purposes, many cryptosystems (such as [10, 15, 16] to name a few) chose to rely on lattices with some algebraic structure, improving time ACM ISBN 978-1-4503-2138-9. and memory requirements to quasilinear in the dimension. DOI: 10.1145/1235 This is sometimes referred as lattice-based cryptography in the ring setting. Technically, the chosen rings typically are must follow the tower of rings R ⊂ Rd1 ⊂ ··· ⊂ Rdi−1 ⊂ Rd, cyclotomic rings, but those are closely related to the convo- for some chain of divisors 1|d1| ... |di−1|d. lution rings discussed so far. The core of this optimization Such a representation is obtained by applying the (mixed- is the fast Fourier transform (FFT) [2, 4, 8, 19] allowing radix) digit-reversal order to the indexation of the rows and fast multiplication of polynomials. But this improvement column of the circulant blocks, as pictured in Figure 2. did not apply in the case of the nearest plane algorithm or We show that this alternative indexation allows to repre- its randomized variant [11, 5]: na¨ıve GSO do not take the sent the matrix L of the GSO in a factorized form: a product algebraic structure into account. of Θ(log d) (sparse) structured matrices, each of which can One possible work-around [9, 22] consist of using the round- be stored in space O(d). An example is given in Figure 3. off algorithm [1] instead of the nearest-plane algorithm. How- Once this hidden structure is unveiled (Theorem 1), the al- ever, this simpler algorithm outputs further vectors, both in gorithmic implications follow quite naturally: one first easily the average and worst cases, weakening those cryptosystems. derive an algorithm in time O(n log2 n) —matching previous and more general results [21]— but noting that the splitting step may be performed directly in the Fourier domain al- lows us to save another log n factor. For easier algorithmic manipulations, the factorization of L is represented using a c c tree.

   0 2 4 6 1 3 5 7  Round-off Nearest plane 0 1 2 3 4 5 6 7 7 0 1 2 3 4 5 6  6 0 2 4 7 1 3 5       6 7 0 1 2 3 4 5   4 6 0 2 5 7 1 3  Figure 1: Round-off and nearest plane algorithms,      5 6 7 0 1 2 3 4   2 4 6 0 3 5 7 1  and their associated fundamental domains.   ⇒    4 5 6 7 0 1 2 3   7 1 3 5 0 2 4 6       3 4 5 6 7 0 1 2   5 7 1 3 6 0 2 4  Our contribution      2 3 4 5 6 7 0 1   3 5 7 1 4 6 0 2  In this work, we discover new algorithms, obtained by cross- 1 2 3 4 5 6 7 0 1 3 5 7 2 4 6 0 ing Cooley-Tukey’s [2] fast Fourier transform algorithm to- C(a) C(M8/4(a)) gether with the orthogonalization and nearest plane algo-  0 4 2 6 1 5 3 7  rithms (not exactly the GSO, but the closely related LDL? de-  4 0 6 2 5 1 7 3  composition). Precisely, we show that, up to a re-indexation    6 2 0 4 7 3 1 5  of rows and columns, the orthogonalization of matrices com-   posed of d × d-circulant blocks can be done in time Θ(d log d)  2 6 4 0 3 7 5 1  ⇒    7 3 1 5 0 4 2 6  when the prime factors of d are bounded. Our algorithm pro-   ?  3 7 5 1 4 0 6 2  duces the LDL decomposition in a special compact format,   requiring Θ(d log d) complex numbers to represent.    5 1 7 3 6 2 0 4  From this compact representation, the nearest plane algo- 1 5 3 7 2 6 4 0 1 rithm can also be performed using Θ(d log d) real operations. C(M (a)) = M (a) As a demonstration of the simplicity of our algorithms, we 8/2 8/1 propose an implementation in python for d × d-circulant Figure 2: Re-indexing the transformation matrix of matrices, when d is a power of 2. 2 7 f ∈ R8 7→ fx (a = 0 + x + 2x + ··· + 7x ∈ R8). Computational model and Parallelism. Our algorithms and their complexity are described for     Random Access Machines with unitary arithmetic operations 1 1 on real numbers, i.e. we do not study the issue related to  1   1      floating-point approximations and numerical stability.  1   e f 1  ?  1   f e 1  Regarding parallelism, we note that our Fast-Fourier LDL L =   ·    a c 1   1  algorithm is almost fully parallelizable: viewed as an arith-  b d     b a d c 1   1  metic circuit on real numbers, it has depth O(log(n)) and     d c a b 1 g h 1 width O(n). This is unfortunately not the case of the Nearest     c d b a 1 h g 1 Plane algorithm nor our Fast-Fourier variant, for which the   n rounding steps are inherently sequential. 1 1  i  e i  1    a f Techniques.  j 1  j At the core of our techniques is the realization that repre- ·   L : b  1  c senting elements of the convolution ring R = [x]/(xd−1) as   d R  k 1  d g k circulant matrices is not the appropriate choice. To allow an    1  h l induction similar to Cooley-Tukey’s FFT, our representation l 1

1 If the number n×m of d×d circulant blocks is not constant, Figure 3: Factorization of L in the LDL? decompo- they contribute a factor O(n3) to the runtime of the fast- sition of M8/1(a), and its tree representation L. Fourier LDL? algorithm, O(n2) to its output size, and O(n2) to the run-time of the fast-Fourier nearest plane algorithm. Related Works. Definition 2. Let h ∈ R[x] be a monic with ∆ There exist many works related to the orthogonalization of distinct roots over C, R = R[x]/(h(x)) and a, b be arbitrary structured bases. For Toeplitz matrices, Sweet [23] intro- elements of R. duced an algorithm faster than the naive orthogonalization ? by a linear factor. Gragg [7] has shown that for Krylov bases • We note a and call (Hermitian) adjoint of a the unique ? – which are bases of the form {b, r(b), ..., rd−1(b)} –, the element of R such that for any root ζ of h, a (ζ) = a(ζ), Levinson recursion [14, 3] allows, when r is an isometric where · is the usual complex conjugation over C. 2 operator, to perform orthogonalization in time Θ(d ) instead ∆ 3 P of Θ(d ). • The inner product over R is ha, bi = h(ζ)=0 a(ζ)·b(ζ), There also exists superfast (running time O(n log2 n)) algo- and the associated norm is kak =∆ pha, ai. rithms for the orthogonalization of Toeplitz-like matrices, for example by Olshevsky and Pan [21], and those are already In the particular case of convolution rings, one can check Pd−1 i based on a structure-preserving induction. In this light, one that if a(x) = i=0 aix ∈ Rd, then may interpret our result as a Fourier-compatible version of d−1 these superfast algorithms. ? d X d−i a (x) = a(1/x) mod (x − 1) = a + a x . The question of improving the nearest-plane algorithm for 0 i structured matrices seems less studied. As far as we know, i=1 the state of the art consist of a single result [17], applying the The (Hermitian) adjoint B? of a matrix B ∈ Rn×m is the Levinson recursion [14] to reduce by a linear factor its space transpose of the coefficient-wise adjoint of B. complexity. Alternatively, a work-around of lesser quality While the inner product h·, ·i (resp. the associated norm was proposed for the NTRU signature scheme [9, 22]. k · k) is not to be mistaken with the canonical coefficient-wise

dot product h·, ·i2 (resp. the associated norm k · k2), they Outline. are closely related. One can easily check that for any f = P i Section 2 introduces the mathematical tools that we will fix ∈ Rd, the vector (f(ζ)) d can be obtained i∈Zd {ζ =1} use through this paper. Section 3 presents our main result, from the coefficient vector (fi)06i n and B = {b1, ... , bn} ∈ R . convolution rings to cyclotomic rings, by reducing the latters We say that B is full-rank (or is a basis) if for any linear com- P to the formers. Appendix B demonstrates the practical bination 1 i n aibi with ai ∈ R, we have the equivalence P 6 6 feasibility of our algorithms by presenting a rather succinct ( i aibi = 0) ⇐⇒ (∀i, ai = 0). python implementations of them in the case where d is a power of two. We note that since R is generally not an integral domain, a set formed of a single nonzero vector is not necessarily 2. PRELIMINARIES full-rank. In the rest of the paper, a basis will either denote a set of independent vectors {b , ... b } ∈ (Rm)n, or the For any ring R, R[x] will denote the ring of univariate poly- 1 n full-rank matrix B ∈ Rn×m whose rows are the b ’s. nomials over R. Scalars (which includes elements of R) will i usually be noted in plain letters (such as a, b), vectors will be 2.2 The GSO and LDL? Decomposition noted in bold letters (such as a, b) and matrices will be noted In this section, R = [x]/(h(x)) as in Definition 2. We in capital bold letters (such as A, B). Vectors are mostly in R first recall a few standard definitions. A matrix L ∈ Rn×n row notation, and as a consequence vector-matrix products is unit lower triangular if it is lower triangular and has only are done in this order unless stated otherwise. (a , ... , a ) de- 1 n 1’s on its diagonal. notes the row vector formed of the a ’s, whereas [a , ... , a ] i 1 n We say that a self-adjoint matrix G ∈ Rn×n is full-rank denotes the matrix whose rows are the a ’s. denotes the i N Gram (or FRG) if there exist m n and a full-rank matrix set of non-negative integers, and ∗ the set \{0}. For > N N B ∈ Rn×m such that G = BB?. This generalizes the notion i, j ∈ , i, j will denote the set {i, i + 1, . . . , j − 1, j}. Z of positive definiteness for symmetric real matrices. J K ? 2.1 The Convolution Ring R We now recall the GSO and LDL decomposition. The d GSO decomposes any full-rank matrix as the product of a ∗ unit lower triangular matrix and an orthogonal matrix. Definition 1. For any d ∈ N , let Rd denote the ring [x]/(xd − 1), also known as circular convolution ring, or R Proposition 1. Let B ∈ Rn×m be a full-rank matrix. B simply convolution ring. can be uniquely decomposed as When d is highly composite, elementary operations in R d B = L · B˜ , (1) can be performed in time Θ(d log d) using the fast Fourier transform [2]. where L is unit lower triangular, and the rows of B˜ are We equip the ring Rd with a conjugation operation as well pairwise orthogonal. as an inner product, making it an Hermitian inner product space. The definitions that we give also encompass other When R is replaced by R or a number field, Proposition 1 types of rings that will be used in Appendix A. is standard. In our case, a proof can be found in Appendix C. The LDL? decomposition writes any positive definite ma- 2.4 Coefficient Vectors and Circulant Matri- trix as a product LDL?, where L ∈ Rn×n is unit lower ces triangular with 1’s on the diagonal, and D ∈ Rn×n is diago- m dm nal. It is related to the GSO as for a basis B, there exists a Definition 5. We define linear maps c : Rd → R and n×m dn×dm P i ˜ C : R → as follows. For any a = aix ∈ unique GSO B = L · B and for an FRG matrix G, there ex- d R i∈Zd ists a unique LDL? decomposition G = LDL?. If G = BB?, Rd where each ai ∈ R: then G = L · (B˜ B˜ ?) · L? is a valid LDL? decomposition of 1. The coefficient vector of a is c(a) = (a0, . . . , ad−1). G. As both decompositions are unique, the matrices L in 2. The circulant matrix of a is both cases are actually the same. In a nutshell:  c(a)  ˜  a0 a1 ··· ad−1 L · B is the GSO of B ad−1 a0 ··· ad−2 c(xa) ˜ ˜ ? ? ? ? ∆   d×d ⇔ L · (BB ) · L is the LDL decomposition of (BB ). C(a) =  ......  =  .  ∈ R . . . . .  .  a a . . . a 1 2 0 c(xd−1a) ? Algorithm 1 LDLR(G) 3. c and C generalize to vectors and matrices in a coefficient- n×n wise manner. Require: A full-rank Gram matrix G = (Gij ) ∈ R . ? Ensure: The decomposition G = LDL over R, where L Proposition 3. The maps c and C satisfy the following is unit lower triangular and D is diagonal. properties: 1: L, D ← 0 n×n 1. C is an injective algebra morphism. In particular, 2: for i from 1 to n do C(a)C(b) = C(ab). 3: Lii ← 1 2. c(a)C(b) = c(ab). P ? ? ? 4: Di ← Gii − j

2. When writing Rd as a d-dimensional R-vector space, we will not use {1, x, ... , xd−1} as our canonical basis, Algorithm 2 NPR(t, L) but a permutation of this basis ˜ n×m Require: The decomposition B = L · B of B ∈ R , a Definition 6. Let d ∈ ∗ be a product of h (not neces- n N vector t ∈ R sarily distinct) primes. Let gpd(d) denote the greatest proper n ˜ Ensure: A vector z ∈ Z such that (t − z)B ∈ P(B) divisor of d. When clear from context, we also note h the 1: z ← 0 number of prime divisors of d (counted with multiplicity), 2: for j = n, . . . , 1 do ∆ ∆ ¯ P dh = d and for i ∈ 1, h , di−1 gpd(di) and ki = di/di−1, so 3: tj ← tj + i>j (ti − zi)Lij J K Q that 1 = d0|d1| ... |dh = d and j i kj = di. 4: zj ← bt¯j e 6 5: end for The di’s defined in Definition 6 form a tower of proper 6: return z divisors of d. For any composite d, there exist multiple towers of proper divisors: per example , 1|6, 1|2|6 and 1|3|6 for d = 6. In this paper, each time we mention a tower of Proposition 2 (From [1, 13]). Algorithm 2 outputs an proper divisors of d it will refer to the unique one induced ˜ integer vector z (zB ∈ L(B)) such that (t − z)B ∈ P(B). by Definition 6 (e.g. for d = 6, that tower is 1|3|6). 0 ∗ 0 00 0 Definition 7. Let d, d ∈ N such that d is in the tower 4. For d |d |d and any element a ∈ Rd, 0 of proper divisors of d, and let k = d/d . We denote by x the ∆ (d/d00)×(d/d00) d M 00 (a) = M 0 00 ◦ M 0 (a) ∈ R . indeterminate of the polynomial ring Rd = R[x]/(x − 1) and d/d d /d d/d d00 k d0 by y = x the indeterminate of Rd0 = R[y]/(y − 1). We When d is clear from context, we simply note Md/d0 = M/d0 . m km define the partial linearization Vd/d0 : Rd → Rd0 recursively as follows: Interpretation. 0 1. For d = d = 1, Vd/d0 is the identity. As for the operator V, the steps of applying M are depicted 0 2. For d = gpd(d) and a single element a ∈ Rd, let in Section 1 Figure 2. The linearization Md/d0 (a) writes the P i a = x ai(y) where ai ∈ Rd0 for each i. Then i∈Zk transformation matrix of the map f ∈ Rd 7→ fa using the same basis as the partial linearization V 0 . As a result, ∆ d/d Vd/d0 (a) = (a0, . . . , ak−1). both operators are compatible: ∀a, b ∈ Rd, Vd/d0 (ab) = k Vd/d0 (a) · Md/d0 (b). More properties follows. In other words, Vd/d0 (a) ∈ Rd0 is the row vector whose Proposition 4. Let d ∈ ∗, a, b (resp. a, b, resp. A, B) coefficients are the (ai)i∈Zk . N m km be arbitrary scalars (resp. vectors, resp. matrices) over Rd, 3. For a vector v ∈ Rd , Vd/d0 (v) ∈ Rd0 is the component- 0 ∆ ∆ wise applications of Vd/d0 . and d |d. To be concise, we note V = Vd/d0 and M = Md/d0 . 00 0 m 4. For d |d |d and any vector v ∈ Rd , The maps V and M satisfy the following properties:

∆ (d/d00)m 1. M is an injective algebra morphism, and in particular: Vd/d00 (v) = Vd0/d00 ◦ Vd/d0 (v) ∈ Rd00 . M(A · B) = M(A) · M(B). When d is clear from context, we simply note Vd/d0 = V/d0 . 2. V is an injective linear map. Interpretation. 3. V(ab) = V(a) · M(b). In practice, an element a ∈ Rd is represented by a vector of d real elements corresponding to the d coefficients of a. 4. V is an isometry: In this context, the operator V simply permutes coefficients. hV(a), V(b)i = ha, bi . As highlighted by Figure 4, when d = 2h is a power of two, 2 2 Vd/1 permutes the coefficients according to the bit-reversal 5. B is full-rank if and only if M(B) is full-rank. order2, which appears in the radix-2 fast Fourier transform Since the proofs are rather straightforward to check from (FFT). More generally, one can show that for an arbitrary d, the definitions, we defer them to Appendix D. Vd/1 permutes the coefficient according to the general mixed- radix digit reversal order, which appears in the mixed-radix Computing V and M in the Fourier Domain. Cooley-Tukey FFT [2]. The operators V and M that we defined can be computed 0 1 2 3 4 5 6 7 ⇒ 0 2 4 6 1 3 5 7 very efficiently when an element a ∈ Rd is represented by its coefficients but also when represented in the Fourier domain. c(a) c(V/4(a)) In the first case, it is obvious that they can both be performed ⇒ 0 4 2 6 1 5 3 7 in time Θ(d) as they (symbolically) permute coefficients of a. c(V/2(a)) = V/1(a) If a is represented in FFT form —that is, by the vector i d 0 0 (a(ζd))i∈Zd in C — then computing Vd/d (a) and Md/d (a) in FFT form can naively be done in time Θ(d log d) by comput- Figure 4: Partial vectorial linearizations. ing its inverse FFT, permuting its coefficients, and computing 0 d/d FFT’s over Rd0 . However, it can be done in time Θ(d), We now move to the matrix representation M compatible as it is a single step – also known as butterfly – of the origi- with V. nal fast Fourier transform. This is formalized in Lemma 1, Definition 8. Following the notations of Definition 7, a reformulation of a simple lemma that is at the heart of n×m kn×km Cooley and Tukey’s FFT. we define the operator Md/d0 : Rd → Rd0 as follows: 0 0 1. For d = d = 1, Md/d0 is the identity. Lemma 1 ([2], adapted). Let d 2, d = gpd(d) and 0 0 > 2. For d = gpd(d), k = d/d and a single element a = 0 ∆ ∆ P i k = d/d . Let V = Vd/d0 , M = Md/d0 . For any a ∈ Rd (resp. x ai(y) where each ai ∈ Rd0 , Md/d0 (a) is the i∈Zk a ∈ V(Rd), resp. A ∈ M(Rd)): following matrix of ∈ Rk×k: d0 • V(a), V−1(a), M−1(A) can be computed in FFT form   Vd/d0 (a) in time Θ(kd).  a0 a1 ··· ak−1 k yak−1 a0 ··· ak−2  Vd/d0 (x a)  2   • M(a) can be computed in FFT form in time Θ(k d).  ......  = . . P i k . . . .  .  Proof. Let a ∈ Rd be uniquely written a = x ai(x ),  .  i∈Zk ya1 ya2 . . . a0 (d0−1)k Vd/d0 (x a) where each ai ∈ Rd0 . Cooley and Tukey show in [2] (equa- tions 7, 8) that we can switch from the FFT of a to the FFT d×d In particular, if d is prime, then Md/1(a) ∈ R is of all the ai’s (and conversely) in time Θ(kd). As the ai’s are exactly the circulant matrix C(a). the coefficients of V(a) and M(a), the result follows. m n×m 3. For a vector v ∈ Rd or a matrix B ∈ Rd , Md/d0 (v) ∈ k×km kn×km In Sections 3 and 4, we will use the operators V and M to R and M 0 (B) ∈ R are the component- d0 d/d d0 speed up the orthogonalization and nearest plane algorithms. wise applications of M 0 . d/d The core idea is that these operators allow to “batch” orthog- 2https://oeis.org/A030109 onalization operations, resulting in a Θ(d/ log(d)) speedup. 3. FAST FOURIER LDL DECOMPOSITION • Since each Li,j is block diagonal with dh−1/di+1 unit This section presents our main result. We present the lower triangular diagonal blocks, Li is block diagonal existence of a compact representation in Section 3.1, and with kh(dh−1/di+1) = d/di+1 unit lower triangular then derive a fast algorithm to compute it in Section 3.2. diagonal blocks. ˜ 3.1 A Compact LDL? Decomposition • We also need to show that B0 is orthogonal. Each submatrix B˜ j of B˜ 0 is the orthogonalization of M(bj ) by induction hypothesis. Therefore, for two distinct Theorem 1. Let d ∈ N and 1 = d0|d1| ... |dh = d be a ˜ m rows u, v of B0: tower of proper divisors of d. Let b ∈ Rd such that Md/1(b) ˜ is full-rank. There exists a GSO of Md/1(b) as follows: – If they belong to the same submatrix Bj , they are orthogonal by induction hypothesis. h−1 ! – Suppose they belong to different submatrices: u ∈ Y ˜ B˜ j , v ∈ B˜ ` and j 6= `. Then u (resp. v) is a Md/1(b) = Mdi/1(Li) · B0 i=0 linear combination of rows of M(bj ) (resp. M(b`)): v = aj · M(bj ) and v = a` · M(bj ) for some aj , a` ˜ d×dm (d/di)×(d/di) where B0 ∈ is orthogonal, and each Li ∈ R dh−1 −1 −1 R di in R . Noting aj = V (aj ) and a` = V (a`): is a block-diagonal matrix with unit lower triangular matrices (di+1/di)×(di+1/di) 3 hu, vi = hV(aj )M(bj ), V(a`)M(b`)i of R as its d/di+1 diagonal blocks. 2 2 di = hV(aj bj ), V(a`b`)i2 = haj bj , a`b`i = 0 As an example, the matrix L of the GSO of M8/1(a) for 2 some a ∈ R8 is depicted in Section 1 Figure 3. Where the second equality comes from Proposi- tion 4, item 3, the third one from the fact that V Proof. If d is prime, the theorem is trivial as it is exactly the GSO. We suppose that d is composite and that the is a scaled isometry (Proposition 4, item 4) and the fourth one from bj , b` being orthogonal. theorem is true for any Ri with i < d. By Proposition 4, ∆ Therefore B˜ 0 is orthogonal. item 5, the matrix Bh−1 = Md/dh−1 (b) is full-rank too. Using the classical GSO, we can therefore decompose it The theorem we stated gives the GSO of Md/1(b) for kh×kh kh×mkh as Bh−1 = Lh−1B˜ , where Lh−1 ∈ R , B˜ ∈ R m dh−1 dh−1 a vector b ∈ Rd , but can be easily generalized from a ∆ ? and kh = d/ gpd(d). Lh−1 is unit lower triangular and vector b to a matrix B, and also yields a compact LDL ˜ ˜ decomposition. B is orthogonal. Noting B = [b1, ... , bkh ], all the bj ’s are pairwise orthogonal and each M/1(bj ) is full-rank. By Corollary 1. Let d ∈ N and 1 = d0|d1| ... |dh = d be a inductive hypothesis, they can be decomposed as follows: n×m tower of proper divisors of d. Let B ∈ Rd be a full-rank matrix. There exist h + 1 matrices (Li)06i6h such that: h−2 ! • L ∈ Rn×n is unit lower triangular. Y ˜ h d ∀j ∈ 1, kh , Mdh−1/1(bj ) = Mdi/1(Li,j ) · Bj , (2) n(d/di)×n(d/di) J K i=0 • For each i < h, Li ∈ R is a block- di

dh− ×mdh− diagonal matrix whose n(d/di+1) diagonal blocks are where each B˜ j ∈ 1 1 is full-rank orthogonal and for R (di+1/di)×(di+1/di) (dh−1/di)×(dh−1/di) unit lower triangular matrices of Rd . i < h−1, each Li,j ∈ R is a block-diagonal i di   (di+1/di)×(di+1/di) Qh matrix with unit lower triangular matrices of R Furthermore, if we note L = Md /1(Li) and di i=0 i as its dh−1/di+1 diagonal blocks. To be concise, we now note ˜ ∆ −1 B0 = L · Md/1(B), then: M =∆ M and V =∆ V . We have: dh−1/1 dh−1/1 ˜ 1. The GSO of Md/1(B) is Md/1(B) = L · B0. ˜ Md/1(b) = M(Lh−1) · M(Bh−1) ? ? 2. The LDL decomposition of Md/1(BB ) is = M(Lh−1) · M[b1,..., bkh ] ˜ ˜ t t = M(Lh−1) · [M(b1),..., M(bkh )] Md/1(B) = L · (B0B0) · L . h−2  0 Q ˜ Proof. We have B = LhB , where Lh is given by ei- = M(Lh−1) · Diag Mdi/1(Li,j ) · B0 ? 0 i=0 ther the GSO or LDL decomposition algorithm. B = h−2  0 0 Q ˜ {b1, ... , bn} is orthogonal. Applying Theorem 1 to each = M(Lh−1) · Mdi/1(Li) · B0 0 0 i=0 row vector bj of B yields n decompositions (Li,j )06ij Ensure: The compact LDL decomposition of G in FFT form −1   9: z ← V ffNP (V 0 (t ), L ) 1:( L, D) ← LDL? (G) j d/d0 Rd0 d/d j j Rd 10: end for 2: if d = 1 then 11: return z = (z ,..., z ) 3: return (L, D) i n 4: end if 5: d0 ← gpd(d) n×m 6: for i = 1, . . . , n do Lemma 3. Let B = {b1, ... , bn} ∈ R and B˜ = ? d 7: Li ← ffLDL (Md/d0 (Dii)) ˜ ˜ Rd0 {b1, ... , bn} be its Gram-Schmidt orthogonalization in Rd. 8: end for The vectors z and t = (t1,..., tn) in Algorithm 4 satisfy 9: return (L, (Li)16i6n) (z − t) · B = (z − t) · B˜ . Algorithm 3 computes a “fast Fourier LDL?”, instead of the “fast Fourier GSO” hinted at in the proof of Corollary 1. Proof. We recall that for each i ∈ 1, n , b˜i = bi − The reason why we favor this approach is because it allows P L b˜ . We have: J K a complexity gain. This gain can already be observed in the jj P ˜ with the Gram-Schmidt process. The same phenomenon = 1 i j n(zi − ti)Lij bj P 6 6 6 happens with their recursive variants. = (zi − ti)bi 16i6n = (z − t) · B. Lemma 2. Let d ∈ N and 1 = d0|d1| ... |dh = d be the (3) tower of proper divisors of d given by the successive gpd’s, The first and last equalities are trivial, the second one replaces ∆ n×n and for i ∈ 1, h , let ki = di/di−1. Let G ∈ Rd be a full- the tj ’s by their definitions, the third one just simplifies rank GramJ matrix.K Then Algorithm 3 computes the LDL? the sum and the fourth one is another way of saying that decomposition tree of G in FFT form in time L · B˜ = B.

2 3 X 2 Θ(n d log d) + Θ(n d) + Θ(nd) ki . 16i6h ∆ ∆ Theorem 2. Let M = Md/1 and V = Vd/1. Algorithm 4 In particular, if all the ki are bounded by a small constant n outputs z ∈ Z such that V((z − t)B) ∈ P(B˜ 0), where B˜ 0 is k, then the complexity of Algorithm 3 is upper bounded by the orthogonalization of M(B) over . Θ(n3d + n2d log d). R Proof. The result is trivially true if d = 1. We prove The proof of Lemma 2 is left in Appendix E. We note that it in the general case. By definition, each subtree L is the Algorithm 3 is parallelizable to up to d processes,: step 1 j ? ˜ relies on operations on polynomials which are parallelizable LDL decomposition tree of Md/ gpd(d)(bj ). By induction ˜ ˜ and all the iterations of step 7 are independent. hypothesis, we therefore know that V((zj − tj )bj ) ∈ P(Bj ), ∆ where B˜ j is the orthogonalization of Bj = M(b˜j ). From 4. FAST FOURIER NEAREST PLANE Lemma 3, we have In this section, we show how to exploit further the compact form of the LDL? decomposition to have a fast Fourier variant X (z − t)B = (zj − tj ) · b˜j , of the nearest plane algorithm. It outputs vectors of the j=1,...,n same quality (ie as close to the target vector) as its classical iterative counterpart Algorithm 2, but runs Θ˜ (d) times faster. so V((z − t)B) ∈ P([B˜ 1, ... , B˜ n]). Now, from the proof of d ˜ ˜ ˜ Definition 9. Let Zd denote the ring Z[x]/(x − 1) of Corollary 1, we know that B0 = [B1, ... , Bn] is actually the elements of Rd with integer coefficients. orthogonalization of M(B), which concludes the proof. Coordinates of t to round [5] Gentry, C., Peikert, C., and Vaikuntanathan, V. 11 1 1 Trapdoors for hard lattices and new cryptographic constructions. In STOC (2008). NRR NR [6] Gorbunov, S., Vaikuntanathan, V., and Wee, H. 2 10 6 10 2 12 12 Attribute-based encryption for circuits. In 45th ACM NRR NRR NR NR STOC (June 2013), D. Boneh, T. Roughgarden, and

3 5 4 5 3 7 9 8 9 7 13 14 13 J. Feigenbaum, Eds., ACM Press, pp. 545–554. NRR NRR NRR NRR NRR NR [7] Gragg, W. B. Positive definite toeplitz matrices, the arnoldi process for isometric operators, and gaussian An ongoing execution of Algorithm 4 for a single t ∈ R . 8 quadrature on the unit circle. Journal of Computational 1. Arrows are labeled in their order of execution. Down- and Applied Mathematics 46, 1-2 (1993), 183 – 198. ward (↓ ↓) (resp. upward (↑ ↑)) arrows correspond [8] Heideman, M. T., Johnson, D. H., and Burrus, −1 Gauss and the history of the fast Fourier to computing Vd/d0 (tj ) (resp. Vd/d0 (... )) at step 9. C. S. transform. ASSP Magazine, IEEE 1, 4 (1984), 14–21. Transverse (y) arrows correspond to step 8. [9] Hoffstein, J., Howgrave-Graham, N., Pipher, J., 2. Cells labeled with R (as in Rounded) correspond to Silverman, J. H., and Whyte, W. NTRUSIGN: already completed subcalls of Algorithm 4, as opposed Digital signatures using the NTRU lattice. In CT-RSA to those labeled with NR (as in Not Rounded). (2003). [10] Hoffstein, J., Pipher, J., and Silverman, J. H. Figure 5: High-level execution of the fast Fourier NTRU: A ring-based public key cryptosystem. In nearest plane algorithm ANTS (1998), pp. 267–288. [11] Klein, P. N. Finding the closest lattice vector when it’s unusually close. In SODA (2000), pp. 937–941. Unlike the fast Fourier transform, Algorithm 4 is not fully [12] Lang, S. Algebraic number theory, 3 ed. 1995. parallelizable, due to step 8 (y arrows in Figure 5). However, its complexity in d is Θ(d log d): informally, this is because [13] Lenstra, A. K., Lenstra, Jr., H. W., and Lovasz,´ L. Factoring polynomials with rational coefficients. each arrow ↓, ↑ or y has a linear complexity in the size of the cells it connects. A more precise statement follows. Mathematische Annalen 261, 4 (December 1982), 515–534. Lemma 4. Let d ∈ N and 1 = d0|d1| ... |dh = d be be the [14] Levinson, N. The Wiener RMS (root mean square) tower of proper divisors of d given by the successive gpds, error criterion in filter design and prediction. J. Math. ∆ n×m Phys. Mass. Inst. Tech. 25 (1947), 261–278. and for i ∈ 1, h , let ki = di/di−1. Let B ∈ Rd and L be its LDL? decompositionJ K tree. The complexity of Algorithm 4 [15] Lyubashevsky, V., Micciancio, D., Peikert, C., is given by: and Rosen, A. SWIFFT: A modest proposal for FFT hashing. In FSE (2008). 2 X 2 Θ(nd log d) + Θ(n d) + Θ(nd) ki . [16] Lyubashevsky, V., Peikert, C., and Regev, O. On 16i6h ideal lattices and learning with errors over rings. In

In particular, if all the ki are bounded by a constant, then EUROCRYPT (2010), pp. 1–23. the complexity of Algorithm 4 is Θ(n2d + nd log d). [17] Lyubashevsky, V., and Prest, T. Quadratic time, linear space algorithms for Gram-Schmidt The proof of Lemma 4 is deferred to Appendix F. orthogonalization and Gaussian sampling in structured lattices. In EUROCRYPT 2015. Acknowledgements [18] Micciancio, D., and Peikert, C. Trapdoors for lattices: Simpler, tighter, faster, smaller. In The authors would like to thank the anonymous reviewers EUROCRYPT (2012). of ISSAC and EUROCRYPT for their diligent comments, significantly contributing to improve the presentation of this [19] Nussbaumer, H. J. Fast Fourier transform and article. convolution algorithms, vol. 2. Springer Science & Business Media, 2012. [20] Olshevsky, V. Fast Algorithms for Structured 5. REFERENCES Matrices: Theory and Applications, vol. 323. American [1] Babai, L. On Lov´asz’ lattice reduction and the nearest Mathematical Soc., 2003. lattice point problem. Combinatorica 6, 1 (1986), 1–13. [21] Olshevsky, V., and Pan, V. Y. A unified superfast Preliminary version in STACS 1985. algorithm for boundary rational tangential [2] Cooley, J. W., and Tukey, J. W. An algorithm for interpolation problems and for inversion and the machine calculation of complex . factorization of dense structured matrices. In FOCS Mathematics of Computation 19, 90 (1965), 297–301. (1998). [3] Durbin, J. The fitting of time-series models. Review of [22] Peikert, C. An efficient and parallel Gaussian sampler the International Statistical Institute 28, 3 (1960), pp. for lattices. In CRYPTO (2010), pp. 80–97. 233–244. [23] Sweet, D. R. Fast toeplitz orthogonalization. [4] Gentleman, W. M., and Sande, G. Fast Fourier Numerische Mathematik 43, 1 (1984), 1–21. transforms: for fun and profit. In Proceedings of the November 7-10, 1966, fall joint computer conference (1966), ACM, pp. 563–578. ∗ APPENDIX Proposition 5. Let d ∈ N and ι = ιd. The embedding ι: A. EXTENDING THE RESULTS TO CYCLO- 1. is an injective ring morphism. TOMIC RINGS 2. is an isometry : for any f, g ∈ F , hι(f), ι(g)i = hf, gi. In this section we argue that our results hold in the cy- d clotomic case as well. It turns out that all the previous Proof. Item 1 follows from the fact that ed is idempotent 2 arguments can be made more general. The required ingredi- ed = ed. Indeed this implies that ed(a + bc) = eda + edbc = 2 ents are the following: eda + edbc = eda + (edb)(edc). In addition, for any element g ∈ ι(F ), g mod φ is the unique antecedent of g with 1. A tower of unitary rings endowed with inner products d d respect to ι, so ι is bijective and ι−1(g) = g mod φ , which onto . d R proves the point 1. Item 2 follows from equation (4). 2. For any rings S, T of the tower, injective maps M0 : k×k 0 k 0 0 S → T and V : S → T , with S of rank d over R, Lemma 5. Let d > 2, d |d, k = d/d and a ∈ Rd. Then and T of rank d/k over R. k (a ∈ ι(Fd)) ⇔ Vd/d0 (a) ∈ ι(Fd0 ) 3. M0 is a ring morphism. 0 0 Proof. We prove the lemma for d = gpd(d), extension 4. V is a scaled linear isometry. to the general case is straightforward. a can be uniquely 0 0 0 P i k 5. V (ab) = M (a)V (b). written as a = x ai(x ) where each ai ∈ Rd0 . Let ζd 06i

Equivalently, ι(f) is the only element in Rd satisfying: This gives the condition 2. Lemma 5 allows to argue that −1 0  the image of Vd/d0 ◦ιd is in the definition domain of ιd0 : V is f(ζ) if φ (ζ) = 0 0 ι(f)(ζ) = d (4) well defined, and similarly for M . Conditions 3 and 5 follow 0 if ψd(ζ) = 0 from the fact that ιd, ιd0 are ring morphisms and that similar properties hold for Md/d0 and Vd/d0 . Condition 4 is true # ffNearestPlane, i/o in base B, fft format (sec. 4) because ιd, Vd/d0 and ιd0 are isometries. Finally, condition 6 def ffBabai_aux(T,t): holds in the FFT representation, from Lemma 1 and from the if len(t)==1: fact that ι in the Fourier domain simply consist of inserting return array([round(t.real)]) some zeros at appropriate positions. (t1,t2) = ffsplit(t) (L,[T1,T2]) = T z2 = ffBabai_aux(T2,t2) B. IMPLEMENTATION IN PYTHON tb1 = t1 + (t2-z2) * conjugate(L) In this section, we give the core of the Python implemen- z1 = ffBabai_aux(T1,tb1) tation of our algorithms when d is a power of 2. The full return ffmerge(z1,z2) implementation, including correctness tests, is in the Public Domain and is available online at the folliwng link: # ffNearestPlane, i/o in canonical base, coef. format def ffBabai(f,T,c): https://github.com/lducas/ffo.py F = fft(f) t = fft(c)/ F z = ffBabai_aux(T,t) Conventions. return ifft(z * F) In python.numpy, the arithmetic operations +,-,*,/ on ar- rays denote coefficient-wise operations. The functions fft and its inverse ifft are built in. The symbol j denotes C. PROOF OF PROPOSITION 1 the imaginary unit. The primitive creates the zeros(d) m ∆ ? d-dimensional zero vector. Proof. For any x, y ∈ R , let hx, yiR = x · y . One can check that h·, ·iR is a Hermitian inner product. In particular,

(hx, xiR = 0) ⇔ (hx, xi = 0) ⇔ (x = 0). # Simplified extract of ffo.py ˜ from numpy import * The decomposition B = L · B can be computed using the Gram-Schmidt process (Algorithm 5). # Linearize operation V, i/o in fft format def ffsplit(F): Algorithm 5 GramSchmidtR(B) d = len(F) 1: for i = 1, . . . , n do winv = exp(2j*pi / d) 2: b˜ ← b Winv = array([winv i for i in range(d/2)]) i i ** 3: for j = 1, . . . , i − 1 do F1 = .5* (F[:d/2] + F[d/2:]) ˜ hbi,bj iR F2 = .5* (F[:d/2]- F[d/2:]) * Winv 4: Li,j = ˜ ˜ hbj ,bj iR return( F1,F2) 5: b˜i ← b˜i − Li,j b˜j 6: end for # Inverse linearize V^-1, i/o in fft format 7: end for def ffmerge(F1,F2): 8: return (B˜ = {b˜ ,..., b˜ }, L = (L ) ) d = 2*len(F1) 1 n ij 16i,j6n F = 0.j*zeros(d) # Force F to complex float w = exp(-2j pi / d) * If we replace R with R or a number field, it is well-known W = array([w**i for i in range(d/2)]) that Algorithm 5 terminates whenever B is full-rank, and F[:d/2] = F1 + W * F2 outputs (B˜ , L) satisfying equation 1. However, it is less F[d/2:] = F1 - W F2 * obvious in our case, since R is no longer a field and the return F ˜ ˜ division by hbj , bj iR in step 4 might be problematic. To show that the output of Algorithm 5 satisfies equation 1 # ffLDL alg., i/o in fft format when R = R[x]/(h(x)), it suffices to show that for any # Outputs an L-Tree (sec 3.2) ˜ ˜ def ffLDL(G): j ∈ 1, n , hbj , bj iR is invertible. Suppose that it is not the J K d = len(G) case, then there exists j ∈ 1, n , and a ∈ R\{0} such that ˜ ˜ J K˜ ˜ if d==1: ahbj , bj iR = 0. By linearity, habj , abj iR = 0 and therefore return( G,[]) ˜ ˜ P ˜ abj = 0. Since bj = bj − i

# ffLQ, i/o in fft format Our arguments seamlessly transfer to the termination of # outputs an L-Tree (sec 3.2) Algorithm 1, as the elements Dj in Algorithm 1 are exactly def ffLQ(f): ˜ ˜ F = fft(f) the elements hbj , bj iR in Algorithm 5. G = F*conjugate(F) T = ffLDL(G) D. PROOF OF PROPOSITION 4 return T Proof. We show the properties separately: 0 t 1. We first prove this statement for d = gpd(d) and for (L, D) ← LDLRd (G) in FFT form. For each i ∈ 1, n , we elements a, b ∈ Rd. All the requirements for show- know from Lemma 1 that Md/ gpd(d)(Dii) can be computedJ K 2 ing that M is a homomorphism are trivial, except for in time Θ(dkh), hence the third term. The last one is for the the fact that it is multiplicative. First, one can check n recursive calls to itself. We then have from Definition 8 that M(ab) = M(a) · M(b). Let P d 3 d n×p p×m C(kh, dh−1) = d Θ(di−1ki ) + d C(k1, d0) A = (aij ) ∈ R and B = (bij ) ∈ R . Since i 1 d d 16i6h ∆ P P 2 (9) AB = ( aikbkj )1 i n,1 j m, we have = Θ(d) ki , 16k6p 6 6 6 6 16i6h   P  where the first equality is shown by induction using equa- M(AB) = M 1 k p aikbkj 6 6 i,j tion 8, except the first term Θ(n2d log d) which is no longer P  . = 1 k p M(aik)M(bkj ) relevant since we are already in the Fourier domain. Com- 6 6 i,j = M(A)M(B) bining equations 8 and 9, we conclude that the complexity of the whole algorithm is 00 Multiplicity then seamlessly transfers to any d |d: 2 3 C(n, d) = Θ(n d log d) + Θ(n d) + nC(kh, dh−1) = Θ(n2d log d) + Θ(n3d) + Θ(nd) P k2. Md/d00 (A · B) = Md0/d00 ◦ Md/d0 (A · B) i 16i6h = Md0/d00 (Md/d0 (A) · Md/d0 (B)) = Md0/d00 ◦ Md/d0 (A) · Md0/d00 ◦ Md/d0 (B) = Md/d00 (A) · Md0/d00 (B). To show injectivity, it suffices to see that if d0 = gpd(d), F. PROOF OF LEMMA 4 then (Md/d0 (a) = 0) ⇔ (a = 0). From the definition, 0 Proof. Let C(k, d) denote the complexity of Algorithm 4 this property seamlessly transfers to any d |d and any over input t ∈ Rk. We have this recursion formula: matrix A. d C(n, d) = Θ(nd log d) + Θ(n2d) + Θ(ndk2 ) + nC(k , d ), 2. This item is immediate from the definition. h h h−1 3. It suffices to notice that for any a, V(a) is the first line where the first term corresponds to computing the FFT of the of M(a). As M is a multiplicative homomorphism, the n coefficients of t, the second term to performing computing result follows. the tj ’s (step 8) in FFT form, the third one to the n calls to V−1 , M (see Lemma 1) and the fourth one to 4. It suffices to prove it for elements a, b ∈ R (instead d/ gpd(d) d/ gpd(d) d the n recursive calls to itself. We have of vectors) and for d0 = gpd(d), the generalization to 0 P d 3 d vectors and to arbitrary values of d is then immediate. C(kh, dh−1) = Θ(di−1ki ) + C(k1, d0) di d1 P i gpd(d) P i gpd(d) 16i6h Let a = i x ai(x ), b = i x bi(x ), where P 2 (10) P j P j = Θ(d) ki , ∀i, ai = ai,j x and bi = bi,j x . j∈Zgpd(d) j∈Zgpd(d) 16i6h Then where the equalities are obtained using the same reasoning ∆ X X ∆ as in the proof of Lemma 2. Similarly, we can then conclude ha, bi2 = hai,j , bi,j i2 = hai, bii2 = hV(a), V(b)i2. i,j i that: 2 P 2 5. We have: C(n, d) = Θ(nd log d) + Θ(n d) + Θ(nd) ki . 16i6h B full-rank ⇔ ∀a, aB = 0 iff a = 0 ⇔ ∀a, V(aB) = 0 iff V(a) = 0 ⇔ ∀a0, V(a0)M(B) = 0 iff V(a0) = 0 ⇔ ∀a0, a0M(B) = 0 iff a0 = 0 ⇔ M(B) full-rank

The first and last equivalences are simply the definition, the second and fourth uses the fact that V is a vector space isomorphism and the third one uses Proposition 4, item 3.

E. PROOF OF LEMMA 2

Proof. Let C(k, d) denote the complexity of Algorithm 3 k×k over a matrix G ∈ Rd . We have the following recursion formula: 2 3 2 C(n, d) = Θ(n d log d) + Θ(n d) + Θ(dkh) + nC(kh, dh−1), (8) where the first term corresponds to computing the FFT of the n2 coefficients of G, and the second term to performing