Fast Fourier Orthogonalization

∗ † Léo Ducas Thomas Prest CWI, Cryptology Group Thales Communications & Security Amsterdam, The Netherlands Gennevilliers, France [email protected] [email protected]

ABSTRACT when the matrix is circulant. This is achieved by exploiting the isomorphism between the ring of d × d circulant matrices The classical fast (FFT) allows to compute d in quasi-linear time the product of two , in the and the circular convolution ring Rd = R[x]/(x − 1). d A widely studied and more involved question is matrix circular convolution ring R[x]/(x − 1) — a task that naively requires quadratic time. Equivalently, it allows to accelerate decomposition [20] for structured matrices, in particular matrix-vector products when the matrix is circulant. Gram-Schmidt orthgonalization (GSO). In this work, we are In this work, we discover that the ideas of the FFT can be interested in the GSO of circulant matrices, and more gener- applied to speed up the orthogonalization process of matrices ally of matrices with circulant blocks. Our main motivation with circulant blocks of size d × d. We show that, when d is is to accelerate lattice algorithms for lattices that admit a composite, it is possible to proceed to the orthogonalization basis with circulant blocks. This use case allows a helpful in an inductive way —up to an appropriate re-indexation of extra degree of freedom: one may permute rows and columns rows and columns. This leads to a structured Gram-Schmidt of the lattice basis since this leaves the generated lattice decomposition. In turn, this structured Gram-Schmidt de- unchanged —up to an isometry. composition accelerates a cornerstone lattice algorithm: the As we will show, a proper re-indexation of these matri- nearest plane algorithm. The complexity of both algorithms ces highlights an inductive structure, with a fast Fourier may be brought down to Θ(d log d). flavor. This leads to accelerations of the orthogonalization Our results easily extend to cyclotomic rings, and can be process —and of the related nearest plane algorithm— down adapted to Gaussian samplers. This finds applications in to quasilinear time and space. lattice-based , improving the performances of trapdoor functions. The Nearest Plane Algorithm, Lattices and Cryptography Keywords. , Gram-Schmidt or- thogonalization, nearest plane algorithm, lattice algorithms, The nearest plane algorithm [1] is a central algorithm over lattice trapdoor functions. lattices. It allows, after precomputation of the GSO and using a quadratic number of real operations, to find a rel- atively close point in a lattice to an arbitrary target. It is 1. INTRODUCTION a core subroutine of LLL [13], and can be used for error When operations involving linear algebra are to be per- correction over analogical noisy channels. It has also found formed over structured matrices, a natural problem is to applications in lattice-based cryptography as a decryption al- accelerate them by exploiting the structure. The most classi- gorithm, and a randomized variant (called discrete Gaussian cal example is the fast Fourier transform [2, 8], which allows sampling) [11, 5] provides secure trapdoor functions based to perform matrix-vector multiplication in quasilinear time on lattice problems. This leads to cryptosystems (attribute- based encryption) with fine-grained access control, as [18, 6] ∗ This work has been supported by a grant from CWI from budget to name a few. d for public-private-partnerships and in part by a grant from NXP Given a basis B of a lattice L ⊂ R and a target vector c, Semiconductors. Part of this work was also supported by an NWO the nearest plane algorithm finds a lattice point somewhat Free Competition Grant. close to c. The result belongs to a fundamental domain † ´ Part of this work was done when the author was at the Ecole centered in c, whose shape is the cuboid defined by B˜ , the Normale Sup´erieure, Paris. This work was partially funded by the 2 European Union H2020 SAFEcrypto project (grant no. 644729). GSO of B (see Figure 1). This algorithm performs Θ(d ) real operations. The GSO itself is required as a precomputation.

Structured lattices in cryptography. When it comes to practical lattice-based cryptography, a quadratic cost in the dimension is rather prohibitive consid- ering the lattices at hand have dimensions ranging in the hundreds, or even thousands. For efficiency purposes, many cryptosystems (such as [10, 15, 16] to name a few) chose to Full version of the article to appear at the 41st International Symposium rely on lattices with some algebraic structure, improving time on Symbolic and Algebraic Computation (ISSAC 2016). and memory requirements to quasilinear in the dimension. This is sometimes referred as lattice-based cryptography in must follow the tower of rings R ⊂ Rd1 ⊂ ··· ⊂ Rdi−1 ⊂ Rd, the ring setting. Technically, the chosen rings typically are for some chain of divisors 1|d1| ... |di−1|d. cyclotomic rings, but those are closely related to the convo- Such a representation is obtained by applying the (mixed- lution rings discussed so far. The core of this optimization radix) digit-reversal order to the indexation of the rows and is the fast Fourier transform (FFT) [2, 4, 8, 19] allowing column of the circulant blocks, as pictured in Figure 2. fast multiplication of polynomials. But this improvement We show that this alternative indexation allows to repre- did not apply in the case of the nearest plane algorithm or sent the matrix L of the GSO in a factorized form: a product its randomized variant [11, 5]: na¨ıve GSO do not take the of Θ(log d) (sparse) structured matrices, each of which can algebraic structure into account. be stored in space O(d). An example is given in Figure 3. One possible work-around [9, 22] consist of using the round- Once this hidden structure is unveiled (Theorem 1), the al- off algorithm [1] instead of the nearest-plane algorithm. How- gorithmic implications follow quite naturally: one first easily ever, this simpler algorithm outputs further vectors, both in derive an algorithm in time O(n log2 n) —matching previous the average and worst cases, weakening those cryptosystems. and more general results [21]— but noting that the splitting step may be performed directly in the Fourier domain al- lows us to save another log n factor. For easier algorithmic manipulations, the factorization of L is represented using a

c c tree.    0 1 2 3 4 5 6 7  0 2 4 6 1 3 5 7 7 0 1 2 3 4 5 6  6 0 2 4 7 1 3 5  Round-off Nearest plane      6 7 0 1 2 3 4 5   4 6 0 2 5 7 1 3     2 4 6 0 3 5 7 1  Figure 1: Round-off and nearest plane algorithms,  5 6 7 0 1 2 3 4      ⇒   and their associated fundamental domains.  4 5 6 7 0 1 2 3   7 1 3 5 0 2 4 6       3 4 5 6 7 0 1 2   5 7 1 3 6 0 2 4      Our contribution  2 3 4 5 6 7 0 1   3 5 7 1 4 6 0 2  1 2 3 4 5 6 7 0 1 3 5 7 2 4 6 0 In this work, we discover new algorithms, obtained by cross- C(a) C(M (a)) ing Cooley-Tukey’s [2] fast Fourier transform algorithm to- 8/4 gether with the orthogonalization and nearest plane algo-  0 4 2 6 1 5 3 7  ? rithms (not exactly the GSO, but the closely related LDL de-  4 0 6 2 5 1 7 3    composition). Precisely, we show that, up to a re-indexation  6 2 0 4 7 3 1 5    of rows and columns, the orthogonalization of matrices com-  2 6 4 0 3 7 5 1  posed of d × d-circulant blocks can be done in time Θ(d log d) ⇒    7 3 1 5 0 4 2 6    when the prime factors of d are bounded. Our algorithm pro-  3 7 5 1 4 0 6 2  ?   duces the LDL decomposition in a special compact format,   requiring Θ(d log d) complex numbers to represent.  5 1 7 3 6 2 0 4  From this compact representation, the nearest plane algo- 1 5 3 7 2 6 4 0 1 rithm can also be performed using Θ(d log d) real operations. C(M8/2(a)) = M8/1(a) As a demonstration of the simplicity of our algorithms, we propose an implementation in python for d × d-circulant Figure 2: Re-indexing the transformation matrix of 2 7 matrices, when d is a power of 2. f ∈ R8 7→ fx (a = 0 + x + 2x + ··· + 7x ∈ R8).

Computational model and Parallelism. Our algorithms and their complexity are described for  1   1  Random Access Machines with unitary arithmetic operations  1   1  on real numbers, i.e. we do not study the issue related to      1   e f 1  floating-point approximations and numerical stability.  1   f e 1  ? L =   ·   Regarding parallelism, we note that our Fast-Fourier LDL  a c 1   1   b d    algorithm is almost fully parallelizable: viewed as an arith-  b a d c 1   1      metic circuit on real numbers, it has depth O(log(n)) and  d c a b 1   g h 1  width O(n). This is unfortunately not the case of the Nearest c d b a 1 h g 1 Plane algorithm nor our Fast-Fourier variant, for which the  1  n rounding steps are inherently sequential. 1  i  e i  1    a f  j 1  j Techniques. ·   L : b  1  At the core of our techniques is the realization that repre-   c d  k 1  g k senting elements of the convolution ring Rd = R[x]/(x −1) as   d  1  h circulant matrices is not the appropriate choice. To allow an l induction similar to Cooley-Tukey’s FFT, our representation l 1 1 If the number n×m of d×d circulant blocks is not constant, Figure 3: Factorization of L in the LDL? decompo- they contribute a factor O(n3) to the runtime of the fast- sition of M8/1(a), and its tree representation L. Fourier LDL? algorithm, O(n2) to its output size, and O(n2) to the run-time of the fast-Fourier nearest plane algorithm. Related Works. We equip the ring Rd with a conjugation operation as well There exist many works related to the orthogonalization of as an inner product, making it an Hermitian inner product structured bases. For Toeplitz matrices, Sweet [23] intro- space. The definitions that we give also encompass other duced an algorithm faster than the naive orthogonalization types of rings that will be used in Appendix A. by a linear factor. Gragg [7] has shown that for Krylov bases Definition 2. Let h ∈ [x] be a monic with d−1 R – which are bases of the form {b, r(b), ..., r (b)} –, the ∆ distinct roots over , R = [x]/(h(x)) and a, b be arbitrary Levinson recursion [14, 3] allows, when r is an isometric C R elements of R. operator, to perform orthogonalization in time Θ(d2) instead of Θ(d3). • We note a? and call (Hermitian) adjoint of a the unique There also exists superfast (running time O(n log2 n)) algo- element of R such that for any root ζ of h, a?(ζ) = a(ζ), rithms for the orthogonalization of Toeplitz-like matrices, for where · is the usual complex conjugation over C. example by Olshevsky and Pan [21], and those are already ∆ P based on a structure-preserving induction. In this light, one • The inner product over R is ha, bi = h(ζ)=0 a(ζ)·b(ζ), may interpret our result as a Fourier-compatible version of and the associated norm is kak =∆ pha, ai. these superfast algorithms. The question of improving the nearest-plane algorithm for In the particular case of convolution rings, one can check Pd−1 i structured matrices seems less studied. As far as we know, that if a(x) = i=0 aix ∈ Rd, then the state of the art consist of a single result [17], applying the d−1 Levinson recursion [14] to reduce by a linear factor its space ? d X d−i a (x) = a(1/x) mod (x − 1) = a + a x . complexity. Alternatively, a work-around of lesser quality 0 i i=1 was proposed for the NTRU signature scheme [9, 22]. The (Hermitian) adjoint B? of a matrix B ∈ Rn×m is the Outline. transpose of the coefficient-wise adjoint of B. Section 2 introduces the mathematical tools that we will While the inner product h·, ·i (resp. the associated norm use through this paper. Section 3 presents our main result, k · k) is not to be mistaken with the canonical coefficient-wise namely the existence of a compact, factorized representation dot product h·, ·i2 (resp. the associated norm k · k2), they for the GSO and LDL? decomposition, and gives a fast are closely related. One can easily check that for any f = P i fix ∈ Rd, the vector (f(ζ)) d can be obtained Fourier flavored algorithm for computing it in this form. i∈Zd {ζ =1} ? This compact LDL decomposition is further exploited in from the coefficient vector (fi)06i n and B = {b1, ... , bn} ∈ R . python implementations of them in the case where d is a We say that B is full-rank (or is a basis) if for any linear com- power of two. The full prototype implementation in python P 2 bination 1 i n aibi with ai ∈ R, we have the equivalence —for d a power of 2— is also available online. P 6 6 ( i aibi = 0) ⇐⇒ (∀i, ai = 0). 2. PRELIMINARIES We note that since R is generally not an integral domain, a set formed of a single nonzero vector is not necessarily For any ring R, R[x] will denote the ring of univariate poly- full-rank. In the rest of the paper, a basis will either denote nomials over R. Scalars (which includes elements of R) will m n a set of independent vectors {b1, ... bn} ∈ (R ) , or the usually be noted in plain letters (such as a, b), vectors will be full-rank matrix B ∈ Rn×m whose rows are the b ’s. noted in bold letters (such as a, b) and matrices will be noted i in capital bold letters (such as A, B). Vectors are mostly in 2.2 The GSO and LDL? Decomposition row notation, and as a consequence vector-matrix products In this section, R = R[x]/(h(x)) as in Definition 2. We are done in this order unless stated otherwise. (a1, ... , an) de- first recall a few standard definitions. A matrix L ∈ Rn×n notes the row vector formed of the ai’s, whereas [a1, ... , an] is unit lower triangular if it is lower triangular and has only denotes the matrix whose rows are the ai’s. N denotes the ∗ 1’s on its diagonal. set of non-negative integers, and N the set N\{0}. For We say that a self-adjoint matrix G ∈ Rn×n is full-rank i, j ∈ Z, i, j will denote the set {i, i + 1, . . . , j − 1, j}. Gram (or FRG) if there exist m n and a full-rank matrix J K > B ∈ Rn×m such that G = BB?. This generalizes the notion 2.1 The Convolution Ring Rd of positive definiteness for symmetric real matrices. We now recall the GSO and LDL? decomposition. The ∗ Definition 1. For any d ∈ N , let Rd denote the ring GSO decomposes any full-rank matrix as the product of a d R[x]/(x − 1), also known as circular convolution ring, or unit lower triangular matrix and an orthogonal matrix. simply convolution ring. Proposition 1. Let B ∈ Rn×m be a full-rank matrix. B can be uniquely decomposed as When d is highly composite, elementary operations in Rd can be performed in time Θ(d log d) using the fast Fourier B = L · B˜ , (1) transform [2]. where L is unit lower triangular, and the rows of B˜ are 2https://github.com/lducas/ffo.py pairwise orthogonal. When R is replaced by R or a number field, Proposition 1 2.4 Coefficient Vectors and Circulant Matri- is standard. In our case, a proof can be found in Appendix C. ces The LDL? decomposition writes any positive definite ma- ? n×n m dm trix as a product LDL , where L ∈ R is unit lower Definition 5. We define linear maps c : Rd → R and n×m dn×dm P i n×n C : R → as follows. For any a = aix ∈ triangular with 1’s on the diagonal, and D ∈ R is diago- d R i∈Zd nal. It is related to the GSO as for a basis B, there exists a Rd where each ai ∈ R: unique GSO B = L · B˜ and for an FRG matrix G, there ex- 1. The coefficient vector of a is c(a) = (a0, . . . , ad−1). ists a unique LDL? decomposition G = LDL?. If G = BB?, 2. The circulant matrix of a is then G = L · (B˜ B˜ ?) · L? is a valid LDL? decomposition of G. As both decompositions are unique, the matrices L in  c(a)   a0 a1 ··· ad−1 both cases are actually the same. In a nutshell: ad−1 a0 ··· ad−2 c(xa) ∆   d×d C(a) =  ......  =  .  ∈ R . ˜ . . . .  .  L · B is the GSO of B a a . . . a ⇔ L · (B˜ B˜ ?) · L? is the LDL? decomposition of (BB?). 1 2 0 c(xd−1a) 3. c and C generalize to vectors and matrices in a coefficient- ? wise manner. Algorithm 1 LDLR(G) n×n 3. The maps c and C satisfy the following Require: A full-rank Gram matrix G = (Gij ) ∈ R . Proposition Ensure: The decomposition G = LDL? over R, where L properties: is unit lower triangular and D is diagonal. 1. C is an injective algebra morphism. In particular, 1: L, D ← 0 n×n C(a)C(b) = C(ab). 2: for i from 1 to n do 2. c(a)C(b) = c(ab). 3. C(a)? = C(a?). 3: Lii ← 1 P ? 4: Di ← Gii − j

Definition 4. Let B = {b1, ... , bn} be a real basis. We 1. Rd may not only be seen as a d-dimensional R-vector 0 call fundamental parallelepiped generated by B and note P(B) 0 n space, but also as a d/d -dimensional Rd -module for P  1 1   1 1  0 the set − 2 , 2 bj = − 2 , 2 · B. any d |d. 16j6n 2. When writing Rd as a d-dimensional R-vector space, we will not use {1, x, ... , xd−1} as our canonical basis,

Algorithm 2 NPR(t, L) but a permutation of this basis ˜ n×m ∗ Require: The decomposition B = L · B of B ∈ R , a Definition 6. Let d ∈ N be a product of h (not neces- n vector t ∈ R sarily distinct) primes. Let gpd(d) denote the greatest proper n Ensure: A vector z ∈ Z such that (t − z)B ∈ P(B˜ ) divisor of d. When clear from context, we also note h the 1: z ← 0 number of prime divisors of d (counted with multiplicity), 2: for j = n, . . . , 1 do ∆ ∆ P dh = d and for i ∈ 1, h , di−1 gpd(di) and ki = di/di−1, so 3: t¯j ← tj + (ti − zi)Lij Q i>j that 1 = d0|d1| ... |dJh =Kd and kj = di. j6i 4: zj ← bt¯j e 5: end for The di’s defined in Definition 6 form a tower of proper 6: return z divisors of d. For any composite d, there exist multiple towers of proper divisors: per example , 1|6, 1|2|6 and 1|3|6 for d = 6. In this paper, each time we mention a tower of Proposition 2 (From [1, 13]). Algorithm 2 outputs an proper divisors of d it will refer to the unique one induced ˜ integer vector z (zB ∈ L(B)) such that (t − z)B ∈ P(B). by Definition 6 (e.g. for d = 6, that tower is 1|3|6). 0 ∗ 0 00 0 Definition 7. Let d, d ∈ N such that d is in the tower 4. For d |d |d and any element a ∈ Rd, 0 of proper divisors of d, and let k = d/d . We denote by x the ∆ (d/d00)×(d/d00) d M 00 (a) = M 0 00 ◦ M 0 (a) ∈ R . indeterminate of the polynomial ring Rd = R[x]/(x − 1) and d/d d /d d/d d00 k d0 by y = x the indeterminate of Rd0 = R[y]/(y − 1). We When d is clear from context, we simply note Md/d0 = M/d0 . m km define the partial linearization Vd/d0 : Rd → Rd0 recursively as follows: Interpretation. 0 1. For d = d = 1, Vd/d0 is the identity. As for the operator V, the steps of applying M are depicted 0 2. For d = gpd(d) and a single element a ∈ Rd, let in Section 1 Figure 2. The linearization Md/d0 (a) writes the P i a = x ai(y) where ai ∈ Rd0 for each i. Then i∈Zk transformation matrix of the map f ∈ Rd 7→ fa using the same basis as the partial linearization V 0 . As a result, ∆ d/d Vd/d0 (a) = (a0, . . . , ak−1). both operators are compatible: ∀a, b ∈ Rd, Vd/d0 (ab) = k Vd/d0 (a) · Md/d0 (b). More properties follows. In other words, Vd/d0 (a) ∈ Rd0 is the row vector whose Proposition 4. Let d ∈ ∗, a, b (resp. a, b, resp. A, B) coefficients are the (ai)i∈Zk . N m km be arbitrary scalars (resp. vectors, resp. matrices) over Rd, 3. For a vector v ∈ Rd , Vd/d0 (v) ∈ Rd0 is the component- 0 ∆ ∆ wise applications of Vd/d0 . and d |d. To be concise, we note V = Vd/d0 and M = Md/d0 . 00 0 m 4. For d |d |d and any vector v ∈ Rd , The maps V and M satisfy the following properties:

∆ (d/d00)m 1. M is an injective algebra morphism, and in particular: Vd/d00 (v) = Vd0/d00 ◦ Vd/d0 (v) ∈ Rd00 . M(A · B) = M(A) · M(B). When d is clear from context, we simply note Vd/d0 = V/d0 . 2. V is an injective linear map. Interpretation. 3. V(ab) = V(a) · M(b). In practice, an element a ∈ Rd is represented by a vector of d real elements corresponding to the d coefficients of a. 4. V is an isometry: In this context, the operator V simply permutes coefficients. hV(a), V(b)i = ha, bi . As highlighted by Figure 4, when d = 2h is a power of two, 2 2 Vd/1 permutes the coefficients according to the bit-reversal 5. B is full-rank if and only if M(B) is full-rank. order3, which appears in the radix-2 fast Fourier transform Since the proofs are rather straightforward to check from (FFT). More generally, one can show that for an arbitrary d, the definitions, we defer them to Appendix D. Vd/1 permutes the coefficient according to the general mixed- radix digit reversal order, which appears in the mixed-radix Computing V and M in the Fourier Domain. Cooley-Tukey FFT [2]. The operators V and M that we defined can be computed 0 1 2 3 4 5 6 7 ⇒ 0 2 4 6 1 3 5 7 very efficiently when an element a ∈ Rd is represented by its coefficients but also when represented in the Fourier domain. c(a) c(V/4(a)) In the first case, it is obvious that they can both be performed ⇒ 0 4 2 6 1 5 3 7 in time Θ(d) as they (symbolically) permute coefficients of a. c(V/2(a)) = V/1(a) If a is represented in FFT form —that is, by the vector i d 0 0 (a(ζd))i∈Zd in C — then computing Vd/d (a) and Md/d (a) in FFT form can naively be done in time Θ(d log d) by comput- Figure 4: Partial vectorial linearizations. ing its inverse FFT, permuting its coefficients, and computing 0 d/d FFT’s over Rd0 . However, it can be done in time Θ(d), We now move to the matrix representation M compatible as it is a single step – also known as butterfly – of the origi- with V. nal fast Fourier transform. This is formalized in Lemma 1, Definition 8. Following the notations of Definition 7, a reformulation of a simple lemma that is at the heart of n×m kn×km Cooley and Tukey’s FFT. we define the operator Md/d0 : Rd → Rd0 as follows: 0 0 1. For d = d = 1, Md/d0 is the identity. Lemma 1 ([2], adapted). Let d 2, d = gpd(d) and 0 0 > 2. For d = gpd(d), k = d/d and a single element a = 0 ∆ ∆ P i k = d/d . Let V = Vd/d0 , M = Md/d0 . For any a ∈ Rd (resp. x ai(y) where each ai ∈ Rd0 , Md/d0 (a) is the i∈Zk a ∈ V(Rd), resp. A ∈ M(Rd)): following matrix of ∈ Rk×k: d0 • V(a), V−1(a), M−1(A) can be computed in FFT form   Vd/d0 (a) in time Θ(kd).  a0 a1 ··· ak−1 k yak−1 a0 ··· ak−2  Vd/d0 (x a)  2   • M(a) can be computed in FFT form in time Θ(k d).  ......  = . . P i k . . . .  .  Proof. Let a ∈ Rd be uniquely written a = x ai(x ),  .  i∈Zk ya1 ya2 . . . a0 (d0−1)k Vd/d0 (x a) where each ai ∈ Rd0 . Cooley and Tukey show in [2] (equa- tions 7, 8) that we can switch from the FFT of a to the FFT d×d In particular, if d is prime, then Md/1(a) ∈ R is of all the ai’s (and conversely) in time Θ(kd). As the ai’s are exactly the circulant matrix C(a). the coefficients of V(a) and M(a), the result follows. m n×m 3. For a vector v ∈ Rd or a matrix B ∈ Rd , Md/d0 (v) ∈ k×km kn×km In Sections 3 and 4, we will use the operators V and M to R and M 0 (B) ∈ R are the component- d0 d/d d0 speed up the orthogonalization and nearest plane algorithms. wise applications of M 0 . d/d The core idea is that these operators allow to “batch” orthog- 3https://oeis.org/A030109 onalization operations, resulting in a Θ(d/ log(d)) speedup. 3. FAST FOURIER LDL DECOMPOSITION • Since each Li,j is block diagonal with dh−1/di+1 unit This section presents our main result. We present the lower triangular diagonal blocks, Li is block diagonal existence of a compact representation in Section 3.1, and with kh(dh−1/di+1) = d/di+1 unit lower triangular then derive a fast algorithm to compute it in Section 3.2. diagonal blocks. • We also need to show that B˜ is orthogonal. Each 3.1 A Compact LDLF Decomposition 0 submatrix B˜ j of B˜ 0 is the orthogonalization of M(bj ) by induction hypothesis. Therefore, for two distinct Theorem 1. Let d ∈ N and 1 = d0|d1| ... |dh = d be a ˜ m rows u, v of B0: tower of proper divisors of d. Let b ∈ Rd such that Md/1(b) ˜ is full-rank. There exists a GSO of Md/1(b) as follows: – If they belong to the same submatrix Bj , they are orthogonal by induction hypothesis. h−1 ! – Suppose they belong to different submatrices: u ∈ Y ˜ B˜ j , v ∈ B˜ ` and j 6= `. Then u (resp. v) is a Md/1(b) = Mdi/1(Li) · B0 i=0 linear combination of rows of M(bj ) (resp. M(b`)): v = aj · M(bj ) and v = a` · M(bj ) for some aj , a` ˜ d×dm (d/di)×(d/di) where B0 ∈ is orthogonal, and each Li ∈ R dh−1 −1 −1 R di in R . Noting aj = V (aj ) and a` = V (a`): is a block-diagonal matrix with unit lower triangular matrices (di+1/di)×(di+1/di) 4 hu, vi = hV(aj )M(bj ), V(a`)M(b`)i of R as its d/di+1 diagonal blocks. 2 2 di = hV(aj bj ), V(a`b`)i2 = haj bj , a`b`i = 0 As an example, the matrix L of the GSO of M8/1(a) for 2 some a ∈ R8 is depicted in Section 1 Figure 3. Where the second equality comes from Proposi- tion 4, item 3, the third one from the fact that V Proof. If d is prime, the theorem is trivial as it is exactly the GSO. We suppose that d is composite and that the is a scaled isometry (Proposition 4, item 4) and the fourth one from bj , b` being orthogonal. theorem is true for any Ri with i < d. By Proposition 4, ∆ Therefore B˜ 0 is orthogonal. item 5, the matrix Bh−1 = Md/dh−1 (b) is full-rank too. Using the classical GSO, we can therefore decompose it The theorem we stated gives the GSO of Md/1(b) for kh×kh kh×mkh as Bh−1 = Lh−1B˜ , where Lh−1 ∈ R , B˜ ∈ R m dh−1 dh−1 a vector b ∈ Rd , but can be easily generalized from a ∆ ? and kh = d/ gpd(d). Lh−1 is unit lower triangular and vector b to a matrix B, and also yields a compact LDL ˜ ˜ decomposition. B is orthogonal. Noting B = [b1, ... , bkh ], all the bj ’s are pairwise orthogonal and each M/1(bj ) is full-rank. By Corollary 1. Let d ∈ N and 1 = d0|d1| ... |dh = d be a inductive hypothesis, they can be decomposed as follows: n×m tower of proper divisors of d. Let B ∈ Rd be a full-rank matrix. There exist h + 1 matrices (Li)06i6h such that: h−2 ! • L ∈ Rn×n is unit lower triangular. Y ˜ h d ∀j ∈ 1, kh , Mdh−1/1(bj ) = Mdi/1(Li,j ) · Bj , (2) n(d/di)×n(d/di) J K i=0 • For each i < h, Li ∈ R is a block- di

dh− ×mdh− diagonal matrix whose n(d/di+1) diagonal blocks are where each B˜ j ∈ 1 1 is full-rank orthogonal and for R (di+1/di)×(di+1/di) (dh−1/di)×(dh−1/di) unit lower triangular matrices of Rd . i < h−1, each Li,j ∈ R is a block-diagonal i di   (di+1/di)×(di+1/di) Qh matrix with unit lower triangular matrices of R Furthermore, if we note L = Md /1(Li) and di i=0 i as its dh−1/di+1 diagonal blocks. To be concise, we now note ˜ ∆ −1 B0 = L · Md/1(B), then: M =∆ M and V =∆ V . We have: dh−1/1 dh−1/1 ˜ 1. The GSO of Md/1(B) is Md/1(B) = L · B0. ˜ Md/1(b) = M(Lh−1) · M(Bh−1) ? ? 2. The LDL decomposition of Md/1(BB ) is = M(Lh−1) · M[b1,..., bkh ] ˜ ˜ t t = M(Lh−1) · [M(b1),..., M(bkh )] Md/1(B) = L · (B0B0) · L . h−2  0 Q ˜ Proof. We have B = LhB , where Lh is given by ei- = M(Lh−1) · Diag Mdi/1(Li,j ) · B0 ? 0 i=0 ther the GSO or LDL decomposition algorithm. B = h−2  0 0 Q ˜ {b1, ... , bn} is orthogonal. Applying Theorem 1 to each = M(Lh−1) · Mdi/1(Li) · B0 0 0 i=0 row vector bj of B yields n decompositions (Li,j )06ij Ensure: The compact LDL decomposition of G in FFT form −1   9: z ← V ffNP (V 0 (t ), L ) 1:( L, D) ← LDL? (G) j d/d0 Rd0 d/d j j Rd 10: end for 2: if d = 1 then 11: return z = (z ,..., z ) 3: return (L, D) i n 4: end if 5: d0 ← gpd(d) n×m 6: for i = 1, . . . , n do Lemma 3. Let B = {b1, ... , bn} ∈ R and B˜ = ? d 7: Li ← ffLDL (Md/d0 (Dii)) ˜ ˜ Rd0 {b1, ... , bn} be its Gram-Schmidt orthogonalization in Rd. 8: end for The vectors z and t = (t1,..., tn) in Algorithm 4 satisfy 9: return (L, (Li)16i6n) (z − t) · B = (z − t) · B˜ . Algorithm 3 computes a “fast Fourier LDL?”, instead of the “fast Fourier GSO” hinted at in the proof of Corollary 1. Proof. We recall that for each i ∈ 1, n , b˜i = bi − The reason why we favor this approach is because it allows P L b˜ . We have: J K a complexity gain. This gain can already be observed in the jj P ˜ with the Gram-Schmidt process. The same phenomenon = 1 i j n(zi − ti)Lij bj P 6 6 6 happens with their recursive variants. = (zi − ti)bi 16i6n = (z − t) · B. Lemma 2. Let d ∈ N and 1 = d0|d1| ... |dh = d be the (3) tower of proper divisors of d given by the successive gpd’s, The first and last equalities are trivial, the second one replaces ∆ n×n and for i ∈ 1, h , let ki = di/di−1. Let G ∈ Rd be a full- the tj ’s by their definitions, the third one just simplifies rank GramJ matrix.K Then Algorithm 3 computes the LDL? the sum and the fourth one is another way of saying that decomposition tree of G in FFT form in time L · B˜ = B.

2 3 X 2 Θ(n d log d) + Θ(n d) + Θ(nd) ki . 16i6h ∆ ∆ Theorem 2. Let M = Md/1 and V = Vd/1. Algorithm 4 In particular, if all the ki are bounded by a small constant n outputs z ∈ Z such that V((z − t)B) ∈ P(B˜ 0), where B˜ 0 is k, then the complexity of Algorithm 3 is upper bounded by the orthogonalization of M(B) over . Θ(n3d + n2d log d). R Proof. The result is trivially true if d = 1. We prove The proof of Lemma 2 is left in Appendix E. We note that it in the general case. By definition, each subtree L is the Algorithm 3 is parallelizable to up to d processes: step 1 j ? ˜ relies on operations on polynomials which are parallelizable LDL decomposition tree of Md/ gpd(d)(bj ). By induction ˜ ˜ and all the iterations of step 7 are independent. hypothesis, we therefore know that V((zj − tj )bj ) ∈ P(Bj ), ∆ where B˜ j is the orthogonalization of Bj = M(b˜j ). From 4. FAST FOURIER NEAREST PLANE Lemma 3, we have In this section, we show how to exploit further the compact form of the LDL? decomposition to have a fast Fourier variant X (z − t)B = (zj − tj ) · b˜j , of the nearest plane algorithm. It outputs vectors of the j=1,...,n same quality (ie as close to the target vector) as its classical iterative counterpart Algorithm 2, but runs Θ˜ (d) times faster. so V((z − t)B) ∈ P([B˜ 1, ... , B˜ n]). Now, from the proof of d ˜ ˜ ˜ Definition 9. Let Zd denote the ring Z[x]/(x − 1) of Corollary 1, we know that B0 = [B1, ... , Bn] is actually the elements of Rd with integer coefficients. orthogonalization of M(B), which concludes the proof. Coordinates of t to round [5] Gentry, C., Peikert, C., and Vaikuntanathan, V. 11 1 1 Trapdoors for hard lattices and new cryptographic constructions. In STOC (2008). NRR NR [6] Gorbunov, S., Vaikuntanathan, V., and Wee, H. 2 10 6 10 2 12 12 Attribute-based encryption for circuits. In 45th ACM NRR NRR NR NR STOC (June 2013), D. Boneh, T. Roughgarden, and

3 5 4 5 3 7 9 8 9 7 13 14 13 J. Feigenbaum, Eds., ACM Press, pp. 545–554. NRR NRR NRR NRR NRR NR [7] Gragg, W. B. Positive definite toeplitz matrices, the arnoldi process for isometric operators, and gaussian An ongoing execution of Algorithm 4 for a single t ∈ R . 8 quadrature on the unit circle. Journal of Computational 1. Arrows are labeled in their order of execution. Down- and Applied Mathematics 46, 1-2 (1993), 183 – 198. ward (↓ ↓) (resp. upward (↑ ↑)) arrows correspond [8] Heideman, M. T., Johnson, D. H., and Burrus, −1 Gauss and the history of the fast Fourier to computing Vd/d0 (tj ) (resp. Vd/d0 (... )) at step 9. C. S. transform. ASSP Magazine, IEEE 1, 4 (1984), 14–21. Transverse (y) arrows correspond to step 8. [9] Hoffstein, J., Howgrave-Graham, N., Pipher, J., 2. Cells labeled with R (as in Rounded) correspond to Silverman, J. H., and Whyte, W. NTRUSIGN: already completed subcalls of Algorithm 4, as opposed Digital signatures using the NTRU lattice. In CT-RSA to those labeled with NR (as in Not Rounded). (2003). [10] Hoffstein, J., Pipher, J., and Silverman, J. H. Figure 5: High-level execution of the fast Fourier NTRU: A ring-based public key cryptosystem. In nearest plane algorithm ANTS (1998), pp. 267–288. [11] Klein, P. N. Finding the closest lattice vector when it’s unusually close. In SODA (2000), pp. 937–941. Unlike the fast Fourier transform, Algorithm 4 is not fully [12] Lang, S. Algebraic number theory, 3 ed. 1995. parallelizable, due to step 8 (y arrows in Figure 5). However, its complexity in d is Θ(d log d): informally, this is because [13] Lenstra, A. K., Lenstra, Jr., H. W., and Lovasz,´ L. Factoring polynomials with rational coefficients. each arrow ↓, ↑ or y has a linear complexity in the size of the cells it connects. A more precise statement follows. Mathematische Annalen 261, 4 (December 1982), 515–534. Lemma 4. Let d ∈ N and 1 = d0|d1| ... |dh = d be be the [14] Levinson, N. The Wiener RMS (root mean square) tower of proper divisors of d given by the successive gpds, error criterion in filter design and prediction. J. Math. ∆ n×m Phys. Mass. Inst. Tech. 25 (1947), 261–278. and for i ∈ 1, h , let ki = di/di−1. Let B ∈ Rd and L be its LDL? decompositionJ K tree. The complexity of Algorithm 4 [15] Lyubashevsky, V., Micciancio, D., Peikert, C., is given by: and Rosen, A. SWIFFT: A modest proposal for FFT hashing. In FSE (2008). 2 X 2 Θ(nd log d) + Θ(n d) + Θ(nd) ki . [16] Lyubashevsky, V., Peikert, C., and Regev, O. On 16i6h ideal lattices and learning with errors over rings. In

In particular, if all the ki are bounded by a constant, then EUROCRYPT (2010), pp. 1–23. the complexity of Algorithm 4 is Θ(n2d + nd log d). [17] Lyubashevsky, V., and Prest, T. Quadratic time, linear space algorithms for Gram-Schmidt The proof of Lemma 4 is deferred to Appendix F. orthogonalization and Gaussian sampling in structured lattices. In EUROCRYPT 2015. Acknowledgements [18] Micciancio, D., and Peikert, C. Trapdoors for lattices: Simpler, tighter, faster, smaller. In The authors would like to thank the anonymous reviewers EUROCRYPT (2012). of ISSAC and EUROCRYPT for their diligent comments, significantly contributing to improve the presentation of this [19] Nussbaumer, H. J. Fast Fourier transform and article. convolution algorithms, vol. 2. Springer Science & Business Media, 2012. [20] Olshevsky, V. Fast Algorithms for Structured 5. REFERENCES Matrices: Theory and Applications, vol. 323. American [1] Babai, L. On Lov´asz’ lattice reduction and the nearest Mathematical Soc., 2003. lattice point problem. Combinatorica 6, 1 (1986), 1–13. [21] Olshevsky, V., and Pan, V. Y. A unified superfast Preliminary version in STACS 1985. algorithm for boundary rational tangential [2] Cooley, J. W., and Tukey, J. W. An algorithm for interpolation problems and for inversion and the machine calculation of complex . factorization of dense structured matrices. In FOCS Mathematics of Computation 19, 90 (1965), 297–301. (1998). [3] Durbin, J. The fitting of time-series models. Review of [22] Peikert, C. An efficient and parallel Gaussian sampler the International Statistical Institute 28, 3 (1960), pp. for lattices. In CRYPTO (2010), pp. 80–97. 233–244. [23] Sweet, D. R. Fast toeplitz orthogonalization. [4] Gentleman, W. M., and Sande, G. Fast Fourier Numerische Mathematik 43, 1 (1984), 1–21. transforms: for fun and profit. In Proceedings of the November 7-10, 1966, fall joint computer conference (1966), ACM, pp. 563–578. ∗ APPENDIX Proposition 5. Let d ∈ N and ι = ιd. The embedding ι: A. EXTENDING THE RESULTS TO CYCLO- 1. is an injective ring morphism. TOMIC RINGS 2. is an isometry : for any f, g ∈ F , hι(f), ι(g)i = hf, gi. In this section we argue that our results hold in the cy- d clotomic case as well. It turns out that all the previous Proof. Item 1 follows from the fact that ed is idempotent 2 arguments can be made more general. The required ingredi- ed = ed. Indeed this implies that ed(a + bc) = eda + edbc = 2 ents are the following: eda + edbc = eda + (edb)(edc). In addition, for any element g ∈ ι(F ), g mod φ is the unique antecedent of g with 1. A tower of unitary rings endowed with inner products d d respect to ι, so ι is bijective and ι−1(g) = g mod φ , which onto . d R proves the point 1. Item 2 follows from equation (4). 2. For any rings S, T of the tower, injective maps M0 : k×k 0 k 0 0 S → T and V : S → T , with S of rank d over R, Lemma 5. Let d > 2, d |d, k = d/d and a ∈ Rd. Then and T of rank d/k over R. k (a ∈ ι(Fd)) ⇔ Vd/d0 (a) ∈ ι(Fd0 ) 3. M0 is a ring morphism. 0 0 Proof. We prove the lemma for d = gpd(d), extension 4. V is a scaled linear isometry. to the general case is straightforward. a can be uniquely 0 0 0 P i k 5. V (ab) = M (a)V (b). written as a = x ai(x ) where each ai ∈ Rd0 . Let ζd 06i

Equivalently, ι(f) is the only element in Rd satisfying: This gives the condition 2. Lemma 5 allows to argue that −1 0  the image of Vd/d0 ◦ιd is in the definition domain of ιd0 : V is f(ζ) if φ (ζ) = 0 0 ι(f)(ζ) = d (4) well defined, and similarly for M . Conditions 3 and 5 follow 0 if ψd(ζ) = 0 from the fact that ιd, ιd0 are ring morphisms and that similar properties hold for Md/d0 and Vd/d0 . Condition 4 is true # ffNearestPlane, i/o in base B, fft format (sec. 4) because ιd, Vd/d0 and ιd0 are isometries. Finally, condition 6 def ffBabai_aux(T,t): holds in the FFT representation, from Lemma 1 and from the if len(t)==1: fact that ι in the Fourier domain simply consist of inserting return array([round(t.real)]) some zeros at appropriate positions. (t1,t2) = ffsplit(t) (L,[T1,T2]) = T z2 = ffBabai_aux(T2,t2) B. IMPLEMENTATION IN PYTHON tb1 = t1 + (t2-z2) * conjugate(L) In this section, we give the core of the Python imple- z1 = ffBabai_aux(T1,tb1) mentation of our algorithms when d is a power of 2. The return ffmerge(z1,z2) full implementation, including correctness tests, is available online and placed in the public domain: # ffNearestPlane, i/o in canonical base, coef. format def ffBabai(f,T,c): https://github.com/lducas/ffo.py. F = fft(f) t = fft(c)/ F z = ffBabai_aux(T,t) Conventions. return ifft(z * F) In python.numpy, the arithmetic operations +, -, ? and / applied on arrays denote coefficient-wise operations. The functions fft and its inverse ifft are built in. The symbol j C. PROOF OF PROPOSITION 1 denotes the imaginary unit. The primitive creates zeros(d) m ∆ ? the d-dimensional zero vector. Proof. For any x, y ∈ R , let hx, yiR = x · y . One can check that h·, ·iR is a Hermitian inner product. In particular,

(hx, xiR = 0) ⇔ (hx, xi = 0) ⇔ (x = 0). # Simplified extract of ffo.py ˜ from numpy import * The decomposition B = L · B can be computed using the Gram-Schmidt process (Algorithm 5). # Linearize operation V, i/o in fft format def ffsplit(F): Algorithm 5 GramSchmidtR(B) d = len(F) 1: for i = 1, . . . , n do winv = exp(2j*pi / d) 2: b˜ ← b Winv = array([winv i for i in range(d/2)]) i i ** 3: for j = 1, . . . , i − 1 do F1 = .5* (F[:d/2] + F[d/2:]) ˜ hbi,bj iR F2 = .5* (F[:d/2]- F[d/2:]) * Winv 4: Li,j = ˜ ˜ hbj ,bj iR return( F1,F2) 5: b˜i ← b˜i − Li,j b˜j 6: end for # Inverse linearize V^-1, i/o in fft format 7: end for def ffmerge(F1,F2): 8: return (B˜ = {b˜ ,..., b˜ }, L = (L ) ) d = 2*len(F1) 1 n ij 16i,j6n F = 0.j*zeros(d) # Force F to complex float w = exp(-2j pi / d) * If we replace R with R or a number field, it is well-known W = array([w**i for i in range(d/2)]) that Algorithm 5 terminates whenever B is full-rank, and F[:d/2] = F1 + W * F2 outputs (B˜ , L) satisfying equation 1. However, it is less F[d/2:] = F1 - W F2 * obvious in our case, since R is no longer a field and the return F ˜ ˜ division by hbj , bj iR in step 4 might be problematic. To show that the output of Algorithm 5 satisfies equation 1 # ffLDL alg., i/o in fft format when R = R[x]/(h(x)), it suffices to show that for any # Outputs an L-Tree (sec 3.2) ˜ ˜ def ffLDL(G): j ∈ 1, n , hbj , bj iR is invertible. Suppose that it is not the J K d = len(G) case, then there exists j ∈ 1, n , and a ∈ R\{0} such that ˜ ˜ J K˜ ˜ if d==1: ahbj , bj iR = 0. By linearity, habj , abj iR = 0 and therefore return( G,[]) ˜ ˜ P ˜ abj = 0. Since bj = bj − i

# ffLQ, i/o in fft format Our arguments seamlessly transfer to the termination of # outputs an L-Tree (sec 3.2) Algorithm 1, as the elements Dj in Algorithm 1 are exactly def ffLQ(f): ˜ ˜ F = fft(f) the elements hbj , bj iR in Algorithm 5. G = F*conjugate(F) T = ffLDL(G) D. PROOF OF PROPOSITION 4 return T Proof. We show the properties separately: 0 t 1. We first prove this statement for d = gpd(d) and for (L, D) ← LDLRd (G) in FFT form. For each i ∈ 1, n , we elements a, b ∈ Rd. All the requirements for show- know from Lemma 1 that Md/ gpd(d)(Dii) can be computedJ K 2 ing that M is a homomorphism are trivial, except for in time Θ(dkh), hence the third term. The last one is for the the fact that it is multiplicative. First, one can check n recursive calls to itself. We then have from Definition 8 that M(ab) = M(a) · M(b). Let P d 3 d n×p p×m C(kh, dh−1) = d Θ(di−1ki ) + d C(k1, d0) A = (aij ) ∈ R and B = (bij ) ∈ R . Since i 1 d d 16i6h ∆ P P 2 (9) AB = ( aikbkj )1 i n,1 j m, we have = Θ(d) ki , 16k6p 6 6 6 6 16i6h   P  where the first equality is shown by induction using equa- M(AB) = M 1 k p aikbkj 6 6 i,j tion 8, except the first term Θ(n2d log d) which is no longer P  . = 1 k p M(aik)M(bkj ) relevant since we are already in the Fourier domain. Com- 6 6 i,j = M(A)M(B) bining equations 8 and 9, we conclude that the complexity of the whole algorithm is 00 Multiplicity then seamlessly transfers to any d |d: 2 3 C(n, d) = Θ(n d log d) + Θ(n d) + nC(kh, dh−1) = Θ(n2d log d) + Θ(n3d) + Θ(nd) P k2. Md/d00 (A · B) = Md0/d00 ◦ Md/d0 (A · B) i 16i6h = Md0/d00 (Md/d0 (A) · Md/d0 (B)) = Md0/d00 ◦ Md/d0 (A) · Md0/d00 ◦ Md/d0 (B) = Md/d00 (A) · Md0/d00 (B). To show injectivity, it suffices to see that if d0 = gpd(d), F. PROOF OF LEMMA 4 then (Md/d0 (a) = 0) ⇔ (a = 0). From the definition, 0 Proof. Let C(k, d) denote the complexity of Algorithm 4 this property seamlessly transfers to any d |d and any over input t ∈ Rk. We have this recursion formula: matrix A. d C(n, d) = Θ(nd log d) + Θ(n2d) + Θ(ndk2 ) + nC(k , d ), 2. This item is immediate from the definition. h h h−1 3. It suffices to notice that for any a, V(a) is the first line where the first term corresponds to computing the FFT of the of M(a). As M is a multiplicative homomorphism, the n coefficients of t, the second term to performing computing result follows. the tj ’s (step 8) in FFT form, the third one to the n calls to V−1 , M (see Lemma 1) and the fourth one to 4. It suffices to prove it for elements a, b ∈ R (instead d/ gpd(d) d/ gpd(d) d the n recursive calls to itself. We have of vectors) and for d0 = gpd(d), the generalization to 0 P d 3 d vectors and to arbitrary values of d is then immediate. C(kh, dh−1) = Θ(di−1ki ) + C(k1, d0) di d1 P i gpd(d) P i gpd(d) 16i6h Let a = i x ai(x ), b = i x bi(x ), where P 2 (10) P j P j = Θ(d) ki , ∀i, ai = ai,j x and bi = bi,j x . j∈Zgpd(d) j∈Zgpd(d) 16i6h Then where the equalities are obtained using the same reasoning ∆ X X ∆ as in the proof of Lemma 2. Similarly, we can then conclude ha, bi2 = hai,j , bi,j i2 = hai, bii2 = hV(a), V(b)i2. i,j i that: 2 P 2 5. We have: C(n, d) = Θ(nd log d) + Θ(n d) + Θ(nd) ki . 16i6h B full-rank ⇔ ∀a, aB = 0 iff a = 0 ⇔ ∀a, V(aB) = 0 iff V(a) = 0 ⇔ ∀a0, V(a0)M(B) = 0 iff V(a0) = 0 ⇔ ∀a0, a0M(B) = 0 iff a0 = 0 ⇔ M(B) full-rank

The first and last equivalences are simply the definition, the second and fourth uses the fact that V is a vector space isomorphism and the third one uses Proposition 4, item 3.

E. PROOF OF LEMMA 2

Proof. Let C(k, d) denote the complexity of Algorithm 3 k×k over a matrix G ∈ Rd . We have the following recursion formula: 2 3 2 C(n, d) = Θ(n d log d) + Θ(n d) + Θ(dkh) + nC(kh, dh−1), (8) where the first term corresponds to computing the FFT of the n2 coefficients of G, and the second term to performing