Part I: Introduction to Post Quantum Tutorial@CHES 2017 - Taipei

Tim Güneysu Ruhr-Universität Bochum & DFKI 04.10.2017 Overview

• Goals – Provide a high-level introduction to Post-Quantum Cryptography (PQC) – Introduce selected implementation details (HW/SW) for some PQC classes (Focus: Encryption) – Highlight open challenges for PQC schemes • Topics/Parts 1. Introduction to PQC 2. Hardware Implementation of PQC 3. (Embedded) Software Implementation of PQC Tutorial Outline – Part I

• Introduction • Classes of Post-Quantum Cryptography (PQC) – Code-Based Cryptography – Lattice-Based Cryptography – Hash-Based Cryptography • Lessons Learned Long-Term Security in Embedded Devices

. For many today‘s applications and systems long-term security is an essential requirement

. Many processing platforms have tight constraints with their 5-25 years computational ressources

> 15 years 10-30 years 10 years Security of Practical Cryptographic Primitives • Cryptosystems must combine security and efficiency

• Embedded devices mostly deploy standardized cryptography – Symmetric encryption: Advanced Encryption Standard – Asymmetric encryption: RSA (Factorization Problem), ElGamal or Elliptic Curve Cryptography (DLOG Problem)

• No „hard“ security guarantees are available for these real-world cryptosystems

• Common practise : Parameters chosen to resist best known (cryptanalytic) attack Best Attacks on Cryptosystems

• Attacks on symmetric cryptosystems – Modern symmetric ciphers follow well-understood principles – For „solid“ ciphers best attack is exhaustive key search – Scaling key sizes to achieve long-term security

• Attacks on asymmetric cryptosystems – Virtually all asymmetric cryptosystems are based on factorization or DLOG problem – Best attacks with subexponential complexity • General Number Field Sieve (on RSA) • Index Calculus (on DLOG) Key Size Recommendations

• Security parameters assuming today‘s algorithmic knowledge and computing capabilities of a powerful attacker (e.g. NSA)

Source: ECRYPT II (symmetric) Yearly Key Size Report

Short-term security (days to months)

Mid-term security (years to decades)

Long-term security (many years) Public-Key Cryptography and Long-Term Security

• General problem: RSA & DLOG-based cryptosystems are closely related

• A breakthrough in classical cryptanalysis is likely to affect both PKC classes

• Further problem: powerful quantum computers Alternatives for Public-Key Cryptography • Research on alternative public-key cryptosystems is required  NIST Call for PQC (Nov 30)

• Foundation on NP-hard problems?

• No polynomial-time attacks (such as Grover‘s/Shor‘s alg.) with quantum computers

• Efficiency in implementations comparable to currently employed cryptosystems Post-Quantum Cryptography

• Definition – Class of cryptographic schemes based on the classical computing paradigm – Designed to provide security in the era of powerful quantum computers • Important: – PQC ≠ quantum cryptography! Post-Quantum Cryptography - Categories

• Five main branches of post-quantum crypto: – Code-based

– Lattice-based

– Hash-based

– Multivariate-quadratic

– Supersingular isogenies • Should support public-key encryption and/or digital signatures CHES History in PQC

• CHES has a long tradition on the implementation of PQC cryptosystems: – CHES 2001: Bailey et al.: NTRU in Small Devices – CHES 2004: Yang et al. : TTS on SmartCards – CHES 2008: Bogdanov et al.: MQ-Cryptosystems in HW – CHES 2009: Eisenbarth et al.: MicroEliece – CHES 2011: Session on Lattice-based attacks (3 papers) – CHES 2012: High-Performance McEliece+MQ+Lattices; GLS-Cryptosystem – CHES 2013: McBits + QC-MDPC McEliece Implementations – CHES 2014: RingLWE + Lattice-based Signature Implementations – CHES 2015: Session on Lattice crypto (2 papers), Homomorphic Encryption – CHES 2016: QcBits, Fault-Attack on BLISS signature scheme – CHES 2017: Tomorrow‘s session on PQC (3 papers) Research Directions in PQC

. Propose novel robust and failure-proof cryptographic constructions (2015-2019) . Efficient constant-time implementation techniques and algorithmic tweaks . Physical resistance against side-channel analysis and fault-injection attacks Implementierungsaspekte alternativer . Improve cryptanalysis to foster asymmetrischer Kryptosysteme confidence considering potential attacks (2012-2017) . Identify secure parameters against attacks from quantum-computers . Compatible implementations for IoT devices, Internet infrastructures and Cloud services ICT-644729 (2015-2019) + more Outline

• Introduction • Classes of Post-Quantum Cryptography (PQC) – Code-Based Cryptography – Lattice-Based Cryptography – Hash-Based Cryptography • Lessons Learned Introduction to Code-based Cryptography • Error-Correcting Codes are well-known in a large variety of applications • Detection/Correction of errors in noisy channels by adding redundancy

• Observation: Some problems in code-based theory are NP-complete  Possible foundation for Code-Based Cryptosystems (CBC) Linear Codes and Cryptography

• Linear codes: Error correcting codes for which redundancy depends linearly on the information • Generator and parity check matrices for encoding and decoding • Rows of G form a basis for the code C[n, k, d] of length n with dimension k and minimum distance d • Matrices can be in systematic form minimizing time/storage

Matrix size of G: k x n Linear Codes and Cryptography

• Parity check matrix H is a (n-k) ∙ k matrix orthogonal to G • Defines the dual C of the code C via scalar product

• A codeword c ∈ C if and only if Hc = 0 • The term s = Hc’ = Hc + He is the syndrome of the error Syndrome Decoding Problem

• Input given – H : parity check matrix of size (n - k) · n – s : vector of GF(2n-k) – t : positive integer (defined by error correction capability)

• Problem: Is there a vector e in GF(2n) of weight w(e)≤ t s.t. H · eT = s

• Syndrome decoding problem is NP-complete – E.R. BERLEKAMP, R.J. MCELIECE and H.C. VAN TILBORG On the inherent intractability of certain coding problems. IEEE Transactions on Information Theory, 24(3), May 1978. Niederreiter Encryption Scheme [1986]

Key Generation Given a code C[n, k, d] with parity check matrix H and error correcting capability t Private Key: (푆, 퐻, 푃), where S is a scrambling and P a permutation matrix Public Key: 퐻 = 푆 · 퐻 · 푃

Encryption 푛 Encode the message 푚 into an error vector 푒 ∈푅 퐹2 , 푤푡 푒 ≤ 푡 x ← 퐻 · 푒푇

Decryption Let Ψ퐻 be a 푡-error-correcting decoding algorithm. 푇 −1 P푚 ← Ψ퐻 푆 · 푥 Extract 푚 by transposing the computation P−1 · P푚푇. McEliece Encryption Scheme [1978]

Key Generation Given a code C[n, k, d] with generator matrix G and error correcting capability t Private Key: (푆, 퐺, 푃), where S is a scrambling and P a permutation matrix Public Key: 퐺 = 푆 · 퐺 · 푃

Encryption 푛−푟 푛 Message 푚 ∈ 퐹2 , error vector 푒 ∈푅 퐹2 , 푤푡 푒 ≤ 푡 x ← 푚퐺 + 푒

Decryption Let Ψ퐻 be a 푡-error-correcting decoding algorithm. −1 S푚 ← Ψ퐻 푥 · P removing the error e Extract 푚 by computing S−1 · S푚 Taxonomy of Code-Based Encryption Schemes

Code-based Encryption Schemes*

McEliece [M78] Niederreiter [N86]

Goppa Generalized Elliptic Concatenated Reed-Solomon Rank-Metric Reed Muller LRPC/LDCP/MDPC Srivastava

* This is a selection based on presenter‘s choice. Taxonomy of Code-Based Encryption Schemes

Code-based Encryption Schemes*

McEliece [M78] Niederreiter [N86]

Goppa Generalized Elliptic Concatenated Reed-Solomon Rank-Metric Reed Muller LRPC/LDCP/MDPC Srivastava

* This is a selection based on presenter‘s choice. Key Aspects of Code-Based Cryptography

Kpub=M y=Mx+e y=Ψ(y, Kpriv) (Matrix) Kpriv

x Decrypt x Encrypt y y

• Focus on encryption, signature schemes are inefficient • Selection of the employed code is a highly critical issue – Properties of code determine key size, matrices are often large – Structures in codes reduce key size, but might enable attacks – Encoding is fast on most platforms (matrix multiplication) – Decoding requires efficient techniques in terms of time and memory • Basic McEliece is only CPA-secure; conversion required • Protection against side-channel and fault-injection attacks Outline

• Introduction • Classes of Post-Quantum Cryptography (PQC) – Code-Based Cryptography – Lattice-Based Cryptography – Hash-Based Cryptography • Conclusions Lattice-based Cryptography – Basics • Hard problem: Shortest/Closest Vector Problem (SVP/CVP) in the worst case

• Typically thought to be – Unpractical but provably secure – Practical but without proof (GGH/NTRU) – Lately: Ideal lattices can potentially combine both

• More constructions feasible beyond classical PKC: hash functions, PRFs, identity-based encryption, homomorphic encryption Learning with Errors Solving of a system of linear equations secret 7×4 4×1 7×1 ℤ13 ℤ13 ℤ13 4 1 11 10 6 4 × 9 = 8 5 5 9 53 11 1 11 10 3 9 0 10 4 1 3 3 2 12 12 7 3 4 9 6 5 11 4 3 3 5 0 Blue is given; Find (learn) red  Solve linear system Learning with Errors Solving of a system of linear equations random secret small noise looks random 7×4 4×1 7×1 7×1 ℤ13 ℤ13 ℤ13 ℤ13 4 1 11 10 6 0 4 × 9 + -1 = 8 5 5 9 53 11 1 1 11 1 10 3 9 0 10 1 4 1 3 3 2 0 12 12 7 3 4 -1 9 6 5 11 4 3 3 5 0 Blue is given; Find red  Learning with Errors (LWE) Problem Key Aspects of Lattice-based Systems • Encryption and signature systems are both feasible (and secure) – Significant ciphertext expansion for (R-)LWE encryption – Decryption error probability with (R-)LWE encryption

• Random Sampling not only from uniform but also from Discrete Gaussian distributions (not a trivial task!)

• Most operations are efficient and parallizable – (Ideal lattices) Make use of FFT for polynomial multiplication – (Standard lattices) Matrix-vector arithmetic

• Reasonably large public and private keys – Given for encryption/signatures constructions – Unclear for advanced services such as functional encryption (e.g., FHE) Outline

• Introduction • Classes of Post-Quantum Cryptography (PQC) – Code-Based Cryptography – Lattice-Based Cryptography – Hash-Based Cryptography • Lessons Learned Hash-based Cryptography: Lamport-Diffie One-Time Signatures (LD-OTS, 1979)

. Definition: Given a security parameter 푛, the set of 푛-bit vectors 푛 푈푛 = {0,1} and a one-way function ℎ: 푈푛 → 푈푛 . Secret key: Generate 2푛 × 푛-bit vector 푋 = (푥 0,0 , 푥 0,1 , 푥 1,0 , 푥 1,1 , . . , 푥 푛−1,1 )

. Public Key : Compute 푌 = 푦 0,0 , . . , 푦 푛−1,1 ∀푦푖,푗 = 푓(푥푖,푗)

x0 x1 x0 x1 x0 x1 … x 0 x 1 x 0 x 1 = X

h h h h h h h h h h

y0 y1 y0 y1 y0 y1 … y 0 y 1 y 0 y 1 = Y

. Publish public key Y Hash-based Cryptography: Lamport-Diffie One-Time Signatures (LD-OTS, 1979)

. Definition: Given a published public key 푌 and an 푛-bit message 푀 = (푚0, … , 푚푛−1) to sign

. Sign: Generate signature 𝜎 = (푥 0,푚0 , . . , 푥 푛−1,푚푛−1 ) by

revealing corresponding 푥 푖,푚푖 secret bits.

. Verify: Check that for f(𝜎푖) = 푦(푖,푚푖) ∀ 푖 = [0, 푛 − 1]

m m m 0 1 2 mn-2 mn-1 r r r r r

x0 x1 x0 x1 x0 x1 … x 0 x 1 x 0 x 1 = 𝜎 ! = h h h h h

y0 y1 y0 y1 y0 y1 … y 0 y 1 y 0 y 1 = Y Extension for Multiple Use: Merkle‘s Signature Scheme Public MSS key

• P Idea by R. Merkle [1979]: reduces K = V 3 [ 0 ]

V 2 [ 0 V 2 ] [ 1 the validity of many OTS verification ]

V V 1 V 1 V 1 [ 1 [ [ 3 [ 2 1 ] 0 ] keys to a single verification key ] ]

V V 0 0 V V V V V 0 V 0 0 0 0 [ [ 0 [ [ [ [ 5 [ [ 4 6 0 1 2 3 ] ] 7 ] ] ] ] = ] ] ======푔(푌 ) 푔(푌 ) = using a binary tree 4 5 푔(푌 ) 푔(푌 ) 6 0 푔(푌1) 푔(푌2) 푔(푌0) 푔(푌7)

Public OTS keys • Properties and Requirements – Max. signature count determined by height H of tree (fixed at setup) – Needs to keep track of already used signatures in the tree  stateful signature scheme – Can be used with any one-time signature scheme and (collision- resistant) cryptographic hash function Principle • Let 푔: {0,1}∗ → {0,1}푛 be a hash function with security parameter 푛 퐻 퐻 • Fix height 퐻 and generate 2 LD-OTS key pairs (푋푖, 푌푖) with 0 ≤ 푖 < 2 퐻−푖 • Notation: 푉푖 푗 with 0 ≤ 푖 ≤ 퐻 and 0 ≤ 푗 < 2

Example: 퐻 = 3 PK = V3[0]

V2[0] V2[1]

V [3] V1[0] V1[1] V1[2] 1

V0[0] V0[1] V0[2] V0[3] V0[4] V0[5] V0[6] V0[7] ======푔(푌0) 푔(푌1) 푔(푌2) 푔(푌0) 푔(푌4) 푔(푌5) 푔(푌6) 푔(푌7)

(푋0, 푌0) (푋1, 푌1) (푋2, 푌2) (푋3, 푌3) (푋4, 푌4) (푋5, 푌5) (푋6, 푌6) (푋7, 푌7)

• Computation rule for inner nodes: 푉푖 푗 = g(푉푖−1[2j] || 푉푖−1[2j+1]) with 0 < 푖 ≤ H and 0 ≤ 푗 < 2푖 Key Aspects of Hash-based Cryptographic Systems • Only signature schemes available, no encryption • Moderate requirements for implementations – Second preimage (older schemes: collision) resistant hash function – Pseudorandom functions for OTS (XMSS) • Hard limitation on the number of signatures per tree – Height of the tree determines max. # of signatures (issue with DoS attacks for real-world systems) – Requires track record of signatures already used (critical in untrusted environments!) – Increasing tree height increases memory requirements and computational complexity Outline

• Introduction • Classes of Post-Quantum Cryptography (PQC) – Code-Based Cryptography – Lattice-Based Cryptography – Hash-Based Cryptography • Lessons Learned Lessons Learned

• Post-Quantum Cryptography essential for long-term security – Code-based encryption schemes are the most mature candidates – Digital signatures from hash-based cryptography with high confidence respect to security and under standardization – Lattice-based cryptography has high potential and extremely high versatility

• Next topics in this tutorial (selection due to time constraints) – Efficient implementation strategies for Code-Based Cryptosystems – Efficient implementation of Lattice-Based Cryptosystems

ICT-644729 Part I: Introduction to Post Quantum Cryptography Tutorial@CHES 2017 - Taipei

Tim Güneysu Ruhr-Universität Bochum & DFKI 04.10.2017

Thank you! Questions? Part II: Hardware Architectures for Post Quantum Cryptography Tutorial@CHES 2017 - Taipei

Tim Güneysu Ruhr-Universität Bochum & DFKI 04.10.2017

including slides by Ingo von Maurich and Thomas Pöppelmann Tutorial@CHES 2017 - Tim Güneysu Tutorial Outline – Part II

Code-based Cryptography Efficient Code-based Implementations Lattice-based Cryptography Efficient Lattice-based Implementations Lessons Learned Recall: McEliece Encryption Scheme [1978]

Key Generation Given a [푛, 푘]-code 퐶 with generator matrix 퐺 and error correcting capability 푡 Private Key: (푆, 퐺, 푃), where 푆 is a scrambling and 푃 is a permutation matrix Public Key: 퐺′ = 푆 · 퐺 · 푃

Encryption 푘 푛 Message 푚 ∈ 픽2, error vector e ∈푅 픽2 , wt e ≤ 푡 x ← 푚퐺′ + e

Decryption Let Ψ퐻 be a 푡-error-correcting decoding algorithm. −1 −1 푚 · 푆 ← Ψ퐻 푥 · 푃 , removes the error e · 푃 Extract 푚 by computing 푚 · 푆 · 푆−1 Security Parameters (Binary Goppa Codes)

• Original proposal: McEliece with binary Goppa codes . Code properties determine key size, matrices are often large • Code parameters revisited by Bernstein, Lange and Peters • Public key is a 푘 ∗ (푛 − 푘) bit matrix (redundant part only) Code-based Cryptography for Embedded Devices

Kpub=M y=Mx+e y=Ψ(y, Kpriv) (Matrix) Kpriv x Decrypt x Encrypt y y

• Selection of the employed code is a highly critical issue – Properties of code determine key size, short keys essential – Structures in codes reduce key size, but can enable attacks – Encoding is a fast operation on all platforms (matrix multiplication) – Decoding requires efficient techniques in terms of time and memory • Basic McEliece is only CPA-secure; conversion required • Protection against side-channel and fault-injection attacks Quasi-Cyclic Moderate Density Check Codes (QC-MDPC)

• 푡-error correcting (푛, 푟, 푤)-QC-MDPC code of length 푛 = 푛0푟 • Parity-check matrix 퐻 consists of 푛0 blocks with fixed row weight 푤 Code/Key Generation

1. Generate 푛0 first rows of parity-check matrix blocks 퐻푖 푟 푛0−1 ℎ푖 ∈푅 퐹2 of weight 푤푖, w = 푖=0 푤푖

2. Obtain remaining rows by 푟 − 1 quasi-cyclic shifts of ℎ푖

3. 퐻 = [퐻0|퐻1| … |퐻푛0−1] 4. Generator matrix of systematic form 퐺 = 퐼푘 푄 −1 푇 (퐻푛0−1∗ 퐻0) (퐻−1 ∗ 퐻 )푇 Q = 푛0−1 1 … −1 푇 (퐻푛0−1∗ 퐻푛0−2) Background on QC-MDPC Codes

Parity check matrix 퐻 푛0 = 2

퐻0 퐻1

Generator matrix 퐺 I (QC-)MDPC McEliece

Encryption 푘 푛 Message 푚 ∈ 퐹2 , error vector 푒 ∈푅 퐹2 , 푤푡(푒) ≤ 푡 x ← 푚퐺 + 푒

Decryption Let Ψ퐻 be a 푡-error-correcting (QC-)MDPC decoding algorithm. 푚퐺 ← Ψ퐻 푚퐺 + 푒 Extract 푚 from the first k positions.

Parameters for 80-bit equivalent symmetric security [MTSB13] 푛0 = 2, 푛 = 9602, 푟 = 4801, 푤 = 90, 푡 = 84 Tutorial Outline – Part II

Code-based Cryptography Efficient Code-based Implementations Lattice-based Cryptography Efficient Lattice-based Implementations Lessons Learned Hardware Implementation of Building Blocks for McEliece/Niederreiter

• Two Operations – Encryption/Encoding: G • Matrix-vector multiplication (with large matricies, either to be stored or to be generated on-the-fly); codeword • TRNG for error generation – Decryption/Decoding: ciphertext • Code- specific syndrome decoding; hard-decision decoding with simple (bitwise) operations preferred

• Inverse-matrix-vector multiplication message Efficient Decoding of MDPC Codes

Decoders for LDPC/MDPC codes: bit flipping and belief propagation

“Bit-Flipping” Decoder 1. Compute syndrome 푠 of the ciphertext

2. Count unsatisfied parity-check-equations #푢푝푐 for each ciphertext bit 3. Flip ciphertext bits that violate ≥ 푏 equations 4. Recompute syndrome 5. Repeat until 푠 = 0 or reaching max. iterations (decoding failure)

. How to determine threshold 푏 ?

• Precompute 푏푖 for each iteration [Gal62] • 푏 = 푚푎푥푢푝푐 [HP03]

• 푏 = 푚푎푥푢푝푐 − δ [MTSB13] FPGA Low-Resource Encryption

Target: Xilinx Spartan-6 FPGA 32 flip flops Scheme: QC-MDPC Encryption m . Given first 4801-bit row 푔 of 퐺 and message 푚, Control + XOR compute 푥 = 푚퐺 + 푒 . Storage requirements BRAM • One 18 kBit BRAM is sufficient to store message m, m

row 푔 and the redundant part (3x4801-bit vectors) G

• But only two data ports are available redundan • Read out 32-bit of the message and store them t part in a separate register . Error addition • Instead of starting with an all-zero redundant part we preload it with the second half of the error vector FPGA Low-Resource Decryption

QC-MDPC Decryption . Secret key and ciphertext consist of two blocks . Iterative vs. parallel design . Decoding is complex task → parallel processing

. BRAM-based implementation: storage requirements . Secret key (2x4801 bit) . Ciphertext (2x4801 bit) . Syndrome (4801 bit) . In total 3 BRAMs due to memory and port access requirements FPGA Low-Resource Decryption

QC-MDPC Decryption . Syndrome computation 푠 = 퐻푥푇 • Similar technique as for encoding . Compare 푠 = ퟎ? • Compute binary OR of all 32-bit blocks of the syndrome

. Count #푢푝푐

• Hamming weight of syndrome AND ℎ0/ℎ1 (32-bit at a time) • Accumulate Hamming weight . Bit-flipping

• If #푢푝푐 ≥ 푏푖 invert ciphertext bit(s) and XOR ℎ0/ℎ1 to the syndrome while rotating both Lightweight FPGA Results

. Post-PAR for Xilinx Spartan-6 XC6SLX4 & Virtex-6 XC6VLX240T . Encryption takes 735,000 cycles . Decryption takes 4,274,000 cycles on average Lightweight FPGA Comparison

. Realistic public key size (0.6 kByte vs. 50-100 kByte) . Smallest McEliece FPGA implementation . Sufficient performance for many applications Tutorial Outline – Part II

Code-based Cryptography Efficient Code-based Implementations Lattice-based Cryptography Efficient Lattice-based Implementations Lessons Learned Lattice-Based Cryptography

• Recall: Benefits of Lattice-Based Cryptography – We can get signatures and public key encryption from lattices and also more advanced services (IBE, FHE) – A lot of development on theory side; schemes are improving – Implementation of lattice-based cryptography is a young field; only done for a few years (except maybe for NTRU) To be Ideal or not Ideal?

Two important lines of research: random lattices and ideal lattices • Major impact on implementation (theory not that much) • Security for random lattices is better understood (ideal lattices are more structured)

 Random Lattices  Ideal Lattices

• Operations on large matrices • Operations on polynomials with 256 or (e.g., 532x840) 512 coefficients • Mostly matrix-vector multiplication modulo 푞 < 232 • Mostly polynomial multiplication modulo • Large public keys (e.g., 532x840 matrix) 푞 < 232 • Public keys are one (or two) polynomials with 256 or 512 coefficients Learning with Errors Solving of a system of linear equations secret 7×4 4×1 7×1 ℤ13 ℤ13 ℤ13 4 1 11 10 6 4 × 9 = 8 5 5 9 53 11 1 11 10 3 9 0 10 4 1 3 3 2 12 12 7 3 4 9 6 5 11 4 3 3 5 0 Blue is given; Find (learn) red  Solve linear system Learning with Errors Solving of a system of linear equations random secret small noise looks random 7×4 4×1 7×1 7×1 ℤ13 ℤ13 ℤ13 ℤ13 4 1 11 10 6 0 4 × 9 + -1 = 8 5 5 9 53 11 1 1 11 1 10 3 9 0 10 1 4 1 3 3 2 0 12 12 7 3 4 -1 9 6 5 11 4 3 3 5 0 Blue is given; Find red  Learning with errors (Ring) Learning with Errors

From learning with errors to ring-learning with errors 7×4 ℤ13 44 1 11 10 Only one line • Shift first line on every line has to be stored • Use rule that we negate x in case of wrap around (e.g., 3 4 1 11 10 ⇒ −10 ≡ 3 mod 13) 2 3 4 1 12 2 3 4 9 12 2 3 10 9 12 2 11 10 9 12 Ring Learning with Errors: Principle

풂 34 23 … 23 • Ideal lattices correspond to ideals in random 푍 푥 the ring R = 푞 × 푥푛+1 small secret 풔 1 -2 … 0 (Gaussian) • Ring Learning With Errors (RLWE) + sample is: 퐭 = 풂풔 + 풆 ∈ 푅 for small error uniform 풂 ∈ R and small discrete 풆 0 1 … 0 (Gaussian) Gaussian distributed 풔, 풆 ← 퐷휎 = – Search-RLWE: Find s when given 퐭 and 퐚 32 43 … 12 random – Decision-RLWE: Distinguish 퐭 from uniform when given 퐭 and 퐚 Example: 푍 푥 Polynomial Addition in R = 풒 푥풏+1 푍 푥 • Assume ring R = 풒 푥풏+1 • Assume parameters 푞 = 5 and 푛 = 4 • 풗 = 4푥3 + 2푥2 + 0푥1 + 1 = (4,2,0,1) • 퐤 = 2푥3 + 1푥2 + 4푥1 + 0 = 2,1,4,0 • 풔 = 풗 + 풌 = 4 + 2 mod 5,2 + 1,4,1 = (1,3,4,1) 풗 풌

풔 Example: 푍 푥 Polynomial Multiplication in R = 풒 푥풏+1

• 풌 = 2, 1, 4, 0 • Task: 풛 = 풔 ∗ 풌 = (3, 0, 2, 0) • 풔 = 1, 3, 4, 1 Discrete Gaussian Distribution

• 퐷휎 is defined by assigning weight proportional to −푥2 𝜌 푥 = exp( ) 휎 2휎2

푍 푥 -1501 1020 502 … -1900 572 풂 R = ퟒퟎퟗퟑ Uniform 푥ퟐퟓퟔ + 1 Gaussian -1 4 -8 … 0 1 e

Remark on Arithmetic of x-distributed values: Uniform * Gaussian = Uniform Gaussian * Gaussian = larger Gaussian Gaussian Sampling: Options

Rejection Sampling Cumulative Distribution Table (CDT) Sampling

Bernoulli Sampling [DG14] Efficient sampling from discrete Gaussians for lattice-based cryptography on a constrained device, Dwarakanath and Galbraith, Applicable Algebra in Engineering, Communication and Computing, 2014 [DDLL14] Lattice Signatures and Bimodal Gaussians, Léo Ducas and Alain Durmus and Knuth-Yao Sampling Tancrède Lepoint and Vadim Lyubashevsky, CRYPTO '13 Ring-LWE Encryption Scheme [LP11/LPR10]

Gen: Choose 풂 ← 푅 and 풓1, 풓2 ← 퐷휎; pk: 풑 = 풓1 − 풂 ⋅ 풓2∈ R; sk: 풓2

푎 x + 푐1 푛 Enc(풂, 풑, 푚 ∈ 0,1 ): 풆1, 풆2, 풆3 ← 퐷휎. 퐷 퐷 퐷 풎 = 푒푛푐표푑푒 푚 . Ciphertext: 휎 휎 휎 푝 푐 [풄1 = 풂 ⋅ 풆1 +풆2, 풄2 = 풑 ⋅ 풆1 +풆3 + 풎 ] x + + 2 푚 푒푛푐표푑푒

Dec(푐 = [풄 , 풄 ], 풓 ): Output 1 2 ퟐ 푐1 x + 푑푒푐표푑푒 푚 푑푒푐표푑푒(풄1 ⋅ 풓2 +풄2) 푟1 푐2

Correctness: 풄1풓2 + 풄2 = (풂풆1 + 풆2)풓2 +풑풆1 + 풆3 + 풎 = 풓2풂풆1 + 풓2풆2 + 풓1풆1 − 풓2풂풆1 + 풆3 + 풎 = 풎 + 풓2풆2+풓1풆1 + 퐞3

large small Ring-LWE Encryption: 푍 푥 R = ퟒퟎퟗퟑ Parameters 푥ퟐퟓퟔ + 1

푛 −bit message/coefficients Error correction • Encode(m) m 0 1 … 1 0 – Return 푚 ⋅ 푞/2 푒푛푐표푑푒 푚 풎 0 2046 … 2046 0

• Decode(x) 풎 + 풓 풆 + 2 2 402 1907 … 2631 4024 – If (1/4푞 < 푥 < 3/4푞) 풓1풆1 + 퐞3 Return 1 de푐표푑푒 푚 – Else return 0 풎 0 1 … 1 0 Ring-LWE Encryption: Parameters

Parameter sets 푛 푝 𝜎 |풄1, 풄2| |sk| |pk| security (256, 4093, 8.35 [LP11] 256 4093 ~4.5 6,144 1,792 6,144 ~106 bits (256, 7681,11.32) [GFSBH12] 256 7681 ~4.8 6,656 1,792 6,656 ~106 bits (512, 12289, 12.18) [GFSBH12] 512 12289 ~4.9 14,336 3,584 14,336 ~256 bits

• Message and ciphertext: – Message space: 푛 bits

– Expansion 2 ⋅ log2 푞 – Two large polynomials (풄1, 풄2) • Public key: one or two large polynomials (풂, 풑)

• Secret key: small polynomial (풓ퟐ) Tutorial Outline – Part II

Code-based Cryptography Efficient Code-based Implementations Lattice-based Cryptography Efficient Lattice-based Implementations Lessons Learned Hardware Implementation Building Blocks for R-LWE

• Two main components – Polynomial multiplier for 푛 = {256,512,1024} over specific rings with coefficients with less than log2(푞) < 24 bits – Discrete Gaussian sampler with precisely defined precision 𝜎 Hardware Implementation: Low-Cost Design for Xilinx Spartan-6

• Row-wise polynomial multiplication (풂풆1/풑풆1) – Simple address generation – Sample coefficient of 풆1, add row of 풄1 then add row of 풄2, add coefficient of 풆2 and 풆3

• Key and ciphertext are stored in block memory Modular • DSP block for arithmetic reduction (power (푞 × 푞-bit multipler) Multiplication ot two possible) (DSP) Hardware Implementation: Low Area Post-place-and-route performance on a Spartan-6 LX9 FPGA.

Area savings by power of two modulus

• Usage of 푞 = 4096 leads to area improvement and higher clock frequency • Performance is still very good • Area consumption is low, especially for decryption Ring-LWE: Can we do better?

• Schoolbook polynomial multiplication is simple and independent of parameters • Performance is reasonable but can still be improved • Remember: according to schoolbook multiplication, we need 푛2 multiplications modulo q for one polynomial multiplication – 1282 = 16384 – 2562 = 65536 – 5122 = 262144 – 10242 = 1048576

Can we do better? Optimization: Polynomial Multiplication based on NTT

• Include algorithmic tweaks for fast polynomial multiplication • The Number Theoretic Transform (NTT) is a discrete Fourier transform (DFT) defined over a finite field or ring. For a given primitive 푛-th root of unity 휔 the NTT is defined as:

– Forward transformation: NTT 푛−1 푖푗 • 푨[푖] = 푗=0 풂 푗 휔 , 푖 = 0,1, … , 푛 – Inverse transformation: INTT −1 푛−1 −푖푗 • 풂[푖] = 푛 푗=0 푨 푗 휔 , 푖 = 0,1, … , 푛

• NTT exists if 푞 is a prime, 푛 a power of two and if q ≡ 1 mod 2푛 • Example: Ring-LWE encryption: 7681 mod 2 ∙ 256 = 1 NTT for Lattice Cryptography: Convolution Theorem

• With the convolution theorem we can basically multiply two vectors/polynomials with the help of the NTT – 퐜 = INTT NTT 풂 ∘ NTT 풃 – Efficient algorithms are known for bi-direction conversion

풂 NTT ∘ INTT 풄 풃 NTT

• Negative Wrapped Convolution: 푛 – Polynomial multiplication in 푍푞 푥 / 푥 + 1 – Runtime 푂(푛 log 푛) – No appending of zeros required (as for regular convolution) – Implicit polynomial reduction by 푥푛 + 1 Efficient Computation of the NTT (Cooley-Tukey)

twiddle factors

Multiplication by 휔0 = 1

• Bitreversal required (NTT푛표→푏표) • Precomputation of powers of 휔 possible • Arithmetic is basically multiplication and reduction 푛 modulo 푞 ( log (푛) times) 2 2 • Further optimizations still possible Ring-LWE Encryption on FPGA

NTT is very fast but still quite small

Lots of improvement since [GFS+12] Tutorial Outline – Part II

Code-based Cryptography Efficient Code-based Implementations Lattice-based Cryptography Efficient Lattice-based Implementations Lessons Learned Lessons Learned

. Efficient McEliece implementations with practical key sizes • QC-MDPC codes are an efficient alternative to binary Goppa codes • Note: consider attacks on decryption failure rate (ASIACRYPT 2016) • Low-cost FPGA implementation practical for key agreement scheme (in prep) . Efficient R-LWE encryption are extremely efficient • R-LWE (and variants) also allow signature + advanced schemes • FPGA implementations more efficient than RSA, en par with ECC . Papers and source code available at http://www.seceng.rub.de/research/projects/pqc/

. For more papers and codes, see project websites of

ICT-644729 Part II: Hardware Architectures for Post Quantum Cryptography Tutorial@CHES 2017 - Taipei

Tim Güneysu Ruhr-Universität Bochum & DFKI 04.10.2017

Thank Tutorial@CHESyou! 2017 - TimQuestions? Güneysu Part III: Post Quantum Cryptography in Embedded Software Tutorial@CHES 2017 - Taipei

Tim Güneysu Ruhr-Universität Bochum & DFKI 04.10.2017

including slides by Ingo von Maurich and Thomas Pöppelmann Tutorial Outline – Part III

Code-based Cryptography Efficient Code-based Implementations Lattice-based Cryptography Efficient Lattice-based Implementations Lessons Learned Recall: McEliece Encryption Scheme [1978]

Key Generation Given a [푛, 푘]-code 퐶 with generator matrix 퐺 and error correcting capability 푡 Private Key: (푆, 퐺, 푃), where 푆 is a scrambling and 푃 is a permutation matrix Public Key: 퐺′ = 푆 · 퐺 · 푃

Encryption 푘 푛 Message 푚 ∈ 픽2, error vector e ∈푅 픽2 , wt e ≤ 푡 x ← 푚퐺′ + e

Decryption Let Ψ퐻 be a 푡-error-correcting decoding algorithm. −1 −1 푚 · 푆 ← Ψ퐻 푥 · 푃 , removes the error e · 푃 Extract 푚 by computing 푚 · 푆 · 푆−1 (QC-)MDPC McEliece

Encryption 푘 푛 Message 푚 ∈ 퐹2 , error vector 푒 ∈푅 퐹2 , 푤푡(푒) ≤ 푡 x ← 푚퐺 + 푒

Decryption Let Ψ퐻 be a 푡-error-correcting (QC-)MDPC decoding algorithm. 푚퐺 ← Ψ퐻 푚퐺 + 푒 Extract 푚 from the first k positions.

Parameters for 80-bit equivalent symmetric security [MTSB13] 푛0 = 2, 푛 = 9602, 푟 = 4801, 푤 = 90, 푡 = 84 Tutorial Outline – Part III

Code-based Cryptography Efficient Code-based Implementations Lattice-based Cryptography Efficient Lattice-based Implementations Lessons Learned 32-bit ARM Microcontroller

ARM-based 32-bit Microcontroller . STM32F407@168MHz . 32-bit ARM Cortex-M4 . 1 Mbyte flash, 192 kbyte SRAM . Crypto functions: TRNG, 3DES, AES, SHA-1/-256, HMAC co-processor . Costs: roughly US$ 10

AVR-based 8-bit Microcontroller . ATXMega128A1@32MHz . 8-bit AVR Xmega Family . 256 Kbyte flash, 8 Kbyte SRAM . Crypto functions: DES, AES . Costs: roughly US$ 10 Implementing Key Generation

. Memory is a scarce resource on microcontrollers . Generate and store random sparse vectors of length 4801 with 45 bits set  store set bit locations only

Generating secret key 푯 = [푯ퟎ|푯ퟏ] . Generate first row of 퐻1, repeat if not invertible . Generate first row of 퐻0 . Convert to sparse representation → 90 counters

Computing public key 푮 = [푰|푸] −1 . Compute 푄 from first row of 퐻1 and 퐻0 Implementing (Plain) Encryption

. Recall operation principle as for low-cost hardware • All processes are based on 32-bit based operations • Set bits in message 푚 select rows of the public key 퐺 • Parse 푚 bit-by-bit, XOR current row of 퐺 if bit is set

. Error addition for encryption • Use TRNG to provide random bits to add 푡 errors • Obtain individual error indices by rejection sampling from log2 푛 = 14 bit Implementing (Plain) Decryption

Recall syndrome computation; parity check matrix in sparse . Parse ciphertext bit-by-bit . XOR row of the secret key if corresponding ciphertext bit is set Decoding iteration . Count #bits that are set in the syndrome and current row of the parity-check matrix blocks  use 90 counters . Compare #bits to decoding threshold . Invert current ciphertext bit if #bits above threshold . Add current row to syndrome . Generate next row → increment counters (check overflows) Implementation Results

Scheme Platform Cycles/Op Time

McE MDPC (keygen) STM32F407 148,576,008 884 ms McE MDPC (enc) STM32F407 16,771,239 100 ms McE MDPC (dec) STM32F407 37,171,833 221 ms

McE MDPC (enc) ATxmega256 26,767,463 836 ms

McE MDPC (dec) ATxmega256 86,874,388 2,71 s

• 8-Bit AVR platform too slow for real-world deployment • Key generation excessive, decryption roughly 3 seconds • 32-bit ARM is a suitable platform and provides built-in TRNG • Improved QcBits software for Cortex-M4 by Chou (CHES 2016) Further Implementation Remarks and Requirements • CCA2-Security for McEliece Encryption: – Additional conversion (e.g., via Fujisaki-Okamoto, includes the necessity for hash-function and re-encryption) • Side-Channel Attacks: – Masking schemes (SCA) for McEliece by Eisenbarth et al. [SAC15], does not include CCA2 security • Decryption Failure Rate Attacks: – Guo et al [ASIACRYPT16] identifies correlation between decoding failures in iterative decoders (bit flipping decoding) Tutorial Outline – Part III

Code-based Cryptography Efficient Code-based Implementations Lattice-based Cryptography Efficient Lattice-based Implementations Lessons Learned Ring-LWE Encryption Scheme [LP11/LPR10]

Gen: Choose 풂 ← 푅 and 풓1, 풓2 ← 퐷휎; pk: 풑 = 풓1 − 풂 ⋅ 풓2∈ R; sk: 풓2

푎 x + 푐1 푛 Enc(풂, 풑, 푚 ∈ 0,1 ): 풆1, 풆2, 풆3 ← 퐷휎. 퐷 퐷 퐷 풎 = 푒푛푐표푑푒 푚 . Ciphertext: 휎 휎 휎 푝 푐 [풄1 = 풂 ⋅ 풆1 +풆2, 풄2 = 풑 ⋅ 풆1 +풆3 + 풎 ] x + + 2 푚 푒푛푐표푑푒

Dec(푐 = [풄 , 풄 ], 풓 ): Output 1 2 ퟐ 푐1 x + 푑푒푐표푑푒 푚 푑푒푐표푑푒(풄1 ⋅ 풓2 +풄2) 푟1 푐2

Correctness: 풄1풓2 + 풄2 = (풂풆1 + 풆2)풓2 +풑풆1 + 풆3 + 풎 = 풓2풂풆1 + 풓2풆2 + 풓1풆1 − 풓2풂풆1 + 풆3 + 풎 = 풎 + 풓2풆2+풓1풆1 + 퐞3

large small Ring-LWE Encryption: Parameters

Parameter sets 푛 푝 𝜎 |풄1, 풄2| |sk| |pk| security (256, 4093, 8.35 [LP11] 256 4093 ~4.5 6,144 1,792 6,144 ~106 bits (256, 7681,11.32) [GFSBH12] 256 7681 ~4.8 6,656 1,792 6,656 ~106 bits (512, 12289, 12.18) [GFSBH12] 512 12289 ~4.9 14,336 3,584 14,336 ~256 bits

• Message and ciphertext: – Message space: 푛 bits

– Expansion 2 ⋅ log2 푞 – Two large polynomials (풄1, 풄2) • Public key: one or two large polynomials (풂, 풑)

• Secret key: small polynomial (풓ퟐ) Tutorial Outline – Part III

Code-based Cryptography Efficient Code-based Implementations Lattice-based Cryptography Efficient Lattice-based Implementations Lessons Learned Simple Implementation of RLWE-Encryption void encrypt(poly a, poly p, unsigned char * plaintext, poly c1, poly c2) { int i,j; poly e1,e2,e3; gauss_poly(e1); gauss_poly(e2); gauss_poly(e3);

poly_init(c1, 0, n); // init with 0 poly_init(c2, 0, n); // init with 0 This has to be fast

for(i = 0;i < n; i++){ // multiplication loops for(j = 0; j=n ? -1 : 1))); c2[(i + j) % n] = modq(c2[(i + j) % n] + (p[i] * e1[j] * (i+j>=n ? -1 : 1))); } c1[i] = modq(c1[i] + e2[i]); c2[i] = (plaintext[i>>3] & (1<<(i%8))) ? modq(c2[i] + e3[i] + q/2) : modq(c2[i] + e3[i]); } } Software Implementation Main Functions for R-LWE • Two main components – Polynomial multiplier for 푛 = {256,512,1024} over specific rings with coefficients with less than log2(푞) < 24 bits – Discrete Gaussian sampler with precisely defined precision 𝜎 and tail cut 휏 Intermediate Results

• Implementation of RLWE-Encryption on the AVR 8-bit ATxmega processor running at 32 MHz

• Schoolbook multiplication (SchoolMul) • Encryption is two multiplications and decryption one Recall Improvement: Polynomial Multiplication with NTT • Number Theoretic Transform (NTT) is a discrete Fourier transform (DFT) defined over a finite field or ring. For a given primitive 푛-th root of unity 휔 the NTT is defined as:

– Forward transformation: NTT 푛−1 푖푗 • 푨[푖] = 푗=0 풂 푗 휔 , 푖 = 0,1, … , 푛 – Inverse transformation: INTT −1 푛−1 −푖푗 • 풂[푖] = 푛 푗=0 푨 푗 휔 , 푖 = 0,1, … , 푛

• NTT exists if 푞 is a prime, 푛 a power of two and if q ≡ 1 mod 2푛 Efficient Computation of the NTT (Textbook)

twiddle factors

Multiplication by 휔0 = 1

• Bitreversal required (NTT푛표→푏표) • Precomputation of powers of 휔 possible • Arithmetic is basically multiplication and 푛 reduction modulo 푞 ( log (푛) times) 2 2 09.10.2012 Optimization of NTT Computation

Removal of expensive “helper” functions • Problem: Permutation (Bitrev) of polynomial is expensive – “Standard” NTT푏표→푛표 requires bitreversed input and produces naturally ordered output – Bitreversal before each forward or inverse NTT

• Solution: NTT algorithm can be written as – Natural to bitreversed for forward: NTT푛표→푏표 – Bitreversed to natural for inverse: INTT푏표→푛표 – No bitreversal necessary anymore:

• INTT푏표→푛표(NTT푛표→푏표 풂 ∘ NTT푛표→푏표(풃)) Optimization of NTT Computation

Removal of expensive “helper” functions • Problem: Multiplication by scalar 푛−1 in inverse transformation is expensive

• Solution: In lattice-based crypto we usually multiply by pretransformed constants (e.g., 풂 , 풑 , or 풓2) – Put 푛−1 into these constants – Multiplication by scalar does not change much as • x ∙ NTT(풂) ⇔ NTT(푥 ∙ 풂) – Store 풂 ′ = 푛−1풂 Optimization of NTT Computation

Removal of expensive “helper” functions • Problem: Multiplication by powers of 휓 and 휓−1 (PowMul) is expensive

• Solution: Merge powers of 휓 into twiddle factors – Only possible with forward transformation and current butterfly (see next picture) Optimization of NTT Computation

• Combines all tricks for forward transformation • We cannot merge powers of 휓−1; We have to multiply after transformation is finished Optimization of NTT Computation

• Usage of Gentlemen-Sande (GS) butterfly instead of Cooley-Tukey (CT) allows merging of inverse multiplication by powers of 휓−1 – CT: 푎 + 휔푏 and 푎 − 휔푏 – GS: 푎 + 푏 and (푎 − 푏)휔 Optimization of NTT Computation

Textbook

Optimized (*)

• We save several steps compared to straightforward approach • Almost no additional costs (if we store twiddle factors) – No multiplication by one in first stage anymore – Can be mitigated by using lookup tables if coefficients for e are small

(*) FFT people probably know most of these tricks Optimization of NTT Computation

How to accelerate the multiplication core operation • Address generation for NTT is cheap and well researched (see FFT) • The only expensive computation is the butterfly, which boils down to – a log2 푞 × log2 푞 multiplication – a mod 푞 modulo reduction – two additions or subtractions modulo 푞 • Implementation of the butterfly depends on target architecture – General methods like Montgomery or Barret reduction – Reductions that depend on special primes like Solinas primes Ring-LWE Encryption on ATXmega (ATXMega128A1)

• Moderate performance impact of larger parameter set • Very fast decryption • Some pitfalls in practice (only CPA and decryption errors) Ring-LWE Encryption on ATXmega Family

Code size is not Sampler is the Schoolbook was significantly bottleneck 12 million increased

[POG15] High-Performance Ideal Lattice-Based Cryptography on 8-bit ATxmega Microcontrollers, Thomas Pöppelmann, Tobias Oder, and Tim Güneysu, Latincrypt’15 Ring-LWE Encryption on Other Platforms [CRV+15]

Table from [CRV+15]: Ruan de Clercq, Sujoy Sinha Roy, Frederik Vercauteren, Ingrid Verbauwhede: Efficient software implementation of ring-LWE encryption. DATE 2015: 339-344 Further Implementation Remarks and Requirements • CCA2-Security: – Additional conversion (e.g., via Fujisaki-Okamoto, includes the necessity for hash-function and re-encryption) • Side-Channel Attacks: – Masking schemes (SCA) by Reparaz et al [CHES15, PQCRYPTO16], does not include CCA2 security • Fault-Injection Attacks: – Loop-Abort attacks by Espitau et al. [ePrint 16] – Fault Sensitivity by Bindel et al. [FDTC16] Tutorial Outline – Part III

Code-based Cryptography Efficient Code-based Implementations Lattice-based Cryptography Efficient Lattice-based Implementations Lessons Learned Lessons Learned

. Efficient McEliece implementations with practical key sizes • QC-MDPC codes are an efficient alternative also in software • Note: consider reported issues with decryption error (ASIACRYPT 2016) • Physical attacks are more challenging to counter with probabilistic decoding . Efficient R-LWE encryption are extremely efficient • R-LWE (and variants) also allow signature + advanced schemes • Software implementations very efficient compared to ECC and RSA . Papers and source code available at http://www.seceng.rub.de/research/projects/pqc/

. For more papers and codes, see project websites of

ICT-644729 Part III: Post Quantum Cryptography in Embedded Software Tutorial@CHES 2017 - Taipei

Tim Güneysu Ruhr-Universität Bochum & DFKI 04.10.2017

Thank you! Questions?