Implementing and Benchmarking Seven Round 2 Lattice-Based Key Encapsulation Mechanisms Using a Software/Hardware Codesign Approach
Total Page:16
File Type:pdf, Size:1020Kb
Implementing and Benchmarking Seven Round 2 Lattice-Based Key Encapsulation Mechanisms Using a Software/Hardware Codesign Approach Farnoud Farahmand, Viet B. Dang, Michał Andrzejczak, Kris Gaj George Mason University Co-Authors GMU PhD Students Visiting Scholar Farnoud Viet Michał Farahmand Ba Dang Andrzejczak Military University of Technology in Warsaw, Poland 2 Hardware Benchmarking 3 Round 2 Candidates in Hardware #Round 2 Implemented Percentage candidates in hardware AES 5 5 100% SHA-3 14 14 100% CAESAR 29 28 97% PQC 26 ? ? 4 Software/Hardware Codesign 5 Software/Hardware Codesign Software Hardware Most time-critical operation 6 SW/HW Codesign: Motivational Example 1 Software Software/Hardware Other Other Major 9% 9% ~1% Major Time saved 91% 90% speed-up ≥ 100 91% major operation(s) ~1% major operation(s) in HW 9% other operations 9% other operations in SW Total Speed-Up ≥ 10 7 SW/HW Codesign: Motivational Example 2 Software Software/Hardware Other Major Other 1% ~1% 1% Major Time saved 99% 98% speed-up ≥ 100 99% major operation(s) ~1% major operation(s) in HW 1% other operations 1% other operations in SW Total Speed-Up ≥ 50 8 SW/HW Codesign: Advantages ❖ Focus on a few major operations, known to be easily parallelizable ▪ much shorter development time (at least by a factor of 10) ▪ guaranteed substantial speed-up ▪ high-flexibility to changes in other operations (such as candidate tweaks) ❖ Insight regarding performance of future instruction set extensions of modern microprocessors ❖ Possibility of implementing multiple candidates by the same research group, eliminating the influence of different ▪ design skills ▪ operation subset (e.g., including or excluding key generation) ▪ interface & protocol ▪ optimization target ▪ platform 9 SW/HW Codesign: Potential Pitfalls ❖ Performance & ranking may strongly depend on A. features of a particular platform o Software/hardware interface o Support for cache coherency o Differences in max. clock frequency B. selected hardware/software partitioning C. optimization of an underlying software implementation ❖ Limited insight on ranking of purely hardware implementations First step, not the ultimate solution! 10 Two Major Types of Platforms FPGA Fabric FPGA Fabric, including & Hard-core Processors Soft-core Processors Processor Soft-core w/ Memory Processor & I/O FPGA FPGA Fabric Fabric Examples: Examples: • Xilinx Zynq 7000 System on Chip (SoC) Xilinx Virtex UltraScale+ FPGAs • Xilinx Zynq UltraScale+ MPSoC Intel Stratix 10 FPGAs, including • Intel Arria 10 SoC FPGAs • Xilinx MicroBlaze • Intel Stratix 10 SoC FPGAs • Intel Nios II • RISC-V, originally UC Berkeley 11 Two Major Types of Platform Feature FPGA Fabric and FPGA Fabric with Hard-core Processor Soft-core Processor Processor ARM MicroBlaze, NIOS II, RISC-V, etc. Clock frequency >1 GHz max. 200-450 MHz Portability similar FPGA SoCs various FPGAs, FPGA SoCs, and ASICs Hardware accelerators Yes Yes Instruction set extensions No Yes Ease of design Easy Dependent on a particular (methodology, tools, OS soft-core processor and support) tool chain Xilinx Zynq UltraScale+ MPSoC 1.2 GHz ARM Cortex-A53 + UltraScale+ FPGA logic 12 Choice of a Platform for Benchmarking Embedded Processor: FPGA Architecture: In NIST presentations to date: ARM Cortex-M4 Artix-7 Our recommendation: ARM Cortex-A53 UltraScale+ • No FPGA SoC with ARM Cortex-M4 and Artix-7 on a single chip • Cortex-M4 and Artix-7 more suitable for lightweight designs, Cortex-A53 and UltraScale+ for high performance • Zynq UltraScale+: • capability to compare SW/HW implementations with fully-SW and fully-HW implementations realized using the same chip • likely in use in the first years of the new standard deployments 13 Experimental Setup AXI Lite Main Clock Zynq Processing System Interface AXI Timer e e l e l c c t i u a a f L f F Q r r I I R e e I X X t t A n A n I I AXI Stream AXI Stream Interface AXI DMA Interface e e c t i e a e f L c t r I i a e f L X t r I A n e I X t A n I Hardware Input FIFO FIFO FIFO Output FIFO Interface Accelerator Interface wr_clk rd_clk clk wr_clk rd_clk UUT_clk Clocking wizard All elements located on a single chip 14 Code Release • Full Code & Configuration of the Experimental Setup • Software/Hardware Codesign of Round 1 NTRUEncrypt to be made available at https://cryptography.gmu.edu/athena under PQC by August 31, 2019 15 Our Case Study 16 SW/HW Codesign: Case Study 7 IND-CCA*-secure Lattice-Based Key Encapsulation Mechanisms (KEMs) representing 5 NIST PQC Round 2 Submissions LWE (Learning with Error)-based: NTRU-based: NTRU FrodoKEM • NTRU-HPS RLWR (Ring Learning with Rounding)-based: • NTRU-HRSS Round5 NTRU Prime Module-LWR-based: • Streamlined NTRU Prime • NTRU LPRime Saber * IND-CCA = with Indistinguishability under Chosen Ciphertext Attack 17 SW/HW Partitioning Top candidates for offloading to hardware From profiling: ❖ Large percentage of the execution time ❖ Small number of function calls From manual analysis of the code: ❖ Small size of inputs and outputs ❖ Potential for combining with neighboring functions From knowledge of operations and concurrent computing: ❖ High potential for parallelization 18 Operations Offloaded to Hardware • Major arithmetic operations • Polynomial multiplications • Matrix-by-vector multiplications • Vector-by-vector multiplications • All hash-based operations • (c)SHAKE128, (c)SHAKE256 • SHA3-256, SHA3-512 19 Example: LightSaber Decapsulation GenSecret Other Hash 2.30% 2.40% 3.30% GenMatrix 5.03% InnerProduct 43.52% MatrixVectorMul 43.44% 20 LightSaber Decapsulation Hash GenSecret Other Hardware Other 3.30% 2.30% 2.40% Accelerator 2.40% GenMatrix 8.77% Execution time 5.03% remaining 11.17% InnerProduct 43.52% MatrixVectorMul Execution time saved 43.44% 88.83% Execution time of functions to be moved to hardware 97.60% Accelerator Speed-Up = 97.60/8.77=11.1 Execution time of functions remaining in software Total Speed-Up = 100/11.17=9.0 2.40% 21 Tentative Results 22 Software Implementations Used FrodoKEM, NTRU-HPS, NTRU-HRSS, Saber: Round 2 submission packages – Optimized_Implementation Round5: https://github.com/r5embed/r5embed (2019-07-28) Streamlined NTRU Prime, NTRU LPRime: supercop-20190811 : factored Changes made after the submission of the paper! Results substantially different! New version of the paper available on ePrint soon! 23 Total Execution Time in Software [휇s] Encapsulation 34,609 62,076 16,192 4,500 4,000 3,500 3,000 2,500 2,000 1,500 1,000 500 0 Round5 Saber Str NTRU NTRU NTRU-HRSS NTRU-HPS FrodoKEM Prime LPRime Level 1 Level 2 Level 3 Level 4 Level 5 24 Total Execution Time in Software/Hardware [휇s] Encapsulation 7 1,642 2,186 4⇒6 1,223 600 500 3⇒4 6⇒5 400 300 5⇒3 200 1 2 100 0 Round5 Saber NTRU-HRSS Str NTRU NTRU-HPS NTRU FrodoKEM Prime LPRime Level 1 Level 2 Level 3 Level 4 Level 5 25 Total Speed-ups: Encapsulation 30.0 28.4 25.0 21.1 20.0 17.8 18.1 15.0 13.2 12.7 10.6 10.2 9.3 10.0 8.3 7.9 7.1 5.0 3.4 3.6 2.6 3.2 2.4 2.5 0.0 NTRU-HRSS FrodoKEM Round5 NTRU-HPS Saber NTRU LPRime Str NTRU Prime Level 1 Level 2 Level 3 Level 4 Level 5 26 Accelerator Speed-ups: Encapsulation 200.0 192.2 180.0 160.0 146.4 140.4 140.0 120.0 100.0 80.0 60.0 44.3 46.1 43.5 32.5 40.0 27.5 20.4 15.317.7 21.3 11.1 14.5 20.0 12.3 8.7 10.6 8.5 0.0 NTRU-HRSS NTRU-HPS FrodoKEM NTRU Str NTRU Round5 Saber LPRime Prime Level 1 Level 2 Level 3 Level 4 Level 5 27 SW Part Sped up by HW[%]: Encapsulation 99.59 99.36 98.49 99.55 98.93 99.14 100.00 97.45 98.62 95.03 94.62 89.71 90.00 88.03 80.00 71.44 73.26 70.35 70.00 65.00 63.43 62.53 60.00 50.00 Round5 Saber NTRU-HRSS FrodoKEM NTRU-HPS NTRU Str NTRU LPRime Prime Level 1 Level 2 Level 3 Level 4 Level 5 28 Total Execution Time in Software [휇s] Decapsulation 34,649 62,377 5 6 16,192 12,000 Order reversed 11,000 compared to 10,000 encapsulation 9,000 8,000 3 4 7,000 6,000 5,000 4,000 3,000 2,000 1,000 0 Round5 Saber NTRU Str NTRU NTRU-HPS NTRU-HRSS FrodoKEM LPRime Prime Level 1 Level 2 Level 3 Level 4 Level 5 29 Total Execution Time in Software/Hardware [휇s]: Decapsulation 7 1,866 3,120 1,319 700 3⇒6 600 500 400 300 5⇒3 4 6⇒5 200 1 2 100 0 Round5 Saber NTRU-HPS Str NTRU NTRU-HRSS NTRU FrodoKEM Prime LPRime Level 1 Level 2 Level 3 Level 4 Level 5 30 Total Speed-ups: Decapsulation 130.0 119.3 120.0 110.0 100.0 90.0 80.0 77.4 74.1 70.0 60.0 54.8 45.5 50.0 40.0 38.3 30.0 18.6 20.0 17.9 20.0 12.3 13.4 4.1 9.0 8.1 9.4 9.8 4.5 10.0 3.9 0.0 NTRU-HPS NTRU-HRSS Str NTRU FrodoKEM Saber Round5 NTRU Prime LPRime Level 1 Level 2 Level 3 Level 4 Level 5 31 Accelerator Speed-ups: Decapsulation 250.0 235.7 225.0 200.0 188.1 182.8 175.0 150.0 132.7 112.0 125.0 100.0 86.2 75.0 46.0 46.0 50.0 44.4 32.8 37.7 27.5 24.6 17.7 25.0 11.1 8.1 9.4 9.8 0.0 NTRU-HRSS NTRU-HPS Str NTRU FrodoKEM NTRU Saber Round5 Prime LPRime Level 1 Level 2 Level 3 Level 4 Level 5 32 SW Part Sped up by HW[%]: Decapsulation 100.00 99.25 98.69 98.92 100.00 100.00 99.58 99.18 98.10 98.41 97.11 100.00 98.53 97.60 96.78 93.96 90.00 79.22 77.46 80.00 76.47 70.00 60.00 50.00 Round5 NTRU-HPS NTRU-HRSS Str NTRU Saber FrodoKEM NTRU Prime LPRime Level 1 Level 2 Level 3 Level 4 Level 5 33 Conclusions ❖ Total speed-ups ▪ for encapsulation from 2.4 (Str NTRU Prime) to 28.4 (FrodoKEM) ▪ for decapsulation from 3.9 (NTRU LPRime) to 119.3 (NTRU-HPS) ❖ Total speed-up dependent on the percentage of the software execution time taken by functions offloaded to hardware and the amount of acceleration itself ❖ Hardware accelerators thoroughly optimized using Register-Transfer Level design methodology ❖ Determining optimal software/hardware partitioning