Efficient Algorithms in Software

Efficient Algorithms in Software Julio López [email protected] Institute of Computing, University of Campinas September 2017, Habana, Cuba. ASCrypto 2017 Agenda 1 Efficient Software Implementations Software Efficiency Parallel Computation -SIMD 2 Symmetric-Key Cryptography Data Encryption Hash Functions SHA2 Implementation SHA3 Implementation 3 Elliptic Curve Cryptography Elliptic Curves Elliptic Curve Diffie-Hellman Digital Signatures EdDSA Scheme Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 2 / 83 Section 1 Efficient Software Implementations 1.1 Software Efficiency Efficient Software Implementations Software Efficiency Software Efficiency The optimization of a software implementation of a cryptographic algorithm is a task with several goals: • Ensure security. • Running time. • Code size. • Memory consumption. • Computer platform characteristics • Energy consumption. Sometimes these goals are in conflict with each other. For example: accelerating an operation using look-up tables, it will increase code size, and it could result vulnerable against memory cache-attacks (if not implemented adequately). Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 3 / 83 Efficient Software Implementations Software Efficiency How Performance is Measured? • Measuring the elapsed time does not allow to compare timing between different computers; instead, clock cycles are measured. • Use the RDTSC instruction to read the Time-Stamp Counter on processor. 1 #include <stdint.h> 2 uint64_t get_cycles() { 3 uint32_t lo,hi; 4 asm volatile("rdtsc":"=a"(lo),"=d"(hi)); 5 return ((uint64_t)hi<<32) | lo; 6 } • To reduce certain sources of randomness during measurements it is recommended to turn off technologies such as Turbo Boost or Hyper-Threading. Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 4 / 83 1.2 Parallel Computation -SIMD Efficient Software Implementations Parallel Computation -SIMD Single Instruction Multiple Data • Single Instruction Multiple Data is a class of computers where a single instruction is applied simultaneously over a set of data. • Latest processors support SIMD class by using a bank of wider registers, also known as vector registers. Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 5 / 83 Efficient Software Implementations Parallel Computation -SIMD Vector instructions Instructions associated to vector registers are known as vector instructions. These instructions operate over words packed in vector registers. Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 6 / 83 Integer Arithmetic MMX (64) Efficient Software Implementations Parallel Computation -SIMD Releases of Vector Instructions 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83 Efficient Software Implementations Parallel Computation -SIMD Releases of Vector Instructions Integer Arithmetic MMX 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 MMX (64) Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83 Efficient Software Implementations Parallel Computation -SIMD Releases of Vector Instructions Integer Arithmetic Floating-point Arithmetic SSE MMX 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 MMX XMM (64)(128) Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83 Efficient Software Implementations Parallel Computation -SIMD Releases of Vector Instructions Integer Arithmetic Floating-point Arithmetic SSE2 SSE MMX 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 MMX XMM (64)(128) Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83 Efficient Software Implementations Parallel Computation -SIMD Releases of Vector Instructions Integer Arithmetic Floating-point Arithmetic SSE2 SSE MMX SSE3 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 MMX XMM (64)(128) Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83 Efficient Software Implementations Parallel Computation -SIMD Releases of Vector Instructions Integer Arithmetic Floating-point Arithmetic SSE2 String Manipulation SSE SSE4 MMX SSE3 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 MMX XMM (64)(128) Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83 Efficient Software Implementations Parallel Computation -SIMD Releases of Vector Instructions Integer Arithmetic Floating-point Arithmetic SSE2 String Manipulation Cryptography SSE SSE4 MMX SSE3 AES-NI + CLMUL 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 MMX XMM (64)(128) Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83 Efficient Software Implementations Parallel Computation -SIMD Releases of Vector Instructions Integer Arithmetic Floating-point Arithmetic SSE2 String Manipulation Cryptography AVX SSE SSE4 MMX SSE3 AES-NI + CLMUL 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 MMX XMM YMM (64)(128) (256) Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83 Efficient Software Implementations Parallel Computation -SIMD Releases of Vector Instructions Integer Arithmetic AVX2 Floating-point Arithmetic SSE2 String Manipulation Cryptography Bit Manipulation AVX SSE SSE4 MMX SSE3 AES-NI + CLMUL BMI 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 MMX XMM YMM (64)(128) (256) Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83 Efficient Software Implementations Parallel Computation -SIMD Releases of Vector Instructions Integer Arithmetic AVX2 Floating-point Arithmetic SSE2 String Manipulation Cryptography Bit Manipulation AVX SSE SSE4 MMX SSE3 SHA1-SHA2 AES-NI + CLMUL BMI 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 MMX XMM YMM (64)(128) (256) Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83 Efficient Software Implementations Parallel Computation -SIMD Releases of Vector Instructions Integer Arithmetic AVX2 AVX-512 Floating-point Arithmetic SSE2 String Manipulation Cryptography Bit Manipulation AVX SSE SSE4 MMX SSE3 SHA1-SHA2 AES-NI + CLMUL BMI 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 MMX XMM YMM ZMM (64)(128) (256) (512) Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83 Variable logic shifts. • 1 cycle for fixed shifts. • 2 cycles for variable shifts. Permutation of words. • 3 cycles for permutations. Combination/selection of registers. • Up-to 3 instructions per cycle without dependencies. Efficient Software Implementations Parallel Computation -SIMD Relevant AVX2 Instructions Integer arithmetic for 64-bit words: • 1 cycle for add/sub. C = ADD(A, B) • 5 cycles for multiplications. a3 a2 a1 a0 + + + + b3 b2 b1 b0 c3 c2 c1 c0 Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 8 / 83 Permutation of words. • 3 cycles for permutations. Combination/selection of registers. • Up-to 3 instructions per cycle without dependencies. Efficient Software Implementations Parallel Computation -SIMD Relevant AVX2 Instructions Integer arithmetic for 64-bit words: • 1 cycle for add/sub. C = VSHL(A, B) • 5 cycles for multiplications. a3 a2 a1 a0 Variable logic shifts. • 1 cycle for fixed shifts. • 2 cycles for variable shifts. b3 b2 b1 b0 c3 c2 c1 c0 Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 8 / 83 Combination/selection of registers. • Up-to 3 instructions per cycle without dependencies. Efficient Software Implementations Parallel Computation -SIMD Relevant AVX2 Instructions Integer arithmetic for 64-bit words: • 1 cycle for add/sub. C = PERM(A, M) • 5 cycles for multiplications. a3 a2 a1 a0 Variable logic shifts. • 1 cycle for fixed shifts. • 2 cycles for variable shifts. m3 m2 m1 m0 0, 1, 2, 3 Permutation of words. { } • 3 cycles for permutations. am3 am2 am1 am0 Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 8 / 83 Efficient Software Implementations Parallel Computation -SIMD Relevant AVX2 Instructions Integer arithmetic for 64-bit words: • 1 cycle for add/sub. C = BLEND(A, B, M) • 5 cycles for multiplications. a3 a2 a1 a0 b3 b2 b1 b0 Variable logic shifts. • 1 cycle for fixed shifts. • 2 cycles for variable shifts. Permutation of words. 0/1 0/1 0/1 0/1 • 3 cycles for permutations. Combination/selection of registers. c3 c2 c1 c0 • Up-to 3 instructions per cycle without dependencies. Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 8 / 83 Efficient Software Implementations Parallel Computation -SIMD Vector Instruction Guide Full documentation available at: http://software.intel.com/sites/landingpage/IntrinsicsGuide Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 9 / 83 Efficient Software Implementations Parallel Computation -SIMD Skylake Execution Engine The Skylake processor has eight execution ports for instructions. This improves the Instruction-Level Parallelism (ILP). Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 10 / 83 Section 2 Symmetric-Key Cryptography 2.1 Data Encryption Symmetric-Key Cryptography Data Encryption Secure Communication • Alice and Bob would like to communicate through an insecure channel. • Charles is a malicious third party that has also access to the channel. • It is desired that Charles does not be able to read messages interchanged by Alice and Bob. 0111100001100010101011111010 Julio López (IC-UNICAMP) Efficient Algorithms

Efficient Algorithms in Software

GPU-Based Password Cracking on the Security of Password Hashing Schemes Regarding Advances in Graphics Processing Units

BLAKE2: Simpler, Smaller, Fast As MD5

Efficient Hashing Using the AES Instruction

NISTIR 7620 Status Report on the First Round of the SHA-3

High-Speed Hardware Implementations of BLAKE, Blue

Reducing the Impact of Dos Attacks on Endpoint IP Security

Performance Analysis of Cryptographic Hash Functions Suitable for Use in Blockchain

SIMD Instruction Set Extensions for KECCAK with Applications to SHA-3, Keyak and Ketje

Comb to Pipeline: Fast Software Encryption Revisited†

Keccak and the SHA-3 Standardization

Stribobr2: “WHIRLBOB”

Keccak and SHA-3: Code and Standard Updates