Fast Implementation of LSH with SIMD
Total Page:16
File Type:pdf, Size:1020Kb
Received July 13, 2019, accepted July 25, 2019, date of publication July 30, 2019, date of current version August 16, 2019. Digital Object Identifier 10.1109/ACCESS.2019.2932129 Fast Implementation of LSH With SIMD DONGYEONG KIM1, YOUNGHOON JUNG2, YOUNGJIN JU 1, AND JUNGHWAN SONG1 1Department of Mathematics, Research Institute for Natural Sciences, Hanyang University, Seoul 04763, South Korea 2The Affiliated Institute of ETRI, Daejeon 341293, South Korea Corresponding author: Junghwan Song ([email protected]) This work was supported by the Institute for Information and Communications Technology Planning and Evaluation (IITP) funded by the Korean Government (MSIT) under Grant 2017-0-00267. ABSTRACT In this paper, we propose a method of efficient software implementation for the cryptographic hash function LSH with single instruction multiple data (SIMD). The method is based on word-wise 0 D ◦ ◦ −1 0 permutations of LSH. Using the modified functions Stepj P Stepj P and MsgExp instead of the −1 original step function Stepj and message expansion function MsgExp, where P is a permutation and P is the inverse permutation of P, we show that the number of the SIMD instructions for implementing LSH is reduced. For efficient implementation of LSH in other environments (e.g., MIMD), various types of word permutations are listed. INDEX TERMS Software implementation, word-wise permutation, SIMD, hash function, LSH, ARX. I. INTRODUCTION 128, 256, and 512 bits. For a 128-bit resister, each resister Cryptographic hash functions are necessary for constructing has four 32-bit or two 64-bit sections of data. Thus, one a system of information security. Generally, they are used operation for an 128-bit resister is equivalent to four 32-bit for authentication, providing both data integrity and entity operations or two 64-bit operations. When implementing integrity [1]–[4]. A cryptographic hash function is a function cryptographic algorithms, SIMD is used for various pur- that maps an input to a fixed output satisfying the following poses such as resistance to side-channel attacks [9], [10] cryptographic resistance properties [5]. and efficient implementations [11]–[16]. There has not been 1. Preimage resistance: for essentially all pre-specified any research on overcoming weakness using SIMD with a outputs, it is computationally infeasible to find an input that cipher For these reasons, BLAKE2 [17] and LSH are crypto- hashes to that output. graphic algorithms having advantage of implementation with 2. 2nd preimage resistance: it is computationally infeasible SIMD. BLAKE2, SIMON/SPECK [18], and LEA [19] were to find another input that hashes the same output as a specified implemented using SIMD in [11]–[13]. However, to the best input. out knowledge, our research is the first to have an efficient 3. Collision resistance: it is computationally infeasible to implementation via SIMD by changing the representation find any two distinct inputs that hash to the same output. of a cryptographic algorithm. In this paper, we show how From the output of a cryptographic hash function, it should to implement a cryptographic hash function LSH efficiently be computationally difficult to find the corresponding input. with SIMD by representing the LSH using P and P−1, where Additionally, for a given input, it should be difficult to find P is a permutation and P−1 is the inverse of P. Note that com- another input that hashes to the same output. Because of plexity is considered as the number of SIMD instructions and these aspects, a cryptographic hash function is used in various their latency, not the number of XOR and modular additions. fields, such as a message authentication code (MAC), key This metric is necessary for finding conditions that reduce derivation function (KDF), and a pseudo-random number the number of SIMD instructions and use SIMD instructions generator. LSH [6] is a cryptographic hash function that with low latency when an algorithm is implemented. was designed by NSRI [7] (National Security Research For example, a case in which a permutation in a reg- Institute). SIMD [8] is a class of parallel computing. SIMD ister composed of four 32-bit words is implemented. If a is an instruction set that performs the same operations on word-wise permutation is operated in a register, then only a multiple data simultaneously. A core element of SIMD is a single SIMD instruction is needed, ``_mm_shuffle_epi32’'. register. SIMD has registers in with various lengths such as However, if the word-wise permutation is the identity, then there is no need for an SIMD instruction. In The associate editor coordinating the review of this manuscript and another example, assuming that two 64-bit words com- approving it for publication was Xiangxue Li. pose a register, if a word-wise permutation is operated 2169-3536 2019 IEEE. Translations and content mining are permitted for academic research only. 107016 Personal use is also permitted, but republication/redistribution requires IEEE permission. VOLUME 7, 2019 See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. D. Kim et al.: Fast Implementation of LSH With SIMD A. NOTATIONS W t : Set of t-word arrays X ⊕ Y : Bit-wise exclusive-or of X and Y . w X Y : X C Y mod 2 . X nr : r bits left rotation of word X. FIGURE 1. Wide-pipe merkle-damgard construction. M (i) VD (M (i)[0]; ··· ; M (i)[31]): The i-th 32-word array message block. in two mixed registers, then three SIMD instructions are M (i) VD (M (i)[0]; ··· ; M (i)[15]): The j-th 16-word array sub- needed as follows: ``_mm_unpacklo_epi64'', ``_mm_unpac- j j j message generated from the i-th message M (i): khi_epi64'', and ``_mm_shuffle_epi32’'. The instruction SC VD (SC [0]; ··· ; SC [7]): The j-th 8-word array step ``_mm_uppacklo_epi64'' extracts the left 64-bit word in the j j j constant. two registers, and ``_mm_unpackhi_epi64'' extracts the right T VD (T [0]; ··· ; T [15]): The 16-word array temporary vari- 64-bit word in the two registers. Additionally, the instruction able used in a step function. ``_mm_shuffle_epi32’' is used to permute in 32-bit units. P: A word-wise permutation on 16 words If each of the two 64-bit words still remain in their same registers, then ``_mm_unpacklo_epi64'' and P(T ) VD P(T [0]; ··· ; P[15]) VD (T [P(0)]; ··· ; T [P(15)]) ``_mm_unpackhi_epi64'' are not needed for a word-wise P(T [i]) VD T [P(i)] permutation. In this paper, conditions to reduce the number of SIMD instructions and implement SIMD instructions with Pi: A word-wise permutation on 4 words. lower latency are found. Notice that we define a permutation P that has the same This paper is organized as follows. Section II shows format for the input and output. If the input is an index i, then the specifications of LSH. Section III provides an efficient P(i) is also an index. Similarly, if the input is a word T [i], implementation method via SIMD for LSH. We demonstrate then the output is a word P(T [i]) D T [P(i)]. the method and permutation conditions needed for efficient B. MsgExp FUNCTION implementation. All permutations are categorized consider- The first two sub-messages M (i) and M (i) are defined as the ing those conditions. In Section IV, we show an optimal 0 1 first 16 words and the next 16 words of M (i), respectively. permutation for the best performance with LSH. Concluding Then the next sub-messages M (i) Ns are calculated by the remarks on the implementation performance with an optimal j jD2 following: permutation are given in Section V. For j D 2; 3; ··· ; Ns; II. SPECIFICATION OF LSH (i) (i) (i) ≤ Mj [l] Mj−1[l] Mj−2[τ(l)]; 0 l < 16 (1) In 2014, a hash function LSH was published by D. Kim et al. at International Conference on Information Security and Here, the permutation τ is defined by Table1. Cryptology, designed specifically to enhance software effi- ciency [6]. LSH was designed using wide-pipe Merkle TABLE 1. The permutation τ in MsgExp. Damgard construction (wide-pipe MD construction) [20]. The design of the compression function for LSH is based on ARX (Addition (), Rotation (n), and XOR (⊕)) [21]. The following describes the wide-pipe MD construction and C. Stepj FUNCTION compression function of LSH. Stepj is used Ns times repeatedly in the compression function As shown in Fig.1, the length of an internal state in a wide- f . Stepj is composed of three functions MsgAdd, Mixj, and pipe MD construction is 2n bits, which is twice the length of σ as an output with n bits. Let w be the number of bits in a word. Stepj D σ ◦ Mixj ◦ MsgAdd: (2) LSH-8w-n can represents any of the following LSHs: LSH- 256-224, LSH-256-256, LSH-512-224, LSH-512-256, LSH- MsgAdd: 512-384, and LSH-512-512. Each has a different initializing MsgAdd V W 16 × W 16 ! W 16 value IV . The generating method of IV is given in [6]. (i) D (i) The structure of the compression function f in an LSH is MsgAdd(T ; Mj ) (T [0] Mj [0]; ARX-based. The number of bits for the input of f is 48w, ··· (i) ; T [15] Mj [15]) (3) and that of the output is 16w. The compression function f transforms 16-word and 32-word messages into 16-word Mixj: messages. Each f contains the MsgExp function, Step (for 16 16 j Mixj V W ! W j D 1; ··· ; Ns), as well as the MsgAdd function. The number 0 0 Mixj(T ) D (T [0]; ··· ; T [15]); of steps Ns is selected as follows: where (T 0[l]; T 0[l C 8]) Mix (T [l]; T [l C 8]); Ns D 26 if the number of bits in w is 32, j;l Ns D 28 if the number of bits in w is 64.