DPR Based AES/SM4 Encryption Highly Efficient Implementation
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the World Congress on Engineering and Computer Science 2018 Vol I WCECS 2018, October 23-25, 2018, San Francisco, USA DPR Based AES/SM4 Encryption Highly Efficient Implementation Qiangjia Bi, Ning Wu, Fang Zhou and Yasir consumes more area and reduces the performance of the Abstract—This paper presents an efficient hardware circuit because of the memory. It uses the additional circuit to implementation of Advance Encryption Standard (AES) and implement the two encryption schemes but only one SM4 algorithms on Xilinx Virtex 7 FPGA by exploiting the algorithm can be used in one instance of time in [7]. It costs feature of dynamic partial reconfiguration (DPR) to optimize additional area and increased power consumption. S-Box on composite field arithmetic (CFA). We utilize delay aware common sub-expression elimination (DACSE) scheme to Since both of the cipher algorithm AES and SM4 are using reduce overall area consumption by sharing the multiplicative MI as a computing unit which is the most complex function inverse (MI) over GF(28) and eliminating the redundant for reducing the chip area of S-Box. Our design shares the circuitry. Our results reveal that hardware resources i.e. slices multiplicative inverse (MI) over GF(28) based on dynamic are minimized by 21.6%, and frequency is improved by 27.9%. partial reconfiguration (DPR) technology. In order to further In addition to that efficiency of our implementation is improved improve implementation efficiency, composite field by 62.70 Mbps/slices which is higher than the previously proposed design. arithmetic (CFA) and delay aware common sub-expression elimination (DACSE) have been employed in our S-Box Index Terms—AES, SM4, Substitution Box(S-Box), DPR design. The rest of the paper is organized as follows: Section II introduces the S-Box algorithm of AES and SM4. Section III I. INTRODUCTION describes the optimized implementation of S-Box circuit. HE Advanced Encryption Standard (AES) is widely Meanwhile, the detailed CFA, DACSE algorithm and DPR Tused in various applications of information security as an technology is discussed. Section IV shows the results and encryption algorithm [1], [2]. SM4 is the first block cipher evaluation of our design along with comparison to other algorithm released by the Chinese National Cipher design schemes. Finally, Section V concludes this paper. Management Committee Office and it is usually used in wireless security of local area network products [3]. Large II. REVIEW OF S-BOX ALGORITHM number of products use these two algorithms to implement The algebraic expression of S-Box of AES and SM4 is encryption of data. Most applications are using them shown as (1) respectively [8], [9]. independently; and without any optimization that leads to Zx() I x M V more on chip area consumption. Therefore, it is significant to (1) Sx() IxACAC optimize these algorithms used in the wireless sensor networks and other resource limited applications. Where I is the multiplicative inverse (MI) over GF(28) S-Box is a nonlinear function and the fundamental which can be reused by AES and SM4, x is the input of computing unit of AES and SM4, that occupies most of area S-Box. M is the affine matrix and V is row vector of AES and power consumption in circuits. For AES it occupies S-Box. A is the affine matrix and C is row vector of SM4 about 75% area of the round transformation [4]. Hence it is S-Box. Z(x) and S(x) are the output of S-Box of AES and important to analyze the S-Box and optimize the logic. SM4 respectively. M and V are shown in (2). A and C are The design and optimization of S-Box have been studied in shown in (3). detail. Shared memory scheme is used to design a 11111000 0 reconfigurable S-Box for different cipher algorithms that 01111100 1 provide more flexibility in [5], [6]. However, their design 00111110 1 00011111 0 Manuscript received July 02, 2018; revised July 25, 2018. This work was M , V supported by the National Science Foundation of China (No.61774086, 10001111 0 No.61376025), the Fundamental Research Funds for the Central Universities 11000111 0 (NS2017023) and the Natural Science Foundation of Jiangsu Province 11100011 1 (BK20160806). Q. B. is with the College of Electronic and Information Engineering, 11110001 1 Nanjing University of Aeronautics and Astronautics, Nanjing, 211100, China (e-mail: [email protected]). (2) N. W. is with the College of Electronic and Information Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, 211100, China (e-mail: [email protected]). F. Z and Yasir is with the College of Electronic and Information Engineering Nanjing University of Aeronautics and Astronautics, Nanjing, 211100, China. ISBN: 978-988-14048-1-7 WCECS 2018 ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online) Proceedings of the World Congress on Engineering and Computer Science 2018 Vol I WCECS 2018, October 23-25, 2018, San Francisco, USA 11100101 1 shown in Fig.2. 4 2 11110010 1 The composite filed MI over GF((2 ) ) needs three 01111001 0 multiplication, two addition, one square multiply coefficient and one multiplicative inverse. All sub-calculation are 4 bit 10111100 1 A , C (3) input and 4 bit output in finite field GF(24). 01011110 0 00100111 0 10010111 1 4 2 4 Ah () ×υI Bh 11001011 1 ()-1 III. DESIGN AND OPTIMIZATION OF MI 4 4 Al Bl A. Using CFA to optimize MI over GF(28) 8 Since the calculation of the MI over GF(2 ) is very Fig. 2. MI over GF((24)2) complex, we introduce CFA to reduce the hardware 8 Following section shows the calculation of multiplication, complexity by mapping the MI over GF(2 ) into composite 4 4 2 addition, square multiply coefficient and MI in GF(2 ). field GF((2 ) ). In order to achieve this transformation, we a) Multiplication over GF(24) need to optimize the equation (1), that is expressed as: Computing the multiplication, we assume that a and b are ZM(x) 1 XV1 elements over GF(24). They can be described as: 1 (4) a=a3ω3+ a2ω2+ a1ω+ a0, b=b3ω3+ b2ω2+ b1ω+ b0 (8) S(x) A TTAXC1 C where {a3, a2, a1, a0, b3, b2, b1, b0} ∈ GF(2). So the Where δ, T is the mapping matrix and the δ-1, T-1 is the expression of multiplication can be shown in (9) and inverse of mapping matrix. We can calculate them using cabmod(f2 (x)) . MATLAB based on the irreducible polynomials. cababababababab331133003211222 The structure of S-Box computing process is shown as Fig.1. The structure on the upper half of Fig.1 is for AES cabababababab2311320022211 c (9) S-Box while bottom half shows SM4 S-Box. The MI over cabababababab1311310013322 GF((24)2) in the middle is the MI which need to be optimized. c0322331132200 abababababab As the set of equation (9) shows, the multiplication needs 25AND gates and 21 XOR gates. b) Addition over GF(24) In finite field, the addition is just XOR of the corresponding bit so the addition of GF(24) is relatively simple. It just needs 4 XOR gates, the expression is shown in (10). Fig. 1. S-Box computing process dab333 8 In finite field, the irreducible polynomials of GF(2 ) of dab222 4 2 dab (10) AES is (5). In the composite field GF((2 ) ) using (6) as dab111 irreducible polynomials internally. dab000 843 (5) 4 Px28 x x x x1 c) Constant Multiplied by Square over GF(2 ) 2 Based on the multiplication we calculated the square GF2:42 f y y y 1 (6) operation shown in (11). GFfxxxxx2:4 432 1 ha 2 32 2 Where v={0010} . v is the coefficient of the GF(2 ) 2 haa221 2 ha (11) irreducible polynomial. Modification of this polynomial haa132 affects the design the circuit. For SM4, it has a little different haa020 irreducible polynomial in finite field GF(28). When it Because v is a constant {0010} . We calculate v with the converts to composite field it still needs the irreducible 2 square operation. And it cost nothing. polynomials (6). 8 ea31 Based on the first equation in (6), the MI over GF(2 ) is calculated in the following expression: 2 ea23 ea * v (12) 1 12 ea10 A AAAAAAAhhllhhl (7) ea = BB B 02 lh 4 ∈ 4 d) Multiplicative Inverse over GF(2 ) Where A can be expressed as A=Ah+γAl, Ah, Al GF(2 ), p In finite field, it has 21 p . So in B is the output of MI that can be expressed as B=Bh+γBl, Bh, 1,GF 2 4 4 2 4 Bl ∈GF(2 ). The main architecture of MI over GF((2 ) )as GF(2 ), the calculation of multiplicative inverse is ISBN: 978-988-14048-1-7 WCECS 2018 ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online) Proceedings of the World Congress on Engineering and Computer Science 2018 Vol I WCECS 2018, October 23-25, 2018, San Francisco, USA 114 . We transfer the calculation from multiplicative Raa132 inverse to multiplication. The expression is shown in (13). Raa RRa 231 711SaR 11 5 f3320310312021aaa aaa aa aa a a Raa330 RRa810 ,, SRS271 (18) f aaa aaa aa aa a a Raa RaR 3320210201031 421936 SRS 1 381 f afaaaaaaaaaaaaa 3 321 310 210 30 21 (13) Rab RaR 520 10 2 6 aa a Raa 20 1 610 f aaa aaa aa aa a a 3321320312010 fSaRR32218 It needs 27 AND gates and 21 XOR gates. 1 fSaRR32359 fa (19) TABLE I f312389 SRRRR GATES CONSUMPTION AFTER CFA fSaRR31007 Computing Unit AND gate XOR gate The overall hardware consumption of MI over GF(24) is 10 Multiplication 25 21 AND gates and 16 XOR gates and the critical path is Addition 0 4 3TXOR+2TAND.