<<

IEEE TRANSACTIONS ON RELIABILITY, VOL. 68, NO. 4, DECEMBER 2019 1347 Reliable Architecture-Oblivious Error Detection Schemes for Secure Cryptographic GCM Structures Mehran Mozaffari Kermani , Senior Member, IEEE, and Reza Azarderakhsh , Member, IEEE

Abstract—To augment the confidentiality property provided I. INTRODUCTION by block ciphers with authentication, the Galois Counter Mode (GCM) has been standardized by the National Institute of Stan- TANDARDIZED by the National Institute of Standards and dards and Technology. The GCM is used as an add-on to 128-bit S Technology, the GCM [1] is used in conjunction with block block ciphers, such as the Advanced Standard (AES), ciphers, such as the Advanced Encryption Standard (AES), SMS4, or Camellia, to verify the integrity of data. Prior works on Camellia, SMS4, and CLEFIA, through a hash function over the error detection of the GCM either use linear codes to protect the GCM architectures or are based on AES–GCM architectures, a binary finite field, to provide authentication, augmented to confining the mechanisms to the AES . Although such confidential data. structures are efficient, they are not only confined to specific ar- There have been previous proposed architectures for the GCM chitectures of the GCM but might also not fully take advantage in the literature. The FPGA bitstream encryption/authentication of the parallel architectures of the GCM. Moreover, linear codes through the GCM, bit-sliced implementations for 64-bit In- have been shown to be potentially ineffective with respect to biased faults. In this paper, we propose algorithm-oblivious constructions tel processors [2], and software-based implementations [3] are through recomputing with swapped and additional au- among the early works adopting the GCM. Moreover, GCM thenticated blocks, which can be applied to the GCM architectures implementations, such as the ones in [4]Ð[9], provide the con- using different finite field multipliers in GF (2128). Such obliv- fidence to utilize the architectures for different secure applica- iousness for the proposed constructions used in the GCM gives tions. In addition, the algorithmic security of the GCM has been freedom to the designers. We the results of error simula- tions and application-specific integrated circuit implementations scrutinized in recent studies [10], [11]. to demonstrate the utility of the presented schemes. Based on the The GCM provides authentication for encrypted/raw data; overhead/degradation tolerance for implementation/performance nevertheless, natural faults, e.g., through exposure to laser and metrics, one can fine-tune the proposed method to achieve more cosmic ray particles, such as alpha and gamma rays, and ma- reliable architectures for the GCM. licious faults undermine its reliability. Previous works on the Index Terms—Application-specific integrated circuit (ASIC), error detection of the GCM have either utilized linear codes, Galois Counter Mode (GCM), GF (2128) multiplier, reliability. e.g., parity and cyclic redundancy check (CRC) [7], or presented different combined AESÐGCM error detection architectures by NOMENCLATURE carefully scrutinizing the implementation cycles [9]. In this paper, we present error detection schemes that are not ASIC Application-specific integrated circuit. confined to specific GCM architectures. We also verify the error FPGA Field-programmable gate array. coverage of the proposed approaches through error simulations GCM Galois Counter Mode. for both transient and permanent multiple and biased faults. Our GE Gate equivalent. schemes constitute algorithm-oblivious constructions through RESCAB Recomputing with swapped ciphertext and addi- RESCAB, which can be applied to the GCM architectures using tional authenticated blocks. different finite field multipliers in GF (2128); for example, in our experiments, we have used a quadratic and a subquadratic mul- tiplier to show the obliviousness of the proposed method. The obliviousness of our schemes comes from the fact they can be ap- Manuscript received November 20, 2017; revised March 12, 2018 and June 3, 2018; accepted September 1, 2018. Date of publication December 12, 2018; plied to different constructions (and are not confined to just one, date of current version November 26, 2019. The work of M. Mozaffari Kermani such as schemes that derive predicted signatures). For instance, and R. Azarderakhsh was supported by the National Institute of Standards and one can use different multipliers, different numbers of branches, Technology under the U.S. Federal Agency Award 60NANB16D245. Associate Editor: F. Belli. (Corresponding author: Mehran Mozaffari Kermani.) different finite fields, such as polynomial basis or normal ba- M. Mozaffari Kermani is with the Department of Computer Science and sis, and different modulo-2 addition trees. We benchmark the Engineering, University of South Florida, Tampa, FL 33620 USA (e-mail:, proposed architectures to assess their ability to detect transient [email protected]). R. Azarderakhsh is with the Department of Computer and Electrical Engi- and permanent faults by performing fault injection simulations. neering and Computer Science, Florida Atlantic University, Boca Raton, FL Moreover, we implement the proposed error detection architec- 33431 USA (e-mail:,[email protected]). tures on an ASIC platform using 65-nm standard-cell library. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. This paper is organized as follows. In Section II, we present Digital Object Identifier 10.1109/TR.2018.2882484 the proposed error detection constructions for high-throughput

0018-9529 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

Authorized licensed use limited to: University of South Florida. Downloaded on February 16,2020 at 23:18:26 UTC from IEEE Xplore. Restrictions apply. 1348 IEEE TRANSACTIONS ON RELIABILITY, VOL. 68, NO. 4, DECEMBER 2019

Fig. 1. GCM architecture including GHASH.

GCM. In this section, through case studies as well as through a the added output with the encrypted given input J0 ) is derived general formula, we propose our efficient error detection con- as shown in Fig. 1. The following example shows a simple struction. We, then, assess the proposed scheme on an ASIC, multiplication for GHASH in finite field. preceded by error injection simulations in Section III in order Example: Suppose we get the polynomial for the 128-bit hash to show the efficacy of the approach proposed in this paper. subkey (H) as H(x)=x27 + x25 + x20 + x4 + x +1. To find Finally, conclusions are presented in Section IV. the polynomial representing β(x)=(Xn−1 × H + Xn ) × H mod f(x), assuming the two 128-bit input blocks to GHASH as X x89 x23 x10 X x93 x24 x10 x II. PROPOSED ERROR DETECTION CONSTRUCTIONS n−1 = + + and n = + + + , one would need to perform a finite field multiplication followed by In the GCM, the encrypted blocks are asserted in conjunction reduction using the irreducible polynomial. We note that through with additional potential raw data, as shown in Fig. 1, to the the irreducible polynomial, we have x128 = x7 + x2 + x +1; Galois Hash (GHASH) function, which is constructed by finite thus, x129 = x8 + x3 + x2 + x, x130 = x9 + x4 + x3 + x2 , field multiplications with a fixed parameter [hash subkey (H)]. x131 = x10 + x5 + x4 + x3 , and the like. The final sim- A A In Fig. 1, 1 Ð m are additional authenticated data (ADD) (not plified answer for β(x)=(Xn− × H + Xn ) × H mod C C 1 encrypted) to be verified for integrity. Moreover, 1 Ð n−m −1 f(x) will have these powers of x: 120, 118, 113, 94, are the 128-bit ciphertext blocks derived through a given block 93, 91, 89, 77, 73, 64, 63, 60, 51, 50, 49, 44, 37, 35, cipher. Together with the ADD and an added 128-bit block for 31, 30, 26, 24, 23, 22, 21, 17, 16, 15, 14, 13, 8, 5, L n X specifying the length A,C , we get blocks of data, i.e., 1 Ð 3, i.e., β(x)=x120 + x118 + x113 + x94 + x93 + x91 + X n . The GCM can provide authentication assurance for addi- x89 + x77 + x73 + x64 + x63 + x60 + x51 + x50 + x49 + tional data (of practically unlimited length per invocation) that x44 + x37 + x35 + x31 + x30 + x26 + x24 + x23 + x22 + x21 are not encrypted. Such data are used to derive the authentica- + x17 + x16 + x15 + x14 + x13 + x8 + x5 + x3 . tion tag as part of the GCM. The plaintext and the additional authenticated data are the two categories of data that the GCM A. Motivation for the Proposed Approach protects. We note that the GCM protects the authenticity of the plaintext and the AAD. The GCM also protects the confiden- Error detection constructions of the GCM are important be- tiality of the plaintext, while the additional authenticated data cause if there are errors in the output of the GCM, the respective are protected in terms of authenticity. Such data might include tag would be calculated incorrectly and that would cause errors addresses, ports, sequence numbers, protocol version num- in providing authenticity. In other words, we note the following. bers, and other fields that indicate how the plaintext should be 1) If the block cipher output is faulty, the resulting tag would treated.  be calculated based on that, causing an incorrect tag based n X The GHASH function calculates the following: i=1 i on the erroneous output. n−i+1 n n−1 2 H = X1 × H ⊕ X2 × H ⊕···⊕Xn−1 × H ⊕ 2) If the GHASH function is faulty, one would have a correct Xn × H. Note that we use the symbol ⊕ for XOR operation ciphertext with a wrong tag, which would cause failure in or modulo-2 addition and the symbol × for multiplication authenticity. in finite field. The hash subkey is generated by applying 3) If both these outputs are erroneous, not only do we get an the 128-bit block cipher to the “zero” input block, i.e., incorrect ciphertext but also an incorrect tag. 0 =(0, 0,...,0) ∈ GF (2128), as shown in Fig. 1. The arith- Linear codes used in [7] are effective, especially for special metic operations are done using the irreducible polynomial classes of fault models and also random faults. However, there f(x)=x128 + x7 + x2 + x +1. Eventually, the authentica- are two main problems with using linear codes for the GCM. tion tag T (with a length of t bits for the most significant part of First, the formulations and error detection mechanisms would

Authorized licensed use limited to: University of South Florida. Downloaded on February 16,2020 at 23:18:26 UTC from IEEE Xplore. Restrictions apply. KERMANI AND AZARDERAKHSH: RELIABLE ARCHITECTURE-OBLIVIOUS ERROR DETECTION SCHEMES FOR SECURE CRYPTOGRAPHIC 1349 be confined to one architecture of the GCM, e.g., one type on added error detection units. Let us explain our scheme for GF 128 j ≤ j ≤ n of multiplier in (2 ) (and, thus, would need to be revis- =2 , 1 log2 . The output of the GCM can be de- ited/implemented accordingly). Moreover, the designers need rived by grouping the input blocks, i.e., we derive the following to have freedom in choosing the block ciphers paired with the example for q = 8,n= 24: GCM. In addition, we note that the parallel GCM architec- 8 8 8 8 8 tures, e.g., presented in [5] and [8] (to alleviate the perfor- X1 × H × H × H ⊕ X9 × H × H ⊕ −i mance/throughput and, thus, the efficiency of the architectures),  8 +1  8 8 ··· can be leveraged for error detection to achieve more efficient ···⊕Xi × H × H × H1+2+4+ ⊕ constructions.  8 −i +1  To resolve the above-mentioned complications, in this paper, 8 1+2+4+··· Xi+8 × H × H ⊕ we present error detection schemes that are not confined to 8 8 ···⊕X8 × H × H × H⊕ specific GCM architectures, whose details are presented in the 8 X16 × H × H ⊕···⊕X24 × H. following.

We also achieve the following for the general case: X1 × Hq ×···×Hq ×Hq ⊕ X × Hq ×···×Hq ⊕···⊕ B. Error Detection for Parallel GHASH    q+1    n − n − The error detection construction proposed in this paper is q 1 times q 1 times based on the parallel high-throughput realization of the GCM  q −i +1  in [8]. In other words, by implementing the GCM in par- X × Hq ×···×Hq ×H1+2+4+···⊕ X × i    i+q allel branches, one would have q parallel addersÐmultipliers n − q j q 1 times for deriving the output of the GCM. Noting that = 2 , q −i ≤ j ≤ n  +1  1 log2 , which leads to a lower number of clock cy- Hq ×···×Hq ×H1+2+4+···⊕···⊕X × cles, let us present the following example for q = 8,n= 24:    q n − cX × H8 × H8 × H8 ⊕ X × H8 × H8 × H7 q 2 times 1 2 Hq ×···×Hq ×H ⊕ X × Hq ×···×Hq ×H ⊕···⊕    2q    8 8 8 8 ⊕···⊕X × H × H × H ⊕ X × H × H n n 8 9 q −1 times q −2 times 8 7 Xn × H. ⊕ X10 × H × H ⊕···⊕X24 × H. Phase 1: It is noted that because the number of input We also achieve the following for the general case: blocks n is large (combined additional authenticated data blocks X × Hq ×···×Hq ⊕X × Hq ×···×Hq ×Hq−1 ⊕ 1    2    and encrypted blocks), in very high percentage of the itera- n n q n n − − q H q times q −1 times tions, i.e., q 1 out of q 1 + log2 times, we have ···⊕X × Hq ×···×Hq ×Hq−i+1 ⊕···⊕X × as one of the inputs of the finite field multipliers used to de- i    q n rive GHASH, e.g., for n =220 and q =8,wehave − 1= n − q q 1 times n q q q q (0.999997)[ − 1 + log q]. We need log q times multiplica- H ×···×H ×H ⊕ Xq × H ×···×H ⊕Xq × q 2 2    +1    +2 tion cycles, in general, so that using the exponentiations to the n − n − q− q 1 times q 1 times powers of two, we can implement H 1 for low-complexity Hq ×···×Hq ×Hq−1 ⊕···⊕X × H    n . Here, only the realizations. If one stores the 128-bit output result of the GCM n n − − q 2 times after q 1 cycles and compares that with the result obtained for exponentiations of the hash subkey to the powers of 2 are the swapped variant of the 128-bit input blocks (for instance, the used for the exponentiations Hq−i+1,1≤ i ≤ q,tohave input for the first multiplier now becomes the input of the kth n low-complexity structures. multiplier for 2 ≤ k ≤ q), then, after q − 1 cycles, the result Our schemes constitute algorithm-oblivious constructions must be the same, without the need for decoding. This low- through RESCAB, which can be applied to the GCM archi- complexity construction is another advantage of the proposed tectures implemented using different finite field multipliers in architecture. GF (2128). The RESCAB is based on asserting the input data Fig. 2 shows the construction of the proposed scheme for q = blocks and swapping the blocks to get to the output. There are 4 parallel branches. The swapped inputs through the RESCAB two advantages in the proposed scheme. First, the derivation of are also shown by the arrows. We note that swapping the input decoded output (after asserting the swapped inputs) to compare blocks means asserting specific operands for each branch and, with the normal output is trivial. Second, the scheme is ap- then, changing the asserted operands by swapping the inputs q j ≤ j ≤ n R plicable to any number of branches =2 ,1 log2 , to the branches. Moreover, we use an intermediate register int n − any type of finite field multiplication, and any combination of to store the result after 4 1 cycles. We have also shown the encrypted and raw data. architecture for q =32in Fig. 3. To explain in detail, we have Process of error detection: The scheme proposed here is based used the simple example of n = 32. In this case and as shown in on recomputation. Therefore, no additional circuitry, other than Fig. 3, swapping has been done in the order shown; nevertheless, added registers for storing the results of cycles and also com- such a construction is not confined to just this order (see Fig. 3). parators, is used. This allows different implementations of the It is possible that we get transient faults (some natural faults GCM to be utilized if needed, as opposed to the schemes based and, especially, malicious faults are of such nature) that cannot

Authorized licensed use limited to: University of South Florida. Downloaded on February 16,2020 at 23:18:26 UTC from IEEE Xplore. Restrictions apply. 1350 IEEE TRANSACTIONS ON RELIABILITY, VOL. 68, NO. 4, DECEMBER 2019

TLS/SSL [13]. Poly1305 uses 128-bit blocks, which are used as coefficients of a polynomial, evaluated modulo the prime number 2130 − 5. Similar to the GCM, Poly1305 derives an au- thentication tag that is appended to the output of a block cipher, say the AES with k, and gets verified at the receiver side. Briefly, the input messages construct the integers c1 − cq , and the tag is derived by multiplying these with a type of “key” r q (similar to the hash subkey in the GCM) as follows: (((c1 r + q-1 1 130 128 c2 r ++cq r ) mod 2 − 5) + AESk (n)) mod 2 . Simi- lar to the RESCAB method for the GCM, here, exponentiations of r can be realized in hardware (and the input entries can be swapped) in parallel. We note that Poly1305 is efficiently imple- mented in software, and the adoption of our proposed scheme does not mean that we advocate the hardware implementations of Poly1305.

III. ERROR SIMULATIONS,ASICIMPLEMENTATIONS, AND COMPARISONS In simulations and in theory, we note that the fundamental difference between security attacks and random faults is the intelligent-attacker assumption. However, due to limitations in practice, in real implementations, one needs to note that it is very difficult to control these variables due to the nature of the Fig. 2. Proposed RESCAB construction for GCM (q =4). implementations. In this section, we present the results of our error injection simulations for both random and biased faults. Error injection simulations: We have used linear-feedback be detected in some cases as we cover most but not all the shift register based injections for stuck-at faults for the GCM iterations. The following simple example elaborates on the fact architecture with q =8parallel branches for two types of finite that as we increase the number of branches, the performance field multipliers, assessed after applying the RESCAB scheme. is increased but that depends on the area tolerance in specific We have considered both unbiased and biased faults. We have applications. injected 10 000Ð50 000 faults, and for the error injection of Example: Comparing Figs. 2 and 3 for two parallel construc- biased faults, we have considered having two adjacent branches tions, i.e., for q =4parallel branches in Fig. 2 and for q =32 (out of eight in our example) faulty. in Fig. 3, one notes that the area footprints in these two figures Figs. 4 and 5 show the results of our simulations for permanent vary significantly. Let us examine in more detail what the im- faults using bit-parallel finite field multiplier (quadratic); see plications of having 4 and 32 branches are with respect to area the results depicted in Fig. 4 for random multiple faults (solid complexity. Fig. 2 needs four parallel multipliers as opposed line) and biased faults (dotted line), and for the KaratsubaÐ to Fig. 3 in which 32 are needed. However, in the first phase, Ofman subquadratic multiplier, see the results in Fig. 5. We as discussed previously, for n = 128, and based on the num- have considered the case study of 210 cycles for throughput n ber of cycles q − 1, 3 and 31 cycles are needed, respectively. degradation alleviation. Similarly, Figs. 6 and 7 show the results The choice for these architectures depends on the area tolerance of our simulations through transient fault injections. of the applications. For example, in implantable and wearable As shown in Figs. 4Ð7, the error coverage for the experi- medical devices, one might use low-area footprints, whereas for mented injections is high. We note that the slight fluctuations game console security, performance is the bottleneck. seen in these figures do not follow a trend. This has been ver- q Phase 2: The remaining log2 iterations constitute a very ified by the injection of up to 80 000 faults for the case study small fraction of cycles needed for the entire output derivation, of permanent biased faults, and the results show a similar error e.g., for n =220 and q =8, 0.0003%. Thus, we propose a simple coverage as Fig. 4. We have also increased the number of our recomputation that adds very slightly to performance overhead. simulation-based experiments by having different inputs and We note that in this case, the left-hand side of Fig. 3 is executed fault numbers/locations, leading to 100 000 instances. Similar twice without swapping the operands. We detect the permanent to previous experiments, this confirms the accuracy of our er- faults in the previous phase through recomputation. ror coverage results. Our proposed schemes can detect transient The proposed scheme can also be applied to Poly1305, which faults with high error coverage. The proposed schemes also pro- is a cryptographic message authentication code (can be used vide high error coverage for permanent and long transient faults. to verify the data integrity and the authenticity of a message) We would also like to discuss two unlikely cases that may [12]. Google has selected Poly1305 (along with Bernstein’s appear during permanent or long transient faults. One event can ChaCha20 symmetric cipher) as a replacement for RC4 in be “masking,” in which the output is not erroneous, even if a

Authorized licensed use limited to: University of South Florida. Downloaded on February 16,2020 at 23:18:26 UTC from IEEE Xplore. Restrictions apply. KERMANI AND AZARDERAKHSH: RELIABLE ARCHITECTURE-OBLIVIOUS ERROR DETECTION SCHEMES FOR SECURE CRYPTOGRAPHIC 1351

Fig. 3. Proposed RESCAB construction for GCM q =32and n =32cycles: Original operands (top subfigure) and swapped operands (bottom shaded subfigure). fault exists in the intermediate logic. Such cases are excluded detection schemes of the block ciphers in this paper as it has because the circuit masks the faults and these are not translated been studied in previous works [9]Ð[26]. to errors. The second instance is a rare case where all the entries In Fig. 8, we see a clear connection between the simulated of operands are zero in each column. As swapping any zero value architecture and the proposed scheme. As seen in this figure, will keep it unaltered, this benchmark case cannot be detected, for eight parallel branches and through adding the proposed ap- confirming the results of our simulations. However, applying all proach, the transient and permanent faults (see Figs. 4Ð7) are the input bits to a logic OR gate can be a secondary measure to injected through the approach presented in this section. More- detect the errors in this unlikely case. over, oblivious of the multiplier type, the simulations for Fig. 8 KaratsubaÐOfman subquadratic multipliers have sub- (see the results depicted in Figs. 4Ð7) show high error coverage. quadratic complexity and are realized by converting the costly Throughput alleviation: We propose two methods for alleviat- multiplications to additions at the cost of added delay. The bit- ing the throughput of the proposed constructions. First, through parallel construction, on the other hand, is much larger but con- deep subpipelining of the finite field multipliers, one can care- siderably fast when implemented. We do not present the error fully assert/reorder the normal and swapped raw and encrypted

Authorized licensed use limited to: University of South Florida. Downloaded on February 16,2020 at 23:18:26 UTC from IEEE Xplore. Restrictions apply. 1352 IEEE TRANSACTIONS ON RELIABILITY, VOL. 68, NO. 4, DECEMBER 2019

Fig. 4. Error injection simulation results using bit-parallel finite field multiplier (quadratic) in the GCM for q =8: Permanent faults (solid line: multiple faults, dotted line: biased faults).

Fig. 5. Error injection simulation results using KaratsubaÐOfman subquadratic multiplier in the GCM for q =8: Permanent faults (solid line: multiple faults, dotted line: biased faults).

Fig. 6. Error injection simulation results using bit-parallel finite field multiplier (quadratic) in the GCM for q =8: Transient faults (solid line: multiple faults, dotted line: biased faults). data. This would increase the throughput of the RESCAB and encoded results. Based on the reliability requirements and method at the expense of added hardware complexity. throughput degradation tolerance, however, this can be low- The second remedy is to recompute with the swapped en- ered. For instance, the intermediate register can be populated tries for just a fraction of the total cycles needed for deriving with the result of normal and swapped operands after 216 cy- the GCM. Let us have an example to clarify this approach. In cles to roughly halve the cycles needed before reaching the the proposed parallel approach, let us have n =220 blocks and intermediate value to compare for error detection (to halve the n n q =8parallel constructions (we get q − 1=(0.999997)[ q − throughput degradation as well). We note that such flexibility q 1 + log2 ]). In the original RESCAB, to achieve high error is an advantage of the proposed approach, noting the complica- coverage, after the first 217 − 1 cycles, we compare the normal tions presented in the previous section.

Authorized licensed use limited to: University of South Florida. Downloaded on February 16,2020 at 23:18:26 UTC from IEEE Xplore. Restrictions apply. KERMANI AND AZARDERAKHSH: RELIABLE ARCHITECTURE-OBLIVIOUS ERROR DETECTION SCHEMES FOR SECURE CRYPTOGRAPHIC 1353

Fig. 7. Error injection simulation results using KaratsubaÐOfman subquadratic multiplier in the GCM for q =8: Transient faults (solid line: multiple faults, dotted line: biased faults).

Fig. 8. Pictorial view of the simulated circuit for error detection of the GCM (q =8).

TABLE I proposed approach in the last three rows based on 1) one-stage ASIC 65-nm TSMC BENCHMARK OF THE PROPOSED SCHEMES deep subpipelining (highest error coverage and area overhead); FOR AESÐGCM AND n = 230 BLOCKS 2) case study of 225 cycles; and 3) case study of 210 cycles. The former approach increases the area overhead (which is nor- mally negligible for time redundancy) to decrease throughput degradation, whereas the latter cases have lower overhead but at the expense of covering permanent and a subset of transient faults. We have also presented in the first row the plain approach without subpipelining. Comparisons: Let us consider linear codes, such as CRC, ASIC implementations: Using 65-nm ASIC synthesis, we also parities, and interleaved parities, for error detection of the GCM present the overhead of the presented constructions for the case constructions. Although such structures are efficient, they have study of AESÐGCM. The benchmarking is performed for the major drawbacks. They are not only confined to specific archi- error detection architectures using TSMC 65-nm library and tectures for the ciphers within blockcipher-GCM (for example, Synopsys Design Compiler (presented in Table I for area [kilo if any of such codes are derived for the AES, obviously, they gate equivalent (kGE), which is the normalized area for 2-input cannot be used for CLEFIA block cipher) but also to the very NAND gate] and respective throughputs). architecture in GHASH within the GCM for which such codes In Table I, we have presented four schemes based on bit- are derived (for example, the parities derived for error detection parallel multipliers for the GCM with q =8, i.e., variants of the in polynomial basis quadratic multipliers cannot be utilized for

Authorized licensed use limited to: University of South Florida. Downloaded on February 16,2020 at 23:18:26 UTC from IEEE Xplore. Restrictions apply. 1354 IEEE TRANSACTIONS ON RELIABILITY, VOL. 68, NO. 4, DECEMBER 2019

composite field normal basis subquadratic multipliers). In con- REFERENCES trast with the previous work, which either utilizes linear codes, [1] M. Dworkin, “Recommendation for block cipher modes of operation: e.g., parity and CRC [7] (confined to specific architectures of Galois/Counter Mode (GCM) and GMAC,” National Inst. of Standards the GCM and considers random and not biased faults), or fo- Technol., Gaithersburg, MD, USA, Rep. NIST SP 800-38D, 2007. [2] E. Kasper¬ and P.Schwabe, “Faster and timing-attack resistant AES-GCM,” cuses on the AESÐGCM, not taking advantage of the parallel in Proc. 11th Int. Workshop Cryptographic Hardware Embedded Syst., constructions presented in [5] and [8] to fine-tune the schemes 2009, pp. 1Ð17. [9]. This paper is not limited to the types of finite field multipli- [3] K. Jankowski and P. Laurent, “Packed AES-GCM algorithm suitable for AES/PCLMULQDQ instructions,” IEEE Trans. Comput., vol. 60, no. 1, ers, can be tailored towards higher reliability or lower overhead, pp. 135Ð138, Jan. 2011. and also can be applied to different block ciphers, and considers [4] S. Lemsitzer, J. Wolkerstorfer, N. Felbert, and M. Braendli, “Multi-gigabit biased fault models. Specifically, the two previous works on the GCM-AES architecture optimized for FPGAs,” in Proc. 9th Int. Workshop Cryptographic Hardware Embedded Syst., 2007, pp. 227Ð238. error detection of the GCM have either utilized linear codes [7] [5] A. Satoh, T. Sugawara, and T. Aoki, “High-performance hardware archi- or presented different combined AESÐGCM error detection ar- tectures for Galois Counter Mode,” IEEE Trans. Comput., vol. 58, no. 7, chitectures by carefully scrutinizing the implementation cycles pp. 917Ð930, Jul. 2009. [6] B. Buhrow, K. Fritz, B. Gilbert, and E. Daniel, “A highly parallel AES- [9]. Let us start with the former. Linear codes used in [7] are GCM core for authenticated encryption of 400 Gb/s network protocols,” effective especially for special classes of fault models and also in Proc. Int. Conf. ReConFigurable Comput. FPGAs, 2015, pp. 1Ð7. random faults. However, there are two main problems with us- [7] A. A. Kouzeh Geran and A. Reyhani-Masoleh, “A CRC-based concurrent fault detection architecture for Galois/Counter Mode (GCM),” in Proc. ing linear codes for the GCM. First, the formulations and error IEEE 23rd Symp. Comput. Arithmetic, 2016, pp. 24Ð31. detection mechanisms would be confined to one architecture of [8] M. Mozaffari Kermani and A. Reyhani-Masoleh, “Efficient and high- the GCM, e.g., one type of multiplier in GF (2128) (and, thus, performance parallel hardware architectures for the AES-GCM,” IEEE Trans. Comput., vol. 61, no. 8, pp. 1165Ð1178, Aug. 2012. would need to be revisited/implemented accordingly). Using [9] X. Guo and R. Karri, “Low-cost concurrent error detection for GCM and any other type of multiplier would result in modifications in CCM,” J. Electron. Testing, vol. 30, no. 6, pp. 725Ð737, 2014. the architectures proposed in such schemes. This is rather im- [10] S. Gueron, A. Langley, and Y. Lindell, “AES-GCM-SIV: Specification and analysis,” eprint 2017/168, 2017, pp. 1Ð16. portant as we do not want to get confined to specific types or [11] T. Iwata and K. Minematsu, “Stronger security variants of GCM-SIV,” architectures (and not being able to tailor the GCM architec- eprint 2016/853, 2016, pp. 1Ð24. tures utilized in different usage models). The error coverage of [12] D. Bernstein, “The Poly1305-AES message-authentication code,” in Proc. 12th Int. Conf. Fast Softw. Encryption, 2005, pp. 32Ð49. the proposed scheme is very high (more than 99.9%), which is [13] “Google swaps out crypto ciphers in OpenSSL,” InfoSecurity, Apr. 24, higher than different variants of the one in [7]: 48%, 74%, 87%, 2014. 92.5%, 96%, and 98%. Moreover, although much effective, for [14] G. Xiaofei and R. Karri, “Recomputing with permuted operands: A con- current error detection approach,” IEEE Trans. Comput.-Aided Design the latter, the designers need to have freedom in choosing the Integr. Circuits Syst., vol. 32, no. 10, pp. 1595Ð1608, Oct. 2013. block ciphers paired with the GCM. In addition, we note that [15] X. Guo, D. Mukhopadhyay, C. Jin, and R. Karri, “Security analysis of the parallel GCM architectures, e.g., presented in [5] and [8] (to concurrent error detection against differential fault analysis,” J. Crypto- graphic Eng., vol. 5, no. 3, pp. 153Ð169, 2015. alleviate the performance/throughput and, thus, the efficiency of [16] M. Yasin, B. Mazumdar, S. Subidh Ali, and O. Sinanoglu, “Security the architectures), can be leveraged for error detection to achieve analysis of logic encryption against the most effective side-channel attack: more efficient constructions. The proposed scheme in this paper DPA,” in Proc. IEEE Int. Symp. Defect Fault Tolerance VLSI Nanotechnol. Syst., 2015, pp. 97Ð102. alleviates the above-mentioned shortcomings. [17] M. Mozaffari Kermani, R. Azarderakhsh, and A. Aghaie, “Fault detection architectures for post-quantum cryptographic stateless hash-based secure signatures benchmarked on ASIC,” ACM Trans. Embedded Comput. Syst. IV. CONCLUSION (Special Issue on Embedded Device Forensics and Security: State of the Art Advances), vol. 16, no. 2, Apr. 2017, Art. no. 59. In this paper, we proposed, simulated through error injec- [18] M. Mozaffari Kermani and A. Reyhani-Masoleh, “A low-power high- tions, and implemented on ASIC our proposed schemes for the performance concurrent fault detection approach for the composite field S-box and inverse S-box,” IEEE Trans. Comput., vol. 60, no. 9, pp. 1327Ð GCM. Our schemes constituted algorithm-oblivious construc- 1340, Sep. 2011. tions through RESCAB, which can be applied to the GCM ar- [19] M. Mozaffari Kermani, R. Azarderakhsh, C. Lee, and S. Bayat-Sarmadi, chitectures using different finite field multipliers in GF (2128); “Reliable concurrent error detection architectures for extended Euclidean- based division over GF (2m ),” IEEE Trans. Very Large Scale Integr. for example, in our experiments, we used a quadratic and a (VLSI) Syst., vol. 22, no. 5, pp. 995Ð1003, May 2014. subquadratic multiplier to show the obliviousness of the pro- [20] M. Mozaffari Kermani and R. Azarderakhsh, “Reliable hash trees for post- posed approach. The proposed scheme is not limited to the quantum stateless cryptographic hash-based signatures,” in Proc. IEEE Int. Symp. Defect Fault Tolerance VLSI Nanotechnol. Syst., Oct. 2015, types of finite field multipliers, can be tailored towards higher pp. 103Ð108. reliability or lower overhead, can be applied to different block [21] M. Mozaffari Kermani, R. Azarderakhsh, and A. Aghaie, “Reliable and ciphers, and considers biased fault models. Such obliviousness error detection architectures of Pomaranch for false-alarm-sensitive cryp- tographic applications,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., for the proposed constructions used in the GCM gives freedom vol. 23, no. 12, pp. 2804Ð2812, Dec. 2015. to the designers. Through deep subpipelining, we reached an [22] M. Mozaffari Kermani and A. Reyhani-Masoleh, “Fault detection struc- area overhead of 6.7% and a throughput degradation of 7.7%. tures of the S-boxes and the inverse S-boxes for the Advanced Encryption 25 10 Standard,” J. Electron. Testing: Theory Appl., vol. 25, no. 4, pp. 225Ð245, Moreover, for RESCAB 2 cycles (2 cycles), we reached Aug. 2009. an area overhead of 4.9%, while throughput degradations are [23] M. Mozaffari Kermani, R. Azarderakhsh, A. Sarker, and A. Jalali, “Ef- 11.9% and 8.5%, respectively. Based on the available resources, ficient and reliable error detection architectures of Hash-Counter-Hash tweakable enciphering schemes,” ACM Trans. Embedded Comput. Syst., one may utilize the proposed error detection schemes for making vol. 17, no. 2, Apr. 2018, Art. no. 54. the hardware implementations of the GCM more reliable.

Authorized licensed use limited to: University of South Florida. Downloaded on February 16,2020 at 23:18:26 UTC from IEEE Xplore. Restrictions apply. KERMANI AND AZARDERAKHSH: RELIABLE ARCHITECTURE-OBLIVIOUS ERROR DETECTION SCHEMES FOR SECURE CRYPTOGRAPHIC 1355

[24] M. Mozaffari Kermani and A. Reyhani-Masoleh, “A high-performance Reza Azarderakhsh (M’11) received the Ph.D. de- fault diagnosis approach for the AES SubBytes utilizing mixed bases,” gree in electrical and computer engineering from in Proc. Workshop Fault Diagnosis Tolerance Cryptogr., Nara, Japan, Western University, London, ON, Canada, in 2011. Sep. 2011, pp. 80Ð87. He is currently an Assistant Professor with the [25] M. Mozaffari Kermani and A. Reyhani-Masoleh, “Reliable hardware ar- Department of Electrical and Computer Engineering, chitectures for the third-round SHA-3 finalist Grostl benchmarked on Florida Atlantic University, Boca Raton, FL, USA. FPGA platform,” in Proc. IEEE Int. Symp. Defect Fault Tolerance VLSI His research interests include finite field and its ap- Nanotechnol. Syst., Vancouver, BC, Canada, Oct. 2011, pp. 325Ð331. plications, elliptic curve , pairing-based [26] S. Patranabis, A. Chakraborty, D. Mukhopadhyay, and P. P. Chakrabarti, cryptography, and postquantum cryptography. “Fault space transformation: A generic approach to counter differential Dr. Azarderakhsh was the recipient of the NSERC fault analysis and differential fault intensity analysis on AES-like block Postdoctoral Research Fellowship while working ciphers,”IEEE Trans. Inf. Forensics Secur., vol. 12, no. 5, pp. 1092Ð1102, with the Center for Applied Cryptographic Research and the Department May 2017. of Combinatorics and Optimization, University of Waterloo, Waterloo, ON, Canada. He was the Guest Editor for the IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING for the special issue on “Emerging Embedded and Cyber Physical System Security Challenges and Innovations” (2016 and 2017). He was also the Guest Editor for the IEEE/ACM TRANSACTIONS ON COMPUTA- TIONAL BIOLOGY AND BIOINFORMATICS for the special issue on “Security.” He is an Associate Editor for the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS Mehran Mozaffari Kermani (S’00–M’11–SM’16) (TCAS-I). He is an I-SENSE Fellow. received the B.Sc. degree in electrical and computer engineering from the University of Tehran, Tehran, Iran, in 2005, and the M.E.Sc. and Ph.D. degrees in electrical and computer engineering from the Depart- ment of Electrical and Computer Engineering, Uni- versity of Western Ontario, London, ON, Canada, in 2007 and 2011, respectively. In 2011, he joined the Advanced Micro Devices as a Senior ASIC/Layout Designer, integrating sophis- ticated security/cryptographic capabilities into accel- erated processing. In 2012, he joined the Electrical Engineering Department, Princeton University, Princeton, NJ, USA, as an NSERC Postdoctoral Research Fellow. From 2013 to 2017, he was an Assistant Professor with the Rochester Institute of Technology, and since 2017, he has been with the Computer Science and Engineering Department, University of South Florida, Tampa, FL, USA. Dr. Kermani was the recipient of the prestigious Natural Sciences and En- gineering Research Council of Canada Postdoctoral Research Fellowship in 2011 and the Texas Instruments Faculty Award (Douglas Harvey) in 2014. He is an Associate Editor for the IEEE TRANSACTIONS ON VLSI SYSTEMS,theACM Transactions on Embedded Computing Systems, the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I, and a Guest Editor for the IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING for the special issue on Emerging Em- bedded and Cyber Physical System Security Challenges and Innovations (2016 and 2017). He was the lead Guest Editor for the IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS and the IEEE TRANSAC- TIONS ON EMERGING TOPICS IN COMPUTING for special issues on “Security”. He has been the Technical Program Committee (TPC) member for a number of conferences including IEEE International Symposium on Hardware Oriented Security and Trust (HOST) (Publications Chair), Design Automation Confer- ence (DAC), Design, Automation and Test in Europe (DATE), RFID Security Conference (RFIDSec), Lightweight Security Conference (LightSec), Interna- tional Workshop on the Arithmetic of Finite Fields (WAIFI), Fault Diagnosis and Tolerance in Cryptography (FDTC), and IEEE International Symposium on Defect and Fault Tolerance (DFT) in VLSI and Nanotechnology Systems (DFT).

Authorized licensed use limited to: University of South Florida. Downloaded on February 16,2020 at 23:18:26 UTC from IEEE Xplore. Restrictions apply.