<<

sensors

Article FPGA Modeling and Optimization of a Lightweight

Sa’ed Abed 1,* , Reem Jaffal 1, Bassam Jamil Mohd 2 and Mohammad Alshayeji 1

1 Department of Computer Engineering, Kuwait University, Safat 13060, Kuwait; [email protected] (R.J.); [email protected] (M.A.) 2 Department of Computer Engineering, Hashemite University, Zarqa 13115, Jordan; [email protected] * Correspondence: [email protected]; Tel.: +965-249-837-90

 Received: 18 December 2018; Accepted: 18 February 2019; Published: 21 February 2019 

Abstract: Security of sensitive data exchanged between devices is essential. Low-resource devices (LRDs), designed for constrained environments, are increasingly becoming ubiquitous. Lightweight block ciphers provide confidentiality for LRDs by balancing the required security with minimal resource overhead. SIMON is a lightweight block cipher targeted for hardware implementations. The objective of this research is to implement, optimize, and model SIMON cipher design for LRDs, with an emphasis on energy and power, which are critical metrics for LRDs. Various implementations use field-programmable gate array (FPGA) technology. Two types of design implementations are examined: scalar and pipelined. Results show that scalar implementations require 39% less resources and 45% less power consumption. The pipelined implementations demonstrate 12 the throughput and consume 31% less energy. Moreover, the most energy-efficient and optimum design is a two-round pipelined implementation, which consumes 31% of the best scalar’s implementation energy. The scalar design that consumes the least energy is a four-round implementation. The scalar design that uses the least area and power is the one-round implementation. Balancing energy and area, the two-round pipelined implementation is optimal for a continuous stream of data. One-round and two-round scalar implementations are recommended for intermittent data applications.

Keywords: security; cipher; block cipher; ; lightweight block cipher; FPGA; power; energy; low-resource devices; SIMON

1. Introduction Recently, there was a rapid growth in applications based on low-resource devices (LRDs), which include radio-frequency identification (RFID), wireless sensor networks (WSNs), smart cards, wireless body area networks (WBANs), and the Internet of things (IoT) [1]. LRDs are designed for constrained environments where cost, power consumption, energy, and available resources are limited. As LRDs become ubiquitous in daily life, it is very important to protect the confidentiality of exchanged data. The challenge is to balance an adequate security level with the limited resources in LRDs. Special cipher implementation is required to optimize energy, power, and area while considering the constraints of the devices, such as ciphers or lightweight ciphers. In general, ciphers in are algorithms responsible for encryption and decryption operations for an input message or by applying certain steps to generate the . The algorithm of ciphers consists of three main sub-algorithms: encryption algorithm, decryption algorithm, and -scheduling algorithm (also known as key-expansion) [2]. The key-scheduling or expansion algorithm is responsible for generating sub-keys used in various encryption/decryption steps. In most ciphers, key expansion is executed once for both decryption and encryption, while in others is executed separately for encryption and decryption such as the case in Advanced

Sensors 2019, 19, 913; doi:10.3390/s19040913 www.mdpi.com/journal/sensors Sensors 2019, 19, 913 2 of 28

Encryption Standard (AES) and RIJNDAEL ciphers [1,2]. Based on the number of keys used on encryption/decryption algorithms, ciphers are categorized as asymmetric (public key) and symmetric (private key) ciphers [1]. Asymmetric ciphers are more expensive and complex in terms of computation [1,2]. The asymmetric algorithm uses two keys: public key and private key. The public key is public and can be accessible to everyone, while the private one is only known by one person and kept as a secret key. The public key and the private key are related to each other internally; however, they cannot be obtained from each other. Examples of popular asymmetric ciphers are RSA, DIFFIE-HELLMAN, ALGAMMAL, elliptic curve cryptography, and DSA [3]. Lightweight ciphers can be designed as symmetric ciphers [4], which use one shared private key employed in both encryption and decryption. Based on the number of processed bits and bytes, symmetric ciphers are categorized as block ciphers and stream ciphers. Block cipher is an encryption scheme for a fixed-size plaintext block [5]. Typically, blocks are 32 bits, 48 bits, 64 bits, or 128 bits. Block ciphers have three main parameters: , block size, and number of rounds [1]. Some researchers add the amount of logic in each round of the encryption algorithm as an additional parameter. Most of the commonly used symmetric ciphers are block ciphers, which include (DES), 3DES, AES, BLOEFISH, and cipher [6]. On the other hand, stream ciphers perform encryption one bit or byte of plaintext input at a to reduce design complexity [5]. Stream ciphers can be easily created by block ciphers. An example of the most popular is RC4 [6]. Other examples include , GRAIN, and TRIVIUM [7–9]. Based on the structure of the algorithms, block ciphers can be categorized into three groups: substitution permutation (SPN), Feistel stream, and Lai–Massey [10]. Compared with conventional blocks, lightweight block ciphers balance the following challenges: (1) minimal overhead (e.g., number of gates, memory footprint), because of the low-cost smart devices, (2) minimal power and energy consumption, and (3) sufficient security performance [10]. Cazorla et al. [11] stated that the lightweight cipher had smaller block sizes, such as 32 bits, 48 bits, or 64 bits. In addition, lightweight block ciphers simplified key scheduling, employing elementary operations per round, and increasing the number of rounds. Cipher implementations targeted for low-resource applications are classified into software implementation and hardware implementation [1]. Hardware implementation is faster and lower in power and energy consumption, as compared with software implementation [4]. Hardware implementation is realized using application-specific integrated circuit (ASIC) technology or a field-programmable gate array (FPGA) platform [1]. Traditionally, an ASIC had better performance than an FPGA, but recent advances in FPGA technology reduced the speed gap [1,12]. FPGA provides more flexibility, re-configurability features, low-cost designs [13], and efficient resource utilization and hardware architecture [12]. Additionally, FPGA is well suited for implementing ciphers, as it offers many advantages, such as algorithm agility, upload, and modification in addition to the security features. Hardware implementation optimizes metrics, including throughput, area, power, and energy performance. For LRDs, power and energy are the most important metrics. SIMON and , published by the (NSA), are highly optimized lightweight ciphers, which provide security and flexibility over a variety of configurations (different block and key sizes). Each of these ciphers demonstrates excellent performance in both hardware (in terms of area) and software (in terms of code size and memory usage), because of the flexible implementations [14]. A SIMON cipher is targeted for hardware and a SPECK cipher is targeted for software implementation [15]. While there are numerous published articles to optimize SIMON implementation for area and throughput, little attention is paid to optimizing power/energy, considered a fundamental issue in LRDs. The novelty of this work is exercising various design options on SIMON FPGA implementations and modeling the implementation performance metrics with emphasis on energy in order to achieve the optimum design implementation. Reasonably accurate performance models are derived to better Sensors 2019, 19, 913 3 of 28 implement current and cipher designs. The following steps are considered to achieve the aforementioned goal:

• Implement a basic SIMON design (scalar/non-pipelined) with one and multiple rounds. Scalar implementations are more appropriate for intermittent (non-continuous) data. • Implement and examine pipelined design (pipelined) with one and multiple rounds. Pipelined designs are better suited to encrypt a continuous stream of data. • Derive accurate performance models for throughput, area, power, and energy metrics based on the implementation results. • Determine the best implementation for each performance metric with a particular area and energy (with slightly higher emphasis on energy), i.e., the most critical metrics in LRDs.

The rest of this paper is organized as follows: Section2 reviews the related work in this area, including software and hardware implementation for lightweight ciphers. Section3 discusses the proposed research methodology, the SIMON lightweight cipher algorithm, and the scalar and pipelined implementations. Section4 summarizes the implementation results. Section5 presents the discussions and guidelines. Finally, Section6 details the conclusions and the recommendations for future work.

2. Related Work Numerous studies examined SIMON and SPECK implementations on different software and hardware platforms, as compared with other block ciphers. This section presents an overview of these studies. With regard to software implementation, Beaulieu et al. [16] implemented SIMON and SPECK ciphers on an 8-bit AVR microcontroller’s platform to achieve optimal performance. SIMON and SPECK were also compared with other block ciphers in the same AVR platform. SPECK demonstrated the best performance. Hosseinzadeh et al. [17] implemented lightweight block ciphers on an Atmega128 microprocessor, which were evaluated in terms of energy and memory performance. The block ciphers included KLEIN-80, TWINE-80, PICCOLO-80, SPECK64/96, and SIMON64/96. The results showed that SPECK64/96 was the best in terms of energy, followed by TWINE-80. In terms of memory consumption, the TWINE-80 block cipher was best and SPECK64/ 96 was ranked third. This study concluded that SPECK64/96 and TWINE-80 were the most suitable block ciphers for WSNs. Several studies focused on hardware implementation using an ASIC flow. Beaulieu et al. [18] presented different ASIC implementations for SIMON and SPECK, including bit-serial, iterated, and partially and fully pipelined. The results concluded that SIMON had the highest efficiency when compared to other ciphers. Beaulieu et al. [14] discussed SIMON and SPECK implementations on ASIC hardware and 8-bit microcontroller software platforms. Comparisons with different lightweight block ciphers were provided, such as KATAN, KLEIN, MCRYPTON, PICCOLO, PRESENT, and AES. The ASIC results showed that PRESENT-80 achieved a throughput of 12.4 Kbps at 100 kHz within 1030 GE, while SIMON64/96 and SPECK 64/96 achieved higher throughput with less area (838 and 984 GE, respectively). Additionally, SIMON64/96 and SPECK64/96 provided 16 added bits of security. SIMON128/128 and SPECK128/128 required half of the AES area, which was better for hardware implementation. Beaulieu et al. [19] provided ASIC implementation results of different versions of SIMON and SPECK block ciphers, in terms of area and throughput. A comparison of different block ciphers (PRESENT, KATAN, KLEIN, PICCOLO, and AES) was also presented. The results showed that SPECK had a small ASIC implementation; however, SIMON had the smallest area of all investigated ciphers. Other studies examined hardware implementations using an FPGA design flow. Beaulieu et al. [19] discussed FPGA performance comparisons of SIMON, SPECK, and PRESENT on low-cost Xilinx Spartan FPGAs. SIMON and SPECK demonstrated better area reduction in comparison to AES and PRESENT. Wetzels et al. [20] implemented several hardware architectural designs for a SIMON64/128 block cipher on a Xilinx Spartan-6 FPGA series platform. These designs included the Sensors 2019, 19, 913 4 of 28 round function, as well as iterative, loop unrolling, inner-round pipelining, outer-round pipelining, and mixed pipelining architectures. The issues and trade-offs between these designs were discussed. Mixed pipelining architecture had the greatest throughput, while the round function was optimal in terms of area. Performance results were demonstrated for throughput, area, and throughput-to-area. Feizi et al. [21] implemented a SIMON32/64 block cipher in an FPGA model, Virtex-5 XC5VFX200T, and presented the results. SIMON was found to be a very flexible algorithm, due to the range of block and key sizes it offers. It is suitable for RFID systems and WSNs. SIMON32/64 has a small block and key size, and, as a result, it is suitable for lightweight devices, where few resources are needed. The larger key and block size offer more levels of security. Aysu et al. [15] suggested that SIMON is a strong alternative to AES, because it provides an equivalent level of security with better area results. The smallest area was achieved for low-cost FPGA by SIMON, which required 36 slices on a Spartan-3 FPGA and 13 slices on a Spartan-6 FPGA. Gulcan et al. [22] proposed a flexible FPGA hardware architecture capable of using all SIMON configurations. The implementation results showed that the proposed architecture required 90 and 32 slices on Spartan-3 and Spartan-6 FPGAs, respectively. Wan et al. [23] proposed an ultra-low-power implementation of a SIMON cipher. A bit-serialized SIMON core with 32-bit plaintext and 64-bit key was implemented. The design was based on adiabatic circuits. The proposed architecture achieved a 27.5× higher energy efficiency (kilobit per per Watt) at the expense of 18% less throughput, as compared to conventional implementations. Yang et al. proposed [24] the SIMECK cipher, which is similar to SIMON, and has the same rounds as SIMON. SIMECK has a slightly smaller area and power. However, SIMECK is more vulnerable to attacks, as several studies demonstrated including References [25,26]. In summary, SIMON exhibits superior hardware metrics, which makes it a good candidate for hardware implementation, especially on an FPGA platform, because of its advantages over an ASIC. While most of the implementations of SIMON targeted area optimization [15,19,22], it is clear that little research was invested in power and energy, which are significant to LRD design. The focus of the work in Reference [23] was circuit design, typically destined for custom design ICs or ASICs. Our work is based on exploring design options in FPGA implementations. Therefore, these are two different fields. Additionally, adiabatic circuits have major issues of strong dependency on parameter variations, voltage threshold, and logic family. To our knowledge, this is the first work we know of which presents analysis and modeling of SIMON metrics to achieve optimum design implementation, while focusing on energy and power performance metrics.

3. Methods The research goal was to implement and optimize a SIMON block cipher in terms of throughput, area, power, and energy by considering different designs to determine the optimum implementation for a SIMON cipher. The optimum implementation balances area and power/energy consumption. The research steps to achieve this goal were as follows (see Figure1):

• Implement the basic one-round SIMON scalar design to optimize the basic design. • Implement scalar designs with multiple hardware rounds, e.g., 2, 4, 8, 16, or 32. • Implement pipelined designs with one and multiple rounds. • Measure and model the performance metrics (area, speed, power, and energy) for each step. • Discuss recommendations for the optimum design. Sensors 2019, 19, 913 5 of 28 Sensors 2018, 18, x FOR PEER REVIEW 5 of 28

Figure 1. Research steps. Figure 1. Research steps. These steps can be achieved by executing the FPGA design flow shown in Figure2. Similar FPGA These steps can be achieved by executing the FPGA design flow shown in Figure 2. Similar design flows were used in other studies, including References [4,13,27–31]. The steps of the FPGA flow FPGA design flows were used in other studies, including References [4,13,27–31]. The steps of the were as follows: FPGA flow were as follows: • The cipher was designed and implemented at the register transfer level (RTL) using VerilogTM, • The cipher was designed and implemented at the register transfer level (RTL) using VerilogTM, a hardware description language. The Verilog implementation was verified by dynamic a hardware description language. The Verilog implementation was verified by dynamic simulations using ModelSimTM. ModelSim provides wave-files, which capture the node activity simulations using ModelSimTM. ModelSim provides wave-files, which capture the node activity used to compute the power dissipation of the design. used to compute the power dissipation of the design. • • The design wa wass synthesized and compiled using the the Altera Altera FPGA FPGA software software package package Quartus Quartus-II.-II. The choice of FPGA family should not impact the results of the research. MohdMohd et al. examined several implementations implementations of of steganography algorithms algorithms in Altera in Altera and and Xilinx Xilinx FPGAs FPGAs [32]. The [32]. study The studyconcluded concluded that Altera that Altera and Xilinx and Xilinx provide provide similar similar trending trending results. results. • For Quartus-IIQuartus-II synthesis, timing constraints werewere used duringduring thethe compilingcompiling process.process. • The design underwentunderwent the following analyses using Quartus-II:Quartus-II:

- -TimingTiming analysis analysis reported reported the maximum the maximum frequency frequency of the design. of the The design. designs The we designsre compiled were with clockcompiled constraints with clockof 50 constraintsMHz. of 50 MHz. - -ResourceResource utilization utilization analysis analysis showed showed the number the number of logic of ele logicments elements (LEs) (LEs) and the and type the type(i.e., combinational,(i.e., combinational, register logic, register or both) logic, used or both) in the used FPGA in the design FPGA [13 design]. LE [is13 the]. LE smallest is the smallest unit of logic inunit the ofAltera logic architecture; in the Altera architecture;it is compact it and is compact facilitates and efficient facilitates logic efficient utilization. logic utilization. Each LE includesEach a four LE- includesinput look a four-input-up table (LUT) look-up and table a programmable (LUT) and a programmable register. The LUT register. is a Thefunction LUT generatoris a that function can implement generator thatany canfunction implement of four any variables function [33 of]. four variables [33]. - -Power analysisPower analysis compute computedd the average the averagepower of power the design. of the The design. computed The computed power is powerthe dynamic is the core powdynamicer that core consist powers of that comb consistsinational, of combinational, register, and register, . The and clock.power The required power byrequired node activitiesby wa nodes extracted activities fromwas extracted the value from change the value dump change (VCD) dump files (VCD)generated files by generated ModelSim by simulations.ModelSim This approach simulations. to computing This approach power to was computing used in powerother works, was used such in as other Referenc works,es [4,13]. such as References [4,13]. • Performance models for area, power, power, and energy energy we werere derived derived based based on on implementation implementation results. results.

Sensors 2019, 19, 913 6 of 28 Sensors 2018, 18, x FOR PEER REVIEW 6 of 28

Figure 2. Figure 2. Design flow. flow. 3.1. SIMON Algorithm 3.1. SIMON Algorithm SIMON is one of the recently published lightweight block ciphers from the National Security AgencySIMON (NSA). is Theone structureof the recently of the published SIMON cipher lightweight is based block on a Feistelciphers network, from the which National is an Security iterated Agency (NSA). The structure of the SIMON cipher is based on a Feistel network, which is an iterated cipher with an internal round function [34]. SIMON offers 10 different configurations depending on cipher with an internal round function [34]. SIMON offers 10 different configurations depending on block and key sizes, which can provide numerous levels of security [22]; see the list in Table1. block and key sizes, which can provide numerous levels of security [22]; see the list in Table 1. Table 1. The SIMON configurations and parameters. Table 1. The SIMON configurations and parameters. Security Block Size Key Size Word Size Key Words Constant Sequence Rounds Configuration (2n) (mn) (n) Key Word(m ) Key Constant(z ) (T) Block Size j Rounds 1Security 32 64 16Size Size 4 Words sequence z0 32 2Configuration 48 72(2n) 24 3 z0(T) 36 (mn) (n) (m) (zj) 3 48 96 24 4 z1 36 41 64 9632 3264 16 34 z0 z2 32 42 5 64 128 32 4 z3 44 62 96 9648 4872 24 23 z0 z2 36 52 7 96 144 48 3 z3 54 3 48 96 24 4 z1 36 8 128 128 64 2 z2 68 94128 19264 6496 32 33 z2 z3 42 69 10 128 256 64 4 z4 72 5 64 128 32 4 z3 44

The SIMON block6 cipher is represented96 as SIMON296 48n/ mn, where2 2n isz2 the block size,52 and n is the number of bits7 that define the word96 size, which144 could48 be 16, 24,3 32, 48, orz3 64 bits. The54 key size is identified by multiplying the number of words in a key, indicated by the parameter m, by the word 8 128 128 64 2 z2 68 size, n, resulting in an mn-bit key length. For example, SIMON32/64 refers to 32-bit plaintext blocks, a 64-bit key size, and9 a 16-bit word. 128 192 64 3 z3 69 10 128 256 64 4 z4 72

Sensors 2019, 19, 913 7 of 28

SIMON32/64 was chosen as the cipher configuration in this study, because it supports block and key sizes appropriate for LRDs. Table2 lists the SIMON notations used in this section. The parameters for SIMON implementation were as follows:

• Block size (2n): 32; • Key size (mn): 64; • Word size (n): 16; • Key words (m): 4;

• Constant sequence (zj): z0; • Cipher rounds (T): 32.

Table 2. SIMON cipher notations.

Notation Description ⊕ Bitwise XOR & Bitwise AND Sj Left circular shift, Sj, by j bits S−j Right circular shift, S−j, by j bits n Word size 2n Block size m Key words mn Key size i Round counter, where 0 ≤ i ≤ T − 1 zj Constant sequence, where j = 0, 1, 2, 3, 4 T Number of cipher rounds F Round function k Round key (sub-key) c Constant

3.1.1. Round Function Encryption and decryption operations in SIMON2n apply the following on n-bit words [14]:

• Bitwise XOR, ⊕; • Bitwise AND, &; • Left circular shift, Sj, by j bits.

Figure3 shows the round function of the SIMON cipher, where the input plaintext block (2 n) is split into two equal words (each one is n-bit). In each round function, three circular shifts to the left (shift left one, shift left eight, and shift left two) and bitwise AND logic operations are performed on the left half block. The result is XOR-ed with the right half block and the round key. At the end of each round, the left half value is transferred to the right block and the generated value is written back to the left block. This round process is continuously repeated depending on the number of rounds within the implemented configuration. The SIMON round encryption function, F, is represented in Equation (1).

F(x,y,k) = (y ⊕ ((S1 x & S8 x) ⊕ S2 x) ⊕ k), (1) where k is the round key, x is the leftmost word of the cipher block, and y is the rightmost word. Sensors 2018, 18, x FOR PEER REVIEW 8 of 28 Sensors 2019, 19, 913 8 of 28 where k is the round key, x is the leftmost word of the cipher block, and y is the rightmost word.

Figure 3. SIMON round function.

3.1.2. 3.1.2. Key Schedule The key schedule process applies the following basic operations on m words from the master key The key schedule process applies the following basic operations on m words from the master with n-bits for each: key with n-bits for each: • BitwiseBitwise XOR,XOR, ⊕⊕;; −j • RightRight circularcircular shift,shift, SS-j, ,by by j jbitsbits.. The SIMON key schedule function takestakes thethe master key and generatesgenerates aa sequence ofof T key words ((kk00, k1,, kk22,, …,... k,Tk −T 1−),1 where), where T Trepresentsrepresents the the number number of of rounds. rounds. There There are are three three different different versions versions of the key schedule function, depending on the block size and master key size, which can include include two, three, three, or four words (i.e., m == 2, 2, 3, 3, or or 4). 4). The The key key schedule schedule function function perform performss two two circular circular shift shift operations operations to tothe the right right (shift (shift right right one, one, and and shift shift right right three). three). The The resul resultt is is XORed XORed with with a a fixed fixed constant, constant, c,c, and a constant sequence, zjj. There There are five five sequences for the constant zj,, which are version version-dependent-dependent (i.e., z0,, zz11, ,zz22, ,zz3 3andand z4z),4 ),as as shown shown in in Table Table 1.1 . Figure4 4 illustrates illustrates thethe keykey scheduleschedule functionfunction ofof SIMONSIMON forfor fourfour mastermaster keykey wordswords (i.e.,(i.e., m = 4) of n--bits.bits. These These four four sub sub-keys-keys are are generated generated on on the the first first iteration iteration of ofthe the key key schedule schedule function function and and are areused used as the as thefirst first four four round round keys. keys. A new A new round round key keyis generated is generated in each in eachkey schedule key schedule iteration. iteration. The Theblock block,, ki, containki, containss the round the round key required key required for the for i-th the round,i-th round, where where0  i < (T 0 ≤ − m)i < for (T −them key) for schedule the key −3 −3 schedulefunction. function.The most The significant most significant word ki + word3 is circularki+3 is circularshifted right shifted by right three by (i.e. three, S (i.e.,), andS then), and XORed then −1 −1 XORedwith the with word the ki word+ 1. Theki +result 1. The is resultcircular is circularshifted right shifted by right one by(i.e. one, S (i.e.,). Finally,S ). the Finally, result the is resultXORed is i i XORedwith the with least the significant least significant word (ki) word and the (ki) round and the constant round ( constantc ⊕ z j). The (c ⊕ valuez j). Theof c valueis equal of toc is (2 equaln − 1) to⊕ (23,n which− 1) ⊕is a3, string which of is (n a − string 2) ones of and (n − two2) zeroes ones and on twothe least zeroes significant on the least two bits significant (i.e., c = two 2n – bits 4 = i i (i.e.,0xff c…= fc 2n). −The4= constant 0xff . . . fc).sequenceThe constant, z j, is computed sequence, asz j,theis computedi-th bit of asthe the zj isequence-th bit of the(fromzj sequencethe most (fromsignific theant most to least significant significant) to least, where significant), i is computed where byi is (i computed− m) mod 62 by for (i −m m≤ i) ≤ mod T − 1 62 and for j mis associated≤ i ≤ T − 1with and eachj is associated configuration with, as each shown configuration, in Table 1. as shown in Table1. For SIMON32/64 with with four four key key words words ( (mm = 4), 4), the the cc constantconstant,, and and the the round round constant sequence sequence,, zjj, round keys are generated by Equation (2).(2).

ki + m = c ⊕ (zj)i ⊕ ki ⊕ (I ⊕ −S1−1) (S−−33 ki + 3 ⊕ ki + 1), (2) ki+m = c ⊕ (zj)i ⊕ ki ⊕ (I ⊕ S )(S ki+3 ⊕ ki+1), (2) where 0 ≤ i ≤ T − m, T = 32, m = 4, c = 216 – 4, and z0 = 16 where01100111000011010100100010111110110011100001101010010001011111 0 ≤ i ≤ T − m, T = 32, m = 4, c = 2 − 4, and z0 = 011001110000110101001000101111101100111 (represented in little-endian). 00001101010010001011111 (represented in little-endian).

Sensors 2019, 19, 913 9 of 28 Sensors 2018, 18, x FOR PEER REVIEW 9 of 28

Figure 4. SIMON four four-word-word k keyey schedule schedule.. 3.2. Scalar Design 3.2. Scalar Design This section details the scalar design implementation of the SIMON cipher. Firstly, it discusses the This section details the scalar design implementation of the SIMON cipher. Firstly, it discusses FPGA implementation of a basic scalar SIMON algorithm: the one-round implementation (iterative). the FPGA implementation of a basic scalar SIMON algorithm: the one-round implementation Secondly, multiple hardware rounds are instantiated in the implementation to optimize area, power, (iterative). Secondly, multiple hardware rounds are instantiated in the implementation to optimize and energy. A list of notations used in this section and the following sections is shown in Table3. area, power, and energy. A list of notations used in this section and the following sections is shown in Table 3. Table 3. Equation notations.

NotationTable 3. Equation Description notations.

Notation r Number of roundsDescription implemented in hardware R Number of iterations/rounds to encrypt one block r TcycleNumberCycle of timerounds (i.e., implemented clock period) in hardware R TblockNumberTime of toiterations/rounds encrypt one block to encrypt one block Setup cycles to load input plaintext and output Cidle Tcycle Cycle ciphertexttime (i.e., clock period) CB Number of cycles to encrypt one block Tblock Time to encrypt one block

Cidle Setup cycles to load input plaintext and output ciphertext 3.2.1. Basic Scalar Design for One Round (Iterative) CB Number of cycles to encrypt one block In the basic scalar design, one hardware round of the encryption unit is implemented as 3.2.1.combinational Basic Scalar logic Design connected for One to aRound single (Iterative) register and supplied with the proper round key. In the first clock cycle, the plaintext block is loaded into the register to perform the first cipher round. The result is thenIn fedthe back basic to scalar the circuit design, through one hardwarethe register. round This process of the encryption is repeated for unitT isclock implemented cycles, where as Tcombinationalis the number logic of cipher connected rounds, to a as single stated register in Table and2. Thesupplied ciphertext with blockthe proper is stored round in the key. register. In the Thisfirst clock design cycle, has twothe plainte main featuresxt block as is follows: loaded into the register to perform the first cipher round. The result is then fed back to the circuit through the register. This process is repeated for T clock cycles, 1.where Only T is onethe blocknumber cipher of cipher is encrypted rounds at, as a time.stated in Table 2. The ciphertext block is stored in the register.2. The This number design of ha two cycles main required features to as encrypt follows a: single block cipher is equal to the number of cipher rounds (i.e., T) plus the number of cycles to load plaintext and output ciphertext (i.e., 1. Only one block cipher is encrypted at a time. C ). 2. Theidle number of clock cycles required to encrypt a single block cipher is equal to the number of Incipher the proposed rounds (i.e. design,, T) plus the basicthe number FPGA scalarof cycles design to load implementation plaintext and of output SIMON32/64 ciphertext is described (i.e., in FigureCidle).5 . Three main blocks are considered: the control logic, the round logic, and the key generationIn the block. proposed design, the basic FPGA scalar design implementation of SIMON32/64 is described in Figure 5. Three main blocks are considered: the control logic, the round logic, and the key generation block. • The control logic block is responsible for managing the external and internal activities of the system. It controls three main registers: key, round counter, and X register. Additionally, it organizes the sequence order of these activities’ functionalities through a finite-state machine (FSM). The encryption process begins with the assertion of a start signal. The plaintext is then loaded into the X register and the round counter is initialized to zero. The value of Key (master key) is also stored in specific sub-key registers in order to perform the key generation process. In the following cycles, the control block assigns sub-key and round counter values to the key

Sensors 2019, 19, 913 10 of 28

• The control logic block is responsible for managing the external and internal activities of the system. It controls three main registers: key, round counter, and X register. Additionally, it organizes the sequence order of these activities’ functionalities through a finite-state machine (FSM). The encryption process begins with the assertion of a start signal. The plaintext is then loaded into the X register and the round counter is initialized to zero. The value of Key (master key) is also stored in specific sub-key registers in order to perform the key generation process. In the following cycles, the control block assigns sub-key and round counter values to the key Sensorsgeneration 2018, 18, x FOR block PEER and REVIEW Xin to the round logic block. Once the counter reaches its maximum10 value of 28 (the number of corresponding rounds has finished), the done signal is asserted by the control generation block and Xin to the round logic block. Once the counter reaches its maximum value block to state that the encryption process is complete. (the number of corresponding rounds has finished), the done signal is asserted by the control • Theblock key to generationstate that the block encryption generates process the sub-key is complete. required for the current round. • TheThe roundkey generation block performs block generate one hardwares the sub round-key operationrequired for and the updates current the round. X register. In the last • clockThe round cycle, block the ciphertext performs value one hardware is saved inround the X operation register. and updates the X register. In the last clock cycle, the ciphertext value is saved in the X register.

Figure 5. Basic SIMON field-programmable gate array (FPGA) design. Figure 5. Basic SIMON field-programmable gate array (FPGA) design. 3.2.2. Scalar Design with Multiple Rounds (Loop Unrolling) 3.2.2. Scalar Design with Multiple Rounds (Loop Unrolling) In the scalar design with multiple rounds, combinational logic is used to implement multiple hardwareIn the rounds scalar insteaddesign ofwith one multiple round, as rounds, in the basiccombinational design. The logic loop is of used the basicto implement design is unrolledmultiple tohar implementdware roundsr rounds. instead If ofr is one equal round the, numberas in the ofbasic rounds design. (i.e., TheT), loop a full of loop the basic unrolling design design is unrolled is the result.to implement However, r rounds. if r is less If r thanis equal the maximumthe number number of rounds of rounds,(i.e., T), partiala full loop loop unrolling unrolling design is the result.is the Theresult. key However, schedules if function r is less than is unrolled the maximum in the same numb manner.er of rounds, Hence, partial the number loop ofunrolling iterations/rounds is the result.R requiredThe key schedules to encrypt function one block is ofunrolled data decreases in the same by factor manner. of r .Hence,R is expressed the number by Equation of iterations/rounds (3). R required to encrypt one block of data decreases by factor of r. R is expressed by Equation (3). RR= = TT/r/r, (3 (3)) wherewhere rr == 1, 2, 4, 8, 16,16, or 32.32. The number of clock cycles required toto encryptencrypt aa singlesingle blockblock ((CCBB)) isis obtained by Equation (4).(4).

CB = R + Cidle. (4) CB = R + Cidle. (4) Figure 6 illustrates the design with two hardware rounds implemented into the SIMON cipher. The mainFigure differences6 illustrates between the design the two with-round two hardware design and rounds the basic implemented (iterative) intodesign the are SIMON as follows: cipher. The main differences between the two-round design and the basic (iterative) design are as follows: • There are two hardware rounds in the two-round design, Roundi + 0 and Roundi + 1, which are • Thereexecut areed s twoimultaneously. hardwarerounds in the two-round design, Roundi+0 and Roundi+1, which are • executedThere is a simultaneously. smaller counter in the two-round design: a 4-bit counter is required; in general, 2j- round design requires a (5-j)-bit counter. • Dataflow for each round starts from the X-register to Roundi + 0, and then to Roundi + 1, returning to the X-register. • Two sub-keys are generated each iteration instead of one sub-key, as in the basic design: sub-Ki + 0 and sub-Ki + 1 are required to feed Roundi + 0 and Roundi + 1.

Sensors 2019, 19, 913 11 of 28

• There is a smaller counter in the two-round design: a 4-bit counter is required; in general, 2j-round design requires a (5-j)-bit counter.

• Dataflow for each round starts from the X-register to Roundi+0, and then to Roundi+1, returning to the X-register.

• Two sub-keys are generated each iteration instead of one sub-key, as in the basic design: sub-Ki+0 Sensors 2018and, 18, sub-Kx FORi+1 PEERare REVIEW required to feed Roundi+0 and Roundi+1. 11 of 28

Figure 6. SIMON FPGA design with two rounds. Figure 6. SIMON FPGA design with two rounds. 3.3. Pipelined Design 3.3. Pipelined Design The 32-round (full loop unrolling) implementation of SIMON32/64 is extended to pipelined designThe 32 by-round inserting (full registers loop unrolling) between the implementation unrolled rounds. of The SIMON32/64 pipelined design, is extended performing to pipelinedmore designthan by one inserting task at registers a time, improves between throughput. the unrolled The rounds. pipeline The design pipelined is better design suited, performing to process amore than continuous one task streamat a time of data., improve The differences throughput. between The the basic pipeline design design (iterative) is andbetter the suited pipelined to design process a is that new blocks of plaintext can be fed into the pipeline of each clock cycle, while, in the basic continuous stream of data. The difference between the basic design (iterative) and the pipelined design, blocks are fed after (T + 2) clock cycles. T is the number of cipher rounds and two cycles are design is that new blocks of plaintext can be fed into the pipeline of each clock cycle, while, in the required to load in plaintext and output ciphertext. The pipelined design of SIMON32/64 encryption basicis design shown, inblocks Figure are7. Itfed consists after (T of 32+ 2) stages clock with cycles one. roundT is the implemented number of incipher each stage,rounds and and registers two cycles are requiredare instantiated to load between in plaintext stages. In and each output clock cycle, ciphertext a new. The plaintext pipelined can be design fed tothe of first SIMON32/64 stage. encryptionKey expansion is show functionalityn in Figure is7. implementedIt consists of in 32 the stages same with pipelined one methodround implemented by instantiating in registers each stage, and betweenregisters sub-key are instantiated generation between functions stages. to feed theIn each appropriate clock cycle, sub-key a tonew the plaintext round function. can be Flavors fed to the first stage.of pipelined Key expansion designs are functionality implemented is by implemented varying the number in the same of rounds pipel (i.e.,ined 1, method 2, 4, 8, 16, by and instantiating 32) in registerseach between pipelined sub stage.-key When generation the number functi ofons rounds to feed per stagethe appropriate doubles, the sub number-key to of the stages round is halved. function. FlavorsFigure of pipelined8 shows the designs implementation are implemented of two rounds by varying per stage. the number of rounds (i.e., 1, 2, 4, 8, 16, and 32) in each pipelined stage. When the number of rounds per stage doubles, the number of stages is halved. Figure 8 shows the implementation of two rounds per stage.

4. Results In this section, results for scalar and pipelined design implementations are presented and summarized. Implementation results include the following performance metrics: number of LEs, maximum frequency, and power and energy consumption.

Sensors 2019, 19, 913 12 of 28 Sensors 2018, 18, x FOR PEER REVIEW 12 of 28

Sensors 2018, 18, x FOR PEER REVIEW 13 of 28 FigureFigure 7. Implementation 7. Implementation of of oneone roundround per per stage stage (pipelined (pipelined design) design)..

FigureFigure 8. Implementation 8. Implementation of of twotwo rounds per per stage stage (pipelined (pipelined design) design)..

Results are illustrated in tables and graphs. Tables include the exact values, while graphs only provide general trends of metrics. Metrics are normalized with respect to the one-round scalar design. The following notations are used to illustrate the results clearly:

• Sr represents FPGA scalar implementation with r hardware rounds, where r = {1, 2, 4, 8, 16, 32}. • SPr represents FPGA pipelined implementation with r hardware rounds per stage. The number of pipeline stages is equal to (32/r). As an example, SP4 has four implemented rounds per stage and has eight stages. Table 4 lists the other notations used in this section.

Table 4. Results notations.

Notation Description f Frequency of the design

LE Resource utilization of the design

P Power consumption of the design

E Energy of the design

4.1. Scalar Implementation Results This section presents the performance metric results for FPGA scalar design implementations illustrated in Subsection 3.2. Results are summarized in Table 5 and Figures 9–12.

Sensors 2019, 19, 913 13 of 28

4. Results In this section, results for scalar and pipelined design implementations are presented and summarized. Implementation results include the following performance metrics: number of LEs, maximum frequency, and power and energy consumption. Results are illustrated in tables and graphs. Tables include the exact values, while graphs only provide general trends of metrics. Metrics are normalized with respect to the one-round scalar design. The following notations are used to illustrate the results clearly:

• Sr represents FPGA scalar implementation with r hardware rounds, where r = {1, 2, 4, 8, 16, 32}. • SPr represents FPGA pipelined implementation with r hardware rounds per stage. The number of pipeline stages is equal to (32/r). As an example, SP4 has four implemented rounds per stage and has eight stages.

Table4 lists the other notations used in this section.

Table 4. Results notations.

Notation Description f Frequency of the design LE Resource utilization of the design P Power consumption of the design E Energy of the design

4.1. Scalar Implementation Results This section presents the performance metric results for FPGA scalar design implementations illustrated in Section 3.2. Results are summarized in Table5 and Figures9–12.

Table 5. Summary of the scalar implementation results. LE—logic element.

Scalar Implementation Fmax (MHz) LEs Power (mW) Energy (pJ/block)

S1 66. 2 210 1.68 2947 S2 66.19 294 2.9 2755 S4 56.42 479 4.02 2211 S8 53.11 858 9.16 3204.6 S16 46.41 2056 19.6 4900 S32 24.02 3469 22.64 4528

4.1.1. Frequency The maximum frequency versus number of rounds implemented for scalar designs is shown in Figure9. Doubling the number of rounds leads to a decrease in the maximum frequency value. Frequency decreases by an average of 10 MHz, which is 16% of the S1 frequency. This reduction in frequency is due to the increase in the length of timing paths as more rounds are implemented in one cycle. From the values in Table5 and Figure9, the frequency trend ( f ) is modeled using Equation (5), with an average model error of 10%.

f (S2k) = f (Sk) − 0.16 f (S1), (5) where f (S1) = 66 MHz, Sr = S2k, and k = [1, 2, 4, 8, 16]. Sensors 2018, 18, x FOR PEER REVIEW 14 of 28

Table 5. Summary of the scalar implementation results. LE—logic element.

Scalar implementation Fmax (MHz) LEs Power (mW) Energy (pJ/block)

S1 66. 2 210 1.68 2947

S2 66.19 294 2.9 2755

S4 56.42 479 4.02 2211

S8 53.11 858 9.16 3204.6

S16 46.41 2056 19.6 4900

S32 24.02 3469 22.64 4528

4.1.1. Frequency The maximum frequency versus number of rounds implemented for scalar designs is shown in Figure 9. Doubling the number of rounds leads to a decrease in the maximum frequency value. Frequency decreases by an average of 10 MHz, which is 16% of the S1 frequency. This reduction in frequency is due to the increase in the length of timing paths as more rounds are implemented in one cycle. From the values in Table 5 and Figure 9, the frequency trend (f) is modeled using Equation (5), with an average model error of 10%.

Sensors 2019, 19, 913 f (S2k) = f (Sk) − 0.16 f (S1), 14 of(5) 28 where f (S1) = 66 MHz, Sr = S2k, and k = [1, 2, 4, 8, 16]. Fmax (MHz) 80

60

40

20

0 S_1 S_2 S_4 S_8 S_16 S_32

Figure 9. Frequency trend versus number of rounds for scalar implementations.implementations. 4.1.2. Resource utilization 4.1.2. Resource utilization Table6 illustrates the resource utilization based on the type of LUTs, which can be 4-input Table 6 illustrates the resource utilization based on the type of LUTs, which can be 4-input functions, 3-input functions, 2-or-less input functions, and register only. Also, in this section, we functions, 3-input functions, 2-or-less input functions, and register only. Also, in this section, we analysis of resources in terms of combinational LEs and register LEs. Register LEs include present analysis of resources in terms of combinational LEs and register LEs. Register LEs include “register only” LEs and “combinational with a register” LEs. The latter analysis facilitates better “register only” LEs and “combinational with a register” LEs. The latter analysis facilitates better understanding of power results. understanding of power results. Table 6. Summary of the scalar implementation look-up table (LUT) results. Table 6. Summary of the scalar implementation look-up table (LUT) results. Scalar Implementation LEs LUT-4 LUT-3 LUT-2 or less LUT RegOnly Scalar implementation LEs LUT-4 LUT-3 LUT-2 or less LUT RegOnly S1 210 50 108 52 0 S1 S2 210294 50 121 108 103 52 68 2 0 S4 479 228 95 121 35 S2 294 121 103 68 2 S8 858 554 191 98 15 S4 S16 4792056 228 1466 95 434 121 140 16 35 S32 3469 2410 718 324 17 S8 858 554 191 98 15

S16 2056 1466 434 140 16 Figure 10 shows the number of LEs versus the number of rounds implemented in the hardware for scalar designs.S32 The number of3469 LEs (which2410 indicates 718 the utilized resources)324 increases by17 an average of 78% when the number of rounds is doubled. To understand the source of this increase, the different types of LEs are identified, including combinational LEs and register LEs. Figure 10 also plots the LE components. Clearly, combinational LEs exhibit growth, while register LEs are constant, with respect to an increase in the number of implemented rounds. The growth component of LEs increases by 102% when the number of rounds is doubled. This is because the synthesis tool minimizes and shares the logic cones efficiently [13] and reuses the combinational logic. The number register LEs, however, is constant for each hardware implementation, except for the counter register, as it decreases by one bit when doubling the number of implemented rounds. Hence, the trend for the number of LEs (LE) is modeled using Equation (6) with an average model error of 9.4%.

LE (S ) = LE (S (Comb)) + LE (S (Register)) 2k 2k 1 (6) = 2.02 LE (Sk (Comb)) + LE (S1 (Register)), where LE (S1 (Comb)) = 105, LE (S1 (Register)) = 105, Sr = S2k, and k = [1, 2, 4, 8, 16].

4.1.3. Power Figure 11 describes the total power dissipation versus the number of implemented hardware rounds. Power increases by an average of 73% when the number of implemented rounds is doubled. Based on the type of circuit, there are two main components of power: sequential and combinational. Figure 11 also illustrates the contributions of combinational and sequential power. The following conclusions are drawn: Sensors 2019, 19, 913 15 of 28

• Combinational power (which estimates the combinational logic power) increases by 105% when the number of rounds is doubled. The increase in the combinational power is due to an increase in implemented logic, as well as an increase in glitch power [35]. • SensorsSequential 2018, 18, x FOR power PEER (which REVIEW estimates the sequential logic, i.e., register, control block) increases15 of by28 an average of 30%. • TheFigure main 10 reasonshows the for thenumber total of power LEs versus increasing the number is the combinational of rounds implemented power. in the hardware •for scalarThe power designs. trend The showsnumber a slightof LEs increase (which indicates for S32 power the utilized when compared resources) to increases the overall by an increasing average of 78%average when forthe other number scalar of rounds designs. is This doubled. is due To to theunderstand reduction the in source combinational of this increase, power growth the different at this typespoint, of LEs as are S32 identified,is optimized including by the synthesis combinational tool to LEs reduce and wiring,register thereby LEs. Figure reducing 10 also routing plots power.the LE components. Clearly, combinational LEs exhibit growth, while register LEs are constant, with respect The power trend (p) is modeled according to Equation (7) with an average model error of 13.2%. to an increase in the number of implemented rounds. The growth component of LEs increases by 102% when the number of rounds is doubled. This is because the synthesis tool minimizes and shares P (S2k) = P (S2k (Comb)) + P (S2k (Seq)) the logic cones efficiently [13] and reuses the combinational logic. The number register LEs, however,(7) = 2.05 P (Sk (Comb)) + 1.30 P (Sk (Seq)), is constant for each hardware implementation, except for the counter register, as it decreases by one wherebit whenP (S doubling1 (Comb)) the = 0.6number mW, Pof(S implemented1 (Seq)) = 1.1 rounds. mW, Sr =Hence, S2k, and the k trend = [1, 2,for 4, the 8, 16].number of LEs (LE) is modeled using Equation (6) with an average model error of 9.4%.

Comb. Reg. Total LEs

4000 3500 3000 2500 2000 1500 1000

Sensors 2018, 18, x FOR PEER REVIEW500 16 of 28 0 S_1 S_2 S_4 S_8 S_16 S_32 P (S2k) = P (S2k (Comb)) + P (S2k (Seq)) (7) Figure 10. ResourceResource utilization utilization= 2.05 trend P trend (S andk (Comb)) and its components its + components1.30 P ( S versuk (Seq)) versuss number, number of rounds of rounds for scalar for scalar implementations. whereimplementations P (S1 (Comb)) .= 0.6 mW, P (S1 (Seq)) = 1.1 mW, Sr = S2k, and k = [1, 2, 4, 8, 16].

mW Total Power Comb. Seq. LE (S2k) = LE (S2k (Comb)) + LE (S1 (Register)) 25 (6) = 2.02 LE (Sk (Comb)) + LE (S1 (Register)), 20 where LE (S1 (Comb)) = 105, LE (S1 (Register)) = 105, Sr = S2k, and k = [1, 2, 4, 8, 16]. 15 4.1.3. Power 10 Figure 11 describes the total power dissipation versus the number of implemented hardware rounds. Power increases by5 an average of 73% when the number of implemented rounds is doubled. Based on the type of circuit, there are two main components of power: sequential and combinational. Figure 11 also illustrates the0 contributions of combinational and sequential power. The following conclusions are drawn: S_1 S_2 S_4 S_8 S_16 S_32

• Combinational power (which estimates the combinational logic power) increases by 105% when Figure 11. The power trend and its components for scalar implementations.implementations. the number of rounds is doubled. The increase in the combinational power is due to an increase in implemented logic, as well as an increase in glitch power [35]. 4.1.4. Energy • Sequential power (which estimates the sequential logic, i.e., register, control block) increases by “Energyan average per of block” 30%. is computed by multiplying “average power” by the “time to process one block”• The, which main is reason expressed for the in totalEquation power (8) increasing. is the combinational power. • The power trend shows a slight increase for S32 power when compared to the overall increasing Tblock = CB × Tcycle average for other scalar designs. This is due to the reduction in combinational power growth at = (R + Cidle) × Tcycle (8) this point, as S32 is optimized by the synthesis tool to reduce wiring, thereby reducing routing

power. = (T/r + 2) × Tcycle.

The twopower cycles trend in (Equationp) is modeled (8) are according required to to Equation load in plaintext (7) with anand average output model ciphertext error (i.e., of 13.2% Cidle).. Energy per block versus number of implemented hardware rounds is illustrated in Figure 12. Doubling the number of rounds decreases energy from S1 to S4. S4 has the least energy dissipation (optimum design). Energy increases from S4 until it reaches its maximum value at S16 and slightly decreases again at S32. To understand the behavior of the energy curve, the following facts are required: • Increasing the number of implemented rounds increases combinational power and (to a lesser extent) sequential power, as shown in Figure 11. • Increasing the number of implemented rounds, r, decreases the time to process one block (Tblock). The synthesis tool performs better routing optimization in some implementations, resulting in noticeably less routing power. Figure 12 shows the contribution of energy components (combinational and sequential) to the total energy. Based on the aforementioned facts, the following is found:

Sensors 2019, 19, 913 16 of 28

4.1.4. Energy “Energy per block” is computed by multiplying “average power” by the “time to process one block”, which is expressed in Equation (8).

Tblock = CB × Tcycle = (R + Cidle) × Tcycle (8) = (T/r + 2) × Tcycle.

The two cycles in Equation (8) are required to load in plaintext and output ciphertext (i.e., Cidle). Energy per block versus number of implemented hardware rounds is illustrated in Figure 12. Doubling the number of rounds decreases energy from S1 to S4.S4 has the least energy dissipation (optimum design). Energy increases from S4 until it reaches its maximum value at S16 and slightly decreases again at S32. To understand the behavior of the energy curve, the following facts are required: • Increasing the number of implemented rounds increases combinational power and (to a lesser extent) sequential power, as shown in Figure 11.

• Increasing the number of implemented rounds, r, decreases the time to process one block (Tblock). The synthesis tool performs better routing optimization in some implementations, resulting in noticeably less routing power. Figure 12 shows the contribution of energy components (combinational and sequential) to the total energy. Based on the aforementioned facts, the following is found:

• Combinational energy in general increases 42% when the number of implemented rounds doubles. Routing power in S16 is not optimized, as compared to S32 and S8. • Sequential energy slightly decays as the number of implemented rounds is doubled. • Since energy is estimated by multiplying power with time to encrypt the block, and power and time exhibit different behavior with respect to r, the energy trend has a V-like curve, as seen in Figure 12.

• The highest energy consumption value at S16 is due to the high combinational energy from the high routing power/energy with additional glitch power/energy [36]. The drop of energy at S32 is due to the drop in combinational energy, as the tools optimize better for larger logic.

The energy trend (E) is modeled using Equation (9), with an average model error of 11.7%.

E (S ) = E (S (Comb)) + E (S (Seq)) 2k 2k 2k (9) = 1.42 E (Sk (Comb)) + 0.73 E (Sk (Seq)), where E (S1 (Comb)) = 850 pJ, E (S1 (Seq)) = 1990 pJ, Sr = S2k, and k = [1, 2, 4, 8, 16].

4.2. Pipelined Implementation Results This section discusses the performance metric results for the FPGA pipelined design implementations discussed in Section 3.3. Results are summarized in Table7 and Figures 13–16.

Table 7. Summary of the pipelined implementation results.

Pipelined Implementation Fmax (MHz) LEs Power (mW) Energy (pJ/block)

SP1 66.2 2057 13.88 694.2 SP2 66.2 2008 13.69 684.4 SP4 57.32 2891 17.58 879 SP8 36.39 2918 27.60 1379.8 SP16 32.68 3338 23.32 1166 SP32 23.79 3361 23.76 1187.8 Sensors 2019, 19, 913 17 of 28 Sensors 2018, 18, x FOR PEER REVIEW 17 of 28

pJ/block Total Energy Comb. Seq. 6000

5000

4000

3000

2000

1000 Sensors 2018, 18, x FOR PEER REVIEW 18 of 28 0 S_1 S_2 S_4 S_8 S_16 S_32 the number of rounds implemented in one pipeline stage increases the critical timing paths and decreases frequency. The frequency trend (f) is modeled according to Equation (10) with an average Figure 12. Energy trend and its components for scalar implementations. model error of 9.8%Figure. 12. Energy trend and its components for scalar implementations.

• Combinational energyFmax in (MHz) general increases 42% when the number of implemented rounds doubles. Routing power70 in S16 is not optimized, as compared to S32 and S8. • Sequential energy slightly60 decays as the number of implemented rounds is doubled. • Since energy is estimated by multiplying power with time to encrypt the block, and power and 50 time exhibit different behavior with respect to r, the energy trend has a V-like curve, as seen in Figure 12. 40 • The highest energy consumption30 value at S16 is due to the high combinational energy from the high routing power/energy with additional glitch power/energy [36]. The drop of energy at S32 20 is due to the drop in combinational energy, as the tools optimize better for larger logic. 10 The energy trend (E) is modeled using Equation (9), with an average model error of 11.7%. 0 E (S2kSP_1) = E (SSP_22k (Comb))SP_4 +SP_8 E (S2kSP_16 (Seq))SP_32 (9) = 1.42 E (Sk (Comb)) + 0.73 E (Sk (Seq)), Figure 13. Frequency trend versus number of rounds for pipelined implementations.implementations. where E (S1 (Comb)) = 850 pJ, E (S1 (Seq)) = 1990 pJ, Sr = S2k, and k = [1, 2, 4, 8, 16]. 4.2.1. Frequency f (SP2k) = f (SPk) − 0.16 f (SP1), (10) 4.2. PipelinedMaximum Implementation frequency versus Results number of rounds implemented in a pipelined stage is shown in where f (SP1) = 66 MHz, SPr = SP2k, and k = [1, 2, 4, 8, 16]. FigureThis 13. The section trend discusses is similar theto the performance case of the scalar metric design, results where for doublingthe FPGA the number pipelined of rounds design results4.2.2.implementations Resource in a drop U in tilizationdiscussed frequency in of Sub 10 MHz,section which 3.3. Resu is equivalentlts are summarized to 16% of the in SPTable1 frequency. 7 and Figures Doubling 13–16. the number of rounds implemented in one pipeline stage increases the critical timing paths and decreases The resource utilization based on the type of LUTs is illustrated in Table 8. In this section, we frequency. The frequencyTable trend 7. (Summaryf ) is modeled of the according pipelined implementation to Equation (10) results with. an average model error ofpresent 9.8%. analysis of resources in terms of combinational LEs and register LEs. The latter analysis facilitatesPipelined better understanding of power results as stated before. f (SP2k) = f (SPk) − 0.16 f (SP1), (10) implementation Fmax (MHz) LEs Power (mW) Energy (pJ/block) where f (SP1) = 66 MHz,Table SP 8.r = Summary SP2k, and of kthe = pipelined [1, 2, 4, 8, implementation 16]. LUT results. SP1 66.2 2057 13.88 694.2

4.2.2.Pipelined ResourceSP2 implementation Utilization 66.2 LEs LUT2008-4 LUT-3 LUT13.69-2 or less LUT RegOnly684.4

TheSP resource4 SP1 utilization 57.32 based 2057 on the type14002891 of LUTs 83 is illustrated17.58477 in Table 8. In this section,97879 we presentSP analysis8 SP of2 resources36.39 in terms2008 of combinational13792918 263 LEs and register27.60349 LEs. The latter1379.817 analysis facilitates better understanding of power results as stated before. SP16 SP4 32.68 2891 20013338 461 23.32347 116682

SP32 SP8 23.79 2918 20313361 562 23.76277 1187.848

SP16 3338 2264 766 282 26 4.2.1. Frequency SP32 3361 2422 591 325 23 Maximum frequency versus number of rounds implemented in a pipelined stage is shown in FigureFigure 13. The 14 trendshows is the similar number to the of caseLEs versusof the scalar the number design ,of where rounds doubling per pipeli thene number stage. Thereof rounds is a slightresult sincrease in a drop in in LEs frequency (an average of 10 of MHz 15%), which when isdoubling equivalent the to number 16% of of the rounds SP1 frequency., except for Doubling SP2. To better understand this trend, Figure 14 also clarifies the source of the number of LEs as combinational- only LEs and register LEs. The number of combinational LEs increases when doubling the number of rounds per stage (similar to the scalar design), while the total number of register LEs decreases. This can be justified by the fact that total number of register LEs increases with the number of stages, as more registers are inserted between stages. SP1 has the largest number of registers, as it has the largest number of stages and pipeline partial results every round [13]. Generally, combinational LEs grow by an average of 40%. The total number of register LEs decays by an average of 42%. Clearly, from Table 7 and Figure 14, the number of LEs (LE) for SP2k is modeled using Equation (11) with an average error of 15.4%.

LE (SP2k) = LE (SP2k (Comb)) + LE (SP2k (Register)) (11) = 1.4 LE (SPk (Comb)) + 0.58 LE (SPk (Register)),

Sensors 2019, 19, 913 18 of 28

Table 8. Summary of the pipelined implementation LUT results.

Pipelined Implementation LEs LUT-4 LUT-3 LUT-2 or less LUT RegOnly

SP1 2057 1400 83 477 97 SP2 2008 1379 263 349 17 SP4 2891 2001 461 347 82 SP8 2918 2031 562 277 48 SP16 3338 2264 766 282 26 SP32 3361 2422 591 325 23

Figure 14 shows the number of LEs versus the number of rounds per pipeline stage. There is a slight increase in LEs (an average of 15%) when doubling the number of rounds, except for SP2. To better understand this trend, Figure 14 also clarifies the source of the number of LEs as combinational-only LEs and register LEs. The number of combinational LEs increases when doubling the number of rounds per stage (similar to the scalar design), while the total number of register LEs decreases. This can be justified by the fact that total number of register LEs increases with the number of stages, as more registers are inserted between stages. SP1 has the largest number of registers, as it has the largest number of stages and pipeline partial results every round [13]. Generally, combinational LEs grow by an average of 40%. The total number of register LEs decays by an average of 42%. Clearly, from Table7 and Figure 14, the number of LEs ( LE) for SP2k is modeled using Equation (11) with an average error of 15.4%.

LE (SP ) = LE (SP (Comb)) + LE (SP (Register)) 2k 2k 2k (11) = 1.4 LE (SPk (Comb)) + 0.58 LE (SPk (Register)), Sensors 2018, 18, x FOR PEER REVIEW 19 of 28 where LE (SP1 (Comb)) = 800, LE (SP1 (Register)) = 1594, SPr = SP2k, and k = [1, 2, 4, 8, 16]. where LE (SP1 (Comb)) = 800, LE (SP1 (Register)) = 1594, SPr = SP2k, and k = [1, 2, 4, 8, 16].

Comb. Reg. Total LEs

4000 3500 3000 2500 2000 1500 1000 500 0 SP_1 SP_2 SP_4 SP_8 SP_16 SP_32

Figure 14. 14. ResourceResource utilization utilization trend trend and its and components its components versus numberversus number of rounds of for rounds pipelined for pipelinedimplementations implementations.. 4.2.3. Power 4.2.3. Power The power trend across pipelined implementations is shown in Figure 15. Initially, power slightly The power trend across pipelined implementations is shown in Figure 15. Initially, power decreases to its minimum value at SP2, and then increases until reaching its maximum value at slightly decreases to its minimum value at SP2, and then increases until reaching its maximum value SP8 before decreasing to SP16 and increasing to SP32. Overall, total power increases by an average at SP8 before decreasing to SP16 and increasing to SP32. Overall, total power increases by an average of 14% when doubling number of implemented rounds. Figure 15 also plots power components of 14% when doubling number of implemented rounds. Figure 15 also plots power components (combinational and sequential power). The following is observed: (combinational and sequential power). The following is observed: • As the number of rounds is doubled, combinational power grows by an average of 35%. • As the number of rounds is doubled, combinational power grows by an average of 35%. • As the number of rounds is doubled, sequential power decays by an average of 20% and the number of register LEs decreases.

The reason for the drop in the power at SP16 is due to combinational power, which is higher for SP8 because of the routing power. Combinational power reduction at SP16 relates to how the synthesis tool works in order to reduce connections and optimize the algorithm. From Table 7 and Figure 15, the power trend (푃) for SP2k is modeled using Equation (12) with an average model error of 14%.

P (SP2k) = P (SP2k (Comb)) + P (SP2k (Seq)) (12) = 1.35 P (SPk (Comb)) + 0.8 P (SPk (Seq)),

where P (SP1 (Comb)) = 5 mW, P (SP1 (Seq)) = 10 mW, SPr = SP2k, and k = [1, 2, 4, 8, 16].

mW Total Power Comb. seq.

30

25

20

15

10

5

0 SP_1 SP_2 SP_4 SP_8 SP_16 SP_32 .

Figure 15. The power trend and its components for pipelined implementations

Sensors 2018, 18, x FOR PEER REVIEW 19 of 28

where LE (SP1 (Comb)) = 800, LE (SP1 (Register)) = 1594, SPr = SP2k, and k = [1, 2, 4, 8, 16].

Comb. Reg. Total LEs

4000 3500 3000 2500 2000 1500 1000 500 0 SP_1 SP_2 SP_4 SP_8 SP_16 SP_32

Figure 14. Resource utilization trend and its components versus number of rounds for pipelined implementations.

4.2.3. Power The power trend across pipelined implementations is shown in Figure 15. Initially, power slightly decreases to its minimum value at SP2, and then increases until reaching its maximum value at SP8 before decreasing to SP16 and increasing to SP32. Overall, total power increases by an average Sensorsof 14%2019 when, 19, 913 doubling number of implemented rounds. Figure 15 also plots power component19 of 28s (combinational and sequential power). The following is observed:

•• AsAs thethe numbernumber of rounds is double doubled,d, combinational sequential power power decays grow bys by an an average average of of 20% 35%. and the • numberAs the number of register of LEsrounds decreases. is doubled, sequential power decays by an average of 20% and the number of register LEs decreases. The reason for the drop in the power at SP16 is due to combinational power, which is higher for SP8 becauseThe reason of the for routing the drop power. in the Combinational power at SP16 power is due reduction to combinational at SP16 relates power to, which how the is higher synthesis for toolSP8 because works in of order the routing to reduce power. connections Combinational and optimize power the reduction algorithm. at SP16 relates to how the synthesis

tool Fromworks Table in order7 and to Figure reduce 15 connections, the power and trend optimiz ( P) fore SPthe2k algorithm.is modeled using Equation (12) with an averageFrom model Table error 7 and of 14%.Figure 15, the power trend (푃) for SP2k is modeled using Equation (12) with an average model error of 14%. P (SP2k) = P (SP2k (Comb)) + P (SP2k (Seq)) P (SP2k) = P (SP2k (Comb)) + P (SP2k (Seq)) (12) = 1.35 P (SPk (Comb)) + 0.8 P (SPk (Seq)), (12) = 1.35 P (SPk (Comb)) + 0.8 P (SPk (Seq)), where P (SP1 (Comb)) = 5 mW, P (SP1 (Seq)) = 10 mW, SPr = SP2k, and k = [1, 2, 4, 8, 16]. where P (SP1 (Comb)) = 5 mW, P (SP1 (Seq)) = 10 mW, SPr = SP2k, and k = [1, 2, 4, 8, 16].

mW Total Power Comb. seq.

30

25

20

15

10

5

0 SP_1 SP_2 SP_4 SP_8 SP_16 SP_32 . Figure 15. The power trend and its components for pipelined implementations.

4.2.4. Energy Figure 15. The power trend and its components for pipelined implementations

Energy to encrypt one block is computed by multiplying total power with the time required to encrypt the block. In the case of a full pipelined implementation, time to encrypt is one cycle, because every cycle of the pipeline completes one block. It is assumed that data are continuous (i.e., not intermittent). Energy per block versus number of rounds per pipeline stage is illustrated in Figure 16. The following is observed:

• The energy curve looks the same as the power curve with minimum energy at SP2. Energy increases gradually until reaching SP8, decays at SP16, and then increases to SP32. • To better understand this trend, Figure 16 plots the energy components throughout pipelined implementations, which are combinational and sequential. Combinational energy increases by an average of 50%, while sequential decreases by an average of 15%. The growth in combinational energy is due to the glitch and interconnect power, while the decay in sequential energy is because of the decreasing number of flip-flops as the number of rounds per stage is doubled. The decay of combinational energy at SP16 is due to how the synthesis tool routes the connections and optimizes the design, as stated in Section 4.2.3.

From Figure 16, the energy trend (E) for SP2k is modeled according to Equation (13) with an average model error of 14%.

E (SP ) = E (SP (Comb)) + E (SP (Seq)) 2k 2k 2k (13) = 1.5 E (SPk (Comb)) + 0.85 E (SPk (Seq)), where E (SP1 (Comb)) = 170 pJ, E (SP1 (Seq)) = 520 pJ, SPr = SP2k, and k = [1, 2, 4, 8, 16]. Sensors 2018, 18, x FOR PEER REVIEW 20 of 28

4.2.4. Energy Energy to encrypt one block is computed by multiplying total power with the time required to encrypt the block. In the case of a full pipelined implementation, time to encrypt is one cycle, because every cycle of the pipeline completes one block. It is assumed that data are continuous (i.e., not intermittent). Energy per block versus number of rounds per pipeline stage is illustrated in Figure 16. The following is observed:

• The energy curve looks the same as the power curve with minimum energy at SP2. Energy increases gradually until reaching SP8, decays at SP16, and then increases to SP32. • To better understand this trend, Figure 16 plots the energy components throughout pipelined implementations, which are combinational and sequential. Combinational energy increases by an average of 50%, while sequential decreases by an average of 15%. The growth in combinational energy is due to the glitch and interconnect power, while the decay in sequential energy is because of the decreasing number of flip-flops as the number of rounds per stage is doubled. The decay of combinational energy at SP16 is due to how the synthesis tool routes the connections and optimizes the design, as stated in Section 4.2.3.

From Figure 16, the energy trend (퐸) for SP2k is modeled according to Equation (13) with an average model error of 14%.

E (SP2k) = E (SP2k (Comb)) + E (SP2k (Seq)) (13) Sensors 2019, 19, 913 = 1.5 E (SPk (Comb)) + 0.85 E (SPk (Seq)), 20 of 28

where E (SP1 (Comb)) = 170 pJ, E (SP1 (Seq)) = 520 pJ, SPr = SP2k, and k = [1, 2, 4, 8, 16].

pJ/block Total Energy Comb. Seq.

1600 1400 1200 1000 800 600 400 200 0 SP_1 SP_2 SP_4 SP_8 SP_16 SP_32

Figure 16. Energy trend and its components for pipelined implementations.implementations. 5. Discussion 5. Discussion In this section, the implementation results for scalar and pipelined designs presented in previous In this section, the implementation results for scalar and pipelined designs presented in previous sections are analyzed and discussed in order to draw guidelines and conclusions. Moreover, the best sections are analyzed and discussed in order to draw guidelines and conclusions. Moreover, the best design option or implementation for each performance metric is shown. Firstly, speed/timing and design option or implementation for each performance metric is shown. Firstly, speed/timing and throughput metrics are discussed, followed by power and utilized resources (LEs) and the energy throughput metrics are discussed, followed by power and utilized resources (LEs) and the energy metric. Then, the optimum design is described. Finally, the implementations are discussed from a metric. Then, the optimum design is described. Finally, the implementations are discussed from a security perspective. security perspective. 5.1. Speed and Throughput 5.1. Speed and Throughput The fastest scalar and pipelined implementations are S1,S2, SP1, and SP2. These designs have the lowestThe fastest logic scalar in the and clock pipelined cycle, implementations the smallest period, are S and1, S2, theSP1, fastestand SP2 frequency.. These designs Throughput have the (encrypted-bitslowest logic in the per clock second) cycle, is the measured smallest forperiod, each and scalar the andfaste pipelinedst frequency. implementation, Throughput (encrypted as shown- bits per second) is measured for each scalar and pipelined implementation, as shown in Figure 17. Sensorsin Figure 2018 17, 18., x FOR PEER REVIEW 21 of 28

Throughput 2500

2000

1500

1000

500

0

S_1 S_2 S_4 S_8

S_16 S_32

SP_1 SP_2 SP_4 SP_8 SP_32 SP_16

Figure 17. Throughput (encrypted (encrypted-bits-bits per second).second).

Clearly, the pipelined implementations demonstrate better throughput (12 times higher than Clearly, the pipelined implementations demonstrate better throughput (12 times higher than scalar). The best pipelined implementations are SP1 and SP2, while the best scalar implementations scalar). The best pipelined implementations are SP1 and SP2, while the best scalar implementations are S16 and S8. It is important to highlight that the pipelined design does not realize its full potential are S16 and S8. It is important to highlight that the pipelined design does not realize its full potential unless pipeline stages are full of encrypted blocks and, hence, are appropriate for applications with unless pipeline stages are full of encrypted blocks and, hence, are appropriate for applications with continuous streams of data blocks. continuous streams of data blocks.

5.2. Power and LEs It is obvious that power and resource utilization (number of LEs) trends are related, as average growth with doubling hardware rounds is approximately the same. Figure 18 illustrates the trends for combinational power and combinational LEs of scalar implementations versus number of rounds. Figure 18 also illustrates the trend for sequential power and sequential LEs. Combinational power and number of combinational LEs exhibit the same increasing trend. Moreover, the growth in power (i.e., 2.05) is slightly higher than the growth of combinational logic (i.e., 2.02). Thus, power is not only affected by the number of LEs, but also by interconnects and glitch power [36]. As the number of rounds is doubled, sequential power increases, with no change in sequential logic. This is attributed to an increase in the activity of sequential circuits when more rounds are implemented in the hardware. The following results are observed for scalar implementations as the number of rounds is doubled: • 102% growth in combinational LEs and 105% growth in combinational power. • No change in sequential LEs and a 30% increase in sequential power.

Comb. Power Comb. LEs Seq. Power Seq. LEs 35 30 25 20 15 10 5 0 S_1 S_2 S_4 S_8 S_16 S_32

Figure 18. Combinational power vs. combinational logic elements (Les) and sequential power vs. sequential LEs for scalar implementations.

Sensors 2018, 18, x FOR PEER REVIEW 21 of 28

Throughput 2500

2000

1500

1000

500

0

S_1 S_2 S_4 S_8

S_16 S_32

SP_1 SP_2 SP_4 SP_8 SP_32 SP_16

Figure 17. Throughput (encrypted-bits per second).

Clearly, the pipelined implementations demonstrate better throughput (12 times higher than scalar). The best pipelined implementations are SP1 and SP2, while the best scalar implementations Sensorsare S162019 and, 19 S,8 913. It is important to highlight that the pipelined design does not realize its full potenti21 of 28al unless pipeline stages are full of encrypted blocks and, hence, are appropriate for applications with continuous streams of data blocks. 5.2. Power and LEs 5.2. PowerIt is obvious and LEs that power and resource utilization (number of LEs) trends are related, as average growthIt is with obvious doubling that power hardware and roundsresource is utilization approximately (number the same.of LEs) Figure trends 18 are illustrates related, as the average trends forgrowth combinational with doubling power hardware and combinational rounds is approximately LEs of scalar implementations the same. Figure versus 18 illustrates number the of rounds.trends Figurefor combinational 18 also illustrates power and the trendcombinational for sequential LEs of power scalar andimplementations sequential LEs. versus Combinational number of rounds. power andFigure number 18 also of illustrates combinational the trend LEs exhibit for sequential the same power increasing and sequential trend. Moreover, LEs. Combinational the growth in power (i.e.,and number 2.05) is slightly of combinational higher than LEs the exhibit growth the of same combinational increasing logic trend. (i.e., Moreover, 2.02). Thus, the powergrowth is in not power only affected(i.e., 2.05) by isthe slightly number higher of LEs, than but the also growth by interconnects of combinational and glitchlogic (i.e., power 2.02). [36 Thus,]. power is not only affectedAs theby th numbere number ofrounds of LEs, isbut doubled, also by sequentialinterconnects power and increases,glitch power with [36 no]. change in sequential logic.As This the isnumber attributed of rounds to an increaseis doubled, in thesequential activity power of sequential increase circuitss, with no when change more in rounds sequential are implementedlogic. This is attributed in the hardware. to an increase The following in the resultsactivity are of observedsequential for circuits scalar implementationswhen more rounds as theare numberimplemented of rounds in the is hardware. doubled: The following results are observed for scalar implementations as the •number102% of growthrounds inis combinationaldoubled: LEs and 105% growth in combinational power. • No102% change growth in sequentialin combinational LEs and LEs a 30%and increase105% growth in sequential in combinational power. power. • No change in sequential LEs and a 30% increase in sequential power.

Comb. Power Comb. LEs Seq. Power Seq. LEs 35 30 25 20 15 10 5 0 S_1 S_2 S_4 S_8 S_16 S_32

Figure 18.18. Combinational power vs. combinational combinational logic logic elements elements ( (Les)Les) and and sequential power power vs. vs. sequential LEs for scalarscalar implementations.implementations. The combinational power and combinational LEs of pipelined implementations versus the number of rounds is shown in Figure 19, in addition to the sequential power and sequential LEs. The combinational power and combinational LEs trends increase in a similar manner; however, power growth is lower except for the case of SP8. Thus, power is not only affected by resource utilization, but also by interconnects and glitch power, as stated above. As the number of rounds is doubled, sequential power and sequential logic decrease, but the power decays less. The following results are observed for pipelined implementations as the number of rounds is doubled:

• 40% growth in combinational LEs and 35% growth in combinational power. • 42% decay in sequential LEs and 20% decay in sequential power.

Clearly, there is a correlation between resources (represented by LEs) and power, since dynamic power is proportional to design area [13]. Although there is a relationship between power and the number of LEs, there are other factors contributing related to the following:

• Interconnect power and glitch power [36]. • The way the synthesis tool routes the connections and optimizes the designs.

When comparing scalar to pipelined implementations with respect to the number of LEs and power consumption, it is found that scalar designs are best in terms of power and resource utilization. Sensors 2018, 18, x FOR PEER REVIEW 22 of 28

The combinational power and combinational LEs of pipelined implementations versus the number of rounds is shown in Figure 19, in addition to the sequential power and sequential LEs. The combinational power and combinational LEs trends increase in a similar manner; however, power growth is lower except for the case of SP8. Thus, power is not only affected by resource utilization, but also by interconnects and glitch power, as stated above. As the number of rounds is doubled, Sensorssequential 2018, 18, x FORpow PEERer and REVIEW sequential logic decrease, but the power decays less. The following22 results of 28 are observed for pipelined implementations as the number of rounds is doubled: The combinational power and combinational LEs of pipelined implementations versus the number• of40% rounds growth is show in combinationaln in Figure 19, LEs in additionand 35% togrowth the sequential in combinational power and power. sequential LEs. The combinational• 42% decay power in and sequential combinational LEs and LEs20% trends decay increas sequentiale in a similarpower. manner; however, power growth isClearly, lower except there is for a correlationthe case of betweenSP8. Thus, resources power is(represented not only affected by LEs) by and resource power, utilization, since dynamic but alsopower by isinterconnect proportionals and to design glitch powerarea [13, as]. Althoughstated above there. As is thea relation numbership of betweenrounds is power double andd, the sequentialnumber pow of erLEs, and there sequential are other logi factc decreasors contributinge, but the relatedpower decayto thes following less. The :following results are observed for pipelined implementations as the number of rounds is doubled: • Interconnect power and glitch power [36]. Sensors• •40% 2019 The,growth19, 913way in the combinational synthesis tool LEs routes and the35% connections growth in combinationaland optimizes power.the designs. 22 of 28 • 42% decay in sequential LEs and 20% decay in sequential power.

ScalarClearly, implementations there is a correlation have lower between LEs and resourcesComb. power, Power (represented as shownComb. in LEsby Figure LEs) 20anda,b, power, respectively. since dynamic Scalar implementationspower is proportional consume, to design on average, area [13 45%]. AlthoughSeq. of Power the pipelined there isSeq. a power relationLEs andship 39% between of the power pipelined and LEs.the Regardingnumber of scalarLEs, there implementations, are other fact theors bestcontributing is S (the basicrelated design), to the asfollowing it has the: least power and the least 10 1 number of LEs. Moreover, it consumes only 12% of the power of the best pipelined implementation • Interconnect power and glitch power [36]. (i.e., SP ), and the number of LEs is 10% that of SP . • The2 way the synthesis tool routes5 the connections2 and optimizes the designs.

0 Comb.SP_1 PowerSP_2 SP_4Comb.SP_8 LEsSP_16SP_32 Seq. Power Seq. LEs

Figure 19. Combinational10 power vs. combinational LEs and sequential power vs. sequential LEs for pipelined implementations. 5 When comparing scalar to pipelined implementations with respect to the number of LEs and power consumption, it is found0 that scalar designs are best in terms of power and resource utilization. SP_1 SP_2 SP_4 SP_8 SP_16SP_32 Scalar implementations have lower LEs and power, as shown in Figure s 20a,b, respectively. Scalar implementations consume, on average, 45% of the pipelined power and 39% of the pipelined LEs. Figure 19. Combinational power vs. combinational LEs and sequential power vs. sequential LEs for RegardingFigure 19. Combinational scalar implementations, power vs. combinational the best is S LEs1 (the and basic sequential design), power as it vs. ha sequentials the least LEs power for and the pipelined implementations. leastpipelined number implementations of LEs. Moreover,. it consumes only 12% of the power of the best pipelined implementation (i.e., SP2), and the number of LEs is 10% that of SP2. When comparing scalar to pipelined implementations with respect to the number of LEs and power consumption, itTotal is found LEs (Scalar) that scalar designs are best in Power terms (Scalar) of power andPower resource (Pipelined) utilization. Scalar implementationsTotal ha LEsve (Pipelined)lower LEs and power, 30as shown in Figures 20a,b, respectively. Scalar implementations consume, on average, 45% of the pipelined power and 39% of the pipelined LEs. 4000 25 Regarding scalar implementations, the best is S1 (the basic design), as it has the least power and the least number3000 of LEs. Moreover, it consumes only20 12% of the power of the best pipelined implementation (i.e., SP2), and the number of LEs is 10%15 that of SP2. 2000 Total LEs (Scalar) 10 Power (Scalar) Power (Pipelined) 1000 5 Total LEs (Pipelined) 30 0 0 4000 25 1 2 4 8 16 32 1 2 4 8 16 32

3000 20 (a) (b) 15 2000 FigureFigure 20. Power 20. Power and LEsand for LEs scalar for scalar and pipelined and pipelined10 implementations. implementations (a).Total (a) Total LEs for LEs scalar for scalar and and 1000pipelinedpipelined implementations; implementations (b) total; (b) powertotal power for scalar for scalar and pipelined and pipelined implementations. implementations. 5 5.3. Energy0 0 1 2 4 8 16 32 1 2 4 8 16 32 Energy per block for all implementations (scalar and pipeline) is shown in Figure 21. This curve indicates the(a) energy trend in order to easily identify the maximum(b) and minimum energy implementations. The energy trend is different in scalar and pipelined designs. The combinational and sequentialFigure 20. Power energy and trends LEs contribute for scalar and to this pipelined total result. implementations. (a) Total LEs for scalar and Thepipelined control implementations and round logic; (b) conceptstotal power in for scalar scalar and and pipelined pipelined implementations . are presented as block diagrams in Figures 22 and 23, respectively. It should be noted that control logic includes the control signals and registers that are not part of round and key scheduling logic. Sensors 2018, 18, x FOR PEER REVIEW 23 of 28

5.3. Energy SensorsEnergy 2018, 18 ,per x FOR block PEER for REVIEW all implementations (scalar and pipeline) is shown in Figure 21. This23 curve of 28 indicates the energy trend in order to easily identify the maximum and minimum energy 5.3.implementations. Energy The energy trend is different in scalar and pipelined designs. The combinational and sequentialEnergy per energy block fortrends all implementations contribute to this (scalar total result and pipeline). is shown in Figure 21. This curve indicates the energy trend in order to easily identify the maximum and minimum energy Sensorsimplementations.2019, 19, 913 ThepJ/block energy trend is differentTotal Energy in scalar andComb pipelined designsSeq . The combinational23 of 28 and sequential energy trends contribute to this total result. 6000 5000 pJ/block4000 Total Energy Comb Seq 60003000 50002000 40001000 30000 2000 1000 0 Figure 21. Energy per block for scalar and pipelined implementations.

The control and round logic concepts in scalar and pipelined implementations are presented as Figure 21. Energy per block for scalar and pipelined implementations. block diagrams in FigureFigures 21. 22 Energy and 23, per respectively. block for scala Itr andshould pipelined be noted implementations that control. logic includes the control signals and registers that are not part of round and key scheduling logic. The control and round logic concepts in scalar and pipelined implementations are presented as block diagrams in Figures 22 and 23, respectively. It should be noted that control logic includes the control signals and registers that are not part of round and key scheduling logic.

Figure 22. Control and round logic in scalar designs.designs.

In scalar one-round implementation, energy is computed by multiplying power by 32 cycles. In scalar one-round implementation, energy is computed by multiplying power by 32 cycles. In In two-round implementation, energy is computed by multiplying power by 16 cycles (half the number two-round implementation, energy is computed by multiplying power by 16 cycles (half the number of cycles of the previous implementation).Figure 22. Control Thus, and round instead logic of in executing scalar designs one round. logic and one control of cycles of the previous implementation). Thus, instead of executing one round logic and one control logic (control signals and registers) per cycle as in the case of one-round implementation, two rounds logicIn (control scalar signalsone-round and implementation,registers) per cycle energy as in ithes co casemputed of one by-round multiplying implementation, power by 32 two cycles. rounds In of logic and one control logic are executed. As a result, doubling the number of rounds, r, decreases twoof logic-round and implementation, one control logic energy are executed. is computed As a bresult,y multiplying doubling power the number by 16 cycles of rounds (half, ther, decreas numberes the number of cycles by a factor of two. This process results in the following: ofthe cycles number of the of cyclesprevious by implementation).a factor of two. This Thus, process instead result of executings in the following: one round logic and one control • The control logic, which includes the control signals (e.g., clock, done, start signals, etc.) and logic• The(control control signals logic and, which registers) include pers cyclethe control as in the signals case of (e.g., one- roundclock, done,implementation, start signals, two etc.) rounds and registers (e.g., flip-flops, round counter), is executed 50% less as r is doubled. As the number of of logicregisters and one (e.g. control, flip-flops, logic roundare executed. counter) As, is a executedresult, doubling 50% less the as numberr is double of roundsd. As the, r ,number decreas eofs rounds is increased, the number of hardware iterations decreases. Hence, control logic energy the numberrounds ofis cyclesincrease byd a, thefactor number of two. of This hardware process iterations results in decrease the following:s. Hence, control logic energy decreases. Generally, clock and registers contribute less to energy with a higher number of rounds, • Thedecrease controls. Generally, logic, which clock include and s regis the terscontrol contribute signals less(e.g., to clock, energy done, with start a higher signals, number etc.) and of which leads to energy savings. This is one source of the decreasing trend. registersrounds, which (e.g., fliplead-flops,s to energy round savings counter). This, is executed is one source 50% lessof the as decreasingr is double dtrend.. As the number of • Theoretically, the round logic should not be affected, because, as the number of rounds rounds is increased, the number of hardware iterations decreases. Hence, control logic energy implemented in the hardware is doubled, cycles are halved. Yet, the following factors should also decreases. Generally, clock and registers contribute less to energy with a higher number of be taken into consideration when the number of hardware rounds, r, is doubled: rounds, which leads to energy savings. This is one source of the decreasing trend.

- The synthesis tool can find more opportunity to optimize larger logic, as there is a better chance to reduce the logic. Thus, doubling the number of hardware rounds typically results in an area less than the summation of the two rounds. This is another source for the decreasing trend [37]. Sensors 2018, 18, x FOR PEER REVIEW 24 of 28

• Theoretically, the round logic should not be affected, because, as the number of rounds implemented in the hardware is doubled, cycles are halved. Yet, the following factors should also be taken into consideration when the number of hardware rounds, r, is doubled: Sensors 2019, 19, 913 24 of 28 - The synthesis tool can find more opportunity to optimize larger logic, as there is a better chance to reduce the logic. Thus, doubling the number of hardware rounds typically results in an area -less thanThe the logic summation becomes of the more two complex rounds. This with is a another larger sou numberrce for of the rounds, decreasing due trend to many [37]. - The logicinterconnections becomes more complex and levels. with Thus, a larger glitch number and interconnectof rounds, due power to many and interconnections energy tend to and levels.increase Thus [,36 glitch]. and interconnect power and energy tend to increase [36].

Figure 23. Control and round logic in pipelined designs. Figure 23. Control and round logic in pipelined designs. The same control and round logic concepts are applied to pipelined implementation, as shown The same control and round logic concepts are applied to pipelined implementation, as shown in Figure 23, with the addition of register logic. For one-round implementation, energy is computed byin Figure multiplying 23, with power the addition with time of toregister encrypt logic. one For block one per-round stage. implementation, For example, considerenergy is thecomputed case of by multiplying power with time to encrypt one block per stage. For example, consider the case of four-round implementation, where the cycle time increases as four rounds are processed in each stage. four-round implementation, where the cycle time increases as four rounds are processed in each Hence, as the number of rounds, r, per stage is doubled, the clock period increases. This results in the stage. Hence, as the number of rounds, r, per stage is doubled, the clock period increases. This results following: in the following: • The control logic is executed less as r is doubled. As the number of rounds per stage increases, • The control logic is executed less as r is doubled. As the number of rounds per stage increases, the number of stages and the round counter decreases. Hence, control logic is simplified when the number of stages and the round counter decreases. Hence, control logic is simplified when doubling the number of rounds, and, as a result, power/energy decreases. This is one source of doubling the number of rounds, and, as a result, power/energy decreases. This is one source of the decreasing trend in the pipelined implementation. the decreasing trend in the pipelined implementation. • The round logic is affected by two main factors as follows: • The round logic is affected by two main factors as follows: - -The synthesisThe synthesis tool tend toolstends to optimize to optimize larger larger logic logic better, better, as as opportunities opportunities for for sharing sharing and minimizingminimizing logic cones logic conesincrease increases.s. Therefore, Therefore, as the as thenumber number of rounds of rounds per per stage stage is is increase increased,d, power powerand energy and energydecrease. decrease. This is Thisanother is another source sourcefor the fordecreasing the decreasing trend. trend. - -Stage complexityStage complexity increase increasess as number as numberof rounds of per rounds stage peris double staged is. As doubled. the level As of thecomplexity level of increasecomplexitys, interconnection, increases, routing interconnection,, and glitch power routing, and and energy glitch increase. power This and is energy one source increase. for the increaseThis isin one the sourceenergy for trend. the increase in the energy trend.

• The registers (i.e., flipflip flops)flops) inserted between pipeline stages reduce by a factor of two when the number of rounds per stage is doubled,doubled, andand thethe numbernumber of stagesstages isis halved.halved. Thus,Thus, power and energy decrease. This is another source ofof thethe decreasingdecreasing trend.trend.

This analysis explains the mechanisms behind the energy trend. Energy also depends on the algorithm and the implementation. In general, pipelined designs dissipate lower energy compared to scalar designs. SP2 reports the lowest energy dissipation value closely followed by SP1 across all pipelined designs. For scalar designs, S4 has the lowest energy. Thus, the best implementation in terms of energy consumption is SP2, as it consumes 31% of the best scalar design (i.e., S4). Sensors 2018, 18, x FOR PEER REVIEW 25 of 28

This analysis explains the mechanisms behind the energy trend. Energy also depends on the algorithm and the implementation. In general, pipelined designs dissipate lower energy compared to scalar designs. SP2 reports the lowest energy dissipation value closely followed by SP1 across all Sensors 2019, 19, 913 25 of 28 pipelined designs. For scalar designs, S4 has the lowest energy. Thus, the best implementation in terms of energy consumption is SP2, as it consumes 31% of the best scalar design (i.e., S4). 5.4. The Optimum Design 5.4. The Optimum Design This section determines the best implementation of a lightweight cipher for LRDs. There are many This section determines the best implementation of a lightweight cipher for LRDs. There are performance metrics (i.e., throughput, power, area, and energy) to be considered, as discussed in the many performance metrics (i.e., throughput, power, area, and energy) to be considered, as discussed previous sections. In each performance metric, one implementation is optimum. Thus, the question is in the previous sections. In each performance metric, one implementation is optimum. Thus, the which metric is the most critical to consider. The answer will be the application to be used in lightweight question is which metric is the most critical to consider. The answer will be the application to be used ciphers. As stated before, lightweight block ciphers are suited for LRDs, as constrained environments in lightweight ciphers. As stated before, lightweight block ciphers are suited for LRDs, as constrained supporting secure communication. With the continued minimization of transistor features and the environments supporting secure communication. With the continued minimization of transistor need to increase battery lifetime, energy becomes the most critical issue for LRDs. Additionally, features and the need to increase battery lifetime, energy becomes the most critical issue for LRDs. providing an adequate level of security without exceeding the resource limitation is critical. Therefore, Additionally, providing an adequate level of security without exceeding the resource limitation is energy and area are the most critical factors to be considered, with slightly higher emphasis on energy. critical. Therefore, energy and area are the most critical factors to be considered, with slightly higher The optimum metric is introduced in Equation (14), as in Reference [13], to compare different design emphasis on energy. The optimum metric is introduced in Equation (14), as in Reference [13], to implementations. This metric rewards the design implementations with minimum area and energy compare different design implementations. This metric rewards the design implementations with and emphasizes minimal energy. minimum area and energy and emphasizes minimal energy. Metric = Eµ × LE, (14) Metric = Eµ × LE, (14) where µ is the energy emphasis factor, and is evaluated using following values: 1.0, 1.2, 1.4, 1.6, 1.8, andwhere 2.0. µ is the energy emphasis factor, and is evaluated using following values: 1.0, 1.2, 1.4, 1.6, 1.8, and 2.0.Figure 24 shows the scalar and pipelined implementation trends using the metric in Equation (14) withFigure different 24µ showsvalues. the Changing scalar and the pipelined value of µ implementationshows which design trends performs using the best metric at different in Equation energy levels.(14) with According different to µ thevalues. curves Changing in Figure the 24 value, the following of µ shows which design are perform drawn: s best at different energy levels. According to the curves in Figure 24, the following observations are drawn: • With a higher emphasis on energy (µ > 1.5), the optimum implementation is SP2 followed by SP1. • 2 1 • With aa lowerhigher emphasis emphasis on on energy energy (1 (µ < >µ 1.5),< 1.5), the the optimum optimum implementation implementation is SP is S1followedfollowed by by SP S2.. • With a lower emphasis on energy (1 < µ < 1.5), the optimum implementation is S1 followed by • For scalar implementations, the optimum design is S1 followed by S2 and S4. S2. • In pipelined implementations, the optimum design is SP2 followed by SP1. • For scalar implementations, the optimum design is S1 followed by S2 and S4. • In general,pipelined balanced implementation energy ands, the area optimum requirements design lead is SP to2 followed the following by SP conclusions:1. • PipelinedIn general, implementation balanced energy performs and area betterrequirements with a higher lead to emphasisthe following on energy conclusions: and, as a result, is the best choice for low-resource/constrained devices. • Pipelined implementation performs better with a higher emphasis on energy and, as a result, is • SP is best for the low-energy requirement, and S is the best for the low-resource requirement. the2 best choice for low-resource/constrained devices.1 • The best implementations are SP , SP S ,S , and S . • SP2 is best for the low-energy requirement2 1, 1 , and2 S1 is4 the best for the low-resource requirement. • The abovebest implementations conclusions are combinedare SP2, SP with1, S1, theS2, recommendedand S4. usage for pipelined implementations startedThe earlier above (i.e., conclusionspipeline implementations are combined are awith better fitthe for recommend applicationsed with usage continuous for pipelined blocks of data).implementations Therefore, two-roundstarted earlier pipelined (i.e., pipeline implementation implementations is optimum are fora better applications fit for applications with continuous with streamscontinuous of data, blocks and of one-round data). Therefore, and two-round two-roun scalard pipelined implementations implementation are recommended is optimum for intermittentapplications datawith applications.continuous streams of data, and one-round and two-round scalar implementations are 60 1 1.2 1.4 1.6 1.8 2 40

20

0

µ × µ FigureFigure 24. 24.EEµ × LELE,, where where µ isis the the energy energy emphasis emphasis factor factor. 5.5. Implementations and Security The presented implementations of the SIMON cipher do not alter the SIMON algorithm. The implementations exercise different design options to improve the performance metrics. Therefore, Sensors 2019, 19, 913 26 of 28 it is expected all implementations provide the same level of confidentiality security service. The only exception, however, is side-channel attack, which is implementation-dependent. Side-channel attack exploits the relationship between analog leakage (e.g., power) and the data manipulated by implemented design. Bhasin et al. were successful in mounting side-channel attack and retrieval of the secret key in FPGA implementation of the SIMON cipher [38]. Bhasin et al. recommended loop unrolling to achieve higher side-channel attack resistance at minimal overhead [38]. For the presented pipelined implementations, all rounds are implemented in hardware, which provides optimum protection from side-channel attacks. For scalar implementations, S32 unrolls all rounds and implements them in hardware, which is similar to pipelined implementations. S1 does not unroll rounds as it implements one round in hardware. Hence, S1 is most vulnerable to side-channel attacks amongst all implementations. Other scalar implementations unroll and implement a different number of rounds in hardware. Thus, when side-channel attack is a concern, the optimum scalar implementations might be slightly modified to be S2 and S4.

6. Conclusions This paper discussed the hardware implementations of SIMON lightweight cipher algorithm targeting secure communication in LRDs. Several scalar and pipelined FPGA implementations of the SIMON32/64 lightweight cipher were designed and examined with different numbers of hardware rounds per cycle. Pipelined implementations performed better than scalar designs in terms of throughput performance by a factor of 12. The best pipelined implementations were SP1 and SP2, while S16 followed by S8 were the best scalar implementations. Additionally, the number of LEs and the measured power consumption trends were very similar. The best implementations in terms of LEs and power were scalar. Scalar implementations consumed, on average, 45% of the pipelined implementations power. Scalar LEs utilized 39% of pipelined LEs. As for scalar implementations, S1 was best; it consumed only 12% of the power of the best pipelined implementation (i.e., SP2) and the number of LEs for S1 was 10% that of SP2. In terms of energy dissipation, SP2 was best followed by SP1. Pipelined designs reported the lowest values for energy consumption compared to scalar designs. The best pipelined design (i.e., SP2) consumed only 31% of the best scalar design (i.e., S4). Balancing energy and area, the optimum pipelined implementations were SP2 and SP1, while the best scalar implementations were S1,S2, and S4. The SP2 implementation is optimum for continuous streams of data, whereas S1 and S2 are recommended for intermittent data applications. When considering channel-side attack, S1 is not favorable. This paper contributed to deriving accurate models for lightweight cipher performance metrics and providing general guidelines for future lightweight ciphers. This study also discussed opportunities to better implement future cipher designs with optimized energy and area, which is critical for LRDs targeting lightweight ciphers for the security purposes. Future cipher designs should carefully examine small-round logic versus large-round logic. Small-round logic requires many rounds, T. Large-round logic implies few rounds. Historically, lightweight cipher designers choose small logic for better area and implement one round in hardware. However, this study showed that the one-round design was not optimal for energy. S4 and SP2 were best in terms of energy consumption. Our recommendation is to consider “fewer” larger rounds for future cipher design. This was also noted in Reference [37]. Future research could extend the work with different SIMON configurations and derive more general performance models for the SIMON cipher, depending on algorithm parameters (e.g., block size, key size, key words, and number of rounds). This would allow for the prediction of various performance metrics (including throughput, power, area, and energy) for each SIMON Sensors 2019, 19, 913 27 of 28 configuration. Extending the derived models to other platforms, such as ASIC technology, is also being considered.

Author Contributions: Conceptualization, S.A. and B.J.M.; methodology, S.A. and B.J.M.; software, S.A., R.J. and B.J.M.; validation, S.A., R.J. and B.J.M.; formal analysis, S.A., R.J., B.J.M. and M.A.; investigation, S.A., R.J. and B.J.M.; resources, S.A. and R.J.; data curation, S.A., R.J. and M.A.; writing—original draft preparation, S.A., R.J. and B.J.M.; writing—review and editing, S.A., R.J., B.J.M. and M.A.; visualization, S.A., R.J., B.J.M. and M.A. Funding: This research received no external funding. Conflicts of Interest: The authors declare no conflict of interest.

References

1. Mohd, B.J.; Hayajneh, T.; Vasilakos, A.V. A survey on lightweight block ciphers for low-resource devices: Comparative study and open issues. J. Netw. Comput. Appl. 2015, 58, 73–93. [CrossRef] 2. Law, Y.W.; Doumen, J.; Hartel, P. Survey and benchmark of block ciphers for wireless sensor networks. ACM Trans. Sens. Netw. 2006, 2, 65–93. [CrossRef] 3. Symmetric, vs Asymmetric Ciphers. Available online: http://windowsitpro.com/security/symmetric-vs- asymmetric-ciphers (accessed on 12 July 2018). 4. Mohd, B.J.; Hayajneh, T.; Yousef, K.M.A.; Khalaf, Z.A.; Bhuiyan, M.Z.A. Hardware design and modeling of lightweight block ciphers for secure communications. Future Gener. Comput. Syst. 2017, 83, 510–521. [CrossRef] 5. Katz, J.; Menezes, A.J.; Van Oorschot, P.C.; Vanstone, S.A. Handbook of Applied Cryptography, 1st ed.; CRC Press: Boca Raton, FL, USA, 1996. 6. An Introduction to Stream Ciphers and Block Ciphers. Available online: http://www.jscape.com/blog/ stream-cipher-vs-block-cipher (accessed on 10 August 2018). 7. Bernstein, D.J. The Salsa20 family of stream ciphers. In New Stream Cipher Designs; Springer: Berlin/Heidelberg, Germany, 2008; Volume 4986, pp. 84–97. 8. Hell, M.; Johansson, T.; Meier, W. Grain: A stream cipher for constrained environments. Int. J. Wirel. Mob. Comput. 2007, 2, 86–93. [CrossRef] 9. De Canniere, C. Trivium: A stream cipher construction inspired by block cipher design principles. In International Conference on Information Security; Springer: Berlin/Heidelberg, Germany, 2006; pp. 171–186. 10. Fan, X.; Mandal, K.; Gong, G. Wg-8: A lightweight stream cipher for resource-constrained smart devices. In International Conference on Heterogeneous Networking for Quality, Reliability, Security and Robustness; Springer: Berlin/Heidelberg, Germany, 2013; Volume 115, pp. 617–632. 11. Cazorla, M.; Marquet, K.; Minier, M. Survey and benchmark of lightweight block ciphers for wireless sensor networks. In Proceedings of the 2013 International Conference on Security and Cryptography (SECRYPT), Reykjavik, Iceland, 29–31 July 2013; pp. 1–6. 12. Wollinger, T.; Guajardo, J.; Paar, C. Security on FPGAs: State-of-the-art implementations and attacks. ACM Trans. Embed. Comput. Syst. 2004, 3, 534–574. [CrossRef] 13. Mohd, B.J.; Hayajneh, T.; Khalaf, Z.A.; Ahmad Yousef, K.M. Modeling and optimization of the lightweight HIGHT block cipher design with FPGA implementation. Secur. Commun. Netw. 2016, 9, 2200–2216. [CrossRef] 14. Beaulieu, R.; Treatman-Clark, S.; Shors, D.; , B.; Smith, J.; Wingers, L. The SIMON and SPECK families of lightweight block ciphers. Available online: https://eprint.iacr.org/2013/404 (accessed on 22 April 2018). 15. Aysu, A.; Gulcan, E.; Schaumont, P. SIMON Says, Break Area Records of Block Ciphers on FPGAs. IEEE Embed. Syst. Lett. 2014, 6, 37–40. [CrossRef] 16. Beaulieu, R.; Shors, D.; Smith, J.; Treatman-Clark, S.; Weeks, B.; Wingers, L. The SIMON and SPECK block ciphers on AVR 8-bit microcontrollers. In International Workshop on Lightweight Cryptography for Security and Privacy; Springer: Cham, Switzerland, 2015; pp. 3–20. 17. Hosseinzadeh, J.; Bafghi, A.G. Software Implementation and Evaluation of Lightweight Symmetric Block Ciphers of the Energy Perspectives and Memory. Int. J. Eng. Educ. 2017, 9, 1–6. 18. Beaulieu, R.; Shors, D.; Smith, J.; Treatman-Clark, S.; Weeks, B.; Wingers, L. Implementation and Performance of the Simon and Speck Lightweight Block Ciphers on ASICs. Unpublished work. Sensors 2019, 19, 913 28 of 28

19. Beaulieu, R.; Shors, D.; Smith, J.; Treatman-Clark, S.; Weeks, B.; Wingers, L. SIMON and SPECK: Block Ciphers for the Internet of Things. Available online: https://eprint.iacr.org/2015/585 (accessed on 10 March 2018). 20. Wetzels, J.; Bokslag, W. Simple SIMON: FPGA implementations of the SIMON 64/128 Block Cipher. Cryptogr. Eng. Kerckhoffs Inst. 2015, 1, 1–20. 21. Feizi, S.; Ahmadi, A.; Nemati, A. A hardware implementation of SIMON cryptography algorithm. In Proceedings of the 2014 4th International eConference on Computer and Knowledge Engineering (ICCKE), Mashhad, Iran, 29–30 October 2014; pp. 245–250. 22. Gulcan, E.; Aysu, A.; Schaumont, P. A flexible and compact hardware architecture for the SIMON block cipher. In International Workshop on Lightweight Cryptography for Security and Privacy; Springer: Cham, Switzerland, 2014; Volume 8898, pp. 34–50. 23. Wan, T.; Salman, H. Ultra Low Power SIMON Core for Lightweight Encryption. In Proceedings of the 2018 IEEE International Symposium on Circuits and Systems (ISCAS), Florence, Italy, 27–30 May 2018; pp. 1–5. 24. Yang, G.; Zhu, B.; Suder, V.; Aagaard, M.D.; Gong, G. The simeck family of lightweight block ciphers. In International Workshop on Cryptographic Hardware and Embedded Systems; Springer: Berlin/Heidelberg, Germany, 2015; pp. 307–329. 25. Ryabko, B.; Soskov, A. Application of the to lightweight block ciphers. In Proceedings of the 2017 International Multi-Conference on Engineering, Computer and Information (SIBIRCON), Novosibirsk, Russia, 18–22 September 2017; pp. 338–341. 26. Kölbl, S.; Roy, A. A brief comparison of Simon and Simeck. In International Workshop on Lightweight Cryptography for Security and Privacy; Springer: Cham, Switzerland, 2016. 27. Zhang, X.; Heys, H.M.; Li, C. Fpga implementation and energy cost analysis of two light-weight involutional block ciphers targeted to wireless sensor networks. Mob. Netw. Appl. 2013, 18, 222–234. [CrossRef] 28. Abed, S.; Mohd, B.J.; Al-bayati, Z.; Alouneh, S. Low power Wallace multiplier design based on wide counters. Int. J. Circuit Theory Appl. 2012, 40, 1175–1185. [CrossRef] 29. Hayajneh, T.; Ullah, S.; Mohd, B.; Balagani, K. An Enhanced WLAN Security System with FPGA Implementation for Multimedia Applications. IEEE Syst. J. 2015, 11, 2536–2545. [CrossRef] 30. Mohd, B.J.; Hayajneh, T.; Abed, S.; Itradat, A. Analysis and modeling of FPGA implementations of spatial steganography methods. J. Circuits Syst. Comput. 2014, 23, 1450018. [CrossRef] 31. Mohd, B.J.; Hayajneh, T. Wavelet-transform steganography: Algorithm and hardware implementation. Int. J. Electron. Secur. Digit. Forensics 2013, 5, 241–256. [CrossRef] 32. Mohd, B.J.; Hayajneh, T.; Khalaf, Z.A.; Vasilakos, A.V. A comparative study of steganography designs based on multiple FPGA platforms. Int. J. Electron. Secur. Digit. Forensics 2016, 8, 164–190. [CrossRef] 33. Altera Cyclone II Device Handbook. Available online: http://www.altera.com/products/devices/cyclone2/ cy2-index.jsp (accessed on 12 August 2018). 34. Menezes, A.J.; van Oorschot, P.C.; Vanstone, S.A. Handbook of Applied Cryptography, 5th ed.; CRC Press: Boca Raton, FL, USA, 2001; p. 251. 35. Mohd, B.; Hayajneh, T.; Shakir, M.; Qaraqe, K.; Vasilakos, A. Energy model for light-weight block ciphers for WBAN applications. In Proceedings of the 2014 EAI 4th International Conference on Wireless Mobile Communication and Healthcare (Mobihealth), Athens, Greece, 3–5 November 2014; pp. 1–4. 36. Kolay, S.; Mukhopadhyay, D. Khudra: A new lightweight block cipher for FPGAs. In International Conference on Security, Privacy, and Applied Cryptography Engineering; Chakraborty, R.S., Matyas, V., Schaumont, P., Eds.; Springer: Cham, Switzerland, 2014; pp. 126–145. 37. Mohd, B.J.; Hayajneh, T. Lightweight Block Ciphers for IoT: Energy Optimization and Survivability Techniques. IEEE Access 2018, 6, 35966–35978. [CrossRef] 38. Bhasin, S.; Graba, T.; Danger, J.L.; Najm, Z. A look into SIMON from a side-channel perspective. In Proceedings of the 2014 IEEE International Symposium on Hardware-Oriented Security and Trust (HOST), Arlington, VA, USA, 6–7 May 2014; pp. 56–59.

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).