<<

ISSN 2319-8885 Vol.04,Issue.31, August-2015,

Pages:6098-6102

www.ijsetr.com

Design and Hardware Implementation for RC4 by using Verilog HDL 1 2 A. LAKSHMI DEVI , VINUSHA MANDAVA 1PG Scholar, DRK Institute of Science and Technology, Bowrampet, Hyderabad, TS, India, E-mail: [email protected]. 2Asst Prof, DRK Institute of Science and Technology, Bowrampet, Hyderabad, TS, India, E-mail: [email protected].

Abstract: is the only practical method for protecting information transmitted through communication networks. The hardware implementation of cryptographic algorithms plays an important role because of growing requirements of high speed and high level of secure communications. Implementation of cryptographic algorithms on hardware runs faster than on software and at the same time offering more intrinsic security. The main feature of cryptography is to work out with problems, which are associated with secrecy, authentication and integrity. Cryptography also, is related with the meaning of protocol. A protocol is sequences of actions, which are concern two or more sides, designed to fulfill a goal. Thus, a is a protocol that uses cryptography. This protocol uses a cryptographic algorithm and its intention is to prevent attempts of thefts and invasions.RC4 is the most popular stream cipher in the domain of cryptology, In this project we propose the fastest known architecture for the cipher with the help of loop unrolling and hardware pipeline which is produce the one RC4 stream byte per clock cycle.

Keywords: Cryptography, High Throughput, Pipelining, RC4, Stream Cipher.

I. INTRODUCTION bytes in 2N + 259 clock cycles, counting the initial delay of 1 Stream Ciphers are broadly classified into two parts cycle for KSA and 2 cycles for PRGA. The hardware depending on the platform most suited to their implementation of RC4 provides an output of N bytes in 3N implementation; namely software stream ciphers and + 768 clock cycles. One can easily observe that for large N, hardware stream ciphers. RC4 is one of the widely used the throughput of our RC4 architecture is three times stream ciphers that is mostly implemented in software. This compared to that of the designs. In terms of the area, exact cipher is used in network protocols such as SSL, TLS, WEP, comparison with [10] is not possible since, we do not have and WPA. The cipher also finds applications in Microsoft access to the FPGA board for which the area figures of [10] Windows, Lotus Notes, Apple AOCE, Oracle Secure SQL, is reported. Considering the design idea, both [10] and [16] etc. Though several other efficient and secure stream ciphers modeled their storage using block RAMs. This have been discovered after RC4, it is still the most popular implementation restricts the number of port accesses per stream cipher algorithm due to its simplicity, ease of cycle. To overcome that, three 256-byte dual-port RAM implementation, and speed. In spite of several blocks are used in [10]. Even then, the design requires three attempts on RC4 the cipher stands secure if used properly. In cycles to produce 1 byte of data. An improved design is this project, we study several aspects of the hardware reported in [16] where only two 256-byte dual-port RAM implementation of RC4, with respect to its efficient blocks are used. It may be noted that we have utilized implementation, and present two new hardware designs register-based storage for the S and K arrays instead of RAM. which allow fast generation of RC4 key stream. The better of This is because a RAM-based storage would incorporate the two is the fastest known hardware implementation of the port-access restrictions and latency issues, resulting in a cipher till date. To motivate our contribution, we would first compromise of throughput. An alternative technique to like to discuss the basic framework of the cipher. A short maintain the high throughput with RAM-based note on RC4 follows. implementation may be partitioning the arrays according to the accesses, and optimize accordingly. Matthews Jr. [17]. II. LITERATURE REVIEW Let us compare the proposed design with the ones that III. SYSTEM DESIGN existed for RC4 hardware till date. We only consider existing A. Proposed System designs that are focused toward improved throughput, and Design: One may note that Design, based on loop unrolling, not any other hardware considerations. Combining our KSA is completely independent of the design idea of hardware and PRGA architectures, we can obtain 2N output stream pipelining in case of RC4. Thus, we propose a completely

Copyright @ 2015 IJSETR. All rights reserved. A. LAKSHMI DEVI, VINUSHA MANDAVA new design of RC4 hardware using efficient hardware Design: One Byte Per CLOC: We consider the generation pipeline and loop unrolling simultaneously. This model of two consecutive values of Z together, for the two provides a throughput of 2 bytes per cycle in RC4 PRGA, consecutive plaintext bytes to be encrypted. Assume that the without losing the clock performance. A detailed account of initial values of the variables i, j, and S are i0, j0, and S0, the design strategy and circuit analysis is presented in below respectively. After the first execution of the PRGA loop, section. these values will be i1, j1, and S1, respectively, and the output byte is Z1, say. Similarly, after the second execution of the B. Block Diagrams Of PRGA And KSA PRGA loop, these will be i2, j2, S2, and Z2, respectively. The schematic diagrams for PRGA and KSA circuits in the Thus, for the first two loops of execution to complete, we proposed design are shown in Figs.1 and 2, respectively. The have to perform the operations shown in Table .1. PRGA circuit operates as per the 2-stage pipeline structure, where the increment of indices take place in the first stage, TABLE I: Two Consecutive Loops of RC4 Stream and so does the double-swap operation for the S-box. In the Generation same stage, the addresses for the two consecutive output bytes Zn and Zn+1 are calculated as the swap does not change the outcomes of the additions S[in]+S[jn] or S[in+1]+ S[jn+1]. In the second stage of the pipeline, the output addresses zn_addr and zn+1_addr are used to read the appropriate key stream bytes from the updated S-box.

Fig.3. Circuit to compute i1 and i2.

C. Modules Fig.1. PRGA circuit structure for proposed architecture. Design: One Byte per Clock: The circuit for KSA operates similarly, but has no pipeline  Design of individual components feature as the operation happens in a single stage. Here, the  The complete circuit increment of indices and swap are done for two consecutive  PRGA stage of the proposed RC4 architecture rounds of KSA in a single clock cycle, thereby producing a  Access sharing of KSA and PRGA speed of 2-rounds-per-cycle.Based on this schematic diagram for the circuits, and the port-sharing logic described earlier, Complete Circuit: we now attempt the hardware implementation of our new The complete circuit diagram for the PRGA algorithm of design. Design 1 is shown in Fig.3. We shall henceforth denote the clock by and its cycles numbered as 1, 2, etc., where

0 refers to the clock pulse that initiates PRGA. In Fig.3. Li denote the latches operated by the trailing edge of 2n+ i, i.e., the (2n + i) th cycle of the master clock where n ≥ 0.

For example, the latches labeled L1 (four of them) are released at the trailing edge of 1, 3, 5, . . . and the

latches labeled L2 (eight of them) are released at the trailing edge of 2, 4, 6, . . . etc. In the final implementation, these latches have been replaced by edge-triggered flip-flops which operate at the trailing edge of the clock. Now, we generalize our previous observation to state the following Efficiency of PRGA in Design 1. The hardware proposed for the PRGA stage of RC4 in Design 1, as shown in Fig.3, produces “one byte per clock” after an initial delay of two clock cycles. Fig.2. KSA circuit structure for proposed architecture. International Journal of Scientific Engineering and Technology Research Volume.04, IssueNo.31, August-2015, Pages: 6098--6102 Design and Hardware Implementation for RC4 Stream Cipher by Using Verilog HDL feedback cycle by one clock pulse in every iteration (e.g., in case of nth iteration). Therefore, our model practically produces 2N output bytes in 2N clock cycles, that is “one byte per clock,” after an initial lag of two clock cycles.

Issues with KSA: Note that the general KSA routine runs for 256 iterations to produce the initial permutation of the S-box. Moreover, the steps of KSA are quite similar to the steps of PRGA, apart from the following:  Calculation of j involves key K along with S and i.  Computing Z1; Z2 is neither required nor advised. Fig.4. Circuit to compute z1.

Let us call the stage of the PRGA circuit shown in Fig.4 the nth stage. This actually denotes the nth iteration of our model, which produces the output bytes Zn+1 and Zn+2. The first block (Circuit 1) operates at the trailing edge of n, and increments in to in+1, in+2. During cycle , the combinational part of Circuit 2 operates to produce jn+1, jn+2.

The trailing edge releases the latches of type L1, and activates the swap circuit (Circuit 3). The combinational logic of the swap circuit functions during cycle and the actual swap operation takes place at the trailing edge of

to produce from Sn. Simultaneously; the latch of type L2 is released to activate the Circuits 4 and 5. The combinational logic of these two circuits operate during , and we get the outputs and at the trailing edge of . This complete block of architecture performs in a cascaded pipeline fashion, as the indices i , j and the 2 2 state Sn+2 are fed back into the system at the end of Fig.6 Circuit for PRGA stage of the proposed RC4

(actually, in+2 is fed back at the end of to allow for the architecture (Design 1). increments at the trailing edge of ) as shown in Fig.5. We propose the use of our loop-unrolled PRGA The operational gap between two iterations (e.g., nth and (n + architecture (Fig.7) for the KSA as well, with some minor 2th) of the system is thus two clock cycles (e.g., to modifications, as follows: ), and we obtain two output bytes per iteration.

Fig.7. Pipeline Structure for the Proposed Design 1. Fig.5. Circuit to compute z2. D. Structure Of PRGA And KSA Circuits Hence, the PRGA architecture of Design 1, as shown in The schematic diagrams for PRGA and KSA circuits in Fig.6, produces 2N bytes of output stream in N iterations, the proposed design are shown in Figs.8 and 9 respectively. over 2N clock cycles. Note that the initial clock pulse 0 is The PRGA circuit operates as per the 2-stage pipeline an extra one, and the production of the output bytes lag the structure, where the increment of indices take place in the International Journal of Scientific Engineering and Technology Research Volume.04, IssueNo.31, August-2015, Pages: 6098-6102 A. LAKSHMI DEVI, VINUSHA MANDAVA first stage, and so does the double-swap operation for the S- described earlier, we now attempt the hardware box. In the same stage, the addresses for the two consecutive implementation of our new design. output bytes Zn and Zn+1 are calculated as the swap does not change the outcomes of the additions S[in]+S[jn] or S[in+1]+ Advantages: S[jn+1]. In the second stage of the pipeline, the output  RC4 is reduced complexity addresses zn_addr and zn+1_addr are used to read the appropriate  RC4 is reduces the area key stream bytes from the updated S-box.  Higher throughput rate  RC4 is fast Computation  RC4 is secure data  Simplicity

Applications:  It is used in transport layer  It is used in Secure socket layer  Efficient implement in both software and hardware

IV. RESULTS Results of this paper is shown in bellow Figs.10 to 12. A. RTL Schematic KSA Sample:

Fig. 8. PRGA circuit structure for proposed architecture.

The circuit for KSA operates similarly, but has no pipeline feature as the operation happens in a single stage. Here, the increment of indices and swap are done for two consecutive rounds of KSA in a single clock cycle, thereby producing a speed of 2-rounds-per-cycle.Based on this schematic diagram for the circuits, and the port-sharing logic described earlier, we now attempt the hardware implementation of our new design. Fig.10. KSA RTL Schematic.

Fig. 9. KSA circuit structure for proposed architecture.

The circuit for KSA operates similarly, but has no pipeline feature as the operation happens in a single stage. Here, the increment of indices and swap are done for two consecutive rounds of KSA in a single clock cycle, thereby producing a speed of 2-rounds-per-cycle.Based on this schematic diagram for the circuits, and the port-sharing logic

Fig. 11. Snapshot of PRNG schematic. International Journal of Scientific Engineering and Technology Research Volume.04, IssueNo.31, August-2015, Pages: 6098--6102 Design and Hardware Implementation for RC4 Stream Cipher by Using Verilog HDL RC4 TOP: Future Scope: In this project by using this functionality now we can implemented spartan3E board of FPGA .In future we can use same functionality and high specification method developed vertex board. In these reduce area, and delay at the same time increase the performance. In future, we will have further look into the hardware architecture for optimizing the area without compromising the runtime performance.

VI. REFERENCES [1] Software Performance Results from the eSTREAM Project,eSTREAM, the ECRYPT Stream Cipher Project, http://www.ecrypt.eu.org/stream/perf/#results, 2012. [2] The Current eSTREAM Portfolio, eSTREAM, the ECRYPT StreamCipher Project, http://www.ecrypt.eu.org/ stream/index.html,2012. [3] S.R. Fluhrer and D.A. McGrew, “Statistical Analysis of the AllegedRC4 Generator,” Proc. Seventh Int’l

Fig.12. RC4 top. Workshop FastSoftware (FSE ’00), vol. 1978, pp. 19-30, 2000. B. Device Utilization [4] S.R. Fluhrer, I. Mantin, and A. Shamir, “Weaknesses in TABLE II: Device Utilization the KeyScheduling Algorithm of RC4,” Proc. Eighth Ann. Int’l WorkshopSelected Areas in Cryptography (SAC ’01), vol. 2259, pp. 1-24, 2001. [5] M.D. Galanis, P. Kitsos, G. Kostopoulos, N. Sklavos, and C.E.Goutis, “Comparison of the Hardware Implementation of StreamCiphers,” Int’l Arab J. Information Technology, vol. 2, no. 4, pp. 267-274, 2005. [6] J. Golic, “Linear Statistical Weakness of Alleged RC4 KeystreamGenerator,” Proc. Advances in Cryptology EUROCRYPT, vol. 1233,pp. 226-238, 1997. [7] T. Good and M. Benaissa, “Hardware Results for Selected StreamCipher Candidates,” eSTREAM, ECRYPT Stream Cipher Project,SASC, Report 2007/023, 2007. [8] F.K. Gurkaynak, P. Luethi, N. Bernold, R. Blattmann, V. Goode,M. Marghitola, H. Kaeslin, N. Felber, and W. Fichtner, “HardwareEvaluation of eSTREAM Candidates: , , MICKEY,MOSQUITO, SFINKS, , VEST, ZK-Crypt,” eSTREAM,ECRYPT Stream Cipher Project, Report 2006/015, 2006. [9] P. Hamalainen, M. Hannikainen, T. Hamalainen, and J. Saarinen,“Hardware Implementation of the Improved WEP and RC4Encryption Algorithms for Wireless Terminals,” Proc. EuropeanSignal Processing Conf., pp. 2289-2292, 2000. The device utilization of ksa sample Contains the number of slices used,number of LUTS and the number of IOBs are used.

V. CONCLUSION The project “Design and hardware implementation for RC4 stream cipher by using verilog HDL” hardware implementation has been successfully simulated. In the process we have two new designs for RC4 hardware. The 1st one produces one key stream byte per cycle using the idea of loop unrolling. Our second design tops the tops the first one by combining the idea of loop unrolling with that of efficient hardware pipelining to obtain two key stream bytes per cycle. This is the fastest known RC4hardware architecture providing a key stream throughput of 30.72 Gpbs in its optimized form using 65 nm technology. It is a more intrinsic security. International Journal of Scientific Engineering and Technology Research Volume.04, IssueNo.31, August-2015, Pages: 6098-6102