Radix-8 Design Alternatives of Fast Two Operands Interleaved

International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 Radix-8 Design Alternatives of Fast Two Operands Interleaved Multiplication with Enhanced Architecture With FPGA implementation & synthesize of 64-bit Wallace Tree CSA based Radix-8 Booth Multiplier Mohammad M. Asad Qasem Abu Al-Haija King Faisal University, Department of Electrical Department of Computer Information and Systems Engineering, Ahsa 31982, Saudi Arabia Engineering e-mail: [email protected] Tennessee State University, Nashville, USA e-mail: [email protected] Ibrahim Marouf King Faisal University, Department of Electrical Engineering, Ahsa 31982, Saudi Arabia e-mail: [email protected] Abstract—In this paper, we proposed different comparable researches to propose different solutions to ensure the reconfigurable hardware implementations for the radix-8 fast safe access and store of private and sensitive data by two operands multiplier coprocessor using Karatsuba method employing different cryptographic algorithms and Booth recording method by employing carry save (CSA) and kogge stone adders (KSA) on Wallace tree organization. especially the public key algorithms [1] which proved The proposed designs utilized robust security resistance against most of the attacks family with target chip device along and security halls. Public key cryptography is with simulation package. Also, the proposed significantly based on the use of number theory and designs were synthesized and benchmarked in terms of the digital arithmetic algorithms. maximum operational frequency, the total path delay, the total design area and the total thermal power dissipation. The Indeed, wide range of public key cryptographic experimental results revealed that the best multiplication systems were developed and embedded using hardware architecture was belonging to Wallace Tree CSA based Radix- modules due to its better performance and security. 8 Booth multiplier ( ) which recorded: critical path This increased the demand on the embedded and delay of , maximum operational frequency of , hardware design area (number of logic elements) System-on Chip ( ) [2] technologies employing of , and total thermal power dissipation estimated several computers aided ( ) tools along with the as . Consequently, method can be configurable hardware processing units such as field efficiently employed to enhance the speed of computation for programmable gate array ( ) and application many multiplication based applications such embedded system specific integrated circuits ( ). Therefore, designs for public key cryptography. considerable number of embedded coprocessors design Keywords-Cryptography; Computer Arithmetic; FPGA were used to replace software based (i.e. programming Design; Hardware Synthesis; Kogge-Stone Adder (KSA); Radix- based) solutions of different applications such as image 8 Booth Recording; Karatsuba Multiplier; Wallace Tree processors, cryptographic processors, digital filters, low power application such as [3] and others. The major part of designing such processors significantly I. INTRODUCTION encompasses the use computer arithmetic techniques in Recently, the vast promotion in the field of the underlying layers of processing. information and communication technology (ICT) such Computer arithmetic [4] or digital arithmetic is the as grid and fog computing has increased the inclination science that combines mathematics with computer of having secret data sharing over the existing non- engineering and deals with representing integers and secure communication networks. This encouraged the real values in digital systems and efficient algorithms DOI: 10.21307/ijanmc-2019-043 15 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 for manipulating such numbers by means hardware using scalable and efficient modules. Thus, the circuitry and software routines. Arithmetic operations following subsections reviews the core design on pairs of numbers x and y include addition ( components for the proposed multiplier y), subtraction ( – ), multiplication implementation unit. ( ), and division ( ). Subtraction and division can be viewed as operations that undo the effects of addition and multiplication, respectively. Multiplication operation is considered as a core operation that affect the performance of any embedded system. Therefore, the use of fast multiplier units will result in enhancements in the overall performance of the system. Recently, several solutions were proposed for multiplication algorithms while few of them were efficient [5]. A multiplication algorithm [6] is method to find the product of two numbers, i.e. Multiplication is an essential building block for several digital processors as it requires a considerable amount of processing time and hardware resources. Depending on the size of the numbers, different algorithms are in use. Figure 1. Carry save Adder: (a) Top View Design (b) Internal Elementary-school grade algorithm was multiplying Architecture each number digit by digit producing partial sum with complexity of ). [5]For larger numbers, more A. Carry save Adder (CSA) efficient algorithms are needed. For example, let CSA [4] is a fast-redundant adder with constant integers to be multiplied with equal to carry path delay regardless of the number of operands’ 1k bits, thus multiplications. bits. It produces the result as two-dimensional vectors: However, more efficient and practical multiplication sum vector (or the partial sum) and carry vector (or algorithms will be discussed in the following partial carry). The advantage of CSA is that the speed subsections. is constant regardless the number of bits. However, its In this paper, we report on several fast alternative area increases linearly with the number of bits. The top designs for Radix-8 based multiplier unit including: view of the CSA unit along with its internal logic Radix-8 CSA Based Booth Multiplier, CSA Based design architecture are provided in Fig.1 below. Radix-8 Booth, Wallace Tree Karatsuba Multiplier, In this work, we have implemented the CSA adder CSA Based Radix-8 Booth, KSA Based Karatsuba using VHDL code for different bit sizes ranges from 8- Multiplier, CSA Based Radix-8 Booth, With bits through 64-bits [8]. The synthesize results of total Comparator Karatsuba Multiplier, Sequential 64-Bit delay in ( ) and area in Logic Elements (LEs) were CSA Based Radix-8 Booth Multiplier, 64-bit Wallace analyzed and reported in [8] and they are illustrated in Tree CSA based Radix-8 Booth multiplier (WCBM). Fig.2. These results were generated using The remaining of this paper is organized as follows: software [9], simulated for Section 2m discusses the core components of efficient model [10] and they multiplier design. Section 3,provides the proposed highly conform theoretical evaluation of CSA design alternatives of Radix-8 based multiplier, Section operation since the delay time is almost equal for all 4, presents the synthesizing results and analysis, and, bits. However, the area is almost double for each finally, Section 5 concludes the paper. number of bits. Also, the timing estimation of CSA was generated via Time II. CORE DESIGN COMPONENTS-REVIEW Analyzer tool provided in the package. Two operands-multiplication is a substantial Accordingly, the critical path delay is which arithmetic operation since it plays a major role in the is data arrival time while the data delay is only 2.866 design of many embedded and digital signal processors ns which provide a frequency of .Finally, to [7]. Therefore, the efficient design and implementation verify the performance of CSA, we have compared it of a fast multiplier unit is on demand. In this paper, we with the well-known Carry LockAhead Adder (CLA) propose a competitive reconfigurable multiplier design in terms of area and delay. CLA is a carry propagation 16 International Journal of Advanced Network, Monitoring and Controls Volume 04, No.02, 2019 adder (CPA) with logarithmic relation between the To verify the performance of all PPAs, we have carry propagation delay and the number of bits in implemented them on FPGA and the experimental operands. results [6] showed that KSA utilizes larger area size to achieve higher performance comparing among all other five PPAs. Thus, we decided to consider KSA as our basic carry propagation adder (CPA) to finalize the redundant results and to build up many other units that are in-need for conventional adder. In short, the simulation results of [6] showed that KSA leading the other adders as it has the smallest time delay with only 4.504. This result is very useful and conforms the theatrical modeling of KSA which has the least number of logic levels. Like all PPAs, KSA functionality consists of three computational stages as illustrated in Fig.3, as follows: Pre-processing stage: The computation of generate and propagate of each bit from A and B are done in this step. These signals are given by the logic equations: and Carry generation network: PPA differentiates from each other by the connections of the network. It computes carries of each bit by using generate and propagate signals from previous block. In this block two blocks are defined group generation and propagation (GGP), in addition to group generation only (GGO), as shown in Fig.3. Logic blocks used for the calculation of generate and propagate bits can be describe by the following logic equations: and ), Where the generation group have only logic equation for Figure 2. Delay-Area analysis of CSA vs

Radix-8 Design Alternatives of Fast Two Operands Interleaved

Primality Testing for Beginners

Advanced Synthesis Cookbook

Obtaining More Karatsuba-Like Formulae Over the Binary Field

Cs51 Problem Set 3: Bignums and Rsa

Quadratic Frobenius Probable Prime Tests Costing Two Selfridges

Modern Computer Arithmetic

Modern Computer Arithmetic (Version 0.5. 1)

New Arithmetic Algorithms for Hereditarily Binary Natural Numbers

3 Finite Field Arithmetic

Multiplying Large Integers Karatsuba Algorithm

Even Faster Integer Multiplication David Harvey, Joris Van Der Hoeven, Grégoire Lecerf

Fast Rational Function Reconstruction