International Journal of Advancements in Research & Technology, Volume 2, Issue 9, September-2013 ISSN 2278-7763 106 16-Bit RISC Processor Design for Convolution Application Anand Nandakumar Shardul 1E & TC Department, Acropolis Institute of Research & Technology, RGPV University, Bhopal

[email protected]

Abstract— In this project, we propose a 16-bit non-pipelined Incitements and benefits RISC processor, which is used for signal processing applications. The processor consists of the blocks, namely, program counter, Before the RISC philosophy became prominent, clock control unit, ALU, IDU and registers. Advantageous many computer architects tried to bridge the so- architectural modifications have been made in the incremented called semantic gap, i.e. to design instruction sets that directly circuit used in program counter and carry select adder unit of supported high-level programming constructs such as the ALU in the RISC CPU core. Furthermore, a high speed and procedure calls, loop control, and complex addressing modes, low power modified modifies multiplier has been designed and allowing data structure and array accesses to be combined into introduced in the design of ALU. The RISC processor has been single instructions. Instructions are also typically highly designed for executing 27-instruction set. It is expandable up to encoded in order to further enhance the code density. The 32 instructions, based on the user requirements. compact nature of such instruction sets results in smaller program sizes and fewer (slow) main memory Keywords— Arithmetic and Logical Unit (ALU), Program accesses, which at the time (early 1960s and onwards) resulted Counter (PC), Register file (REG), Instruction Decoder Unit in a tremendous savings on the cost of computer memory and (IDU) and Clock Control Unit (CCU). disc storage, as well as faster execution. It also meant good programming productivity even in assembly language, as high level languages such as FORTRAN. INTRODUCTION In the 1970s, analysis of high level languages indicated some The trend in the recent past shows the RISC processors clearly complex machine language implementations and it was outsmarting the earlier CISC processor architectures. The determined that new instructions could improve performance. reasons have been the advantages, such as its simple, flexible Some instructions were added that were never intended to be and fixed instruction format and hardwired control logic, used in assembly language but fit well with compiled high which paves for higher clock speed, by eliminating the need level languages. were updated to take advantage of for microprogramming. The combined advantages of high IJOARTthese instructions. The benefits of semantically rich speed, low power, area efficient and operation-specific design instructions with compact encodings can be seen in modern possibilities have made the RISC processor ubiquitous. The processors as well, particularly in the high performance main feature of the RISC processor is its ability to support segment where caches are a central component (as opposed to single cycle operation, meaning that the instruction is fetched most embedded systems). This is because these fast, but from the instruction memory at the maximum speed of the complex and expensive, memories are inherently limited in memory. RISC processors in general, are designed to achieve size, making compact code beneficial. Of course, the this by pipelining, where there is a possibility of stalling of fundamental reason they are needed is that main memories clock cycles due to wrong instruction fetch when jump type (i.e. dynamic RAM today) remain slow compared to a (high instructions are encountered. This reduces the efficiency of performance) CPU-core. the processors. This paper describes a RISC architecture in which, single cycle operation is obtained without using a pipelined design. It averts possible stalling of clock cycles in The RISC idea effect. The development of CMOS technology provides very high density and high performance integrated circuits. The The circuitry that performs the actions defined by the performance provided by the existing devices has created a in many (but not all) CISC processors is, in itself, a never-ending greed for increasingly better performing devices. processor which in many ways is reminiscent in structure to This predicts the use of a whole RISC processor as a basic very early CPU designs. In the early 1970s, this gave rise to device by the year 2020. ideas to return to simpler processor designs in order to make it more feasible to cope without (then relatively large and

expensive) ROM tables and/or PLA structures for sequencing and/or decoding. The first (retroactively) RISC-

Copyright © 2013 SciResPub. IJOART International Journal of Advancements in Research & Technology, Volume 2, Issue 9, September-2013 ISSN 2278-7763 107

labelled processor (IBM 801 - Watson Research Centre, Detail of Logical Blocks mid-1970s) was a tightly pipelined simple machine originally The block diagram of the 16-bit RISC CPU. The proposed intended to be used as an internal microcode kernel, or engine, RISC CPU consists of five blocks, namely, Arithmetic and in CISC designs, but also became the processor that Logical Unit (ALU), Program Counter (PC), Register file introduced the RISC idea to a somewhat larger public. (REG), Instruction Decoder Unit (IDU) and Clock Control Simplicity and regularity also in the visible instruction set Unit (CCU). would make it easier to implement overlapping processor stages (pipelining) at the level (i.e. the level Architecture seen by compilers.) However, pipelining at that level was already used in some high performance CISC The architecture of the proposed RISC CPU is a uniform16- bit instruction format, single cycle non-pipelined processor. It "supercomputers" in order to reduce the instruction cycle time has a load/store architecture, where the operations will only be (despite the complications of implementing within the limited performed on registers, and not on memory locations. It component count and wiring complexity feasible at the time). Internal microcode execution in CISC processors, on the other follows the classical von-Neumann architecture with just one hand, could be more or less pipelined depending on the common memory bus for both instructions and data. A total of 27 instructions are designed as a first step in the process of particular design, and therefore more or less akin to the basic development of the processor. The instruction set consists of structure of RISC processors. Logical, Immediate, Jump, Load, store and HALT type of Why RISC? instructions. The Halt instruction acts as a border line between the instruction and data memory. This offers the flexibility to Various attempts have been made to increase the instruction the programmer, who uses this processor core to define their execution rates by overlapping the execution of more than one own instruction and data memory within the allotted 64 instruction since the earliest day of computing. The most memory registers. Each of the register is of 16-bits width common ways of overlapping are pre-fetching, pipelining and capacity. superscalar operation. 1) Program Counter: The Program Counter (PC) is a 16-bit 1) Pre-fetching: The process of fetching next instruction or latch that holds the memory address of location, from which instructions into an event queue before the current instruction the next machine language instruction will be fetched by the is complete is called pre-fetching. The earliest 16-bit processor. The proposed PC is the largest sub-block and , the Intel 8086/8, pre-fetches into a non-board second to the control unit in complexity. It controls the flow queue up to six bytes following the byte currently being of the instructions execution and it ensures the logical executed thereby making them immediately available for operation flow of the processor. It performs the two operations, decoding and execution, without latency. namely, incrementing and loading. For most instructions, the 2) Pipelining: Pipelining instructions means starting or PC is simply incremented in preparation for the following issuing an instruction prior to the completion of the currently instruction or the following instruction nibbles. In general, executing one. The current generation of machines carries this abnormal conventional adder circuit will be used for to a considerable extent. TheIJOART PowerPC 601 has 20 separate incrementing action. However, it leads to increased hardware pipeline stages in which various portions of various use along with more power dissipation. Hence, this work instructions are executing simultaneously. strives for a low power and novel incremented circuit design. In this design, we employ a 6-bit pointer to indicate the 3) Superscalar operation: Superscalar operation refers to a instruction memory. It additionally uses a 6-bit pointer to processor that can issue more than one instruction point to the data memory, which will be used only when a simultaneously. The PPC 601 has independent integer, Load/Store instruction is encountered for execution. floating-point and branch units, each of which can be 2) Arithmetic and Logic unit: The arithmetic and logic unit executing an instruction simultaneously. CISC machine (ALU) performs arithmetic and logic operations. It also designers incorporated pre-fetching, pipelining and performs the bit operations such as rotate and shift by a superscalar operation in their designs but with instructions that defined number of bit positions. The proposed ALU contains were long and complex and operand access depending on three sub-modules, viz. arithmetic, logic and shift modules. complex address arithmetic, it was difficult to make efficient The arithmetic unit involves the execution of addition use of these new speed-up techniques. Furthermore, complex operations and generates Sign flag and Zero flag as per the instructions and addressing modes hold down clock speed result shown in the process. In order to reduce the complexity compared to simple instructions. RISC machines were of the adder circuits used in the arithmetic unit of the RISC designed to efficiently exploit the caching, pre-fetching, CPU, a very fast and low power carry select adder circuit has pipelining and superscalar methods that were invented in the been introduced. The ALU also consists of a modified days of CISC machines. multiplier, which uses compressor circuits to achieve low power and improved speed of operation. The multiplier is designed to execute in a single cycle. Hence, it satisfies the requirement of the RISC design, to execute single cycle

Copyright © 2013 SciResPub. IJOART International Journal of Advancements in Research & Technology, Volume 2, Issue 9, September-2013 ISSN 2278-7763 108

instructions the shift module is used for executing instructions such as rotation and shift operations. The shift module is mandatory for signal processing applications, which needs division by 2.This is achieved by a single right shift operation. The logic unit is used to perform logical operations, such as, Ex-or, OR, and AND. The Data out of each ALU operation is written back into the corresponding destination register, along with the flags updated. In order to maintain simplicity of the design, the carry out of the ALU is not taken into consideration. 3) Register File: The register file consists of 8 general purpose registers of 16-bits capacity each. These register files are utilized during the execution of arithmetic and data-centric Instructions. It is fully visible to the programmer. It can be addressed as both source and destination using a 3-bit identifier. The register addresses are of 3-bit length, with the range of 000 to 111. The load instruction is used to load the values into the registers and store instruction is used to retrieve the values back to the memory to obtain the processed outputs back from the processor. The Link register is used to hold the addresses of the corresponding memory locations. 4) Instruction Decoder Unit: Our instruction set is limited yet comprehensive. Since our data bus is only 5 bits wide, it was decided to keep the number of instructions supported within 32 for easier implementation. At present, only 27 instructions Fig. 1 Block Diagram of the Processor. have been implemented. The rest have been reserved for

porting digital processing applications into our processor. The

decoder units decodes the instruction and gives out the 3-bit source and destination addresses respectively, depending on the op-code’s operation and it also decides whether the write back circuit has to be enabled or not . 5) Clock Control Unit: It fetches the code of all of the instructions in the program. It directs the operation of the other units by providing timing and control signals. All computer resources are managed by the CU. It directs the flow of data between the CentralIJOART Processing Unit (CPU) and the other devices. The control unit is the circuitry that controls the flow of data through the processor, and coordinates the activities of the other units within it. In a way, it is the "brain within the brain", as it controls what happens inside the processor, which in turn controls the rest of the computer. A. Figures and Flow chart. Figures show the representation of the block of the processor and diagrammatic view for the same. As mention earlier the system is as shown in figure. In these all the registers are assign a task at once. The process takes place in simultaneous manner in which all register and other need peripherals devices takes the task at same time and process it. The flow chart elaborates us how the flow exists with the processor. Initially all the register & memory is made ready, than the fetching of the data takes place and the data is being processed in the same register & memory register help. It’s a process which occurs in the same simultaneous manner. All the processor flow can be welled understood in the flow chart.

Fig. 2 Flow Chart

Copyright © 2013 SciResPub. IJOART International Journal of Advancements in Research & Technology, Volume 2, Issue 9, September-2013 ISSN 2278-7763 109

Result References The RISC processor described above is designed using [1] Robert S. Plachno, VP of Audio ―A True Single Cycle Verilog HDL mixed language and is simulated using Xilinx RISC Processor without Pipelining‖. ESS Design White Paper ise 14.2. The proper functioning of the processor is validated. – RISC Embedded Controller. The simulation result shows that the processor is capable of [2] Youngjoon Shin, Chanho Lee, and Yong Moon, ―A Low implementing the given instruction in single clock cycle, Power 16-Bit RISC Microprocessor Using ECRL Circuits‖, thereby satisfying the basic requirements of the RISC ETRI Journal, Volume 26, Number 6, December 2004. processor. In order to evaluate the system performance, usage [3] Yasuhiro Takahashi, Toshikazu Sekine, and Michio of synthesis software were used to map the proposed Yokoyama, ―Design of a 16-bit Non-pipelined RISC CPU in processor on a target library a Two Phase Drive Adiabatic Dynamic CMOS Logic,‖ International Journal of Computer and Electrical Engineering, Conclusion Vol. 1, No. 1, April 2009 1793-8198. The design of a single cycle 16-Bit non-pipelined RISC [4] V. B. Saambhavi and V. S. Kanchana Bhaaskaran, A 16- processor for its application towards convolution application Bit RISC Microprocessor Using DCPAL Circuits. has been presented. Novel adder and multiplier structures International Journal of Advanced Engineering and have been employed in the RISC architecture. The processor Technology (IJAET), E-ISSN-0976-3945, Vol.II, Issue I, has been designed for executing the instruction set comprising January-March 2011, pp. 154-162 of 27 instructions in total. It is shown expandable up to 32 [5] J.S. Denker, ―A Review of Adiabatic Computing,‖ IEEE instructions, based on the user requirements. The processor Symp. Low Power Electronics, 1994, pp. 94-97. design promises its use towards any signal processing [6] H. Mahmoodi-Meinnand, A. Afzali-Kusha, and M. applications. Nourani, ―Adiabatic Carry Look-Ahead Adder with Efficient Power Clock Generator,‖ IEEE Proc., vol. 148, 2001, pp. 229- . 234. [7] K. Nishimura, T. Kudo, and H. Amano, ―Educational 16- Acknowledgment bit microprocessor PICO-16,‖ Proc. 3rd Japanese FPGA/PLD We place our gratitude on record to the Department of design conference and exhibit (Japanese Edition), Tokyo, July Electronics and Communication Engineering, Acropolis 19–21, 1995, pp. 589–595. Institute of Research & Technology, Indore for the support [8] Samiappa Sakthikumaran et al., ―A Very Fast and Low rendered to us in carrying out this work. Power Incrementer and Decrementer Circuits‖, International Journal of Computer Communication and Information System (IJCCIS) Vol2. No.1 – 2011, pp. 200-203. [9] Samiappa Sakthikumaran et al., ―A Very Fast and Low Power Carry Select Adder Circuits‖, 3rd International Conference on Electronics Computer Technology - ICECT 2011. IJOART [11] Keshab K.Parhi, VLSI Digital Signal Processing Systems, Wiley India

Copyright © 2013 SciResPub. IJOART