RUN-TIME CUSTOMIZATION OF A SOFT-CORE CPU ON AN FPGA

A DISSERTATION SUBMITTED TO THE UNIVERSITY OF MANCHESTER

FOR THE DEGREE OF MASTER OF SCIENCE

IN THE FACULTY OF ENGINEERING AND PHYSICAL SCIENCES

2015

By Rehab Abdullah Shendi School of Computer Science

Contents

Abstract ...... 8

Declaration ...... 9

Copyright ...... 10

Acknowledgements ...... 11

Dedication ...... 12

1 Introduction ...... 13

1.1 Aim and Objectives ...... 14

1.2 Report Outline ...... 15

2 Background ...... 16

2.1 ...... 16

2.1.1 History ...... 16

2.1.2 FPGA ...... 17

2.1.3 Reconfiguration Hardware ...... 20

2.1.4 Partial Reconfiguration ...... 21

2.2 Architecture ...... 26

2.2.1 RISC Microprocessor ...... 26

2.2.2 Soft-Core Microprocessor ...... 27

2.2.3 MIPS Architecture...... 28

2.3 Reconfigurable CPU Instruction Set Extensions ...... 30

2.3.1 Custom Instructions in Hardware ...... 31

2.3.2 Custom Instructions in Software ...... 32

2.4 Design Considerations ...... 33

2.5 Previous Work ...... 35

2.5.1 Instruction Set Extension ...... 35

2.5.2 Partial Reconfiguration ...... 35

3 System Design and Methodology ...... 37 2

3.1 System Development Methodology ...... 37

3.2 Implementation Tools ...... 44

3.2.1 Hardware Description Language ...... 44

3.2.2 ISE (Xilinx, 2013) : ...... 45

3.2.3 Cross compiler: ...... 46

3.2.4 FPGA Platform: ...... 46

3.2.5 GoAhead ...... 47

3.3 System Design...... 47

3.3.1 System Definition and Scope ...... 48

3.3.2 System Architecture and Components ...... 48

4 Implementation ...... 54

4.1 Baseline MIPS Soft-Core ...... 54

4.2 Custom Instruction in Software ...... 57

4.3 Configuration Controller Modules ...... 59

4.4 Custom Instruction in Hardware ...... 61

4.5 Challenges During Implementation ...... 69

5 Testing, Results and Evaluation ...... 70

5.1 Testing ...... 70

5.2 Results ...... 76

5.3 Evaluation ...... 77

6 Conclusions and Future Work ...... 81

6.1 Conclusions ...... 81

6.2 Future Work ...... 81

Works Cited ...... 83

Appendix A - MIPS CPU ...... 87

Appendix B - Trap handler based on MUX ...... 96

Appendix C - Trap handler based on ICAP ...... 98

3

(Word count 16033)

4

List of Tables

Table 1 Configuration speeds with ICAP achievement (Hansen, Koch and Torresen, 2011)...... 25 Table 2 Type of MIPS instructions (Fritzell,2013)...... 30 Table 3 Descriptions of using ICAP_SPARTAN6 Port (Xilinx Inc, 2015)...... 52 Table 4 Custom instructions’ address and ID...... 60 Table 5 An example of bitstream for the IPROG command using ICAP (Xilinx Inc, 2015)...... 61 Table 6 Resource requirements for Configuration controller...... 76 Table 7 Resource requirements for Custom modules...... 77 Table 8: comparison between Xilinx Embedded processors with our soft-core and their Performance...... 79 Table 9 Software requirements...... 80

5

List of Figures

Figure 1 Classification of FPGAs (Koch, 2013)...... 21 Figure 2 Baseline model of partial reconfiguration (Koch, 2013)...... 22 Figure 3 Styles of reconfigurable modules placement. (a) Island style. (b) Slot style. (c) Grid style (Koch, 2013)...... 23 Figure 4 a) a typical CPU b) extensions CPU with Reconfigurable Instructions (Koch, 2013)...... 31 Figure 5 Design and development tools (Minev and Kukenska, 2007)...... 34 Figure 6: The general approach of the system development stages (Soft, 2013). 37 Figure 7: A step-by-step design and implementation method...... 38 Figure 8: First step, system overview...... 39 Figure 9 Third Step, system overview...... 40 Figure 10: Four step, system overview...... 40 Figure 11 Five step: system overview of the first approach...... 42 Figure 12 Five step: system overview of the second approach. (Xilinx, 2012)...... 43 Figure 13 Xilinx Spartan-6 LX16 FPGA platform (Nexys3™ Board Reference Manuall, 2013)...... 46 Figure 14 The final system design...... 49 Figure 15 The non-pipelined MIPS shows the most important signals and logics (Fritzell, 2013)...... 50 Figure 16: ICAP Primitive (Xilinx Inc, 2015)...... 52 Figure 17 Custom Module Logic...... 53 Figure 18 The Program Counter process overview that consists of extra logic and flip-flops to handle branch and jump instructions. (Fritzell, 2013)...... 55 Figure 19 Datapath for the multiplication, allowing two clock cycles for execution. (Fritzell, 2013)...... 57 Figure 20 Adding Custom instruction in the compiler...... 58 Figure 21 Trap Handler State Machine...... 59 Figure 22 Custom Instruction (CI) act as extension of the ALU ...... 61 Figure 23 On-FPGA Communication for Custom Instructions...... 62 Figure 24 Static implementation ...... 64 Figure 25 Partial Part: the example shows the implementation CRC instruction. . 66 Figure 26 GoAhead GUI. The graphical user interface of the GoAhead...... 68 6

Figure 27 GoAhead Script...... 68 Figure 28 Test-bench of the MIPS CPU and ROM all pictures above a, b and c are presenting one test bench that shows different signals for example A) instruction encoding, decoding and ALU functionalities b) Program counter functionality and c) branch delay and ROM functionalities...... 71 Figure 29 Modalism Simulation of CRC-32 Module...... 72 Figure 30 Modalism simulation of One Counter Module...... 72 Figure 31 Modalism Simulation of Parity generation module...... 73 Figure 32 Modalism simulation of Leading Zero Counter Module...... 73 Figure 33 Modalism Simulation of Mux Based TrapHandler...... 74 Figure 34 Modalism simulation of ICAP based Trap Handler...... 74

7

Abstract

RUN-TIME CUSTOMIZATION OF A SOFT-CORE CPU ON AN FPGA Rehab Abdullah Shendi A dissertation submitted to the University of Manchester For the degree of Master of Science, 2015

The use of customised soft-core processors in which instructions can be integrated into a system in application hardware is increasing in the Field Programmable Gate Array (FPGA) field. Specifically, the partial run-time reconfiguration of FPGAs in specialised processors for a particular domain can be very beneficial. In this report, the design and implementation for the customisation of a soft-core MIPS processor using an FPGA and partial reconfiguration (PR) of FPGA technology will be addressed to achieve efficient resource use. This can be achieved using a PR design flow that helps the design fit into a smaller device. Moreover, the impact of static power consumption could be reduced due to run- time reconfiguration. This will be done by configurable custom instructions implemented in the hardware as an extension on the MIPS CPU. The aim of this project is to investigate the PR of FPGAs for run-time adaptations of the instruction set of a soft-core CPU, including the integration of custom instructions and the exploration of the potential to use the MultiBoot feature available in Xilinx FPGAs to carry out the PR process. The system will be evaluated and tested on a Nexus 3 development board featuring a Xilinx Spartran-6 FPGA. The system will be able to load reconfigurable custom instructions dynamically into user programs with the help of the trap handler when the custom instruction is called by the MIPS CPU. The results of this experiment demonstrate that custom instructions in hardware can speed up a certain function and many instructions can be saved when compared to a software implementation of the same function. Implementing custom instructions in hardware is perfectly possible and worth exploring.

8

Declaration

No portion of the work referred to in this dissertation has been submitted in support of an application for another degree or qualification of this or any other university or other institute of learning.

9

Copyright i. The author of this thesis (including any appendices and/or schedules to this thesis) owns certain copyright or related rights in it (the “Copyright”) and s/he has given The University of Manchester certain rights to use such Copyright, including for administrative purposes. ii. Copies of this thesis, either in full or in extracts and whether in hard or electronic copy, may be made only in accordance with the Copyright, Designs and Patents Act 1988 (as amended) and regulations issued under it or, where appropriate, in accordance with licensing agreements which the University has from time to time. This page must form part of any such copies made. iii. The ownership of certain Copyright, patents, designs, trademarks and other intellectual property (the “Intellectual Property”) and any reproductions of copyright works in the thesis, for example graphs and tables (“Reproductions”), which may be described in this thesis, may not be owned by the author and may be owned by third parties. Such Intellectual Property and Reproductions cannot and must not be made available for use without the prior written permission of the owner(s) of the relevant Intellectual Property and/or Reproductions. iv. Further information on the conditions under which disclosure, publication and commercialisation of this thesis, the Copyright and any Intellectual Property and/or Reproductions described in it may take place is available in the University IP Policy (see http://documents.manchester.ac.uk/DocuInfo.aspx? DocID=487), in any relevant Thesis restriction declarations deposited in the University Library, The University Library’s regulations (see http://www.manchester.ac.uk/library/aboutus/regulations ) and in The University’s policy on presentation of Theses

10

Acknowledgements

I would like to thank my supervisor, Dirk Koch, for giving me the opportunity to work in my favourite, and dream field in Computer Sciences: Computer System Engineering. His remarkable teaching and coaching strategies enabled me to give my best from the first day; without him this dream would not have been realised. Special thanks also go to my parents, sisters and my small family - Fahad, Qusai and Retal - for their help and encouragement during my studies. Thanks also to my friends for their support.

11

Dedication

From my heart to my brother—you are still here in my heart and mind. I miss you always, my best friend.

12

Chapter 1

1 Introduction

Field Programmable Gate Arrays (FPGAs) have become popular over the last decade as they allow designers to create complex digital designs at a low implementation cost. Application Specific Circuits (ASICs), in contrast, introduce a high initial cost and require a large amount of resources to create complex designs.

Modern FPGAs now occupy central positions in industry because of their capacity for over 1000 multipliers, megabytes of on-chip memory, hundreds of thousands of logic cells and clock speeds of up to half a gigahertz. Moreover, the cost per function in FPGAs decreases significantly over time (Koch, 2013). Partial Reconfiguration (PR) is one of the most important features of modern FPGAs provided by the FPGA vendor Xilinx. It allows modules running on an FPGA to dynamically reconfigure and swap during execution while the remaining modules continue operating. PR is an interesting topic for research among students and researchers in the Reconfigurable Computing and Adaptive Hardware field. FPGAs are less efficient in area, power and speed than ASICs; however, it is possible to make them more efficient than a static system when all or parts of the hardware are reconfigured at run-time through the execution operation. The extension of a soft-core instruction set with user-defined instructions used to speed up the execution of an application in a specific domain can provide huge PR benefits. Such benefits include integrating different sizes of reconfigurable modules into the system to be placed on an FPGA at run-time, and being able to communicate efficiently with the rest of the system and avoiding additional delay. In this project, the extension of a MIPS soft-core, user-defined instruction set will be introduced with the help of PR. The aim of this project is to explore the efficient use of partial run-time reconfiguration with a CPU instruction set extensions library.

13

This chapter presents basic information about the project. Section 1.1 describes the aims and objectives of the project, and section 1.2 presents the report outline. 1.1 Aim and Objectives

The aim of this project is to investigate Partial Reconfiguration (PR) of FPGAs for run-time adaptations of the instruction set of a Soft-core CPU, including the integration of custom instructions by presenting a practical introduction to soft-core processor with extension design through the use of step-by-step integration of the system for partial reconfiguration using GoAhead tool flow. The powerful GoAhead tool supports all recent Xilinx FPGAs and includes some features that are not available in the other PR tools provided by the FPGA vendor Xilinx (Beckhoff, et al., 2012) as will be introduced in chapter 3.

The objective of this project is to investigate a custom instruction module library that offers low latency performance; low implementation costs in terms of logic resources, and achieves high CPU clock cycle savings compared to software-only implementations.

• Learning Objectives

– Investigate and understand the concept of reconfiguration hardware. – Review how custom instructions can be applied as an extension of the soft- core. – Investigate and understand the concept of PR. – Investigate and understand the topic of reconfiguration MultiBoot and its potential for use with PR.

• Deliverable Objectives

– Develop and implement custom instructions as an extension of a given soft- core on an FPGA. – Understand and implement reconfigurable custom instructions for a soft-core on an FPGA. – Analyse previous results and establish a performance concept.

14

1.2 Report Outline

Chapter 2: Background This chapter will provide an overview of the relevant literature and related works as an introduction to reconfigurable hardware and FPGA architecture. PR concepts and details regarding the reconfiguration of FPGA devices will be included. Finally, microprocessor architecture, with a focus on MIPS and reconfigurable instruction set extensions, will be introduced. Chapter 3: System design and methodology This chapter introduces the system methodology considered for this project. The whole system used in the project, including the MIPS CPU and the peripheral components (memory, GPIO, ROM, and trap handler) connected by the system bus will be presented. Chapter 4: System implementation This chapter discusses the implementation of the final system’s components and all related technical issues. Chapter 5: Testing, results and evaluation This chapter presents the tests conducted in this study, the results of these tests and an overall evaluation of the system. Chapter 6: Conclusion and further work This chapter summarises the report and presents recommendations for further improvement of the implemented system. Appendix Three appendices have been included: Appendix A contains the VHDL-code for the MIPS CPU Appendix B contains the VHDL-code for the trap multiplexer Appendix C contains the VHDL-code for the trap handler.

15

Chapter 2 2 Background

Three areas are dealt with in the background research. Firstly, the general area of reconfigurable computing including FPGAs architecture is discussed. Then, in the second part, Microprocessor architectures are discussed. Finally, the third part looks at the specific area of this project. 2.1 Reconfigurable Computing

Reconfigurable computing is a computer paradigm that combines the flexibility of software with high hardware processing performance through the use of flexible high speed fabrics such as FPGAs. Reconfigurable computing provides the ability to make substantial changes to the data path with the control flow. Additionally, reconfigurable computing is able to adapt the underlying hardware during run-time by providing the option to load a new circuit on the reconfigurable fabric (Koch, 2013).

2.1.1 History According to Bobda (2008) the history of reconfigurable computing can be traced back to 1960s when Gerald Estrin proposed a that was made up of a standard processor combined with an array of reconfigurable hardware. The core processor was used to control the behaviour of the reconfigurable hardware. Such a design was later adjusted to perform other tasks such as image processing (Lysaght & Subrahmanyam, 2005). The adjustment was commonly done whenever the need arose. These adjustments could be performed whenever the need arose and led to the development of a hybrid computer structure that possessed both software flexibility and speed.

Since then, the design of reconfigurable computing has improved as many architectures have been developed by industry. Some of the designs that have been introduced to the market include Copacobana, Elixent, Silicon Hive, PiCoGA etc. The first reconfigurable architecture based computer for the commercial market was released in 1991 by Algotronix. This architecture was later adopted by

16

Xilinx, which acquired Algotronix to improve it for commercial purposes (Algotronix.com, 2015).

2.1.2 FPGA Field Programmable Gate Array (FPGA) technology has recently gained a lot of popularity in production and prototyping products in both small and moderate quantities. FPGAs are a special kind of Programmable Logic Devices (PLD) that allows the implementation of general digital circuits with a limitation of the circuit size. Programming the device is used to define the circuit to be implemented. The capabilities of FPGAs have grown over the years and today a whole multiprocessor system can fit on a single device. The complex circuit designs needed for such complex devices are normally specified with the help of Hardware Description Languages (HDLs). As they support circuit description with the help of high-level language constructs, HDLs are preferred for this type of application.

FPGAs are comprised of a chip full of digital logic which allows for programmable connections between components. FPGA design tools are used to generate configuration files that contain the initial values and the required connections which can then be downloaded to the FPGA. The key feature of FPGAs lies in the fact that their design is completely soft and that it can be reprogrammed. However, this also means that if power is removed from them, they will lose their configuration. As such, they will require reprogramming in order to create another working design (Balwaik, et al., 2013).

The history of FPGAs dates from the late 80’s with the increasing interest in extending the functionality of large Programmable Logic Arrays (PLAs) that were being further developed (Bobda, 2008). The early 90’s witnessed the increased use of FPGAs in the networking and telecommunication industry due to their increased flexibility. At that time, they were preferred because it was possible to separate the development stage and hardware design from the logic design stage. As such, they were seen as helping vendors to engineer solutions without spending lot of time in designing the logics which was the case in Application Specific Integrated Circuits (ASICs) (Parvez and Mehrez, 2011).

A part of the background to this study is soft-core processors. A soft-core processor is regarded to be a microprocessor that is completely described using a

17

Hardware Description Language and is synthesized for FPGAs. At this point it is worth mentioning that the design of a soft-core processor that has been designed for an FPGA is considered to be flexible due to its ability to be readjusted by reprogramming the device. This is not possible with much other programmable hardware. Traditionally, such systems could be developed using ASIC technology. However, ASICs are traditionally not designed for allowing reconfiguration. FPGAs have been demonstrated to create very powerful and highly performing systems because of their reprogramming feature (Musoll, 2010).

One limitation of FPGAs is that very few details of the low level implementation process are available to the end users (e.g. the encoding of the configuration data). Sufficient information about the choices made during the development process of FPGA technology is not often provided.

FPGA Architecture FPGA technology can be implemented using arbitrary user logic. There are three main resources available in FPGAs: 1) logic blocks, 2) I/O blocks and 3) a programmable interconnection.

Logic blocks FPGA logic blocks consist of a look-up table (LUT) and flip-flops (FF). Each has the ability to implement small functions consisting of several variables.

The FPGA implements the Boolean logic with the help of the LUTs, which are the basic elements in FPGA architecture, providing the capability of programming whenever given a logic function (as long as it fits into the LUTs). A Boolean function is normally represented by a truth table stored in static random access memory (SRAM) cells. A LUT is normally linked to specific inputs; those with n inputs are referred to as n-LUTs (Munden, 2005). As such, an n-LUT is essentially a multiplexer that takes input signals from the configuration storage memory and forwards the selected one into an output signal line.

LUT outputs are normally linked to the state flip-flop, which is supposed to store the current state of the synchronous circuit. Practical look-up tables provide additional features that vary among different families of FPGAs. Some of the features witnessed on different FPGAs include distributed memory modes and the potential to combine adjacent LUTs with larger LUTs that have more inputs and 18 fast-carry ripple chain logic (Pedroni, 2010). In other words, LUTs are combined during the routing implementation with configurable registers and multiplexers in order to produce a logic . A logic cell is the main thread of the FPGA fabric in that all unmapped logic in special blocks like DSPs, CPUs or BRAMs is implemented in logic cells. Xilinx FPGAs, for example, have recently begun to provide four logic cells combined as a slice, creating a configurable logic block (CLB) (a combination of two slices). Slices can consist of logic other than basic logic cells to implement fast carry chains, shift registers and distributed RAM by adding dedicated signals and logic between slices in the same column to propagate signals through many slices. This removes the need for routing through the interconnect.

I/O blocks I/O blocks are used to connect the internal logic to the outside pins. I/O blocks are bidirectional, meaning they can either be used as inputs or outputs depending on the actual configuration. Different pins may be configured to different standards if the underlying device can support more than one I/O standard (Munden, 2005).

Programmable interconnection. Programmable interconnections are used to connect different logic blocks. The interconnections between FPGA logic blocks may be programmed in three ways: via SRAM cells, FLASH/electrical erasable programmable memory (EEPROM) or antifuses. These hold the configurations defining the Boolean function and control the configured routing.

The majority of FPGAs are SRAM-based programmable interconnects. The SRAM cells drive pass transistors, tri-state buffers and multiplexers. SRAM is a volatile memory technology and needs to be programmed from an external memory each time power is applied to the device. During reconfiguration, these SRAM cells will be overwritten with new functions. FLASH, which is based on EEPROM technology, is non-volatile and will retain configuration data when power is removed from the device. Antifuse-based programmable interconnects create permanent connections in the configuration cells. Similar to FLASH, these interconnects may only be programmed a single time, after which point the configuration process cannot be redone.

19

The above architectures indicate the complex programming capabilities of FPGAs and may account for some of the problems involved in FPGA use. These problems include the fact that FPGAs consume a lot of power during programming and they also require a large amount of space which results in latency of routing and functional blocks. FPGAs also consume a significant amount of power and configuration memory during operation. Compared to ASICs, FPGAs also exhibit longer circuit delays (Lin et al., 2008).

Configuration details FPGA configuration occurs when a bitstream is written onto a device’s configuration port. The bitstream contains data for the SRAM cells that hold the device’s configuration. There are two types of configuration ports: external and internal. They have different interfaces to accommodate specific protocols and connections. Xilinx FPGA devices support regional reconfiguration on the device during run-time. The smallest region is a reconfigurable one, and its configuration frame varies in size depending on the device.

2.1.3 Reconfiguration Hardware The processors used in computing may be classified into three types (Bobda, 2007). The first, a general purpose processor (GPP), employs data, a control path and a data path to conduct computation, and does not necessarily alter the existing hardware. The second, a domain-specific processor (DSP), is used in situations in which a processor is only employed in one particular computation area. DSP data paths and operations are fitted to a set of algorithms, which reduces flexibility though boosted performance for underlying domains. The third type, an application-specific processor (ASIP), achieves the best performance by directly executing the hardware algorithm. Moreover, it does not employ instructions, which implies that unlike the other processors, it is not restricted by the need for sequential implementation.

The ideal processor would be one that combines the flexibility of GPPs with the performance power of an ASIP. Modern FPGA technology makes this possible as they can adapt to different problems in a form called reconfigurable hardware, in which all or parts of the hardware structure can be changed during execution. Despite the high static power consumption of modern FPGA devices, run-time

20 reconfiguration can create flexible hardware by increasing device utilisation through device reconfiguration.

The architecture of FPGAs can be seen from the perspective of their configurable capabilities: the highest level of FPGA can be separated into one-time configurable devices and reconfigurable devices. Figure 1 illustrates the major classifications of FPGAs in regard to their configuration capabilities.

Figure 1 Classification of FPGAs (Koch, 2013).

A globally reconfigurable device allows complete device configuration exchange, while partially configurable devices permit the exchange of only a fraction of the FPGA resources. PR can be accomplished either with active or passive operations (i.e. if the FPGA continues or stops operation during configuration).

2.1.4 Partial Reconfiguration PR is associated with the ability of a reconfigurable device to change a portion of the reconfigurable hardware circuitry while the other portion is still running. Such reconfigurable designs require modular circuits created by different subcomponents. It is possible to swap out some sections of these subcomponents even when the FPGA is still running (Koch, 2013).

A full reconfiguration operation is normally done when the FPGA is in the reset mode, at which time an external controller is employed to reload the design into the chip; this improves functionality to critical parts of the design. In addition, PR can be used to create space for multiple modules at run-time by storing the partially reconfigurable modules expected to be changed. Figure 2 illustrates the baseline model of PR.

21

Figure 2 Baseline model of partial reconfiguration (Koch, 2013).

Figure 2 shows how active modules are exclusively placed within the reconfigurable region and how the swapping between the modules is accomplished through writing a partial configuration bitstream to the configuration port, as seen by the configuration data stream in the right hand side of Figure 2. PR is available in most modern FPGAs and allows a subset of the logic fabric to be dynamically reconfigured while the logic in it continues to operate undisturbed. Some of the FPGAs equipped with this capability include the devices of FPGA vendors Xilinx and , which include this feature on their high-end FPGAs. PR is not only necessary for general purpose reconfigurable systems but is preferred due to its extensibility and flexibility (Koch, 2013).

To undertake partial run-time reconfiguration, hardware must be supported by the devices mentioned above. Reconfiguration in one section of the device must not stop operation in other sections. PR may be classified according to the frequency of reconfiguration applicable within an operation clock cycle. These classifications are: single-cycle reconfiguration (frequently applicable), sub-cycle reconfiguration and multi-cycle reconfiguration (seldom applicable). In multi-cycle reconfiguration, reconfiguration requires more than a single system clock cycle because the reconfiguration data is transferred from memory to configuration cells in a serial fashion. Single-cycle reconfiguration occurs when a redesign involves a change of logic on the device within a single chain of the system clock. Context switching may not be undertaken in run-to-completion modules, as the module’s internal state would not be stored.

The reconfigurable system is divided into two important parts. The part of the system that is always present is called the static region, and can include a

22 memory controller, a soft CPU or configuration port interface logic. The second part, which contains run-time reconfigurable modules, is typically provided as one or more partial regions. Different methods of conducting PR exist, including small changes in net lists, routing and LUT functions, or even large module replacement (Koch, 2013).

Style of module placement There are various methods available for PR; for example, the manner in which the area set for PR is employed categorises PR into different styles of configuration. One method of conducting PR is substituting larger portions of logic known as modules for every reconfiguration. This is termed module-based reconfiguration. The area where PR modules are placed could be: a) only one module in a reconfigurable region b) in a one dimensional fashion or c) a two dimensional fashion. The following figure 3 shows the partial region and the different styles that can be arranged in it.

Figure 3 Styles of reconfigurable modules placement. (a) Island style. (b) Slot style. (c) Grid style (Koch, 2013).

Island styles are supported by the Xilinx PR flow. In the “island style", only one module will be present in the PR region, while switching between other modules could be carried out in the static part of the system. A PR region has to accommodate all modules that the system will need. The design could be a single or multi island style. With the latter one the developer should consider that the same resources will be shared by all of the islands. On the other hand, in the “slot style", PR regions will be divided into slots that have the same size. So, it will be not be limited to one module as in the "island style". Varying slot requirements for different modules could cause fragmentation challenges inside the PR region. As

23 a result, replacing modules in the "slot style" will not be as straightforward as in the "island style", in which there is only the matter of choosing between the islands (Koch, 2013).

Module footprint Interchanging modules between various islands/slots found on the device requires the designer to consider the resources required for the module. It also needs the existing FPGA frameworks and the manner in which resources are placed on the device to be considered. The PR module bears a resource footprint which has to fit the resource footprint of the existing FPGA. Therefore, when a module is changed to a new group of slots, the slots have to perfectly fit the module footprint. There are challenges when permitting module relocation. One challenge is the alteration in signal timing and incorporating a timing footprint. There could be a change in timing based on the position of the module relocation. Other sections of the FPGA could have longer delays in routing due to concealed features, for instance, the configuration logic.

Spartan-6 configuration Configuration frames are an integral component of the Spartan-6. The configuration frames for the devices of Spartan-6 could be classified into three kinds that have specific data for various parts of the device (Xilinx Inc, 2013). They include: Type 0; Type 1, or the Block RAM; and Type 2, or the IOB. Configuration is conducted using three kinds of operations that are offered by the configuration logic. They include: "00": NOP; "01": READ; and the "02": WRITE. The execution of a configuration command occurs in the event that a configuration register is drafted using data (Xilinx Inc, 2011). Each and every configuration register is described in the user guide of Spartan-6-configuration (Xilinx Inc, 2015). Configuration data is designed into two kinds of packets: Type 1 which has short blocks of 16-bit data areas; and type 2 in which packets could have long blocks of multiple 16-bit wide data areas.

Spartan-6 bitstream In order to configure a Xilinx device a bitstream to one of the configuration interfaces needs to be applied. The bitstream, as mentioned before, is an

24 encapsulation for the configuration data packets. The format of the bitstream in Spartan-6 devices is as follows (Xilinx Inc, 2015):  Dummy words: To prepare the pipeline of the configuration interface for data.  Synchronisation words: Two 16-bits words used for synchronisation (0xAA99 and 0x5566).  Header.  Configuration body.  Header2.  De-synchronisation word: One word (16-bit) signalling the end of the bitstream (0x000D). In the reconfiguration, in order to set up configuration registers, the header will be used, whereas, in the configuration body, data will be written to the configuration frames of the device. While Header2 could be also used for setting different configuration registers.

In the reconfiguration, in order to set up configuration registers, the header will be used, whereas, in the configuration body, data will be written to the configuration frames of the device. While Header2 could be also used for setting different configuration registers.

Internal Configuration Access Port (ICAP) During run-time reconfiguration, the system will have to write the configuration data into the configuration cells. In other words, writing data to the Internal Configuration Access Port (ICAP) on Xilinx devices. ICAP could consider the internal version of SelectMap port; one of the external configuration ports on Spartan-6. The following table shows the configuration speeds achievements. Bit width Frequency MHz Configuration speed Mb/s /MB/s

8 bit 100 800/100

16 bit 100 1600/200

Table 1 Configuration speeds with ICAP achievement (Hansen, Koch and Torresen, 2011).

25

On Spartan-6 devices (Xilinx Inc, 2015), the ICAP_SPARTON6 primitive has an input (I) data port that can accept 8- or 16-bit words of configuration data and an output (O) port which is used for read-back of configuration data already present on the device. Controlling the primitive will be done by setting the write enable (WRITE) and clock enable (CE) signals. And the data will be read or written by the primitive on the rising edge of the clock (CLK). Relocation of partial module bitstreams Module relocation occurs when the system is able to shift modules between various slots, as opposed to fitting a module to a particular slot in the PR area. The benefit of module relocation is the achieved dynamism in module placement. Challenges including external fragmentation can be handled with ease because modules can be eliminated between various slots. In addition, its flexibility makes the task of discovering placement and module scheduling much easier. This is because every module matches more than a single slot. There are various methods of executing module relocation. One such method will be to establish a different bitstream for every slot one needs to put his module in. A major solution in reducing storage within a system which boosts module relocation will be to keep position independent bitstream data distinct from position dependent. Based on this, it is just the position dependent data that must be kept for every position.

2.2 Microprocessor Architecture

2.2.1 RISC Microprocessor RISC, or Reduced Instruction Set Computer, is a type of microprocessor architecture that is designed to have instruction sets consisting of small, same size and simple instructions in order to make the whole architecture faster by executing them within one cycle. Moreover, RISC CPUs require less use of the memory when they are designed with a larger number of registers and only two dedicated instructions; load and store instructions that allow access to the memory. Whereas, CISC, Complex Instruction Set Computing, which is the opposite of RISC, can perform memory access from many different instructions. Examples of well-known RISC processors that are used widely in different hardware devices around the word are DEC Alpha, AMD Am29000, ARC, ARM,

26

AVR, , i860 and i960, MIPS, , PA- RISC, Power (including PowerPC), RISC-V, SuperH, and SPARC.

2.2.2 Soft-Core Microprocessor Soft-core processors have been wholly implemented using logic synthesis and through different semiconductor devices containing programmable logic. There are many soft-core processors that have been targeted for FPGA implementation. A typical soft-core CPU includes instruction sets, register files, arithmetic-logic units and other features eventually. The performance of these Soft-core CPUs implemented on FPGAs is considered to be higher when compared to those implemented on ASICs architecture. The disadvantage of an FPGA implementation is that it involves additional reprogramming capability that is not found in the ASIC architecture. However, the soft-core CPU created can be improved, if a problem with the design is found. This is one of the advantages of FPGA technology over the ASIC technology. For example, a new performance requirement of the CPU can be matched by adjusting the parameters on the FPGA of the system.

As mentioned above, there are many types of soft-core CPUs and corresponding development tools. Some popular soft-core CPUs include; Xilinx MicroBlaze, Altera Nios/NiosII, LatticeMico32 etc. These CPUs offer logic and memory elements that have several intellectual property peripherals which are required in the rapid development of System-on-Programmable-Chip.

A number of the soft-core processors that have been developed using FPGA technology are discussed below, and their functional details and performance provided (Levy and Conte, 2009).

MicroBlaze soft-core processor One of the most popular soft-core processors is the MicroBlaze soft-core processor from the FPGA vender Xilinx. It has a 32-bit Reduced Instruction Set Computer (RISC) architecture and can be customised with a number of memory and peripheral configurations. There are three pipeline stages that contain variable length instruction latencies. The Xilinx Platform Studio software can be used in the design process which provides a user-friendly environment that is able to generate MicroBlaze system. This type of architecture was adopted the Havard

27 memory architecture which consists of two local memory busses: one that is used to connect the data memories; the other that is used for the instructions. The number and size of memory peripherals can be selected by the user. The processor is capable of operating at up to 200MHz in Virtex -4 devices (Le Gal and Jego, 2013).

NIOS II Soft-core processor This soft-core CPU has load-store RISC architecture. The processor consists of many architectural parameters that may be configured easily at the time of design. For example, the user has a chance to choose between 32 or 16 bits of datapath width, cache size and register file sizes. There are custom instructions used to help the user to customise the hardware this could be used to accelerate the CPU. The integration of off-the-shelf intellectual property is readily realised, thus reducing the time that is required to set up a SoC and design time (Microelectronics International, 2012).

Micro32 Soft Processor core This is another example of soft-core processor but one that is in many ways unlike the other two examples that have been discussed above. Although it employs RISC architecture just like the above two examples, it is completely open. Additionally, it uses a smaller number of LUTs on the FPGA which makes it cheaper when compared to the others and it is easy to configure for the options you want to have in your application (Chu, 2008).

2.2.3 MIPS Architecture MIPS Overview In this project, the MIPS architecture, shown in Figure 15 below, will be used as a demonstrator for the custom instruction implementation in hardware. It is used to implement a 32 bit . Moreover, it is an example of RISC architecture and one of the most widely supported processors and has been used in research on efficient processor organisations which can deliver the highest performance and high power efficiency.

The original MIPS architecture consists of the following functional blocks:

28

Instruction decoder: It will decode the simple MIPS instructions since all instructions are the same size with only three different formats.

Programme Counter (PC): It contains the address of the currently executed instruction and then increments the stored value address of the next instruction by 4. In the case of there being a branch or jump instruction, a delayed branch will occur, which means one more instruction is performed and the value that is provided by the branch or jump instruction will be added to the instruction address.

Arithmetic Logic Unit (ALU): it is a fundamental block of the CPU that performs arithmetic and logical operation on the operands, which are the data inputs to an ALU to be operated on, from register to register, memory to register or vice versa.

Registers: the MIPS processor has 31 general purpose registers including register 0 that holds a constant zero. The other registers will be used by the compiler as outlined in the "MIPS32® Instruction Set Quick Reference"

Memory: It will be only accessed via load and store instructions.

Pipeline registers are often placed between the functional blocks in order to allow the processor to run at high clock speeds and to minimise the delay. Basically, the MIPS processor has been designed to use pipelining to improve throughput and performance. It includes a 5-stage pipeline: Instruction Fetch, Instruction Decode, Execute, Memory access and Register write back..

MIPS Instruction Set The MIPS instruction set is divided into three core groups of instructions. Each one of them has its own encoding, as illustrated in the following table.

BITS Instructions type 31-26 25-21 20-16 15-11 10-6 5-0

R-type opcode rs rt rd shamt funct

I-type opcode rs rt immediate

J-type opcode address

29

Table 2 Type of MIPS instructions (Fritzell,2013).

Table 2 shows that each type has a 6-bit main opcode that can be used by the decoder to determine the instruction, while the other fields, rs, rt and rd, will be address vectors in the registers file. Those instructions are used for:

• R-type instructions are Arithmetic Instructions that use two operands from the register file, rs and rt, and the result of the operation will be returned to the register rd. The R-type instruction could share their opcode with other instructions and funct-code will determine the operation. • I-type instructions are Load / Store Instructions that use a register, rs, with a constant value, coded as the immediate, the result will be returned to the register rt. The I-type instruction could be used for braches, so the immediate will be added to the current PC to perform a branch. • J-type instructions are Jump instructions that provide a new address for the programme counter. This means moving the execution to a new code block.

2.3 Reconfigurable CPU Instruction Set Extensions

Many different applications could be handled by using only GPPs, General purpose processors. However, most of them could use only a small subset of all the available instructions in the GPP. Therefore, some small changes to dedicated hardware in any application could give a huge improvement in execution time. A compression algorithm, for example, would need to count the number of one-bits in a vector. By adding dedicated hardware instruction, the speed up of this algorithm will be increased.

Extending the instruction set of a CPU could be one way to do this, allowing for of small parts of an application. The Microblaze and the Nios soft-core CPUs from Xilinx and Altera are good examples of CPUs that allow custom instructions with the benefits of a fast RISC machine. The next section will highlight the interesting points regrading custom instructions.

30

2.3.1 Custom Instructions in Hardware Custom instructions enable a designer to implement a complex sequence of standard instructions into a simpler and single instruction built in hardware. The simple description of implementing such a custom instruction in a MIPS CPU, and one that can access the register file in the same way as an ALU is shown in Figure 4.

Figure 4 a) a typical CPU b) extensions CPU with Reconfigurable Instructions (Koch, 2013).

Figure 4 shows that extending the CPU with exchangeable instructions could be done after decoding unused instruction in the original CPU ISA. Then a multiplexer is used in order to select between normal ALU option and one or more user defined instructions. Then the configurable instruction can be integrated into the CPU (Koch D, 2013).

The custom instruction logic block has two input ports and one output result, as shown in Figure 4. Often, custom instructions operate in a single clock cycle. However, a multi-cycle operation can be considered for longer combinatory paths. Through the use of custom instructions, it becomes possible to tailor the processor core to a certain application.

One way to emulate the configuration instructions is by adding large reconfigurable accelerator modules multiplexer that can be placed outside the CPU on the system bus. However, this approach will involve an additional cost.

Another way to configure such a custom instruction in hardware is by using run- time partial reconfigurable. The custom instruction could be placed in small slots/islands close to the MIPS CPU, which could cause routing congestion because a high number of signals need to be entered inside the small area. Devices from Altera or Xilinx support design flow tools such as PlanAhead, Open

31

PR and GoAhead flow, such a design flow can communicate between the static system, which includes MIPS CPU, with custom instructions as they can implement the interface between the static and partial system. By using bus macros, proxy logic or direct mapping wired technique that are provided by PlanAhead, OpenPR and GoAhead flow tools respectively.

Fritzell (2013), who proposed a fast dynamic partial reconfiguration system using GoAhead, argued that with a high number of signals and small islands/slots, design flows using bus macros or proxy logic could not give good results, considering the communication overhead. He shows that by using GoAhead with the direct wire approach, the implementation of the custom instructions can be very efficient in small islands/slots. Consequently, the modules can be relocated. The benefits of allowing the custom instruction to be relocated in more than one slot are the flexibility of slot utilisation, the reduction of the external fragmentation and the removal of unnecessary reconfiguration calls as mentioned by (Koch et al., 2010). As a result, the processor will need a look-up table to store a location of a slot that has a custom instruction so that the decoder will know from which custom instruction slot the result should be routed (Fritzell, 2013).

2.3.2 Custom Instructions in Software Reconfiguration of custom modules could be done either by run-time partial reconfiguration or by a multiplexer that emulates the configuration process, as already mentioned above, and the reconfiguration time could be the biggest overhead. So, in order to trigger the configuration process, there are two fundamental options:

Explicit approach: the configuration instruction will be loaded during the execution time by the user or by the program, before the processor needs it. Hauck (1998) proposed this method as the configuration pre-fetch instructions before the instruction is called. It could be fast. However, the speed of the configuration controller and the size of the bitstream will affect the time that the reconfiguration of the custom instruction takes. Consequently, the processor must be stalled, if the configuration of the custom instruction is not finished before the processor calls it.

32

Implicit approach: an exception trap will be triggered when the processor detects that the custom instruction is not in hardware. The trap handler will handle the configuration process of the custom instruction that the processor needs. The trap handler could run a program (Lynch, Forin and Pittman, 2006) that the software function will be executed when the custom hardware is not configured. This approach could remove a lot of overheads by not stalling the CPU while the configuration is in progress. However, it could take time to handle the trap.

2.4 Design Considerations

The development of a customisable CPU on FPGAs requires the consideration of critical system factors in order to attain the desired performance. Some of the critical objectives that are normally taken into consideration include the speed of the CPU, the memory, the power required and the speed with which the CPU can access other components of the system. There is usually a trade-off between the performance and the power required to attain such performance (Kulkarni, 2006).

The additional design considerations of a customisable configuration include the architecture of the processor and its suitability for the targeted application. This implies that the designer will have to take into consideration the size and type of memory and peripheral bus. In addition, the designer will have to decide on the model and size of the address space that is confined to the CPU, space and type of the caches and instruction and data caches. It is also important to give consideration to the type of controllers that are being used in the architecture. Optional accelerators might be used to speed up the CPU (Deschamps, Sutter and Cantó, 2012).

It should also be mentioned that the and the design and development tools are part of the considerations that will have to be evaluated by the designer. The biggest advantage of implementing the soft-core CPU using FPGA lies in the fact that in the case of any mistake being committed during the development phase, there is the possibility of repeating the process to reconfigure the parameters afresh. There are no limits to the number of times the processor can be reconfigured. This provides designers with a degree of design flexibility (Kozyrakis and Patterson, 2004).

33

The designer will have to take into consideration the development and design tools that will be used to develop the soft-core. The following figure provides an illustration of the design and development tools. The design and development tools are considered to be responsible for the parameterisation of the soft-core and also the associated implementation of the peripherals (Kilts, 2007).

Figure 5 Design and development tools (Minev and Kukenska, 2007).

FPGAs allow extensive customisation alternatives that are not found in other platforms such as ASIC. Additionally, an FPGA is also considered to have optimisation techniques that help a designer to work towards achieving performance metrics faster (Gebotys, 2002). The benefits of using an FPGA platform in customising soft-core CPUs have also be reviewed. The development of a customisable CPU on FPGA requires critical system factors in order to attain the desired performance (Gebotys, 2012).

Evaluation of the design and development tools will help the designer to easily and quickly attain the design requirements. It should also be noted that the wrong choice of design and development tools can lead to system inefficiencies. The design and development tools are considered to be responsible for the

34 parameterisation of the soft-core and also the associated implementation of the peripherals (, 2010).

2.5 Previous Work

Related work that is relevant to this project can be categorized into two parts: instruction set extension and partial reconfiguration.

2.5.1 Instruction Set Extension An example study regarding instruction set extension is that of Altera (2011). This study demonstrates the ability to extend the NIOS-II CPU with custom instructions using the SOPC builder wizard of the Quartus design tool. Integrating custom instructions with a soft-core instruction set is a feasible way of speeding up application execution in specific domains such as cryptography (MAJZOUB and DIAB, 2007). Some of the issues involved in the customisation of an instruction set were analysed in detail by Galuzzi and Bertels (2011), who provided a comprehensive overview of instruction-set extensions.

2.5.2 Partial Reconfiguration A fair amount of literature has been published on partial run-time reconfiguration in the soft-core CPUs of FPGA. These studies have shown that PR reduces the size, weight, power and cost of an FPGA system. The use of design techniques to increase performance and resource utilisation of reconfigurable soft CPUs was studied by Wold et al. (2012). They have investigated the appropriate instruction implementation technique for a soft CPU which can achieve a performance improvement, while at the same time reduce the resource requirement. It is a different task but fairly closely related to what this project is aiming at. Their goal is to improve soft CPUs for FPGAs using partial reconfiguration. For example, they presented a classification method that determined the parameters for selecting the most suitable instruction based on profiling. Instruction Set Extensions, Software Emulation, Reconfigurable Instructions and ISA Subsetting are the optimisation techniques used in their methodology.

Reconfigurable instructions could result in a critical side effect in terms of the configuration time. An example of this could be stalling programme execution

35 while waiting for the reconfiguration process to complete could cause an overhead (Wold, et al., 2012).

Another study by Koch, Beckhoff and Torresen (2010) involved an approach to reduce this overhead. They examined the problem which occurs when the communication needs an extra logic or the placement of reconfigurable modules needs to be restricted to the static system which causes an additional logic overhead. They reveal a novel tool called ReCoBus-Builder. In a case study, modules of different sizes and latency were integrated with soft CPUs without causing any logic overhead by using partial run-time reconfiguration. For this project, the newer tool GoAhead, which is a fully re-implemented issue of the tool ReCoBus-Builder, will be used. However, this study will be a library of dynamic instruction set extension.

36

Chapter 3

3 System Design and Methodology

This chapter presents the methodology that has been adopted in this project, the implementation tools and the system design. 3.1 System Development Methodology

Designing and developing such an effective customization soft-core processor is a challenging task, especially with little experience in processor and system design. Therefore, a system development lifecycle method and a step-by-step design approach are appropriate. This can progressively develop a researcher’s learning experience in this important computer engineering field and developing an effective system using partial reconfiguration field.

Figure 6: The general approach of the system development stages (Soft, 2013).

Figure 6 shows the general lifecycle stages that were used in this project in order to develop a processor. The requirement analysis stage has already been introduced in the objectives section of the Introduction chapter on page 13. The design and implementation stages used a step-by-step design and implementation method (Elkateeb, 2011), as shown in Figure 7, and this will be discussed below in this section. The testing and evolution stages will be introduced in Chapter 4 and will use an appropriate approach for FPGA Embedded Processors design and

37 evaluation (Fletcher, 2005) such as comparing the system against a software implementation and comparing with the benchmark system and others real-world system. Finally, some techniques for optimizing the performance and cost in an FPGA MIPS processor system will be discussed.

Figure 7: A step-by-step design and implementation method.

When using such a step-by-step design and implementation method, the customizing soft-core processor has to be done by gradually integrating the processor module with other system modules and developing other modules to get the final customization soft-core MIPS processor design with the help of the partial reconfiguration. Each of the steps is briefly described below.

First step: MIPS CPU: First of all, the soft-core is the brain of the system. A MIPS CPU has been implemented in one module, using an XOR gate in the top level in order to synthesise it shown in figure 8. The reason for the XOR gate is that the MIPS CPU used more interface wires than there are I/O pins available on the FPGA board. By XORing some of the CPU outputs, the CPU could be synthesised for test purpose (e.g. for data mining clock frequency and resources utilisation). Testing MIPS instructions encoding and implementation module was done by using Test Bench in the Xilinx ISE package as is illustrated in the testing section in chapter 5.

38

Figure 8: First step, system overview.

Second step: Custom instruction in software: A GCC cross compiler is used in order to compile the MIPS C code. This compiler is modified to include the custom instructions by assigning the custom instructions to unused opcodes. Accordingly, this will be used in the instruction decoder to select the instructions from the binary code. Installing the compiler was done using a virtual machine that was installed on the Windows operating system.

Third step: One custom instruction in hardware: A custom module that will be connected with the MIPS is chosen. Then, adding a “Counting One” custom module as component in the MIPS CPU. The MIPS will detect the custom instruction and return the result from the custom module. Moreover, the MIPS CPU module is connected with other modules such as ROM, RAM and GPIO via system bus.

39

MIPS CPU Module Instructions Custom ROM Module Module System Bus

General I/O Memory Module Module

Figure 9 Third Step, system overview.

Fourth step: Custom Instructions library in hardware: Four custom modules are implemented. In addition, a Trap handler that is based on a multiplexer (MUX) is developed (Appendix B) in order to choose one custom instruction, the one that is called by the MIPS CPU. This approach has overhead logic costs as shown in the result in chapter 5.

CM CM CM CM 1 1 1 1

MIPS Instructions MUX CPU ROM Module Module System Bus

General I/O Memory Module Module

Figure 10: Four step, system overview.

Fifth step: Reconfiguration Custom instruction: There are different methods for implementing reconfigurable custom modules in hardware as already mentioned in the background chapter. In this project the following approaches have been implemented.

First approach step: Improving the Trap handler: The Trap handler based on a MUX is improved to handle the configuration process. In this approach, the trap handler will be based on ICAP (Appendix C). It is done by implementing the trap

40 handler as a state machine which includes a table to save the addresses of the configuration bitstreams for the different custom instructions as will be introduced later in section 4.3 and then uses the ICAP primitive in order to load the bitstreams into the device. We will exploit the fact that all academic boards come with serial SPI memory that is often not used. The MultiBoot feature is applied in this project; this allows the FPGA to load one of several configuration revisions. Spartan-6 FPGAs support two different configuration modes: BPI and SPI. The functionality of this feature is described in detail in [Spartan-6 FPGA Configuration User Guide]. The iMACT will be used to supply the starting address for each configuration revision in order to generate the MultiBoot SPI file (Xilinx Inc, 2015). SPI PROM is specified to store the configuration bitstream for the different custom modules. Consequently, if the custom instruction is needed by the MIPS CPU then the trap handler will check if the custom instruction is already configured otherwise a different bitstream will be loaded from an attached external memory (SPI PROM) into the FPGA. As a result, the FPGA will be reconfigured with a different configuration bitstream. The testbench, in the test section in chapter 5 shows the functionality of this module.

The whole process works with full reconfiguration, with respect to the MIPS and the extension. The reconfiguration will only make sense with partially reconfigurable custom instructions because rebooting the whole system each time when different custom instruction is called is not a good idea. So a different approach comes from investigating the MultiBoot can be used for partial reconfiguration.

41

CM CM CM CM 2 3 4

MU

Reg

MIPS CPU ICAP CM Instructions Module ROM Module Trap handler

System Bus

General I/O Memory Module Module

Figure 11 Five step: system overview of the first approach.

Second approach step: Exploiting the MultiBoot feature for partial reconfiguration. As stated in Xilinx’s Partial Reconfiguration User Guide (2012), PR is a technique for modifying the operation of the FPGA by loading a different bitstream while it is performing its normal operation. The whole design in this technique is translated into different bitstreams or files, where each one defines a separate function and is loaded upon being required. Application Specific integrated Chips (ASIC) are fabricated in the fab and are designed to perform a fixed functionality. On the other hand, FPGAs offer the flexibility of being reprogrammed, and most modern FPGAs offer the capability of on-site programming. In PR, the operation of the FPGA is modified by programming a partial bitstream (also called bit files), which defines the operation of a subset of the programmable blocks while in this case the whole FPGA fabric is not reprogrammed. In such a scenario, first of all a full bit file is programmed into the FPGA, which defines the operation for the whole FPGA. Then afterwards, depending on the requirement of the operation, a partial bit file can be downloaded to modify the reconfigurable parts of the FPGA and the other parts continue to perform their operation without being affected. The conceptual diagram of the partial reconfigurable system is shown below in figure12.

42

Figure 12 Five step: system overview of the second approach. (Xilinx, 2012).

It can be seen that there is a Reconfigurable Block A in the system, which can be loaded with one of the possible configurations defined by several BIT files, A1.bit, A2.bit, A3.bit, and A4.bit. The logic in the FPGA design is divided into two different regions: reconfigurable region and static region. The dark area of the FPGA block represents reconfigurable regions and the lighter area shows the static region. The functionality of the reconfigurable region is defined by the partial bit files and can be re-programmed by loading one of the partial configurations, while the static region continues to perform its operation and is not affected by the reprogramming of the reconfigurable region.

The method of Partial reconfiguration offers several advantages, which include:

– This approach helps to reduce the area or size of the FPGA device required to implement a given function, which means fewer logic blocks are consumed; hence, as a result, it also reduces the cost and power consumption of the device. – This approach helps to implement and test multiple algorithms or methods to perform a specific functionality. In such a case, multiple implementations can be loaded turn by turn and can be compared against each other. – This technique enhances the design security as specific user dependent keywords or codes can be included into the reconfigurable region and reprogrammed by the end user. – This approach enhances the fault tolerance in the FPGA design, where any malfunctioning regions or parts can be reprogrammed by the user and can be debugged.

43

– This approach enables the designer to divide the complete design into multiple regions or blocks, and these blocks can be added to the FPGA design incrementally; hence, it speeds up the FPGA design and verification process.

In our partially reconfigurable system, there is a partial reconfiguration controller implemented in the static region. This partial reconfiguration controller is used to retrieve the partial bitstreams from any memory connected to the FPGA, and then forwards it to a configuration port. There are two possibilities for the partial reconfiguration controller; either it is implemented in an external device such as a separate processor or in the static region of the FPGA design. In the case of the partial reconfiguration controller being located inside the static region of the FPGA, the partial bit files are loaded using ICAP interface. Like the other logic in the static region of the FPGA, the partial reconfiguration controller logic functions without being affected by the programming of partial bit files.

The fundamentals and the concepts of the partial reconfiguration for any system design are discussed above. However, nothing in the documentation provides information on using the ICAP primitive to send the command sequence for loading configuration bitsreams in MultiBoot feature for partial reconfiguration. From this point, partial reconfiguration is applied. The code will be changed to include a black box that presents the custom instruction wrapper later in order to perform the down to top syntheses, which is the important concept when implementing partial reconfiguration. Figure 14 in section 3.3 in page illustrates this approach.

3.2 Implementation Tools

3.2.1 Hardware Description Language The circuit for an FPGA is developed using a Hardware Description Language (HDL). The two most popular hardware description languages used for FPGAs are and VHDL. Hardware description languages are used to design circuits and they are used to capture the complexity of large circuits and they can significantly increase the productivity of the design process (Wold, et al., 2012). In short, a hardware description language can be compared to an imperative

44 programming language. However, there are many fundamental differences between the two programming languages. Normal programming languages are used to create programmes that are executed by . However, hardware description languages are designed to produce hardware circuits. They are capable of describing circuit hierarchy and connectivity, providing a built-in mechanism for simulating circuit behaviour in the software and expressing the inherent parallelism of separate circuit components (Hauck and Wilson, 1999).

3.2.2 Xilinx ISE (Xilinx, 2013) : This is an Integrated Synthesis Environment software tool that is provided by the FPGA vender Xilinx. It is used for the synthesis and analysis of HDL designs and enables a designer to compile their HDLs designs, (such as VHDL and Verilog file), to perform timing analysis, to view RTL schematic, to simulate a behavioural model, and to generate bitstreams for FPGA to configure the target device.

By using a VHDL programming language, different levels of abstraction are supported by the hardware description languages. The commonly applied abstraction levels include behavioural and structural modelling. A module is considered to encapsulate a circuit by defining its interface. In this way the circuit is able to communicate to the outside world through the input/output ports. Modules are comparable to classes in object oriented programming. The modules are normally defined and then instantiated several times. Different instantiations of the modules can be executed simultaneously and they can also be connected, mapped and routed using the signals that link their inputs and outputs.

ISim simulator: Hardware description languages are normally associated with simulation features which provide an insight into the functionality of the circuit when fabricated. This helps to reduce the risks and costs that are associated with real fabrication processes. Simulation is normally considered to be crucial in the implementation and design of hardware circuits. They are both economical and practical. There are different levels of granularity that are supported for the simulation of a circuit. The initial stage of simulation seeks to determine the behavioural correctness of the circuit. In this case, an appropriate benchtest is generated and introduced to the circuit. The results of such a benchtest are already known before the simulation. The simulation results obtained are

45 compared to the expected results and the comparison can be used to assess the correctness of the designed circuit

3.2.3 Cross compiler: A cross compiler is a compiler which generates code that can be run on a different system, for example, compiling C code for MIPS architecture (Gnu.org, 2015). For this project, a GNU cross compiler will be adapted to use reconfigurable instructions through inline assembly calls.

3.2.4 FPGA Platform: A Nexys3 digital circuit development platform which is based on the Xilinx Spartan-6 LX16 FPGA was used, and is shown in Figure 13. The Spartan-6 FPGA will be used for implementing reconfiguration ISA extension. This provides high performance at low resource cost. It includes the following features (Digilent, 2013):

– 2,278 slices each containing four 6- input LUTs and eight flip-flops – 576Kbits of fast block RAM – two clock tiles (four DCMs and two PLLs) – 32 DSP slices – 500MHz+ clock speeds"

Figure 13 Xilinx Spartan-6 LX16 FPGA platform (Nexys3™ Board Reference Manuall, 2013).

46

3.2.5 GoAhead A tool for implementing partially reconfigurable systems is GoAhead. This tool supports all of the recent Xilinx FPGAs. It provides some features that the Xilinx PR tool chain cannot perform, including (Beckhoff, et al., 2012)

– Implemented partial modules that will be completely independent with respect to the static design. – Modules that can be relocated and the multi-modules that can be instantiated – Modules can be integrated without any logic overhead "no bus macro or proxy logic required ". – It will provide Hierarchical reconfiguration which allows the implementation of a PR module inside a PR module. – Communication architecture generation that enable multiple PR modules to be hosted simultaneously in the same PR region.

3.3 System Design

In this project, focus is put on embedded systems that have different requirements in various application domains such as cryptography, network control systems and image processing. This is due to the fact that an FPGA platform is the most suitable device to adapt to changes in application requirements (Koch D, 2013). There are four custom instructions that have been considered as extensions for the MIPS processor and they are described below:

I. Count ones: Counts the number of ones in a 32-bit vector.

Counting the set bits in the vector is a common algorithm, called Hamming Weight, and it is used in cryptography and network domains. For example, in a Hamming distance algorithm, in order to detect the number of bit errors between two binary numbers, the detection will be obtained by applying XOR gates to them and then counting the one numbers and the result will be the number of bit errors (Schiller, 2003). II. 32-bit CRC: Takes two 32-bit operands and computes a CRC.

A cyclic redundancy check (CRC) is one of the most popular error detection methods used in networks and in storage systems. It is very useful to detect any

47 errors that have occurred because of the noise in the transmission channel in the network. For example, the same number between the transmitter and receiver will be used to detect the error. The CRC calculation will be done in both of them and the result should be zero if there is no error. CRC calculation can be obtained sequentially by a shift register and XOR gates or in parallel with XOR gates only (Schiller, 2003). III. Leading zero: Adding zero bits before the first one bit in MSB in a 32- bit vector.

This is computing the preceding number of a bit vector that has zero bits in the most significant bits (MSB) of the vector. It is often used for electronic digital display devices as seven-segment display on the devices for example, or for ascending order of numbers or for preventing fraud in financial documents (Miller, 2004).

IV. Parity: counting the number of 32-bit vector to generate the parity bit.

This is one of the simplest and most popular error detection methods. It could be used as a special case of CRC, when 1-bit CRC is considered, or it could be used with other methods such as Hamming Weight to calculate the Hamming distance as mentioned above, because it uses only a number of XOR gates to calculate it. As a result, the output vector will include a parity bit at the last significant bit (LSB) in the 32-bit vector that generate it by using XOR gates in order to indicate whether the number of bits in the vector is even or odd (Schiller, 2003).

3.3.1 System Definition and Scope The overall project is comprised of two parts. One is the implementation of a custom instruction module library, where we implement custom modules for different operations like CRC, Ones counter, parity etc. The other part is the implementation of the PR region of the FPGA, which is used to reconfigure the reconfigurable region according to the requirements. 3.3.2 System Architecture and Components The overall system is divided into two main regions, the static region and reconfigurable region, as show in the Figure 14. The static part includes all of the major logic and the reconfigurable region only includes the custom module. The MIPS CPU is the main controller processor of the system and it fetches

48 instructions from the instruction ROM. The MIPS CPU decodes the instructions and performs the desired operations. When the MIPS CPU encounters an instruction which is not implemented in its datapath, it will start a hardware trap handler and send the opcode of the desired operation to the trap handler. The trap handler will look at the opcode and check if the desired instruction is already loaded into the custom module and performs the operation. If the desired instruction is not loaded into the custom module, then the configuration manager inside the trap handler will load the partial bitstream using the ICAP primitive and hence a new partial bit file will be loaded into the reconfigurable region and then the operation is performed. The whole process is carried out in hardware to achieve the lowest latency for reconfiguration.

Static Region Reconfigurable Region

Trap handler MIPS CPU Controller State machine Custom module (reconfigurable)

ICAP

Instruction ROM

External Memory(outside FPGA)

Figure 14 The final system design.

The system operates on a 50 MHz clock, deriving internally from a top level clock using Global buffers BUFG to allow accessing of the clock in high speed and to provide the least amount of skew possible between the MIPS and the peripherals, connected to the bus that physically located in large distances.

MIPS Soft-Core Processor

The CPU core is based on the MIPS I instruction set and is built in the system as a soft-core processor. It is used as a platform demonstrator for reconfigurable

49 instruction extensions. Moreover, it is the main module that will control all the different modules and it will run a trap when the custom instruction exception occurs. The following Figure 15 illustrates the MIPS overview.

Figure 15 The non-pipelined MIPS shows the most important signals and logics (Fritzell, 2013).

Peripheral component modules:

– Memory RAM: A static memory that provides write-before-read behaviour. In other words, the data being returned, during a write-cycle, is the same as that being written. The memory module is synthesised into internal block memories in the Sparton-6 FPGA architecture. (Doulos.com, 2015). – GPIO: General-purpose input/output (GPIO) that includes any connection with an input or output pin. The user at run-time can have control of them. GPIO pins such as LEDs and switches go OFF by default (Fritzell, 2013). – ROM: this module will contain the machine code of the instructions, using the ROM’s address as an index into this memory. The machine code will be generated with the help of a GCC cross compiler that compiles the C code and runs the assembly to produce the binary code that can be used in this array. 50

– UART: universal asynchronous receiver/transmitter (UART). A UART module can be added to the system. This unit allows the user to control the operation of the MIPS CPU, the trap handler and other modules and allows them to check the status of the system. Additionally, the UART module can also be used to load the configuration required by the ICAP module – System bus: all modules are connected via a baseline bus protocol, consisting of: Chip select (CS) input signal, Write enable (WR_en) input signal, Address input signal, Writedata input signal and Readdata output signal, with the MIPS as the only master module (Fritzell, 2013).

Configuration controller module:  The Trap handler The Trap Handler is a core module and is located in the static region of the FPGA design. The trap handler is directly connected with the MIPS CPU with a bus, this module can be easily modified such that multiple CPUs can use it to load the configuration at the desired places and run the operations. Whenever the MIPS CPU encounters an instruction which is not implemented in its datapath, then there are two options: either to have a stall or trigger the trap handler. The trap handler is implemented so as to avoid the malfunction of the CPU due to the non-implemented instruction.

 The ICAP primitive As we are using Spartan-6 FPGA, the ICAP primitive is used to initiate the configuration process (called ICAP_SPARTAN6). It is implemented in the FPGA's fixed logic. This primitive can be used to program the FPGA logic by user control. Figure 16 shows the interface diagram of the ICAP Spartan-6 primitive and Table 3 gives the detailed description of the input and output ports of the primitive.

51

Clk O[15:0] CE

WRITE ICAP_SPARTAN6 Busy

I[15:0]

Figure 16: ICAP Primitive (Xilinx Inc, 2015).

Table 3 Descriptions of using ICAP_SPARTAN6 Port (Xilinx Inc, 2015).

 Custom modules There are four custom instructions implemented in the design. The instructions are: CRC-32, Ones Counter, Parity flag and Leading zero counter. The concept of each custom instruction is taken from different sources, for example, using the CRC generator to generate the CRC-32 custom instruction module (Outputlogic.com, 2015). Each implemented module is assigned a CUSTOM ID, which makes it differentiable from the others. More custom instructions can be implemented and added to the systems by assigning a unique CUSTOM ID to each of the custom instruction as in Figure 17.

52

The CUSTOM ID is evaluated by the instruction decoder of the MIPS CPU in order to run the corresponding module or to trigger the configuration process through the hardware trap handler.

Figure 17 Custom Module.

53

Chapter 4

4 Implementation

This chapter discusses the implementation aspect and the technical issues and the challenges faced. 4.1 Baseline MIPS Soft-Core

Because the implementation of the soft-core is often sophisticated and comes with many design files, the implementation of the CPU core in the system has been done by using the same implementation idea of the MIPS that was proposed by Fritzell (2013), this is done in one HDL file. It is modified to support dynamically reconfigurable module. Pros and Cons There are several advantages of the simple implementation style for the MIPS CPU in the system. The MIPS CPU will be of a small size and will run at 50 MHz and will deliver 50 M instructions per second and that could be more than many micro-controllers. Moreover, The CPU will trap the custom instruction if it is not available and return the correct result automatically to the register file; that is, it is combined with a trap handler that handles the configuration process in a smart way.

One disadvantage is that the CPU will become an application-specific processor if the customisation extensions are considered, but the MIPS CPU itself will be just used as an advanced state machine for the configuration controller. In addition, if the MIPS is still too large for the application or the application needs to increase the execution speed then reducing the memory size and removing unused instructions could be a solution (Yiannacouras, et al., 2006).

One instruction per cycle Despite the fact that most RISC CPUs are 5 pipelined stage designs, computing one instruction per cycle in the pipeline stages will require handling hazards in each stage by adding the corresponding control logic.

54

The non-pipelined MIPS in this project will execute one instruction per cycle without the need for any hazard detection and handling. The VHDL code, in Appendix A, of the non-pipelined MIPS highlights the main important blocks which are: Instruction decoder, Register file, ALU and Program counter. These can handle the execution of one instruction in one clock cycle. So, the MIPS code has long propagation delay paths between flip-flops which needs to be minimised in order to achieve high clock frequency.

In order to allow the execution of one instruction per cycle we use a ‘trick’ to avoid waiting one clock cycle to get the instruction memory output. The instruction memory will receive the address of the next instruction just before starting the next clock edge. As a result, this will make the instruction word available at the beginning of the current clock cycle. As shown in Figure 18, the next PC address will pass to the instruction memory instead of the PC, because reading the instruction from a BRAM should be done synchronous to avoid one clock cycle reading delay. In this case the time could be affected since the address of the next instruction that is done after a long linking path that has to meet setup requirements on the input of the instruction memory

Figure 18 The Program Counter process overview that consists of extra logic and flip-flops to handle branch and jump instructions. (Fritzell, 2013).

55

Delayed Branch

Delayed Branch is a technique that is applied in order to avoid the effect of control dependency “hazards” in a pipelined MIPS and it is used in non-pipelined MIPS to handle branch and jump instructions as already shown in Figure 18. If the branch is taken, the next instruction that follows the branch address instruction will be executed before branching or jumping to the new address. By adding extra logic and flip-flops we can handle the branch address when the delay slot is performing. Because of that, the MIPS code after the branch or jump instructions often executes NOPs instructions. Instruction encoding The MIPS VHDL code will start when the instruction word is decoded, following the instruction set encoding that is provided in the MIPS32 instruction set reference manual (MIPS Technologies, 2003), in order to provide the data that can be operated on by the ALU. The output result will be stored in the register file. Multi-cycle instructions Most instructions are implemented in a straight forward manner; that is, they are executed in one clock cycle. However, there are some instructions that have a critical path in the code and they could affect the timing and performance. Therefore, they should be implemented as multi-cycle instructions. Examples are signed and unsigned multiplication and division instructions. Because division instructions could be resource expensive and seldom used, the undefined instruction will be considered when the div instruction occurs. This is, however, not a problem as we can add it as a custom instruction, software function or multi- cycle instruction if we need it.

The multiplication instruction is implemented by enabling the DSP-blocks in the synthesis tool. This is done in order to take full advantage of device resources and to increase the performance by allowing the implementation of multiplication in DSP-blocks. Figure 19 illustrates the multiplication that can operate on extra registers called HI and LO (div instruction would use HI and LO registers too). This operation was achieved by using the constraints editor in ISE to constrain the combinatorial path assignments between the instruction memory output and the HI and LO registers inputs to allow the path multi-cycle operation in hardware. Also, a stall signal is used when performing multi-cycle execution in order to stall the

56

MIPS CPU during the execution of the multi-cycle. As result, two clock cycles (or more if needed) will be performed when the multiplication instructions are executed. To allow this, we have to prevent the PC and register file from being updated for one cycle (i.e we have to stall the CPU)

Figure 19 Datapath for the multiplication, allowing two clock cycles for execution. (Fritzell, 2013).

Trap instructions

When the instruction is available, the result will be returned to the register file. However, when the instruction is not available but is defined as a custom instruction, then the MIPS CPU will trap this instruction to be processed by the trap handler. 4.2 Custom Instruction in Software

Supporting custom instruction in software has been done by changing the GCC cross compiler for the MIPS architecture. The encoding of the MIPS I instruction set can be found in C code inside the binutiles that hold the opcodes folder of the compiler.

The mips-opc.c source file has all the assembly instructions defined in the MIPS I instructions set in addition to the range of UDIs (user defined instructions). The format of the UDIs is similar to the format of the R-TYPE instructions that were defined in section 2.2.3 . Therefore, the UDIs instructions share the same opcode and are distinguished y the function field, from 0x70000010 to 0x7000001f instruction word range. In totl, 16 individual user instructions are unused. So, the designers could add an additional 16 instructions directly to a system.

57

In order to implement the custom instructions in software, the instruction encoding of any instruction from the UDIs instructions range only should be used. By exploiting the similarity in the format with R-TYPE instructions, one of the user defined instructions can be modified to the same R-TYPE format such as XOR instruction as the following steps (Fritzell, 2013):

 Coping the XOR instruction:

{"xor", "d,v,t",0x00000026, 0xfc0007ff,WR_d|RD_s|RD_t,0,I1 },

 Choosing any UDI instruction such as the following:

{"udi0", "s,t,d,+1", 0x70000010, 0xfc00003f, WR_d|RD_s|RD_t, 0, I33 },

 A small change to the UDI instruction name to be CUSTOM by modifying the format to be the same XOR will be done:

{"custom", "d,v,t", 0x70000010, 0xfc0007ff, WR_d|RD_s|RD_t, 0, I1 },

 As shown in Figure 20 and then recompiling the GCC cross-compiler with the new custom instruction.

Figure 20 Adding Custom instruction in the compiler.

 Then the following inline assembly will be used inside the C code, in order to call the software implementation of the custom instruction.

__asm__ ("nop\n\t"

"custom %0, %1, %2\n\t" :"=r" (z) :"r" (x), "r" (y)); Note: x, y and z are the input operands and the result respectively.

58

4.3 Configuration Controller Modules

I. Trap handler The Trap handler is the module to handle the exception encountered by the MIPS CPU. The MIPS CPU reads the instructions from the instruction ROM and then decodes them. After this, it executes them. In the case that the instruction received is not implemented in the MIPS CPU, an exception is generated. Then the MIPS CPU requests the trap handler to handle the exception. The operation of the trap handler is controlled by a state machine. Figure 21 shows the state machine diagram for the trap handler.

Trap_start =0

Custom_done = 1 ST0 Trap_start =1

Count < 13 ST3 Trap_start =1 & ST1 Opcode = CUSTOM_ID Custom_done = 0

Count = 13

ST2

Figure 21 Trap Handler State Machine.

There are four states in the state machine. ST0 is the reset state and the system is normally in this state. Here it waits for the trap start signal, which comes from the MIPS CPU. When an exception occurs inside the MIPS CPU, it will send the trap start signal to the trap handler. On the reception of this signal, the state machine moves to either ST1 or ST2. If the requested Opcode is equal to the currently loaded CUSTOM ID, then there is no need to load the partial bit file so the state machine moves to ST2. In the other case, the state machine moves to ST1, where it sends the command to the ICAP primitive to load the partial bit file inside the custom module. The configuration process is typically thousands of cycles so we use a counter in order to monitor the configuration reading signal from ICAP before going to ST2. At ST2, the trap handler will send a start signal to the custom module and in ST3, it will wait for it to complete the operation.

59

Each custom module is assigned a unique opcode and the address, which are given in the below table 4.

Custom Module Name Opcode Address CRC-32 010000 X"100000"

Ones Counter 100001 X"200000" Parity 010001 X"300000"

Leading Zero Counter 100000 X"400000"

Table 4 Custom instructions’ address and ID

II. ICAP primitive There are multiple ways to use the ICAP controller. If a UART for a connection to a host machine is considered, then the ICAP controller will be dependent on the user to initiate a UART transaction for reconfiguring the FPGA. However, the more automatic way is to load a configuration into SPI flash, and get them from there. In this project, the latter option is used. An ICAP primitive is instantiated inside the Trap handler to allow us to load the configuration files so that the reconfigurable region is reprogrammed to the desired logic. As described by Xilinix Inc

“Spartan-6 FPGAs have dedicated MultiBoot logic, which is used for both fallback and MultiBoot (IPROG) reconfiguration. When fallback or IPROG happens, an internally generated pulse resets the entire configuration logic, except for the dedicated MultiBoot logic. The IPROG (internal PROGRAM_B) command can be sent through ICAP_SPARTAN6 or the bitstream” (2015).

60

Table 5 An example of bitstream for the IPROG command using ICAP (Xilinx Inc, 2015).

The sequence of command as illustrated in the table above is described in detail in Spartan-6 FPGA configuration user guide (Xilinx Inc, 2015). After the IPROG command is sent to the configuration logic, the FPGA will reset everything except the dedicated reconfiguration logic. Then the bitstream value in the starting address will be loaded. Thus, the static region is not affected by this operation.

4.4 Custom Instruction in Hardware

OP_ OP_ A B

CI ALU

RES

Instructio n ALU_out

Figure 22 Custom Instruction (CI) act as extension of the ALU

61

Figure 22 shows the MIPS CPU with custom instructions as extensions to the original ALU. It could take one or two 32-bit input operands and one 32-bit output is computed. Adding custom instructions to the system can speed up the execution time of an application as mentioned above. Run-time reconfigurable accelerator modules in a PR region with a proxy logic approach for the communication have been implemented using the GoAhead tools.

Figure 23 illustrates the communication between the static and the partial reconfiguration module. Proxy logic will be used as a connection primitive which is nothing else than a look up table in route through mode. It acts as a placeholder for the non-existing part of the system; that is, it replaces the partial module when implementing the static system and it replaces the static system when implementing reconfigurable custom instruction accelerator. The same wires are used for the communication between the static system and the reconfigurable area.

Figure 23 also shows that the different custom instruction modules use different logic, but have exactly the same interface to the CPU (including the routing).

OP_A To CI OP_B To CI

RES From Proxy Logic CustomCI Partial Static Part Reconfiguration Instruction OP_A To OP_A To OP_A From CI CI CPU OP_B To CI OP_B To CI OP_B From

CPU RES From RES From RES To CPU CI OP_A To Custom CustomCI CPU CI Instruction Instruction OP_B To CI

RES From CI Custom Instruction

Figure 23 On-FPGA Communication for Custom Instructions.

62

Static System Implementation A screenshot of the static system is shown in Figure 24. It shows the operand signals (OP_A, OP_B) in the left side and the result signal is collected at the right. The amount of wires that are connected from the static part of the system to the PR region is four for the connection primitive. Consequently, it takes 8 connection primitives for each of the 32-bit interface signals (OP_A,OP_B and RES).

63

OP RE OP

Figure 24 Static implementation

64

Reconfigurable Instructions Implementing the reconfigurable modules in the absence of the static system is done as can be seen in the screenshot in the Figure 25. For the partial module implementation, the same primitive will be used with the other side which is not connected yet OP_A to CI and OP_B to Ci and RES_from CI. The figure 25 shows the CRC module connects where the static design ends by the proxy logic. The custom instruction wrapper has been auto generated by the GoAhead tool.

As the output of the result is not connected to the outside word (i.e. the path ends at the connection primitives), the FPGA tools would typically remove all logic and routing to the output primitive. This will eventually result in an empty design to overcome this, all interface signals were set with a keep attribute (which is specific to the Xilinx vender tools).

65

Figure 25 Partial Part: the example shows the implementation CRC instruction.

66

Using the GoAhead tool

GoAhead provides a GUI as well as a scripting interface. A screenshot of the tool is shown in Figure 26. The GUI is typically used to create scripts. The script will then generate all the constraints that are needed in the system. The generating constraints for this system are used for two important jobs. The first one is to prevent the use of the resources in the PR region. In other words, the routing will be blocked inside the PR region and no logic primitives will be used. Another job is to create connection primitive placement constraints.

The following steps are used for both implementations (Static and Partial) with GoAhead as illustrated in the screenshot of the Figure 26:

1. Device description will be loaded 2. Define the region in GoAhead. By selecting the elements between 72 and 79 it is exactly 8 elements which is 8 routes. 3. Place connection macros inside the PR region by using the macro placer in GoAhead. 4. Create the connection primitive into this area. 4 input wires for connection primitive that way it creates an area with 8 tiles (i.e. CLBs). 5. All routing inside the PR region will be blocked, except the operands and result vectors. Then the blocker is exported to the XDL, which is a Xilinx specific netlist format that is not further investigated in this project. 6. Instantiate the connection macros as in Figure 27. The name of the primitive is "OP_A connect "and then it has input "OP_A from CPU” as the VDHL name. 7. Then the constraint file for the design (UCF-file) with placement constraints for the PR region which is generated by GoAhead should be updated.

In order to generate the bitstream, the static and the partial implementations should emerge together. It could be done by copying the text description of XDL netlists and merging them together.

67

Figure 26 GoAhead GUI. The graphical user interface of the GoAhead.

Figure 27 GoAhead Script.

68

4.5 Challenges during Implementation

– The three month duration working on this project was a major challenge. In addition, working on different phases and tools and spend couple of weeks to learn each one. – GCC cross complier for MIPS in Windows is not a straightforward task and it takes time to setup. – The Nexys3 platform, that hosts the system, does not have external interfaces such as audio and video which causes limitations in the usability of this device. Moreover, the difficulty in testing the system was due to the high clock speed. GPIO and UART are very slow. – Because the Nexys3 SPI model is not clear and it is not in the documentation, testing the reconfiguration was difficult. I had to spend a couple of days and tried to run the code on that board. But later on I had to change the board and try the code on a new board. I had to change the IO configuration of the board to run the code. – Multiboot feature with partial reconfiguration. This is a new approach that never implemented before I had to go through several literatures and had spent couple of weeks learning this feature. – With implementing partial reconfiguration, each design has different names for the primitives and that way it is not completed

69

Chapter 5

5 Testing, Results and Evaluation

This chapter presents simulations and test of the system, results and evaluation. 5.1 Testing

The whole system is simulated using Test Bench in the Xilinx ISE package. Figure 27 shows the functionality of the MIPS CPU, reading the instructions from ROM and decoding it, and incrementing the address in the program counter and executing the branch delay. a)

b)

70

c)

Figure 28 Test-bench of the MIPS CPU and ROM all pictures above a, b and c are presenting one test bench that shows different signals for example A) instruction encoding, decoding and ALU functionalities b) Program counter functionality and c) branch delay and ROM functionalities.

Modalism Simulation of the custom instruction modules

The simulations for the custom instruction modules were created and the functionality of different custom instruction modules is verified. Figure 29 shows the simulation results of the CRC-32 module. Here it can be seen that when crc_en is high then the CRC-32 is generated and output on the crc_out bus.

71

Figure 29 Modalism Simulation of CRC-32 Module.

Figure (30) shows the simulation results of one counter module. Here it can be seen that the data is given to the data_in bus and is toggled after the intervals of the clock and the corresponding output is generated on the output bus.

Figure 30 Modalism simulation of One Counter Module.

Figure (31) shows the simulation output of the parity generation module. Here it can be seen that data is given into the data_in bus and is changed on the intervals of the clock and in the result the output is generated on the output bus.

72

Figure 31 Modalism Simulation of Parity generation module.

Figure (32) shows the simulation results of the leading zero counter module. Here it can be seen that the data is given to the data_in bus and is changed on the interval of the clock, in the results the output is generated on the output bus.

Figure 32 Modalism simulation of Leading Zero Counter Module.

Modalism Simulation of the Trap handler

For the simulation of trap handler, two simulations are performed. The first simulation is for the Mux based trap handler and the other simulation is for the ICAP based trap handler. Figure (33) shows the simulation output of the Mux

73 based trap handler. Here is shown how this module performs when opcode and data is changed on the input.

Figure 33 Modalism Simulation of Mux Based TrapHandler.

Figure (34) shows the simulation of the trap handler module. Here you can see that the state machine starts moving after the trap_start signal, then it sends a command to the ICAP primitive and when it is complete, it starts the custom module.

Figure 34 Modalism simulation of ICAP based Trap Handler.

74

Software and Testing

The following C code is compiled for MIPS using cross-compiler, the introducing machine code is used in ROM.

/* read switches and write to leds*/ #define LEDS_BASE_ADDERSS 0x10001000 #define SWS_BASE_ADDERSS 0x10000010 #define RESET_BASE_ADDRESS 0xBFC00000 int main() { int temp = 0; int * RED_LED = (int*)LEDS_BASE_ADDERSS; volatile* SWITCHES = (int*)SWS_BASE_ADDERSS; while(1){ temp = *SWITCHES; if (temp == 8) *RED_LED = ~0x80; else if (temp == 7) *RED_LED = ~0x40; else if (temp == 6) *RED_LED = ~0x20; else if (temp == 5) *RED_LED = ~0x10; else if (temp == 4) *RED_LED = ~0x08; else if (temp ==3) *RED_LED=~0x04; else if (temp ==2) * RED_LED=~0x02; else if (temp ==1) * RED_LED=~0x01; else *RED_LED=~0x00; } return 0; }

75

Test reconfigurable modules:

Testing the reconfiguration process is done by selectively uploading the configuration bitstreams for the four custom instructions and the difference bitstream into the SPI storage by using iMPACT as illustrated in Tapp (2010). The documentation from Xilinx (Configuring Xilinx FPGAs with SPI Serial Flash) shows the steps in details. Each configuration bitstream is assigned to a specific region. In this project, the Nexys3 development board should be used for the test as it is the host of our system. However, because the Nexys3 SPI model port was difficult to read and not clear in its documentation, Atlys development board is used for this test.

Because the Nexys3 board does not have the external interfaces such as audio and video, the only input and output result for this system is the GPIO module, Moreover, a UART module is not considered. Therefore, testing using only the GPIO module was not convenient due to the high clock speed. One way to do the test is by implementing the ICAP_SAPRTAN-6 with the multiboot feature with a simple system such as the basic logic gates (AND, XOR, NOR…etc.) that can read the switches as inputs, using the logic to display the output as LEDs to see the differences when the module is changed to another logic module.

5.2 Results

The cost of the system resources for the first approach and the cost of the system resources for the final system approach are outlined in Table 6.

Approach Nr of LUT Nr.of Slices Latency MUX based trap handler 246 1798 20.011ns ICAP based trap handler 438 1370 18.125ns

Table 6 Resource requirements for Configuration controller.

The results reveal that when using a MUX based trap handler. The system used less look up tables than the ICAP-based trap handler due to the simpler datapath in the ICAP variant. However, the slice resources that are used in the MUX-based trap handler system will be more than those used in the ICAP-based trap handler

76 because the system uses more logic for the custom modules. Finally, the latency is higher in the case of the system that is based on the MUX-trap handler because of the trap overhead. However, in the ICAP system, unless one custom instruction is configured in the system and only in the case of the custom instruction not the desired one then the reconfiguration will be considered. Note that the delay in this table is for the whole implementation without considering the reconfiguration overhead. Only by introducing the ICAP-based trap handler, were we able to run the system at the target 50 MHz clock frequency

For the custom modules, the following table shows the cost of the resources. Custom Nr. Of Nr.of Latency Bitstream

Module LUT Slices (Max/av)ns size (KB)

CRC 32 43 18 8.038/3.597 282

Counting One 39 19 15.717/14.35 263

Leading Zero 19 15 9.723/3.597 293

Parity (XOR) 7 6 3.618/3.597 282

Table 7 Resource requirements for Custom modules.

The results in Table 7 show the implementation costs for the custom instructions. In the progress report, manual code optimization was performed in order to see if the tools recognize the optimization by itself or not and the result shows the tools do not do that. This point was considered when we implemented the custom module. Therefore, the result shown in Table 7 shows the better use of the resources, delay and bitstream size for each custom module after manually optimising each module.

5.3 Evaluation

System performance The whole system, including the configuration controller, can run at a system clock of 50 MHz. The first of the two biggest limitation factors is that the MIPS CPU runs a trap when a custom instruction exception occurs and traps have a tiny additional

77 overhead which would not occur in case of a baseline MIPS implementation. The second factor is that the trap handler represents the configuration controller, which uses external flash memory.

For partial reconfiguration, one important benchmark is the response time that has to be considered for the reconfiguration process. Swapping instructions will obviously take a significant amount of time for loading the corresponding partial bitstream from an external SPI memory to the device. Moreover, the bitstream size would affect the speed of the configuration module.

In Fritzell (2013), the configuration controller for module relocation was designed to use two clocks, one clock running at 50 MHz for the part that was connected to the bus and the other one running at 100 MHz for the part that handled the configuration process. In our system, the trap handler will run at 50 MHz, which could slow down the configuration speed. Moreover, in Fritzell (2013) a decompression module is used to decompress the configuration data on the FPGA for faster reconfiguration. So, our predicted result of the reconfiguration time could be lower than what is achieved in that work.

However, there are some techniques that could be applied to optimize the performance and cost in the system on the FPGA device. In this project, we used the FPGA MultiBoot feature that is slow, but that uses a serial configuration memory chip that is underutilized in most FPGA prototyping systems. This also separates the configuration bitstream storage from other memory which improves the security of the system.

Performance Enhancing Techniques

General speaking, performance techniques could be divided into: techniques that are not FPGA specific from compiler and memory usage to name a few; and techniques that are FPGA specific, such as increasing the operating frequency. As a rule of thumb, since optimizing configuration speed is a typical goal, an entire program should rarely be targeted at external memory (Fletcher, 2005) if so, then the use of another clock should be considered in order to handle the process faster than it would be.

78

Comparing the system to a real-world system:

Speed(MHz) Processor Processor Type Device Family used Achieved

PowerPCTM 405 hard Vritex-4 450 MicroBlaze soft Vritex-II Pro 150 MicroBlaze soft Spartan-3 85 MIPS soft Spartan-6 50 Table 8: comparison between Xilinx Embedded processors with our soft-core and their Performance.

The available embedded processors with the manufacturers quoted maximum frequency and our soft-core, included the extension with its maximum frequency are summarized in the Table 8. Despite the MIPS processor being the slowest in that table, it might outperform the others due to the use of custom instructions.

Hardware acceleration

A soft-core on the FPGA will allow the designer to make a trade-off between hardware and software in order to maximize efficiency and performance. If there is a software function identified as a software bottleneck, then a custom module can be designed for this function in the FPGA. The device will then act as a co- processor or, as in our case, as a custom instruction extension to the soft-core processor.

One way to evaluate custom instructions in hardware implementation is to compare them against software implementations of the functions running on the standard ISA of the MIPS CPU. The software functions that are used as a reference can be found on (Andersen, 2005). Software evaluation for those four functions, which are written in C code, is compiled for the MIPS using a GCC cross-cross compiler. Using disassembly for the code in order to calculate how many instructions each function is consuming. Table 9 shows how many CPU instructions are saved by using a custom instruction.

79

Software function Instructions CRC 262 Hamming weight 262 Leading Zero 294 Parity (XOR) 263 Table 9 Software requirements.

80

Chapter 6

6 Conclusions and Future Work

6.1 Conclusions The system is improved through the lifecycle that is presented in the methodology. The final system after all improvements had been done meets the objectives outlined in the introduction chapter. Moreover, learning the concepts and the fundamental features of FPGAs step by step is the biggest achievement. The previous chapters described those concepts in detail, the necessary components and tools and the implementation of a fully functional PR system. The dynamically run-time reconfigurable custom instruction set extension of a MIPS CPU can be replaced in the system. The most important part of the implemented system are: 1. MIPS CPU. 2. Trap handler, included ICAP primitive. 3. The exploitation of the MultiBoot feature for the full and partial reconfiguration.

6.2 Future Work There are some improvements that can be done to the final implemented system and together these could be considered as the requirement analysis stage for the next lifecycle.  In this project Nexys3 has been used as a platform. However, the lack of external interfaces caused limitations in the usability of this device. Using another academic board which includes audio and video then could show the input and the output of the system and could design a complete digital system built around soft-core processor.  The MIPS CPU that is used as soft-core is a very simple processor, is non- pipelined and uses BRAM as both program memory and data memory. These could be improved by implementing a pipelined processor also by implementing a simple cache controller that could be connected to DDR-

81

memory. As a result of this, executing larger programs and storing large data structures such as frame buffers could be possible.  The system uses the MultiBoot feature and the command sequence that is sent through the ICAP primitive to support the read-back of configuration data from ICAP. However, there are two different ways for reading and writing the configuration data from ICAP. As illustrated in (Fritzell, 2013) “Either clock is left toggling and clock enable is used to control throughput, or clock enable is kept high and the clock signal is controlled to achieve wanted throughput” with implementing ICAP interface.  Adding more advanced modules for communication over COM-port.  Measuring the clock cycle of the reconfiguration by using Log with a counter in the trap handler in order to reflect the number of clock cycles from the time the counter starts until it is stopped.  The Nexus3 board has a seven segment electrical screen; it could be exploited for testing.  Different benchmarks could be used to evaluate the soft-core on the FPGA. The most standard benchmark is Dhrystone MIPs (DMIPs) and the result from this could then be compared with the results we achieved with our system.

82

Works Cited Andersen, S. E., 2005. Bit Twiddling Hacks. [Online] Available at: http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive [Accessed 31 August 2015].

Beckhoff, C., Koch, D. & Torresen, J., 2012. Go ahead: A partial reconfiguration framework. Field-Programmable Custom Computing Machines (FCCM), 2012 IEEE 20th Annual International Symposium, pp. 37-44.

Bibda, C., 2007. Introduction to Reconfigurable Computing: Architectures, Alhorithims and Applications. s.l.:Springer.

Bobda, C., 2007. Introduction to Reconfigurable Computing: Architectures, Algorithims, and Applications. s.l.:Springer.

Bobda, C., 2008. Introduction to Reconfigurable Computing. Netherlands: Springer .

Digilent, 2013. Nexys3™ Board Refference Manual. [Online] Available at: https://www.digilentinc.com/Data/Products/NEXYS3/Nexys3_rm.pdf

Doulos.com, 2015. Simple Ram Model. [Online] Available at: https://www.doulos.com/knowhow/vhdl_designers_guide/models/simple_ram_mod el/ [Accessed 7 August 2015].

Elkateeb, A., 2011. A Processor Design Course Project: Creating Soft-Core MIPS Processor Using Step-by-Step Components' Integration Approach. International Journal of Information and Education Technology, 1(5), pp. 432-440.

Fletcher, B., 2005. FPGA Embedded Processors Revealing True System Performance. In: Embedded Training Program Embedded Systems Conference.. [Online] Available at: http://www.xilinx.com/products/design_resources/proc_central/resource/ETP- 367paper.pdf [Accessed 14 August 2015].

83

Fritzell, A., 2013. A System for Fast Dynamic Partial Reconfiguration using GoAhead Design and Implementation.. Masters Thesis: University of Oslo.

Galuzzi, C. & Bertels, K., 2011. The Instruction-Set Extension Problem: A Survey. ACM Transactions on Reconfigurable Technology and Systems. article 18, 4(2).

Gebotys, C. H., 2012. A network flow approach to memory bandwidth utilization in embedded DSP core processors. IEEE Transactions On Very Large Scale Integration (Vlsi) Systems, 10(4), pp. 390-398.

Hansen, S. G., Koch, D. & Torresen, J., 2011. High speed partial runtime reconfiguration using enhanced icap hard macro. In: Parallel and Distributed Processing Workshops and Icap hard macro. Shanghai: IEEE, pp. 174-180.

Hauck, S., 1998. Configuration prefetch for single context reconfigurable coprocessors. In: Proceedings of the 1998 ACM/SIGDA sixth international symposium on Field programmable gate arrays. New York: ACM, pp. 65-74.

Hauck, S. & Wilson, W. D., 1999. Run Length Compression Techniques for FPGA Configurations. Napa Valley, IEEE.

Jo, J., 2013. 6 Basic Phases of Software Development Life Cycle (SDLC). [Online] Available at: http://www.techknol.net/2013/04/software-development-life-cycle.html [Accessed 15 August 2015].

Koch, D., 2013. Partial Reconfiguration on FPGAs: Architectures, Tools and Applications. New York: Springer.

Koch, D., Beckhoff, C. & Torreson, J., 2010. Zero logic overhead integration of partially reconfigurable modules. Proceedings of the 23rd symposium on Integrated circuits and system design, pp. 103-108.

Kozyrakis, C. E. & Patterson, D. A., 2004. Scalable, vector processors for embedded systems. Micro, IEEE, 23(6), pp. 36-45.

Kuon, I. & Rose, J., 2007. Measuring the Gap Between FPGAs and ASICs.. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 26(2), pp. 203-215.

Lysaght, P. & Subrahmanyam, P. A., 2005. Guest Editors’ Introduction: Advances in Configurable Computing. EEE Design & Test of Computers, 22(2), pp. 85-89.

84

Miller, J., 2004. The Chicago guide to writing about numbers. Chicago: University of Chicago Press.

Minev, P. B. & Kukenska, V. S., 2007. Implemenation of Soft-core Processors in FPGAs. Gabrovo, International Scientific Conference.

MIPS Technologies, 2003. MIPS32™ Architecture For Programmers Volume II: The MIPS32™ Instruction Set. [Online] Available at: http://www.cs.cornell.edu/courses/cs3410/2008fa/mips_vol2.pdf [Accessed 3 August 2015].

OutputLogic.com, 2013. OutputLogic.com. [Online] Available at: http://outputlogic.com/ [Accessed 30 August 2015].

Pittman, R. N., Lynch, N. L. & Forin, A., 2006. eMIPS, A Dynamically Extensible Processor, Redmond: Microsoft Research.

Synopsys, 2010. SiliconBlue Selects Synopsys as FPGA Synthesis Partner for Its iCE65 mobileFPGA Family. [Online] Available at: http://news.synopsys.com/index.php?s=20295&item=123144 [Accessed 30 March 2015].

Tapp, S., 2010. Configuring Xilinx FPGAs with SPI Serial Flash. 1st ed. [ebook] Xilinx.Inc.. [Online] Available at: http://www.xilinx.com/support/documentation/application_notes/xapp951.pdf [Accessed 1 September 2015].

Wold, A., Koch, D. & Torresen, J., 2012. Design techniques for increasing performance and resource utilization of reconfigurable soft CPUs. s.l., IEEE, pp. 50-55.

Xilinx Inc, 2011. Spartan-6 FPGA Block RAM Re-sources User Guide. [Online] Available at: http://www.xilinx.com/support/documentation/user_guides/ug383.pdf [Accessed 1 August 2015].

Xilinx Inc, 2015. Spartan-6 FPGA Configuration User Guide. [Online] Available at: http://www.xilinx.com/support/documentation/user_guides/ug380.pdf [Accessed 11 August 2015].

85

Xilinx, 2012. Partial Configuration User Guide. [Online] Available at: http://www.xilinx.com/support/documentation/sw_manuals/xilinx14_1/ug702.pdf [Accessed 1 August 2015].

Xilinx, 2013. ISE Design Suite. [Online] Available at: http://www.xilinx.com/products/design-tools/ise-design-suite.html [Accessed 1 May 2015].

Yiannacouras, P., Steffan, J. G. & Rose, J., 2006. Application-Specific Customization of Soft Processor Microarchitecture. Proceedings of the 2006 ACM/SIGDA 14th international symposium on Field programmable gate arrays, pp. 201-210.

86

Appendix A - MIPS CPU library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; use std.textio.all; entity MIPS_CPU is port ( clk : in std_logic; reset : in std_logic; WaitRequest : in std_logic; D_write_en : out std_logic; D_read_en : out std_logic; I_ADR : out std_logic_vector (31 downto 0); I_DATA : in std_logic_vector (31 downto 0); D_ADR : out std_logic_vector (31 downto 0); D_W_DATA : out std_logic_vector (31 downto 0); D_R_DATA : in std_logic_vector (31 downto 0); RES_0 : in std_logic_vector(31 downto 0); opCode : out std_logic_vector(5 downto 0); OP_A_c : out std_logic_vector(31 downto 0); OP_B_c : out std_logic_vector(31 downto 0); trap_start : out std_logic; OP_A : out std_logic_vector(31 downto 0); OP_B : out std_logic_vector(31 downto 0)); end MIPS_CPU; architecture a_MIPS_CPU of MIPS_CPU is

type Instruction_type_type is (Undefined,R_type, ADDI, ADDIU, SLTI, SLTIU, ANDI, ORI, XORI, LUI, J, BNE, BEQ, load, store, JAL, BRANCHES, BGTZ, BLEZ, I_type_special_2); type PC_type is (Normal, branchs,Jumbs); signal Instruction_type : Instruction_type_type; signal PCstate : PC_type; signal local_D_write_en : std_logic; signal rs, rt, rd, sa : std_logic_vector(4 downto 0); signal W_ADR : std_logic_vector(4 downto 0); signal R_DATA_A, R_DATA_B : std_logic_vector(31 downto 0); signal ALU_out : std_logic_vector(31 downto 0); signal ALU_out64 : std_logic_vector(63 downto 0); signal PC, PC4, nextPC, branchPC,jumbpc : std_logic_vector(31 downto 0); signal RegFile_en : std_logic; signal instr : std_logic_vector(5 downto 0); signal funct : std_logic_vector(5 downto 0); signal immediate, SL2immediate : std_logic_vector(31 downto 0); signal immediateU : std_logic_vector(31 downto 0); signal immediateJ : std_logic_vector(27 downto 0); signal BranchTaken, branching : std_logic; signal JumpTaken, JumpTakenJR : std_logic; signal idata : std_logic_vector(31 downto 0); signal HI, LO : std_logic_vector(31 downto 0); signal WaitRequest_i : std_logic; signal WaitRequest_comb : std_logic; signal mul_wait : std_logic; signal MTHI, MTLO : std_logic; signal mul_taken : std_logic; type memtype is array (31 downto 0) of std_logic_vector(31 downto 0); signal RegFile : memtype := (others => (others => '0')); 87

begin OP_A_c <= R_DATA_A; OP_B_c <= R_DATA_B; opCode <= funct;

------REGISTER_FILE ------

p_write : process (clk) begin if clk'event and clk = '1' then

if WaitRequest_comb = '1' then if RegFile_en = '1' and (W_ADR /= (W_ADR'range => '0')) then

if (Instruction_type = load) then RegFile(to_integer(unsigned(W_ADR))) <= D_R_DATA; else RegFile(to_integer(unsigned(W_ADR))) <= ALU_out; end if; end if; -- RegFileEnable end if; --Waitrequest end if; --clk

end process; R_DATA_A <= RegFile(to_integer(unsigned(rs))); R_DATA_B <= RegFile(to_integer(unsigned(rt)));

OP_A <= R_DATA_A; OP_B <= R_DATA_B; ------INSTRUCTION_DECODER------

--setting idata to correct signals: idata <= I_DATA; funct <= idata(5 downto 0); instr <= idata(31 downto 26); rs <= idata(25 downto 21); rd <= idata(15 downto 11); rt <= idata(20 downto 16); sa <= idata(10 downto 6); --Immediate sign extended: immediate(31 downto 16) <= (others => idata(15)); immediate(15 downto 0) <= idata(15 downto 0); --Immediate unsigned: immediateU <= x"0000" & idata(15 downto 0); --Jump offset: immediateJ <= idata(25 downto 0) & "00"; --Immediate sign extended and leftshift 2: SL2immediate <= immediate(29 downto 0) & "00";

--Decoding instructions:

p_INS_DECOER : process (instr)

88

begin ------R_TYPE if (std_match(instr, "000000")) then Instruction_type <= R_type; elsif (std_match(instr, "011100")) then Instruction_type <= I_type_special_2; -- I-type instruction SPECIAL 2 custom instruction

------I_TYPE elsif (std_match(instr, "001001")) then Instruction_type <= ADDIU; elsif (std_match(instr, "001001")) then Instruction_type <= ADDIU; elsif (std_match(instr, "001000")) then Instruction_type <= ADDI; elsif (std_match(instr, "001011")) then Instruction_type <= SLTIU; elsif (std_match(instr, "001100")) then Instruction_type <= ANDI; elsif (std_match(instr, "001101")) then Instruction_type <= ORI; elsif (std_match(instr, "001110")) then Instruction_type <= XORI; elsif (std_match(instr, "001111")) then Instruction_type <= load;- -LUI elsif (std_match(instr, "001010")) then Instruction_type <= SLTI; -- slti elsif (std_match(instr, "101011")) then Instruction_type <= store; - - store instruction elsif (std_match(instr, "101000")) then Instruction_type <= store; - - store byte instruction elsif (std_match(instr, "100011")) then Instruction_type <= load; -- load instruction -- elsif (std_match(instr, "100000")) then Instruction_type <= load; -- load byte instruction ------BRANCHES elsif (std_match(instr, "000100")) then Instruction_type <= BEQ; elsif (std_match(instr, "000101")) then Instruction_type <= BNE; elsif (std_match(instr, "000111")) then Instruction_type <= BGTZ; elsif (std_match(instr, "000110")) then Instruction_type <= BLEZ; elsif (std_match(instr, "000001")) then Instruction_type <= BRANCHES; -- BLTZ,BGEZ,BGEZAL,BLTZAL ------J_TYPE elsif (std_match(instr, "000010")) then Instruction_type <= J; -- jump instruction elsif (std_match(instr, "000011")) then Instruction_type <= JAL; -- jal (jump and link) else Instruction_type <= Undefined; report " +++ unimplemented instruction type !! "; end if; end process; ------

WaitRequest_i <= '0' when (mul_taken = '1' and mul_wait = '0') else '1'; WaitRequest_comb <= WaitRequest and WaitRequest_i;

------for multiplication instructions 2 cycle InstMulreg : process(clk) begin if rising_edge(clk) then if MTHI = '1' then HI <= R_DATA_A; elsif mul_wait = '1' then HI <= ALU_out64(63 downto 32); end if; if MTLO = '1' then LO <= R_DATA_A; elsif mul_wait <= '1' then

89

LO <= ALU_out64(31 downto 0); end if; if mul_taken = '1' and mul_wait = '0' then mul_wait <= '1'; else mul_wait <= '0'; end if; end if; end process; ------for sending the trap in case of custom instructions process(Instruction_type, funct) begin if(Instruction_type = I_type_special_2)then if (funct = "010000" or funct = "010001" or funct = "100000" or funct = "100001") then --I:CUST trap_start <= '1'; else trap_start <= '0'; end if; else trap_start <= '0'; end if; end process;

------ALU ------

D_write_en <= local_D_write_en; D_ADR <= ALU_out; D_W_DATA <= R_DATA_B;

------p_ALU: process (PC, hi, lo, WaitRequest_comb ,RES_0, Instruction_type, funct, instr, rt, rd, rs, sa, immediate, immediateU, SL2immediate, R_DATA_A, R_DATA_B, ALU_out, ALU_out64, W_ADR) begin

--initialising values: ALU_out <= (others => '0'); ALU_out64 <= (others => '0'); JumpTaken <= '0'; BranchTaken <= '0'; W_ADR <= (others => '0'); RegFile_en <= '0'; D_read_en <= '0'; local_D_write_en <= '0'; JumpTakenJR <= '0'; MTHI <= '0'; MTLO <= '0'; mul_taken <= '0';

case Instruction_type is when R_type => RegFile_en <= '1'; W_ADR <= rd; case funct is

90

when B"00_00_00" => ALU_out <= std_logic_vector(unsigned(R_DATA_B) SLL to_integer(unsigned(sa))); -- I:SLL when B"00_00_10" => ALU_out <= std_logic_vector(unsigned(R_DATA_B) SRL to_integer(unsigned(sa))); -- I:SRL when B"00_01_10" => ALU_out <= std_logic_vector(unsigned(R_DATA_B) SRL to_integer(unsigned(R_DATA_A))); --I:SRLV when B"00_01_00" => ALU_out <= std_logic_vector(unsigned(R_DATA_B) SLL to_integer(unsigned(R_DATA_A))); --I:SLLV when B"00_00_11" => ALU_out <= std_logic_vector(signed(R_DATA_B) SRL to_integer(unsigned(sa))); --I:SRA when B"00_01_11" => ALU_out <= std_logic_vector(signed(R_DATA_B) SRL to_integer(unsigned(R_DATA_A))); -- I:SRAV when B"10_10_10" => if signed(R_DATA_A) < signed(R_DATA_B) then --I:SLT ALU_out <= x"00000001"; --I:SLT else --I:SLT ALU_out <= (others => '0'); --I:SLT end if; --I:SLT when B"10_10_11" => if unsigned(R_DATA_A) < unsigned(R_DATA_B) then --I:SLTU ALU_out <= x"00000001"; --I:SLTU else --I:SLTU ALU_out <= (others => '0'); --I:SLTU end if; --I:SLTU when B"10_00_01" => ALU_out <= std_logic_vector(unsigned(R_DATA_A) + unsigned(R_DATA_B)); --I:ADDU when B"10_00_00" => ALU_out <= std_logic_vector(signed(R_DATA_A) + signed(R_DATA_B)); --I:ADD when B"10_00_10" => ALU_out <= std_logic_vector(signed(R_DATA_A) - signed(R_DATA_B)); --I:SUB when B"10_00_11" => ALU_out <= std_logic_vector(unsigned(R_DATA_A) - unsigned(R_DATA_B)); --I:SUBU when B"10_01_00" => ALU_out <= R_DATA_A and R_DATA_B; --I:AND when B"10_01_01" => ALU_out <= R_DATA_A or R_DATA_B; --I:OR when B"10_01_10" => ALU_out <= R_DATA_A xor R_DATA_B; --I:XOR when B"10_01_11" => ALU_out <= R_DATA_A nor R_DATA_B; --I:NOR when B"01_00_00" => ALU_out <= HI; --I:MFHI when B"01_00_10" => ALU_out <= LO; --I:MFLO when B"01_00_01" => MTHI <= '1'; --I:MTHI when B"01_00_11" => MTLO <= '1'; --I:MTLO when B"00_10_00" => JumpTakenJR <= '1'; --I:JR RegFile_en <= '0'; --I:JR when B"00_10_01" => ALU_out <= std_logic_vector(unsigned(PC) + 8); --I:JALR JumpTakenJR <= '1'; --I:JALR when B"00_10_11" => ALU_out <= R_DATA_A; --I:MOVN if R_DATA_B = x"00000000" then --I:MOVN RegFile_en <= '0'; --I:MOVN end if; --I:MOVN when B"00_10_10" => ALU_out <= R_DATA_A; --I:MOVZ if R_DATA_B /= x"00000000" then --I:MOVZ RegFile_en <= '0'; --I:MOVZ end if; --I:MOVZ when B"01_10_00" => mul_taken <= '1'; ALU_out64 <= std_logic_vector(signed(R_DATA_A) * signed(R_DATA_B));

91

when B"01_10_01" => mul_taken <= '1'; ALU_out64 <= std_logic_vector(unsigned(R_DATA_A) * unsigned(R_DATA_B)); when others => report " +++ unimplemented instruction type !! "; end case; ------I_TYPE when ADDIU => RegFile_en <= '1'; --I:ADDIU W_ADR <= rt; --I:ADDIU ALU_out <= std_logic_vector(unsigned(R_DATA_A) + unsigned(immediate)); --I:ADDIU when ADDI => RegFile_en <= '1'; --I:ADDI W_ADR <= rt; --I:ADDI ALU_out <= std_logic_vector(signed(R_DATA_A) + signed(immediate)); --I:ADDI when SLTIU =>RegFile_en <= '1'; --I:SLTIU W_ADR <= rt; --I:SLTIU if unsigned(R_DATA_A) < unsigned(immediateU) then -- I:SLTIU ALU_out <= (0 => '1', others => '0'); --I:SLTIU else --I:SLTIU ALU_out <= (others => '0'); --I:SLTIU end if; --I:SLTIU when SLTI =>RegFile_en <= '1'; --I:SLTI W_ADR <= rt; --I:SLTI if signed(R_DATA_A) < signed(immediate) then --I:SLTI ALU_out <= (0 => '1', others => '0'); --I:SLTI else --I:SLTI ALU_out <= (others => '0'); --I:SLTI end if; --I:SLTI when ANDI =>RegFile_en <= '1'; --I:ANDI W_ADR <= rt; --I:ANDI ALU_out <= R_DATA_A and immediateU; --I:ANDI when ORI =>RegFile_en <= '1'; --I:ORI W_ADR <= rt; --I:ORI ALU_out <= R_DATA_A or immediateU; --I:ORI when XORI =>RegFile_en <= '1'; --I:XORI W_ADR <= rt; --I:XORI ALU_out <= R_DATA_A xor immediateU; --I:XORI

------load instruction when load => RegFile_en <= '1'; W_ADR <= rt; case instr is

when B"10_00_11" => ALU_out <= std_logic_vector(signed(immediate) + signed(R_DATA_A)); --I:LW local_D_write_en <= '0'; D_read_en<= '1'; --I:LW

when B"10_00_00" => ALU_out <= std_logic_vector(signed(immediate) + signed(R_DATA_A)); --I local_D_write_en <= '0'; D_read_en <= '1'; --I

when B"00_11_11" =>

92

ALU_out <= immediate(15 downto 0) & X"0000"; --I:LUI local_D_write_en <= '0'; when others => report " +++ unimplemented load instruction !! ";

end case; ------store instruction when store => case instr is -- ALU_out == address -- address = memory[base+offset], base 25-21, offset 15-0 when B"10_10_11" => ALU_out <= std_logic_vector(signed(immediate) + signed(R_DATA_A)); --I:SW local_D_write_en <= '1';

when B"10_10_00" => ALU_out <= std_logic_vector(signed(immediate) + signed(R_DATA_A)); --I:SW local_D_write_en <= '1'; report " +++ store byte executed as store word !! "; when others => report " +++ unimplemented store instruction !! "; end case;

------JAMP when J =>JumpTaken <= '1'; --I:J when JAL =>RegFile_en <= '1'; --I:JAL W_ADR <= "11111"; --I:JAL ALU_out <= std_logic_vector(unsigned(PC) + 8); --I:JAL JumpTaken <= '1'; --I:JAL

------CUSTOMS

when I_type_special_2 => RegFile_en <= '1'; --I:? W_ADR <= rd; --I:? if (funct = "010000" or funct = "010001" or funct = "100000" or funct = "100001") then --I:CUST ALU_out <= RES_0; --I:CUST

report " +++ not custom instruction type !! "; end if;

------BRANCHES when BNE => if R_DATA_A /= R_DATA_B then --I:BNE BranchTaken <= '1'; --I:BNE end if; --I:BNE when BEQ => if R_DATA_A = R_DATA_B then --I:BEQ BranchTaken <= '1'; --I:BEQ end if; --I:BEQ when BGTZ =>if signed(R_DATA_A) > x"00000000" then --I:BGTZ BranchTaken <= '1'; --I:BGTZ end if; --I:BGTZ when BLEZ =>if signed(R_DATA_A) <= x"00000000" then --I:BLEZ BranchTaken <= '1'; --I:BLEZ end if; --I:BLEZ when BRANCHES => if rt = "00000" then if signed(R_DATA_A) < x"00000000" then --I:BLTZ BranchTaken <= '1'; --I:BLTZ end if;

93

elsif rt = "00001" then --I:BGEZ if signed(R_DATA_A) >= x"00000000" then -- I:BGEZ BranchTaken <= '1'; --I:BGEZ end if; --I:BGEZ elsif rt = "10001" then --I:BGEZAL W_ADR <= "11111"; --I:BGEZAL ALU_out <= std_logic_vector(unsigned(PC) + 8); --I:BGEZAL if signed(R_DATA_A) >= x"00000000" then -- I:BGEZAL BranchTaken <= '1'; --I:BGEZAL end if; --I:BGEZAL elsif rt = "10000" then --I:BLTZAL W_ADR <= "11111"; --I:BLTZAL ALU_out <= std_logic_vector(unsigned(PC) + 8); --I:BLTZAL if signed(R_DATA_A) <= x"00000000" then -- I:BLTZAL BranchTaken <= '1'; --I:BLTZAL end if; --I:BLTZAL end if; ------when Undefined =>report " +++ undefined instruction !! "; when others =>report " +++ unimplemented instruction type !! "; end case; end process;

------PROGRAM-COUNTER------

--Immediate sign extended and leftshift 2: SL2immediate <= immediate(29 downto 0) & "00";

nextPC <= PC4 when PCstate= Normal else branchPC when PCstate= branchs else JumbPC when PCstate=Jumbs else PC4 ; ------I_ADR <= nextPC when WaitRequest_comb = '1' else PC; PC4 <= std_logic_vector(unsigned(PC) + 4); process (clk) begin if clk'event and clk = '1' then if WaitRequest_comb = '1' then if reset = '1' then PCstate <= Normal; PC <= X"BFC00000"; ---MIPS reset address branchPC <= X"BFC00000"; ---MIPS reset address JumbPC <= X"BFC00000"; ---MIPS reset address

else PC <= nextPC; ------case PCstate is -- "Normal" state of PC: when Normal => --If a branch is taken: if BranchTaken = '1' then

94

PCstate <= branchs; branchPC <= std_logic_vector(signed(PC4) + signed(SL2immediate)); --If a jump is taken: elsif JumpTaken = '1' then PCstate <= Jumbs; JumbPC <= PC4(31 downto 28) & immediateJ; -- If a jump from register is taken: elsif JumpTakenJR = '1' then PCstate <= Jumbs; JumbPC <= R_DATA_A; else PCstate <= Normal; end if; -- branch and jumb state of PC: when branchs => PCstate <= Normal; when Jumbs=> PCstate<=Normal; when others => PCstate <= Normal; end case;-- case PCstate ------

end if;--rest end if;--wait end if;--clk end process; ------end;

95

Appendix B - Trap handler based on MUX

library IEEE; use IEEE.STD_LOGIC_1164.ALL; ------entity trapHandler is Port ( clk : IN std_logic; -- clk100 : in std_logic; reset : IN std_logic; address : IN std_logic_vector(31 DOWNTO 0); opcode : in std_logic_vector (5 downto 0); writedata : IN std_logic_vector(31 DOWNTO 0); commandIn : in STD_LOGIC_VECTOR (31 downto 0); readdata : OUT std_logic_vector(31 DOWNTO 0); WaitRequest : in std_logic ); end trapHandler; architecture Behavioral of trapHandler is component custom1_module is port ( data_in : in std_logic_vector (31 downto 0); crc_en , reset, clk : in std_logic; crc_out : out std_logic_vector (31 downto 0)); end component; ------component custom2_module is port ( data_in : in std_logic_vector (31 downto 0); one_out : out std_logic_vector (31 downto 0)); end component; ------component custom3_module is port ( data_in : in std_logic_vector (31 downto 0); parity_out : out std_logic_vector (31 downto 0)); end component; ------component custom4_module is port ( data_in : in std_logic_vector (31 downto 0); zero_out : out std_logic_vector (31 downto 0)); end component; ------signal reg :std_logic_vector(31 downto 0); signal sel1 : std_logic; signal sel2 : std_logic; signal sel3 : std_logic; signal sel4 : std_logic; signal readdata1 : std_logic_vector(31 downto 0); signal readdata2 : std_logic_vector(31 downto 0); signal readdata3 : std_logic_vector(31 downto 0); signal readdata4 : std_logic_vector(31 downto 0); ------begin reg<=writedata; --op_A and writedata

inst1: custom1_module PORT MAP( data_in => reg, crc_en => sel1, reset => reset, clk => clk, crc_out => readdata1 96

); ------inst2: custom2_module PORT MAP( data_in => reg, one_out => readdata2 ); ------inst3: custom3_module PORT MAP( data_in => reg, parity_out => readdata3 ); ------inst4 : custom4_module PORT MAP( data_in => reg, zero_out => readdata4 ); ------sel1 <= '1' when (opcode = "010000") else '0'; sel2 <= '1' when (opcode = "100001") else '0'; sel3 <= '1' when (opcode = "010001") else '0'; sel4 <= '1' when (opcode = "100000") else '0'; ------process(readdata1, readdata2, readdata3, readdata4, sel1, sel2, sel3, sel4) begin if(sel1 = '1')then readdata <= readdata1; elsif(sel2 = '1')then readdata <= readdata2; elsif(sel3 = '1')then readdata <= readdata3; elsif(sel4 = '1')then readdata <= readdata4; else readdata <= (others => '0'); end if; end process; end Behavioral;

97

Appendix C - Trap handler based on ICAP

------library IEEE; use IEEE.STD_LOGIC_1164.ALL; use ieee.std_logic_unsigned.all;

entity trapHandler is Port ( opcode : in std_logic_vector (5 downto 0); dataIn : in STD_LOGIC_VECTOR (31 downto 0); commandIn : in STD_LOGIC_VECTOR (31 downto 0); trap_start : in std_logic; clk : in STD_LOGIC; rst : in STD_LOGIC; dataOut : out STD_LOGIC_VECTOR (31 downto 0)); end trapHandler; architecture Behavioral of trapHandler is component custom_module is port ( data_in : in std_logic_vector (31 downto 0); start , rst, clk : in std_logic; CustomInstID : out std_logic_vector (5 downto 0); done : out std_logic; data_out : out std_logic_vector (31 downto 0)); end component;

component ICAP_SPARTAN6 is port ( clk : in std_logic; ce : in std_logic; WRITE : in std_logic; I : in std_logic_vector(15 downto 0); O : out std_logic_vector(15 downto 0); busy : out std_logic ); end component;

signal start : std_logic; signal custom_done : std_logic;

TYPE st IS (st0, st1, st2, st3); SIGNAL currentState, nextState: st;

signal command_register : std_logic_vector(223 downto 0); signal command_register_reg : std_logic_vector(223 downto 0);

signal MB_StartAddr : std_logic_vector(23 downto 0); signal FB_StartAddr : std_logic_vector(23 downto 0);

constant MB_StartAddr1 : std_logic_vector(23 downto 0):= X"100000"; constant MB_StartAddr2 : std_logic_vector(23 downto 0):= X"200000"; constant MB_StartAddr3 : std_logic_vector(23 downto 0):= X"300000"; constant MB_StartAddr4 : std_logic_vector(23 downto 0):= X"400000"; 98

constant FB_StartAddr1 : std_logic_vector(23 downto 0):= X"100000"; constant FB_StartAddr2 : std_logic_vector(23 downto 0):= X"200000"; constant FB_StartAddr3 : std_logic_vector(23 downto 0):=X"300000"; constant FB_StartAddr4 : std_logic_vector(23 downto 0):= X"400000";

signal custom_start : std_logic; signal icap_datain : std_logic_vector(15 downto 0); signal icap_dataout : std_logic_vector(15 downto 0); signal icap_busy : std_logic; signal icap_write : std_logic; signal count : std_logic_vector(3 downto 0); signal opcode1 : std_logic_vector(7 downto 0):= X"00"; signal opcode2 : std_logic_vector(7 downto 0):= X"00"; signal CustomInstID : std_logic_vector(5 downto 0);

begin --instantiate the ICAP module

ICAP_inst: ICAP_SPARTAN6 port map( clk => clk, ce => (not rst), WRITE => icap_write, I => icap_datain, O => icap_dataout, busy => icap_busy ); --select ICAP write or read command icap_write <= '1' when (currentState = st1) else '0'; --send the data to the ICAP module from the command_register_reg icap_datain <= command_register_reg(223 downto 208);

--implement a shift register to hold the command, which need to be sent to the ICAP module process(clk,rst) begin if(rst = '0') then command_register_reg <= (others => '0'); elsif(rising_edge(clk))then if(currentState = st1)then --shift left, 16 places command_register_reg <= command_register_reg(207 downto 0) & command_register_reg(223 downto 208); else command_register_reg <= command_register; end if; end if; end process; --command, that is to be sent to the ICAP module command_register <= X"FFFF" & X"AA99" & X"5566" & X"3261" MB_StartAddr(15 downto 0) & X"3281" & opcode1 & MB_StartAddr(23 downto 16) & X"32A1" & FB_StartAddr(15 downto 0) & X"32C1" & opcode2 & FB_StartAddr(23 downto 16) & X"30A1" & X"000E" & X"2000";

99

--Master bitstream address selection on the basis of the opcode

MB_StartAddr <= MB_StartAddr1 when (opcode = "010000") else MB_StartAddr2 when (opcode = "100001") else MB_StartAddr3 when (opcode = "010001") else MB_StartAddr4 when (opcode = "100000") else (others => '0'); --Feedback bitstream address selection on the basis of the opcode

FB_StartAddr <= FB_StartAddr1 when (opcode = "010000") else FB_StartAddr2 when (opcode = "100001") else FB_StartAddr3 when (opcode = "010001") else FB_StartAddr4 when (opcode = "100000") else (others => '0');

--assign nextState to the currentState on the clock edge process(clk,rst) begin if(rst = '0') then

currentState <= st0;

elsif(rising_edge(clk))then

currentState <= nextState; end if; end process;

--decide nextState on the basis of currentState, count, trap_start and custom_done process(currentState, count, trap_start, custom_done) begin case (currentState) is --st0 is the reset state, here it will wait for the trap_start signal when st0 => if(trap_start = '1')then --if current loaded custom instruction is same as the required one, then go to st2 --else go to st1 if(opcode = CustomInstID)then nextState <= st2; else nextState <= st1; end if; else nextState <= ST0; end if; when st1 => --in st1, the command to the ICAP module is sent in the 14 clock cycles --here it will check the counter, if its equal to 13, then move to ST2 if(count = "1101")then nextState <= ST2; else nextState <= ST1; end if; when st2 => nextState <= st3; when st3 =>

100

--now start the custom module, to run the custom command if(custom_done = '1')then nextState <= st0; else nextState <= st3; end if; when others => nextState <= st0; end case; end process; --implement a counter, which is used while sending command to the ICAP module process(clk,rst) begin if(rst = '0') then count <= (others => '0'); elsif(rising_edge(clk))then -- if currentState is st1, then count if(currentState = st1)then count <= count + '1'; else count <= (others => '0'); end if; end if; end process;

--instantiate the custom instruction module inst1: custom_module PORT MAP( CustomInstID => CustomInstID, data_in => dataIn, start => start, rst => rst, clk => clk, done => custom_done, data_out => dataOut ); --start the custom module, when state = st3 custom_start <= '1' when (currentState = st3) else '0'; start <= custom_start when ((opcode = "010000") or (opcode = "100001") or (opcode = "010001") or (opcode = "100000")) else '0'; end Behavioral;

101