IT 14 008 Examensarbete 30 hp Januari 2014

Adaptation of an ARM compatible System on chip as an IP-module in a FPGA

Emanuel Wahlqvist

Institutionen för informationsteknologi Department of Information Technology

Abstract Adaptation of an ARM compatible System on chip as an IP-module in a FPGA Emanuel Wahlqvist

Teknisk- naturvetenskaplig fakultet UTH-enheten In the world of today a fast prototyping and low time to market are very important factors when developing products. Any effort to minimize these parameters as well as Besöksadress: making systems easier to maintain is effort well placed. Syntronic is a consultant Ångströmlaboratoriet Lägerhyddsvägen 1 company dealing in electronic and software development, testing and maintenance. Hus 4, Plan 0 They see the soft core processor, implemented in a Field Programmable Gate Array, as a step towards more versatile platforms. As a first effort this thesis presents the Postadress: specification, implementation and testing of a System on Chip based on a open source Box 536 751 21 Uppsala ARMv2a compatible processor designed in . The system aims at applications where performance is not the highest priority but rather small FPGA area and Telefon: possibility to connect many different sensor types. The final result is a system that is 018 – 471 30 03 able to execute both assembler and C code in simulations. There was no hardware

Telefax: available for testing but the synthesis procedure shows promising results. The final 018 – 471 30 00 system include interfaces for UART, SPI and I2C along with support for up to 32 General Purpose Input Output pins. All steps required for modifying and customizing Hemsida: the system is also presented along with the tools used. http://www.teknat.uu.se/student

Handledare: Stig Silver Ämnesgranskare: Leif Gustafsson Examinator: Philipp Rümmer IT 14 008 Tryckt av: Reprocentralen ITC

Acknowledgements

I would like to thank:

• Stig Silver at Syntronic for trusting in me to solve this problem.

• Lars Johansson, my supervisor at Syntronic, for guidance, knowledge and help along the way.

• Robert Adenmark who, despite being on parental leave, showed me in the right direction on more FPGA detailed issues.

• Leif Gustafsson, my supervisor at Uppsala University, for reading, correcting and giving knowledgeable input to this report and for directing my attention to related studies in this subject.

And all others at Syntronic who in some way aided me in this work.

Emanuel Wahlqvist Contents

1. Introduction 1 1.1. FPGAfundamentals ...... 2 1.1.1. CLB ...... 2 1.1.2. RAM blocks ...... 3 1.1.3. Routing net ...... 3 1.2. FPGAs and processors ...... 3 1.3. WhyARM ...... 4 1.4. HDL design with Verilog ...... 4 1.4.1. Example design ...... 5

2. Related work 7 2.1. Optimizations...... 7

3. Specification 8 3.1. Core alternatives ...... 9 3.1.1. Considerations ...... 9

4. HDL Design 11 4.1. Target ...... 12 4.2. Amberproject ...... 12 4.3. Clock and reset manager ...... 12 4.4. The Amber core ...... 13 4.4.1. ARMv2a Instruction Set Architecture ...... 14 4.4.2. Pipeline ...... 16 4.4.3. Pipeline hazards ...... 17 4.5. Wishbonebus...... 20 4.5.1. Wishbone signals ...... 20 4.5.2. Protocol ...... 21 4.5.3. De-multiplexer(Demux)...... 23 4.6. Memory ...... 25 4.6.1. Cache ...... 26 4.6.2. Boot memory ...... 27 4.6.3. Main ...... 27 4.6.4. Flash memory ...... 28 4.7. I2C...... 29 4.7.1. I2Cprotocol...... 29 4.7.2. I2Ccontroller...... 30 4.8. SPI...... 31 4.8.1. SPI protocol ...... 31 4.8.2. SPI controller ...... 32 4.9. UART ...... 33 4.9.1. UART protocol ...... 33 4.9.2. UART controller ...... 34 4.10.EthernetMAC ...... 38 4.11. GPIO ...... 38 4.11.1. GPIO controller ...... 38 4.12.Timers...... 39 4.12.1. Registers ...... 39 4.12.2. Setting up a timer ...... 41 4.13. Interrupt controller ...... 41 4.13.1. Registers ...... 42 4.14.Testmodule...... 45 4.15. Verilog test bench ...... 47 4.15.1. UART ...... 48 4.15.2. SPI ...... 48 4.15.3. I2C...... 48 4.15.4. GPIO ...... 49

5. Configuration 50 5.1. Parameters ...... 50 5.2. Adding or removing a peripheral ...... 51

6. Tools 53 6.1. ISE 14.5 ...... 53 6.1.1. Synthesis ...... 53 6.1.2. Simulation ...... 56 6.1.3. Bulk simulation ...... 58 6.1.4. Debug switches ...... 59 6.2. Sourcery CodeBench for ARM processors ...... 59 6.3. Amber specific tools ...... 59 6.3.1. amber-elfsplitter ...... 60 6.3.2. amber-elfsplitter-memcontents ...... 60 6.3.3. check mem size...... 60 6.4. Installation ...... 61

7. Testing 62 7.1. Assemblertests...... 62 7.1.1. SPItest(spi.S)...... 62 7.1.2. I2Ctest(i2c.S) ...... 63 7.1.3. UART test (uart tx.S)...... 66 7.1.4. GPIO (gpio.S) ...... 67 7.2. Ctests...... 68 7.2.1. Libraries ...... 68 7.2.2. boot-loader-serial ...... 69 7.2.3. dhry ...... 69 7.2.4. hello-world ...... 69 7.2.5. spi-timer ...... 69 7.3. Linuxtest...... 69

8. Result 70

9. Conclusion 71 9.1. Specification ...... 71 9.2. Implementation ...... 71 9.2.1. Target independence ...... 71 9.2.2. Peripheral integration ...... 71 9.3. Documentation ...... 72 9.3.1. Peripheral controllers ...... 72 9.4. Testing ...... 72 9.5. Compiler optimization issue ...... 73

10.Discussion 74 10.1. Pros and cons with the Amber SoC ...... 74 10.2. Peripherals ...... 75

11.Future work 76

12.Bibliography 78

A. I2C test output I

B. test output V List of Figures

1.1. SimplesketchofFPGAlayout ...... 2

4.1. Diagram showing the complete system design ...... 11 4.2. Overview of the a23 verilog structure...... 13 4.3. Example of pipelined execution...... 16 4.4. Example of control hazard handling...... 18 4.5. Wishbone handshake ...... 21 4.6. Wishbone single read cycle ...... 22 4.7. Wishbone single write cycle ...... 22 4.8. Wishbone synchronous burst cycle ...... 23 4.9. Schematic of the wishbone demultiplexer ...... 24 4.10. Tri-state buffers on SDA and SCL...... 29 4.11. I2Ctransfer...... 29 4.12. SPI transfer timing diagram...... 32 4.13. UART in half duplex mode...... 33 4.14. UART in full duplex mode with RTS and CTS...... 34 4.15. A UART transfer...... 34 4.16. Structural schematic of the UART controller...... 35 4.17. GPIO pin tri-state buffer connection...... 38 4.18. Interrupt vectors and masks...... 42 4.19. Fast interrupt vectors and masks...... 42

6.1. Xilinx ISE design flow...... 54 6.2. Simulation script organization ...... 57

7.1. SPI transfer of first word...... 63 7.2. SPI transfer of second word...... 63 7.3. Start condition and sending slave address plus write bit (0x20) ...... 63 7.4. Sending register address 0x01 ...... 64 7.5. Sending data 0xa5 ...... 64 7.6. Sending data 0x5a and a stop condition ...... 64 7.7. Start condition and sending slave address plus write bit (0x20) ...... 64 7.8. Sending register address 0x01 ...... 64 7.9. Start condition and sending slave address plus read bit (0x21) ...... 64 7.10. Reading data 0xa5 ...... 65 7.11. Reading data 0x5a and stop condition ...... 65 7.12. Start condition and sending slave address plus write bit (0x20) ...... 65 7.13. Sending invalid register address 0x10 and receiving NACK ...... 65 7.14. Send character ”H” ...... 66 7.15. Send character ”i”, receive character ”H” ...... 66 7.16. Send character ”!”, receive character ”i” ...... 66 7.17. Send character ” ”, receive character ”!” ...... 67 7.18. Pins [8:1] is ”0xDA” and mirrored on pins [16:9] ...... 67 7.19. Pins [8:1] is ”0xBE” and mirrored on pins [16:9] ...... 68 List of Tables

3.1. Comparison between ARM cores ...... 9

4.1. ARMv2a instructions supported by the Amber core...... 14 4.2. Some of the control signals for the execute stage...... 17 4.3. Wishbone signals, direction is seen from a master perspective ...... 21 4.4. Slave numbering in the Wishbone demultiplexer...... 23 4.5. Coprocessor registers. All registers are 32 bits wide...... 26 4.6. Layout of coprocessor register CR0 ...... 26 4.7. I2C registers. All registers are 8 bits wide ...... 30 4.8. SPImodes...... 31 4.9. SPI core registers. All registers are 32 bits wide...... 33 4.10. UART core registers. All registers are 8 bits wide...... 36 4.11. Flag register bits. Bits 2 and 1 are always high...... 37 4.12. GPIO core registers...... 39 4.13. Timer core registers...... 40 4.14. Control register bits. Unused bits are always low...... 40 4.15. Timer prescaler value ...... 41 4.16. Interrupt vector outline. The unused bits (NA) are initialized to zero. . . 42 4.17. Timer core registers...... 43 4.18. Test module registers...... 45

5.1. Parameters to configure the system ...... 50

6.1. Simulation script options ...... 56 6.2. Simulation script options ...... 58 6.3. Required environmental variables ...... 61

7.1. FilesrequiredforaCtest ...... 68 Abbreviations

CISC Complex Instruction Set Computer

CLB Configurable

CTS Clear To Send

DMIPS Dhrystone Million Instructions Per Second

DSP Digital Signal Processors

ELF Executable and Linkable Format

FIFO First In First Out

FIRQ Fast Interrupt ReQuest

GPL General Public License

GUI Graphical User Interface

HDL Hardware Descriptive Language

I2C Inter-Integrated Circuit

IP Intellectual Property

IRQ Interrupt ReQuest

ISA Instruction Set Architecture

LGPL Lesser (or Library if old) General Public License

LSB Least Significant Byte

LUT Look-up table

MAC Media Access Control

PC Program Counter

PCB Printed Circuit Board

PLL Phase Locked Loop

RAM Random Access Memory RISC Reduced Instruction Set Computer

RTS Ready To Send

Rx Receive

SPI Serial Peripheral Interface

Tx Transmit

UART Universal Asynchronous Reciever/Transmitter 1. Introduction

Syntronic is a global consultant company dealing in product development, testing and maintenance. They are active in several areas like telecommunication, medical and automotive. Their idea is to cover all areas from a design idea to a finished product applied in the field. In doing this they really see the advantage in keeping a design flow that not only gets products out on the market quickly but also makes them easy to maintain and upgrade. As a step in optimizing that design flow they want to take a closer look into soft processors. The reason for this is that several of their earlier designs has involved both FPGAs and . By integrating the into the FPGA there is a great potential in lessening the development time and at the same time make the system easier to tailor for future needs. This thesis includes the specification and implementation of a SoC design around an ARM core in a FPGA. The purpose of the system is to be used in FPGA applications where a small control processor is needed. This could include data collection from sensors, receiving commands from a controlling system or user interface, perform smaller control loops, coordinate other signal processing algorithms in the FPGA etc.

1 1.1. FPGA fundamentals

There are several manufacturers of FPGAs that all use their own architecture but the main structure is very similar. A general FPGA mainly consists of Configurable Logic Blocks (CLB:s) but also contains memory and DSP blocks. All blocks are connected together through a configurable routing net that can connect any blocks with each other regardless of their physical location on the FPGA as shown in Figure 1.1. This makes it possible to create any logic function ranging from a simple AND gate to extremely complex digital circuits such as processors. Historically these functions were described by creating a schematic on a drawing board. When the designs grew in size and complexity the use of a Hardware Descriptive Language(HDL), such as VHDL and Verilog, followed by a synthesis process became common.

Figure 1.1.: Simple sketch of FPGA layout

1.1.1. CLB The CLB consists of several Look-Up Tables (LUT:s) that works as a logic function generator and at least one flip-flop per LUT. A typical LUT has four, five or six inputs, one or two outputs and contains 2n bits (where n is the number of inputs). The CLB usually also contain multiplexers and additional flip-flops or latches. The flip-flops are used to synchronize an output from the LUT with a clock signal. Instead of using the LUT as a logic function it can also be used as a memory block. This

2 memory type is often referred to as Distributed Random Access Memory (Distributed RAM).

1.1.2. RAM blocks Another type of FPGA memory is the RAM blocks. They interface the rest of the FPGA with input and output buses, an address bus, write enable inputs, a clock input and a reset input. The internal memory array can be very large, up to at least 36K bits. Since there are no multiplexers and relatively few flip-flops compared to the distributed RAM it is the preferred way of implementing large memories as it uses less FPGA area.

1.1.3. Routing net The routing nets cover the whole FPGA and can be configured to connect all blocks in many different ways. At every point where nets cross each other a configurable switch matrix is located that is used to connect nets with each other. There are a special type of nets called clock nets that is used to distribute clock signals through the FPGA with minimal delay and skew.

1.2. FPGAs and processors

The fact that an FPGA can be programmed to perform any (of course limited to the size of the device) amount of tasks in parallel makes it very suitable for digital signal processing. Earlier, FPGAs were often coupled with a separate microprocessor who took care of communication interfaces, task management and other small organizational tasks. This has lead to FPGAs with an integrated hard processor. Examples of this is the Xilinx Zynq[1] and SoC[2] product lines which combines different FPGAs with an ARM Cortex A9 processor or Microsemis Smartfusion[3] which uses an ARM Cortex-M3 processor. This is a solution for one who needs to combine a high capacity FPGA with a very competent processor. Compared to the solution with a separate processor this has the following benefits:

• Smaller total Printed Circuit Board (PCB) footprint

• No hardware interface needed between the processor and FPGA modules

For someone with not so high demands on performance, this is probably not the optimal solution. Since also the cheaper FPGAs has grown in size it is possible to implement a soft processor core inside these FPGAs along with the desired parallel logic. This has several benefits over the hard core solution such as:

• Possibility to change/upgrade the processor in the finished product

• Companies can hide their designs better

• Easier to implement customized multi-core platforms

3 The fastest way to implement a soft processor core in a FPGA is to use one of the vendor specific cores. For Altera the processor is called NIOS II[4] while Xilinx call theirs MicroBlaze[5]. Both are 32 bit Reduced Instruction Set Computer (RISC) processors with a variety of configuration parameters. They however, are not the only players in the market. There are several RISC processors in the open source community to choose from, along with ARMs own propriety soft processor called Cortex M1.

1.3. Why ARM

Since ARM has seized a firm grip on the embedded processor market and is likely to keep their position, companies including Syntronic, see an advantage in learning and using processors based on their architectures. Even though it is not a very big step to move from ARM to a NIOS or Microblaze processor Syntronic wanted to investigate the possibilities of using a soft ARM core in a FPGA.

1.4. HDL design with Verilog

To understand the description of the final system no deep knowledge about HDL lan- guages is needed. It is however necessary to be familiar with the basic structure of a Verilog design. The following concepts is enough to follow the reasoning:

• Module

• Top level module

• Port

• Wire

Module A module is a block of logic that can be used once or several times in a design.

Top level module The top level module has the same code structure as a regular module. But the top level module is where all the regular modules are instantiated and organized to create the final design. Thus there can only be one top level module in a design.

Port The port is always defined in the beginning of each module and contains the interface of the modules, in other words, the module’s inputs and outputs.

Wire A wire is used to connect modules or logic together.

4 1.4.1. Example design An example of a module implementing a simple AND gate is shown below. Everything between the two keywords ”module” and ”endmodule” defines the content of the module while ”and gate” is the name of the module which will be used to reference it later. The code between the parenthesis is the port, in this case two inputs and one output, and between the port and the ”endmodule” keyword is where the implementation is written. /∗ ∗ The port of the and gate module ∗/ module and gate ( input x, input y, output z );

/∗ ∗ The logic of the and gate module ∗/ assign z=x&y; endmodule

5 To put several modules together one can instantiate the desired modules and wire them together as shown in the example below. There two instances of the ”and gate” module shown above is used together with an OR gate. Worth noting is that there also can be logic in the top level module and in some cases one actually uses only a top level module for a design. /∗ ∗ The port of the top level module ∗/ module top level ( input a, input b, input c, input d, output result );

‘include ”and gate.v”

/∗ ∗ To connect the result with the outcome of the two AND gates ∗ one can use a ”wire” ∗/ wire a AND b; wire c AND d;

/∗ ∗ Creating two AND gates by instantiating the and gate module twice ∗/ and gate gate1 ( .x (a), .y (b), .z (a AND b) ); and gate gate2 ( .x (c), .y (d), .z (c AND d) );

/∗ ∗ It is also possible to use logic in the top level module ∗ So lets create an OR gate for the result ∗/ assign result = a AND b | c AND d; endmodule

6 2. Related work

The idea of using a soft core processor in a FPGA is not new. One example is the free LEON processor written in VHDL and based on the SPARC architecture. The devel- opment of LEON started in 1997[6] and the first version was released under the Library General Public License1 (LGPL) in 2000[7] by the European Space Agency. After that the second version of LEON called LEON2 have had several successful implementations in space[8]. For the third version the development was moved to the Swedish company Aeroflex Gaisler and it is now on its forth version. Another example of a open source processor is the openRISC 1000, or OR1K, that was released in 2001[9] and marked the beginning of the community. Along with the community the OR1K grew and has now a complete toolchain, several compatible operating systems and there are at least two available SoC:s that are developed around it[10]. A commercial example is the NIOS processor developed by Altera that was released in 2001[11].

2.1. Optimizations

Since the subject has become very popular there have been several studies with the goal to make soft processors more efficient in area utilization and also have better perfor- mance. In Sheldon et al.[12] the sharing of resources such as floating point units and multipliers between soft cores are analysed. They managed to decrease the area utiliza- tion of a dualcore platform with 16% while only introducing an cycle count overhead of 1%. Another interesting article was written by Lysecky and Vahid[13] where a so called warp processor based on a Microblaze soft core is implemented. The warp processor analyses the software at runtime and uses a dynamic partitioning scheme to implement important software functions as circuits in the FPGA at runtime. When comparing the warp processor to a fully equipped Microblaze processor they find that the performance increased 5.8 times while the power consumption decreased with 57%. Another approach is to optimize the processor for a specific software before synthesis. In an article written by Sheldon et al.[14] a method for this based on a Microblase processor is shown. They gain a 200% speedup at most and a 20% speedup when using tight size constraints. Yet another approach to application specific optimization is taken by Yiannacouras, Steffan, and Rose[15]. They use a verilog generating software called SPREE to generate appli- cation specific processors. By first optimizing away unused features and then remove unused parts of the instruction set they achieve a performance per area increase of 25% compared to a NIOS II processor.

1Now called Lesser General Public License

7 3. Specification

A big task of the system will be to read data from different sensors and control a variety of chips. This is done through different communication protocols where the most common used today are SPI and I2C. For communicating with a PC in a simple way UARTs have been used for a very long time and will also be included. In discussion with the supervisor about Syntronic’s need we agreed to also include a GPIO controller and a flash memory. A complete list of the systems peripherals are shown below.

• I2C controller

• SPI controller

• UART controller

• GPIO

• Flash memory for storage

• Main memory

• Boot mechanism

• At least one user interrupt

• At least one user configurable timer

8 3.1. Core alternatives

There are several processor cores available for developers to chose from. As mentioned earlier, ARM has released their own FPGA targeted design called Cortex M1. The major benefit from using this processor is that ARM themselves has verified its function and along with the core you get their warranty. The downside is of course the cost. A free evaluation version exists though but with a fixed configuration and no visibility inside the code. Alternatives to the Cortex M1 can be found on the OpenCores website www.opencores.org. For this thesis, two projects have been considered, Amber and Storm SoC. Table 3.1 shows a comparison of the cores.[16][17][18]

Core Amber a23/a251 Cortex M1 Storm SoC Pipeline stages 3/5 3 8 Cache size(kb) 8-32 D=0-1, I=0-1 D=1, I=1 Interrupts 16 1-32 32 Frequency (MHz) 40-80 70-200 80 DMIPS/MHz 0.75/1.05 0.8 NA Occupied area (LUTs) 90002 26003 90004 License LGPL Commercial GPL Cost Free 1$/unit, min 1000$ Free

Table 3.1.: Comparison between ARM cores

3.1.1. Considerations The most important properties to consider is licensing, cost, performance and area uti- lization. Cortex M1 is the most expensive option while also providing the highest per- formance. However, if there is high demands on system performance one should instead consider the MicroBlaze and NIOS II mentioned earlier since they provide more features and higher performance at a lower cost[19]. The Xilinx Zynq or Altera SoC are other high performance options as mentioned earlier. That leaves the two open source projects. The major benefit these have over the Cortex M1 except the cost is that they already are complete systems. With the Cortex one has to add a bus architecture, find suitable peripherals for that bus and create an arbitration scheme between these and that takes time. When looking at the included peripherals, the STORM SoC has everything listed in the beginning of this chapter. It would then seem to be the best choice for this project. However, since the core is to be used in commercial applications, the biggest difference between the two open source cores are the license. The General Public License (GPL) license states that any products containing GPL licensed software needs to be shipped

1The amber project includes two different cores called 23 and 25 2Core 23 and 16KB cache 3Minimal config, no Cache 4Core and 2KB cache

9 along with the source code of the whole system itself, while this demand is left out of the LGPL license. Since there might be the case that some components in a system is very high-tech and secret, giving out the code is not an option. Therefore the system implemented in this thesis will be built around the Amber project.

10 4. HDL Design

In this chapter the Amber project is presented more in detail along with the changes that were made to it in order to obtain the system specified in chapter 3. An overview of the complete design can be seen below in Figure 4.1. There it is shown how the core connect to the peripherals over the Wishbone bus[20]. The Wishbone bus is a competitor to ARMs open bus standard AMBA and how it works is shown in more detail in Section 4.5. The peripherals that came with the Amber project (UART, Interrupt controller, Timer controller and test module) were not included in the Amber user guide[21]. The information presented about these were obtained by us through analyzing the code and simulations of it. All the configuration parameters mentioned in this chapter are detailed in table 5.1.

Figure 4.1.: Diagram showing the complete system design

11 4.1. Target

Different FPGA:s were discussed as a platform for the project. The supervisor suggested some sort of Xilinx FPGA since that was going to be used in another project and the hardware could then be shared. Unfortunately the business arrangements were not completed in time so there was no hardware available for actual testing. For simulation and synthesis the FPGA targeted in the Amber system, a Spartan 6 LX45T, was used. The synthesis results were only verified by reading the synthesis reports. These do not replace hardware testing but at least for device utilization and basic timing analysis we considered it enough.

4.2. Amber project

The Amber project was designed by Conor Santifort, a member at opencores. It was tested by him on a Xilinx SP605 development board[21]. The complete specification of the system is shown in the list below.

• ARMv2a compatible core

• Configurable cache size

• 8kB boot memory

• Two UART controllers

• Ethernet MAC

• Interrupts

• Timers

• Spartan-6 DDR3 memory controller

The peripherals connect to the core over a wishbone bus interface. The boot memory contains a boot loader that uses one of the UART ports to receive programs to be run on the system. The project also contains an extensive suite of hardware test programs written in assembler along with a bootable linux image that can be run in a simulator.

4.3. Clock and reset manager

In this module the system-wide clock is generated by a Phase Locked Loop (PLL) and a synchronous reset signal is generated. Originally in the Amber project there were three clock generators. One PLL for a Spartan 6 FPGA, one for a Virtex 6 FPGA and a third, non synthesizeable, clock generator used for simulations. Since this thesis uses the Spartan 6 as target we decided to remove the Virtex 6 PLL and the non synthesizeable clock generator. This makes the simulations more realistic and cleans up the code. The

12 remaining PLL uses a differential clock input of 200MHz to generate a 800MHz clock. This clock is then divided by the AMBER CLOCK DIVIDER parameter to get the system clock.

4.4. The Amber core

In the Amber project there are two different cores available called a23 and a25. They are totally software compatible but have some differences as shown in table 3.1. Since area is preferred over performance the a23 core will be used. In Figure 4.2 below an overview of the core’s Verilog structure is shown. The picture is taken from Amber core specification[22] where it is called Figure 5.

Figure 4.2.: Overview of the a23 verilog structure.

The core has a unified data and instruction cache and executes instructions in a three stage pipeline. It also supports two interrupts with different priorities where the Fast In- terrupt ReQuest (FIRQ) is prioritized over the normal Interrupt Request (IRQ). A more detailed description of the core and a schematic diagram of the processor architecture

13 can be found in the Amber core specification[22].

4.4.1. ARMv2a Instruction Set Architecture The core was built to be compatible with the ARMv2a Instruction Set Architecture (ISA) which is built up by a couple of 32-bit wide instructions. The ones supported by the Amber core can be divided in categories depending on their purpose as shown in Table 4.1. For descriptions of the individual instruction’s syntax and operation see Table 4 in the Amber core specification[22].

Category Instructions Description Data processing ADC, ADD, AND, BIC, CMN, Performs operations on CMP, EOR, MOV, MVN, ORR, data already in registers RSB, RSC, SBC, SUB, TEQ, TST Multiply MLA, MUL Used to perform multiply- ing operations Single data swap SWP, SWPB Swaps data in a register with data in memory Single data trans- LDR, LDRB, STR, STRB Used to move data be- fer tween memory and regis- ters Block data trans- LDM, STM Moves a series of words be- fer tween memory and regis- ters Branch B, BL Branches the execution to other places in the pro- gram Coprocessor data MCR, MRC Used to move data to and transfer from a coprocessor register Software inter- SWI Used to throw a software rupt interrupt exception

Table 4.1.: ARMv2a instructions supported by the Amber core.

Registers and modes The ARMv2a ISA is a load/store architecture which means that all operations on data occurs in the processors internal registers. In the Amber core, and in ARM cores in general, there are 16 internal registers of which 13 can be utilized for data operations. Which registers that are accessible depends on which mode the processor is in. For the Amber core four different modes are available:

User Non privileged mode. Most user code is executed here

14 IRQ Privileged mode that the processor enters when a interrupt occurs

FIRQ Privileged mode that the processor enters when a fast interrupt occurs

Supervisor Privileged mode that the processor enters when a software interrupt occurs

The current mode is indicated in the two least significant bits of register 15. This register is referred to as the program counter or PC as it also points to where in the program the processor is, or more correctly, the next instruction it will execute. The other reserved registers are register 14 and 13. Register 14 is called the ”Link register” and contain the address the processor will jump to when the current function call is completed. Register 13 is called the ”Stack pointer”, or SP, and is used as a pointer to the end of the stack. A graphical overview of the registers in the respective modes are shown in Table 14 and 15 in the Amber core specification[22].

Comparison with other ISAs Compared to a general Complex Instruction Set Computer (CISC) architecture the RISC architecture is simpler and contain fewer instructions. This makes it less complicated to implement in digital logic with the nice side effect that it uses less resources in the FPGA. The ARMv2a ISA in particular holds no grave implementation specific benefits over other RISC ISAs, for example, the MIPS ISA. They both require a pipeline to be really efficient and both contain a relatively small number of instructions.

15 4.4.2. Pipeline Using a pipeline means that the execution of an instruction is divided into smaller steps much like the famous assembly line invented by Henry Ford. In the a23 core the execution is divided in three steps, or stages, called called fetch, decode and execute. In Figure 4.3 an example of the execution on a pipelined processor is compared with a processor without a pipeline. The example is only for basic understanding and does not take into account the hazards discussed later in this section or other delays that occur when dividing the execution in several stages.

Figure 4.3.: Example of pipelined execution.

Fetch In this stage the instruction, or data, is fetched from the cache. If the cache misses, i.e. the instruction or data is not there, the whole pipeline is stalled while it is fetched from memory over the memory bus. If a instruction was fetched it is passed along to the decode stage but data is passed directly to the execute stage.

16 Decode This is the most complicated pipeline stage in the core. Here, the fetched instruction is decoded according to the tables in Chapter 4 of the Amber core specification[22]. The decoded instruction is converted into control signals for the execute stage. Some examples of these control signals are presented in Table 4.2.

Name Size (b) Description instruction execute 1 If cleared the instruction passes through the execute stage without being executed. See Section 4.4.3 why this is necessary. rn sel 4 Selects which of the 15 cpu regis- ters that is used as rn register in the current instruction. rm sel 4 Selects which of the 15 cpu registers that is used as rm register in the current instruction. rds sel 4 Selects which of the 15 cpu registers that is used as rd and rs register in the current instruction1. status bits mode 2 Shows what mode the processor is in and thus which registers are to be accessed.

Table 4.2.: Some of the control signals for the execute stage.

Execute In this stage the control signals from the decode stage are registered and combined with data from the fetch stage. The data passes through the ALU and the result is written back to the cache. Additionally, the next address for the fetch stage is generated.

4.4.3. Pipeline hazards A pipeline generally improves the performance of a processor but it also introduces some problems, called hazards. First there is the possibility when two subsequent instructions access the same register and the first is a write instruction. This is often referred to as ”data hazard”. Another problem occurs when a instruction is executed only if a certain condition is met. This condition is determined by the execution of an earlier instruction but by then the other is already in the fetch or decode stage, scheduled for execution.

1In for example the MUL instruction, the Rd register specified in the Amber core specification[22] is actually the Rn register.

17 This is often referred to as ”control hazard”, or in the case of a conditional branch instruction, ”branch hazard”.

Data hazard In the a23 core this is dealt with by keeping the second instruction in the decode stage for two extra cycles and prevent the execute stage from executing it the two extra times. Two examples of this is shown in section 2.2 of the Amber core specification[22]. The method of disabling the execute stage can be compared with the method where the compiler inserts NOP instructions in the code to avoid this type of hazards. However, the decode stage in the a23 core stores the ”stalled” instruction in a register so that it can be decoded directly after the execution is resumed. This saves one clock cycle so where the NOP method would waste three clock cycles the a23 core only wastes two.

Control hazard This problem is not documented in the specification but simulations show that it is solved in a similar manner. The condition flags of the instruction are compared with the status bits of the Program Counter (PC) in the decode stage. If a faulty condition is detected the execute stage is disabled when that instruction passes through as shown in the following example. Consider the following assembler code snippet: mov r0, #0x0 @ Loading value 0x0 into register 0 mov r1, #0x1 @ Loading value 0x1 into register 1 subs r2, r1, r0 @ Compare the registers and update condition f lags beq 1f @ This branch will not execute, r1 != r2

Figure 4.4 shows what happens in the pipeline, which is also described below, tick by tick.

Figure 4.4.: Example of control hazard handling.

1. ”mov r1” instruction enters the fetch stage

2. ”mov r1” is decoded while ”mov r2” enters fetch.

18 3. ”mov r1” is executed, ”mov r2” is decoded and the ”subs” instruction enters fetch

4. ”mov r2” is executed, subs is decoded and the ”beq” instruction enters fetch

5. ”subs” is executed and beq is decoded. The decode stage detects a conditional execution and starts to read the status flags of the program counter. The execute stage updates these flags after the execution is done.

6. The decode stage has detected a faulty condition so it disables the execute stage. The ”beq” instruction passes through without being executed.

19 4.5. Wishbone bus

The wishbone bus is an open standard designed for interconnection between Intellectual Property (IP)-cores. It is widely used in the open source community and is the official interconnect fabric for the cores at opencores.org. In a comparison to the AMBA bus used by ARM, wishbone get praise for its simplicity and ease of use[23]. There are several different options for implementing the Wishbone bus, for example, the topology could be implemented in four different ways:

Point-to-point Connects one master to one slave

Pipelined The IP-cores are connected sequentially and thus act as both master and slave, forwarding the data

Shared bus Connects one or several masters to one or several slaves with a common bus medium. An arbiter is needed to direct all data traffic

Crossbar switch Similar to the shared bus with the addition that several masters can communicate at the same time, as long as they do not try to access the same slave.

There are also two different bus cycle definitions called classic and registered feedback. Registered feedback actually includes the classic cycle but also includes improvements to send data in bursts. This improvement comes at the cost of a more complicated interface and the need for three additional signals. As for the Amber 23 system it uses a 32 bit wide classic wishbone bus with the standard protocol and a shared bus topology. The only exception is that the reset signal RST is not used. The bus supports the classic read and write cycle along with the simplest burst type called ”Synchronous cycle terminated burst”. As seen in the standard there would be a performance gain in implementing another burst type, for example the ”Advanced synchronous cycle terminated burst” which is also mentioned in the Chapter 11 Future Work.

4.5.1. Wishbone signals The Wishbone signals has a naming scheme where the signal has a prefix according to its direction where I means In and O means Out. An output signal, for example strobe (called STB), could therefore be implemented as either O STB or STB O. In the Amber system the signal direction is in respective to each module so an output in the master module would be called an input in the slave module and vice versa. The signals used by the Amber system are:

20 ADR O Target address for the current bus cycle. SEL O Four bit signal that indicates which bytes in the 32 bit data that is valid for the current cycle. WE O Indicates if the current cycle is a write or read. 1 indicates a write. DAT I 32 bit wide data input line. DAT O 32 bit wide data output line. CYC O Asserted at the slave targeted by the transfer. If not asserted, other signals are not valid. STB O Strobe line. Asserted to the slave targeted at the current bus cycle. ACK I Set by the slave to indicate that the strobe and cyc signal is detected. In the case of a read cycle the data must be available at the next positive edge of the wishbone clock after ack is asserted. ERR I Indicate that the slave cannot perform the requested action. No er- ror handling is implemented in the Amber core but the signal is still present.

Table 4.3.: Wishbone signals, direction is seen from a master perspective

4.5.2. Protocol The handshake between master and slave is clearly shown in the Wishbone standard[20] Illustration 3-3 which is shown below in Figure 4.5. The master asserts CYC O and STB O at the positive edge of CLK I. When the slave is ready to respond it asserts ACK I at a following positive clock edge. The master terminates the cycle by resetting CYC O and STB O.

Figure 4.5.: Wishbone handshake

Below in Figure 4.6 and Figure 4.7 a single read/write cycle is shown respectively. The pictures are taken from the Wishbone standard[20] where they are named Illustration 3-5 and Illustration 3-7 respectively. A read cycle is performed as follows. At the first clock edge the master asserts CYC - O and STB O to indicate a valid transfer. It also presents an address on ADR O and asserts SEL O accordingly. The WE O is kept low to indicate a read cycle. When the slave is ready to present data on the DAT I lines (here at the next clock edge) it asserts ACK I and presents the data on DAT I.

21 Figure 4.6.: Wishbone single read cycle

The write cycle is similar but with the difference that the WE O signal is asserted to indicate a write and the data is presented on DAT O at the first clock edge. The slave still asserts ACK O when it is ready which in the write cycle in most cases is the next clock edge.

Figure 4.7.: Wishbone single write cycle

In figure 4.8 a burst access is shown. The picture is taken from the Wishbone stan- dard[20] where it is named Illustration 4-2. Here the master initiates a transfer but instead of negating STB O after an ack it is kept high. By doing this the master can keep owning the bus for several word transfers even if a master with higher priority is requesting bus access. This also speeds up the transfer since one clock cycle is saved between every transferred word, where otherwise an initiating of the transfer should have occurred.

22 Figure 4.8.: Wishbone synchronous burst cycle

4.5.3. De-multiplexer (Demux) The Wishbone demux connects the Wishbone master (the Amber core) to the slaves (peripherals). It treats the Wishbone signals ADR O, DAT O and SEL O as general and branch them out to all slaves, while the other signals are directed to the currently addressed slave. It determines which slave is currently addressed by using its base address. That is converted to a number as shown in Table 4.4. A schematic of the demux is shown in Figure 4.9. The Verilog file is named wishbone arbiter.v although it is actually a demux. This is derived from the original Amber project where this component also arbitrated between two Wishbone masters. To not loose the reference to the original code the filename has been kept but for correctness it is referenced here as a demux.

Number Base address (hex) Slave 0 2000 I2C 1 NA2 Boot memory 2 NA3 Main memory 3 1600 UART0 4 1700 UART1 5 F000 Test module 6 1300 Timer module 7 1400 Interrupt controller 8 1800 SPI Controller 9 1900 GPIO

Table 4.4.: Slave numbering in the Wishbone demultiplexer.

2Depends on the BOOT MSB parameter, see Section 4.6.2 3Depends on the BOOT MSB and MAIN MSB parameter, see Section 4.6.3

23 Figure 4.9.: Schematic of the wishbone demultiplexer

24 4.6. Memory

The ARM architecture uses memory byte addressing, meaning that the smallest unit addressable in any memory is one byte. ARMv2a also supports word access meaning that a chunk of four bytes are addressed at the same time. An example of this is the ”str” and ”strb” assembler instructions that stores a word or a byte in a register respectively. There were a couple of different memories supplied with the Amber project which are listed below.

• Generic RAM with variable size, byte-wide write enable – Used as boot and cache data memory – Synthesizes as flip-flops

• Generic RAM with variable size, line-wide write enable – Used as cache tag memory – Synthesizes as ram blocks in Spartan 6

• Spartan 6 specific block ram implementations of different fixed sizes – Used as boot and cache (data and tag) memory – Synthesizes as ram blocks – Sizes: 256x21, 256x32, 256x128, 512x128, 1024x128, 2048x32 and 4096x32 – Useful only on 6 series FPGA:s

• Wishbone to Spartan 6 memory controller bridge with DDR3 model – Used as main memory along with a DDR3 model generated by Coregen – Useful only on Spartan 6 designs

• A non synthesizable memory model of variable size, 32 and 128 MB – Used as main memory in simulations only

All Spartan 6 specific ram were removed since the code have to be usable for all kinds of FPGAs. As for the generic coded memories it is desirable that they synthesize in ram blocks when they are available. Of the ones supplied only the one with line-wide write enable achieved this. Therefore they were replaced by following the template found in Xilinx UG687 [24]. The line-wide write enable memory was also replaced to keep a continuous coding style. These memories will synthesize in any FPGA and in a 6 series FPGA from Xilinx (and most probably others as well) they will be utilizing ram blocks.

25 4.6.1. Cache The system uses a unified cache meaning that the data and instructions share cache space. Its size is configurable through the parameter A23 CACHE WAYS and can be either 2, 4, 8 or 16 kB. In the FPGA the cache is built from two different RAMs, one for the tags and one for the actual data. The cache is controlled by a coprocessor which in turn are controlled by four registers shown in Table 4.5. These registers are accessed with the assembler instructions mcr and mrc.

Name Access Description CR0 R ID register CR2 R/W Cache control register CR3 R/W Cachable area register CR4 R/W Updateable area register CR5 R/W Disruptive area CR6 R Fault status register CR7 R Fault address register

Table 4.5.: Coprocessor registers. All registers are 32 bits wide.

ID register (CR0) This register returns an ID tag of the processor. It has the layout shown in Table 4.6.

Bit 31:24 23:16 15:8 7:0 Name Company ID Manufacturer ID Part type Revision Value (hex) 41 56 03 00

Table 4.6.: Layout of coprocessor register CR0

Cache control register (CR2) This register is used to enable and disable the cache memory. By setting bit 0 the cache is enabled, otherwise it is disabled. The other bits are unused.

Cachable area register (CR3) The area from the boot and main memory that can be cached are defined in this register. Every bit represents a 2MB region where bit 0 represent the lowest 2 MB.

Updateable area register (CR4) This register marks 2 MB regions as read only, with bit 0 repersenting the lowest 2 MB. Writes to a read only region is ignored.

26 Disruptive area (CR5) Writing to areas marked by this register flushes the cache. Bit 0 represents the lowest 2 MB.

Fault status register (CR6) If a cache miss occurs, a Fault status can be read from this register.

Fault address register (CR7) If a cache miss occurs the faulty address is stored in this register.

4.6.2. Boot memory The boot memory uses a per byte write enable and is variable in size. The size is changed with the parameter BOOT MSB and the resulting size can be calculated using equation 4.1. BOOT MSB Size(b) = 2 +1 (4.1) The boot address space starts at address 0 and the highest address is found by subtract- ing 1 from the result of equation 4.1.

Originally the Amber system infused the boot memory content in the test bench. At synthesis, the specified block ram component was loaded through the makefile. Since all Xilinx specific code was removed this is no longer possible. Instead the Verilog function ”readmemh” is used as shown below. It infuses content into the boot memory at both simulations and synthesis using a file generated by the amber-elfsplitter-memcontents tool described in 6. initial begin $readmemh(”boot mem contents.data”, mem, 0, 2∗∗(ADDRESS WIDTH−2)−1); end

The command takes as argument a file containing the data, the array that is to be loaded and the index boundaries of the array. The file ”boot mem contents.data” is extracted from an elf4 file using the tool amber-elfsplitter-memcontents described in section 6.3.2 and contains only data values.

4.6.3. Main The memory is implemented as a 32 bit wide array that is variable in depth by changing the parameter MAIN MSB. The size (in bytes) of the memory can be calculated with equation 4.2.

4An elf file is the resulting executable after a program compilation

27 MAIN MSB Size(b) = 2 +1 (4.2) The memory address space starts where the boot memory ends which means that the lowest address can be calculated with equation 4.1. The highest address naturally is calculated by summing the base and size and subtracting one as shown in equation 4.3

BOOT MSB MAIN MSB Address = 2 +1 + 2 +1 − 1 (4.3)

In the original memory controllers there was a signal called i mem ctrl that was used to wrap the memory address at bit 24 if it was set. The purpose was to simulate a 32MB memory even if it was bigger like the 128MB RAM on the SP605 dev board. In the current implementation it has been left out since the size is variable through the MAIN MSB parameter.

4.6.4. Flash memory There is no controller for a flash memory in the system. In those cases where it is needed, a serial flash memory can be directly connected to the SPI controller as a slave.

28 4.7. I2C

I2C is a very common serial communication protocol. In this section the protocol is first described and then followed by an introduction to the core that was used in this system.

4.7.1. I2C protocol I2C uses two signals for communication: SCL Serial Clock SDA Serial Data SCL is a clock line that determines the speed of the transfer. This is controlled by the master but the slave can force it low to pause the transfer temporarily (this is called clock stretching). SDA is the data line and the control is shared between the master and the slave. In order to share a common line there has to be tri-state buffers in both ends along with an output enable (oe) signal as shown in Figure 4.10.

Figure 4.10.: Tri-state buffers on SDA and SCL.

A typical I2C transfer is shown in figure 4.11 below. The picture is taken from the I2C controller specification[25]

Figure 4.11.: I2C transfer.

Figure 4.11 step by step:

29 1. The master generates a start condition (SDA line is pulled low before SCL). All slaves starts to listen.

2. The master sends the address to the desired slave.

3. The master tristates the SDA and the addressed slave confirms that it detected the ”call” by pulling SDA low. This is called ACK.

4. The master sends the data.

5. The slave confirms the data by setting the SDA high. This is called a NACK.

6. The master generates a stop condition (SCL is pulled high before SDA).

4.7.2. I2C controller The I2C controller used in this project was written by Richard Herveille and published on opencores in 2001[26]. The version used in this system was uploaded to opencores on the 6:th of June in 2010. Some of the key features taken from the are:

• Multi master operation

• Clock stretching and wait generation

• Supports 7 and 10 bit addressing mode

• Arbitration lost interrupt

• 8 bit wishbone interface

Since the wishbone interface is only 8 bits it only supports byte access. To avoid any unpredictable behavior caused by undefined values the Least Significant Byte (LSB) of the output signal DAT O is wired to the controller while the other bytes are set to zero in system.v. All the other signals are also wired to the controller with the LSB. The core is configured and controlled by a set of registers shown in table 4.7.

Name Address Access Description PRERlo 0x20000000 R/W Low byte of clock prescaler PRERhi 0x20000004 R/W High byte of clock prescaler CTR 0x20000008 R/W Control register TXR 0x2000000C W Transmit register RXR 0x2000000C R Receive register CR 0x20000010 W Command register SR 0x20000010 R Status register

Table 4.7.: I2C registers. All registers are 8 bits wide

For information how to set up the registers see the I2C controller specification[25].

30 4.8. SPI

SPI is a full duplex capable serial protocol used in a wide variety of applications ranging from small sensors to transfers of large amounts of data. In this section the SPI protocol is described followed by an introduction to the core used in the system.

4.8.1. SPI protocol SPI uses four signals for communication:

SS Slave Select. It is used to select the slave the master currently wants to address. This eliminates the need of sending an address like I2C at the cost of some extra hardware.

SCK Serial Clock. This is controlled completely by the master and sets the speed of the transfer

MISO Master In Slave Out. Data line from slave to master.

MOSI Master Out Slave In. Data line from master to slave.

There are four different modes of SPI communication called 0,1,2 and 3 as shown in table 4.8. The parameters that determine the modes are:

CPOL Level of SCK in idle state. CPOL = 0 means SCK = low.

CPHA Phase of SCK. If the data is sampled on the rising or falling edge of SCK. CPHA = 0 means sample on rising edge.

Mode CPOL CPHA 0 0 0 1 0 1 2 1 0 3 1 1

Table 4.8.: SPI modes

A timing diagram of a transfer in mode 0 are shown below in figure 4.12 that is taken from the SPI controller specification[27].

31 Figure 4.12.: SPI transfer timing diagram.

4.8.2. SPI controller The SPI core used in this system was written by Simon Srot and published on opencores in 2002[28]. The version used in this project was uploaded on the 10:th of March 2009. Some of the key features of the core are:

• Full duplex

• Variable length of transfer word up to 128 bits

• MSB of LSB first data transfer

• Supports mode 0 and 1

• Eight slave select lines

• 32 bit Wishbone slave interface

The core is configured and controlled by a set of registers shown in table 4.9. For information on how to set up the registers for a specific configuration see the SPI core specification[27]. However, two things are worth an extra notice. First is bit 8 of the control register called GO BSY in the specification. When this bit is set, the transfer starts. It is important that all registers are set up before this, even the control register. Therefore, to start a transfer, two writes to the control register has to be done. Second is that the receive and transmit registers are implemented in the same flip-flops. So a write to the transmit register during a transfer will actually overwrite the received data in the corresponding transmit register.

32 Name Address Access Description RX0 0x18000000 R Recieve register bits [31:0] RX1 0x18000004 R Recieve register bits [63:32] RX2 0x18000008 R Recieve register bits [95:64] RX3 0x1800000c R Recieve register bits [127:96] TX0 0x18000000 R/W Transmit register bits [31:0] TX1 0x18000004 R/W Transmit register bits [63:32] TX2 0x18000008 R/W Transmit register bits [95:64] TX3 0x1800000c R/W Transmit register bits [127:96] CTRL 0x18000010 R/W Control and status register DIVIDER 0x18000014 R/W SCK divider value. SS 0x18000018 R/W Slave select register

Table 4.9.: SPI core registers. All registers are 32 bits wide.

4.9. UART

UART is one of the most common serial protocols used to interface between different sys- tems. Its simplicity makes it ideal to send commands and instructions from a computer or terminal to a smaller system such as this. It is also used to convert parallel data trans- missions to serial or to interface with RS-232 and RS-485 drivers. The Amber project had two UART controllers already implemented and both of them were kept. One of them is used by the included boot-loader to interface with a computer and initialize program downloads.

4.9.1. UART protocol UART is a point to point transmission and can be used in either simplex, half duplex or full duplex mode. The transmission speed is called baud rate and is configured separately at both ends. There is therefore no need for a clock signal and in simplex mode there is then only need for a single data line. In half duplex mode the data line is shared but an additional signal controls the direction of the data as shown in Figure 4.13. The direction signals is controlled by one of the UART controllers and is usually called Ready To Send (RTS) or Clear to send (CTS).

Figure 4.13.: UART in half duplex mode.

For full duplex there naturally is two data lines and in the smallest configuration this

33 is all that is needed. However, since UART controllers usually uses small First In First Out (FIFO) buffers they often implement two more signals, RTS and CTS. They are used to tell the other end if the receive buffer is full so that they can pause the transmission. The RTS signal from one controller is wired to the CTS input on the other as shown in Figure 4.14.

Figure 4.14.: UART in full duplex mode with RTS and CTS.

A transfer with the UART protocol is a bit flexible. It always starts with a start bit but the data that follows can range from 5 to 9 bits. If the data is 8 bits or smaller it is then followed by an optional parity bit and the transmission ends with one or two stop bits. The start bit is low, the data is always sent LSB first and the stop bits are high. A schematic of this is shown in Figure 4.15.

Figure 4.15.: A UART transfer.

4.9.2. UART controller The controller was included in the Amber project and the main features of the controller are:

• Fixed setting of: – 8 data bits – No parity bit – 1 stop bit

• Hardware configurable baudrate, synchronous with system clock

• 1 byte buffer or 16 byte FIFO, enabled in software

• Transmit and receive interrupts

Having a fixed configuration and baudrate makes the controller quite inflexible but very small in terms of FPGA utilization. Also, that the UART run synchronously with

34 the system clock means that the baud rate is not exact. However, it is well within the 10% offset allowed in the standard according to the author of the controller[29]. In Figure 4.16 the structure of the UART controller is shown. The data register is drawn in dotted lines since it is not actually a register, just an common address to the transmit and receive FIFOs. In the system two UARTs are instantiated, called UART0 and UART1. They are differentiated by the first 16 bits of the address where UART0 has 0x1600 and UART1 0x1700. These are shown as XXXX in Table 4.10 where the UART configuration registers are shown.

Figure 4.16.: Structural schematic of the UART controller.

Interrupts The transmit and receive interrupt share the same output. Thus when an output occurs one need to read the interrupt status register to determine which kind of interrupt that occurred.

Receive interrupt If the FIFO is enabled the receive interrupt will trigger when there is 8 bytes or more in the FIFO. Thus it can be cleared by reading bytes through the data register until less than 8 bytes remain. If the FIFO is disabled the interrupt will trigger when there is a byte ready in the transmission buffer and reset when the byte is read.

Transmit interrupt If the FIFO is enabled the transmit interrupt will trigger when there is 8 bytes or less left in the FIFO. Thus it can be cleared by pushing bytes through

35 the data register until there is more than 8 bytes in the FIFO. If the FIFO is disabled the interrupt will trigger when the transmission buffer is empty and reset when a byte is pushed to it. The transmit interrupt can also be cleared by a write to the interrupt clear register.

Configuration registers

Name Address Access Description PID0 0xXXXX0fe0 R Constant value of 0x00000010 PID1 0xXXXX0fe4 R Constant value of 0x00000010 PID02 0xXXXX0fe8 R Constant value of 0x00000004 PID03 0xXXXX0fec R Constant value of 0x00000000 CID0 0xXXXX0ff0 R Constant value of 0x0000000d CID01 0xXXXX0ff4 R Constant value of 0x000000f0 CID2 0xXXXX0ff8 R Constant value of 0x00000005 CID3 0xXXXX0ffc R Constant value of 0x000000b1 DR 0xXXXX0000 R/W Data register RSR 0xXXXX0004 R/W Receive Status Register LCRH 0xXXXX0008 R/W Line Control Register High LCRM 0xXXXX000c R/W Line Control Register Middle LCRL 0xXXXX0010 R/W Line Control Register Low CR 0xXXXX0014 R/W Control Register FR 0xXXXX0018 R Flag Register IIR 0xXXXX001c R Interrupt status register ICR 0xXXXX001c W Interrupt Clear Register

Table 4.10.: UART core registers. All registers are 8 bits wide.

DR, Data register A write to this register either pushes a byte into the FIFO if it is enabled, otherwise puts it directly in the 1 byte buffer for transmission. When reading this register either the oldest byte from the FIFO or the byte in the transmission buffer is retrieved. The controller will initiate a transmission as soon as there is data in the FIFO or transmission buffer and the receiving UART signals that it is ready by pulling the CTS input low. Thus, a write to this register will implicitly start a transmission.

RSR, Receive status register Not used but initialised to zero. A write to this register will store whatever value was written and a read will return the same value or zeroes if no write has occurred.

36 Line Control Registers The Line Control Register consists of three bytes, called H (High), M(Middle) and L(Low). Of these only bit 4 in the high byte is used which will enable the FIFOs when set.

Control Register In this register bit 4 is used to enable the receive interrupt and bit 5 to enable the transmit interrupt (high = enabled). The other bits are unused.

FR (Flag Register) The 8 bits of the flag register are used as shown in Table 4.11.

Bit 7 6 5 4 3 2 1 0 Name TxE RxF TxF RxE Busy Not used Not used CTS

Table 4.11.: Flag register bits. Bits 2 and 1 are always high.

TxE Transmit FIFO empty. When no data is present in the FIFO or the buffer is empty this bit is high.

RxF Receive FIFO full. When the FIFO is full or there is data in the buffer this bit is high.

TxF Transmit FIFO full. When the FIFO is full or there is data in the buffer this bit is high.

RxE Receive FIFO empty. When no data is present in the FIFO or the buffer is empty this bit is high.

Busy UART busy flag. When there is data in the buffer or FIFO, this bit is high.

CTS Clear To Send. When the device the UART communicating with is ready to receive data this bit is high.

IIR Interrupt status register This register is used to read the status of the interrupts. Bit 2 is the transmit interrupt status and bit 1 the receive interrupt status (high means interrupt active).

ICR Interrupt clear register A write with any data to this register clears the transmit interrupt.

37 4.10. Ethernet MAC

The Ethernet Media Access Control (MAC) provided in the Amber project was removed. The decision was based upon the fact that it represented around 25% of the system’s FPGA utilization while not implementing a function Syntronic saw any future need of in this system. If this function is needed sometime in the future, the work needed to reinsert the controller is not overwhelming, see Section 5.2 for further information. Before doing that though, one should also consider the use of an external controller since that would further simplify the implementation and save a considerable amount of FPGA resources.

4.11. GPIO

GPIO:s are exactly what the name states. A set of pins that can be configured individ- ually in software to act as either inputs or outputs. They are very useful for reading buttons or driving led lights but can also be used to implement communication protocols such as the I2C and SPI presented earlier.

4.11.1. GPIO controller The GPIO controller used in this project was written by Richard Harveille and uploaded on opencores in 2002[30]. The version used in this system was uploaded on the 10:th of March in 2009. The original version had support for 8 GPIO pins and an 8 bit Wishbone interface. If one wanted more one could instantiate several components to achieve that. Since that solution would be a bit cumbersome in this case we modified the controller to support up to 32 GPIO pins and a 32 bit Wishbone interface in one instance. The number of usable pins are configured with the GPIO PINS parameter. Since a pin can be used as both input and output it has to utilize a tri-state buffer as shown in Figure 4.17. The registers CTRL, WRITE and READ are explained below.

Figure 4.17.: GPIO pin tri-state buffer connection.

Configuration registers The GPIO pins are controlled through two register addresses called CTRL and LINE where LINE actually points to two different internal registers. They vary in size with the

38 GPIO PINS parameter and every bit in the registers represent a pin. Due to an issue while accessing the line register from a C program it has been split into two addresses in register addresses.h called WRITE and READ. The software registers are presented in Table 4.12. More details about the access issue is described in Section 9.5.

Name Address Access Description CTRL 0x19000000 R/W Control the direction of the pins WRITE 0x19000004 R/W Write output pin values, points to the LINE register READ 0x19000014 R/W Read input pin values, points to the LINE register

Table 4.12.: GPIO core registers.

CTRL This registers control if the respective pin is used as output or input. By setting a bit in this register to 1 its respective pin is used as an output.

WRITE This register is used to set a value to the output pins. There is nothing in the hardware that will prevent a read access but using this to read input pins might cause faulty values so use the READ register instead. Writing to an input pin will have no effect.

READ This register is used to read the values from the input pins. It will also report the state of the output pins, an effect of them sharing hardware register.

4.12. Timers

The timer core was included in the Amber project. There are three timers available that are configurable through a set of registers shown in Table 4.13.The timers are identical and have the following features:

• Individual interrupts

• Either periodic or one-shot

• Three different prescalers

4.12.1. Registers

39 Name Address Access Description TIMER0 LOAD 0x13000000 R/W Value set to timer 0 TIMER0 VALUE 0x13000004 R Current value of timer 0 TIMER0 CTRL 0x13000008 R/W Timer 0 control register TIMER0 CLR 0x1300000c W Timer 0 interrupt clear TIMER1 LOAD 0x13000010 R/W Value set to timer 1 TIMER1 VALUE 0x13000014 R Current value of timer 1 TIMER1 CTRL 0x13000018 R/W Timer 1 control register TIMER1 CLR 0x1300001c W Timer 1 interrupt clear TIMER2 LOAD 0x13000020 R/W Value set to timer 2 TIMER2 VALUE 0x13000024 R Current value of timer 2 TIMER2 CTRL 0x13000028 R/W Timer 2 control register TIMER2 CLR 0x1300002c W Timer 2 interrupt clear

Table 4.13.: Timer core registers.

LOAD The LOAD register is used to load a value that the timer will count down from. The register is two bytes wide an thus the biggest value that can be loaded is 0xFFFF. The value is stored in the timer until it is overwritten.

VALUE This register holds the current value of the timer and can be read. A write to this register has no effect.

CTRL The control register is 8 bits and they are used as shown in Table 4.14.

Bit 7 6 5:4 3:2 1:0 Name Enable Mode Not used Prescaler Not used

Table 4.14.: Control register bits. Unused bits are always low.

Enable When set the timer is enabled.

Mode This bit controls if the timer will act in one-shot or periodic mode. Periodic mode is entered when the bit is high and works a expected. When the timer has reached the value in the LOAD register an interrupt is fired and the timer restarts the counting. One-shot mode however is not exactly what one would expect. Instead of disabling the timer after it has reached the LOAD value it loads 0xFFFF and start counting again. To avoid this the irq-handler has to disable the timer at the first interrupt.

40 Prescaler The prescaler bits determine how much the timer counts on a system tick. How much the prescaler affects the counting is shown in Table 4.15.

CTRL[3:2] PS value 00 1 01 16 10 256 11 Not used

Table 4.15.: Timer prescaler value

CLR Writing any value to this register clears the timer interrupt.

4.12.2. Setting up a timer The timer is set up by first writing a value to the LOAD register and then set up the control register with the enable bit set and the desired prescaler and mode. How long time in seconds the timer will count is shown in equation 4.4 where LOAD is the value in the LOAD register and PS value is the value of the prescaler as shown in Table 4.15.

LOAD ∗ PS value T ime(s)= (4.4) Freq(Hz) When the time expires the timer will fire an interrupt and restart.

4.13. Interrupt controller

The interrupt controller was included in the Amber project and basically combines all hardware and software interrupts into two interrupt request signals, IRQ and F(ast)IRQ. The interrupt request signals are then fed into the processor core. The irq signal is generated by combining the hardware and software interrupts with two enable masks. The firq signal only contains the hardware interrupts and has two enable masks of its own. The relation between the interrupts and masks are shown in Figure 4.18 for the IRQ signal and 4.19 for the FIRQ signal. Shown in these figures are also the interrupt requests from the test module described in Section 4.14. They are not maskable and their status can not be read and thus they are not discussed more in this section.

41 Figure 4.18.: Interrupt vectors and masks.

Figure 4.19.: Fast interrupt vectors and masks.

4.13.1. Registers The registers in Table 4.17 handles the different vectors and masks shown in Figure 4.18 and 4.19 above. For the vectors that include the software interrupt they are outlined as presented in Table 4.16. For the others the software interrupt bit is unused.

Bit 31:10 9 8 7 6 5 4:3 2 1 0 Name NA SPI I2C Timer2 Timer1 Timer0 NA UART1 UART0 SW

Table 4.16.: Interrupt vector outline. The unused bits (NA) are initialized to zero.

42 Name Address Access Description IRQ0 STATUS 0x14000000 R IRQ0 status IRQ0 RAWSTAT 0x14000004 R IRQ0 hw IRQ status IRQ0 ENABLESET 0x14000008 R/W IRQ0 mask enable IRQ0 ENABLECLR 0x1400000c W IRQ0 mask disable INT SOFTSET 0 0x14000010 R/W Set software interrupt 0 INT SOFTCLEAR 0 0x14000014 R/W Clear software interrupt 0 FIRQ0 STATUS 0x14000020 R FIRQ0 status FIRQ0 RAWSTAT 0x14000024 R FIRQ0 hw IRQ status FIRQ0 ENABLESET 0x14000028 R/W FIRQ0 mask enable FIRQ0 ENABLECLR 0x1400002c W FIRQ0 mask disable IRQ1 STATUS 0x14000040 R IRQ1 status IRQ1 RAWSTAT 0x14000044 R IRQ1 hw IRQ status IRQ1 ENABLESET 0x14000048 R/W IRQ1 mask enable IRQ1 ENABLECLR 0x1400004c W IRQ1 mask disable INT SOFTSET 1 0x14000050 R/W Set software interrupt 1 INT SOFTCLEAR 1 0x14000054 R/W Clear software interrupt 1 FIRQ1 STATUS 0x14000060 R FIRQ1 status FIRQ1 RAWSTAT 0x14000064 R FIRQ1 hw IRQ status FIRQ1 ENABLESET 0x14000068 R/W FIRQ1 mask enable FIRQ1 ENABLECLR 0x1400006c W FIRQ1 mask disable INT SOFTSET 2 0x14000090 None Defined but unused INT SOFTCLEAR 2 0x14000094 None Defined but unused INT SOFTSET 3 0x140000d0 None Defined but unused INT SOFTCLEAR 3 0x140000d4 None Defined but unused

Table 4.17.: Timer core registers.

In Table 4.17 above there are six types of registers, STATUS, RAWSTAT, ENABLE- SET, ENABLECLR, SOFTSET and SOFTCLEAR. They are divided for the six different interrupt types IRQ0, IRQ1, FIRQ0, FIRQ1, SOFT0 and SOFT1. The registers have exactly the same function for their respective interrupts and masks so there will only follow a general description them.

STATUS Reading from this register will return the value of the masked interrupt vector.

RAWSTAT Reading from this register will return the value of the (unmasked) hardware interrupt vector.

ENABLESET Writing 1 to bits in this register will enable the corresponding interrupt. Reading from it will return the current enable vector.

43 ENABLECLR Writing 1 to bits in this register will disable the corresponding interrupt.

SOFTSET Writing 1 to bit zero of this register will set the corresponding software interrupt. Reading from it will return the software interrupt status.

SOFTCLEAR Writing 1 to bit zero of this register will clear the corresponding software interrupt. Reading from it will return the software interrupt status.

44 4.14. Test module

This module was included in the Amber project and is used to interface with the verilog test bench, to test the interrupt functionality, controlling the test bench UART and provide a set of random numbers to the system. It is controlled by a set of registers shown in Table 4.18.

Name Address Access Description STATUS 0xf0000000 R/W Test status register FIRQ TIMER 0xf0000004 R/W FIRQ interrupt test register IRQ TIMER 0xf0000008 R/W IRQ interrupt test register UART CONTROL 0xf0000010 R/W Controls the test bench UART UART STATUS 0xf0000014 R Test bench UART status register UART TXD 0xf0000018 R/W Test bench UART data feed SIM CTRL 0xf000001c R Simulation register MEM CTRL 0xf0000020 R/W Not used CYCLES 0xf0000024 R Counts system ticks LED 0xf0000028 R/W Control LEDS on SP605 board PHY RST 0xf000002c R/W Not used RANDOM NUM 0xf0000100 R/W Provides a random number RANDOM NUM00 0xf0000100 R/W Provides a random number RANDOM NUM01 0xf0000104 R/W Provides a random number RANDOM NUM02 0xf0000108 R/W Provides a random number RANDOM NUM03 0xf000010c R/W Provides a random number RANDOM NUM04 0xf0000110 R/W Provides a random number RANDOM NUM05 0xf0000114 R/W Provides a random number RANDOM NUM06 0xf0000118 R/W Provides a random number RANDOM NUM07 0xf000011c R/W Provides a random number RANDOM NUM08 0xf0000120 R/W Provides a random number RANDOM NUM09 0xf0000124 R/W Provides a random number RANDOM NUM10 0xf0000128 R/W Provides a random number RANDOM NUM11 0xf000012c R/W Provides a random number RANDOM NUM12 0xf0000130 R/W Provides a random number RANDOM NUM13 0xf0000134 R/W Provides a random number RANDOM NUM14 0xf0000138 R/W Provides a random number RANDOM NUM15 0xf000013c R/W Provides a random number

Table 4.18.: Test module registers.

STATUS Used to terminate tests in simulation through the Verilog test bench.A write to this register with value ”32’d17” will terminate the test and generate a test pass message. A write with any other data will generate a testfail. If the data is equal

45 to or greater than ”32’h8000” the the message will be: ”Failed ’testname’ - with error 0x’data’” otherwise it will be: ”Failed ’testname’ - with error on line ’data’”. Reading will return the value written to it or zeroes. This is useful in hardware and only if the testpass() or testfail() functions are not used since they put the processor in an infinite loop.

FIRQ TIMER and IRQ TIMER Used to test the interrupt functionality of the pro- cessor core. When writing to this register only the LSB is used and the data has the following effect:

8’h00 Clears the interrupt

8’h01 Sets the interrupt

8’h02 to 8’hff Initiate a countdown that decreases every system clock tick. When it reaches 8’h01 it fires the interrupt and stops.

Note that these interrupts will not be shown in any of the interrupt controller vectors. However, reading from this register will return interrupt timer value and can therefore be used to check if an interrupt is set from here.

UART CONTROL This register controls the UART interface in the Verilog test bench. For this only the two lowest bits are used. They have the following effect:

Bit 0 When set it enables transmission in the test bench UART.

Bit 1 When set the test bench UART is in loopback mode.

UART STATUS Returns the status of the test bench UART. Only bit one and zero is used and has the following meaning:

Bit 0 High if the UART transmit FIFO is empty.

Bit 1 High if the UART transmit FIFO is full.

UART TXD This register is used to push a byte into the test bench UARTs transmit FIFO if it not in loopback mode. If the FIFO is full the byte will be discarded and a warning message generated (in simulation).

SIM CTRL This register is used by software to determine if it runs in simulation or in hardware. If the register is zero then it is on the FPGA otherwise it is a simulation. This register is controlled by the run.sh script and a define in the code.

MEM CTRL This register has no effect any more. Its purpose was to wrap addresses going to the main memory and this feature was used by some of the tests.

46 CYCLES This 32 bit register stores the number of system ticks since startup.

LED This register is used to control the LEDs on the SP605 development board. For this bit 3:0 is used.

PHY RST This register is not used any more. Its purpose was to reset the Ethernet controller on the SP605 development board.

RANDOM NUM registers These are a set of one byte wide registers containing a random number. A new random number is retrieved reading the LSB from any of these registers. Writing to any of these registers will give the generator a new seed.

4.15. Verilog test bench

The test bench is the top level entity when running simulations and was included in the Amber project. It instantiates the whole system along with slave modules for the following functions:

• UART

• SPI

• I2C

• GPIO

Additionally the test bench generates a 200MHz clock signal, loads the main memory with content and read the STATUS register of the test module described in 4.14 to terminate tests. When a test is terminated a message is printed that shows the current system status and a Passed or Failed message as shown below. In the following sections the slave modules will be described in more detail.

47 −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

Amber Core User FIRQ IRQ > SVC r0 0x20000010 r1 0x000000c1 r2 0x00000002 r3 0x00000080 r4 0xdeadbeef r5 0x000000a5 r6 0x0000005a r7 0xdeadbeef r8 0xdeadbeef 0xdeadbeef r9 0xdeadbeef 0xdeadbeef r10 0x00000011 0xdeadbeef r11 0xf0000000 0xdeadbeef r12 0xdeadbeef 0xdeadbeef r13 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef r14 (lr) 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef r15 (pc) 0x00000268

Status Bits: N=0, Z=0, C=1, V=0, IRQ Mask 1, FIRQ Mask 1, Mode = Supervisor −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

++++++++++++++++++++ Passed i2c 12364 ticks ++++++++++++++++++++ Stopped at time : 309577500 ps : File ”/home/emanuel/workspace/amber SoC/ trunk/hw/vlog/tb/tb.v” Line 462

4.15.1. UART This UART controller has two modes, loopback and transmission. In loopback mode a received byte is put in the transmission buffer and sent back. In transmission mode it utilises a 16 byte transmission buffer that can be filled with data using the TXD register. The registers are controlled from the test module described in Section 4.14.

4.15.2. SPI The SPI slave model is a simple loopback model that was included in the SPI project[28] where it was a part of that systems test bench. The only modifications made was to read the CTRL register in order to automatically set the same mode as the Amber SPI controller.

4.15.3. I2C This I2C slave was included in the I2C project where it was part of its test bench. It is interfaced as a real I2C device with address 7’b0010000 and contain 16 registers with

48 both read and write access. Since both the slave and the acutal controller tristates SDA and SCL they are pulled up in the test bench top level with the Verilog ”pullup” command.

4.15.4. GPIO There was no suitable test model included in the GPIO project[] so a simple loopback model was written. It divides the GPIO signals in two equally sized sections. The lower half of the signals are mirrored on the upper half.

49 5. Configuration

In this chapter the parameters for configuring the system are presented. The steps that were taken to remove and add modules to the system are also shown.

5.1. Parameters

There are a number of parameters for configuring the system presented in Table 5.1.

Parameter Location Description A23 CACHE WAYS a23 config defines Defines the size of the cache. The size can be either 2, 4, 8 or 16 kB. A23 RAM REGISTER BANK a23 config defines If set the register bank is imple- mented in a RAM block otherwise in flip-flops. AMBER CLOCK DIVIDER system config defines The PLL output is divided by this value to get the system clock. AMBER UART BAUD system config defines Specifies the baud rate for both UARTs BOOT MSB memory configuration.v Specifies the size of the boot mem- ory. GPIO PINS system config defines.v The number of available GPIO pins. Any number from 1 to 32 is valid. MAIN MSB memory configuration.v Specifies the size of the main mem- ory. SPI DIVIDER LEN spi defines.v Sets the bit length for the spi clock divider. SPI MAX CHAR spi defines.v Sets the maximum transmission data block size. SPI SS NB spi defines.v Sets the number of slave select sig- nals. WB SLAVES system.v Sets the number of slaves on the Wishbone bus.

Table 5.1.: Parameters to configure the system

50 5.2. Adding or removing a peripheral

This section contains a general description of the modifications that were made when adding the SPI, GPIO and I2C controllers. For removing a peripheral the modifications are similar. The files that needs to be changed are:

• amber isim.prj

• amber registers.h

• global defines.v

• interrupt controller.v

• memory configuration.v

• registry defines.v

• system.v

• tb.v

• wishbone arbiter.v

• xs6 source files.prj amber isim.prj All HDL files that is needed for simulations are listed here so all new files should be added. This includes any test bench files if they exist. amber registers.h This header file contains defines for all registers in the system. It should be kept syn- chronized with register addresses.v global defines.v Here a reference to the peripheral is defined. It is not mandatory for the peripheral to work but it is a neat way to access signals and components during simulation. interrupt controller.v The interrupts are defined in a 32 bit vector called raw interrupts. If the new controller contains an interrupt output it should be defined here and wired to a new input.

51 memory configuration.v The Wishbone demux uses a function (called in controllername) defined in this file to identify the currently addressed slave. The peripherals base addresses are also defined in this file. registry defines.v All registers that are accessible through the Wishbone bus should be defined here. system.v In this file the following changes has to be made:

• Increase the WB SLAVES parameter

• Add any inputs and outputs

• Instantiate the component

• Modify instantiation of the interrupt controller if an interrupt signal is used

• Modify the Wishbone demux instantiation tb.v If any inputs or outputs are added to system.v they have to be added in the system instantiation. Instantiate also any slave module or bench for testing the peripheral. wishbone arbiter.v In this file the following has to be changed

• Add inputs and outputs for the peripherals Wishbone interface

• Add the peripheral to the slave arbitration (assignment to signal current slave)

• Add assignments to the peripherals Wishbone signals xs6 source files All HDL files that are needed for synthesis should be added to this file. Note that any test bench files should not be listed here.

52 6. Tools

Included in the Amber project are Linux scripts for running simulations and makefiles for both synthesis and software compiling. They are all introduced in this chapter, along with the compiler, simulator and synthesis software they depend on.

6.1. Xilinx ISE 14.5

Both the simulation script and synthesis makefile use the Xilinx ISE development suite. It is available in a basic version called Webpack[31] that is free and contains all the necessary functionality for this project. Since the simulations and synthesis are run from scripts the built in Graphical User Interface (GUI) and project manager are not used.

6.1.1. Synthesis The makefile for synthesis uses the command line design flow described in the Xilinx Command Line Tools User Guide[32] and the XST User Guide[33]. Its overall procedure is shown in Figure 6.1. Before the synthesis step is entered the serial boot loader software is compiled and the .data file used to load the boot memory is placed in the work folder. For the synthesis step the input is a project file (file extension .prj) containing a list of all Verilog source files and a text file containing a seed from which the placement algorithms generate its starting point. At step two, called the NGDBuild step, a user constraints file is also used. That file contains information about which pins on the FPGA to use and also timing constraints on different nets. More information on the ucf and constraints can be found in the Xilinx constraints file[34]. The result of the design process is a bitfile that can be downloaded to the target FPGA.

53 Figure 6.1.: Xilinx ISE design flow.

54 Synthesis, XST This application uses the HDL source files and converts them to a list of logical circuits and their connections. This file is called a netlist (file extension .ngc) and is the first step towards the goal of flashing the FPGA. XST can take a large number of options and the ones used can be viewed in the makefile. Also, in ISE there is a viewer for the netlist called RTL viewer. It displays all the circuits and the connections as a schematic diagram. This can be used as a step in verifying the source code.

NGDBuild This application uses the netlist file created by XST and the ucf to generate a Native Generic Database(NGD) file. The ngd file basically contains the same things as the netlist file except that the logic circuits have been transformed into FPGA primitives such as AND gates, OR gates, flip-flops and look-up tables (LUTs).

MAP In the mapping step the primitives in the ngd file is mapped into specific places in the targeted FPGA. The file that describe the physical placement is called a Native Circuit Description (NCD) file but since it only describes the placement of the primitives, not the connections, it outputs an immediate file called map.ncd. It is also possible to do the placement in the next step but it is not done so in this case.

Place and Route, PAR In this step the connections between the primitives are routed in the FPGA and output as an ncd file. This is a very complicated process since it is often a huge number of connections that are made and they can be made in a many different ways.

Timing analysis, trce This is the final check of the circuit. It checks the design against timing constraints that were given in the user constraints file. If timing fails the place and route program is rerun in an effort to fix the errors and a new timing analysis is done. This is iterated until the design passes the test or a certain threshold of iterations is passed. If the place and route fails to produce an error free routing one can rerun the map with another seed to get a new placement that allows for other routes to be made.

BitGen The BitGen, or bitgenerator, takes the ncd file and converts it to a bitstream that can be used by Xilinx flashing program Impact to program the FPGA.

55 6.1.2. Simulation The simulation script uses the Xilinx ISIM simulator that is included in the ISE de- sign suite to run a behavioural simulation. ISIM allows for the use of a test bench and viewing signals in a configurable waveform viewer. It also supports the Tool Command Language (TCL) for controlling the simulator. The script invokes the gcc-arm compiler found in the CodeSourcery compiler suite dis- cussed in Section 6.2 to compile the software used as stimulus for the simulation. Running the script with a test named ”test my SoC” is done from the terminal with the following command: user@computer :/../ amber dir/trunk/hw/isim$./run.sh [− options] test my SoC

An output from running a test called i2c.S are included in Appendix A. The script can be used with a set of options described in Table 6.1.

Option Description -h Bring up a help message -g Launch simulation in ISIM graphical interface -w Used with -g option to specify a wave configuration file (wcfg) located in the wcfg directory -b Specify the size of the boot memory. It is needed to generate the .data1 file properly. Defaults to 8192 bytes.

Table 6.1.: Simulation script options

When the script is invoked it follows the execution order showed in Figure 6.2. Ex- planations to the actions are given below.

1Boot memory contents file, see Section 4.6.2

56 Figure 6.2.: Simulation script organization

57 Checking test type There are three different types of tests that are supported. They are described in Table 6.2 below. If the test does not exist the script will exit here and give the error message ”Test ’testName’ not found”

Test type Location Description Hardware hw/tests/ Test written in assembler. Used to test specific hard- ware functions of the system. Software sw/testName/ Test written in C. This can either be a stand-alone application or it can use the boot-loader that jumps to address 0x8000. vmlinux sw/vmlinux Boots a linux kernel. This is an extensive test that is used to further verify the correctness of the core. It requires the MAIN MSB parameter to be 24 i.e. 32MBytes.

Table 6.2.: Simulation script options

Generate boot .data file As described in Section 4.6.2 the boot memory is loaded with data from a file called boot mem contents.data. This step is where that file is generated for simulations. It uses the tool amber-elfsplitter-memcontents described in Section 6.3.2. In order for the file to be correct the boot memory size need to be specified correctly.

Launch FUSE FUSE is part of the ISIM simulator. It is used to generate an executable file from a Verilog design specified in a project file with file extension .prj. Note that this is not the same project file as the one used for synthesis since the simulation project also contains the Verilog test bench.

Launch ISIM ISIM is launched by running the executable that was created by FUSE. It can be run with or without a GUI. If the GUI is used a wave configuration file can be specified. It is a file for configuring the wave viewer to show some specific signals and are very useful when debugging so one does not have to add them manually every time. The file should be located in the ”wcfg” subdirectory of the ”isim” folder.

6.1.3. Bulk simulation In the Amber project there was also a script included to run several assembler tests automatically. The script is called all.sh and is invoked with the following command:

58 user@computer :/../ amber dir/trunk/hw/isim$./ all .sh [−b xxxx]

It contains a list with test names and uses the run.sh script to sequentially run them all. The -b option is used to pass a boot memory size to the run.sh script. It can be omitted if the boot memory size is 8192 bytes, otherwise it has to be specified.

6.1.4. Debug switches A set of debug switches were present in the Amber project and have been kept. They are located in the files a23 config defines.v and system config defines.v. Enabling them will print a set of debug messages in the terminal when running simulations except for the AMBER WISHBONE DEBUG parameter which will add jitter to the Wishbone interface.

6.2. Sourcery CodeBench for ARM processors

For compilation of the software that is to be used in the system the GNU Cross Compiler (gcc) is used. It can be found in a ready made package from called Sourcery CodeBench Lite Edition[35]. To be able to download it one must register, it is completely free but there are no support included. To use it one has to specify the following options to the compiler:

-march=armv2a Use instructions for the ARMv2a architecture.

-mno-thumb-interwork use ARM instructions only, no thumb instructions.

and this to the linker:

–fix-vfbx Changes all ”bx” instructions to ”mov pc, lr”.

The reason for this is that the Amber core does not support neither the thumb in- struction set nor the assembler instruction ”bx”. This information was found in the Amber user guide[21].

6.3. Amber specific tools

There were a number of tools included in the Amber project and the ones used in the thesis are presented here. One of them was modified slightly for use with the new boot memory and named amber-elfsplitter-memcontents. Not presented here are some tools for looking at disassembled files and tools for generating different memory content files. These can prove useful in the future but as said above, has not been used during the thesis. If one is interested they are located in ”$AMBER BASE/trunk/sw/tools/”.

59 6.3.1. amber-elfsplitter This tool analyses a linker file (-elf) and generates a .mem file used to infuse specific ram blocks with content. Since no such memories are used any more this tool is only used to load main memory content in the Verilog testbench that still uses the old infusion scheme.

6.3.2. amber-elfsplitter-memcontents This tool is a modified version of the amber-elfsplitter described above. It still uses the elf file as input but instead outputs a file with content for use with the $readmemh command. In order for that command to work properly the whole memory array needs to be filled. Therefore the array is padded with zeros after all valid data. This has the positive side effect that the boot memory will not contain any uninitialized memory slots. In order to know how many zeros that should be added the tool has an additional input that specifies the size of the boot memory in bytes. A typical usage with the linker file test.elf and memory size of 8kB is shown below. amber−elfsplitter −memcontents test .elf 8192 > mem contents. data

6.3.3. check mem size Used by the boot loader makefile to ensure that the compiled boot loader program fits in the boot memory.

60 6.4. Installation

In addition to using each software packages installer one has to configure a couple of environmental variables shown in Table 6.3. These are most conveniently put in the .bashrc file or similar.

Variable Description AMBER BASE Absolute path to the trunk of the Amber project AMBER CROSSTOOL Name of the gcc compiler XILINX Path to the Xilinx libraries PATH Both the Sourcery CodeBench and the Xilinx ISE should be available through the system variable PATH

Table 6.3.: Required environmental variables

A snapshot of the .bashrc file used for this thesis is presented below. It was produced from an example in the Amber user guide[21]. # Change /proj/amber to where you saved the amber package on your system export AMBER BASE=/home/emanuel/workspace/amber SoC/trunk # Change /opt/Sourcery to where the package is installed on your system PATH=/opt/sourcery/bin :${PATH} # Also need to add Xilinx ISE directory to PATH PATH=/opt/Xilinx /14.5/ISE DS/ISE/bin/lin64 :${PATH} # AMBER CROSSTOOL is the name added to the start of each GNU tool in # the Code Sourcery bin directory. # This variable is used in various makefiles to set # the correct tool to compile code for the Amber core export AMBER CROSSTOOL=arm−none−linux −gnueabi # Xilinx ISE installation directory # This should be configured for you when you install ISE. # But check that is has the correct value # It is used in the run script to locate the Xilinx library elem ents . export XILINX=/opt/Xilinx /14.5/ISE DS/ISE

additionally one has to give the scripts in ”AMBER BASE/hw/isim” and ”AMBER BASE/hw/fpga/bin” permission to execute. This is easiest done with chmod as follows: user@computer:/ directory$ chmod +x ∗ .sh

61 7. Testing

There are mainly three test types used to test this system. They are assembler, C and Linux tests. They are all discussed in this chapter and in Section 7.1, where the assembler tests are discussed, waveforms of the SPI, I2C, UART and GPIO pins are shown

7.1. Assembler tests

An extensive suite of hardware tests written in assembler were included in the Amber project. They are listed in the Amber user guide[21] in Table 2. In addition to these tests three more tests were written to test the SPI, I2C and GPIO controllers. Since the Ethernet MAC was removed the tests concerning it was also removed. Worth noting is that the test ”addr ex” requires a main memory of 128MB to complete successfully, and some other test requires 32MB. This has no real practical influence so they have been left unattended.

7.1.1. SPI test (spi.S) The SPI hardware test performs the following actions: 1. Set up controller with the following settings: • Mode 1 • 40MHz SCK • 64 bit transfer • Interrupt enabled • Send to (testbench) slave 0

2. Start transfer

3. Wait for interrupt

4. Verify the GO BSY bit is 0

5. Verify loopback of the first transferred word In Figure 7.1 the transfer of the first word is shown. Since the slave modules buffer is initialized to zero the first received word contains only zeroes. Figure 7.2 shows the second word transfer where the first word is looped back. The pictures has a slight overlap.

62 Figure 7.1.: SPI transfer of first word.

Figure 7.2.: SPI transfer of second word.

7.1.2. I2C test (i2c.S) Since the I2C controller only uses eight bits of the wishbone interface the I2C test uses both word and byte access. This ensures that the padding was done successfully. The test performs the following actions: 1. Set a prescaler that gives 365KHz SCLK at 40MHz system clock 2. Enable the core 3. Send data 0xa5 to slave address 0x20 register 1 4. Send data 0x5a to register 2 using auto incrementation 5. Read back data and verify it 6. Write to an invalid register and check NACK Figure 7.3 to 7.13 shows SDA and SCL during the complete sequence. The i2c slave model uses address 0010 000.

Figure 7.3.: Start condition and sending slave address plus write bit (0x20)

63 Figure 7.4.: Sending register address 0x01

Figure 7.5.: Sending data 0xa5

Figure 7.6.: Sending data 0x5a and a stop condition

Figure 7.7.: Start condition and sending slave address plus write bit (0x20)

Figure 7.8.: Sending register address 0x01

Figure 7.9.: Start condition and sending slave address plus read bit (0x21)

64 Figure 7.10.: Reading data 0xa5

Figure 7.11.: Reading data 0x5a and stop condition

Figure 7.12.: Start condition and sending slave address plus write bit (0x20)

Figure 7.13.: Sending invalid register address 0x10 and receiving NACK

65 7.1.3. UART test (uart tx.S) The UART transmission test performs the following actions:

1. Enable loopback in the test UART

2. Send message ”Hi!”, encoded in ASCII

3. Send white space character to flush the loopback buffer

Figure 7.14 to 7.17 shows the UART communication lines during the test.

Figure 7.14.: Send character ”H”

Figure 7.15.: Send character ”i”, receive character ”H”

Figure 7.16.: Send character ”!”, receive character ”i”

66 Figure 7.17.: Send character ” ”, receive character ”!”

7.1.4. GPIO (gpio.S) The GPIO test uses 16 pins and performs the following actions:

1. Configure pins [16:9] as inputs and pins [8:1] as outputs

2. Set the outputs to hex value ”0xDA”

3. Check the mirrored inputs

4. Set the outputs to hex value ”0xBE”

5. Check the mirrored inputs

In Figure 7.18 and Figure 7.19 the GPIO pin values are shown.

Figure 7.18.: Pins [8:1] is ”0xDA” and mirrored on pins [16:9]

67 Figure 7.19.: Pins [8:1] is ”0xBE” and mirrored on pins [16:9]

7.2. C tests

The test software written in C are located in the $AMBER BASE/sw/ directory where each test has its own folder. The folder has to have the same name as the test software file and minimally contain the files presented in Table 7.1.

Filename Description Makefile A makefile that specifies the files and dependencies, any extra vari- ables and then calls a makefile that is common for all C tests called common.mk. sections.lds Linker file that specifies the different memory sections of the test pro- gram. start.S Assembler start routine. Contains exception and interrupt handlers, stack initiation, etc. Used when program is run without the boot- loader. test.c The actual test program.

Table 7.1.: Files required for a C test

7.2.1. Libraries There is a library included in the Amber project called mini-libc. It contains, among others, malloc and a printf version that uses the UART0 controller. These can be used instead of the standard ”stdio.h” library since it runs stand-alone on this system. To use it one has to define the parameter ”USE MINI LIBC” in the makefile and set it to ”1”. For an example see the Hello World sample program discussed in Section 7.2.4. The

68 mini-libc library also contains functions, called testpass() and testfail(), for terminating the tests in a similar manner as the assembler tests. In case one does not want to use the mini-libc library but still needs these functions they were extracted into a new library that was called ”test func”. The common makefile was modified so that the test func library is used if the USE MINI LIBC parameter is undefined.

7.2.2. boot-loader-serial This launches the boot loader that is also used in hardware. The boot-loader displays a help message and then the test is terminated.

7.2.3. dhry This test launches the Dhrystone benchmark program version 2.1. The accuracy of the test result is unknown but differences can be seen when implementing various hardware changes. For example, when the main memory was changed an increase in performance of 0.03 DMIPS could be noted. This could with high probability be traced to the fact that the new memory used one wishbone cycle less per access than the replaced one.

7.2.4. hello-world There is a simple ”Hello-World” sample program included in the Amber project that prints the message ”Hello, World!” using the printf function in mini libc.

7.2.5. spi-timer This sample program was written to show how the timers and interrupts are set up. It uses a timer to periodically send a message using the SPI controller. The message contains a number which tell how many messages that has been sent. The test uses the UART0 controller to print a $ when a SPI transfer starts and a ! every SPI interrupt followed by the content of the received message.

7.3. Linux test

This test boots a precompiled Linux 2.4 kernel. It is set up for 32MB of main memory so the parameter MAIN MSB must be set to 24 before launching this test. When the kernel is booted it prints a ”hello world” message as shown in Appendix B.

69 8. Result

The Verilog code along with this document serves as the final product of this thesis. The final SoC not only fulfills the specification specified in Chapter 3 but it is also written in a generic manner that should require a minimal effort to synthesize on an arbitrary FPGA. This is important since there might be specific customer demands on hardware in any future product. The SoC is also very general in terms of peripheral functionality. Most sensors today use either SPI or I2C and where something else is used the GPIO pins can with most certainty be utilized. They can also be used in interface with other FPGA components such as Digital Signal Processors, memories or other project specific components. When comparing performance, the finished SoC’s 0.78 DMIPS/MHz is not far from the maximum performance of the ARM7 families 0.9 DMIPS/MHz[36]. The ARM7 archi- tecture has for example been used in the LEGO Mindstorms NXT brick, which has been used in a huge variety of applications including several control theory projects at Uppsala University[37][38]. In these projects data from several sensors is collected and processed. Since the ARM7 is run at 48MHz it indicates that the a23 processor should be powerful enough to handle similar situations.

70 9. Conclusion

The goal of this thesis was to specify and implement a ARM based SoC in a FPGA. In the end, no hardware was available for testing the final system. However, in simulations it is shown that the SoC executes both assembler and C code correctly and that the peripheral interfaces behave as expected. As a further verification the Verilog code was synthesized for a Spartan 6 LX45T. Timing reports from the synthesis tools indicate that the synthesis was successful and that the code is ready for hardware testing.

9.1. Specification

The ARM Cortex M1 processor core and two open source SoC:s are compared and one of the SoC:s, called Amber, was selected due to its advantages in the cost and licensing properties. A specification for the whole system was also set that included communication controllers that would be able to interface with most sensors today’s market.

9.2. Implementation

9.2.1. Target independence The complete design was written in Verilog HDL but several functions such as memories, an adder and the multiplication unit were instantiated as Xilinx specific FPGA blocks. It is usually desired to use these blocks but not by forcing them through code instantiation. Instead it is common to write generic code that the synthesis tool translates into FPGA blocks when they are available. By doing this one can much easier migrate the design to different FPGA:s. To achieve this the block instantiations were replaced by generic code that can be synthesized in a FPGA from any vendor with a tool chain that supports Verilog code. As for the adder and multiplication unit a good generic replacement code was included in the project. The memories also had generic variants supplied but they either did not use the RAM blocks at synthesis or, like the main memory, was not synthesizable at all due to its size. They were instead replaced by new code written to suit Xilinx RAM blocks but that will synthesize in any FPGA. The functionality of all the replaced code was verified with the included tests.

9.2.2. Peripheral integration Three peripheral projects were integrated into the system to add SPI, I2C and GPIO functionality. The original Amber project was studied in order to integrate the periph-

71 erals while maintaining the projects code structure as much as possible. All peripheral projects had a Wishbone bus interface but in some cases they had to be modified slightly. Since an Ethernet controller was not desired the Ethernet MAC included in the Amber project was removed. In addition to not provide a desired function it would stand for almost a third of the systems utilized FPGA area so by removing it the synthesis tool got a much easier job of routing the design.

9.3. Documentation

Included in the Amber project was several scripts and makefiles that used the Xilinx ISE and CodeSourcery CodeBench Lite programs. The scripts and makefiles were studied and their function is presented along with the installation procedure and work flow of the tools.

9.3.1. Peripheral controllers The SPI, I2C and GPIO peripherals were only described in short and their documentation is referenced in their respective sections. As for the peripherals included in the Amber project they were without any documentation. Their functionality was documented by analysing the code and running simulation tests. All registers were investigated and presented along with the functions they control. The resulting information presented in this thesis should suffice for a developer that is using or modifying the system. Also, the peripheral integration process was documented to ease any future expansions of the system.

9.4. Testing

There were a suite of numerous tests written in assembler and C included in the Amber project along with a Verilog test bench for the whole system. The assembler tests was used to verify different hardware functions such as adding, UART transmit/receive, main memory access, cache access etc. They were used throughout the whole project to verify that the changes made did not interfere with the rest of the design. When a new peripheral was added a new assembler test was written to verify its function. As it is impossible to check signal levels through code the peripherals output and input pins were also checked in a waveform graph to verify its correctness. To further test the system and to provide a more complex sample program than ”Hello World!” a program called ”spi-timer” was written that uses a timer to periodically send a SPI message. It shows the usage of the timer, interrupt and SPI controllers as well as the new test func library.

72 9.5. Compiler optimization issue

An issue arose when compiling a C program that read and writes to a register with the same address in the following manner:

• Writing 0x0000FEED to register 0x19000004

• Reading from register 0x19000004 and store in a variable called ”readout”

This illustrates a write and read to the GPIO controllers LINE register, in this case the 16 upper bits are used as inputs and could have any value. The compiler optimized away the read and instead copied the written value (0x0000FEED) to readout. This was discovered by watching the gpio wb stb signal to see that the gpio controller never was accessed for a read cycle. When instead reading from register 0x19000014 it reads the value correctly. This is possible since the controller only read the third address bit. Thus these two addresses are equivalent. In order to solve this issue all compiler optimizations were turned off in the common makefile common.mk. With it off the read was executed correctly. After some testing it was clear that a combination of several optimizations produced the error and no further troubleshooting was performed. The result of this discovery was that the GPIO controller got three registers in the register addresses.h file. One for the CTRL register, one for reading the LINE register and one for writing to it.

73 10. Discussion

At the beginning of the project the work to be done was hard to specify due to the variance in the available alternatives. If the Cortex M1 had been chosen the focus would have been put on getting a structured code base and a working bus architecture. Or, if the Storm SoC had been chosen, all focus would have been put on testing and software writing to integrate the SoC in some existing application. Now, since the Amber SoC was chosen, the focus instead lay on adding peripherals, generalizing code and improving the documentation of the original Amber system.

10.1. Pros and cons with the Amber SoC

The Amber SoC is a tightly integrated system with scripts that were tailored to be used with the Xilinx tools in a Linux environment. That needed quite some study to understand. Luckily this was not a problem since Syntronic planned to test the final system on a Xilinx FPGA and the Linux environment is quite easy to set up. There were however other obstacles to be tackled coupled to the Amber SoC, two of them being:

• No earlier experience in Verilog, Bash, Assembler and Xilinx tools

• Lack of documentation of the system architecture, peripherals and scripts

Non of these are uncommon in the engineering business and not impossible to deal with. They do however take quite some time in pretense. With that said the Amber system also had several upsides like:

• Being a complete working SoC from the beginning

• Well structured code

• An extensive test framework

The structure of the code was easy to follow and in many cases the code-comments compensated for the lack of documentation. That eased the job of extending the docu- mentation even when the comments were not consistent with the code. In those cases the code was easy enough to follow to spot errors. Since the Amber SoC already was a fully working SoC it was already a usable product. All implementation work could be focused on the adaptation of it to suit Syntronics specific needs. Since there was an extensive test suite included in the project every change could be tested directly. When there was no suitable test to execute the existing ones could be used as templates. This way bugs could be found in a early state which shortened the total time spent on debugging.

74 10.2. Peripherals

There were a couple of factors that limited the available alternatives of peripheral con- trollers. There was no time to write own controllers and to generate them from Xilinx core-generator was not an alternative since they had to be vendor independent. The only alternative was to find open source IP cores that suited the specification. This however, was not a problem since the opencores community has had the same need for a long time and there were a couple of controllers to choose from for every function that was needed. Since time was a limiting factor IP-cores without a Wishbone interface was sorted out. This greatly limited the available choices and as presented in Chapter 11 there was a trade-off in features in the GPIO and SPI controllers that might have been avoided otherwise. On the other hand it might prove that to integrate the missing fea- tures in the chosen cores takes less time than to create a Wishbone interface for another core.

75 11. Future work

Since there was no hardware the next step is to verify the synthesis and to prove that the SoC actually works. All simulations seems to verify that it does indeed work but it is not certain until a real hardware test is done. When it comes to further improve the system these following options should be considered: Test SPI flash memory To ensure that a permanent storage alternative is available.

SPI modes Adding support for SPI mode 2 and 3 would make the SPI controller even more useful as it will widen the number of sensors it can interface.

Central configuration file Moving all parameters to one file will make it easier to set up the system. If the system is going to be used in many different projects one would also benefit from a mechanism where the peripherals could be enabled/disabled from a file like this.

Improving the Wishbone bus Adding support for the ”Advanced burst scheme” could improve the whole system performance.

Main memory content infusion If one wants to load the main memory with data at synthesis the ”readmemh” function should be added. It has not been done since there has been no need for it. Doing that would demand an adaptation of the Linux boot simulation.

UART baud rate defines Creating individual defines for the UART baud rate will make them a little more flexible than they are now.

Replace UART controllers An alternative to the item above is to replace the UART controllers with a more flexible one.

GPIO interrupt Adding interrupt support for the GPIO controller would greatly in- crease its usability.

JTAG For larger software designs a JTAG controller could prove useful for debugging. What improvements and modifications to perform is very situation dependent but in a general sense I consider these three the most important:

• Test SPI flash memory

• Central configuration file

• GPIO interrupt

76 These will all enhance the flexibility of the system and widen its possible uses. The SPI flash memory would provide the system with a permanent storage where one could store large programs, measurement data, logs etc. It also provide a potential customer with an easy way feed data and configurations to the system. Adding interrupt functionality to the GPIO controller would remove the need of polling the controller at certain intervals. This would increase the available CPU time as well as freeing up atleast one timer. It would also make it possible to extend the system to receive external timer and clock signals. As for the central configuration file it would mean a extensive change in the code but would shorten the implementation time of the system dramatically in those cases where the peripheral functionality needs to be customized. It would be very useful if one could disable unused peripherals and add several instances of other just by changing some defines. The FPGA footprint of the system would also be optimized by this but there would most probably be no severe size benefits since the core is the most area demanding component by far. On the other hand, there could be some improvement in timing since the routing would be simpler. This could potentially increase the maximum frequency and thus the performance of the system.

77 12. Bibliography

[1] Xilinx. Zynq-7000 All Programmable SoC. 2013. url: http://www.xilinx.com/ content/xilinx/en/products/silicon-devices/soc/zynq-7000.html (visited on 07/18/2013). [2] Altera. SoC Overview. 2013. url: http://www.altera.com/devices/processor/ soc-fpga/overview/proc-soc-fpga.html (visited on 07/18/2013). [3] Microsemi. SoC FPGAs. 2013. url: http://www.microsemi.com/products/ fpga-soc/soc-fpgas (visited on 11/12/2013). [4] Altera. Nios II Processor: The World’s Most Versatile Embedded Processor. 2013. url: http://www.altera.com/devices/processor/nios2/ni2-index.html (visited on 07/18/2013). [5] Xilinx. MicroBlaze Soft Processor Core. 2013. url: http://www.xilinx.com/ tools/microblaze.htm (visited on 07/18/2013). [6] Jan Andersson, Jiri Gaisler, and Roland Weigand. NEXT GENERATION MUL- TIPURPOSE MICROPROCESSOR. Article. 2010. [7] Peter Clarke. European Space Agency launches free Sparc-like core. 2000. url: http://www.eetimes.com/document.asp?doc_id=1214267 (visited on 08/30/2013). [8] Europe Space Agency. LEON’S FIRST FLIGHTS. 2013. url: http://www.esa. int/Our_Activities/Space_Engineering/LEON_s_first_flights (visited on 08/30/2013). [9] Opencores community. Main Page. 2013. url: http://opencores.org/or1k/ Main_Page (visited on 08/30/2013). [10] Opencores community. OR1K:Community portal. 2013. url: http://opencores. org/or1k/OR1K:Community_Portal (visited on 08/30/2013). [11] Altera. Nios Embedded Processor. 2013. url: http://www.altera.com/products/ ip/processors/nios/nio-index.html (visited on 07/18/2013). [12] David Sheldon et al. Conjoining Soft-Core FPGA Processors. Report. 2006. [13] Roman Lysecky and Frank Vahid. A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning. Report. 2005. [14] David Sheldon et al. Application-Specific Customization of Parameterized FPGA Soft-Core Processors. Report. 2006. [15] Peter Yiannacouras, Gregory Steffan, and Jonathan Rose. Exploration and Cus- tomization of FPGA-Based Soft Processors. Report. 2007.

78 [16] Conor Santifort. Amber ARM-compatible core :: Overview. 2013. url: http:// opencores.com/project,amber (visited on 07/18/2013). [17] ARM Ltd. Cortex-M1 Processor. 2013. url: http://www.arm.com/products/ processors/cortex-m/cortex-m1.php (visited on 05/08/2013). [18] Stephan Nolting. Storm Core (ARM7 compatible) :: Overview. 2012. url: http: //opencores.org/project,storm_core (visited on 05/08/2013). [19] Xilinx. Xcelljournal - Solutions for a programmable world. Table 3. 2008. url: http://www.nxtbook.com/nxtbooks/xilinx/xcell64/index.php?startid=58 (visited on 05/20/2013). [20] OpenCores. Wishbone B4. 2010. [21] Conor Santifort. Amber Project User Guide. 2013. [22] Conor Santifort. Amber 2 Core Specification. 2013. [23] Rudolf Usselmann. OpenCores SoC Bus Review. Report. 2001. [24] Xilinx. XST User Guide for Virtex-6,Spartan-6, and 7 Series Devices. 2011. [25] Richard Harveille. I2C-Master Core Specification. 2003. [26] Richard Harveille. I2C controller core :: Overview. 2013. url: http://opencores. org/project,i2c (visited on 07/01/2013). [27] Simon Srot. SPI Master Core Specification. 2004. [28] Simons. SPI controller core :: Overview. 2013. url: http://opencores.org/ project,spi (visited on 05/31/2013). [29] Conor Santifort. uart.v. 2013. [30] Richard Harveille. Simple General Purpose IO :: Overview. 2009. url: http:// opencores.org/project,simple_gpio (visited on 07/01/2013). [31] Xilinx. ISE WebPACK Design Software. 2013. url: http://www.xilinx.com/ products/design- tools/ise- design- suite/ise- webpack.htm (visited on 03/27/2013). [32] Xilinx. Command Line Tools User Guide. 2009. [33] Xilinx. XST User Guide. 2009. [34] Xilinx. Constraints Guide. 2009. [35] Mentor Graphics. Sourcery CodeBench Lite Edition. 2013. url: http : / / www . mentor . com / embedded - software / sourcery - tools / sourcery - codebench / editions/lite-edition/ (visited on 05/23/2013). [36] ARM Limited. Dhrystone and MIPs performance of ARM processors. 2011. url: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ ka3885.html (visited on 09/25/2013). [37] Egi Hidayat Uppsala University. Embedded Control Systems Project Groups. 2012. url: http://www.it.uu.se/edu/course/homepage/projektsystek/ht11/ Nyheter/Grupper (visited on 09/25/2013).

79 [38] Egi Hidayat Uppsala University. Embedded Control Systems Project Groups. 2013. url: http://www.it.uu.se/edu/course/homepage/projektsystek/ht12/ Nyheter/Grupper (visited on 09/25/2013).

80 A. I2C test output

No boot memory size specified. Defaulting to 8192 bytes Test i2c , type 1 Compile ../ tests/i2c.S Running: /opt/Xilinx/14.5/ISE DS/ISE/bin/lin64/unwrapped/fuse work.tb work. glbl −o amber−test . exe −prj amber−isim.prj −L unisims ver −d BOOT MEM FILE=” . . / t e s t s / i 2 c .mem” −d MAIN MEM FILE=”” −d AMBER LOG FILE= ”tests.log” −d AMBER TEST NAME=” i 2 c ” −d AMBER SIM CTRL=1 −d AMBER TIMEOUT=0 −incremental −i ../vlog/lib −i ../vlog/system −i −−/ vlog/system/spi −i −−/vlog/system/ram −i −−/vlog/system/i2c −i ../vlog/ amber23 −i ../vlog/tb ISim P.58f (signature 0xfbc00daa) Number of CPUs detected in this system: 2 Turning on mult−threading , number of parallel sub−compilation jobs : 4 Determining compilation order of HDL files Analyzing Verilog file ”/opt/Xilinx/14.5/ISE DS/ISE/./verilog/src/glbl .v” into library work Analyzing Verilog file ”../vlog/system/boot mem32.v” into library work Analyzing Verilog file ”../vlog/system/clocks resets.v” into library work Analyzing Verilog file ”../vlog/system/interrupt controller.v” into library work Analyzing Verilog file ”../vlog/system/system.v” into li brary work Analyzing Verilog file ”../vlog/system/test module.v” into library work Analyzing Verilog file ”../vlog/system/timer module.v” into library work Analyzing Verilog file ”../vlog/system/uart.v” into libr ary work Analyzing Verilog file ”../vlog/system/gpio.v” into libr ary work Analyzing Verilog file ”../vlog/system/wishbone arbiter.v” into library work Analyzing Verilog file ”../vlog/system/afifo.v” into lib rary work Analyzing Verilog file ”../vlog/system/main mem.v” into library work Analyzing Verilog file ”../vlog/system/spi/spi top.v” into library work Analyzing Verilog file ”../vlog/system/spi/spi clgen.v” into library work Analyzing Verilog file ”../vlog/system/spi/spi shift.v” into library work Analyzing Verilog file ”../vlog/system/ram/wb mem.v” into library work Analyzing Verilog file ”../vlog/system/ram/ram array.v” into library work Analyzing Verilog file ”../vlog/system/i2c/i2c master bit ctrl.v” into library work Analyzing Verilog file ”../vlog/system/i2c/i2c master byte ctrl.v” into library work Analyzing Verilog file ”../vlog/system/i2c/i2c master top.v” into library work Analyzing Verilog file ”../vlog/amber23/a23 alu.v” into library work Analyzing Verilog file ”../vlog/amber23/a23 barrel shift.v” into library work Analyzing Verilog file ”../vlog/amber23/a23 barrel shift fpga.v” into library work

I Analyzing Verilog file ”../vlog/amber23/a23 cache.v” into library work Analyzing Verilog file ”../vlog/amber23/a23 coprocessor.v” into library work Analyzing Verilog file ”../vlog/amber23/a23 core.v” into library work Analyzing Verilog file ”../vlog/amber23/a23 decode.v” into library work Analyzing Verilog file ”../vlog/amber23/a23 decompile.v” into library work Analyzing Verilog file ”../vlog/amber23/a23 execute.v” into library work Analyzing Verilog file ”../vlog/amber23/a23 fetch.v” into library work Analyzing Verilog file ”../vlog/amber23/a23 multiply.v” into library work Analyzing Verilog file ”../vlog/amber23/a23 register bank.v” into library work Analyzing Verilog file ”../vlog/amber23/a23 ram register bank.v” into library work Analyzing Verilog file ”../vlog/amber23/a23 wishbone.v” into library work Analyzing Verilog file ”../vlog/lib/generic iobuf.v” into library work Analyzing Verilog file ”../vlog/lib/boot ram byte en.v” into library work Analyzing Verilog file ”../vlog/lib/gen ram byte en.v” into library work Analyzing Verilog file ”../vlog/lib/gen ram line en.v” into library work Analyzing Verilog file ”../vlog/tb/i2c slave model.v” into library work Analyzing Verilog file ”../vlog/tb/spi slave model.v” into library work Analyzing Verilog file ”../vlog/tb/tb uart.v” into library work Analyzing Verilog file ”../vlog/tb/tb gpio.v” into library work Analyzing Verilog file ”../vlog/tb/dumpvcd.v” into libra ry work Analyzing Verilog file ”../vlog/tb/tb.v” into library work Starting static elaboration Completed static elaboration Fuse Memory Usage: 102528 KB Fuse CPU Usage: 1760 ms Compiling module IBUFGDS(DIFF TERM=”TRUE” ,IOSTAND . . . Compiling module PLL ADV(CLKFBOUT MULT=4,CLKIN1 P... Compiling module BUFG Compiling module clocks resets Compiling module gen ram line e n (DATA WIDTH=32’b0 . . . Compiling module gen ram byte e n (DATA WIDTH=128,A . . . Compiling module a23 cache default Compiling module a23 wishbone Compiling module a23 fetch default Compiling module a23 decompile 2 Compiling module a23 decode Compiling module a23 barrel shift Compiling module a23 alu Compiling module a23 multiply Compiling module a23 register bank Compiling module a23 execute Compiling module a23 coprocessor Compiling module a23 core Compiling module i2c master bit ctrl Compiling module i2c master byte ctrl Compiling module i2c master top Compiling module boot ram byte e n (DATA WIDTH=32,A . . . Compiling module boot mem32 default Compiling module uart(WB DWIDTH=32,WB SWIDTH=4) Compiling module test module (WB DWIDTH=32,WB SWID . . . Compiling module timer module(WB DWIDTH=32,WB SWI . . .

II Compiling module interrupt controller(WB DWIDTH= 3 . . . Compiling module spi clgen Compiling module spi shift Compiling module spi top Compiling module gpio Compiling module ram array default Compiling module wb mem default Compiling module wishbone arbiter (WB DWIDTH=32,WB. . . Compiling module system Compiling module tb uart default Compiling module spi slave model Compiling module i2c slave model Compiling module tb gpio Compiling module dumpvcd Compiling module tb Compiling module glbl Time Resolution for simulation is 1 ps . Waiting for 1 sub−compilation(s) to finish ... Compiled 42 Verilog Units Built simulation executable amber−test . exe Fuse Memory Usage: 414448 KB Fuse CPU Usage: 2930 ms GCC CPU Usage: 270 ms ISim P.58f (signature 0xfbc00daa) WARNING: A WEBPACK l i c e n s e was found . WARNING: Please use Xilinx License Configuration Manager to check out a full ISim license. WARNING: ISim will run in Lite mode. Please refer to the ISim documentation for more information on the differences between the Lite and the Full version . This is a Lite version of ISim. Time resolution is 1 ps Simulator is doing circuit initialization process. log file tests.log, timeout 0, test name i2c Finished circuit initialization process.

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

Amber Core User FIRQ IRQ > SVC r0 0x20000010 r1 0x000000c1 r2 0x00000002 r3 0x00000080 r4 0xdeadbeef r5 0x000000a5 r6 0x0000005a r7 0xdeadbeef r8 0xdeadbeef 0xdeadbeef r9 0xdeadbeef 0xdeadbeef r10 0x00000011 0xdeadbeef r11 0xf0000000 0xdeadbeef r12 0xdeadbeef 0xdeadbeef r13 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef

III r14 (lr) 0xdeadbeef 0xdeadbeef 0xdeadbeef 0xdeadbeef r15 (pc) 0x00000268

Status Bits: N=0, Z=0, C=1, V=0, IRQ Mask 1, FIRQ Mask 1, Mode = Supervisor −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

++++++++++++++++++++ Passed i2c 12391 ticks ++++++++++++++++++++ Stopped at time : 309887500 ps : File ”/home/emanuel/workspace/amber SoC/ trunk/hw/vlog/tb/tb.v” Line 386

IV B. Linux test output

No boot memory size specified. Defaulting to 8192 bytes Test vmlinux , type 3 make −s −C ../mini−libc MIN SIZE=1 arm−none−linux −gnueabi −ld −Bstatic −Map boot−loader −serial .map −−strip − debug −−fix −v4bx −o boot−loader −s e r i a l . e l f −T sections.lds boot−loader − serial.o start.o crc16.o xmodem.o elfsplitter.o ../mini−libc/ printf .o ../ mini−libc/libc asm.o ../mini−libc/memcpy.o arm−none−linux −gnueabi −objcopy −R .comment −R .note boot−loader −s e r i a l . e l f ../ tools/amber− elfsplitter boot−loader −s e r i a l . e l f > boot−loader −serial .mem ../ tools/amber−memparams32. sh boot−loader −serial .mem boot−loader − serial memparams32 .v ../ tools/amber−memparams128. sh boot−loader −serial .mem boot−loader − serial memparams128 .v arm−none−linux −gnueabi −objdump −C −S −EL boot−loader −s e r i a l . e l f > boot− loader −serial .dis ../ tools/check mem size.sh boot−loader −serial .mem ”@000020” Running: /opt/Xilinx/14.5/ISE DS/ISE/bin/lin64/unwrapped/fuse tb −o amber− test . exe −prj amber−isim.prj −d BOOT MEM FILE=” ../../ sw/boot−loader − serial/boot−loader −serial .mem” −d MAIN MEM FILE=” ../../sw/vmlinux/ vmlinux .mem” −d AMBER LOG FILE=”tests . log” −d AMBER TEST NAME=” vmlinux ” −d AMBER SIM CTRL=3 −d AMBER TIMEOUT=0 −d AMBER LOAD MAIN MEM − incremental −i ../vlog/lib −i ../vlog/system −i −−/vlog/system/spi −i −−/vlog/system/ram −i −−/vlog/system/i2c −i ../vlog/amber23 −i ../vlog/ tb ISim P.58f (signature 0xfbc00daa) Number of CPUs detected in this system: 4 Turning on mult−threading , number of parallel sub−compilation jobs : 8 Determining compilation order of HDL files Analyzing Verilog file ”../vlog/system/boot mem32.v” into library work Analyzing Verilog file ”../vlog/system/clocks resets.v” into library work Analyzing Verilog file ”../vlog/system/interrupt controller.v” into library work Analyzing Verilog file ”../vlog/system/system.v” into li brary work Analyzing Verilog file ”../vlog/system/test module.v” into library work Analyzing Verilog file ”../vlog/system/timer module.v” into library work Analyzing Verilog file ”../vlog/system/uart.v” into libr ary work Analyzing Verilog file ”../vlog/system/gpio.v” into libr ary work Analyzing Verilog file ”../vlog/system/wishbone arbiter.v” into library work Analyzing Verilog file ”../vlog/system/afifo.v” into lib rary work Analyzing Verilog file ”../vlog/system/main mem.v” into library work Analyzing Verilog file ”../vlog/system/spi/spi top.v” into library work Analyzing Verilog file ”../vlog/system/spi/spi clgen.v” into library work Analyzing Verilog file ”../vlog/system/spi/spi shift.v” into library work Analyzing Verilog file ”../vlog/system/ram/wb mem.v” into library work

V Analyzing Verilog file ”../vlog/system/ram/ram array.v” into library work Analyzing Verilog file ”../vlog/system/i2c/i2c master bit ctrl.v” into library work Analyzing Verilog file ”../vlog/system/i2c/i2c master byte ctrl.v” into library work Analyzing Verilog file ”../vlog/system/i2c/i2c master top.v” into library work Analyzing Verilog file ”../vlog/amber23/a23 alu.v” into library work Analyzing Verilog file ”../vlog/amber23/a23 barrel shift.v” into library work Analyzing Verilog file ”../vlog/amber23/a23 barrel shift fpga.v” into library work Analyzing Verilog file ”../vlog/amber23/a23 cache.v” into library work Analyzing Verilog file ”../vlog/amber23/a23 coprocessor.v” into library work Analyzing Verilog file ”../vlog/amber23/a23 core.v” into library work Analyzing Verilog file ”../vlog/amber23/a23 decode.v” into library work Analyzing Verilog file ”../vlog/amber23/a23 decompile.v” into library work Analyzing Verilog file ”../vlog/amber23/a23 execute.v” into library work Analyzing Verilog file ”../vlog/amber23/a23 fetch.v” into library work Analyzing Verilog file ”../vlog/amber23/a23 multiply.v” into library work Analyzing Verilog file ”../vlog/amber23/a23 register bank.v” into library work Analyzing Verilog file ”../vlog/amber23/a23 ram register bank.v” into library work Analyzing Verilog file ”../vlog/amber23/a23 wishbone.v” into library work Analyzing Verilog file ”../vlog/lib/generic iobuf.v” into library work Analyzing Verilog file ”../vlog/lib/boot ram byte en.v” into library work Analyzing Verilog file ”../vlog/lib/gen ram byte en.v” into library work Analyzing Verilog file ”../vlog/lib/gen ram line en.v” into library work Analyzing Verilog file ”../vlog/tb/i2c slave model.v” into library work Analyzing Verilog file ”../vlog/tb/spi slave model.v” into library work Analyzing Verilog file ”../vlog/tb/tb uart.v” into library work Analyzing Verilog file ”../vlog/tb/tb gpio.v” into library work Analyzing Verilog file ”../vlog/tb/dumpvcd.v” into libra ry work Analyzing Verilog file ”../vlog/tb/tb.v” into library work Starting static elaboration Completed static elaboration Fuse Memory Usage: 100400 KB Fuse CPU Usage: 1290 ms Compiling module clocks resets Compiling module gen ram line e n (DATA WIDTH=32’b0 . . . Compiling module gen ram byte e n (DATA WIDTH=128,A . . . Compiling module a23 cache default Compiling module a23 wishbone Compiling module a23 fetch default Compiling module a23 decompile 2 Compiling module a23 decode Compiling module a23 barrel shift Compiling module a23 alu Compiling module a23 multiply Compiling module a23 register bank Compiling module a23 execute Compiling module a23 coprocessor

VI Compiling module a23 core Compiling module i2c master bit ctrl Compiling module i2c master byte ctrl Compiling module i2c master top Compiling module boot ram byte e n (DATA WIDTH=32,A . . . Compiling module boot mem32 default Compiling module uart(WB DWIDTH=32,WB SWIDTH=4) Compiling module test module (WB DWIDTH=32,WB SWID . . . Compiling module timer module(WB DWIDTH=32,WB SWI . . . Compiling module interrupt controller(WB DWIDTH= 3 . . . Compiling module spi clgen Compiling module spi shift Compiling module spi top Compiling module gpio Compiling module ram array default Compiling module wb mem default Compiling module wishbone arbiter (WB DWIDTH=32,WB. . . Compiling module system Compiling module tb uart default Compiling module spi slave model Compiling module i2c slave model Compiling module tb gpio Compiling module dumpvcd Compiling module tb Time Resolution for simulation is 1 ps . Waiting for 22 sub−compilation(s) to finish ... Compiled 38 Verilog Units Built simulation executable amber−test . exe Fuse Memory Usage: 671904 KB Fuse CPU Usage: 2670 ms GCC CPU Usage: 27950 ms ISim P.58f (signature 0xfbc00daa) WARNING: A WEBPACK l i c e n s e was found . WARNING: Please use Xilinx License Configuration Manager to check out a full ISim license. WARNING: ISim will run in Lite mode. Please refer to the ISim documentation for more information on the differences between the Lite and the Full version . This is a Lite version of ISim. Time resolution is 1 ps Simulator is doing circuit initialization process. log file tests.log, timeout 0, test name vmlinux Load main memory from ../../sw/vmlinux/vmlinux.mem Finished circuit initialization process. Amber Boot Loader v20130822150540 j 0x00080000

Linux version 2.4.27 − vrs1 (conor@server) (gcc version 4.5.1 (Sourcery G++ Lite 2010.09 −58) ) #446 Mon Dec 21 14:04:42 GMT 2009 CPU: Amber 2 revision 0 Machine: Amber−FPGA−System On node 0 totalpages: 1024 zone(0): 1024 pages. zone(1): 0 pages.

VII zone(2): 0 pages. Kernel command line : console=ttyAM0 mem=32M root=/dev/ram 19.91 BogoMIPS (preset value used) Memory: 32MB = 32MB total Memory: 31136KB available (496K code, 195K data, 32K init) Dentry cache hash table entries: 4096 (order: 0, 32768 bytes) Inode cache hash table entries: 4096 (order: 0, 32768 bytes) Mount cache hash table entries: 4096 (order: 0, 32768 bytes) Buffer cache hash table entries: 8192 (order: 0, 32768 bytes) Page−cache hash table entries: 8192 (order: 0, 32768 bytes) POSIX conformance testing by UNIFIX Linux NET4.0 for Linux 2.4 Based upon Swansea University Computer Society NET3.039 Starting kswapd ttyAM0 at MMIO 0x16000000 (irq = 1) is a WSBN pty: 256 Unix98 ptys configured RAMDISK driver initialized: 16 RAM disks of 208K size 1024 bl ocksize NetWinder Floating Point Emulator V0.97 (double precision ) RAMDISK: ext2 filesystem found at block 8388608 RAMDISK: Loading 200 blocks [1 disk] into ram disk... done . Freeing initrd memory: 200K VFS: Mounted root (ext2 filesystem) readonly . Freeing init memory: 32K BINFMT FLAT: Loading file : /sbin/init Mapping is 8b0000, Entry point is 8068, data start is 8dd0 Load /sbin/init : TEXT=8b0040−8b8dd0 DATA=8b8dd4−8b8dff BSS=8b8dff −8b8e04 Hello , World!

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

Amber Core > User FIRQ IRQ SVC r0 0x00000010 r1 0x008b8de4 r2 0x00000000 r3 0x00000000 r4 0x00000000 r5 0x00000000 r6 0x00000000 r7 0x00000000 r8 0x00000000 0xdeadbeef r9 0x00000000 0xdeadbeef r10 0x00000011 0xdeadbeef r11 0xf0000000 0xdeadbeef r12 0x00000000 0xdeadbeef r13 0x008affb4 0xdeadbeef 0x0210cc24 0x02161fe8 r14 (lr) 0x00000000 0xdeadbeef 0x6209620b 0x008b8068 r15 (pc) 0x008b84bc

Status Bits: N=0, Z=1, C=1, V=0, IRQ Mask 0, FIRQ Mask 0, Mode = User −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

++++++++++++++++++++

VIII Passed vmlinux 12120135 ticks ++++++++++++++++++++ Stopped at time : 303003852500 ps : File ”/home/emanuel/workspace/amber SoC /trunk/hw/vlog/tb/tb.v” Line 395

IX