Design and VHDL Implementation of an Application-Specific Instruction Set

Lauri Isola

School of Electrical Engineering

Thesis submitted for examination for the degree of Master of Science in Technology. Espoo 19.12.2019

Supervisor

Prof. Jussi Ryynänen

Advisor

D.Sc. (Tech.) Marko Kosunen Copyright © 2019 Lauri Isola Aalto University, P.O. BOX 11000, 00076 AALTO www.aalto.fi Abstract of the master’s thesis

Author Lauri Isola Title Design and VHDL Implementation of an Application-Specific Instruction Set Processor Degree programme , Communication and Information Sciences Major Signal, Speech and Language Processing Code of major ELEC3031 Supervisor Prof. Jussi Ryynänen Advisor .Sc. (Tech.) Marko Kosunen Date 19.12.2019 Number of pages 66+45 Language English Abstract

Open source processors are becoming more popular. They are a cost-effective option in hardware designs, because using the processor does not require an expensive license. However, a limited number of open source processors are still available. This is especially true for Application-Specific Instruction Set Processors (ASIPs). In this work, an ASIP processor was designed and implemented in VHDL hardware description language. The design was based on goals that make the processor easily customizable, and to have a low resource consumption in a System- on-Chip (SoC) design. Finally, the processor was implemented on an FPGA circuit, where it was tested with a specially designed VGA graphics controller. Necessary software tools, such as an assembler were also implemented for the processor. The assembler was used to write comprehensive test programs for testing and verifying the functionality of the processor. This work also examined some future upgrades of the designed processor. The upgrades include improvements to hardware, software tools and usability. The source codes for the processor, graphics controller and test programs are published under the MIT license, and are available at: http://www.iki.fi/lauri.isola/asip38. Keywords ASIP, CPU, RTL, HDL, FPGA, SoC, programmable logic, embedded systems Aalto-yliopisto, PL 11000, 00076 AALTO www.aalto.fi Diplomityön tiivistelmä

Tekijä Lauri Isola Työn nimi Sovelluskohtaisen käskykantaprosessorin suunnittelu ja toteutus VHDL:llä Koulutusohjelma Computer, Communication and Information Sciences Pääaine Signal, Speech and Language Processing Pääaineen koodi ELEC3031 Työn valvoja Prof. Jussi Ryynänen Työn ohjaaja TkT Marko Kosunen Päivämäärä 19.12.2019 Sivumäärä 66+45 Kieli Englanti Tiivistelmä

Avoimen lähdekoodin prosessorit ovat yleistymässä. Ne ovat laitteistosuunnitte- lun kannalta kustannustehokas vaihtoehto, koska prosessorin käyttäminen ei vaadi kallista lisenssiä. Avoimen lähdekoodin prosessoreita on tarjolla kuitenkin vielä var- sin rajallinen määrä. Tämä koskee etenkin sovelluskohtaisia käskykantaprosessoreita (ASIP). Tässä työssä on suunniteltu sovelluskohtainen käskykantaprosessori ja toteutet- tu se VHDL-laitteistokuvauskielellä. Suunnittelun lähtökohtina olivat prosessorin vaivaton muokattavuus sekä toteuttaminen järjestelmäpiirille (SoC) mahdollisim- man vähillä resursseilla. Prosessori on toteutettu FPGA-piirille, jossa sitä testattiin erikseen suunnitellulla VGA-grafiikkaohjaimella. Prosessoria varten toteutettiin myös tarvittavat ohjelmistotyökalut, kuten as- sembler-kääntäjä. Sen avulla kirjoitettiin kattavat testiohjelmat, joita käytettiin prosessorin toiminnan testaamiseksi ja todentamiseksi. Tässä työssä tutustuttiin myös prosessorin jatkokehittämiseen. Tarkastelussa olivat laitteistoon, ohjelmistotyökaluihin ja käytettävyyteen liittyvät kehitysideat. Prosessorin, grafiikkaohjaimen sekä testiohjelmien lähdekoodit on julkaistu MIT- lisenssillä ja ne ovat saatavilla osoitteessa: http://www.iki.fi/lauri.isola/asip38. Avainsanat ASIP, CPU, RTL, HDL, FPGA, SoC, ohjelmoitavat piirit, sulautetut järjestelmät 5

Preface

I have done a number of electronics projects using a variety of . Although microcontrollers have generally served their job well, they have sometimes lacked features that would have been useful in those projects. The idea of making my own ASIP processor started to seriously interest me when small FPGAs became more affordable to be used in enthusiast-level projects. When using anASIP, the instruction set of the processor, as well as other features, can be specifically designed according to the project. This brings the projects to a whole new level. This thesis is a documentation of the results of my ASIP project. I would like to thank Aalto University especially for and Digital Microelectronics I & II courses, which contained the basic knowledge of the topic. Thanks also to Nokia Networks for the interest in my ASIP processor. Finally, I would like to thank my family for supporting me during my studies.

Espoo, 31.8.2019

Lauri Isola 6

Contents

Abstract3

Abstract (in Finnish)4

Preface5

Contents6

Abbreviations9

1 Introduction 10 1.1 Thesis goals...... 10 1.2 Thesis organization...... 11

2 Embedded processor technology 12 2.1 Single-purpose processors...... 12 2.2 General-purpose processors...... 12 2.3 Application-specific processors...... 14

3 15 3.1 Design objectives...... 15 3.2 Memory architecture...... 15 3.3 design...... 16 3.3.1 Program ...... 16 3.3.2 Program stack...... 16 3.3.3 Registers...... 17 3.3.4 Data ...... 18 3.3.5 ...... 18 3.3.6 Datapath organization...... 20 3.4 Instruction set architecture...... 21 3.4.1 Overview...... 21 3.4.2 Instruction word...... 22 3.4.3 Memory operations...... 23 3.4.4 Accumulator operations...... 24 3.4.5 Input/Output operations...... 24 3.4.6 Control operations...... 25 3.5 ...... 27 3.5.1 Single-cycle approach...... 27 3.5.2 Multi-cycle approach...... 27 3.5.3 Pipelined approach...... 29 7

3.6 ...... 29 3.6.1 Structure...... 29 3.6.2 Control signals...... 30 3.6.3 PC control...... 32

4 VHDL implementation 33 4.1 FPGA design flow...... 33 4.2 Artix-7 FPGA resources...... 35 4.3 Top level system...... 36 4.4 Required VHDL packages...... 37 4.5 ASIP38...... 37 4.5.1 Block RAM...... 37 4.5.2 ...... 38 4.5.3 Program stack...... 38 4.5.4 ALU...... 39 4.5.5 Instruction decode and control...... 39 4.6 Graphics controller...... 40 4.7 Inputs...... 43 4.8 Top level entity...... 44 4.9 Behavioral simulation...... 44 4.10 Synthesis...... 45 4.11 Implementation...... 46

5 Verification and testing 48 5.1 Hardware verification...... 48 5.2 Assembler...... 48 5.3 Test programs...... 50 5.4 Board-level testing...... 54

6 Analysis of results 55 6.1 Results of the design ...... 55 6.2 Processor comparison...... 55

7 Future upgrades 58 7.1 Additional hardware...... 58 7.2 Software tools...... 59 7.3 Bus protocols...... 59

8 Conclusion 61

References 63 8

A Source codes 67 A.1 asip38.vhd...... 67 A.2 display_control.vhd...... 78 A.3 vga_sync.vhd...... 84 A.4 rgb_gen.vhd...... 86 A.5 line_draw.vhd...... 91 A.6 ellipse_draw.vhd...... 94 A.7 area_paint.vhd...... 97 A.8 input.vhd...... 101 A.9 top.vhd...... 103 A.10 assembler.py...... 107 9

Abbreviations 3D Three-dimensional AC Accumulator ALU Arithmetic Logic Unit ASIC Application-Specific ASIP Application-Specific Instruction Set Processor BRAM Block RAM CISC Complex Instruction Set Computer CPI CPU DSP Digital Signal Processing FF Flip-Flop FPGA Field-Programmable FSM Finite State Machine HDL Hardware Description Language HLL High-Level Language I/O Input/Output IC Integrated Circuit IR ISA Instruction Set Architecture ISR Interrupt Service Routine IP Intellectual Property LUT Look-up Table NRE Non-Recurring Engineering PC Program Counter PLD Programmable Logic Device RAM Random Access Memory ROM Read-Only Memory RISC Reduced Instruction Set Computer RTL Register Transfer Level RTOS Real-Time Operating System SoC System-on-Chip SP Stack Pointer TOS Top Of Stack uC uP VHDL VHSIC Hardware Description Language VHSIC Very High Speed Integrated Circuit VGA Video Graphics Array WNS Worst Negative Slack 10

1 Introduction

A soft processor core is usually described with a Hardware Description Language (HDL) which is converted into an optimized gate-level representation using logic synthesis [1]. The synthesis result can then implement the functionality of the processor inside of a Field-Programmable Gate Array (FPGA), or some other Programmable Logic Device (PLD) [2]. Recent development has been towards soft processors which are completely open source. For example, processor architectures such as RISC-V are getting into large-scale production due their open source license [3]. Even a previously closed MIPS architecture has now been released without any license fees or royalties [4]. At the same time, the demand of integrated circuits is increasing due to growing markets of Internet of Things (IoT) devices. According to data from IoT Analytics, about 8.3 billion internet-connected devices exist in 2019 [5]. In the future, IoT devices will be everywhere and include every aspect of life. An estimated number of IoT devices is 41.6 billion by 2025 [6]. The ongoing development presumably increases the demand for different types of open source processors. With growing IoT market, the focus is towards small, portable and self-powered devices [7,8]. This brings the challenge of making computations more energy efficient. Instead of using general-purpose processors, a processor can be customized for the specific application to achieve the best combination of performance, power and size. This type of processor is called an Application-Specific Instruction Set Processor (ASIP) [2]. In computing, hardware acceleration is used to perform functions more efficiently than a software which is run on a general-purpose processor. An ASIP often provides the much needed flexibility between hardware accelerators and general-purpose processors. [2] For example, an ASIP can be used as an alternative for hardware accelerators, or it can serve as a main processor for a number of accelerators [9, 10]. The list of open source ASIPs is still relatively short. Open source websites such as OpenCores [11] or GitHub [12] do not seem to have a processor core with the characteristics of a small and easily customizable ASIP. However, this type of processor could be useful for providing some flexibility for the system designers.

1.1 Thesis goals The aim of this thesis is to study the core knowledge and skills required to implement an ASIP processor. Furthermore, the goal is to design and implement an open source ASIP which is small and easily customizable. To consider the processor successful, its whole datapath and instruction set must be easily modifiable. In this way, the processor can be customized for many different applications. Due to programmability, the primary target device for the ASIP is an FPGA, but other platforms are also considered during the design process. To make the processor compatible with FPGAs, the processor’s instruction cycle must support different memory configurations inside of an FPGA. This is an important requirement as the processor cannot be tied to a single type of memory. The implementation size of the processor should be minimized in order to save logic resources, and to reduce the complexity and the power consumption of the design. The processor architecture must be also optimized for achieving the best possible performance. Finally, necessary tools for software development need to be implemented. Below is a summary of general design goals which need to be met.

• Customizable. The structure of the processor needs to be easily modifiable.

• Compatible. Processor must support different memory configurations inside of an FPGA.

• Small implementation size. Resource usage must be low.

• High performance. Optimizing the processor for performance.

• Usable. Necessary software development tools must be implemented.

For testing the ASIP in an FPGA, a complete prototype of an ASIP controlled system is implemented. Therefore, some extra hardware needs to be implemented for testing purposes. A VGA graphics controller is a good choice for testing hardware, as it displays the outputs of the ASIP in real-time on a VGA monitor. For connecting with the graphics controller, the ASIP requires some application-specific instructions which are added to its instruction set during the design phase.

1.2 Thesis organization The thesis is organised as follows. Section2 introduces different processor types which are used in embedded systems. Section3 focuses on the processor design, which provides a basis for the rest of the thesis. Section4 presents the VHDL implementation including an FPGA design flow. Here the designed processor is also synthesized for board-level testing. Section5 shows experiments with test software. Section6 analyzes the results of the design phase and compares the ASIP with other open source processors. Section7 presents the future upgrades of the designed processor. Finally, Section8 wraps up the whole work and concludes the thesis. 12

2 Embedded processor technology

Central Processing Unit (CPU) is the key component of most embedded systems. Embedded processors can be broken into two main categories. [13] A microprocessor (uP) uses separate Integrated Circuits (ICs) for CPU, memory and peripherals, whereas a microcontroller (uC) has them in a single . A microcontroller can ease the hardware architecture design, especially when it contains all the required peripherals. [14] Processors can be categorized by how they are physically implemented. A soft processor is implemented on a general-purpose logic device such as an FPGA, while a hard processor is fabricated directly in the silicon of an IC [15]. Soft processors are typically licensed and distributed in their own Intellectual Property (IP) cores. In this way, they can be easily used as a part of a broader system. When this type of system is placed on a single IC, it is called a System-on-Chip (SoC). [15] Processors can be also categorized depending on their datapath, and whether their software is programmable or non-programmable. This section describes processor designs which are commonly used in embedded systems. It provides background information on different processor types, including an ASIP.

2.1 Single-purpose processors A single-purpose processor is designed to execute only one program. It is simply a custom digital circuit used for one purpose. Another name for this type of circuit is an accelerator, or just "hardware". The program is hardwired directly to the control logic and cannot be changed. [2] Figure 2.1 illustrates an architecture of a single-purpose processor. Accelerators can execute computational tasks more efficiently than general- purpose processors. A good example of a task is a video codec which uses an algorithm to compress and decompress the frames of a video. For a specific task, an accelerator has better performance, smaller size and lower power consumption compared to a general-purpose processor. However, its design time and Non- Recurring Engineering (NRE) costs are higher for small quantities, while the flexibility remains low. [2]

2.2 General-purpose processors A general-purpose processor is designed for variety of applications to maximize the number of sold devices. It executes user-written programs that are stored in the program memory. A Program Counter (PC) is used to point to the program memory location that stores the current or future instruction. An Instruction Register (IR) is then used to hold the instruction which is being executed. This 13

Figure 2.1: Single-purpose processor. Adapted from [2]. type of processor presents a software approach for solving computational tasks. Because those tasks are not predetermined, the datapath of the processor needs a large enough , and a general-purpose Arithmetic Logic Unit (ALU). [2] The structure of a general-purpose processor is presented in Figure 2.2a.

(a) (b)

Figure 2.2: The functionality of different processor types: (a) general-purpose, (b) application-specific. Adapted from [2].

General-purpose processors can be used whenever they can meet the require- ments of the application. However, a general-purpose processor has also its draw- 14 backs. For example, the performance may not be sufficient for some applications, or the datapath of the processor cannot process certain tasks without a separate hardware. The processor may also have too high size and power consumption, making it unsuitable for certain applications. [2] Therefore, additional time is needed for designing an application-specific processor.

2.3 Application-specific processors An Application-Specific Instruction Set Processor (ASIP) can be a compromise between general-purpose and single-purpose processors. This type of processor provides a good combination of performance, power and size. These benefits can be achieved by using a custom instruction set and datapath, which are optimized for the desired application. [2] Application-specific instructions are used to increase efficiency by replacing common code sequences [10]. The custom datapath includes only the registers and the ALU operations which are needed by the application [2]. Figure 2.2b illustrates an architecture of an application-specific processor. While this type of processor offers flexibility, it has longer design time, which increases the NRE costs. The software development can be slower, as the unique instructions set often prevents the use of high-level language compilers. [2] This forces software developers to write programs in assembly language, which usually is slower and more difficult. A common example of an ASIP is a (DSP). It can handle signal processing tasks efficiently with a custom instruction set. Special instructions may combine multiple arithmetic operations to form one complex instruction, which performs more efficiently. [2] 15

3 Processor design

3.1 Design objectives The general design goals for the processor are listed in Section 1.1. They are used to define the characteristics which the processor should have. However, an application-specific processor is designed for a special application or task. Therefore, the intended task defines the instruction set of the processor. For testing purposes, the task for the ASIP is to serve as a main processor for a VGA graphics controller. Together they form a design that can be tested on an FPGA development board. The graphics controller uses hardware accelerators to draw lines, circles or ellipses, and to paint certain areas. A line can be drawn by knowing the x and y coordinates of its initial and end points. To create three-dimensional (3D) objects, the lines need to be calculated in three-dimensional space, which involves computations of trigonometric functions. To calculate those functions accurately enough, the processor must be able to do addition, subtraction and multiplication with 32-bit signed fixed-point numbers. Additionally, the multiplication has to include a built-in result selection for selecting the correct result from the output of the multiplier. The graphics controller is operated directly with control signals from the pro- cessor. To generate the control signals, some special instructions are designed for the instruction set of the ASIP. In this way, the graphics controller serves as an application-specific testing hardware for the ASIP. The implementation ofthe graphics controller is examined in Section 4.6.

3.2 Memory architecture The design process starts by defining the memory architecture of the processor. We know that an ASIP needs memories for both program and data. The of a computer system can be solved in two basic ways. In , both the program and data are placed in the same memory space. In , they are located in separate memories. A modified Harvard architecture is a common solution in modern microprocessor designs. It contains aspects of both Harvard and von Neumann architectures. [14] Since the goal is to design a soft ASIP processor, it is justifiable to choose the Harvard architecture. In this way, the processor does not have to use the same address space, and arrange the bus access between the program and data memories. Options for the memory hierarchy are shown in Figure 3.1. The used program memory needs to be a Read-Only Memory (ROM), as it is non-volatile. The ASIP has a requirement of compatibility, thus it must operate 16

(a) (b)

Figure 3.1: Two memory architectures: (a) Harvard, (b) von Neumann. with different types of memory configurations. Therefore, the program memory can be either synchronous or asynchronous. The data memory is a normal Random Access Memory (RAM). Unlike the program memory, the RAM is volatile. To save the contents of the RAM, an additional non-volatile memory such as an Electrically Erasable Programmable ROM (EEPROM) could be used to store important data while the processor is powered off.

3.3 Datapath design 3.3.1 Program counter A generic computer datapath includes a program counter (PC) which is a register pointing to the program memory [2]. The program counter holds the address of the current instruction. The next instruction can be fetched by incrementing the PC by one. Similarly, control operations can be implemented by manipulating the contents of the PC. The program counter is directly connected to the address line of the program memory. Therefore, the bit width of the program counter defines the size of the program memory.

3.3.2 Program stack A subroutine contains a sequence of instructions that perform a given task. Subrou- tines can be called multiple times in different locations of the program. They area powerful programming tool for reducing the amount of maintainable code. [16] Subroutines are normally implemented using subroutine calls. A subroutine call includes a branching to the address of the subroutine. At the same time, the return address of the next instruction (PC + 1) is stored on top of a stack memory. This type of memory allows subroutines to call other subroutines (nested calls), and to also call themselves (recursive calls). [17] 17

In general-purpose processors, a stack memory can also be accessible by the programmer to ease the programming of some complex tasks. This is done with separate PUSH and POP instructions. [16] However, the ASIP needs only the ability to execute subroutines. Therefore, the processor needs a stack memory which is connected directly to the program counter. This type of implementation is often called as a program stack [13]. The subroutine calls are operated with CAL instruction, which pushes the return address on Top Of the Stack (TOS) and increases the Stack Pointer (SP) by one. Returning is done with the RET instruction, which decreases the stack pointer by one and moves the TOS to the program counter. Moreover, the program stack cannot be accessed by any other instruction which makes it secure. The depth of the program stack can be an issue if the number of nested or recursive calls is high. The program complexity defines the suitable depth of the stack. As the ASIP is going to support a very large program memory, the depth of the stack needs to be great enough to minimize the risk of a stack overflow. A 32-level program stack can be considered large enough for our test software. However, it is the programmer’s responsibility to ensure that the stack does not run out.

3.3.3 Registers To store the results of the ALU, at least one accumulator (AC) register is required. The accumulator is also physically connected to one of the ALU inputs. An other way for storing the ALU result is the use of a register file. This is common practice for example in a MIPS datapath [18]. However, as explained in Section2, an ASIP architecture includes only the registers that are necessary for the purpose of the application. For addressing data memory indirectly, the content of a memory location can be used to point a memory address. Typically, the contents of a register file can be used as pointers for indirect addressing. [16] To save resources, the ASIP does not need a register file, as it needs only the registers that are necessary forthe intended task. Because of that, a single register is enough to implement indirect addressing. This so-called file register (F) can be loaded from the accumulator with its own instruction. To load or store data indirectly, two separate control signals are needed. They are used to control the which is shown in Figure 3.2. The rest of the implemented registers are application-specific. The graphics controller requires three output registers (X, Y and OUTPUT). They could be loaded through the accumulator, but it is more efficient to load them directly from the data memory. 18

Figure 3.2: Addressing mode multiplexer.

3.3.4 Data bus A data bus combines all registers, memories and an Arithmetic Logic Unit (ALU). A design objective for the ASIP is to handle 32-bit signed fixed-point numbers. If the width of the bus would be for example 8 bits, adding two 32-bit integers would take a number of instructions. The ALU would have to also use a carry bit to produce the correct results. As the ASIP does not use a register file, processing large numbers would result very complex programs with a lot of load and store operations. For these reasons, it is reasonable to use a 32-bit data bus. Respectively, all the registers, memories and the ALU have the same width as the data bus. However, the address lines of the program and data memories use a different bus width, because these lines also determine the sizes of those memories.

3.3.5 Arithmetic logic unit An Arithmetic Logic Unit (ALU) performs all the arithmetic operations of a CPU. The structure of an ALU consists of two inputs and an output for the data. There could also be input and output for a carry bit. In processors, the carry bit is typically saved in a carry flag of a . [19] However, storing the carry bit is not always needed if the data bus is wide enough for executing all arithmetic operations directly. In addition, some even have instructions to execute ALU instructions without the carry [18]. The ASIP needs only normal arithmetic and bitwise operations, because many operations as negation or two’s complement, can be done as a combination of those instructions. Any operation that is not absolutely necessary is better left out to save logic resources. The ALU has also one application-specific operation which is the 32-bit multiplication with built-in result selection. This operation is performed using a 32-bit multiplier with a 64-bit output. The result selection is needed to select the 32-bit signed fixed-point number from the multiplier output. 19

The designed ALU operations for the ASIP are shown in Table 3.1.

Table 3.1: 32-bit ALU operations.

Operation Formula Description BYPASS F = B ALU bypass ADD F = A + B Addition without carry SUB F = A − B Subtraction without carry MUL F = (A ∗ B)[55:24] Multiplication with result selection AND F = A ∧ B Bitwise conjunction OR F = A ∨ B Bitwise disjunction XOR F = A ⊕ B Bitwise exclusion INC F = A + 1 Increment DEC F = A − 1 Decrement CIL F = A ≪ 1 Circulate left CIR F = A ≫ 1 Circulate right

One design goal for the ASIP is to have high performance. This is achieved by using a clock frequency which is as high as possible. Figure 3.3 shows a digital system for timing analysis of the ALU. It consists of two Flip-Flops (FFs) which are connected together through the of the ALU.

Figure 3.3: Timing model of the ALU.

The minimum clock period, Tmin of digital system is defined as

Tmin = Tclk−q + Tlogic + Tsetup − Tskew (3.1) where Tclk−q is the clock-to-Q delay, or the delay from clock arrival until data arrives at the Q of the flip-flop; Tlogic is the propagation delay of the ALU; Tsetup is minimum time data must arrive at D before the next clock edge occurs (setup time); and Tskew is the propagation delay of the clock between the two flip- 20

[20]. The maximum allowable clock frequency [20], Fmax is then defined as 1 Fmax = . (3.2) Tmin

From Equation 3.1, it can be seen that the Tlogic is a key factor which affects the Fmax in Equation 3.2, because the propagation delay of many ALU operations is significantly long. The ALU is also the only component of the datapath with large amounts of combinational logic. For achieving the shortest possible Tlogic, all ALU operations are executed in parallel. The result is then selected by a multiplexer which has a select line called ALU SEL. A structure of the designed ALU is shown in Figure 3.4. The 32-bit multiplication is presumably the slowest of the ALU operations, as it consists of an array multiplier or other delay adding multiplier design. However, it is possible to reduce the delay by pipelining the multiplier design, which is presented in the Section 4.5.4.

Figure 3.4: Design of the ALU.

3.3.6 Datapath organization All the declared components can be connected together to create a complete datapath of the ASIP. The ASIP needs enough memory for running complex programs and for storing data. However, the memory sizes cannot be static as one 21 design goal of the ASIP is to be easily customizable. Given the selected 32-bit data bus, the ASIP supports directly a memory size of 232. Therefore, the design of the ASIP does not limit the size of memory for any practical applications. In the Figure 3.5, both address buses have the size of 14 bits. The sizes of the program and data memories are exactly 214 which is equal to 16 384 memory locations. As the amount of memory is defined by the width of the data bus, the maximum memory size for both of the memories is 232. As the width of the program memory is 38 bits, the designed processor can be appropriately called as ASIP38. Figure 3.5 shows the register level organization of the ASIP38 which is also known as an architecture. The figure shows also a control unit (Instruction decode and control). Its operation is described in the following subsections.

3.4 Instruction set architecture 3.4.1 Overview An Instruction Set Architecture (ISA) describes to the design of a computer, and is often referred as computer architecture. It describes the computer operations which the ASIP38 will support. Instruction set architectures are normally characterized into two main categories. In Reduced Instruction Set (RISC), all instructions fit in a single word. A RISC processor also uses load/store architecture where the data memory is accessed only with load and store instructions, and all arithmetic operations must be executed in the registers of the processor. Simple addressing modes are also typical for RISC systems. As a result, a RISC instruction set has fewer instructions compared to more complex ISAs. [16] Another category for an ISA is called Complex Instruction Set Computer (CISC). In CISC processor, an instruction can use more than one instruction words. In this way, a single instruction supports multiple operations or addressing modes. This allows the execution of complex tasks with only one instruction. A CISC architecture allows applying arithmetic operations to both memory and register operands. Therefore, a CISC does not have to use load/store architecture to access the memory, which is a key difference compared to RISC systems. [16] As the ASIP38 is an application-specific processor, it needs to implement only the necessary instructions. As the processor needs also to have a small implementation size, the number of instructions must be low. The processor has a requirement of being easily customizable, which can be achieved by fitting the instructions ina single word and having simple addressing modes. Consequently, the ISA of the ASIP38 will be designed from a RISC perspective. 22

Figure 3.5: The architecture of the ASIP38.

3.4.2 Instruction word An instruction word represents a complete instruction which is placed to one memory location of the program memory. It consist of at least two parts. The first part defines an operand code (opcode). The second part defines anoperand. Computer systems use typically complex instruction words. In addition to the 23 opcode, the word can contain fields for register access, addressing modes or other features. [19] The opcode is the identifier of the operation in the instruction, whereas the operand defines the addressing mode. The ASIP38 needs only two addressing modes to operate. In a direct addressing mode, the operand is a direct address of the data memory. In an immediate addressing mode, the operand is a literal value which can be transferred to the accumulator, or other register. Bit numbering of the opcode and the operand start at zero for the Least Significant Bit (LSB). Table 3.2 shows the 38-bit instruction word.

Table 3.2: Instruction format of the ASIP38.

6-bit opcode 32-bit address [A] / 32-bit immediate

3.4.3 Memory operations Memory operations can be separated into direct and indirect types. Direct opera- tions have the RAM address within the instruction word. All load and store operations are performed through an accumulator (AC) register. The AC is loaded from the RAM memory with the LDA instruction. Accordingly, the content of the AC is stored to the RAM using the STO instruction. Other memory related instructions are arithmetic or bitwise operations. These operations are performed between the RAM and the AC, and the result is naturally stored back to the AC. Indirect addressing is necessary as it makes possible accessing the memory with the result of an arithmetic operation. The easiest way of doing this is with separate address register that points to the memory indirectly. There are many ways to implement indirect addressing but a straightforward way is to design separate instructions for it. This also simplifies the instruction decoding. To address the RAM memory indirectly, a file register (F) is needed. To access the RAM using the F register, three new instructions are introduced. Instruction LFR reads the RAM contents indicated by the F register to the AC. Instruction SFR stores the contents of the AC to the RAM location indicated by the F register. Finally, instruction WFR writes the contents of the AC to the F register. Memory related instructions are shown in Table 3.3. 24

Table 3.3: Memory instructions.

Mnemonic Opcode Operation Description LDA 000001 AC ← RAM[A] Load AC from RAM STO 000010 RAM[A] ← AC Store AC into RAM ADD 000100 AC ← AC + RAM[A] Add AC to RAM SUB 000101 AC ← AC − RAM[A] Subtract RAM from AC MUL 000110 AC ← AC ∗ RAM[A] Multiply AC by RAM (res. sel.) AND 000111 AC ← AC ∧ RAM[A] AND RAM with AC IOR 001000 AC ← AC ∨ RAM[A] OR RAM with AC XOR 001001 AC ← AC ⊕ RAM[A] XOR RAM with AC LFR 011110 AC ← RAM[F] Load AC from RAM indirectly SFR 011111 RAM[F] ← AC Store AC into RAM indirectly WFR 100000 F ← AC Write AC to F

3.4.4 Accumulator operations The second group of operations is related to the contents of the accumulator. Accumulator operations do not include reading or writing to the RAM. The most important instruction is LDI, which is used to move immediate to the AC. This operation is vital as the RAM is clear after every power-up. Without LDI, moving data to RAM is not possible. Other useful instructions include INC and DEC, which are used to increase or decrease the accumulator by one. Bitwise operations are often required to modify numbers, or for multiplication and division. Thus, shift instructions CIL and CIR are needed. They are used for shifting the contents of the accumulator bit by bit to the left, or to the right. Sometimes no operation (NOP) is also necessary to enable an empty operation in time-critical programs. However, it can be replaced by other operations. An accumulator value can be used for controlling the program flow. Basic operation is to skip next instruction depending if the accumulator value is zero or non-zero. For example, the SNZ instruction skips the next instruction if the value of the accumulator is non-zero. Accumulator related instructions are shown in Table 3.4.

3.4.5 Input/Output operations Input/Output (I/O) operations are essential for communicating with external devices. An input instruction INP is needed to obtain data from the INPUT 25

Table 3.4: Accumulator instructions.

Mnemonic Opcode Operation Decription LDI 000000 AC ← immediate Load AC with immediate INC 001010 AC ← AC + 1 Increment AC DEC 001011 AC ← AC − 1 Decrement AC CIL 001100 AC ← AC ≪ 1 Circulate AC left CIR 001101 AC ← AC ≫ 1 Circulate AC right SNZ 010011 PC + 2 if AC ̸= 0 Skip if AC is not zero SZA 010100 PC + 2 if AC = 0 Skip if AC is zero SGT 010101 PC + 2 if AC > RAM[A] Skip if AC is greater than SLT 010110 PC + 2 if AC < RAM[A] Skip if AC is less than SKI 010111 PC + 2 if INPUT = 0 Skip if INPUT is zero

register, and to store it to the RAM address indicated by the instruction word. The processor has three output registers: X, Y and OUTPUT. They are directly connected to the graphics controller. Instructions LDX, LDY and OUT are used to store data into the registers. These registers are loaded directly from the RAM to make programming more efficient. Two application-specific instructions are implemented for sending commands to the graphics controller. Instruction SET controls MODE register which stores the command provided by the instruction word. It also triggers control signal DISP_SET which enables the graphics controller to execute the given command. Instruction CLR is required for clearing video memory locations without affecting the contents of the MODE register. All I/O related instructions are shown in Table 3.5.

3.4.6 Control operations Control operations are used to control the flow of a program. The most common control operation is to branch unconditionally or with some condition. The condi- tions are usually accumulator related. At the minimum, an unconditional branch (JMP) instruction is needed. The graphics controller includes hardware accelerators which require some processing time. During the time, the graphics controller is unable to receive commands from the processor. Therefore, the processor must wait until the graphics controller has performed its current task. For the wait operation, WAI instruction is implemented. The WAI instruction halts the processor if the value of 26

Table 3.5: I/O instructions.

Mnemonic Opcode Operation Description INP 010001 RAM[A] ← INPUT Store INPUT into memory OUT 010010 OUTPUT ← RAM[A] Load OUTPUT from memory LDX 001111 X ← RAM[A] Load X from memory LDY 010000 Y ← RAM[A] Load Y from memory SET 011000 MODE ← immediate Command to graphics controller 1 → DISP_SET CLR 011001 1 → DISP_CLR Clear video memory location

the READY flag is 0. The operation of the processor is continued when theflag returns to 1. A subroutine call has direct control of the program stack. Therefore, instruction CAL is introduced which stores the return address (PC + 1), increases the stack pointer and stores the address of the subroutine to the program counter. Returning from subroutine also needs its own RET instruction, which decreases the stack pointer, and places the output of the program stack to the program counter. Control related instructions are shown in Table 3.6.

Table 3.6: Control instructions.

Mnemonic Opcode Operation Description JMP 000011 PC ← address [A] Unconditional branch WAI 001110 Wait if READY = 0 Conditional wait CAL 011010 STACK[SP] ← PC + 1 Call subroutine SP ← SP + 1 PC ← address [A] RET 011100 SP ← SP − 1 Return from subroutine PC ← STACK[SP] 27

3.5 Instruction cycle 3.5.1 Single-cycle approach In a single-cycle processor, the instruction is fetched, decoded, executed and its results are stored back to the memory in a single clock cycle. The length of the instruction cycle defines the longest possible propagation delay of the processor. The longest possible delay path is called the critical path, which is equal to the worst-case delay for all instructions. [18] It determines the minimum clock period Tmin which can be used. Single-cycle processors can be difficult to implement for an FPGA if external memory resources are used. For example, many Block RAMs have too much read and write latency to be used in a single-cycle processor [21]. For these reasons, a multi-cycle approach is a better solution for the ASIP38.

3.5.2 Multi-cycle approach In single-cycle processors, data must propagate through the processor in a single clock cycle, which means that minimum clock period Tmin needs to be fairly long to take care of the propagation delay. Multi-cycle processors use multiple clock cycles to complete a single instruction. In this way, signals need to travel less distance in a single clock cycle, which allows to shorten the minimum clock period [19]. Figure 3.6 demonstrates the gained performance in a multi-cycle design.

Figure 3.6: Comparison of instruction cycles: single-cycle and multi-cycle.

Typically, an execution of an instruction requires at least five clock cycles. Classic RISC is a good example of a five cycle design with following stages: Instruction Fetch (IF), Instruction Decode (ID), Execution (EX), Memory access (MEM) and Write-Back (WB) [22]. It is shown in Table 3.7. 28

Table 3.7: The five stages of the classic RISC pipeline.

Stage Description IF Instruction fetch. Increment PC. ID Instruction decode. Read registers. EX Execution. Calculate effective address. MEM Memory access using effective address. WB Write the result into the register file.

The five stages of the classic RISC pipeline is a good model for the instruction cycle of the ASIP38, with a few exceptions. Firstly, the ASIP38 has no need to calculate effective addresses, because it does not use a register file to address the data memory. Secondly, the program counter needs to be updated before the instruction is fetched. In this way, the program counter needs to be updated only once per instruction cycle, reducing the complexity of the control unit. Without the need for effective address calculations, the instruction cycle ofthe ASIP38 can have only four stages. To keep things simple, every instruction of the ASIP38 will be processed in four clock cycles. This is also a common practise, for example, in PIC microcontrollers [23]. To optimize the clock speed to be as fast as possible, it is important to keep the the critical path of each stage as short as possible. To achieve this, it is important to organize all datapath operations into the correct stages. The main principle is to use multiple inexpensive operations, or to use one expensive operation per one clock cycle. Inexpensive datapath operations are usually those where the propagation delay is small. For example, registers or have fairly small delay. Respectively, an ALU operation or a memory access has a much longer delay, and is better to be executed apart from other operations. This happens by restricting reading and writing of the memory or the accumulator to corresponding stages. The following four stages define the instruction cycle of the ASIP38. The Fetch stage is reserved for updating the program counter and fetching a new instruction from the program memory. In the Decode stage, the instruction is decoded and correct register/memory is selected to the data bus. The Execute stage performs an ALU operation and generates all instruction-specific control signals. Finally, the MemWrite stage writes the result back to selected register/memory, and selects a new operation for the program counter. The suggested four stages are shown in Table 3.8. With the above four stages, four clock cycles are used per instruction. Thus, 29

Table 3.8: The four stage instruction cycle of the ASIP38.

Stage Description Fetch Update PC. Fetch instruction from memory. Decode Decode instruction. Update RAM address. Select register/memory to data bus. Execute Execute ALU operation. Generate rest of the control signals. MemWrite Write the result to the register/memory. Select PC operation.

the Cycles Per Instruction (CPI) value for the ASIP38 is always 4. As mentioned in Section1, the ASIP38 has a design goal for compatibility. An external RAM component has a delay from the time a new address is entered, to the time the contents of the memory are shown at the output of the RAM. This delay is well supported by the instruction cycle as the RAM address is updated in the Decode stage.

3.5.3 Pipelined approach The ASIP38 has a requirement of high performance. An efficient way to improve the performance is to use a pipelined instruction cycle. Pipelining is a technique used to improve the performance of a processor by overlapping the execution of instructions [18]. Overlapping instructions can create what is called a hazard. Structural hazards arise when the same part of the processor’s hardware is simultaneously needed by two or more instructions. A data hazard arise when an instruction has a dependency on the result of a previous instruction and the data is exposed by the overlapping of instructions. Pipelining branch instructions can create a control hazard if the outcome of the branch is not predicted beforehand. [18] To anticipate all the hazards, a pipelined processor needs considerably more additional logic than non-pipelined processor. For this reason, it was decided to use a non-pipelined instruction cycle in the design of the ASIP38.

3.6 Control unit 3.6.1 Structure The control unit generates the control signals which control the datapath. The program counter needs control signals for choosing the next PC operation, whether it is incremented by a value, or loaded with an address for branching. Every register 30 and memory needs at least one control signal for writing. The ALU and the data bus need control signals which have a multiple bit width. A control unit can be designed in many ways. One option is to use combinational logic, for example logic gates, encoders, decoders and multiplexers. In this way, the control logic can be minimized to use less resources, which is one of the design goals in Section 1.1. If an instruction contains more than one cycles, this type of design needs also a synchronous part to store the information of the current cycle. This type of design can also be problematic if later modifications need to be made. The control unit can also use synchronous logic to generate the control signals. A Moore type Finite State Machine (FSM) can be used to implement a multiple state instruction cycle [18]. It can be implemented with decoders and state registers as seen in Figure 3.7.

Figure 3.7: Control unit implementation with a Moore type FSM.

3.6.2 Control signals Control signals are used for controlling the components of a datapath. In a single- cycle processor, the instruction is decoded and the required control signals are generated in one clock cycle. In a multi-cycle processor, the control signal generation is distributed over different stages of the instruction cycle. A key part of the control unit is a control logic for the data bus. As seen from Table 3.9, the control signal BUS SEL is generated in the Decode stage, and it remains the same until the next instruction. It controls a multiplexer, which selects what memory is selected to the data bus. The selected memory can be either the output of the program memory, data memory, accumulator or input register. Control signals ALU SEL and MEM SEL are also generated in the Decode stage. The ALU SEL selects an ALU operation, and the MEM SEL selects a direct 31

Table 3.9: Generated control signal for each stages of the instruction cycle.

Control signal Fetch Decode Execute MemWrite BUS SEL x x x x ALU SEL x x x x MEM SEL x x x x STACK PUSH x STACK POP x X LOAD x Y LOAD x INPUT LOAD x OUTPUT LOAD x AC LOAD x RAM WRITE x PC SEL x WRITE F x Other signals* (x) (x) (x) (x)

* Application-specific instructions can have control signals at any of the stages. or an indirect addressing mode for the data memory. Both control signals remain the same for the rest of the instruction. The Execute stage has the most of the control signals. The stack memory is operated in this stage with the control signals STACK PUSH and STACK POP. The signals are generated by the CAL and RET instructions. It is possible to operate the stack from other stages, because there are no strict timing requirements for dealing with the return addresses. The write operations of the X, Y, INPUT and OUTPUT registers has to be performed in the Execute stage, because the data memory is automatically addressed by the instruction word in the Decode stage. The MemWrite stage has the function of writing the results back to the accu- mulator, or the data memory. Signal AC LOAD is used to load the accumulator, and signal RAM WRITE operates the data memory. The next program counter operation is selected with the PC SEL signal. The control logic of the PC is explained in the next subsection. Control signal WRITE F is used to load the F register from the accumulator. Finally, application-specific control signals can be generated in any of the stages. 32

3.6.3 PC control The program counter is controlled by the select line PC SEL. It is generated in the control unit as a result of the decoded instruction. Other factor that may affect to the PC SEL signal is the status of the accumulator in the SNZ, SZA, SGT or SLT instructions. The SKI instruction uses status of the INPUT register. Figure 3.8 shows the logic controlling the program counter. The correct operation for each instruction is defined in Section 3.4.

Figure 3.8: PC control.

Figure 3.8 shows also the select line operating a multiplexer which controls the operation of the program counter. The PC SEL is a 3-bit select line which includes five control modes: PCLatch, PCInc, PCSkip, PCLoad and PCRET. Thesefive modes perform all the necessary tasks for the whole instruction set. Mode PCLatch does not affect to the PC, mode PCInc increments the PC by 1, mode PCSkip increments the PC by 2, mode PCLoad loads an address from the data bus into the PC, and mode PCRET loads the PC with the Top Of the Stack (TOS). The design of the control unit is ready, which also completes the processor architecture of the ASIP38. The next section describes how the architecture is implemented in VHDL. 33

4 VHDL implementation

4.1 FPGA design flow The implementation follows a process known as a design flow. The design could be also implemented in Verilog HDL, but VHDL is chosen as it is a common language in FPGA designs [24]. To prove that a design can be implemented physically, the design requires a target device. For prototyping, a Field-Programmable Gate Array (FPGA) is a common option. An FPGA is a digital IC that can be easily programmed to implement the functionality of any digital circuit. FPGAs are generally used in embedded systems and for prototyping. They have evolved in the 1980s from earlier PLDs such as Complex Programmable Logic Devices (CPLDs) and Programmable Logic Arrays (PLAs) [25]. Alternatively, the target device could be an Application-Specific Integrated Circuit (ASIC). While they are better in achieving optimal speed and power consumption, the NRE cost for a small amount of ASICs would be remarkably high [2]. In our case, the only cost-effective option is to use an FPGA for testing and proving the final implementation. The target device we are going to use is a popular Xilinx FPGA. The details about this device are explained in Section 4.2. The design flow for Xilinx FPGAs is shown in Figure 4.1. It consists of the following major steps [26][27]. 1. Design the system and produce Hardware Description Language (HDL) and constraint files. 2. Develop a testbench in HDL and perform Register Transfer Level (RTL) simulation. 3. Perform synthesis and implementation. The synthesis process is where a software is used to transform the HDL to a generic gate-level representation. It is followed by the implementation process consisting of three mandatory sub-processes:

(a) The Opt Design optimizes the gate-level representation and makes easier to fit it onto the Xilinx target device. (b) The Place Design produces the physical layout inside the FPGA chip. The logic cells are placed in physical locations. (c) The Route Design step determines which wires should be used to connect the placed logic cells to each other.

In the Xilinx design flow, static timing analysis is performed at the endof the implementation process. It determines various timing parameters, such as the maximum propagation delay and the maximum clock frequency. 34

4. Generate bitstream and download programming files. This step generates a configuration bitfile according to the final netlist. The bitfile isthen downloaded to the FPGA for configuring the logic cells and implementing the circuit.

Figure 4.1: Design flow for Xilinx FPGAs [24].

The design flow is performed using Xilinx’s own Integrated Development En- vironment (IDE) called Vivado. The design flow includes also a few optional simulations, such as functional simulation and timing simulation. The functional simulation checks the correctness of the synthesis process by replacing the RTL description with synthesized netlist [24]. The timing simulation is used to simulate the final netlist along with detailed timing data[24]. 35

4.2 Artix-7 FPGA resources For the final implementation, we select the Digilent Basys 3 FPGA development board. The board has an Artix-7 series FPGA produced by Xilinx using a 28 nm manufacturing processes. The part number of the FPGA is XC7A35T-1CPG236C which has the following features [28].

• 33 280 logic cells in 5200 slices

• 1800 kbits of fast Block RAM

• Five clock management tiles, each with a phase-locked loop (PLL)

• 90 DSP slices

• Internal clock speeds exceeding 450 MHz

• On-chip analog-to-digital converter (XADC)

Additionally, the Basys 3 development board has the following features [28].

• 100 MHz oscillator

• 12-bit VGA output

• 16 user

• 16 user LEDs

• 5 user push-buttons

• 4-digit 7-segment display

• Four 12-pin Pmod ports

The Artix 7 FPGA provides 1800 kilobits of Block RAM which is essential for implementing the 16k program and 16k data memories for the ASIP38. The graphics controller, which we discuss later, also uses large amounts of Block RAM. The multiplier of the ASIP38 can be implemented with some of the hardware multipliers included in the DSP slices. From the development board itself, we are going to use the VGA output, and the five push-buttons. The Basys 3 development board is shown in Figure 4.2. 36

Figure 4.2: Basys 3 FPGA development board.

4.3 Top level system The top level implementation consists of the VHDL codes of the ASIP38 processor, the input logic, and the graphics controller. Figure 4.3 describes the top level system. The input block provides debounced button interface for the ASIP38 while the graphics controller is used to display the output. Its operation is explained briefly in Section 4.6.

Figure 4.3: Top level system. 37

4.4 Required VHDL packages To implement all the VHDL modules, it is necessary to use VHDL packages which provide all the necessary data types, operators and functions. When writing VHDL for a synthesizable circuit, the package ieee.std_logic_1164 is usually required. It is essential as it provides std_logic and std_logic_vector data types including their type conversions. An other important package is the ieee.numeric_std which brings two important data types: unsigned and signed. [29] Finally, we need the ieee.std_logic_unsigned package which allows to treat std_logic_vector as unsigned and signed numbers [29].

4.5 ASIP38 The VHDL implementation of the ASIP38 is provided in Appendix A.1. It consists of one design file with multiple processes which implement the design. The entity declaration includes a set of port declarations. It contains all the input and output signals of the processor. For example, the system clock, and the input/output register connections are defined here along with some hardware specific control signals. The internal registers of the processor are described as std_logic_vector type of signals. They are introduced in the beginning of a design unit called architecture.

4.5.1 Block RAM Block RAM (BRAM) stands for Block Random Access Memory. It is a special memory module inside the FPGA device, apart from the logic cells. They are used to implement large RAM or ROM memories inside the FPGA. [21] The XC7A35T FPGA has 50 BRAMs. Each Block RAM has two independent ports for simultaneous read and write operations and all memory operations are controlled by the clock. Each BRAM has the size of 36 kb, and it can be divided into two separate 18 kb BRAMs. A BRAM can be organized for different data widths, for example from 16k x 1 to 512 x 36. [21] The program memory of the ASIP38 takes (16k x 38) 622 592 bits, and the data memory takes (16k x 32) 524 288 bits. Together they take up 1 146 880 bits out of available 1800 kb, which is over half what is available. An alternative for the BRAM would be to use distributed RAM which is constructed from Look-Up Tables (LUTs). The LUTs are placed in a larger FPGA resource called slice. Each slice of the XC7A35T contains four LUTs and eight flip-flops, but only some of the slices can use their LUTs as a distributed RAM.The 38 maximum supported amount of distributed RAM is 400 kb which is not enough for the ASIP38. [30] Therefore, it is better to use it for other purposes. A BRAM, or a distributed RAM can be configured using Core Generator tool in the Xilinx Vivado software suite. Refer to Appendix A.1, page 68 for the declarations of program and data memories. The data memory has two options for RAM configuration: single-port or dual- port. A dual-port RAM allows read and write operations at the same time using two separate ports. [21] However, as the instruction cycle of the ASIP38 was designed to be compatible for different memory configurations, it does not require the use of a dual-port RAM. Therefore, a single-port configuration is sufficient for the RAM. The program memory has automatically a single-port configuration as it is a ROM. Each memory has internal signals that need to be connected to the signals of the design file. In VHDL, this is done by using port maps. All connected memory components must be declared in the beginning of the VHDL code. It is not possible to use them directly, as they are located in a different file.

4.5.2 Program counter The implementation of the program counter is provided in Appendix A.1, page 70. The program counter is defined as a signal pc. For the control signal PC SEL, we introduce a user-defined enumerated data type pc_type. It is simply a list of character literals what the PC SEL can have. A signal pc_sel can be now defined to use the values of the pc_type. The program_counter is a synchronous process, being only sensitive to the clock. The signal pc_sel can have the following values: PCInc, PCLoad, PCSNZ, PCSZA, PCSGT, PCSLT, PCSKI, PCRET, and PCLatch. All skip instructions control the program counter directly, as the skip decisions are done inside the process.

4.5.3 Program stack The program_stack is a synchronous process. It consists of stack, which is a two dimensional array of std_logic_vector signals. As defined in Section 3.3.2, the size of the program stack is 32. The signal stack has its own type definition stack_type. The stack has the signal stack_dataout for outputting the data. The program stack has also the signal stack_pointer which points to its address line. The program_stack process is controlled by two control signals: stack_push, and stack_pop. Their operation is to increment or decrement the stack pointer. It is not possible to decrement the stack when the stack pointer is zero. However, 39 if the stack pointer is incremented over the size of the stack, it goes back to zero. The implementation of the program stack is shown in Appendix A.1, page 71.

4.5.4 ALU The alu of the ASIP38 is an asynchronous process which implements combinational logic. Because of this, all input signals must be in the sensitivity list of the process. The ALU implements the following operations: bypass, ADD, SUB, MUL, AND, IOR, XOR, INC, DEC, CIL, and CIR. The bypass is used when loading to the accumulator or when the instruction does not use the ALU. The implementation of the ALU can be found in Appendix A.1, page 71. The XC7A35T FPGA has 90 DSP48 slices that are intended for Digital Signal Processing (DSP) operations. A DSP48 slice contains one 25x18 multiplier. [30] The MUL operation uses four DSP48 slices to implement the operation for 32-bit multiplication. It involves a built-in result selection for 32-bit signed fixed-point numbers. The multiplier is introduced in the beginning of the architecture as a component called mult32. In the Section 3.3.5, the multiplier implementation was designed to be asyn- chronous. In this way, the MUL operation can be executed in one instruction stage which is practical to implement. However, this can cause too much logic delay in the synthesis phase which can cause the failure of timing requirements. A good solution is to use enough pipeline stages in the multiplier. The processor must be then forced to wait for the result during the MUL instruction. This needs additional control logic which is shown in Appendix A.1, page 77.

4.5.5 Instruction decode and control Bus select The content of the data bus is selected with the control signal BUS SEL. The data bus has four different sources for data: the program memory output, accumulator, data memory output and input register. The BUS SEL is implemented in its own asynchronous process called bus_select. It is shown in Appendix A.1, page 69.

Memory select The selection of the data memory address could have been embedded directly to the control logic itself, but it caused some additional slack which made the critical path longer. Therefore, the address for the data memory is selected by a multiplexer, as seen in Figure 3.2. It is operated by the control signal MEM SEL, which selects the program memory output or the F register into the address line of the data memory. The implementation of the multiplexer can be found in Appendix A.1, page 69. 40

State machine The state_machine is a synchronous process which updates the signal state with the next state of the instruction cycle. The process also implements the WAI instruction based on the state of the ready signal. The implementation of the state machine can be found in Appendix A.1, page 72.

Control logic The control_logic is an asynchronous process which implements the instruction cycle of the Table 3.8. It generates all the control signals of the Table 3.9 according to the state signal. Control signals can use user-defined enumerated character literals or std_logic. In the beginning of the process, initial values are assigned for the control signals that are not updated in the Decode stage. The control logic is constructed by using nested case statements, and its implementation is shown in Appendix A.1, page 72.

Resets An essential VHDL design practise is to use reset signals inside the processes. Both ASIC and FPGA designs need to have reset signals into registers in order to set an initial condition. However, modern FPGA designs can also use initial values in the signal declarations. It is considered a good practice to reset as few flip-flops as possible, and initialize all flip-flops instead as reset lines take routing resources. The reset lines also increase power consumption and make the design harder to meet timing. [31] As our target device is a Xilinx FPGA, we can benefit using initial values on signals, and this way get lower resource usage, and optimal timing. To get the most efficiency out of the design, resets should be coded only when they are necessary for the functionality of the design [32]. Therefore, the use of external reset signals in our design is not necessary. For other target devices, the resets can be added later if needed.

4.6 Graphics controller The graphics controller is a test unit for the ASIP38. It is able to render vector graphics with a special line drawing algorithm. The main task of the ASIP38 is to act as a control processor by running a software that controls the graphics controller. The role of the graphics controller is to act as a platform for verifying and testing the correct operation of the processor. It also helps testing the customizability and the performance requirements of the ASIP38 as a part of a larger system. 41

Figure 4.4 shows a block diagram of the graphics controller. The graphics controller consists of three main VHDL modules which are the display controller, VGA synchronization and RGB signal generation. Figure 4.4 shows also other modules, such as hardware accelerators, which are described later.

Figure 4.4: A block diagram of the graphics controller.

Display controller The display controller is a command interface between the ASIP38 and the rest of the graphics controller. It receives commands delivered by the SET and CLR instructions of the ASIP38. A total of 29 different commands can be executed by the SET instruction. The commands are used, for example, for changing between graphics modes, or writing to the video memory. The display controller is connected to a paint memory which keeps track of the contents of the display. This makes possible the flood fill algorithm to know what has been written to the screen, as a video memory cannot be read directly from the display controller. The flood fill algorithm allows painting certain areas onthe 42 screen. The implementation of the display controller is shown in Appendix A.2, page 78.

VGA synchronization Video Graphics Array (VGA) is a graphics standard for display controllers first introduced by IBM Corporation in 1987. The VGA was designed at the time of Cathode Ray Tubes (CRTs). The color of a pixel is determined by the intensity of three components: Red, Green, and Blue (RGB). Each component can have a voltage level between 0 and 0.7 volts. [24] The purpose of the module vga_sync is to produce timing for standard 640x480 60 Hz video mode. The main clock for this mode is 25 MHz, which requires a clock divider process vga_clk_25MHz. The timing for the electron beam is controlled using signals for the horizontal and vertical synchronization (hsync, vsync). They are generated by counting the x and y pixel positions in processes called horizontal and vertical. The hsync makes the electron beam to start a new line, and the vsync to start a new frame. The signal video_on is used for switching the electron beam off during a line change. The x and y pixel positions are also used inside the process for the RGB signal generation. They are needed for reading the contents of a video memory. The implementation of the VGA synchronization is shown in Appendix A.3, page 84.

RGB signal generation The RGB signal is generated using the contents of a video memory. Its implemen- tation is shown in Appendix A.4, page 86. The module rgb_gen outputs the RGB signal which goes directly to the VGA connector. The graphics controller has five video modes. As seen from Figure 4.4, each mode uses a separate video memory. The current video mode is selected with the signal video_mode. The display consists of dots which can have a resolution from 160x120 down to 40x15. The signals v_mem_x and v_mem_y provide the coordinate for a single dot that can be set or cleared with the signals disp_write and disp_clear. The RGB signal is produced by using signals pixel_x and pixel_y as a read address for selected video memory. The character code for the selected dot can be then read from the output of the video memory. This code is used as a character address for a font ROM. The output of the font ROM is then used to produce pixels in true 640x480 resolution for the selected video mode.

Line drawing The graphics controller uses module line_draw to generate lines in the video memory. It is shown in Appendix A.5, page 91. The line_draw works as an 43 accelerator for the display_controller. Therefore, there is no need use any software algorithm for line generation. This kind of hardware solution increases the overall performance and makes graphics programming easier. The used algorithm is Bresenham’s line algorithm. It uses only integer numbers [33]. The module takes line start and end coordinates as inputs, and sends line coordinates to the rgb_gen. The module is started with the line_start signal, and the video memory is updated with the line_update signal. This module also manipulates the ready signal which is connected to the ASIP38. During line drawing the ready is 0 when otherwise it is 1. This makes possible to use the WAI instruction to prevent the execution of the ASIP38’s software during the time of the line generation.

Ellipse drawing The module ellipse_draw generate shapes that include circles and ellipses. The implementation of the module can be found from Appendix A.6, page 94. The module works as an accelerator which is controlled similarly to the line_draw module. The module uses ellipse drawing algorithm which can be used to generate both line and circles. It is based on Bresenham’s circle generating algorithm [34]. The required input parameters are x and y coordinates for the center of the ellipse, and constants a and b. The ellipse generation is started using the signal ellipse_start. The signal ellipse_update updates the video memory. Similarly to the line drawing, this module also affects to the status of the ready signal.

Flood fill The module area_paint is a fill tool for the graphics controller. It is an accelerator which implementation is described in Appendix A.7, page 97. The module uses flood fill algorithm [35] to fill contained areas with the same color. The implementation of the algorithm is stack-based and works recursively. The module takes the start location and color as its input parameters, and writes its output directly to the video memory. The paint_memory is used for checking the content of the display, as it is not possible to access the video memory outside of the module rgb_gen. Like the line and ellipse drawing, the area paint manipulates the ready signal.

4.7 Inputs The Basys 3 FPGA board has five push-buttons that are needed for interacting with the ASIP38. In the test setup, the states of these switches are used to interact with the processor. However, using switches introduces phenomenon called 44 bounce. It occurs when a switch is pushed and begins to make a contact. During that time the two contacts separate and reconnect usually 10 to 100 times over a period of about 1 millisecond [36]. During that moment, the state of the button is difficult to determine. A common solution is to implement switch debouncing, which waits until the button state stabilizes and then registers the state of the button. This method can be implemented in either hardware or software. [16] We select the hardware method which uses a counter which increases when a button is pressed. The counter increases in every clock cycle. When a specific count is reached, the button registers as pressed. Respectively, the counter decreases when the button is not pressed, and the button is registered as released when the counter reaches zero. Push-buttons are asynchronous signal inputs which cause metastability to a synchronous system. To prevent metastability caused system failures, synchronizers must be used between the push-button inputs and the rest of the system. For each of the inputs, a synchronizer is implemented. It consist of a chain of at least two flip-flops. [18] The implementations of the debouncing and the synchronizers are shown in the input module, which can be found in Appendix A.8, page 101. The input needs also to produce a specific code for each button, or a combination of two buttons. Using button codes, the software knows which button or a combination is pressed. This is done with a synchronous process called button_select.

4.8 Top level entity The module top connects other VHDL modules together to form the entire test system. It can be found in Appendix A.9, page 103. This system is connected to the resources of the FPGA. The input signals of the system are the 100 MHz system clock (clk) and the five push-buttons: center, (btnC), right (btnR), left (btnL), down (btnD) and up (btnU). The outputs of the system are the VGA horizontal synchronization (Hsync), vertical synchronization (Vsync), and three color intensity signals: vgaRed, vgaBlue and vgaGreen.

4.9 Behavioral simulation Behavioral simulation is an important part of the design flow. It is needed to verify logical correctness of the RTL design. Before simulating any of the design files, a VHDL test bench needs to be created. The test bench is a separate file which is used to determine input values that are used during the simulation. Behavioral simulation is an iterative process. It may take multiple simulations to achieve the desired functionality. However, simulation is not very useful for 45 resolving problems with timing. For checking if the required timing conditions are met, a static timing analysis is normally used [37]. Xilinx Vivado provides an in-built simulator which was used for verifying the desired functionality. At first, the correct behavior of the ASIP38 was verified in the simulator. This was performed using short test programs which tested the full instruction set. The correct operation of the datapath was then observed from the simulation results. During the development of the graphics controller, each design file was first simulated separately. However, the complete operation of the graphics controller was not simulated as verifying pure RGB signal is not practical enough. After the correct operation was confirmed, the whole system was simulated. This was done to check for any errors during the simulation. These errors prevent the synthesis of the design, and must be corrected before continuing the design flow.

4.10 Synthesis Synthesis is the part of where VHDL code is converted into a gate-level netlist. The produced netlist is used in the implementantion phase to produce a placed and routed FPGA design. Xilinx Vivado uses a built-in synthesis tool for synthesizing the design. It generates a synthesis report which contains useful timing information. One of the design goals of the ASIP38 was the need for high performance. To meet that requirement, the processor should achieve clock speeds which are typical for the selected FPGA. Therefore, we choose to use the 100 MHz oscillator of the Basys 3 for the main clock of the system. At first, the ASIP38 is synthesized alone without the graphics controller. This gives us a timing report that estimates if the processor passes all the timing constraints. The obtained timing summary is presented in Table 4.1.

Table 4.1: Timing summary after the synthesis.

ASIP38 Worst Negative Slack (WNS) 1.143 ns Worst Hold Slack (WHS) 0.091 ns Total Negative Slack (TNS) 0.000 ns Total Hold Slack (THS) 0.000 ns

The term slack indicates the margin by which a timing requirement is met. Worst Negative Slack (WNS) refers to the worst slack of all the timing paths for 46 maximum delay analysis. If it is positive, the path passes. If negative, the path fails. The Worst Hold Slack (WHS) refers to the worst slack of all the timing paths for minimum delay analysis. It must be positive to pass. Total Negative Slack (TNS) is the sum of all negative slack violations. It must be zero for meeting the timing. Like the TNS, the Total Hold Slack (THS) is the sum of negative hold slack. It must also be zero for the design to pass. [38] The timing summary, which is shown in Table 4.1, confirmed that the design passes the static timing analysis. The maximum allowable clock frequency for the ASIP38 can now be estimated using the 100 MHz clock period and the WNS as shown in Equation 4.1[39]. The result of over 112 MHz is promising, but we also know that the implementation phase can reduce these values. Because the design is not too close to the 100 MHz limit, it is safe to continue to the implementation phase. 1 1 Fmax = = = 112.905 MHz (4.1) Tclk − WNS 10 ns − 1.143 ns 4.11 Implementation As mentioned in the Section 4.1, the implementation consists of three sub-processes before the bitstream for the FPGA can be generated. The processes are the Optimization Design (Opt Design), Place Design and Route Design. It is possible to do optional optimizations between the sub-processes, but they are not necessary as we want to implement the design for testing purposes only. After running the implementation successfully, the summary of the timing analysis is presented in Table 4.2. Table 4.2: Timing summary after the implementation.

ASIP38 Complete system Worst Negative Slack (WNS) 0.680 ns 0.245 ns Worst Hold Slack (WHS) 0.110 ns 0.108 ns Total Negative Slack (TNS) 0.000 ns 0.000 ns Total Hold Slack (THS) 0.000 ns 0.000 ns

As expected, the WNS has been reduced. This has en effect to the maximum allowable clock frequency of the ASIP38, as seen in Equation 4.2. 1 1 Fmax = = = 107.296 MHz (4.2) Tclk − WNS 10 ns − 0.680 ns 47

With the graphics controller and the input module included, the complete test system has a WNS of only 0.245 ns. The maximum allowable clock frequency of the test system is then 102.511 MHz. For the system to work, the clock period must be greater than or equal to the critical path. The timing report also presents the 0.680 ns critical path for the ASIP38, which occurred between the program memory and the accumulator. It is caused by the Block RAM, as some extra delay accumulates due to the long routing distances. It is often possible to shorten these delays by selecting the retiming option from the settings of the synthesis engine. This moves the registers around while maintaining the original functionality [20]. The ASIP38 has a design goal of small implementation size. Table 4.3 shows the FPGA resource utilization of the ASIP38. From the table, it can be clearly seen this goal was successful as the number of used LUTs is only 573. The only high consumer is the Block RAM, but its amount can be changed according to needs of the application. During the implementation a size of 8k was used for the program memory, and 16k for the data memory.

Table 4.3: FPGA resource utilization of the ASIP38. The utilization for the complete test system is noted in parenthesis.

Resource Utilization Available Utilization % LUT 573 (7175) 20800 2.75 (34.50) LUTRAM 22 (3158) 9600 0.23 (32.90) FF 360 (1094) 41600 0.87 (2.63) BRAM 24 (44) 50 48.00 (88.00) DSP 4 (16) 90 4.44 (17.78) IO 3 (20) 106 2.83 (18.87) BUFG 1 (2) 32 3.13 (6.25)

Table 4.3 shows that the complete system took a lot more resources than the ASIP38 alone. The larger design needs to use more routes. Because of this, the critical path emerged inside the graphics controller. The implementation is now complete, and ready for testing. 48

5 Verification and testing

This section focuses on testing the board-level behavior of the ASIP38. The purpose of the this type of testing is to verify the correct functionality of the ASIP38. This section also focuses on the software development tools, which are needed for programming the ASIP38. A series of test programs are then introduced, and their results are analyzed.

5.1 Hardware verification Hardware verification is a critical step in any RTL design as the implementation must perform to its specification. Furthermore, the behavioral simulation or the static timing analysis cannot guarantee the correct operation of the design inside of an FPGA. Hardware verification can involve compliance tests, or other important parameters that must be met. It is typically done by simulating or prototyping. [40] The most practical method for us is the FPGA prototyping. Doing otherwise would require using some other hardware verification method, for example, Universal Verification Methodology (UVM). Xilinx Vivado provides different IPs for hardware debugging purposes [41]. For example, a logic analyzer core could be used for monitoring internal signals of the FPGA. For our needs, the graphics controller serves as an on-chip debugger by dis- playing the output signals of the ASIP38. The real-time output can also be quicker to verify compared to the debugger. The graphic controller has been independently verified earlier which is an essential requirement for successful verification ofthe ASIP38. The verification by prototyping involves board-level testing with a series of tests programs, which are executed by the processor. The results are then displayed in real-time by the graphics controller. The program complexity should be close to tasks the processor would be normally executing. If a test program displays expected output, it can be considered successful. Board-level testing provides also a practical way of proving that the VHDL design was synthesizable on an FPGA.

5.2 Assembler The processor can be programmed by creating instructions directly in machine language and transferring them to the program memory. In the Block Memory Generator of the Xilinx Vivado, this is done by writing the program lines to a text file in a hexadecimal format, and loading the contents of the file directly intothe program memory. However, programming in machine language causes a well known problem especially with the target addresses of the branch instructions. For example, every 49 time a new line is added to the program, the target addresses of the JMP or CAL instructions are changed. The problem gets bigger the longer the program is. A solution is to use a symbolic machine language (Assembly), and an assembler that coverts assembly code into [18]. An assembler is practically mandatory software tool in almost every computer system. Therefore, we must implement a custom assembler program to help the software development of the ASIP38. In an assembly language, labels are used instead of absolute numeric addresses. Therefore, all branch target addresses become relative. In an assembly instruction, the leftmost label represents the program line address, and the rightmost label represents either a memory address or an immediate. The middle tag is reserved for an instruction. If an instruction is not a destination for a branch instruction, the leftmost label will be replaced by ’-’. Conversely, if an instruction does not contain an address or other number, the rightmost label is replaced with ’-’. Dashes are needed to allow the assembler to recognize an empty space. The assembly language format of the processor is shown in Table 5.1.

Table 5.1: ASIP38’s instruction format in assembly language.

Label Instruction Address / Immediate MAIN LDI 5 - STO TEMP LOOP LDI ffffffff - XOR TEMP - SNZ - - JMP MAIN - LDA TEMP - INC - - STO TEMP - JMP LOOP

TEMP EQU 0

The RAM locations can be addressed using an assembler directive which gives a numeric constant to a symbolic label. They can be used to point a single RAM memory location. A commonly used name for this type of directive is EQU [42]. For example, a constant TEST can be given a value of 2 as follows: TEST EQU 2. Since the TEST now corresponds to number 2, it can be used as a variable for a 50

RAM memory location. For example, - LDA TEST loads a number from the third data memory location into the accumulator. The assembler works in two steps. At first, a table is created where the leftmost label of each instruction is given an index number. The table provides the address where each of the labels are located. The second step produces the final hexadecimal code based on the first table. At the same time, the rightmost labels are replaced with the hexadecimal values. The operand codes are also replaced with predefined hexadecimal values. A hexadecimal instruction has 10 characters. Due to the processor’s 32-bit data bus, the eight rightmost ones are reserved for the address/immediate. The two characters to the left represent the 6-bit operand code. For example, the instruction word 030000001a corresponds to a JMP instruction to the location 1a of the program memory. The output file of the assembler is a complete binary file which canberead by the Block Memory Generator of the Xilinx Vivado. The file has the correct format and header fields required by the Vivado. This accelerates the software development as the required format is produced automatically. During testing, the performance of the implemented assembler appeared to be fairly good. The compilation took only few seconds even when the code was several thousand lines long. The Python source code of the assembler is shown in Appendix A.10, page 107.

5.3 Test programs The ASIP38 is tested with a series of test programs which are loaded into the program memory. They are designed to test the whole instruction set of the ASIP38. The programs can be divided into two categories: those which do read an input from the user, and those which do not. We are interested in programs which read user input, as the behavior of this type of programs is not completely predetermined. They are more likely to reveal design flaws in the hardware. The test programs can be either small or large. Smaller tests concentrate to one specific processor component. For example, the correct operation of the program stack can be tested by calling subroutines from another subroutine, printing an output, and returning back to the starting point. These programs perform well when the testing concentrates on individual features of the processor. However, they do not always cover everything, and it is reasonable to test the processor with programs which use the full instruction set. A practical way for hardware testing is to first use smaller test programs, and then start writing larger test programs. Complex software usually combines several smaller tests into one entity. This provides a good alternative for verifying the operation of the processor. 51

Normal instructions A program called SNAKE was assembly coded for testing general operation, sub- routine calls, and indirect addressing of the ASIP38. Its functionality is close to a classic snake game. Another program called BRICKS was assembly coded in the case something was missing in the SNAKE. The BRICKS is a typical bricks breaking game, and more complicated than the SNAKE. During testing both the SNAKE, and the BRICKS operated correctly and without bugs. This suggested that the hardware of the ASIP38 was operating normally on the FPGA, and more test programs could be written. Figure 5.1 shows the BRICKS in action.

Figure 5.1: The BRICKS program.

Application-specific instructions To use the application-specific SET instruction in its full potential, a software called PAINT was programmed. It allows drawing lines, ellipses, circles or using a free 52 hand. The line and ellipse drawing use the Bresenham’s algorithm as described in Section 4.6. It provides an eraser tool, and a paint bucket which uses the flood fill algorithm. As the algorithms are implemented inside the graphics controller, the SET instruction was tested comprehensively. All commands of the SET instruction were tested, which served also as a functionality test for the graphics controller. As a result, the graphics controller operated as intended in the PAINT software. Figure 5.2 shows the PAINT software in operation.

Figure 5.2: The PAINT program.

The MUL is an application-specific instruction of the ASIP38. It makes possible to multiply signed 32-bit fixed-point numbers. Moreover, the MUL instruction has a built-in result selection which was designed especially for a test program called CUBE. The CUBE program is capable of 3D rendering by producing a rotating cube in a middle of the screen. It allows the cube to be rotated around its x, y and z axes. The shape of a cube is constructed by rendering lines in three-dimensional Cartesian coordinate space. The direction and the speed of rotation are changeable by the user. 53

Each corner of the cube represents one point which x, y and z coordinates need to be updated every time the points are rotated around the origin. For example, the previous coordinates of x, and y are used to calculate new coordinates for one corner, as shown in Equation 5.1[43].

x′ = x cos θ − y sin θ (5.1) y′ = y cos θ + x sin θ

In the Equation 5.1, trigonometry is used for calculating variables x′ and y′, which represent the new coordinate values being calculated. Variables x and y represent the previous values, and θ represents the angle of rotation [43]. Figure 5.3 shows the rotation of x and y coordinates around the origin. The Equation 5.1 can be then used to create a program which rotates the cube around its x, y and z axes.

Figure 5.3: Rotating x and y coordinates around the origin by θ.

As the cube is rotated by using its previous coordinates, the precision of the multiplication must be high enough, or the cube will become distorted over time. A working solution is to use 32-bit signed fixed-point numbers where the fractional part is at least 24 bits long. Thus, the multiplication precision is high enough to prevent an accumulating error. To handle a fixed point multiplication result, the MUL instruction was designed to select the bits 55 to 24 from the 64-bit multiplier output. 54

Figure 5.4 shows the CUBE program in action. After continuous running of the CUBE program, no visible distortion was detected. This was a clear indicator that the software and the hardware operated correctly. This also confirmed the desired behaviour of the full instruction set.

Figure 5.4: The CUBE program.

5.4 Board-level testing The instruction set of the ASIP38 was tested thoroughly. Especially real-time testing with longer test programs was found useful. Continuously running software, such as the CUBE program, also tested the overall system stability. As a result, no hardware flaws were detected during any of the test programs. The testing indicates that the functionality of the ASIP38 can be successfully synthesized on an Artix 7 FPGA with the clock speed of 100 MHz. However, further verification with an on-chip logic analyzer is recommended, as the performed tests were limited only to the real-time board-level scenario. 55

6 Analysis of results

After successful implementation, the features of the ASIP38 can be analyzed. At first, we wrap up the results of the design process which were described inSection 3. After that, we compare the results to other open source processors with similar features for getting a better overview of the results.

6.1 Results of the design process The ASIP38 is an application-specific processor which has a 32-bit data busfor efficient operations between large numbers. It uses a Harvard memory model for straightforward connection of the program and data memories. The ASIP38 categorizes as a RISC processor, as its instruction set is small, and the instructions fit in a single word. Having only two addressing modes, as explained in theSection 3.4.2, also supports the RISC classification. For saving resources, the design of the ASIP38 does not use a register file. The accumulator is the only register for processing data. This forces the arithmetic operations to be performed between the memory and the accumulator, which is a CISC property. This leads the ASIP38 to implement the RISC only partially, which is a good compromise in terms of speed and low resource usage. The ASIP38 uses a multi-cycle ISA, and is directly compatible with common memory blocks inside of an FPGA. This type of instruction cycle makes possible that no additional clock cycles are used for waiting the output of the memory, such as the Block RAM. As a result, the ASIP38 has a true performance of only four clock cycles per instruction. The features of the ASIP38 are presented in Table 6.1.

6.2 Processor comparison OpenCores [11] is an online community for the development of digital open source hardware. Its website offers binary compatible clones of commonly used processor architectures. GitHub [12] is another popular website which is used for hosting open source hardware projects. OpenCores provides a comprehensive list [44] of open source processors from the websites above. As shown in Table 6.1, we can select a few well-known and similar-sized processors from the list, and compare their features to the ASIP38. Although they are not ASIPs, the comparison to existing processor architectures helps to form an overview of the tasks the ASIP38 would be suitable for. This also helps to evaluate how well the design goals in Section 1.1 were achieved. Table 6.1 presents the features of a processor core called T65. The ISA of the T65 is directly compatible to the 6502 microprocessor which was developed by MOS Technology in 1975. This processor has an 8-bit data bus, one accumulator 56 and two index registers. [45] Unlike the ASIP38, the 6502 implements the program stack inside of the RAM. It also has 13 addressing modes which is a CISC feature. The implementation of the T65 uses only 575 LUTs [44] which is almost equal to the ASIP38. Table 6.1: The features of the ASIP38 compared to small open source processor cores.

ASIP38 T65 Light52 AVR Core PicoRV32 Category: ASIP core uP core uC core uC core uP core Architecture: Custom 6502 8051 ATmega103 RISC-V Data bus: 32-bit 8-bit 8-bit 8-bit 32-bit Memory model: Harvard Princeton Harvard Modified Modified ISA: RISC-like RISC-like CISC RISC RISC CPI: 4 2-7 2-8 1 3-15 Instructions: 31 (64 max) 56 256 121 55 Pipeline: No No No 2-stage No Program space: 16k (4G) 64k 64k 128k 4G Data space: 16k (4G) 64k 64k 64k 4G Program stack: 32-level In RAM In RAM In RAM In RAM Peripherals: 0 0 2 3 0 HLL support: No Yes Yes Yes Yes Used LUTs: 573 575 1022 2135 725

The next processor in the table is an 8051 compatible microcontroller core. The was developed by Intel in the early 1980s, and has been widely used for small scale embedded systems [16]. The 8051 has an 8-bit data bus, and a complex instruction set. This particular 8051 core has also peripheral modules for timer and Universal Asynchronous Receiver Transmitter (UART). Its implementation size is 1022 LUTs [44]. Table 6.1 shows also an ATmega103 compatible microcontroller core. It uses the AVR RISC architecture developed in 1996. The ATmega103 uses a 2-stage pipeline which executes an instruction in a single clock cycle [46]. It has three peripheral modules: two for timers and one for UART. With all the features, the resource consumption of this processor core is 2135 LUTs. [44] The last processor in the table is the PicoRV32. It uses an open source ISA called RISC-V which has been developed in University of California, Berkeley. The RISC-V was introduced in 2010 and has been increasing its popularity since then. [47] The PicoRV32 is described as a size-optimized RISC-V CPU. It uses a 32-bit 57 data bus and employs a non-pipelined version of RV32IMC ISA with 55 instructions [48]. As a result, the PicoRV32 uses only 725 LUTs [44]. A significant benefit of the listed processor architectures is their support for High-Level Languages (HLLs). This means that programs can be written not only in Assembly language, but in high-level programming languages such as C. It makes programming more efficient, and allows using programs which were written on a different hardware. The ASIP38 uses 573 LUTs with three application-specific instructions. Al- though the ASIP38 is a 32-bit processor, its implementation size seems to be approximately similar to the other processor cores listed in Table 6.1. Moreover, any of the listed processor cores do not support Wishbone or any other bus interface, which can increase the LUT count. In general, the features of the ASIP38 seem fairly similar to the processor cores listed in Table 6.1. The key difference is that the listed processors were not originally designed to be customizable. Therefore, any later modifications, such as adding new instructions, can be difficult and time consuming. In ASIP38, adding new instructions, registers and other hardware can be done easily, because its ISA was designed to be customizable. The comparison indicates that the ASIP38 can be proposed for a role where a small application-specific processor is required. The next section discusses its future upgrades and improvements. 58

7 Future upgrades

In Section 1.1, the structure of the designed processor was required to be customiz- able. This makes possible to develop the ASIP38 further by adding new features. The features can be either internal hardware of the processor or pure software tools. This section presents some suggestions for future work.

7.1 Additional hardware Interrupts The ASIP38 could be designed to have interrupts. This would allow interrupting the current program by an urgent event which needs to be processed immediately. An interrupt would first push the current PC address to the top of the program stack. A flag would be then be set to prevent new interrupts during the execution of the current interrupt. A subroutine called an Interrupt Service Routine (ISR) would be then executed at the address which would locate at a fixed location in the program memory [2]. The interrupts could be triggered either internally or externally. An internal interrupt is typically triggered by a peripheral, for example a configurable timer module. An external interrupt is triggered by the state of an input port. [14] Enabling and disabling interrupts could be done by separate instructions, or using a configurable register. Other peripherals, such as timer modules, would also need their own control registers. For accessing the registers, new instructions would have to be implemented.

Multitasking capability Sometimes interrupts are not enough to perform a lot of tasks in real-time. Further- more, placing long and complex code inside of an ISR can block other parts of the program from being executed. Subroutines can also take a long time to complete, and other subroutines have to wait until one finishes. Therefore, one significant improvement for the ASIP38 would be the ability to perform multitasking. In a typical multitasking scheme, a Real-Time Operating System (RTOS) would share processor time for all runnable tasks. Thus, even the most time consuming tasks can be run without disturbing the real-time execution of other tasks. This method of saving processor time is called scheduling. [14] An important part of any RTOS is a scheduler. It is responsible for deciding which task should be executed at any particular time. A preemptive scheduler is able to switch tasks by using a process called context switch. Its function is to save the state of the current task, or . The context switch makes possible to restore the task, and resume its execution later from the same point. [14] 59

A context switch can be triggered, for example, by a timer interrupt. The functionality of the context switch is located inside of an ISR which saves all register values of the current task to the data memory. The processor’s registers would be then loaded from the RAM with the values of another thread. As the PC value is also loaded, the switched thread continues from the same location as before. In the case of the ASIP38, the PC, AC, and F registers need to be saved and loaded by the context switch. Thus, these registers must be accessible from the instruction set. Three new instructions would be needed for saving and loading from the PC, and loading from the F register.

7.2 Software tools An assembler is an important tool for low-level software development. However, an assembly language is strictly tied to the processor and the instruction set it was designed for. A high-level programming language, such as C, provides a number of benefits. For example, it can be compiled to different microprocessor architectures, and it allows using code which is already available for other systems. [2] The C syntax can be also a lot easier to read, which makes it better for complex programs. The support for a high-level programming language would be beneficial for the software development of the ASIP38. The first option would be to modify its instruction set to support a suitable high-level language compiler. The second option would be to develop a program which translates the assembly code of another processor to the assembly code of the ASIP38. However, this might be difficult depending on the differences in the instructions sets. From both of the options, modifying the instruction set might be easier to do. It only requires adding the instructions and registers of another processor architecture, but at the same time it changes the architecture of the ASIP38 by some degree. In both cases, the application-specific instructions of the ASIP38 would still have to be inserted manually into the assembly code. Suitable compilers would be those which have the assembly language closest to the ASIP38. For example, the C compilers of accumulator based 8-bit microcon- trollers could be used.

7.3 Bus protocols An on-chip bus is the key component of a SoC. It is used to connect different system components including processors, memory, and peripherals. All communi- cation occurs on a unified bus architecture. [1] This has the benefit of connecting several IP components using one shared communication protocol. This subsection 60 investigates the connectivity of the ASIP38 to the most commonly available on-chip bus standards: AMBA, CoreConnect and Wishbone. The Advanced Microcontroller Bus Architecture (AMBA) is an open bus speci- fication developed by the ARM Corporation. It was developed in 1995, andhas become the de facto standard for interfacing components in a SoC. It has evolved to its third generation and includes three variants: a low-bandwidth general-purpose bus called Advanced Peripheral Bus (APB), a high-speed single-frequency bus called Advanced High-performance Bus (AHB), and a high-speed multifrequency bus called AMBA Advanced Extensible Interface (AXI). [1] The CoreConnect is a bus system from IBM Corporation. It was developed for IBM’s PowerPC line of processors. Its functionality is similar to the AMBA bus. Its main variants include a low-bandwidth general-purpose bus called On-chip Peripheral Bus (OPB) and a high-speed single-frequency bus called Processor Local Bus (PLB). [1] The Wishbone is an open source bus system developed by the SiliCore Cor- poration [1]. It is used by many designs in the OpenCores [11] website. The specification defines master and slave interfaces which can be used to form different bus topologies, such as point-to-point, many-to-many or even crossbar switches [49]. All of the bus system could provide a communication protocol for the ASIP38 to exchange data with external IP blocks. However, the ASIP38 was developed from an open source perspective, and it is likely to be used in an open source project. In the OpenCores, the IPs are usually made compatible with the Wishbone. The Wishbone is also simpler than AMBA or CoreConnect [1]. The OpenCores website also provides even a simplified version of the Wishbone called Simple Bus Architecture (SBA) [50]. The SBA implements a minimum subset of the Wishbone, and can be easily connected to other Wishbone compatible cores. To connect the ASIP38 through a bus system, a custom bus adapter needs to be designed. An adapter could be easily implemented to support the SBA, as its core functionality is based on a FSM. Because of its reduced amount of logic, the SBA does not consume as much FPGA resources as the other bus systems. This could be useful in the applications of the ASIP38. 61

8 Conclusion

The purpose of this thesis was to study the core knowledge and skills required to implement an ASIP processor. The goal was achieved by designing and implementing an open source ASIP. The designed processor was named as ASIP38. It was implemented in VHDL for an FPGA target device. To test the ASIP38 in real-time, a VGA graphics controller was implemented using VHDL, which was connected to the ASIP38 for creating a complete test system. The design work was divided into five main goals. At first, the structure of the processor needed to be customizable, as this is an essential feature of an ASIP processor. The Instruction Set Architecture (ISA) of the ASIP38 was designed to have many characteristics from a RISC ISA. For example, instructions had to fit in a single word, and the number of instructions were kept low. TheRISC approach allowed the processor to have only two addressing modes which makes later modifications easier. The testing with the graphics controller also required the design of three application-specific instructions. The processor was required to be directly compatible with different memory configurations inside of an FPGA. This ensured that many types of available memory could be used, and the processor does not need to depend on just a single memory configuration. This goal was achieved in the design of the instruction cycle. The processor’s implementation size needed to be small. The goal was achieved by implementing only the necessary instructions and registers, instead of a complex instruction set and a register file. As the processor was required to be optimized for performance, the possibility to have only one clock cycle per instruction was investigated. However, the single- cycle approach was not compatible with the Xilinx Block RAM. As a result, an instruction cycle with four clock cycles per instruction was selected. A pipelined instruction cycle was also considered, but it would have made the design too complex for the gained performance. A more feasible option was to increase the maximum clock frequency by keeping the critical path of the RTL design as small as possible. This helped to exceed the target clock speed of 100 MHz in the implementation phase. The final goal was to make the processor user-friendly by providing the necessary tools for software development. The focus was set on implementing an assembler, as it is a crucial programming tool. The assembler implementation was a success, and the assembler was used for compiling test programs to binary code. The VHDL implementation of the ASIP38 was first simulated using VHDL test benches, and finally synthesised and implemented using the design flow for Xilinx FPGAs. The implementation was then verified directly from an FPGA prototype using complex test programs. The programs were successfully used for testing the 62 complete instruction set. The ASIP38’s features were compared to small open source processor cores. As a result, many of its features seemed to be similar to common 8-bit processor architectures. The compared implementation sizes were fairly similar although the ASIP38 has a 32-bit data bus. The comparison indicated that the ASIP38 has potential use where a small application-specific processor is required. This thesis also investigated future work of the ASIP38. A potential improvement would be the implementation of an interrupt mechanism. Such feature would improve the usability of the processor by executing tasks which are triggered by some external event. This could be a good improvement as it allows using I/O which is interrupt driven. The interrupt mechanism would also allow the use of a scheduler software which would be used in a multitasking scenario. The support for a high-level programming language was also investigated. A feasible option would be to implement a program which would translate the assembly code of another processor to the assembly code of the ASIP38. Interfacing with other hardware is often done with a common protocol in a SoC design. As the ASIP38 was designed as an open source hardware, a few popular bus protocols were examined for connecting it with other IP components. Consequently, the Wishbone bus stood out due to its popularity in open source designs. Therefore, making the processor Wishbone compliant has the most benefits of all the upgrades. The ASIP38 offers a solution where a small and customizable 32-bit processor with an application-specific instruction set is needed. Especially with some ofthe recommended upgrades, it can be effectively used in SoC designs. 63

References

[1] R. Schaumont. A Practical Introduction to Hardware/Software Codesign. 1st ed. New York: Springer, 2010. isbn: 1441959998. [2] F. Vahid and T. Givargis. Design: A Unified Hard- ware/Software Introduction. 1st ed. New York: John Wiley & Sons, 2002. isbn: 0471386782. [3] Western Digital Corporation. Western Digital To Accelerate The Future Of Next-Generation Computing Architectures For Big Data And Fast Data Environments. Accessed: 2019-08-05. url: https://www.westerndigital. com/company/newsroom/press-releases/2017/2017-11-28-western- digital-to-accelerate-the-future-of-next-generation-computing- architectures-for-big-data-and-fast-data-environments. [4] Inc Wave Computing. Wave Computing Launches the MIPS Open Initiative. Accessed: 2019-08-06. url: https://wavecomp.ai/wave- computing- launches-the-mips-open-initiative. [5] IoT Analytics. State of the IoT 2018: Number of IoT devices now at 7B – Market accelerating. Accessed: 2019-08-06. url: https://iot-analytics. com/state-of-the-iot-update-q1-q2-2018-number-of-iot-devices- now-7b. [6] International Data Corporation. The Growth in Connected IoT Devices Is Expected to Generate 79.4ZB of Data in 2025, According to a New IDC Forecast. Accessed: 2019-07-10. url: https://www.idc.com/getdoc.jsp? containerId=prUS45213219. [7] C. Lundqvist et al. “Key technology choices for optimal massive IoT devices”. In: Ericsson Technology Review 98 (2019), pp. 48–58. [8] R. A. Kjellby et al. “Self-Powered IoT Device for Indoor Applications”. In: 2018 31st International Conference on VLSI Design and 2018 17th International Conference on Embedded Systems (VLSID) (2018), pp. 455– 456. [9] S. Shahabuddin et al. “Design of a transport triggered for turbo decoding”. In: Analog Integrated Circuits and Signal Processing 78.3 (2014), pp. 611–622. [10] J. Yu et al. “Vector Processing as a Soft Processor Accelerator”. In: ACM Transactions on Reconfigurable Technology and Systems (TRETS) 2.2 (2009), pp. 1–31. [11] OpenCores. Accessed: 2019-09-21. url: https://opencores.org. 64

[12] GitHub. Accessed: 2019-09-21. url: https://github.com. [13] P. Barry and P. Crowley. Modern Embedded Computing: Designing Con- nected, Pervasive, Media-Rich Systems. 1st ed. Waltham, MA: Morgan Kaufmann, 2012. isbn: 0123914906. [14] X. Fan. Real-Time Embedded Systems: Design Principles and Engineering Practices. 1st ed. Oxford: Newnes, 2015. isbn: 0128015071. [15] S. Dey et al. “Using a soft core in a SoC design: experiences with picoJava”. In: IEEE Design & Test of Computers 17.3 (2000), pp. 60–71. [16] C. Hamacher et al. Computer Organization and Embedded Systems. 6th ed. New York: McGraw-Hill Education, 2011. isbn: 0073380652. [17] M. A. Laughton and D. F. Warne. Electrical Engineer’s Reference Book. 16th ed. Oxford: Newnes, 2002. isbn: 0750646373. [18] D. Patterson and J. Hennessy. Computer Organization and Design MIPS Edi- tion: The Hardware/Software Interface. 5th ed. Oxford: Morgan Kaufmann, 2013. isbn: 978-0124077263. [19] M. M. Mano and C. R. Kime. Logic and Computer Design Fundamen- tals: Pearson New International Edition. 4th ed. Harlow, Essex: Pearson Education Limited, 2013. isbn: 1292024682. [20] S. Kilts. Advanced FPGA Design: Architecture, Implementation, and Opti- mization. 1st ed. New York: Wiley-IEEE Press, 2007. isbn: 0470054379. [21] 7 Series FPGAs Memory Resources. 1st ed. Xilinx Inc. July 2019. url: https : / / www . xilinx . com / support / documentation / user _ guides / ug473_7Series_Memory_Resources.pdf. [22] J. Hennessy and D. Patterson. Computer Architecture: A Quantitative Ap- proach. 5th ed. Waltham, MA: Morgan Kaufmann, 2011. isbn: 012383872X. [23] PICmicro Mid-Range MCU Family Reference Manual. Inc. 1997. url: http://ww1.microchip.com/downloads/en/devicedoc/ 33023a.pdf. [24] P. P. Chu. FPGA Prototyping by VHDL Examples: Xilinx Spartan-3 Version. 1st ed. New Jersey: Wiley-Interscience, 2008. isbn: 0470185317. [25] S. Brown and Z. Vranesic. Fundamentals of Digital Logic with VHDL De- sign. 3rd ed. New York: McGraw-Hill Higher Education, 2011. isbn: 978–0073529530. [26] Vivado Design Suite User Guide: Design Flows Overview. 2nd ed. Xilinx Inc. June 2018. url: https://www.xilinx.com/support/documentation/ sw_manuals/xilinx2018_2/ug892-vivado-design-flows-overview.pdf. 65

[27] Vivado Design Suite User Guide: Implementation. 1st ed. Xilinx Inc. Apr. 2018. url: https://www.xilinx.com/support/documentation/sw_ manuals/xilinx2018_1/ug904-vivado-implementation.pdf. [28] Basys 3 FPGA Board Reference Manual. 1st ed. Digilent Inc. Apr. 2018. url: https://reference.digilentinc.com/_media/basys3:basys3_rm. pdf. [29] M. Zwolinski. Digital System Design with VHDL. 2nd ed. Harlow, Essex: Pearson Education Limited, 2004. isbn: 013039985X. [30] 7 Series FPGAs Data Sheet: Overview. 1st ed. Xilinx Inc. Feb. 2018. url: https://www.xilinx.com/support/documentation/data_sheets/ ds180_7Series_Overview.pdf. [31] K. Chapman. Get Smart About Reset: Think Local, Not Global. Xilinx Inc. Mar. 2008. url: https://www.xilinx.com/support/documentation/ white_papers/wp272.pdf. [32] 7 Series FPGAs Migration: Methodology Guide. Xilinx Inc. Apr. 2018. url: https://www.xilinx.com/support/documentation/sw_manuals/ug429_ 7Series_Migration.pdf. [33] J. E. Bresenham. “Algorithm for computer control of a digital plotter”. In: IBM Systems Journal 4.1 (1965), pp. 25–30. [34] A. Agathos, T. Theoharis, and A. Boehm. “Efficient integer algorithms for the generation of conic sections”. In: Computers & Graphics 22.5 (1998), pp. 621–628. [35] S. Torbert. Applied Computer Science. 2nd ed. Berlin: Springer, 2016. isbn: 3319308645. [36] P. Horowitz and W. Hill. The Art of Electronics. 3rd ed. New York: Cambridge University Press, 2015. isbn: 0521809266. [37] H. Kaeslin. Top-Down Digital VLSI Design: From Architectures to Gate- Level Circuits and FPGAs. 1st ed. San Francisco: Morgan Kaufmann, 2015. isbn: 0128007303. [38] Vivado Design Suite User Guide: Design Analysis and Closure Techniques. Xilinx Inc. Oct. 2017. url: https : / / www . xilinx . com / support / documentation / sw _ manuals / xilinx2017 _ 3 / ug906 - vivado - design - analysis.pdf. [39] Xilinx Inc. AR# 57304. Accessed: 2019-07-29. url: https://www.xilinx. com/support/answers/57304.html. 66

[40] R. Munden. ASIC and FPGA Verification: A Guide to Component Modeling (Systems on Silicon). 1st ed. San Francisco: Morgan Kaufmann, 2004. isbn: 0125105819. [41] Xilinx Inc. Vivado Hardware Debug. Accessed: 2019-08-04. url: https: //www.xilinx.com/products/design-tools/vivado/debug.html#logic. [42] D. Salomon. Assemblers and Loaders. 1st ed. Chichester: Ellis Horwood Ltd, 1993. isbn: 0130525642. [43] P. Collingridge. 3D graphics tutorial. Accessed: 2019-06-14. url: http: //petercollingridge.appspot.com/3D-tutorial/rotating-objects. [44] J. Brakefield. Small soft core uP Inventory. Accessed: 2019-09-21. Feb. 2019. url: https://opencores.org/usercontent/doc/1550810299. [45] MCS6502 Datasheet. MOS Technology Inc. 1975. url: http://archive. 6502.org/datasheets/mos_6501-6505_mpu_preliminary_aug_1975.pdf. [46] ATmega103 Datasheet. Atmel Corporation. 2007. url: http : / / ww1 . microchip.com/downloads/en/DeviceDoc/doc0945.pdf. [47] RISC-V Foundation. Accessed: 2019-11-30. url: https://riscv.org. [48] PicoRV32 - A Size-Optimized RISC-V CPU. Accessed: 2019-11-30. url: https://github.com/cliffordwolf/picorv32. [49] OpenCores.org. WISHBONE SoC Architecture Specification, Revision B.3. Accessed: 2019-08-14. url: https://cdn.opencores.org/downloads/ wbspec_b3.pdf. [50] OpenCores.org. SBA - Simple Bus Architecture. Accessed: 2019-08-14. url: https://opencores.org/projects/simple_bus_architecture. 67

A Source codes

A.1 asip38.vhd

------ASIP38 -- -- Application-specific instruction set processor ------Copyright (c) 2018 Lauri Isola -- -- Released under the MIT license (see LICENSE.txt) ------Instructions: 31 -- Program ROM: 16k (38-bit) -- Data RAM: 16k (32-bit) -- Program stack: 32-level -- Data bus: 32-bit -- Instruction word: -- | 6-bit opcode | 32-bit address/immediate | ------Processor organization: -- -- Program memory -- Data memory -- Program counter (PC) -- Program stack -- ALU -- Accumulator (AC) -- Bus select -- File register (F) -- X register -- Y register -- Input register -- Output register -- Mode register -- Instruction decode and control ------library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_unsigned.all; use ieee.numeric_std.all; entity asip38 is port ( clk: in std_logic; x: out std_logic_vector (31 downto 0); y: out std_logic_vector (31 downto 0); disp_data: out std_logic_vector (31 downto 0); disp_cmd: out std_logic_vector (5 downto 0); disp_set: out std_logic; disp_clr: out std_logic; input_data: in std_logic_vector(31 downto 0); input_flag: in std_logic; input_rst: out std_logic; ready: in std_logic ); end asip38; architecture Behavioral of asip38 is

constant xLDI: std_logic_vector(5 downto 0) := "000000"; constant xLDA: std_logic_vector(5 downto 0) := "000001"; constant xSTO: std_logic_vector(5 downto 0) := "000010"; constant xJMP: std_logic_vector(5 downto 0) := "000011"; constant xADD: std_logic_vector(5 downto 0) := "000100"; constant xSUB: std_logic_vector(5 downto 0) := "000101"; constant xMUL: std_logic_vector(5 downto 0) := "000110"; constant xAND: std_logic_vector(5 downto 0) := "000111"; constant xIOR: std_logic_vector(5 downto 0) := "001000"; constant xXOR: std_logic_vector(5 downto 0) := "001001"; constant xINC: std_logic_vector(5 downto 0) := "001010"; constant xDEC: std_logic_vector(5 downto 0) := "001011"; constant xCIL: std_logic_vector(5 downto 0) := "001100"; constant xCIR: std_logic_vector(5 downto 0) := "001101"; 68

constant xWAI: std_logic_vector(5 downto 0) := "001110"; constant xLDX: std_logic_vector(5 downto 0) := "001111"; constant xLDY: std_logic_vector(5 downto 0) := "010000"; constant xINP: std_logic_vector(5 downto 0) := "010001"; constant xOUT: std_logic_vector(5 downto 0) := "010010"; constant xSNZ: std_logic_vector(5 downto 0) := "010011"; constant xSZA: std_logic_vector(5 downto 0) := "010100"; constant xSGT: std_logic_vector(5 downto 0) := "010101"; constant xSLT: std_logic_vector(5 downto 0) := "010110"; constant xSKI: std_logic_vector(5 downto 0) := "010111"; constant xSET: std_logic_vector(5 downto 0) := "011000"; constant xCLR: std_logic_vector(5 downto 0) := "011001"; constant xCAL: std_logic_vector(5 downto 0) := "011010"; constant xRET: std_logic_vector(5 downto 0) := "011100"; constant xLFR: std_logic_vector(5 downto 0) := "011110"; constant xSFR: std_logic_vector(5 downto 0) := "011111"; constant xWFR: std_logic_vector(5 downto 0) := "100000";

COMPONENT program_memory -- Block-ROM, Xilinx PORT ( addra: in STD_LOGIC_VECTOR(13 downto 0); clka: in STD_LOGIC; douta: out STD_LOGIC_VECTOR(37 downto 0) ); END COMPONENT;

COMPONENT ram_memory -- Block-RAM, Xilinx PORT ( clka: in STD_LOGIC; wea: in STD_LOGIC_VECTOR(0 downto 0); addra: in STD_LOGIC_VECTOR(13 downto 0); dina: in STD_LOGIC_VECTOR(31 downto 0); douta: out STD_LOGIC_VECTOR(31 downto 0) ); END COMPONENT;

COMPONENT mult32 -- 32-bit multiplier, Xilinx DSP48 PORT ( CLK: in STD_LOGIC; A: in STD_LOGIC_VECTOR(31 downto 0); B: in STD_LOGIC_VECTOR(31 downto 0); P: out STD_LOGIC_VECTOR(63 downto 0) ); END COMPONENT; type state_type is (Start, Fetch, Decode, Execute, MemWrite, Multiply, Halt); type pc_type is (PCInc, PCLoad, PCSNZ, PCSZA, PCSGT, PCSLT, PCSKI, PCRET, PCLatch); type alu_type is (AluBUS, AluADD, AluSUB, AluMUL, AluAND, AluIOR, AluXOR, AluINC, AluDEC, AluCIL, AluCIR); type bus_type is (BusROM, BusAC, BusRAM, BusINPR); type mul_type is (Calculate, Idle); type stack_type is array (0 to 31) of std_logic_vector(13 downto 0); signal state: state_type; -- Finite state machine signal state_next: state_type; signal stack: stack_type:=( others =>( others => '0')); -- Program stack signal stack_pointer: std_logic_vector(4 downto 0):=( others => '0'); -- Stack pointer signal stack_dataout: std_logic_vector(13 downto 0):=( others => '0'); -- Stack output signal progmem_out: std_logic_vector(37 downto 0); -- ROM output signal datamem_out: std_logic_vector(31 downto 0); -- RAM output signal mux_datamem: std_logic_vector(13 downto 0); -- Memory addressing multiplexer signal pc: std_logic_vector(13 downto 0):=( others => '0'); -- PC signal pc_sel: pc_type; -- PC select signal alu_result: std_logic_vector(31 downto 0):=( others => '0'); -- ALU result signal mult32_out: std_logic_vector(63 downto 0):=( others => '0'); -- Multiplier output signal alu_sel: alu_type; -- ALU select signal ac: std_logic_vector(31 downto 0):=( others => '0'); -- AC signal data_bus: std_logic_vector(31 downto 0):=( others => '0'); -- Data bus signal bus_sel: bus_type; -- Bus select signal mul_state: mul_type; -- Multiplier state signal ac_load: std_logic:= '0'; -- AC load signal x_load: std_logic:= '0'; -- X register load signal y_load: std_logic:= '0'; -- Y register load signal input_load: std_logic:= '0'; -- Input register load signal output_load: std_logic:= '0'; -- Output register load signal mode_load: std_logic:= '0'; -- Mode register load signal ram_write: std_logic:= '0'; -- RAM write signal mem_sel: std_logic:= '0'; -- Memory select 69

signal wfr: std_logic:= '0'; -- Write F register signal stack_push: std_logic:= '0'; -- Stack push signal stack_pop: std_logic:= '0'; -- Stack pop

signal mode_reg: std_logic_vector(31 downto 0):=( others => '0'); -- Display mode signal input_reg: std_logic_vector(31 downto 0):=( others => '0'); -- Input register signal output_reg: std_logic_vector(31 downto 0):=( others => '0'); -- Output register signal f_reg: std_logic_vector(31 downto 0):=( others => '0'); -- F register (indirect RAM addressing) signal x_reg: std_logic_vector(31 downto 0):=( others => '0'); -- X register signal y_reg: std_logic_vector(31 downto 0):=( others => '0'); -- Y register

signal mul_wait_count: std_logic_vector(2 downto 0):=( others => '0'); -- Multiplier delay register signal mul_ready: std_logic:= '0'; -- Multiplier ready flag

signal dispset: std_logic:= '0'; -- Display set signal dispclear: std_logic:= '0'; -- Display clear signal inputrst: std_logic:= '0'; -- Input reset begin progmem: program_memory PORT MAP ( clka => clk, addra => pc, douta => progmem_out ); datamem: ram_memory PORT MAP ( clka => clk, wea(0) => ram_write, addra => mux_datamem, dina => data_bus, douta => datamem_out ); multiplier: mult32 PORT MAP ( CLK => clk, A => ac, B => datamem_out, P => mult32_out );

-- Memory addressing (direct/indirect) mux_datamem <= progmem_out(13 downto 0) when (mem_sel= '0') else f_reg(13 downto 0);

-- Output signals input_rst <= inputrst; x <= x_reg; y <= y_reg; disp_data <= output_reg; disp_cmd <= mode_reg(5 downto 0); disp_set <= dispset; disp_clr <= dispclear;

------BUS SELECT ------bus_select: process(bus_sel, progmem_out, ac, datamem_out, input_reg) begin case bus_sel is when BusROM => data_bus <= progmem_out(31 downto 0); when BusAC => data_bus <= ac; when BusRAM => data_bus <= datamem_out; when BusINPR => data_bus <= input_reg; when others => data_bus <=( others => '0'); end case; end process;

------REGISTERS -- 70

------

-- Accumulator reg_ac: process(clk) begin if (rising_edge(clk)) then if (ac_load= '1') then ac <= alu_result; end if; end if; end process;

-- F register reg_f: process(clk) begin if (rising_edge(clk)) then if (wfr= '1') then f_reg <= data_bus; end if; end if; end process;

-- X register reg_x: process(clk) begin if (rising_edge(clk)) then if (x_load= '1') then x_reg <= data_bus; end if; end if; end process;

-- Y register reg_y: process(clk) begin if (rising_edge(clk)) then if (y_load= '1') then y_reg <= data_bus; end if; end if; end process;

-- Input register reg_input: process(clk) begin if (rising_edge(clk)) then inputrst <= '0'; if (input_load= '1') then input_reg <= input_data; inputrst <= '1'; end if; end if; end process;

-- Output register reg_output: process(clk) begin if (rising_edge(clk)) then if (output_load= '1') then output_reg <= data_bus; end if; end if; end process;

-- Mode register reg_mode: process(clk) begin if (rising_edge(clk)) then if (mode_load= '1') then mode_reg <= data_bus; end if; end if; end process;

------PC ------program_counter: process(clk) begin if (rising_edge(clk)) then 71

case pc_sel is when PCInc => -- Increment program counter pc <= (pc+1);

when PCLoad => -- Load address to program counter pc <= progmem_out(13 downto 0);

when PCSNZ => -- Skip if AC != 0 if (ac /=0) then pc <= (pc+2); else pc <= (pc+1); end if;

when PCSZA => -- Skip if AC == 0 if (ac=0) then pc <= (pc+2); else pc <= (pc+1); end if;

when PCSGT => -- Skip if AC > RAM if (ac> datamem_out) then pc <= (pc+2); else pc <= (pc+1); end if;

when PCSLT => -- Skip if AC < RAM if (ac< datamem_out) then pc <= (pc+2); else pc <= (pc+1); end if;

when PCSKI => -- Skip if input flag is zero if (input_flag= '0') then pc <= (pc+2); else pc <= (pc+1); end if;

when PCRET => -- Return from subroutine pc <= stack_dataout;

when PCLatch => -- Latch PC pc <= pc;

when others => null; end case; end if; end process;

------PROGRAM STACK (32 LEVEL) ------stack_dataout <= stack(to_integer(unsigned(stack_pointer))); program_stack: process(clk) begin if (rising_edge(clk)) then if (stack_push= '1') then stack(to_integer(unsigned(stack_pointer))) <= (pc+1); -- Stack push stack_pointer <= (stack_pointer+1); elsif (stack_pop= '1') then -- Stack pop if (stack_pointer> "00000") then stack_pointer <= (stack_pointer-1); end if; end if; end if; end process;

------ALU ------alu: process(alu_sel, data_bus, datamem_out, ac, mult32_out) 72

begin case alu_sel is when AluBUS => -- Bypass alu_result <= data_bus;

when AluADD => -- ADD addition alu_result <= (ac+ datamem_out);

when AluSUB => -- SUB subtraction alu_result <= (ac- datamem_out);

when AluMUL => -- MUL multiplication with result selection alu_result <= mult32_out(55 downto 24);

when AluAND => -- AND and operation alu_result <= (ac and datamem_out);

when AluIOR => -- IOR inclusive or alu_result <= (ac or datamem_out);

when AluXOR => -- XOR exclusive or alu_result <= (ac xor datamem_out);

when AluINC => -- INC increment alu_result <= (ac+1);

when AluDEC => -- DEC decrement alu_result <= (ac-1);

when AluCIL => -- CIL circulate left alu_result(31 downto 1) <= ac(30 downto 0); alu_result(0) <= ac(31);

when AluCIR => -- CIR circulate right alu_result(30 downto 0) <= ac(31 downto 1); alu_result(31) <= ac(0);

when others => alu_result <=( others => '0'); end case; end process;

------INSTRUCTION DECODE -- AND -- CONTROL ------state_machine: process(clk) begin if (rising_edge(clk)) then if (progmem_out(37 downto 32)= xWAI and ready= '0') then state <= state; else state <= state_next; end if; end if; end process; control_logic: process(state, progmem_out, mul_ready) begin bus_sel <= BusROM; alu_sel <= AluBUS; mem_sel <= '0'; wfr <= '0'; stack_push <= '0'; stack_pop <= '0'; x_load <= '0'; y_load <= '0'; input_load <= '0'; output_load <= '0'; mode_load <= '0'; ac_load <= '0'; ram_write <= '0'; dispset <= '0'; dispclear <= '0'; mul_state <= Idle; pc_sel <= PCLatch; state_next <= Start; case state is when Start => -- Start stage state_next <= Fetch; when Fetch => -- Fetch stage 73

state_next <= Decode; case progmem_out(37 downto 32) is when xLDA => bus_sel <= BusRAM;

when xSTO => bus_sel <= BusAC;

when xADD => alu_sel <= AluADD;

when xSUB => alu_sel <= AluSUB;

when xMUL => alu_sel <= AluMUL;

when xAND => alu_sel <= AluAND;

when xIOR => alu_sel <= AluIOR;

when xXOR => alu_sel <= AluXOR;

when xINC => alu_sel <= AluINC;

when xDEC => alu_sel <= AluDEC;

when xCIL => alu_sel <= AluCIL;

when xCIR => alu_sel <= AluCIR;

when xLDX => bus_sel <= BusRAM;

when xLDY => bus_sel <= BusRAM;

when xINP => bus_sel <= BusINPR;

when xOUT => bus_sel <= BusRAM;

when xLFR => bus_sel <= BusRAM; mem_sel <= '1';

when xSFR => bus_sel <= BusAC; mem_sel <= '1';

when xWFR => bus_sel <= BusAC;

when xLDI| xJMP| xWAI| xSNZ| xSZA| xSGT| xSLT| xSKI| xSET| xCLR| xCAL| xRET => null;

when others => -- Halt state_next <= Halt; end case; when Decode => -- Decode stage state_next <= Execute; case progmem_out(37 downto 32) is when xLDA => bus_sel <= BusRAM;

when xSTO => bus_sel <= BusAC;

when xADD => alu_sel <= AluADD;

when xSUB => alu_sel <= AluSUB; 74

when xMUL => alu_sel <= AluMUL;

when xAND => alu_sel <= AluAND;

when xIOR => alu_sel <= AluIOR;

when xXOR => alu_sel <= AluXOR;

when xINC => alu_sel <= AluINC;

when xDEC => alu_sel <= AluDEC;

when xCIL => alu_sel <= AluCIL;

when xCIR => alu_sel <= AluCIR;

when xLDX => bus_sel <= BusRAM;

when xLDY => bus_sel <= BusRAM;

when xINP => bus_sel <= BusINPR;

when xOUT => bus_sel <= BusRAM;

when xLFR => bus_sel <= BusRAM; mem_sel <= '1';

when xSFR => bus_sel <= BusAC; mem_sel <= '1';

when xWFR => bus_sel <= BusAC;

when xLDI| xJMP| xWAI| xSNZ| xSZA| xSGT| xSLT| xSKI| xSET| xCLR| xCAL| xRET => null;

when others => state_next <= Halt; end case; when Execute => -- Execute stage state_next <= MemWrite; case progmem_out(37 downto 32) is when xLDA => bus_sel <= BusRAM;

when xSTO => bus_sel <= BusAC;

when xADD => alu_sel <= AluADD;

when xSUB => alu_sel <= AluSUB;

when xMUL => alu_sel <= AluMUL; state_next <= Multiply;

when xAND => alu_sel <= AluAND;

when xIOR => alu_sel <= AluIOR;

when xXOR => alu_sel <= AluXOR; 75

when xINC => alu_sel <= AluINC;

when xDEC => alu_sel <= AluDEC;

when xCIL => alu_sel <= AluCIL;

when xCIR => alu_sel <= AluCIR;

when xLDX => bus_sel <= BusRAM; x_load <= '1';

when xLDY => bus_sel <= BusRAM; y_load <= '1';

when xINP => bus_sel <= BusINPR; input_load <= '1';

when xOUT => bus_sel <= BusRAM; output_load <= '1';

when xSET => mode_load <= '1';

when xCAL => stack_push <= '1';

when xRET => stack_pop <= '1';

when xLFR => bus_sel <= BusRAM; mem_sel <= '1';

when xSFR => bus_sel <= BusAC; mem_sel <= '1';

when xWFR => bus_sel <= BusAC;

when xLDI| xJMP| xWAI| xSNZ| xSZA| xSGT| xSLT| xSKI| xCLR => null;

when others => state_next <= Halt; end case; when MemWrite => -- MemWrite stage state_next <= Fetch; case progmem_out(37 downto 32) is when xLDI => -- LDI AC <= immediate value pc_sel <= PCInc; ac_load <= '1';

when xLDA => -- LDA AC <= RAM bus_sel <= BusRAM; pc_sel <= PCInc; ac_load <= '1';

when xSTO => -- STO RAM <= AC bus_sel <= BusAC; pc_sel <= PCInc; ram_write <= '1';

when xJMP => -- JMP unconditional branch pc_sel <= PCLoad;

when xADD => -- ADD AC + RAM alu_sel <= AluADD; pc_sel <= PCInc; ac_load <= '1';

when xSUB => -- SUB AC - RAM alu_sel <= AluSUB; 76

pc_sel <= PCInc; ac_load <= '1'; when xMUL => -- MUL AC * RAM alu_sel <= AluMUL; pc_sel <= PCInc; ac_load <= '1'; when xAND => -- AND AC and RAM alu_sel <= AluAND; pc_sel <= PCInc; ac_load <= '1'; when xIOR => -- IOR AC or RAM alu_sel <= AluIOR; pc_sel <= PCInc; ac_load <= '1'; when xXOR => -- XOR AC xor RAM alu_sel <= AluXOR; pc_sel <= PCInc; ac_load <= '1'; when xINC => -- INC AC + 1 alu_sel <= AluINC; pc_sel <= PCInc; ac_load <= '1'; when xDEC => -- DEC AC - 1 alu_sel <= AluDEC; pc_sel <= PCInc; ac_load <= '1'; when xCIL => -- CIL circulate AC left alu_sel <= AluCIL; pc_sel <= PCInc; ac_load <= '1'; when xCIR => -- CIR circulate AC right alu_sel <= AluCIR; pc_sel <= PCInc; ac_load <= '1'; when xWAI => -- WAI wait if ready is 0, continue when 1 pc_sel <= PCInc; when xLDX => -- LDX X <= RAM bus_sel <= BusRAM; pc_sel <= PCInc; when xLDY => -- LDY Y <= RAM bus_sel <= BusRAM; pc_sel <= PCInc; when xINP => -- INP RAM <= INPUT bus_sel <= BusINPR; pc_sel <= PCInc; ram_write <= '1'; when xOUT => -- OUT OUT <= RAM bus_sel <= BusRAM; pc_sel <= PCInc; when xSNZ => -- SNZ skip if AC != 0 pc_sel <= PCSNZ; when xSZA => -- SZA skip if AC == 0 pc_sel <= PCSZA; when xSGT => -- SGT skip if AC > RAM pc_sel <= PCSGT; when xSLT => -- SLT skip if AC < RAM pc_sel <= PCSLT; when xSKI => -- SKI skip if input == 0 pc_sel <= PCSKI; when xSET => -- SET send command to display controller pc_sel <= PCInc; dispset <= '1'; when xCLR => -- CLR clear video memory location 77

pc_sel <= PCInc; dispclear <= '1';

when xCAL => -- CAL call subroutine pc_sel <= PCLoad;

when xRET => -- RET return from subroutine pc_sel <= PCRET;

when xLFR => -- LFR AC <= RAM[F], indirect bus_sel <= BusRAM; mem_sel <= '1'; pc_sel <= PCInc; ac_load <= '1';

when xSFR => -- SFR RAM[F] <= AC, indirect bus_sel <= BusAC; mem_sel <= '1'; pc_sel <= PCInc; ram_write <= '1';

when xWFR => -- WFR F <= AC bus_sel <= BusAC; pc_sel <= PCInc; wfr <= '1';

when others => state_next <= Halt; end case;

when Multiply => -- Multiply stage case progmem_out(37 downto 32) is when xMUL => alu_sel <= AluMUL; case mul_ready is when '0' => mul_state <= Calculate; state_next <= Multiply; when '1' => mul_state <= Idle; state_next <= MemWrite; when others => mul_state <= Idle; state_next <= Halt; end case; when others => state_next <= Halt; end case;

when others => -- Halt stage state_next <= Halt; end case; end process; multiplier_wait: process(clk) begin if (rising_edge(clk)) then case mul_state is when Calculate => if (mul_wait_count<5) then -- Wait for multiplier result (6 clk cycles) mul_wait_count <= (mul_wait_count+1); mul_ready <= '0'; else mul_wait_count <=( others => '0'); mul_ready <= '1'; end if; when Idle => mul_wait_count <=( others => '0'); mul_ready <= '0'; when others => null; end case; end if; end process; end Behavioral; 78

A.2 display_control.vhd

------Display controller ------Copyright (c) 2018 Lauri Isola -- -- Released under the MIT license (see LICENSE.txt) ------Graphics modes: 160x120, 160x120CSR, 80x60, 40x30 -- Character modes: 40x15 ------Special function: Algorithm: -- -- Line drawing Brensenham -- Ellipse drawing Brensenham -- Area painting Flood fill ------library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_unsigned.all; use ieee.numeric_std.all; entity display_control is Port ( clk: in std_logic; -- Line draw x_line_draw: in std_logic_vector(31 downto 0); y_line_draw: in std_logic_vector(31 downto 0); line_ready: in std_logic; line_update: in std_logic; line_start: out std_logic; x0_start: out std_logic_vector(7 downto 0); y0_start: out std_logic_vector(7 downto 0); x0_end: out std_logic_vector(7 downto 0); y0_end: out std_logic_vector(7 downto 0); -- Ellipse draw x_coord: out std_logic_vector(31 downto 0); y_coord: out std_logic_vector(31 downto 0); a: out std_logic_vector(31 downto 0); b: out std_logic_vector(31 downto 0); x_ellipse_draw: in std_logic_vector(31 downto 0); y_ellipse_draw: in std_logic_vector(31 downto 0); ellipse_start: out std_logic; ellipse_update: in std_logic; ellipse_ready: in std_logic; -- Area paint x_paint: out std_logic_vector(31 downto 0); y_paint: out std_logic_vector(31 downto 0); x_paint_draw: in std_logic_vector(31 downto 0); y_paint_draw: in std_logic_vector(31 downto 0); paint_mem_addr: in std_logic_vector(14 downto 0); paint_mem_in: in std_logic; paint_mem_out: out std_logic; paint_mem_write: in std_logic; new_color: out std_logic; paint_start: out std_logic; paint_ready: in std_logic; -- From processor x: in std_logic_vector(31 downto 0); y: in std_logic_vector(31 downto 0); disp_cmd: in std_logic_vector(5 downto 0); disp_set: in std_logic; disp_clr: in std_logic; -- To processor ready: out std_logic; -- To RGB signal generation v_mem_x: out std_logic_vector(31 downto 0); v_mem_y: out std_logic_vector(31 downto 0); cursor_address: out std_logic_vector(14 downto 0); video_mode: out std_logic_vector(2 downto 0); color_mode: out std_logic; cursor_mode: out std_logic; eraser_mode: out std_logic; disp_clear: out std_logic; disp_write: out std_logic ); 79

end display_control; architecture Behavioral of display_control is

COMPONENT paint_memory PORT ( clka: IN STD_LOGIC; wea: IN STD_LOGIC_VECTOR(0 DOWNTO 0); addra: IN STD_LOGIC_VECTOR(14 DOWNTO 0); dina: IN STD_LOGIC_VECTOR(0 DOWNTO 0); douta: OUT STD_LOGIC_VECTOR(0 DOWNTO 0) ); END COMPONENT;

-- Paint memory signals signal we: std_logic:= '0'; signal address: std_logic_vector(14 downto 0):=( others => '0'); signal datain: std_logic:= '0'; signal dataout: std_logic:= '0'; signal paint_mem_we: std_logic:= '0'; signal paint_mem_address: std_logic_vector(14 downto 0):=( others => '0'); signal paint_mem_datain: std_logic:= '0';

-- Line draw signals signal line_mode: std_logic:= '0'; signal x_start: std_logic_vector(31 downto 0):=( others => '0'); signal y_start: std_logic_vector(31 downto 0):=( others => '0'); signal x_end: std_logic_vector(31 downto 0):=( others => '0'); signal y_end: std_logic_vector(31 downto 0):=( others => '0');

-- Ellipse draw signals signal ellipse_x: std_logic_vector(31 downto 0):=( others => '0'); signal ellipse_y: std_logic_vector(31 downto 0):=( others => '0'); signal ellipse_a: std_logic_vector(31 downto 0):=( others => '0'); signal ellipse_b: std_logic_vector(31 downto 0):=( others => '0');

-- Paint draw signals signal paint_x: std_logic_vector(31 downto 0):=( others => '0'); signal paint_y: std_logic_vector(31 downto 0):=( others => '0'); signal paint_color: std_logic:= '0';

-- Operating mode signals signal color: std_logic:= '0'; signal cursor: std_logic:= '0'; signal eraser: std_logic:= '0'; signal cursor_addr: std_logic_vector(14 downto 0):=( others => '0');

signal resolution: std_logic_vector(2 downto 0):=( others => '0'); begin

memory_unit_paint: paint_memory PORT MAP ( clka => clk, wea(0) => we, addra => address, dina(0) => datain, douta(0) => dataout );

-- To RGB signal generation cursor_address <= cursor_addr; video_mode <= resolution; color_mode <= color; cursor_mode <= cursor; eraser_mode <= eraser;

-- Ellipse initial values x_coord <= ellipse_x; y_coord <= ellipse_y; a <= ellipse_a; b <= ellipse_b;

-- Paint initial values x_paint <= paint_x; y_paint <= paint_y; new_color <= paint_color;

-- Paint memory we <= paint_mem_write when disp_cmd= "110010" else paint_mem_we;

address <= paint_mem_addr when disp_cmd= "110010" else 80

paint_mem_address; datain <= paint_mem_in when disp_cmd= "110010" else paint_mem_datain; paint_mem_out <= dataout;

-- Ready signal to processor ready <= '0' when line_ready= '0' or ellipse_ready= '0' or paint_ready= '0' else '1';

-- Line drawing initial values and mode selection (normal / display origin calculation) line_mode_sel: process(clk) begin if (rising_edge(clk)) then if (line_mode= '0') then x0_start <= x_start(7 downto 0); y0_start <= y_start(7 downto 0); x0_end <= x_end(7 downto 0); y0_end <= y_end(7 downto 0); else if (x_start(23 downto 0)> X"7FFFFF") then x0_start <= x_start(31 downto 24)+ X"3D"; else x0_start <= x_start(31 downto 24)+ X"3C"; end if; if (y_start(23 downto 0)> X"7FFFFF") then y0_start <= y_start(31 downto 24)+ X"3D"; else y0_start <= y_start(31 downto 24)+ X"3C"; end if; if (x_end(23 downto 0)> X"7FFFFF") then x0_end <= x_end(31 downto 24)+ X"3D"; else x0_end <= x_end(31 downto 24)+ X"3C"; end if; if (y_end(23 downto 0)> X"7FFFFF") then y0_end <= y_end(31 downto 24)+ X"3D"; else y0_end <= y_end(31 downto 24)+ X"3C"; end if; end if; end if; end process; video_mode_sel: process(clk) begin if (rising_edge(clk)) then v_mem_x <= x; v_mem_y <= y; disp_write <= '0'; disp_clear <= '0'; paint_mem_we <= '0'; line_start <= '0'; ellipse_start <= '0'; paint_start <= '0'; if (disp_clr= '1') then disp_clear <= '1'; if (resolution= "011") then paint_mem_address <= y(6 downto 0)& x(7 downto 0); paint_mem_datain <= '0'; paint_mem_we <= '1'; end if; end if; case disp_cmd(5 downto 4) is when "00" => -- Normal mode case disp_cmd(3 downto 0) is when "0000" => -- Resolution 40x15 resolution <= "000";

when "0001" => -- Resolution 40x30 resolution <= "001";

when "0010" => -- Resolution 80x60 resolution <= "010";

when "0011" => -- Resolution 160x120 resolution <= "011";

when "0100" => -- Resolution 160x120 for cursor resolution <= "100";

when "0101" => -- Set dot 81

if (disp_set= '1') then disp_write <= '1'; if (resolution= "011") then paint_mem_address <= y(6 downto 0)& x(7 downto 0); paint_mem_datain <= '1'; paint_mem_we <= '1'; end if; end if;

when "0110" => -- Set character if (disp_set= '1') then disp_write <= '1'; end if;

when "0111" => -- Set cursor if (disp_set= '1') then disp_write <= '1'; end if;

when "1000" => -- Update cursor address for RGB signal generation cursor_addr <= y(6 downto 0)& x(7 downto 0);

when "1001" => -- Set color mode normal color <= '0';

when "1010" => -- Set color mode special color <= '1';

when "1011" => -- Set cursor mode on cursor <= '1';

when "1100" => -- Set cursor mode off cursor <= '0';

when "1101" => -- Set eraser mode on eraser <= '1';

when "1110" => -- Set eraser mode off eraser <= '0';

when others => null; end case; when "01" => -- Line draw mode case disp_cmd(3 downto 0) is when "0000" => -- Line mode normal line_mode <= '0';

when "0001" => -- Line mode origin line_mode <= '1';

when "0010" => -- Set line start point if (disp_set= '1') then x_start <= x; y_start <= y; end if;

when "0011" => -- Set line end point if (disp_set= '1') then x_end <= x; y_end <= y; end if;

when "0100" => -- Draw line if (line_mode= '0') then v_mem_x <= x_line_draw; v_mem_y <= y_line_draw; else v_mem_x <= x_line_draw+ 20; v_mem_y <=(( not y_line_draw)+1)-9; end if; if (disp_set= '1') then line_start <= '1'; end if; if (line_update= '1') then disp_write <= '1'; if (resolution= "011") then paint_mem_address <= y_line_draw(6 downto 0)& x_line_draw(7 downto 0); paint_mem_datain <= '1'; paint_mem_we <= '1'; end if; end if; 82

when "0101" => -- Clear line if (line_mode= '0') then v_mem_x <= x_line_draw; v_mem_y <= y_line_draw; else v_mem_x <= x_line_draw+ 20; v_mem_y <=(( not y_line_draw)+1)-9; end if; if (disp_set= '1') then line_start <= '1'; end if; if (line_update= '1') then disp_clear <= '1'; if (resolution= "011") then paint_mem_address <= y_line_draw(6 downto 0)& x_line_draw(7 downto 0); paint_mem_datain <= '0'; paint_mem_we <= '1'; end if; end if;

when others => null; end case; when "10" => -- Ellipse draw mode case disp_cmd(3 downto 0) is when "0000" => -- Set ellipse center (x and y) if (disp_set= '1') then ellipse_x <= x; ellipse_y <= y; end if;

when "0001" => -- Set a if (disp_set= '1') then ellipse_a <= x; end if;

when "0010" => -- Set b if (disp_set= '1') then ellipse_b <= x; end if;

when "0011" => -- Draw ellipse v_mem_x <= x_ellipse_draw; v_mem_y <= y_ellipse_draw; if (disp_set= '1') then ellipse_start <= '1'; end if; if (ellipse_update= '1') then disp_write <= '1'; if (resolution= "011") then paint_mem_address <= y_ellipse_draw(6 downto 0)& x_ellipse_draw(7 downto 0); paint_mem_datain <= '1'; paint_mem_we <= '1'; end if; end if;

when "0100" => -- Clear ellipse v_mem_x <= x_ellipse_draw; v_mem_y <= y_ellipse_draw; if (disp_set= '1') then ellipse_start <= '1'; end if; if (ellipse_update= '1') then disp_clear <= '1'; if (resolution= "011") then paint_mem_address <= y_ellipse_draw(6 downto 0)& x_ellipse_draw(7 downto 0); paint_mem_datain <= '0'; paint_mem_we <= '1'; end if; end if;

when others => null; end case; when "11" => -- Area paint mode case disp_cmd(3 downto 0) is when "0000" => -- Set paint center (x and y) if (disp_set= '1') then paint_x <= x; paint_y <= y; 83

end if;

when "0001" => -- Set color (1 =green, 0 =black) if (disp_set= '1') then paint_color <= x(0); end if;

when "0010" => -- Start area paint v_mem_x <= x_paint_draw; v_mem_y <= y_paint_draw; if (disp_set= '1' and resolution= "011") then paint_start <= '1'; end if; if (paint_mem_write= '1') then if (paint_color= '1') then disp_write <= '1'; else disp_clear <= '1'; end if; end if; when others => null; end case;

when others => null; end case; end if; end process; end Behavioral; 84

A.3 vga_sync.vhd

------VGA synchronization ------Copyright (c) 2018 Lauri Isola -- -- Released under the MIT license (see LICENSE.txt) ------Function: Generates VGA synchronization signals ------library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; entity vga_sync is Port ( clk: in std_logic; hsync: out std_logic; vsync: out std_logic; pixel_x: out std_logic_vector(9 downto 0); pixel_y: out std_logic_vector(9 downto 0); vga_clk: out std_logic; video_on: out std_logic ); end vga_sync; architecture Behavioral of vga_sync is

constant HD: integer := 640; -- Horizontal display area constant HF: integer := 16; -- H. front porch constant HB: integer := 48; -- H. back porch constant HR: integer := 96; -- H. retrace constant VD: integer := 480; -- Vertical display area constant VF: integer := 10; -- V. front porch constant VB: integer := 33; -- V. back porch constant VR: integer :=2; -- V. retrace

signal p_tick: std_logic:= '0'; signal h_counter: unsigned(9 downto 0) :=( others => '0'); signal v_counter: unsigned(9 downto 0) :=( others => '0'); signal count: std_logic:= '0'; signal h_sync: std_logic:= '1'; signal v_sync: std_logic:= '1'; begin

vga_clk_25MHz: process(clk) begin if (rising_edge(clk)) then if (count= '1') then p_tick <= not(p_tick); count <= '0'; else count <= '1'; end if; end if; end process;

horizontal: process(p_tick) begin if (rising_edge(p_tick)) then h_counter <= h_counter+1; if (h_counter=HD+HF-1) then h_sync <= '0'; elsif (h_counter=HD+HF+HR-1) then h_sync <= '1'; elsif (h_counter >=HD+HF+HR+HB-1) then h_counter <=( others => '0'); end if; end if; end process;

vertical: process(h_sync) begin if (rising_edge(h_sync)) then v_counter <= v_counter+1; if (v_counter=VD+VF-1) then 85

v_sync <= '0'; elsif (v_counter=VD+VF+VR-1) then v_sync <= '1'; elsif (v_counter >=VD+VF+VR+VB-1) then v_counter <=( others => '0'); end if; end if; end process;

-- Output signals hsync <= h_sync; vsync <= v_sync; pixel_x <= std_logic_vector(h_counter); pixel_y <= std_logic_vector(v_counter); vga_clk <= p_tick;

video_on <= '1' when (h_counter< HD) and (v_counter< VD) else '0'; end Behavioral; 86

A.4 rgb_gen.vhd

------RGB signal generation ------Copyright (c) 2018 Lauri Isola -- -- Released under the MIT license (see LICENSE.txt) ------Function: Display pixel generation -- Video memory control ------library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; entity rgb_gen is Port ( clk: in std_logic; vga_clk: in std_logic; video_on: in std_logic; disp_data: in std_logic_vector(31 downto 0); disp_clear: in std_logic; disp_write: in std_logic; video_mode: in std_logic_vector(2 downto 0); color_mode: in std_logic; cursor_mode: in std_logic; eraser_mode: in std_logic; cursor_address: in std_logic_vector(14 downto 0); v_mem_x: in std_logic_vector(31 downto 0); v_mem_y: in std_logic_vector(31 downto 0); pixel_x: in std_logic_vector(9 downto 0); pixel_y: in std_logic_vector(9 downto 0); rgb: out std_logic_vector(2 downto 0) ); end rgb_gen; architecture Behavioral of rgb_gen is

COMPONENT video_memory_160x120 PORT ( clk: IN STD_LOGIC; we: IN STD_LOGIC; a: IN STD_LOGIC_VECTOR(14 DOWNTO 0); d: IN STD_LOGIC_VECTOR(0 DOWNTO 0); dpra: IN STD_LOGIC_VECTOR(14 DOWNTO 0); dpo: OUT STD_LOGIC_VECTOR(0 DOWNTO 0) ); END COMPONENT;

COMPONENT video_memory2_160x120 PORT ( clk: IN STD_LOGIC; we: IN STD_LOGIC; a: IN STD_LOGIC_VECTOR(14 DOWNTO 0); d: IN STD_LOGIC_VECTOR(0 DOWNTO 0); dpra: IN STD_LOGIC_VECTOR(14 DOWNTO 0); dpo: OUT STD_LOGIC_VECTOR(0 DOWNTO 0) ); END COMPONENT;

COMPONENT video_memory_80x60 PORT ( clk: IN STD_LOGIC; we: IN STD_LOGIC; a: IN STD_LOGIC_VECTOR(12 DOWNTO 0); d: IN STD_LOGIC_VECTOR(1 DOWNTO 0); dpra: IN STD_LOGIC_VECTOR(12 DOWNTO 0); dpo: OUT STD_LOGIC_VECTOR(1 DOWNTO 0) ); END COMPONENT;

COMPONENT video_memory_40x30 PORT ( clk: IN STD_LOGIC; we: IN STD_LOGIC; a: IN STD_LOGIC_VECTOR(10 DOWNTO 0); d: IN STD_LOGIC_VECTOR(7 DOWNTO 0); 87

dpra: IN STD_LOGIC_VECTOR(10 DOWNTO 0); dpo: OUT STD_LOGIC_VECTOR(7 DOWNTO 0) ); END COMPONENT;

COMPONENT video_memory_40x15 PORT ( clk: IN STD_LOGIC; we: IN STD_LOGIC; a: IN STD_LOGIC_VECTOR(9 DOWNTO 0); d: IN STD_LOGIC_VECTOR(7 DOWNTO 0); dpra: IN STD_LOGIC_VECTOR(9 DOWNTO 0); dpo: OUT STD_LOGIC_VECTOR(7 DOWNTO 0) ); END COMPONENT; signal write_enable_160x120: std_logic; signal add_read_160x120: std_logic_vector(14 downto 0); signal add_write_160x120: std_logic_vector(14 downto 0); signal data_in_160x120: std_logic_vector(0 downto 0); signal data_out_160x120: std_logic_vector(0 downto 0); signal write_enable2_160x120: std_logic; signal add_read2_160x120: std_logic_vector(14 downto 0); signal add_write2_160x120: std_logic_vector(14 downto 0); signal data_in2_160x120: std_logic_vector(0 downto 0); signal data_out2_160x120: std_logic_vector(0 downto 0); signal write_enable_80x60: std_logic; signal add_read_80x60: std_logic_vector(12 downto 0); signal add_write_80x60: std_logic_vector(12 downto 0); signal data_in_80x60: std_logic_vector(1 downto 0); signal data_out_80x60: std_logic_vector(1 downto 0); signal write_enable_40x30: std_logic; signal add_read_40x30: std_logic_vector(10 downto 0); signal add_write_40x30: std_logic_vector(10 downto 0); signal data_in_40x30: std_logic_vector(7 downto 0); signal data_out_40x30: std_logic_vector(7 downto 0); signal write_enable_40x15: std_logic; signal add_read_40x15: std_logic_vector(9 downto 0); signal add_write_40x15: std_logic_vector(9 downto 0); signal data_in_40x15: std_logic_vector(7 downto 0); signal data_out_40x15: std_logic_vector(7 downto 0); signal char_addr_160x120: std_logic_vector(0 downto 0); signal char_addr2_160x120: std_logic_vector(0 downto 0); signal char_addr_40x30: std_logic_vector(7 downto 0); signal char_addr_80x60: std_logic_vector(1 downto 0); signal row_addr_80x60: std_logic_vector(2 downto 0); signal rom_addr_80x60: std_logic_vector(4 downto 0); signal char_addr_40x15: std_logic_vector(7 downto 0); signal row_addr_40x15: std_logic_vector(4 downto 0); signal rom_addr_40x15: std_logic_vector(12 downto 0); signal bit_add_160x120: unsigned(1 downto 0); signal bit_add2_160x120: unsigned(1 downto 0); signal bit_add_80x60: unsigned(2 downto 0); signal bit_add_40x30: unsigned(3 downto 0); signal bit_add_40x15: unsigned(3 downto 0); signal font_word_160x120: std_logic_vector(3 downto 0); signal font_word2_160x120: std_logic_vector(3 downto 0); signal font_word_80x60: std_logic_vector(7 downto 0); signal font_word_40x30: std_logic_vector(15 downto 0); signal font_word_40x15: std_logic_vector(15 downto 0); signal font_bit_160x120: std_logic; signal font_bit2_160x120: std_logic; signal font_bit_80x60: std_logic; signal font_bit_40x30: std_logic; signal font_bit_40x15: std_logic; signal pixel_x_delay: std_logic_vector(9 downto 0); signal rgb_reg: std_logic_vector(2 downto 0); signal rgb_reg_next: std_logic_vector(2 downto 0); signal rgb_reg_next_1: std_logic_vector(2 downto 0); signal rgb_reg_next_2: std_logic_vector(2 downto 0); signal rgb_reg_next_3: std_logic_vector(2 downto 0); signal rgb_reg_next_4: std_logic_vector(2 downto 0); 88

signal rgb_reg_next_5: std_logic_vector(2 downto 0); begin

memory_unit_160x120: video_memory_160x120 PORT MAP ( clk => clk, we => write_enable_160x120, a => add_write_160x120, d => data_in_160x120, dpra => add_read_160x120, dpo => data_out_160x120 );

memory_unit2_160x120: video_memory2_160x120 PORT MAP ( clk => clk, we => write_enable2_160x120, a => add_write2_160x120, d => data_in2_160x120, dpra => add_read2_160x120, dpo => data_out2_160x120 );

memory_unit_80x60: video_memory_80x60 PORT MAP ( clk => clk, we => write_enable_80x60, a => add_write_80x60, d => data_in_80x60, dpra => add_read_80x60, dpo => data_out_80x60 );

memory_unit_40x30: video_memory_40x30 PORT MAP ( clk => clk, we => write_enable_40x30, a => add_write_40x30, d => data_in_40x30, dpra => add_read_40x30, dpo => data_out_40x30 );

memory_unit_40x15: video_memory_40x15 PORT MAP ( clk => clk, we => write_enable_40x15, a => add_write_40x15, d => data_in_40x15, dpra => add_read_40x15, dpo => data_out_40x15 );

add_write_160x120 <= (v_mem_y(6 downto 0)& v_mem_x(7 downto 0)); data_in_160x120 <= "1" when video_mode= "011" and disp_write= '1' else "0"; write_enable_160x120 <= '1' when video_mode= "011" and (disp_write= '1' or disp_clear= '1') else '0';

add_write2_160x120 <= (v_mem_y(6 downto 0)& v_mem_x(7 downto 0)); data_in2_160x120 <= "1" when video_mode= "100" and disp_write= '1' else "0"; write_enable2_160x120 <= '1' when video_mode= "100" and (disp_write= '1' or disp_clear= '1') else '0';

add_write_80x60 <= (v_mem_y(5 downto 0)& v_mem_x(6 downto 0)); data_in_80x60 <= disp_data(1 downto 0) when video_mode= "010" and disp_write= '1' else "00"; write_enable_80x60 <= '1' when video_mode= "010" and (disp_write= '1' or disp_clear= '1') else '0';

add_write_40x30 <= (v_mem_y(4 downto 0)& v_mem_x(5 downto 0)); data_in_40x30 <= "00000001" when video_mode= "001" and disp_write= '1' else "00000000"; write_enable_40x30 <= '1' when video_mode= "001" and (disp_write= '1' or disp_clear= '1') else '0';

add_write_40x15 <= (v_mem_y(3 downto 0)& v_mem_x(5 downto 0)); data_in_40x15 <= disp_data(7 downto 0) when video_mode= "000" and disp_write= '1' else "00000000"; write_enable_40x15 <= '1' when video_mode= "000" and (disp_write= '1' or disp_clear= '1') else '0'; 89

add_read_160x120 <= (pixel_y(8 downto 2)& pixel_x(9 downto 2)); char_addr_160x120 <= data_out_160x120; add_read2_160x120 <= (pixel_y(8 downto 2)& pixel_x(9 downto 2)); char_addr2_160x120 <= data_out2_160x120; add_read_80x60 <= (pixel_y(8 downto 3)& pixel_x(9 downto 3)); char_addr_80x60 <= data_out_80x60; add_read_40x30 <= (pixel_y(8 downto 4)& pixel_x(9 downto 4)); char_addr_40x30 <= data_out_40x30; add_read_40x15 <= (pixel_y(8 downto 5)& pixel_x(9 downto 4)); char_addr_40x15 <= data_out_40x15; font_unit80x60: entity work.font_rom80x60 PORT MAP ( addr => rom_addr_80x60, data => font_word_80x60 ); font_unit40x15: entity work.font_rom40x15 PORT MAP ( addr => rom_addr_40x15, data => font_word_40x15 ); row_addr_80x60 <= pixel_y(2 downto 0); rom_addr_80x60 <= (char_addr_80x60& row_addr_80x60); row_addr_40x15 <= pixel_y(4 downto 0); rom_addr_40x15 <= (char_addr_40x15& row_addr_40x15); font_word_160x120 <= "1111" when char_addr_160x120= "1" else "0000"; font_word2_160x120 <= "1111" when char_addr2_160x120= "1" else "0000"; font_word_40x30 <= "1111111111111111" when char_addr_40x30= "00000001" else "0000000000000000"; pixel_delay_reg: process(clk) begin if (rising_edge(clk)) then pixel_x_delay <= pixel_x; -- Delay for Distributed RAM end if; end process; bit_add_160x120 <= unsigned(pixel_x_delay(1 downto 0)); font_bit_160x120 <= font_word_160x120(to_integer(not bit_add_160x120)); bit_add2_160x120 <= unsigned(pixel_x_delay(1 downto 0)); font_bit2_160x120 <= font_word2_160x120(to_integer(not bit_add2_160x120)); bit_add_80x60 <= unsigned(pixel_x_delay(2 downto 0)); font_bit_80x60 <= font_word_80x60(to_integer(not bit_add_80x60)); bit_add_40x30 <= unsigned(pixel_x_delay(3 downto 0)); font_bit_40x30 <= font_word_40x30(to_integer(not bit_add_40x30)); bit_add_40x15 <= unsigned(pixel_x_delay(3 downto 0)); font_bit_40x15 <= font_word_40x15(to_integer(not bit_add_40x15));

-- Make possible to draw black over green area (all cursor memory needed to set green) rgb_reg_next_1 <= "010" when video_on= '1' and ((font_bit2_160x120= '0' and font_bit_160x120= '1') or font_bit_40x15= '1') else "000";

-- Change square color of the eraser depending on what the backround color is rgb_reg_next_2 <= "010" when video_on= '1' and ((font_bit2_160x120= '0' and font_bit_160x120= '1') or font_bit_40x15= '1') else "010" when video_on= '1' and ((font_bit2_160x120= '1' and font_bit_160x120= '0') or font_bit_40x15= '1') else "000";

-- Operate normally rgb_reg_next_3 <= "010" when video_on= '1' and (font_bit_160x120= '1' or font_bit2_160x120= '1' or font_bit_80x60= '1' or font_bit_40x30= '1' or font_bit_40x15= '1') else "000"; 90

rgb_reg_next_4 <= rgb_reg_next_3 when color_mode= '0' and eraser_mode= '0' else rgb_reg_next_1 when color_mode= '1' and eraser_mode= '0' else rgb_reg_next_2 when color_mode= '0' and eraser_mode= '1' else rgb_reg_next_3 when color_mode= '1' and eraser_mode= '1' else "000";

-- Change cursor color related to display backround color rgb_reg_next_5 <= rgb_reg_next_4 when add_read_160x120 /= cursor_address or (add_read_160x120= cursor_address and font_bit_160x120= '0' and font_bit_40x15= '0') else "000";

rgb_reg_next <= rgb_reg_next_4 when cursor_mode= '0' else rgb_reg_next_5 when cursor_mode= '1' else "000";

rgb_buf_25MHz: process(vga_clk) begin if (rising_edge(vga_clk)) then rgb_reg <= rgb_reg_next; end if; end process;

rgb <= rgb_reg; end Behavioral; 91

A.5 line_draw.vhd

------Line drawing ------Copyright (c) 2018 Lauri Isola -- -- Released under the MIT license (see LICENSE.txt) ------Function: Generates line drawing coordinates -- for display controller -- Algorithm: Bresenham ------library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_unsigned.all; use ieee.numeric_std.all; entity line_draw is Port ( clk: in std_logic; x0_start: in std_logic_vector(7 downto 0); y0_start: in std_logic_vector(7 downto 0); x0_end: in std_logic_vector(7 downto 0); y0_end: in std_logic_vector(7 downto 0); x_line_draw: out std_logic_vector(31 downto 0); y_line_draw: out std_logic_vector(31 downto 0); line_start: in std_logic; line_update: out std_logic; line_ready: out std_logic ); end line_draw; architecture Behavioral of line_draw is

type state_type is (Idle, Check1, Check2, Check3, Check4, Initialize1, Initialize2, Calculate, Count1, Count2);

signal state: state_type;

signal X_start: std_logic_vector(7 downto 0):=( others => '0'); signal X_end: std_logic_vector(7 downto 0):=( others => '0'); signal Y_start: std_logic_vector(7 downto 0):=( others => '0'); signal Y_end: std_logic_vector(7 downto 0):=( others => '0'); signal dy_sig: std_logic_vector(7 downto 0):=( others => '0'); signal dy: std_logic_vector(8 downto 0):=( others => '0'); signal dx: std_logic_vector(8 downto 0):=( others => '0'); signal P: std_logic_vector(8 downto 0):=( others => '0'); signal mul_2dx: unsigned(17 downto 0):=( others => '0'); signal mul_2dy: unsigned(17 downto 0):=( others => '0'); signal C1: std_logic_vector(7 downto 0):=( others => '0'); signal C2: std_logic_vector(7 downto 0):=( others => '0'); signal x: std_logic_vector(7 downto 0):=( others => '0'); signal y: std_logic_vector(7 downto 0):=( others => '0'); signal k0: std_logic:='0'; signal k1: std_logic:='0'; signal k2: std_logic:='0'; begin

-- Display output x_line_draw <= X"000000"& x; y_line_draw <= X"000000"& y;

line_draw: process(clk) begin if (rising_edge(clk)) then line_ready <= '0'; line_update <= '0'; case state is when Idle => line_ready <= '1'; C1 <=( others => '0'); C2 <=( others => '0'); P <=( others => '0'); mul_2dx <=( others => '0'); mul_2dy <=( others => '0'); if (line_start= '1') then state <= Check1; 92

end if; when Check1 => if (x0_start> x0_end) then X_start <= x0_end; X_end <= x0_start; Y_start <= y0_end; Y_end <= y0_start; else X_start <= x0_start; X_end <= x0_end; Y_start <= y0_start; Y_end <= y0_end; end if; state <= Check2; when Check2 => dy_sig <= Y_end- Y_start; state <= Check3; when Check3 => dx <= '0' & (X_end- X_start); if (dy_sig> X"7F") then dy <= '0' &(( not dy_sig)+1); k2 <= '1'; else dy <= '0' & dy_sig; k2 <= '0'; end if; x <= X_start; y <= Y_start; state <= Check4; when Check4 => if (dy> dx) then k1 <= '1'; else k1 <= '0'; end if; state <= Initialize1; when Initialize1 => mul_2dx <=2* unsigned(dx); mul_2dy <=2* unsigned(dy); state <= Initialize2; when Initialize2 => if (k1= '1') then P <= std_logic_vector(mul_2dx(8 downto 0))- dy; C1 <= Y_start; C2 <= X_start; else P <= std_logic_vector(mul_2dy(8 downto 0))- dx; C1 <= X_start; C2 <= Y_start; end if; line_update <= '1'; state <= Calculate; when Calculate => if (k1= '1') then if (P< X"FF") then P <=P+ std_logic_vector(mul_2dx(8 downto 0))- std_logic_vector(mul_2dy(8 downto 0)); else P <=P+ std_logic_vector(mul_2dx(8 downto 0)); end if; else if (P< X"FF") then P <=P+ std_logic_vector(mul_2dy(8 downto 0))- std_logic_vector(mul_2dx(8 downto 0)); else P <=P+ std_logic_vector(mul_2dy(8 downto 0)); end if; end if; state <= Count1; when Count1 => if (P< X"FF") then if (k1= '1' and k2= '1') then C1 <= C1-1; else C1 <= C1+1; end if; 93

if (k1= '0' and k2= '1') then C2 <= C2-1; else C2 <= C2+1; end if; else if (k1= '1' and k2= '1') then C1 <= C1-1; else C1 <= C1+1; end if; end if;

if (x= X_end and y= Y_end) then state <= Idle; else state <= Count2; end if;

when Count2 => if (k1= '1') then x <= C2; y <= C1; else x <= C1; y <= C2; end if; line_update <= '1'; state <= Calculate;

when others => null; end case; end if; end process; end Behavioral; 94

A.6 ellipse_draw.vhd

------Ellipse drawing ------Copyright (c) 2018 Lauri Isola -- -- Released under the MIT license (see LICENSE.txt) ------Function: Generates ellipse coordinates -- for display controller -- Algorithm: Bresenham ------library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_unsigned.all; use ieee.numeric_std.all; entity ellipse_draw is Port ( clk: in std_logic; x_coord: in std_logic_vector(31 downto 0); y_coord: in std_logic_vector(31 downto 0); a: in std_logic_vector(31 downto 0); b: in std_logic_vector(31 downto 0); x_ellipse_draw: out std_logic_vector(31 downto 0); y_ellipse_draw: out std_logic_vector(31 downto 0); ellipse_start: in std_logic; ellipse_update: out std_logic; ellipse_ready: out std_logic ); end ellipse_draw; architecture Behavioral of ellipse_draw is

type state_type is (Idle, CalcDef1, CalcDef2, CalcDef3, Initialize, Draw1, Draw2, Draw3, Draw4, Increment, Calculate1, Calculate2, Calculate3, Calculate4, Calculate5, Calculate6, Check, ReadyHalf1);

signal state: state_type;

signal x: unsigned(23 downto 0):=( others => '0'); signal y: unsigned(23 downto 0):=( others => '0'); signal sigma: unsigned(23 downto 0):=( others => '0'); signal sum: unsigned(23 downto 0):=( others => '0'); signal mul: unsigned(47 downto 0):=( others => '0'); signal sum_sigma: unsigned(47 downto 0):=( others => '0'); signal mul_sigma: unsigned(71 downto 0):=( others => '0'); signal sigma_def: unsigned(71 downto 0):=( others => '0'); signal a2: unsigned(23 downto 0):=( others => '0'); signal b2: unsigned(23 downto 0):=( others => '0'); signal mul_a2: unsigned(63 downto 0):=( others => '0'); signal mul_b2: unsigned(63 downto 0):=( others => '0'); signal fa2: unsigned(23 downto 0):=( others => '0'); signal fb2: unsigned(23 downto 0):=( others => '0'); signal mul_fa2: unsigned(47 downto 0):=( others => '0'); signal mul_fb2: unsigned(47 downto 0):=( others => '0'); signal x_ellipse: unsigned(23 downto 0):=( others => '0'); signal y_ellipse: unsigned(23 downto 0):=( others => '0'); signal half2: std_logic:= '0'; begin

-- Output x_ellipse_draw <= X"000000"&( std_logic_vector(x_ellipse(7 downto 0))+ x_coord(7 downto 0)); y_ellipse_draw <= X"000000"&( std_logic_vector(y_ellipse(7 downto 0))+ y_coord(7 downto 0));

ellipse_draw: process(clk) begin if (rising_edge(clk)) then ellipse_ready <= '0'; ellipse_update <= '0'; case state is when Idle => ellipse_ready <= '1'; half2 <= '0'; if (ellipse_start= '1') then state <= CalcDef1; end if; 95

when CalcDef1 => mul_a2 <= unsigned(a)*unsigned(a); mul_b2 <= unsigned(b)*unsigned(b); state <= CalcDef2; when CalcDef2 => a2 <= mul_a2(23 downto 0); b2 <= mul_b2(23 downto 0); if (half2= '1') then mul <=1-(2* unsigned(a(23 downto 0))); else mul <=1-(2* unsigned(b(23 downto 0))); end if; state <= CalcDef3; when CalcDef3 => mul_fa2 <=4* a2; mul_fb2 <=4* b2; if (half2= '1') then sigma_def <=(2*a2)+ (b2* mul); else sigma_def <=(2*b2)+ (a2* mul); end if; state <= Initialize; when Initialize => fa2 <= mul_fa2(23 downto 0); fb2 <= mul_fb2(23 downto 0); if (half2= '1') then x <= unsigned(a(23 downto 0)); y <=( others => '0'); else x <=( others => '0'); y <= unsigned(b(23 downto 0)); end if; sigma <= sigma_def(23 downto 0); state <= Draw1; when Draw1 => x_ellipse <= x; y_ellipse <= y; ellipse_update <= '1'; state <= Draw2; when Draw2 => x_ellipse <= x; y_ellipse <=( not y)+1; ellipse_update <= '1'; state <= Draw3; when Draw3 => x_ellipse <=( not x)+1; y_ellipse <= y; ellipse_update <= '1'; state <= Draw4; when Draw4 => x_ellipse <=( not x)+1; y_ellipse <=( not y)+1; ellipse_update <= '1'; state <= Calculate1; when Calculate1 => if (half2= '1') then sum <=(1- x); mul <=(4*y)+6; else sum <=(1- y); mul <=(4*x)+6; end if; state <= Calculate2; when Calculate2 => if (half2= '1') then sum_sigma <= fb2* sum; mul_sigma <= a2* mul; state <= Calculate5; else sum_sigma <= fa2* sum; mul_sigma <= b2* mul; state <= Calculate3; end if; 96

when Calculate3 => if (sigma <= X"7FFFFF") then sigma <= sigma+ sum_sigma(23 downto 0); y <=y-1; end if; state <= Calculate4;

when Calculate4 => sigma <= sigma+ mul_sigma(23 downto 0); x <=x+1; state <= Check;

when Calculate5 => if (sigma <= X"7FFFFF") then sigma <= sigma+ sum_sigma(23 downto 0); x <=x-1; end if; state <= Calculate6;

when Calculate6 => sigma <= sigma+ mul_sigma(23 downto 0); y <=y+1; state <= Check;

when Check => if (half2= '1') then if (a2*y <= b2*x) then state <= Draw1; else state <= Idle; end if; else if (b2*x <= a2*y) then state <= Draw1; else state <= ReadyHalf1; end if; end if;

when ReadyHalf1 => half2 <= '1'; state <= CalcDef2;

when others => null; end case; end if; end process; end Behavioral; 97

A.7 area_paint.vhd

------Area painting ------Copyright (c) 2018 Lauri Isola -- -- Released under the MIT license (see LICENSE.txt) ------Function: Fill tool for display controller -- Algorithm: Flood fill ------library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_unsigned.all; use ieee.numeric_std.all; entity area_paint is Port ( clk: in std_logic; x_paint: in std_logic_vector(31 downto 0); y_paint: in std_logic_vector(31 downto 0); x_paint_draw: out std_logic_vector(31 downto 0); y_paint_draw: out std_logic_vector(31 downto 0); paint_mem_addr: out std_logic_vector(14 downto 0); paint_mem_in: out std_logic; paint_mem_out: in std_logic; paint_mem_write: out std_logic; new_color: in std_logic; paint_start: in std_logic; paint_ready: out std_logic ); end area_paint; architecture Behavioral of area_paint is

COMPONENT paint_stack PORT ( clka: IN STD_LOGIC; wea: IN STD_LOGIC_VECTOR(0 DOWNTO 0); addra: IN STD_LOGIC_VECTOR(14 DOWNTO 0); dina: IN STD_LOGIC_VECTOR(36 DOWNTO 0); douta: OUT STD_LOGIC_VECTOR(36 DOWNTO 0) ); END COMPONENT;

type state_type is (Idle, Initialize1, Initialize2, Fill, Xplus1, Check1, Return1, Xminus1, Check2, Return2, Yplus1, Check3, Return3, Yminus1, Check4, Return4, Pop5, Return5, MemWait, Ready);

signal state: state_type; signal state_next: state_type; signal state_stack: state_type; signal state_5bit: std_logic_vector(4 downto 0);

signal x: std_logic_vector(7 downto 0):=( others => '0'); signal x_old: std_logic_vector(7 downto 0):=( others => '0'); signal y: std_logic_vector(7 downto 0):=( others => '0'); signal y_old: std_logic_vector(7 downto 0):=( others => '0'); signal stack_datain: std_logic_vector(36 downto 0):=( others => '0'); signal stack_dataout: std_logic_vector(36 downto 0):=( others => '0'); signal stack_pointer: std_logic_vector(14 downto 0):=( others => '0'); signal stack_we: std_logic:= '0'; signal color: std_logic:= '0'; signal previous_color: std_logic:= '0'; signal not_prev_color: std_logic:= '0'; signal hw_border: std_logic:= '0'; begin

stack_memory: paint_stack PORT MAP ( clka => clk, wea(0) => stack_we, addra => stack_pointer, dina => stack_datain, douta => stack_dataout ); 98

-- Color selection color <= new_color;

-- Display output x_paint_draw <= X"000000"& x; y_paint_draw <= X"000000"& y;

-- Paint memory input paint_mem_addr <= y(6 downto 0)& x; paint_mem_in <= color;

-- Next pixel color and border check hw_border <= '1' when (x= X"08" and x_old= X"9F") or (x= X"9F" and x_old= X"08") or (y= X"00" and y_old= X"77") or (y= X"77" and y_old= X"00") else '0'; not_prev_color <= (previous_color xor paint_mem_out) or hw_border;

-- Paint stack input stack_datain <=( '1' & X"300000000") when state= Initialize2 else (state_5bit+3)& y_old& x_old&y& x; stack_we <= '1' when state= Initialize2 else '1' when state= Xplus1 else '1' when state= Xminus1 else '1' when state= Yplus1 else '1' when state= Yminus1 else '0'; area_paint: process(clk) begin if (rising_edge(clk)) then paint_mem_write <= '0'; case state is when Idle => paint_ready <= '1'; if (paint_start= '1') then paint_ready <= '0'; state <= Initialize1; end if;

when Initialize1 => paint_ready <= '0'; stack_pointer <=( others => '0'); x <= x_paint(7 downto 0); y <= y_paint(7 downto 0); x_old <= X"01"; y_old <= X"01"; state_next <= Initialize2; state <= MemWait;

when Initialize2 => if (color= paint_mem_out) then state <= Ready; else state <= Fill; end if; previous_color <= paint_mem_out; stack_pointer <= (stack_pointer+1);

when Fill => paint_mem_write <= not not_prev_color; state <= Xplus1;

when Xplus1 => if (x= X"9F") then x <= X"08"; else x <=x+1; end if; x_old <= x; stack_pointer <= (stack_pointer+1); state_next <= Check1; state <= MemWait;

when Check1 => if (not_prev_color= '1') then stack_pointer <= (stack_pointer-1); state_next <= Return1; else state_next <= Fill; end if; state <= MemWait; 99

when Return1 => x <= stack_dataout(7 downto 0); y <= stack_dataout(15 downto 8); x_old <= stack_dataout(23 downto 16); y_old <= stack_dataout(31 downto 24); state <= state_stack; when Xminus1 => if (x= X"08") then x <= X"9F"; else x <=x-1; end if; x_old <= x; stack_pointer <= (stack_pointer+1); state_next <= Check2; state <= MemWait; when Check2 => if (not_prev_color= '1') then stack_pointer <= (stack_pointer-1); state_next <= Return2; else state_next <= Fill; end if; state <= MemWait; when Return2 => x <= stack_dataout(7 downto 0); y <= stack_dataout(15 downto 8); x_old <= stack_dataout(23 downto 16); y_old <= stack_dataout(31 downto 24); state <= state_stack; when Yplus1 => if (y= X"77") then y <= X"00"; else y <=y+1; end if; y_old <= y; stack_pointer <= (stack_pointer+1); state_next <= Check3; state <= MemWait; when Check3 => if (not_prev_color= '1') then stack_pointer <= (stack_pointer-1); state_next <= Return3; else state_next <= Fill; end if; state <= MemWait; when Return3 => x <= stack_dataout(7 downto 0); y <= stack_dataout(15 downto 8); x_old <= stack_dataout(23 downto 16); y_old <= stack_dataout(31 downto 24); state <= state_stack; when Yminus1 => if (y= X"00") then y <= X"77"; else y <=y-1; end if; y_old <= y; stack_pointer <= (stack_pointer+1); state_next <= Check4; state <= MemWait; when Check4 => if (not_prev_color= '1') then stack_pointer <= (stack_pointer-1); state_next <= Return4; else state_next <= Fill; end if; state <= MemWait; when Return4 => 100

x <= stack_dataout(7 downto 0); y <= stack_dataout(15 downto 8); x_old <= stack_dataout(23 downto 16); y_old <= stack_dataout(31 downto 24); state <= state_stack;

when Pop5 => stack_pointer <= (stack_pointer-1); state_next <= Return5; state <= MemWait;

when Return5 => x <= stack_dataout(7 downto 0); y <= stack_dataout(15 downto 8); x_old <= stack_dataout(23 downto 16); y_old <= stack_dataout(31 downto 24); state <= state_stack;

when MemWait => state <= state_next;

when Ready => paint_ready <= '1'; state <= Idle;

when others => null; end case; end if; end process;

state_stack <= Idle when stack_dataout(36 downto 32)= "00000" else Initialize1 when stack_dataout(36 downto 32)= "00001" else Initialize2 when stack_dataout(36 downto 32)= "00010" else Fill when stack_dataout(36 downto 32)= "00011" else Xplus1 when stack_dataout(36 downto 32)= "00100" else Check1 when stack_dataout(36 downto 32)= "00101" else Return1 when stack_dataout(36 downto 32)= "00110" else Xminus1 when stack_dataout(36 downto 32)= "00111" else Check2 when stack_dataout(36 downto 32)= "01000" else Return2 when stack_dataout(36 downto 32)= "01001" else Yplus1 when stack_dataout(36 downto 32)= "01010" else Check3 when stack_dataout(36 downto 32)= "01011" else Return3 when stack_dataout(36 downto 32)= "01100" else Yminus1 when stack_dataout(36 downto 32)= "01101" else Check4 when stack_dataout(36 downto 32)= "01110" else Return4 when stack_dataout(36 downto 32)= "01111" else Pop5 when stack_dataout(36 downto 32)= "10000" else Return5 when stack_dataout(36 downto 32)= "10001" else MemWait when stack_dataout(36 downto 32)= "10010" else Ready when stack_dataout(36 downto 32)= "10011" else Idle;

state_5bit <= "00000" when state= Idle else "00001" when state= Initialize1 else "00010" when state= Initialize2 else "00011" when state= Fill else "00100" when state= Xplus1 else "00101" when state= Check1 else "00110" when state= Return1 else "00111" when state= Xminus1 else "01000" when state= Check2 else "01001" when state= Return2 else "01010" when state= Yplus1 else "01011" when state= Check3 else "01100" when state= Return3 else "01101" when state= Yminus1 else "01110" when state= Check4 else "01111" when state= Return4 else "10000" when state= Pop5 else "10001" when state= Return5 else "10010" when state= MemWait else "10011" when state= Ready else "00000"; end Behavioral; 101

A.8 input.vhd

------Debounce logic for button inputs ------Copyright (c) 2018 Lauri Isola -- -- Released under the MIT license (see LICENSE.txt) ------library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_unsigned.all; use ieee.numeric_std.all; entity input is Port ( clk: in std_logic; btnC: in std_logic; btnR: in std_logic; btnL: in std_logic; btnD: in std_logic; btnU: in std_logic; input_rst: in std_logic; input_data: out std_logic_vector(31 downto 0); input_flag: out std_logic ); end input; architecture Behavioral of input is

constant debounce_limit: integer := 1048576; constant debounce_limit_btn5: integer := 5242880; constant flip_flop_count: natural :=3;

signal btnC_sync_chain: std_logic_vector(flip_flop_count-1 downto 0):=( others => '0'); signal btnR_sync_chain: std_logic_vector(flip_flop_count-1 downto 0):=( others => '0'); signal btnL_sync_chain: std_logic_vector(flip_flop_count-1 downto 0):=( others => '0'); signal btnD_sync_chain: std_logic_vector(flip_flop_count-1 downto 0):=( others => '0'); signal btnU_sync_chain: std_logic_vector(flip_flop_count-1 downto 0):=( others => '0');

signal buttons: std_logic_vector(4 downto 0):=( others => '0'); signal code: std_logic_vector(3 downto 0); signal temp: std_logic_vector(3 downto 0):=( others => '0'); signal debounce_counter: std_logic_vector(23 downto 0):=( others => '0'); signal inputdata: std_logic_vector(31 downto 0):=( others => '0'); signal inputflag: std_logic:= '0'; signal pressed: std_logic:= '0'; begin

-- Output signals input_data <= inputdata; input_flag <= inputflag;

-- Synchronizers for button inputs btnC_sync_chain <= (btnC_sync_chain(btnC_sync_chain'high-1 downto 0)& btnC) when rising_edge(clk); btnR_sync_chain <= (btnR_sync_chain(btnR_sync_chain'high-1 downto 0)& btnR) when rising_edge(clk); btnL_sync_chain <= (btnL_sync_chain(btnL_sync_chain'high-1 downto 0)& btnL) when rising_edge(clk); btnD_sync_chain <= (btnD_sync_chain(btnD_sync_chain'high-1 downto 0)& btnD) when rising_edge(clk); btnU_sync_chain <= (btnU_sync_chain(btnU_sync_chain'high-1 downto 0)& btnU) when rising_edge(clk); buttons(0) <= btnC_sync_chain(btnC_sync_chain'high); buttons(1) <= btnR_sync_chain(btnR_sync_chain'high); buttons(2) <= btnL_sync_chain(btnL_sync_chain'high); buttons(3) <= btnD_sync_chain(btnD_sync_chain'high); buttons(4) <= btnU_sync_chain(btnU_sync_chain'high); button_select: process(clk) begin if (rising_edge(clk)) then case buttons is when "00010" => -- Right code <= "0001"; when "00100" => -- Left code <= "0010"; when "01000" => -- Down code <= "0011"; when "10000" => -- Up 102

code <= "0100"; when "00001" => -- Center code <= "0101"; when "10010" => -- Up and Right code <= "0110"; when "01010" => -- Down and Right code <= "0111"; when "01100" => -- Down and Left code <= "1000"; when "10100" => -- Up and Left code <= "1001"; when others => code <= "0000"; end case; end if; end process; debounce: process(clk) begin if (rising_edge(clk)) then if (input_rst= '1') then if (temp /= "0101") then pressed <= '0'; end if; inputflag <= '0'; inputdata <=( others => '0'); else if (code= "0000") then if (debounce_counter>0) then debounce_counter <= debounce_counter-1; else if (pressed= '1') then if (temp= "0101") then pressed <= '0'; inputflag <= '1'; inputdata(3 downto 0) <= temp; end if; else temp <= "0000"; end if; end if; else if (pressed= '0') then if (debounce_counter=0) then temp <= code; end if; if (temp= "0101") then if (debounce_counter< debounce_limit_btn5) then debounce_counter <= debounce_counter+1; else pressed <= '1'; end if; else if (debounce_counter< debounce_limit) then debounce_counter <= debounce_counter+1; else pressed <= '1'; inputflag <= '1'; inputdata(3 downto 0) <= temp; debounce_counter <=( others => '0'); end if; end if; end if; end if; end if; end if; end process; end Behavioral; 103

A.9 top.vhd

------Top level ------Copyright (c) 2018 Lauri Isola -- -- Released under the MIT license (see LICENSE.txt) ------System modules: -- -- ASIP38 -- Debounce logic for button inputs -- Display controller -- RGB signal generation -- VGA synchronization -- Line draw -- Ellipse draw -- Area paint ------library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_unsigned.all; use ieee.numeric_std.all; entity top is PORT ( clk: in std_logic; -- 100 MHz clock Hsync: out std_logic; Vsync: out std_logic; vgaRed: out std_logic_vector(3 downto 0); vgaBlue: out std_logic_vector(3 downto 0); vgaGreen: out std_logic_vector(3 downto 0); btnC: in std_logic; btnR: in std_logic; btnL: in std_logic; btnD: in std_logic; btnU: in std_logic ); end top; architecture Behavioral of top is

-- Processor signal x: std_logic_vector(31 downto 0); signal y: std_logic_vector(31 downto 0); signal disp_data: std_logic_vector(31 downto 0); signal disp_cmd: std_logic_vector(5 downto 0); signal disp_set: std_logic; signal disp_clr: std_logic; signal ready: std_logic;

-- Input signal input_rst: std_logic; signal input_data: std_logic_vector(31 downto 0); signal input_flag: std_logic;

-- RGB signal generation signal v_mem_x: std_logic_vector(31 downto 0); signal v_mem_y: std_logic_vector(31 downto 0); signal cursor_address: std_logic_vector(14 downto 0); signal video_mode: std_logic_vector(2 downto 0); signal color_mode: std_logic; signal cursor_mode: std_logic; signal eraser_mode: std_logic; signal disp_write: std_logic; signal disp_clear: std_logic;

-- VGA synchronization signal pixel_x: std_logic_vector(9 downto 0); signal pixel_y: std_logic_vector(9 downto 0); signal video_on: std_logic; signal vga_clk: std_logic; signal rgb: std_logic_vector(2 downto 0);

-- Line draw signal x0_start: std_logic_vector(7 downto 0); signal y0_start: std_logic_vector(7 downto 0); 104

signal x0_end: std_logic_vector(7 downto 0); signal y0_end: std_logic_vector(7 downto 0); signal x_line_draw: std_logic_vector(31 downto 0); signal y_line_draw: std_logic_vector(31 downto 0); signal line_start: std_logic; signal line_update: std_logic; signal line_ready: std_logic;

-- Ellipse draw signal x_coord: std_logic_vector(31 downto 0); signal y_coord: std_logic_vector(31 downto 0); signal a: std_logic_vector(31 downto 0); signal b: std_logic_vector(31 downto 0); signal x_ellipse_draw: std_logic_vector(31 downto 0); signal y_ellipse_draw: std_logic_vector(31 downto 0); signal ellipse_start: std_logic; signal ellipse_update: std_logic; signal ellipse_ready: std_logic;

-- Area paint signal x_paint: std_logic_vector(31 downto 0); signal y_paint: std_logic_vector(31 downto 0); signal x_paint_draw: std_logic_vector(31 downto 0); signal y_paint_draw: std_logic_vector(31 downto 0); signal paint_mem_addr: std_logic_vector(14 downto 0); signal paint_mem_in: std_logic; signal paint_mem_out: std_logic; signal paint_mem_write: std_logic; signal new_color: std_logic; signal paint_start: std_logic; signal paint_ready: std_logic; begin

-- VGA output vgaRed <= "0000"; vgaBlue <= "0000"; vgaGreen <= rgb(1)& rgb(1)& rgb(1)& rgb(1); processor: ENTITY work.asip38 PORT MAP ( clk => clk, x => x, y => y, disp_data => disp_data, disp_cmd => disp_cmd, disp_set => disp_set, disp_clr => disp_clr, input_data => input_data, input_flag => input_flag, input_rst => input_rst, ready => ready ); buttons: ENTITY work.input PORT MAP ( clk => clk, btnC => btnC, btnR => btnR, btnL => btnL, btnD => btnD, btnU => btnU, input_rst => input_rst, input_data => input_data, input_flag => input_flag ); display_controller: ENTITY work.display_control PORT MAP ( clk => clk, x0_start => x0_start, y0_start => y0_start, x0_end => x0_end, y0_end => y0_end, x_line_draw => x_line_draw, y_line_draw => y_line_draw, line_start => line_start, line_update => line_update, line_ready => line_ready, x_coord => x_coord, y_coord => y_coord, a => a, b => b, 105

x_ellipse_draw => x_ellipse_draw, y_ellipse_draw => y_ellipse_draw, ellipse_start => ellipse_start, ellipse_update => ellipse_update, ellipse_ready => ellipse_ready, x_paint => x_paint, y_paint => y_paint, x_paint_draw => x_paint_draw, y_paint_draw => y_paint_draw, paint_mem_addr => paint_mem_addr, paint_mem_in => paint_mem_in, paint_mem_out => paint_mem_out, paint_mem_write => paint_mem_write, new_color => new_color, paint_start => paint_start, paint_ready => paint_ready, x => x, y => y, disp_cmd => disp_cmd, disp_set => disp_set, disp_clr => disp_clr, ready => ready, v_mem_x => v_mem_x, v_mem_y => v_mem_y, video_mode => video_mode, color_mode => color_mode, cursor_mode => cursor_mode, eraser_mode => eraser_mode, cursor_address => cursor_address, disp_clear => disp_clear, disp_write => disp_write ); rgb_signal_generation: ENTITY work.rgb_gen PORT MAP ( clk => clk, vga_clk => vga_clk, video_on => video_on, disp_data => disp_data, disp_clear => disp_clear, disp_write => disp_write, video_mode => video_mode, color_mode => color_mode, cursor_mode => cursor_mode, eraser_mode => eraser_mode, cursor_address => cursor_address, v_mem_x => v_mem_x, v_mem_y => v_mem_y, pixel_x => pixel_x, pixel_y => pixel_y, rgb => rgb ); vga_synchronization: ENTITY work.vga_sync PORT MAP ( clk => clk, hsync => Hsync, vsync => Vsync, pixel_x => pixel_x, pixel_y => pixel_y, vga_clk => vga_clk, video_on => video_on ); line_drawing: ENTITY work.line_draw PORT MAP ( clk => clk, x0_start => x0_start, y0_start => y0_start, x0_end => x0_end, y0_end => y0_end, x_line_draw => x_line_draw, y_line_draw => y_line_draw, line_start => line_start, line_update => line_update, line_ready => line_ready ); ellipse_drawing: ENTITY work.ellipse_draw PORT MAP ( clk => clk, x_coord => x_coord, y_coord => y_coord, 106

a => a, b => b, x_ellipse_draw => x_ellipse_draw, y_ellipse_draw => y_ellipse_draw, ellipse_start => ellipse_start, ellipse_update => ellipse_update, ellipse_ready => ellipse_ready ); paint: ENTITY work.area_paint PORT MAP ( clk => clk, x_paint => x_paint, y_paint => y_paint, x_paint_draw => x_paint_draw, y_paint_draw => y_paint_draw, paint_mem_addr => paint_mem_addr, paint_mem_in => paint_mem_in, paint_mem_out => paint_mem_out, paint_mem_write => paint_mem_write, new_color => new_color, paint_start => paint_start, paint_ready => paint_ready ); end Behavioral; 107

A.10 assembler.py

### Assembler for ASIP38

# Copyright (c) 2018 Lauri Isola

# Released under the MIT license (see LICENSE.txt) filename1= 'asip38_assembly.txt' filename2= 'binary.txt' filename3= 'binary_opcode.txt' filename4= 'binary_fpga.coe'

RAW= 'v2.0 raw\n' RADIX= 'memory_initialization_radix=16;\n' VECTOR= 'memory_initialization_vector=\n'

# Instruction set

LDI= '00' # AC <= immediate value LDA= '01' # AC <= RAM STO= '02' # RAM <= AC JMP= '03' # unconditional branch ADD= '04' # AC + RAM SUB= '05' # AC - RAM MUL= '06' # AC * RAM AND= '07' # AC and RAM IOR= '08' # AC or RAM XOR= '09' # AC xor RAM INC= '0a' # AC + 1 DEC= '0b' # AC - 1 CIL= '0c' # circulate AC left CIR= '0d' # circulate AC right WAI= '0e' # wait if ready is 0, continue when 1 LDX= '0f' # X <= RAM LDY= '10' # Y <= RAM INP= '11' # RAM <= INPUT OUT= '12' # OUT <= RAM SNZ= '13' # skip if AC != 0 SZA= '14' # skip if AC == 0 SGT= '15' # skip if AC > RAM SLT= '16' # skip if AC < RAM SKI= '17' # skip if input == 0 SET= '18' # send command to display controller CLR= '19' # clear video memory location CAL= '1a' # call subroutine RET= '1c' # return from subroutine LFR= '1e' # AC <= RAM(F), indirect SFR= '1f' # RAM(F) <= AC, indirect WFR= '20' # F <= AC jump={} constant={} index=-1 line_number=0 line_number_hex=0 file1= open(filename1, 'r') while True: line= file1.readline() if len(line) ==0: break if line != '\n': labels= line.split() if labels[1] == 'EQU': constant[labels[0]]= labels[2]+ '\n' print (constant) file1.seek(0) while True: line= file1.readline()

if len(line) ==0: break

if line != '\n': 108

labels= line.split()

print (labels)

name1= labels[0]

line_number_hex= hex(line_number)

line= str(line_number_hex) line= line+ '\n'

jump[name1]= line[2:] line_number= line_number+1 file1.close print (jump) file2= open (filename2, 'w') file2.write(RAW) file3= open (filename3, 'w') file3.write(RAW) file4= open (filename4, 'w') file4.write(RADIX) file4.write(VECTOR) file1= open(filename1, 'r') while True: line= file1.readline()

if len(line) ==0: break

if line != '\n': labels= line.split()

name2= labels[1] name3= labels[2]

index= index+1

if name2 == 'LDI': operand= '00000000' + name3+ '\n' code= operand[-9:] file2.write(code) file3.write(LDI+ '\n') file4.write(LDI+ code)

elif name2 == 'LDA': operand= '00000000' + constant[name3] code= operand[-9:] file2.write(code) file3.write(LDA+ '\n') file4.write(LDA+ code)

elif name2 == 'STO': operand= '00000000' + constant[name3] code= operand[-9:] file2.write(code) file3.write(STO+ '\n') file4.write(STO+ code)

elif name2 == 'JMP': operand= '00000000' + jump[name3] code= operand[-9:] file2.write(code) file3.write(JMP+ '\n') file4.write(JMP+ code)

elif name2 == 'ADD': operand= '00000000' + constant[name3] code= operand[-9:] file2.write(code) file3.write(ADD+ '\n') file4.write(ADD+ code)

elif name2 == 'SUB': operand= '00000000' + constant[name3] 109

code= operand[-9:] file2.write(code) file3.write(SUB+ '\n') file4.write(SUB+ code) elif name2 == 'MUL': operand= '00000000' + constant[name3] code= operand[-9:] file2.write(code) file3.write(MUL+ '\n') file4.write(MUL+ code) elif name2 == 'AND': operand= '00000000' + constant[name3] code= operand[-9:] file2.write(code) file3.write(AND+ '\n') file4.write(AND+ code) elif name2 == 'IOR': operand= '00000000' + constant[name3] code= operand[-9:] file2.write(code) file3.write(IOR+ '\n') file4.write(IOR+ code) elif name2 == 'XOR': operand= '00000000' + constant[name3] code= operand[-9:] file2.write(code) file3.write(XOR+ '\n') file4.write(XOR+ code) elif name2 == 'INC': code= '00000000\n' file2.write(code) file3.write(INC+ '\n') file4.write(INC+ code) elif name2 == 'DEC': code= '00000000\n' file2.write(code) file3.write(DEC+ '\n') file4.write(DEC+ code) elif name2 == 'CIL': code= '00000000\n' file2.write(code) file3.write(CIL+ '\n') file4.write(CIL+ code) elif name2 == 'CIR': code= '00000000\n' file2.write(code) file3.write(CIR+ '\n') file4.write(CIR+ code) elif name2 == 'WAI': code= '00000000\n' file2.write(code) file3.write(WAI+ '\n') file4.write(WAI+ code) elif name2 == 'LDX': operand= '00000000' + constant[name3] code= operand[-9:] file2.write(code) file3.write(LDX+ '\n') file4.write(LDX+ code) elif name2 == 'LDY': operand= '00000000' + constant[name3] code= operand[-9:] file2.write(code) file3.write(LDY+ '\n') file4.write(LDY+ code) elif name2 == 'INP': operand= '00000000' + constant[name3] code= operand[-9:] file2.write(code) file3.write(INP+ '\n') file4.write(INP+ code) 110

elif name2 == 'OUT': operand= '00000000' + constant[name3] code= operand[-9:] file2.write(code) file3.write(OUT+ '\n') file4.write(OUT+ code) elif name2 == 'SNZ': code= '00000000\n' file2.write(code) file3.write(SNZ+ '\n') file4.write(SNZ+ code) elif name2 == 'SZA': code= '00000000\n' file2.write(code) file3.write(SZA+ '\n') file4.write(SZA+ code) elif name2 == 'SGT': operand= '00000000' + constant[name3] code= operand[-9:] file2.write(code) file3.write(SGT+ '\n') file4.write(SGT+ code) elif name2 == 'SLT': operand= '00000000' + constant[name3] code= operand[-9:] file2.write(code) file3.write(SLT+ '\n') file4.write(SLT+ code) elif name2 == 'SKI': code= '00000000\n' file2.write(code) file3.write(SKI+ '\n') file4.write(SKI+ code) elif name2 == 'SET': operand= '00000000' + constant[name3] code= operand[-9:] file2.write(code) file3.write(SET+ '\n') file4.write(SET+ code) elif name2 == 'CLR': code= '00000000\n' file2.write(code) file3.write(CLR+ '\n') file4.write(CLR+ code) elif name2 == 'CAL': operand= '00000000' + jump[name3] code= operand[-9:] file2.write(code) file3.write(CAL+ '\n') file4.write(CAL+ code) elif name2 == 'RET': code= '00000000\n' file2.write(code) file3.write(RET+ '\n') file4.write(RET+ code) elif name2 == 'LFR': code= '00000000\n' file2.write(code) file3.write(LFR+ '\n') file4.write(LFR+ code) elif name2 == 'SFR': code= '00000000\n' file2.write(code) file3.write(SFR+ '\n') file4.write(SFR+ code) elif name2 == 'WFR': code= '00000000\n' file2.write(code) file3.write(WFR+ '\n') file4.write(WFR+ code) 111

elif name2 == 'EQU': pass

else: print ('syntax error at line ' + str(index)+ ': ' + name2)

else: file2.write('\n') file3.write('\n') file4.close file3.close file2.close file1.close file4= open(filename4, 'r') file_lines=[] index=0 while True: line= file4.readline() if len(line) ==0: break if line != '\n': current_index= file4.tell() file4.seek(current_index+1) next_line= file4.readline() file4.seek(current_index) if index<2: file_lines.append(''.join([line.strip(), '\n'])) else: if len(next_line) ==0: file_lines.append(''.join([line.strip(), ';'])) else: file_lines.append(''.join([line.strip(), ',', '\n'])) index= index+1 file4.close file4= open(filename4, 'w') file4.writelines(file_lines) file4.close