Universitat Politècnica de Catalunya (UPC) BarcelonaTech Facultat d’Informàtica de Barcelona (FIB)

RTL design and implementation of a framebuffer for a RISC-V processor

Educational Cooperative Agreement with Barcelona Supercomputing Centre (BSC)

Computer Engineering Degree Final Project

Author: Narcís Rodas Quiroga ​ Supervisor: Miquel Moretó (Computer Architecture Department DAC) ​ Co-supervisor: Guillem Cabo ​ Specialization: Computer Engineering ​ Date of oral defense: 28th of October 2020 ​

Abstract

The RISC-V instruction set architecture (ISA) and the foundation that supports it continue to grow rapidly as an open-source alternative for hardware designs. Despite open-source software already being established as an important part of all the software solutions, open-source hardware has only recently begun to expand. Before that, the market was entirely made of proprietary ISAs (mostly from the US) that controlled it.

This Final Degree Thesis shows the design, implementation and testing of a VGA ( Graphics Array) framebuffer for the RISC-V processor being developed in the DRAC project by the Barcelona Supercomputing Centre. This document explains the various steps taken along the way and the reasoning behind the decisions that were taken.

Keywords: RISC-V, VGA, RTL, , Framebuffer, AXI. ​

Resumen

El conjunto de instrucciones o ISA (del inglés instruction set architecture) RISC-V y la ​ ​ fundación que lo respalda siguen creciendo rápidamente como una alternativa open-source para los diseños hardware. Aunque el software open-source ya representa una parte importante de todas las soluciones software, el hardware open-source todavía está empezando a expandirse. Antes de esto, el mercado estaba compuesto íntegramente de ISAs propietarias (la gran mayoría provenientes de los E.E. U.U.) que lo controlaban.

Este Trabajo de Final de Grado muestra el diseño, implementación y el testing de un framebuffer VGA para el procesador RISC-V que se está desarrollando en el proyecto DRAC por el Barcelona Supercomputing Centre. En este documento se muestran los diversos pasos seguidos y el razonamiento detrás de las decisiones tomadas.

Palabras clave: RISC-V, VGA, RTL, Verilog, Memoria de video, AXI. ​

1

Resum

El conjunt d’instruccions o ISA (de l’anglès instruction set architecture) RISC-V i la ​ ​ fundació que el recolza segueixen creixent ràpidament com una alternativa open-source per als dissenys hardware. Tot i que el software open-source ja representa una part important de totes les solucions software, el hardware open-source encara està començant a expandir-se. Abans d’això, el mercat estava format íntegrament per ISAs propietàries (la gran majoria provinents dels EUA) que el controlaven.

Aquest Treball de Final de Grau mostra el disseny, implementació i el testing d’un framebuffer VGA pel processador RISC-V que s’està desenvolupant en el projecte DRAC del Barcelona Supercomputing Centre. En aquest document s’expliquen els diversos passos seguits i el raonament darrera les decisions preses.

Paraules clau: RISC-V, VGA, RTL, Verilog, Memòria de vídeo, AXI. ​

2

Index

1. Context and scope 5 1.1. The project in the BSC framework 5 1.2. Definition of terms and concepts 6 1.3. Identification of the problem and project justification 6 1.4. Stakeholders 7 1.5. Comparison with alternatives 7 1.6. Scope 8 1.6.1. General and sub-objectives 8 1.6.2. Functional and non-functional requirements 8 1.7. Obstacles and risks 9

2. Initial work plan 10 2.1. Work methodology 10 2.2. Monitoring tools 11 2.3. Description of tasks 11 2.3.1. Description, time estimation and dependencies 11 2.3.2. Required human and material resources 13 2.3.3. Summary table 14 2.3.4. Alternative tasks 15 2.3.5. Additional resources 15 2.3.6. Gantt chart 16 2.4. Budget 17 2.4.1. Staff costs 17 2.4.2. General costs 17 2.4.3. Contingency and incidentals 17 2.4.4. Cost estimates 18

3. Final work plan 20 3.1. Changes in the tasks and time assignments 20 3.2. Changes in resources 23 3.3. Changes in the budget 23

4. Background 24 4.1. Open-source hardware 24 4.2. DRAC core and SoC 24 4.3. VGA protocol 25 4.4. AXI protocol 27 4.4.1. Full AXI 27

3

4.4.2. AXI-Lite 27

5. RTL design and testing 29 5.1. Initial state of the art research and setup 29 5.2. Design of the PCB for the integration 30 5.3. Design decisions and comparison between options 31 5.4. First implementation on the BlackIce Mx FPGA 33 5.5. Tests of other resolutions 35 5.6. Mutation tests 36 5.7. DRAC environment setup and AXI diagram 38 5.8. AXI wrapper design 40 5.9. Integration with the DRAC core 41 5.10. Tests with the KC705 and the PCB 44

6. Sustainability report 49 6.1. Self-evaluation 49 6.2. Economic dimension 49 6.2.1. PPP 49 6.2.2. Exploitation 49 6.2.3. Risks 50 6.3. Environmental dimension 50 6.3.1. PPP 50 6.3.2. Exploitation 50 6.3.3. Risks 50 6.4. Social dimension 51 6.4.1. PPP 51 6.4.2. Exploitation 51 6.4.3. Risks 51

7. Conclusions 52

8. References 53

4

1. Context and justification

1.1. The project in the BSC framework

This work is a part of the Barcelona Supercomputing Centre DRAC (Designing RISC-V-based Accelerators for the next generation Computers) project. The main goal of the DRAC is the design of a RISC-V based general-purpose processor with accelerators for future computers [1][2] and it will include the developed framebuffer to add a VGA controller to the . The targets of the project are specific applications in the fields of safety, genomics and autonomous driving.

The first of these processors, the Lagarto, was built at the end of 2019 as the first open-source chip developed in Spain [3]. It is made with TSMC 65 nm transistors, produced by the Taiwanese company following the indications they had been given.

The addition of the framebuffer and VGA controller designed in this project to the DRAC chip will allow it to be used independently from other systems, making it easier to use and also to debug in case of errors.

Parallel to the DRAC project, BSC is also currently working on the European Processor Initiative (EPI) with the main goal of designing low power processors for high-performance and embedded systems [4]. These will exclusively be made with European and open source technology to help mitigate the dependence on United States hardware.

Figure 1. The preDRAC printed circuit board with the Lagarto core (left) [2]

5

1.2. Definition of terms and concepts

● RTL: acronym of Register Transfer Level. It is an abstraction used to define the ​ digital phase of design [5]. It is one of the first steps in integrated circuit design, right after defining the microarchitecture and the instruction set. RTL design is usually made using a hardware description language like VHDL or Verilog, and later it is synthesized to begin the physical design step.

● RISC-V: free and open RISC (Reduced Instruction Set Computer) type ISA ​ (Instruction Set Architecture) that promotes open collaboration to advance to the next era in the innovation of processors [6]. It offers more freedom in software and hardware of computer engineering, avoiding royalties.

● Framebuffer: memory region that stores the colour of the (usually on a matrix) ​ of an image projected on a screen. Each can store a value that represents the exact shade it will have when displayed.

● FPGA: Field Programmable Gate Arrays are a matrix of configurable logic blocks ​ (made of logic gates) that can be programmed, more than once, to provide different functionalities [7]. The programming is done using a hardware description language (like Verilog) that describes the structure and behaviour of digital logic circuits.

● ASIC: stands for Application-Specific Integrated Circuit and is a microchip designed ​ for a certain application in mind [8]. Unlike common processors, it can only perform a limited set of tasks it has been made for. The RTL design and simulation can be implemented on an FPGA to verify the performance on physical hardware and later move it onto an ASIC.

● AXI: communication interface part of the Arm Advanced Microcontroller Bus ​ Architecture (AMBA) specification [9].

● PCB: a Printed Circuit Board is composed of a non-conductive substrate with ​ conductive tracks to connect electronic components [10].

1.3. Identification of the problem and project justification

In the previous tapeout, the DRAC chip communicates with other hosts through the serial port, to where it sends its commands. If we want the DRAC processor to provide a video output, the data has to be sent through this port and the host has to generate the output. The problem of doing it this way is that it occupies the serial port and makes the chip dependent on the host to display the video output.

To overcome this limitation, we propose to add a stand-alone VGA connector, with its respective controller and framebuffer. This way, the next tapeout of the DRAC chip will have

6

an independent video output that will not need any other device to manage it and the serial port will be freed.

1.4. Stakeholders

This section lists the people or organizations that participate directly in the development of the framebuffer and the VGA controller mentioned or that are interested in the results of this project:

● DRAC’s development team: everyone in the group is working hard on their respective parts to make the final result as good as possible. The DRAC project is an experimental vehicle for academic purposes that can be used as a base platform to perform research based on RISC-V processors or accelerators.

● High-performance computing European community: they are interested in the ​ results of the DRAC project to evaluate the viability of the RISC-V European processor initiative

● Supervisor and co-supervisor: this thesis’ supervisor, Miquel Moretó, is the ​ coordinator of the DRAC project and a team member in the European Processor Initiative. The co-supervisor, Guillem Cabo, is also a team member in the DRAC project.

● BSC: the company wants the DRAC project to be a success to innovate in the ​ research about high-performance computing and emergent applications like Big Data.

1.5 Comparison with alternatives

One of the alternatives is the implementation in the Lagarto design in 2019, where video communication is made through the serial port. The advantage of this approach is its simplicity because the serial protocol is relatively easy to implement and besides the data bits, it only requires Start, Stop and Parity bits [11]. In comparison, the VGA protocol is more complex because it requires horizontal and vertical synchronization signals. On the other hand, the cons of this option are that we can only display characters on the screen and we need an external host.

Another alternative to have independent video output, different from the VGA interface, is a digital one like HDMI or DisplayPort. Here we find ourselves a trade-off because these other options can output video with greater resolution and/or than VGA with fewer losses, but they also need a bigger framebuffer (and therefore more memory), a faster clock and their protocols are harder to implement. For example, the HDMI protocol uses 3 data lines plus 2 control signals [12].

7

Therefore, we believe that the VGA interface is the sweet spot between the simplicity of the communication through the serial port and the performance of the HDMI or DisplayPort solutions.

1.6 Scope

1.6.1 General and sub-objectives The main objective of this project is the RTL design of a VGA controller for the DRAC chip to have support for independent video output. The other objectives that the final result should meet for it to be a success are the following:

● Provide specific hardware support to be able to have a terminal window without consuming too much memory. This means that the framebuffer should have an efficient way of storing and displaying the data.

● Provide testbenches for functional and gate verification.

● Implement the AXI wrapper to interface the controller with the bus to allow communication with the processor core.

● Help with the design of a PCB for the peripherals if needed.

● Provide an ASIC-ready RTL of the design to include it in the next tapeout.

1.6.2 Functional and non-functional requirements Below are listed the functional and non-functional requirements that the final solution must fulfil to meet expectations and obtain a good result. The functional ones are:

● The frequency of the VGA controller has to be determined to decide if it can be obtained by multiplying or dividing the processor frequency with the provided integer PLL (Phase-Locked Loop) of the DRAC or if it needs an external crystal instead.

● A minimum resolution of 640 x 480 pixels. This resolution is achieved with a 25Mhz clock.

● The communication interface has to be either AXI-lite or AXI (the microcontroller bus) compliant.

The other non-functional requirements are the following:

● The design has to be tested on the FPGA before implementing it on the ASIC.

● Determine the number of pins and their direction (input or output) needed early on to communicate it to the physical design team.

8

● The Verilog code must be properly documented and has to fulfill the code guidelines of the DRAC project.

1.7. Obstacles and risks

Like every other project, there are some obstacles and risks that we kept in mind in order to avoid them. Those are:

● Failure to adapt in time to the new environment and tools: The RTL design of the entire DRAC project is made with Verilog, which is a hardware description language that I had not used before the start of this thesis because during the bachelor’s degree we used VHDL. Likewise, the hardware used to test my design was also new to me and I had to learn how to use it. A lack of adaptability on my part would have resulted in extra time spent that could have been needed for other parts of the project.

● Hardware differences from simulations: A code that works on the simulation is not ​ guaranteed to work when implemented on the hardware. There are a lot of factors that we are mindful of to reduce the variability to a minimum.

● Lack of resources about RISC-V: Even though RISC-V is an open-source ISA and should have more information available than other options, it is also fairly new and the immaturity of the technology might reduce the amount of data about it.

● Bad communication between the team: The DRAC is being developed by a ​ medium-sized team that works on different parts as well as different phases in the design process. Insufficient or deficient communication between the members can hurt the advance of the project or the quality of the result.

9

2. Initial work plan

This project started on the 17th of February and was expected to be finished by the 19th of June. This amounted to 87 workdays (excluding weekends and holidays) with 6 hours of work per day, resulting in a total of 522 hours. The expected date of the defence of this bachelor’s thesis was during the June 2020 presentation period.

2.1. Work methodology

This project follows the Kanban methodology throughout its development. Kanban uses a board split into different categories to organize the project’s tasks in cards [13]. Each category has its column and they represent the different states or parts of the task process. Some category examples are “To Do” for tasks that have not started yet, “Doing” for tasks currently in process and “Done” for finished tasks.

Figure 2. Typical Kanban board with a set of tasks (A-F) organized into columns [13].

In our case, besides the usual “To Do”, “Doing” and “Done” categories, we have added the column “Blocked” for tasks that have dependencies with tasks that are assigned to someone else, “Halt” for tasks stopped for other reasons apart from dependencies, and “Archived” for done tasks which have already been talked about or communicated to the other members.

The communication between the DRAC team is mostly done at the progress meetings held once per week. In these meetings, everyone has their turn to explain what they have done in the last seven days and discuss the topics needed to keep progressing on the project.

Following this methodology, each member of the team can know what the others are currently working on, what they have already finished and which are their planned tasks. This way, the dependencies between the tasks of different people can be properly managed

10

and taken into account. This results in an iterative workflow where feedback from other people of the development team is taken into account before proceeding to the next step.

For the technical part, the design is made according to the RTL and testbench coding guidelines of the DRAC project. Every person in DRAC project tries to follow these guidelines as much as possible and they include regulations of comment conventions, the use of hard-coded numbers, indenting and variable naming, initialization of registers and default states, port declarations, macro definitions and usage, file name conventions and module nomenclature.

2.2. Monitoring tools

We use Trello to implement the Kanban methodology. It is a free software that besides the standard functionalities, also has a lot more extras to organize your project better and more easily. For example, cards can be assigned to a set of people, can be organized based on labels and they can have comments, file attachments, checklists and due dates. A notification system helps you keep track of the changes that have been made to the cards and you have the option to receive them on your email.

We keep track of the hardware description language code using GitLab. GitLab is a version control application based on Git that allows us to each work on different branches of the project simultaneously and keep a history of the changes made to the design with each update. It also helps us solve errors with the “Issues” functionality. When someone finds a part of the project that is not working properly, they open an issue where they explain the problem they ran into and notify the people responsible for that part so they can fix it as soon as possible.

Another communication software that we use is Slack. With it, we can send short messages to the other members of the team to notify them about something or send reminders about meetings time, issues to be solved, etc. We have a workspace for the DRAC project so everyone in the team can discuss topics related to the development of the project together.

2.3. Description of tasks

2.3.1 Task description, time estimation and dependencies This project is composed of the following tasks:

● Task T1: Control meetings. During the project, weekly meetings are held for 1 hour ​ with the DRAC project team to explain our recent work to the rest and agree on the things needed to keep progressing. Estimated time: 18 hours. Does not have dependencies.

11

● Task T2: Thesis report. Throughout the project, the necessary documentation is written to ensure that the final submission of the thesis report is made within the expected deadline. Estimated time: 70 hours. Does not have dependencies.

● Task T3: Project management: context and scope. Explanation of the project’s context, justification of its necessity, definition of the scope and explanation of the methodology that will be followed and rigour. Estimated time: 26 hours. Does not have dependencies.

● Task T4: Research state of the art and currently available options to solve the problem at hand. Then, decide the implementation that is going to be made and compare it with other alternatives. Estimated time: 11 hours. Does not have dependencies.

● Task T5: Project management: time planning. Definition of the projects various tasks, estimation in hours of the time it will take to complete each of them, explanation of the dependencies between them and definition of the necessary human and material resources. Estimated time: 9 hours. Depends on T4 because the different tasks of the project must be specified.

● Task T6: Setup the environment and tools needed to develop the project. Install the ​ software that will be used later on and manage the licenses. Estimated time: 8 hours. Does not have dependencies.

● Task T7: Project management: budget and sustainability. Identification of costs, ​ definition of costs estimates, management control and sustainability report. Estimated time: 10 hours. Depends on T4.

● Task T8: Project management: final document. Integration of the 3 previous assignments and their revision based on the feedback received. Estimated time: 19 hours. Depends on T7, T5 and T3.

● Task T9: Identify the open-source implementation of the VGA controller that will be used as a base and modify it to make the design. Replace that current version with the memory map and the ASCII characters for the framebuffer. This includes: ○ Task T9.1: RTL design. Estimated time: 16 hours. Depends on T6 and T4 because it needs specific tools and the choice of design. ○ Task T9.2: Implementation of testbenches and debugging. Estimated time: ​ 18 hours. Depends on T9.1 because the implementation needs to be finished to test it. ○ Task T9.3: Implementation on the Ice40 FPGA (synthesis, place and route and debugging on the real hardware). Estimated time: 20 hours. Depends on T9.2 to make sure it is working properly.

● Task T10: After the first base version is working properly, evaluate whether to ​ increase the VGA resolution based on the controller slack (spare time) and the

12

memory restrictions. Estimated time: 8 hours. Depends on T9.3 to correctly evaluate the decision.

● Task T11: Co-design a PCB to test PMOD peripherals (the VGA is one of them) ​ along the development platform. Estimated time: 19 hours. Depends on T9.3 to check if it works in a more reduced environment.

● Task T12: Implementation on the KC705 evaluation board with a processor to ​ communicate with the VGA controller. Depends on T10 to be able to integrate the components. This includes: ○ Task T12.1: Modify the peripherals bus to include the VGA controller. Estimated time: 40 hours. ○ Task T12.2: Modify the memory map to account for the access to the VGA registers. Estimated time: 46 hours. ○ Task T12.3: Add an AXI (or different) wrapper and design the driver for the communication between the processor and the controller. Estimated time: 47 hours. ○ Task T12.4: Implementation of testbenches for the KC705. Estimated time: 60 hours. Depends also on T11 for the connection through the PCB and T12.1, T12.2 and T12.3 to test the implementation.

● Task T13: Communication with the physical design team of the DRAC project to receive feedback and change the RTL accordingly. Participation in the physical design process if there is spare time. Estimated time: 55 hours. Depends on T12.4 to know that it works as expected on the FPGA.

● Task T14: Oral defence preparation. Estimated time: 22 hours. Depends on T2 and T13.

The post-silicon verification is not included in the project’s scope because the built design will be received later than the finish date for the project.

2.3.2 Required human and material resources This section defines the human and material resources (both software and hardware) that were used during the project:

● Human resources: The project is developed by me, with the help of the supervisor, Miquel Moretó, the co-supervisor, Guillem Cabo, and the rest of the DRAC project team that will collaborate to integrate all of the parts into the final result. These resources are needed for all the tasks.

● Software resources: During the project, the Vivado Design Suite by Xilinx is used for the synthesis and place and route of the design (T9, T12), Verilator for simulating Verilog Hardware Description Language (T9.2, T12), Questa Advanced Simulator (also known as QuestaSim and will be called like this from now on) for the simulation and verification of the FPGA designs (T9.2, T12), Synopsys Spyglass for RTL

13

analysis and verification (T9.3, T12.4), RISC-V tools for the System on Chip (T12), Git for version control (T9, T11, T12, T13), Google Docs for documentation (T2, T3, T5, T7, T8, T14), GanttProject for the Gantt chart (T5), Trello for the work methodology control (all tasks) and Slack for communicating with the DRAC team (all tasks).

● Hardware and physical resources: The work is done in a BSC office (all tasks) and ​ we use a laptop with an internet connection (all tasks except T1), the myStorm BlackIce Mx FPGA development board for the initial tests (T6, T9) and the Xilinx KC705 evaluation board for testing the implementation with a processor (T12, T13).

Figure 3. The myStorm BlackIce Mx [14] (left) and the Xilinx KC705 [15] (right)

2.3.3 Summary table

ID Name Time Dependencies Resources T1 Control meetings 18 h - - T2 Thesis report 70 h - Google Docs T3 Project management: context and scope 26 h - Google Docs T4 State of the art research 11 h - - T5 Project management: time planning 9 h T4 Google Docs T6 Environment and tools setup 8 h - BlackIce Mx T7 Project management: budget and 10 h T4 Google Docs sustainability T8 Project management: final document 19 h T7, T5, T3 Google Docs T9 VGA controller and framebuffer Vivado, Git, implementation BlackIce Mx T9.1 RTL design 16 h T6, T4 - T9.2 testbenches implementation 18 h T9.1 Verilator, Questa Sim T9.3 Ice40 FPGA implementation 20 h T9.2 Spyglass T10 Evaluate resolution increase 8 h T9.3 -

14

T11 Design and Integration on the PCB 19 h T9.3 Git T12 Implementation on KC705 with a processor Vivado, RISC-V tools, Git, KC705, Verilator, QuestaSim T12.1 Peripherals bus modification 40 h T10 - T12.2 Memory map modification 46 h T10 - T12.3 Wrapper and driver design 47 h T10 - T12.4 testbenches implementation 60 h T11, T12.1, Spyglass T12.2, T12.3 T13 Communication with the physical design 55 h T12.4 KC705 team T14 Oral defence preparation 22 h T13, T2 Google Docs

Figure 4. Task summary table with time and resources needed (resources that apply to all tasks not included, see section 2.3.2) and its dependencies. Made by the author.

2.3.4 Alternative tasks Since this project is a real implementation of a component that is integrated with other parts in a larger project it does not have alternative tasks. On the other hand, the time needed for the tasks that have more risks and are more likely to present obstacles was overestimated, especially those tasks that consist of testing and debugging the errors. This ties in with one of the obstacles mentioned in section 1.7 and it is that hardware might behave differently than simulations due to a lot of different factors.

Even more, some tasks are not essential for the completion of the project and could have been skipped or reduced if the project was not advancing according to the planning. For example, the evaluation of the resolution increase (T10) could have been removed and I would not have participated in the design of the PCB (T11) if there were delays in the firsts tasks like the risk of taking too long to adapt to the new tools (also explained in section 1.7).

We also considered shortening the communication with the physical design team (T13) by reducing the number of feedback cycles if needed. The possible risk of bad communication with the DRAC team was also taken into account, meaning that if this was not working properly, the task would have been reduced to avoid further delays.

2.3.5 Additional resources In the case that the counter-measures mentioned above were not enough to get the project to meet the time planning, additional people would have worked on the project to finish it in time. This would have meant additional human resources plus the subset of software and hardware resources, from the ones already needed, that they would have required to work and help me on the project.

15

2.4. Budget

2.4.1 Staff costs In this section, the staff costs are identified based on the role of the person responsible for doing each of the tasks mentioned in the planning of the project. In this case, the people working on the project play the following roles:

● Project manager: this also includes control management and writing the ​ non-technical and technical documentation of the project. This role is assigned tasks T1, T2, T3, T5, T6, T7, T8, T10 and T14.

● RTL design: this role is responsible for designing the various hardware components of the project at the register transfer level and their respective testbenches to ensure that each part works properly. It is in charge of tasks T1, T4, T9.1, T9.3, T10, T12.1, T12.2, T12.3, and T13.

● PCB layout engineer: the person with this role will design the PCB layout to ​ integrate the peripherals. They carry out tasks T1 and T11.

● Physical design engineer: its job consists of the layout assembly and connectivity ​ of logical gates for the integrated circuit. Their assigned tasks are T1 and T13.

2.4.2 General costs These include hardware, software and other costs. For this project, those are:

● Hardware: Laptop (Dell Latitude 7490), myStorm BlackIce Mx FPGA development ​ board and Xilinx KC705 evaluation board.

● Software: Vivado HL System Edition, QuestaSim license and Synopsys Spyglass ​ license. The rest of the software that will be used is either open-source or free.

● Other: electricity, workspace and internet. ​ 2.4.3 Contingency and incidentals For contingency, an extra 15% was added to have a security margin for unexpected events.

In the case of incidentals, as mentioned in the project planning, there are no alternative tasks for this project. Instead, an additional person would have been recruited if needed. The software licenses can be shared, but they would have needed their own laptop, BlackIce Mx FPGA development board and KC705 evaluation board.

17

2.4.4 Cost estimates For the estimation of staff and general costs, we assume 252 working days of 8 hours each in a year and the 522 hours of the project. With that and the yearly salary for each position obtained from Glassdoor (data from the USA), we get the results shown in Figure 6.

Role Average yearly Average yearly Average cost per salary salary + SS hour

Project manager 59110€ [16] 79798,50€ 39,58€

RTL design 51849€ [17] 69996,15€ 34,72€

PCB layout engineer 60705€ [18] 81951,75€ 40,65€

Physical design engineer 76833€ [19] 103724,55€ 51,45€ Figure 6. Staff costs. Made by the author.

In the case of hardware and software costs, dividing the price by (252*8*#years) and multiplying by 522 (project hours) we obtain the following Figure 7.

Product Price Useful life Cost

Dell Latitude 7490 1157€ 5 years 59,92€

BlackIce Mx 52,81€ 5 years 2,73€

KC705 1509€ 5 years 78,14€

Vivado HL System Ed. 3805€ 4 years 246,31€

QuestaSim 1767€ 1 year 457,53€

Spyglass 1329€ 1 year 344,15€ Figure 7. Hardware and software costs. Made by the author.

The estimation of electricity, workspace and internet costs is done using data from Barcelona.

Resource Cost

Electricity 1000kWh * 0.1198€/kWh = 119,80€

Workspace 27€/month*m ² * 5months * 15m ² = 2025€

Internet 24,08€/month * 5months = 120,40€ Figure 8. Indirect costs. Made by the author.

18

With these calculations, we can now fill the table for the CPA, CG, Contingency and total costs (Figure 9).

Activity Cost (€) Comments T1 - Control meetings 2995,22 All roles, 18 hours T2 - Thesis report 2770,74 Project manager 70 hours T3 - Project management: context and scope 1029,13 Project manager 26 hours T4 - State of the art research 381,94 RTL designer 11 hours T5 - Project management: time planning 356,24 Project manager 9 hours T6 - Environment and tools setup 277,78 RTL designer 8 hours T7 - Project management: budget and sustainability 395,82 Project manager 10 hours T8 - Project management: final document 752,06 Project manager 19 hours T9 - VGA controller and framebuffer implementation T9.1 - RTL design 555,55 RTL designer 16 hours T9.2 - testbenches implementation 625,00 RTL designer 18 hours T9.3 - Ice40 FPGA implementation 694,44 RTL designer 20 hours T10 - Evaluate resolution increase 594,43 P. manager, RTL designer 8 hours T11 - Design and Integration on the PCB 772,35 PCB layout engineer 19 hours T12 - Implementation on KC705 with a processor T12.1 - Peripherals bus modification 1388,88 RTL designer 40 hours T12.2 - Memory map modification 1597,21 RTL designer 46 hours T12.3 - Wrapper and driver design 1631,93 RTL designer 47 hours T12.4 - testbenches implementation 2083,32 RTL designer 60 hours T13 - Communication with the physical design team 4739,38 RTL designer, Phys. design engineer 55 h T14 - Oral defence preparation 870,80 Project manager 22 hours Total CPA 24.512,22 Dell Latitude 7490 59,92 Laptop BlackIce Mx 2,73 FPGA development board KC705 78,14 Evaluation board Vivado HL System Ed. 246,31 License QuestaSim 457,53 License Spyglass 344,15 License Electricity 119,80 1000kWh * 0.1198€/kWh Workspace 2025 27€/month*m² * 5months * 15m² Internet 120,40 24,08€/month * 5months Total CG 3.453,98 Total CPA + Total CG 27.966,20 Contingency 4.194,93 15% of Total CPA + Total CG Total CPA + CG + Contingency 32.161,13 Additional person 19,05 (59,92+2,73+78,14+119,80+120,4)*0,05 Total incidentals 19,05 5% risk x hardware + indirect costs TOTAL 32.180,18 Figure 9. Total budget. Made by the author.

19

3. Final work plan

3.1. Changes in the tasks and time assignments

I have followed the initial work plan as much as possible but there have been a few changes. Despite this, the first couple of tasks and the project management tasks have been done successfully and according to the time estimates made at the start of the project.

The first change happened when the PCB design was rescheduled to an earlier date because we did not know exactly how long it would take for the finished product to arrive once we ordered the components and we wanted to have it more sooner than later.

It also took a bit longer than expected because the task was brought forward and therefore there was more work to do. This meant that T11 had to be done before the first VGA implementation on the Ice40 FPGA (T9) but it only affected the overall time planning with a few days of delay due to the extra work.

Before this happened, we had just started T9 but it was put on hold until the design part of T11 was finished. On the other hand, the test-bench and FPGA implementation (T9.2 and T9.3) took a bit shorter than expected, so a couple of days were gained there.

Another change was in the resolution increase evaluation task (T10). At the time, some characteristics of the processor for the final product of the DRAC project were still being discussed, so we did not have the information necessary to decide the final resolution of the VGA. Instead, different resolution options were tried on the Ice40 FPGA to have a first look at them and the changes they needed.

Finally, the last and most important change to the initial work plan was caused by the global pandemic. Due to that, the components we ordered for the PCB we designed in task 11, arrived much later than expected. This meant that the task 12.4 tests could not be done on the KC705 FPGA plus the PCB, and instead had to do them only in simulations with Verilator (a Verilog simulator). We also made a change in the design while doing these simulations so this task was extended by a week.

Even though simulations can detect a lot of errors in a design, they might not detect all of them or some errors will only occur when testing on the hardware itself (and this was the case in this project as you would see later). Because of this, we decided to wait for the PCB components and postponed the defence of the project until the October 2020 turn instead of the June one.

With this change, the new finish date is the 7th of August and this results in a total of 121 workdays or 726 working hours. This also meant that task 12.4 had to be split into 2 parts, the first being the simulations and the other the tests on the hardware, which will be identified as task 12.5 from now on. The time between these 2 tasks will be spent on the documentation on the project (task 2), so it can be finished as soon as possible.

20

The communication with the physical design team (task 13) was also shortened as mentioned in the alternative tasks section, to allow for more testing and still finish the project during the first week of August.

In summary, the changes in the time estimation for the affected tasks are: minus 11 hours in tasks 9.2 and 9.3 combined (27 hours total), plus 14 hours in task 11 (33 total), plus 54 hours in task 12.4 (114 total), the extra 96 hours of task 12.5, plus 73 hours in task 2 (143 total), plus 5 hours in task 1 (23 total) and minus 27 hours in task 13 (28 total).

The updated Gantt chart is as follows:

21

3.2. Changes in resources

In terms of resources, no additional hardware or human resources have been needed even with the pandemic because I changed to working from home. I went to the office on punctual days to use the KC705 that we share between a few people in the project.

However, extra software that had not been planned at the start was needed, like Google Meet to attend the progress meetings that were held online and KiCad to contribute more in the design of the PCB (task 11).

3.3. Changes in the budget

Following the changes made in the tasks mentioned before, the calculation of the CPA has also been modified to match them. This results in a total of 3827,20€ for task 1, 5659,94€ for task 2, 937,44€ for tasks 9.2 and 9.3 combined, 1341,45€ for task 11, 3958,08€ for task 12.4, 3333,12€ for task 12.5 and 2412,76€ for task 13. The new total CPA cost is 31303,76€.

With the increment in hours of the project, the hardware and software costs have been adjusted accordingly for a total of 83,33 + 3,80 + 108,68 + 274,05 + 636,33 + 478,60 = 1584,79€. The estimation of electricity, workspace and internet costs have also been recalculated with a result of 143,76 + 2430 + 144,48 = 2718,24€, making the total CG cost 1584,79 + 2718,24 = 4303,03€.

These modifications bring the total cost of the project to 35606,79€, which is 10% higher than the initial planned cost. This is almost exclusively due to the decision to extend the project and defend it in October instead of June. Even though the initial budget has been exceeded, the extra hours have allowed for more testing to get a better final result and more time to improve the quality of the documentation.

23

4. Background

4.1 Open-source hardware

This section aims to provide more background information about the current state of open-source hardware. Its community continues to rapidly grow as more designs get published and the vast majority of them use the RISC-V ISA. Some notable examples are:

● Rocket, a dual-core 64-bit RISC-V processor with vector accelerators designed by ​ the University of California, Berkeley and fabricated in 45 nm technology [20].

● BOOM (Berkeley Out-of-Order Machine) is also a 64-bit RISC-V core created by the ​ University of California, Berkeley, but unlike the Rocket cores, the BOOM is out-of-order, instead of in-order [21].

● lowRISC, a 64-bit SoC based on the Rocket core and designed by the University of ​ Cambridge, UK [22].

● Ariane is a 64-bit RISC-V in-order core from ETH Zurich. It is important to note that this core is capable [23].

● PULP (Parallel Ultra Low Power), a platform organized in clusters of RISC-V cores ​ targeting high energy efficiency [24].

● Riscy is a 32-bit RISC-V in-order core developed at ETH Zurich [25]. ​

● Openpiton is a general-purpose, multithreaded, manycore processor and framework ​ by the Princeton Parallel Group [26].

4.2 DRAC core and SoC

Since the VGA connects and interfaces with the DRAC core and it is a part of the SoC, it is important to know the basic characteristics of the chip to make a good integration.

The DRAC project processor, the Lagarto, is a single core with a 5-stage single-issue in-order pipeline. The 5 stages of the pipeline are: fetch, decode, read-registers, execution and write-back. It implements a 64-bit scalar RISC-V ISA (the RV64IMA) with a branch predictor, instruction and data L1 caches and an L2 cache [27][28].

The ISA mentioned before is the 64-bit base instruction (RV64I) plus the extension for integer multiplication and division (labeled M) and the one for atomic instructions (labeled A) [29]. The SoC platform is adapted from the lowRISC project [22] and it includes IPs from open source projects as well as IPs internally developed at BSC.

24

Figure 11. The preDRAC SoC block diagram [27].

The DDR3 main memory is located on an external FPGA board, the Xilinx Kintex KC705, and the SoC connects to it through an FPGA Mezzanine Card (FMC) cable. The FPGA has the memory controllers to access the DDR3 memory and the chip contains the core and the rest of the system, including both caches. The interface between them is called the Packetizer, which packs/unpacks the data from the AXI interface.

This was done due to the lack of availability of certain technologies and IPs that are needed for the fabrication of the whole SoC in one die [27].

4.3 VGA protocol

Before going into more detail into the development part of the project, and to better understand the inner workings of the controller we worked on, this section explains the VGA signals and how the protocol works.

The VGA connector has a total of 15 pins, mainly composed of the RGB signals, 2 synchronization signals (horizontal and vertical), power, ground and Display Data Channel signals that are used to communicate the display mode between the video source and the monitor. Despite this, a PMOD adapter [30] will be used to interface between the controller and the VGA connector. As a result, the controller will output its signals through the PMOD, which has 16 pins, but only has the RGB and the synchronization signals, while the adapter itself handles the rest.

25

More specifically, the PMOD has 4 pins for each RGB component, which means 12 bits to code each colour for a total of 4096 different ones, 1 for the vertical synchronization, 1 for the horizontal and 2 that are not connected.

The VGA protocol works in the following way: to display a frame, the 2 synchronization signals, both of them separately, have to be driven high (this is called the front porch), then low (the sync pulse) and then high again (the back porch) [31]. The time spent in each of those states will determine a resolution and a refresh rate for the screen, but one thing to note is that during this time, the pixels on the screen are not being updated.

Each pair of resolution and refresh rate has a specific time to paint a pixel to go through all of them in each frame. This is also known as the pixel clock. Due to this, the timings for the horizontal synchronization are usually given in pixels, while for the vertical one, they are given in lines (n lines = n x pixel width including the non-active video part).

After the back porch of both of the signals, the active video zone starts, meaning that one pixel will be painted each cycle starting from the (0,0) and going horizontally from left to right and top to bottom. When the last pixel on the screen is reached, the current frame ends and the front porches for the next one start.

Figure 12. VGA video timings in pixels (39,72ns each pixel or a 25,175Mhz clock) for a 640x480 resolution at 60Hz (not to scale). Made by the author

26

4.4 AXI protocol

4.4.1. Full AXI The Advanced eXtensible Interface (AXI) is an interface designed for on-chip communication [32]. This is the protocol used to connect all of the peripherals (or Intellectual Property cores) of the DRAC, so the VGA controller will also use it to connect to the processor.

The AXI protocol is burst-based and has support for unaligned data transfers (using write strobes), multiple outstanding addresses and out-of-order transaction completion. It follows the master-slave paradigm and there are 5 independent transaction channels between the 2 devices. Those are the write address, the write data, the write response, the read address and the read data. Each channel has a different prefix for their signals (AW, W, B, AR, R, respectively) and 2 signals, VALID and READY, to provide a two-way handshake between components. The write response channel and the read response signal indicate the status of the transition when it ends.

The handshake process works as follows: first, the source of the transaction raises the VALID signal when the information in the respective channel is available and after that, the destination activates the READY signal when it can accept the data. The destination does not have to wait for the VALID to set the READY and in the case of the write transaction, the respective address and data channels do not have to be fully synchronized (the data handshake can happen before the address one).

The read data and write data channels also have the LAST signal to indicate the final data of a transaction. AXI also has 2 signals that are global, for all of the channels, the clock and reset signals (ACLK and ARESETn). Besides those, there are also other signals, either required or optional, that provide multiple functionalities like bursts, write strobes and protection levels.

The write strobes specify the bytes of the data bus that contain valid information and the slave device only writes the bytes that have their write strobe set to 1, keeping the data unchanged in the rest.

4.4.2. AXI-Lite AXI-Lite is another version of the AXI interface that uses a subset of the AXI signals to make a simpler protocol for applications that do not require all the functionalities and performance of the full AXI.

In AXI-Lite all data accesses use the full length of the bus that can be 32 bit or 64 bit and there is no burst support, meaning that every transaction needs its address. The only signals required for each channel are the handshake signals, the data, the response signals and the write strobes.

27

Despite this, AXI and AXI-Lite interfaces are interoperable and only require conversion in the case of an AXI master and an AXI-Lite slave.

Figure 13. The AXI channel architecture for writes [33]. In AXI-Lite only 1 write data can be transmitted for each address.

28

5. RTL design and testing

5.1 Initial state of the art research and setup At the beginning of the project, we started with learning the environment and the various tools that would be used during the project. The first step was learning Verilog, especially the differences it has with VHDL, which was the hardware description language that I knew before this project. We also researched how to use the BlackIce Mx FPGA development board: the functionalities it has, how to compile the HDL and program it and various methods for debugging. Finally, we delved into some of the more advanced functions of Git to use the software in a better way.

After this, we installed the software needed for the project and set up the environment to access the respective licenses for it.

With the set up done, we moved on to researching the current state of the art. We searched for open-source VGA controllers and a few were found, each with different options and configurations. The most relevant one is “Yet another VGA” by Sandro Amato [34], which is a VGA controller in VHDL with a character with the 128 ASCII characters in 8x16 pixels. In an 800x600 resolution this results in 80x30 characters in a screen and that is enough for a terminal. The 8x16 pixels are a good size in almost all screens to be seen clearly, so this character bitmap is the one used in the first implementation of the VGA.

The other interesting controller found is “AXI4 to VGA Frame Buffer with Linux Driver” by Jose Rissetto [35]. At the time, it was already decided that the VGA module would communicate with the DRAC core through an AXI interface and this controller had one. Despite this, the first implementation on the BlackIce Mx FPGA board would only consist of the controller, so it was not needed then, and for the final implementation, we decided on an AXI-Lite interface meaning that it is not used at all.

There are other open-source VGA controllers with different configurations for resolution, refresh rates and colours but they did not suit our needs. The lowRISC project (mentioned before as a template for the DRAC SoC) has a video controller but we were unable to find any documentation on it, so we disregarded it [22]. The PULP platform was one of the other projects that were looked into, and although it has peripherals connected through AXI, it does not have one dedicated to video display [36].

Lastly, we looked at 2 VGA designs developed at BSC by Guillem Cabo and Francisco Bas (which adapted the VGA controller from a design by Sergio Cuenca [37]) to test the FPGA setup. One design is a very simple VGA that paints the same pattern of pixels repeated through the screen, but it helped us familiarize with the colours and what the VGA module needed to output. The other design implements the classic snake game so it has some things that are not relevant for this project but the design helped us learn how to communicate the FPGA with the PC through the serial port (UART). It also gave us some ideas for how to implement the frame buffer.

29

5.2 Design of the PCB for the integration

Parallel to the initial management of the project (planning, context, scope…) we designed the PCB that we would later use to integrate our respective peripherals with the DRAC. The PCB connects with the KC705 evaluation board which in turn connects with the rest of the core.

We used the free software KiCad for this task. The schematic design was already done, so after that, we worked on the layout. We placed the footprints of the 4 PMOD connectors where our peripherals would connect to the PCB, placed the FMC connector that would connect to the bring-up board or FPGA devkit, added the necessary resistors and capacitors, routed the tracks to connect all of them and finally we added the power and ground planes. During the routing, we applied track length matching to prevent signal propagation issues when working with high frequencies.

We also placed the voltage converters that enable each of the PMOD connectors to work at a different voltage and modules with various electrical tensions can be added. In Figure 14 we can see the layout of the PCB without the tracks and copper zones: one PMOD connector on each side (the one on the left is in a vertical position), the FMC on the left side (on the backside of the board), the converters (labelled U1-6) and the various resistors and capacitors.

Figure 14. Screenshot from KiCad of the PCB layout. Made by the author.

30

5.3 Design decisions and comparison between options After the research was done, we had to decide how to implement the frame buffer to keep the data from the current frame being displayed. This decision affects what can be displayed on the screen and how much memory it needs to do so.

To make this decision, we made a comparison between the 2 options we had:

● Full frame buffer: this implementation is the most straightforward of the two. With it, each pixel on the screen can be painted with a different colour, meaning that we need to store the necessary bits to represent a certain colour for each pixel. Depending on the number of different colours we want to be able to display, we will need more or fewer bits per pixel. This implementation allows us to display anything on the screen, with the only limitation being the number of pixels of the screen. For example, in a 640x480 resolution with 2 different colours we would need 640*480*1 = 307200 bits to store the screen.

● Simplified terminal: in this case, we divide the screen into zones or tiles of a fixed ​ size and for each of them (a similar approach to old 2D video games [38]), we store the identifier of the pattern of pixels we want to display in that tile. This limits what we can paint to a set of defined , and their positions must be inside a specific tile, not in fractional parts of them. Despite this, what we lose in flexibility in displaying images we gain it in memory usage. This method only requires to store the set of characters with enough bits per pixel to represent all the colours wanted, the required bits for the identifiers in each tile based on the number of different patterns defined and the necessary bits to encode the number of different colours on them. A few bits for each tile can be added to indicate the rotation of the pattern. In this way we do not need to have 2 different ones if they are the same but rotated or mirrored.

In Figure 15, we can see formulas for the number of bits needed to store the bitmaps and the buffer itself. Applying these formulas and following the previous example, in a 640x480 resolution with 128 characters of 8x16 pixels in size and 2 different colours represented with 12 bits (the parameters used for the first implementation), we would need 16384 bits to store the patterns 16800 bits for the screen buffer plus 24 bits for the 2 colours. With 8 possible rotations, the buffer would need 24000 bits.

m = xyn * log2(c) h v b = x * y * log2(n + r) Figure 15. Formulas for the sizes in bits of the bitmap (m)​ and buffer (b)​ memories, where x ​ ​ ​ and y ​are the horizontal and vertical sizes of the bitmaps in pixels, n ​is the number of bitmaps, ​ ​ c ​is the number of different colours, h ​and v a​re the horizontal and vertical sizes of the video ​ ​ resolution in pixels and r ​is the number of rotations. ​

31

Figure 16. Simplified terminal buffer structure (yellow) contains references to a set of patterns (blue). Made by the author.

The decision between a full frame buffer or a simplified one also has implications for the performance of our solution in terms of how fast the whole frame can be written.

We are going to assume that we have a 32-bit wide AXI-Lite bus at 50Mhz driven by the DRAC core, running at 600Mhz and with an IPC of 0,7. We are also going to assume that the VGA controller at the other end of the bus runs at 25Mhz. Given this scenario, we obtain that our AXI bus has a raw bandwidth of 50*10⁶ * 32 = 1600Mb/s, but that does not reflect how much real data we can send through it because it does not take into account the control, and address messages and also since the frequency of the VGA controller is lower than the bus, it will take more than 1 bus cycle to respond.

The AXI-lite protocol uses a 2-way handshake process to manage the control of the different channels it has. In a write transaction, 3 of the channels are used: the address and control channel, the write data channel and the write response channel. If we assume that both the master and the slave are ready for the transaction, the handshake only takes 1 bus cycle. With the proper logic and buffering of the signals, we can achieve a throughput of data transfer every clock cycle, but since our VGA controller is half as fast, we have to settle for a data transfer every 2 cycles. This results in a real bus bandwidth of 50/2*10⁶ * 32 = 800Mb/s.

With this bandwidth, and assuming that we have 1 write port on the buffer and a data bus of 7 bits, we can write the entire frame buffer of our simplified terminal in 640/8 * 480/16 = 2400 cycles of our VGA controller, that means in 2400 * 1/25*10⁶ = 96 microseconds. This represents 4800 bus cycles and 57600 processor cycles (these translate to 40320 instructions).

In the case of a full-frame buffer, assuming it has a data bus of 32 bits, writing every pixel would take 640*480/(32/1) = 9600 cycles or 9600 * 1/25*10⁶ = 384 microseconds. That means 19200 bus cycles and 230 400 processor cycles (161 280 instructions).

32

Since the main focus of the VGA functionality for the DRAC processor is a terminal to display characters, we decided to implement the simplified terminal buffer divided in zones or “tiles”. We think this is the best approach to save memory and resources while also making the writes to the VGA take less time and obtain a satisfying result.

Then we also had to decide the other parameters for the simple terminal configuration: the number of unique characters, the size of each one and the number of different colours we would be able to display. We made an excel file comparing the memory requirements of each option to help us with the decision. We decided to go with 8x16 pixels per character ​ because it is not too big but not too small and this way, the character bitmap file from Sandro Amato could be reused. For the number of colours, we settled on 2 to save as much memory as possible and having more colours was not essential for the terminal.

5.4 First implementation on the BlackIce Mx FPGA Following the research mentioned earlier, we started with the RTL design of the VGA controller to test this early version with the BlackIce Mx FPGA. First, we made a module to generate the VGA signals: the horizontal and vertical synchronization and the horizontal and vertical counters to determine which is the active pixel (the pixel being painted in this cycle) at any time. The various timings for the synchronization signals were obtained from the website TinyVGA [39], which has all the signal timings for each resolution and refresh rate. For this reason, this module does not have a testbench since it does not have inputs and the outputs only depend on pre-set parameters.

After that, the bitmap memory was designed in Verilog to store the pixel patterns for each character and initialize it by reading the 8x16 characters file mentioned before. It was a single port memory because we only needed to read from it, with 128 positions of 128 bits each to send a full character with each read. Before moving on to the buffer module, we implemented a testbench for the bitmap memory to ensure that it was working properly. It tested that the initialization went well and that the reads returned the right values for each address.

Up next, we made the frame buffer module, which consists of 2400 (80x30) positions with 7 bits each. These 7 bits store the addresses of the characters that are displayed in each tile and it has 2 ports so we can write to it while reading the output to access the right character in the bitmap memory and display it. The initialization of this module loads 0 to every position so at the start they all point to a blank character. A testbench was also made for the frame buffer to check reading and writing at the same time.

Then, we implemented a top module to connect the other modules accordingly, assign the colours and control the video output. We also added a UART receiver to connect to the FPGA with the PC. It is used to write the data to the buffer memory to change the ​ characters displayed on the screen. In Figure 17 we can see a block diagram of this implementation with the main connections between the modules.

33

Figure 17. Block diagram of the design for the Black Ice Mx FPGA. Made by the author

This data is interpreted as follows. The first 8 bits contain the column identifier (if it is higher than 79, the number wraps around, 80 = 0), the next 8 bits contain the row identifier (also wraps if it is higher than 30) and the following 8 bits code the character in ASCII discarding the most significant bit (this number does not wrap). After that, the next 8 bits are ignored to account for end-line characters and facilitate the use of the UART through a terminal.

Finally, we made testbenches to simulate the UART and the top modules. The testbench for the UART sends a sequence of 1 and 0 and checks that the data is received and interpreted as expected. The top module testbench also reads the data from the UART, writes it to the buffer memory and checks the video output at the right times to confirm that the pixels were sent to the screen with the right colour.

Figure 18 shows the values of the signals during the top module testbench. The most relevant ones are: activevideo indicates when we are in the active video area (as mentioned ​ ​ in section 4.3), pmod is the output to the PMOD connector (12 bits of colour and both sync ​ ​ signals), col and row make up the address for the buffer memory (a pair for read address ​ ​ ​ ​ and another for write address), din is the data to write in the buffer if wr_en is active, ​ ​ ​ char_addr is the output from the buffer memory, x_px and y_px show the current pixel, ​ ​ ​ ​ font_in is the address for reading the bitmap memory (composed of char_addr and y_img to ​ ​ ​ ​ ​ know which of the bitmap lines is accessed) and char is the current line of pixels to display ​ ​ on the screen (the output of the bitmap memory).

34

Figure 18. Screenshot of wave analysis from the top module testbench. Made by the author.

5.5 Tests of other resolutions When we had to evaluate the possibilities of other resolutions for the display, the various clocks that the DRAC core would provide to the peripherals were still being discussed, and because the resolution of the VGA video depends entirely on the pixel clock (the inverse of the time to output the data for that pixel), we did not have the information necessary to decide the final resolution for the VGA output. Instead, we tried different resolution options on the BlackIce FPGA board to have a first look at them and the changes they required.

Five of them were tested: 800x600, 1024x768, 1280x960, 1280x1024 and 1400x1050. Higher resolutions could not be tested because of the memory limitations on the FPGA, as we were already at a 96% RAM usage with the 1400x1050 resolution.

Of those 5, the 800x600 and 1024x768 resolutions worked perfectly but the rest had different glitches on the screen or did not work at all. After looking at them more closely, the other 3 had negative slack, meaning that the values of the signals were not able to change in time in response to the changes of a positive clock edge before the next positive clock edge (the clock cycle was not long enough). This is because higher resolutions paint more pixels per frame and therefore need a higher clock frequency to do so.

To fix the slack problem, we tried to optimize the critical path (the chain of signals that takes the longer to change in a cycle) of the VGA controller to reduce the time needed per

35

cycle and we managed to reduce it by around 30% but it still was not enough and we did not want to spend too much time on this task because the frequency specifications of the DRAC project were not decided at the time.

More specifically, one of these modifications is rearranging the memory to replace multiplications for bit shifts when calculating the access address if possible (when multiplying by a power of 2), because Yosys, which is the synthesis tool we used for this design, does not seem to do it on its own. Other improvements are simplifying the conditional assignments of signals for the cases when their values did not matter.

Another challenge is the PLL, which is a building block used to generate clocks with various frequencies. The base clock for the myStorm FPGA is 25 MHz and that works fine for a 640x480 resolution at 60 Hz, but for other resolutions, this has to be converted with the PLL and the result may not be exact. Fortunately, most monitors can tolerate this imprecision in the pixel clock and still work as expected.

5.6 Mutation tests To further test and prove this first implementation, we decided to use an open-source software tool called Mcy to perform mutation tests [40]. These tests combine formal verification with self-checking testbenches to measure the amount of coverage that the testbenches provide, meaning how well does the test cover all the possible cases.

Mcy generates mutations in the hardware design by modifying random individual signals, then these mutations are filtered using formal verification to only keep the ones that impact the output to make it fail. After that, the testbench runs with the mutated designs to check that it detects the error and fails. This way the testbenches can be improved to cover more cases and make them more reliable.

The procedure to make the mutation tests is as follows: first, you have to make a configuration file to specify the number of mutations, the files involved in the test and the tests that are going to be run. Then, we made the formal verification test with the SymbiYosys tool (also open-source) [41] with its own configuration file and the outer module that checks if the outputs from the original design and the mutated one are the same or not, given the same inputs.

A certain mutation can give 3 different results depending on the results of the formal verification test and the testbench. If the formal verification passes, the mutation is filtered because it produces no change. If it fails and the testbench also fails it means that case is covered by the testbench, but if the testbench does not fail then that design that changed the outputs was considered correct, so the case is uncovered.

When doing these tests, we realized that they could not be done on the full design (all of the modules) because the VGA changes its output slower relative to other modules and the mutation tests took too long. We reduced the number of mutations to 10 to try to get the results in a reasonable time but that did not work. Finally, we ended up settling on doing the

36

mutation tests on the frame buffer module alone and with 500 mutations obtained these results: of the 500 mutations, 386 were covered cases where a change that made the formal verification fail also failed the testbench, 56 were uncovered cases and 58 were mutations that passed the formal test and therefore resulted in no relevant change in the design. This amounts to a coverage of the 87,33% of the cases.

Then the buffer testbench was improved by adding a full read at the start to check that the initialization had been done properly. After running the mutation tests again, the results were: 397 covered cases, 44 uncovered and 59 that made no change, resulting in a 90,02% case coverage.

In Figures 19 and 20, in the Results table and graph, Test_eq is the formal verification and test_sim is the testbench. The Count column is the number of tests in that group. It adds up to 1000 because for each mutation the two tests are run. In the Tags table and graph, the Count column represents the percentage of the pair of tests in that category based on the total.

37

Figures 19 and 20. Results of the mutation tests before the testbench improvement (top) and after (bottom). Made by the author.

5.7 DRAC environment setup and AXI diagram Before starting with the integration with the rest of the DRAC chip and the design of the AXI wrapper to communicate with it, the environment has to be set up to do the simulations later. While setting it up, we researched the existing AXI architecture of DRAC. To get a better understanding of it we made a schematic drawing of it along other team members (Figure 21).

In the drawing, we can see the two main modules of the DRAC: the Top_asic and the ​ Chip_top. They are the top modules and instantiate the rest of them, either directly or by ​ another module directly instantiated in them. One of the key components is the Top Rocket, ​ ​ where the DRAC core is instantiated, and as we can see, the AXI (called NASTI in the ​ drawing) buses start or end there. NASTI was made by UC Berkeley for the lowRISC project and implements a subset of AXI, but is functionally equivalent in many aspects. The interface will be called AXI throughout the full document.

The other ends of the architecture are the peripheral components like the VGA. In between, there are various AXI modules to route the transactions to the right component.

38

Those are the crossbars, that send the data to the right channel with the use of addresses and masks, the channel combiners and slicers and the bridge to convert from full-AXI to AXI-lite and vice-versa. There are also conditional instances, meaning that certain modules are added to the design if the specific conditional flags have been set when compiling it.

Figure 21. AXI architecture of the DRAC chip with the VGA. Made by the author in collaboration with Xavier Carril

39

By making this drawing, we can also see the differences between the full AXI and AXI-Lite interfaces, and this would later help us decide on the one that we were going to choose for the VGA. We chose the AXI-Lite interface because it is simpler and we do not need the performance of the full AXI for the VGA, since it is a slow module relative to others and we would not be able to make full use of the bus anyway.

5.8 AXI wrapper design

Before designing the AXI wrapper, the bitmap memory was changed to have a data bus ​ of 8 bits, so now it has 2048 positions x 8 bits (instead of 128 x 128). This was done to ​ make the physical design of the memory easier since the data bus is smaller. It did not need too many changes because the 128 bits of output were not completely needed for the reading for the display, so just adjusting addresses and signals was enough. This is due to the display writing by rows, so instead of the full character, we only need the current row of the current character (the 8 bits).

Then we designed the RTL of the AXI wrapper that encapsulates the VGA module and enables the connection of it with the rest of the DRAC core. To do it, we researched the requirements of the protocol [32] and also the webpage Zipcpu [42] from where we took part of the code. It is worth mentioning that the original code is licensed under the Apache License, so it can be adapted for this project.

From this website, we also used the formal verification file and code, and adapted them to fit our design. The formal verification file performs k-induction to prove the design assertions in an unbounded banner as opposed to testing them the first N cycles. The assertions check that the AXI requirements are met and that the signaling is handled correctly.

The AXI writes would now go to the buffer to change the characters that were displayed. This meant deleting the UART module, adding the connections from the wrapper to the buffer and adapting the existing testbenches to take the changes into account.

The formal verification only tests the AXI signalling and reads, but that is enough since a testbench was also made for simulating the wrapper. This test makes 6 AXI-Lite transactions: 1 read, 1 write, 2 writes consecutively (without waiting for the VGA to finish between them) and 2 reads consecutively. The writes also test different write strobes and the edge registers based on addresses.

I also wrote scripts to automatically execute the testbenches and configure the signals for the wave analysis in Questasim to see their values throughout the simulation. This way, other people in the project can easily test and review the design and give better feedback.

Figure 22 shows the values of the various signals during the execution of the testbench for the AXI wrapper. There are the AXI-lite signals (all of them prefixed by s_axi), the video ​ ​ output from the VGA (vga_o, composed of the pixel colour in 4 bits per component and the 2 ​ ​ synchronization signals), the registers to control the writes to the buffer and bitmap

40

memories (wr_ena and wr_en_rom respectively) and the values of each position in these ​ ​ memories (bmem and mem). The other signals indicate the different parts of the testbench. ​ ​ ​ ​

Figure 22. Screenshot taken from the wave analysis of the AXI wrapper testbench in Questasim. Made by the author.

5.9 Integration with the DRAC processor

For the integration, the VGA AXI wrapper was added as a submodule in the RTL of the DRAC by instantiating it, changing the address space range to include VGA and modifying the AXI crossbar accordingly. The changes were also tested to ensure that new errors were not created in the rest of the design. This was done with the DRAC general tests that were already made by other people in the project.

The current VGA design enables writes from the AXI bus to the bitmap memory, this way the DRAC processor can modify the characters that will be displayed. This is accomplished by assigning different address ranges for the 2 memories, so when an AXI transaction is received by the VGA only 1 of those memories is accessed based on the address.

Since the bitmap memory only has a single port, it can not output the display data while it is being written, but this is not a significant problem because it will only prevent 1 frame from displaying the right pixels and instead show the background colour. Even more, writes to the bitmap memory will not be too frequent and it can be fully written during 1 frame, so multiple consecutively blank frames will not happen.

41

Another change was removing the AXI reads to the frame buffer memory because we decided they were not really necessary and could cause problems in the display when reading the address of the character in the bitmap memory.

While running the DRAC general tests, we noticed that the performance of the AXI transactions to the buffer was not optimal because the width of its data bus was only 7 bits, so of the 32 bits of the AXI bus, the other 25 bits were discarded. Due to this, we decided to change the width of the data in the buffer memory to 28 bits by packing 4 bitmap memory ​ ​ character addresses in 1, changing from a 2400 positions x 7 bits memory to a 600 x 28 one. The remaining 4 bits are still discarded, but now we can write 4 times as many positions as before in a write transaction and in the future, these bits could be used as checksum to verify data integrity.

This required some changes in the connection of the addresses in the VGA top module to account for this modification. Mainly, the number of bits needed for the buffer addresses and the treatment of the data (discarding the most significant bit in each byte, Figure 23). Looking at Figure 24 we can see the block diagram of the final design and the changes made in contrast to the previous one in Figure 16. Those are: the addition of read and write operations to the bitmap memory (for the AXI protocol), the resize of the buffer memory datapath and the replacement of the UART module by the AXI wrapper.

Figure 23. The conversion made by the VGA top module between AXI write data and frame buffer data. Made by the author.

42

Figure 24. Block diagram of the final design. Made by the author

The bitmap memory has a similar problem because its data bus width is only 8 bits, but unlike the buffer, the performance was not as important because AXI writes to this memory are not very frequent, so we decided to keep it unchanged. Instead, we decided to add readable and writable colour registers in the VGA to enable the modification of the ​ background and character colour by the DRAC processor. Although the VGA only has these 2 different colours, we thought this would be a good functionality to have. Since the colour is composed of the 3 components Red Green Blue (RGB) there are a total of 6 registers to read or write each of the components of the background and character colours in the VGA.

Next, we made a software test to fully ensure that the VGA was working as expected with the rest of the DRAC. This test writes all of the VGA addresses: the bitmap memory, the frame buffer memory and the colour registers and then reads all of them to check that the values have been written.

The test only makes aligned accesses, because misaligned ones are not allowed. The “initial” blocks for register initialization in the RTL were also removed, because they are not synthesizable and they were substituted with asynchronous resets as is required for the DRAC project.

After that, the instance of the FMC LPC (Low Pin Count, the other FMC connector in the FPGA is not used) connector was added to the DRAC core and the VGA output was mapped accordingly so each of the pins will connect with the right one when interfacing with the VGA cable. To do that, we looked at the specification for the KC705 evaluation board to know the name of each pin and at the schematics of the PCB that we made to connect with the VGA output to know to which output was connected to each pin. Then we modified the

43

constraints file of the FPGA to assign these pins to wires in the RTL and connected them accordingly to the rest of the system.

Figure 25 shows what is connected to each pin of the FMC connector so we could make the assignments. The pins used for the VGA are G27-37 (skipping those connected to ground) and H25-35 for the 12 bits of colour, the 2 synchronization signals and another 2 that are not used.

Figure 25. FMC (LPC) connector pinout. Taken from the KC705 user guide [43].

5.10 Tests with the KC705 and the PCB Before testing the design with both boards, we made a simple RTL design in Vivado to program the KC705 with a 16 bit counter and connected the bits to the pins of the PMOD. This test was done to make sure that there were no electrical errors in the PCB that we made and to reduce the possibility of hardware error when we test the full design.

44

I used one of the board clocks for the design and a counter that just adds 1 at every cycle. The necessary pin constraints were added to map the ports of the top-level module with the adequate physical pins on the board.

I then checked that the pins changed their voltage correctly with the use of an oscilloscope and probes. It also helped me check that the pins in the FMC connector were correctly mapped with those of the PMOD.

After that, we made a program in C to test the VGA controller with the DRAC processor emulated on the KC705. We also coded a library with the basic functions to manage the controller (read/write the bitmap and buffer memory and the configuration registers). Unlike testbenches used for simulations, this program prints readable output on the screen because in this case you can not see the values of the internal signals (unless you make specific changes for it, more on that later).

To run the program, a bitstream has to be generated to program the FPGA with it. This includes the synthesis and implementation of the design, that consist of converting the RTL code into a netlist that describes the connections between the logic elements and the place and route when these are placed in the FPGA. This is a similar procedure as the bitstream generation for the BlackIce Mx in the earlier stages of the project, but now including the full DRAC core along with the VGA.

Figure 26. Photo of the setup for the integration tests with the KC705, the custom PCB (top center) and the VGA PMOD (top left). Made by the author.

45

The program also had to be loaded to the core’s main memory so it can be executed, but the other people in the DRAC team had already made a Makefile to automate this process. This file modifies the previously generated bitstream by writing the program binary to the position of the memory. Thanks to that, we only had to include the VGA program and run it.

Despite having fixed the errors found in the simulation, the VGA still was not working properly and the output displayed was not the expected one (Figure 27). The test printed some characters in each of the corners of the screen and the error was a bit strange because only some of the characters were not displaying properly. Using the UART that is already built in the DRAC to debug was not an option because the bitmap memory is not readable through the AXI bus. Due to this, we tried using the Xilinx ILA (Integrated Logic Analyzer) to debug the design [44]. It is a module that can show you the value of a certain group of internal signals during a specific period.

Figure 27. Photo of the C program running on the DRAC core emulated on the KC705 (previous to debugging). Made by the author.

The ILA works in the following way: first, you configure the module for the number of signals that you want to monitor and instantiate it in your design connected to those signals. Then, you generate the bitstream and program the FPGA, but now Vivado detects the ILA. In the new window that popped up, you can configure the triggers, meaning the comparisons in the signals values that will start the capture process, how much data you want to collect after that. Finally, you run the module and it waits for the trigger to happen to capture the data.

This process is familiar to me because I used the Altera () logic analyzer called SignalTap during the computer engineering project in the degree. I was also familiar with the general workflow when working with FPGAs thanks to that syllabus. Despite this, we were not able to get the ILA working and it was producing critical warnings when generating the bitstream. Some were related to parts of the DRAC that were made by other people, but others were from the VGA. The VGA warnings were about timing due to the division of the clock from the AXI bus with registers to get the 25Mhz for the rest of the modules.

To further check those warnings, we ran the Spyglass tool [45], which is a linting tool that analyzes the RTL code, on the VGA. It reported some warnings about variables that were

46

not declared but not used and initial block statements. These signals and initializations were used in the simulation but are not synthesizable, so to solve the warnings, more defines were added to only declare those signals in the right context. Apart from these, there were no errors or more critical warnings reported.

To solve the timing warnings, the division of the clock was replaced with a PLL that is included in Vivado, similar to what we had done to generate the clock for the BlackIce Mx FPGA. We also thought it was a good idea to add a debug mode that would enable reading ​ from the character memory through the AXI bus because it could make further debugging easier. To do this, another configuration register was added that can be read and written through AXI to enable or disable the debug mode.

The downside of the debug mode is that reads to the bitmap memory disable the video output for 2 cycles, the same as the writes, so the display might not be accurate during this mode. On the other hand, after the addition of the PLL, the critical warnings that were caused by the VGA were solved and in turn, the video output also got fixed and now it was displaying the expected result.

After that we ran more tests, primarily consisting of drawing an image by painting the adequate pixels. With it, we found an error that was caused by the change of the buffer memory datapath from 7 to 28 bits and that the testbenches were unable to detect. It was related to the address selection in the output and it was not updating properly when it had to change bitmap memory positions.

Figure 28. Photo of a C program that paints the BSC logo running on the DRAC core on the KC705. Made by the author.

47

After fixing this error, we made scripts to convert images to the code needed by the DRAC core to write the patterns in the bitmap memory. The first script is for displaying images normally while the second is used for displaying images in various steps, making an animation. We also had to increase the size of the boot RAM to fit the animation test program because it was too long.

5.11. Synthesis results

The design was synthesized using the tool Cadence Innovus. In terms of resource utilization, the total area of the VGA module is 71.816,726 μm2 , which represents an increment of area of 2,89% to the previous tapeout of the DRAC, which has a total area of 2487 mm2 [27]. The VGA area composed of 1.888 μm2 for the standard cells, 63.966,288 μm2 for the Macros (the buffer and bitmap memories) and 5.962 μm2 of blockages. ​

It also reaches the frequency of 600Mhz (the frequency of the DRAC core) with a positive slack of 1180 ps. With an arrival time of 394 ps, the module could work at a maximum theoretical frequency of around 2,5Ghz.

The power report shows a total power estimation of 4,67 mW, with the total internal power being 4,38 mW, the total switching power 0,28 mW and a total leakage power of less than 0,01 mW. For comparison, the DRAC core has a total power estimation of 201,55 mW with default switching information [27].

Given the functionalities of this design, we consider that its cost is adequate for our implementation.

48

6. Sustainability report

6.1 Self-evaluation The survey has helped me realize what I know and what I do not know about the environmental, social and economic aspects of sustainability. Thanks to some of the subjects of the degree I know the importance of it and the theory about these aspects: how to analyze a project from the perspective of each of them, the causes, consequences and solutions for the sustainability problems of a project, its positive or negative impact in society, etc.

I am familiar with creativity and innovation concepts and strategies to develop them; resource reuse, circular economy, social justice, equity, transparency, accessibility, ergonomy and security. I have also read the ethical code of the Col·legi Oficial d’Enginyeria ​ en Informàtica de Catalunya (COEINF) and I know about project management procedures ​ like time planning and economical viability.

Despite this, I still do not know some of the indicators to measure how big the effect of a project is and since I did not have any professional experience before the beginning of this project, I have yet to use any of them or the techniques mentioned before. I also lack experience with collaborative work in projects with more than 4 people working on them.

6.2. Economic dimension 6.2.1. PPP As described in the budget section in the initial work plan, at the start of the project I estimated the cost of the human resources based on the role they would performing and the number of hours that they would work, the cost of material resources (both hardware and software) depending on their useful life and the number of hours of use in the project, indirect costs like electricity and internet, contingency and incidentals based on their risk. All of this amounted to a total of 32.180,18€. I only used the essential hardware and software and also used free options when possible.

Due to the delays in the project, the final real cost is 35606,79€, which is a bit higher than the initially expected one. The pandemic situation was obviously a completely unexpected event and despite adding a 15% of contingency to the first estimation, the final cost exceeded the planned budget. On the other hand, I was able to finish the project without additional human resources so the extra cost of another person working on the project (that was contemplated at the start as the last option) was avoided.

6.2.2. Exploitation The cost of the project during its exploitation will be the cost of powering it, the cost of doing any necessary adjustment if needed and the cost of maintaining the VGA cable and the connector. The energy cost could be reduced if there was an option to not power the VGA module when it is not needed but this was not implemented due to time constraints. On

49

the other hand, since the VGA connector and the controller are powered together with the rest of the DRAC chip, they should add a little amount to the overall power consumption of the SoC.

6.2.3. Risks In the case of an error in the VGA controller, the patch or change required would mean extra hours of work and additional economic costs. To prevent this, multiple tests have been performed at the various stages of development to ensure as much as possible that the controller works as intended. If there is a hardware malfunction, there will be an extra cost for the replacement required.

6.3. Environmental dimension 6.3.1. PPP The material resources needed for the project have also been identified in the initial work plan and the energy cost of them has also been estimated. To reduce the environmental impact of the project, both the hardware and software resources can be reused for future activities since their useful life and license time respectively are longer than the project duration. Also, I followed common habits to reduce the energy used during its realization like powering off the laptop and other electronics at the end of the day, but neither the overall environmental impact nor the savings have been quantified.

In the case of starting the project again, I think I could do it without the Questasim license using free wave analysis software in its place, but this would also slow me down a bit due to the free software having fewer functionalities and the files format conversions that would have to be made. Nevertheless, the choice between the 2 software options should not have any environmental impact, and the hardware resources were all necessary.

6.3.2. Exploitation During the project’s life, the resources that will be needed are a VGA cable to connect the DRAC to a display and the energy needed to power the chip. Due to this, the environmental impact of this project is expected to be very minimal. It will also have an overall positive effect on the ecological footprint because the resources needed in the production of the external hosts that were used previously to generate the video output will be saved.

6.3.3. Risks A failure in the VGA cable or the connector will need a replacement that will increase the ecological footprint of the project, but this should be an unusual occurrence with very little environmental impact. A more critical issue would be an error in the controller that would require the chip to be re-made to fix it, because this would add the environmental impact of the production an extra time.

50

6.4. Social dimension 6.4.1. PPP This project has taught me how to work and communicate with a medium-sized team where everyone is in charge of a different part of the same project. I also learned that sometimes you have to do things while you still do not know their final requirements and specifications, and you have to change them and adapt along the way.

During the development, I realized the problems that a team this size faces and this made me reflect on why you can not always speed a project by assigning more people to it.

6.4.2. Exploitation The addition of the VGA controller will improve the accessibility of the DRAC chip and make it easier to use. In the future, the DRAC project can help improve the high-performance computing European community and it can also help the fields of safety, genomics and autonomous driving.

I believe that the project solves the problem identified at the start and the objectives have been met. The functional and non-functional requirements have also been fulfilled accordingly.

6.4.3. Risks In case of a failure in the VGA module the project could be detrimental to the image of the whole DRAC project and damage its reputation. Despite this, the project is not expected to cause any social problems for the users, since it is aimed at the high-performance community specifically and one of the main goals of the DRAC is to incentivize research in the field.

51

7. Conclusions

During the development of this project, we have designed in Verilog (RTL) a frame buffer that we later integrated with the rest of the DRAC core to work with the Lagarto core (a RISC-V processor). We reduced as much as possible the amount of memory required for the buffer while still providing the needed functionalities and the integration with the core via the AXI interface has been done successfully. I also contributed to the design of the PCB so we could have it physically as soon as possible.

Throughout the thesis, the design has been tested in various ways: simulations using Questasim to perform wave analysis, hardware tests isolating the buffer in a reduced setting and finally full integration tests with the DRAC processor.

About the technical competences of the thesis, we think they have been developed successfully and with the depth specified at the beginning of the project. Both the VGA framebuffer and the PCB designs have been completed and the DRAC SoC has been modified accordingly to integrate them.

The next steps are to see it integrated in the following DRAC processor tapeout at the end of this year. I will be working at BSC to be in charge of the bringup and validation of the VGA controller.

After that, we can decide what could be improved for a future version. One of the aspects that could be improved is the number of colours displayable by increasing the number of bits per pixel (currently only 1). Of course, this would require more memory to store these extra bits. Another thing that could improve it is changing the tiled buffer for a full frame buffer but this would mean adding much more memory and probably would force us to change the connection interface from AXI-Lite to full AXI to be able to read and write the new buffer in a timely manner.

We also have decided to make the VGA frame buffer open-source because we think this is a great way of incentivizing research on open-source hardware and that aligns with the general objectives of the DRAC project. This way, it can be reused in future projects and help save time in favour of innovation. With this design, we hope to contribute to the open-source hardware ecosystem in Barcelona and Europe.

On a more personal level, in this project I learned what it means to work with a moderately big team and the communication that needs to function well. Even more, not everyone on the team is a computer engineer and this has helped me learn a bit about other areas that are related to the work I did but that I did not have any knowledge about them before. For example, one of those areas is the part of physical design in the design of an integrated circuit.

52

8. References

[1] “BSC coordinates two of the nine RIS3CAT projects and participates in four more | BSC-CNS.” [Online]. Available: https://www.bsc.es/news/bsc-news/bsc-coordinates-two-the-nine-ris3cat-projects-and-partici pates-four-more. [Accessed: 22-Feb-2020]

[2] “The DRAC project is underway to manufacture a new chip and open source accelerators in Barcelona | BSC-CNS.” [Online]. Available: https://www.bsc.es/news/bsc-news/the-drac-project-underway-manufacture-new-chip-and-o pen-source-accelerators-barcelona. [Accessed: 25-May-2020].

[3] “The BSC coordinates the manufacture of the first open source chip developed in Spain | BSC-CNS.” [Online]. Available: https://www.bsc.es/news/bsc-news/the-bsc-coordinates-the-manufacture-the-first-open-sour ce-chip-developed-spain. [Accessed: 22-Feb-2020].

[4] “EPI: European Processor Initiative (EPI) | BSC-CNS.” [Online]. Available: https://www.bsc.es/research-and-development/projects/epi-european-processor-initative-epi. [Accessed: 22-Feb-2020].

[5] “Register Transfer Level (RTL) - Semiconductor Engineering.” [Online]. Available: https://semiengineering.com/knowledge_centers/eda-design/definitions/register-transfer-leve l/. [Accessed: 23-Feb-2020].

[6] “RISC-V Foundation - RISC-V Foundation.” [Online]. Available: https://riscv.org/risc-v-foundation/#. [Accessed: 23-Feb-2020].

[7] “What is an FPGA? Field Programmable Gate Array.” [Online]. Available: https://www.xilinx.com/products/silicon-devices/fpga/what-is-an-fpga.html. [Accessed: 23-Feb-2020].

[8] “What is ASIC (application-specific integrated circuit)? - Definition from WhatIs.com.” [Online]. Available: https://whatis.techtarget.com/definition/ASIC-application-specific-integrated-circuit. [Accessed: 23-Feb-2020].

[9] “Introduction to AXI Protocol: Understanding the AXI interface.” [Online]. Available: https://community.arm.com/developer/ip-products/system/b/soc-design-blog/posts/introducti on-to-axi-protocol-understanding-the-axi-interface. [Accessed: 23-Feb-2020].

[10] “What is a PCB or Printed Circuit Board? - Technical Terms by Eurocircuits.” [Online]. Available: https://www.eurocircuits.com/pcb-printed-circuit-board/. [Accessed: 30-Jun-2020].

53

[11] “Puerto Serial - protocolo y su teoría - HETPRO/TUTORIALES.” [Online]. Available: https://hetpro-store.com/TUTORIALES/puerto-serial/. [Accessed: 24-Feb-2020].

[12] “Inside HDMI (High Definition Multimedia Interface) - How It Works of 4 - Hardware Secrets.” [Online]. Available: https://www.hardwaresecrets.com/inside-hdmi-high-definition-multimedia-interface/2/. [Accessed: 24-Feb-2020].

[13] “What Is Kanban? An Overview Of The Kanban Method.” [Online]. Available: https://www.digite.com/kanban/what-is-kanban/. [Accessed: 24-Feb-2020].

[14] “BlackIce Mx from MyStore on Tindie.” [Online]. Available: https://www.tindie.com/products/Folknology/blackice-mx/. [Accessed: 13-Mar-2020].

[15] “Xilinx Kintex-7 FPGA KC705 Evaluation Kit.” [Online]. Available: https://www.xilinx.com/products/boards-and-kits/ek-k7-kc705-g.html. [Accessed: 13-Mar-2020].

[16] “Salary: Project Manager | Glassdoor.” [Online]. Available: https://www.glassdoor.com/Salaries/project-manager-salary-SRCH_KO0,15.htm. [Accessed: 06-Mar-2020].

[17] “Salary: RTL Design | Glassdoor.” [Online]. Available: https://www.glassdoor.com/Salaries/rtl-design-salary-SRCH_KO0,10.htm. [Accessed: 06-Mar-2020].

[18] “Salary: PCB Layout Engineer | Glassdoor.” [Online]. Available: https://www.glassdoor.com/Salaries/pcb-layout-engineer-salary-SRCH_KO0,19.htm. [Accessed: 06-Mar-2020].

[19] “Salary: Physical Design Engineer | Glassdoor.” [Online]. Available: https://www.glassdoor.com/Salaries/physical-design-engineer-salary-SRCH_KO0,24.htm. [Accessed: 06-Mar-2020].

[20] Y Lee, A. Waterman, R. Avizienis, H. Cook, C. Sun, V. Stojanović, K. Asanović, “A 45nm 1.3GHz 16.7 Double-Precision GFLOPS/W RISC-V Processor with Vector Accelerators.”

[21] C. Celio, P.-F. Chiu, B. Nikoli, D. Patterson, and K. Asanović, “BOOM v2: an open-source out-of-order RISC-V core BOOM v2 an open-source out-of-order RISC-V core.”

[22] “lowRISC: Collaborative open silicon engineering.” [Online]. Available: https://www.lowrisc.org/. [Accessed: 26-Jun-2020].

54

[23] F. Zaruba and L. Benini, “The Cost of Application-Class Processing: Energy and Performance Analysis of a Linux-ready 1.7GHz 64bit RISC-V Core in 22nm FDSOI Technology.” [24] F. Conti, D. Rossi, A. Pullini, I. Loi, and L. Benini, “PULP: A Ultra-Low Power Parallel Accelerator for Energy-Efficient and Flexible Embedded Vision,” J. Signal Process. Syst., vol. 84, no. 3, pp. 339–354, Sep. 2016, doi: 10.1007/s11265-015-1070-9.

[25] M. Gautschi, P. D. Schiavone, A. Traber, I. Loi, A. Pullini, D. Rossi, E.Flamand, F. K. Gurkaynak and L. Benini, ¨ Fellow, IEEE ., “Near-Threshold RISC-V core with DSP extensions for scalable IoT endpoint devices,” IEEE Trans. Very Large Scale Integr. Syst., vol. 25, no. 10, pp. 2700–2713, Feb. 2017, doi: 10.1109/TVLSI.2017.2654506.

[26] J. Balkind, M. McKeown, Y. Fu, T. Nguyen, Y. Zhou, A. Lavrov, M. Shahrad, A. Fuchs, S. Payne, X. Liang, M. Matl and D. Wentzlaff, “OpenPiton: An Open Source Manycore Research Framework,” 2016, doi: 10.1145/2872362.2872414.

[27] J. Abella, C. Bulla, G. Cabo, F. J. Cazorla, A. Cristal, M. Doblas, R. Figueras, A. González, C. Hernández, C. Hernández, V. Jiménez, L. Kosmidis, V. Kostalabros, R. Langarita, N. Leyva, G. López-Paradís, J. Marimon, R. Martínez, J. Mendoza, F. Moll, M. Moretó, J. Pavón, C. Ramírez, M. A. Ramírez, C. Rojas, A. Rubio, A. Ruiz, N. Sonmez, V. Soria, L. Terés. O. Unsal, M. Valero, I. Vargas, L. Villa, “An Academic RISC-V Silicon Implementation Based on Open-Source Components.”

[28] J. Abella, G. Cabo, F. J. Cazorla, A. Cristal, R. Figueras, A. González, C. Hernández, C. Hernández, V. Kostalabros, N. Leyva, J. Marimon, R. Martínez, J. Mendoza, F. Moll, M. Moretó, J. Pavón, C. Ramírez, M. A. Ramírez, C. Rojas, A. Rubio, A. Ruiz, N. Sonmez, L. Terés. O. Unsal, M. Valero, I. Vargas, L. Villa,, “Lagarto : First Silicon RISC-V Academic Processor Developed in Spain”

[29] A. Waterman and K. A. Asanović, “The RISC-V Instruction Set Manual,” 2019.

[30] “Pmod VGA Reference Manual [Reference.Digilentinc].” [Online]. Available: https://reference.digilentinc.com/reference/pmod/pmodvga/reference-manual?_ga=2.488063 74.1737305034.1593619945-838570198.1593619945. [Accessed: 01-Jul-2020].

[31] “VGA Display Controller [Digilent Documentation].” [Online]. Available: https://reference.digilentinc.com/learn/programmable-logic/tutorials/vga-display-congroller/st art. [Accessed: 25-May-2020].

[32] “AMBA ® AXI and ACE Protocol Specification.” [Online]. Available: https://static.docs.arm.com/ihi0022/g/IHI0022G_amba_axi_protocol_spec.pdf. [Accessed: 25-May-2020].

[33] “AXI Reference Guide.” [Online]. Available: https://www.xilinx.com/support/documentation/ip_documentation/ug761_axi_reference_guid e.pdf. [Accessed: 01-Jul-2020].

55

[34] “Overview :: Yet Another VGA :: OpenCores.” [Online]. Available: https://opencores.org/projects/yavga. [Accessed: 26-Jun-2020].

[35] “Overview :: AXI4 to VGA Frame Buffer with Linux driver :: OpenCores.” [Online]. Available: https://opencores.org/projects/axi_vga_fb. [Accessed: 26-Jun-2020].

[36] “PULP platform.” [Online]. Available: https://pulp-platform.org/index.html. [Accessed: 26-Jun-2020].

[37] “GitHub - sergicuen/collection-iPxs: Icestudio Pixel Stream collection.” [Online]. Available: https://github.com/sergicuen/collection-iPxs. [Accessed: 26-Jun-2020].

[38] “Tile-based video game - Wikipedia.” [Online]. Available: https://en.wikipedia.org/wiki/Tile-based_video_game. [Accessed: 25-May-2020].

[39] “VGA Signal Timing.” [Online]. Available: http://tinyvga.com/vga-timing. [Accessed: 26-Jun-2020].

[40] “GitHub - YosysHQ/mcy: Mutation Cover with Yosys (MCY).” [Online]. Available: https://github.com/YosysHQ/mcy. [Accessed: 2-Apr-2020].

[41] “GitHub - YosysHQ/SymbiYosys: SymbiYosys (sby) -- Front-end for Yosys-based formal verification flows.” [Online]. Available: https://github.com/YosysHQ/SymbiYosys. [Accessed: 2-Apr-2020].

[42] “Building an AXI-Lite slave the easy way.” [Online]. Available: https://zipcpu.com/blog/2020/03/08/easyaxil.html. [Accessed: 26-Jun-2020].

[43] “KC705 Evaluation Board for the Kintex-7 FPGA.” [Online]. Available: https://www.xilinx.com/support/documentation/boards_and_kits/kc705/ug810_KC705_Eval_ Bd.pdf. [Accessed: 3-Jul-2020].

[44] “Integrated Logic Analyzer v6.1.” [Online]. Available: https://www.xilinx.com/support/documentation/ip_documentation/ila/v6_1/pg172-ila.pdf. [Accessed: 6-Jul-2020].

[45] “SpyGlass Lint.” [Online]. Available: https://www.synopsys.com/verification/static-and-formal-verification/spyglass/spyglass-lint.ht ml. [Accessed: 6-Jul-2020].

56