CERN-THESIS-2017-478 //2017 oioigSse RM CR RadiatiOn (CERN CROME System Monitoring otiuin oteSL2Radiation 2 SIL the to Contributions oioigElectronics) Monitoring Supervisor Supervisor r Hamza Dr. ioaJoel Nicola r Alain Dr. atrThesis Master ac 2017 March CERN EPFL Author: Boukabache Vachoux Gerber - - HSE-RP/IL: SEL | STI:

Abstract

CERN is developing a new radiation monitoring system called CROME to replace the currently used system which is at the end of its life cycle. As radiation can pose a threat to people and the environment, CROME has to fulfill several requirements regarding functional safety (SIL 2 for safety-critical functionalities). This thesis makes several contributions to increase the functionality, reliability and availability of CROME. Floating point computations are needed for the signal processing stages of CROME to cope with the high dynamic range of the measured radiation. Therefore, several IEEE 754-2008 conforming floating point operation IP cores are developed and implemented in the system. In order to fulfill the requirements regarding functional safety, the IP cores are verified rigorously by using a custom OSVVM based floating point verification suite. A design methodology for functional safety in SRAM-based FPGA-SoCs is developed. Some parts of the methodology are applied to the newly developed IP cores and other parts were ported back to the existing system. In order to increase the reliability and availability of CROME, a new in-system communication IP core for decoupling the safety-critical from the non-safety-critical parts of the FPGA-SoC-based system is implemented. As the IP core contains mission critical configuration data and will have an uptime of several years, it is equipped with several single event upset mitigation techniques such as ECC for memory protection and fault-robust FSMs in order to increase the system’s reliability and availability. To make this IP core usable for the overall system, the existing Linux kernel and userspace software stack is adapted for it. Finally, some critical design weaknesses were found in the current system architecture. In order to remediate those weaknesses, an improved system architecture is proposed to be implemented in future design iterations.

Acknowledgements

I would like to thank my supervisor from EPFL Dr. Alain Vachoux. The numerous courses given by him at EPFL equipped me with all the necessary tools to successfully finish this thesis. His advice during the thesis enabled me to overcome several problems I faced while conducting the project. I would like to express profound gratitude to my supervisor at CERN Dr. Hamza Boukabache. His guidance through this project was invaluable. He provided assistance when needed and at the same time left me the freedom to explore my own ideas. Many thanks to Ciar´anToner for his advice, the fruitful discussions and all his proofreading. And also a big thank you for all the rest of the HSE-RP/IL section for providing such an inspiring and pleasant work environment. And finally, I would like to thank my family. Thank you for all your love, understanding and support throughout my studies. Without you I would not be where I am.

Contents

1 Introduction 1 1.1 Motivation for the CROME Project ...... 1 1.2 Scope of this Thesis ...... 2 1.3 Organisation of this Report ...... 3

2 System Architecture5 2.1 System Level Overview ...... 5 2.1.1 CROME Measuring and Processing Unit ...... 6 2.1.2 Avnet PicoZed ...... 8 2.2 Zynq Device Architecture ...... 8 2.2.1 Zynq Boot Sequence ...... 10 2.2.2 Fitness of Zynq for CROME ...... 11 2.2.3 PCAP-ICAP Issue ...... 12 2.3 Proposed Modifications on the Architecture to Increase Reliability and Availability . 13 2.3.1 SIL 3 Microcontroller ...... 15

3 Methodologies and Techniques for the Design of Safety-Critical Systems 19 3.1 Development Cycle to Achieve SIL 2 ...... 19 3.2 Overview of Error Correction Schemes ...... 20 3.2.1 Repetition Codes and Parity Bits ...... 21 3.2.2 Checksums ...... 21 3.2.3 Cyclic Redundancy Checks ...... 22 3.2.4 Error-Correcting Codes ...... 23 3.2.5 Cryptographic Hash Functions ...... 24 3.3 Techniques to Increase Reliability and Availability of FPGA Designs ...... 24 3.3.1 Taxonomy of FPGAs ...... 24 3.3.2 Influence of Radiation in Deep Sub-Micron Silicon Devices ...... 25 3.3.3 Single Event Upset Mitigation Techniques ...... 26 3.4 Overview of Verification Methodologies ...... 28 3.4.1 Formal Verification ...... 30 3.4.2 Open Source VHDL Verification Environment Methodology ...... 32 3.4.3 SystemVerilog Direct Programming Interface ...... 33

4 Securing PS/PL Data Transfer 35 4.1 Adaption of the FPGA Firmware ...... 36 4.1.1 Formalisation of the Functional Requirements ...... 37 4.1.2 Functional Description ...... 38 4.1.3 Implementation ...... 40 4.1.4 Verification ...... 42 4.1.5 Discussion of the Design ...... 43 4.2 Adaptions of the Linux Software Stack ...... 44 4.2.1 Proposed Changes in the Linux Software Stack for Future Releases ...... 45 4.3 Current State of the Development ...... 45 4.3.1 Final Remarks on the Verification ...... 46

5 Floating Point Computation 47 5.1 Used Subset of IEEE 754-2008 ...... 47 5.1.1 Binary Representation and Interpretation ...... 47 5.1.2 Rounding of Floating Point Numbers ...... 49 5.1.3 Omitted Functionality ...... 51 5.2 IEEE 754-2008 Verification Suite ...... 52 5.2.1 Implementation of Coverage Points ...... 52 5.2.2 Generic Test Functions ...... 54 5.3 Floating Point Comparison ...... 56 5.3.1 Functional Description ...... 56 5.3.2 RTL Architecture ...... 56 5.3.3 Verification Environment ...... 56 5.3.4 Implementation and Verification Results ...... 57 5.4 Integer to Floating Point Conversion ...... 57 5.4.1 Functional Description ...... 58 5.4.2 RTL Architecture ...... 58 5.4.3 Verification Environment ...... 59 5.4.4 Implementation and Verification Results ...... 61 5.5 Floating Point to Integer Conversion ...... 62 5.5.1 Functional Description ...... 62 5.5.2 RTL Architecture ...... 63 5.5.3 Verification Environment ...... 63 5.5.4 Implementation and Verification Results ...... 65 5.6 Floating Point Addition ...... 65 5.6.1 Functional Description ...... 66 5.6.2 RTL Architecture ...... 67 5.6.3 Verification Environment ...... 68 5.6.4 Implementation and Verification Results ...... 71 5.7 Floating Point Multiplication ...... 73 5.7.1 Functional Description ...... 73 5.7.2 RTL Architecture ...... 73 5.7.3 Verification Environment ...... 74 5.7.4 Implementation and Verification Results ...... 76 5.8 Example Application of the Floating Point Cores: Temperature Compensation . . . . 77

6 Conclusion and Outlook 79 6.1 Conclusion on the Completed Work ...... 79 6.1.1 Floating Point Core Design and Verification ...... 79 6.1.2 Securing PS/PL Data Transfer ...... 80 6.1.3 General Methodologies to Increase the Reliability of the Design ...... 80 6.2 Future Work ...... 81 6.2.1 General Methodologies to Increase the Reliability of the Design ...... 81 6.2.2 Architectural Modifications to Increase the Reliability of the Design . . . . . 81

Glossary 83 Bibliography 89

List of Figures

2.1 Diagram of the complete radiation monitoring system ...... 6 2.2 PCBs inside the submodules of CROME ...... 6 2.3 High level block diagram of a PicoZed SoM ...... 8 2.4 High level block diagram of a Zynq device ...... 9 2.5 Block diagram of the PS of a Zynq device ...... 10 2.6 Block diagram of the ICAP and PCAP interfaces ...... 12 2.7 Simplified functional block diagram of the current state of the system ...... 13 2.8 System architecture ...... 14 2.9 Proposed system architecture for the next iteration ...... 15

3.1 General V-cycle for HDL development ...... 20 3.2 Formal verification flow for floating point units used at Corporation ...... 31

4.1 Block diagram of the current link between the PS and the PL ...... 35 4.2 Block diagram of the secured link between the PS and the PL ...... 38 4.3 State diagram of the FSM for securing the PS/PL transfer ...... 39

5.1 Generic binary representation of a floating point number ...... 47 5.2 Logarithmic to logarithmic plot of the runtime of a result based adder verification procedure against the uniformly spread seed numbers around the target result . . . . 55 5.3 RTL architecture of the integer to floating point converter ...... 59 5.4 Testbench architecture for floating point to integer and integer to floating point conversion ...... 61 5.5 RTL architecture of the floating point to integer converter ...... 63 5.6 RTL architecture of the floating point adder ...... 68 5.7 Testbench architecture for floating point addition and multiplication ...... 71 5.8 RTL architecture of the floating point multiplier ...... 75

List of Tables

2.1 Probability of failure per hour in continuous mode ...... 7

4.1 Memory mapping of the configuration parameters ...... 37 4.2 Memory mapping of the results ...... 38

5.1 Floating point classification ...... 49 5.2 Number of stimuli applied to complete test the procedures for the integer to floating point converter ...... 61 5.3 Resource usage of the integer to floating point converter ...... 62 5.4 Resource usage of the floating point to integer converter ...... 65 5.5 Number of stimuli applied to complete the test procedures for the floating point adder 72 5.6 Resource usage of the floating point adder ...... 72 5.7 Number of stimuli applied to complete the test procedures for the floating point multiplier ...... 76 5.8 Resource usage of the floating point multiplier ...... 77

1 Introduction

1.1 Motivation for the CROME Project

The European Organization for Nuclear Research (CERN) in Geneva (Switzerland) is a research institute active in the field of particle physics. It operates the world’s largest and highest energy particle accelerator complex. Its latest addition, the Large Hadron Collider (LHC), has a circum- ference of 27 km and has reached energies up to 13 TeV. Attached to the accelerator complex are different experiments where particle collisions take place. These collisions, and also the operation of the accelerator complex by itself, produces radiation, which has to be monitored in order to protect CERN personell and visitors, residents and the environment in general. For this task, the RAdia- tion Monitoring System for the Environment and Safety (RAMSES) program was put in place by CERN. Its task is to measure, monitor and document all different kinds of radiating emissions and to trigger different alarms if necessary. Prior to putting RAMSES in place, a Preliminary Hazard Analysis (PHA) according to the IEC 61508 standard about Functional Safety of Electrical/Electron- ic/Programmable Electronic Safety-related Systems was conducted. The output of the PHA was that it was possible to quantify the risks and what influence those risks may have. Based on this, it was possible to define how reliable certain functionalities need to be so that the expected effect of a hazard is in equilibrium with the risk that the hazard happens. According to IEC 61508, the safety-critical functions of the system need to comply to Security Integrity Level (SIL) 2. The SIL of a functionality defines how reliable this functionality has to be by defining the likelihood that the functionality can fail. SIL 2 is equivalent to having a probability of failure per hour between 10−6 and 10−7. [1][2] The monitor system currently in use in the RAMSES program is called ARea CONtrol (ARCON) and it is approaching its end of life after being in operation for 30 years. In order to replace the old ARCON system, CERN started the CERN RadiatiOn Monitoring Electronics (CROME) project in order to develop a replacement. Certain functionalities of ARCON were safety-critical and therefore had to comply to SIL 2, the same kind of functionalities of CROME need to meet the same reliability requirements as ARCON and therefore have to comply to SIL 2 as well. Besides the requirements in terms of reliability, CROME had to be designed in a modular way to increase its maintainability and to better cope with the extremely long life cycle. In order to have low cost and mature components, mainly commercial of-the-shelf (COTS) components were used. The use of COTS components has an important impact on the project as it has to be assured that there are replacement parts available in far future in order to allow repair. In case a certain important component becomes unavailable during the life-cycle of CROME, only a module would have to be changed and not the complete system. [2] Besides the reliability requirements, there are also several technical challenges as the overall system has to be able to measure very low currents (currents are the physical quantity that is measured and that is equivalent to the radiation level) in the order of magnitude of femto Amperes, which is roughly 6000 electrons per second. This very low current coupled with an overall dynamic range of 9 decades of the measurements poses a big challenge to the analogue measurement frontend and the following digital signal processing stages. The analogue measurement part and following digital signal processing stages are both safety-critical and have to conform to SIL 2. Finally, the

1 collected data has to be sent back to the supervision through a custom protocol based on TCP/IP. The supervision is the organ that collects and monitors all the measurement results, it also sets the parameters for a device, such as the threshold radiation at which the alarm has to be triggered.

1.2 Scope of this Thesis

Prior to defining the tasks tackled in this thesis, it is necessary to introduce the chip which is used for this project: the Xilinx Zynq 7000 System-on-Chip (SoC). Xilinx is a market leader in the field of Field Programmable Gate Arrays (FPGAs), which are a sort of programmable logic device. The particularity of the Xilinx Zynq 7000 device family is that Xilinx put two ARM Cortex-A9 processors on the same die together with a traditional FPGA to form a SoC. The FPGA portion of the SoC is called Programmable Logic (PL) and the portion containing the two ARM Cortex-A9 cores is called Programmable System (PS). The communication between the two portions of the SoC happens through several Advanced eXtensible Interface (AXI) buses. This knowledge about the chip should suffice to define the original tasks of the master project:

ˆ Creating a SIL 2 certifiable floating point processing core

ˆ Performing temperature compensation on the measurements using a floating point processing core

ˆ Replacing the AXI General Purpose Input/Output (GPIO) based PS/PL communication chan- nel with a new system based on Block RAM (BRAM)

During the course of this project the tasks were specified in more detail as it became clear what exactly was needed. Furthermore, the work on the tasks resulted in some interesting byproducts, which were also added to the task list as they aid the greater goal of the project: to create a sufficiently reliable design. The initial task list was transformed into:

ˆ Designing a floating point multiplier, floating point adder, floating point to arbitrary length integer converter, arbitrary length integer to floating point converter and a floating point comparator

ˆ Verifying all of the designed floating point cores up to the point that they are certifiable

ˆ Performing temperature compensation on the measurements using the floating point cores

ˆ Designing a system for securing the data transfer of mission critical data between the pro- grammable system part and the programmable logic part

The floating point processing core was split up into multiple cores which have to perform different operations. Even though the floating point comparator was designed in an earlier stage of the project, it was removed later on as it was not used any longer. In order to have the floating point cores certifiable, they had to undergo rigorous verification. In order to do this a verification suite for floating point operation cores was developed. The task of creating a secure system for PS/PL transfer originally was to replace the previously used approach of using multiple AXI GPIO cores with an approach that leverages BRAM as a storage instead of a large array of AXI GPIO cores. This task was fulfilled within the first few weeks of the project duration and was subsequently replaced by the task to make the BRAM based approach more reliable and robust.

2 During the planning phase for the PS/PL communication core some research in how to make FPGA devices more reliable was conducted. The overview of methods came form literature research and different trainings (at Silkan in Aix-en-Provence and during a seminar by and Microsemi at CERN) and it could be condensed into a methodology to apply for the design of high reliability and high availability systems. Parts of this methodology were implemented into the newly designed cores developed in this thesis and some were ported back into the existing code to increase reliability. However, several crucial methods, which would increase the reliability of the design even further, were out of scope to be implemented during this time or have to be applied at a later stage of the design of the system. The methods that were implemented do increase the reliability of the device already, but the reliability can be increased even further when the other methods, which were out of scope for the thesis, are also implemented. During the exploration of the techniques for increasing the reliability and availability of the design, a critical weakness was discovered in how the Xilinx Zynq 7000 device is used. This lead to the proposal of a new improved system architecture which is more robust against this problem. Eventually, it was found that similar suggestions for architectural improvements were already made earlier, but were not yet implemented. Furthermore, they did not fully address the shortcomings of the current system architecture. These two developments, which were byproducts of the main tasks, can be summarised into two other points where significant work was conducted during the course of this thesis:

ˆ Evaluating Single Event Upset (SEU, a SEU is when a particle/radiation alters the content of a storage element on a chip) mitigation techniques to increase reliability and availability of the design and create a concept for its implementation into the design

ˆ Proposing architectural changes to remediate some architectural problems which were discov- ered as they have a severe impact on the reliability and availability of the complete system

1.3 Organisation of this Report

The rest of this report is organised as follows: in Chapter 2, an overview of the complete system will be given with a special focus on the parts that will be altered by the content of this thesis. Then the general architecture plus some particularities of the used device will be presented. As some details of the device’s architecture have a severe impact on the reliability and availability of the complete system, a new architecture is proposed which remediates the shortcomings of the current architecture. Chapter 3 provides a theoretical foundation for later chapters. The overall development cycle used for this project will be presented. Then a general overview for error correction schemes will be given for all kinds of applications with a special focus on the methods which can be used for this project. This will be followed by a short taxonomy of FPGA devices and combined with details on how error detection and correction can be achieved in FPGA devices to increase their reliability. The chapter ends with an overview of verification methodologies with a focus on the methodologies and techniques used for verification tasks in this thesis. Chapter 4 is about how to secure the critical communication between the different portions of the SoC used for the design. Using several techniques presented in Chapter 3, a hardened hardware architecture will be presented which was hardened with the aim to increase the reliability of the system. In order to use this hardware, several changes to the software need to be made to make the hardware usable. The essential changes for using the hardware were implemented and will be

3 presented together with some proposed software modifications which are aimed at increasing the overall system reliability. Chapter 5 presents a subset of the IEEE 754-2008 standard, which defines the floating point number format and operations with floating point numbers. This will be followed by the presentation of a VHDL software suite designed to simplify the verification process of floating point cores. The previously presented subset of floating point numbers will be used to implement certain floating point operations as IP cores. The detailed functionality and hardware architecture of those cores will be presented as well as their verification. At the end of the chapter an example application in the system, which uses several floating point cores, is presented. Finally, in Chapter 6, a conclusion will be drawn and an outlook for future work to be conducted for the project will be given.

4 2 System Architecture

In this chapter, a high level overview of the system for which this project is conducted will first be given. Then the electronics parts, on which this thesis is aimed, are presented as a system. Having gathered an overview of the overall electronic system, the actual SoC used in this thesis will be presented more in detail. Some critical technical details of the SoC will be presented and discussed in terms of the fitness of the SoC for this application. As the internal architecture of the SoC has a critical impact on the device reliability and availability a new architecture for mitigating these short-comings will be presented and discussed.

2.1 System Level Overview

Figure 2.1 (from [2]) shows a high level overview of the new radiation monitoring system CROME which is being developed at CERN. While only the CROME part is new (encircled in red in the schema), the new system has to be integrated in an existing architecture. The existing architecture mainly consists of a technical networking infrastructure (denoted TN network in Figure 2.1 realised as a VLAN within a normal network which is also used for other purposes) and some services connected to this network in form of supervisory control and data acquisition (SCADA) servers and data base servers (all together called supervision for the rest of this thesis). The task of the SCADA servers are to control and monitor each CROME node and put the measured data from each CROME node into a database. [2] The CROME nodes are the functional part of the system that actually perform the measurements and trigger the alarm units and most importantly the interlocks, that can force a shutdown of any experiment if a bad condition (radiation above an acceptable level for example) is detected. The CROME Alarm Unit (CAU) is a pillar of warning lights (green = no problem, orange = warning and red = error), which receives an alarm command, acts accordingly and rebroadcasts the incoming alarm signal to other CAUs. Furthermore, the CAU also emits an acoustic warning signal. The actual structure of the CAU is irrelevant for the course of the thesis, only the alarm signalling signal is important as it is created by the CROME Measuring and Processing Unit (CMPU). Besides the alarm signalling connection with the CAU, the CMPU is connected directly to the interlocks and to an ethernet network, through which all the communication with the supervision is handled. The device is connected to an ionization chamber which is denoted ”IG5 Detector” in Figure 2.1. The name ”IG5” in ”IG5 Detector” denotes the type of the ionization chamber. Ionisation chambers can differ in their geometry, based on whether the internal gas is pressurised and the substance used for the gas. The IG5 Detector is a sensor which converts the physical quantity of radiation, which is not directly sensible in this configuration, into a current that can then be measured. The CMPU is powered by the CROME Uninterruptible Power Supply (CUPS), which is attached to the standard 230V AC power network and contains a battery to sustain short-term power outages and output 24V DC. The CAU is also powered by the CUPS. [2]

5 Chapter 3. CROME Project 27

Figure 3.2: Global Overview of the CROME System [27] Figure 2.1: Diagram of the complete radiation monitoring system

In the supervision software all data (measurement values, self-diagnostic failures, 2.1.1 CROME Measuring and Processing Unit triggered alarm/interlocks etc.) can be monitored, is stored and can be accessed Figurewhenever 2.2 (from needed. [2]) shows the different printed circuit boards (PCB) which were developed in the course of the CROME project. The focus of this thesis is the CMPU. It is enclosed in a metal case Figure30 3.3 shows the CROME Measurement and 3.2.Processing Functions Unit of the (CMPU). CROME System to which the ionisation chamber can be connected on one side and on the other side the power, ethernet and other communication signals, interlock and alarms signals have their connectors.

FigureFigure 2.2:3.6: PCBs Structure inside of the the submodules CROME System of CROME [28] Figure 3.3: CROME Measuring and Processing Unit (CMPU) From3.2 the Functions electronics point of of the view CROME the system consists System of four PCBs. The Connectic Board Master’s Thesis IMA/ University of Stuttgart Saskia Hurst

The main functions of the CROME system are listed below [27]: 6 • The system shall be capable of measuring very low dose rates and simultaneously being able to measure radiation over an extensive range.

• The system shall be able to trigger visual and acoustic alarms, as soon as the measured dose rate exceeds a set threshold.

• The system shall be able to trigger an interlock of systems, which represent a high radiation risk as accelerators or access systems.

Those main functions together with ancillary monitoring, self-diagnostic tests etc. are implemented in the three subsystems.

CAU

The functional structure of the CAU is described in Figure 3.7. It shows schematically the two boards with its functions and the signal transfer be- tween the boards and the connectors. The single boxes describe functional blocks, which correspond to circuits on the boards. The main functions of the Interface Board are

• EMC protection,

• galvanic isolation of all signals,

• interface and level adaptation with the Logic Board.

Saskia Hurst Master’s Thesis IMA/ University of Stuttgart incorporates the connectors to the outside world. It includes the power supply, interlocks, ethernet, etc. It also incorporates ICs to monitor supply voltages used for the other boards. It is connected to the Adapting Board, whose purpose is to support the PicoZed board (the graphic actually is outdated as the Microzed board was replaced by the PicoZed board), and connect it to the (analogue) Frontend Board and to the PicoZed System-on-Module (SoM). The Frontend Board’s purpose is to provide a high voltage source to bias the ionisation chamber, measure and monitor the various power supply voltages, measure the very low currents which come from the ionisation chamber, measure the temperature and humidity inside the case, measure the ambient temperature of the ionisation chamber and provide a lithium-ion battery charger for the PicoZed. The purpose of the PicoZed is to perform all the logic functionalities, peripheral control and communication. It controls the high voltage source, reads out the measurement circuits and calculates the radiation levels, logs them, and triggers alarm and interlock signals accordingly. Furthermore, it checks for dangerous conditions of the device like if the interior of the enclosure is past the dew point or in general the temperature is too hot or if one of the other boards measures a bad supply voltage. [2] The PHA mentioned in Section 1.1 has shown that all safety-critical functions of the system (interlock triggering and all the measurement related functionalities) should have at least a SIL 2 according to [3]. Table 2.1 shows the probability of failure per hour for a continuous mode function corresponding to a certain SIL level according to [3]. For SIL, a distinction between high demand/continuous mode functions and low demand functions has to be made. As the critical functionalities of the CMPU are all high demand (several times per second), the continuous mode probability of failure per hour numbers have to be used. The requirement to conform to SIL 2 defined by the PHA mentioned in Section 1.1 is equivalent to having a probability of failure per hour of 10−6 − 10−7. [2]

Table 2.1: Probability of failure per hour in continuous mode Security Integrity Level 1 2 3 4 Probability of Failure per Hour [h−1] 10−5 − 10−6 10−6 − 10−7 10−7 − 10−8 10−8 − 10−9

There are functions of the CMPU which are only SIL 1 or which are not critical at all. However, as the SIL 2 critical functions depend on all the IP cores developed for this thesis, SIL 2 is always assumed for the rest of this document. The CMPU measures the current equivalent to the radiation level every 100 ms, 10 times per second. These 100 ms periods will be referred to as the measurement window, windows or cycle in the rest of the report. However, this does not mean that within the 100 ms time frame the CMPU does nothing. In fact, the actual task for the measurement are executed all the time. Furthermore, the parameters for the measurement can be updated once per cycle. Once per second (or every 10 cycles) at most and at least once per day (this update period is parametrisable), the actual radiation level is calculated and interlocks and alarms are triggered according to the radiation level and configuration of the device. [4] Another detail specific to the CMPU is related to the fact that the measured current can be very low (6000 electrons per second as written in the introduction). This necessitates that special cabling is used and PCB layout techniques must be adapted as if there is humidity or dust inside the CMPU case some of the few electrons which should be measured will travel through the humid air or air plus dust mixture instead of going through the cable or PCB trace. This loss through the air is a normal phenomenon and usually those losses do not significantly contribute to the overall charge, but in this project these losses absolutely need to be mitigated as otherwise the measurement results are inaccurate. [4]

7 2.1.2 Avnet PicoZed

The Avnet PicoZed SoM contains a chip of the Xilinx Zynq 7000 SoC device family as its core component. The exact chip is dependent on the SoM. In the case of this project, the SoC is a XC7Z020 from Xilinx. The XC7Z020 will be presented in more detail in the next section. As already quickly presented in Chapter 1, the SoC has two major parts sitting on the same die: the Programmable System (PS) and the Programmable Logic (PL). The PL can be assumed to be a FPGA and the PS is like a microprocessor together with peripherals. Figure 2.3 (from [5]) shows a block diagram of the PicoZed SoM. [5]

Block Diagram

OSC @ 128 Mb QSPI 33.33 MHz XC7Z010/20-1 CLG400 USB 2.0 1 GB DDR3 ULPI PHY XC7Z015/30-1 SBG485 Ethernet 4 GB eMMC PHY

Processing JX1 Micro Header System 50/58 User I/O JX3 Micro Header (7Z010/7Z015-20-30) Ethernet / USB 2.0 OTG 10/20 User I/O (7Z020/7Z015-30) JX2 Micro Header Programmable 4 GTP/GTX Ports 50/57 User I/O Logic (7Z015/7Z030) (7Z010/7Z015-20-30)

Featured Manufacturers Figure 2.3: High level block diagram of a PicoZed SoM

The connectivity of the SoM is provided through three Micro Headers which are connected to the Adapting Board. There is 1 GB of DDR2 DRAM available on this board. For storage there is a 128 Mb QSPI, a 4 GB eMMC and a SD card slot available on the board. The SD card must be added by the user and therefore is not shown in the block diagram. For connectivtity, there are USB 2.0 Ordering and Gigabit Ethernet Ports available. Finally, there is a 33.33 MHz crystal oscillator present on the boardPart which Number provides a clockHardware source Description to the Zynq. [5] Resale (100-499*) AES-Z7PZ-7Z010-SOM-G 7010 PicoZed SOM $179 USD

AES-Z7PZ-7Z010-SOM-I-G 7010 Industrial Temperature PicoZed SOM $209 USD 2.2 XilinxAES-Z7PZ-7Z015-SOM-G Zynq Device7015 PicoZed Architecture SOM $299 USD AES-Z7PZ-7Z015-SOM-I-G 7015 Industrial Temperature PicoZed SOM $339 USD

FigureAES-Z7PZ-7Z020-SOM-G 2.4 (from [6]) shows7020 a highly PicoZed simplifiedSOM block diagram of a Zynq 7000 $229 SoC. USD Its focus is actuallyAES-Z7PZ-7Z020-SOM-I-G on the interfaces between7020 Industrial the twoTemperature main PicoZed subsystems SOM of the SoC, but $285 it also USD contains the essentialAES-Z7PZ-7Z030-SOM-G blocks needed to get7030 a PicoZed highlevel SOM overview over the functionality of the$399 device.USD The PL can be considered as a FPGA, with the only difference that the configuration bitstream is loaded AES-Z7PZ-7Z030-SOM-I-G 7030 Industrial Temperature PicoZed SOM $425 USD by the PS and not by the PL itself (this will have some important implications as discussed later). RELATED PARTS This detail is important because a FPGA normally loads its configuration by itself, but Zynq devices AES-PZCC-FMC-G PicoZed Carrier Card with FMC $425 USD do it through*contact your anlocal externalAvnet sales office ”agent”. for pricing on The higher FPGA quantities can be provided with external clock signals or with clock signals provided by the PS. The same holds for reset signals. There is only one important detail:To evenpurchase if the this reset kit, visit signals www.picozed.org of the PL come from an external source, the complete PL will be held in reset for the time that the PS is being initialised. The reason for this is that there could be Contact Information North America Europe Japan Asia 2211 South 47th Street Gruber Str. 60c Yebisu Garden Place Tower, 29F 151 Lorong Chuan, Phoenix, Arizona 85034 85586 Poing 4-20-3 Ebisu, Shibuya-ku, #06-03 New Tech Park United States of America Germany Tokyo 150-6029 Japan Singapore 556741 [email protected] [email protected] [email protected] [email protected] 8 1-800-585-1602 +49-8121-77702 +81-(0)3-5792-8210 +65-6580-6000

Copyright © 2015 Avnet, Inc. AVNET and the AV logo are registered trademarks of Avnet, Inc. All other brands are the property of their respective owners. LIT # PB-AES-Z7PZ-SOM-G-v1 problems otherwise as the PS to PL signals might change asynchronously when the PS is reset and this might cause problems to the PL as timings are not respected any more. [7]

PL PS SGP0 Peripherals(UART,USB,Network,SD,GPIO,…) SGP1 4Channels DMAController (ARMPL330) HP0 DRAM Controller AXI HP1 (SynopsysIntelliDDRMPMC) Masters Inter HP2 Connect (ARM ARMA9 HP3 NIC‐301) L2 S L1 NEON PL310 n MMU AXI MGP0 o o ARMA9 Slaves MGP1 OCM p L1 NEON AXIMaster ACP MMU

Figure 2.4: High level block diagram of a Zynq device Figure 1: A block diagram representing important elements of the Xilinx ZYNQ device. The processing core of the PS is comprised of two ARM Cortex-A9 cores with NEON— single AXI slave unit,instruction, implemented multiple on data the (SIMD) PL will engine, also occupy a memory a managementlength. The unit AXI (MMU) slave and side a ofcache all ofhierarchy. the AXI master units part of thisThis address is directly range. connected It should be to noted a DDR2 that DRAM except controllerresiding and on toACP the, centralHP0 and interconnectHP1 ports of are type connected to GP0. the CPU coresARM and NIC-301. their L1 There instruction also is caches, DMA controller the rest of from ARMIn (PL330) order to push present the and processing a large speed number to ofits maximum, we the system isperipherals using physical like the address QSPI values. Flash controller, SD/SDIOuse controller, two AXI GPIO master controller, blocks (called USBAXI controller, read and AXI write) The ACPGigabitport is Ethernet connected controller, to the ARM SPI Snoopcontroller, Control CAN controller,running UART in parallel controller to perform and I2C concurrent controller. read and write Unit (SCU Other).Thus pripherals it provides present the possibility on the SoC, of whichinitiating are not astransactions important on for each this projectPS interface. but still The worth masters are con- cache coherentnoting, accesses are the to clock the ARM management sub-system. system Careful and the resetnected system, to the Joint interface Test Action using anGroup AXI (JTAG) interconnect. In Fig- use of ACP can improve overall system performance and ure 2 each AXI interconnect is depicted with a big i letter. and Debug Acces Port (DAP) system. Another detail is the XADC interface, which is a 1 Msample/s energy efficiency. Inappropriate usage of this port however, AXI interconnects are configured to their high performance can adverselymultichannel affect execution ADC whichspeed ofis present other running on the appli-SoC. The listcross-bar of peripherals mode to further achieve contains maximum an interrupt possible bandwidth. cations becausecontroller the accelerator and different can timers pollute (one precious watchdog cache per CPU core,As shown a global in watchdog Figure 2, and between more).AXI The read list and AXI write area. can be extended, but these are the most important componentsthere is a present separate in the module PS of (called the SoC.acceleration For the logic) which sake of completeness: Figure 2.5 (from [7]) shows thecontains full detail a block FIFO diagram and also of the the logic PS related of a Zynq to the acceleration 2. INFRASTRUCTUREdevice. A first observation SETUP shows that there are plentytask. of clock For ourdomain performance crossing circuits measurements, compared we have selected We developto the a complete simplified infrastructure block diagram. containing Furthermore, hard- at manya 16 places tap FIR in the filter design as the it is acceleration possible to logic. use Each pair of ware, softwareQuality and firmwareof Service setup (QoS) which signaling, enables but us this to per- is not reallyAXI needed masters for canthis operateproject. in A veryparallel, important meaning that, while form evaluations on different processor-accelerator memory one of them is reading a packet of data from the memory, device which is not shown in the simplified graphic butthe is in other the detailedone is writing one, is the the previous DevC or processed device packet back sharing methods.configuration For the device. firmware, It is the we devicebasically through use the which it is possible to program the PL and through generated firmware by Xilinx toolset for our target board to its destination. The developed driver at software side is which the PS is be able to read back the configuration of the PL. A final word about the interfaces (ZC-702 ). We ensure that AXI level shifters are enabled resposible for synchronization of these two units. from the beginingbetween of the device PS andoperation. the PL: This there is arevital many for the and someThe of them proposed are capable hardware of exceptionally provides the high designers with an correct operationthroughput of HP (AXIand HPACP andinterfaces. ACP port), but only one interfacearchitecture is actually which needed is very in easy the to design, customize. which For any defined is one AXI General Purpose (AXI GP) port where thecomputational PS is a master task,and the the PLacceleration side is a slave. logic, It which is based 2.1 Hardwareis used by the PS to provide configuration data to the PLon anand easy to read to understand back results FIFO calculated interface, by the is the only block which needs to be modified. Figure 2 showsPL. AXI a block stands diagram for Advanced of the developed eXtensible hardware Interface and is a highspeed, widely spread bus protocol The hardware also includes an AXI monitor unit [14]. Its on the ZYNQwhich device. is part As of we the see, Advanced three AXI Microcontroller slave interfaces Bus Architecture (AMBA) interconnect specification (ACP, HP0 and HP1 ) and one AXI master interface (GP0 ) purpose is to be able to monitor the AXI interface signals, so are enabled and used. The AXI masters used in this design that we can debug possible problems and further investigate implement AXI 4.0 protocol specifications [11]. They are latency values by directly looking at the actual waveforms. If the AXI interfaces are running at 125 MHz the maxi- based on the AXI master template provided as a LogiCORE 9 by Xilinx [13]. Each AXI master logic is further customized mum theoretical full-duplex bandwidth for them is 2 GBytes/s. to issue an interrupt when it finishes a transfer task. The For DRAM memory, the data bus width is 32 Bits and if interrupt signals are connected to the interrupt controller DDR3 DRAM is running at 533 MHz, we reach a theoreti- unit on the PS. Each AXI master, also contains an AXI cal bandwidth of 4.2 GBytes/s. slave port, through which the CPU can program and start Finally, we have also placed a set of AXI masters on the the master. All of the AXI masters are configured to handle HP1 port. The purpose of these blocks is to be able to burst lengths of up to 256 which is equal to 4096 Bytes of generate dummy traffic to the DRAM whenever required. data (read+write). Each AXI master, when programmed, Using this block we will evaluate the performance of the is capable of handling transactions of up to 1 MBytes total accelerator running on HP0 while the DRAM is also busy Chapter 3: Application Processing Unit of ARM Ltd. [7]

X-Ref Target - Figure 3-2

High Performance General AXI Controllers Cache General Purpose (AXI_HP) Coherent PL Fabric Purpose AXI Slaves ACP Port AXI Masters PL Clocks S0 S1 M0 M1 M2 M3 M M0 M1

32-/ 32-/ 32-/ 32-/ ASYNC 64-bit 64-bit 64-bit 64-bit Application DevC Processing Unit ASYNC ASYNC ASYNC ASYNC ASYNC Cortex-A9 DAP ASYNC ASYNC ASYNC NEON MMU L1 I/D Caches 4 8 8 1

Instruction FIFO FIFO FIFO FIFO Data Snoop Slave Interconnect for Master Peripherals Snoop Control Unit AXI_HP to DDR CPU_6x4x CPU_2x Interconnect L2 Cache DMA IOP IOP 512 KB Controller QoS QoS Masters Slave M0 M1 CPU_2x Reg & M 64-bit 64-bit Data QoS 8 8 16 8 OCM IOP Interconnect QoS QoS 64-bit 64-bit CPU_1x

8 64-bit Central Interconnect Read/Write Requests On-chip ASYNC (e.g., 8 reads, RAM QoS 8 writes) 256 kB ASYNC 4 8 16 32-bit

ASYNC 32-bit DDR Controller DDR_3x Clock Synchronizer 1 4 8 8 QoS M3 M2 M1 M0 Quality of Clock Domains Master Interconnect Service are specified within for Slave Peripherals CPU_2x Priority Some Blocks

UG585_c3_02_101614

Figure 2.5: BlockFigure 3-2: diagramAPU System of the View PS Diagram of a Zynq device

2.2.1 Zynq Boot Sequence While loading the bitstream (configuration data) into a normal Xilinx FPGA is not very complex, the bring up process of a Zynq SoC consists of the following steps:

1. Hard coded boot ROM

2. First Stage Boot Loader (FSBL)

Zynq-7000a) Perform AP SoC Technical basic configuration Reference Manual of thewww.xilinx.com PS Send Feedback 63 UG585b) (v1.11)If a bitstream September 27, is present,2016 load it to the PL

3. Two options: a) Bare metal code to be run on the ARM Cortex-A9 b) Second Stage Boot Loader (SSBL) like U-Boot

10 i. Load and start Linux or another operating system

After the power supply has settled, the reset is finished and all internal PLLs are locked (there are some more conditions, but these are the most important ones), the Zynq starts executing a hard coded boot ROM. This boot ROM does some very basic device initialisation and at the end of its execution lifetime checks the value of some of the input pins to determine from which peripheral (QSPI flash, eMMC, SD card or JTAG) the next boot executable has to be loaded, loads it and starts executing it. This boot ROM sometimes is called the zero stage boot loader (ZSBL), because the executable which is executed after it, is called FSBL. [7][8] The FSBL is loaded into the on-chip memory (OCM, see Figure 2.5) from where it is executed. It finishes the device configuration that was started by the ZSBL and loads the bitstream into the PL if a bitstream is present. The PL is held in reset until the loading of the bitstream is finished (or it is detected that none is present in the boot executable). At the end of the execution of the FSBL it loads the next executable into the DDR2 DRAM and starts executing it. There are three kinds of programs which can be executed at this stage: either it is a bare metal ARM Cortex-A9 application, a real-time operating system (RTOS) or a SSBL. [7][8] The SSBL usually just loads the Linux kernel and data required by the Linux kernel like the device tree and other configuration data into the DDR2 DRAM and starts the Linux kernel with access to both ARM Cortex-A9 cores. However, this is not the only possiblity: it is possible to let Linux run in an asymmetric-multiprocessor configuration (AMP), where Linux only runs on the first core with some restrictions on which address range in the memory space it uses and bare metal code or an RTOS runs on the second core. [8]

2.2.2 Fitness of Zynq for CROME The original motivation for using a Zynq device for this project was based on two main ideas. On the one hand, the implementation of safety-critical functions like interlock triggering inside the PL must use the deterministic real-time capabilities of the FPGA. Such capabilities ensure the full predictability of the FPGA behaviour at the clock cycle. A microprocessor has far less real-time capabilities than a FPGA: due to its caching system and superscalar execution of code, it is usually not possible to predict when exactly a certain portion of code will be finished. On the other hand, the non-critical functionality like providing access to the configuration and results to the supervision could be implemented inside a Linux system running on the ARM Cortex-A9 cores inside the PS. Even though this approach looks very nice on paper, it has two technical problems which can have a severe impact on the availability (for the first problem) and more importantly on the reliability (for the second problem). The first problem occurs when the Linux system inside the PS encounters an non-recoverable error condition. Although the safety-critical functions inside the PL should continue to run correctly, it is impossible to set new configuration parameters or to access the results, as all of this communication runs through Linux. The only possibility would be to reset the PS to restart the Linux. However, during the reset process of the PS, the PL is held in reset as well. This is very bad, as the PL will not be able to perform its safety-critical operation as long as it is being held in reset. Furthermore, it would be technically challenging (but not impossible) to prevent the PS from reflashing the bitstream in the PL. If the bitstream is reloaded in the PL during reset of the PS, the interlocks would trigger as the outputs of the PL are being held in high impedance state during reflashing to avoid damaging the device. Although this problem has some severe impact on the design, the second problem (which is actually tightly linked to the first problem) is even more severe as presented in the next section. [7]

11 2.2.3 PCAP-ICAP Issue

Figure 2.6 shows a simplified block diagram of the programming hardware and how it is accessed in a Zynq 7000 device. [7]

Figure 2.6: Block diagram of the ICAP and PCAP interfaces

Although this might not seem very important at first sight, this architecture has a crucial impact on the suitability of the device for high reliability, high availability systems. It implies that the PS decides how the device configuration primitive is accessed. The ARM Cortex-A9 cores control the multiplexer that decides whether the Processor Configuration Access Port (PCAP) or the Internal Configuration Access Port (ICAP) has access to the configuration primitive. The reason why Xilinx decided to put the PS in upmost control, is a feature called TrustZone of ARM. It allows the execution of trusted code in a secure environment. However, CROME tries to do exactly the opposite: it tries to run something insecure on the PS while behaviour of the PL should remain completely deterministic even when the PS has a failure. However, it cannot be guaranteed that a failure in the PS will not affect the configuration primitive (even though it is highly unlikely as several registers have to be unlocked in a specific order) and therefore the PL by overwriting any of its configuration. [7] There are anti-tamper cores developed by Xilinx which observe from within the PL the config- uration of the multiplexers which control the access to the device configuration primitive and can provide a warning signal when the PS tries to take control over the device configuration primitive. However, those anti-tamper cores are under the International Traffic in Arms Regulations (ITAR), which prohibits an easy use of them. Furthermore, in order to get through the process of ITAR half a year up to one year of delay has to be calculated. So it is not something that could easily be tested. A final problem of using the anti-tamper core of Xilinx is that it uses the ICAP, which could be used better for other things (a more in detail discussion is in Section 3.3.3). As there is very little information publicly available, it is not possible to tell whether a resource sharing usage of the ICAP between the anti-tamper core and another core would easily be possible and what the cost would be of the resource sharing. [9]

12 2.3 Proposed Modifications on the Architecture to Increase Reliability and Availability

Figure 2.7 shows a very simplified block diagram of the current state of the system. The PS of the Zynq is connected to the internet (through which it is connected to the supervision) and the PL of the Zynq is connected to all the peripherals necessary for radiation measurement like high voltage control for the biasing of the ionisation chamber or the ADC for measuring the current coming form the ionisation chamber. Furthermore, the PL also controls peripherals which monitor its own operation like ADCs for measuring the voltage on the power supply lines or the ADC for measuring the ambient temperature and humidity. And finally, the PL of the Zynq is connected to the interlock system which it has to trigger under certain runtime definable conditions. All the critical functions are only connected to the PL of the system while non-critical functionality can run in the PS.

Figure 2.7: Simplified functional block diagram of the current state of the system

Even if it were possible to guarantee SIL 2 by solely using a Zynq, there would still be a problem because the Zynq is not built onto a custom board but it comes as a SoM called PicoZed and the complete SoM is not SIL 2 (however there are negotiations with Avnet to provide a SIL 2 version of the board). Besides the problem with the Picozed SoM not being SIL 2, there is the problem of not being able to detect malfunctioning of the Zynq by itself. To solve these kinds of problems a next generation of the system architecture was proposed by the team about one year ago as shown in Figure 2.8. The interlock signal in Figure 2.8 is active low. The complete system beside the interlock is duplicated. A microcontroller is added to perform the same functionality as the Zynq. A part of this duplicated functionality is safety-critical, like the correct handing of the peripherals like the ADC and high voltage control and calculation of the dose. But also the non safety-critical functionality like the communication with the supervision through ethernet is duplicated. Furthermore, a Microsemi ProAsic 3 flash-based FPGA (the benefits of using a flash- based FPGA and how their usage differs from the usual SRAM-based FPGAs will be detailed in the next chapter) is added for supervising the correct functionality of the Zynq and the microcontroller. Microsemi ProAsic 3 FPGAs are widely used in safety-critical and high reliability system like defence and aeronautics applications as well as widely used at CERN for electronics which must work in a radiating environment. The interlock signals of both the Zynq and the microcontroller are connected to an AND gate which then connects to the actual interlock. While this architecture certainly greatly increases the reliability, it still has three problems. The first is that the software stack which has to run on the microcontroller might be too complex to be certifiable. While it would not be too difficult to certify the critical SIL 2 part by its own, it would be very difficult to demonstrate that the rest of the software stack (all the non-critical functionality like communication with the supervision) has no influence on the reliability of the critical part of the software stack.

13 Figure 2.8: System architecture as proposed initially

The second problem is that the availability of the system was not improved by the new approach. The problem in the availability lies in the fact that it is not possible to restart the PS individually while keeping the PL working. This means any time there is a problem in the PS which requires the PS to be restarted, the interlocks will be triggered as it is not possible anymore to change the configuration of the PL (where the interlock functionality could be disabled). Furthermore, there is no real mechanism foreseen to restart the Zynq remotely. The third problem is that the peripherals might be damaged when their corresponding master is not working correctly, e.g. if the Zynq is reset during operation of the complete system. The reason why the peripherals might be damaged cannot be explained here as it is inherent to the measurement principle which might be patented. Finally, the architecture has one more problem which is not directly related to the architecture but to the state of the complete system hardware. Up to now it is not possible to have multiple ionisation chambers attached to the analogue frontend board and it is not possible to have one chamber attached to multiple analogue frontends as it defeats the measurement principle. In order to solve these problems an improved architecture, as shown in Figure 2.9, is proposed. The dotted lines represent connections that are global and are not routed through any multiplexer like the master in, slave out (MISO) signal of an ADC that is connected through SPI and can be read by multiple components. The new architecture replaces the general purpose microcontroller with a SIL 3 microcontroller (an overview over possible candidates will be given in the next section). The main task of the Microsemi ProAsic 3 is still the same, it monitors if both the Zynq and the microcontroller are working correctly. The tasks of the Zynq remains the same as well. However, the task of the microcontroller changes considerably: the microcontroller does not have a connection to the supervision any more. The only purpose of its ethernet connection is to provide an interface for the maintenance group through which the microcontroller can be told to restart the Zynq because the PS is in a bad state. Besides this restarting task, the microcontroller will have to perform the same real-time critical dose calculations as the Zynq and manage the corresponding peripherals as well. The active low interlock signals are again anded together with an AND gate. However, the interlock output signal of the Zynq will be input to an OR gate together with the control signal of the ProAsic 3 in order to ”gate off” the Zynq’s interlock signal for when the Zynq is restarting. As it will most likely never be possible to have two ionisation chambers in one system, there will only be

14 one analogue frontend. The control over all the peripherals needs to be multiplexed between having the Zynq as master and the microcontroller as master (because the frontend might be damaged if the controlling electronic is not working correctly). This multiplexing can be either done by discrete ICs or by putting the multiplexing function into the ProAsic 3 FPGA. A quantitative simulation will have to show which configuration (multiplexing with discreet ICs or multiplexing with the FPGA) has better reliability. In normal mode, the Zynq is master of all the peripherals and the microcontroller just ”listens” to the signals and interprets the data and calculates the dose himself as well. In alternate mode, the microcontroller is master of all the peripherals and does all the controlling and computation himself. The interlock output of the Zynq can be gated off in case that the Zynq is not working or restarting. However, it is only the PL that effectively is gated off as the PS is still working and doing communication with the supervision. As a final remark, it would be possible to perform causality checking by the ”inactive” devices, e.g. in normal mode the SIL 3 microcontroller and/or the ProAsic 3 could check if the output of Zynq makes sense and that for example certain delays for interrupt handling are respected. Similarly in an alternate mode, the SIL 3 microcontroller’s output would be checked by the Zynq and/or the ProAsic 3. While it is beneficial in terms of resources to use the ProAsic 3 for causality checking in both cases, it is risky to do so as another single point of failure is created when the ProAsic 3 has a problem. Therefore, it would make sense to disperse this functionality into the different main components of the system.

Figure 2.9: Proposed system architecture for the next iteration

The PHA identified certain sites in CERN that are so critical that it is planned to have two independent monitors in parallel to observe them. For this scenario it might be advantageous to have one system running in normal mode and the second by default in alternate mode with the microcontroller in charge and the Zynq only using its PS for communication with the supervision. The reason why this might be a good idea is to diversify the number of possible sources of error, because the two systems are configured differently with different code running the critical tasks, which would reduce the risk of failure due to the same cause (the same erroneous HDL code for example).

2.3.1 SIL 3 Microcontroller High reliability microcontrollers have been around for some time now, but they were mainly used in automotive industries and in some cases of industrial applications. However, other industries such as the aerospace industry, which up to now mostly relied on FPGAs and ASICs [10], are now

15 turning their attentions to this kind of microcontroller [11]. Key companies with products specifically targeted at functional safety are Infineon with their TriCore— device family, with their Hercules— device family or NXP Semiconductor with their Power Architecture for Automotive product lineup (this is list is non-exhaustive, as there are various other companies). While the TriCore— devices are based on a proprietary architecture from Infineon, the Hercules— devices are based on a ARM Cortex-R architecture (NXP Semiconductor uses the power architecture as already stated). [12][13]

Reliability Features Although the underlying architectures of the microcontrollers are different, they all have similar features targeted specifically at reliability. The first is the lock-step: two (or more) cores execute exactly the same task and their outputs are compared after each clock cycle and an error is raised if they differ. Furthermore, the cores are not just sitting side by side. In order to minimise a common cause of malfunctioning, the physical orientation of each core is usually turned by 90° on the die (one with north-south orientation and the other east-west orientation for example). A second feature is that most registers, caches and buses (which memory element exactly depends on the actual microcontroller) are ECC-controlled and/or parity bit protected (ECC stands for error correcting code and will be presented in more detail in Section 3.2.5). This allows for built-in soft- error mitigation. Third, there are usually some built-in boot-time self-test and diagnostic systems. Most of the time there is a Memory Built-In Self-Test (MBIST), which tests the memory at boot-time and there is a Logic Built-In Self-Test (LBIST), which checks for problems in the logic at boot-time. This guarantees that, at least at the beginning, the hardware is fully functional (or a malfunction is detected) and any later malfunctionality can be handled by software or by the surrounding system. [12]

Real-Time Features Most safety-critical systems are also real-time critical, meaning that almost no system exists where it is enough to assure that the correct results or outputs are generated without any time constraints. In most systems these results or outputs must also be generated within a certain time frame. A first step towards more real-time determinism in microcontrollers is to remove caches. As cache behaviour is extremely difficult or even impossible to predict, it poses a threat to a real-time critical system as the latency of the system cannot be predicted accurately (there is no way to predict cache misses at all times). To solve this problem, most of the microcontrollers mentioned above remove non- deterministic cache hierarchies or try to make them more deterministic. As an example, the ARM Cortex-R family at least partially replaces the cache hierarchy with something called tightly coupled memories (TCM), which allow for more deterministic delays when accessing memory. Another special feature in this kind of microcontroller is that there are some modifications on their interrupt systems compared to normal microcontrollers. Their interrupt system is modified so that interrupt entry happens in a bounded and deterministic way. Furthermore, the interrupts are entered very quickly (low-latency interrupt mode (LLIM) on ARM Cortex-R system). [12]

Software Considerations The two microcontrollers which were found to be particularly interesting for this project were the Texas Instruments Hercules— RM48 (which is based on the ARM Cortex-R product family) and the Infineon AURIX TC29xT (which is based on the TriCore—). As the task running on the core will require both very low latency interrupt handing (for the analogue frontend of the system) as well as

16 some networking tasks, an approach which uses a RTOS will have to be chosen as doing networking on bare metal would just be too much work and induce an additional risk because the code base which has to be verified increases. Different RTOS kernels exist and one of them which is widely used is FreeRTOS. FreeRTOS is especially interesting as there is a very similar, but not quite API compatible, RTOS kernel called SafeRTOS which is certified up to SIL 3. Furthermore, both FreeRTOS and SafeRTOS support the above mentioned microcontrollers. For future development, FreeRTOS could initially be used and at a later stage with only minor modification in the code base (but high costs) a switch to SafeRTOS could take place to facilitate the certification process. Under a software architecture point of view the usage of the two microcontrollers would differ, because while the Texas Instruments Hercules— RM48 only supports lock-step (or no lock-step) mode with its two cores, the Infineon AURIX TC29xT has three cores where the lock-steps can dynamically be set and/or delayed. This feature would be particularily interesting as on an Infineon AURIX TC29xT it would be possible to let the safety-critical part of the software run on two cores which are in lock-step mode with each other and the third core would be without lock-step and manage the ethernet communication which is regarded as not safety-critical. However, this is only a nice to have feature and not really necessary as in [14] there is an approach sketched how to design and protect a system properly where different tasks with different safety levels have to run. The basic approach is to use the MMU, which is extremely powerful on such systems, to isolate the different tasks from each other into different memory regions and to let the system trigger a fault if bad memory access happens. If the Infineon AURIX TC29xT was chosen over the TI Hercules— RM48, the same functionality of the MMU would have to be used in order to prevent the single core running the ethernet stack to write into address regions which are used by the two other cores in lock-step configuration. [15] One final remark: it will be crucial to make an early case study of how fast the SIL 3 microcontroller is able to respond to interrupts and which response latency is needed by the overall system. While it is possible with a FPGA to react within tens of nanoseconds to the change of input signals (reacting means produce an output according to the input), a typical reaction time of a microcontroller is in the order of hundreds of nanoseconds or depending on the used software stack like a RTOS, where extensive context switching has to be performed, it might even reach the order of magnitude of microseconds. Depending on how fast reaction times of the system are necessary and the actual reaction time of an interrupt in the RTOS, it might be necessary to move certain low level control task from the SIL 3 microcontroller to the ProAsic 3 if the microcontroller cannot perform them fast enough.

17

3 Methodologies and Techniques for the Design of Safety-Critical Systems

In this chapter the development cycle used to achieve SIL 2 will be presented first. Second, er- ror correcting schemes in digital systems will be presented. Then some general methodologies and techniques used to increase reliability and availability of safety-critical FPGA designs will be pre- sented. Finally, after an overview of verification methodologies, two key technologies used during the verification steps will be presented.

3.1 Development Cycle to Achieve SIL 2

The overall approach to develop a complex system that shall be SIL 2 certifiable is the following: a quantitative approach is chosen if it is possible to do so, if not, a qualitative approach has to be taken. For example, the failure rate of electronic components and other parts of the complete system can be known. On the base of that simulations of the behaviour of the system if a certain component is malfunctioning can be conducted (this kind of simulation is called Failure Mode and Effects Analysis (FMEA) [2]). Based on those simulations, a failure rate of the complete system can be calculated. For the case of software or complex hardware, this is not applicable. A quantitative assessment of the hardware is possible for statistical failure of the hardware due to random hardware faults, but it is not possible for design errors. In the following the focus will be on complex hardware and design errors. First of all, the term complex needs to be defined. Complex is everything that is not simple and simple is anything for which a comprehensive combination of deterministic tests and analysis can ensure correct functional performance. As complete testing of complex hardware is (in most cases) not feasible, a certain process needs to be put in place which assures the quality of the ”product” which can then be certified based on the documentation of the development process. [10] Leaving the exact documents that have to be delivered to certification authorities aside, the rest of this section will describe the overall development process with a special focus on the development of certifiable IP cores. As there is no real standard process for SIL available, the process was heavily based of the well know and tested DO-254 process used in aeronautics industry. The base of all certifiably development process is the V-cycle. Figure 3.1 shows a complete V-cycle. The V-cycle descends from planning, high level functional design with requirement capture, conceptual and detailed functional design, detailed technical design to finally on the the bottom the actual implementation. It is possible to have a V-cycle starting at a later point than the top of the V. For example, when creating an IP core, the conceptual and functional design are usally already done and put in form as requirements and only the rest of the way down the V-cycle needs to be done. The way up the V is then starting from the implementation to verification, validation, testing (if applicable), qualification (if applicable), integration, integeration testing and integration qualification. The number of testing and qualification steps can depend on how much a system needs to be integrated to be able to test certain functionality. The V-cycle can be applied recursively to decompose complex tasks into simpler tasks, where the complete or partial V-cycle then takes place again on the individual ”leaves” of the recursion. Similarly, the V-cycle can be applied iteratively

19 when some design metrics were not met and (partial) redesign is needed, or when specifications change. However, it is always crucial to motivate why something was done, which creates traceability over the whole project. At any time in the development process, the team has to be able to answer ”why are you doing this” with a tracable reasoning through the different stages of the cycle/cycles. For example, it is not detremental (but not good) to find a bug in a certain component in a design at a later stage, but it has to be documented and the different process steps need to be redone/adapted. [10][16]

Figure 3.1: General V-cycle for HDL development

Besides the standard V-cycle, there are a number of supporting processes which run alongside all the different V-cycles. There is the configuration management process, which describes the tools which are used for development: which code versioning system is used, which tools are used for simulation, synthesis, place-and-route, and which impact the reliability of those tools has on the reliability of the ”product” (for example, synthesis could introduce errors, while simulation only could not discover certain errors), which tool is used for bug tracking, etc. Then, there is the process assurance process, which is used to assure that the other processes are executed correctly. Finally, there is the certification liaison process, which provides the link of the ensemble of processes and documents to the final deliverables to be handed to the certification authorities. [10] One crucial requirement to achieve SIL 2 (same holds true for SIL 3) is to have independence between the person who implements and the person who verifies. The reason is that it is not acceptable to have the same person writing the design to write the reference model which can be used in verification. However, it is acceptable if an independent reference model is provided that the designer implements the testbench. [10]

3.2 Overview of Error Correction Schemes

In this section a brief overview of common error correction schemes is given. Error correction schemes can be used to increase the reliability of transmission and hardware systems. This section’s focus is on error correction in transmission system and Section 3.3 is more focused on hardware systems. In Figure 2.9 there is a lot of communication between the different parts of the system, which are similar to transmission systems. In order to find a good way to secure these communication links a general survey of the different options is very useful.

20 3.2.1 Repetition Codes and Parity Bits The easiest and most naive way to assure that the same data has been received as was sent is to send the same data at least twice. Even though this method is extremely easy to implement, it has two major shortcomings: first, it is an extremely inefficient way of using the available bandwidth or memory. Furthermore, if the same data fragment is only sent or read twice and there is a bit error in one of the two fragments, there is no way of knowing which one of the two fragments contains the correct data and which one has an error. Therefore, the same data would have to be sent or read at least three times in order to be able to correct a single bit error. Second, this method is very weak against the case where the same error occurs twice in the same location. This is especially bad in the case where there is a transmission over a parallel bus, because the risk of having two or more bit errors in the same location is high as the same location means the same physical line of the bus which could be impacted. Another simple way to detect bit errors are parity bits. Parity bits are calculated by xoring the payload bit by bit and then sending the payload plus the parity bit. Using this method, any odd number of bit flips can be detected. Yet if there are two (or any other even number) bit flips, it is not possible to detect them. Furthermore, this method does not provide any mean for correcting detected errors. This method is actually being used by very simple serial protocols such as UART. The reason why this simple method is not being used more frequently is that it cannot detect double bit errors plus its implementation is very inefficient to do in software (bitwise xoring or xor-reduce a word is very inefficiently implementable on a microprocessor). However, it is very efficiently implementable in hardware for a serial transmission.

3.2.2 Checksums The basic idea behind checksums is using modular arithmetic in order to compress data. One very basic (and not very high-performance) example could be that when N words are transmitted during PN−1 L a transmission, all the N words are being added up like n=0 w(n) mod 2 where w(n) is the nth word and L is the length in bits of one word. Finally, the N words plus their modular sum is being transmitted. By using this kind of scheme, at least single bit errors can be detected and the algorithms are efficiently implementable in microcontrollers as they operate on words. One major weakness of such a simple checksum algorithm is that if the word order is changed, the checksum remains the same. Another weakness is that if the word width is small compared to the length of the transmission, it can be quite likely to have a collision, meaning that an erroneous data package generates the same checksum as its correct version. There are many different variations of how to calculate the sum in order to increase the efficiency of the algorithm like using one’s complement instead of two’s complement. In the following section a specific example of a checksum that is actually used in real systems is presented.

Fletcher’s Checksum A simplified version of the checksum algorithms presented in [17] can be described as follows. First, let L be the word width in bits and w(n) the nth element of the set of N words to be send. Then

L c0(n) = (c0(n − 1) + w(n)) mod (2 − 1) (3.1) L c1(n) = (c1(n − 1) + c0(n)) mod (2 − 1) (3.2)

where c0(k) and c1(k) are equal to zero for all k < 0. Fletchers checksum then can be calculated

21 by concatenating the two sums c1(N) and c0(N). The choice of y in the calculation x mod y is arbitrary; y could be set to a different value than (2L − 1). Furthermore, an initial value could be set for the sum as well. An example of such a checksum algorithm, which does both things, is Adler32. Adler32 sets a different initial condition on the sum and uses another modulus than Fletcher’s algorithm. While there are arguments for taking another modulus, which is the next smallest prime number to (2L − 1), to increase detectability of certain errors, it reduces the number of possible checksums so that Adler’s checksum starts performing worse than Fletcher’s checksum when L increases. [18] A key observation of taking (2L − 1) to calculate the modulo is that the summation actually is a one’s complement summation and one’s complement arithmetic can be emulated quite efficiently in two’s complement arithmetic (the kind of arithmetic present in the vast majority of recent systems). However, one’s complement also means that there are two zeros: one encoded with all zeros and one encoded with all ones. This is important to note as the algorithm is not capable to distinguish a block of all zeros from a block with all ones. [17] There are different variations of Fletcher’s checksum such as Fletcher16, Fletcher32 and Fletcher64, where the numbers signify the bit width of the checksum. While with Fletcher16 there still is some risk for collision, this risk becomes much smaller for 32 and 64 bits. There are two important remarks to make: first, the endiannes of the system needs to be respected if the word width of the system is different of half of the width of the checksum. Second, the accumulation of the sums has to be done in a data type bigger than L in order to correctly perform the modulo operation. Yet it is possible to first sum up several terms and to perform the modulo operation only after having calculated the sum in order to gain performance if it can be assured that the accumulator data type will not overflow. [17]

3.2.3 Cyclic Redundancy Checks Cyclic redundancy checks (CRC) add redundancy to a data set which has to be transmitted by performing polynomial long division of the data to be transmitted with a predefined generator polynomial and appending the rest of the division to the data. The data is interpreted as a polynomial in the way that if there are N bits and the bits are enumerated from N − 1 down to 0, the PN−1 n corresponding polynomial is n=0 bit(n) · x . The receiver will then perform the same polynomial long division with the same predefined generator polynomial but with the data plus the rest appended by the transmitter. If the polynomial long division equates to zero, then no transmission error occurred, otherwise, there was an error. The used vocabulary of transmitter and receiver indicates where this method is used the most: in serial communication, as the polynomial long division can be implemented very efficiently if the data is received bit after bit. [19] CRC is efficiently implementable in hardware, especially if the communication is serial. It essen- tially consists of a linear feedback shift register (LFSR), which contains the generator polynomial, shift registers for data and result, some xors and some more logic. The choice of the generator polynomial is crucial for the performance of the CRC. For different standards, different generator polynomials are defined in the standards to be used for the corre- sponding standard. For example, Bluetooth uses the polynomial x5 +x4 +x2 +1. CRCs are actually tightly coupled with a certain class of Hamming codes, but this link is out of scope for this report. Hamming codes will be presented in more detail in the next section. [19][20] Overall, CRC is a very powerful and efficient tool for error detection, because it detects many different kinds of errors with little hardware overhead. However, it does not provide any mean for error correction and it is not very efficiently implementable in microprocessors. Nevertheless, it is still widely used even if the algorithm has to be implemented on a microprocessor.

22 3.2.4 Error-Correcting Codes

There are different kinds of ECC algorithms. However, only the Hamming code will be presented here as the others are not used for storage but rather in telecommunication. The basic idea behind Hamming codes is to increase the Hamming distance between the different words. The Hamming distance is a measure of how many bits need to be flipped in order to get from one word to the other. Example: the bit strings ”11001” and ”10101” have a Hamming distance of 2 as there are two bits different between the two words. For Hamming encoding, the N bit data word is extended with M parity bits, which are placed in the extended word (data bits + parity bits) so that every bit position, which is a power of two, is a parity bit. All the bits are then enumerated in binary and enumeration starts at one. As the parity bits are all at a position which has an index which is a power of two, this index has just one one and the rest all zeros in binary representation. The parity bits are then calculated by xoring all data bits together that contain a one in their index (their index in the extended word) at the same position as the corresponding parity bit (again the index in the extended word). This algorithm ensures that all different words have a Hamming distance of 3. This is also called sometimes Hamming-3. [21] Hamming decoding (in the case of hard decisions as no soft decisions are possible in a purely digital system) works as follows: first, the extended word is read and the parity bits are recalculated depending on the data bits contained in the extended word. Then, two cases need to be distinguished. In the first case, the same parity bits are recalculated from the data bits as were stored in the extended word. In that case no bit error occurred. In the second case, the recalculated parity bits are not the same as those which were stored. For further investigation, two cases need to be distinguished: there is exactly one parity bit that is different or there are multiple parity bits that are different. If there is only one parity bit that is different, then this parity bit is bad and had a bit flip. If there are multiple parity checks that fail, then it is the data bit, which has its index formed by the addition of the indexes of the parity bits whose checks failed, which has a bit error. [21] This simple Hamming code has the capability of detecting and correcting single bit errors. If there are two bit errors, the error will be detected but the correction will reconstruct wrong data bits. In order to be able to distinguish between correctable single bit errors and non-correctable double bit errors, an additional parity bit can be added, which is the parity bit over all data and parity bits. However, it only makes sense to add this additional parity bit when the number of bits of the used code is odd. The reason is that if the number of parity bits were even, the additional parity bit would not contain any additional information. The kind of code with normal Hamming code plus additional parity bit is called extended Hamming code. It can do single error correction, double error detection (SECDED) if the previously mentioned condition is met. The basic idea of an SECDED decoder is to find the number of bit errors as follows: if the simple Hamming code does not find a problem, then no bit error happened (only possibly in the overall parity bit of the extended Hamming code). If the simple Hamming code finds a bit error and the overall parity bit is coherent with the simple Hamming code, then a double bit error occurred. If the overall parity bit indicates an error in the simple Hamming code as well, then there was a single bit error which can be corrected. [22] One last detail that is left for discussion is how to construct a code for a fixed length, as M and N usually do not add up to be a power of two, but in digital electronics word sizes nowadays usually are a power of two or and most of the time a multiple of 8. The first step to construct a code with a fixed length is to find the next bigger complete code (complete means that all the bits can have an arbitrary value) into which the fixed length fits. The second step then can be done in different ways, but only the method used by [22] is presented: the extended code word’s format is chosen in a way that either all the leading or all the trailing positions after the fixed code length are zero (trailing position in [22]’s case). As the complete code word was chosen so that the fixed length

23 code word fits into it, the trailing bits are all data bits which are set to zero. As they have a fixed value on the encoder side, they do not need to be transmitted and the decoder can just assume the same position to be zero as well. The rest of the encoding and decoding works as if the code had no fixed size. [22]

3.2.5 Cryptographic Hash Functions Cryptographic hash functions are checksums designed with two objectives: first, if there is a very small change in the message of which the hash is computed, the output should be completely different. Second, knowing a message A with a certain hash it should be almost impossible (for a good hash function) to construct a message B 6= A where the difference between A and B can be chosen arbitrarily where both messages A and B have the same hash. Both properties are desirable for this project as they would allow for bit error detection and other transmission errors, but the second property is not as important as the first one. Moreover, in the design of popular hash cryptographic hash functions like the Message Digest family (MD) or the Secure Hash Algorithm (SHA) family, the second objective is the key objective as it assures cryptographic security, which is valued over resource usage. [23]

3.3 Techniques to Increase Reliability and Availability of FPGA Designs

3.3.1 Taxonomy of FPGAs While the basic functionality of all FPGAs is quite similar, the way how this functionality is actually implemented in the chip can differ quite a lot. In [16], an overview over different kinds of FPGA is given and their suitability for use in a radiating environment (nuclear power plant in the case of the study) is assessed.

Anti-Fuse-Based FPGA An anti-fuse-based FPGA can be programmed only once. However, once it is programmed, it will never loose its programmed functionality. They necessitate little power and are very fast. Their biggest advantage is that their programming is rad-hard, meaning even a very big dose will not alter the functionality of the FPGA permanently. Furthermore, as soon as the device has power it has its functionality, there is no need to load a configuration bitstream. For a long time the density was excellent. However, this type of FPGA has fallen off considerably as the anti-fuse technology was not able to scale as much as traditional CMOS or flash technology. [24][16]

SRAM-Based FPGA SRAM-based FPGAs are the most widely used FPGA type nowadays. Its biggest advantage is that it can be reprogrammed indefinitely. However, this comes at the cost that in a radiating environment, the programmed functionality of the FPGA can be altered by the radiation. In order to protect the configuration from alternation due to radiation, most SRAM-based FPGAs incorporate a system which checks their configuration during runtime and corrects eventual errors. Furthermore, it is necessary to put a second chip besides the FPGA which contains the configuration bitstream of the FPGA, which is loaded by the FPGA on start-up, because it looses its configuration every time when there is no power. This process takes time and therefore the FPGA needs a few seconds

24 after power-up until it is fully functional. Its power usage can vary greatly depending on the actual type; there are high performance SRAM-based FPGA which use a lot of power as well as very low power models. Originally, its density was not excellent. However, the recent developments of CMOS processes which scaled more quickly than flash processes for example made that the density of SRAM-based FPGAs nowadays is good. [16]

Flash-Based FPGA

Flash-based FPGAs are something in between SRAM-based and anti-fuse-based FPGAs. They do not loose their configuration at power down, but they can be reprogrammed several times (although not as may times as SRAM-based FPGAs). Due to their technology, they use very little power. Their density is very good, although flash technology did not scale as much as CMOS lately. Although it is possible that they loose their functionality if they are exposed to strong radiation for a long time, it does take a lot of energy to reprogram a flash-based FPGA, which makes them kind of rad-hard. Furthermore, it is measurable how much they have already lost their configuration as the flash-cells change their charge and with that some electrical properties. As they do not loose their configuration at power-down, they are instantly on at power-on. [24][16]

3.3.2 Influence of Radiation in Deep Sub-Micron Silicon Devices As structures continue to shrink in the deep sub-micron feature size of silicon devices, the influence of noise due to environmental radiation starts to have an increasing impact. The source of this kind of radiation can be the natural cosmic radiation (then there are usually neutrons hitting the silicon device) or the packaging material (usually alpha particle emitting packaging material) or just the natural environmental background radiation. While the alpha particles are less of an issue nowadays as the packaging material is chosen so that they contain as little alpha particle emitting substances as possible and other alpha particles are easily shielded by the package, the biggest danger for malfunctioning are neutrons hitting the silicon. [24][25] A neutron hitting the silicon creates a trace of ionised particles, which can critically change the charge distribution in a depletion region and turning a transistor on/off which should be off/on. This process can physically damage the device, in which case the occurrence of such an event is called a hard-error. If only signals are altered but no permanent damage is done to the device, it is called a soft-error. There is nothing that can be done against hard-errors beside shielding and detection of them, but there are different techniques at disposal to fight soft-error. [24][25] First a little taxonomy of soft-errors is needed: the most frequent and treatable ones are single event transients (SET) and single event upsets (SEU). Single event means that one event source affects at most one memory cell/signal/other object. There are other kinds of events which affect multiple bits/signals, but the probability for them to happen is much lower than a SEU/SET. Depending on the level of expected radiation and how critical the application is, being able to resolve a single SEU at a time is enough and if two SEU happen at the same time this should be detectable, but does not necessarily need to be recoverable. As our devices are placed at sea level in a low radiation environment, being able to correct a single SEU and being able to detect two SEUs is enough. [25] In a design clocked below the Gigahertz range (the rule of thumb is 2 GHz for a 28 nm process), SETs do not significantly contribute to the failure rate compared to SEUs. The reason is that the duration of an SET is so short that the probability that the bad value is clocked into a register is very small. Furthermore, the SET has to occur exactly at the right place in the logic path so that it is possible to affect the signal at the entrance of the flip flop. As SETs are much less likely to

25 alter any internal state permanently than a SEU, they are much less critical than the latter. As the critical part of the design in the PL is clock with 100 MHz, there is no need to actually use mitigation techniques which actively address SETs. Therefore, only storage elements need to be protected while the logic itself does not need to be protected. [25][26]

3.3.3 Single Event Upset Mitigation Techniques In this section, different techniques for SEU mitigation will be presented for all kinds of devices. However, the focus will be on FPGAs and more specifically on SRAM-based FPGAs.

SEU Mitigation on Configuration RAM

This section is solely applicable for SRAM-based FPGAs. The configuration of flash-based FPGAs is altered much slower (in the temporal sense) through radiation than the configuration of SRAM- based FPGAs. There are special techniques to monitor the state of the configuration of a flash-based FPGA, but they are out of scope for this thesis. [24] The configuration RAM (CRAM) contains all the configuration used for FPGA operation from direction and drive strength/pull ups/etc of input/output (IO) pads to how the routing elements inside the FPGA route data and even the content of the look-up tables (LUTs) used for the logic functions. It is not hard to see that maintaining a correct configuration inside the CRAM is crucial for correct operation of the device and to be sure that it is not damaged during operation (an input suddenly becoming and output can create a short circuit which can rend a complete IO bank dysfunctional). While the CRAM in flash-based FPGAs consists of flash cells, which are very robust against radiation, the CRAM in SRAM based FPGAs consists of SRAM cells, which are susceptible to SEU. There are different systems implemented by the different SRAM-based FPGA vendors which constantly go through the CRAM, check the validity of the content and correct it if necessary. However, those systems usually need to be instantiated in the FPGA by the designer or at least need to be enabled. Needing to instantiate them at least partially inside the FPGA fabric leads to a kind of chicken/egg problem: how should the component which checks the CRAM be able to detect that there is a configuration error if the error affects the configuration which ensures the the correct functionality of the component himself? There are some measures to do this with tools like built-in self test (BIST) or watchdog signals. However, usually external logic is needed as well to observe the watchdog signals. [24][26][27] Xilinx provides a soft-error mitigation (SEM) core which is targeted at protecting the CRAM from soft-errors. CRAM is a memory array of SRAM cells. All those cells are protected with CRC for error detection and with ECC for error correction. The SEM core can even be configured in a way that it is able to reload a certain portion of the bitstream from an external storage element for the case that the CRC engine detects an error which is not correctable with ECC. In order to readback the configuration bitstream and to correct eventual errors the SEM core needs to be connected to the ICAP interface. [27] According to Xilinx, there is a small number of memory bits in internal device registers and state elements that are not protected. Soft-errors occurring in these memory regions can result in a regional or even device-wide failure which is referred to as single-event function interrupt (SEFI). Xilinx states that the number of memory bits which are SEFI critical is very small and therefore the frequency of such events is much smaller compared to soft-errors in other kinds of storages including the rest of the CRAM. [27] Even though the Xilinx SEM core is definitely a feature to be implemented, there are some points left to discuss. First, what impact its usage together with the anti-tamper core has needs

26 to be checked. A first assumption would be that the anti-tamper core does not directly impact the performance of the SEM core as it supposedly uses an AXI bus to scan for the relevant registers which control the PCAP and ICAP. However, this is just an educated guess as Xilinx provides very little information about this core. The second and more important point for discussion is the performance of the SEM core. According to [27], it takes 8 ms in the fastest configuration, which can only correct single bit errors and detect double bit errors in the CRAM, for the core to scan the complete CRAM for bit errors. If a better error correction and even classification of the error is desired, the scan time can go up to 27 ms for the complete device. Those numbers are only valid if the ICAP interface runs at its maximum clock speed which is 100 MHz. If the device usage is not too high, it is possible to achieve this clock speed, but the achievable clock speed might also be less. Furthermore, in the basic configuration it takes the SEM core 610 µs to correct the bit error at 100 MHz and 25 µs to detect the double bit error. This almost 9 ms period for single bit error correction in the CRAM in the best case might be too much, because one of the requirements of the system is to be able to sense pulsed radiation, which typically has a duration of only 10 ms. If during 9 ms of these 10 ms the system is malfunctioning, it might be possible that it is not able to detect the radiation pulse. To overcome this, in Section 2.3 where the next iteration of the architecture was presented, causality checking was added as a task to the second processing unit (SIL 3 microcontroller), just to allow the system that performs the causality checking to overwrite the output of the first system when it does not react quickly enough to a certain input as it might happen during a SEU in the CRAM. [27][9]

SEU Mitigation on Flipflops

To increase the reliability in a data flow application, all registers can be duplicated and their content can be compared. If their content differs, a SEU has occured and an error signal can be raised or anything else can be done to address the error. The downside is that there is no possibility to know the correct value of the erroneous register pair and everything downstream from the SEU location needs to be reset for the time the error propagates. However, one can do better by triplicating the registers (triple-mode redundancy (TMR)), because that way the correct value of the register can be found using a majority voter. This however, comes at the cost of additional register (3 times higher register usage), increased path delay due to the majority voter and increased usage of logic cells also due to the majority voter. This form of TMR, where all the local registers are triplicated and armed with a majority voter, is called local TMR (LTMR). If the designer has control of the full HDL code, this is the most efficient way of implementing TMR. [28] If there is a complete block which should be protected (because it contains an IP core where the HDL code is not accessible or if the functional safety strategy requires to triplicate a complete portion of the design) block TMR (BTMR) can be used where the logic and the registers are triplicated. While this can give a higher reliability of the design, it comes with the cost of a massively higher usage of logic. It is possible to apply floor-planning for the individual blocks of the BTMR in order to separate them physically to increase reliability. While theoretically it is also possible to do this for LTMR, it would be a big effort for the designer to manually place every flipflop in a design, which normally is not done. Furthermore, in a FPGA device the flip flops are usually spaced enough (even within one logic slice) to not be affected both at the same time by the same incoming particle. [28] Another critical point in every SEU hardened design are internal feedback paths, which can all be seen as a sort of finite state machine (FSM). The reason why feedback paths are critical is because they can get stuck in an illegal state which is entered by a SEU in the state register. This is especially critical in incomplete state encodings. There are two ways to remediate this: the first is to include a catch-all clause in the state machine’s state transition process and to tell the synthesiser not to

27 optimise this clause away (in a perfect world with no SEUs, which are the conditions assumed by the synthesiser, the catch-all clause would never be entered as the FSM never gets into an illegal state and therefore the catch-all clause is optimised away). By doing so, an error signal can be triggered when the FSM enters the catch-all clause and the FSM can be set in a known state. The downside of this approach is that the previous state is not restorable, which means that, depending on the rest of the system, the complete system is in a bad state and needs to be reset. Another approach is to equip the FSM’s state register with ECC, which usually is a Hamming-3 encoding, which can correct single bit errors. This means that a SEU can alter one bit in the state encoding register and the Hamming-3 logic will detect this bad state and restore the previous legal state in the next clock cycle. The downside of this approach is that the FSM is in a bad state for one clock cycle. This needs to be carefully taken care of in all parts which depend on the state of the FSM directly. As an example: there are two states A and B and there is a signal which is zero when the state is A, one when the state is B and else two. Depending on the coding scheme used for the multiplexer and its actual implementation by the synthesiser, it is possible that if for example a one-hot encoding is used and the state is B and the bit which encodes the state A has a SEU and is set as well (an overall illegal state), that the output of the signal is zero for one clock cycle because the A state bit has a higher priority than the B state in the multiplexer. This might become a problem if the signal is clocked into a flip flop and then used further. There are pros and cons for both techniques, but applied properly and having the rest of the system around the FSM designed accordingly, both approaches are viable. Depending on how the ECC is applied, it might be that no error is signalled and the FSM is only put back in a legal state. In order to detect the rogue condition and to take countermeasures, a combination of the two approaches can be used: the ECC is responsible to put the FSM back into the last state when a bad state is recognised and the catch-all clause can be used to generate signals to detect that such a condition happened. [28][29]

SEU Mitigation on Block RAM

As a final point to secure there are large random access storages like BRAMs. Rad-hardening them by applying TMR would be extremely wasteful. Therefore, another technique is used which requires a bit more logic overhead but significantly reduces the overhead in terms of storage overhead. The technique is again ECC using algorithms like Hamming codes. The reason why it is efficient to use ECC on BRAM is that only one single element inside the storage is accessed at a time, therefore the logic to perform ECC does only need to be present once in the system. And finally, it is important to apply proper layout techniques to BRAM cells because they are extremely small and dense and therefore a single particle travelling through it could easily affect multiple bits. However, if the different affected bits belong to different words inside the BRAM, it is possible to correct the errors more easily than as if they all belonged to the same word. [28]

3.4 Overview of Verification Methodologies

Verification methodologies can roughly be separated into two main groups: simulation-based and formal verification methods. The names already include their most notable difference: while it is necessary to create a stimulus for a simulation-based verification, a formal verification can be conducted without any stimuli.

28 Simulation-based Verification

To perform a simulation-based verification, a simulation engine is needed to actually run the simula- tion. This might sound obvious, but in electronic design the purchase of a simulator usually is a very costly investment (although one that every company in this business will have to do). As in the vast majority of real-life circuits it is not possible to exhaustively verify them, it is necessary to make a plan what actually has to be verified in order to be confident to have verified the design enough and well enough. Exhaustive verification means that any input combination and internal state is tested. Making the link to the development cycle presented in Section 3.1, prior to performing the actual verification, a verification plan based on the functional requirements of the component/design/sys- tem has to be elaborated. The verification plan defines tests and acceptance criteria depending on the functional requirements which will then be implemented in the actual testbench. The decision to make is what exactly and how exactly to test certain requirements. It has to be decided how the stimuli are generated (if they are directed stimuli or constrained random stimuli) and how the check- ing of the correctness of the results has to be performed. Furthermore, it can be decided which kind of technologies and frameworks are used for different tasks like Universal Verification Methodology (UVM), Open Source VHDL Verification Methodology (OSVVM), Property Specification Language (PSL), etc. Finally, it has to decided how the behaviour of the device under verification (DUV) is being tested. The design can be tested that some properties hold true which are expressed in PSL. This approach is very well suited for verifying buses or FSMs where properties of the behaviour can be described easily. PSL extends linear temporal logic (LTL) expressions with regular expression, which makes it a powerful tool to describe properties and conditions of a logic system in both an immediate and temporal manner (if a condition is true, then another condition has to be true at the same time or if a condition is true, another condition has to become true within a certain time). [30] Another way to perform verification is to compare the behaviour of the DUV against a reference model. To do this, a reference model has to be coded as independently from the actual imple- mentation as possible (in another language, or even by a completely different person to be SIL 2 certifiable). The black-box methodology only tests the signals at the interfaces of the DUV and the reference model. The advantage of this approach is that it is completely independent of the actual implementation of the DUV, e.g. it is easily possible to perform verification of both a RTL implementation or a gate-level implementation of the same design using the same testbench. The downside of this approach is that if a bug is found, further investigation has to be done to actually trace down the bug as the testbench has no knowledge of the inner states of the DUV. Further- more, it can even be possible that a bug is not detected as it is only an internal signal which is not propagated to the output. However, a proper verification plan should minimise the probability for this to happen. The reverse of this approach is the white-box methodology. In this case, the testbench has full knowledge and control over the internal signals and states of the DUV and checks them. This minimises the probability of having an internal bug which remains undetected and if a bug is detected, it can be located accurately. The actual checking is usually done through assertions (language native or PSL assertions). The downside of this approach is that the overall verification effort is bigger than for a black-box approach. Similar to the black-box approach the signals have to be checked at the level testbench (though assertions however), but the DUV also has to be equipped internally with additional verification statements. Compared to the black-box approach, the overall abstraction level of the verification is raised considerably when using a white-box approach, espe- cially when PSL is used. Another disadvantage of a white-box approach is that the testbench usually is DUV-specific as not only the interface signals are checked but also internal ones. Using both a white-box and a black-box method at the same time is called grey-box method. While it adds up

29 the capabilities of both approaches like the high level of abstraction of the white-box approach, it also comes with the constraints of both methods. Therefore, the grey-box testbench is DUV specific in the same way as is the white-box testbench. [30] The V-cycle presented in Section 3.1 coupled with the requirement for independence for SIL 2 requires to use a black-box methodology for functional verification. However, it is possible to apply certain stimuli, which force the internal signals into a certain state, by only knowing the functional specifications. This idea can to be used to test the behaviour of the core in corner cases, which are only internal but are not really visible at the output as such. An example for this will be presented in Chapter 5 in form of the test procedure called Test FP B4, which will be used to verify some floating point cores. [30][10]

3.4.1 Formal Verification

Formal verification is a static verification process which is completely independent of possible inputs of a circuit. Therefore, it is possible to prove the correctness of a circuit for any possible inputs, something that is usually not practicable to do with a simulation-based approach. Formal verifica- tion techniques of all sorts basically need to solve a mathematical problem, for which any kind of mathematical prover engine is needed. It may have to prove the equivalence of two implementations of the same circuit (for example of the RTL model and its synthesised netlist), or that a implemen- tation has certain properties which are specified as properties in PSL or are inherent to a model. There are two main difficulties in formal verification: first, formal verification is tightly linked to theoretical informatics of mathematics, which makes it difficult as sometimes a HDL model needs to be translated or recoded in a particular fashion in some specific language in order to make formal verification possible. The second difficulty is that the underlying mathematical problems, which have to be solved to perform formal verification, are extremely hard to solve (usually in non-polynomial time). However, recent improvements in both the computational power of modern processors as well as the theoretical foundations like compact representation of combinational circuits, have made formal verification more viable than it used to be. [30]

Formal Verification of Floating Point Units

There has been a lot of effort in formally verifying floating point units for a long time in safety-critical systems like in space applications. Since the divider bug in Intel’s Pentium processor in 1994, there are similar efforts in the commercial sector, too. Most of the approaches use a theorem prover like HOL [31], PVS [32] or ACL2 [33] for mechanically proving properties of floating point operations and subsequently the correctness of the operation according to the standard. While there are many papers available which show that it is possible to model the properties of floating point arithmetics in theorem provers and to prove their correctness, the main difficulty is to create a link between the theorem proving language and the actual HDL code. In order to properly verify a system, the theorem prover not only needs to be verified but also the tool chain, which is used to create the link between for example PVS and . There are approaches which translate PVS into Verilog [34] [35] to just mention one example. As already stated, the PVS-to-Verilog compiler would have to be certified. Furthermore, coding efficient hardware structures in PVS is extremely difficult as PVS was never intended to be used as a HDL. The combinational behaviour can be expressed at a higher level in PVS, but explicitly inferring specific hardware structures is almost impossible (or only possible for the very specific cases the tool was built for). For the sake of completeness, some French universities and NASA use a system based on PVS [36][32], which is similar to the above mentioned approach.

30 Interestingly, early attempts to formally verify their floating point cores at AMD worked in a similar way: they used ACL2 for verifying their floating point cores and built a ACL2-to-Verilog compiler through which they could get a synthesisable model of their mathematical description of the floating point core. As also stated in their papers, the correctness of this compiler is as critical as the rest of the tool chain used for proper verification. For a company like AMD it was possible to extensively test their compiler, but not every organisation has means like AMD has. [37] One important problem is that it took those teams years to develop their tools. AMD (as well as [35]) made their tool chain publicly available through github or similar services. The problem now was to learn to code something in ACL2 that would be translatable into Verilog, which was clearly not feasible in 6 months. However, a person from the original team at AMD changed to Intel (now, 2017, he is at ARM) and continued his work and made the code still available on github. Figure 3.2 shows a formal verification flow for floating point units as it was used at Intel Corporation. [33]

Figure 3.2: Formal verification flow for floating point units used at Intel Corporation

The main idea is to be able to have a model of the floating point algorithms in both ACL2, which is used for formal verification, and in VHDL or Verilog, which can be used for synthesis and RTL optimisation. In contrast to AMD, where the ACL2 model was directly translated into Verilog, now the two models are built independently. In order to make the link between the two models, a new intermediary language was created called MASC (Modelling Algorithms in SystemC) [38]. MASC is a C-like language that has properties of both C and C++, which are augmented by SystemC data types. This allows for high level algorithmic modelling of some hardware processes. MASC can be seen as an intermediary language which serves as documentation for the algorithm design teams, RTL design teams, verification and formal verification design teams, because it is quite readable from the algorithmic point of view but hardware structures can still be identified in the code. There are two compilers for the MASC model: one that compiles it into ACL2, which can then be used for formal verification, and one that compiles it into SystemC. The SystemC reprsentation can then be fed together with the actual HDL to the HECTOR tool from Synopsys, which then performs equivalence checking between the two models. This is how the link between provable ACL2 code and synthesisable HDL code is made. [33] Even though this approach seems to be very promising, it was to much to be done in only 6 moths time for the thesis. Even though a lot of the code necessary for proving the correctness of the model in ACL2 is publicly available, some user interaction is still needed to actually perform the task. ACL2 is a Lisp based language, e.g. functional programming and that is something that cannot be learned

31 very quickly. Furthermore, MASC would have to be learned in order to create the ”reference design”. And finally, the HECTOR tool from Synopsys was not available and its purchase would have been very costly, and as it is a very powerful tool, it would have taken a considerable amount of time to learn how to use it: therefore, a more traditional, e.g. simulation based verification approach was chosen for the floating point cores.

3.4.2 Open Source VHDL Verification Environment Methodology OSVVM was introduced to equip VHDL with verification capabilities similar to the ones of UVM and SystemVerilog. While UVM usually requires a proprietary constraint solver and therefore the developed code might not be very portable, OSVVM is completely open source and will work with most modern HDL simulators (at least the ones which support the latest VHDL standard). [39][40] The main methodology brought to VHDL by OSVVM is functional coverage. A traditional way to measure the coverage of a verification is to measure the code coverage. Code coverage means which portions of the code have been executed during simulation and which parts have not. Although this provides interesting information, the information has to be treated carefully. The code coverage will show whether there is code in the DUV that has not been executed. However, code coverage cannot be used to measure the level of verification of the functionality of a component. For example, two portions of the code have to be active simultaneously to trigger a bug. Both portions of the code are run independently and therefore are shown as covered in the code coverage, but they were never active at the same time and therefore the bug remains undiscovered despite a possible 100% code coverage. To overcome this limitation, the concept of functional coverage was introduced. [40]

Functional Coverage

Functional coverage is a metric to measure how much of a design’s functionality has been exercised by a certain testbench. The functional coverage model of a design has to be defined by a verification engineer. Having 100% functional coverage does not necessarily mean that the design has fully been tested. It only means that all the functionalities defined in the functional coverage model have been tested. Therefore, it is crucial to carefully define the functional coverage model. [40] While UVM can leverage the object-oriented programming (OOP) capabilities of SystemVerilog, OSVVM has to be purely procedural as VHDL has no OOP capabilities. OSVVM’s alternative is to use VHDL’s protected types together with the corresponding functions and procedures to emulate OOP in a natural way. Usually the protected types should not be modified directly by the designer but only through the provided functions and procedures. [40][41] A functional coverage model consists of a set of coverage points, which can be seen as checkpoints which have to be reached (usually they have to be reached multiple times). Each coverage point (also called bin) can be represented by an integer or an integer vector that usually has a length of two in OSVVM. An integer is used when the coverage bin contains only one single value and an integer vector is used for a range of values being represented by one bin. This integer/integer vector is encapsulated into a protected type (the actual bin object) together with other information like the number of times this coverage point has to be ”hit” or which weight he has (the weight is equivalent to the probability to be taken at random among all the other bins). The choice of using integers comes with one important constraint: VHDL only has an integer width of 32 bits and the bit string ”10..00” is excluded from the integer range. This means that values longer than 32 bits or the bit string ”10..00” (if the bit string is 32 bits long) cannot be represented easily as a coverage bin in OSVVM. This is especially cumbersome as the bit string ”10..00” of length 32 often is a corner case which has to be tested (when the word width of the DUV is 32 bits). Leaving these

32 constraints aside, OSVVM provides a powerful yet simple to use tool to model functional coverage and to update it during simulation. [40][42]

Intelligent Coverage The intelligent coverage feature of OSVVM helps with speeding up simulation (five times faster or more than constrained random testing) by returning a random coverage hole of the overall coverage structure when a random coverage point is requested. A coverage hole is a coverage point of the coverage structure that has not yet been covered. Hence, the way intelligent coverage speeds up the simulation is not by being particularly fast but by avoiding the generation of useless stimuli which do not improve the actual functional coverage. [40]

Random Types Another field where OSVVM significantly increased the capabilities of VHDL is random types. Tra- ditional VHDL can only generate random data for a few data types and with only one uniform distribution. Furthermore, this can only be done with a procedure, which prevents to use random generations in expressions. OSVVM enables the generation of random data for a wider variety of types and also gives control over the distribution of the random data. The available distributions in OSVVM are uniform, normal and Poisson distribution. Furthermore, the uniform distribution can be specialised to favour small or large numbers. [40] While pure VHDL allows the generation of random data for the data types real and integer, OSVVM enables to also generate random data for the types std logic vector, signed and unsigned. This is an extremely useful feature as these are the types that are usually used for RTL design. This looks really nice on paper, but it has one important limitation: it generates random data with a width of at most a length of 31. Sometimes, 32 bits are possible as well, but only for the case that the random data has to be within a certain range. The reason for this limitation is that internally the integer type is used, which only has 32 bits and which cannot represent the string ”one followed by all zero” (the most negative signed number in two’s complement). Due to this limitation, only 31 truly random bits can be generated at once. In order to have longer random bit strings, some wrapper functions have to be written, which use several calls to the OSVVM random functions internally. [40][42]

3.4.3 SystemVerilog Direct Programming Interface The SystemVerilog Direct Programming Interface (DPI) is defined in [41] as part of the SystemVer- ilog standard and defines an interface for SystemVerilog with foreign languages such as C and C++. Technically it is irrelevant which foreign language is used as long as it provides calling and linking interfaces compatible with the ones of C or C++. Therefore, instead of C Python could be used and the C bindings of Python could be used to interface with SystemVerilog, which would allow a much higher level of abstraction in programming than in pure C. [41] There exist other interfaces for HDLs with foreign languages such as PLI and VPI for Verilog and FLI and VHPI for VHDL, but they all either have a significant overhead, are difficult to use or are tool specific. For this reason DPI was chosen to interface HDL with a foreign language as the interface is easy to use and creates little overhead during simulation. [41][42] SystemVerilog DPI is only an interface and uses a strict black-box approach meaning that the actual implementation of the interfaced functions are not visible to the other side. There are two abstraction layers to achieve this, one for interfacing from the foreign language to SystemVerilog and one for interfacing from SystemVerilog to the foreign language. Both abstraction layers can exist

33 independently, e.g. if only a one way communication is necessary, there is no need to implement the other direction of communication in contrary to other interfaces. The language that wants to call a function from the other language assumes that the other language is being loaded through dynamic linkage. Therefore, a task in SystemVerilog can be exported to C and the C program can link dynamically to the task which exists within the simulator executing the SystemVerilog code. Similarily, a simulator executing SystemVerilog code can be started and linked dynamically against a library from which SystemVerilog can call a function implemented in C. In order to link correctly, the signature of the foreign function/task has to be known by the other side, e.g. the return type, the name and the arguments of a function have to be exported to the other language. Furthermore, it has to be defined whether the function/task is pure or context dependent. Pure means in a simplified manner that it returns in the same delta cycle (needs no physical time to execute for the simulation), is non-void and returns the same result for the same arguments at any time (has no state) and does not interact with other functions, e.g. if a function is pure and its result is not used anywhere it can safely be deleted. On the opposite, a context function/task can take simulation time to execute, can give different results when being called at different times with the same arguments and can interact with other parts of the system such as global data, sockets or files. There is no problem when a pure function is defined as a context function, but the opposite can give incorrect simulation results. The pure attribute can be used by the SystemVerilog compiler to optimise more aggressively than with a context function. [41] In the course of this thesis only C functions were exported to SystemVerilog and were called from within there. The usage and definition of DPI in SystemVerilog is simple, but the creation of a functional dynamic linked library can pose more problems. First, there are different ways how a dynamic linked library can be linked at runtime depending on the operating system. While there is only one way in Linux, Windows actually has at least two mechanisms how a dynamic linked library can be loaded and the dynamic linked library has to be build in a way so that it employs the mechanism which is expected by the simulator. Second, it has to be carefully considered if the operating system is a 32 bit or 64 bit operating system and if the simulator is a 32 bit or 64 bit executable, because the compiler and the linker have to produce a library which is compatible with both the operating system and the simulator. While the operating system can have the same or a bigger bit width than the simulator, the dynamic linked library needs to have the same bit width as the simulator because otherwise the linking cannot be done correctly at runtime. [41][43] When the proper toolchain for compilation and linking is installed and a working Makefile was created (e.g. the above described problems have been resolved), SystemVerilog DPI is actually simple to use and any changes on both sides of the interface can be performed very efficiently so that it allows fast development, debugging and verification cycles.

34 4 Securing PS/PL Data Transfer

Figure 4.1 shows the current state of the connection between the PS and PL. The ARM cores in the PS act as AXI masters and can read and write the data inside a BRAM cell in the PL through a AXI BRAM controller. The data written to the BRAM by the ARM cores is used as configuration data in the PL and originates from the supervision which sets the device parameters through an ethernet link. The data read from the BRAM are the actual measurement results which are stored or sent to the supervision by the PS. In order to signal to the PL that new configuration data is available, the PS can use a dedicated signalling line. A FSM inside the PL will receive the signal from the PS and will wait until the current cycle (at most 100 ms) is finished until it updates the configuration register with the new configuration data from the PS. The reason why it has to wait is to assure that the configuration during the measurement cycle is consistent, because otherwise the data could be copied from the BRAM into the configuration registers while the registers are being read and used for computations. In a similar fashion, the FSM inside the PL copies the data from some result registers into the BRAM when the measurement cycle is finished and then signals to the PS through an interrupt line that new measurement data is available.

Figure 4.1: Block diagram of the current link between the PS and the PL

The link between PS and PL is crucial for the correct operation of CROME, as the configuration data for the logic inside the PL are sent through this link and incorrect configuration data can rend the complete device disfunctional. In order to assure correct operation of the PL even when transmission errors occur or the PS is sending garbage data, an appropriate methodology has to be developed. Furthermore, while any communication with the supervision is secured using CRC, the PS has currently no means to verify that the data sent by the PL is valid or not. Therefore, both directions of the link PS-PL have to be secured.

35 Another problem is that the configuration register may need to be updated only a few times per year as the supervision is expected to change the configuration parameters very rarely. This makes these registers more risky for the occurrence of SEUs. Therefore, another methodology has to be developed in order to be able to continue operating without external interaction even in the occurrence of SEUs.

4.1 Adaption of the FPGA Firmware

The main focus, besides reliability, of the new core was ease of integration. The new system should provide and require only the same signals as the previous version of the core. Besides those required signals, it could provide additional interfaces, that do not need to be connected for now, which can provide additional functionality later in the development process. First, the design decision of how to secure the content of the configuration registers had to be made. Using TMR with a majority voter usually would be the choice, however this is not a viable solution in this case as the registers are too numerous and the overhead of implementing it would use too many FPGA resources. ECC would be another option, however to apply it on such a wide register bank would require too many resources as well. As not only error detection but also error correction is required, all of the other methodologies previously discussed in this chapter are not applicable. A possible solution for these shortcomings is found as follows: instead of assuring the validity of the content of the configuration registers at all times, the requirement can be lowered to allow erroneous content in the registers for a short period of times, which has to be detected and corrected. By allowing this, it is possible to store a ”golden” set of configuration data inside an ECC protected BRAM module, which can be considered as valid at all times as SEUs are detected (double bit errors) and corrected (single bit errors). The content of the configuration register is then constantly compared against the ”golden” version of the configuration data inside the ECC BRAM. When there is any difference, the content of the configuration register is overwritten by the content of the ECC BRAM and an error flag is raised for the rest of the system to indicate that the configuration data was bad. It is then up to the rest of the system to determine whether the error is critical (occurred during the time when the content of the configuration register was read) or not as the time it takes to check all the configuration registers once is known. Something this system does not do is to check the validity of the configuration data sent by the PS. To add this, Fletcher’s algorithm was implemented in VHDL to check the checksum which is added by the PS. If the received data does not correspond to the checksum the received data is discarded, an error flag is raised to indicate that bad configuration data was received and the system continues to work with the old configuration data. The choice of Fletcher’s algorithm was due to the following reasons: repetition codes and parity bits were too inefficient in both implementation in hardware and software, performed bad in error detection plus used too much bandwidth. Although cryptographic hash functions would offer good error detection and are efficiently implementable in software, their hardware implementation used too many resources. ECC is quite efficiently imple- mentable in hardware, however it is very inefficient to implement it in software. So there were only CRC and Fetcher’s sum left. CRC offers better error detection capabilities than Fletcher’s sum, but it is less efficiently implementable in software than Fletcher’s algorithm. The logic used to implement both algorithms in hardware are of the same order of magnitude (the CRC needs a bit less logic resources). One difference is that the Fletcher’s sum runs much quicker than the CRC calculation as the duration of the CRC calculation scales with the word width, while Fletcher’s sum does not. The reason why still Fletcher’s algorithm is chosen over CRC is the type of resources the

36 implementations use: while the CRC uses a shift register with xors and some more logic, Fletcher’s algorithm uses arithmetic operations (adders). In the overall system, there is much more logic used than DSP slices, and the adders plus the corresponding registers could be put into DSP slices, of which the overall FPGA device has plenty. In a similar fashion the validity of the data sent back to the PL can be guaranteed: first, the results are moved from their respective registers into an ECC BRAM. Second, their Fletcher’s checksum is calculated and appended to the results inside the ECC BRAM. And finally, this data can be sent back to the PS which then can check its validity by calculating the checksum by itself and require retransmission if desired.

4.1.1 Formalisation of the Functional Requirements The requirement for PS to PL communication can be formalised as follows:

ˆ Req Func PS2PL 0 The PS shall signal to the PL that a new configuration set is available by toggling the value of one single bit.

ˆ Req Func PS2PL 1 The PL shall be able to verify the validity of a provided data set provided by the PS.

ˆ Req Func PS2PL 2 If a given data set is validated, it shall be loaded before the following cycle. If it could not be validated, the configuration set shall be discarded and an error flag shall be raised.

ˆ Req Func PS2PL 3 The PL shall be able to detect and correct a single bit SEU and signal if configuration data was erroneous at any time due to a SEU that affected multiple bits.

ˆ Req Func PS2PL 4 The memory mapping of the configuration set inside the BRAM shall conform to the memory mapping as shown in Table 4.1, where one address cell is 64 bit wide and there are N − 2 parameters in the system. Depends on Req Func PS2PL 2 (Fletcher checksum added) and Req Func PL2PS 1 (Config- uration Serial added).

Table 4.1: Memory mapping of the configuration parameters Address Offset: 0 ... N − 3 N − 2 N − 1 Configuration: Parameter 1 ... Parameter N − 2 Configuration Serial Fletcher Checksum

The requirements for the PL to PS communication can be formalised as follows:

ˆ Req Func PL2PS 0 The PL shall signal to the PS the availability of a new result in the BRAM by toggling an interrupt line (the interrupts in the PS are edge sensitive).

ˆ Req Func PL2PS 1 The PS shall be able to know which configuration set a certain result set has been calculated.

ˆ Req Func PL2PS 2 The PS shall be able to verify if there were transmission errors during the transfer from the PL.

37 ˆ Req Func PL2PS 3 The memory mapping of the results inside the BRAM shall conform to the memory mapping as shown in Table 4.2, where one address cell is 64 bits wide and there are M − 2 results in the system. Depends on Req Func PL2PS 1 (Configuration Serial added) and Req Func PL2PS 2 (Fletcher checksum added).

Table 4.2: Memory mapping of the results Address Offset: 0 ... M − 3 M − 2 M − 1 Results: Result 1 ... Result M − 2 Configuration Serial Fletcher Checksum

4.1.2 Functional Description

Figure 4.2 shows a high level block diagram of the proposed solution for securing the PS/PL transfer. Compared to the current link given in Figure 4.1, there are now three BRAM instances. There has to be one BRAM instance for the parameters and one for the results. The reasons for adding a third BRAM instance are twofold: first, the AXI BRAM controller from Xilinx is not able to write into ECC BRAM from within Linux and second (and more importantly), it is not desirable that the Linux System can possible overwrite the ”golden” configuration data by error. Therefore, a third non-ECC BRAM instance is added as a a buffer into which the Linux System can write and read from and if Linux performs a bad operation it should not have an impact on the behaviour of the PL.

Figure 4.2: Block diagram of the secured link between the PS and the PL

The behaviour of the system can be described with one FSM as shown in Figure 4.3. The diagram is simplified and should be read together with the detailed explanation of the states and their behaviour. The numbers in the states in the diagram correspond to the number of the detailed state explanation. The state transitions are assumed to happen when all the state tasks have completed (more than one clock cycle may be required) or when the transition condition is true, if there is any in the diagram. To save space, the words ”move” was shortened to ”mv” and ”configuration” was shortened to ”conf.”.

38 Figure 4.3: State diagram of the FSM for securing the PS/PL transfer

1. Scrub Job This is the default and reset state. In this state the system constantly compares the content of the configuration registers with the reference configuration stored inside one of the ECC BRAMs. If there is any discrepancy between the configuration register and the ECC BRAM, the content of the configuration register is overwritten with the content of the ECC BRAM and an error flag is raised. If a single bit error occurred when reading from the ECC BRAM, a single bit error is signalled to the rest of the system and the corrected value is written back into the ECC BRAM. This has to be done in order to minimize the occurrence frequency of double bit errors in the ECC BRAM, which can happen if an existing single bit error is not corrected. A double bit error occurring in the ECC BRAM is a non-recoverable failure of the system and the complete configuration data has to be sent again by the PS. As long as the new configuration data has not been loaded successfully, the system is in a unsafe state. All this is being done repeatedly until the another PL function signals that a new result set is available.

2. Move Results to ECC BRAM As soon as the rest of the PL signals that a new result set is available the results are copied from the registers into a ECC BRAM.

39 3. Append Configuration Serial Number The configuration serial number is read from the other ECC BRAM and appended to the results residing in the ECC BRAM. The configuration serial number has as purpose to enable the PS to know with which configuration set was used to calculate the current result set. One parameter or result is considered to be 64 bits wide and their values are resized when the corresponding register has a smaller bit width and similarly the smaller bit width registers are zero-extended when they are copied to the ECC BRAM.

4. Calculate Fletcher Checksum for Results The checksum of this data package can be calculated.

5. Append Checksum to Results When the calculation of the Fletcher checksum is finished it can be appended to rest of the results in the ECC BRAM.

6. Move Results from ECC BRAM to Ordinary BRAM The complete result set can be moved from the ECC BRAM to the BRAM which is connected to the AXI bus connected to the PS. The results are always stored in the upper half of the BRAM (the base address for the results in the normal BRAM is ”normal BRAM size divided by two”. After the data as been moved, the interrupt line to the PS is triggered to signal that a new result set is available. The next state can now be two different states: if since the last time a new result was signalled to the PS the PS signalled that new configuration data is available, the next state is Move Configuration from BRAM to ECC BRAM; if not, the next state is Scrub Job.

7. Move Configuration from BRAM to ECC BRAM The first step of attempting to load a new configuration parameter is to move the configuration set provided by the PS from the BRAM into the ECC BRAM. The configuration parameters are stored in the lower half of the normal BRAM. The ECC BRAM is divided into two sections: lower half and upper half. One of the two sections contains the ”golden” configuration data and the other is the place where a possible new configuration set is moved to. Information on which of the halves contains the ”golden” configuration set is stored in a special register (more on this register later).

8. Calculate Fletcher Checksum for Configuration The Fletcher checksum of the configura- tion data has to be calculated to check if the configuration set has been transmitted correctly from the PS to the PL.

9. Verify Fletcher Checksum of Configuration Parameters The calculated Fletcher checksum can be compared against the Fletcher checksum stored as one of the configuration parameters. The result of this comparison determines the next state: if the checksums compare equal, the next state is Move Configuration Data to Configuration Registers and the value of the register which stores information on which half the ”golden” configuration set is stored is set to the half which was just verified, otherwise an error flag is raised to signal that bad configuration data was received and the next state is Scrub Job.

10. Move Configuration Data to Configuration Registers If the consistency of the configura- tion was verified, the configuration data is moved from the ECC BRAM into the configuration registers. After this process has finished, the state is changed to Scrub Job.

4.1.3 Implementation There were two main concerns for the implementation of the core: first, the different memories (registers, BRAM and ECC BRAM) had to be interfaced correctly as they all have different read/write

40 latencies and different port widths. Second, the critical parts of the design had to be rad-hardened to withstand soft-errors. To interface the different memories one address generator was created (in form of a simple up counter with reset) which is followed by shift registers to delay its output. The different latencies could then be resolved by taking the address from the correct place in the shift registers. To interface the BRAM (32 bit word width) with the ECC BRAM (64 bit word width), the content of the address counter has to be shifted right by one in order to count with the same velocity for both kinds of BRAMs. When the output of a wider storage had to be put into a storage with a smaller word width, a simple multiplexer is sufficient to chose the right part of the wider word. In the reverse case, an intermediate storage in form of additional registers has to be used in order to first read from the storage with a smaller word width once, temporarily store it, and then store the temporary storage together with the following word of the smaller word width storage into the storage with a bigger word width. For rad-hardening the core, the critical parts of the core have to be identified. Why some parts of the design were not considered to be critical will be discussed later. There are three parts which need to be rad-hard in the design: the ECC BRAM for the ”golden” configuration data, the FSM which controls the system and the register which stores in which half of the ECC BRAM the ”golden” configuration data resides. The ECC BRAM for the configuration data is inherently rad-hardened as it is ECC protected. It has to be rad-hard because if it was not and a discrepancy between the configuration register and the ECC BRAM was found, it would no be possible to tell which one (the register or the ECC BRAM) contains the correct data. The FSM needs to be rad-hardened in order not to get stuck in a rogue state due to incomplete state encoding. In order to rad-harden the FSM, Xilinx provides a VHDL attribute ”fsm safe state”, which with the value ”auto” adds a Hamming-3 encoding to the state registers of the FSM. Hamming-3 enables the FSM to recover from single bit errors. If there is a SEU in one of the state registers, this will be detected by the Hamming-3 logic which is synthesised into the FSM and which will then put the FSM into its previous state in the next clock cycle. The design can get into a rogue state for one clock cycle, but it can then recover from it and does not get stuck. The results calculated by the rest of the system which reads from this core could be wrong, but if the complete system (this core plus the rest of the current PL) is TMRed with a majority voter, the false result will not propagate to the output and the part of the system which had the SEU will resume to work correctly in the next cycle. The downside of not manually implementing the rad-hardening of the FSM is that it is not possible to detect if such an event occurred. It is just known that the system can recover from a single bit error and resume correct functionality and if the complete system is triplicated and a majority voter is added at the output, errors will not propagate to the output and be contained within one cycle of the machine. The only risk would be that two bit errors happen in the same cycle. The frequency of such an event is expected to be within the Failure in Time (FIT) rate which has to be achieved. The register which stores information on which half the ”golden” configuration set resides has to be secured because if it was not and a bit flip happened in this register, the system would start running with old configuration data as during Scrub Job state of the FSM, if there is a discrepancy between the configuration registers and the ECC BRAM, the configuration registers are overwritten. In order to rad-harden this register it is triplicated and a majority voter is placed at the output. Special care has to be taken so that the tool does not optimise this away as normally there should be no discrepancy. This can be done by adding the attribute ”dont touch” with value ”true” to the corresponding signals. Xilinx tools will then not perform any logic optimization and the desired functionality is preserved. [29] TCL commands can be used as an alternative to VHDL attributes. Usually this is even preferred as the VHDL attributes are sometimes not well supported by tools even though they are supposed

41 to be. However, in this case both attributes were fully supported and it keeps the source code more compact and readable if such things can be done in HDL instead of having to read a TCL command file as well. A risk of using TCL commands is that some of the commands are applied onto the elaborated netlist and names of signals can be changed there compared to the HDL code, which creates the risk that the TCL commands are not applied in the correct location if no attention is paid to this. [29]

4.1.4 Verification The core was simulated and worked correctly with sample data, but proper verification of this core could not be done yet as for SIL 2, independence between the RTL designer and the verification engineer has to be assured. However, a verification plan can be defined based on the specifications. ˆ Test PL2PS Apply the following 1e2 (1 · 102 = 100) times (this is a coverage goal): provide random results to the core and the signal that the results are ready to the core. Wait until the results ready output signal of the core toggles. Check if the BRAM content is equal to the results in the registers, the configuration serial number is correct and the Fletcher32 checksum is correct. Pass criteria: the memory mapping defined in Table 4.2 is respected, the results in the BRAM are equal to results in the registers, the configuration serial number in the BRAM is the same as in the configuration ECC BRAM and the Fletcher32 checksum is correct. ˆ Test PS2PL Apply the following 1e2 times (this is a coverage goal): write random config- uration parameters into the BRAM and randomly generate either a correct or an incorrect Fletcher32 checksum. Toggle the signal which indicates a new configuration is available and indicate that a new result is available. Check the configuration registers for their content the next time the new result available signal goes high. Pass criteria: if the Fletcher32 checksum added to the configuration data in the BRAM is corrected, the configuration data shall be available in the configuration register within 50 ms. If the Fletcher32 checksum added to the configuration data in the BRAM is wrong, the con- figuration register shall keep their previous value and an error flag shall be raised indicating the bad configuration data. ˆ Test PSPL SEU Perform the tests Test PL2PS and Test PS2PL while flipping one bit in every flip flop word or memory array at a random moment. The registers which are protected by special synthesis specific commands have to be tested in the post-synthesis netlist. Pass criteria: No single bit error shall happen undetected. One cycle after the bit flip, the results shall be consistent (meaning they are valid even even if they are calculated with an older configuration set). In the cycle with the bit flip, the results may be erroneous, but the error has to be detected. The choice of 1e2 simulation runs for Test PL2PS and Test PS2PL is rather arbitrary. It could definitely be more as well, but as most of the complex cores used for this core are IP cores from Xilinx, there is no need to extensively verify them. The FSM and its surrounding are more like a glue logic around the IP cores from Xilinx, and this glue logic is simple to verify as it either works or does not, there is very little room for undetected buggy behaviour once the design works.

Verification of the Fletcher32 Checksum Core The verification of the Fletcher32 checksum core could be done as there were reference models for calculating the Fletcher32 checksum available. The following test was performed:

42 ˆ Test Fletcher32 Apply 1e3 times the following: first, a random number of inputs (within the bound of possible number of inputs for the core) is defined and signalled to the core. Then apply a random input bit pattern after a random delay (with clock enable low during the application of different inputs) until the defined number of random inputs is reached. Wait with clock enable low or high at random until the core signals that the result is ready. Check the result for correctness with the reference model and reset the core. Pass criteria: all results of the core are the same as the reference model when the core signals the validity of the result.

Test Fletcher32 was executed successfully. The delay times between application of different inputs were restricted to be within [0, 8] to speed up the simulation as leaving the delay unbound and arbitrary could result in very long delays (which would not contribute to the quality of the simulation).

4.1.5 Discussion of the Design A first point of discussion can be why a configuration serial number was added as an additional configuration parameter and as an additional ”measurement” result of the system. The reason is the following: if the core in the PL fails to detect that the signal coming from the PS indicates that a new configuration is available in the BRAM (can be by an SEU in the edge detector of the core or if something else goes wrong in the transmission of the signal from the PS to the PL), the PL will not load anything and the PS has no means of knowing that the new configuration data was not loaded. To overcome this, the configuration serial number was added which makes the round tour from PS to PL and back to PS, that way the PS knows at any time with which configuration data the current results have been calculated. This can be especially useful having in mind that Linux is not a RTOS, so technically in the PS within Linux one has no means to know when exactly data was moved from the PS to the PL. Another delicate point is that until now the PS sent new configuration data in each and every cycle to the PL. This becomes a problem as soon as SEUs are considered, because the storages in the PS (the external DDR2 DRAM and the internal caches) cannot be hardened in a efficient way against SEUs. These problems could be resolved by changing the functionality of the software stack, but more about this in the next section. The register that toggles to indicate to the PS that a new result set is available does not need to be rad-hardened because the interrupts lines of the PS are edge sensitive. The toggling will happen at least once and if an SEU happens a second time during the cycle. There are now two cases which need to be differentiated: in the first case, the SEU happens before the FSM is in state Move Results from ECC BRAM to Ordinary BRAM. Then the PS will read the same data twice but it can check with a time stamp for the uniqueness of the result and discard duplicates. The other case is when the PS reads from the BRAM while the FSM is writing data into it. In this case the PS will read two data fragments which should not have the same Fletcher checksum. The PS can verify the validity of a result set by checking the checksum, which is not given in this case as the checksum verification will fail. SEUs in the address generator cannot create a failure of the device as any SEU in the address generator is detected and classified as a fault. If there is an SEU in the address generator register chain while copying configuration data from the BRAM to the ECC BRAM, the Fletcher checksum will detect it. If there is an error when calculating the Fletcher checksum then the checksum will fail. The same holds true if one of the internal registers of the Flecher32 core has a SEU: a false result will be calculated, but nothing will ever get stuck in an non-recoverable state. There is an internal

43 state machine inside the Fletcher32 core, but it is rad-hardened with Hamming-3 in the same way as the FSM inside the main PS/PL communication core and if a SEU happens there, only a wrong Fletcher checksum will be calculated. If there is a SEU in the address generator chain when the state is Scrub Job, then some bad values might temporarily be moved into the configuration register. However, the next time the system goes through the erroneous memory location, the error will be detected, corrected and reported. The instance when a SEU in the address generator can have an undetectable impact is when data is copied from the results register into the BRAM. However, sending back the results is not considered a safety critical function and it is therefore possible to allow errors happening there. Furthermore, assuming there will be TMR of the full system at a later stage, the PS would be able to determine if two times the same ”meaningful” results are written back and the third instance returns nonsense, that the two identical results are the good ones when all the three could be verified with the Fletcher checksum (as the Fletcher checksum is applied after the address generator had a SEU, the third badly copied result set will have a correct sum but still defer from the other two instances of results). The same reasoning as for copying and checking configuration data applies for the results: if there is a SEU in the address generator when creating the checksum or when copying the result from the ECC BRAM to the BRAM, the checksum will fail in the PS and there error will be detected.

4.2 Adaptions of the Linux Software Stack

In order to get the core running, only a few adaptations to the existing Linux software stack are necessary. Especially as the Linux software stack is scheduled to be redesigned soon, it does not make sense to invest a lot of energy in adding improvements to the existing code and software architecture. First, some code has to be written which implements Fletcher32 to sign and verify data sent and received to/from the PL. As performance is no bottleneck in the software stack, the algorithm can be written for a maximum of readability, maintainability and that it is simple to verify. Second, the actual mapping of the employed data structures has to be adapted to mirror the memory mapping specified in Table 4.1 and 4.2. Up to now all memory access to a physical memory location from within Linux was done as a 32 bit wide memory access. If the written data was wider than 32 bits, it had to be written and read in two steps, which added some additional code. To simplify this and to reduce the code complexity, all memory access happens now as a 64 bit memory access. This brings some considerable overhead (especially considering that most parameters are smaller than 32 bits wide), but the overhead does not matter as there is by far no performance bottleneck in the PS. Up to now, the parameters were read out from the data structure individually and moved to the physical address. Even though it brings a considerable memory overhead, the new approach is to transform the data structure into an array first and then move the complete array to the physical memory location. As a side effect, this considerably simplifies how the Fletcher32 algorithm has to interact with the data: before it would have had to read every field of the data structure individually and extend the smaller bit fields if necessary; now it can simply iterate through an array. However, it is important to iterate through the array before it is moved to the physical memory location. The reason is that if the algorithm reads data from a physical location which was made accessible to the Linux userspace through the mmap syscal, every array element read will then trigger the underlying Generic UIO driver which would load the data every time from the physical memory. As the physical memory is a BRAM portion connected to a AXI BRAM controller which is attached to an AXI-Lite bus, the latency is fairly big (in order of several tens of microseconds1)

1The exact research results are property of ABB Corporate Research Switzerland.

44 for reading a very small chunk (4 or 8 words per burst) of memory. This low throughput combined with a big latency can quickly add up and mean that the PS spends a lot of time waiting for data being moved through an AXI bus system and then through its cache hierarchy.

4.2.1 Proposed Changes in the Linux Software Stack for Future Releases

As the configuration data changes very rarely (typically a few times a year), each time a new configuration set is received (or some of the parameters are altered), the changes could be saved onto one or multiple external flash based device which are present on the PicoZed boards (there is eMMC, SD cards and QSPI flash). A flash-based storage is much more robust against radiation as it takes a considerable dose to completely charge/discharge an flash cell. Especially as the radiation (and therefore the dose) is measured, one could try estimate if the data is still fine having received the dose measured. If the dose reaches the limits where data retention might become an issue, the data could be read back and rewritten to the device so that it would take the full dose to charge/discharge a flash cell again. As there are three flash based storage devices available, one could even go so far to use TMR in the PS on the configuration data and store it in all three storages, but this would be an overkill. However, using two might be useful in the case that one stops working, one could still be sure about the validity of the content of the other storage as long as the critical dose for this storage was not reached. One possible problem might be finding the critical dose, as it requires expensive testing by the producer of the flash based devices, which might not necessarily be done. As an example, Microsemi provides this data (sometimes under a NDA) for its flash-based FPGA devices, but those are put onto satellites or used in other radiation containing environment where the users require this kind of data.

4.3 Current State of the Development

The Fletcher checksum core was designed and could be verified successfully. It was included into the overall design of the PS/PL transfer core. The PS/PL transfer core was designed and all its functionality except the rad-hardened features were tested successfully. Verification was not done yet as it will be Ciar´anToner’s task to implement the verification in order to guarantee enough independence to be SIL 2 certifiable. However, the verification plan has already been prepared so that only the implementation of the actual test procedures remains to do. The reason why the rad-hardened features were not tested is that it takes a lot of effort to do so and therefore it is better to only do it once during the ”official” verification process. The core was not added to the overall project yet as on the one hand, it was not verified properly yet and on the other hand, the resources were too limited to do it as there were other tasks in the pipeline which were more important to do now. The modifications in the Linux software stack that are necessary to use the newly designed core were all developed and tested. As the core was not yet put into the overall system, it did not make sense to push the changes in the software stack to the full system yet as they are incompatible with the currently used system (the new core needs the new software and the old core needs the old software). Finally, none of the proposed modifications that are not necessary to use the new core were implemented yet. However, another reason why the already implemented modifications were held back is that when software has to be modified, the proposed modifications could be implemented at the same time to save some work as they all require modifications in the same subsystem.

45 4.3.1 Final Remarks on the Verification A major part of the design complexity of the system is due to IP cores of Xilinx (which are used in form of BRAM and ECC BRAM cores). As they would be needed in the reference model to perform verification using a black-box approach, the black-box approach is maybe not the optimal way to perform verification of this core (as it does not make sense nor is it possible to verify Xilinx’s IP cores). Therefore, a white-box verification approach using PSL is proposed. PSL is particularly suited to model specifications of the core like Req Func PS2PL 3. Furthermore, the behaviour of the FSM as well as the interfaces to the BRAM can be modelled efficiently in PSL. Although some stimuli generation would still be needed as formal verification of such a complex system is not feasible, the need of developing a reference model which would closely resemble the DUV could be eliminated.

46 5 Floating Point ComputationIEEE Std 754-2008 IEEE Standard for Floating-Point Arithmetic

In order to cope with the large dynamic range of the measured current (nine decades of dynamics), a 3.4.0 3.4 Binaryfloating interchange point representation format of the encodings measurements is necessary as a fixed-point representation would Each floating-pointrequire too number large bit widths.has just To one start encoding this chapter in aa generalbinary overview interchange of floating format. point To numbers make willthe encoding unique, in beterms given. of Thenthe parameters the different in floating 3.3, the point value IP cores of the developed significand in this projectm is maximized will be presented by decreasing with a e until either e = eminspecial or focus m ≥ 1. on theAfter different this process development is done, steps asif presentede = emin inand Section 0 < m 3.1. < 1, Finally, the floating-point an application number is subnormal.which Subnormal uses most numbers of the developed (and zero) floating are encoded point cores with will a bereserved presented. biased exponent value. Representations of floating-point data in the binary interchange formats are encoded in k bits in the following three5.1 fields Used ordered Subset as ofshown IEEE in Figure 754-2008 3.1: a) 1-bit5.1.1 sign Binary S Representation and Interpretation b) w-bitIn this biased section exponent a subset ofE = the e IEEE+ bias 754-2008 standard described in [44] is presented. The presented

c) (t subset= p − 1)-bit was judged trailing to besignific sufficientand for field CROME digit and string the reasoningT = d1 d2… whydp − the1; the subset leading is sufficient bit of will the be significand, presented. d0, is implicitly encoded in the biased exponent E. Figure 5.1 (from [44]) shows a generic representation of of a binary floating point number.

1 bit MSB w bits LSB MSB t = p – 1 bits LSB S E T (sign) (biased exponent) (trailing significand field)

E0...... Ew-1 d1...... dp-1 Figure 5.1: Generic binary representation of a floating point number 3.4.0 Figure 3.1 — Binary interchange floating-point format The first bit (MSB) of the bit string is the sign bit S. When S = 0 the floating point number is positive and when S = 1, the number is negative. The sign bit is then followed by w bits The valueswhich of k, formp, t, w the, and biased bias exponent for binaryE. Afterinterchange the exponent formats there are is listed the mantissa in TableT 3.5consisting (see 3.6 of). the The range remainingof the encoding’st bits. E hasbiased to be exponent interpreted E asshall an unsignedinclude: number. This decision was made in order to facilitate floating point operations because using two’s complement would make operations like a ― everysimple integer comparison between much 1 moreand 2 difficult.w − 2, inclusive, However, byto encode doing so normal no number numbers with a magnitude smaller ― thethan reserved one can value be represented. 0 to encode To be±0 able and to subnormal represent such numbers numbers, a constant offset called bias B has to be subtracted from E. The bias of the exponent B is defined as follows: ― the reserved value 2w − 1 to encode ±∞ and NaNs. The representation r of the floating-point datum, and value v of the floating-point datum represented, are B = 2w−1 − 1 (5.1) inferred from the constituent fields as follows: a) If EIn = order2w − 1 to and correctly T ≠ 0, interpretthen r is the qNaN value or of sNaN a binary and encoded v is NaN floating regardless point number, of S (see several 6.2.1 cases). need to be distinguished. b) If E = 2w − 1 and T = 0 , then r and v = (−1) S × (+∞). c) If 1 ≤1. EDenormalised ≤ 2w− 2, then r number is (S, (WhenE−biasE ),is all(1 zeros,+ 21− p the× T)); float is a denormalised number. Denormalised the valuenumbers of the are corresponding equally spaced floating-point and span around number zero. Theis v mantissa= (−1) S × of 2 a E− denormalisedbias × (1 + 21− numberp × T); thus normalhas to numbers be interpreted have asan0 implicit.T , meaning leadingd0, which significand is not shown bit of in1. Figure 5.1, is implicit and d) If E = 0 and T ≠ 0, then r is (S, emin, (0 + 21− p × T)); the value of the corresponding floating-point number is v = (−1) S × 2 emin × (0 + 21− p × T); thus subnormal numbers have an implicit leading significand bit of 0. 47 e) If E = 0 and T = 0 , then r is (S, emin, 0) and v = (−1) S × (+0) (signed zero, see 6.3).

NOTE — Where k is either 64 or a multiple of 32 and ≥ 128, for these encodings all of the following are true (where round( ) rounds to the nearest integer):

k = 1 + w + t = w + p = 32× ceiling((p + round(4× log2(p + round(4× log2(p)) − 13)) − 13) /32)

w = k – t − 1 =k − p = round(4× log2(k)) − 13

t = k – w − 1 =p − 1 = k − round(4× log2(k)) + 12

p = k − w =t + 1 = k − round(4× log2(k)) + 13 emax = bias = 2(w −1) − 1 emin = 1 − emax = 2 − 2(w −1).

9 Copyright © 2008 IEEE. All rights reserved.

Authorized licensed use limited to: CERN. Downloaded on November 02,2016 at 13:56:06 UTC from IEEE Xplore. Restrictions apply. given by the value of E. Using this, the numeric value f of a binary encoded denormalised floating point number can be calculated as

f = (−1)S · 21−B · T · 2−t (5.2)

2. Normalised number When E is not all zeros and not all ones, the float is a normalised number. Normalised numbers are logarithmically spaced. The mantissa of a normalised number has to be interpreted as 1.T , meaning d0, which is not shown in Figure 5.1, is implicit and given by the value of E (the same principle as for denormalised numbers). Using this, the numeric value f of a binary encoded normalised floating point number can be calculated as

f = (−1)S · 2E−B · 1 + T · 2−t (5.3)

3. ∞ When E is all ones and T is all zeros an infinite number is encoded. However, an infinite number according to IEEE 754-2008 is not strictly equivalent to mathematical infinity. An infinite value encoded after IEEE 754-2008 can represent any value of the interval from the first value which is too big to be represented as a normalised number up to the true mathematical infinity (speaking in terms of the magnitude of the numbers, e.g. the same holds for negative numbers and negative infinity). The sign bit is interpreted the same way as normalised and denormalised numbers. Therefore, although not fully correct but serving as illustration, one could write the value f of the float as

f = (−1)S · ∞ (5.4)

4. NaN When E is all ones and T is not all zeros, the float encodes a Not-a-Number (NaN) value. There is a distinction between two different types of NaNs: signalling and quiet NaNs (sometimes called sNaN and qNaN). The distinction between sNaN and qNaN is done by using d1.

a) Signalling NaN When the float encodes a NaN and d1 is zero, it is a sNaN.

b) Quiet NaN When the float encodes a NaN and d1 is one, it is a qNaN. The difference between the two NaN values (besides their different encodings) is that sNaN is accompanied by an exception flag while qNaN is not. The exception flag allows simplified debugging on NaN occurrence while a qNaN will simply propagate through the computations. More about exceptions can be read in Section 5.1.3.

Equation 5.2 and 5.3 can be combined and generalised to

f = (−1)S · 2max(1;E)−B · min(E; 1) + T · 2−t (5.5)

By using Equation 5.5 no distinction between denormalised and normalised numbers is necessary any more. Furthermore, Equation 5.5 shows that the numeric values of floating point numbers are encoded in a sign/magnitude manner. This implies that there are two zero values: +0 and −0. For binary32, which is the standard float type in C and most other languages, w = 8 and t = 23. This is the standard which was implemented for CROME. Using the binary representation as show in Figure 5.1, the binary representations can be used for classifying floating point numbers. This classification is useful for defining verification plans in later stages. Table 5.1 shows a classification of all positive floating point numbers. The table uses a

48 notation based on VHDL to describe bit strings. The symbol ’-’ signifies don’t care and the names exponent and mantissa are specified as exp and mant respectively. The content of the different rows is not necessarily exclusive as min subnorm ∈ subnorm for example. The sign bit is omitted because the same table can be created for all negative floating point values and that table would only defer from this one by changing the sign bit form ’0’ to ’1’. Therefore when −max norm is written somewhere, the same bit string as in Table 5.1 is meant but with a ’1’ instead of a ’0’ as sign bit. When a row of Table 5.1 encodes a range, a from ”starting point” to ”end point” notation is used, where all the intermediary bit strings between ”starting point” and ”end point” are included when counting from ”starting point” to ”end point” and the bit strings are interpreted as unsigned numbers.

Table 5.1: Floating point classification Name Exponent Mantissa zero (others => ’0’) (others => ’0’) min subnorm (others => ’0’) (mant’high => ’1’, others => ’0’) subnorm (others => ’0’) (others => ’-’) max subnorm (others => ’0’) (others => ’1’) min norm (exp’low => ’1’, others => ’0’) (others => ’0’) one (exp’high => ’0’, others => ’1’) (others => ’0’) from (exp’low => ’1’, others => ’0’) from (others => ’0’) norm to (exp’low => ’0’, others => ’1’) to (others => ’1’) max norm (exp’low => ’0’, others => ’1’) (others => ’1’) ∞ (others => ’1’) (others => ’0’) from (mant’low => ’1’, others => ’0’) sNaN (others => ’1’) to (mant’high => ’0’, others => ’1’) qNaN (others => ’1’) (mant’high => ’1’, others => ’-’)

5.1.2 Rounding of Floating Point Numbers [44] defines several rounding functions. In the following section the chosen rounding function are presented with a brief explanation why it has been chosen. In Section 5.1.3, the reasons to not select the remaining available rounding functions are discussed. First, it is necessary to define exactly what a rounding function does in IEEE 754-2008. Let F be the ensemble of all possible binary representation of a given bit width of the floating point number (like binary32 as an example) and let F0 and F1 be two binary representations of floating point numbers with their respective numeric values f0 and f1, e.g. F0 ∈ F and F1 ∈ F. If now an operation like an addition or multiplication is performed (taking the multiplication as an example)

f2 = f0 · f1 (5.6)

it can happen (and does happen most of the time) that it is not possible to make the mapping

f2 → F2 (5.7)

so that

49 F2 ∈ F (5.8)

In other words, the standard requires us to perform any operation with an infinitely precise inter- mediary result (f2 in this case), but this intermediary result is not representable in a binary form (F2 in this case), so that F2 ∈ F because F2 requires too many bits to accurately represent f2. As this would led to an ever growing size of the binary representation when doing multiple operations, rounding is needed so that a mapping

0 f2 = round float(f2) (5.9) 0 0 f2 → F2 (5.10) (5.11)

0 is always possible with the condition F2 ∈ F even if F2 6∈ F. The algorithm chosen for round float(x) is called round ties to even. In words: the infinitely precise intermediary result is rounded to the closest representable result. If there are two representable results equally spaced from the infinitely precise intermediary result, the even representable result is chosen, e.g. the one with an LSB in the mantissa equal to zero. In a more mathematical manner: 0 00 0 00 0 00 let f3 with F3 6∈ F and |f3| ≥ |f3| ≥ |f3 | and F3 ∈ F and F3 t ∈ F. For F3 and F3 further 000 0 000 00 000 0 00 holds that there is no f3 so that |f3| > |f3 | > |f3 | and F3 ∈ F, or in other words, F3 and F3 are successive/neighbouring elements of F. And finally, let LSB(F ) be the function which returns the LSB of the mantissa of F , e.g. returns bit dp−1 with p = t + 1 of Figure 5.1. Finally, the round float(f) function for round ties to even can be written as

 f 0 if |f 0 | − |f | < |f | − |f 0 |  3 3 3 3 3  00 0 0 f3 if |f3| − |f3| > |f3| − |f3| round float(f3) = (5.12) f 0 if |f 0 | − |f | = |f | − |f 0 | and LSB(F 0) = 0  3 3 3 3 3 3  00 0 0 00 f3 if |f3| − |f3| = |f3| − |f3| and LSB(F3 ) = 0

Second, let us define the term bias (not to be confused with the bias of the exponent): bias means that if we have stochastic variable x uniformly distributed over an arbitrary interval and

E[x] 6= E[round(x)] (5.13)

where E[x] is the expectation value of x, which has an infinite precision, and round() is a rounding function. When checking if round ties to nearest even is biased, it turns out that it is not biased over any (arbitrary) interval [f4; f5] with F4 ∈ F and F5 ∈ F. The reason is that the rounding is always done towards the nearest neighbouring number (which does not introduce any bias) and if there is a tie (two neighbouring numbers have the same distance), then the tie braking rule does not introduce any bias either (over an interval half of the ties will be rounded up and the other half will be rounded down). This is a very desired property as CROME has to be able to measure very small differences in voltage over a long period of time and therefore, any bias could accumulate to a big error over time. Furthermore, round ties to even is the default rounding mode proposed by IEEE 754-2008.

50 5.1.3 Omitted Functionality There are several features of IEEE 754-2008 which are described in the standard, but were not implemented for the floating point cores. In the following these features will quickly be described and a short explanation why they can be omitted will be given.

Rounding Modes Besides the round ties to nearest even, [44] defines several alternative rounding algorithms:

ˆ Round ties to away The nearest floating point number to the infinitely precise result shall be returned. If there is a tie, the rounded result with the larger magnitude shall be returned.

ˆ Round toward positive Any infinitely precise result shall be rounded up to the next biggest floating point number or stay the same when the result is representable in floating point format.

ˆ Round toward negative Any infinitely precise result shall be rounded down to the next smallest floating point number or stay the same when the result is representable in floating point format.

ˆ Round toward zero Any infinitely precise result shall be rounded to the number with the smallest magnitude or stay the same when the result is representable in floating point format.

While round toward positive and round toward negative have a bias over any interval, round ties + to away and round toward zero are bias free over R. However, both have an individual bias on R − and R . Having a bias in the floating point representation would be bad for CROME because CROME measures voltage ramps which may have a very low steepness. Furthermore, as the system will be active for a long time the bias will accumulate and can have a severe impact on the result. An example why a bias in the rounding mode of the floating point arithmetic is bad was when the index of the Vancouver Stock Exchange was completely wrong1 because the small bias due to the rounding mode employed accumulated over the years. Therefore round toward positive and round toward negative are for sure no viable options. For round ties to away and round toward zero a more in detailed analysis needs to be done, but the reasoning remains the same as for round toward positive and round toward negative. The voltage ramp, which is measured typically, is in the interval [0; Vdd] with Vdd being a positive voltage. As + − stated before, round ties to away and round toward zero are biased on R and R and are biased in the interval [0; Vdd] as well. Therefore, neither of these rounding modes was suitable, even tough round toward zero (also called truncation) would be extremely efficient to implement.

Exceptions [44] defines several exceptions to be raised based on conditions. They are given below with illustrative examples:

ˆ Invalid operation is raised when a mathematically non defined operation like ∞ · 0 or ∞ − ∞ is performed or a NaN value is converted to an integer.

ˆ Division by zero is raised when a non zero number is divided by zero or log(0) is calculated.

1https://www.ma.utexas.edu/users/arbogast/misc/disasters.html last visited on Thursday 16th March, 2017

51 ˆ Overflow is raised when the exact result of a valid operation needs to be rounded to infinity in order to be representable in binary format, e.g. when the intermediary result is too large in magnitude.

ˆ Underflow is raised when an intermediary result has to be rounded to a denormal or zero number to be representable.

ˆ Inexact is raised when the rounded result of an operation is inexact, meaning when an inter- mediary result of an operation has to be rounded to be representable in binary format.

These exception flags are not implemented in the project as they do not provide useful information in the context of CROME. The exceptions are mainly useful in the context of being used in a microprocessor where a certain algorithm can be interrupted when an exception occurs and the exception can be treated. But for CROME the situation is different: the inexact flag is almost always on and provides little to no useful information. Underflows should never happen and if one happens, the floating point cores support denormal numbers and can therefore continue to work with the biggest possible precision. The underflow flag is especially useful when denormal numbers are not supported (like for Xilinx floating point core) or deactivated (when using SSE instructions on a Intel CPU). Division by zero can never occur as no divisions or logarithms are calculated in floating point format in CROME. Invalid operations will produce a NaN value which will then propagate through the calculations and finally the internal result of CROME will be checked for NaNs. Similarly, an overflow will simply propagate through the calculations and at the end it can be checked if one happened. For CROME, it is only important if there is a NaN or ∞ at the output and not when it happened exactly (because it should never happen and if it happens an error needs to be signalled).

5.2 IEEE 754-2008 Verification Suite

For the verification of the floating point cores developed for this project, a verification suite targeted at an arbitrary bit width floating point type was developed. The verification suite is implemented as a generic package, a type of package that was introduced in VHDL 2008. In order to be able to perform the verification, the package needs to be instantiated for a specific bit width of the floating point type (actually for a specific exponent and mantissa width) and a specific bit width of the integer type. For the stimuli generation and the scoreboarding, the OSVVM feature ”Intelligent Coverage” was used for generating test cases based on missing coverage points. This speeds up the simulation process by not wasting time on already covered coverage points. However, the way coverage points are implemented in OSVVM posed several problems for certain tests.

5.2.1 Implementation of Coverage Points In OSVVM, coverage points are implemented as integers or integer vectors which are constrained by the VHDL standard to be 32 bits wide and to not cover the value that is encoded in two’s complement representation by a leading one and followed by all zeros (the most negative number representable in two’s complement). This constraint led to the problem, that the value of negative zero according to Table 5.1 cannot be represented as an OSVVM coverage point. In order to overcome this problem, a lookup table function and its reverse (value to lookup table index) function were implemented where all the possible entries of Table 5.1 were enumerated and defined as a coverage point. By using the forward lookup table function the actual range of values encoded by the coverage point could be decoded. Similarly, the reverse function takes a bitstring as an input and returns the corresponding coverage point to which the bitstring belongs.

52 Another and much more important problem caused by the implementation of coverage points in OSVVM, which was also resolved by using a lookup table and for which Listing 5.1 serves as illustration, is the following: assume that one coverage bin is one line of Table 5.1. The coverage bin can either be a single point, in which case it is represented by an integer, or a certain range of bitstrings, in which case it is represented as an integer vector. If now a cross coverage of the set of all the possible coverage bins in Table 5.1 with itself is created, the cross coverage can have as index an integer vector of length two, three or four. If the initial coverage points, which were used for the cross coverage, were both characterised by only one bitstring, then their cross coverage point is a integer vector of length two. If the two initial coverage bins were characterising both a range, then their cross coverage is an integer vector of length four. If one of the two initial coverage bins was a single value and the other a range, then their cross coverage point is an integer vector of length three. The problem occurs when a random coverage point has to be returned through the ”Intelligent Coverage” mechanism of OSVVM. As the return value of the function that returns the coverage point needs to be constrained, one forces the function to only return coverage points of one particular length. Using three different calls to the function where each call had a length of the return type of two, three and four for the integer vector would in principle resolve the problem. However, the code would not be very readable or efficient as for some time when a certain length of the return value of the coverage point would already be fully covered while other lengths were not, there would still be some return values of the particular length which is already finished. But using different return lengths also has another problem, which makes the approach impractical: if the length of the return integer vector is three, then it is impossible to tell whether the first value is a single coverage bin and the second and the third value form a coverage range or the reverse, the first and the second coverage value of the coverage point form a range and the third is a single value coverage bin. As it is not possible to resolve this ambiguity efficiently, the approach with the lookup tables has to be taken.

Listing 5.1: Problematic OSVVM snipped for functional coverage

--VHDL 2008 standard is used,OSVVM include, entity testbench declared architecture problematic of testbench is shared variable cov: CovPType; -- some other declarations begin process variable int0, int1, int2, int3: integer; begin cov.AddCross(GenBin(1,1) & GenBin(2,3), GenBin(4,4) & GenBin(6,8)); -- cov contains: (1,4), (1,(6,8)), ((2,3),4), ((2,3),(6,8)) -- cov is flattened to: (1,4), (1,6,8), (2,3,4), (2,3,6,8) while not cov.IsCovered loop -- GenBin(1,1)x GenBin(4,4) or GenBin(4,4)x GenBin(1,1) (int0, int1) := cov.RandCovPoint;

-- GenBin(1,1)x GenBin(6,8) or GenBin(2,3)x GenBin(4,4) -- impossible to tell if correct coverage points are --((int0, int1), int2) or(int0,(int1, int2)) (int0, int1, int2) := cov.RandCovPoint;

-- GenBin(2,3)x GenBin(6,8) or GenBin(6,8)x GenBin(2,3) (int0, int1, int2, int3) := cov.RandCovPoint;

-- apply stimuli, collect coverage, etc end loop;

53 end process; -- some more testbench code end architecture problematic;

5.2.2 Generic Test Functions There are mainly two types of tests which were developed: the first type has coverage points defined based on the input of the floating point core and the other type has coverage points defined based on the output of the floating point core. For the case where the coverage point is defined by the input it is simple to just create a corresponding input. However, when the coverage point is defined by the floating point operation’s output, it is not easy to construct the input values which result in the desired output in a non-trivial manner. If a trivial manner were sufficient, one could just take the desired output of the operation as one input and put the other operand to the neutral element of the corresponding operation (zero for addition and one for multiplication). But the trivial manner was judged to not sufficiently test the hardware and therefore a more sophisticated approach to construct a certain result at the output was developed.

Parametrisation of Operations for Result Based Tests The result based tests in general were implemented in a way that the actual operation which has to be performed is a parameter to the function. The actual operation was abstracted as a function call where the inputs were the two actual input values plus a custom type which defines the operation like addition, multiplication, etc. In the interior of the function the custom type was resolved and the actual operation performed. Similarly, functions were implemented which return the inverse of a certain operation (subtraction for addition, division for multiplication) or which return an illegal input combination per operation like ∞−∞ for additions or ∞·0 for multiplications. This made the code of the actual test procedure much more modular, compact and readable.

Parametrisation of Stimuli Generation for Result Based Tests Another functionality that has to be made parametric of the operation is the stimuli generation. The reason is that with a multiplication the value of the product can vary much more when one of the inputs is fixed and the other calculated to obtain a certain result than with a sum. The general input calculation to achieve a specific result at the output which corresponds to a certain coverage point works as follows. Knowing the target result the first operand is determined by either setting it to a random number (for the case of multiplications or additions that should have a result with a large magnitude) or by setting it to the target result plus the target result times a random number which is bounded to be within a certain range (for the additions where the result should have a small magnitude (any denormalised number or ±1 for the Table 5.1)). This range was determined by making it as large as possible, which meant to set it between −126 and 126, but as small as necessary so that the simulation finished in a reasonable amount of time. As the runtime increases at least linearly when increasing the range of the random number, it was possible to find the point where the increase in runtime is moderate compared to the increase in randomness by allowing a larger spread of the random number. The formula can be written as T = t + rnd · t where T is the first operand, t is the target result and rnd is a random number in the range between −126 and 126. The second operand is then simply calculated by performing the inverse operation of the first operand with the result. This way a non-trivial and random input set which should create a certain output is created. Although for the case of additions it is quite likely that the target result is not ”hit” after the first few trials, more attempts to ”hit” the target result help to verify the circuit properly.

54 Figure 5.2 shows the logarithmic to logarithmic plot of the runtime of a result based verification procedure which is run on the adder. While for very small spread factor there is no significant increase, the runtime starts increasing significantly after a certain threshold is reached. It makes sense that the runtime of the simulation has a lower bound which is met when there is no randomness in the operand which is simply set to the target value when the spread is too small. The magnitude of the spread was set to 252, which is roughly equivalent to 8 bits. This means that the exponent of the ”random” operand is within the bound of ±4. The cost of this is an increase by a bit more than a factor 10 in simulation time. This was judged to be a viable compromise as the gain in randomness is high while the increase in runtime is moderate. While the exponent’s range is bounded, the mantissa however is still completely random.

107

106

105 Runtime in Number of Stimuli of Number in Runtime 104 10-2 100 102 104 Spread Factor

Figure 5.2: Logarithmic to logarithmic plot of the runtime of a result based adder verification pro- cedure against the uniformly spread seed numbers around the target result

While the VHDL standard only requires the real type to be at least 32 bits wide, the actual implementation of the real type in QuestaSim 10.6 uses 64 bits (is a double precision float). As all the internal computations for the stimuli generation is done with reals, the actual estimates are done with a much bigger precision than the actual computation inside the cores. Having double precision floats as a real type enables this floating point verification suite to be used for floating point types which are smaller or equal to 64 bits. There is, however, a limitation that is due to the integer type in VHDL being 32 bits wide as integers are omnipresent in the OSVVM package. If the bit width of VHDL’s integer type were increased to 64 bits, this floating point verification suite should be able to perform verification of floating point formats up to 64 bits wide with some minor adaptions. If the format to be verified is more than 32 bits, there is one test (Test FP B2, which will be presented later) in the current implementation which would not work as its bit width needs to be equal or smaller than the integer width. The other tests could work with some moderate adaption if the bit width were increased. While these capabilities were tested, certain functionalities of the floating point verification suite were tested using smaller floating point bit widths such as 16 bits. Another feature of the floating point verification suite is its expandability. The actual implementa-

55 tion of the floating point test suite is coded in a very modular and flexible way so that the addition of a new operation like a floating point division could be done with only adding about 10 lines of code. This would also work for other operations such as logarithms, exponentials and multiply-accumulate (MAC). The only thing which has to be tested is the performance of the testbench, e.g. how quickly all the test cases are generated. Creating ”good” stimuli for the verification of a floating point oper- ation is most difficult when the operation is an addition (or subtraction). As the stimuli generation was successfully implemented for a floating point addition, the testbench should be powerful enough to support all other operations as well (only MAC should be tested for performance).

5.3 Floating Point Comparison

All floating point numbers can be divided into two categories: ordered and unordered. The ordered floating point numbers contain all the denormalised and normalised numbers, plus the infinite values as all of them can be compared to each other. The unordered floating point numbers contain all the different NaN values as it is not meaningfull to be able to tell NaN > 1 or sNaN < qNaN. [44]

5.3.1 Functional Description The above mentioned properties can be translated into the following requirements.

ˆ Req FP Cmp 0 All denormalised, normalised and infinite floating point values shall be called ordered. All NaN values shall be called unordered.

ˆ Req FP Cmp 1 Any comparison which contains at least one unordered floating point value shall return false. Any comparison among ordered floating point numbers shall return the result of the comparison of their encoded value according to Equation 5.5 and 5.4.

This implies that comparing one NaN value with itself will return false, even though the bitstrings are the same. This can be used to detect a NaN as NaNs are the only numbers which return false when being compared to itself. Attention has to be paid when comparing infinities with each other, because infinity can encode any real number from those who are too big to be represted with the floating point format to true mathematical infinity.

5.3.2 RTL Architecture The RTL architecture of the comparator is so simple that no block design is needed. To check for equality of the floating point number, the bitstring of the floating point number can be directly compared for equality with the special case that there is a signed zero which shall equate to true even if the signs are different. Furthermore, if any of the inputs encodes a NaN the output is false.

5.3.3 Verification Environment The following test has to be performed for verification, which was heavily inspired by the tests proposed by [45].

ˆ Test FP Cmp shall create 1e4 random test cases of all possible combinations of the positive and negative version of the entries of Table 5.1. If the element is a range, then a random element from within this range shall be taken. Test all the different comparisons: equal,

56 unequal, less, less or equal, greater, greater or equal. Pass criteria: the output of the DUV and the reference model shall be the same when there is a valid output (after resetting and waiting for the latency of the core’s number of clock cycles)

Reference Model The reference model for all kinds of comparisons can be taken directly form the C programming language, compiled into a dynamic linked library (with correct C wrapper) and then be accessed through the DPI-C interface from within SystemVerilog.

Testbench The overall testbench is implemented in VHDL (with some SystemVerilog parts). The actual test procedure was not implemented as such, however the code used later for the test Test FP B1.1 (which will be presented later) could be used to actually implement the test Test FP Cmp (both tests are basically the same). The stimuli generation and the result checking was not implemented as it became clear that the floating point comparison will not be used before the implementation started. However, if the cores were needed, they could be verified quickly as the test procedure would already exist. While the RTL code is fully finished and functional, only 95% of the code needed for verification was finished as the remaining 5% were not implemented because the core was not used which made verification obsolete. The remaining 5% of code would have consisted of putting already developed parts together.

5.3.4 Implementation and Verification Results Discovered Bugs During Verification Since it was decided that the floating point comparator will not be used in the final design before the verification of all the floating point cores took place, no real verification was performed for this core (only the testing during the debugging phase while designing the core took place).

Resource Usage The actual implementation of the core uses so few resources that the resource usage is negligible in the overall design (a few LUTs plus one register).

5.4 Integer to Floating Point Conversion

Every measurement of an analogue signal returns a raw value which is equivalent to a certain fixed point or integer representation. In order to perform any floating point operation on measurement data, the data has to be transformed from an integer/fixed point representation into a floating point representation. This transformation happens multiple times in the current system architecture because in some places it is advantageous to return to a fixed point representation from the floating point representation because the fixed point representation is more precise in those specific cases. Another example where it is not precision but resource usage that gives the reason to switch from integer to floating point representation and back again to integer representation is a place where a division has to be performed because the division can be performed with a better precision and far less resource usage in integer representation than with a floating point format. A floating point

57 divider would use much more resources and be much harder to verify: even Intel, a huge company, once had a bug in their floating point divider. [46] There is one important constraint for the integer to floating point converter: it has to work for non-standard bit widths of the integer numbers. From the point of view of hardware that has to be developed, this does not increase the design complexity significantly, but from the verification point of view this is more cumbersome as the reference model needs to be developed in a way to be flexible enough to support the variable bit width but still simple enough so that it can be checked easily for correctness. Literature research for similar project of this core did not yield many interesting results. One worth mentioning was the VFloat project presented in [47], which also contains a fixed point to floating point converter. This core could theoretically be used for this project. However, there are other floating point functionalities like a floating point multiplication that is part of VFloat and do not fulfil the requirements to be used for this project. As the floating point multiplier (and the adder as well) of the VFloat project were not found to be suitable for this project, the other (simpler) cores were not used either. Although it would have been possible to verify and use them, the fact that they do not follow specific coding guidelines would probably have caused problems in the certification process.

5.4.1 Functional Description The functionality of the integer to floating point converter developed for this project can be expressed with the following requirement:

ˆ Req FP Int2Float Any integer shall be rounded to its nearest floating point representation. If there are two floating point representations with the same arithmetic distance to the integer, the rounding scheme of rounding to nearest even is applied. In other words, the floating point representation with a zero in the LSB position of the mantissa is chosen.

This functional description can directly be translated into a RTL architecture which performs some operations to perform the operation complying with Req FP Int2Float.

5.4.2 RTL Architecture Figure 5.3 shows a simplified block diagram of the RTL architecture of the integer to floating point conversion core. The dashed lines signify clock cycles and the arrows are the principal data flow, which is overall from top (where the input is located) to the bottom (where the output is located). In the first clock cycle, the input integer, which comes in two’s complement format, is transformed in a sign/magnitude representation. The sign/magnitude representation requires one bit more to be able to represent all possible two’s complement integers (actually, only the smallest possible number, a one followed with all zeros, requires the additional bit). As any integer is either a normalised number or too big to be represented in floating point format, the rounding scheme can be adapted accordingly. In the second clock cycle, the first zero in the magnitude of the sign/magnitude representation is searched and used to determine an intermediate exponent, the significant range for the mantissa and the bits needed to perform rounding (sticky and guard bits). In the final and third clock cycle, all these intermediate results are being put together to form the final result. The rounding bits are applied to the mantissa to form the rounded mantissa and it is checked whether the intermediate exponent has to be incremented because the mantissa had an overflow during rounding. Having calculated the final exponent and the rounded mantissa, the previously calculated sign bit can be added to form the floating point representation of the integer.

58 Figure 5.3: RTL architecture of the integer to floating point converter

As all the necessary bits for the rounding to nearest even mode are already generated, it would be very simple to add additional rounding modes (it would take only a few lines of code). If for any reason it would be necessary to add additional rounding modes in the future development, it would not take a lot of effort at all. Furthermore, the addition of the rounding modes would not have a big influence on the code of the testbench, only the reference model would have to be adapted.

5.4.3 Verification Environment The reference C model is compiled and packed into a dynamic linked library with GCC and loaded into the simulation environment by QuestaSim at simulation time by setting the corresponding switch when starting the simulation. The verification plan consists of an exhaustive verification part and a part which is heavily inspired by the verification plan defined in [45] (especially the number of stimuli is taken from there). An exhaustive verification means that all possible combinations of input values (when using a black-box or white-box approach) and internal states (only when using a white-box approach) are covered. As the floating point cores are state less, exhaustive verification means that all possible input combinations are applied to the core. It contains the following three tests

ˆ Test FP ExhaustiveInt32ToFloat32 Exhaustive verification shall be performed where the input integer is 32 bits wide. Pass criteria: the converted floating point values shall be the same bitwise for the DUV and the reference model when there is a valid output (after resetting and waiting for the latency of the core’s number of clock cycles).

ˆ Test FP B25 The following numbers shall be converted from integer to their floating point representation: ±MaxInt (the biggest positive and and smallest negative representable num- bers in integer format); zero; ±1 and 2e3 random numbers. Pass criteria: the converted floating point values shall be the same bitwise for the DUV and the reference model when there is a valid output (after resetting and waiting for the latency of the core’s number of clock cycles).

ˆ Test FP B26 The borders plus 2e3 random values inside of the following bitstring intervals interpreted as two’s complement numbers shall be applied: [”0...0”]; [”0...01”]; [”0...010”,

59 ”0...011”]; [”0...0100”, ”0...0111” ]; ...; [”10...0”, ”1....1”]. Pass criteria: the converted floating point values shall be the same bitwise for the DUV and the reference model when there is a valid output (after resetting and waiting for the latency of the core’s number of clock cycles).

As previously mentioned, the verification plan was heavily inspired by [45] and this is also the source of the numbers that might appear to be magic numbers like the amounts of random stimuli generated in the verification plan. The naming convention where there is a B25 or B26 in the name can be used to directly link the test defined in this document with the corresponding test in [45], which shares the same number, e.g. B25 and B26. This naming convention was used for all tests used for verifying the floating point cores, what means that every time there is a suffix consisting of a capital letter B together with a number, there exists a test in [45] with the same name.

Reference Model Creating a proper reference model for the integer to floating point conversion core was problematic as non-standard integer widths are needed. If the integer width was 32 bit, there would have been many different libraries available. For the exhaustive simulation with the 32 bit integer, the softfloat library provided by [48] could be used with the appropriate wrappers. However, for the case that the integer bit width was wider than 32 bits, a custom reference model had to be developed. There are two ways of doing so when using the DPI C interface: the first way is to adapt the softfloat library. This would be the proper way but take a considerable amount of time to perform all the required testing. The second way is to use the built in functions of the processors of the machine where the verification was run. In order to do so the configuration of the CPU registers which control the rounding modes of the FPU have to be controlled at runtime. The project uses the second way, but in order to have a certifiable model, the reference model would have to be reimplemented by an independent person. The person can chose the way it has to be done, but the specifications have to be fulfilled.

Testbench The simplicity of the core to be verified does not justify the building of a proper transaction based verification environment because the overhead of using transactions out-weights the gain of code reusability for other testbenches. Overall, there were two testbenches: one used for the exhaustive simulation and another used for the Test FP B25 and Test FP B26 tests. Figure 5.4 shows the general architecture of the two testbenches used for verification.

Testbench for exhaustive simulation In this case the Scoreboard & Stimgen component does not really exist: it is a process which generates all the bit combinations possible with 32 bits. The reference model is C code encapsulated by a SystemVerilog DPI-C wrapper. The checker process checks for bitwise equality between the reference model and the DUV and delays the output of the reference model by the latency of the DUV.

Testbench for Test FP B25 and Test FP B26 The testbench’s Scoreboard and Stimgen com- ponent executes the required test procedures, which are implemented as directed tests, because the verification plan only sets conditions on which kind of input has to be applied and this makes it simple to implement the tests as directed tests. Similar to the other testbench, the checker checks for bitwise equality and delays the output of the reference model to be aligned with the latency of the DUV.

60 Figure 5.4: Testbench architecture for floating point to integer and integer to floating point conver- sion

5.4.4 Implementation and Verification Results Applied Stimuli for Verification Table 5.2 shows the number of stimuli applied to the DUV per test procedure. The exhaustive test was not included as the number of stimuli to perform an exhaustive verification can easily be determined (for a 32 bit input there are 232 possible input combinations). The results for Test FP B25 and Test FP B26 are not very interesting as they are generated by directed testing. They are just here for the sake of completeness as for the case of constained random test (which will be performed for other cores) it will be interesting to see how many stimuli need to be applied to have full test coverage.

Table 5.2: Number of stimuli applied to complete test the procedures for the integer to floating point converter Test Procedure Test FP B25 Test FP B26 Number of Stimuli 2005 136138

Discovered Bugs During Verification The verification phase for this core was executed without the discovery of any bugs in the design.

Resource Usage The core was synthesised and placed and routed successfully with Vivado 2016.3 on a Zynq 7020 device using a clock frequency of 166.67 MHz or a clock period of 6 ns. Table 5.4 shows the resource usage of the integer to floating point converter after the place-and-route phase. The core uses slightly less than 1 % of the available LUTs in the device and much fewer registers. The reason why so many LUTs are used compared to the registers is that addition is performed within the LUTs, which is not really efficient in terms of resource usage. One could try to put the adders inside DSP slices, but the results would also not be so good because implementing the addition in random

61 logic (LUT) allows for much more optimization together with the surrounding logic. One can force Vivado to infer DSP slices but the gain in terms of LUTs saved is marginal as the adders are not very wide, plus the logic optimisation around the adders is not possible any more.

Table 5.3: Resource usage of the integer to floating point converter Resource LUTs Registers F7 Muxes F8 Muxes DSP48E1 Usage [#] 472 107 0 0 0 Available [#] 53200 106400 26600 13300 220 Usage [%] 0.89 0.10 0.00 0.00 0.00

5.5 Floating Point to Integer Conversion

As previously explained in the integer to floating point core’s section, there are some instances in the design where the data has to be converted from floating point representation back to an integer representation. Since the dynamic of these intermediate results can vary, it is necessary that the floating point to integer converter can operate on non-standard integer widths. From the point of view of hardware that has to be developed this does not increase design complexity significantly, but from the verification point of view this will be a bit more cumbersome as the reference model needs to developed in a way to be flexible enough to support the variable bit width but still easy enough so that it can be checked easily for correctness. Overall, the requirements and problems of the floating point to integer converter are very close to the ones of the integer to floating point converter. Literature research for similar project of this core did not yield many interesting results. Similarly to the integer to floating point converter, the VFloat project also contains a floating point to integer converter. The reason not to use it was the same as for the integer to floating point core. As a side note: while conducting literature research some interesting papers were found, but they were mainly aimed at a software implementation of the floating point to integer converter when there is no FPU available for the CPU. Some software papers for the integer to floating point converter were found as well, although there were fewer publications on that topic than about the floating point to integer conversion.

5.5.1 Functional Description The functionality of the floating point to integer converter developed for this project can be expressed with the following requirement:

ˆ Req FP Float2Int 0 Any floating point number shall be rounded to its nearest integer two’s complement representation. If there are two integer two’s complement representations with the same arithmetic distance to the floating point number, the rounding scheme round to nearest even is applied. In other words, the integer two’s complement representation with a zero in the LSB position is chosen. If the floating point number has a too big magnitude to be representable as a two’s complement integer, an error flag shall be raised and the integer value shall be set to the closest result to the floating point value.

ˆ Req FP Float2Int 1 Any special value that is ordered (±∞) shall raise an error flag and be treated like a number that is too big to be represented. Any special value that is unordered (all kinds of NaN) shall just raise an error flag and the integer value shall be put to minus one.

62 This functional description can directly be translated into a RTL architecture which performs some operations to perform the operation complying with Req FP Float2Int 0 and Req FP Float2Int 1.

5.5.2 RTL Architecture Figure 5.5 shows a functional block diagram of the floating point to integer converter. One dashed line in the graphic separates one clock cycle from another. In a first step, the input float is denor- malised, e.g. the mantissa is extended by its implicit bit and the exponent is decoded. If the value encoded by the bistring contains a fractional part the bits necessary for the rounding process (sticky, guard, etc bits) are generated and multiplexed according to the exponent value of the float. If the value encoded by the bitstring has no fractional part, e.g. the LSB of the mantissa encodes a value bigger than 1 together with the exponent, the mantissa needs to be shifted to the left according to the amount that is defined by the exponent. All this takes place in the first clock cycle. In the second clock cycle, the generated bits (sticky, guard, etc.) for rounding are combined with the extended (and possibly shifted) mantissa in order to perform rounding. Finally, having created the rounded sign/magnitude representation, in the third clock cycle the sign/magnitude representation has to be converted into two’s complement. In this stage also an error message is created if the input float is too big to be representable as an integer with the given bit width or if the input is a special value like NaN or ∞.

Figure 5.5: RTL architecture of the floating point to integer converter

As all the necessary bits for the round to nearest even mode are already generated, it would be very simple to add additional rounding modes (it would take only a few lines of code). If for any reason it would be necessary to add additional rounding modes in the future development, it would not take a lot of effort at all. Furthermore, the addition of the rounding modes would not have any influence on the code of the testbench.

5.5.3 Verification Environment As the bit width of the floating point type is fixed to 32 bits, it is possible to do an exhaustive verification. It still takes a full day to complete, but it simplifies the verification plan and the test

63 implementation significantly. Furthermore, the output bit width can still be variable as it does not influence the duration of the verification. Exhaustive verification definitely is inefficient in terms of computation as only about 10% of the possible 232 states is actually converted into an integer; all the other values are either too big, too small or NaNs; which all cannot be represented as integers. However, it simplifies the task of the engineer a lot as he does not need to design sophisticated verification plans and implement them. Nevertheless, a proper verification plan has to be formulated in order to create a certifiable model. The overall verification environment is the same as the one used for the integer to floating point converter. The testplan contains the following test:

ˆ Test FP ExhaustiveFloat32ToInt36 An exhaustive verification shall be performed where all possible 32 bit wide floating point numbers are converted into a 36 bit integer. Pass criteria: the converted integer values shall be bitwise the same for the DUV and the reference model when there is a valid output (after resetting and waiting for the latency of the core’s number of clock cycles).

The reason why the integer has to be 36 bits wide is because it is the bit width that will be used by the actual system.

Reference Model Since the the reference model of the floating point to integer converter had the same kind of problem as the reference model for the integer to floating point, a custom reference model with an arbitrary integer width was developed in C. It relies heavily on built-in functions of the FPU present in the CPU. Therefore some routines that monitor the settings of the CPU register which control for example the rounding modes of the FPU were developed aside from the already required floating point to integer conversion. Even though the difference in abstraction level between the synthesisable VHDL model and the high level C model using FPU functions is quite big, another reference model will have to be developed according to the requirements by somebody else in order to assure to be SIL 2 certifiable.

Testbench The testbench is fairly simple. It just iterates from 0 to 232 − 1 and feeds the bitstring to both the reference model and the DUV. The output of the reference model needs to be delayed by the number of clock cycles necessary for the DUV to compute the output. After that, the ouput of the DUV and the reference model can automatically be checked. At the end of the simulation, the testbench reports whether a discrepancy in the behaviour of the DUV and the reference model was found. Figure 5.4 (the testbench architecture of the integer to floating point converter) can serve as an illustration to show the different components/processes of the testbench. The Stimgen & Scoreboard process only creates all the possible stimuli. Scoreboarding is not necessary as the stimuli are directed stimuli used for an exhaustive verification. The checker component this time is a simple process which delays the output of the DUV and the reference model and then compares the outputs to check if all bits are equal.

Critical Discussion of the Testbench As already discussed, Test FP ExhaustiveFloat32ToInt36 is highly inefficient. In order to speed up the simulation, only the significant 10% of stimuli could be tested in detail while the remaining 90% could only be tested randomly. The reason why this was not implemented (even though it would speed up the simulation considerably), was that for a long time

64 it was not exactly specified what the output bit width of the integer will have to be (however, an upper bound was defined quite early on in the development process). As it is critical to properly test the border from where there is an overflow and where the conversion can be performed properly and these borders were not known a priori, it was decided not perform randomised simulation over the ”unusable” input. The one day runtime was also taken as a cost regarding the increased complexity of the testbench and its verification as when only performing a ”stupid” exhaustive verification, the testbench’s internals are extremely simple and therefore very robust against design errors.

5.5.4 Implementation and Verification Results Discovered Bugs During Verification The verification phase for this core executed without the discovery of any bugs in the design.

Resource Usage The core was synthesised and placed and routed successfully with Vivado 2016.3 on a Zynq 7020 device using a clock frequency of 166.67 MHz or a period of 6 ns. Table 5.4 shows the resource usage of the floating point to integer converter after the place-and-route step. The core uses both roughly 0.5 % of the available LUTs and registers of the device, which is a moderate usage. This usage can be explained by the fact that the only rounding of normalised numbers needs to be performed (denormalised numbers are too small to be represented as integer anyway). Furthermore, the rounding only has to be applied to at most the extended mantissa, which is much less than for other cores where there exists a more precise intermediate result which has a larger bit width. Finally, the sign/magnitude to two’s complement conversion is a bitwise not followed by an incrementer, what is implemented in the LUTs. This final stage could in principle be put into a DSP slice, but the reason not to do it is the same as for the integer to floating point converter: there is quite a bit of logic around the adder, with which Vivado can perform optimisation. This optimisation would not be possible any more if this stage was partially executed inside a DSP slice and therefore the gains in LUT usage would be marginal.

Table 5.4: Resource usage of the floating point to integer converter Resource LUTs Registers F7 Muxes F8 Muxes DSP48E1 Usage [#] 293 163 0 0 0 Available [#] 53200 106400 26600 13300 220 Usage [%] 0.55 0.51 0.00 0.00 0.00

5.6 Floating Point Addition

In order to perform any digital signal processing task with data that has previously been converted into a floating point format, a floating point adder is needed. The vast majority of signal processing algorithms like filtering or polynomial fitting at some point needs to be able to perform an addition. As polynomial fitting needs to be performed in CROME, a floating point adder was designed during this project. The adder conveniently can also perform subtractions, only the input which has to be subtracted needs to have its sign inverted. The VFloat project could have been used for the floating point addition, but it does not support subnormal numbers and was therefore not suitable as full support of IEEE 754-2008 was required.

65 In contrary to the previously presented cores where literature research did not yield much, several interesting papers were found about floating point addition. In [49], an adder design based on the VHDL 2008 floating point package is presented. In general, one of the early questions in the project was why not to use the VHDL 2008 floating point package. The reason was, that on the one hand a few bugs were found in the package and on the other hand does not support the package and therefore may not synthesise it. And finally, the package does not allow easy pipelining of its data path to fulfil clock speed and latency requirements, also it is not coded in a way that the optimal hardware for the used FPGA family is inferred. Therefore the package was not used and also the approach of [49] was discarded. In [50] a floating point adder with extensive support of IEEE 754-2008 is presented. The report even comes with some source code. Theoretically, this core could have been used as its functionality fulfils the requirements of the core, however there were issues with this core. First, the verification completed for the report was not enough for this project and therefore would have had to be redone. Second, the core was not coded following any particular coding style, but the code is rather clean. And third and most importantly, the core was not coded in a way that allows optimisation for a particular FPGA architecture with particular hard macros available or to allow an arbitrary clock speed or latency easily. [51], [52] and [53] all present at least one floating point adder which has been optimised for FPGAs. Unfortunately all these papers are rather old (15-20 year) and therefore their architectures do not necessarily exploit the capabilities of modern FPGAs. However, one of the ideas presented in [52] was used for this project: they developed a non-pipelined design. This has the advantage that debugging is greatly simplified compared to when a pipelined design is developed right away. However, it brings the burden to pipeline the design at a later stage, but this usually is not very time consuming if the purely combinational design is written with the thought in mind that it will be pipelined. [54] presents a floating point adder which was implemented on a rather recent FPGA. Even though the cores presented in this paper are probably the closest to those which were developed for this project, they could not be used for the following reasons: first, they did not support denormalised numbers. Second, the verification process was not detailed enough to be able to use the cores. And finally, the pipelining was done through a retiming process in Xilinx ISE. This means that the author defined a combinational model of the circuits he wanted and added several flipflops in front and behind the combinational path. ISE would then move these flipflop in the implementation (place- and-route) phase around in order to meet the timing requirements. This functionality of ISE is now supported in newer versions of Vivado, however this new capability comes with several issues. For example, Vivado has problems to infer hard macros such as wide multipliers or adders when there are no registers around or ISE was less restrictive in coding style than Vivado and therefore some old design do not synthesise. [29] Because of the above mentioned reasons, it was necessary to design our own implementation.

5.6.1 Functional Description The functional requirements of the floating point adder can be formalised as follows: ˆ Req FP Add Any floating point addition shall first add up the numeric values of the two inputs to form an infinitely precise intermediate result and then round this intermediate result using the rounding to nearest even scheme. If one of the inputs is ±∞ and the other is a real number, the result shall remain ±∞. If at least one of the inputs encodes any kind of NaN value, the output shall be a NaN value as well. If the inputs encode an operation which is mathematically not defined (such as ∞ − ∞), the output shall be set to NaN as well.

66 An architecture, which performs the operations necessary to fulfil the requirement Req FP Add, is presented in the next section.

5.6.2 RTL Architecture

Figure 5.6 shows the simplified RTL architecture of the floating point addition core. Similar to the previous RTL block diagrams, the dashed lines in the diagram represent the border of one clock cycle. The computation of the result can be divided into two major parts: the calculation of the infinitely precise intermediate result and the rounding of the intermediate result.

Calculation of the Infinitely Precise Intermediate Result In a first step the input operands are decoded meaning that the mantissa is extended by the implicit bit depending on if it is a normalised or a denormalised number and the exponent is decoded to its true value depending on whether it is a normalised or a denormalised number. Then an intermediate exponent is calculated which is equal to the exponent of the input operand with the bigger magnitude. In parallel, the extended mantissa of the input operand with the smaller magnitude is transformed from a sign/magnitude representation to a two’s complement representation when the signs of the two operands differ. Otherwise the extended mantissa of the input operands is interpreted as a positive integer. All of this is completed in the first clock cycle. In the second clock cycle, the two’s complement representation of the two extended mantissas are added (or subtracted, but the subtraction is done by an addition of the two’s complement value). Before adding, the smaller magnitude number is shifted right (arithmetic shift) by the absolute value of the differences between the exponents of the two inputs. Concurrently, the exponent of the number with the larger magnitude is used to calculate the significant range of the addition’s result for rounding an assumed to be denormalised number.

Rounding and Special Cases The third clock cycle is then used to multiplex the significant portion of the added mantissas for the normalised rounding (where the significant range is determined by the first one in the added mantissas) and for the denormalised rounding (where the significant range has to be calculated in the previous clock cycle). The added mantissas are further used to create the bits that are used for rounding like the sticky and the guard bit. These bits are also multiplexed in a similar fashion as the significant ranges of the added mantissas to be available for the normalised and denormalised rounding. In the fourth clock cycle the significant ranges are combined with the bits created in the previous clock cycle to perform the rounding step. If the rounding step resulted in an overflow of any of the two mantissas, the exponent of the corresponding mantissa is incremented in the same clock cycle. In the fifth and final clock cycle the multiplexing between the normalised and denormalised number as well as for special cases such as one input operand being NaN or the input operands having a mathematically undefined combination such as ∞ − ∞ is performed to form the final result. As all the necessary bits for the rounding to nearest even mode are already generated, it would be very simple to add additional rounding modes (it would take only a few lines of code). If for any reason in the future development it would be necessary to add additional rounding modes, it would not take a lot of effort at all. However, it is necessary to account for the change in the testbench as the performance of the actual test implementation would be worse as the test rely on the real type of VHDL which uses round to nearest even as rounding scheme. But as long as 32 bit floats are tested the influence is of minor degree as the VHDL real type in the 64 bit version of QuestaSim 10.6 are 64 bit floats (although the VHDL standard only requires at least 32 bits for the real type).

67 Figure 5.6: RTL architecture of the floating point adder

5.6.3 Verification Environment The verification plan for the floating point addition was similar to the other cores heavily inspired by [45]. Hence for the verification of the floating point addition the following verification plan was created:

ˆ Test FP ExhaustiveFaddFmult A first test has to be conducted with small bit widths for the floats in order to allow an exhaustive simulation. The bit widths for the floats are fixed to 8 bits (3 bits for the exponent and 4 bits for the mantissa) and 16 bits (5 bits for the exponent and 10 bits for the mantissa). Pass criteria: the output of the floating point operation shall be bitwise the same for the DUV and the reference model with the special rule that all NaN values are consider to be the same and +0 and −0 are considered equal as well when there is a valid output (after resetting and waiting for the latency of the core’s number of clock cycles). ˆ Test FP B1.1 shall create 1e4 random test cases of all possible combinations of the positive and negative version of the entries of Table 5.1. If the element is a range then a random element from within this range shall be taken. Pass criteria: the output of the floating point operation shall be bitwise the same for the DUV and the reference model with the special rule that all NaN values are consider to be the same and +0 and −0 are considered equal as well when there is a valid output (after resetting and waiting for the latency of the core’s number of clock cycles). ˆ Test FP B1.2 shall create 1e4 outputs for each positive and negative element of Table 5.1 in a non-trivial manner.

68 Pass criteria: the output of the floating point operation shall be bitwise the same for the DUV and the reference model with the special rule that all NaN values are consider to be the same and +0 and −0 are considered equal as well when there is a valid output (after resetting and waiting for the latency of the core’s number of clock cycles).

ˆ Test FP B2 shall create 1e2 times an output which is equal to taking ±zero, ±min subnorm, ±max subnorm, ±min norm, ±one or ±max norm and flipping one random individual bit of their mantissa. The output shall be obtained in a non-trivial manner. If the element is a range then a random element from within this range shall be taken. Pass criteria: the output of the floating point operation shall be bitwise the same for the DUV and the reference model with the special rule that all NaN values are consider to be the same and +0 and −0 are considered equal as well when there is a valid output (after resetting and waiting for the latency of the core’s number of clock cycles).

ˆ Test FP B4 shall test the infinitely precise intermediate result, overflowing and rounding by creating 1e2 intermediate results which are in the range of ±max norm ±{1,2,3}ulp or smaller/larger than ±max norm ±3ulp, where ulp is equal to one increment/decrement of the mantissa. For clarification: each value of < max norm −3ulp, max norm −3ulp, max norm −2ulp, max norm −1ulp, max norm, max norm +1ulp, max norm +2ulp, max norm +3ulp, > max norm +3ulp for the positive side (the same holds for the negative half) represents a coverage bin. For determination of the bin the intermediary result shall be rounded using the round to nearest even scheme. Pass criteria: the output of the floating point operation shall be bitwise the same for the DUV and the reference model with the special rule that all NaN values are consider to be the same and +0 and −0 are considered equal as well when there is a valid output (after resetting and waiting for the latency of the core’s number of clock cycles).

[45] proposes many other tests than the ones presented in this document. Most of these other tests are not applicable for this thesis as they are for example used to verify a floating point divider. However, there are some tests which could be used. The reason why they were not used, is that they usually do not provide additional test cases. It was calculated that using the above mentioned tests, the stimuli of the other relevant tests of [45] are generated anyway. However, the coverage of these additional tests is not measured. If desired by the certification authorities, the majority of the other tests could be added by just adding the their coverage. No other stimuli generation would have to be written as the already implemented tests already create the necessary stimuli for most of the other tests.

Reference Model The reference model for the 32 bit floats uses as a core the softfloat library developed by [48], which is wrapped into custom C wrappers in order to make it more easily callable through the DPI-C interface from within SystemVerilog. The interaction with the SystemVerilog wrapper as well as the delaying of the output of the reference model when having a pipelined DUV for example is a task of the testbench. As the softfloat library relies purely on software and does not use the FPU of the processor, there is no need to surveil the settings of the FPU inside the processor as there would be if the FPU was used. The reference model for exhaustive simulation with smaller bit widths (8 bits and 16 bits) uses the generic floating point type shipped with VHDL 2008 specialised to 8 or 16 bits respectively. Although several problems with the package were discovered during the development process, none

69 of those problems affected the addition of two numbers and therefore the package was judged sufficient to serve as a reference model. As the test Test FP ExhaustiveFaddFmult is not needed for certification (because the bit width when the core is actually used is different), the bugs in the VHDL 2008 floating point library should not affect the certification process. The Test FP ExhaustiveFaddFmult procedure’s purpose was more to demonstrate the capabilities of the developed RTL model and to aid debugging than to actually serve the certification process.

Testbench Testbench for exhaustive simulation The testbenches for exhaustive simulation with 8 and 16 bit floats are fairly simple. It is very similar to the architecture shown in Figure 5.4. In this case, the Stimgen & Scoreboard component is a simple process which does only create all the possible input combinations for the given bit width. Scoreboarding for such a simple stimuli generation is not necessary. The checker process takes the output of the DUV and the reference model and delays the output of the reference model by the latency of the DUV. Finally, it checks if the two output bitstrings are equal or if both encode a NaN value (all NaNs are considered equal) or if one encodes +0 and the other encodes −0 (the two values are mathematically equal but can be encoded by different bitstrings).

Testbench for 32 bits input width testing Figure 5.7 shows the testbech that is used to perform all the tests defined in the verification plan except the exhaustive test. The core of this testbench is very similar to the simpler one shown in Figure 5.4. Similar to the simple testbench, it contains a Scoreboard & Stimgen component, a DUV, a reference model, a checker process. Additionally, it contains an integer to floating point and floating point to integer component. The name is actually very misleading, but is given due to the data types used in the SystemVerilog wrapper in which these functions, which are written in C, are wrapped. The core does not actually perform an integer to floating point conversion or a floating point to integer conversion but its task is to convert the real value of VHDL first into a shortreal (a 32 bit IEEE 754-2008 number) and then into the bit pattern of the shortreal. The integer to floating point conversion performs the inverse: it takes a bit pattern, converts it into a SystemVerilog shortreal which is then converted into a VHDL real. The reason why this has to be done is that VHDL does not provide any access to the bit pattern of its real values (the floating point package of VHDL 2008 does provide this functionality, but several bugs were found in this package). These two conversion functions are used by the Scoreboard & Stimgen process for the internal stimuli generation. The real to bit pattern conversion is needed because in the result based tests, the two operands are calculated using reals but they have to be output as bit pattern. Similarly, the coverage points (e.g. target results) in a result based test are equivalent to a bit pattern, but for the generation of the inputs the bit pattern has to be converted into a real. Their actual usage is quite tricky, as it is a component whose signal is used like the return value of a function and it has to be assured that the delta cycle stalls at the point where the conversion is needed until the converted value can be read. This is done by checking if the actual value shown to the component is sufficiently different from the value that is currently output towards the component so that the return value is expected to be different. If the expected return value is different, the output toward to the component is set to the new value and the event attribute of the return value is used to wait until the return value changes. This way, a process can stall within the same delta cycle until the output is ready and the component can be used like a function. Similar to the simple testbench, the checker is a process which delays the output of the reference model by the latency of the DUV and compares the outputs for equality. Equality in this context means that the two results are equal if they both consist of the same bitstring, they both encode a

70 NaN value or they both encode one of the two zero encodings. The reference model is similar to the simple testbench as well: it is a SystemVerilog part which encapsulates the reference model, which is written in C. The Scoreboard & Stimgen component in this testbench is a lot more complex than the one used in the simple testbench. Internally, it calls the test procedures, which are part of the floating point verification suite. The test procedures themselves are mostly specialisations of the generic, operand oblivious result based testing procedures presented in Section 5.2 or directed testing procedures where the input to the DUV defines the test coverage and therefore simply the ”Intelligent Coverage” feature of OSVVM can be used to create the appropriate test cases. Furthermore, the component at the beginning of the simulation call procedures which create the test coverages for the different tests. While the scoreboarding of the input defined tests can be done easily within the component, the scoreboarding of the result/output based tests necessitates that the output of the DUV or the reference model is fed back in order to internally update the coverage and do the scoreboarding.

Figure 5.7: Testbench architecture for floating point addition and multiplication

5.6.4 Implementation and Verification Results

Applied Stimuli for Verification

Table 5.5 shows the number of stimuli that were necessary to complete all the defined tests. Similar to the other cores, the number of stimuli for the exhaustive simulations is not given here either as it does not provide any information. The first thing which can be seen is that result based testing is less efficient in terms of stimuli applied than the input based testing. The only test which is input based is Test FP B1.1, all the others are result based tests. While Test FP B1.1 has 24 times more coverage points than Test FP B1.2, Test FP B1.2 takes 10% longer to complete than Test FP B1.1. This is due to the fact that it is difficult to exactly ”hit” a certain result while keeping the inputs randomised. As explained in Section 5.2, a compromise between runtime and randomisation of the inputs had to be found. Besides the coverage points that are hard to ”hit” (all the single values except zero; zero is easy to ”hit”), the denormalised numbers are not simple to be constructed either. This can be seen when observing the performance of the test Test FP B2, where specific values in the denormalised range have to be constructed. Test FP B2 contains considerably fewer outputs to be generated than Test FP B1.2, but it still takes considerably longer to run. The reason is again that it is difficult to construct a certain result with additions as the mantissas are not

71 aligned necessarily and therefore the construction of a value is imprecise. Interestingly, Test FP B4 did not take as much time to run even though it is a result based test. The reason is that the result which has to be constructed is an internal result which cannot be really measured. Therefore it is difficult to accurately measure the coverage. The way that it is done is through estimating the internal result and having a high enough probability that the result has been hit with the number of test cases. As the intermediate results are all really close by, if one is not hit, it is likely that another is hit instead. Technically, one cannot be sure that all the expected intermediate values have really been ”hit”, however it is very probable (the confidence level was estimated).

Table 5.5: Number of stimuli applied to complete the test procedures for the floating point adder Test Procedure Test FP B1.1 Test FP B1.2 Test FP B2 Test FP B4 Number of Stimuli 4840000 5367406 8288479 827794

Discovered Bugs During Verification

During the verification process there were two bugs found which both had the same cause. The reason was premature optimisation. The bugs had an impact when either both inputs had the same magnitude but different signs or when the signs were different and the exponents had a certain (big) difference and the mantissas had some specific values. The bugs caused the sign/magnitude to two’s complement conversion stage together with the extended mantissa addition stage to produce erroneous values which caused the final result to have an error of one ULP. The bug fix consisted in removing the manual optimisation and to rewrite the code properly. Although the impact of the bugs was of minor severity, it was the kinds of bugs which should not happen. As a side note: it was not possible to trigger the bug with 8 or 16 bit floats, it was only expressed when floats were 32 bits wide. As no simple reasons for this behaviour could be found, some further investigation on this topic might be interesting.

Resource Usage

The core was synthesised and placed and routed successfully with Vivado 2016.3 on a Zynq 7020 device using a clock frequency of 166.67 MHz or a period of 6 ns. Table 5.6 shows the resource usage of the floating point adder after the place-and-route step. The resource usage of this core is significant as it uses more than 2 % of the available LUTs of the device. However, only 0.5 % of the available registers are used, which is four times less than the LUT count. The reason for this is that complex functionality is performed using LUTs, which can be seen in the fact that F7 Multiplexers are used, which are used to combine the LUTs of multiple slices to be able to perform more complex functionality.

Table 5.6: Resource usage of the floating point adder Resource LUTs Registers F7 Muxes F8 Muxes DSP48E1 Usage [#] 1116 544 3 0 0 Available [#] 53200 106400 26600 13300 220 Usage [%] 2.10 0.51 0.01 0.00 0.00

72 5.7 Floating Point Multiplication

Besides addition, multiplication is the second most used operation in digital signal processing. In order to perform polynomial fitting, in addition to the adder, a multiplier for floating point numbers is needed. Therefore this floating point multiplication core was developed. The literature research for floating point multipliers yielded overall the most interesting results of all the cores developed for this project. As the multiplication is a complex and resource intensive operation, there are many optimisations based on the actual multiplication. One example is [55], where carry save adders are used to create partial products of the multiplication. There are other cases where a special multiplication algorithm like the Booth encoding was used, but all of these interesting approaches are more suited for ASIC design than to be put into a FPGA. The reason is that these multiplication algorithms would have to be implemented in LUTs and this would take a lot of LUTs, and that the FPGA contains a hard macro multipliers which lead to an overall much more efficient design when they are used instead of trying to create a Booth algorithm with LUTs. In [56] a floating point multiplier in VHDL is presented. Although the paper clearly presents the floating point format, the verification of the core was not performed with enough rigour to be able to use it in a safety-critical environment. The papers [49], [51], [52] and [54] (which were all presented in the previous section), all contain an floating point multiplier as well, but they all suffer from the same shortcomings as their floating point adders. Therefore, our own implementation has to be developed.

5.7.1 Functional Description The functional requirements of the floating point multiplier can be formalised as follows:

ˆ Req FP Mult Any floating point multiplication shall first multiply the numeric values of the two inputs to form an infinitely precise intermediate result and then round this intermediate result using the round to nearest even scheme. If one of the inputs is ±∞ and the other is a real number different from zero, the result shall remain ±∞. If at least one of the inputs encodes any kind of NaN value, the output shall be a NaN value as well. If the inputs encode an operation which is mathematically not defined (such as ∞ · 0), the output shall be set to NaN as well.

An architecture which performs the operations necessary to fulfil the requirement Req FP Mult is presented in the next section.

5.7.2 RTL Architecture Figure 5.8 shows a simplified model of the RTL architecture of the floating point multiplier. In the same manner as in the previous block diagrams, the dashed lines in the diagram indicate the boundary of one clock cycle. The computation of the result can be divided into two major parts: the calculation of the infinitely precise intermediate result and the rounding of the intermediate result.

Calculation of the Infinitely Precise Intermediate Result Initially, the two input data bitstrings are decoded in the sense that the real exponent value is calculated and the mantissa is extended by the implicit bit of normalised and denormalised numbers. In the same first clock cycle the extended mantissas are multiplied to create an infinitely precise intermediate result together with the two exponents, which are used to calculate an intermediate (infinitely precise) exponent of the product. The actual multiplication of the two extended mantissa takes a second clock cycle and

73 in the meantime the intermediate exponent can be used to determine which range of the mantissa product would have to be rounded if the final result was a denormalised number.

Rounding and Special Cases The operation performed in the third clock cycle contains the critical path of the floating point multiplier. The functionality is to bitwise OR the complete multiplication result (bit0 or bit1 or bit2 or ...) up to a certain number which creates an array of single bit values. These bit values are used for the rounding process at a later stage and the correct bits of the array need to be chosen for rounding an assumed to be normalised number and assumed to be denormalised number. In a similar fashion a range of the multiplication result has to be multiplexed assuming once that the final result will be normalised and once that the final result will be denormalised. The multiplexing for the assumed to be normalised result is based on the first set bit in the multiplication result. The multiplexing for the assumed to be denormalised result is based on the range calculated in the second clock cycle. Besides having a high logic depth due to the numerous multiplexers, the main thing which accounts for the path delay is the high fanout of the single multiplication result register. In the fourth clock cycle, the significant ranges from the multiplication results plus the special bits are used together to perform rounding. At this stage actually any rounding algorithm defined in [44] could easily be implemented. But for the purpose of this project, only the round to nearest even scheme is implemented. The reason rounding has to be performed simultaneously for a denormalised and a normalised result is the following: there is a corner case, in which the normalised rounded result indicates that the result has to be denormalised but the denormalised rounded result had an overflow and became a normalised rounded result. It cannot be known beforehand if the result is going to be normalised or denormalised and therefore both have to be calculated individually and it has to be decided which is the correct one (in the mentioned case it is the denormalised result which became normalised). These rounding steps (like the wide or reduce function) are common to most cores, even if it was not explicitly mentioned before. The reason why it is mentioned only here is that the wide or reduce function has the biggest impact in this core as the intermediate result that has to be rounded is the widest for the multiplier. The last and fifth clock cycle actually performs the multiplexing between denormalised and normalised result plus it checks if the inputs encoded a special value such as qNaN or +∞ and adapts the final result accordingly. This stage also catches errors like 0 · ∞ and puts a NaN value at the output for example. [29]

5.7.3 Verification Environment

The same tests as defined for the floating point addition were applied to the floating point multipli- cation and therefore the verification plan is also heavily inspired by [45]. To name them all (without explaining again what they do, this can be found in Section 5.6), the following test were be applied to the floating point multiplier:

ˆ Test FP ExhaustiveFaddFmult

ˆ Test FP B1.1

ˆ Test FP B1.2

ˆ Test FP B2

ˆ Test FP B4

74 Figure 5.8: RTL architecture of the floating point multiplier

Reference Model

The reference model used for the floating point multiplier has exactly the same structure as the one used for the floating point adder (its core is the softfloat library of [48]), which is accessed through a SystemVerilog component from within the VHDL testbench. The SystemVerilog component uses the DPI-C interface to link against the softfloat library’s C code.

Testbench

The testbench used for the verification of the 32 bit multiplier and the complete environment in general is the same as the one used for the floating point adder, which is shown in Figure 5.7 and described in Section 5.6.3. The only difference between the testbench for the floating point adder and for the floating point multiplier is which DUV and which reference model actually is bound and which operation is given as a parameter to the Scoreboard & Stimgen component. The testbench used for the exhaustive verification of the floating point multiplier with a smaller bit width than 32 bits is also extremely similar to the one used for the floating point adder. But in this testbench it is only the DUV and the reference model which differ.

75 5.7.4 Implementation and Verification Results Applied Stimuli for Verification Table 5.7 shows the number of stimuli that were necessary to successfully run all the tests. The interesting comparison is now with the number of stimuli used for the floating point adder, which are shown in Table 5.5. While the test Test FP B1.1 needs the same number of stimuli in both cases to complete (what makes sense as the coverage is based on the input and not on the output of the cores), the other test procedures complete with significantly fewer stimuli for the floating point multiplier than for the floating point adder. However, the different test procedures show a different increase in numbers of stimuli depending on the test procedure: while Test FP B4 takes only 30% more stimuli, Test FP B2 needs 4 times more stimuli and Test FP B1.2 is the worst with more than 10 times more stimuli needed for test completion. The reason for this general increase in runtime (which is equivalent to number of stimuli) is that it is much harder to find a floating point value B, which fulfils the equation C = A + B when C is fixed and A is a random number, than to find the floating point value B, which fulfils the equation C = A · B under the same conditions for C and A. It can even be impossible to fulfil the equation for the case of the addition depending on how much the target result C and the random seed A differ in magnitude. For the case of the multiplication it can also happen that the equation cannot be fulfilled. However, it is much less likely than for the case of the addition as with a multiplication a much bigger range of numbers can be reached with the given number of bits. The difference in increase in runtime between the different tests can be explained by the amount of single points compared to the number of ranges which have to be ”hit”. Test FP B1.2 contains much more single value coverage points compared to the overall number of coverage points than Test FP B2. This can explain the increase as the values in Test FP B2 are all within a very narrow range where a value can easily be ”hit” by accident when another value is missed. Test FP B4 is not really comparable as the internal values are not really measured directly inside the design but rather estimated from the exterior world.

Table 5.7: Number of stimuli applied to complete the test procedures for the floating point multiplier Test Procedure Test FP B1.1 Test FP B1.2 Test FP B2 Test FP B4 Number of Stimuli 4840000 358951 2214324 614816

Discovered Bugs During Verification During the verification process of the design, one bug was discovered and had to be resolved. The input combination that triggered the bug was the multiplication of a number with a very large magnitude with ±0. The result of such an operation was a very small, but non-zero value compared to exactly zero which would have been correct. The underlying problem was that ±0 values were treated as any other denormalised numbers. Therefore the intermediate exponent calculation returned a non-zero result. This non-zero intermediate exponent entered the rounding stages and finally reached the output. The fix for this problem was implemented in the case multiplexing stage, where all special cases like ∞ · 0 for example are treated. The fix consists in checking if one of the input operands is zero and the other is neither a NaN value nor ∞ and then hard-coding the exponent and mantissa value to all zeros. This bug would have been of major severity if it stayed undetected. Overall, it was a corner case which was not considered during the design phase. Interestingly the bug only had an influence on the correctness of the result when the floats were 32 bits wide; when the floats were 8 or 16 bits wide, the bug did not alter the results of the core. As no simple reasons for this behaviour could be found, some further investigation on this topic might be interesting.

76 Resource Usage The core was synthesised and placed and routed successfully with Vivado 2016.3 on a Zynq 7020 device using a clock frequency of 166.67 MHz or a period of 6 ns. Table 5.8 shows the resource usage of the floating point multiplier. In contrary to the other cores, it uses DSP slices. They are used to implement the wide multiplication necessary to create the intermediate product of the two mantissas of the input. If the DSP slices were not used, a very big amount of LUTs would be necessary to implement this wide multiplication. Even with the DSP slices, the number of LUTs used is considerable. They are mainly used for the multiplexing of the mantissa product and for the additions necessary for the rounding process. Compared to the LUT usage, only few registers are used. The reason is that a lot more registers are used than shown in the table, however they were merged into the DSP slice as it contains many registers which otherwise would not be used. In order to properly infer the wide multiplier with the corresponding registers inside the DSP slice, a proper coding technique needs to be employed. The fact that some F7 and F8 multiplexers were used indicates that the logic functions performed by the core are rather complex, because these multiplexers are usually used to aggregate the output of several LUTs to create bigger LUTs which can have a more complex functionality. The place where the multiplexers are inferred is after the mantissa multiplication where they are used to create the wide multiplexer for the significant ranges for the normalised and denormalised rounding as well as to create the bits necessary for the rounding such as the guard and the sticky bit.

Table 5.8: Resource usage of the floating point multiplier Resource LUTs Registers F7 Muxes F8 Muxes DSP48E1 Usage [#] 959 379 4 1 2 Available [#] 53200 106400 26600 13300 220 Usage [%] 1.80 0.36 0.02 0.01 0.91

5.8 Example Application of the Floating Point Cores: Temperature Compensation

Temperature compensation serves to eliminate the effect of temperature on the measurement elec- tronics in order to measure the charges generated by radiation accurately. The charges inside the gas chamber are generated by particles (radiation) travelling at a high speed through the gas and ionising atoms on their ways. The charges (current) are collected and converted into a voltage through something similar to a transimpendance amplifier and are then digitalised by an ADC. The influence of the temperature on the voltage measured by the electronics and the true voltage due to radiation can be modelled as

Vtrue = VADC − Vtemperature (5.14)

where Vtrue is the true voltage corresponding to the level of radiation, VADC is the voltage mea- sured by the ADC and Vtemperature models the effect of the temperature on the voltage. Simulation and testing has shown that Vtemperature can be modelled as a fourth order polynomial like

4 3 2 1 Vtemperature = a4T + a3T + a2T + a1T + a0 (5.15)

77 where T is the temperature of the electronics and ax the polynomial coefficients. A static polyno- mial by itself would not necessitate floating point arithmetic. The worst term (in terms of resource usage) could be calculated with a multiplier which is of the order of magnitude 80 bits wide (which is wide, but with pipelining this would be feasible) if designed properly. However, the polynomial depends on the actual electronic parts used and is therefore different for every system. In order to find the correct polynomial, calibration needs to be performed for every system. This calibration takes two days as the temperature has to be swept over a large range and the sweep needs to be done sufficiently slow so that the electronics have no transient effects. As the polynomial coefficients vary greatly between different systems, it is necessary to perform the calculation in floating point to have accurate results. First, the measured temperature which is in fixed point format is converted to a floating point number. Then the polynomial is calculated using two multipliers and one adder. Finally, the calculated Vtemperature is converted back to fixed point in order to subtract it from VADC which is also in fixed point. The reason why Vtrue has a fixed point format is because some more calculations are done later which can be done more conveniently in fixed point on one hand and with a bigger precision than with floating point on the other hand. In order not to use too many resources, resource sharing for calculating the polynomial is used. The complete polynomial is calculated with two multipliers and one adder. These components can be used in other parts of the design as well through resource sharing. There is a FSM that schedules which calculation is done when with the floating point cores. The scheduling of the calculation can theoretically be done in an arbitrary way, however it is important that operands of a floating point addition are of the same order of magnitude, otherwise a bigger rounding error than necessary is induced. Therefore it is 4 3 2 1 better to calculate the polynomial as (a4T + a3T ) + ((a2T + a1T ) + a0) rather than using a 4 3 2 1 naive approach like (((a4T + a3T ) + a2T ) + a1T ) + a0, which induces a much larger error in the system simply because the intermediate values for the addition differ more in magnitudes.

78 6 Conclusion and Outlook

This chapter is split into two parts: in the first part a conclusion of the work done is drawn and in the second part an outlook for future work will be given.

6.1 Conclusion on the Completed Work

There are three main fields in which significant work was completed. The first and most important field was the design and verification of several floating point cores. The second field was to secure the PS/PL data transfer within the SoC. These two fields are both part of the greater objective of the overall project, which is to create a reliable system for a safety-critical application. There are various methods that help to accomplish this objective which were developed as part of the work on the floating point cores and the PS/PL transfer core. All the different methods could be assembled to an overall methodology, which can also be applied to other parts of the project. The creation of this collection of methods was the third field where a lot of work was done.

6.1.1 Floating Point Core Design and Verification The main task of the thesis was to design several floating point cores and to verify them. Further- more, the cores should be used to perform temperature compensation for the measurement. All these tasks were completed. While the necessity to design floating point cores on our own became clear very early on in the design process (as no certifiable cores existed), how to actually verify them was a question that could not easily be answered. Although formal verification seemed very appeal- ing, it was not possible to perform it within the given time frame and therefore a more traditional verification approach was chosen. During the actual design process there were some issues with the Vivado synthesiser, as it is extremely picky regarding the coding style when proper hardware inference is needed. Unfortunately, there were no comparable cores available, because a comparison in terms of performance and resource usage between the newly designed cores and some existing ones would have been very interesting. Although the RTL design of the floating point cores already took a considerable amount of time, the majority of the workload originated from the verification process. First, a proper verification plan had to be developed. Fortunately, there was a very detailed verification plan developed by IBM available in [45], because most of the other major companies in this field publish very little to nothing of their work. Therefore, the verification plan was heavily based on the experience that IBM has already used to design their own FPUs. The actual implementation of the verification software leverages the capabilities of OSVVM together with the generic packages introduced in the VHDL 2008 standard to create a generic floating point verification environment. The verification environment was able to discover two bugs in the floating point cores (one in the floating point adder and one in the floating point multiplier), which otherwise would have stayed undiscovered despite the considerable amount of testing done already during the design phase. Finally, the floating point cores were added to the full system design. Besides adapting the system to the specific latency of the cores their introduction did not require much effort (most of the work to include the cores in the existing system was done by Ciar´anToner). As soon as the cores were

79 integrated into the system, it was possible to successfully perform temperature compensation on the measurement. The temperature compensation performed by the cores was compared to the previously used software implementation of the temperature compensation and both calculated the same results.

6.1.2 Securing PS/PL Data Transfer

Due to problems with the Linux kernel, the original task of replacing the existing AXI GPIO based PS/PL transfer system with a BRAM based system was already started by Ciar´anToner before this thesis started. In the beginning, there were some problems with the usage of the BRAM, but those problems were quickly resolved. The task was then changed to securing the transfer between the PS and the PL. Before starting the design of the actual core, some research on how to increase the reliability of safety-critical systems was conducted. The outcome of this research was presented in Chapter 3 and will be summarised in the next section. Keeping the different challenges of safety-critical systems in mind, a core was designed for the PS/PL transfer which is robust against SEUs and malfunctioning of parts of the system which are not SIL 2 like the Linux system running on the PS. The correct functionality of the core was tested using a bus functional model (BFM) simulation of the core. A verification plan for the core was written as well, but the actual verification could not be done as independence between the design and the verification engineer is required to achieve SIL 2. The software stack for Linux that is necessary to use the core was developed and tested. However, the core together with the software were not pushed to the complete project yet due to the lack of time. There is other testing to be done right now and adding another component to test would unnecessarily increase the workload. Therefore, the replacement of the currently used BRAM based system was postponed. Another reason not to include the core is that it has not been fully verified yet. Although the correct functionality has been shown, its robustness was not yet verified through simulation as this is part of the proper verification process.

6.1.3 General Methodologies to Increase the Reliability of the Design

When a chip is properly supplied with power, the main reason why it still could encounter prob- lems during runtime are SEUs. As reliability is the key objective for any safety-critical system, a methodology was developed to increase the reliability of the full system. The shortcomings of the current system architecture were discussed in Chapter 2 as well as a way to remediate those issues. One key problem in terms of reliability for the SIL 2 functionality running in the PL is when the PS is malfunctioning. In Chapter 2, it was discussed how it can influence the CRAM, but it can also influence the SIL 2 functionality running in the PL by sending bad configuration data. To this end, the robust PS/PL communication core was developed. One of its key functionalities is to decouple the functionality in the PL from the PS. From the methodological point view, it is important to de- couple safety-critical functionalities from not safety-critical functionalities, which was done through this core. As Ciar´anToner was cleaning up the project and redoing parts anyway, the occasion was used to unify the coding style of the complete project and to introduce the safe FSM encoding techniques presented in Section 3.3 all over the project.

80 6.2 Future Work

Although the initially required tasks were fulfilled, during the course of this thesis, more things to do were found and are now scheduled to be done in the near future.

6.2.1 General Methodologies to Increase the Reliability of the Design The attentive reader might have noticed that in Chapter 5 no techniques to increase the reliability were mentioned. In order to increase the reliabilty of the floating point cores, LTMR would have to be used. However, this is not possible as the cores internally used hard macros of the FPGA which internally contain registers that have to be used and therefore cannot be triplicated. For this reason, and also because it is in general more reliable from the system point of view, BTMR will be applied on the full system, which will be triplicated inside the FPGA to perform exactly the same task three times (and at the end the results are ”multiplexed” by a majority voter). Using BTMR will even harden the system against SETs (although they are not really a problem at the clock speed of the current system) and not only against SEUs. The SEM IP from Xilinx which can be used to protect the CRAM was briefly presented in Section 3.3.3. Although this is definitely something that has to be done for a high reliability and high avail- ability system like CROME, it was out-of-scope to be done during this thesis. The implementation is especially tricky together with the Linux kernel which has to be modified to hand the control over the device configuration primitive over to the ICAP and not to keep it for itself and the PCAP. Furthermore, it has to be evaluated if the bundled IP of Xilinx which incorporates both SEM and anti-tamper functionalities, can be used as it falls under ITAR. When using BTMR, the individual blocks that are triplicated have to be physically separated in order to achieve maximum independence between the blocks. Xilinx actually provides a design flow which is already certified up to SIL 3 for this task. According to this flow, it is not enough to only separate the individual blocks, but it is also necessary to leave some space in between the blocks in order to minimise the influence of one on the others. This process is called fencing. Furthermore, the fencing process requires that only a special kind of routing resource between each block is used. But all of these functionalities are already provided and described by the isolation design flow. [57] Finally, when adapting the software stack so that the robust PS/PL communication core can be used, it makes sense to also integrate the changes proposed in Section 4.2.1, e.g. to use the flash- based off-chip storage present on the PicoZed to store for a long time the configuration data in a rad-hardened way. This would increase the reliability of functionalities that were not initially deemed safety-critical, but which turned out to be able to cause problems in the safety-critical domain due to the expected long uptime of the CROME nodes. A SEU in a memory location of the DDR2 DRAM connected to the PS which contains configuration data, can have a severe impact on the reliability of the system as the next time the configuration is updated, bad configuration data is sent to the PL.

6.2.2 Architectural Modifications to Increase the Reliability of the Design There was always the undisputed intention to add a second FPGA to monitor the correction func- tionality of the Zynq. This second FPGA would check things like whether the Zynq is being provided a valid clock source and if it gets out of the reset state and maybe also some watchdog timers pro- vided by the PL of the Zynq to check if everything is okay. However, as shown in Section 2.2.2, the current system architecture, even if a second FPGA would be added, has problems to achieve a high availability because it is not possible to remotely restart the system when Linux in the PS crashes. A system architecture that remediates those issues is presented in Section 2.3. In near future, the

81 requirements of such a system architecture will have to be formalised in order to outsource the design of a custom PCB which incorporates the proposed architecture. It is important to start this process early, as the external design of the board will take some time and some more time will be needed to actually program the components on the board. Although it is possible to already buy evaluation boards of the main components (which has been done already) and to start developing using the evaluation board, it is definitely better to start working on the real boards as early as possible as eventual problems with the board and the design are, then, discovered earlier when they are used in their future environment.

82 Glossary

AC Alternative Current.

ACL2 A Computational Logic for Applicative Common Lisp.

ACP Accelerator Coherency Port.

ADC Analog to Digital Converter.

AMBA Advanced Microcontroller Bus Architecture.

AMP Asymmetric-Multi Processor.

API Application Programming Interface.

ARCON ARea CONtrol.

ASIC Application-Specific Integrated Circuit.

AXI Advanced eXtensible Interface.

AXI GP AXI (Advanced eXtensible Interface) General Purpose (Port).

AXI HP AXI (Advanced eXtensible Interface) High Performance (Port).

BFM Bus Functional Model.

BIST Built-In Self-Test.

BRAM Block RAM (Random Access Memory).

BTMR Block TMR (Triple Modular Redundancy).

CAN Controller Area Network.

CAU CROME Alarm Unit.

CERN Conseil Europ´eenpour la Recherche Nucl´eaire.

CMOS Complementary Metal-Oxide-Semiconductor.

CMPU CROME Measuring and Processing Unit.

COTS Commercial Off-The-Shelf.

CPU Central Processing Unit.

CRAM Configuration RAM (Random Access Memory).

CRC Cyclic Redundancy Checks.

83 CROME CERN RadiatiOn Monitoring Electronics.

CUPS CROME Uninterruptible Power Supply.

DAP Debug Acces Port.

DC Direct Current.

DDR2 Double Data Rate 2.

DevC Device Configuration (Device).

DMA Direct Memory Access.

DO-254 Design Assurance Guidance for Airborne Electronic Hardware.

DPI Direct Programming Interface.

DRAM Dynamic RAM (Random Access Memory).

DSP Digital Signal Processing.

DUV Device Under Verification.

ECC Error Correcting Code. eMMC embedded Multi Media Card.

FIT Failure In Time.

FLI Foreign Language Interface.

FMEA Failure Mode and Effects Analysis.

FPGA Field Programmable Gate Array.

FPU Floating Point Unit.

FSBL First Stage Boot Loader.

FSM Finite State Machine.

GCC GNU C Compiler.

GPIO General Purpose Input/Output.

HDL Hardware Description Language.

HOL Higher Order Logic.

I2C Inter-Integrated Circuit.

IC Integrated Circuit.

ICAP Internal Configuration Access Port.

84 IEC International Electrotechnical Commission.

IEEE Institute of Electrical and Electronics Engineers.

IO Input/Output.

IP Intellectual Property.

ITAR International Traffic in Arms Regulations.

JTAG Joint Test Action Group.

LBIST Logic Built-In Self-Test.

LFSR Linear-Feedback Shift Register.

LHC Large Hadron Collider.

LLIM Low-Latency Interrupt Mode.

LSB Least Significant Bit.

LTL Linear Temporal Logic.

LTMR Local TMR (Triple Modular Redundancy).

LUT Lookup Table.

MAC Multiply-Accumulate.

MASC Modeling Algorithms in SystemC.

MBIST Memory Built-In Self-Test.

MD Message-Digest.

MISO Master In Slave Out.

MMU Memory Management Unit.

MSB Most Significant Bit.

NaN Not a Number.

NASA National Aeronautics and Space Administration.

NDA Non-Disclosure Agreement.

OCM On-Chip Memory.

OOP Object-Oriented Programming.

OSVVM Open Source VHDL Verification Methodology.

PCAP Processor Configuration Access Port.

85 PCB Printed Circuit Board.

PHA Preliminary Hazard Analysis.

PL Programmable Logic.

PLI Programming Language Interface.

PLL Phase-Locked Loop.

PS Programmable System.

PSL Property Specification Language.

PVS Prototype Verification System. qNaN quiet NaN (Not a Number).

QoS Quality of Service.

QSPI Quad-SPI (Serial Peripheral Interface).

RAMSES RAdiation Monitoring System for the Environment and Safety.

ROM Read-Only Memory.

RTL Register Transfer Level.

RTOS Real-Time Operating System.

SCADA Supervisory Control and Data Acquisition.

SD Secure Digital (Memory Card).

SDIO Secure Digital Input Output.

SECDED Single-Error Correcting, Double-Error Detecting.

SEFI Single Event Functional Interrupt.

SEM Soft Error Mitigation.

SET Single Event Transient.

SEU Single Event Upset.

SHA Secure Hash Algorithm.

SIL Security Integrity Level.

SIMD Single Instruction, Multiple Data. sNaN signalling NaN (Not a Number).

SoC System on Chip.

SoM System-on-Module.

86 SPI Serial Peripheral Interface.

SRAM Static RAM (Random Access Memory).

SSBL Second Stage Boot Loader.

SSE Streaming SIMD (Single Instruction, Multiple Data) Extensions.

TCL Tool Command Language.

TCM Tightly Coupled Memory.

TCP/IP Transmission Control Protocol/Internet Protocol.

TMR Triple Mode Redundancy.

TN Technical Network.

UART Universal Asynchronous Receiver Transmitter.

UIO User Input/Output.

ULP Unit in the Last Place.

USB Universal Serial Bus.

UVM Universal Verification Methodology.

VHDL VHSIC (Very High Speed Integrated Circuit) Hardware Description Language.

VHPI VHDL Procedural Interface.

VLAN Virtual Local Area Network.

VPI Verilog Procedural Interface.

ZSBL Zero Stage Boot Loader.

87

Bibliography

[1] G. S. Millan, D. Perrin, and L. Scibile, “Ramses: the lhc radiation monitoring system for the environment and safety,” in Proceedings of 10th ICALEPS Int. Conf. on Accelerator & Large Expt. Physics Control Systems, 2005.

[2] S. Hurst, “Reliability Analysis of CERN Radiation Monitoring Electronics System CROME,” Master’s thesis, University of Stuttgart, 2017.

[3] I. E. Commission et al., “IEC 61508-2:2010: Functional Safety of Electrical/Electron- ic/Programmable Electronic Safety-Related Systems,” Electronic/Programmable Electronic Safety-Related Systems, 2010.

[4] M. Widorksi, “CMPU Functional Requirements,” Organisation Europ´eennepour la Recherche Nucl´eaire- Occupational Health and Safety and Environmental Protection Unit, Tech. Rep., 8 2016.

[5] PicoZed Zynq 7010/7020 SoM: Hardware User Guide, Avnet, 7 2016, v1.7.

[6] M. Sadri, C. Weis, N. Wehn, and L. Benini, “Energy and Performance Exploration of Accelerator Coherency Port Using Xilinx ZYNQ,” in Proceedings of the 10th FPGAworld Conference. ACM, 2013, p. 5.

[7] Zynq-7000 All Programmable SoC Technical Reference Manual, Xilinx Inc., 9 2016, v1.11.

[8] Zynq-7000 All Programmable SoC Software Developers Guide, Xilinx Inc., 9 2015, v12.0.

[9] Security Monitor IP Core, Xilinx Inc., 2015.

[10] P. Aristote, “DO-254 Training,” November 2016.

[11] G. Stylianos, “Lockstep Analysis for Sefety-Critical Embedded Systems,” Master’s thesis, Aal- borg University, 2015.

[12] RM48L95216- and 32-Bit RISC Flash Microcontroller, Texas Instruments, 6 2015.

[13] Highly Integrated and Performance Optimized 32-bit Microcontrollers for Automotive and Industrial Applications, Infineon, 9 2016.

[14] H. I. S. WITTENSTEIN, “Embedded Architectures Supporting Mixed Safety Integrity Soft- ware,” Tech. Rep., 2016.

[15] Product Brief: AURIX— - TC275T/TC277T, Infineon, 8 2016.

[16] C. Menon and S. Guerra, “Field Programmable Gate Arrays in Safety Related Instrumentation and Control Applications,” 2015.

[17] J. Fletcher, “An Arithmetic Checksum for Serial Transmissions,” IEEE Transactions on Communications, vol. 30, no. 1, pp. 247–252, 1982.

89 [18] T. C. Maxino and P. J. Koopman, “The Effectiveness of Checksums for Embedded Control Networks,” IEEE Transactions on Dependable and Secure Computing, vol. 6, no. 1, pp. 59–72, 2009.

[19] T. K. Moon, “Error Correction Coding,” Mathematical Methods and Algorithms. Jhon Wiley and Son, 2005.

[20] “IEEE Standard for Information technology - Local and metropolitan area networks - Spe- cific requirements– Part 15.1a: Wireless Medium Access Control (MAC) and Physical Layer (PHY) Specifications for Wireless Personal Area Networks (WPAN),” IEEE Std 802.15.1-2005 (Revision of IEEE Std 802.15.1-2002), pp. 1–700, June 2005.

[21] R. W. Hamming, “Error Detecting and Error Correcting Codes,” Bell Labs Technical Journal, vol. 29, no. 2, pp. 147–160, 1950.

[22] S. Tam, Single Error Correction and Double Error Detection, Xilinx Inc., 8 2006, v2.2.

[23] C. Paar and J. Pelzl, Understanding cryptography: a textbook for students and practitioners. Springer Science & Business Media, 2009.

[24] Neutron-Induced Single Event Upset (SEU) FAQ, Microsemi Inc., 8 2011.

[25] Introduction to Single-Event Upsets, Corporation, 9 2013, 1.0.

[26] Enhancing Robust SEU Mitigation width 28-nm FPGAs, Altera Corporation, 7 2010, 1.0.

[27] Product Guide: Soft Error Mitigation Controller, Xilinx Inc., 9 2015, v4.1.

[28] A. Sutton, Achieve Functional Safety & High Uptime Using TMR, Synopsys, 12 2015.

[29] Vivado Design Suite User Guide - Synthesis, Xilinx Inc., 11 2016, v2016.4.

[30] A. Vachoux, “Lecture Notes for Hardware Systems Modelling I,” 2016, Ecole´ Polytechnique F´ed´eralede Lausanne.

[31] J. Harrison, “Floating Point Verification in HOL Light: the Exponential Function,” in International Conference on Algebraic Methodology and Software Technology. Springer, 1997, pp. 246–260.

[32] S. Boldo and C. Munoz, “A High-Level Formalization of Floating-Point Number in PVS,” 2006.

[33] D. Russinoff, A Formal Theory of Register-Transfer Logic and Floating-Point Arithmetic. http: //www.russinoff.com/libman/text/, 2017.

[34] S. Beyer, C. Jacobi, D. Kroening, and D. Leinenbach, “Correct hardware by synthesis from pvs,” Submitted for publication, 2002.

[35] D. Leinenbach and P. P. D. W. Paul, “Implementierung eines maschinell verifizierten prozes- sors,” Ph.D. dissertation, Saarland University, Germany, 2002.

[36] P. S. Miner, “Defining the IEEE-854 Floating-Point Standard in PVS,” 1995.

[37] D. M. Russinoff, “A Mechanically Verified Commercial SRT Divider,” 2010.

90 [38] J. W. O’Leary and D. M. Russinoff, “Modeling Algorithms in SystemC and ACL2,” arXiv preprint arXiv:1406.1565, 2014.

[39] J. Lewis. (2016) Open Source VHDL Verification Methodology. [Online]. Available: http://osvvm.org/

[40] ——, “Functional Coverage Using CoveragePkg,” SynthWorks Design Inc., 2013.

[41] “IEEE Standard for SystemVerilog–Unified Hardware Design, Specification, and Verification Language,” IEEE Std 1800-2012 (Revision of IEEE Std 1800-2009), pp. 1–1315, Feb 2013.

[42] “IEEE Standard VHDL Language Reference Manual,” IEEE Std 1076-2008 (Revision of IEEE Std 1076-2002), pp. 1–620, Jan 2009.

[43] Questa® SIM User’s Manual, ®, 2016, Software Version 10.6.

[44] D. Zuras, M. Cowlishaw, A. Aiken, M. Applegate, D. Bailey, S. Bass, D. Bhandarkar, M. Bhat, D. Bindel, S. Boldo et al., “IEEE Standard for Floating-Point Arithmetic,” IEEE Std 754-2008, pp. 1–70, 2008.

[45] FPgen Team at IBM Labs in Haifa, “Floating-Point Test-Suite for IEEE,” 2008.

[46] B. Ljepoja, T. Pfennig, and S. Rosenegger. (2003) Analyse von Softwarefehlern: Der Pentiumbug. [Online]. Available: https://www5.in.tum.de/lehre/seminare/semsoft/ unterlagen 02/pentiumbug/website/

[47] X. Wang and M. Leeser, “Vfloat: A Variable Precision Fixed-and Floating-Point Library for Reconfigurable Hardware,” ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol. 3, no. 3, p. 16, 2010.

[48] J. Hauser, “The SoftFloat and TestFloat Validation Suite for Binary Floating-Point Arithmetic,” University of California, Berkeley, Tech. Rep, 1999.

[49] A. Youssef and M. Bogdan, “Synaptisches Rauschen auf Grundlage der Poissonverteilung: Sys- tementwurf und Integration in FPGA,” Master’s thesis, Universit¨atLeipzig - Fakult¨atf¨urMath- ematik und Informatik - Institut f¨urInformatik - Abteilung Technische Informatik.

[50] A. B. Castillo, R. Zalusky, and V. Stopjakova, “Design of Single Precision Float Adder (32-Bit Numbers) According to IEEE 754 Standard using VHDL,” M. Tech, thesis, Slovensk´a Technical University, 2012.

[51] L. Louca, T. A. Cook, and W. H. Johnson, “Implementation of IEEE Single Precision Floating Point Addition and Multiplication on FPGAs,” in FCCM, 1996, pp. 107–116.

[52] B. Lee and N. Burgess, “Parameterisable Floating-Point Operations on FPGA,” in Signals, Systems and Computers, 2002. Conference Record of the Thirty-Sixth Asilomar Conference on, vol. 2. IEEE, 2002, pp. 1064–1068.

[53] P.-M. Seidel and G. Even, “Delay-Optimized Implementation of IEEE Floating-point Addition,” IEEE Transactions on computers, vol. 53, no. 2, pp. 97–113, 2004.

[54] D. Tecu, “FPGA-Based Vector Floating-Point Unit with Software-Implemented Division,” Mas- ter’s thesis, ETH Zurich - Department of Computer Science - Native Systems Group, 2010.

91 [55] R. Thompson and J. E. Stine, “An IEEE 754 Double-Precision Floating-Point Multiplier for Denormalized and Normalized Floating-Point Numbers,” in Application-specific Systems, Architectures and Processors (ASAP), 2015 IEEE 26th International Conference on. IEEE, 2015, pp. 62–63.

[56] M. Sumi and S. Daniel, “Multiplication of Floating Point Numbers using VHDL,” International Journal of Engineering and Innovative Technology, 2014.

[57] E. Hallett, Isolation Design Flow for Xilinx 7 Series FPGAs or Zynq-7000 AP SoCs (Vivado Tools), Xilinx Inc., 9 2016, v1.3.

92