Evaluation of Partial Reconfiguration in FPGA-Based High-Performance

Malardalen¨ University

School of innovation, design and engineering

Master thesis in Electronics

Evaluation of partial reconﬁguration in FPGA-based high-performance video systems

Author: Supervisor: Dr. Mikael ¨ Emil Segerblad Ekstrom [email protected] Examiner: Prof. Lars Asplund

June 5, 2013 Abstract

The use of reconfigurable logic within the field of computing has increased during the last decades. The ability to change hardware during the design process enables developers to lower the time to market and to reuse designs in several different products. Many different architectures for reconfigurable logic exists today with one of the most commonly used are Field- Programmable Gate Arrays (FPGA). The use of so called dynamic reconfiguration, or partial reconfiguration, in FPGAs have recently been introduced by several leading vendors but the concept has existed for several decades. Partial reconfiguration is a technique were a specific part of the FPGA can be reprogrammed during run-time. In this report an evaluation of partial reconfiguration is presented with focus on the Xilinx ZynQ System-On-Chip and the GIMME2 vision platform developed at MälardalenUniversity. Special focus has been given to the use of partial reconfiguration in high-performance video systems such as the GIMME2 platform. The results show that current state of the technology is capable of performing reconfigurations within strict timing constraints but that the associated software tools are yet lacking in both performance and usability. Sammanfattning

Användningen av rekonfigurerbar logik inom beräkningsomr˚adethar ökat under de senaste de- cennierna. Möjligheten att ändra h˚ardvaran under designprocessen kan hjälpa utvecklare att sänka utvecklingstiderna och att ˚ateranvända konstruktioner i flera olika produkter. M˚anga olika arkitekturer för rekonfigurerbar logik existerar idag och en av de vanligaste är Field- Programmable Gate Arrays (FPGA). Användningen av s˚akallad dynamisk omkonfigurering eller partiell omkonfigurering i FPGA: er har nyligen införts av flera ledande leverantörer men konceptet har funnits i flera decennier. Partiell omkonfigurering används för att ändra en specifik del i h˚ardvaran under körning. I denna rapport presenteras en utvärdering av partiell omkonfigurering p˚aFPGA:er med fokus p˚aXilinxs ZynQ System-On-Chip och GIMME2-plattformen som utvecklats vid Mälardalens högskola. Särskilt fokus har lagts vid användningen av partiell omkonfigurering i högpresterande videosystem s˚asomGIMME2-plattformen. Resultaten visar att den nuvarande tekniken är kapabel att utföra partiella omkonfigureringar inom strikta tidsbegränsningar men att de tillhörande verktygen (mjukvaran) ännu har klara brister i b˚ade prestanda och användarvänlighet. Acknowledgements

The path of the righteous man is beset on all sides by the iniquities of the selﬁsh and the tyranny of evil men. Blessed is he who, in the name of charity and good will, shepherds the weak through the valley of darkness, for he is truly his brother’s keeper and the ﬁnder of lost children. And I will strike down upon thee with great vengeance and furious anger those who would attempt to poison and destroy My brothers. And you will know My name is the Lord when I lay My vengeance upon thee.

- Jules Winnﬁeld, ”Pulp Fiction”, 1994

I chose this quote not because I am a religious person, on the contrary, I chose it because it sounds cool and also due to the fact that it is something that you would not expect a hitman to say just before killing a man. It makes you think, does it not? Now, before I get all philosophical, there are some people I would like to thank for making this master thesis possible. First of all I would like to thank Sara, my fiancée,for her great support during these past 4 years. Furthermore, I would to thank my family and friends for their interest and support of my work. I would also like to thank my room mates at MDH: Fredrik, Carl, Ralf and Batu for not throwing me out when I was annoying and also for their good ideas and feedback. Lastly, I would like to thank my supervisor Mikael Ekströmat MDH for good support and ”can-do”-attitude and my examinator Lars Asplund at MDH for good feedback and challenging tasks.

Emil Segerblad - V¨aster˚as- June 5, 2013

i Glossary

API Application Programming Interface 46, 48

ARM Advanced Reduced Instruction Set Computer (RISC) Machine 17

ASIC Application Speciﬁed Integrated Circuit 1, 10

AXI Advanced eXtensible Interface 17, 31–33, 36, 37, 41, 46

BIOS Basic Input/Output System 39

BLE Basic Logic Element 10

CAN Controller Area Network 41

CCD Charged Coupled Device 7

CPLD Complex Programmable Logic Device 9, 10

CPU Central Processor Unit 3, 10, 13, 50

DDR Dual Data Rate 36, 41, 46, 47, 58

DMA Direct Memory Access vi, 18, 35, 46

DSP Digital Signal Processor 14

EDK Embedded Development Kit 33, 34

EMIO Extended Multiplexed I/O 18

EPP Extensible Processing Platform iv, 3, 17, 19

FPGA Field Programmable Gate Array 1–3, 5, 6, 10–17, 22–25, 27, 28, 31–33, 39, 40, 48–53

FPS Frames Per Second 27, 28, 46

FSBL First-Stage Boot-Loader 40

GIMME General Image Multiview Manipulation Engine 5

GPIO General Purpose Input/Output 41

GPS Global Positioning System 7

GPU Graphics Processing Unit 23

HDL Hardware Description Language 1, 3, 5, 11, 52

HLS High-Level Synthesis 1, 2, 6, 37, 51, 53

ii I/O Input/Output 33

IC Integrated Circuit 10, 13, 16

ICAP Internal Conﬁguration Access Port v, 15, 25–28, 33–36, 46, 49, 50, 52

IEEE Institute of Electrical and Electronics Engineers 31

IP Intellectual Property 6, 15, 16, 39, 48, 50, 58

ISE Integrated Software Environment 14–16, 34, 36, 38, 45, 51

LIDAR LIght Detection and Ranging 7

LUT Look-Up Table 10, 14

MDH M¨alardalenUniversity 41, 51, 53

MIG Memory Interface Generator 41

MIO Multiplexed Input/Output 40

MIPS Microprocessor without Interlocked Pipeline Stages 8

PCAP Parallel Conﬁguration Access Port 16, 17, 31–34, 36, 45, 46, 49–52

PCI Peripheral Component Interconnect 14, 23

PL Programmable Logic 1, 9, 10, 17, 18, 31, 35, 39–41, 45, 46, 49, 50

PS Processing System 17, 18, 31–33, 35, 36, 40, 41, 46, 49, 50

RAM Random Access Memory 14

RISC Reduced Instruction Set Computer ii, 17

RTL Register Transfer Level 1

SD Secure Digital 40, 48

SDK Software Development Kit 40

SoC System on Chip 5, 17, 18, 31, 39, 40, 49

SRAM Static Random Access Memory 27

USB Universal Serial Bus 39

VDMA Video Direct Memory Access 36, 37, 47, 50

VHDL VHSIC Hardware Description Language v, 37, 47, 50

WSN Wireless Sensor Networks 27

XMD Xilinx Microprocessor Debugger 40

iii List of Figures

1.1 Figure showing a comparison between various hardware architectures. Picture is from the work of Flynn and Luk. [25]...... 2 1.2 Figure showing the concept behind dynamic reconfiguration...... 3 1.3 Figure showing a wrongly performed partial reconfiguration...... 4 2.1 Figure showing the concept of an stereo-camera setup taken from the work of Ohmura et al. [55]...... 8 2.2 Video flow example...... 8 2.3 Filtering example from http://rsbweb.nih.gov/ij/plugins/sigma-filter.html 9 2.4 Feture extraction example from http://www2.cvl.isy.liu.se/Research/Robot/ WITAS/operations.html ...... 9 2.5 MIPS pipeline...... 10 2.6 Figure showing the mobile robot used by Weiss and Biber (left) and the output 3D map (right). The image is from the work of Weiss et al. [64]...... 10 2.7 Figure showing the general concept of an FPGA-device. The figure is from the article by Kuon et al. [48]...... 11 2.8 Figure showing a typical Look-Up Table (LUT). [49]...... 11 2.9 Picture from lecture slides from NTU. [8]...... 12 2.10 Figure showing how a LUT can be ”programmed” to perform logic operations. [49] 12 2.11 Figure showing the general concept of an FPGA-device. The figure is from the article by Kuon et al. [48]...... 13 2.12 Figure showing a typical island style partition location strategy...... 15 2.13 Figure showing the outline of the Xilinx RP. [35]...... 16 2.14 The layout of the DevC-module. [38]...... 18 2.15 Figure showing the general idea on how to utilize the Xilinx Extensible Processing Platform (EPP)-family. [41]...... 19 2.16 Figure showing the outline of the Xilinx ZynQ-SoC. Image from Xilinx document UG585. [38]...... 19 2.17 Figure showing the ZC702 board peripherals. [41]...... 20 2.18 Figure showing the outline of the GIMME2 platform. [3]...... 20 2.19 Figure showing the GIMME2 boards front (right) and backside (left). Notice the two image sensors on the backside of the PCB (encircled in red). Also notice the Zynq SoC (encircled in yellow, the PS DDR Memory (encircled in blue) and the PS DDR Memory (encircled in purple)...... 21 3.1 Figure showing the system from the article by Hosseini and Hu. [29] To left is the logic implementation and to the right is the CPU implementation using the Altera Nios II...... 23 3.2 System developed by Ohmura and Takauji. Picture is retrieved from the article by Ohmura and Takauji. [55]...... 23 3.3 System developed by Ohmura and Takauji. Picture is retrieved from the article by Ohmura and Takauji. [55]...... 24 3.4 Komuro et al.’s architecture. [47]...... 24 3.5 Blair et al.’s performance. [14]...... 25 3.6 Figure showing Ming et al.’s results taken from their article. [51]...... 27 3.7 Koch et al.’s implemented system. Picture is taken from the related article. [45]. 28

iv 3.8 Ackermann et al.’s system. [2]...... 29 4.1 Figure showing the Partial Reconfiguration flow from WP374. [23]...... 33 4.2 Excluded partial reconfiguration steps from XAPP1159 shown in red. [46].... 34 4.3 Reference design with Microblaze and Internal Configuration Access Port (ICAP) added...... 35 4.4 Reference design from XAPP1159...... 36 4.5 Implemented video design...... 37 4.6 Second implemented video design...... 38 4.7 Red filter code...... 39 4.8 Color filter code...... 42 4.9 Work flow for Partial Reconfiguration in Xilinx ISE from UG702. [35]...... 43 4.10 Example boot image format for Linux. Picture from UG821. [37]...... 43 4.11 Example boot image format for Linux. Picture from UG873. [36]...... 44 5.1 Excerpt from VHSIC Hardware Description Language (VHDL)-code generated by Vivado HLS...... 47 7.1 Proposed video pipe line...... 51

A.1 Macro deﬁnitions...... 58 A.2 Memory access code...... 58

B.1 Overview of ZynQ-family. Image from DS190. [42]...... 59

v List of Tables

3.1 Figure showing the results from work by Hosseini and Hu. [29] The first four rows are for the filtering of a 64 x 64 pixel image while the last two are for the filtering a 256 x 256 pixel image...... 22 3.2 Komuro et al.’s performance. [47]...... 25 3.3 Meyer et al.’s results. [53]...... 26 3.4 Table showing Ming et al.’s results taken from their article. [51]...... 26 3.5 Ackermann et al.’s results. [2]...... 28 3.6 Ackermann et al.’s results. [2]...... 29 3.7 Bhandari et al.’s results. [11]...... 30 3.8 Perschke et al.’s results. Picture from the related article. [58]...... 30 3.9 Perschke et al.’s results. Picture from the related article. [58]...... 30 5.1 Figure showing the time needed to just finish the ”write”-function call on the ZC702 board running Linux...... 45 5.2 Figure showing the time needed for the full reconfiguration flow on the ZC702 board running Linux...... 45 5.3 Figure showing reconfiguration time needed to finish the Direct Memory Access (DMA)-transfer on the ZC702 board using standalone software...... 46 5.4 Figure showing summarized test results...... 46

vi Contents

List of Figures iv

List of Tables vi

Introduction 1 Thesis description...... 4 Scope of this report...... 5 Outline of this report...... 6

Background 7 Introduction to computer vision and its applications...... 7 Introduction to Programmable Logic (PL)...... 9 Field Programmable Gate Array-technology...... 10 Run-time Reconﬁgurability with focus Xilinx FPGA...... 12 Heterogeneous systems...... 16 Computer Vision on FPGAs and heterogeneous systems...... 17 Xilinx ZynQ-7000 and Xilinx XC702...... 17 GIMME 2...... 18

Related work 22 Implementations of computer Vision on FPGAs...... 22 Implementations of reconﬁgurable FPGA-systems...... 24 Implementations of reconﬁgurable FPGA-systems running computer vision algorithms 27

Method 31 Early work...... 31 Design considerations...... 31 Method of reconﬁguration...... 34 Implemented vision components...... 36 System design in Xilinx ISE/Planahead...... 38 Interface from Linux...... 39 GIMME2...... 39

Results 45 Performance of reconﬁguration methods...... 45 Vision component implementation...... 47 Software implementation evaluation...... 48 GIMME2...... 48

Discussion 49

Future work 51

Conclusions 52

vii Bibliography 54

Appendix A Device interface from Linux 58

Appendix B Overview of Xilinx ZynQ-family 59

viii Introduction

Within the world of computation one of the largest open questions is which hardware platform, or architecture, to use for a certain task. Some platforms can achieve high clock frequencies but low degrees of parallelism while others can achieve high degree of parallelism but are limited to lower clock frequencies. Each architecture has its own inherent strengths and weaknesses. Designers are often faced with the problem of choosing the right platform for the right task, which is not always an simple problem to solve as costs, availability and design support also play a major role during development and production. Hence, performance is not always the most important aspect to consider when choosing a platform. Today many applications exist where both performance and costs are critical aspects. The industry standard for many years to use for such applications have been to design and implement Application Specified Integrated Circuits (ASICs). ASICs have the advantage of being cheap to produce and allows for high performance in general. ASICs are not without problems, however. Long development times, expensive in small to medium-scale production and no re- programmability after production are some its inherent disadvantages. Other technologies such as Field Programmable Gate Arrays (FPGAs) allows designer to shorten the design phase and allows for a high-degree of re-programmability after production. FPGAs belongs to a much larger family known as Programmable Logic (PL) where focus is on the programmability of the hardware and can jokingly be refereed to as ”Lego-logic” due to its high degree of custom-ability. The use of PL has increased over the years in embedded systems due the good flexibility and decent performance that the various PL-architectures offer. However, the reconfigurable nature of PL-devices also has some drawbacks, such as much lower speed (in terms of clock frequency) than other processing architectures. What the PL-architectures lack in speed they gain in parallelism. The ability to run tasks in parallel on PL borders to the extreme. An comparison between various hardware architectures with respect to performance and programmability can be found in figure 1.1. Some PL-systems are reconfigurable during run-time. That means that one or several sections of the logic fabric can be reprogrammed with or without affecting the operation of other sections depending on which technique is used. Run-time reconfiguration enables developers to dynamically use hardware-acceleration in their applications by changing the behaviour of the logic. Furthermore, by swapping functions in and out of the FPGA chip area can be saved. An example of this concept, known as partial reconfiguration, can be seen in figure 1.2. At first the FPGA-fabric contains functions A-D. At some point the user wants to put function E on to the FPGA and hence functions B and D, in this example, must be overwritten in order to make function E fit properly. Some important aspects seen in figure 1.2 needs further explanation. All bit streams (a stream of data containing the new configuration of the PL) for the functions to be used in the system needs to be generated prior to run-time and stored in non-volatile memory in order to minimize reconfiguration delay and ensure a deterministic system behaviour (generating bit streams during run-time is currently impossible as the time needed is vast). In most cases the tasks put on the FPGA are generated manually by the developer using Hardware Description Language (HDL) but lately tools for automatic HDL-code generation from high level languages have appeared on the market. One example of such a tool is Xilinx Vivado High-Level Synthesis (HLS). Vivado HLS is able to generate Register Transfer Level (RTL)-level code from high level languages, such as: C/C++ and Matlab. RTL, not to be confused with Resistor-Transistor

1 Figure 1.1: Figure showing a comparison between various hardware architectures. Picture is from the work of Flynn and Luk. [25]

Logic, is a common abstraction mechanism used in hardware design where synchronous circuits are modelled as the data flow between registers and the manipulation performed on the data flow. For more information about Xilinx Vivado HLS, please refer to the Vivado HLS User guide. [33]. When doing partial reconfiguration there exists a risk for corruption of previously existing functions. An example can be seen in figure 1.3. As can be clearly seen in figure 1.3, if a component is placed badly on to the FPGA it can destroy or interfere with other pre-existing components. Therefore, it is of uttermost importance that the user keeps tracks of the boundaries of all components on the FPGA in some manner. This is however, tiresome and error prone to do ”by hand” so a better option to ensure component integrity is to implement a resource manager that keeps track on where the functions are placed and during partial reconfiguration makes sure that no pre-existing component is affected negatively. This task is highly complex and not without problems. A third option also exists and that is to divide the area of the chip in to different regions with different properties. Most common is to have one region being static, that is non-reconfigurable, and then several regions that can be reconfigured. Using high-level design tools, the user can then verify that non of the tasks to be put into one of these regions violates any design rules or other limitations. After this is verified the functionality of each region can be changed dynamically during run-time without any concern that the reconfiguration would affect any other region. However there exists some limitations on partial reconfiguration, both software-wise and hardware-wise, as will be presented later in this report. Furthermore, some different ways of partitioning the FPGA-logic between static and reconfigurable areas will be briefly explained in the Background-section. The applications of this technology are many. The ability to put software tasks on hardware are appealing, especially for systems with a strong degree of possible parallelism in them. An example here would be computer vision systems, where large chunks of data needs to be processed in real-time. Vision systems also have a high degree of parallelism making them ideal for hardware acceleration. However, the software used in modern vision systems are often complex and written in high-level languages such as C++ or Java. Porting such complex tasks to an

2 Figure 1.2: Figure showing the concept behind dynamic reconﬁguration.

FPGA would be extremely hard as many of the constructs used in high-level languages have no direct representations in HDL. However, some tools exist, such as Vivado HLS that was presented earlier in this text, for High-level language to HDL-conversion. These tools are not without limitations though, as will be seen later in this report. Instead of creating new specialized functions for each implementation, a library of vision algorithms and tools are often used in software design. One popular library for computer vision is OpenCV. OpenCV contains several high-level functions for image processing such as Hough transforms, color conversion algorithms and feature extraction methods. The OpenCV-library is compatible with a wide range of operating systems, such as Linux and Microsoft Windows. In recent years FPGAs with hard Central Processor Unit (CPU)-cores embedded in them have been introduced to the market. This enables programmers to mix the high clock frequency of the CPU-cores with the high parallelism and programmability of the FPGA in order to create high-performing systems within a wide range of applications. Platforms that contain several different processing elements are known as heterogeneous platforms and will be discussed more extensively later in this report. Some of these heterogeneous FPGAs even features the possibility to reprogram the FPGA from the hard CPU-cores during run-time. This is a interesting feature as it could prove to be useful in certain types of high-performance systems where a wide range of functions can be offloaded onto the FPGA during run-time and hence increasing performance and decreasing power usage. The newly released Xilinx ZynQ EPP is a heterogeneous system that features a dual-core ARM-processor and a large FPGA-area. The Xilinx Zynq has the capability to reconfigure the FPGA during run-time.

3 Figure 1.3: Figure showing a wrongly performed partial reconﬁguration.

Thesis Description

As stated earlier this work presented in this report was done as part of an master thesis in electronics at M¨alardalenUniversity. The original thesis description was produced by Professor Lars Asplund at M¨alardalenUniversity [7]

Reconfigurable Systems The FPGA-area in the Zynq is reconfigurable, which means that parts of the area can be reloaded during run-time. This can be of great use for robotics vision system, e.g. where different algorithms can be loaded for navigation first and later algorithms for object recognition i.e a robot moving in an apartment first and then in the kitchen it can find the coffee . . . This master thesis work aims to design a framework for loading a set of OpenCV components in the FPGA area at the same time as the camera-part is continuously running and with the requirement that some other vision components that are resident can continue to work. The work will use already defined blocks for the camera input, and the work is also connected to the research project Ralf-3, and will be used in a tool for allocation of software components on the FPGA. High level programming Xilinx has released a system called Vivado, which makes it possible to use high level languages such as Matlab Simulink, C or C++ for implementation in an FPGA. This master thesis work aims at evaluating Vivado. Special attention should be put on components from the OpenCV-library. In the thesis work the suitability of Vivado for this type of computation will be evaluated. Special attention should be put on speed and FPGA-area allocation. The possibility of using LabView should be included.

4 Cooperation These two thesis works should be run simultaneously since they beneﬁt from each others results. They will also be connected to two research groups. The robotics group in terms of hardware and IP-components for the cameras and communication with the dual core ARM and the Software Engineering group in the Ralf-3 project.

This work focuses on the ”Reconfigurable Systems”-part. However, modifications to the thesis description was needed as no student applied for the ”High level programming”-part and hence no OpenCV-functions were generated for the partial reconfiguration framework to use. Instead this thesis will focus on the use of partial reconfiguration and the properties of the Xilinx Zynq FPGA and its associated tools. A more extensive description of the thesis and this report can be found in the next section.

Scope of this report

In this report the Xilinx ZynQ System on Chip (SoC), that features a dual-core ARM-processor unit and a FPGA, will be used to demonstrate the capabilities and limitations of heterogeneous systems in high-performance applications such as image or video processing with emphasis on the partial reconfiguration possibilities found in SoCs such as the ZynQ. The Xilinx ZynQ SoC and the Xilinx tool suite associated with the ZynQ supports run-time partial reconfiguration of the FPGA from both the hard ARM-processor and within the FPGA and will hence be evaluated with respect to usability and speed of the reconfiguration process. Furthermore, the usability of partial reconfiguration in video processing systems will be evaluated using the GIMME2 stereo-vision platform. GIMME2 is a hardware platform featuring a Xilinx ZynQ FPGA developed at MälardalenUniversity as part of the research project VARUM (Vision Aided Robots for Underwater Minehunting). The General Image Multiview Manipulation Engine (GIMME) platform was first introduced in the work of Ekstrand et al. [4] The possibility of generating hardware components of OpenCV-like-functions will also be explored and evaluated, to some degree, using the Xilinx Vivado HLS-tool suite. The motivation of this is to provide programmers with tools and techniques to easily place vision components onto the ZynQ FPGA in order to offload the embedded ARM-cores in the SoC during run-time. The work is done at MälardalenUniversity as part of a master thesis in electronics. In order to limit the scope of the thesis work and to produce clear goals for thesis itself a set of questions that this report intends to answer can be seen below.

• What are the possible advantages of using heterogeneous systems in video processing instead of homogeneous systems?

• What is the performance of Xilinx’s current partial reconﬁguration methods?

• How well developed are the available software tools for partial reconﬁguration in Xilinx FPGAs?

• What are the technological limitations of partial reconﬁguration in its current state?

• How can partial reconﬁguration be utilized in high-performance video systems such as stereo-vision systems and what are the implications of this technology for machine vision in general?

• What types of components can be partially reconﬁgured?

• How can M¨alardalenUniversity use partial reconﬁguration in their current research projects such as GIMME2?

• What is the status of high-level language to HDL-tools such as Xilinx Vivado HLS, in terms of eﬃciency and performance?

5 These questions have been answered throughout this report and a summarized version of the answers can be found on page 52, in the Conclusions-section (page 52).

Outline of this report

In the Background-section, starting on page7, an overview of FPGAs, heterogeneous platforms, programmable logic and computer vision in general will be presented to the reader. Furthermore, the Xilinx ZynQ-FPGA, the GIMME2 stereo-vision platform, run-time reconfiguration of the ZynQ-FPGA and the open source image processing library OpenCV will be discussed. In the Related work-section, page 22, the state of the art solutions and implementations of related fields will be presented to the reader in the form of short summaries of research papers and technical reports. In the Method-section, page 31, the implementation done by the author of this report is presented to the reader: A reference systems based around Xilinx Intellectual Property (IP)- cores running in the Zynq’s FPGA will be presented and evaluated, image processing components generated with Vivado HLS will be presented and evaluated and the final implementation on the GIMME2-board will be presented and evaluated. In the Results-section, starting at page 45, the performance and other important features of the implementation is evaluated and demonstrated. In the Discussion-section, page 49, future work and possible improvements are presented to the reader. The conclusions of this report can be found in the Conclusions-section, page 52. Finally, some appendices containing useful information about the configuration and set-up of the GIMME2 are presented starting at page 57.

6 Background

Introduction to computer vision and its applications

Computer vision, or machine vision, is becoming more wide-spread these days as more and more robotic and control applications require some kind of vision to function properly. An image on a digital system is constructed by small elements called pixels. The amount of pixels in a image indicate the so called resolution, a large number of pixels indicate a high resolution. As the the resolution of an image always is ﬁnite, regardless of the camera, the image is a discrete representation of an continuous environment. Using image acquisition devices such as a Charged Coupled Device (CCD), the three dimensional world is converted into a two dimensional representation expressed by the linear combination of a set of base colours. An example of such a set is the Red, Green and Blue (RGB) used in the RGB-colorspace. By expressing the color of an pixel as a linear combination of red, green and blue an human- interpretable representation is constructed. Other colorspaces that utilize other aspect of the environment to create an image, that may or may not be fully human interpretable, exists as well but these are outside the scope of this report. The amount of unique colors that a colorspace can create is called the color depth. For example, if using the RGB-colorspace and each component is expressed as 1 byte (8 bits) a color depth of 24 bits achieved and the number of possible colors are 224 = 16777216. As said earlier, a single camera will generate a two dimensional representation of the world. This might be hard to believe as if a human views an image he/she can ”perceive” the missing dimension, depth. This is due to the human ability to interpolate by using visual features in the image. This is something a computer cannot do (yet) and hence it is unable to perceive depth from a single image. By using more than one camera, depth images can be created. The concept is presented in ﬁgure 2.1 by using a stereo-camera system. The distance, z, to an object can be calculated by the formula seen in equation 2.1

z = (fL)/(pd) (2.1) by a method called triangulation. L is the distance between the two cameras, f is the focal lengths of the cameras, p is the size of one pixel in the cameras and d is the so called disparity. The disparity is the absolute value of the difference in pixels between the center of the object (or some other visual feature) between the two cameras. The same concept of triangulation used here is used in, for example, the Global Positioning System (GPS) where the position of a device is calculated by measuring the time it takes for signals to travel from a number of different satellites to the device. An example of stereo vision can be found in the work of Chen et al. [17] Some main ideas are common for all computer vision systems. Firstly, representation of the environment must be acquired. Image acquisition can be done in many different ways with the most common one method is using a regular camera. Other ways of creating representations of the environment include LIght Detection and Ranging (LIDAR) and ultra sonic range sensors. After an image has been acquired some kind of filtering known as preprocessing is often performed in order to remove noise or unwanted part of the image. For example, if one wants to find green objects it would be unnecessary to keep anything but green pixels in the image. After the filtering is completed, an algorithm for feature extraction is often performed. After

7 Figure 2.1: Figure showing the concept of an stereo-camera setup taken from the work of Ohmura et al. [55] relevant features are extracted the detection of some parameter is performed and the results used to control the system in some fashion. An example of such a video processing flow can be seen in figure 2.2. Other steps can be added as needed to the video flow in order to perform more complex operations.

Figure 2.2: Video ﬂow example.

One example of filtering can be seen in figure 2.3 while one example of feature extraction can be seen in figure 2.4. A common method for performing these image operations efficiently are by using a so called pipeline. Pipe-lining is a well known technique with the field of computer architecture and is used to increase throughput and parallelism in computer systems. The concept presented in figure 2.2 can be implemented as a pipeline. The essential idea between pipe-lining is to allow multiple instructions to overlap in execution, i.e. instead of performing all necessary operations in one stage the operations on data can be split between several stages that are interconnected. One good example of a pipeline implementation is the one found in Microprocessor without Interlocked Pipeline Stages (MIPS) pipeline where instructions have 5 distinct steps [32]: 1. Fetch instruction (IF) 2. Decode instruction (ID) 3. Execute instruction operation (EX) 4. Access operands (MEM) 5. Write result to register (WB) An example of this pipeline can be seen in figure 2.5. For more information about the use of pipelines in modern computer architecture and the MIPS-pipeline, please refer to the works of Patterson and Hennessy. [32][31]

8 Figure 2.3: Filtering example from http://rsbweb.nih.gov/ij/plugins/sigma-filter.html

Figure 2.4: Feture extraction example from http://www2.cvl.isy.liu.se/Research/Robot/ WITAS/operations.html

A lot of pre-developed libraries that contain high-level image processing exist today. One example of such a library is the Open Computer Vision library better known as OpenCV. OpenCV is a open-source project released under the BSD-license, meaning that it free to use even in commercial applications. OpenCV contains a vast range of image processing algorithms and can be run in both Windows and *nix-based operating systems (Linux, Mac and Android). For more information about OpenCV please refer to the official homepage http://opencv.org/. One possible application of computer vision is within agriculture as can be seen in the research article written by Segerblad and Delight. [21] The most interesting example from the work of Segerblad and Delight [21] is, according to the author if this report atleast, the work of Weiss and Biber [64]. Using a LIDAR-device the two scientists detected and mapped vegetation in fields onto a global 3D map. This LIDAR-device was mounted onto a mobile robot that traversed the fields. The solution showed promising results with a successful detection rate of maize plants, in a field, of 60 percent. A image showing the mobile robots used and the generated 3D map can be seen in figure 2.6. More information about computer vision and its applications can be found in the book written by Szeliski. [61].

Introduction to Programmable Logic (PL)

Logic circuits can either be fixed or programmable. Programmable logic behaviour can be changed after manufacturing is completed while fixed logic has a static behaviour. PL has existed for over 50 years and is extensively used in both industry and academia. It offers developers and researchers a flexible and fast way to design and implement hardware. Complex

9 Figure 2.5: MIPS pipeline

Figure 2.6: Figure showing the mobile robot used by Weiss and Biber (left) and the output 3D map (right). The image is from the work of Weiss et al. [64]

Programmable Logic Device (CPLD) is one of the most commonly used architectures together with FPGA. Both technologies have their unique properties and diﬀerent applications. In this report the focus will be on FPGAs and their particular applications. A overview of the CPLD and FPGA technologies can be found in the article by Brown et al. [16] One of the most interesting aspects of the various types of existing PL-architectures is the possibility to perform tasks in true parallel unlike regular CPU-based systems that must run tasks in series. This means that some tasks can be performed much faster on PL-devices than on CPU-devices. Another appealing aspect is the possibility to reprogram these devices without loss of performance. This implies a lower development cost compared to regular ASIC-devices and also the possibility to correct potential hardware bugs after they are released to the market.

Field Programmable Gate Array-technology

Modern FPGAs were first introduced in the middle of the 1980’s by the American company Xilinx [26] but the concept can be traced back to the 1960’s. The first Xilinx FPGAs only contained a few thousand logic cells while modern FPGA Integrated Circuits (ICs) can contain several millions. The basic concept of the FPGA is to pack large amounts of logic blocks, memory blocks and other low-level hardware peripherals onto one IC and then using a large network of interconnections to ”glue” all components together. [24] In figure 2.7 this concept is demonstrated. The high degree of interconnectivity is what makes the FPGA so versatile but is also one of the big drawbacks with FPGAs. The high degree interconnectivity implies that a large area of the FPGA-IC must be dedicated to this task, this increases the physical size of the packaging and also lowers the highest possible clock frequency as the clock signals must travel longer distances. Configurable Control Blocks (CLBs) provide the core logic and storage capabilities of the FPGA. In the figure 2.7 these are labelled just ”logic”. Today, most commercial CLBs are LUT-based. Each CLB consists of several Basic Logic Elements (BLEs) arranged in a special fashion. In Xilinx’s FPGAs a CLB consists of a number of so called slices which they consist of several BLEs. A BLE contains a N-input LUT and a D-type-flip-flop. Using a N-input LUT

10 Figure 2.7: Figure showing the general concept of an FPGA-device. The figure is from the article by Kuon et al. [48] makes it possible to implement any logic functions with N-input bits. This concept is seen in figure 2.8. By connecting the output from the LUT to a D-flip-flop the behaviour of the circuit can become synchronised. More complex logic functions are implemented by connecting several of BLEs together. An example of this can be seen in figure 2.11 and also in 2.9. Most basic digital electronic concepts are explained in the book by Kuphaldt. [49]

Figure 2.8: Figure showing a typical LUT. [49]

The programming of the FPGA is basically just connecting the CLBs in the right fashion. Several different programming methods exist with some being static while others changeable. The most common one today is the use of some kind of static memory to hold the configuration while older technologies used fuses and anti-fuses to create permanent connections. HDLs were created in order to increase the development speed of implementations on FP- GAs. Using a synthesis tool the HDL code is then translated into a bit stream containing the configuration of the FPGA. This bit stream can then be uploaded to the FPGA from a computer or dedicated programmer. The two most popular HDLs are Verilog and VHDL. Both are commonly used within both academia and industry. In later years graphical development tools for embedded systems on FPGAs have been released by most FPGA-manufacturers enabling developers to rapidly develop complex systems. A survey of various FPGA-architectures can be found in the work by Kuon et al. [48]

11 Figure 2.9: Picture from lecture slides from NTU. [8]

Figure 2.10: Figure showing how a LUT can be ”programmed” to perform logic operations. [49]

Run-time Reconﬁgurability of FPGAs with focus on the Xilinx FPGAs

Many of the modern FPGAs support run-time reconfigurability to some extent, with partial reconfiguration being the most common one. Partial reconfiguration is a term commonly used when referring to reconfiguration of a specific part of the FPGA without interfering with other components located on the FPGA. As can been seen in the Introduction, partial reconfiguration can be dangerous to the overall system performance or stability if performed wrongly. The potential benefits of run-time reconfiguration are many, for example, the ability to dynamically move demanding functionality from software to hardware would improve performance of many applications. However, one must consider the time it takes to reconfigure a section on the FPGA and weigh that against the potential speed-up. Even tough the possibility of run-time reconfiguration has existed for over 2 decades now few FPGA manufactures have yet to provide a complete design flow including design tools and paradigms. Several reasons behind this can be found: The main reason is that the number of logic blocks on FPGAs have increased rapidly during this time and hence no direct need for partial reconfiguration has existed due to the much easier process of implementing a fully static system instead of using function swapping. Another reason is the added development time needed for the implementation and verification of systems that feature partial reconfiguration. However, as the ICs grow larger so does the time needed to

12 Figure 2.11: Figure showing the general concept of an FPGA-device. The figure is from the article by Kuon et al. [48] program these. This proves to be troublesome in applications where start-up timing is crucial, as will be seen later in this section. Another implication is high static power consumption due to the increased number of transistors in each package. By utilizing smaller FPGAs in combination with partial reconfiguration lower power consumption and, in some cases, higher performance can be achieved. In the book on the subject [44] Koch presents some crucial ideas behind the concept of partial reconfiguration. Koch separates between active and passive partial reconfiguration, where active reconfiguration is when the FPGA is reconfigured during run-time without disturbing the rest of the FPGA and passive reconfiguration is when the entire FPGA is stopped (stopped in this case is when all the clocks in the FPGA are stopped for a period of time) during reconfiguration. In this paper only active partial reconfiguration is considered and hence will be used synonymously with partial reconfiguration. Furthermore, Koch presents three open questions on the subject of partial reconfiguration that can be seen in the quote below. [44]

1. Methodologies: How can hardware modules be efficiently integrated into a system at runtime? How can this be implemented with present FPGA technology? And, how can runtime reconfigurable systems be managed at runtime? 2. Tools: How to implement reconfigurable systems at a high level of abstraction for increasing design productivity? 3. Applications: What are the applications that can considerably benefit from runtime reconfiguration?

Furthermore, Koch identifies three possible benefits of partial reconfiguration: Performance improvement, Area and power reduction and fast-system start-up. To summarize: As has been stated before in this report algorithms that are highly parallel in nature can easily achieve speed- up by running on an FPGA compared with a CPU. By swapping functions in and out of the FPGA dynamically the power and area used can be reduced. Lastly, fast system start-up refers to systems where the device must have low start-up times. Partial reconfiguration can be used here to only load crucial components on the FPGA on start-up in order to minimize the start time and then at a later stage load the rest of the functionality on to the FPGA.An example of

13 this can be found in Xilinx Application Note 883 where a FPGA connected to the Peripheral Component Interconnect (PCI)-Express bus is partially configured at boot-up in order to meet the strict timing constraints of the bus. [62] The partial reconfiguration methods can generally be divided into two categories: difference based reconfiguration for small net list changes and module reconfiguration for large module- based changes. This report will focus on module reconfiguration but according to Xilinx their latest FPGAs ”[..] support reconfiguration of CLBs (flip flops, look-up tables, distributed Ran- dom Access Memory (RAM), multiplexers, etc.), block RAM, and Digital Signal Processor (DSP) blocks, plus all associated routing resources.” [23]. This would imply that a high level of granularity can be achieved during difference based partial reconfiguration. Partial reconfiguration can be seen as a specific implementation of context switching. How- ever, as Koch points one must consider the entire system state before explicitly labelling partial reconfiguration as context switching. [44] Some key words used in the context of partial reconfiguration is needed to be explained and elaborated on before a more technical discussion of partial reconfiguration can occur. A short summary of commonly used terms are presented below and are a summary of the terms introduced and described both by Koch [44] and by Xilinx [35].

Reconﬁgurable Partition A physical part of the FPGA constrained by the user to host re- conﬁgurable modules.

Reconfigurable Module A net list that is set to reside inside a reconfigurable partition at some point. Several modules can share the same reconfigurable partition.

Reconﬁgurable Logic Logic elements that make up the reconﬁgurable module.

Static Logic Logic implemented in such a way that it is not reconﬁgurable.

Proxy Logic Logic inserted by design software in order to provide the system with a known communication path between static logic and reconﬁgurable partitions.

The techniques for placement and interaction of/with reconfigurable modules within reconfigurable partitions differs between manufactures but three of the most common ones are: island style, grid style and slot style. Island style is the simplest model for partial reconfiguration and is the only one supported by Xilinx so far and hence it will be the focus of this report. A figure showing the concept behind island style placement can be seen in figure 2.12. Notice the static region around the reconfigurable module. This is needed in order to provide a safe and efficient way for routing signals in and out of the reconfigurable module. The static region is extra important when the reconfigurable module is connected to a bus as it makes sure that the bus is not disturbed during reconfiguration. In Xilinx FPGAs the must common implementation of the static region was so called bus macros that was needed to be added manually by the user in the Integrated Software Environment (ISE) tool suite during the design phase. These have since been replaced by proxy LUTs that are automatically inserted during the synthesis. Island style placement only allows for one reconfigurable module per island. This means that a certain degree of fragmentation will occur when swapping modules as resources within the island will be wasted if not all resources are used by the new module. This is further enhanced by the fact that reconfigurable partitions must be predefined by the user and hence finding a perfect partition size in order to avoid fragmentation may be hard if not impossible. Furthermore, the current tools on the market require net lists to be generated for unique pair of module and island. This means that even if the same module can be placed in two different islands, two net lists and bit streams must still be generated. For example, a system that has 6 modules and 3 islands where all modules can resist in any island must have 18 unique net lists and bit streams. This is clearly time consuming for the designer. Xilinx states in their user guide for Partial Reconfiguration, [35] that all logic can be reconfigured during run-time except: ”Clocks and Clock Modifying Logic [...], I/O and I/O related

14 Figure 2.12: Figure showing a typical island style partition location strategy. components [...], Serial transceivers (MGTs) and related components [...] and Individual architecture feature components (such as BSCAN, STARTUP, etc.) must remain in the static region of the design”. Further, it is stated that bidirectional interfaces between static logic and reconfigurable logic are not allowed unless explicit routes exist. Also some specific IP components might function erratically if used in combination with partial reconfiguration. Another design consideration to notice is that the interface between the reconfigurable partition and the static logic must be static, as stated earlier, this implies that all reconfigurable modules that is to reside within a reconfigurable partition must have the same interface ”out towards” the rest of the FPGA. Ports or bus connections cannot be created on the fly. An extensive list of design considerations for partial reconfiguration can be found in Xilinx UG702. [35] In order to use partial reconfiguration on an Xilinx FPGA one must use their design suite ISE. The work flow to generate and use partial reconfiguration on a Xilinx FPGA using the Xilinx ISE will be presented in the method-chapter of this report. A general idea of the partial reconfiguration concept on a Xilinx FPGA can be seen in figure 2.13. The concept is in depth explained in work by Khalaf et al. [43] Designers are forced to use so called ”Bottom-Up- Synthesis” in order to successfully implement a reconfigurable system. This implies that all modules must have separate net lists for each possible instantiation and no optimization is allowed for the interface between the module and rest of the FPGA. Bottom-Up-Synthesis is explained in the Partial Reconfiguration Guide by Xilinx [35] and the Hierarchical Design Methodology Guide by Xilinx. [34] Xilinx states the following design performance in UG702 [35]:

• Performance will vary between designs but in general expect 10% degradation in clock frequency and not to be able to exceed 80% in packing density.

• Longer runtimes during synthesis and implementation due to added constraints.

• Too small reconﬁgurable partitions may result in routing problems.

From the user-side, reconfiguration of the FPGA during run-time is only a matter of writing a partial bit stream to the associated reconfiguration port. The most commonly used reconfiguration port in Xilinx-based systems are the ICAP-interface which can be instantiated as a

15 Figure 2.13: Figure showing the outline of the Xilinx RP. [35] soft IP-core in the FPGA-fabric. Other reconfiguration interfaces such as the ZynQ’s Parallel Configuration Access Port (PCAP)-interface exists as well. This report focuses on the actual usability of the partial reconfiguration flow and the technical low-level reconfiguration process will hence not be discussed here. A good introduction to the partial reconfiguration work flow and limitations for Xilinx FPGAs can be found in Partial Reconfiguration User Guide by Xilinx. [35] Xilinx claims the following in the Partial Reconfiguration Reference Design for the Zynq [46]:

The conﬁguration time scales fairly linearly as the bitstream size grows with the number of reconﬁgurable frames with small variances depending on the location and the contents of the frames.

If the reconfiguration time is linear with respect to the bitstream size it would imply that partial reconfiguration could be used in time-critical systems as the worst case scenario could be calculated and verified with a high degree of certainty. This property could be useful in high-speed applications such as video processing or other streaming data applications. An article from 2006 written by Xilinx employees describes the general work-flow in their ISE-tool-suite and much of the information found there still applies to the current versions. [52]

Heterogeneous systems

A general trend in both research and industry is to use more and more heterogeneous systems. Heterogeneous systems are composed of several different processing architectures. An example of this is the Xilinx ZynQ-platform which features two hard ARM-cores and a large FPGA-section in one IC. The Xilinx ZynQ-family of FPGAs will be discussed later in this chapter. Another example of a heterogeneous system can be seen in the Related Work-section of this report, more precisely the work of Blair et al. [14]. Heterogeneous systems enables programmers to utilize different processing architectures different properties for different tasks. For instance, a task that can be run in parallel can be put on an FPGA while a strongly serial task can be run on the much faster CPU. However, these systems are not without drawbacks. Different processing architectures use different methods of execution and tasks must be adapted to fit these methods in order work correctly.

16 Computer Vision on FPGAs and heterogeneous systems

The parallel nature of FPGAs makes them ideal for running image processing algorithms due to the outline of these algorithms. An example of this is the conversion between the YUV422-color format and the RGB-color format. The equation for the conversion can be seen in equation 2.2.

R 1 0 1.13983  Y  G 1 −0.39465 −0.58060 U (2.2) B 1 2.03211 0 V If the simple conversion example seen in equation 2.2 would be run in series it would require at least P ictureW idth ∗ P ictureHeight iterations to finish. For an image in the VGA-format (640x480 pixels) 307200 iterations are then needed. If this conversion would be implemented on an FPGA all pixels could be converted at the same time in parallel, that is only a few iterations would be required (assuming that the entire picture is available in the FPGA component at the start of the conversion process, which rarely is the case). One example if this can be found in the work of Hamid et al. [29] where a filtering algorithm that took 17 iterations to finish on a CPU only took 5 iterations on an FPGA. Integrating hard processor cores in to FPGAs is no new idea, however. For example, previous Xilinx FPGAs have featured PowerPC-processors integrated into them (Virtex-IV) and a wide range of soft CPU-cores exist for integrating in the FPGA-fabric, such as Microblaze from Xilinx and NIOS from Altera. Using heterogeneous systems for video processing have several positive implications for the overall system performance and usability as will be seen later in this report.

Xilinx ZynQ-7000 and Xilinx XC702

This thesis focuses on using partial reconfiguration of the Xilinx ZynQ-7000 FPGA family. Hence, it is important to discuss the features and properties of these devices and especially present the development boards used, the Xilinx ZC702 and GIMME2 that will be presented in the next section. Xilinx calls the ZynQ family for a EPP-family due to the fact that it both features a ARM-processor and a FPGA-block in the same package. The general idea of using the EPP is demonstrated in figure 2.15. An outline of the Xilinx ZynQ-SoC can be seen in 2.16 and a table showing some of the basic characteristics of the different devices in the ZynQ-family can be found in AppendixB. The ZynQ can generally be divided into two different regions: the PL featuring the FPGA- fabric and the Processing System (PS) featuring the ARM-processor. The Xilinx ZynQ SoC features a wide range of embedded peripherals. The most interesting ones for report are the DevC-interface (Where the PCAP-interface is located) and the Advanced eXtensible Interface (AXI)-bus between the PL and the PS. The AXI bus is a high-performance bus developed and specified by ARM. It provides developers with an easy way of interfacing between the PL and the PS-sections. The principal layout of the DevC-module of the Advanced RISC Machine (ARM)-processor on the Zynq can be seen in figure 2.14. The latest version of the AXI bus, version 4, have three distinct implementation: AXI4, AXI4 Lite and AXI4-Stream. The standard AXI4 bus is a burst based master-slave bus with independent channels for read addressing, write addressing, data reception, data transmission and transmission response. The width of the data channels can range from 8 upto 1024 bits. Interconnects are used to connect masters to slaves, and vice versa, and several masters can be connected to the same interconnect. AXI4-Lite is a reduced version of the standard AXI4 bus designed for a more simple communication method between masters and slaves that is not capable of burst reads or writes. The width of the buses are also limited to either 32 or 64 bits. AXI4-Stream is a stream based communication interface that is designed for high-performance applications such as video processing. As AXI4-Stream is stream based, it lacks the address channels present in regular AXI4. AXI4-Stream can either be used as an direct protocol where

17 Figure 2.14: The layout of the DevC-module. [38] a master units writes directly to a slave unit or together with an interconnect in order to perform operations on the data stream such as routing or resizing. In order to pass stream based data into a memory an extra component is needed and a common technique is to utilize an DMA- device to perform such operations without the direct involvement of the processing unit. This is common with video applications where frame buffers are used to store video frames between the various stages of the video pipeline. More information about the regular AXI4-protocol and AXI4-Lite can be found in the specification supplied by ARM. [6] More information about the AXI4-Stream-protocol can be found in the specification released by ARM. [5] The Xilinx ZynQ has integrated AXI-based high performance ports for PL access of the memory connected to the PS. This allows the two different sections to share memory. Further- more, other AXI-based ports are available for communication and peripheral sharing between the PS and PL sections. Also interrupts and Extended Multiplexed I/Os (EMIOs) are routed between the PL and the PL. More extensive information about the Xilinx Zynq SoC can be found in the Technical Reference Manual [38]. Xilinx ZC702 is a development board featuring a Xilinx ZynQ XC7Z020 SoC and wide range of on-board peripherals. A picture showing these various peripherals can be seen in 2.17

GIMME 2

GIMME2 is an computer vision platform developed by professor Lars Asplund at Mälardalen University and AF Inventions GmbH. The board features, for example, a Xilinx ZynQ XC7Z020- 2 SoC, two Omnivision OV10810 10-Megapixel image sensors and separate DDR-memories for the PL and the PS. A short technical summary of the board can be seen in figure . The board itself can be seen in the figure . The GIMME2 platform is intended to function as a research platform for researchers at MälardalenUniversity and other research institutes as well.

18 Figure 2.15: Figure showing the general idea on how to utilize the Xilinx EPP-family. [41]

Figure 2.16: Figure showing the outline of the Xilinx ZynQ-SoC. Image from Xilinx document UG585. [38]

19 Figure 2.17: Figure showing the ZC702 board peripherals. [41]

Figure 2.18: Figure showing the outline of the GIMME2 platform. [3]

20 Figure 2.19: Figure showing the GIMME2 boards front (right) and backside (left). Notice the two image sensors on the backside of the PCB (encircled in red). Also notice the Zynq SoC (encircled in yellow, the PS DDR Memory (encircled in blue) and the PS DDR Memory (encircled in purple).

21 Related work

Implementations of computer vision on FPGAs and heterogeneous systems

Utilising the computational power of an FPGA in a mono or stereo vision system is no new concept. This concept has been the target of many researcher’s work throughout the years, for reasons already accounted for. One interesting, and relevant to this report, implementation of computer vision on an FPGA is the work of Hosseini and Hu [29]. Hosseini and Hu compared the performance of a hard logic solution against a soft CPU-core (Altera Nios II) implemented on an FPGA when given the task to filter a 64 x 64 pixel or 256 x 256 pixel grey scale image using a n x n coefficient matrix. It was found that the logic FPGA-implementation was immensely faster than a similar implementation on a CPU. The authors found that the logic based solution could perform up to 80 times faster. The results of this article can be seen in figure 3.1.

Table 3.1: Figure showing the results from work by Hosseini and Hu. [29] The first four rows are for the filtering of a 64 x 64 pixel image while the last two are for the filtering a 256 x 256 pixel image.

An overview of the algorithm implemented in hard logic and the algorithm implemented on the soft CPU-core by Hosseini and Hu can be seen in figure 3.1. Another implementation of a computer vision system on an FPGA can be found in the article by Ohmura and Takauji. [55] Ohmura and Takauji used a stereo-vision system with the Orientation Code Matching (OCM) algorithm running on an Altera FPGA-chip. In short, the OCM algorithm is designed to find similarities between pictures and one possible application is stereo-machining. An extensive explanation of the OCM algorithm can be found in the same paper. The stereo vision module used was capable of supplying images by the size 752 x 480 pixels at 60 FPS. The FPGA on module was an Altera Cyclone III running at 53 MHz. A figure showing a block diagram of the developed system can be seen in figure 3.2. The implemented system performed with a minimal delay but used 82% of the available Logic Elements on the FPGA. The system used a 16 x 16 pixel template size with a maximum disparity of 127 pixels. A overview of the algorithm implemented on the FPGA can be seen in figure 3.3. Also the work of Komuro et al [47] is an example of an vision system implemented on a

22 Figure 3.1: Figure showing the system from the article by Hosseini and Hu. [29] To left is the logic implementation and to the right is the CPU implementation using the Altera Nios II.

Figure 3.2: System developed by Ohmura and Takauji. Picture is retrieved from the article by Ohmura and Takauji. [55] heterogeneous system featuring a microprocessor and an FPGA. They developed a high-speed system capable of retrieving 1000 FPS using a single camera setup. The architecture used can be seen in figure 3.4. The camera used were capable of outputting 1280 x 512 pixel frames at 1000 FPS. By splitting functionality between the CPU and the FPGA the team managed to get acceptable performance out of the system. The implemented system, with the CPU running at 266 MHz and the FPGA at 200 MHz, was compared to an PC with a dual-core Intel processor running at 1.86 GHz and with 3 Gb of RAM. On the PC OpenCV was used to provide high-level image processing capabilities. In figure 3.2 a comparison between similar functions running on the heterogeneous system and on the PC can be seen. It is quite clear that even though the clock frequency of the heterogeneous platform is much lower, it still can outperform the dual-core PC in most of the cases due to the increased level of parallelism achieved. Blair et al. [14] present a vision system implemented on a heterogeneous platform (GPU,FPGA and CPU) for detection of pedestrians in real-time using a algorithm called HOG (Histogram of Oriented Gradients). Shortly, HOG works by looking at the intensity gradients of image pixels and then uses this data to detected objects. Blair. et al.’s system was based around an dual- core Intel processor running at 2.4 GHz, a Xilinx Virtex-6 FPGA and a Nvidia 560Ti Graphics Processing Unit (GPU). The devices were connected together using the PCI Express-bus. Each stage of the implementation can run on any of the processing units and mixing between the different architectures are allowed. Figure 3.5 shows the possible paths data can take. The performance of the system was evaluated by sending two 1024 x 768 pixel images through all the different data paths. One of the images were single scale and the other had 13 scales. The fastest data path, for both images, was the one mainly using the GPU, 6.8 ms and 47.0 ms, while the

23 Figure 3.3: System developed by Ohmura and Takauji. Picture is retrieved from the article by Ohmura and Takauji. [55]

Figure 3.4: Komuro et al.’s architecture. [47] slowest one was the one mainly using the CPU, 174.3 ms and 1376 ms. Using mainly the FPGA gives the third best times, 10.1 ms and 124.5 ms. The authors conclude that communication delay is a major problem and combining two processing units in one data ﬂow is not preferred due to this. Furthermore, Blair et al. state that it would be better to, instead of splitting the algorithm between processing units, dedicate each processing unit to a speciﬁc task and then combine the results of each task in the end. For more information about the implementation and design of computer vision systems on FPGAs please refer to the comprehensive book on the subject by Bailey. [9] An overview of the state of the art in heterogeneous computing can be found in the work by Brodtkorb. [15]

Implementations of reconﬁgurable FPGA-systems

Thoma et al. [63] discusses a method for dynamic pipeline reconfiguration of a soft core processor implemented on an FPGA. Thoma et al. present a novel processor core that can be reconfigured with respect to the depth of the pipeline in order to increase performance of decrease power consumption. The processor core is based upon an already existing one called LEON3. LEON3 is implemented in VHDL and can hence be used on a FPGA. For more information about LEON3, please refer to the technical paper about LEON3. [1] By cleverly joining or splitting up adjacent pipeline stages dynamically, Thoma et al. demonstrated a relative saving of 3,8% in cycle count over the execution of a test program. In master thesis by Hamre [30] a framework (in principle a hardware operating system) is presented for dynamic reconfiguration of FPGAs. Hamre’s work is closely related to the work present in this report as the same principal ideas are shared. Hamre thoroughly presents the concept of partial reconfiguration and also the difficulties of implementing it. Hamre’s framework is designed to work with Linux and also uses the Xilinx ICAP-port for reconfiguration. The

24 Table 3.2: Komuro et al.’s performance. [47]

Figure 3.5: Blair et al.’s performance. [14] results show that partial reconfiguration of an FPGA from Linux is possible with the available tools. However, no real performance evaluation is made in Hamre’s report. In the article by Lesau et al. [50] the usage of real-time reconfiguration in combination with embedded Linux is discussed and furthermore a set of tools for easier handling of reconfiguration is presented. Lesau et al. have successfully implemented tools like mail-boxes for Linux to hardware module communication and also a hardware administrator that handles reconfiguration. This system was implemented on an Xilinx Virtex-5 FPGA using MicroBlaze-cores and PetaL- inux. Lesau et al. successfully proved that this kind of hardware and software layout could be used for handling dynamic reconfiguration. Meyer et al. present a new configuration method for FPGAs in their article on the subject. [53] The researchers have named it ”Fast start-up”. By manipulating the bit streams generated the regular development tools, Meyer et al. significantly decreased the configuration time of a Xilinx Spartan-6 FPGA. The results from these tests can be found in table 3.3. In Ming et al. [51] a comparison between different ICAP-components are made and the reconfiguration speed versus the bit-file-size are compared. They found that by making radical

25 Table 3.3: Meyer et al.’s results. [53] changes to the standard ICAP the reconfiguration time could be lowered by a order of one magnitude. Ming et al.’s results (reconfiguration time versus file size) can be seen in table 3.4. In figure 3.6 one can see the almost linear relationship between bit file size and reconfiguration time.

Table 3.4: Table showing Ming et al.’s results taken from their article. [51]

In the work by Koch et al. [45] present the concept of partial reconfiguration and also demonstrates some tools commonly used for achieving it. Furthermore, some possible applications of partial reconfiguration is presented to the reader with one application being a ”self-adaptive reconfigurable video-processing system”. This implemented system can be seen in figure 3.7. The modules seen in the picture can be dynamically loaded and unloaded during run-time. This system was implemented on an Xilinx Virtex-II FPGA. No performance indicator is given but it would appear as the performance of the system is acceptable for non-real time applications, thus proving that the concept is implementable. Koch has also published a book on the subject named ”Partial Reconfiguration on FPGAs” which may be of interest for the reader. [44] Koch is also the co-author of the article where the GoAhead-tool for partial reconfiguration is presented. [10] GoAhead is aimed to provide developers with a simple user-interface in order to create systems containing reconfigurable modules. Several other tools and work flows have lately emerged for creating run-time reconfigurable systems. For example: Dreams [56], the work by Ruiz et al. [54] and the work by Dondo et al [22]. Gantel et al. [27] discuss a possible algorithm for module relocation during run-time. The work was performed on a Xilinx FPGA mostly using the Xilinx tool Isolation Design Flow. The main problem when dealing with relocation of modules on an FPGA is that each module must have an unique bit-stream for each possible module location on the FPGA. Gantel et al. want to make it possible to only have one bit-stream per module and then using a bit-stream parser and bit-stream re-locator just adapt so that the module can be placed anywhere on the glsFPGA. This would allow for a more efficient design. Gantel et al. succeeded with module relocation on

26 Figure 3.6: Figure showing Ming et al.’s results taken from their article. [51] the FPGA with their proposed method. Garcia et al. [28] presented a possible application of reconfigurable FPGAs within Wireless Sensor Networkss (WSNs). In the work by Papadimitriou et al. [57] the authors present their solution of providing fast reconfigurations and also propose a model for cost estimation of the reconfiguration process.

Implementations of reconﬁgurable FPGA-systems running computer vision algorithms

Ackermann et al. [2] implemented a self-reconfigurable video processing system on an FPGA. They present two different implementations of the system. In the first one a Xilinx Virtex-IV FPGA is used and reconfiguration was performed with the help of the ICAP and a Microblaze- processor. Some common image processing algorithms were implemented as modules used for the reconfiguration testing. The first image processing algorithm implemented was binarization, the second was edge detection using 2-d gradients and the third one was edge detection using horizontal derivative. The bit-streams are here stored on a CompactFlash-card accessed from the MicroBlaze processor. Image data is retrieved from an camera with a resolution of 2048 x 2048 pixels at 16 Frames Per Second (FPS). For both implementations only one image processing algorithm was allowed on the FPGA. An overview of the first implementation can be seen in figure 3.8. In the second implementation, the MicroBlaze processor has been removed and been replaced by a small bit-stream controller connected to a Static Random Access Memory (SRAM)- memory where all bit-streams are located. The results of the two different implementations can be found in figure 3.6 which shows the time needed to reconfigure the FPGA. The size of each implemented processing algorithm can be found in 3.5. This clearly shows that run-time reconfiguration is possible even in such demanding systems as video processing systems. Another implementation of a dynamic reconfiguration in a multimedia application was presented in the work by Bhandari et al. [11] Bhandari et al. used a Xilinx Virtex-4 FPGA to implement a test-platform where a VGA-camera stream was fed into the FPGA, past a reconfigurable filter module and then outputted onto a monitor. Also available on the FPGA was an audio module that also had a reconfigurable filter slot. Three programming techniques uti-

27 Figure 3.7: Koch et al.’s implemented system. Picture is taken from the related article. [45]

Table 3.5: Ackermann et al.’s results. [2] lizing the ICAP were tested. The first two, OPB-HWICAP and XPS-HWICAP, are made by Xilinx while the last one, SEDPRC, is designed by Bhandari et al. The results of these tests can be seen in figure 3.7. As can be seen the novel ICAP interface performs much better than the Xilinx alternatives, hence performance of the ICAP-interface can be improved by utilizing custom-made components. A run-time reconfigurable ”multi-object-tracker” were implemented by Perschke et al. [58] The system implemented is based on a Xilinx Virtex 4 FPGA with a PowerPC-CPU and video is retrieved from a camera running at 384 x 286 pixels and 25 FPS. Just like Bhandari et al., Perschke et al. have implemented a new ICAP-controller in order to improve performance of the system as the standard Xilinx ICAP-controller was deemed to slow to be used. In figure 3.8 the total usage of each component implemented can be seen. In figure 3.9 the results from the article can be found. Perschke et al. conclude that components can be switched between frames and hence no delay in the object-tracking algorithm is produced. Looking at the resource utilization of the components used and the speed of the switching it can be concluded that this method can be used in high-speed vision systems. A application for partial reconfiguration within the automotive industry can be found in workings by Claus et al. [18][19] These studies shows that FPGAs and real-time reconfiguration

28 Figure 3.8: Ackermann et al.’s system. [2]

Table 3.6: Ackermann et al.’s results. [2] can be usefully within the automotive industry as well where time-constraints are strict and overall reliability of the system is a critical factor.

29 Table 3.7: Bhandari et al.’s results. [11]

Table 3.8: Perschke et al.’s results. Picture from the related article. [58]

Table 3.9: Perschke et al.’s results. Picture from the related article. [58]

30 Method

Early work

In order to evaluate partial reconfiguration on the Xilinx ZynQ FPGA using its associated tool suite several steps were taken to prepare and measure interesting aspects. At first the general functionality of the different development boards were tested and a basic system featuring partial reconfiguration of one slot was designed and implemented in order to learn the work flow in the different Xilinx tools. This procedure was completed both on the Xilinx ZC7020 and the GIMME2 board. The first run-time reconfiguration designs were implemented on the Xilinx ZC7020 FPGA development board. In later stages the designs were moved onto the GIMME2-board. GIMME2 [4] is a stereo-vision platform developed at MälardalenUniversity by the Intelligent Sensor System Group. The latest version of the GIMME board, version 2, features a Xilinx Zynq-SoC, dual high resolution cameras etc. GIMME2 was described in the Background-section as was the ZC7020 development board. After getting to know the tools better the state of the art was researched by looking into articles and research papers from various scientific databases such Institute of Electrical and Electronics Engineers (IEEE) Xplore. Linux were downloaded from the Xilinx git-server in the form of binary files. The possibility to adapt the Linux Kernel and file system exist and extensive guides are available at Xilinx Wiki-page [65].

Design considerations

The technical limitations of partial reconﬁguration in its current state as implemented by Xilinx were accounted for in the Background-section of this report. From those limitations some conclusions can be drawn:

• Modules residing within partitions should have similar functionality and communication interfaces in order to minimize fragmentation and design problems.

• The size of reconﬁgurable partitions should be kept as small as possible in order to minimize fragmentation and free resources for other components residing in the FPGA.

• Partial reconﬁguration should only be used in system that potentially could beneﬁt from it drastically. Small performance improvements are a small gain for the added complexity during the design process.

Before starting the implementation of the final reconfigurable system some technical considerations are needed to be dealt with. As can be seen in Xilinx documents UG470 [39] and WP374 [23] several different methods exist for configuration and reconfiguration of Xilinx’s 7 series FPGAs. Further, if one consults the technical reference manual, UG585 [38], of the Zynq 7000 SoC, one can see a clear overview of the various techniques used to program the PL from the PS. Looking at chapter 6.4.5 in UG585 [38], it is clear that reconfiguration from the PS using the PCAP is not desirable in all cases as it would require the AXI interface to be turned off during reconfiguration and hence separate the two areas for some time. This would mean that no data would be able to be passed to any component stored in the FPGA from the PS. However, the

31 PCAP supports 400 MB/s throughput which means a low configuration/reconfiguration time. The entire (power-on) configuration flow for the ZynQ using the PCAP [38] can be seen in the first quote below and the entire reconfiguration flow for the ZynQ using the PCAP [38] can be seen in the second quote below.

1. Wait for PCFG INIT to be set High by the PL (STATUS-bit[4]) 2. Set internal loopback to 0 (MCTRL–bit[4]) 3. Set PCAP PR and PCAP MODE to 1 (CTRL-bit[27 and 26]) 4. Initiate a DevC DMA transfer: (a) Source address: Location of PL bitstream (b) Destination address: 0xFFFF FFFF (c) Source length: Total number of 32-bit words in the PL bitstream (d) Destination length: Total number of 32-bit words in the PL bitstream 5. Wait for PCFG DONE to be set High by the PL (INT STS-bit[2])

1. Disable the AXI interface to the PL. 2. Disable the PL level shifters by writing 0xA to the SLCR LVL SHFTR EN register. 3. Set PCAP MODE and PCAP PR High. 4. Clear the previous conﬁguration from the PL (optional): (a) Set PCFG PROG B High. (b) Set PCFG PROG B Low. (c) Check for PCFG INIT = 0 (STATUS-bit[4]). (d) Set PCFG PROG B High. 5. Check for PCFG INIT = 1 (STATUS-bit[4]). 6. Set INT PCAP LPBK Low (MCTRL-bit[4]). 7. Initiate a DevC DMA transfer: (a) Source Address: Location of new PL bitstream. (b) Destination Address: 0xFFFF FFFF. (c) Source Length: Total number of 32-bit words in the new PL bitstream. (d) Destination Length: Total number of 32-bit words in the new PL bitstream. 8. Clear PCFG DONE INT by writing a 1 to the INT STS[2]. 9. Wait for PCFG DONE INT to be set High. 10. Enable the PL level shifters by writing 0xF to the SLCR LVL SHFTR EN register. 11. Enable the AXI interface to the PL.

A more flexible solution can be found in WP374 [23], where a logic component in the FPGA takes care of the reconfiguration mechanism and hence efficiently hides it from the PS. A simple micro-controller, such as a Xilinx Microblaze, or some state-machine could be in charge of reconfiguration and accept reconfiguration requests from the PS on the AXI bus. Partial bit- files would then be stored at known memory locations and as the bit-files are hard-area-coded no reconfiguration could affect other logic on the FPGA then the logic meant to be affected. This system would resemble the one presented in the work by Hamre. [30] Then the only time the PCAP will be used is during boot-up when the base-system bit-file is loaded on to the FPGA. After the initialisation of the FPGA is completed all reconfigurations will be handled by the logic in the FPGA. A sketch of this concept can be seen in figure 4.1. However, little to none documentation exists for using the glsICAP-component on the Zynq and developing such a system turned out to be troublesome as will be seen later in this text.

32 Figure 4.1: Figure showing the Partial Reconﬁguration ﬂow from WP374. [23]

The main issue lies in the fact that the Zynq already has a built-in configuration/reconfiguration port, the PCAP, and that the configuration interfaces are mutually exclusive. The control of which configuration interface that is active is done via a register in the PS and must be set correctly before the ICAP or PCAP can be used. [38] If one consults the Xilinx Application note XAPP1159 [46], especially table 2, some typical reconfiguration times using the PCAP-interface can be found. From Linux a partial reconfiguration time of 2 milliseconds are specified. If such a low reconfiguration time can be achieved using the PCAP then it could be acceptable to turn off the AXI-bus interface for that period of time. However, Xilinx writes in the text the 2 ms only includes the actual ”beginning and end of the DevC DMA transfer driver function call”. Hence some steps of the reconfiguration flow is not included in this time. These excluded steps can be seen in figure 4.2. Still it is interesting to see the performance of the partial reconfiguration flow both from Linux and from so called stand-alone software. Another interesting problem one is faced with when dealing with partial reconfiguration is how to synthesize each reconfigurable module in an efficient way. Each module needs to be synthesized independently in order to get partial reconfiguration to work properly. Xilinx method to performing this is to ensure that the all the components that shall be able to reside inside a reconfigurable partition have the same interfaces towards the rest of the FPGA and then importing each component into the Embedded Development Kit (EDK)-environment. There by changing the component instantiation in the mhs-file associated to the project between synthesis runs each component gets a net-list generated. This set of net lists are then exported to the Partial Reconfiguration project in Planahead. This methodology may seem clumsy at best but it has proven to work without any major malfunctions. Another approach is to use component instantiations in the HDL-files and then directly edit these instantiations between synthesis runs. As Xilinx enforces the Bottom-Up-Synthesis methodology users can synthesise modules using any synthesis tool. However, Input/Output (I/O) insertion must be disabled during synthesis as

33 Figure 4.2: Excluded partial reconfiguration steps from XAPP1159 shown in red. [46] the modules pins will no be connected to package pins but the static logic of each partition. [35] For this report the EDK method was mostly used as it seemed smoother but work flow using a separate synthesis tool was also tested and confirmed to work. As stated earlier, a reconfigurable partition consists of two parts: (1) the static logic and (2) the reconfigurable module itself. [35] In order to guarantee successful partial reconfiguration the static logic for all modules residing in a partition must be identical after implementation. In the Xilinx ISE tool suite this checked with the Verify-tool in Planahead. After reconfiguration each module must be reset in order to guarantee the functionality of the module. In earlier versions of the Xilinx ISE this was needed to be done by the user using an external reset signal but later versions now support automatic resets by adding extra constraints in to the design during the implementation phase. For the designs made on the ZC702 an external reset was used and for the design on the GIMME2-board the extra reset constraints were added.

Method of reconﬁguration

In order to test the different properties of the ICAP and the PCAP, the speed of reconfiguration and the usability of partial reconfiguration reference systems were created. For the first reference system the reference design provided by Xilinx [46] was used for initial testing on the Xilinx ZC702 board. A schematic of this design can be seen in figure 4.4. A lot of the functionality provided by the original XAPP1159 [46] design was scaled away and some new functionalities such as: more test-patterns and added support for new filters were added. This was done in order to simplify the system and provide more test cases. The reconfiguration time of one reconfigurable module was then measured when using the PCAP. Both reconfiguration from Linux and reconfiguration from so called stand-alone software was

34 performed and evaluated. Both methods were verified to work correctly. In order to evaluate the ICAP-component another system was designed that featured a Microblaze processor and an ICAP-interface. This system was mainly tested on the GIMME2- platform as the possibility of storing bit files in either the PS-DDR memory or the glsPL-DDR memory are appealing. The PS initiates a reconfiguration by writing a set of messages to the shared mailbox found in the system. The Microblaze processor is responsible for passing on the arguments sent from the PS to the ICAP. The Microblaze/PL can access the PS-DDR memory using one of the high-performance master ports found in the ZynQ. After reconfiguration is complete the PL returns the time needed to perform reconfiguration through the mailbox. This system can be seen in figure 4.3.

Figure 4.3: Reference design with Microblaze and ICAP added.

This system was designed mainly to test the possibility of reconfiguration of the PL using the ICAP on the Zynq. No actual verification of the correctness of the reconfiguration was possible in this design. A simple video system was also implemented in order to verify the correctness of the reconfiguration processes. This is simpler version of the design seen in 4.3 but with a Video DMA added to the system in order to shuffle data back and forth to the filter component. This design

35 Figure 4.4: Reference design from XAPP1159. can be seen in figure 4.5. By setting up the Video Direct Memory Access (VDMA) correctly one can inspect the input and output from the filter and determine if the reconfiguration was successful or not. The picture to be filtered was stored in the PS-Dual Data Rate (DDR)-memory and the filtered output was stored at another address allowing for easy comparisons between the two. The last design that was tested was a similar one that can be seen in 4.5 but with the ICAP connected directly to the PS instead. This system can be seen in figure 4.6. The reason for implementing this design was that the earlier designs, seen in figure 4.3 and figure 4.5, with the ICAP failed to complete the reconfiguration properly. Using the Xilinx tool Chipscope, data could be seen going in to the ICAP and all functions returned non-error return codes. However, when data was transmitted to the reconfigurable module under test no change in the outputted could been seen and hence no reconfiguration did actually take place. The reason for this is still unknown but a Xilinx employee suggest that it might have something to do with timing issues. [59]. The implementation with the ICAP connected directly to the PS worked with the same source code and configuration which indicates that this suggestion might be true. No possible solution have yet been found to this problem. All designs with the ICAP used a 100 MHz clock for the ICAP-core and the memory frequency for the both memories used were 533 MHz. The 100 MHz clock was also used the clock source for the PCAP.

Implemented vision components

In order to the test the functionality of the Xilinx ISE tool suite several different vision components were implemented. The first component to be implemented was a simple filter that marks pixels with a Red-component over 127 with white (255,255,255) and all others pixels with black (0,0,0). The short code snippet used to represent this filter can be seen in figure 4.7. This filter was designed to utilize the AXI-stream interface in order to provide high performance and low latency. The generated module was added to the modified XAPP1159 [46]-reference design and

36 Figure 4.5: Implemented video design. also to the design seen in 4.5. The performance of the filters were evaluated by sending an video stream to them and then, simultaneously, read back the resulting video stream. This was done using the Xilinx VDMA. The second component was a color filter with the ability to set both which component to be filtered with respect to (Red, green or blue) and also the cut-off value (0-255). This data was sent to the module via the AXI4-Lite control bus. The code for the main filtering algorithm used in the color filter can be seen in figure 4.8. Using the template given by Xilinx in XAPP1159 [46] making more stream-based components is easy in Vivado HLS and the components are not limited to just filters but other algorithms such as blob detection and segmentation can also be implemented. There is also somewhat of support for OpenCV functions to be used directly in Vivado HLS but as the current time this is limited to OpenCV 1.0, an old and deprecated version that is lacking much of the functionality and data management found in later versions. As the focus of this thesis is not directly OpenCV no direct time was spent on trying to port OpenCV to VHDL using Vivado HLS. Furthermore, most OpenCV-functions have different output and input arguments and are hence hard to match to reconfigurable partitions where hard constraints exist for the similarity between reconfigurable

37 Figure 4.6: Second implemented video design. modules. In order to convert OpenCV-functions to working reconﬁgurable modules a lot of work would be needed.

System design in Xilinx ISE/Planahead

In order to successfully implemented a reconfigurable system one must follow the design flow associated with the Xilinx ISE tool suite. This work flow can been seen in figure 4.9. In short, the design process is made up of two parts: synthesis and implementation. Both are performed in Xilinx Planahead but the current versions of Planahead requires the two phases to be conducted in different projects due to the imposed demand for Bottom-Up-Synthesis by Xilinx. This implies that first one project where the general system design and all reconfigurable modules are synthesised are made. After all net lists and constraints have been synthesised these are then exported into a new Planahead project where the actual implementation takes place. After successful implementation bit-files are generated, two for each possible configuration. One for a full configuration with the intended module and one partial made for reconfiguration.

38 Figure 4.7: Red ﬁlter code.

Interface from Linux

As mentioned earlier, interfacing custom or Xilinx processor peripherals located in the PL is done through some kind of bus. The latest versions of the Xilinx software supports the AXI-bus derived from the AMBA-bus specified by ARM. [6] This report focuses on components based around the AXI-bus and hence on processor peripherals only. In order to write and read from components connected from the bus one must first map the appropriate physical memory area into the virtual memory of Linux. This is done using the devmap-command. This can either be done in userspace from a user application or in kernelspace from a kernel module. After successful memory mapping is done regular write and read operations can be performed on the received address. However, several software components can map the same address space with out problem and hence it is up to the programmer to make sure that mutual exclusion is achieved. Example code for interfacing with IP-components located in the FPGA from userspace can be found in AppendixA. The details of the hardware interface from Linux can also be hidden from the user using drivers embedded in the operating system. However, as there is no Basic Input/Output System (BIOS) present in the ZynQ SoC no support for so called ”Plug-n-Play” exists. Instead a static tree like structure called a devicetree must be built and passed to the kernel during boot in order for the Linux Kernel to load the correct drivers. This makes the system somewhat static as no new devices can be added during run-time with support from the operating system (the only exception being Universal Serial Bus (USB)-devices) but is a efficient solution to a complex problem.

GIMME2

The GIMME2 platform was presented to the reader in the Background-section of this paper. The ﬁrst GIMME2 boards were delivered in mid-February 2013. Many of the features available on the platform where then untested and undocumented, such as Linux booting and camera

39 interfacing. A example Xilinx ISE project was delivered with them and this was used to generate programming files for the FPGA and also hardware descriptive files for the Xilinx Software De- velopment Kit (SDK). The example projects basically maps all the pins to the right location and configuration the PS of the ZynQ. In order to get the platform up and running Linux some modifications to various configuration files had to be done. First off, the boot-loader used ,U-boot, did not work with GIMME2 using the same configuration set as for the ZC702 board.. A new configuration set was created based on the Xilinx XC702-configuration set. The major differ- ences between the configurations are that the GIMME2-board only has 512 MB of memory, the UART-frequency is raised from 50 MHz to 100 MHz and the addresses used for finding boot files need to be updated to fit within the 16 MB Flash-memory on the GIMME2 board. When these changes had been made, the boot-loader was recompiled. The outputted was then uploaded to the GIMME2 board via JTAG using the Xilinx Microprocessor Debugger (XMD)-tool. More information about porting U-boot to new applications can be found in the article on the subject by STMicroelectronics. [60] However, as most of the Multiplexed Input/Output (MIO)-pins on the ZynQ SoC are already allocated by other peripherals the dedicated for UART-operation are routed through the FPGA. This implies that the FPGA must be programmed when the Linux kernel boots otherwise the system might hang. Booting the boot-loader from the Secure Digital (SD)-card was tried but failed due to a pin mapping error on the board. This will be fixed in later board revisions. Once the bootloader was tested and deemed to work a valid boot image needed to be created. This was done from the graphical tool in Xilinx Software Development Kit. The format of the boot image is described in UG821 [37]. An example boot image can been seen in figure 4.10. It is important the both the ordering of the files are correct and that the offsets for the Linux boot files, uImage.bin (the kernel image), devicetree.dtb (hardware descriptive file) and uramdisk.image.gz (compressed file system), are correct with respect to the offsets specified in the boot-loader. An example of this can be seen in figure 4.11. The uImage.bon and uramdisk.image.gz are retrieved from the 14.4 Xilinx Reference Design. A programming file for the FPGA can be included in the boot image, it should then be placed just after the First- Stage Boot-Loader (FSBL)-file. The FPGA will then be programmed during the execution of the FSBL. The FSBL-file is generated in Xilinx SDK after the hardware platform description was exported from the example project provided by AF inventions GmbH. The devicetree-file is created using the Xilinx Device Tree Generator-plug-in for the SDK. It contains hardware descriptive information used by the Linux kernel during boot. The devicetree-file must be converted using the kernel sources and a special conversion script into an binary file. Conversion is possible both ways. Once the boot image has been created it is important to verify that its size is less than 16 MB as the GIMME2 board only can access 16 MB of its 32 MB Flash memory. If the size should be larger than 16 MB one can try to change the address offsets in order to make the file smaller. In order to program the boot image to the flash-memory on the GIMME2 XMD was used. First the u-boot executable file was loaded to the program memory, then the boot image was loaded into program memory. Then starting u-boot, one can copy the boot image from the program memory to the flash memory using the build in tools. A detailed guide for programming flash-memory can be found in UG873 [36]. Once the flash-memory has been successfully programmed, the board can be restarted. The boot image will then be executed and Linux will boot. Other methods of booting are also available, such as booting via TFTP (network boot) using a TFTP server on a remote machine and an Ethernet cable connecting the GIMME2 board and the remote machine. An example of this boot method can be found in UG873 [36]. After a successful Linux boot had been achieved, the Fast Ethernet port available on the GIMME2 board was introduced into the design. The AXI Ethernet Lite-component available in the Xilinx IP-library was added into the PL and the pin mappings between the IP-component and the physical interface was made. In order to access the Fast Ethernet-port from Linux the devicetree needs to be regenerated in SDK. The boot image then needs to be updated with the new devicetree and then uploaded to the flash-memory on the GIMME2-board. The Fast

40 Ethernet-port will then appear as an third Ethernet port in Linux. The Fast Ethernet port was then tested using udhcpc and ping. The port could successfully acquire an IP-address from a DHCP-server an both respond to and transmit PING-messages successfully. Using the Xilinx Memory Interface Generator (MIG)-tool a interface for the PL-DDR- memory was made. The interface features a AXI-slave connection which allows the memory to be connected to on-board peripherals simply via the AXI-bus. The interface was tested by connecting to one of the AXI-Master ports on the Zynq and writing and reading randomized data. The current version of Xilinx’s tools does not support multiple memory maps and hence the extra PL-memory cannot be used as system memory for operating systems such as Linux. However, this is not intention for this memory. A more likely usage with be as a frame buﬀer for the video pipe line. As only one 16-bit memory module is connected to the PL while there are two 16-bit memory modules connected to the PS with shared addressing buses, only a data width of 16-bit can be achieved which must be remembered when storing and reading data to/from it. By adding an AXI Interconnect module to the slave port of the memory interface a two ported memory controller can be made, where one channel is dedicated for writes while the other is dedicated for reads. This can be useful if data shall be passed from the video pipeline to the PS. Also, pins from the Controller Area Network (CAN)-bus controller in the PS were brought out onto a General Purpose Input/Output (GPIO)-header. No drivers for CAN-support in Linux exist from Xilinx but open source drivers are available. [12][40][13] However, it is unclear to what extent these work as they have not been tested by the author. An userspace driver for CAN were written but is still untested as the necessary hardware to test it is currently lacking. Interfacing with the two cameras on the GIMME2-board was not attempted in this work as it was deemed to too time consuming at somewhat outside of the scope of this report. Researchers at M¨alardalenUniversity (MDH) are currently working on the task.

41 Figure 4.8: Color ﬁlter code.

42 Figure 4.9: Work ﬂow for Partial Reconﬁguration in Xilinx ISE from UG702. [35]

Figure 4.10: Example boot image format for Linux. Picture from UG821. [37]

43 Figure 4.11: Example boot image format for Linux. Picture from UG873. [36]

44 Results

Performance of reconﬁguration methods

The implemented system based on the example design found in Xilinx’s Application note 1159 [46] with three basic image filters were tested with respect to reconfiguration speed. All partial bit streams have the same size (160 kB) and no modification where made hardware wise to the example design except adding the simple red filter. The filter module on the PL was reconfigured 20 times in random fashion and the results where noted with respect to average reconfiguration time, best case reconfiguration time and worst case reconfiguration time. The time needed for reconfiguration was measured using the built-in ”gettimeofday”-function in Linux. Two data sets were extracted, one where just the time needed to finish the ”write’-function call and one where the entire reconfiguration flow is considered: Disabling the component to be reconfigured, performing necessary steps prior to reconfiguration, writing of the partial reconfiguration bit streams, performing necessary steps after reconfiguration at lastly enabling the component. These tests were conducted on the Xilinx ZC702 hardware evaluation board that was described earlier in this report. These results can be seen in figure and in figure respectively.

Average time 4,21 ms Best case 2,4 ms Worst case 9,9 ms

Table 5.1: Figure showing the time needed to just ﬁnish the ”write”-function call on the ZC702 board running Linux.

Average time 18,56 ms Best case 13,3 ms Worst case 22,7 ms

Table 5.2: Figure showing the time needed for the full reconﬁguration ﬂow on the ZC702 board running Linux.

It shall be noted the built-in Linux timers are not that precise and these results were only used to give the author a sense of what to except, performance wise, of the PCAP-interface. However, it is clear that the difference between the worst case time and the best case time in both cases are quite large. Hence there is a certain degree of uncertainty to be considered during reconfiguration with respect to the time needed for the procedure when reconfiguration is performed from Linux, from low-level software the timing can be calculated and guaranteed more precisely as less overhead is present in the system. Hence, In order to produce a more accurate reconfiguration time, a standalone version of the software was tested as well and the results of these tests can be seen in table . This coincides good with the reconfiguration time stated in XAPP1159 [46]. The size of the partial bit-file in this case was 160 kilobytes. Another interesting property that can be observed are the maximum frequency of the PL reported by the Xilinx ISE after and before the insertion of the reconfigurable module. After static synthesis the reported maximum frequency of the reference system was 148 MHz. After implementation this had dropped to 92 MHz. This is very close to the expected frequency

45 Average time 2,0 ms Best case 1,8 ms Worst case 2,1 ms

Table 5.3: Figure showing reconfiguration time needed to finish the DMA-transfer on the ZC702 board using standalone software. drop after implementation of a non-partial reconfiguration design. For example, the maximum frequency of the system seen in 4.6 was after implementation 101 MHz when partial reconfiguration was disabled. Hence it seems that Xilinx’s design tools are capable at maintaining high frequency even during partial reconfiguration design runs. When using the simple base system shown in the Method-section 4.3, the time required for the AXI-ICAP to reconfigure the same reconfigurable module was found to be 13,5 ms. This time was confirmed to be consistent over 10 test runs. A large part of this time is the memory access time as the partial bit-file was stored in the PS-DDR-memory. An improvement would be to use a separate memory for bit-files which can be accessed more rapidly from the PL. Such a design is feasible on the GIMME2-platform as it has dual DDR-memories available, one for each hardware section. Once again the size of the bit-file was 160 kilobytes. Similar times was achieved for the system seen in figure 4.5. Neither of these designs were, as stated earlier, confirmed to work due to possible timing errors. [59] Hence, the results from these tests might not reflect the real performance of the ICAP When the separate DDR-memory on the GIMME2-platform was utilized the reconfiguration time for the bit-file was lowered to an average of 10,8 ms over 10 test runs. However, using the PL-DDR memory is somewhat problematic as it is only 16 bits wide and the Application Programming Interface (API) provided by Xilinx is designed for 32-bit memory width. This was worked around by moving data from the PL-DDR memory in chunks of 200 16 bit words and converting them to 100 32 bit words in the Microblaze and then calling the reconfigurable functions in series. Once again the design was not verified to work correctly and the results from these tests are not 100 % valid. The design that can be seen in figure 4.6 had an average time of 6 ms when using the larger (230 kB) bit-file. This design was confirmed to work unlike the other ICAP-designs. A summary of the reconfiguration times can be found in table.

Reconfiguration method Average time PCAP from Linux using gettimeofday() 4,21 ms ICAP-system seen in figure 4.3 (unverified) 13,5 ms PCAP from standalone software with smaller 0.98 ms bit-file PCAP from standalone software with larger bit- 1 ms file (230 kB) ICAP connected to PS from standalone using 6.0 ms PS-DDR memory with 230 kB bit-file ICAP using PL-DDR memory 10.8 ms (unverified) PCAP using XAPP1159-system from Linux 18,56 ms with smaller bit-file (full reconfiguration flow)

Table 5.4: Figure showing summarized test results.

As can be seen in Xilinx XAPP1159 [46] the blanking period for 1080p60 video stream is 2.1 ms. This is well above the needed time for the PCAP to finish the reconfiguration sequence for both tested bit files from standalone. Lowering the frame rate to 30 FPS would allow for more time to be used by the reconfiguration engine, about 4.9 ms, with affecting the video stream. As the speed of the reconfiguration is fairly linear with respect to the bit-file size smaller changes could be made at 60 FPS without affecting the video stream while larger changer could not. As

46 the reconfiguration times from Linux are hard to calculate a lower frame rate might be needed in order to guarantee that the reconfiguration does not interfere with the video stream. It is hard to calculate an exact time when reconfiguring from Linux as the operating systems adds unwanted latency to the system.

Vision component implementation

Using the Xilinx tool Vivado HLS, some vision components were generated as described in the Mehtod-section of this report. The components, written in C, was translated to VHDL using Vivado HLS and their functionality was both confirmed through a C-test bench and through testing on the development platforms. However the efficiency of the C-to-VHDL conversion can be discussed. A simple color filter, the one presented in the method section, with the actual conversion function being around 60 lines of code are transformed into over 800 lines of VHDL code. This is a drawback as it increases resource utilization and may effect the overall performance of the component. The generated VHDL-code is also very hard, if not impossible, to understand for a human as signals and constants are given unintuitive names. This makes it hard for a human to optimize the code as well. An example of the VHDL-code generated by Vivado HLS can be seen in figure 5.1.

Figure 5.1: Excerpt from VHDL-code generated by Vivado HLS.

The performance of the generated filter modules were tested as described in the Method- section. The time needed to stream a 1920 x 1080 pixel image to the filter and back using the Xilinx VDMA was on average 10.4 ms which means that a frame rate of 96 could be sustained. The average transfer speed was (4 ∗ 1920 ∗ 1080)/(0, 0104) = 797.5MB/s. Xilinx demonstrates in their XAPP1159 document [46] that the ZC702 board, which is more or less similar to the GIMME2-board, has a theoretical memory throughput of slightly more than 4 GB/s. Hence several data streams could be active at the same time to/from the DDR-memory without affecting overall system performance. The clock driving the filter modules was set to 100 MHz under these test cases.

47 Software implementation evaluation

Xilinx provides developers with basic driver APIs for most of their IP-components both for stand-alone software and for Linux. However, when designing and implementing customised components new drivers must be written. By using the earlier described devicetree structure users can make Linux load drivers at boot. This greatly improves the usability of components as APIs can be constructed to utilize kernel module functionality instead of relying on simple userspace control for IP-components. The use of Linux instead of stand-alone software also increases the usability of the system as it enables programmers to hide a lot of the complexity from the user through various design methods. Developing drivers, both for standalone or Linux, are made easy thanks to the tool suite support by Xilinx. Several userspace drivers were successfully developed for the various components used in this report. Development of the drivers were made easy thanks to the sample code and user guides provided by Xilinx.

GIMME2

When the GIMME2 board ﬁrst arrived it was only partially tested and no operating system had been veriﬁed to work on it. As can be seen in the Method-section of this report much work and time went in to making Linux and U-boot working smoothly on the GIMME2-board. Furthermore, much time was spent testing and implementing device support for the various peripherals on the GIMME2-board such as the FPGA-based Ethernet Controller and SD-card support. The end result was a fully working embedded system with support for almost every peripheral that was present on the board. Only the cameras are not fully implemented and tested yet.

48 Discussion

In this report a the concept of partial reconfiguration on FPGAs has been evaluated with respect to the newly released Xilinx ZynQ SoC which features both an dual core ARM-processor and an FPGA. The results have shown that partial reconfiguration can be used in most systems as a changing the functionality of the components implemented on the FPGA during run-time. Furthermore, the tools associated with the Xilinx Zynq FPGA has been evaluated and found to be capable of supplying the developers with a high-level interface and methodology for run-time reconfiguration. However, several issues exist with the development environment supplied by Xilinx. The tool suite from Xilinx is vast and complex. This implies that has a high entry point and a lot of time was spent on learning this tool suite. The Xilinx tool suite is not without bugs as well and a considerable amount of time was spent on dealing with small issues related to the tool suite. Furthermore, the time needed to fully synthesize, implement and generate bit-files in Xilinx Planahead and related tolls is some what huge. A complete design run can take upto 40 minutes in Xilinx Planahead making changes to the hardware time costly. Implementing reconfigurable modules was also time consuming. It is clear that this fairly new technique of partial reconfiguration has potential but the tools available when this report was written is restraining and time-consuming to use. More work is needed from the developer of these tools in order to make them more stable and intuitive to use. The author of this report would like to conclude that the hardware is fully capable of partial reconfiguration but the tools provided by Xilinx is not. However, as can be seen in the Related Work-section of this report there are other tools to use such as the one developed by Koch et al. However these tools were not available at the time being for the author to evaluate and also lies slightly outside the scope of this report. The performance of the developed systems are very close to those figures claimed by both Xilinx and some of the researchers’ work that can be seen in the Related Work-section of this report. It was no surprise that the PCAP would be the fastest of the reconfiguration interfaces due to the fact that it has a higher operating frequency and are spatially closer to the memory. However, some researchers, such as Bhandari et al. [11] have claimed to have achieved much higher reconfiguration speeds using custom made ICAP-cores. However, as developing a custom component is indeed a very complex and lengthy task this was never tried. It is clear that in the technology’s current state the use of the PCAP-interface would be recommended. It has several advantages beside better performance, by not instantiating reconfiguration components in the FPGA resources are freed to be used by other components. Also no custom interface for reconfiguration is needed from Linux as the PCAP drivers are included into kernel and hence reconfiguration is only at matter of opening and writing to a ”file”. It is still unclear how the internal reconfiguration engine of the ICAP and PCAP differs, and why exactly the PCAP is faster, as little to none extensive documentation exists. However, the P in PCAP stands for Parallel and it is possible that the PCAP has a wider interface towards the PL and hence are able to perform reconfiguration operations in parallel while the ICAPis bound to only perform reconfiguration in sequence. Unfortunately, the author of this report was unable to get any of the fully PL-reconfiguration (ICAP) designs to work. As stated earlier, no direct error could be found in either hardware or software and the same approach worked when the ICAP was connected to the PS. A Xilinx employee suggested that timing issues might be the cause of the problems. [59] This is however not confirmed to be true and no direct solution to this could be found. However, the results

49 show that even when the ICAP was connected to the PS it was significantly slower than the PCAP which suggest that an fully PL-based reconfiguration solution could not outperform the PCAP. The usability of Vivado HLS can also be discussed. The tool does produce functioning VHDL-code from C-code but the question is: At what cost? As could be seen in the Result- section of this report the coded produced was clearly obfuscated and unoptimized . Signal names does not coincide with the variable names used in the C-code and the complexity of the code generated is quite vast. The Vivado HLS tool seems suitable for users that lack knowledge of FPGAs-technology and VHDL, such as high-level programmers. For more advanced users it seems that writing components directly using VHDL are still to be preferred as the code can be more easily understood and redesigned. However, if the Vivido HLS tool is redesigned in the future to be produce more clear VHDL-code and provide integrated levels of optimization it could be truly be useful for hardware designers. For computer vision purposes Vivado HLS does not provide the designer with high-level library support. The current version (14.4) of Vivado has only support for the old deprecated OpenCV version 1.0. This means that if someone would like to port image processing code written in C/C++ using a newer version of OpenCV to the Vivado environment a large amount of work and effort is needed. However, it shall be made clear that it is possible to do so and several benefits exists but from a dynamic reconfiguration perspective it is would be hard to make functions accept similar inputs and output similar outputs due to the complexity and wide range of different OpenCV-functions. Partial reconfiguration on Xilinx FPGAs currently only supports Island style reconfiguration, as was presented in the Background-section. This is also a clear drawback as it is less flexible than other types of reconfiguration and unfortunately leads to a higher degree of fragmentation when placing components inside reconfigurable areas. The fact that components within a partition must be homogeneous to a high degree limits the usability of partial reconfiguration in most cases as most normal operations differ in their operands and outputs. However for systems where components have, more or less, the same operands and outputs, such as streaming video systems, partial reconfiguration still can prove to be useful. Overall partial reconfiguration can be suitable for high-performance video systems as it supplies developers with a tool of saving chip area and provide performance increases by moving software functionality from a CPU to the FPGA as the reconfiguration time needed are often well below period of the image flow. However, the process of migrating tasks from high-level programming languages to hardware descriptive languages are not yet fully optimized and much work is still needed in order to provide developers with good tools to use. Although some documentation form Xilinx exists for most of their tools and IP-components it is hard for designers to get started with designing and implementing even simple designs. As stated before, many bugs still exist within the tool suite provided by Xilinx, even for work flows that lack the partial reconfiguration capability. Interfacing with complex components such as the Xilinx VDMA,especially from Linux, used during testing and implementation of the test systems is hard and the manual for them is often vast and highly technical in its language usage. As partial reconfiguration is case-dependent it is hard to say anything about the direct use of it systems in general. A complete system analysis must first be made before a decision can be made on if partial reconfiguration is suitable for the current system.

50 Future work

As partial reconfiguration has proven to be feasible in high-performance video processing systems it should be possible to integrated reconfigurable modules into the pre-existing GIMME1-video pipeline. However, this pipeline is currently implemented onto an FPGA that does not support partial reconfiguration natively and hence the entire system must be moved onto a FPGA that does support it. The GIMME2-board features a Xilinx ZC7020-chip which is capable of partial reconfiguration but to this date the pipeline has not been ported and hence no testing on it has been made by the author of this report. It is still unclear if the video pipeline at MDH could benefit from using partial reconfiguration but results have shown that the technique can successfully be implemented in high-performance video systems. One possible implementation for MDH to use in their research is a fully reconfigurable video pipe-line. This concept can be seen in figure 7.1.

Figure 7.1: Proposed video pipe line.

By utilizing the short reconfiguration time demonstrated by the PCAP, pipe line stages can be reconfigured dynamically and if the frame rate and image resolution are chosen correctly, reconfiguration can be made with affecting the video flow. I.e. it will not cause so called ”tearing”. As very few video components were designed and implemented using the Xilinx Vivado HLS tool a future work is to extend this library of video components. Components such as blob-detector or segmentation tool can be added with little effort. Another aspect that might be interesting to look at is the possibility to create a customised tool for partial reconfiguration on the Xilinx Zynq in order to replace Xilinx ISE. This tool could possibly be more flexible and intuitive to use for the user than the already existing tool suite from Xilinx. Such tools already exist such as the GoAhead-tool developed by Beckhoff et al. [10] but it might be interesting to design and implement a partial reconfiguration tool that has closer connections to the Xilinx ISE tool suite and also uses the latest partial reconfiguration methods recommended by Xilinx.

51 Conclusions

In the Introduction to this report a number of questions were asked: 1. What are the possible advantages of using heterogeneous systems in video processing instead of homogeneous systems? 2. What is the performance of Xilinx’s current partial reconfiguration methods? 3. How well developed are the available software tools for partial reconfiguration in Xilinx FPGAs? 4. What are the technological limitations of partial reconfiguration in its current state? 5. How can partial reconfiguration be utilized in high-performance video systems such as stereo-vision systems and what are the implications of this technology for machine vision in general? 6. What types of components can be partially reconfigured? 7. How can MälardalenUniversity use partial reconfiguration in their current research projects such as GIMME2? 8. What is the status of high-level language to HDL-tools such as Xilinx Vivado HLS, in terms of efficiency and performance? These questions have hopefully been answered throughout the report but a summarized version will be given here.

1. Heterogeneous systems are more complex to work with than homogeneous systems but can provide users with performance increments if used correctly. 2. The 2 main reconfiguration methods, the PCAP and the ICAP interface, have been tested in this report and for a programming file of the size of 160 kilobytes the best performing interface was the PCAP with an average reconfiguration time of 2 ms from Linux and circa 1 ms from standalone software. 3. The tools supplied by Xilinx have been found to be lacking in but user-friendliness and performance. Major bugs still exist as this report is written. 4. The major technological limitations of partial reconfiguration are: components residing in a partition must have the same interface toward the FPGA, certain logic cannot be reconfigured and a certain performance decrease is to be expected when using partial reconfiguration in a design. 5. Assuming a pipe line implementation (which is common within video processing applications), components in the pipe line can be reconfigured during runtime without frame corruptions (if reconfigured in a correct and timely fashion). For machine vision in general this implies that smaller FPGAs can be used in system in order host video processing logic as components can be reconfigured during run-time. This implies that the total energy consumption can be lowered and that video processing systems can be adapted for new more demanding applications.

52 6. Most components can be reconﬁgured however as seen earlier in this report some exceptions exist.

7. This remains an open question as the video pipeline designed at MDH still is not ported to a platform that supports partial reconﬁguration, but partial reconﬁguration could allow for a smaller FPGA to be used and hence save power.

8. The status of the HLS tool is that is fully working but the generated ﬁles are severely obfuscated and unoptimized.

To summarize the report one can clearly say that the in its current state partial reconfiguration is a fully-functional technique that allows designers to use smaller FPGAs and achieve higher system performance by dynamically reprogramming the FPGA during run-time. However, the tools associated with the partial reconfiguration work flow are still under-developed and much work is still needed in order to make them more flexible and intuitive for users. Furthermore, it has been shown in this report that partial reconfiguration can be used in high-performance video processing systems as the reconfiguration time is low enough to avoid so called frame tearing if used correctly. An application of partial reconfiguration in video processing applications, the GIMME2-board and the pseudo-video stream filtering, has been presented and evaluated in this report further demonstrating the capabilities of partial reconfiguration.

53 Bibliography

[1] Aeroﬂex Gaisler AB. Companioncore data sheet. http://www.actel.com/ipdocs/LEON3_ DS.pdf, 2010.

[2] K.F. Ackermann, B. Hoﬀmann, L.S. Indrusiak, and M. Glesner. Enabling self- reconﬁguration on a video processing platform. In Industrial Embedded Systems, 2008. SIES 2008. International Symposium on, pages 19 –26, june 2008.

[3] af inventions. GIMME2 User Guide. af inventions, 2 edition, 2013.

[4] C. Ahlberg, J. Lidholm, F. Ekstrand, G. Spampinato, M. Ekstr¨om,and L. Asplund. Gimme - a general image multiview manipulation engine. In Reconﬁgurable Computing and FPGAs (ReConFig), 2011 International Conference on, pages 129 –134, 30 2011-dec. 2 2011.

[5] ARM. AMBA 4 AXI4-Stream Protocol, 2010.

[6] ARM. AMBA AXI and ACE Protocol Speciﬁcation, 2011.

[7] Lars Asplund and Mikael Ekstr¨om.Thesis description 2012. http://www.idt.mdh.se/rc/ education/Thesis2012.pdf, 2012.

[8] Guan-Lin Wu at NTU. Introduction to fpga - lecture slides. http://cc.ee.ntu.edu.tw/ ~jhjiang/instruction/courses/fall11-cvsd/LN13-FPGA.pdf, 2013. [9] D.G. Bailey. Design for Embedded Image Processing on FPGAs. Wiley, 2011.

[10] C. Beckhoﬀ, D. Koch, and J. Torresen. Go ahead: A partial reconﬁguration framework. In Field-Programmable Custom Computing Machines (FCCM), 2012 IEEE 20th Annual International Symposium on, pages 37–44, 2012.

[11] S. Bhandari, S. Subbaraman, S. Pujari, F. Cancare, F. Bruschi, M.D. Santambrogio, and P.R. Grassi. High speed dynamic partial reconﬁguration for real time multimedia signal processing. In Digital System Design (DSD), 2012 15th Euromicro Conference on, pages 319 –326, sept. 2012.

[12] Torsten Bitterlich and Heinz-J¨urgenOertel. can4linux - can bus device driver. http: //sourceforge.net/projects/can4linux/, 2013.

[13] Torsten Bitterlich and Heinz-J¨urgenOertel. can4linux - can bus device driver. http: //www.can-wiki.info/can4linux/man/index.html, 2013.

[14] C. Blair, N.M. Robertson, and D. Hume. Characterising pedestrian detection on a heterogeneous platform. http://home.eps.hw.ac.uk/~cgb7/papers/scabot_12_cb.pdf, 2012. [15] Andre R. Brodtkorb, Christopher Dyken, Trond R. Hagen, Jon M. Hjelmervik, and Olaf O. Storaasli. State-of-the-art in heterogeneous computing. Sci. Program., 18(1):1–33, January 2010.

[16] S. Brown and J. Rose. Fpga and cpld architectures: a tutorial. Design Test of Computers, IEEE, 13(2):42–57, 1996.

54 [17] Meng Chen, Zhihao Cai, and Yingxun Wang. A method for mobile robot obstacle avoidance based on stereo vision. In Industrial Informatics (INDIN), 2012 10th IEEE International Conference on, pages 94 –98, july 2012.

[18] C. Claus, J. Zeppenfeld, F. Muller, and W. Stechele. Using partial-run-time reconﬁgurable hardware to accelerate video processing in driver assistance system. In Design, Automation Test in Europe Conference Exhibition, 2007. DATE ’07, pages 1 –6, april 2007.

[19] Christopher Claus, Rehan Ahmed, Florian Altenried, and Walter Stechele. Towards rapid dynamic partial reconﬁguration in video-based driver assistance systems. Reconﬁgurable Computing: Architectures, Tools and Applications, pages 55–67, 2010.

[20] Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman. Linux Device Drivers, 3rd Edition. O’Reilly Media, Inc., 2005.

[21] B. Delight and E. Segerblad. Machine vision in agricultural robotics – a short overview. http://www.idt.mdh.se/kurser/ct3340/ht11/MINICONFERENCE/FinalPapers/ ircse11_submission_2.pdf, 2011.

[22] Julio Daniel Dondo, JesúsBarba, Fernando Rincón,Francisco Moya, and Juan Carlos López. Dynamic objects: Supporting fast and easy run-time reconfiguration in fpgas. Jour- nal of Systems Architecture, 59(1):1 – 15, 2013.

[23] David Dye. Partial Reconﬁguration of Xilinx FPGAs Using ISE Design Suite. Xilinx Inc., 1.2. edition, 2012. Available at: http://www.xilinx.com/support/documentation/ white_papers/wp374_Partial_Reconfig_Xilinx_FPGAs.pdf.

[24] U. Farooq, Z. Marrakchi, and H. Mehrez. Tree-Based Heterogeneous FPGA Architectures: Application Speciﬁc Exploration and Optimization. SpringerLink : B¨ucher. Springer New York, 2012.

[25] M.J. Flynn and W. Luk. Computer System Design: System-on-Chip. Wiley, 2011.

[26] R. Freeman. User-programmable gate arrays. Spectrum, IEEE, 25(13):32 –35, dec. 1988.

[27] L. Gantel, M.E.A. Benkhelifa, F. Lemonnier, and F. Verdier. Module relocation in heterogeneous reconfigurable systems-on-chip using the xilinx isolation design flow. In Reconfigurable Computing and FPGAs (ReConFig), 2012 International Conference on, pages 1–6, 2012.

[28] R. Garcia, A. Gordon-Ross, and A.D. George. Exploiting partially reconﬁgurable fpgas for situation-based reconﬁguration in wireless sensor networks. In Field Programmable Custom Computing Machines, 2009. FCCM ’09. 17th IEEE Symposium on, pages 243 –246, april 2009.

[29] H. Gholamhosseini and Shuying Hu. A high speed vision system for robots using fpga technology. In Mechatronics and Machine Vision in Practice, 2008. M2VIP 2008. 15th International Conference on, pages 81 –84, dec. 2008.

[30] Sverre Hamre. Framework for self reconﬁgurable system on a xilinx fpga., 2009.

[31] J.L. Hennessy and D.A. Patterson. Computer Architecture: A Quantitative Approach. The Morgan Kaufmann Series in Computer Architecture and Design. Elsevier Science, 2006.

[32] John L. Hennessy and David A. Patterson. Computer Architecture, Fourth Edition: A Quantitative Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2006.

[33] Xilinix Inc. Vivado design suite user guide. http:\www.xilinx.com, 2013.

[34] Xilinx Inc. Hierarchical Design Methodology Guide. Xilinx Inc., 14.1 edition, 2012. Xilinx UG748.

55 [35] Xilinx Inc. Partial Reconﬁguration User Guide. Xilinx Inc., 14.4 edition, 2012. Xilinx UG702.

[36] Xilinx Inc. Zynq-7000 All Programmable SoC: Concepts, Tools, and Techniques (CTT), 2012.

[37] Xilinx Inc. Zynq-7000 All Programmable SoC Software Developers Guide. Xilinx Inc., 3.0 edition, 2012. Xilinx UG821.

[38] Xilinx Inc. Zynq-7000 All Programmable SoC Technical Reference Manual. Xilinx Inc., 1.4 edition, 2012. Xilinx UG585.

[39] Xilinx Inc. 7 series fpgas conﬁguration user guide, 2013.

[40] Xilinx Inc. Can4linux. http://www.wiki.xilinx.com/CAN4Linux, 2013.

[41] Xilinx Inc. ZC702 Evaluation Board for the Zynq-7000 XC7Z020 All Programmable SoC. Xilinx Inc., 1.2 edition, 2013. Xilinx UG850.

[42] Xilinx Inc. Zynq-7000 all programmable soc overview, 2013.

[43] Ajay. Khalaf, Mario. Jagtiani. Framework for self reconﬁgurable system on a xilinx fpga. Embedded Systems Design, 24(5):22–27, 2011.

[44] Dirk Koch. Partial Reconﬁguration on FPGAs. Springer-Verlag New York Inc., 2012.

[45] Dirk Koch, Jim Torresen, Christian Beckhoﬀ, Daniel Ziener, Christopher Dennl, Volker Breuer, Jurgen Teich, Michael Feilen, and Walter Stechele. Partial reconﬁguration on fpgas in practice x2014; tools and applications. In ARCS Workshops (ARCS), 2012, pages 1 –12, feb. 2012.

[46] Christian Kohn. Partial Reconﬁguration of a Hardware Accelerator on Zynq-7000 All Pro- grammable SoC Devices. Xilinx Inc., 1.0 edition, 2013. Xilinx UG702.

[47] T. Komuro, T. Tabata, and M. Ishikawa. A reconﬁgurable embedded system for 1000 f/s real-time vision. Circuits and Systems for Video Technology, IEEE Transactions on, 20(4):496 –504, april 2010.

[48] Ian Kuon, Russel Tessier, and Jonathan Rose. Fpga architecture: Survey and challenges. Foundations and Trends in Electronic Design Automation, 2(2):135–253, 2007.

[49] T. Kuphaldt. Lessons In Electric Circuits, Volume IV – Digital. Open Book Project, 2007.

[50] V.G. Lesau, E. Chen, W.A. Gruver, and D. Sabaz. Embedded linux for concurrent dynamic partially reconﬁgurable fpga systems. In Adaptive Hardware and Systems (AHS), 2012 NASA/ESA Conference on, pages 99 –106, june 2012.

[51] Ming Liu, Wolfgang Kuehn, Zhonghai Lu, and Axel Jantsch. Run-time partial recon- ﬁguration speed investigation and architectural design space exploration. In Field Pro- grammable Logic and Applications, 2009. FPL 2009. International Conference on, pages 498–502. IEEE, 2009.

[52] Patrick Lysaght, Brandon Blodget, Jeﬀ Mason, Jay Young, and Brendan Bridgford. Invited paper: Enhanced architectures, design methodologies and cad tools for dynamic reconﬁg- uration of xilinx fpgas. In Field Programmable Logic and Applications, 2006. FPL’06. International Conference on, pages 1–6. IEEE, 2006.

[53] J. Meyer, J. Noguera, M. Hubner, L. Braun, O. Sander, R.M. Gil, R. Stewart, and J. Becker. Fast start-up for spartan-6 fpgas using dynamic partial reconﬁguration. In Design, Automa- tion Test in Europe Conference Exhibition (DATE), 2011, pages 1 –6, march 2011.

56 [54] G. Ochoa-Ruiz, O. Labbani-Narsis, E. Bourennane, S. Cherif, S. Meftali, and J. Dekeyser. Facilitating ip deployment in a marte-based mde methodology using ip-xact: A xilinx edk case study. In Reconﬁgurable Computing and FPGAs (ReConFig), 2012 International Con- ference on, pages 1–8, 2012.

[55] I. Ohmura, T. Mitamura, H. Takauji, and S. Kaneko. A real-time stereo vision sensor based on fpga realization of orientation code matching. In Optomechatronic Technologies (ISOT), 2010 International Symposium on, pages 1 –5, oct. 2010.

[56] A. Otero, E. de la Torre, and T. Riesgo. Dreams: A tool for the design of dynamically reconﬁgurable embedded and modular systems. In Reconﬁgurable Computing and FPGAs (ReConFig), 2012 International Conference on, pages 1–8, Dec.

[57] Kyprianos Papadimitriou, Apostolos Dollas, and Scott Hauck. Performance of partial recon- ﬁguration in fpga systems: A survey and a cost model. ACM Transactions on Reconﬁgurable Technology and Systems (TRETS), 4(4):36, 2011.

[58] M. Rummele-Werner, T. Perschke, L. Braun, M. Hubner, and J. Becker. A fpga based fast runtime reconﬁgurable real-time multi-object-tracker. In Circuits and Systems (ISCAS), 2011 IEEE International Symposium on, pages 853 –856, may 2011.

[59] Emil Segerblad and Xilinx Inc. Forum post regarding icap problem. http://forums. xilinx.com/t5/Embedded-Processors-and/Zynq-ICAP-problem/td-p/317275, 2013.

[60] STMicroelectronics. Porting u-boot to a new board. http://www.stlinux.com/u-boot/ porting, 2013.

[61] R. Szeliski. Computer Vision: Algorithms and Applications. Texts in Computer Science. Springer, 2010.

[62] Simon Tam and Martin Kellermann. Fast conﬁguration of pci express technology through partial reconﬁguration, 2010.

[63] C. Tradowsky, F. Thoma, M. Hubner, and J. Becker. On dynamic run-time processor pipeline reconﬁguration. In Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW), 2012 IEEE 26th International, pages 419 –424, may 2012.

[64] Ulrich Weiss and Peter Biber. Plant detection and mapping for agricultural robots using a 3d lidar sensor. Robotics and Autonomous Systems, 59(5):265 – 273, 2011.

[65] Xilinx. Xilinx wiki. http://www.wiki.xilinx.com/, 2013.

57 Appendix A

Device interface from Linux

Interfacing with custom IP-devices from Linux (in userspace) is simple as long as the user knows the register mapping associated with the device. An example can be seen below. Assume that constants/macros are defined as in figure A.1 then accessing the register associated with the device (in this case the DDR-memory on the ZC702-board) can be done as in figure A.2. The mmap and munmap functions are declared in the sys/mman.h header file and are used to map and unmap files into memory.

Figure A.1: Macro deﬁnitions.

Figure A.2: Memory access code.

The possibility of creating so called kernel modules and running the drivers from kernel mode is also possible. This is recommended for IP-components that are ﬁnalized. During development though, it is often simpler to use userspace drivers as no compilation against the used kernel is needed. For more extended information about the various driver types and kernel module development, please refer to the excellent book on the subject ”Linux Device Drivers”’ by Corbet et al. [20]

58 Appendix B

Overview of Xilinx ZynQ-family

Figure B.1: Overview of ZynQ-family. Image from DS190. [42]