HotStream: Heterogeneous Many-Core Data Streaming Framework with Complex Pattern Support

Sergio´ Micael Ferreira Paiagua´

Thesis to obtain the Master of Science Degree in Electrical and Computer Engineering

Examination Committee Chairperson: Doutor Nuno Cavaco Gomes Horta Supervisor: Doutor Ricardo Jorge Fernandes Chaves Co-supervisor: Doutor Nuno Filipe Valentim Roma Members of the Committee: Doutor Horacio´ Claudio´ de Campos Neto Doutor Paulo Ferreira Godinho Flores

October 2013

Abstract

The work herein presented proposes a data streaming accelerator framework that provides efficient data management facilities that can be easily tailored to any application and data pattern. This is achieved through an innovative and fully programmable data management structure, imple- mented with two granularity levels, which is further complemented with a complete software layer, which ranges from a device driver to an high-level API that provides easy access to every feature provided by the framework. The fine-grained data movements are made possible by an innovative Data Fetch Controller, powered by a custom microcontroller, which can be programmed to gener- ate arbitrarily complex access patterns with minimal performance overhead. The obtained results show that the proposed framework is capable of achieving virtually zero-latency address gener- ation and data fetch, even for most complex streaming data patterns, while significantly reducing the size occupied by the pattern description code. In order to validate the proposed framework, two distinct case-studies were considered. The first deals with the block-based multiplication of large matrices, while the second consists of a full image-processing application in the frequency domain. The obtained experimental results for the first case study demonstrate that, by enabling data re-use, the proposed framework increases the available bandwidth by 4.2×, resulting in a speed-up of 2.1× when compared to existing related state of the art. Furthermore, it reduces the Host memory requirements and its intervention in the acceleration by more than 40×. The signal- processing case study revealed that an accelerator base on the proposed framework can achieve a linear relationship between the execution time and the size of the input image, which highly contrasts with CPU or GPU-based alternatives. Frame rates of 40 and 2.5 FPS were obtained for 1024 × 1024 and 4096 × 4096 images, respectively.

Keywords: Stream computing, Many-Core Heterogeneous Architectures, Programmable Data Access Patterns, Data Reuse, Reconfigurable Devices, High-Speed Interconnections.

i

Resumo

No presente trabalho e´ proposta uma plataforma de acelerac¸ao˜ baseada em computac¸ao˜ de fluxo de dados, que proporciona uma gestao˜ de dados eficiente, facilmente adaptavel´ a qual- quer aplicac¸ao˜ ou padrao˜ de acesso de dados. Isto e´ conseguido atraves´ de uma inovadora estrutura de gestao˜ de dados completamente programavel,´ composta por dois n´ıveis de gran- ularidade e complementada por uma extensa camada de software, que abarca desde o driver do dispositivo a uma interface de alto n´ıvel que garante o facil´ acesso a todos os elementos da plataforma. O controlo de dados a um n´ıvel de granularidade mais fino e´ garantido por um in- ovador Data Fetch Controller, comandado por um microcontrolador especialmente desenhado, capaz de gerar padroes˜ de acesso arbitrariamente complexos. Os resultados obtidos revelam que a plataforma proposta e´ capaz de gerar enderec¸os e aceder a dados de forma quase ime- diata, qualquer que seja o padrao˜ de dados em questao,˜ reduzindo ainda o espac¸o necessario´ para alojar a descric¸ao˜ do padrao.˜ Por forma a validar a plataforma proposta, dois estudos de caso distintos foram utilizados. O primeiro baseia-se na multiplicac¸ao˜ de matrizes de grandes dimensoes,˜ enquanto que o segundo consiste numa aplicac¸ao˜ de processamento de imagem no dom´ınio da frequencia.ˆ Os resultados obtidos para o primero caso de estudo demonstram que, ao explorar extensivamente a re-utilizac¸ao˜ de dados, a plataforma proposta aumenta a largura de banda fornecida as` unidades de computac¸ao˜ em 4.2×, o que resulta num aumento de desem- penho de 2.1×, quando comparada com implementac¸oes˜ convencionais. Mais, os requisitos de memoria´ impostos a` maquina´ anfitria˜ e´ reduzida em mais de 40×. O segundo caso de estudo revela que um acelerador baseado na plataforma proposta garante uma relac¸ao˜ linear entre o tempo de execuc¸ao˜ e a dimensao˜ da imagem a ser processada, algo que o estado da arte nao˜ permite.

Keywords: Computac¸ao˜ de fluxos de dados, Arquitecturas Heterogeneas´ com multiplos´ nucleos,´ Padroes˜ de Acesso Programaveis,´ Reutilizac¸ao˜ de Dados, Dispositivos Reconfiguraveis.´

iii Acknowledgments

Within the next 80 pages, a lot more than a master thesis is contained. It obviously represents my hard work, dedication and effort over the last 8 months but is actually much more than that. This is the final step in a journey that I started back in 2008. A journey that has only been successful due to the invaluable help and companionship of a number of people that more than deserve to be mentioned in the following paragraphs. First of all, I would like to express my deepest gratitude to the exceptional team of advisors I had the pleasure to work with. Ricardo Chaves, Nuno Roma, Pedro Tomas´ and Frederico Pratas, I really couldn’t have hoped for a better supervision over the last months. From the lengthy but enlightening meetings, always accompanied by good humour and plenty of laughs, to your tireless effort in reviewing all of my work, I have no doubt that the quality of this thesis is, in great part, owed to all of you. To all the amazing friends I made during these last five years, in particular, Rui Coelho, Joana Marinhas, Jose´ Santos, Filipe Morais, Joao˜ Carvalho, Rita Pereira, a big thank you for all your support throughout all the (mostly) good and bad times. A special thanks to my great friend Jose´ Leitao˜ who had a special impact in this thesis by keeping me company during the long work nights at INESC and for always having the time to share a laugh, or to happily engage in endless technical debates. Finally, I thank my parents and my sister for, well, everything. Not exaggerating in the slight- est, without them, this moment would simply not have happened. I am very grateful for all the wonderful guidance, patience and love they have so selflessly given me over the years.

iv Contents

1 Introduction 2 1.1 Motivation ...... 3 1.2 Objectives ...... 4 1.3 Main contributions ...... 5 1.4 Dissertation outline ...... 6

2 Technology Overview 9 2.1 Stream Computing Platforms and Address Generation ...... 10 2.2 PCI Express Interfaces ...... 11 2.3 Shared Buses and Crossbars ...... 12 2.3.1 Shared ...... 12 2.3.2 Crossbar ...... 12 2.4 Networks On Chip ...... 13 2.5 NoC Survey ...... 14 2.6 Crossbar Survey ...... 15 2.7 Summary ...... 15

3 HotStream Framework Architecture 17 3.1 Host Interface Bridge ...... 19 3.2 Multi-Core Processing Engine ...... 20 3.3 The HotStream API ...... 21 3.4 Data Fetch Controllers, Shared Memory and Auxiliary Units ...... 22 3.4.1 Address Generation Core (AGC) ...... 23 3.4.2 Micro16 microcontroller ...... 24 3.4.3 Access to the Shared Memory ...... 27 3.5 Data Stream Switch (DSS) and Core Management Unit (CMU) ...... 28 3.6 Summary ...... 30

4 Host Interface Bridge 31 4.1 PCI Express Infrastructure ...... 32 4.2 Address Spaces and DMA ...... 33 4.3 2D DMA Transfers ...... 34

v Contents

4.4 Device Driver and User Interface ...... 37 4.4.1 Modifications to the MPRACE device driver ...... 38 4.4.2 Configuring a data transfer ...... 38 4.5 Summary ...... 40

5 Framework Prototype 41 5.1 AXI Interfaces ...... 42 5.2 HIB Implementation and Performance ...... 43 5.3 Backplane Implementation and Performance ...... 47 5.3.1 Hermes NoC ...... 48 5.3.1.A Modified packet structure ...... 48 5.3.2 AXI Stream Interconnect ...... 49 5.3.3 Backplane Performance Evaluation ...... 50 5.3.3.A Core Emulator and Stream Wrapper ...... 50 5.3.3.B Testbench and Python script ...... 51 5.3.3.C Results ...... 51 5.3.4 Crossbar and NoC Comparative Evaluation ...... 53 5.4 Shared Memory Performance ...... 55 5.4.1 Cycle-Accurate Simulator ...... 55 5.5 Summary ...... 56

6 Framework Evaluation 58 6.1 General Evaluation ...... 59 6.1.1 Resources Overhead ...... 59 6.1.2 Stream Generation Efficiency ...... 61 6.2 Case Study 1: Matrix Multiplication ...... 63 6.2.1 Computing Cores ...... 64 6.2.2 Roofline Model ...... 66 6.2.3 Performance and Memory Usage ...... 67 6.3 Case Study 2: Image processing chain in the frequency domain ...... 69 6.3.1 Computing Cores ...... 71 6.3.2 Performance and Scalability ...... 72 6.4 Summary ...... 75

7 Conclusions and Future Work 77 7.1 Conclusions ...... 78 7.2 Future work ...... 80

A Appendix A 85 A.1 Micro16 Instruction Set Architecture ...... 86 A.2 HotStream Register Interface ...... 86 vi Contents

A.3 HotStream API ...... 89

B Appendix B 95 B.1 Pattern Description Examples ...... 96 B.1.1 Linear and Tiled access pattern ...... 96 B.1.2 Diagonal access pattern ...... 96 B.1.3 Cross access pattern ...... 98

vii Contents

viii List of Figures

2.1 Structure of a 2D Mesh and 2D Torus NoC ...... 14

3.1 Structure and organization overview of the HotStream framework ...... 18 3.2 AGC in a 3-level nested loop configuration...... 23 3.3 Architecture of the Micro16 microcontroller...... 25 3.4 Core internal structure, comprising the PE (e.g., an application specific IP Core) and the co-located BMC ...... 28 3.5 Internal structure of the BMC, consisting of Write and Read control units, a Channel Arbiter and a Synchronizer block ...... 28

4.1 Address spaces and translation mechanisms on a x86-like architecture ...... 34 4.2 Mapping of an user-buffer to the bus address space and subsequent creation of the corresponding SG descriptors. The size and number of the physical data chunks can vary considerably according to the size of the original buffer and the state of the phyisical memory ...... 34 4.3 Application of a 2D pattern to the mapping of Fig. 4.2 ...... 35 4.4 Application of a 2D pattern to a more realistic mapping between virtual and physical address space. This example highlights the descriptor savings that are possible by utilizing a DMA with 2D capabilities ...... 36 4.5 Flowchart of the algorithm that converts a list of SG DMA descriptors into a list featuring 2D transfers ...... 36 4.6 Operation of the HotStream gather() function, which gathers the various sub-blocks defined by a 2D pattern and places them linearly in a new user-space buffer . . . 37

5.1 Basic handshake principle utilized by the AXI4-Stream and other similar stream- based protocols. Retrieved from [2] ...... 43 5.2 Aggregate throughput of various PCI Express configurations. The dashed line ac- counts for protocol overhead as per [17] ...... 44 5.3 Measured aggregate throughput for back-to-back transfers with varying buffer size 45 5.4 Chipscope waveforms obtained during a back-to-back transfer of 4 KB ...... 46 5.5 Time elapsed during the configuration of the send and receive transactions . . . . 46

ix List of Figures

5.6 Aggregate throughput for a back-to-back transfer including the time taken for the transaction set-up ...... 47 5.7 Traffic patterns for the NoC simulation ...... 52 5.8 Data delivery throughput in various traffic configurations. In every one, the input throughput is reached asymptotically ...... 52 5.9 Source to destination latency when using best case or worst case routing. Dupli- cating the inserted data throughput does not affect latency ...... 53

6.1 Access patterns, with varying complexity degrees, adopted for the DFC evaluation 61 6.2 HotStream-based implementation of the block-based multiplication algorithm, con- sisting of 3 Kernels to process multiple and concurrent data streams, where double buffering is used on the shared memory to overlap communication with computation 64 6.3 8:1 binary reduction tree based on Matrix Accumulators. The structure of the 16:1 reduction core follows the same architecture but with double the number of basic accumulators ...... 65 6.4 Internal structure of the multiplication core utilized in the HotStream implementation of the matrix multiplication. Sub-blocks from matrix A are stored and re-used during the computation of a full sub-block line from matrix B ...... 65 6.5 Roofline model for the matrix multiplication example: Cx and Hx denote the actual performance for each implementation (Conventional and HotStream, respectively); while the conventional solutions C2× and C4×, with 2× and 4× parallelism, re- spectively, are limited by the PCIe link (communication-bounded), all other imple- mentations are computation-bounded ...... 66 6.6 Processing time taken on each step of the matrix multiplication algorithm for the considered implementations ...... 68 6.7 Core scalability of the three matrix multiplication implementations ...... 68 6.8 Host memory requirements for matrix multiplication implementations ...... 69 6.9 Image processing chain in the frequency domain, mapped to the HotStream frame- work...... 71 6.10 Execution time and bus utilization for various image sizes and read and write burst sizes. Both transient (single frame) and steady-state (streaming) operation condi- tions are depicted ...... 73 6.11 FFT Execution time with CUFFT, a CUDA-based FFT library, for various image sizes [33] ...... 74 6.12 Execution time of a 2D FFT on a NVIDIA QUADRO FX5600 using CUFFT and on an Intel Dual Core Processor (6600) @ 2.4 GHz using FFTW [10] ...... 75

B.1 Pattern description code of a simple linear access with 1024 positions ...... 97 B.2 Pattern description code for a tiled 128×72 access ...... 97 B.3 Pattern description code for a diagonal access on a 1024×1024 matrix ...... 98 x List of Figures

B.4 Pattern description code for a greek cross access pattern ...... 99

xi List of Figures

xii List of Tables

2.1 Examples of architecture combinations supported by the ATLAS environment . . . 15

5.1 PCI Express Gen1 and Gen2 support on the AXI Bridge for PCI Express IP Core 44 5.2 Hardware utilization of the Hermes NoC configured in a 2×2 mesh and the AXI Stream Interconnect Crossbar for a varying number of independent cores . . . . . 53 5.3 Throughput and latency of the Hermes NoC and AXI Stream Crossbar when inter- connecting 4 Cores, under different traffic conditions ...... 54

6.1 Resource usage for each component in the MCPE and HIB (hardware platform: XCV7VX485T Virtex-7 FPGA) ...... 60 6.2 Individual resource usage of the DFCs and BMCs (hardware platform: XCV7VX485T Virtex-7 FPGA) ...... 61 6.3 Address generation rate and descriptor size of the considered access patterns (the adopted length of each pattern results from the parameterization depicted in Fig. 6.1) 62 6.4 Resource usage of the cores utilized in the various implementations of the 4096×4096 matrix multiplication (hardware platform: XCV7VX485T Virtex-7 FPGA) ...... 66 6.5 Resource usage of the cores utilized in the frequency domain processing case study (hardware platform: XCV7VX485T Virtex-7 FPGA) ...... 72

A.1 ALU Register-Register operations ...... 86 A.2 Constant loading operations ...... 86 A.3 Low and High constant loading and miscellaneous operations ...... 87 A.4 Flow control operations ...... 87 A.5 CMU Address Mapping ...... 87 A.6 CMU Register Details ...... 88

xiii List of Tables

xiv List of Tables

Acronyms

AGC Address Generation Core

API Application Programming Interface

ASIC Application Specific Integrated Circuits

BMC Bus Master Controller

CMU Core Management Unit

DFC Data Fetch Controller

DMA

DRAM Dynamic Random Access Memory

DSS Data Stream Switch

ERF External Register File

FFT Fast Fourier Transform

FPGA Field-Programmable Gate Array

FPS Frames Per Second

GPP General Purpose Processor

HIB Host Interface Bridge

IDE Integrated Development Environment

IOMMU Input/Output Memory Management Unit

IRF Internal Register File

ISA Instruction Set Architecture

MCPE Multi-Core Processing Engine

MMU Memory Management Unit

MSI Message Signalled Interrupts

NoC Network-On-Chip

PE Processing Element

PSR Program Status Register

RAM Random Access Memory

xv List of Tables

RTL Register Transfer Level

SG Scatter Gather

TLP Transaction Layer Package

VLSI Very Large Scale Integration

1 1 Introduction

Contents 1.1 Motivation ...... 3 1.2 Objectives ...... 4 1.3 Main contributions ...... 5 1.4 Dissertation outline ...... 6

2 1.1 Motivation

1.1 Motivation

One of the most critical aspects to be considered during the development of multi-core hard- ware accelerators is how to efficiently handle data transfers between the various Processing Element (PE)s of the system. The architecture of the memory subsystem and of the commu- nication data channels has a significant impact on the achievable effective memory bandwidth that is made available to the PEs, and therefore on the overall system performance. In fact, an efficient and coordinated management of the data transfers is important not only because the PEs have different processing characteristics and capabilities (e.g., general purpose processors, appli- cation specific processors, or custom-designed accelerating cores), but also because applications often present distinct memory footprints and bandwidth requirements. While traditional solutions (such as cache hierarchy structures) try to reduce the latency of accessing the data, they do not allow exploiting all levels of available data parallelism. Therefore, recent advances have encouraged researchers to exploit other models that are able to deal with the intrinsic constraints of the underlying Very Large Scale Integration (VLSI) technology and of the inherent parallelism of emerging applications. As a result, it has been observed there is an increasing interest in data stream computation models, which focuses on decoupling communication from computation, by exposing an additional level of concurrency. This type of concurrency is especially important in hardware accelerators, given the slow communication channels (e.g., buses) typically used to connect with the Host device. Nevertheless, while regular streaming patterns are easy to handle, complex memory ac- cesses require more radical strategies to avoid long memory access times and to keep a high overall system performance. Such accesses can either occur due to the intrinsic complexity of the underlying application or because multiple kernels concurrently access different memory re- gions. Moreover, when data streams produced by a given kernel are consumed by several other kernels and at different paces, intermediate buffering is also required, thus further increasing the pressure on the memory subsystem. In an attempt to minimize the impact of these problems, dedicated Address Generation Units (AGUs) can be employed, which (pre-)fetch the data with the specific pattern required by the target application. Moreover, data reuse mechanisms can reduce the number of effective memory accesses, by sharing some of the streams through alternative channels or by rearranging a stream before it is consumed by the next kernel. The work presented herein describes the HotStream framework, consisting on a platform for the development of stream-based computing tasks, by providing an easy implementation maeans of advanced features such as data (pre-)fetching, stream sharing and support for arbitrarily com- plex data access patterns, generated by fully-programmable AGUs. These units, which are an integral part of the framework, differ from most comparable solutions by providing a Pattern De- scription Language (PDL), which enables any pattern to be described in a compact and scalable format, which greatly contrasts with the approaches of competing solutions, where periodic repe- titions within the pattern cannot be explored with ease. The aforementioned stream-related features of the framework are supported on standard com-

3 1. Introduction munication channels with added pattern-based data addressing mechanisms. Such addressing is structured with two levels of granularity, namely, a coarse-grained data access from the Host to the accelerator, to maximize the transmission efficiency, and a fine-grained data access within the shared memory of the device, made available to all the cores within the accelerator, to maximize data reuse. This is very important, as it allows the designer to focus on the accelerator architec- ture and not on the surrounding infrastructures, which have the potential of severely limiting the achievable performance, if not carefully designed. It is also important to note that the HotStream framework is designed to be equally efficient regardless of the final implementation technology, be it a reconfigurable device, an ASIC, or a SoC combining the two. As such, the description of each component is accompanied by a number of requisites that must be met in order to guarantee a good overall performance. Finally, a comprehensive software API was developed, which conveniently abstracts all the low level interactions between the Host machine and the accelerator. This greatly reduces the development time of stream accelerators based on the HotStream framework. Furthermore, the open-source nature of the code and its extensive documentation promote further adjustments and modifications towards increasing the performance for a particular platform/application pair.

1.2 Objectives

This master thesis is primarily focused on the development of all the necessary components required for the realization of the HotStream framework, namely: i) an Host Interface Bridge (HIB), which handles all the communications between the accelerator and the Host; ii) a Multi-Core Pro- cessing Engine (MCPE), capable of simultaneously hosting an arbitrary number of stream-based kernels; iii) auxiliary structures, such as a Data Stream Switch (DSS) and a Core Management Unit (CMU), to enable the cores to be interfaced on an individual basis, directly from the Host; iv) a backplane interconnection capable of providing full-connectivity between all the cores in the MCPE without compromising the communication bandwidth and; v) a C-based API, providing an high-level access to all the facilities offered by the framework. Within the MCPE lies an element that worths, by itself, a separate discussion, namely the Data Fetch Controller (DFC). The DFCs are fully-programmable AGUs associated with each core in the MCPE and enable the fine-grained access to the shared memory. These units feature a 16-bit microcontroller, tightly coupled with an address generation unit, that enables the description of arbitrarily complex access patterns through the compact (but still rich) instruction set provided by the microcontroller. The Pattern Code development is facilitated by a purposely designed assembler which supports all the features commonly required by these type of programs. Furthermore, in order to properly characterize the various communication mechanisms and interfaces that can be used in the framework, several technologies are analyzed in terms of la- tency, bandwidth, and area occupation. In the particular case of the backplane interconnection, two very distinct implementations are explored, in order to select the one that better suits the

4 1.3 Main contributions characteristics of the target implementation: Networks On Chip (NoC) or Crossbars.

1.3 Main contributions

The conducted evaluation of the proposed framework on a reconfigurable device proved that the embedded DFCs offer significant memory savings on the pattern description code. Compared to the existing related art, the proposed solution, based on the HotStream framework, achieves code size reductions above 1500×, with identical address generation rates. As an example, considering the block-based matrix multiplication case study, experimental results suggest that, given the extensive data-reuse offered by the proposed HotStream framework, it is easy to achieve a 2× speed-up, relatively to the state of the art implementation. Moreover, the proposed solution is able to reduce the Host intervention in the process by up to 45×, while requiring significantly less buffering from the Host. Consequently, the proposed framework allows for larger data scalability. To properly characterize the communication channels considered in the framework, a detailed experimental analysis of the performance of PCI Express interfaces, which is one of the technolo- gies that may be used to connect the HotStream-based accelerator to a Host, was conducted. Moreover, a simulation-based performance assessment of the specific DDR3/controller pair that was used on the prototyping phase of this work is also performed. These two subjects are no- tably poorly documented in the literature, despite being a fundamental part of the vast amount of Field-Programmable Gate Array (FPGA)-based accelerators that have been proposed over the years. Likewise, the complete software infrastructure that accompanies the HotStream framework also deals with the issue of handling the communication between the Host and the accelerator, each with its own address space. While some of the concepts explored in this context are specific to PCI Express interfaces, most of them can be adapted to any other communication interface that makes use of a Direct Memory Access (DMA) engine to handle data transfers. Moreover, the extensive comparison between Crossbar-based buses and Network-On-Chip (NoC)s, offers system-designers with an additional source of information when trying to decide between the two. In particular, the implementation results presented for a state-of-the-art recon- figurable device, such as the Virtex 7, provide new insights regarding the future of NoC structures on current-generation FPGAs. Comparable information is only available for older devices, which are out of date, lagging the capacity of modern offerings by an order of magnitude or more. The preliminary insights of the work that is presented in this thesis were published on a paper that was presented on a national conference:

• Sergio´ Paiagua,´ Adrian Matoga, Ricadro Chaves, Pedro Tomas,´ Nuno Roma, Evaluation and integration of a DCT core with a PCI Express Interface using an Avalon interconnection, In IX Jornadas sobre Sistemas Reconfiguraveis´ REC2013, University of Coimbra, pages 93-99, February 2013.

This paper focused on an exploratory study based on a DCT accelerator communicating with an

5 1. Introduction

Host machine through a PCI Express interface. More recently, the HotStream framework and, in particular, its innovative data-fetching and stream management mechanisms were extensively discussed in another paper that was pre- sented in an international conference:

• Sergio´ Paiagua,´ Frederico Pratas, Ricadro Chaves, Pedro Tomas,´ Nuno Roma, HotStream: Efficient Data Streaming of Complex Patterns to Multiple Accelerating Kernels, in 25th Inter- national Symposium on Computer Architecture and High Performance Computing (SBAC- PAD’2013), October 2013.

An extended version of this paper is currently under preparation, to be submitted to the Interna- tional Journal of Parallel Programming. Meanwhile, the main ideas explored in the context of this thesis also motivated the elaboration of an R&D Project proposal:

• Streaming Complex Data Patterns on Heterogeneous Systems, submitted to the Portuguese foundation for science and technology, FCT, in July of 2013.

In the future, multiple research directions are envisaged by the ideas that were opened by the present thesis and will be actually further explored by a PhD candidate on the SIPS group at INESC-ID.

1.4 Dissertation outline

The work presented in this dissertation is organized in eight chapters. In Chapter 2, the related work in the area of stream computing, address generation units, and heterogeneous multi-core architectures is presented. A brief description of the PCI Express standard is provided, along with the rationale for its development. Different on-chip interconnection technologies are also described in this chapter, as these play a very important role in the internal architecture of the proposed framework. Chapter 3 provides an in-depth discussion of the various hardware and software elements that compose the framework. Naturally, the DFC description is described in more detail, given its key importance within the framework. The two elements that make up the HIB are discussed in Chapter 4 for the particular case of a PCI Express interface between the Host and accelerator. A particular emphasis is given to the PCI Express interface, since it is the most complex interface, from those supported by the HotStream framework. This chapter also discusses the low level components of the HotStream API, which handle the various address spaces that usually exist in a modern operating system and create the necessary descriptors for the operation of the DMA controller. Chapter 5 discusses the implementation of the evaluation prototype that was used to validate the platform, including a thorough characterization of all the communication mechanisms that were utilized. Experimental results are discussed in Chapter 6, where a detailed evaluation of the framework (as a whole) is provided, firstly by assessing the address generation efficiency and resource occupation of the HotStream infrastructure, and then

6 1.4 Dissertation outline by testing the framework with two case-studies: the first based on the multiplication of very large matrices, and the second one dealing with the processing of very large images in the frequency domain. Finally, Chapter 7 concludes this thesis with some concluding remarks and future work directions.

7 1. Introduction

8 2 Technology Overview

Contents 2.1 Stream Computing Platforms and Address Generation ...... 10 2.2 PCI Express Interfaces ...... 11 2.3 Shared Buses and Crossbars ...... 12 2.4 Networks On Chip ...... 13 2.5 NoC Survey ...... 14 2.6 Crossbar Survey ...... 15 2.7 Summary ...... 15

9 2. Technology Overview

Given the broad scope of subjects covered by the HotStream framework, the comparison with the state of the art is not trivial. In fact, while a wide a range of streaming architectures have been proposed over the years, most are completely self-contained in the sense that the communication with an external general purpose processor, including all the necessary software layers from the low-level device drivers to the user-accessible APIs, is not considered. Section 2.1 discusses the related state of the art that share the most common points with the proposed HotStream framework, with an intentional bias towards architectures that tackle stream management and data access patterns, as these are the key elements of the proposed framework. Section 2.2, on the other hand, aims to provide the reader with the needed background to better illustrate the choice of the PCI Express as the main interface supported by the HIB. The efficient and high-throughput communication between multiple streaming cores is a key feature of the MCPE, the part of the framework that hosts all the processing elements. Two main communication mechanisms exist within this component of the HotStream framework: i) a backplane interconnection, used to allow for stream reuse among the multiple kernels by providing a network that is able to establish a full-duplex connection between any two cores with minimal latency and maximum throughput and; ii) a large-capacity shared memory which enables stream buffering and rearrangement. Despite the requirement for a large storage capacity, this shared memory should still be capable of high data access throughput so as not to represent a significant bottleneck to the performance of the framework. Naturally, these requirements dictate that an off-chip DDR memory be used, as these are the only components that are able to couple large storage capacity with high data access throughputs. On the other hand, the backplane interconnection can be equally implemented by different technologies, i.e., shared buses, Crossbars or Networks-On-Chip. Naturally, each solution implies different area/performance trade-offs, which also depend on the target technology, i.e., Application Specific Integrated Circuits (ASIC) or FPGA. Sections 2.3, 2.4, 2.5 and 2.6 describe the key features of these interconnection technologies.

2.1 Stream Computing Platforms and Address Generation

The popularity increase of stream-computing models have led to the development of many specialized architectures that tackle the efficient fetching and management of data streams. Ex- amples such as the IMAGINE stream processor [23][24] and the MERRICAM stream-based su- percomputer [11][14], which are based on clusters of PEs and Stream Register Files (SRF), offer simple data pre-fetch mechanisms which, in the case of the IMAGINE processor, transfer entire streams between the SRF and an off-chip SDRAM. As stated by the authors, only 50% of the optimal performance is achieved, which motivates the development of more efficient data man- agement structures. Given this bottleneck, other researchers have focused on improving the generation of data streams. After demonstrating that dataflow computing can lead to significant performance im-

10 2.2 PCI Express Interfaces provements in a wide range of applications, Pell et al [32] developed the MaxCompiler, which maps the computing kernels to an FPGA. To generate the streams to be fed to the computing Kernels, a set of commands is provided, instructing the developed tool-chain to automatically generate simple 1D, 2D or 3D data patterns. More complex patterns can only be described by using multiple commands [7], which is both time-consuming and results in a large configuration overhead. The same shortcomings are experienced by the Programmable Pattern-based Memory Controller (PPMC) [20], which eases the programming of regular 1D, 2D or 3D patterns through a set of function calls, integrated in an API. Again, this solution falls short when long and/or complex patterns must be described. The above pattern-generation solutions are actually very similar to the functionalities offered by modern DMA engines. For example, the Xilinx AXI DMA controller offers independent read and write channels, which provide high-bandwidth DMA between memory and stream-type pe- ripherals [5]. With its scatter-gather capabilities and the support for 2D transfers, this controller can actually be used as a pattern-generator. Its configuration is done by setting up a chain of de- scriptors that are then read by the engine, making this a rather similar solution to the one adopted by the PPMC [20]. Moreover, multichannel support is also offered through stream identifiers that accompany the data. This enables the multiplexing of the two available data channels, so that multiple master and slaves can connect to a single DMA engine. Both the PPMC and the AXI DMA solutions are driven towards moving large and regular chunks of data, and fall short when even more complex access patterns are considered, such as the one used on the Smith-Waterman algorithm [25]. In contrast, the DFC herein proposed is capable of handling arbitrary patterns of varying complexity without significant penalties. Fur- thermore, since the pattern description is not descriptor-based, there is essentially no limit to the length of the pattern to be generated.

2.2 PCI Express Interfaces

In the past, the PCI (Peripheral Component Interconnect), defined by the PCI Local Bus standard, was the most often used solution for connecting hardware devices in a computer sys- tem [29]. It experienced wide adoption during a large period of time, with components such as network cards, sound and graphic cards making extensive use of this solution, and can still be found in many modern motherboards. However, the growing requirements for communication bandwidth quickly made clear that PCI would not be a scalable solution. To overcome the limitations of the original PCI standard, as well as other standards such as PCI-X and AGP, the PCI-SIG (PCI Special Interest Group) jointly developed a high-speed se- rial alternative, the PCI Express, officially abbreviated as PCIe. This new standard has become the de facto standard for high speed interfaces with computer peripherals [22] due to its higher throughput, scalability, lower I/O pin count and native hot-plug capabilities, among other features. To date, two main revisions have been made to the PCIe specification, which have increased the

11 2. Technology Overview maximum transfer rates by a factor of 2 in each iteration, while maintaining backwards compatibil- ity with previous versions. PCIe is a serial interface that achieves significant data-rates by utilizing multiple lanes that operate in full-duplex. Lane widths of 2x to 16x are widely used, whereas 32x slots are very uncommon. Unlike the original PCI Local Bus standard, PCIe is a point-to-point connection. In order to make multiple interfaces available, root complex devices are used, which connect a processor and memory subsystem through a local bus to multiple ports, which can be further expanded by using special switches.

2.3 Shared Buses and Crossbars

The trend in modern digital design is to increase component reuse by resorting to verified IP Cores that implement the desired functionality with a certain interface. This greatly improves time-to-market and significantly reduces design complexity. The interconnection between these blocks is, nevertheless, very important as it very often determines the overall performance of the architecture. With the goal of further easing the interaction between these off-the-shelf compo- nents, the industry has moved to standard interconnection solution, such as the AMBA (Advanced Microcontroller Bus Architecture) developed by ARM, of which the AXI (Advanced Extensible In- terconnect) family of interconnections is the most well known, or the IBM CoreConnect. These standard interconnection solutions usually implement a shared bus or Crossbar to en- able the communication between the master and slave interfaces attached to the bus.

2.3.1 Shared Bus

A shared bus is, for the greater part, a collection of wires interconnecting the various interfaces while a central arbiter grants the masters exclusive access to the bus. The arbitration is done according to a statically defined arbitration rule such as round-robin or using a set of priorities attributed to each element. In this configuration, whenever a master takes control of the shared bus, all attached slaves have access to the information being transmitted and will act on it if their ID is specified in an appropriate control signal. Apart from the obvious observation that the peak bandwidth is limited to the maximum bandwidth achievable between any master-slave pair, it is further compromised by the difficult task of maintaining high clock frequencies as the number of interfaces increase and thus increasing the length of the interconnecting ”wires” between them [8].

2.3.2 Crossbar

As an evolution of the standard shared-bus, Crossbar architectures greatly increase the aggre- gate bandwidth by allowing simultaneous transactions between independent master-slave pairs. In this topology, the number of interconnecting wires is greatly increased, as each master now has a dedicated route to each slave. Naturally, an arbiter circuit is still required to ensure that any

12 2.4 Networks On Chip master can communicate with any slave within a reasonable waiting time. Although the peak ag- gregate bandwidth is now increased by a factor equal to the number of simultaneous connections, the hardware complexity is considerably higher than that of the shared-bus solution, meaning that interconnecting an increasing number of interfaces will require extra effort if the same clock fre- quency is to be maintained. As such, it is common to resort to a hierarchical use of Crossbar interconnections in order to keep the interconnection density low, while slightly sacrificing the ag- gregate throughput given that the number of connections that can be established simultaneously will be lower [34].

2.4 Networks On Chip

While Crossbars have proved to be able to fulfil the bandwidth requirements of modern IP Cores and SoCs, their increasing number and heterogeneity is still a challenge when designing SoCs that must adhere to strict area budgets and operating frequency targets. The key to ad- dressing these issues is to decouple the Transport Layer from the Physical Layer, by utilizing a packet-based Transport Protocol [8]. Packets are usually composed by a header, payloa,d and trailer, which are routed across the network according to a certain routing algorithm. Networks on Chip are inherently more scalable than their bus-based counterparts as the spe- cific bandwidth and interface requirements of each IP Core can be met on a per-node basis. In fact, when interconnecting multiple cores with a crossbar, the width of the internal connec- tion buses is determined by the bandwidth requirements of the fastest master-slave combina- tion. Thus, when connections with inferior throughputs are established, the available bandwidth is under-utilized and resources are wasted. On the other hand, NoCs provide the ability to opti- mize the data links between the various switches that compose the network, in order to maximize overall throughput and quality of service (QoS), while minimizing circuit area [8]. Inspired by traditional computer networks, NoCs are usually characterized by the communica- tion mechanism, switching mode and routing algorithm. All these parameters are functions of the network topology, i.e., the way in which the switching elements are arranged. Common topolo- gies are 2D mesh, 2D torus, folded 2D torus and Bi-directional ring, although many others exist. Figure 2.1 depicts the structure of the first two topologies. The communication mechanism defines how messages traverse the network and usually falls within two categories, circuit switching and packet switching. In circuit switching, a connection between a source and a destination is established before any packet is sent. During the lifetime of the connection, the links involved cannot be used by packets with any other origin, which may lead to under-utilization. On the other hand, in packet switching a connection is never established, and the routing decisions are done in run-time and on a per-packet basis, which leads to more efficient network usage, albeit with a slight increase in logic complexity. Packet switching requires the use of a switching mode, which defines how packets move through the switches. While many techniques exist, the most common are store-and-forward,

13 2. Technology Overview

00 10 20

00 10 20

01 11 21 01 11 21

02 12 22 02 12 22

(a) 2D Mesh (b) 2D Torus

Figure 2.1: Structure of a 2D Mesh and 2D Torus NoC virtual cut-through and wormhole [30]. The first two operate on full packets. In store-and-forward mode, a switch buffers a packet completely before it is sent to the next switch, which increases transmission latency and hardware requirements. Virtual cut-through is similar but a switch can start forwarding a packet as soon as the next switch indicates that it has a buffer that is big enough to hold the full packet, which slightly reduces communication latency. Finally, wormhole switching reduces buffering requirements by splitting a full packet into various sub-packets of fixed size, designated as flits. Only the header flit possesses routing information and, therefore, the payload flits must follow the same path reserved by the header. The path taken by a packet from source to destination is defined by the routing algorithm. The most common algorithms are distributed and perform the routing decisions on a per-node and per-packet basis. In addition, these can be either deterministic or adaptive, depending on wether the routing decision takes into account the current network traffic [30]. Naturally, deterministic routing algorithms, such as XY routing, lead to inferior resource usage.

2.5 NoC Survey

While much research has been done on the subject of Network On Chips, most of this body of work focuses on the influence of the various routing and arbitration algorithms on the flow of data within the network. Software models are typically used to ease the test of the proposed solutions under different traffic loads. Thus, there is an evident shortage of publicly available NOC implementations. The following details the NoCs available in the state of the art. NOCem [35] is a configurable NOC architecture which implements packet switching with op- tional virtual channels and supports three of the most common topologies, namely Mesh, Torus and Double Torus. A more comprehensive alternative is proposed by the European Space Agency (ESA) in the form of the SOCWire [31]. This modular solution is composed of the SOCWire Switch, which provides wormhole routing and round-robin arbitration, and the SOCWire CODEC, respon- sible for enabling the communication between the nodes and the routing elements. To ensure

14 2.6 Crossbar Survey proper operation in hazardous environments, the design is fault-tolerant and includes hot-plug abilities for dynamically reconfigurable modules. The ATLAS project [19] is an ambitious JAVA-based environment for the generation of different NOC architectures, which are configurable according to a large set of configuration parameters, such as the topology, number of virtual channels and routing algorithm. Table 2.1 presents some of the supported parameters for two of the architectures generated by the framework.

Table 2.1: Examples of architecture combinations supported by the ATLAS environment

Parameter Hermes Mercury Topology 2D Mesh 2D Torus Virtual Channels 1,2,4 1 Routing Alg. XY or West-first Adaptive Scheduling Alg. Round Robin Round Robin

The support for virtual channels and the simpler routing algorithm, which reduces hardware utilization, led to the adoption of the Hermes architecture as the Network on Chip to which the Crossbar solution is compared. The SOCWire, with its large array of features, proved to utilize too much resources to be competitive with Crossbar solutions, while the NOCEm utilizes store-and- forward instead of wormhole switching, which reduces latency and buffering requirements and, therefore, overall hardware resources.

2.6 Crossbar Survey

Unlike NoCs, Crossbars benefit from a greater standardization, as they are central elements of the multiple SoC interfaces offered by the various IP Core vendors. Widely used interfaces such as Altera Avalon, AMBA AXI, AMBA AHB, IBM CoreConnect and the open source , all make extensive use of crossbar modules in their infrastructure. Moreover, as the goal of such standards is to promote interoperability and design reuse across multiple vendors, they are all offered with no fees or royalties associated. This results in a significant availability of high-quality crossbar implementations, which are usually very close in terms of performance and area requirements. Thus, the deciding factor is usually dictated by the target technology or by the bus architecture used by the rest of the system, i.e., if the rest of SoC is interconnected by an AXI infrastructure, it is only natural that the need for a stream-based Crossbar is fulfilled by the AXI Stream Interconnect.

2.7 Summary

Most streaming architectures described in the literature do not encompass the hardware mod- ules and associated software that is required to communicate with an external host. Several authors have identified data management structures as being the main bottleneck

15 2. Technology Overview when developing such co-processors. Some of these authors report a performance degradation, attributable to these elements, of up to 50%. Thus, much research is being directed to the devel- opment of efficient stream management and, as a consequence, pattern generation mechanisms. However, most approaches focus on the creation of address generation units which are optimized to perform large data transfers with regular patterns. When a more fine control over the data is required, large configuration overheads are incurred by such solutions. Like any streaming architecture, the performance of the HotStream framework is largely in- fluenced by the communication channels it utilizes. Thus, it is important to use state-of-the-art off-chip and on-chip communication solutions. As far as off-chip interconnections are concerned, PCI Express is the most widely used interface in modern computational systems. It offers a signif- icant aggregate bandwidth by leveraging multiple parallel lanes of full-duplex serial connections. On-chip interconnections benefit from a broader array of solutions. While Crossbars have become the de facto standard for the interconnection of high performance IP Cores in modern SoCs, Networks on Chip (NoCs) are becoming an important alternative. In comparison to the former, NoCs offer increased flexibility, scalability and performance. However, there is a clear shortage of publicly available NoC implementations, which furthers complicates its evaluation and application to real designs. In the case of the HotStream framework, the decision to employ either one of these solutions is dictated by their performance vs. area trade-offs and, most importantly, by their capability to scale.

16 3 HotStream Framework Architecture

Contents 3.1 Host Interface Bridge ...... 19 3.2 Multi-Core Processing Engine ...... 20 3.3 The HotStream API ...... 21 3.4 Data Fetch Controllers, Shared Memory and Auxiliary Units ...... 22 3.5 Data Stream Switch (DSS) and Core Management Unit (CMU) ...... 28 3.6 Summary ...... 30

17 3. HotStream Framework Architecture

SOURCE CODE

General Purpose Processor (GPP)

HOTSTREAM DEVICE API DRIVER

High Speed (e.g., PCIe, AXI, Interconnect CoreConnect)

Hardware Accelerator Multi Core Processing Engine (MCPE)

Backplane Interconnect Host Interface Data Stream Bridge (HIB) Switch Core 1 Core 2 Core 3 Core 3 Core 4 DATA DMA CMU STREAM PE DFC PE DFC PE DFC PE DFC PE DFC ... CONTROLLER BRIDGE

Shared Memory

Figure 3.1: Structure and organization overview of the HotStream framework

The proposed HotStream framework, depicted in Fig. 3.1, is a comprehensive solution for the development of stream-based architectures, composed of a software layer and a hardware layer. The software layer integrates: i) a convenient API that allows the programmer to specify any arbitrarily complex streaming pattern, as well as ii) a Device Driver to map the user-specified memory buffers, allocated on the user space, to the physical address space. This device driver also serves as the data transfer peer, on the software side, that assures appropriate integration mechanisms with the hardware layer. The hardware layer is composed of: i) the Host Interface Bridge (HIB), responsible for handling the data management between the Host processor and the accelerator; and ii) the Multi-Core Pro- cessing Engine (MCPE), responsible for managing the data streams between the PEs within the accelerator. The proposed architecture is designed to be fully scalable and adaptable (by support- ing a variable number of data streams and PEs), as well as flexible enough to support applications with different and arbitrarily complex streaming patterns with a minimal effort. Moreover, one of the main features that sets it apart from the current state-of-the-art is that it offers support for efficient data fetch and reuse within the accelerator architecture. The above mentioned characteristics are mainly achieved by providing two distinct levels of data access patterns, with different intrinsic granularities and complexity degrees. The first level is implemented within the HIB and supports simpler patterns of a more coarse-grained nature, as this type of communication channels typically benefit from transfers involving larger data chunks.

18 3.1 Host Interface Bridge

This first level of granularity is identical to what is provided by the PPMC [20]. The second and more fine-grained level of granularity is implemented within the MCPE, supporting more complex streaming patterns.

3.1 Host Interface Bridge

Copying data from the Host processor memory system to the MCPE is a complex procedure. It requires the intervention of: i) the device driver, on the Host side; and ii) a specific hardware structure, the HIB, that is able to autonomously issue data transfers (read/write requests) between the main memory of the Host and the MCPE. The Host Interface Bridge (HIB) mainly consists of two modules: the Data Stream Bridge (DSB) and the Direct Memory Access (DMA) controller. The DSB is responsible for interfacing the hardware accelerator with the Host General Purpose Processor (GPP). The adopted interfacing standard and corresponding communication structure should be adapted to the data transfer facilities and infrastructures that are offered by each specific Host. Some supported standards are the PCIe, AMBA AXI, CoreConnect, etc. As a consequence, the implementation of this bridge must be adapted to the specific requisites of each application. The co-located DMA controller ensures the management of the coarse-grained data transfers between the Host GPP and the hardware accelerator. Despite the implementation efficiency of the different entities involved in a single data trans- action, an unavoidable amount of overhead is expected, given the various operations that are involved. Fortunately, this overhead bears a weak dependence on the size of the data to be trans- ferred. Thus, transferring large data chunks guarantees a more efficient utilization of the available bandwidth in the data channel. However, complex streaming patterns often require accessing data that is not laid out linearly in the Host memory but, instead, spread over a regular pattern of contiguous blocks separated by non-unit strides. In such situations, transferring the smallest data chunk that encompasses each set of contiguous blocks inevitably results in a waste of bandwidth. The solution herein proposed to tackle this issue is to implement coarse-grained patterned data transfers between the Host and the MCPE, such that the HIB transfers only useful data but in large data chunks. These coarse-grained data access patterns can be easily accomplished by setting up the DMA controller to transfer each of the contiguous data blocks that constitute the pattern from their physical locations in the Host memory, according to the regions mapped by the device driver. The set up phase consists of creating scatter-gather descriptors for configuring the DMA engine, arranged in a chain, and defining the starting position and size of the memory blocks to be read from or written to. As the number of contiguous blocks described by a given pattern increases, the number of required descriptors also increases in the same proportion. Hence, in order to minimize the impact of this increase, the simple and traditional DMA engine can be replaced by a more efficient alternative capable of performing, at least, the most regular 2D memory accesses, i.e., each

19 3. HotStream Framework Architecture

memory transaction can be described by the tuple {OFFSET, HSIZE, STRIDE, VSIZE}, specifying the starting address of the first memory block, the size of each contiguous block, the starting position of the next contiguous block with relation to the previous, and the number of repetitions of the two previous parameters, respectively. This reduces the total number of descriptors needed to describe a given pattern. Thus, the better the patterns fit within a 2D description, the more significant will be the observed reduction. While the size and nature of the patterns applied to the data transfers between the Host and the MCPE are technically only limited by the available space to store the descriptors, these should not be too fine-grained in order to avoid a detrimental impact on throughput. Therefore, in the proposed HotStream Framework, the API provided to the programmer includes a special call to gather an arbitrary sequence of data segments stored across the memory space into one, larger and contiguous buffer that can then be transferred at once. This gathering operation takes a non- negligible time to complete, thus it is only useful when the incurred penalty does not exceed the overheads of transferring the smaller non-contiguous individual data chunks.

3.2 Multi-Core Processing Engine

The MCPE is where the actual computation takes place and is designed to support multiple independent and heterogeneous cores that collaboratively execute the multiple streaming kernels. Moreover, each kernel can span several Cores, to further exploit data parallelism. As depicted in Fig. 3.1, it consists of: i) multiple Cores, each composed of a PE, responsible for the computation, and one or more Data Fetch Controllers (DFCs), responsible for data management; ii) a high- speed Backplane Interconnection, able to dynamically route the data streams between the Cores, promoting the required data reusage schemes; iii) a shared memory, which allows rearranging the stream access patterns and data reutilization; iv) a Data Stream Switch (DSS), to route the data streams coming from the Host to either the backplane or to the shared memory. Accordingly, the data streams transferred from the Host via the DSS or those produced by an individual Core can be routed to other Cores in the MCPE via the backplane interconnection or stored in the shared memory for later reuse; and v) a Core Management Unit (CMU), a register-based interface for (re)starting the Cores, configuring their instruction memories or configuring interrupt generation on a per-Core basis. The high-speed stream-oriented Backplane Interconnection must ensure that each Core can communicate with any other with a minimum routing delay. In addition, multiple connections may need to be active at any given time. Taking into account these requirements, a high-speed interconnection network is required. Sophisticated Network-on-Chip (NoC) solutions are likely to provide higher system scalability and better support for heterogeneity among the Cores in terms of data interfaces. However, one must also ensure that the amount of hardware resources required by the interconnection network is minimal, saving space for extra computing Cores. The shared memory, on the other hand, is particularly important for applications that exploit

20 3.3 The HotStream API different types of access patterns or when data reusage is exploited between the PEs. Whenever a stream needs to be rearranged before it is consumed by another Core (or even streamed back to the Host machine), it can be buffered on the shared memory. As such, this element must be accessible by all the Cores, employing a simple work-conserving round-robin arbitration mech- anism, which makes sure that all read and write requests are served with equal priority and no starvation occurs. In addition, being an address-based element in an otherwise stream-oriented architecture, reading and writing operations requires the inbound or outbound data streams to be accompanied by a stream of addresses. The generation of these addresses is carried out by the DFC unit, within each Core (further detailed in Section 3.4), and allows the implementation of fine-grained streaming patterns of variable complexity, ranging from simple linear accesses to more exotic and complex ones, such as diagonal or cross-shaped patterns. Accordingly, the DFCs, which are responsible for these fine-grained data access patterns, are implemented through small programmable units directly coupled with the PEs (one for each out- bound or inbound stream) that generate the addresses for each data element within the stream (or for groups of data elements, if incremental bursts are used, i.e., multiple data elements are stored or retrieved from sequential memory locations). Unlike the coarse-grained patterns supported by the DMA engine, these units are able to generate address patterns with a resolution down to the single-address level. In fact, since the DFC can be programmed to describe long-running complex patterns with common loop structures, there is effectively no limit to the type of patterns that the programmer can describe. Additionally, the pattern specification takes virtually no space, thus avoiding the penalties resulting from the descriptor-based data-fetching mechanisms used in the state-of-the-art approaches, such as the PPMC [20].

3.3 The HotStream API

The seamless integration between the acceleration hardware and the Host is accomplished by the HotStream API. This software layer abstracts the low-level interactions between the device driver and the HIB and provides the user with a convenient and well-documented set of calls that provide easy access to the various features of the HotStream framework. The API (see Appendix A.3), written in C, is further subdivided into four logical groups, each fulfilling a special set of tasks within the framework: i) Core Management; ii) Framework Management; iii) Data Management and; iv) Pattern Definition. The Core Management group provides calls to configure the Instruction Memory of individual Cores, as well as to issue individual or global resets and manage interrupts. These procedures can be performed on a per-Core basis or applied to multiple Cores at once by utilizing vector variants of the same calls. The Data Management and Pattern Definition group of functions allow the creation of data streams with different coarse-grained patterns in order to meet the require- ments of particular applications. These streams can either target or be sourced from the shared memory or the high-speed backplane without additional configuration. Finally, the Framework

21 3. HotStream Framework Architecture

Management set of calls are responsible for initializing and gracefully terminating the operation of the framework. Two pattern creation calls, HotStream 2D() and HotStream Block() leverage the 2D capabil- ities of the DMA Engine and are complemented by the gather function, HotStream gather(), in- troduced in section 3.1. While the first function configures a stream with the basic parameters outlined in 3.1, i.e. {OFFSET, HSIZE, STRIDE, VSIZE}, providing the maximum flexibility for the definition of a stream, the second call provides an additional level of abstraction when generating tiled patterns. The recurrence of such patterns in streaming applications motivated the develop- ment of this custom call, which only requires the original matrix and tile size, along with the size, in bytes, of each matrix entry, to be specified in order to configure the data stream. The Hot-

Stream Linear() call can be used when no complex patterns are required thus, only the OFFSET within the user-provided buffer and its TOTAL SIZE need to be specified.

3.4 Data Fetch Controllers, Shared Memory and Auxiliary Units

The DFCs are undoubtedly the central and the most important elements of the MCPE. These units are responsible for single-handedly extracting the data from the (address-based) shared memory with arbitrarily complex patterns, and for forming the data streams that are presented to each Core’s PE, while the latter remains completely oblivious as to the origin of the data that it is consuming. In particular, each DFC is responsible for generating the corresponding read and write data transactions, according to the defined streaming pattern. Each DFC has its own instruction memory that is programmed from the Host machine through a compact but complete ISA, by using a custom assembler with syntax validation. This instruction memory can be dynamically updated and its size represents the only limitation to the complexity of the considered pattern. However, as long-running patterns can be described by loops, the size of the instruction data is kept relatively small and independent of the extension of the pattern. In addition, the DFC is optimized to take advantage of features provided by the considered bus protocol (e.g. AMBA AXI), such as burst commands to minimize the existing overheads. In order to handle these tasks, the DFCs incorporate two fundamental blocks that operate together: i) the Address Generation Core (AGC); and ii) a custom small-footprint 16-bit micro- controller (Micro16). The AGC is a small specialized processor that autonomously generates addresses in a linear, 2D or 3D fashion. On the other hand, the Micro16 microcontroller is capa- ble of generating combinations of linear, 2D and 3D patterns that are sequentially requested to the AGC in order to construct more complex stream patterns. The two units interact via a small and shared register file, the External Register File (ERF). This way, while the AGC is generating a sequence of addresses, the microcontroller concurrently modifies the ERF with all the required parameters for the next regular pattern. The combination of these two units makes it possible to describe patterns as complex as required, without ever compromising the address generation rate.

22 3.4 Data Fetch Controllers, Shared Memory and Auxiliary Units

In Appendix B, examples of patterns with fundamentally different characteristics are provided, along with the Pattern Description Code that instructs the DFC to generate such access patterns.

3.4.1 Address Generation Core (AGC)

The AGC effectively emulates traditional nested loops, as found on most programming lan- guages, by specifying, for each loop level, the number of iterations to be executed. Each level is implemented through a Loopcontrol unit that independently counts down from a starting value to zero, generating an interrupt upon completion. In addition, each of these Loopcontrol units holds the necessary parameters to determine the starting address of the AGC during the next iteration of the loop level. By combining multiple Loopcontrol units in a daisy chain structure and by routing the interrupt signal of the innermost levels to the enable input of the outermost ones, an N-level nested loop can be designed. Figure 3.2 illustrates the required configuration to implement a 3-level loop. It should be noted that, regardless of the number of Loopcontrol units desired, the AGC is able to automatically configure all the necessary internal connections between the various elements, as its hardware description is based on Generic VHDL parameters that selectively add the necessary blocks and interconnections.

RF Loopcontrol RF Loopcontrol RF Loopcontrol 1 1 2 2 3 3

Count Interrupt Count Interrupt Count 1

RF Address Loopbody 0

Figure 3.2: AGC in a 3-level nested loop configuration.

The body of the loop is emulated by the Loopbody unit, which generates one address per clock cycle based on three basic configuration parameters: increment, multiplication, and initial value. This trio makes it possible to generate any affine linear access pattern, i.e., patterns of the type yn=yn−1×m+i, which represent the great majority of the indexing needed by most sci- entific applications [16]. Hence, the Loopbody address generation is controlled by the associated Loopcontrol units, which interrupt the former whenever the iteration limit in any of the nested loop levels is reached. This results in a two clock cycle delay to compute the next starting address. Considering that any delay presented in the data fetching procedure may potentially slow down multiple PEs, it is of the utmost importance for the address generation to be essentially contin- uous. While this is a reasonable requirement when the supported patterns are restricted to 2D or even 3D sequential accesses, supporting arbitrarily irregular patterns with changing starting positions requires a more sophisticated approach. Therefore, for more complex patterns the con- figuration of the AGC relies on a double-buffered scheme, which is accomplished by duplicating the configuration registers used by the Loopbody and Loopcontrol units. With this architecture,

23 3. HotStream Framework Architecture the Micro16 is able to compute and configure the loop parameters for the following portions of the pattern, concurrently with the AGC execution, i.e., without interrupting the address generation.

3.4.2 Micro16 microcontroller

The Micro16 is a custom microcontroller designed to configure and control the AGC. While the overall architecture closely follows a traditional single-cycle RISC, it also comprises customized features aimed at easing the integration with the AGC. One such feature is the incorporation of an ERF, which is composed of the local register files from the Loopcontrol and Loopbody units within the AGC. Despite its external nature, any of the registers can be used as sources and destinations during ALU-based operations, with no additional latency. The microcontroller encompasses a sim- ple interrupt controller which makes it possible to trigger and wait for events on one of the multiple interrupt lines available. Notably, one of the interrupt lines is used exclusively to communicate with the AGC. An interrupt sent from the micro16 to the AGC lets the latter know that new parameters have been set and may be read in. Next, the microcontroller must wait on an interrupt from the AGC indicating that the parameters have been stored and the ERF may be modified again. This procedure is easily coded in assembly by first writing to the interrupt line selector register and subsequently triggering and waiting on an interruption, through two dedicated instructions. The remaining lines are particularly useful when synchronization between multiple AGUs is needed. This is the case, for example, in applications where multiple heterogeneous cores that depend on one another are running in parallel. The architecture of the Micro16, depicted in figure 3.3, was designed to be as compact as possible, since the DFC is to be replicated for each core in the MCPE. This promotes the scalability of the framework, while keeping its overall resources requirements low. With these constraints in mind, the datapath is 16-bit wide and the internal register file is composed of a reduced set of 4 registers, one of which is tied to zero and doubles as the interrupt line selector. However, in order to overcome this limited number of storage elements, a stack with a depth of 32 words was incorporated, which greatly enhances the programmability of the microcontroller, without compromising its resource usage. In order to ensure that the Micro16 is capable of addressing large shared memories, a base address register is included. By setting this register through a specific instruction, a 32-bit address bus is created, capable of addressing 4G words of memory. Despite its name, the Micro16 can be also configured in 32-bit mode, which effectively doubles the width of all the buses in the datapath and inevitably increases the hardware requirements. This configuration is particularly useful when a given application frequently addresses data that is scattered across the shared memory and the repeated modification of the base address register results in a decrease to the address generation rate. In this mode of operation, the instruction width is still maintained at 16-bit, in order to leverage the highly compact instruction set and to keep the instruction memory usage to a minimum. The adoption of a 16-bit wide instruction word required the Instruction Set Architecture (ISA) to be carefully tailored in order to accommodate a set of operations large enough so as not to limit

24 3.4 Data Fetch Controllers, Shared Memory and Auxiliary Units the pattern description ability of the microcontroller. In particular, all supported instructions were encoded with a minimal number of bits, although at the cost of a more complex decoding logic. These optimizations resulted in an ISA containing 14 instructions, divided into four main groups:

1. Register-Register operations involving the ALU 2. Immediate constant loading operations 3. Low and High constant loading and miscellaneous operations 4. Flow control operations

ALU Operations

The ALU supports 4 distinct operations on 16-bit integers: addition, subtraction, multiplication and decrement. The operations were selected based on the particular characteristics of most pattern description codes, where arithmetic operations are more abundant than logical ones. In addition, given that the flags available for flow control are active on zero, not zero or negative, the decrement operation is more useful than an increment operation would be when describing loops. When using any of these operations, all flags are updated. It is worth noting that this group of operations permits any combination of source and destination registers, i.e., both can refer to either the Internal Register File (IRF) or ERF. This can be easily specified in the assembly code by prepending the register number with the internal or external register file identifier (IRF or ERF), followed by the targeted unit within the AGC, if the external register is selected, and the register number. For instance, register 1 of the loopbody unit of the AGC is referred to as erf.0.r1. Conversely, when using a register from the IRF, the same reference would take the form irf.r1. The

Figure 3.3: Architecture of the Micro16 microcontroller.

25 3. HotStream Framework Architecture organization of ALU control words is depicted in Tab. A.1 in Appendix A, along with the assembly supported by the purposely-built assembler.

Constant Loading Operations

Due to the limitations imposed by the 16-bit instruction set, the constant loading operations are divided into two instruction groups: Immediate constant loading operations and Low and High constant loading and miscellaneous operations. The existence of three constant loading instructions aims to minimize code size. In fact, the first group enables the loading of constants of up to 12 bits in a single instruction, while the two instructions of the latter group allow the loading of any 16-bit constants, albeit in two sequential steps: first the lower 8-bits are loaded into a register and then the upper 8-bits are loaded into that same register. Due to instruction word size limitations, the destination register must be part of the IRF. The two instruction word groups along with the constant loading instructions are presented in Tab. A.2.

Miscellaneous operations

This group, in addition to encompassing the loading of 8-bit constants, enables the seamless integration between the Micro16 microcontroller and the AGC through two custom instructions, Wait and Done. The Done instruction must be issued whenever the registers in the ERF are modified. This indicates to the AGC that the set of constants currently stored in the ERF are ready to be read and can be copied to the internal registers of the Loopbody and Loopcontrol units. Once this copy is complete, the AGC asserts a signal indicating that the constants in the ERF have been parsed and can now be modified again. The microcontroller can wait on the assertion of this signal through the Wait instruction. Again, due to word-size constraints, this word group is further subdivided through an additional bit to define stack operations. These consist of the conventional push and pop operations, which place/retrieve any register from the IRF or ERF on/from the stack, thus greatly easing the programmability of the Micro16. This instruction word group and associated assembly mnemonics is represented in Tab. A.3.

Flow control operations

Finally, flow control is ensured by four different jump instructions, three of which conditional. The conditional jumps are supported by three status flags, namely Zero, Negative and Not Zero. These are stored in the Program Status Register (PSR), depicted in Fig. 3.3, so that their state does not have to be polled immediately after the operation that updated the flags. Only absolute jumps are supported, through a destination field of 12-bits. This means that a single jump in- struction can traverse up to 4096 instructions, which is more than enough considering the size of usual pattern description codes. To facilitate the coding task, the assembler supports instruction labelling, automatically converting these labels into absolute addresses. These four operations are supported by the word group described in Tab. A.4.

26 3.4 Data Fetch Controllers, Shared Memory and Auxiliary Units

Assembler

In light of the custom nature of the machine code defined by the Micro16 ISA, an assembler was developed to assist its programming. This tool offers syntax validation, label support for jump operations, and overflow checking when loading constants or specifying a jump destination. In addition, single and multi-line comments are supported, and so are decimal or hexadecimal numeric bases when specifying constants. The output of the assembler is a binary file ready to be placed in the instruction memory of the microcontroller. While the performance of the DFC is not directly influenced by these features, they greatly facilitate the pattern specification process, thereby increasing the ease-of-use of the HotStream framework, which is undoubtedly a key aspect of the proposed system. By taking advantage of such user-friendly assembly language, configuring a pattern is just a matter of populating the ERF with the relevant parameters and using the custom interface instructions to start and stop the address generation.

3.4.3 Access to the Shared Memory

The buffering capabilities of the HotStream framework are ensured by a single large-capacity shared memory, usually implemented by an external Dynamic Random Access Memory (DRAM). While this may be regarded as a limitation to the system performance, as arbitrated accesses af- fect the effective memory bandwidth for each core, it is one that cannot be avoided. Furthermore, DRAM devices are characterized by complex timing characteristics, as the need for periodic re- freshing and costly charging of data lines means that the access time for an arbitrary memory position is not constant. At this respect, modern Double Data Rate (DDR) memories offer spe- cial access modes to maximize the achievable data throughput. These are essentially burst-based accesses that retrieve a fixed number of sequential data beats from the DDR with one single com- mand. Exploiting burst-based accesses to the DDR memory is, therefore, paramount to get the most out of the available memory bandwidth. Accordingly, the address stream that is generated by the developed AGC was shaped with this particular objective in mind. As an arbitrary number of cores may be requesting data from the shared memory, a scalable and robust arbitration solution is also required. This may be achieved by adopting any of the cur- rently available industry-standard interfaces, specifically targeted at high-performance systems. Some examples are the AXI4 specification, maintained by ARM [2], or the CoreConnect bus ar- chitecture, defined by IBM [1], which are widely supported by most FPGA makers, e.g., Xilinx, and can also be deployed in CMOS technology. To comply with the specific bus architecture adopted for each particular application, a Bus Master Controller (BMC) was developed and integrated in the streaming framework. The purpose of this unit is to perform the conversion between the stream-based interface used by the DFC and the Memory-Mapped (MM) protocol used by the bus interface (see Fig. 3.4). To avoid placing additional pressure on the shared memory, each bus controller features a Stream-to-MM and an MM-to-Stream interface. These independent channels are arbitrated internally so that only one

27 3. HotStream Framework Architecture stream accesses the bus at a time.

Backplane

Instruction Processing Memory Element (PE) Write Read DFC DFC

Bus Master Controller

Shared Memory

Figure 3.4: Core internal structure, comprising the PE (e.g., an application specific IP Core) and the co- located BMC

Write Write Read Read data stream addr stream data stream addr stream

Synchronizer

WriteMaster ReadMaster

ChannelArbiter

BUS Master Interface

Figure 3.5: Internal structure of the BMC, consisting of Write and Read control units, a Channel Arbiter and a Synchronizer block

The BMC, Fig. 3.5, is comprised of convenient write and read control units, featuring appro- priate internal buffers that enable the issuing of incremental burst-based transactions and, in the case of the latter, perform data pre-fetching by requesting data that will be buffered until the Core is ready to consume it. To accomplish this, both the write and read control units consume the address stream up until the point that an increment pattern is interrupted or the burst limit, which can be selected when instantiating the component, is reached. In the particular case of the write channel, no data can be output until it is actually marked valid by the Core. Thus, a synchronizer block is also used, which forces the data and address stream to flow at the same pace.

3.5 Data Stream Switch (DSS) and Core Management Unit (CMU)

Since all DFCs operate independently from one another, separate instruction memories are required for each. To avoid the eventual resource under-utilization that would result from using one entire Random Access Memory (RAM) for each unit, the MCPE offers the possibility of sharing a dual-port RAM, if available, between two DFCs. In addition, because the program code for each may have fundamentally different sizes, it is possible to dynamically adjust the partition of

28 3.5 Data Stream Switch (DSS) and Core Management Unit (CMU) the memory space between the two. This procedure may be performed statically, i.e, included in the hardware implementation, or dynamically and in real time, directly from the Host machine. The latter functionality is supported by the combination of the Data Stream Switch and the Core Management Unit, see Fig. 3.1. The Data Stream Switch is a simple switching element that is able to route the data stream coming from the Host to three different targets: the shared memory, the high-speed backplane and the array of instruction memories. While the first is suited for situations where a large amount of data is to be transferred to the MCPE for further processing, the second makes it possible to stream data directly to one of the kernels, thus avoiding any intermediate buffering. This mode of operation is particularly suited for applications that do not explore fine-grained access patterns and data re-use. Finally, the last target is used for programming the various instruction memo- ries, with the help of the Core Management Unit, which defines the RAM partitions and provides feedback to the Host on the success of the programming operations. The Core Management Unit, which is controlled by the Host trough a set of hardware registers (see Tab. A.5 and A.6), can be used to interact with the kernels on the MCPE. It is possible to define interrupt masks and acknowledge the interrupts received on an individual basis. In addition, the kernels can be reset individually and two registers with no previously defined function, UR1 and UR2, are available for satisfying the needs of particular applications.

29 3. HotStream Framework Architecture

3.6 Summary

The HotStream framework, more than a platform for the development of stream-based hard- ware accelerators, is a comprehensive hardware and software solution that handles all the inter- actions between an Host and the existing accelerators. This is achieved by a modular hardware architecture, consisting of a Host Interface Bridge (HIB) and a Multi Core Processing Engine (MCPE), supported by the combination of a device driver and the purposely designed HotStream API on the software side. The proposed organization results in a framework that is generic enough to be mapped to virtually any chip technology. In fact, the HIB can be equally implemented by a PCI Express interface, AXI high-speed bus, or IBM’s CoreConnect. The data transfers are initiated from the accelerator-side by a DMA engine capable of performing 2D transactions. This is essential to equip the presented framework with one of its most distinctive features: the two-level patterned access to all the data shared between the Host and accelerator. While more coarse-grained access patterns are made possible by the DMA engine in the HIB, fine-grained pattern description of multiple parallel streams is possible within the MCPE. The Data Fetch Controllers (DFCs), which are composed of a tightly-coupled combination of a custom 16-bit microcontroller and an autonomous address generation unit, parse user-provided pattern descriptions. The pattern descriptions are based on a custom-designed assembly. The DFCs generate and manage the data streams of the associated Cores. This unit, which shares various similarities with conventional descriptor-based DMAs, such as the PPMC proposed in [20], offers significant advantages when data must be accessed with complex and long patterns. The HotStream API complements the hardware side of the framework by providing a set of four logical groups of methods that give access to all the features of the platform: i) Framework Management; ii) Core Management; iii) Data Management and; iv) Pattern Definition. In addition to the multiple heterogeneous Cores and associated DFCs that the MCPE supports, an high-speed backplane interconnection and large-capacity shared memory are also available. These two units enable point-to-point, low-latency, communication between the multiple cores and the reutilization and rearrangement of data streams, respectively. Finally, two auxiliary units, the Data Stream Switch (DSS) and the Core Management Unit (CMU), provide additional control over most features of the framework directly from the Host.

30 4 Host Interface Bridge

Contents 4.1 PCI Express Infrastructure ...... 32 4.2 Address Spaces and DMA ...... 33 4.3 2D DMA Transfers ...... 34 4.4 Device Driver and User Interface ...... 37 4.5 Summary ...... 40

31 4. Host Interface Bridge

At the conceptual level, the HotStream framework does not depend on any particular interface technology for any of its communication links. In particular, the High Speed Interconnect depicted in Fig. 3.1 may be as easily implemented by a serial PCI Express link or an AXI or CoreConnect bus interface, depending on the target platform. Using a parallel bus interface usually requires the accelerator and the GPP to be co-located on the same SoC, which can be fully implemented as an ASIC or as a combination of hard-wired hardware and reconfigurable fabric, as in the case of the Xilinx Zynq [12]. While this option gives access to an higher communication bandwidth and greatly eases the design, it is not always available. This is especially the case when the Host is a powerful GPP, thus requiring an off-chip interface technology, such as PCI Express, which requires special structures to be present in both communication endpoints. In addition, the logical separation between the Host and accelerator mean that (at least) two address spaces exist, thus requiring a more complex software management, including the development of custom device drivers. Therefore, the HIB and accompanying software of the HotStream framework were developed with this more complex case in mind. Nevertheless, SoC-based implementations can still be easily targeted by simplifying the software management and removing the PCIe endpoint altogether. The use of a PCI Express requires a PCIe endpoint bridge to replace the Data Stream Bridge of Fig. 3.1. This bridge must be capable of mapping the address space of the accelerator, sup- ported on a local high-speed bus interface, to the Host address space. In addition, to make a more efficient use of the available bandwidth, it supports multiple lanes operating in full duplex. The fol- lowing sections describe the elements required to support the efficient bidirectional transmission of data between the Host and the board, as well as the device driver that makes it possible to control the system at an higher abstraction level.

4.1 PCI Express Infrastructure

The PCIe endpoint on the accelerator side communicates with the root complex device on the Host. It is not necessary that both feature the same number of lanes, as the PCI Express specification supports a link negotiation stage, known as link training, which co-operatively deter- mines the maximum width supported by the interconnected pair. In addition, for the accelerator to be properly recognized and configured during the device enumeration performed by the BIOS during the Host computer boot, a suitable set of ID values and Class codes must be set on the PCIe endpoint. These contain information regarding its function and manufacturer and are also fundamental in its identification by the device driver. The device enumeration step is completed by reserving a block on the bus address space of the Host for the PCIe device. The size of this block is determined by the contents of the mandatory register set that the endpoint provides, which must conform with the PCIe specification. Moreover, this set of registers also indicates the available apertures. These correspond to address ranges that are later mapped to the Host memory space by the device driver. Each aperture is

32 4.2 Address Spaces and DMA characterized by a starting position, indicated by a Base Address Register (BAR), and size and give access to the address space of the sub-system attached to the PCIe endpoint. Interrupts may be delivered by the accelerator to the Host by making use of the Message Signalled Interrupts (MSI). In the past, PCI devices used an individual interrupt line connected directly to the Host. This out-of-band method posed several problems, the most striking of which being the possibility of the interrupt to arrive at the Host before the Transaction Layer Package (TLP)s of the corresponding transfer had been received. The MSI mechanism eliminates this synchronization problem by utilizing a conventional memory write TLP directed at a reserved address in the Host bus map. In addition, this solution also reduces pin count and greatly improves interoperability between PCIe-based devices.

4.2 Address Spaces and DMA

Gaining access to the address space of the accelerator from the Host is complicated by the multiple levels of address spaces that usually exist in the system, see Fig. 4.1. In fact, while the apertures provided by the PCIe endpoint are mapped to the bus address space, which is the same as the physical address space in architectures which do not possess an Input/Output Memory Management Unit (IOMMU) (x86 architectures fall within this category), user applica- tions run in user space, where a contiguous and very large set of virtual addresses are available. Since this address range is not limited by the capacity of the main system memory and to enforce address space separation between multiple concurrent processes, there is not a one-to-one cor- respondence between the physical memory and the virtual memory. In fact, the mapping between virtual pages and the frames in the physical memory is performed by a dedicated hardware unit known as the Memory Management Unit (MMU). Furthermore, the kernel benefits from a special region of virtual memory where the physical memory, or part of it, is linearly mapped, since no address violation prevention between processes is required. This region can be used whenever a contiguous data buffer needs to be allocated. However, the system calls available for allocat- ing such buffers provide only a best-effort service, meaning that there is no guarantee that an arbitrarily-sized buffer can always be allocated in a contiguous region of the physical memory [9]. Therefore, two options exist when transferring a data block to or from the accelerator through the PCI Express interface: i) the user-space application requests the allocation of a kernel buffer that is large enough to contain the data to transfer and maps it to the process space, through suitable system calls or; ii) the buffer is allocated in the process space and then mapped to the physical space, where it will likely span multiple physical frames scattered across the physical memory. While the first scenario is more convenient, and greatly simplifies DMA transfers, it is not always possible to allocate buffers with the desired size, as explained above. In addition, requiring the buffer to be pre-allocated can complicate the adaptation of existing applications to the framework, thus reducing its usability. It becomes clear that the second approach is more flexible, despite requiring a more complex management of the DMA transfers, since multiple smaller blocks

33 4. Host Interface Bridge

USER SPACE KERNEL VIRTUAL VIRTUAL ADDRESSES ADDRESSES

KERNEL LOGICAL ADDRESSES

Page Tables (MMU)

PHYSICAL ADDRESSES

BUS ADRESSES

PCIe Apertures

PCIE ENDPOINT ACCELERATOR

Figure 4.1: Address spaces and translation mechanisms on a x86-like architecture have to be transferred, instead of just one. In order to facilitate this management, a DMA engine with Scatter Gather (SG) capabilities can be employed. Unlike simple DMA implementations, which can only be instructed to perform a data transfer at once, SG DMAs engines read a singly linked list of descriptors, each specifying the starting address and size of the data block to be transferred. This allows the engine to per- form a large sequence of data transfers from multiple locations without further CPU intervention. Figure 4.2 depicts the process of mapping a user buffer to the bus address space, through a ded- icated call contained in the HotStream API, and the subsequent creation of the corresponding SG descriptors.

Virtual Addresses

USER BUFFER

pd_mapUserMemory()

DMA DMA DMA DMA Descriptor Descriptor Descriptor Descriptor 0 1 2 3

Physical Addresses

Figure 4.2: Mapping of an user-buffer to the bus address space and subsequent creation of the correspond- ing SG descriptors. The size and number of the physical data chunks can vary considerably according to the size of the original buffer and the state of the phyisical memory

4.3 2D DMA Transfers

As discussed in section 3.1, the use of a DMA engine capable of performing 2D memory accesses, i.e., data transactions described by the tuple (OFFSET, HSIZE, STRIDE, VSIZE) can

34 4.3 2D DMA Transfers significantly reduce the total number of descriptors needed to implement several coarse-grained patterns. Unfortunately, this complicates the creation of the SG descriptors even further, as seen in Fig. 4.3. These complications arise because the 2D pattern must be applied to the physical data chunks that result from the mapping of the user-space to the bus address space. Since these chunks can vary in size and starting positions, each contiguous block defined by the HSIZE parameter may or may not fit entirely within a physical data chunk. Thus, whenever an HSIZE block exceeds the capacity of the starting physical chunk, an additional SG descriptor is required.

Virtual Addresses

USER BUFFER

pd_mapUserMemory()

STRIDE STRIDE STRIDE STRIDE

HSIZE HSIZE HSIZE HSIZE HotStream_2D()

DMA DMA DMA DMA DMA DMA Descriptor Descriptor Descriptor Descriptor Descriptor Descriptor 0 1 2 3 4 5

Physical Addresses

Figure 4.3: Application of a 2D pattern to the mapping of Fig. 4.2

Figure 4.3 represents the worst possible case, as the 2D pattern must be described by 6 SG descriptors, thus providing no descriptor savings with relation to a DMA engine with no 2D support. Fortunately, in most real application scenarios, the size of the physical data chunks is such that a considerable portion of the 2D pattern is able to fit within them without further segmentation, as depicted in Fig. 4.4.

The application of the 2D pattern defined by the tuple (OFFSET, HSIZE, STRIDE, VSIZE) to the array of physical memory chunks is performed by the HotStream 2D() API call, which substitutes the previous descriptor list with a new one that exploits the specification of 2D transfers. The underlying algorithm for this method is presented in Fig. 4.5. Finally, while the HIB was designed to primarily support coarse-grained data patterns, the specification of sparse patterns based on blocks with a small HSIZE is still possible. However, the nature of the PCI Express link dictates that large data transfers make a more efficient use of the link, as the overheads introduced by the protocol are largely independent of the data transfer size and, as such, become less relevant with its increase. Thus, it is expected that the use of patterns that are too sparse or fine-grained will result in a significant penalty to the achievable throughput of the interconnection. To circumvent this problem, the HotStream API includes the gather function, already introduced in section 3.1. This call, whose operation is depicted in Fig. 4.6, gathers the various sub-blocks of size HSIZE within the provided user buffer and populates a new user-space buffer linearly with these blocks. This not only significantly reduces the number of SG descriptors

35 4. Host Interface Bridge

Virtual Addresses

USER BUFFER

pd_mapUserMemory()

STRIDE STRIDE STRIDE STRIDE

HSIZE HSIZE HSIZE HSIZE HotStream_2D()

DMA DMA DMA Descriptor Descriptor Descriptor 0 1 2

Physical Addresses

Figure 4.4: Application of a 2D pattern to a more realistic mapping between virtual and physical address space. This example highlights the descriptor savings that are possible by utilizing a DMA with 2D capabilities

pattern2d(offset, hsize, stride, vsize)

bar = offset

end if vsize == 0 d = next descriptor

N = (d.size - bar) / stride if bar > d.size F = (d.size - bar) % stride

if N >= vsize bar -= d.size

if F != 0

sg_push(d.addr + bar, hsize, vsize, stride) if F < hsize

sg_push(d.addr + bar, hsize, N, stride) sg_push(d.addr + bar, hsize, N+1,stride) vsize -= N if N != 0 vsize -= (N+1) bar = stride - F

bar += N*stride sg_push(d.addr + bar, F,1,stride) bar = 0 sg_push(d.addr + bar, hsize - F, 1, stride) vsize -= 1 bar = stride - F

Figure 4.5: Flowchart of the algorithm that converts a list of SG DMA descriptors into a list featuring 2D transfers

needed to complete the DMA data transfer but also potentially increases the achievable data throughput over the PCI Express link.

36 4.4 Device Driver and User Interface

Virtual Addresses

USER BUFFER

OFFSET STRIDE STRIDE STRIDE STRIDE

HSIZE HSIZE HSIZE HSIZE

HotStream_gather()

HSIZE HSIZE HSIZE HSIZE

NEW USER BUFFER

Figure 4.6: Operation of the HotStream gather() function, which gathers the various sub-blocks defined by a 2D pattern and places them linearly in a new user-space buffer

Naturally, this operation must be performed after the initial mapping of the user-provided buffer from the virtual to the physical address space, thus introducing an additional delay in the sequence of steps needed to complete a data transfer from the Host to the accelerator. However, this increased setup time can be effectively hidden by the increased data throughput that can be achieved over the PCIe link, which results in a shorter transfer time. Thus, for sparse/fine-grained data patterns up to a certain break-even point, utilizing the gather operation actually reduces the overall transfer time.

4.4 Device Driver and User Interface

The HotStream API provides an high-level abstraction for the user to interface with the ac- celerator and, implicitly, the DMA engine and PCI Express interface. However, this is only made possible by the underlying software layers, which convert these high-level commands into simpler low-level instructions that directly control the configuration registers of both the DMA engine and PCI Express endpoint. The device driver is the key element within these layers, as it is respon- sible for: i) device detection and initialization; ii) creation of the corresponding device node; iii) mapping of the PCIe apertures into the bus address space and; iv) providing direct access to the multiple device registers through ioctl() calls. Developing such a driver requires a comprehensive understanding of the APIs provided by the kernel for memory allocation, PCI devices and interrupt handling, which is far beyond the scope of the present thesis. Thus, an open-source and generic driver from the MPRACE framework [28] was employed, greatly accelerating the development process. This framework is an open source stack, primarily aimed at the development of custom FPGA boards with PCI Express interfaces. In addition to the conventional device driver that pro- vides access to the accelerator through read() write() and ioctl() system calls, it also features a comprehensive user-space API in C and C++, which abstracts complex procedures, such as the mapping of the PCI Express apertures to user space, the mapping of user-space buffers to the physical address space and the allocation of contiguous kernel buffers. This was used as the starting point for the development of the HotStream API, which utilizes some of the calls provided by the MPRACE API to configure the DMA transfers and to dynamically set up the address trans-

37 4. Host Interface Bridge lation performed by the PCIe endpoint. In addition, some modifications were introduced to the kernel driver in order to include support for MSI and . These modifications, as well as the necessary steps for configuring a data transfer by making use of the MPRACE API are described in the following subsection.

4.4.1 Modifications to the MPRACE device driver

The two modifications made to the MPRACE device driver aim to provide support for MSI and Bus Mastering. The importance of the former has already been described in section 4.1 but ultimately depends on whether the kernel of the Host operating system was compiled with support for this type of mechanism (determined by the CONFIG PCI MSI config parameter in Linux kernels). If this form of interrupt handling is not available, the traditional out-of-band method must be utilized. Bus Mastering refers to the capability of the PCI Express endpoint to take ownership of the Host memory bus and issue read and write requests to the main memory system. This is indispensable for the accelerator to be capable of accessing the Host memory with minimal intervention of the latter, thus leading to higher data transfer throughputs. Since not every Host supports MSI, the PCIe endpoint must be informed of which interrupt mechanism to use. Since the out-of-band mechanism is selected by default, MSI must be en- abled explicitly on the endpoint. This is done by setting a bit on a capability register defined by the PCI Express specification. Such an action must be directly performed by the kernel as the device driver does not have permissions to do so. Instead, the latter does this indirectly by resort- ing to the pci enable msi() call of the Linux PCI API, after which the associated interrupt line is registered through the well known register irq() kernel call. To preserve the compatibility with con- ventional interrupt methods, a msi enabled flag was added to the pcidriver privdata t structure which overrides the legacy interrupt routines defined in the driver source code. Bus Mastering capabilities are simply activated by utilizing the pci set master() call of the previously mentioned API.

4.4.2 Configuring a data transfer

While the MPRACE API greatly simplifies the task of configuring data transfers to and from the accelerator, the existence of independent address spaces and a DMA engine requires a number of steps to be taken before the data begins to flow in either or both directions, even if no 2D patterns are applied. After the device file is opened, which maps the PCIe endpoint configuration registers to the device space, the pd mapBAR() call of the MPRACE API is used to map one of the PCIe apertures to user-space. This gives immediate access to the address space of the accelerator and makes it possible to access the configuration registers of the various peripherals whose address ranges are contained in this address space. After the user has provided a pointer to a data buffer, the pd mapUserMemory() call locks it into physical memory and maps it to device space.

38 4.4 Device Driver and User Interface

The resulting Scatter-Gather (SG) list indicates the location and size of each of the physical memory chunks in the Host memory, as described in section 4.2. Since the PCIe endpoint is only able to access a limited memory range, which is determined by the size of the configured aperture, it is important that this group of data chunks fits within this addressable range. Once this is verified by the software, a base address that satisfies this condition is calculated and written to a configuration register in the PCIe endpoint which performs the address conversion from the accelerator to the Host. Thus, whenever a peripheral within the accelerator issues a write or read request to a memory position of the aperture defined by the PCIe endpoint, it is propagated to the Host with an additional offset, the aforementioned base address. Naturally, this also means that the SG list that is parsed by the DMA engine must utilize relative addresses and not absolute Host memory addresses, which is accomplished by subtracting the same base address to each entry of the list. After applying the aforementioned address conversion, the SG descriptors are written to a static memory on the HIB, according to the descriptor structure specified by the DMA engine. Once the complete list has been written, the configuration registers of the DMA engine are popu- lated with the base addresses of the start and tail descriptors, which usually prompts the beginning of the data transfer operation. Once the interrupt that signals the end of the transfer is received by the Host, each SG descriptor on the static memory will hold the number of bytes that were successfully transferred for that particular descriptor. The resulting sum should equal the size of the specified user buffer. Finally, in the case of a data transfer from the accelerator to the Host, the explicit synchronization of the device and user buffers may be required. This task, which is greatly simplified by the pd syncUserMemory(), may be necessary to ensure cache coherency when the same user-buffer is used for successive data transfers. Such a step may be unnecessary if the Host supports PCI bus snooping.

39 4. Host Interface Bridge

4.5 Summary

Despite the support for various Host-accelerator interfaces that the HotStream framework pro- vides, implementing an off-chip is typically more difficult than resorting to an intra-chip solution, such as AXI or CoreConnect. Thus, the design of the HIB and accompanying device drivers was specifically targeted at an PCI Express connection, as this represents the worst-case in terms of design complexity. Adapting the developed hardware/software combination to an intra-chip solution would require significantly less effort. Independent of the chosen Host-to-accelerator communication technology is the potential in- volvement of multiple address spaces in all data transfers. In fact, if the Host runs a full-fledged operating system, the existence of physical and virtual address spaces greatly complicates the task of transmitting data buffers to and from the accelerator. In such a scenario, the device driver is responsible for mapping the user-provided memory buffer to the bus device space, which is then accessible from the accelerator. However, while the virtual memory space is laid out linearly, the mapping of a contiguous memory buffer to a physical space often results in a collection of non-correlated, arbitrarily located physical data chunks. Thus, for the efficient transfer of such chunks to the accelerator, a DMA engine with Scatter-Gather (SG) capabilities is required, as it greatly reduces the Host’s CPU intervention in the data transfer process. The use of a DMA engine capable of 2D transfers can greatly reduce the number of descriptors needed to complete a data transfer, in addition to the advantages outlined in Section 3.1. However, the 2D pattern must be applied to the chunks of physical memory that resulted from the initial mapping from virtual address space to the bus device space and not to the original, user-provided, contiguous buffer. Such a task is nontrivial and can even result in no descriptor savings, as in the case depicted in Fig 4.3. However, typically, most real scenarios resemble the situation depicted in Fig. 4.4, where a considerable portion of the 2D pattern is able to fit within a single physical memory chunk, thus resulting in a significant descriptor saving. The task of applying a 2D pattern to a list of physical memory chunks is accomplished by a purposely designed algorithm that takes this initial list and produces a new set of descriptors ready to be parsed by the DMA engine. The device driver for the PCI Express based accelerator was not developed from scratch. Instead, the open-source and generic driver from the MPRACE framework [28] was used. Modifi- cations to the interrupt handling procedures were introduced and bus mastering capabilities were added, thus enabling the DMA engine to effectively take ownership of the Host memory bus.

40 5 Framework Prototype

Contents 5.1 AXI Interfaces ...... 42 5.2 HIB Implementation and Performance ...... 43 5.3 Backplane Implementation and Performance ...... 47 5.4 Shared Memory Performance ...... 55 5.5 Summary ...... 56

41 5. Framework Prototype

The proposed HotStream framework represents a generic design that can be easily mapped to different targets, i.e., a reconfigurable device, an ASIC or even a combination of the two, using, for example, a Zynq SoC. However, while the conceptual design and software layers can be developed with no particular target technology in mind, it is not possible to fully characterize the platform and the performance of its various communication links without considering a particular implementation. Moreover, to properly evaluate the proposed platform the test environment must provide an high-speed interconnection between the accelerator and the Host, such as a PCI Express interface and a large-capacity off-chip memory providing significant bandwidth. The nature of these requirements mean that the prototyping of the framework must be done on an FPGA, which greatly simplifies the process of interfacing with the off-chip memory and the PCI Express interconnection. Thus, the Xilinx VC707 development board was selected, which is powered by a state-of-the-art Virtex 7 FPGA [40], coupled with an high-performance DDR3 memory, offering 512 MB of capacity and a peak bandwidth of 12.8 GB/s. Host connectivity is ensured by a PCI Express interface with 8 lanes, capable of supporting Gen2 speeds. On the Host side, a powerful Intel Core i7 3770K processor clocked at 3.5 GHz is at the heart of a machine with 16 GB of DDR3 memory running at 1.866 GHz. These components, in addition to the VC707 development board, are hosted by an Asus P8Z77-V LX motherboard. The following sections present a thorough characterization of the communication channels that compose the HotStream framework when mapped to the aforementioned development board, as well as an evaluation in terms of performance and area of key elements of the framework, such as the DFCs. Finally, a case study based on the multiplication of very large matrices is presented, which aims to highlight not only the capability of the framework to support multiple concurrent streams being consumed and produced by various heterogeneous kernels, but also to give an insight as to the performance gains that can be expected when using the proposed HotStream framework in detriment of other conventional accelerator architectures.

5.1 AXI Interfaces

Given the choice of a Xilinx development board for prototyping the framework and, since most of the IP Cores available for its devices make use of the AMBA AXI family of interfaces [2], this was selected as the interconnection solution to be used across the HotStream framework. The multiple interface variants contained in this specification allow for a tight correspondence between the requirements of each interconnection and the capabilities of the corresponding interface. Thus, for high-performance links within the system, such as the interface between the HIB and MCPE and the access to the shared memory, the AXI4 protocol was used, as it provides bi-directional data transfers with burst support of up to 256 data beats. For register mapped interfaces, on the other hand, the simpler AXI4-lite protocol, with no support for bursting, was preferred in order to keep hardware resource usage to a minimum. Finally, in order to ease the development of accelerator solutions based on the HotSteram

42 5.2 HIB Implementation and Performance framework, the streaming cores must comply with the AXI4-Stream protocol. This protocol is tar- geted at low-resource, high-bandwidth unidirectional data transfers. Flow control is implemented through two signals, TVALID and TREADY, and support for bursts of arbitrary length is ensured by a TLAST signal. Additional control signals exist, such as null-beat indicators and routing in- formation, but these are optional and depend on the particular requirements of the application at hand. The protocol consists of completely symmetric master and slave interfaces that can be connected directly. Moreover, no restriction to the width of the data channel exists, which further increases the flexibility of the solution. It is important to note that, while the prototype version of the proposed framework was de- signed to host cores with one or more AXI4-Stream interfaces, other stream-based interfaces can be easily adapted, provided they utilize the same two-way flow control mechanism, also known as handshake, depicted in Fig. 5.1. This is the case, for instance, with the Avalon Stream interface, widely adopted in Altera-based designs.

Figure 5.1: Basic handshake principle utilized by the AXI4-Stream and other similar stream-based protocols. Retrieved from [2]

5.2 HIB Implementation and Performance

The two main components of the HIB, the DSB and DMA engine were both implemented by off-the-shelf Xilinx IP Cores. The HIB was implemented by the AXI Bridge for PCI Express [39], which handles the low level interactions with the hard silicon PCI Express endpoint available in the VC707 development board and provides a convenient AXI4 interface, allowing easy integration with the remaining components of the framework. Similarly, the DMA Engine was implemented by the AXI DMA engine [5], which meets all the requirements outlined for this unit during the framework description, such as Scatter-Gather capabilities and 2D data accesses. Even with off-the-shelf components, making efficient use of the available bandwidth on the PCI Express link is far from trivial and several factors have to be taken into account. These range from the inevitable protocol overhead introduced by the communication mechanism, based on Transaction Layer Packages, TLPs, the symbol encoding used to reduce transmission errors on the physical layer, and remaining system components [17]. However, since these factors greatly vary with the characteristics of the transmitted data, the quoted figures for PCI Express performance usually refer to the raw aggregate bandwidth that can be sustained by both directions in simultaneous, Fig. 5.2. However, if the characteristics of the PCI Express protocol are taken

43 5. Framework Prototype into consideration, the actual achievable throughput can be reduced by as much as 20% in some situations, as determined by Goldhammer and Ayer in [17] (see dashed line in Fig. 5.2).

32 Gen1 16 Gen2

8 Gen1 with Overhead

4

2

1

0.5 Aggregate Throughput [GB/s] Aggregate 0.25 1 2 4 8 16 32 Lane Width

Figure 5.2: Aggregate throughput of various PCI Express configurations. The dashed line accounts for protocol overhead as per [17]

While the hard silicon PCI Express controller on the VC707 development board supports lane widths of up to 8× at Gen1 and Gen2 speeds, the AXI Bridge for PCI Express IP Core is more limited and only supports a subset of the possible configurations. In addition, the width of the AXI interface on the accelerator side is also a key parameter as it may impose a bottleneck for data transfer performance. This happens if the throughput on the accelerator side does not at least equal the aggregate throughput achieved by the PCI Express link. Table 5.1 summarizes the configurations supported by this IP Core.

Table 5.1: PCI Express Gen1 and Gen2 support on the AXI Bridge for PCI Express IP Core

No. of Lanes AXI Data Width Gen1 Gen2 x1 64 Yes Yes x2 64 Yes Yes x4 64 Yes No x4 128 Yes Yes x8 128 Yes No

At this point, it is important to note that the PCI Express bridge and DMA engine cannot operate at frequencies higher than 100 MHz. This constraint arises from the need to clock the add-in card with the central motherboard clock, in order to ensure the accurate frequency-lock (as discussed in Xilinx’s Answer Record AR# 18329). Thus, the highest bandwidth the AXI interface can provide is 1.6 GB/s (16B × 100 MHz) in each direction, given that the AXI4 protocol provides independent read and write channels. Again, this does not account for arbitration latencies nor idle periods between successive bursts. This creates a throughput ceiling that effectively limits the achievable data transfer performance to 3.2 GB/s, meaning that both the x8 @ Gen1 and x4 @ Gen2 configurations cannot be fully exploited and will yield the same results. Given that

44 5.2 HIB Implementation and Performance the Gen2 implementation of the PCI Express bridge uses slightly more resources than Gen1, the x8 @ Gen1 combination was preferred. The HIB was tested by setting up back-to-back transfers with varying buffer sizes and measur- ing the elapsed time between writing the address of the tail descriptor to the AXI DMA registers and the moment the interrupt signalling the end of the receive operation was received. For each buffer size, the procedure was repeated 100 times and the minimum elapsed time registered, in order to minimize as much as possible the impact of context switching and other non-deterministic behaviour on the host machine. In order to guarantee enough precision on the interval measure- ment the gettimeofday() Linux call was employed, which provides resolutions as high as 1 µs. The results from this experiment are represented in Fig. 5.3.

2272.73 2308.00 1344.08 844.59 1000.00 416.67

211.15

110.03 100.00 53.51 Throughput [MB/s]

14.05

10.00 1K 1024 4096 4K 8192 8K 16384 16K 32768 32K 65536 64K 131072 128K 262144 256K Chunk Size [Bytes]

Figure 5.3: Measured aggregate throughput for back-to-back transfers with varying buffer size

As expected, the achievable throughput rises with the increase in buffer size, since this masks the various overheads present on the system. From 256 KB onwards, the transfer rate saturates at around 2.3 GB/s, 72% of the theoretical limit of the AXI interface, which may be justified by the aforementioned arbitration phases, latency between bursts among other factors. Nevertheless, the reduced performance for smaller sizes can only be attributed to either the PCI Express bridge or the DMA engine. This can be confirmed by attaching a Chipscope monitor to the PCI Express bridge AXI Slave port, where the read and write requests from the DMA engine are received. Figure 5.4 depicts two waveforms obtained with the Chipscope Analyzer software, which corre- spond, respectively, to the AXI transaction of a 4 KB buffer from the PCI Express bridge to the DMA engine and vice-versa. By taking note of the elapsed time between the start and end of the successive bursts, a send throughput of 870 MB/s and a receive throughput of 1097.26 MB/s can be calculated, which results in an aggregate bandwidth of 1967.26 MB/s, which is significantly smaller than the 53.51 MB/s obtained experimentally, Fig. 5.3, and confirms that the transfer time is being limited by one of the two structures of the HIB and not the interconnection.

45 5. Framework Prototype

218 ns 667 ns

(a) AXI transaction from the PCI Express bridge to the DMA engine 356 ns

(b) AXI transaction from the DMA engine to the PCI Express bridge

Figure 5.4: Chipscope waveforms obtained during a back-to-back transfer of 4 KB

It should be noted that the numbers presented above do not include the time required for the HotStream API to set-up both transfers, which inevitably reduce the effective throughput. Be- cause of this, the methods that handle this task within the API were carefully optimised so as to reduce unnecessary operations and branching. Nevertheless, this time can be significant, mainly when dealing with the transfer of small data chunks, and can be reduced to two components: i) compulsory operations that result in a constant execution time and; ii) a variable execution time that is dependent on the amount of physical chunks that resulted from mapping the user-provided buffer to the physical memory. The time elapsed during the configuration of both the send and receive transactions for a varying buffer size is depicted in Fig. 5.5. By including this initial latency on the numbers of Fig. 5.3, it can be observed that, as expected, the performance drop is more pronounced when smaller chunks are transferred, Fig. 5.6

280 258 260

240

220 199 191 200 190 177 180 166 170 160 164 Setup Time [us] [us] Time Setup 160

140 1024 1K 4096 4K 16384 16K 65536 64K 262144 256K Chunk Size [Bytes]

Figure 5.5: Time elapsed during the configuration of the send and receive transactions

46 5.3 Backplane Implementation and Performance

5000.00 2106.87 1618.12 883.39 471.70 500.00 248.02

128.07

65.93

50.00 32.96 Throughput [MB/s]

8.51

5.00 1K 1024 4096 4K 8192 8K 16384 16K 32768 32K 65536 64K 131072 128K 262144 256K Chunk Size [Bytes]

Figure 5.6: Aggregate throughput for a back-to-back transfer including the time taken for the transaction set-up

Again, the results presented in this section refer to a x8 @ Gen1 PCI Express link and an AXI interface with a data width of 128 bits. Similarly, the AXI DMA engine, instantiated with 2D data access support, also utilizes 128-bit interfaces, in order to avoid bottlenecks due to bandwidth mismatches. It should be noted that, during these experiments, the support for multiple streaming destinations was not activated as it significantly reduces the data copying performance, since it is incompatible with descriptor queueing [5]. Descriptor queueing essentially refers to a double- buffering mechanism which automatically fetches the subsequent SG descriptors in a chain, while the current ones are being processed, and stores them in a FIFO. However, in order to fully comply with the HotStream framework specification, the HIB must be capable of equally streaming data to the shared memory, any individual core or one of the associated instruction memories. This requires the support for multiple streaming destinations to be enabled and, therefore, a significant performance drop with relation the results presented in this section should be expected.

5.3 Backplane Implementation and Performance

As discussed in Sec. 2.4 and Sec. 2.3, the high-speed, high-connectivity backplane can be equally implemented by a NoC or by a more conventional Crossbar-based solution. Following the survey of publicly available NoC implementations, the Hermes network [30] was selected as the best balance balance between performance and resource usage. On the opposite side of the comparison, and given the adoption of the AXI family of interfaces for prototyping the HotStream framework, the AXI-Stream Interconnect [38] was naturally selected as the Crossbar to which the NoC must be compared to. The following subsections briefly discuss the main characteristics of both the Hermes NoC and AXI Stream Interconnect Crossbar to determine which leads to the best implementation of

47 5. Framework Prototype the high-speed backplane for the specific case of an FPGA-based design (the trade-offs involved when targeting other ASICS, for instance, are fundamentally different [36]).

5.3.1 Hermes NoC

The central element of the Hermes NoC is the Hermes Switch [30]. It encompasses five bi- directional ports, four of which are used to establish connections to the neighbouring switches, while the fifth ensures the communication with the local IP Core. The first four ports can be ex- panded through the use of virtual channels, which have the ability to reduce traffic congestion at the cost of an increased resource usage. After the round-robin based arbitration step is complete, the XY routing algorithm is used to connect the input port to the correct output port. Since worm- hole switching is utilized, the routing decision is performed upon the reception of the header flit, while the second flit indicates the size of the payload. After all payload flits have been routed, the input to output mapping is marked as free. The network structure is of the 2D Mesh type, meaning that the outer switches only possess three ports instead of five. The selection of this type of arrangement is justified by the easier place- ment and simpler routing algorithm. Thus, different switches have different peak performances, depending on their location within the mesh. In fact, while inner switches can theoretically main- tain five simultaneous connections, outer switches see this number reduced to three. Since each flit takes two clock cycles to be sent, the aggregate peak bandwidth of a network with N × N nodes is given by:

N×4−4 N(N−4)+4 X f X f P eakT hroughput = 3 × flitwidth × + 5 × flitwidth × (5.1) 2 2

For an 3×3 network with 32-bit flits and a frequency of 100 MHz, the peak throughput is thus 46,400 Mbit/s or 5,800 MB/s. According to the authors, the minimal latency, i.e in the absence of network contention, to transfer a packet from a source to a target switch can be expressed in clock cycles as:

n X MinLatency = ( Ri) + P × 2, (5.2) i=1

where n is the number of switches in the communication path, otherwise known as hops, Ri is the execution time of the routing algorithm at each switch (at least 10 clock cycles) and P is the packet size, which is multiplied by 2, since a single flit is sent in two clock cycles.

5.3.1.A Modified packet structure

As it is proposed by the authors, the Hermes Switch requires the number of flits in a packet to be known before a transfer is initiated. While this is not a problem if the payloads are fixed in size, if the size of the messages varies over time, it is necessary to buffer an entire packet before populating the size flit. This naturally increases both the hardware resources utilization and the

48 5.3 Backplane Implementation and Performance latency of the network. To circumvent this limitation, an end-of-packet (EOP) bit was added to each flit, so that the Switch can determine when to terminate the ongoing connection.

One drawback of this solution is that the routing information overhead is no longer fixed but, instead, dependent on the payload size. For instance, and considering 32-bit flits, the overhead of the proposed modification is smaller than the original implementations up to a payload size of 32 flits, after which it continues to increase while the original remains fixed.

5.3.2 AXI Stream Interconnect

The AXI4-Stream protocol is part of the AXI4 specification and is targeted at low-resource, high-bandwidth unidirectional data transfers. Flow control is implemented through two signals only, TVALID and TREADY, and support for bursts of undetermined length is ensured by a TLAST signal.

The AXI Stream Interconnect is an IP core developed by Xilinx to be used with its standard design tools. Its main functionality is to provide an efficient routing mechanism to allow the com- munication between multiple AXI4-Stream masters and slaves. In addition, it includes a collection of modules that further improve the IP functionality, such as bus width and clock conversion, pipelining and data buffering.

At the heart of the AXI Stream Interconnect is an arbitrated Crossbar, capable of intercon- necting up to 16 masters and slaves with varying degrees of connectivity, i.e, a programmable connectivity map lets the user specify full or sparse Crossbar connectivity. The arbitration can either be round-robin or priority-based, with statically assigned priorities. In addition, it is possible to define when the arbitration mechanism is applied: at TLAST boundaries, after a set number of transfers and/or after a certain number of idle cycles.

Naturally, as the degree of connectivity of the Crossbar is increased, its area, throughput and latency will be affected. As such, combinations of N master x 1 slaves and 1 master x N slaves are preferable to MxN interconnects. The relevant user guide even goes so far as to suggest that when MxN interconnects are absolutely necessary, the number of endpoints should be kept low or sparse connectivity should be specified[38].

The latency involved in each transfer depends on the particular configuration of the IP for each interface. The Crossbar switch itself inserts 2 clock cycles of latency, but the addition of a register slice or FIFO buffer adds an additional 1 and 3 clock cycles of latency, respectively.

On the other hand, the throughput of a datapath through the interconnect depends only on the data width and clock frequency used along the path. Therefore, its maximum throughput is limited by the slowest component. The peak aggregate throughput will be simply given by the peak throughput between each master-slave pair, multiplied by the number of pairs that can communicate simultaneously.

49 5. Framework Prototype

5.3.3 Backplane Performance Evaluation

While the performance of a bus-based design can be easily evaluated by understanding its in- ternal architecture, arbitration algorithm and overall latencies, determining the realistically achiev- able throughput and latency of a Network On Chip is considerably more complicated as it greatly depends on the nature of the traffic it is subject to, even more so than the parameters of the net- work itself [13]. Moreover, even if a complete analytical description of the traffic behaviour would be performed, this would result in a very complicated analysis, as the routing and arbitration de- cisions at each node must be taken into account, which becomes increasingly difficult as the size of the network increases. Therefore, it is common practice to evaluate the performance of such networks by resorting to software models or even to behavioral simulation of the actual Register Transfer Level (RTL) description of the circuit. In order to perform the Backplane evaluation, a behavioural description of a core emulator was developed to easily simulate the network under various real-use conditions. Performance measures were achieved by building a VHDL testbench that logs every event across the network, along with its timestamp, to a text file. This text file is then parsed by a Python script which calculates the latency incurred by each packet transfer and presents a final summary of the overall latency and throughput in the delivery of the injected data to the various nodes. According to [13], these are the most important performance metrics of an interconnection network. The following sections describe in greater detail the various features of the core emulator and the structure of the testbench utilized to monitor the network activity.

5.3.3.A Core Emulator and Stream Wrapper

In order to keep the core emulator as generic as possible, thus facilitating its reuse in other circumstances, it was designed to have an interface that is fully compatible with the AXI-Stream specification. In addition, extensive use of generic parameters was made to allow for the config- uration of basic parameters such as the width of the data ports but also emulation parameters, described in greater detail below. The module permits two distinct types of operation modes: pipelined and non-pipelined. The first mode is considerably simpler and reduces the emulator to a one-slot FIFO which starts by receiving one data beat and immediately tries to output it in the next clock cycle. Meanwhile, no further data is accepted. This is in contrast with the non-pipelined mode which includes a non-zero latency before sending or receiving new data, effectively simulating a non-pipelined computation. Most of the remaining configuration parameters of the core emulator are based in this mode of operation and allow to define whether the core starts on a sending or receive state and the waiting period between these two states. It is also possible to configure the core in half or full duplex, i.e., data is sent and received in an independent and simultaneous manner or, on the other hand, if the two phases are serialized. Regardless of the configuration chosen, the module behaves in a cyclical manner with a period that is also user-defined. Given that the core emulator was designed with a conventional stream interface in mind, a

50 5.3 Backplane Implementation and Performance wrapper is needed to make it compatible with the Hermes’ IP core interface. Fortunately, by selecting the credit-based flow control mechanism for the NoC, the wrapper’s task is greatly sim- plified as this is quite similar to the stream interface used by the emulator. Having said that, its job is reduced to adding the header flit, indicating the destination of the payload that follows, and asserting the end of package bit on the last flit of the payload. The number of flits per packet is determined by the burst size parameter.

5.3.3.B Testbench and Python script

In order to monitor the activity over the network, a small set of signals were routed from the Hermes network to the output of the unit under test (UUT), which consisted of the network with all the emulators properly attached. These signals were then evaluated by the testbench which registered on a text log whenever an event occured, i.e, a payload flit was sent or received by any of the attached cores. In order to allow the computation of latency data, the core emulators were modified to include a timestamp and the number of the node currently emitting a flit in the generated payload. For each entry in the log file, the Python script compares the current time to the timestamp carried by the flit, thus computing the average latency over all transactions. The average through- put of the data delivered to the multiple cores is obtained by dividing the number of received bytes by the time taken to complete all transactions. This is then compared to the throughput generated by all the core emulators combined.

5.3.3.C Results

For simulation purposes, the Hermes NOC was configured as a 2 × 2 mesh with 32 bit-wide flits and 32 flit-long buffers at each of the two Virtual Channels available on each switch. The performance tests performed on the network mainly focused on two key aspects: the effect of the size of the payload of each packet on the overall throughput and the way in which the traversal of more or less routers affects the average latency. To accomplish these objectives the test cases utilize worst and best-case routing, as well as a variable burst length, from 8 to 1000 flits per packet. Worst-case routing corresponds to the situation where each node is instructed to send traffic to a node that sits diagonally across him, as this leads to the biggest number of hops until the destination is reached. On the other hand, in the best-case routing scenario, each core sends information to its neighbour, reducing the number of hops to the minimum possible value. When best-case routing is selected, nodes 0 and 1 first send 5,000 data flits to nodes 2 and 3, respectively, after which their roles are reversed. This scenario is depicted in Fig. 5.7a. When using worst-case routing, Fig. 5.7b, the procedure is analogous to the previous but now, nodes 0 and 1 start by sending packets to nodes 3 and 2 which, as a consequence of XY routing, will force the packets to traverse the longest possible path, which, in this simple case of a 2x2 network, corresponds to one intermediate hop. It is worth noting that, while nodes 0 and 1 are simultaneously sending and receiving data, the performance of the network will not be affected

51 5. Framework Prototype

(a) Best-case routing (b) Worst-case routing

Figure 5.7: Traffic patterns for the NoC simulation since each router contains independent send and receive channels on each of its virtual channels. The results presented in Fig. 5.8 show that, as the burst size increases, so does the throughput of the data delivered to the IP Cores, increasing asymptotically to 8 Bytes per clock cycle which corresponds to the amount of data two active cores inject per clock cycle. This is a expected result because, as the burst size increases, the impact of the overhead introduced by the routing decision is reduced, since wormhole switching is employed. It can be concluded that IP Cores with controller-like behaviour, issuing small data bursts occasionally will suffer the bigger penalty. Likewise, and for the exact same reason, the latency decreases when increasing the burst size, as depicted in Fig. 5.9, down to a minimum of 10 clock cycles, which corresponds to the latency required to traverse a single router, as described in section 5.3.1.

16

14 Best Case Roung

12 Worst Case Roung

10 Best Case Roung Full Duplex 8

6 Throughput [Bytes/C.C]

4

2 8 80 800 Burst Size [Bytes]

Figure 5.8: Data delivery throughput in various traffic configurations. In every one, the input throughput is reached asymptotically

Finally, and maintaining the conditions of the previous experiment, the core emulators were configured to work in full-duplex mode. This results in all 4 nodes inserting 4 bytes per cycle in simultaneous, leading to an aggregate injected bandwidth of 16 bytes per cycle. Again, the existence of independent send and receive interfaces in each router results in the same behaviour as in the previous two cases with regard to throughput as well as latency, as depicted by the dotted lines with triangular markers in Fig. 5.9 and Fig. 5.8.

52 5.3 Backplane Implementation and Performance

70 Best Case Roung 60 Worst Case Roung 50 Best Case Roung Full 40 Duplex

30

20 Average Latency [C.C] Average 10

0 8 80 800 Burst Size [Bytes]

Figure 5.9: Source to destination latency when using best case or worst case routing. Duplicating the in- serted data throughput does not affect latency

5.3.4 Crossbar and NoC Comparative Evaluation

The rationale behind the adoption of Network On Chip for very large SoCs is mostly the in- creasing cost, both in terms of area and communication delay, of the interconnection ”wires”. This is especially true for ASICs [18], but FPGA designs are also increasingly constrained by this issue, as new technology nodes are making transistors smaller and faster, while ”wires” get compara- tively slower [27]. In this section, the NoC and Crossbar solutions discussed above are compared to determine whether the former is advantageous for FPGA-based designs. Table 5.2 summarizes the resource utilization of the Hermes NoC using the configuration used in section 5.3.3, i.e., a 2 × 2 mesh with 4 byte-wide flits and 32 flit-long buffers with 2 Virtual Channels per port. In the same table, the hardware requirements for interconnecting a varying number of cores with the AXI Stream Interconnect Crossbar are presented. To guarantee a fair comparison between the two, the crossbar was configured with a 32 bit datapath, round robin arbitration and a data fifo with a depth of 32 elements in each interface, which is analogous to the flit buffer in the Hermes NoC. Moreover, full connectivity was enabled in the switching element, so that any master interface can establish a connection with any other slave port.

Table 5.2: Hardware utilization of the Hermes NoC configured in a 2×2 mesh and the AXI Stream Intercon- nect Crossbar for a varying number of independent cores

Available Hermes NOC AXI Stream Interconnect Crossbar Resources 2 × 2 Mesh 4 Cores 8 Cores 16 Cores 4 Cores (@ 128 bit) Regs 607,200 15,935 1,773 4,584 8,656 5,229 LUTs 303,600 7,886 1,125 2,981 8,882 2,448 Slices 75,900 5,587 849 1,969 4,875 1,989 Max. Freq. 200 MHz 144 MHz 133 MHz 133 MHz 146 MHz

The results presented in table 5.2 clearly show a significant gap between the Crossbar and

53 5. Framework Prototype

NoC solution in terms of hardware requirements. In fact, for the same number of interconnected cores, the Crossbar requires 6.6× less slices than its NoC counterpart. As expected, the Hermes NoC offers a better performance in terms of attainable clock frequency, which confirms the reduc- tions in the average length of the interconnecting ”wires”, but the hardware overhead introduced by its multiple switches means that its scalability is very limited on FPGA designs. Moreover, the smaller area footprint of the Crossbar solution leaves room for the increase of the interconnection width, thus potentially increasing the peak throughput, while still keeping the resource utilization lower than the equivalent NoC configuration.

For equivalent ”wire” widths, however, the peak throughput of the NoC will be superior to that of the Crossbar due to the higher clock frequency. In fact, by replacing the variables in equation 5.1, derived in section 5.3.1, with a frequency of 200 MHz, and a 2×2 mesh, a peak throughput of 4,800 MB/s is obtained. On the other hand, at a clock frequency of 144 MHz, the crossbar yields a peak throughput of 2,304 MB/s.

Since the Crossbar is never subject to traffic contention, its bandwidth does not vary with traffic behaviour as in the NoC. Table 5.3 provides a comparison between the throughput and latency of the Crossbar and NoC for the worst-case routing scenario depicted in Fig. 5.9 and Fig. 5.8. Even for controller-like behaviour, which is characterized by small data bursts, the NoC provides superior performance due to its higher operating frequency, even though the average transmission latency is longer in every situation.

Table 5.3: Throughput and latency of the Hermes NoC and AXI Stream Crossbar when interconnecting 4 Cores, under different traffic conditions

Burst NoC Crossbar Size Latency Peak Measured Latency (Avg.) Peak Measured [(Avg.) [c.c] Throughput [MB/s] Throughput [MB/s] [(Avg.) [c.c] Throughput [MB/s] Throughput [MB/s] 8 63.3 6,400 4,800 6 4,608 4,608 100 40.7 6,400 6,160 6 4,608 4,608 1000 15 6,400 6,390 6 4,608 4,608

In conclusion, the NoC is undeniably more scalable in terms of operating frequency, as the addition of extra nodes does not increase the average length of the interconnections between the switches. However, the gains obtained in performance are overshadowed by the considerable resource utilization figures. Moreover, although NoCs are usually targeted at SoCs with a large number of IP Cores, the analysis above shows that these large numbers, for which NoCs are supposedly the best solution, are simply not feasible in modern FPGA devices, given the large re- source overhead introduced by the network switches. Finally, it is interesting to note that, although in ASIC designs the circuit is evaluated in terms of its total area, which comprises both transis- tors and wires, in FPGA designs, resource usage is, generally speaking, only evaluated in terms of occupied LUTs, slices, and registers, but not on the number or length of the interconnections resources. This further reduces the attractiveness of FPGA-based NoC implementations.

54 5.4 Shared Memory Performance

5.4 Shared Memory Performance

The shared memory access time is another critical factor for the performance of the HotStream framework, since stream re-use within the MCPE is entirely supported by this element. Despite providing a considerable peak throughput, of 12.8 GB/s, under normal operating conditions the bandwidth utilization of the off-chip DDR3 memory available in the VC707 development board may be significantly reduced. While the overall read and write latency of the DDR module is dependent on how the memory controller is configured, the most influential factor is the behaviour of the traffic and access patterns (Xilinx Answer Record AR# 45644). Thus, it is not possible to evaluate the performance of the shared memory from a purely theoretical standpoint. Fortunately, the Memory Interface Generator (MIG) tool provided by Xilinx to configure and instantiate memory controllers also generates a timing-accurate simulation model. This model eases the profiling of the memory controller and memory module combination under varying access patterns. By performing simulations with the various access patterns available in the test bench, the read access latency was determined to be between 23 and 30 clock cycles. In continuous op- eration, i.e., when multiple read requests are issued sequentially, the latency between bursts of 8 data beats is reduced to 5 clock cycles. This corresponds to an overall access efficiency of 8 / (8 + 5)= 0.6153 and, on average, a read throughput of 7.87 GB/s. The same procedure, re- peated for write accesses resulted in similar results, justifying the adoption of the 7.87 GB/s mark as the overall DDR average access throughput. However, it should be noted that these values refer to the user interface provided by the memory controller. In the context of the HotStream framework, an AXI Slave controller is added in order to facilitate multiple arbitrated accesses to the shared memory. This naturally reduces the throughput due to the various overheads of the AXI4 protocol, as well as the latencies introduced by the AXI Slave controller attached to the native memory controller.

5.4.1 Cycle-Accurate Simulator

In a framework such as the one presented in this thesis, it is important to be able to accurately predict the expected performance of a given application without having to go through the full de- sign, testing and integration cycle. This makes it possible to iteratively adapt a particular design to the framework and to mitigate existing bottlenecks in the early stages of the development process, resulting in a significantly decreased time-to-market and lower engineering effort. The extensive profiling effort presented in Chapter is key to making this possible, as it provides the necessary information to simulate the behaviour of the various communication channels that compose the HotStream framework. To combine all this information into usable performance metrics, a Python-based simulator was developed that takes an arbitrary number of Cores configuration files as an input and performs a cycle-accurate simulation of all the transactions that take place between the multiple Cores, the backplane, and the shared memory. The simulator is highly parametrizable as it allows to define

55 5. Framework Prototype the size of the burst requests that are issued to the shared memory, as well as the arbitration latency for both read and write accesses, and the latency incurred between successive burst ac- cesses from a same Core. Each Core configuration file is composed of a sequence of instructions that allow to easily emulate any stream-based Core by taking into account its data processing la- tency, the address generation behaviour of the associated DFC and the relationship with other cores, by issuing synchronization requests. The output of the simulator is a comprehensive set of statistics that indicates the overall shared memory utilization of its read and write channels and number of unused cycles. Similar informa- tion is provided for each emulated Core and complemented with the total number of reads and writes it performed, as well as its period of activity in clock cycles. As it is an event-driven sim- ulator, simulation halts when no more read or write requests are received. The duration of the simulation, in clock-cycles, provides an estimate of the execution time of the application after it has been mapped to the HotStream framework.

5.5 Summary

The prototyping of the framework was done on a Xilinx VC707 development board, powered by a state-of-the-art Virtex 7 FPGA and accompanied by an high-performance DDR3 memory and an 8 lanes PCI Express interface capable of Gen2 speeds. The Host machine was equipped with an Intel Core i7 3770K processor at 3.5 GHz and 16 GB of DDR3 memory. Being in a Xilinx environment, the HIB was naturally implemented through the combination of the AXI DMA engine and the AXI Bridge for PCI Express. The subsequent experimental evaluation of the PCI Express connection confirms that the achievable throughput over this type of interface greatly depends on the size of data chunks being exchanged. Small-sized chunks can lead to aggregate throughputs that are 160× lower than what is possible under ideal conditions. By taking into account the time elapsed during the configuration steps performed by the HotStream Application Programming Interface (API) when a transfer is configured, aggregate throughputs as high as 2.1 GB/s were measured for a x8 Gen1 configuration. For the implementation of the backplane interconnection two opposing solutions were com- pared: the Hermes NoC [30] and the AXI Stream Interconnect [38] Crossbar. Taking data de- livery throughput and latency as the two main performance metrics, the NoC solution reveals to be slightly superior to its counterpart. However, such advantage is achieved at an hardware- utilization cost that seriously hinders the scalability objectives defined for the HotStream frame- work. In fact, the obtained results can be extrapolated to most FPGA designs, as the considerable amount of hardware required to implement the switches and routers that form the ”wires” in a NoC implementations, makes it unattractive for designs based on reconfigurable hardware. For the shared memory, an off-chip DDR3 memory was used. To evaluate this component a time-accurate simulation model was used, easing the characterization of the memory access time when different read and write patterns are utilized. While the peak bandwidth of the module is of

56 5.5 Summary

12.8 GB/s, the various dynamic mechanisms that are present in these type of devices reduce this value to an average access time of 7.87 GB/s for the considere read and write patterns. By taking advantage of the results obtained during the characterization of the various com- munication channels, a Cycle-Accurate Simulator was developed that makes it possible to obtain performance estimates of HotStream-based accelerators early in the development cycle. An ar- bitrary number of Cores can be emulated, each described by its own configuration file, where its data processing latency, synchronization with other Cores and the address generation rate of the associated DFC is specified. Accesses to the shared memory are also accurately simulated based on the various latency figures obtained in this Chapter, which may be changed by the user to reflect different configurations of the data bus or the shared memory itself.

57 6 Framework Evaluation

Contents 6.1 General Evaluation ...... 59 6.2 Case Study 1: Matrix Multiplication ...... 63 6.3 Case Study 2: Image processing chain in the frequency domain ...... 69 6.4 Summary ...... 75

58 6.1 General Evaluation

The HotStream framework is a comprehensive solution for the development of stream-based hardware accelerators. It aims at handling all the communications between the accelerators and the Host machine, as well as to facilitate the management of the intra-accelerator communica- tions. This effectively allows the hardware designer to focus on the development of highly efficient computation Cores, as the powerful, yet easy-to-use HotStream API and the programmable DFCs guarantee that the final product will provide the best possible performance. To accomplish this, the HotStream framework is essentially divided into 3 components: i) the software layer; ii) the HIB and; iii) the MCPE. The first two work in conjunction to allow the streaming of data between the accelerator and the Host, while the extensive support for coarse-grained patterns make it possible to fully explore the available bandwidth between the two. The third component is where the actual computation takes place, by hosting an arbitrary number of computation Cores that may be interconnected with each other via the high-speed and low-latency Backplane interface. Such backplane also allows the cores to access the shared memory through individual DFCs. These DFCs are a key element of the framework, as they provide fine-grained access to the large-capacity shared memory, which confers the HotStream framework its unique data-reuse ca- pabilities. For such purpose, this unit was custom designed to support arbitrary access patterns, which can be easily programmed by a custom assembly language, without compromising the address generation efficiency and hardware resources.

6.1 General Evaluation

The following sections provide a comprehensive evaluation of the HotStream framework in two major steps. First, its fine-grained memory access capabilities are tested through a series of com- monly used access patterns, where address generation rates and specific memory requirements for the storage of the pattern descriptions are compared to the most relevant related art, namely the PPMC [20]. However, since the PPMC implementation was not publicly available at the time of this work, the Xilinx AXI DMA engine was used for this purpose, as it features identical function- alities to the PPMC [20]. This DMA engine IP Core was also used in the prototyping framework, to implement the HIB module. The second step demonstrates how the HotStream framework can be used to develop real applications, and the levels of performance that can be expected. This is achieved through two distinct case-studies: i) a block-based multiplication of very large matrices and; ii) a full signal processing chain, where high-resolution images, in the range of 1024 × 1024 to 4096 × 4096 pixels, are filtered in the frequency domain using 2D FFTs.

6.1.1 Resources Overhead

Considering the strong focus on scalability of the proposed streaming framework, it is paramount that its core elements do not significantly impact the overall resource usage. This goal was achieved by designing the DFC with a low area footprint in mind, as this is bound to be the most replicated unit in this framework.

59 6. Framework Evaluation

Table 6.1 summarizes the resource utilization by the key elements composing the framework. It is important to note that the DFC is fully configurable with relation to the number of Loopcontrol units that are used. Thus, its resource occupation varies with the chosen configuration. Table 6.1 presents the results based on a DFC configuration with 1 to 3 Loopcontrol units. All resource utilization and performance figures refer to the implementation on a XC7VX485T Virtex-7 FPGA.

Table 6.1: Resource usage for each component in the MCPE and HIB (hardware platform: XCV7VX485T Virtex-7 FPGA)

Available DFC (1-3 Loopcontrol units) Streaming Backplane DSB DMA Resources + BMC Bus (AXI) Interconnect Slices 75,900 1,014 - 1,216 3,273 4,875 5,300 1,548 LUTs 303,600 1,743 - 2,225 5,305 8,882 12,620 3,588 Regs 607,200 1,553 - 2141 4,922 8,656 9,160 4,128 DSPs 2,800 4 0 0 0 0 BRAM36 1,030 1 0 0 0 6 Max.Freq. 160 MHz 167 MHz 146 MHz 200 MHz 136 MHz

It is important to note that, while the resource utilization of the Backplane Interconnect (im- plemented in a Crossbar topology) seems rather high, it represents a worst-case scenario, con- figured to support full connectivity to a maximum of 16 independent nodes. On the other hand, each DFC/BMC pair accounts for only 1.6% of the total resources available in the device, which ensures the addressed scalability goal. While competitive for FPGA-based designs, the maximum operating frequency of the DFC is limited by the simple pipelined nature of the used microcon- troller. By adopting a more aggressively optimized architecture, higher processing frequencies can be achieved. This solution was not sought because, as explained in section 5.2, for an add-in card to work in any environment, its PCI Express interface and DMA engine must be clocked at the 100 MHz clock provided by the motherboard.

In this particular embodiment of the HotStream framework, the DSB, which is implemented by the PCI Express bridge, accounts for the largest fraction of resource usage, at roughly 7% of the available resources of the target FPGA. Naturally, this balance would change if the implementation platform featured other means of communication between the Host processor and the accelerator module, such as the ones available on the Zynq All Programmable SoC.

Regarding the relationship between the DFC and the BMC (as depicted in Fig. 3.4), it should be recalled that a BMC can be shared by two independent DFCs. Incidentally, the BMC is nearly two times larger than each DFC, as depicted in Tab. 6.2. This means that the DFC + BMC column in Tab. 6.1 corresponds to a worst-case scenario, where the full-duplex capabilities of the BMC are not exploited. In the case of a core arrangement where DFCs and BMCs are perfectly paired, the 1 hardware cost of each DFC + 2 BMC is effectively halved. In the grand scheme of the framework, this corresponds to just 0.8% of the total resources available in the device.

60 6.1 General Evaluation

Table 6.2: Individual resource usage of the DFCs and BMCs (hardware platform: XCV7VX485T Virtex-7 FPGA)

Available DFC BMC Slices 75,900 236 - 337 542 LUTs 303,600 371 - 612 1,001 Regs 607,200 390 - 684 773 DSPs 2,800 2 0 BRAM36 1,030 0 0

6.1.2 Stream Generation Efficiency

Given the relation between the complexity and nature of the considered patterns and the resulting address generation rate and size of the pattern descriptor, a proper evaluation of the proposed DFC and overall framework can only be achieved through a representative benchmark. Therefore, five distinct patterns of varying complexity will be herein considered: Linear; Tiled; Diagonal; Zig-Zag; and Greek Cross. While the first two are usually found in a wide range of applications, the remaining three are somewhat more exotic in nature. Nevertheless, they are still of great importance in the context of stream-computing. As an example, the Diagonal access pattern is extensively used by the Smith-Waterman algorithm for DNA sequences alignment [25]; the Zig-Zag scanning is a key element in the entropy encoding of the AC coefficients in the JPEG and MPEG standards [37]; and the Greek Cross is often used by the vast class of diamond search motion estimation algorithms adopted in video encoding [41]. Figure 6.1 depicts the access pat- terns being considered, including their size and evolution over time, as well as the pseudo-code for their generation using the API proposed for this framework. 1024 128 1024

72 512 1024

512 Time

(a) Linear (b) Tiled (c) Diagonal

8 1024

16x16

8 1024

Time

(d) Zig-Zag (e) Greek Cross

Figure 6.1: Access patterns, with varying complexity degrees, adopted for the DFC evaluation

The metrics considered for this evaluation are the code size, required to describe each pattern,

61 6. Framework Evaluation and the address generation rate, defined as the average number of addresses generated per clock cycle. As stated above, the AXI DMA engine is used as the baseline for this comparison, representing the characteristics of most descriptor-based pattern generation mechanisms that have been proposed in the related art. Table 6.3 depicts the obtained results.

Table 6.3: Address generation rate and descriptor size of the considered access patterns (the adopted length of each pattern results from the parameterization depicted in Fig. 6.1)

DFC DMA Pattern Length Code Size Addr/cycle Code Size Addr/cycle Linear 1024 24 1 32 0.96 Tiled 128×721 40 0.99 32 1 Diagonal 1024×1024 44 1 65k 1 Zig-Zag 8×8 48 (132*) 0.36 (0.71*) 480 0.63 Cross 1024×1024 132 0.89 228k 1 * Values obtained after loop unrolling 1 For a memory block of 512×512

By analyzing the pattern generation results obtained with the proposed DFC, when compared with the traditional descriptor-based data-fetch DMA mechanisms, it can be concluded that the proposed controller achieves a similar address generation rate but with significantly lower code- memory requirements. Moreover, the related state of the art does not offer any form of scalability, as the size of the descriptor increases with the length of the pattern. This is particularly em- phasized for the Diagonal and Cross patterns: for an 1024×1024 pattern, the descriptors for the conventional DMA occupies about 1500× more memory to store the access patterns than the proposed DFC approach. For larger matrices this discrepancy will be even greater. As a con- sequence of the larger code size, the conventional DMA approach would require a significantly larger internal memory and eventually need an external processor to dynamically generate the patterns, which would further increase the required hardware resources and reduce the attain- able performance. Nevertheless, certain cases still exist where the execution time of the pattern description code in the DFC cannot be entirely overlapped with the address generation, with a consequent re- duction of the attained rate. This is the case of the Zig-Zag access pattern. To circumvent this problem, the loop that sets the AGC parameters for each diagonal can be unrolled, thus improving the address generation rate at the cost of a slightly larger code size. This technique effectively provides a duplication of the rate for the Zig-Zag pattern generation (values marked with an * in the table), allowing for an address generation rate above the one reported in the related state of the art, still with smaller memory requirements. Naturally, the actual performance gains that can be achieved by accelerating a given data streaming application with the proposed framework depend not only on the amount of parallelism that can be exploited, but also on the involved computational complexity (i.e., the number of operations performed on a single data element). The latter is especially important, as it effectively defines the amount of data reuse that can take place within the framework. In order to provide an

62 6.2 Case Study 1: Matrix Multiplication insight of the speed-up magnitudes that can be expected by utilizing the HotStream framework, the following section presents two case studies.The first case study considers the block-based multiplication of very large matrices, which is able to take full advantage of the advanced data- reuse capabilities of the proposed framework, as well as its inherent support for an high-degree of data parallelism. The second case study consists of a full image processing chain on the frequency domain utilizing 2D FFTs, representing a complete and self-contained final product that also highlights several features of the HotStream framework, such as its extensive support for heterogeneity among the computing Cores.

6.2 Case Study 1: Matrix Multiplication

In this section, a block-based matrix multiplication example is used to evaluate and compare the proposed framework with other usual approaches based on hardware accelerators. As a comprehensive data management solution for streaming applications, the proposed framework provides efficient data streaming mechanisms between the host and the accelerating hardware, as well as extensive data (re-)usage and (pre-)fetching capabilities within the MCPE. The outcome is a significant increase in the attained input/output data bandwidth within each processing core, as well as the consequent maximization of the resulting data processing throughput. This approach dramatically contrasts with traditional implementations, where the host GPP or a conventional DMA engine centralize the whole data management, at a detrimental cost of being subjected to the (often rather limited) data bandwidth of the underlying communication interface between the host and the accelerating hardware. While this block-based matrix multiplication case study allows to demonstrate the potential of the proposed framework, additional advantages are expected as the application complexity grows. Block-based matrix multiplication is typically used for improving data locality, and allows the implementation of multiplication operations with matrix sizes much greater than would be possible if the multiplication was performed in one single step. The considered implementation is divided into two steps: i) the multiplication of the sub-blocks; and ii) the accumulation (reduction) step, to compose the final matrix with the computed partial sub-matrices [26]. Equation 6.1 depicts a simple partitioning example of an N×N matrix multiplication operation, by considering sub-blocks of size N/2×N/2. " # " # " # A11 A12 B11 B12 A11·B11+A12·B21 A11·B12+A12·B22 · = (6.1) A21 A22 B21 B22 A21·B11+A22·B21 A21·B12+A22·B22

Three different approaches were considered for this evaluation. The first, hereinafter denoted as conventional, simply streams the matrix data over the PCI Express link into the accelerator, where a single matrix multiplication core is consuming the incoming data and producing the results that are streamed back to the Host. The second approach, referred to as conventional+buffering, features an additional memory in the accelerator, which is large enough to buffer one of the input matrices, so that it can be reused over the course of the entire computation. Finally, the third

63 6. Framework Evaluation

Sub-block Sub-block Sub-block Matrix Addition Matrix Addition Matrix Multiplication (16:1 Reduction) (8:1 Reduction) 1 1 1 PCIe M A1 A2 PCIe 2 2 2

Shared Shared Backplane Shared Memory Memory Memory

Figure 6.2: HotStream-based implementation of the block-based multiplication algorithm, consisting of 3 Kernels to process multiple and concurrent data streams, where double buffering is used on the shared memory to overlap communication with computation approach makes use of the HotStream framework, which maximizes the data re-usage by includ- ing reduction (accumulation) modules on the MCPE that run concurrently with the multiplication cores (see Figure 6.2), as well as overlapping the data communication with the computation by employing double-buffering techniques. It should be noted that to implement a 4096×4096 matrix multiplication with 32×32 sub-blocks, a 128:1 reduction step is required, in which all the interme- diate results from the sub-block multiplications are combined, through simple additions, into the final matrix. This is achieved by using two addition cores: one to perform a 16:1 reduction and the other to perform an 8:1 reduction. The result is a self-contained accelerator that only streams back the final matrix to the Host, contrasting with the former solutions, which completely relies on the Host to perform the reduction. In addition to these three basic implementations, corresponding parallel versions were also considered, by replicating the structure depicted in Fig. 6.2. The level of exploited parallelism is limited either by the available hardware resources or by the data bandwidth capacity of the communication channels.

6.2.1 Computing Cores

It is important to recall that the main focus of this case study is not on the adopted matrix multiplication cores but instead on the framework itself. In accordance, it was decided to adopt off-the-shelf Xilinx IP Cores [4] to implement both the matrix multiplication and the accumulation cores. It is worth noting that the same multiplication units are also used in the conventional solutions. These soft cores support matrices of up to 32 × 32 entries of 2 bytes each. While this limits the proposed solution in terms of data width, it is very suitable to allow the evaluation of the considered framework in terms of scalability. In order to implement the accumulation cores, A1 and A2, depicted in Fig. 6.2, the off-the-shelf Xilinx IP Cores were adequately arranged in a binary tree, as depicted in Fig. 6.3. Since the input is in serial form, i.e., a matrix entry is received per clock cycle, a convenient set of input buffers was added to the first level of accumulators. The reduction operation only starts when the first level is ready to be processed. In what concerns the multiplication core, it uses a dedicated structure that offers an higher degree of data re-use. In fact, given that each 32×32 sub-block of matrix A is involved in a

64 6.2 Case Study 1: Matrix Multiplication

OUTPUT

A1

A1 A1

A1 A1 A1 A1

INPUT

Figure 6.3: 8:1 binary reduction tree based on Xilinx Matrix Accumulators. The structure of the 16:1 reduc- tion core follows the same architecture but with double the number of basic accumulators

number of multiplications that is equal to the number of columns of matrix B, its re-utilization allows to greatly reduce the number of accesses to the shared memory. Figure 6.4 depicts the structure of such multiplication core, where a sub-block from matrix A is stored and re-used during the multiplication with a full line from matrix B. Once a line is finished, the input buffer is flushed and a new matrix A sub-block is pushed in.

Sub-Block A

M OUTPUT INPUT Sub-Blocks B

Figure 6.4: Internal structure of the multiplication core utilized in the HotStream implementation of the matrix multiplication. Sub-blocks from matrix A are stored and re-used during the computation of a full sub-block line from matrix B

These cores are run at a conservative frequency of 100 MHz and output a 2 byte matrix element per clock cycle, after a significant initial latency, which is effectively hidden by the large data set that composes the stream. Therefore, a constant rate of 100 MOps (Million Operations per Second) is maintained by each core. Table 6.4 summarizes the resource utilization of each of the basic cores utilized in the various implementations of the matrix multiplication, as well as the more complex structures featured in the HotStream version.

65 6. Framework Evaluation

Table 6.4: Resource usage of the cores utilized in the various implementations of the 4096×4096 matrix multiplication (hardware platform: XCV7VX485T Virtex-7 FPGA)

Available Xilinx Xilinx HotStream HotStream HotStream Mat. Mult. Mat. Add. Mult. Acc. Tree 16 Acc. Tree 8 Slices 75,900 10,588 117 10,650 1,541 792 LUTs 303,600 9,238 144 12,230 2,141 1,134 DSPs 2,800 32 1 32 15 7 RAMB36 1,030 2 0 3 17 9 RAMB18 2,060 64 0 64 0 0

6.2.2 Roofline Model

To evaluate the available design space in terms of the offered performance, the Roofline model [21] was applied to the considered case-study, in order to correlate the exploited processing performance with the throughput of the involved communication channels. Figure 6.5 depicts the peak performance of the conventional implementations, using a paral- lelism level of 1×, 2× and 4×, which result from using 1, 2 and 4 Multiplication Cores, respectively, to implement the matrix multiplication kernel (using data parallelism). For a 1× parallelism level, the conventional solution is limited by the performance of the XILINX IP Core that performs the matrix multiplication. By increasing the number of cores to 2 or 4, 2× or 4× parallelism can be achieved, respectively. However, these implementations become limited by the PCIe, thus result- ing in a performance of only 190 MOps, i.e, a speed-up of only 1.9 with 4× parallelism.

zzzz

zzzz ss ss zzzz zzzz Sz zzzz zzzz

ss ss s zzzz zzzz Sz

z z

ss z z sz zzNz zzzzz z S S S S S ss

Figure 6.5: Roofline model for the matrix multiplication example: Cx and Hx denote the actual performance for each implementation (Conventional and HotStream, respectively); while the conventional solutions C2× and C4×, with 2× and 4× parallelism, respectively, are limited by the PCIe link (communication-bounded), all other implementations are computation-bounded

At this point, the observation that is worth noticing is concerned with the possibility to increase

66 6.2 Case Study 1: Matrix Multiplication the communication roofline and overcome the previous limitation by simply enabling data re-use within the accelerator. Accordingly, by just applying the proposed HotStream framework to im- plement the same kernels, a speed-up of about 2.1 is achieved, corresponding to a performance of 400 MOps. However, the HotStream framework also allows to easily develop more aggressive solutions that use addition kernels to directly perform the sub-block accumulation on the accelera- tor. This has the advantage of increasing the accelerator operational intensity from 0.5 operations per byte (OPS/Byte) to 1.5 OPS/Byte. By using the Xilinx IP Cores to implement sub-block ma- trix multiplication and addition, it is possible to observe a peak performance of 300 MOps for the dataflow illustrated in Fig. 6.2. Furthermore, this value can still be increased if data parallelism is added to this implementation: by replicating each of the kernels depicted in Fig. 6.2 by four times, a peak performance of 1200 MOps is observed (see the HotStream roofline with 4× parallelism in Fig. 6.5).

6.2.3 Performance and Memory Usage

Another key advantage of the HotStream framework, when compared to conventional solu- tions, is the reduction of the data traffic on the PCI Express link. These savings can be rather significant for large data sets, which not only provide the means for the application to execute faster, but also to reduce the intervention of the Host GPP. In fact, as the amount of transferred data over the PCI Express connection decreases, so does the number of DMA descriptors that must be setup by the Host. While this operation can be overlapped with the actual data trans- fer, it wastes valuable Host processing time. Figure 6.6 depicts the execution times stripped into different communication/computation domains for the considered matrix multiplication implemen- tations, using a 4096×4096 matrix. In fact, to achieve a comparable performance to what is possible with the proposed HotStream framework, the conventional implementations must be able to completely overlap communication with computation. However, this comes at the cost of an higher intervention of the Host GPP, since part of the GPP computation power must be used to manage the communication with the accelerator, instead of being used for other useful computations. Figure 6.6 clearly shows that the HotStream implementation is able to reduce the Host processor occupation by 45× for the case of a 4096×4096 matrix. The three bars on the left of Fig. 6.7 represent single-kernel configurations of the three consid- ered implementations. In this particular setup, the total execution time of the conventional+buffering implementation is very similar to the one provided by the HotStream implementation (whose Host processing time is not considered, since it can be overlapped with the computation). The slight performance gap that is observed in the conventional+buffering implementation is mainly due to the overhead introduced by the need to perform the reduction step in the Host. For the considered 4096×4096 matrix multiplication, this step corresponds to more than 2×109 operations, which take about 486 ms to execute in a state of the art Intel Core i7-3770K processor (@3.5 GHz and 8 MB of cache). If a less powerful processor was considered, the observed gap between these two

67 6. Framework Evaluation

Figure 6.6: Processing time taken on each step of the matrix multiplication algorithm for the considered implementations implementations would become significantly more pronounced, as well as the gains provided by the proposed framework. Furthermore, by taking into account the performed Roofline analysis (see fig. 6.5), it is worth recalling that the HotStream implementation is much less bound by the communication than the conventional approaches, which allows the exploitation of an additional level of parallelism. This is demonstrated in the remaining columns of the graph, where an almost linear speed-up is achieved with a parallelism of 2× and 4×, respectively. This contrasts with the conventional implementations, which presented a considerably lower scalability.

35 Convenonal 30 Convenonal with Buffering 25

20 HotStream

15

10

Total Execuon Time [s] 5

0 1x 2x 4x Parallelism

Figure 6.7: Core scalability of the three matrix multiplication implementations

The last fundamental parameter that was considered in this evaluation refers to the amount of memory that is required to store the matrices and intermediate results in the Host. This is a very important scalability measure, as it effectively determines the maximum matrix size that can be handled by the accelerator. It is expected that all implementations (except the HotStream solution) fall short in this domain. In particular, the conventional solutions require a Host buffer that is several times larger than the whole matrix size, in order to hold all the block-based inter-

68 6.3 Case Study 2: Image processing chain in the frequency domain mediate results of the multiplication operation. As depicted in Fig. 6.8, for 4096 × 4096 matrices, the conventional implementations require storing the full input and output matrices, as well as an additional 4GB buffer to store the intermediate results. In contrast with the HotStream imple- mentation, the matrix size is only limited by the total capacity of the Host memory, which must be capable of holding the three full matrices, i.e., 96MB, leading to a 42× memory requirement re- duction. Naturally, to enable data re-use within the accelerator, the shared memory must also be large enough to hold three sub-blocks of the input and output matrices, as well as a small subset of extra sub-blocks, that guarantee the overlapping of the communication and computation. This is easily achieved with the 512MB of DDR memory available on the Virtex 7 FPGA board.

256T Conven�onal / Conven�onal with Buffering HotStream 4T

64G Alloca�on Limit with 16 GB of RAM

1G

16M

256K Host Memory Requirements [B]

4K 32 128 512 2K 8K 32K 128K Square Matrix Size (Number of columns)

Figure 6.8: Host memory requirements for matrix multiplication implementations

Finally, the scaling of the multiplication operation for larger matrices can be considered. The conventional implementations are not able to compute a 8192×8192 matrix multiplication, since a buffering capacity of about 32 GB would be need. In contrast, less than 1 GB is required with the proposed framework (storing of the input and output matrices). This clearly demonstrates the ability of the HotStream framework to scale and deal with very large data sets.

6.3 Case Study 2: Image processing chain in the frequency domain

While the matrix multiplication case study successfully outlines the key features of the Hot- Stream framework, it is not enough to attest the proposed framework capability of successfully supporting any kind of stream-based application, independently of its complexity. Thus, in order to demonstrate the adaptability and the extensive support for multiple heterogeneous cores of the proposed framework, a second case study is considered, which applies frequency domain filtering to a stream of large images. Additionally, it is also the purpose of this case study to demonstrate the advantages of utilizing the Cycle-Accurate simulator to predict the resulting application perfor- mance under a variety of configurations. Unlike spatial domain filtering, which is the most prevalent in image processing applications, frequency domain filtering takes the frequency representation of an image and multiplies each of

69 6. Framework Evaluation its frequency components by a specific coefficient. This procedure is significantly less computationally- intensive than performing a conventional 2D convolution, but the cost of transforming the image to and from the frequency domain must be taken into account. Basically, this means that for suf- ficiently small images and convolution kernels, applying a direct convolution is faster than taking the frequency-based approach. However, as images get larger and/or the convolution kernels become bigger, the overhead introduced by the calculation of the two frequency transformations becomes comparatively smaller with relation to the convolution operation, which means that the frequency-based approach will result in an higher processing throughput [15]. To properly eval- uate the framework for this case study, images up to 4096 × 4096 pixels are considered. Such images are commonly found on medical imaging applications, where various types of filters are applied to very high-resolution images. For such dimensions, utilizing frequency domain filtering is largely justified, as convolution-based solutions become notably slow in these conditions.

The transformation between the spatial and the frequency domains can be accomplished by a 2D Fast Fourier Transform (FFT), which can be performed in two steps by a conventional linear FFT, by first applying a 1D FFT to each line of the input image and then performing a 1D FFT to each column of the resulting matrix. Naturally, this is equivalent to performing the first step twice, as long as a matrix transposition operation is performed between the two. In a streaming application, pipelined operation is preferred in detriment of resource sharing, meaning that two linear FFTs are applied in series, but interposed by a transposition step. It is worth emphasising that the transposition and, subsequently, the second FFT can only be applied once the first stage is completely finished, which means that intermediate buffering is required. Given the significant dimensions of the images being considered, it becomes obvious that resorting to on-chip memory is completely unfeasible. For instance, considering a single-precision floating point FFT, with magnitude and phase components, and a 2048 × 2048 image with only one colour component (i.e., in grayscale), it leads to a memory footprint of 32 MB. This amount of memory can only be found in off-chip memory solutions, such as the one offered in the HotStream framework. Moreover, the advanced data-fetching capabilities of the DFCs employed in the framework allow the access to the stored data in transposed order without any additional hardware or delay.

To avoid the loss of precision that can result from using fixed-point arithmetic, the entire sig- nal processing chain operates in single-precision floating-point. This creates the need for aux- iliary cores that perform the conversion from grayscale values to fixed-point and, subsequently, to floating-point. Furthermore, to improve the data throughput, a double-buffering scheme is employed between each FFT pair, meaning that, after the initial latency, both FFTs are always producing and reading frequency samples. Figure 6.9 represents the complete digital signal pro- cessing chain, mapped to the HotStream framework.

It is also interesting to note that the same approach that was used to implement the 2D FFT can be used to create an ultra-long FFT solution. This particular case of the FFT operates on very large sample-sets, usually in the order of 1M points. To manage such a large transform, the input is usually arranged in a 2D matrix and the computation is separated into two sequential steps,

70 6.3 Case Study 2: Image processing chain in the frequency domain

Floating 1D FFT 1 1D FFT PCIe 2FixedPoint Point 2D FFT 2048 2048 Converter 2 Backplane Shared Memory

Frequency Domain Filter

Floating 1D FFT 1 1D FFT PCIe 2Grayscale Point 2D FFT 2048 2048 Converter 2 Backplane Shared Memory

Figure 6.9: Image processing chain in the frequency domain, mapped to the HotStream framework as in the 2D FFT. An additional multiplication step is required, in which corrective twiddle factors (trigonometric constant coefficients that are used in the course of FFT algorithms) are used to adjust the phase of each of the individual FFTs that are applied to each line of the matrix. Due to the considerable memory requirements, off-chip memory solutions are also typically employed.

6.3.1 Computing Cores

As in the previous case-study, the focus of this work is not on the development of high- efficiency processing cores, but rather on the proper evaluation of the proposed framework. As such, off-the-shelf IP cores were used wherever possible, and custom hardware was only de- veloped when strictly needed. In this particular case study, the floating-point infrastructure is based on the Xilinx Floating-Point Operator [3], which supports traditional arithmetic operations on floating-point values in addition to conversion from and to fixed-point representation. The 2FixedPoint and 2GrayScale blocks, depicted in fig. 6.9, are simple asymmetrical FIFOs which buffer the stream to and from the PCI Express interface, presenting to the floating-point convert- ers 32-bit fixed-point values of the image pixels, with no fractional part. The FFT cores were implemented by the Xilinx Fast Fourier Transform IP Core [6], configured in pipelined mode with single-precision floating point representation. The frequency domain filter (FDF) required a more custom approach, as no equivalent solu- tions were available. The core of the FDF is a floating-point multiplier, implemented by the pre- viously mentioned Xilinx floating-point solution, which applies a factor to each sample received. These factors depend on the type of filtering that is being applied and must all be stored in local- memory, in order to maintain a processing throughput of one sample per clock-cycle. However, for a 2048 × 2048 image, the same number of floating-point coefficients is required, which corre-

71 6. Framework Evaluation sponds to 16 MB of memory. To circumvent this problem, run-length encoding was used to store the coefficients in less than 64 kB. In this type of encoding, each set of equally-valued entries is grouped into one and represented by a [VAL,NUM] pair, in which VAL corresponds to the repeated value and NUM represents the number of repetitions. The FDF core was designed to read an encoded pair from its local memory and to apply each particular coefficient to the next NUM in- coming samples. In addition, its pipelined design allows it to hide the delay between the parsing of encoded pairs and the utilization of the resulting coefficients, effectively achieving a throughput of one filtered sample per clock-cycle. Table 6.5 summarizes the resource usage of each of the processing blocks utilized in this application.

Table 6.5: Resource usage of the cores utilized in the frequency domain processing case study (hardware platform: XCV7VX485T Virtex-7 FPGA)

Available Xilinx Xilinx HotStream HotStream HotStream FFT Floating-Point Converter 2FixedPoint 2GrayScale FDF Slices 75,900 2,445 117 71 62 239 LUTs 303,600 6,913 226 90 68 391 DSPs 2,800 46 0 0 0 2 RAMB36 1,030 2 1 1 1 2 RAMB18 2,060 36 0 0 0 0

6.3.2 Performance and Scalability

Having implemented all the individual units described above, and after performing a complete functional verification of the system with the aid of a Matlab model, the results obtained through simulation of all the Cores were used to create a model of the proposed signal processing chain suited for the implemented Cycle-Accurate Simulator. The impact of the adopted configuration of the communication channels present in the HotStream framework is exemplified by varying the size of the data access bursts between 64 and 128 data beats. It is important to note that, while it may seem that using the largest available burst size is always advantageous, in practice, this is not always true. In fact, the burst size has a significant impact in the bus arbitration dynamics, which directly affects the overall performance. After running the simulation for the first time, it became clear that the performance of the DFC associated with the second forward FFT was being severely affected by the need to repeatedly modify the base address register as a consequence of accessing the output from the first forward FFT in transposed order. Due to the large image size, skipping full lines of this data structure implies very large strides that frequently exceed the addressable range of the DFC configured in 16-bit mode. By utilizing the 32-bit datapath width, the shared memory read utilization increased by a factor of 4.4 for bursts of 64 data beats. To allow the considered application to process images of varying sizes and aspect ratios, each of the FFT cores was configured to support all the necessary number of points. Since two independent Xilinx FFT cores are used to compute the 2D FFT in each direction, the theoretical

72 6.3 Case Study 2: Image processing chain in the frequency domain upper limit to the size of the input image is as high as 65536 × 65536 pixels. However, the actual configuration that is supported by the HotStream framework is solely determined by the capacity of the shared memory. The memory requirements, in bytes, of this application are given by NCOLS × NROWS × 8 × 4, where NROWS and NCOLS correspond to the number of rows and columns of the input image, respectively. Hence, the 512 MB of available memory on the test prototype effectively limits the image size to 4096 × 4096 pixels. The results depicted in Fig. 6.10a and 6.10b represent the simulation results of the execution time and bus utilization for five different image sizes, from 1024 × 1024 to 4096 × 4096 pixels, including rectangular aspect ratios, such as 4096 × 2048. As stated above, these results were obtained for burst sizes of 64 and 128 data beats. Finally, two data sets are presented for each image and burst size pair, which correspond to the transient behaviour of the accelerator and its steady-state operation. In the former, the double-buffering mechanisms are not in effect and the full data processing latency of each Core is incurred. The latter corresponds to the steady-state of the operation, where a stream of images is continuously provided to the accelerator without interruptions.

Single Frame 1024 Streaming 512 Single Frame (128 Bursts) 256 Streaming (128 Bursts) 128

64

Time Elapsed [ms] 32

16 8 1024 x 1024 2048 x 1024 2048 x 2048 4096 x 2048 4096 x 4096 Image Size (Number of Pixels)

(a) Execution time

92

82

72

62 Single Frame

Bus U�liza�on [%] 52 Streaming Single Frame (128 Bursts) 42 Streaming (128 Bursts) 32 1024 x 1024 2048 x 1024 2048 x 2048 4096 x 2048 4096 x 4096 Image Size (Number of Pixels)

(b) Overall bus utilization

Figure 6.10: Execution time and bus utilization for various image sizes and read and write burst sizes. Both transient (single frame) and steady-state (streaming) operation conditions are depicted

As expected, the time required to process a full frame is significantly decreased when a con-

73 6. Framework Evaluation

192 FFT 2D 96 48 24 12 6 Time Elapsed [ms] 3 1.5 0.75 256 512 1024 2048 4096 Image Size (Number of Columns)

Figure 6.11: FFT Execution time with CUFFT, a CUDA-based FFT library, for various image sizes [33] tinuous stream of frames is provided. This performance improvement was determined to be in the order of 30% and results from the full utilization of the double buffering mechanisms, as well as from the elimination of the initial latencies of each core, effectively creating a pipeline where data is output in every clock cycle. Moreover, Fig. 6.10a clearly shows that the frame processing time increases linearly with the image size. This is the case because the HotStream implementation of this application was initially designed so as not to exceed the available bandwidth of the shared memory and, as long as the pressure on the memory bus is not increased by adding more Cores, this behaviour continues to hold for any image size. This linear characteristic greatly contrasts with CPU and GPU-based implementations of 2D FFTs, in which non-linear behaviours are more common, as demonstrated in Figs. 6.11 [33] and 6.12 [10], which represent the execution time of similarly sized 2D FFTs on GPU and CPU platforms (note that, in Fig. 6.12, the time axis is in 1 msec 2 , in order to more closely resemble a linear behaviour). For the considered image sizes, and in an application where frames are continuously provided to the accelerator, frame rates between 40 and 2.5 Frames Per Second (FPS) can be expected for the smaller and largest image sizes, respectively. Again, if individual frames are to be processed, a performance decrease of around 30% may be experienced, reducing the equivalent frame rates to 30.3 and 1.9 FPS, respectively. It is important to note that absolute execution times are not compared in this case study, as the experimental results in [33] and [10] were obtained with 6 years old hardware. In addition, it is not the aim of this case study to provide an high-performance implementation of the described DSP chain, as some adjustments can be made to significantly increase its performance, such as increasing the clock frequency from 100 to 200 MHz (or more), which is easily achievable with the FPGA device used in the prototype. Furthermore, implementing such an application in the HotStream framework, in detriment of a CPU or GPU solution, is very advantageous when multiple filters are to be applied to the incoming frames. This type of scalability is unmatched by the other solutions, as the introduction of additional filters does not affect the steady-state perfor-

74 6.4 Summary

Figure 6.12: Execution time of a 2D FFT on a NVIDIA QUADRO FX5600 using CUFFT and on an Intel Dual Core Processor (6600) @ 2.4 GHz using FFTW [10] mance of the accelerator, provided that the operating frequency can be maintained. Moreover, the complexity of the filters can greatly influence the execution time in the GPU and CPU based approaches [10] [15], while hardware-based implementations are largely impervious to this issue, as long as the units are fully pipelined. The variation of the data bursts size reveals that maximizing this value can, in this case, lead to an higher use of the shared bus. As depicted in Fig. 6.10b, the efficiency gain is more pronounced in a full-streaming application, where a 10% increase is observed. Naturally, when processing a single frame, the increased burst capabilities are not fully explored and the efficiency increase is much smaller. Furthermore, it should be noted that a bus utilization of 100% is not possible given the various latencies intrinsic to the AXI protocol and to the memory element itself.

6.4 Summary

The evaluation of the framework itself was accomplished in three parts. First, the resource occupation overhead that it introduces was analysed, followed by an assessment of the stream generation efficiency that can be achieved with each DFC. Finally, two distinct case studies were considered, in order to test and validate the usefulness of the described streaming platform. The first case study focused on the multiplication of large matrices, an operation where an high degree of parallelism can be explored. A complete signal processing chain for the frequency-domain filtering of large images was developed as the second case study, where the support for Core heterogeneity of the framework is fully explored. Obtaining a truly scalable platform, capable of hosting several heterogeneous cores was one of the goals when designing the HotStream framework. Fulfilling such requirement meant that the units that are more often replicated should focus on a low resource usage first and foremost. The

75 6. Framework Evaluation results from the resource utilization analysis confirm that. In fact, the DFC/BMC pair, which must be replicated for each Core that is added to the framework, account for as few as 0.8% of the total resources available in the reconfigurable device, allowing for the desired scalability. A comparison between the DFC and a conventional DMA engine demonstrates that the pro- posed solution offers significant reductions in the pattern description size, especially for long and complex patterns. Such a result is achieved without compromising the address generation ef- ficiency, i.e., the number of addresses issued per clock cycle, which was always very close to competing solutions for all the data patterns considered. Applying the HotStream framework to two real use-cases, namely the block-based multipli- cation of large matrices and the full image processing chain in the frequency domain, allowed to obtain an insight as to the gains that can be achieved by using the proposed framework. A comprehensive performance analysis that was conducted for the first case-study shows that the increase of 4.2× in the bandwidth that is made available to the multiple Cores leads to a per- formance speed-up of 2.1, when compared to conventional approaches. Furthermore, the Host memory requirements and its intervention in the communication is reduced by as much as 42×. The results from the second case-study suggest that the implementation of a frequency- domain filtering application in the HotStream framework can benefit from a linear relationship between the size of the image being processed and the execution time, unlike what happens with state-of-the-art CPU or GPU-based approaches. Moreover, when multiple filters are used in sequence, it is expected that the HotStream-based solution will largely outperform CPU or even GPU-based solutions. Nevertheless, frame rates of up to 40 FPS when processing 1024 × 1024 images and 2.5 FPS for the largest image size (4096 × 4096) supported the significant processing advantages that are offered by this test prototype.

76 7 Conclusions and Future Work

Contents 7.1 Conclusions ...... 78 7.2 Future work ...... 80

77 7. Conclusions and Future Work

7.1 Conclusions

In the context of this thesis, a full framework for the development of efficient and high-performance stream-based architectures was proposed, implemented and evaluated. This framework consists of three main elements: i) a Multi Core Processing Engine (MCPE), which hosts multiple het- erogeneous cores, interconnected by an high-speed, and low latency backplane interconnection, and with common access to a large shared memory that enables advanced stream-sharing and data re-use; ii) an Host Interface Bridge (HIB), responsible for handling all the communications between the accelerator and the Host machine in an efficient manner and; iii) a software layer, composed of the HotStream API and the accompanying low-level device drivers, providing an high-level and intuitive access to the multiple features of the framework. The proposed framework offers an innovative two-level data access mechanism, which pro- vides both coarse and fine-grained access to the data involved in the computation. While the first level is ensured by the HIB, through the use of an efficient DMA engine capable of 2D accesses, the second is accomplished by an innovative Data Fetch Controller. When compared to existing state of the art pattern generation mechanisms [20], the proposed Data Fetch Controller struc- tures provides considerable gains, both in hardware resources and storage requirements of the pattern description code. This is achieved without compromising the address issuing rate, while allowing for a very compact and easy to use pattern description code. The HotStream framework was designed to be completely platform independent. Neverthe- less, it was prototyped in a Virtex 7 development board, to properly evaluate it. This development board was selected since it fulfilled all the requirements that are necessary for a full implemen- tation of the framework, such as a large capacity off-chip memory, and an high-throughput PCI Express interface, compatible with any conventional PC motherboard. Given that the performance of the framework is greatly dependent on the performance of the various communication interfaces it encompasses, these were adequately characterized and pro- filed. In particular, the dependency of the PCI Express link’s performance on the size of the data transmitted revealed that small data-chunks are highly penalized and can lead to aggregate throughputs more than 160× lower than what can be achieved under ideal conditions. The exper- iments also demonstrated that aggregate throughputs up to 2.1 GB/s are possible with a x8 Gen1 PCI Express interface, including the transfer set-up time inherent to the HotStream API. Timing-accurate simulations run on the DDR3 shared memory demonstrated that, while the peak throughput can be calculated as being as high as 12.8 GB/s, the various mechanisms in- volved on the operation of a dynamic memory can reduce this value to an average 7.87 GB/s, for most traffic conditions, provided that burst accesses can be exploited. Note that the available data bandwidth may still be lower, depending on the remaining communication channels, or interfaces. In order to perform a more comprehensive analysis of the backplane interconnection, two com- peting and equally capable solutions were studied: Networks On Chip (NoCs) and Crossbars. By comparing the most competitive and freely available NoC implementation with a standard Cross-

78 7.1 Conclusions bar solution provided by Xilinx, it became clear that, while the NoC offers increased scalability, flexibility and performance, it simply cannot compete with its counterpart in terms of resource usage, when targeting an FPGA. The main cause are the ”wires” that interconnect the various nodes in a Crossbar bus, since these are replaced by switches and routers that introduce a larger area overhead that even modern reconfigurable devices cannot support. However, the conclusion can be different if the HotStream framework is implemented on ASIC technology, since in modern fabrication processes, ”wires” are becoming comparatively more expensive than transistors.

The comprehensive profiling of the various communication channels present in the HotStream framework made it possible to develop a Cycle-Accurate Simulator, that effectively emulates all the interactions between all the elements of the framework. By taking into account the various latencies associated with the Cores, as well as the access latencies of the shared bus and the backplane, it is possible to use this tool as a very useful guidance in early design stages of the implementation process, as its results are very close to what is observed in the implemented designs. The input of the simulator are a set of configuration files, each describing the several instantiated cores, where processing latencies, read and write behaviours, as well as synchro- nization mechanisms with other cores can be configured.

As originally intended, the hardware overhead of the various structures of the HotStream framework was kept to a minimum whenever possible. In particular, the DFC/BMC pairs that are replicated for each core and support the fine-grained access patterns to the shared memory. This pair corresponds to 0.8% of the total resources available in the FPGA device that was used in the test prototype. This low hardware requirement is achieved without compromising the address generation efficiency, while offering memory savings associated with descriptor storage of 1500× or more. This discrepancy can be even higher for complex and long-running patterns, meaning that conventional DMAs will require an external or microcontroller to dynamically generate the SG descriptors, while the proposed DFC can be completely self-contained.

The performance of the HotStream framework, as a whole, and the speed-ups that can be achieved by developing an acceleration architecture based on its structure is greatly dependent on the target application and on the amount of parallelism and data-reuse it can exploit. Thus, for a proper evaluation, two case studies were considered, namely, a block-based matrix multiplication of very large matrices, and a full image processing chain on the frequency domain. The obtained experimental results for the first case study show that conventional solutions that do not exploit data-reuse can be easily constrained by the PCIe communication link. On the other hand, by using the proposed framework, it is possible to increase the available bandwidth in the core by 4.2×, thus leading to a 2.1 performance speed-up in a 4× data parallelism approach. Furthermore, by easing the implementation of more complex solutions, further speed-ups can still be achieved. The final solution allows for a reduction in Host memory requirements by 42×, consequently allowing the processing of much larger matrices.

For the second case study, images between 1024 × 1024 and 4096 × 4096 pixels were consid- ered. When a continuous stream of frames is provided to the HotStream-based accelerator, frame

79 7. Conclusions and Future Work rates of 40 and 2.5 FPS can be achieved for the smallest and biggest image sizes, respectively. Most importantly, the HotStream-based accelerator revealed to offer a linear relation between the image size and the execution time, which is not possible with either CPU or GPU implemen- tations. Moreover, the proposed implementation is particularly competitive when multiple filters are applied to a stream of images. In these conditions, it is expected that the HotStream-based implementation will outperform state-of-the-art CPU and GPU-based solutions. In conclusion, the proposed HotStream framework proved to be a comprehensive platform for the development of stream-based accelerators, capable of achieving high-levels of performance. In particular, the two carefully selected case-studies suggest that a large range of applications can benefit from the unique characteristics of this platform, taking advantage of the significant reductions to the development cycle that it provides.

7.2 Future work

As it was referred before, the proposed HotStream framework and, in particular, its imple- mentation on an FPGA-device was validated through two case-studies: the multiplication of large matrices and the processing of large images in the frequency domain. While these two bench- mark applications exploited most of the unique capabilities of the platform, more extensive testing should be conducted, in order to fully evaluate the goals that were set out for this concept (i.e., the design of a generic platform for the development of stream-based hardware accelerators with multiple heterogeneous cores). In particular, the framework should ideally be mapped to several target technologies, in order to assess whether it remains competitive with relation to similar of- ferings in these contexts. In addition, the framework would greatly benefit from its application to a range of other application domains, such as multimedia or even generic digital signal processing. In this way, it would be possible to develop a portfolio of stream-based applications that would attest the full usefulness of the presented work. Finally, in order to increase the overall usability of the framework and to boost its adoption as a competitive stream-computing architecture for the development of powerful accelerators, an high-level Integrated Development Environment (IDE), capable of stream-lining the definition of the access patterns for each processing element, should be created. Ideally, such a tool would be coupled with a graphical component that would allow the quick creation of an access pattern. Bound checking on the shared memory would also be performed by this tool, which would further reduce spurious errors during the development process, which are usually difficult to track.

80 Bibliography

[1] (1999). The CoreConnect Bus Architecture. Technical report, IBM.

[2] (2010). AMBA AXI protocol. Technical report, ARM.

[3] (2011). LogiCORE IP Floating-Point Operator v5.0. Technical Report DS335, Xilinx.

[4] (2011). LogiCORE IP Linear Algebra Toolkit v1.0. Technical Report ds829, Xilinx.

[5] (2012). LogiCORE IP AXI DMA v6.03a. Technical Report PG021, Xilinx.

[6] (2012). LogiCORE IP Fast Fourier Transform v8.0. Technical Report DS808, Xilinx.

[7] (2012). MaxCompiler: Manager Compiler Tutorial. Maxeler Technologies.

[8] Arteris (2009). From Bus and Crossbar to Network-On-Chip. Technical report, Arteris.

[9] Cao, P., Song, K., and Yang, J. (2012). A real-time data transmission method based on linux for physical experimental readout systems. Fusion Engineering and Design, 87(9):1693 – 1699.

[10] Castano-D˜ ´ıez, D., Moser, D., Schoenegger, A., Pruggnaller, S., and Frangakis, A. S. (2008). Performance evaluation of image processing algorithms on the GPU. Journal of Structural Biology, 164(1):153 – 160.

[11] Dally, W. J., Labonte, F., et al. (2003). Merrimac: Supercomputing with streams. In Proceedings of the 2003 ACM/IEEE conference on Supercomputing, SC ’03, pages 35–, New York, NY, USA. ACM.

[12] DS190 (2013). Zynq-7000 All Programmable SoC Overview. Xilinx.

[13] Duato, J., Yalamanchili, S., and Lionel, N. (2002). Interconnection Networks: An Engineering Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.

[14] Erez, M., Ahn, J., et al. (2004). Analysis and performance results of a molecular modeling application on merrimac. In Supercomputing, 2004. Proceedings of the ACM/IEEE SC2004 Conference, pages 42–42.

[15] Fialka, O. and Cadik, M. (2006). FFT and convolution performance in image filtering on GPU. In Information Visualization, 2006. IV 2006. Tenth International Conference on, pages 609–614.

81 Bibliography

[16] Ghosh, S., Martonosi, M., et al. (1997). Cache miss equations: An analytical represen- tation of cache misses. In In Proceedings of the 1997 ACM International Conference on Supercomputing, pages 317–324. ACM Press.

[17] Goldhammer, A. and Jr., J. A. (2008). Understanding Performance of PCI Express Systems. Technical Report WP350, Xilinx.

[18] Gratz, P., Kim, C., Mcdonald, R., Keckler, S. W., and Burger, D. (2006). Implementation and Evaluation of On-Chip Network Architectures. In Proceedings of the 24th International Conference on Computer Design, pages 170–177.

[19] Heinen, I., Moller, L., and Guazzelli, R. (2011). Atlas - An Environment for NoC Generation and Evaluation. https://corfu.pucrs.br/redmine/projects/atlas.

[20] Hussain, T., Shafiq, M., et al. (2012). PPMC: a programmable pattern based memory controller. In Proceedings of the 8th international conference on Reconfigurable Computing: architectures, tools and applications, ARC’12, pages 89–101, Berlin, Heidelberg. Springer- Verlag.

[21] Ilic, A., Pratas, F., and Sousa, L. (2013). Cache-aware roofline model: Upgrading the loft. IEEE Computer Architecture Letters, 99(RapidPosts):1.

[22] Jahangiri, A. (2007). PCI Express 2.0: The Next Frontier in Interconnect Technology. http: //http://www.rtcmagazine.com/articles/view/100846.

[23] Kapasi, U., Dally, W., et al. (2002). The imagine stream processor. In Computer Design: VLSI in Computers and Processors, 2002. Proceedings. 2002 IEEE International Conference on. IEEE.

[24] Kapasi, U., Rixner, S., et al. (2003). Programmable stream processors. Computer, 36(8):54– 62.

[25] Karanam, R. K., Ravindran, A., and Mukherjee, A. (2008). A stream chip-multiprocessor for bioinformatics. SIGARCH Comput. Archit. News, 36(2):2–9.

[26] Lam, M. D., Rothberg, E. E., et al. (1991). The cache performance and optimizations of blocked algorithms. SIGPLAN Not., 26(4):63–74.

[27] Lee, E., Lemieux, G., and Mirabbasi, S. (2006). Interconnect driver design for long wires in field-programmable gate arrays. In Field Programmable Technology, 2006. FPT 2006. IEEE International Conference on, pages 89–96.

[28] Marcus G., Gao W., K. A. M. R. (2011). The MPRACE framework: An open source stack for communication with custom FPGA-based accelerators. Programmable Logic (SPL), 2011 VII Southern Conference on.

82 Bibliography

[29] Mayhew, D. and Krishnan, V. (2003). PCI express and advanced switching: evolutionary path to building next generation interconnects. In High Performance Interconnects, 2003. Proceedings. 11th Symposium on, pages 21–29.

[30] Moraes, F., Calazans, N., Mello, A., Moller,¨ L., and Ost, L. (2004). Hermes: an infrastruc- ture for low area overhead packet-switching networks on chip. Integration, the VLSI Journal, 38(1):69 – 93.

[31] Osterloh, B., Michalik, H., Fiethe, B., and Kotarowski, K. (2008). Socwire: A network-on- chip approach for reconfigurable system-on-chip designs in space applications. In Adaptive Hardware and Systems, 2008. AHS ’08. NASA/ESA Conference on, pages 51–56.

[32] Pell, O. and Averbukh, V. (2012). Maximum performance computing with dataflow engines. Computing in Science Engineering, 14(4):98–103.

[33] Podlozhnyuk, V. (2007). FFT-Based 2D Convolution. Technical report, NVIDIA.

[34] Rajeev Balasubramonian, T. P. (2011). Encyclopedia of Parallel Computing. Springer Sci- ence + Business Media.

[35] Schelle, G. and Grunwald, D. (2006). Onchip Intercon- nect Exploration for Multicore Processors Utilizing FPGAs. In WARFP-2006 2nd Workshop on Architecture Research using FPGA Platforms.

[36] Schelle, G. and Grunwald, D. (2008). Exploring FPGA network on chip implementations across various application and network loads. In Field Programmable Logic and Applications, 2008. FPL 2008. International Conference on, pages 41–46.

[37] Wallace, G. (1992). The JPEG still picture compression standard. Consumer Electronics, IEEE Transactions on, 38(1):xviii–xxxiv.

[38] Xilinx (2012a). AXI4-Stream Interconnect v1.1. Technical Report pg035, Xilinx.

[39] Xilinx (2012b). LogiCORE IP AXI Bridge for PCI Express. Technical Report PG055, Xilinx.

[40] Xilinx (2013). VC707 Evaluation Board for the Virtex-7 FPGA. Technical Report UG885, Xilinx.

[41] Zhu, S. and Ma, K.-K. (2000). A new diamond search algorithm for fast block-matching motion estimation. Image Processing, IEEE Transactions on, 9(2):287–290.

83 Bibliography

84 A Appendix A

Contents A.1 Micro16 Instruction Set Architecture ...... 86 A.2 HotStream Register Interface ...... 86 A.3 HotStream API ...... 89

85 A. Appendix A

A.1 Micro16 Instruction Set Architecture

The Micro16 microcontroller was entirely custom designed to ensure the programmability of the innovative DFC utilized in the HotStream framework. One of the main requirements for this unit was that the hardware requirements were kept as low as possible. To achieve this, a 16-bit architecture was developed, together with an highly-compact instruction set but, at the same time, capable of handling any access pattern. Tables A.1 through A.4 provide a detailed description of the Micro16 ISA.

Table A.1: ALU Register-Register operations

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 0 FSEL RFD AGC# REGD RA RFB AGC# REGB

FSEL Operation Assembly Mnemonic 00 D = A + B add rd, ra, rb 01 D = B - A sub rd, ra, rb 10 D = B - 1 dec rd, rb 11 D = A×B mult rd, ra, rb

Table A.2: Constant loading operations

15 14 13 12 11 0 0 1 RD 12-bit Constant 15 14 13 12 11 10 9 8 7 0 1 0 RD OP WAIT/DONE X Const. 8 bit

Word Group OP Operation Assembly Mnemonic 01 XX D = Const loadlit rD, Const 10 10 D = (D & 0×FF00) | Const8 lcl rd, Const8 10 11 D = (D & 0×00FF) | (Const8 << 8) lch rd, Const8

A.2 HotStream Register Interface

The HotStream framework makes it possible to manage and configure an arbitrary number of cores directly from the Host machine. Operations such as setting interrupt masks or loading new instruction data are made possible by a custom register interface that may be accessed through the HotStream API. Tables A.5 trough A.6 describe the structure of this register interface.

86 A.2 HotStream Register Interface

Table A.3: Low and High constant loading and miscellaneous operations

15 14 13 12 11 10 9 8 7 0 1 0 0 RD OP WAIT/DONE Const. 8 bit

OP Wait/Done Operation Assembly Mnemonic 00 0 Wait wait 00 1 Done done

15 14 13 12 11 5 4 3 2 1 0 1 0 1 PUSH/POP X RFB AGC# REGB

Push/Pop Operation Assembly Mnemonic 0 Push B push rb 1 Pop B pop rb

Table A.4: Flow control operations

15 14 13 12 11 0 1 1 OP Absolute Address

OP Operation Assembly Mnemonic 00 Jump Unconditional jmp label 01 Jump if Zero j.z label 10 Jump if Negative j.n label 11 Jump if Not Zero j.nz label

Table A.5: CMU Address Mapping

Offset Name 00h Instruction Configuration Register (ICR) 04h Kernel Control Register (KCR) 08h Kernel Control Mask Low (KCML) 0Ch Kernel Control Mask High (KCMH) 10h Kernel Interrupt Vector Low (KIVL) 14h Kernel Interrupt Vector High (KIVH) 18h User Register 1 (UR1) 1Ch User Register 2 (UR2)

87 A. Appendix A

Table A.6: CMU Register Details

(a) ICR Register Details

Bits Field Access Description 0 RS WO Run(1)/Stop(0) Instruction streaming to the currently selected Core 1 IO WO Select the Input (0) or Output (1) Instruction Memory partition of the currently selected Core 2 to 3 Rsv RO Reserved 4 to 13 IBA WO Instruction base address for the partition defined by the IO field in the currently selected Core 14 to 15 Rsv RO Reserved 16 to 21 CS WO Select one of the 64 Cores in the MCPE 22 to 23 Rsv RO Reserved 24 to 25 Status RO Indicates the state of the Instruction streaming operation: 00 - Default; 01 - Configuration Complete; 10 - Instruction Overflow; 11 - Reserved 26 to 31 Rsv RO Reserved (b) KCR Register Details

Bits Field Access Description 0 Reset WO Reset the Cores selected by the Kernel Control Mask (Active High) 1 SetInt WO Activate interrupts from the Cores selected by the Kernel Control Mask (Active High) 2 to 3 Rsv RO Reserved 4 to 5 Dest WO Configures the destination of Host streams: 00 - Shared Memory; 01 - Back- plane; 10 - Instruction Stream; 11 - Reserved (c) KCML Register Details

Bits Field Access Description 0 to 31 MaskL R/W Mask for the first 32 Cores in the MCPE (d) KCMH Register Details

Bits Field Access Description 0 to 31 MaskH R/W Mask for the last 32 Cores in the MCPE (e) KIVL Register Details

Bits Field Access Description 0 to 31 IntVL R/W Interrupt vector for the first 32 Cores in the MCPE. Interrupts are acknowledged by resetting each bit individually (f) KIVH Register Details

Bits Field Access Description 0 to 31 IntVH R/W Interrupt vector for the last 32 Cores in the MCPE. Interrupts are acknowledged by resetting each bit individually

88 A.3 HotStream API

A.3 HotStream API

The HotStream API provides a convenient set of software methods that make it possible to manage and configure every aspect of the HotStream framework directly from the host. These methods can be divided into four groups, according to their specific functions: i) Core Manage- ment; ii) Framework Management; iii) Data Management and; iv) Pattern Definition. The following sections describe in detail each of the methods within these categories.

Core Management

int HotStream readBin(char *fname, void *idata buf, int *buf size) Description: Reads a binary file produced by the Micro16 assembler into a buffer pointed to by idata buf Arguments: - fname - Name of the binary output by the Micro16 assembler - idata buf - Pointer to the buffer that will hold the instruction data - buf size - Address of the integer that will hold the instruction buffer size in bytes Returns: 0 on success; -1 otherwise

int HotStream copyInstrData(int CoreNum, void *idata0, int idata0 size, void *idata1, int idata1 size) Description: Populates the instruction memory of the Core CoreNum with the contents from buffers idata0 and idata1. If the instruction memory capacity is exceeded, this call will fail Arguments: - CoreNum - Core number within the MCPE - idata0 - Pointer to a buffer containing the instruction data for the first partition of the instruction memory - idata0 size - Size, in bytes, of the idata0 buffer - idata1 - Pointer to a buffer containing the instruction data for the second partition of the instruc- tion memory - idata1 size - Size, in bytes, of the idata1 buffer Returns: 0 on success; -1 otherwise

void HotStream setIntVector(unsigned long intv) Description: Sets the interrupt vector for the Cores in the MCPE Arguments: - intv - 64-bit long interrupt vector bit-mask Returns: None

void HotStream resetCore(int CoreNum) Description: Sends a reset signal to the Core CoreNum

89 A. Appendix A

Arguments: - CoreNum - Core number within the MCPE Returns: None

void HotStream resetCoreVector(unsigned long rstv) Description: Sends a reset signal to the Cores specified in the rstv bit-mask Arguments: - rstv - 64-bit long reset vector bit-mask Returns: None

unsigned long HotStream getInt() Description: Obtains a bit-mask indicating the interrupts received from the MCPE Arguments: None Returns: 64-bit long bit vector of asserted interrupts

void HotStream ackInt(int CoreNum) Description: Acknowledges an interrupt from Core CoreNum Arguments: - CoreNum - Core number within the MCPE Returns: None

void HotStream ackIntVector(unsigned long rstv) Description: Acknowledges the interrupts generated by the Cores specified in the rstv bit-mask Arguments: - rstv - 64-bit long interrupt acknowledge vector bit-mask Returns: None

Framework Management

int HotStream init() Description: Initializes the HotStream framework by opening the device driver and configuring the DMA engine Arguments: None Returns: 0 on success; -1 otherwise

void HotStream close() Description: Shuts down the HotStream framework by stopping the DMA engine and closing the device handle Arguments: None

90 A.3 HotStream API

Returns: 0 on success; -1 otherwise

Data Management

int HotStream sendShMem(void *uBuf, unsigned int size, unsigned int offset) Description: Configures the HotStream framework to stream the user-provided buffer pointed to by uBuf to the Shared Memory. Arguments: - uBuf - Pointer to a pre-allocated user-defined buffer - size - Size, in bytes, of the buffer to stream - offset - Offset within the Shared Memory Returns: 0 on success; -1 otherwise

int HotStream recvShMem(void *uBuf, unsigned int size, unsigned int offset) Description: Configures the HotStream framework to retrieve a size bytes block from the Shared Memory at location offset and store it in the user-provided buffer pointed to by uBuf Arguments: - uBuf - Pointer to a pre-allocated user-defined buffer - size - Size, in bytes, of the data to retrieve - offset - Offset within the Shared Memory Returns: 0 on success; -1 otherwise

int HotStream sendBplane(void *uBuf, unsigned int size, int CoreNum) Description: Configures the HotStream framework to stream the user-provided buffer pointed to by uBuf to the Core CoreNum attached to the high-speed backplane Arguments: - uBuf - Pointer to a pre-allocated user-defined buffer - size - Size, in bytes, of the buffer to stream - CoreNum - Number of the Core within the backplane Returns: 0 on success; -1 otherwise

int HotStream recvBplane(void *uBuf, unsigned int size, int CoreNum) Description: Configures the HotStream framework to retrieve a size bytes block from the Core CoreNum and store it in the user-provided buffer pointed to by uBuf Arguments: - uBuf - Pointer to a pre-allocated user-defined buffer - size - Size, in bytes, of the data to retrieve - CoreNum - Number of the Core within the backplane Returns: 0 on success; -1 otherwise

91 A. Appendix A

int HotStream startSend(int waitInt) Description: Initiates the configured send operation to the Backplane or Shared Memory Arguments: - waitInt - If set to 1, the call will block until the transfer is terminated Returns: 0 on success; -1 otherwise

int HotStream startRecv(int waitInt) Description: Initiates the configured receive operation from the Backplane or Shared Memory Arguments: - waitInt - If set to 1, the call will block until the transfer is terminated Returns: 0 on success; -1 otherwise

int HotStream checkSend(int closeMapping) Description: Checks if the data buffer was streamed without errors and prepares the next send operation. This method should always follow HotStream startSend(). If closeMapping is set to 0, the user buffer can be reused Arguments: - closeMapping - Indicates whether the current buffer mapping is to be terminated Returns: 0 on success; -1 otherwise

int HotStream checkRecv(int closeMapping) Description: Checks if the data buffer was streamed without errors and prepares for the next receive operation. This method should always follow HotStream startRecv().If closeMapping is set to 0, the user buffer can be reused Arguments: - closeMapping - Indicates whether the current buffer mapping is to be terminated Returns: 0 on success; -1 otherwise

Pattern Definition

int HotStream linear send(unsigned int offset, unsigned int size) Description: Creates a linear stream of size size starting at offset offset based on the user- provided buffer specified in the HotStream sendSHMem or HotStream sendBplane calls Arguments: - offset - Starting position of the stream within the user-provided buffer - size - Total size of the stream Returns: 0 on success; -1 otherwise

92 A.3 HotStream API

int HotStream linear recv(unsigned int offset, unsigned int size) Description: Creates a linear stream of size size starting at offset offset based on the user- provided buffer specified in the HotStream recvSHMem or HotStream recvBplane calls Arguments: - offset - Starting position of the stream within the user-provided buffer - size - Total size of the stream Returns: 0 on success; -1 otherwise

int HotStream 2d send(int offset, int hsize, int stride, int vsize) Description: Creates a 2D stream based on the user-provided buffer specified in the HotStream sendSHMem or HotStream sendBplane calls. Arguments: - offset - Starting position of the stream within the user-provided buffer - hsize - Size, in bytes, of the contiguous blocks within the 2D pattern - stride - Skip, in bytes, between the starting position of two contiguous blocks - vsize - Number of repetitions of the contiguous block of size hsize Returns: 0 on success; -1 otherwise

int HotStream 2d recv(int offset, int hsize, int stride, int vsize) Description: Creates a 2D stream based on the user-provided buffer specified in the HotStream recvSHMem or HotStream recvBplane calls. Arguments: - offset - Starting position of the stream within the user-provided buffer - hsize - Size, in bytes, of the contiguous blocks within the 2D pattern - stride - Skip, in bytes, between the starting position of two contiguous blocks - vsize - Number of repetitions of the contiguous block of size hsize Returns: 0 on success; -1 otherwise

int HotStream block send(int bsize, int mat size, int elem size) Description: Creates a Tiled stream based on the user-provided buffer specified in the Hot- Stream sendSHMem or HotStream sendBplane calls. Arguments: - bsize - Size of the tile/block within the matrix - mat size - Size of the matrix to which the tiled pattern is to be applied - elem size - Size, in bytes, of each matrix entry Returns: 0 on success; -1 otherwise

int HotStream block recv(int bsize, int mat size, int elem size) Description: Creates a Tiled stream based on the user-provided buffer specified in the Hot-

93 A. Appendix A

Stream recvSHMem or HotStream recvBplane calls. Arguments: - bsize - Size of the tile/block within the matrix - mat size - Size of the matrix to which the tiled pattern is to be applied - elem size - Size, in bytes, of each matrix entry Returns: 0 on success; -1 otherwise

int HotStream gather(void *ubuf, void *new ubuf, int *buf size, unsigned int offset, unsigned int hsize, unsigned int stride, unsigned int vsize) Description: Gathers the data blocks specified by offset, hsize, stride and vsize and creates a new contiguous buffer that can be streamed to the Shared Memory or Backplane Arguments: - ubuf - Pointer to a pre-allocated user-defined buffer - new ubuf - Pointer to the newly allocated contiguous buffer - buf size - The size, in bytes, of the newly allocated buffer - offset - Starting position of the stream within the user-provided buffer - hsize - Size, in bytes, of the contiguous blocks within the 2D pattern - stride - Skip, in bytes, between the starting position of two contiguous blocks - vsize - Number of repetitions of the contiguous block of size hsize Returns: 0 on success; -1 otherwise

94 B Appendix B

Contents B.1 Pattern Description Examples ...... 96

95 B. Appendix B

The combination of an autonomous and parameterized address generation unit, the AGC, with a simple microcontroller, the Micro16, enables the efficient description of data transfers with vari- ous access patterns, without resorting to the creation of long chains of scatter-gather descriptors, as is the case with typical DMA engines, such as the PPMC [20]. Instead, the Micro16 modifies the generation parameters of the AGC in run-time, which may be compared to a dynamic gener- ation of SG descriptors. As stated before, this greatly reduces the memory requirements for the storage of large and complex data-patterns, without compromising the address generation rate, due to the use of a double-buffering mechanism. The combination of the purposely-designed assembly language and assembler ensures that the task of describing an arbitrarily complex pattern is both intuitive and as compact as possible. The general structure of a pattern description code may be subdivided into three key elements: i) the copying of the pattern generation parameters to each unit of the AGC (it is important to restate that the AGC can be configured with a varying number of loopcontrol units), followed by the; ii) activation of the address generation through the done instruction, which should always be followed by the blocking wait instruction, before; iii) one or more software loops compute and copy to the AGC the address generation parameters for the next section of the pattern, after which step ii) is repeated. The following section highlights some of the coding techniques used for describing some of the access patterns allowed by the AGC/Micro16 combination. These coding techniques are exemplified through commented pattern description codes and pseudo-codes for patterns with fundamentally different characteristics and lengths.

B.1 Pattern Description Examples

B.1.1 Linear and Tiled access pattern

The simplest pattern that can be described by the DFC is a linear access, like the one depicted in Fig. B.1. Such a pattern only requires one loopcontrol unit and the AGC parameters only need to be copied once, as is made clear by the assembly code that accompanies the figure. Describing a regular tiled access naturally increases the needed number of loopcontrol units to two. Given the regular nature of this access, minimal intervention from the Micro16 is required, apart from the initial copying of the address generation parameters to the AGC. It is important to note that the code presented on the right-hand side of Fig. B.2 can be used for tiled patterns of any size with minimal modifications to the stop conditions. This greatly increases the code reuse and facilitates the configuration of commonly-used data access patterns.

B.1.2 Diagonal access pattern

Unlike the previous examples, the diagonal data access pattern, which is an integral part of the Smith Waterman algorithm for DNA sequence alignment [25], requires additional computation to be performed on the microcontroller. Some implementations of this algorithm, which is applied

96 B.1 Pattern Description Examples

1024 ; Initialize AGC add erf.0.r0, r0, irf.r0 ; Lpbdy.initval=0 ; add erf.1.r0, r0, irf.r0 ; Lpctrl.initval=0

loadlit r1, 1 add erf.0.r1, r0, irf.r1 ; Lpbdy.multval=1 add erf.1.r1, r0, irf.r1 ; Lpctrl.multval=1 add erf.0.r2, r0, irf.r1 ; Lpbdy.incval=1 add erf.1.r2, r0, irf.r1 ; Lpctrl.incval=1

loadlit r1, 1024 add erf.1.r3, r0, irf.r1 ; Lpctrl.resetval=1024

done ; Start AGC wait end: jmp end ; wait forever

Figure B.1: Pattern description code of a simple linear access with 1024 positions

128 ; Initialize AGC add erf.0.r0, r0, irf.r0 ; Lpbdy.initval=0 add erf.1.r0, r0, irf.r0 ; Lpctrl1.initval=0 add erf.2.r0, r0, irf.r0 ; Lpctrl2.initval=0 72 loadlit r1, 1 add erf.0.r1, r0, irf.r1 ; Lpbdy.multval=1 add erf.1.r1, r0, irf.r1 ; Lpctrl1.multval=1 add erf.2.r1, r0, irf.r1 ; Lpctrl2.multval=1

512 add erf.0.r2, r0, irf.r1 ; LpbdyLpctrl.incval=1 loadlit r1, 512 // Initialize AGC add erf.1.r2, r0, irf.r1 ; Lpctrl1.incval=512 set_loopbody(mult=1,inc=1,init=0) set_loopcontrol1(resetval=128,inc=512) loadlit r1, 128 set_loopcontrol2(resetval=72) add erf.1.r3, r0, irf.r1 ; Lpctrl1.resetval=128 done // Start AGC loadlit r1, 72 add erf.2.r3, r0, irf.r1 ; Lpctrl1.resetval=72

done ; Start AGC wait end: jmp end ; wait forever

Figure B.2: Pattern description code for a tiled 128×72 access to a bi-dimensional matrix formed by the two sequences to be aligned, requires the traversal of the entries that are perpendicular to each entry of the principal diagonal in a sequential manner, from the top-left to the bottom-right. Thus, the first six steps of the access sequence are the following (where the notation (i,j) refers to the entry on line i and column j): (1,1), (1,2), (2,1), (1,3), (2,2), (3,1). The pattern can be easily described by noting that, until the minor diagonal of the matrix is reached, the number of perpendicular elements that are accessed for each entry of the principal diagonal increases by one. After this point, the access pattern is symmetric. Thus, the code tests whether the minor diagonal has been reached, by comparing the current value of the loop iterator with 1024, and updates the number of repetitions performed by the AGC accordingly. The

97 B. Appendix B increment value of 1023 selected for the loopbody unit results in a diagonal sequence from right to left, as required for the considered implementation of the Smith Waterman algorithm.

1024 ; Initialize AGC add erf.0.r0, r0, irf.r0 ; Lpbdy.initval=0 add erf.1.r0, r0, irf.r0 ; Lpctrl.initval=0

loadlit r1, 1 add erf.0.r1, r0, irf.r1 ; Lpbdy.multval=1 1024 add erf.1.r1, r0, irf.r1 ; Lpctrl.multval=1

loadlit r1, 1023 add erf.0.r2, r0, irf.r1 ; Lpbdy.incval=1023 Time add erf.1.r3, r0, irf.r1 ; Lpctrl.resetval=1 // Initialize AGC set_loopbody(mult=1,inc=1023,init=0) loadlit r3, 1 ; init increment value set_loopcontrol(resetval=1) loop: loadlit r2, 2048 ; loop iterator for i = 1 to 2048 do done // Start AGC done // Start AGC wait wait // Wait to modify parameters loadlit r1, 1024 if i < 1024: sub irf.r1, r1, irf.r2 ; r2 - 1024 set_loopcontrol(resetval+=1,init+=1) j.n up ; if r2 < 1024 set_loopbody(init+=1) ; else else: loadlit r1, -1 set_loopcontrol(resetval-=1,init+=1024) loadlit r3, 1024 set_loopbody(init+=1024) jmp cnt up: loadlit r1, 1 cnt: add erf.1.r3, r0, irf.r1 ; Lpctrl.resetval=1/-1 add erf.0.r0, r3, erf.0.r0 ; Lpbdy.initval+=r3 add erf.1.r0, r3, erf.0.r0 ; Lpbdy.initval+=r3

dec irf.r2, irf.r2 ; r2-- j.z end ; if r2 == 0 jmp loop ; continue loop end: jmp end ; wait forever

Figure B.3: Pattern description code for a diagonal access on a 1024×1024 matrix

B.1.3 Cross access pattern

Greek Cross access patterns are common on the vast class of diamond search motion esti- mation algorithms adopted in video encoding [41]. It stands as an interesting data access pat- tern example as it makes use of the maximum number of loopcontrol units supported by the Micro16/AGC combination. In addition, despite its apparent complexity, the fact that the ac- cess sequence can be broken down into simpler and regular patterns results in a compact and straightforward pattern description code. Interestingly, the pattern depicted on the left-hand side of Fig. B.4 can be equally implemented by using two or three loopcontrol units without compro- mising the address generation rate, which is not true in most patterns. Thus, the deciding factor of whether to use the maximum number of loopcontrol units or, instead, settle for an approach where the lack of the third unit is compensated by an increase in the pattern description code, comes down to a area occupation vs. code size tradeoff. The example code provided on the right of Fig. B.4 is based on an AGC with three loop- control units. Therefore, after a proper configuration of the AGC’s parameters, it is capable of autonomously generating the addresses that correspond to two 16×16 blocks. The intervention

98 B.1 Pattern Description Examples of the microcontroller is thus reduced to switching between the two horizontally-disposed or the two vertically-disposed blocks. This is easily done by adding or subtracting an offset at the end of each section. Naturally, this offset depends on the total size of the memory buffer under con- sideration as well as on the size of the sub-blocks that compose the cross. Similarly, applying this same pattern to bigger matrices is simply a matter of modifying the starting value of the loop iterator.

1056 ; Initialize AGC lcl r1, 0 lch r1, 64 16x16 add erf.0.r0, r0, irf.r1 ; Lpbody.initval=16384 add erf.1.r0, r0, irf.r1 ; Lpctrl1.initval=16384 add erf.2.r0, r0, irf.r1 ; Lpctrl2.initval=16384 1056 loadlit r1, 1 add erf.0.r1, r0, irf.r1 ; Lpbdy.multval=1 Time add erf.1.r1, r0, irf.r1 ; Lpctrl1.multval=1 add erf.2.r1, r0, irf.r1 ; Lpctrl2.multval=1 add erf.0.r2, r0, irf.r1 ; Lpbdy.incval=1 // Initialize AGC loadlit r1, 1056 set_loopbody(mult=1,inc=1,init=16384) add erf.1.r2, r0, irf.r1 ; Lpctrl1.incval=1056 set_loopcontrol1(resetval=16,inc=1056) set_loopcontrol2(resetval=16,inc=32) loadlit r1, 32 set_loopcontrol3(resetval=2) add erf.2.r2, r0, irf.r1 ; Lpctrl2.incval=32 for i = 1 to 484 do loadlit r1, 16 done // Start AGC add erf.1.r3, r0, irf.r1 ; Lpctrl1.resetval=16 wait // Wait to modify parameters add erf.2.r3, r0, irf.r1 ; Lpctrl2.resetval=16 set_loopbody(init-=16400) // Vertical loadlit r1, 2 done add erf.3.r3, r0, irf.r1 ; Lpctrl3.resetval=2 wait loadlit r2, 484 ; loop iterator set_loopbody(init+=16416) // Horizontal ; start horizontal section loop: done ; Start AGC wait ; move to vertical section ; load 16400 lcl r1, 16 lch r1, 64 sub erf.0.r0, r1, erf.0.r0 ; Lpbdy.initval-=16384 sub erf.1.r0, r1, erf.1.r0 ; Lpctrl1.initval-=16384 sub erf.2.r0, r1, erf.2.r0 ; Lpctrl2.initval-=16384 done ; Start AGC wait ; move to horizontal section ; load 16416 lcl r1, 32 lch r1, 64 add erf.0.r0, r1, erf.0.r0 ; Lpbdy.initval+=16416 add erf.1.r0, r1, erf.1.r0 ; Lpctrl1.initval+=16416 add erf.2.r0, r1, erf.2.r0 ; Lpctrl2.initval+=16416 dec irf.r2, irf.r2 ; r2-- j.z end ; if r2 == 0 jmp loop ; continue loop end: jmp end ; wait forever

Figure B.4: Pattern description code for a greek cross access pattern

99