HotStream: Heterogeneous Many-Core Data Streaming Framework with Complex Pattern Support
Sergio´ Micael Ferreira Paiagua´
Thesis to obtain the Master of Science Degree in Electrical and Computer Engineering
Examination Committee Chairperson: Doutor Nuno Cavaco Gomes Horta Supervisor: Doutor Ricardo Jorge Fernandes Chaves Co-supervisor: Doutor Nuno Filipe Valentim Roma Members of the Committee: Doutor Horacio´ Claudio´ de Campos Neto Doutor Paulo Ferreira Godinho Flores
October 2013
Abstract
The work herein presented proposes a data streaming accelerator framework that provides efficient data management facilities that can be easily tailored to any application and data pattern. This is achieved through an innovative and fully programmable data management structure, imple- mented with two granularity levels, which is further complemented with a complete software layer, which ranges from a device driver to an high-level API that provides easy access to every feature provided by the framework. The fine-grained data movements are made possible by an innovative Data Fetch Controller, powered by a custom microcontroller, which can be programmed to gener- ate arbitrarily complex access patterns with minimal performance overhead. The obtained results show that the proposed framework is capable of achieving virtually zero-latency address gener- ation and data fetch, even for most complex streaming data patterns, while significantly reducing the size occupied by the pattern description code. In order to validate the proposed framework, two distinct case-studies were considered. The first deals with the block-based multiplication of large matrices, while the second consists of a full image-processing application in the frequency domain. The obtained experimental results for the first case study demonstrate that, by enabling data re-use, the proposed framework increases the available bandwidth by 4.2×, resulting in a speed-up of 2.1× when compared to existing related state of the art. Furthermore, it reduces the Host memory requirements and its intervention in the acceleration by more than 40×. The signal- processing case study revealed that an accelerator base on the proposed framework can achieve a linear relationship between the execution time and the size of the input image, which highly contrasts with CPU or GPU-based alternatives. Frame rates of 40 and 2.5 FPS were obtained for 1024 × 1024 and 4096 × 4096 images, respectively.
Keywords: Stream computing, Many-Core Heterogeneous Architectures, Programmable Data Access Patterns, Data Reuse, Reconfigurable Devices, High-Speed Interconnections.
i
Resumo
No presente trabalho e´ proposta uma plataforma de acelerac¸ao˜ baseada em computac¸ao˜ de fluxo de dados, que proporciona uma gestao˜ de dados eficiente, facilmente adaptavel´ a qual- quer aplicac¸ao˜ ou padrao˜ de acesso de dados. Isto e´ conseguido atraves´ de uma inovadora estrutura de gestao˜ de dados completamente programavel,´ composta por dois n´ıveis de gran- ularidade e complementada por uma extensa camada de software, que abarca desde o driver do dispositivo a uma interface de alto n´ıvel que garante o facil´ acesso a todos os elementos da plataforma. O controlo de dados a um n´ıvel de granularidade mais fino e´ garantido por um in- ovador Data Fetch Controller, comandado por um microcontrolador especialmente desenhado, capaz de gerar padroes˜ de acesso arbitrariamente complexos. Os resultados obtidos revelam que a plataforma proposta e´ capaz de gerar enderec¸os e aceder a dados de forma quase ime- diata, qualquer que seja o padrao˜ de dados em questao,˜ reduzindo ainda o espac¸o necessario´ para alojar a descric¸ao˜ do padrao.˜ Por forma a validar a plataforma proposta, dois estudos de caso distintos foram utilizados. O primeiro baseia-se na multiplicac¸ao˜ de matrizes de grandes dimensoes,˜ enquanto que o segundo consiste numa aplicac¸ao˜ de processamento de imagem no dom´ınio da frequencia.ˆ Os resultados obtidos para o primero caso de estudo demonstram que, ao explorar extensivamente a re-utilizac¸ao˜ de dados, a plataforma proposta aumenta a largura de banda fornecida as` unidades de computac¸ao˜ em 4.2×, o que resulta num aumento de desem- penho de 2.1×, quando comparada com implementac¸oes˜ convencionais. Mais, os requisitos de memoria´ impostos a` maquina´ anfitria˜ e´ reduzida em mais de 40×. O segundo caso de estudo revela que um acelerador baseado na plataforma proposta garante uma relac¸ao˜ linear entre o tempo de execuc¸ao˜ e a dimensao˜ da imagem a ser processada, algo que o estado da arte nao˜ permite.
Keywords: Computac¸ao˜ de fluxos de dados, Arquitecturas Heterogeneas´ com multiplos´ nucleos,´ Padroes˜ de Acesso Programaveis,´ Reutilizac¸ao˜ de Dados, Dispositivos Reconfiguraveis.´
iii Acknowledgments
Within the next 80 pages, a lot more than a master thesis is contained. It obviously represents my hard work, dedication and effort over the last 8 months but is actually much more than that. This is the final step in a journey that I started back in 2008. A journey that has only been successful due to the invaluable help and companionship of a number of people that more than deserve to be mentioned in the following paragraphs. First of all, I would like to express my deepest gratitude to the exceptional team of advisors I had the pleasure to work with. Ricardo Chaves, Nuno Roma, Pedro Tomas´ and Frederico Pratas, I really couldn’t have hoped for a better supervision over the last months. From the lengthy but enlightening meetings, always accompanied by good humour and plenty of laughs, to your tireless effort in reviewing all of my work, I have no doubt that the quality of this thesis is, in great part, owed to all of you. To all the amazing friends I made during these last five years, in particular, Rui Coelho, Joana Marinhas, Jose´ Santos, Filipe Morais, Joao˜ Carvalho, Rita Pereira, a big thank you for all your support throughout all the (mostly) good and bad times. A special thanks to my great friend Jose´ Leitao˜ who had a special impact in this thesis by keeping me company during the long work nights at INESC and for always having the time to share a laugh, or to happily engage in endless technical debates. Finally, I thank my parents and my sister for, well, everything. Not exaggerating in the slight- est, without them, this moment would simply not have happened. I am very grateful for all the wonderful guidance, patience and love they have so selflessly given me over the years.
iv Contents
1 Introduction 2 1.1 Motivation ...... 3 1.2 Objectives ...... 4 1.3 Main contributions ...... 5 1.4 Dissertation outline ...... 6
2 Technology Overview 9 2.1 Stream Computing Platforms and Address Generation ...... 10 2.2 PCI Express Interfaces ...... 11 2.3 Shared Buses and Crossbars ...... 12 2.3.1 Shared Bus ...... 12 2.3.2 Crossbar ...... 12 2.4 Networks On Chip ...... 13 2.5 NoC Survey ...... 14 2.6 Crossbar Survey ...... 15 2.7 Summary ...... 15
3 HotStream Framework Architecture 17 3.1 Host Interface Bridge ...... 19 3.2 Multi-Core Processing Engine ...... 20 3.3 The HotStream API ...... 21 3.4 Data Fetch Controllers, Shared Memory and Auxiliary Units ...... 22 3.4.1 Address Generation Core (AGC) ...... 23 3.4.2 Micro16 microcontroller ...... 24 3.4.3 Access to the Shared Memory ...... 27 3.5 Data Stream Switch (DSS) and Core Management Unit (CMU) ...... 28 3.6 Summary ...... 30
4 Host Interface Bridge 31 4.1 PCI Express Infrastructure ...... 32 4.2 Address Spaces and DMA ...... 33 4.3 2D DMA Transfers ...... 34
v Contents
4.4 Device Driver and User Interface ...... 37 4.4.1 Modifications to the MPRACE device driver ...... 38 4.4.2 Configuring a data transfer ...... 38 4.5 Summary ...... 40
5 Framework Prototype 41 5.1 AXI Interfaces ...... 42 5.2 HIB Implementation and Performance ...... 43 5.3 Backplane Implementation and Performance ...... 47 5.3.1 Hermes NoC ...... 48 5.3.1.A Modified packet structure ...... 48 5.3.2 AXI Stream Interconnect ...... 49 5.3.3 Backplane Performance Evaluation ...... 50 5.3.3.A Core Emulator and Stream Wrapper ...... 50 5.3.3.B Testbench and Python script ...... 51 5.3.3.C Results ...... 51 5.3.4 Crossbar and NoC Comparative Evaluation ...... 53 5.4 Shared Memory Performance ...... 55 5.4.1 Cycle-Accurate Simulator ...... 55 5.5 Summary ...... 56
6 Framework Evaluation 58 6.1 General Evaluation ...... 59 6.1.1 Resources Overhead ...... 59 6.1.2 Stream Generation Efficiency ...... 61 6.2 Case Study 1: Matrix Multiplication ...... 63 6.2.1 Computing Cores ...... 64 6.2.2 Roofline Model ...... 66 6.2.3 Performance and Memory Usage ...... 67 6.3 Case Study 2: Image processing chain in the frequency domain ...... 69 6.3.1 Computing Cores ...... 71 6.3.2 Performance and Scalability ...... 72 6.4 Summary ...... 75
7 Conclusions and Future Work 77 7.1 Conclusions ...... 78 7.2 Future work ...... 80
A Appendix A 85 A.1 Micro16 Instruction Set Architecture ...... 86 A.2 HotStream Register Interface ...... 86 vi Contents
A.3 HotStream API ...... 89
B Appendix B 95 B.1 Pattern Description Examples ...... 96 B.1.1 Linear and Tiled access pattern ...... 96 B.1.2 Diagonal access pattern ...... 96 B.1.3 Cross access pattern ...... 98
vii Contents
viii List of Figures
2.1 Structure of a 2D Mesh and 2D Torus NoC ...... 14
3.1 Structure and organization overview of the HotStream framework ...... 18 3.2 AGC in a 3-level nested loop configuration...... 23 3.3 Architecture of the Micro16 microcontroller...... 25 3.4 Core internal structure, comprising the PE (e.g., an application specific IP Core) and the co-located BMC ...... 28 3.5 Internal structure of the BMC, consisting of Write and Read control units, a Channel Arbiter and a Synchronizer block ...... 28
4.1 Address spaces and translation mechanisms on a x86-like architecture ...... 34 4.2 Mapping of an user-buffer to the bus address space and subsequent creation of the corresponding SG descriptors. The size and number of the physical data chunks can vary considerably according to the size of the original buffer and the state of the phyisical memory ...... 34 4.3 Application of a 2D pattern to the mapping of Fig. 4.2 ...... 35 4.4 Application of a 2D pattern to a more realistic mapping between virtual and physical address space. This example highlights the descriptor savings that are possible by utilizing a DMA with 2D capabilities ...... 36 4.5 Flowchart of the algorithm that converts a list of SG DMA descriptors into a list featuring 2D transfers ...... 36 4.6 Operation of the HotStream gather() function, which gathers the various sub-blocks defined by a 2D pattern and places them linearly in a new user-space buffer . . . 37
5.1 Basic handshake principle utilized by the AXI4-Stream and other similar stream- based protocols. Retrieved from [2] ...... 43 5.2 Aggregate throughput of various PCI Express configurations. The dashed line ac- counts for protocol overhead as per [17] ...... 44 5.3 Measured aggregate throughput for back-to-back transfers with varying buffer size 45 5.4 Chipscope waveforms obtained during a back-to-back transfer of 4 KB ...... 46 5.5 Time elapsed during the configuration of the send and receive transactions . . . . 46
ix List of Figures
5.6 Aggregate throughput for a back-to-back transfer including the time taken for the transaction set-up ...... 47 5.7 Traffic patterns for the NoC simulation ...... 52 5.8 Data delivery throughput in various traffic configurations. In every one, the input throughput is reached asymptotically ...... 52 5.9 Source to destination latency when using best case or worst case routing. Dupli- cating the inserted data throughput does not affect latency ...... 53
6.1 Access patterns, with varying complexity degrees, adopted for the DFC evaluation 61 6.2 HotStream-based implementation of the block-based multiplication algorithm, con- sisting of 3 Kernels to process multiple and concurrent data streams, where double buffering is used on the shared memory to overlap communication with computation 64 6.3 8:1 binary reduction tree based on Xilinx Matrix Accumulators. The structure of the 16:1 reduction core follows the same architecture but with double the number of basic accumulators ...... 65 6.4 Internal structure of the multiplication core utilized in the HotStream implementation of the matrix multiplication. Sub-blocks from matrix A are stored and re-used during the computation of a full sub-block line from matrix B ...... 65 6.5 Roofline model for the matrix multiplication example: Cx and Hx denote the actual performance for each implementation (Conventional and HotStream, respectively); while the conventional solutions C2× and C4×, with 2× and 4× parallelism, re- spectively, are limited by the PCIe link (communication-bounded), all other imple- mentations are computation-bounded ...... 66 6.6 Processing time taken on each step of the matrix multiplication algorithm for the considered implementations ...... 68 6.7 Core scalability of the three matrix multiplication implementations ...... 68 6.8 Host memory requirements for matrix multiplication implementations ...... 69 6.9 Image processing chain in the frequency domain, mapped to the HotStream frame- work...... 71 6.10 Execution time and bus utilization for various image sizes and read and write burst sizes. Both transient (single frame) and steady-state (streaming) operation condi- tions are depicted ...... 73 6.11 FFT Execution time with CUFFT, a CUDA-based FFT library, for various image sizes [33] ...... 74 6.12 Execution time of a 2D FFT on a NVIDIA QUADRO FX5600 using CUFFT and on an Intel Dual Core Processor (6600) @ 2.4 GHz using FFTW [10] ...... 75
B.1 Pattern description code of a simple linear access with 1024 positions ...... 97 B.2 Pattern description code for a tiled 128×72 access ...... 97 B.3 Pattern description code for a diagonal access on a 1024×1024 matrix ...... 98 x List of Figures
B.4 Pattern description code for a greek cross access pattern ...... 99
xi List of Figures
xii List of Tables
2.1 Examples of architecture combinations supported by the ATLAS environment . . . 15
5.1 PCI Express Gen1 and Gen2 support on the AXI Bridge for PCI Express IP Core 44 5.2 Hardware utilization of the Hermes NoC configured in a 2×2 mesh and the AXI Stream Interconnect Crossbar for a varying number of independent cores . . . . . 53 5.3 Throughput and latency of the Hermes NoC and AXI Stream Crossbar when inter- connecting 4 Cores, under different traffic conditions ...... 54
6.1 Resource usage for each component in the MCPE and HIB (hardware platform: XCV7VX485T Virtex-7 FPGA) ...... 60 6.2 Individual resource usage of the DFCs and BMCs (hardware platform: XCV7VX485T Virtex-7 FPGA) ...... 61 6.3 Address generation rate and descriptor size of the considered access patterns (the adopted length of each pattern results from the parameterization depicted in Fig. 6.1) 62 6.4 Resource usage of the cores utilized in the various implementations of the 4096×4096 matrix multiplication (hardware platform: XCV7VX485T Virtex-7 FPGA) ...... 66 6.5 Resource usage of the cores utilized in the frequency domain processing case study (hardware platform: XCV7VX485T Virtex-7 FPGA) ...... 72
A.1 ALU Register-Register operations ...... 86 A.2 Constant loading operations ...... 86 A.3 Low and High constant loading and miscellaneous operations ...... 87 A.4 Flow control operations ...... 87 A.5 CMU Address Mapping ...... 87 A.6 CMU Register Details ...... 88
xiii List of Tables
xiv List of Tables
Acronyms
AGC Address Generation Core
API Application Programming Interface
ASIC Application Specific Integrated Circuits
BMC Bus Master Controller
CMU Core Management Unit
DFC Data Fetch Controller
DRAM Dynamic Random Access Memory
DSS Data Stream Switch
ERF External Register File
FFT Fast Fourier Transform
FPGA Field-Programmable Gate Array
FPS Frames Per Second
GPP General Purpose Processor
HIB Host Interface Bridge
IDE Integrated Development Environment
IOMMU Input/Output Memory Management Unit
IRF Internal Register File
ISA Instruction Set Architecture
MCPE Multi-Core Processing Engine
MMU Memory Management Unit
MSI Message Signalled Interrupts
NoC Network-On-Chip
PE Processing Element
PSR Program Status Register
RAM Random Access Memory
xv List of Tables
RTL Register Transfer Level
SG Scatter Gather
TLP Transaction Layer Package
VLSI Very Large Scale Integration
1 1 Introduction
Contents 1.1 Motivation ...... 3 1.2 Objectives ...... 4 1.3 Main contributions ...... 5 1.4 Dissertation outline ...... 6
2 1.1 Motivation
1.1 Motivation
One of the most critical aspects to be considered during the development of multi-core hard- ware accelerators is how to efficiently handle data transfers between the various Processing Element (PE)s of the system. The architecture of the memory subsystem and of the commu- nication data channels has a significant impact on the achievable effective memory bandwidth that is made available to the PEs, and therefore on the overall system performance. In fact, an efficient and coordinated management of the data transfers is important not only because the PEs have different processing characteristics and capabilities (e.g., general purpose processors, appli- cation specific processors, or custom-designed accelerating cores), but also because applications often present distinct memory footprints and bandwidth requirements. While traditional solutions (such as cache hierarchy structures) try to reduce the latency of accessing the data, they do not allow exploiting all levels of available data parallelism. Therefore, recent advances have encouraged researchers to exploit other models that are able to deal with the intrinsic constraints of the underlying Very Large Scale Integration (VLSI) technology and of the inherent parallelism of emerging applications. As a result, it has been observed there is an increasing interest in data stream computation models, which focuses on decoupling communication from computation, by exposing an additional level of concurrency. This type of concurrency is especially important in hardware accelerators, given the slow communication channels (e.g., buses) typically used to connect with the Host device. Nevertheless, while regular streaming patterns are easy to handle, complex memory ac- cesses require more radical strategies to avoid long memory access times and to keep a high overall system performance. Such accesses can either occur due to the intrinsic complexity of the underlying application or because multiple kernels concurrently access different memory re- gions. Moreover, when data streams produced by a given kernel are consumed by several other kernels and at different paces, intermediate buffering is also required, thus further increasing the pressure on the memory subsystem. In an attempt to minimize the impact of these problems, dedicated Address Generation Units (AGUs) can be employed, which (pre-)fetch the data with the specific pattern required by the target application. Moreover, data reuse mechanisms can reduce the number of effective memory accesses, by sharing some of the streams through alternative channels or by rearranging a stream before it is consumed by the next kernel. The work presented herein describes the HotStream framework, consisting on a platform for the development of stream-based computing tasks, by providing an easy implementation maeans of advanced features such as data (pre-)fetching, stream sharing and support for arbitrarily com- plex data access patterns, generated by fully-programmable AGUs. These units, which are an integral part of the framework, differ from most comparable solutions by providing a Pattern De- scription Language (PDL), which enables any pattern to be described in a compact and scalable format, which greatly contrasts with the approaches of competing solutions, where periodic repe- titions within the pattern cannot be explored with ease. The aforementioned stream-related features of the framework are supported on standard com-
3 1. Introduction munication channels with added pattern-based data addressing mechanisms. Such addressing is structured with two levels of granularity, namely, a coarse-grained data access from the Host to the accelerator, to maximize the transmission efficiency, and a fine-grained data access within the shared memory of the device, made available to all the cores within the accelerator, to maximize data reuse. This is very important, as it allows the designer to focus on the accelerator architec- ture and not on the surrounding infrastructures, which have the potential of severely limiting the achievable performance, if not carefully designed. It is also important to note that the HotStream framework is designed to be equally efficient regardless of the final implementation technology, be it a reconfigurable device, an ASIC, or a SoC combining the two. As such, the description of each component is accompanied by a number of requisites that must be met in order to guarantee a good overall performance. Finally, a comprehensive software API was developed, which conveniently abstracts all the low level interactions between the Host machine and the accelerator. This greatly reduces the development time of stream accelerators based on the HotStream framework. Furthermore, the open-source nature of the code and its extensive documentation promote further adjustments and modifications towards increasing the performance for a particular platform/application pair.
1.2 Objectives
This master thesis is primarily focused on the development of all the necessary components required for the realization of the HotStream framework, namely: i) an Host Interface Bridge (HIB), which handles all the communications between the accelerator and the Host; ii) a Multi-Core Pro- cessing Engine (MCPE), capable of simultaneously hosting an arbitrary number of stream-based kernels; iii) auxiliary structures, such as a Data Stream Switch (DSS) and a Core Management Unit (CMU), to enable the cores to be interfaced on an individual basis, directly from the Host; iv) a backplane interconnection capable of providing full-connectivity between all the cores in the MCPE without compromising the communication bandwidth and; v) a C-based API, providing an high-level access to all the facilities offered by the framework. Within the MCPE lies an element that worths, by itself, a separate discussion, namely the Data Fetch Controller (DFC). The DFCs are fully-programmable AGUs associated with each core in the MCPE and enable the fine-grained access to the shared memory. These units feature a 16-bit microcontroller, tightly coupled with an address generation unit, that enables the description of arbitrarily complex access patterns through the compact (but still rich) instruction set provided by the microcontroller. The Pattern Code development is facilitated by a purposely designed assembler which supports all the features commonly required by these type of programs. Furthermore, in order to properly characterize the various communication mechanisms and interfaces that can be used in the framework, several technologies are analyzed in terms of la- tency, bandwidth, and area occupation. In the particular case of the backplane interconnection, two very distinct implementations are explored, in order to select the one that better suits the
4 1.3 Main contributions characteristics of the target implementation: Networks On Chip (NoC) or Crossbars.
1.3 Main contributions
The conducted evaluation of the proposed framework on a reconfigurable device proved that the embedded DFCs offer significant memory savings on the pattern description code. Compared to the existing related art, the proposed solution, based on the HotStream framework, achieves code size reductions above 1500×, with identical address generation rates. As an example, considering the block-based matrix multiplication case study, experimental results suggest that, given the extensive data-reuse offered by the proposed HotStream framework, it is easy to achieve a 2× speed-up, relatively to the state of the art implementation. Moreover, the proposed solution is able to reduce the Host intervention in the process by up to 45×, while requiring significantly less buffering from the Host. Consequently, the proposed framework allows for larger data scalability. To properly characterize the communication channels considered in the framework, a detailed experimental analysis of the performance of PCI Express interfaces, which is one of the technolo- gies that may be used to connect the HotStream-based accelerator to a Host, was conducted. Moreover, a simulation-based performance assessment of the specific DDR3/controller pair that was used on the prototyping phase of this work is also performed. These two subjects are no- tably poorly documented in the literature, despite being a fundamental part of the vast amount of Field-Programmable Gate Array (FPGA)-based accelerators that have been proposed over the years. Likewise, the complete software infrastructure that accompanies the HotStream framework also deals with the issue of handling the communication between the Host and the accelerator, each with its own address space. While some of the concepts explored in this context are specific to PCI Express interfaces, most of them can be adapted to any other communication interface that makes use of a Direct Memory Access (DMA) engine to handle data transfers. Moreover, the extensive comparison between Crossbar-based buses and Network-On-Chip (NoC)s, offers system-designers with an additional source of information when trying to decide between the two. In particular, the implementation results presented for a state-of-the-art recon- figurable device, such as the Virtex 7, provide new insights regarding the future of NoC structures on current-generation FPGAs. Comparable information is only available for older devices, which are out of date, lagging the capacity of modern offerings by an order of magnitude or more. The preliminary insights of the work that is presented in this thesis were published on a paper that was presented on a national conference:
• Sergio´ Paiagua,´ Adrian Matoga, Ricadro Chaves, Pedro Tomas,´ Nuno Roma, Evaluation and integration of a DCT core with a PCI Express Interface using an Avalon interconnection, In IX Jornadas sobre Sistemas Reconfiguraveis´ REC2013, University of Coimbra, pages 93-99, February 2013.
This paper focused on an exploratory study based on a DCT accelerator communicating with an
5 1. Introduction
Host machine through a PCI Express interface. More recently, the HotStream framework and, in particular, its innovative data-fetching and stream management mechanisms were extensively discussed in another paper that was pre- sented in an international conference:
• Sergio´ Paiagua,´ Frederico Pratas, Ricadro Chaves, Pedro Tomas,´ Nuno Roma, HotStream: Efficient Data Streaming of Complex Patterns to Multiple Accelerating Kernels, in 25th Inter- national Symposium on Computer Architecture and High Performance Computing (SBAC- PAD’2013), October 2013.
An extended version of this paper is currently under preparation, to be submitted to the Interna- tional Journal of Parallel Programming. Meanwhile, the main ideas explored in the context of this thesis also motivated the elaboration of an R&D Project proposal:
• Streaming Complex Data Patterns on Heterogeneous Systems, submitted to the Portuguese foundation for science and technology, FCT, in July of 2013.
In the future, multiple research directions are envisaged by the ideas that were opened by the present thesis and will be actually further explored by a PhD candidate on the SIPS group at INESC-ID.
1.4 Dissertation outline
The work presented in this dissertation is organized in eight chapters. In Chapter 2, the related work in the area of stream computing, address generation units, and heterogeneous multi-core architectures is presented. A brief description of the PCI Express standard is provided, along with the rationale for its development. Different on-chip interconnection technologies are also described in this chapter, as these play a very important role in the internal architecture of the proposed framework. Chapter 3 provides an in-depth discussion of the various hardware and software elements that compose the framework. Naturally, the DFC description is described in more detail, given its key importance within the framework. The two elements that make up the HIB are discussed in Chapter 4 for the particular case of a PCI Express interface between the Host and accelerator. A particular emphasis is given to the PCI Express interface, since it is the most complex interface, from those supported by the HotStream framework. This chapter also discusses the low level components of the HotStream API, which handle the various address spaces that usually exist in a modern operating system and create the necessary descriptors for the operation of the DMA controller. Chapter 5 discusses the implementation of the evaluation prototype that was used to validate the platform, including a thorough characterization of all the communication mechanisms that were utilized. Experimental results are discussed in Chapter 6, where a detailed evaluation of the framework (as a whole) is provided, firstly by assessing the address generation efficiency and resource occupation of the HotStream infrastructure, and then
6 1.4 Dissertation outline by testing the framework with two case-studies: the first based on the multiplication of very large matrices, and the second one dealing with the processing of very large images in the frequency domain. Finally, Chapter 7 concludes this thesis with some concluding remarks and future work directions.
7 1. Introduction
8 2 Technology Overview
Contents 2.1 Stream Computing Platforms and Address Generation ...... 10 2.2 PCI Express Interfaces ...... 11 2.3 Shared Buses and Crossbars ...... 12 2.4 Networks On Chip ...... 13 2.5 NoC Survey ...... 14 2.6 Crossbar Survey ...... 15 2.7 Summary ...... 15
9 2. Technology Overview
Given the broad scope of subjects covered by the HotStream framework, the comparison with the state of the art is not trivial. In fact, while a wide a range of streaming architectures have been proposed over the years, most are completely self-contained in the sense that the communication with an external general purpose processor, including all the necessary software layers from the low-level device drivers to the user-accessible APIs, is not considered. Section 2.1 discusses the related state of the art that share the most common points with the proposed HotStream framework, with an intentional bias towards architectures that tackle stream management and data access patterns, as these are the key elements of the proposed framework. Section 2.2, on the other hand, aims to provide the reader with the needed background to better illustrate the choice of the PCI Express as the main interface supported by the HIB. The efficient and high-throughput communication between multiple streaming cores is a key feature of the MCPE, the part of the framework that hosts all the processing elements. Two main communication mechanisms exist within this component of the HotStream framework: i) a backplane interconnection, used to allow for stream reuse among the multiple kernels by providing a network that is able to establish a full-duplex connection between any two cores with minimal latency and maximum throughput and; ii) a large-capacity shared memory which enables stream buffering and rearrangement. Despite the requirement for a large storage capacity, this shared memory should still be capable of high data access throughput so as not to represent a significant bottleneck to the performance of the framework. Naturally, these requirements dictate that an off-chip DDR memory be used, as these are the only components that are able to couple large storage capacity with high data access throughputs. On the other hand, the backplane interconnection can be equally implemented by different technologies, i.e., shared buses, Crossbars or Networks-On-Chip. Naturally, each solution implies different area/performance trade-offs, which also depend on the target technology, i.e., Application Specific Integrated Circuits (ASIC) or FPGA. Sections 2.3, 2.4, 2.5 and 2.6 describe the key features of these interconnection technologies.
2.1 Stream Computing Platforms and Address Generation
The popularity increase of stream-computing models have led to the development of many specialized architectures that tackle the efficient fetching and management of data streams. Ex- amples such as the IMAGINE stream processor [23][24] and the MERRICAM stream-based su- percomputer [11][14], which are based on clusters of PEs and Stream Register Files (SRF), offer simple data pre-fetch mechanisms which, in the case of the IMAGINE processor, transfer entire streams between the SRF and an off-chip SDRAM. As stated by the authors, only 50% of the optimal performance is achieved, which motivates the development of more efficient data man- agement structures. Given this bottleneck, other researchers have focused on improving the generation of data streams. After demonstrating that dataflow computing can lead to significant performance im-
10 2.2 PCI Express Interfaces provements in a wide range of applications, Pell et al [32] developed the MaxCompiler, which maps the computing kernels to an FPGA. To generate the streams to be fed to the computing Kernels, a set of commands is provided, instructing the developed tool-chain to automatically generate simple 1D, 2D or 3D data patterns. More complex patterns can only be described by using multiple commands [7], which is both time-consuming and results in a large configuration overhead. The same shortcomings are experienced by the Programmable Pattern-based Memory Controller (PPMC) [20], which eases the programming of regular 1D, 2D or 3D patterns through a set of function calls, integrated in an API. Again, this solution falls short when long and/or complex patterns must be described. The above pattern-generation solutions are actually very similar to the functionalities offered by modern DMA engines. For example, the Xilinx AXI DMA controller offers independent read and write channels, which provide high-bandwidth DMA between memory and stream-type pe- ripherals [5]. With its scatter-gather capabilities and the support for 2D transfers, this controller can actually be used as a pattern-generator. Its configuration is done by setting up a chain of de- scriptors that are then read by the engine, making this a rather similar solution to the one adopted by the PPMC [20]. Moreover, multichannel support is also offered through stream identifiers that accompany the data. This enables the multiplexing of the two available data channels, so that multiple master and slaves can connect to a single DMA engine. Both the PPMC and the AXI DMA solutions are driven towards moving large and regular chunks of data, and fall short when even more complex access patterns are considered, such as the one used on the Smith-Waterman algorithm [25]. In contrast, the DFC herein proposed is capable of handling arbitrary patterns of varying complexity without significant penalties. Fur- thermore, since the pattern description is not descriptor-based, there is essentially no limit to the length of the pattern to be generated.
2.2 PCI Express Interfaces
In the past, the PCI (Peripheral Component Interconnect), defined by the PCI Local Bus standard, was the most often used solution for connecting hardware devices in a computer sys- tem [29]. It experienced wide adoption during a large period of time, with components such as network cards, sound and graphic cards making extensive use of this solution, and can still be found in many modern motherboards. However, the growing requirements for communication bandwidth quickly made clear that PCI would not be a scalable solution. To overcome the limitations of the original PCI standard, as well as other standards such as PCI-X and AGP, the PCI-SIG (PCI Special Interest Group) jointly developed a high-speed se- rial alternative, the PCI Express, officially abbreviated as PCIe. This new standard has become the de facto standard for high speed interfaces with computer peripherals [22] due to its higher throughput, scalability, lower I/O pin count and native hot-plug capabilities, among other features. To date, two main revisions have been made to the PCIe specification, which have increased the
11 2. Technology Overview maximum transfer rates by a factor of 2 in each iteration, while maintaining backwards compatibil- ity with previous versions. PCIe is a serial interface that achieves significant data-rates by utilizing multiple lanes that operate in full-duplex. Lane widths of 2x to 16x are widely used, whereas 32x slots are very uncommon. Unlike the original PCI Local Bus standard, PCIe is a point-to-point connection. In order to make multiple interfaces available, root complex devices are used, which connect a processor and memory subsystem through a local bus to multiple ports, which can be further expanded by using special switches.
2.3 Shared Buses and Crossbars
The trend in modern digital design is to increase component reuse by resorting to verified IP Cores that implement the desired functionality with a certain interface. This greatly improves time-to-market and significantly reduces design complexity. The interconnection between these blocks is, nevertheless, very important as it very often determines the overall performance of the architecture. With the goal of further easing the interaction between these off-the-shelf compo- nents, the industry has moved to standard interconnection solution, such as the AMBA (Advanced Microcontroller Bus Architecture) developed by ARM, of which the AXI (Advanced Extensible In- terconnect) family of interconnections is the most well known, or the IBM CoreConnect. These standard interconnection solutions usually implement a shared bus or Crossbar to en- able the communication between the master and slave interfaces attached to the bus.
2.3.1 Shared Bus
A shared bus is, for the greater part, a collection of wires interconnecting the various interfaces while a central arbiter grants the masters exclusive access to the bus. The arbitration is done according to a statically defined arbitration rule such as round-robin or using a set of priorities attributed to each element. In this configuration, whenever a master takes control of the shared bus, all attached slaves have access to the information being transmitted and will act on it if their ID is specified in an appropriate control signal. Apart from the obvious observation that the peak bandwidth is limited to the maximum bandwidth achievable between any master-slave pair, it is further compromised by the difficult task of maintaining high clock frequencies as the number of interfaces increase and thus increasing the length of the interconnecting ”wires” between them [8].
2.3.2 Crossbar
As an evolution of the standard shared-bus, Crossbar architectures greatly increase the aggre- gate bandwidth by allowing simultaneous transactions between independent master-slave pairs. In this topology, the number of interconnecting wires is greatly increased, as each master now has a dedicated route to each slave. Naturally, an arbiter circuit is still required to ensure that any
12 2.4 Networks On Chip master can communicate with any slave within a reasonable waiting time. Although the peak ag- gregate bandwidth is now increased by a factor equal to the number of simultaneous connections, the hardware complexity is considerably higher than that of the shared-bus solution, meaning that interconnecting an increasing number of interfaces will require extra effort if the same clock fre- quency is to be maintained. As such, it is common to resort to a hierarchical use of Crossbar interconnections in order to keep the interconnection density low, while slightly sacrificing the ag- gregate throughput given that the number of connections that can be established simultaneously will be lower [34].
2.4 Networks On Chip
While Crossbars have proved to be able to fulfil the bandwidth requirements of modern IP Cores and SoCs, their increasing number and heterogeneity is still a challenge when designing SoCs that must adhere to strict area budgets and operating frequency targets. The key to ad- dressing these issues is to decouple the Transport Layer from the Physical Layer, by utilizing a packet-based Transport Protocol [8]. Packets are usually composed by a header, payloa,d and trailer, which are routed across the network according to a certain routing algorithm. Networks on Chip are inherently more scalable than their bus-based counterparts as the spe- cific bandwidth and interface requirements of each IP Core can be met on a per-node basis. In fact, when interconnecting multiple cores with a crossbar, the width of the internal connec- tion buses is determined by the bandwidth requirements of the fastest master-slave combina- tion. Thus, when connections with inferior throughputs are established, the available bandwidth is under-utilized and resources are wasted. On the other hand, NoCs provide the ability to opti- mize the data links between the various switches that compose the network, in order to maximize overall throughput and quality of service (QoS), while minimizing circuit area [8]. Inspired by traditional computer networks, NoCs are usually characterized by the communica- tion mechanism, switching mode and routing algorithm. All these parameters are functions of the network topology, i.e., the way in which the switching elements are arranged. Common topolo- gies are 2D mesh, 2D torus, folded 2D torus and Bi-directional ring, although many others exist. Figure 2.1 depicts the structure of the first two topologies. The communication mechanism defines how messages traverse the network and usually falls within two categories, circuit switching and packet switching. In circuit switching, a connection between a source and a destination is established before any packet is sent. During the lifetime of the connection, the links involved cannot be used by packets with any other origin, which may lead to under-utilization. On the other hand, in packet switching a connection is never established, and the routing decisions are done in run-time and on a per-packet basis, which leads to more efficient network usage, albeit with a slight increase in logic complexity. Packet switching requires the use of a switching mode, which defines how packets move through the switches. While many techniques exist, the most common are store-and-forward,
13 2. Technology Overview
00 10 20
00 10 20
01 11 21 01 11 21
02 12 22 02 12 22
(a) 2D Mesh (b) 2D Torus
Figure 2.1: Structure of a 2D Mesh and 2D Torus NoC virtual cut-through and wormhole [30]. The first two operate on full packets. In store-and-forward mode, a switch buffers a packet completely before it is sent to the next switch, which increases transmission latency and hardware requirements. Virtual cut-through is similar but a switch can start forwarding a packet as soon as the next switch indicates that it has a buffer that is big enough to hold the full packet, which slightly reduces communication latency. Finally, wormhole switching reduces buffering requirements by splitting a full packet into various sub-packets of fixed size, designated as flits. Only the header flit possesses routing information and, therefore, the payload flits must follow the same path reserved by the header. The path taken by a packet from source to destination is defined by the routing algorithm. The most common algorithms are distributed and perform the routing decisions on a per-node and per-packet basis. In addition, these can be either deterministic or adaptive, depending on wether the routing decision takes into account the current network traffic [30]. Naturally, deterministic routing algorithms, such as XY routing, lead to inferior resource usage.
2.5 NoC Survey
While much research has been done on the subject of Network On Chips, most of this body of work focuses on the influence of the various routing and arbitration algorithms on the flow of data within the network. Software models are typically used to ease the test of the proposed solutions under different traffic loads. Thus, there is an evident shortage of publicly available NOC implementations. The following details the NoCs available in the state of the art. NOCem [35] is a configurable NOC architecture which implements packet switching with op- tional virtual channels and supports three of the most common topologies, namely Mesh, Torus and Double Torus. A more comprehensive alternative is proposed by the European Space Agency (ESA) in the form of the SOCWire [31]. This modular solution is composed of the SOCWire Switch, which provides wormhole routing and round-robin arbitration, and the SOCWire CODEC, respon- sible for enabling the communication between the nodes and the routing elements. To ensure
14 2.6 Crossbar Survey proper operation in hazardous environments, the design is fault-tolerant and includes hot-plug abilities for dynamically reconfigurable modules. The ATLAS project [19] is an ambitious JAVA-based environment for the generation of different NOC architectures, which are configurable according to a large set of configuration parameters, such as the topology, number of virtual channels and routing algorithm. Table 2.1 presents some of the supported parameters for two of the architectures generated by the framework.
Table 2.1: Examples of architecture combinations supported by the ATLAS environment
Parameter Hermes Mercury Topology 2D Mesh 2D Torus Virtual Channels 1,2,4 1 Routing Alg. XY or West-first Adaptive Scheduling Alg. Round Robin Round Robin
The support for virtual channels and the simpler routing algorithm, which reduces hardware utilization, led to the adoption of the Hermes architecture as the Network on Chip to which the Crossbar solution is compared. The SOCWire, with its large array of features, proved to utilize too much resources to be competitive with Crossbar solutions, while the NOCEm utilizes store-and- forward instead of wormhole switching, which reduces latency and buffering requirements and, therefore, overall hardware resources.
2.6 Crossbar Survey
Unlike NoCs, Crossbars benefit from a greater standardization, as they are central elements of the multiple SoC interfaces offered by the various IP Core vendors. Widely used interfaces such as Altera Avalon, AMBA AXI, AMBA AHB, IBM CoreConnect and the open source Wishbone, all make extensive use of crossbar modules in their infrastructure. Moreover, as the goal of such standards is to promote interoperability and design reuse across multiple vendors, they are all offered with no fees or royalties associated. This results in a significant availability of high-quality crossbar implementations, which are usually very close in terms of performance and area requirements. Thus, the deciding factor is usually dictated by the target technology or by the bus architecture used by the rest of the system, i.e., if the rest of SoC is interconnected by an AXI infrastructure, it is only natural that the need for a stream-based Crossbar is fulfilled by the AXI Stream Interconnect.
2.7 Summary
Most streaming architectures described in the literature do not encompass the hardware mod- ules and associated software that is required to communicate with an external host. Several authors have identified data management structures as being the main bottleneck
15 2. Technology Overview when developing such co-processors. Some of these authors report a performance degradation, attributable to these elements, of up to 50%. Thus, much research is being directed to the devel- opment of efficient stream management and, as a consequence, pattern generation mechanisms. However, most approaches focus on the creation of address generation units which are optimized to perform large data transfers with regular patterns. When a more fine control over the data is required, large configuration overheads are incurred by such solutions. Like any streaming architecture, the performance of the HotStream framework is largely in- fluenced by the communication channels it utilizes. Thus, it is important to use state-of-the-art off-chip and on-chip communication solutions. As far as off-chip interconnections are concerned, PCI Express is the most widely used interface in modern computational systems. It offers a signif- icant aggregate bandwidth by leveraging multiple parallel lanes of full-duplex serial connections. On-chip interconnections benefit from a broader array of solutions. While Crossbars have become the de facto standard for the interconnection of high performance IP Cores in modern SoCs, Networks on Chip (NoCs) are becoming an important alternative. In comparison to the former, NoCs offer increased flexibility, scalability and performance. However, there is a clear shortage of publicly available NoC implementations, which furthers complicates its evaluation and application to real designs. In the case of the HotStream framework, the decision to employ either one of these solutions is dictated by their performance vs. area trade-offs and, most importantly, by their capability to scale.
16 3 HotStream Framework Architecture
Contents 3.1 Host Interface Bridge ...... 19 3.2 Multi-Core Processing Engine ...... 20 3.3 The HotStream API ...... 21 3.4 Data Fetch Controllers, Shared Memory and Auxiliary Units ...... 22 3.5 Data Stream Switch (DSS) and Core Management Unit (CMU) ...... 28 3.6 Summary ...... 30
17 3. HotStream Framework Architecture
SOURCE CODE
General Purpose Processor (GPP)
HOTSTREAM DEVICE API DRIVER
High Speed (e.g., PCIe, AXI, Interconnect CoreConnect)
Hardware Accelerator Multi Core Processing Engine (MCPE)
Backplane Interconnect Host Interface Data Stream Bridge (HIB) Switch Core 1 Core 2 Core 3 Core 3 Core 4 DATA DMA CMU STREAM PE DFC PE DFC PE DFC PE DFC PE DFC ... CONTROLLER BRIDGE
Shared Memory
Figure 3.1: Structure and organization overview of the HotStream framework
The proposed HotStream framework, depicted in Fig. 3.1, is a comprehensive solution for the development of stream-based architectures, composed of a software layer and a hardware layer. The software layer integrates: i) a convenient API that allows the programmer to specify any arbitrarily complex streaming pattern, as well as ii) a Device Driver to map the user-specified memory buffers, allocated on the user space, to the physical address space. This device driver also serves as the data transfer peer, on the software side, that assures appropriate integration mechanisms with the hardware layer. The hardware layer is composed of: i) the Host Interface Bridge (HIB), responsible for handling the data management between the Host processor and the accelerator; and ii) the Multi-Core Pro- cessing Engine (MCPE), responsible for managing the data streams between the PEs within the accelerator. The proposed architecture is designed to be fully scalable and adaptable (by support- ing a variable number of data streams and PEs), as well as flexible enough to support applications with different and arbitrarily complex streaming patterns with a minimal effort. Moreover, one of the main features that sets it apart from the current state-of-the-art is that it offers support for efficient data fetch and reuse within the accelerator architecture. The above mentioned characteristics are mainly achieved by providing two distinct levels of data access patterns, with different intrinsic granularities and complexity degrees. The first level is implemented within the HIB and supports simpler patterns of a more coarse-grained nature, as this type of communication channels typically benefit from transfers involving larger data chunks.
18 3.1 Host Interface Bridge
This first level of granularity is identical to what is provided by the PPMC [20]. The second and more fine-grained level of granularity is implemented within the MCPE, supporting more complex streaming patterns.
3.1 Host Interface Bridge
Copying data from the Host processor memory system to the MCPE is a complex procedure. It requires the intervention of: i) the device driver, on the Host side; and ii) a specific hardware structure, the HIB, that is able to autonomously issue data transfers (read/write requests) between the main memory of the Host and the MCPE. The Host Interface Bridge (HIB) mainly consists of two modules: the Data Stream Bridge (DSB) and the Direct Memory Access (DMA) controller. The DSB is responsible for interfacing the hardware accelerator with the Host General Purpose Processor (GPP). The adopted interfacing standard and corresponding communication structure should be adapted to the data transfer facilities and infrastructures that are offered by each specific Host. Some supported standards are the PCIe, AMBA AXI, CoreConnect, etc. As a consequence, the implementation of this bridge must be adapted to the specific requisites of each application. The co-located DMA controller ensures the management of the coarse-grained data transfers between the Host GPP and the hardware accelerator. Despite the implementation efficiency of the different entities involved in a single data trans- action, an unavoidable amount of overhead is expected, given the various operations that are involved. Fortunately, this overhead bears a weak dependence on the size of the data to be trans- ferred. Thus, transferring large data chunks guarantees a more efficient utilization of the available bandwidth in the data channel. However, complex streaming patterns often require accessing data that is not laid out linearly in the Host memory but, instead, spread over a regular pattern of contiguous blocks separated by non-unit strides. In such situations, transferring the smallest data chunk that encompasses each set of contiguous blocks inevitably results in a waste of bandwidth. The solution herein proposed to tackle this issue is to implement coarse-grained patterned data transfers between the Host and the MCPE, such that the HIB transfers only useful data but in large data chunks. These coarse-grained data access patterns can be easily accomplished by setting up the DMA controller to transfer each of the contiguous data blocks that constitute the pattern from their physical locations in the Host memory, according to the regions mapped by the device driver. The set up phase consists of creating scatter-gather descriptors for configuring the DMA engine, arranged in a chain, and defining the starting position and size of the memory blocks to be read from or written to. As the number of contiguous blocks described by a given pattern increases, the number of required descriptors also increases in the same proportion. Hence, in order to minimize the impact of this increase, the simple and traditional DMA engine can be replaced by a more efficient alternative capable of performing, at least, the most regular 2D memory accesses, i.e., each
19 3. HotStream Framework Architecture
memory transaction can be described by the tuple {OFFSET, HSIZE, STRIDE, VSIZE}, specifying the starting address of the first memory block, the size of each contiguous block, the starting position of the next contiguous block with relation to the previous, and the number of repetitions of the two previous parameters, respectively. This reduces the total number of descriptors needed to describe a given pattern. Thus, the better the patterns fit within a 2D description, the more significant will be the observed reduction. While the size and nature of the patterns applied to the data transfers between the Host and the MCPE are technically only limited by the available space to store the descriptors, these should not be too fine-grained in order to avoid a detrimental impact on throughput. Therefore, in the proposed HotStream Framework, the API provided to the programmer includes a special call to gather an arbitrary sequence of data segments stored across the memory space into one, larger and contiguous buffer that can then be transferred at once. This gathering operation takes a non- negligible time to complete, thus it is only useful when the incurred penalty does not exceed the overheads of transferring the smaller non-contiguous individual data chunks.
3.2 Multi-Core Processing Engine
The MCPE is where the actual computation takes place and is designed to support multiple independent and heterogeneous cores that collaboratively execute the multiple streaming kernels. Moreover, each kernel can span several Cores, to further exploit data parallelism. As depicted in Fig. 3.1, it consists of: i) multiple Cores, each composed of a PE, responsible for the computation, and one or more Data Fetch Controllers (DFCs), responsible for data management; ii) a high- speed Backplane Interconnection, able to dynamically route the data streams between the Cores, promoting the required data reusage schemes; iii) a shared memory, which allows rearranging the stream access patterns and data reutilization; iv) a Data Stream Switch (DSS), to route the data streams coming from the Host to either the backplane or to the shared memory. Accordingly, the data streams transferred from the Host via the DSS or those produced by an individual Core can be routed to other Cores in the MCPE via the backplane interconnection or stored in the shared memory for later reuse; and v) a Core Management Unit (CMU), a register-based interface for (re)starting the Cores, configuring their instruction memories or configuring interrupt generation on a per-Core basis. The high-speed stream-oriented Backplane Interconnection must ensure that each Core can communicate with any other with a minimum routing delay. In addition, multiple connections may need to be active at any given time. Taking into account these requirements, a high-speed interconnection network is required. Sophisticated Network-on-Chip (NoC) solutions are likely to provide higher system scalability and better support for heterogeneity among the Cores in terms of data interfaces. However, one must also ensure that the amount of hardware resources required by the interconnection network is minimal, saving space for extra computing Cores. The shared memory, on the other hand, is particularly important for applications that exploit
20 3.3 The HotStream API different types of access patterns or when data reusage is exploited between the PEs. Whenever a stream needs to be rearranged before it is consumed by another Core (or even streamed back to the Host machine), it can be buffered on the shared memory. As such, this element must be accessible by all the Cores, employing a simple work-conserving round-robin arbitration mech- anism, which makes sure that all read and write requests are served with equal priority and no starvation occurs. In addition, being an address-based element in an otherwise stream-oriented architecture, reading and writing operations requires the inbound or outbound data streams to be accompanied by a stream of addresses. The generation of these addresses is carried out by the DFC unit, within each Core (further detailed in Section 3.4), and allows the implementation of fine-grained streaming patterns of variable complexity, ranging from simple linear accesses to more exotic and complex ones, such as diagonal or cross-shaped patterns. Accordingly, the DFCs, which are responsible for these fine-grained data access patterns, are implemented through small programmable units directly coupled with the PEs (one for each out- bound or inbound stream) that generate the addresses for each data element within the stream (or for groups of data elements, if incremental bursts are used, i.e., multiple data elements are stored or retrieved from sequential memory locations). Unlike the coarse-grained patterns supported by the DMA engine, these units are able to generate address patterns with a resolution down to the single-address level. In fact, since the DFC can be programmed to describe long-running complex patterns with common loop structures, there is effectively no limit to the type of patterns that the programmer can describe. Additionally, the pattern specification takes virtually no space, thus avoiding the penalties resulting from the descriptor-based data-fetching mechanisms used in the state-of-the-art approaches, such as the PPMC [20].
3.3 The HotStream API
The seamless integration between the acceleration hardware and the Host is accomplished by the HotStream API. This software layer abstracts the low-level interactions between the device driver and the HIB and provides the user with a convenient and well-documented set of calls that provide easy access to the various features of the HotStream framework. The API (see Appendix A.3), written in C, is further subdivided into four logical groups, each fulfilling a special set of tasks within the framework: i) Core Management; ii) Framework Management; iii) Data Management and; iv) Pattern Definition. The Core Management group provides calls to configure the Instruction Memory of individual Cores, as well as to issue individual or global resets and manage interrupts. These procedures can be performed on a per-Core basis or applied to multiple Cores at once by utilizing vector variants of the same calls. The Data Management and Pattern Definition group of functions allow the creation of data streams with different coarse-grained patterns in order to meet the require- ments of particular applications. These streams can either target or be sourced from the shared memory or the high-speed backplane without additional configuration. Finally, the Framework
21 3. HotStream Framework Architecture
Management set of calls are responsible for initializing and gracefully terminating the operation of the framework. Two pattern creation calls, HotStream 2D() and HotStream Block() leverage the 2D capabil- ities of the DMA Engine and are complemented by the gather function, HotStream gather(), in- troduced in section 3.1. While the first function configures a stream with the basic parameters outlined in 3.1, i.e. {OFFSET, HSIZE, STRIDE, VSIZE}, providing the maximum flexibility for the definition of a stream, the second call provides an additional level of abstraction when generating tiled patterns. The recurrence of such patterns in streaming applications motivated the develop- ment of this custom call, which only requires the original matrix and tile size, along with the size, in bytes, of each matrix entry, to be specified in order to configure the data stream. The Hot-
Stream Linear() call can be used when no complex patterns are required thus, only the OFFSET within the user-provided buffer and its TOTAL SIZE need to be specified.
3.4 Data Fetch Controllers, Shared Memory and Auxiliary Units
The DFCs are undoubtedly the central and the most important elements of the MCPE. These units are responsible for single-handedly extracting the data from the (address-based) shared memory with arbitrarily complex patterns, and for forming the data streams that are presented to each Core’s PE, while the latter remains completely oblivious as to the origin of the data that it is consuming. In particular, each DFC is responsible for generating the corresponding read and write data transactions, according to the defined streaming pattern. Each DFC has its own instruction memory that is programmed from the Host machine through a compact but complete ISA, by using a custom assembler with syntax validation. This instruction memory can be dynamically updated and its size represents the only limitation to the complexity of the considered pattern. However, as long-running patterns can be described by loops, the size of the instruction data is kept relatively small and independent of the extension of the pattern. In addition, the DFC is optimized to take advantage of features provided by the considered bus protocol (e.g. AMBA AXI), such as burst commands to minimize the existing overheads. In order to handle these tasks, the DFCs incorporate two fundamental blocks that operate together: i) the Address Generation Core (AGC); and ii) a custom small-footprint 16-bit micro- controller (Micro16). The AGC is a small specialized processor that autonomously generates addresses in a linear, 2D or 3D fashion. On the other hand, the Micro16 microcontroller is capa- ble of generating combinations of linear, 2D and 3D patterns that are sequentially requested to the AGC in order to construct more complex stream patterns. The two units interact via a small and shared register file, the External Register File (ERF). This way, while the AGC is generating a sequence of addresses, the microcontroller concurrently modifies the ERF with all the required parameters for the next regular pattern. The combination of these two units makes it possible to describe patterns as complex as required, without ever compromising the address generation rate.
22 3.4 Data Fetch Controllers, Shared Memory and Auxiliary Units
In Appendix B, examples of patterns with fundamentally different characteristics are provided, along with the Pattern Description Code that instructs the DFC to generate such access patterns.
3.4.1 Address Generation Core (AGC)
The AGC effectively emulates traditional nested loops, as found on most programming lan- guages, by specifying, for each loop level, the number of iterations to be executed. Each level is implemented through a Loopcontrol unit that independently counts down from a starting value to zero, generating an interrupt upon completion. In addition, each of these Loopcontrol units holds the necessary parameters to determine the starting address of the AGC during the next iteration of the loop level. By combining multiple Loopcontrol units in a daisy chain structure and by routing the interrupt signal of the innermost levels to the enable input of the outermost ones, an N-level nested loop can be designed. Figure 3.2 illustrates the required configuration to implement a 3-level loop. It should be noted that, regardless of the number of Loopcontrol units desired, the AGC is able to automatically configure all the necessary internal connections between the various elements, as its hardware description is based on Generic VHDL parameters that selectively add the necessary blocks and interconnections.
RF Loopcontrol RF Loopcontrol RF Loopcontrol 1 1 2 2 3 3
Count Interrupt Count Interrupt Count 1
RF Address Loopbody 0
Figure 3.2: AGC in a 3-level nested loop configuration.
The body of the loop is emulated by the Loopbody unit, which generates one address per clock cycle based on three basic configuration parameters: increment, multiplication, and initial value. This trio makes it possible to generate any affine linear access pattern, i.e., patterns of the type yn=yn−1×m+i, which represent the great majority of the indexing needed by most sci- entific applications [16]. Hence, the Loopbody address generation is controlled by the associated Loopcontrol units, which interrupt the former whenever the iteration limit in any of the nested loop levels is reached. This results in a two clock cycle delay to compute the next starting address. Considering that any delay presented in the data fetching procedure may potentially slow down multiple PEs, it is of the utmost importance for the address generation to be essentially contin- uous. While this is a reasonable requirement when the supported patterns are restricted to 2D or even 3D sequential accesses, supporting arbitrarily irregular patterns with changing starting positions requires a more sophisticated approach. Therefore, for more complex patterns the con- figuration of the AGC relies on a double-buffered scheme, which is accomplished by duplicating the configuration registers used by the Loopbody and Loopcontrol units. With this architecture,
23 3. HotStream Framework Architecture the Micro16 is able to compute and configure the loop parameters for the following portions of the pattern, concurrently with the AGC execution, i.e., without interrupting the address generation.
3.4.2 Micro16 microcontroller
The Micro16 is a custom microcontroller designed to configure and control the AGC. While the overall architecture closely follows a traditional single-cycle RISC, it also comprises customized features aimed at easing the integration with the AGC. One such feature is the incorporation of an ERF, which is composed of the local register files from the Loopcontrol and Loopbody units within the AGC. Despite its external nature, any of the registers can be used as sources and destinations during ALU-based operations, with no additional latency. The microcontroller encompasses a sim- ple interrupt controller which makes it possible to trigger and wait for events on one of the multiple interrupt lines available. Notably, one of the interrupt lines is used exclusively to communicate with the AGC. An interrupt sent from the micro16 to the AGC lets the latter know that new parameters have been set and may be read in. Next, the microcontroller must wait on an interrupt from the AGC indicating that the parameters have been stored and the ERF may be modified again. This procedure is easily coded in assembly by first writing to the interrupt line selector register and subsequently triggering and waiting on an interruption, through two dedicated instructions. The remaining lines are particularly useful when synchronization between multiple AGUs is needed. This is the case, for example, in applications where multiple heterogeneous cores that depend on one another are running in parallel. The architecture of the Micro16, depicted in figure 3.3, was designed to be as compact as possible, since the DFC is to be replicated for each core in the MCPE. This promotes the scalability of the framework, while keeping its overall resources requirements low. With these constraints in mind, the datapath is 16-bit wide and the internal register file is composed of a reduced set of 4 registers, one of which is tied to zero and doubles as the interrupt line selector. However, in order to overcome this limited number of storage elements, a stack with a depth of 32 words was incorporated, which greatly enhances the programmability of the microcontroller, without compromising its resource usage. In order to ensure that the Micro16 is capable of addressing large shared memories, a base address register is included. By setting this register through a specific instruction, a 32-bit address bus is created, capable of addressing 4G words of memory. Despite its name, the Micro16 can be also configured in 32-bit mode, which effectively doubles the width of all the buses in the datapath and inevitably increases the hardware requirements. This configuration is particularly useful when a given application frequently addresses data that is scattered across the shared memory and the repeated modification of the base address register results in a decrease to the address generation rate. In this mode of operation, the instruction width is still maintained at 16-bit, in order to leverage the highly compact instruction set and to keep the instruction memory usage to a minimum. The adoption of a 16-bit wide instruction word required the Instruction Set Architecture (ISA) to be carefully tailored in order to accommodate a set of operations large enough so as not to limit
24 3.4 Data Fetch Controllers, Shared Memory and Auxiliary Units the pattern description ability of the microcontroller. In particular, all supported instructions were encoded with a minimal number of bits, although at the cost of a more complex decoding logic. These optimizations resulted in an ISA containing 14 instructions, divided into four main groups:
1. Register-Register operations involving the ALU 2. Immediate constant loading operations 3. Low and High constant loading and miscellaneous operations 4. Flow control operations
ALU Operations
The ALU supports 4 distinct operations on 16-bit integers: addition, subtraction, multiplication and decrement. The operations were selected based on the particular characteristics of most pattern description codes, where arithmetic operations are more abundant than logical ones. In addition, given that the flags available for flow control are active on zero, not zero or negative, the decrement operation is more useful than an increment operation would be when describing loops. When using any of these operations, all flags are updated. It is worth noting that this group of operations permits any combination of source and destination registers, i.e., both can refer to either the Internal Register File (IRF) or ERF. This can be easily specified in the assembly code by prepending the register number with the internal or external register file identifier (IRF or ERF), followed by the targeted unit within the AGC, if the external register is selected, and the register number. For instance, register 1 of the loopbody unit of the AGC is referred to as erf.0.r1. Conversely, when using a register from the IRF, the same reference would take the form irf.r1. The
Figure 3.3: Architecture of the Micro16 microcontroller.
25 3. HotStream Framework Architecture organization of ALU control words is depicted in Tab. A.1 in Appendix A, along with the assembly supported by the purposely-built assembler.
Constant Loading Operations
Due to the limitations imposed by the 16-bit instruction set, the constant loading operations are divided into two instruction groups: Immediate constant loading operations and Low and High constant loading and miscellaneous operations. The existence of three constant loading instructions aims to minimize code size. In fact, the first group enables the loading of constants of up to 12 bits in a single instruction, while the two instructions of the latter group allow the loading of any 16-bit constants, albeit in two sequential steps: first the lower 8-bits are loaded into a register and then the upper 8-bits are loaded into that same register. Due to instruction word size limitations, the destination register must be part of the IRF. The two instruction word groups along with the constant loading instructions are presented in Tab. A.2.
Miscellaneous operations
This group, in addition to encompassing the loading of 8-bit constants, enables the seamless integration between the Micro16 microcontroller and the AGC through two custom instructions, Wait and Done. The Done instruction must be issued whenever the registers in the ERF are modified. This indicates to the AGC that the set of constants currently stored in the ERF are ready to be read and can be copied to the internal registers of the Loopbody and Loopcontrol units. Once this copy is complete, the AGC asserts a signal indicating that the constants in the ERF have been parsed and can now be modified again. The microcontroller can wait on the assertion of this signal through the Wait instruction. Again, due to word-size constraints, this word group is further subdivided through an additional bit to define stack operations. These consist of the conventional push and pop operations, which place/retrieve any register from the IRF or ERF on/from the stack, thus greatly easing the programmability of the Micro16. This instruction word group and associated assembly mnemonics is represented in Tab. A.3.
Flow control operations
Finally, flow control is ensured by four different jump instructions, three of which conditional. The conditional jumps are supported by three status flags, namely Zero, Negative and Not Zero. These are stored in the Program Status Register (PSR), depicted in Fig. 3.3, so that their state does not have to be polled immediately after the operation that updated the flags. Only absolute jumps are supported, through a destination field of 12-bits. This means that a single jump in- struction can traverse up to 4096 instructions, which is more than enough considering the size of usual pattern description codes. To facilitate the coding task, the assembler supports instruction labelling, automatically converting these labels into absolute addresses. These four operations are supported by the word group described in Tab. A.4.
26 3.4 Data Fetch Controllers, Shared Memory and Auxiliary Units
Assembler
In light of the custom nature of the machine code defined by the Micro16 ISA, an assembler was developed to assist its programming. This tool offers syntax validation, label support for jump operations, and overflow checking when loading constants or specifying a jump destination. In addition, single and multi-line comments are supported, and so are decimal or hexadecimal numeric bases when specifying constants. The output of the assembler is a binary file ready to be placed in the instruction memory of the microcontroller. While the performance of the DFC is not directly influenced by these features, they greatly facilitate the pattern specification process, thereby increasing the ease-of-use of the HotStream framework, which is undoubtedly a key aspect of the proposed system. By taking advantage of such user-friendly assembly language, configuring a pattern is just a matter of populating the ERF with the relevant parameters and using the custom interface instructions to start and stop the address generation.
3.4.3 Access to the Shared Memory
The buffering capabilities of the HotStream framework are ensured by a single large-capacity shared memory, usually implemented by an external Dynamic Random Access Memory (DRAM). While this may be regarded as a limitation to the system performance, as arbitrated accesses af- fect the effective memory bandwidth for each core, it is one that cannot be avoided. Furthermore, DRAM devices are characterized by complex timing characteristics, as the need for periodic re- freshing and costly charging of data lines means that the access time for an arbitrary memory position is not constant. At this respect, modern Double Data Rate (DDR) memories offer spe- cial access modes to maximize the achievable data throughput. These are essentially burst-based accesses that retrieve a fixed number of sequential data beats from the DDR with one single com- mand. Exploiting burst-based accesses to the DDR memory is, therefore, paramount to get the most out of the available memory bandwidth. Accordingly, the address stream that is generated by the developed AGC was shaped with this particular objective in mind. As an arbitrary number of cores may be requesting data from the shared memory, a scalable and robust arbitration solution is also required. This may be achieved by adopting any of the cur- rently available industry-standard interfaces, specifically targeted at high-performance systems. Some examples are the AXI4 specification, maintained by ARM [2], or the CoreConnect bus ar- chitecture, defined by IBM [1], which are widely supported by most FPGA makers, e.g., Xilinx, and can also be deployed in CMOS technology. To comply with the specific bus architecture adopted for each particular application, a Bus Master Controller (BMC) was developed and integrated in the streaming framework. The purpose of this unit is to perform the conversion between the stream-based interface used by the DFC and the Memory-Mapped (MM) protocol used by the bus interface (see Fig. 3.4). To avoid placing additional pressure on the shared memory, each bus controller features a Stream-to-MM and an MM-to-Stream interface. These independent channels are arbitrated internally so that only one
27 3. HotStream Framework Architecture stream accesses the bus at a time.
Backplane
Instruction Processing Memory Element (PE) Write Read DFC DFC
Bus Master Controller
Shared Memory
Figure 3.4: Core internal structure, comprising the PE (e.g., an application specific IP Core) and the co- located BMC
Write Write Read Read data stream addr stream data stream addr stream
Synchronizer
WriteMaster ReadMaster
ChannelArbiter
BUS Master Interface
Figure 3.5: Internal structure of the BMC, consisting of Write and Read control units, a Channel Arbiter and a Synchronizer block
The BMC, Fig. 3.5, is comprised of convenient write and read control units, featuring appro- priate internal buffers that enable the issuing of incremental burst-based transactions and, in the case of the latter, perform data pre-fetching by requesting data that will be buffered until the Core is ready to consume it. To accomplish this, both the write and read control units consume the address stream up until the point that an increment pattern is interrupted or the burst limit, which can be selected when instantiating the component, is reached. In the particular case of the write channel, no data can be output until it is actually marked valid by the Core. Thus, a synchronizer block is also used, which forces the data and address stream to flow at the same pace.
3.5 Data Stream Switch (DSS) and Core Management Unit (CMU)
Since all DFCs operate independently from one another, separate instruction memories are required for each. To avoid the eventual resource under-utilization that would result from using one entire Random Access Memory (RAM) for each unit, the MCPE offers the possibility of sharing a dual-port RAM, if available, between two DFCs. In addition, because the program code for each may have fundamentally different sizes, it is possible to dynamically adjust the partition of
28 3.5 Data Stream Switch (DSS) and Core Management Unit (CMU) the memory space between the two. This procedure may be performed statically, i.e, included in the hardware implementation, or dynamically and in real time, directly from the Host machine. The latter functionality is supported by the combination of the Data Stream Switch and the Core Management Unit, see Fig. 3.1. The Data Stream Switch is a simple switching element that is able to route the data stream coming from the Host to three different targets: the shared memory, the high-speed backplane and the array of instruction memories. While the first is suited for situations where a large amount of data is to be transferred to the MCPE for further processing, the second makes it possible to stream data directly to one of the kernels, thus avoiding any intermediate buffering. This mode of operation is particularly suited for applications that do not explore fine-grained access patterns and data re-use. Finally, the last target is used for programming the various instruction memo- ries, with the help of the Core Management Unit, which defines the RAM partitions and provides feedback to the Host on the success of the programming operations. The Core Management Unit, which is controlled by the Host trough a set of hardware registers (see Tab. A.5 and A.6), can be used to interact with the kernels on the MCPE. It is possible to define interrupt masks and acknowledge the interrupts received on an individual basis. In addition, the kernels can be reset individually and two registers with no previously defined function, UR1 and UR2, are available for satisfying the needs of particular applications.
29 3. HotStream Framework Architecture
3.6 Summary
The HotStream framework, more than a platform for the development of stream-based hard- ware accelerators, is a comprehensive hardware and software solution that handles all the inter- actions between an Host and the existing accelerators. This is achieved by a modular hardware architecture, consisting of a Host Interface Bridge (HIB) and a Multi Core Processing Engine (MCPE), supported by the combination of a device driver and the purposely designed HotStream API on the software side. The proposed organization results in a framework that is generic enough to be mapped to virtually any chip technology. In fact, the HIB can be equally implemented by a PCI Express interface, AXI high-speed bus, or IBM’s CoreConnect. The data transfers are initiated from the accelerator-side by a DMA engine capable of performing 2D transactions. This is essential to equip the presented framework with one of its most distinctive features: the two-level patterned access to all the data shared between the Host and accelerator. While more coarse-grained access patterns are made possible by the DMA engine in the HIB, fine-grained pattern description of multiple parallel streams is possible within the MCPE. The Data Fetch Controllers (DFCs), which are composed of a tightly-coupled combination of a custom 16-bit microcontroller and an autonomous address generation unit, parse user-provided pattern descriptions. The pattern descriptions are based on a custom-designed assembly. The DFCs generate and manage the data streams of the associated Cores. This unit, which shares various similarities with conventional descriptor-based DMAs, such as the PPMC proposed in [20], offers significant advantages when data must be accessed with complex and long patterns. The HotStream API complements the hardware side of the framework by providing a set of four logical groups of methods that give access to all the features of the platform: i) Framework Management; ii) Core Management; iii) Data Management and; iv) Pattern Definition. In addition to the multiple heterogeneous Cores and associated DFCs that the MCPE supports, an high-speed backplane interconnection and large-capacity shared memory are also available. These two units enable point-to-point, low-latency, communication between the multiple cores and the reutilization and rearrangement of data streams, respectively. Finally, two auxiliary units, the Data Stream Switch (DSS) and the Core Management Unit (CMU), provide additional control over most features of the framework directly from the Host.
30 4 Host Interface Bridge
Contents 4.1 PCI Express Infrastructure ...... 32 4.2 Address Spaces and DMA ...... 33 4.3 2D DMA Transfers ...... 34 4.4 Device Driver and User Interface ...... 37 4.5 Summary ...... 40
31 4. Host Interface Bridge
At the conceptual level, the HotStream framework does not depend on any particular interface technology for any of its communication links. In particular, the High Speed Interconnect depicted in Fig. 3.1 may be as easily implemented by a serial PCI Express link or an AXI or CoreConnect bus interface, depending on the target platform. Using a parallel bus interface usually requires the accelerator and the GPP to be co-located on the same SoC, which can be fully implemented as an ASIC or as a combination of hard-wired hardware and reconfigurable fabric, as in the case of the Xilinx Zynq [12]. While this option gives access to an higher communication bandwidth and greatly eases the design, it is not always available. This is especially the case when the Host is a powerful GPP, thus requiring an off-chip interface technology, such as PCI Express, which requires special structures to be present in both communication endpoints. In addition, the logical separation between the Host and accelerator mean that (at least) two address spaces exist, thus requiring a more complex software management, including the development of custom device drivers. Therefore, the HIB and accompanying software of the HotStream framework were developed with this more complex case in mind. Nevertheless, SoC-based implementations can still be easily targeted by simplifying the software management and removing the PCIe endpoint altogether. The use of a PCI Express requires a PCIe endpoint bridge to replace the Data Stream Bridge of Fig. 3.1. This bridge must be capable of mapping the address space of the accelerator, sup- ported on a local high-speed bus interface, to the Host address space. In addition, to make a more efficient use of the available bandwidth, it supports multiple lanes operating in full duplex. The fol- lowing sections describe the elements required to support the efficient bidirectional transmission of data between the Host and the board, as well as the device driver that makes it possible to control the system at an higher abstraction level.
4.1 PCI Express Infrastructure
The PCIe endpoint on the accelerator side communicates with the root complex device on the Host. It is not necessary that both feature the same number of lanes, as the PCI Express specification supports a link negotiation stage, known as link training, which co-operatively deter- mines the maximum width supported by the interconnected pair. In addition, for the accelerator to be properly recognized and configured during the device enumeration performed by the BIOS during the Host computer boot, a suitable set of ID values and Class codes must be set on the PCIe endpoint. These contain information regarding its function and manufacturer and are also fundamental in its identification by the device driver. The device enumeration step is completed by reserving a block on the bus address space of the Host for the PCIe device. The size of this block is determined by the contents of the mandatory register set that the endpoint provides, which must conform with the PCIe specification. Moreover, this set of registers also indicates the available apertures. These correspond to address ranges that are later mapped to the Host memory space by the device driver. Each aperture is
32 4.2 Address Spaces and DMA characterized by a starting position, indicated by a Base Address Register (BAR), and size and give access to the address space of the sub-system attached to the PCIe endpoint. Interrupts may be delivered by the accelerator to the Host by making use of the Message Signalled Interrupts (MSI). In the past, PCI devices used an individual interrupt line connected directly to the Host. This out-of-band method posed several problems, the most striking of which being the possibility of the interrupt to arrive at the Host before the Transaction Layer Package (TLP)s of the corresponding transfer had been received. The MSI mechanism eliminates this synchronization problem by utilizing a conventional memory write TLP directed at a reserved address in the Host bus map. In addition, this solution also reduces pin count and greatly improves interoperability between PCIe-based devices.
4.2 Address Spaces and DMA
Gaining access to the address space of the accelerator from the Host is complicated by the multiple levels of address spaces that usually exist in the system, see Fig. 4.1. In fact, while the apertures provided by the PCIe endpoint are mapped to the bus address space, which is the same as the physical address space in architectures which do not possess an Input/Output Memory Management Unit (IOMMU) (x86 architectures fall within this category), user applica- tions run in user space, where a contiguous and very large set of virtual addresses are available. Since this address range is not limited by the capacity of the main system memory and to enforce address space separation between multiple concurrent processes, there is not a one-to-one cor- respondence between the physical memory and the virtual memory. In fact, the mapping between virtual pages and the frames in the physical memory is performed by a dedicated hardware unit known as the Memory Management Unit (MMU). Furthermore, the kernel benefits from a special region of virtual memory where the physical memory, or part of it, is linearly mapped, since no address violation prevention between processes is required. This region can be used whenever a contiguous data buffer needs to be allocated. However, the system calls available for allocat- ing such buffers provide only a best-effort service, meaning that there is no guarantee that an arbitrarily-sized buffer can always be allocated in a contiguous region of the physical memory [9]. Therefore, two options exist when transferring a data block to or from the accelerator through the PCI Express interface: i) the user-space application requests the allocation of a kernel buffer that is large enough to contain the data to transfer and maps it to the process space, through suitable system calls or; ii) the buffer is allocated in the process space and then mapped to the physical space, where it will likely span multiple physical frames scattered across the physical memory. While the first scenario is more convenient, and greatly simplifies DMA transfers, it is not always possible to allocate buffers with the desired size, as explained above. In addition, requiring the buffer to be pre-allocated can complicate the adaptation of existing applications to the framework, thus reducing its usability. It becomes clear that the second approach is more flexible, despite requiring a more complex management of the DMA transfers, since multiple smaller blocks
33 4. Host Interface Bridge
USER SPACE KERNEL VIRTUAL VIRTUAL ADDRESSES ADDRESSES
KERNEL LOGICAL ADDRESSES
Page Tables (MMU)
PHYSICAL ADDRESSES
BUS ADRESSES
PCIe Apertures
PCIE ENDPOINT ACCELERATOR
Figure 4.1: Address spaces and translation mechanisms on a x86-like architecture have to be transferred, instead of just one. In order to facilitate this management, a DMA engine with Scatter Gather (SG) capabilities can be employed. Unlike simple DMA implementations, which can only be instructed to perform a data transfer at once, SG DMAs engines read a singly linked list of descriptors, each specifying the starting address and size of the data block to be transferred. This allows the engine to per- form a large sequence of data transfers from multiple locations without further CPU intervention. Figure 4.2 depicts the process of mapping a user buffer to the bus address space, through a ded- icated call contained in the HotStream API, and the subsequent creation of the corresponding SG descriptors.
Virtual Addresses
USER BUFFER
pd_mapUserMemory()
DMA DMA DMA DMA Descriptor Descriptor Descriptor Descriptor 0 1 2 3
Physical Addresses
Figure 4.2: Mapping of an user-buffer to the bus address space and subsequent creation of the correspond- ing SG descriptors. The size and number of the physical data chunks can vary considerably according to the size of the original buffer and the state of the phyisical memory
4.3 2D DMA Transfers
As discussed in section 3.1, the use of a DMA engine capable of performing 2D memory accesses, i.e., data transactions described by the tuple (OFFSET, HSIZE, STRIDE, VSIZE) can
34 4.3 2D DMA Transfers significantly reduce the total number of descriptors needed to implement several coarse-grained patterns. Unfortunately, this complicates the creation of the SG descriptors even further, as seen in Fig. 4.3. These complications arise because the 2D pattern must be applied to the physical data chunks that result from the mapping of the user-space to the bus address space. Since these chunks can vary in size and starting positions, each contiguous block defined by the HSIZE parameter may or may not fit entirely within a physical data chunk. Thus, whenever an HSIZE block exceeds the capacity of the starting physical chunk, an additional SG descriptor is required.
Virtual Addresses
USER BUFFER
pd_mapUserMemory()
STRIDE STRIDE STRIDE STRIDE
HSIZE HSIZE HSIZE HSIZE HotStream_2D()
DMA DMA DMA DMA DMA DMA Descriptor Descriptor Descriptor Descriptor Descriptor Descriptor 0 1 2 3 4 5
Physical Addresses
Figure 4.3: Application of a 2D pattern to the mapping of Fig. 4.2
Figure 4.3 represents the worst possible case, as the 2D pattern must be described by 6 SG descriptors, thus providing no descriptor savings with relation to a DMA engine with no 2D support. Fortunately, in most real application scenarios, the size of the physical data chunks is such that a considerable portion of the 2D pattern is able to fit within them without further segmentation, as depicted in Fig. 4.4.
The application of the 2D pattern defined by the tuple (OFFSET, HSIZE, STRIDE, VSIZE) to the array of physical memory chunks is performed by the HotStream 2D() API call, which substitutes the previous descriptor list with a new one that exploits the specification of 2D transfers. The underlying algorithm for this method is presented in Fig. 4.5. Finally, while the HIB was designed to primarily support coarse-grained data patterns, the specification of sparse patterns based on blocks with a small HSIZE is still possible. However, the nature of the PCI Express link dictates that large data transfers make a more efficient use of the link, as the overheads introduced by the protocol are largely independent of the data transfer size and, as such, become less relevant with its increase. Thus, it is expected that the use of patterns that are too sparse or fine-grained will result in a significant penalty to the achievable throughput of the interconnection. To circumvent this problem, the HotStream API includes the gather function, already introduced in section 3.1. This call, whose operation is depicted in Fig. 4.6, gathers the various sub-blocks of size HSIZE within the provided user buffer and populates a new user-space buffer linearly with these blocks. This not only significantly reduces the number of SG descriptors
35 4. Host Interface Bridge
Virtual Addresses
USER BUFFER
pd_mapUserMemory()
STRIDE STRIDE STRIDE STRIDE
HSIZE HSIZE HSIZE HSIZE HotStream_2D()
DMA DMA DMA Descriptor Descriptor Descriptor 0 1 2
Physical Addresses
Figure 4.4: Application of a 2D pattern to a more realistic mapping between virtual and physical address space. This example highlights the descriptor savings that are possible by utilizing a DMA with 2D capabilities
pattern2d(offset, hsize, stride, vsize)
bar = offset
end if vsize == 0 d = next descriptor
N = (d.size - bar) / stride if bar > d.size F = (d.size - bar) % stride
if N >= vsize bar -= d.size
if F != 0
sg_push(d.addr + bar, hsize, vsize, stride) if F < hsize
sg_push(d.addr + bar, hsize, N, stride) sg_push(d.addr + bar, hsize, N+1,stride) vsize -= N if N != 0 vsize -= (N+1) bar = stride - F
bar += N*stride sg_push(d.addr + bar, F,1,stride) bar = 0 sg_push(d.addr + bar, hsize - F, 1, stride) vsize -= 1 bar = stride - F
Figure 4.5: Flowchart of the algorithm that converts a list of SG DMA descriptors into a list featuring 2D transfers
needed to complete the DMA data transfer but also potentially increases the achievable data throughput over the PCI Express link.
36 4.4 Device Driver and User Interface
Virtual Addresses
USER BUFFER
OFFSET STRIDE STRIDE STRIDE STRIDE
HSIZE HSIZE HSIZE HSIZE
HotStream_gather()
HSIZE HSIZE HSIZE HSIZE
NEW USER BUFFER
Figure 4.6: Operation of the HotStream gather() function, which gathers the various sub-blocks defined by a 2D pattern and places them linearly in a new user-space buffer
Naturally, this operation must be performed after the initial mapping of the user-provided buffer from the virtual to the physical address space, thus introducing an additional delay in the sequence of steps needed to complete a data transfer from the Host to the accelerator. However, this increased setup time can be effectively hidden by the increased data throughput that can be achieved over the PCIe link, which results in a shorter transfer time. Thus, for sparse/fine-grained data patterns up to a certain break-even point, utilizing the gather operation actually reduces the overall transfer time.
4.4 Device Driver and User Interface
The HotStream API provides an high-level abstraction for the user to interface with the ac- celerator and, implicitly, the DMA engine and PCI Express interface. However, this is only made possible by the underlying software layers, which convert these high-level commands into simpler low-level instructions that directly control the configuration registers of both the DMA engine and PCI Express endpoint. The device driver is the key element within these layers, as it is respon- sible for: i) device detection and initialization; ii) creation of the corresponding device node; iii) mapping of the PCIe apertures into the bus address space and; iv) providing direct access to the multiple device registers through ioctl() calls. Developing such a driver requires a comprehensive understanding of the APIs provided by the kernel for memory allocation, PCI devices and interrupt handling, which is far beyond the scope of the present thesis. Thus, an open-source and generic driver from the MPRACE framework [28] was employed, greatly accelerating the development process. This framework is an open source stack, primarily aimed at the development of custom FPGA boards with PCI Express interfaces. In addition to the conventional device driver that pro- vides access to the accelerator through read() write() and ioctl() system calls, it also features a comprehensive user-space API in C and C++, which abstracts complex procedures, such as the mapping of the PCI Express apertures to user space, the mapping of user-space buffers to the physical address space and the allocation of contiguous kernel buffers. This was used as the starting point for the development of the HotStream API, which utilizes some of the calls provided by the MPRACE API to configure the DMA transfers and to dynamically set up the address trans-
37 4. Host Interface Bridge lation performed by the PCIe endpoint. In addition, some modifications were introduced to the kernel driver in order to include support for MSI and Bus Mastering. These modifications, as well as the necessary steps for configuring a data transfer by making use of the MPRACE API are described in the following subsection.
4.4.1 Modifications to the MPRACE device driver
The two modifications made to the MPRACE device driver aim to provide support for MSI and Bus Mastering. The importance of the former has already been described in section 4.1 but ultimately depends on whether the kernel of the Host operating system was compiled with support for this type of mechanism (determined by the CONFIG PCI MSI config parameter in Linux kernels). If this form of interrupt handling is not available, the traditional out-of-band method must be utilized. Bus Mastering refers to the capability of the PCI Express endpoint to take ownership of the Host memory bus and issue read and write requests to the main memory system. This is indispensable for the accelerator to be capable of accessing the Host memory with minimal intervention of the latter, thus leading to higher data transfer throughputs. Since not every Host supports MSI, the PCIe endpoint must be informed of which interrupt mechanism to use. Since the out-of-band mechanism is selected by default, MSI must be en- abled explicitly on the endpoint. This is done by setting a bit on a capability register defined by the PCI Express specification. Such an action must be directly performed by the kernel as the device driver does not have permissions to do so. Instead, the latter does this indirectly by resort- ing to the pci enable msi() call of the Linux PCI API, after which the associated interrupt line is registered through the well known register irq() kernel call. To preserve the compatibility with con- ventional interrupt methods, a msi enabled flag was added to the pcidriver privdata t structure which overrides the legacy interrupt routines defined in the driver source code. Bus Mastering capabilities are simply activated by utilizing the pci set master() call of the previously mentioned API.
4.4.2 Configuring a data transfer
While the MPRACE API greatly simplifies the task of configuring data transfers to and from the accelerator, the existence of independent address spaces and a DMA engine requires a number of steps to be taken before the data begins to flow in either or both directions, even if no 2D patterns are applied. After the device file is opened, which maps the PCIe endpoint configuration registers to the device space, the pd mapBAR() call of the MPRACE API is used to map one of the PCIe apertures to user-space. This gives immediate access to the address space of the accelerator and makes it possible to access the configuration registers of the various peripherals whose address ranges are contained in this address space. After the user has provided a pointer to a data buffer, the pd mapUserMemory() call locks it into physical memory and maps it to device space.
38 4.4 Device Driver and User Interface
The resulting Scatter-Gather (SG) list indicates the location and size of each of the physical memory chunks in the Host memory, as described in section 4.2. Since the PCIe endpoint is only able to access a limited memory range, which is determined by the size of the configured aperture, it is important that this group of data chunks fits within this addressable range. Once this is verified by the software, a base address that satisfies this condition is calculated and written to a configuration register in the PCIe endpoint which performs the address conversion from the accelerator to the Host. Thus, whenever a peripheral within the accelerator issues a write or read request to a memory position of the aperture defined by the PCIe endpoint, it is propagated to the Host with an additional offset, the aforementioned base address. Naturally, this also means that the SG list that is parsed by the DMA engine must utilize relative addresses and not absolute Host memory addresses, which is accomplished by subtracting the same base address to each entry of the list. After applying the aforementioned address conversion, the SG descriptors are written to a static memory on the HIB, according to the descriptor structure specified by the DMA engine. Once the complete list has been written, the configuration registers of the DMA engine are popu- lated with the base addresses of the start and tail descriptors, which usually prompts the beginning of the data transfer operation. Once the interrupt that signals the end of the transfer is received by the Host, each SG descriptor on the static memory will hold the number of bytes that were successfully transferred for that particular descriptor. The resulting sum should equal the size of the specified user buffer. Finally, in the case of a data transfer from the accelerator to the Host, the explicit synchronization of the device and user buffers may be required. This task, which is greatly simplified by the pd syncUserMemory(), may be necessary to ensure cache coherency when the same user-buffer is used for successive data transfers. Such a step may be unnecessary if the Host supports PCI bus snooping.
39 4. Host Interface Bridge
4.5 Summary
Despite the support for various Host-accelerator interfaces that the HotStream framework pro- vides, implementing an off-chip is typically more difficult than resorting to an intra-chip solution, such as AXI or CoreConnect. Thus, the design of the HIB and accompanying device drivers was specifically targeted at an PCI Express connection, as this represents the worst-case in terms of design complexity. Adapting the developed hardware/software combination to an intra-chip solution would require significantly less effort. Independent of the chosen Host-to-accelerator communication technology is the potential in- volvement of multiple address spaces in all data transfers. In fact, if the Host runs a full-fledged operating system, the existence of physical and virtual address spaces greatly complicates the task of transmitting data buffers to and from the accelerator. In such a scenario, the device driver is responsible for mapping the user-provided memory buffer to the bus device space, which is then accessible from the accelerator. However, while the virtual memory space is laid out linearly, the mapping of a contiguous memory buffer to a physical space often results in a collection of non-correlated, arbitrarily located physical data chunks. Thus, for the efficient transfer of such chunks to the accelerator, a DMA engine with Scatter-Gather (SG) capabilities is required, as it greatly reduces the Host’s CPU intervention in the data transfer process. The use of a DMA engine capable of 2D transfers can greatly reduce the number of descriptors needed to complete a data transfer, in addition to the advantages outlined in Section 3.1. However, the 2D pattern must be applied to the chunks of physical memory that resulted from the initial mapping from virtual address space to the bus device space and not to the original, user-provided, contiguous buffer. Such a task is nontrivial and can even result in no descriptor savings, as in the case depicted in Fig 4.3. However, typically, most real scenarios resemble the situation depicted in Fig. 4.4, where a considerable portion of the 2D pattern is able to fit within a single physical memory chunk, thus resulting in a significant descriptor saving. The task of applying a 2D pattern to a list of physical memory chunks is accomplished by a purposely designed algorithm that takes this initial list and produces a new set of descriptors ready to be parsed by the DMA engine. The device driver for the PCI Express based accelerator was not developed from scratch. Instead, the open-source and generic driver from the MPRACE framework [28] was used. Modifi- cations to the interrupt handling procedures were introduced and bus mastering capabilities were added, thus enabling the DMA engine to effectively take ownership of the Host memory bus.
40 5 Framework Prototype
Contents 5.1 AXI Interfaces ...... 42 5.2 HIB Implementation and Performance ...... 43 5.3 Backplane Implementation and Performance ...... 47 5.4 Shared Memory Performance ...... 55 5.5 Summary ...... 56
41 5. Framework Prototype
The proposed HotStream framework represents a generic design that can be easily mapped to different targets, i.e., a reconfigurable device, an ASIC or even a combination of the two, using, for example, a Zynq SoC. However, while the conceptual design and software layers can be developed with no particular target technology in mind, it is not possible to fully characterize the platform and the performance of its various communication links without considering a particular implementation. Moreover, to properly evaluate the proposed platform the test environment must provide an high-speed interconnection between the accelerator and the Host, such as a PCI Express interface and a large-capacity off-chip memory providing significant bandwidth. The nature of these requirements mean that the prototyping of the framework must be done on an FPGA, which greatly simplifies the process of interfacing with the off-chip memory and the PCI Express interconnection. Thus, the Xilinx VC707 development board was selected, which is powered by a state-of-the-art Virtex 7 FPGA [40], coupled with an high-performance DDR3 memory, offering 512 MB of capacity and a peak bandwidth of 12.8 GB/s. Host connectivity is ensured by a PCI Express interface with 8 lanes, capable of supporting Gen2 speeds. On the Host side, a powerful Intel Core i7 3770K processor clocked at 3.5 GHz is at the heart of a machine with 16 GB of DDR3 memory running at 1.866 GHz. These components, in addition to the VC707 development board, are hosted by an Asus P8Z77-V LX motherboard. The following sections present a thorough characterization of the communication channels that compose the HotStream framework when mapped to the aforementioned development board, as well as an evaluation in terms of performance and area of key elements of the framework, such as the DFCs. Finally, a case study based on the multiplication of very large matrices is presented, which aims to highlight not only the capability of the framework to support multiple concurrent streams being consumed and produced by various heterogeneous kernels, but also to give an insight as to the performance gains that can be expected when using the proposed HotStream framework in detriment of other conventional accelerator architectures.
5.1 AXI Interfaces
Given the choice of a Xilinx development board for prototyping the framework and, since most of the IP Cores available for its devices make use of the AMBA AXI family of interfaces [2], this was selected as the interconnection solution to be used across the HotStream framework. The multiple interface variants contained in this specification allow for a tight correspondence between the requirements of each interconnection and the capabilities of the corresponding interface. Thus, for high-performance links within the system, such as the interface between the HIB and MCPE and the access to the shared memory, the AXI4 protocol was used, as it provides bi-directional data transfers with burst support of up to 256 data beats. For register mapped interfaces, on the other hand, the simpler AXI4-lite protocol, with no support for bursting, was preferred in order to keep hardware resource usage to a minimum. Finally, in order to ease the development of accelerator solutions based on the HotSteram
42 5.2 HIB Implementation and Performance framework, the streaming cores must comply with the AXI4-Stream protocol. This protocol is tar- geted at low-resource, high-bandwidth unidirectional data transfers. Flow control is implemented through two signals, TVALID and TREADY, and support for bursts of arbitrary length is ensured by a TLAST signal. Additional control signals exist, such as null-beat indicators and routing in- formation, but these are optional and depend on the particular requirements of the application at hand. The protocol consists of completely symmetric master and slave interfaces that can be connected directly. Moreover, no restriction to the width of the data channel exists, which further increases the flexibility of the solution. It is important to note that, while the prototype version of the proposed framework was de- signed to host cores with one or more AXI4-Stream interfaces, other stream-based interfaces can be easily adapted, provided they utilize the same two-way flow control mechanism, also known as handshake, depicted in Fig. 5.1. This is the case, for instance, with the Avalon Stream interface, widely adopted in Altera-based designs.
Figure 5.1: Basic handshake principle utilized by the AXI4-Stream and other similar stream-based protocols. Retrieved from [2]
5.2 HIB Implementation and Performance
The two main components of the HIB, the DSB and DMA engine were both implemented by off-the-shelf Xilinx IP Cores. The HIB was implemented by the AXI Bridge for PCI Express [39], which handles the low level interactions with the hard silicon PCI Express endpoint available in the VC707 development board and provides a convenient AXI4 interface, allowing easy integration with the remaining components of the framework. Similarly, the DMA Engine was implemented by the AXI DMA engine [5], which meets all the requirements outlined for this unit during the framework description, such as Scatter-Gather capabilities and 2D data accesses. Even with off-the-shelf components, making efficient use of the available bandwidth on the PCI Express link is far from trivial and several factors have to be taken into account. These range from the inevitable protocol overhead introduced by the communication mechanism, based on Transaction Layer Packages, TLPs, the symbol encoding used to reduce transmission errors on the physical layer, and remaining system components [17]. However, since these factors greatly vary with the characteristics of the transmitted data, the quoted figures for PCI Express performance usually refer to the raw aggregate bandwidth that can be sustained by both directions in simultaneous, Fig. 5.2. However, if the characteristics of the PCI Express protocol are taken
43 5. Framework Prototype into consideration, the actual achievable throughput can be reduced by as much as 20% in some situations, as determined by Goldhammer and Ayer in [17] (see dashed line in Fig. 5.2).
32 Gen1 16 Gen2
8 Gen1 with Overhead
4
2
1
0.5 Aggregate Throughput [GB/s] Aggregate 0.25 1 2 4 8 16 32 Lane Width
Figure 5.2: Aggregate throughput of various PCI Express configurations. The dashed line accounts for protocol overhead as per [17]
While the hard silicon PCI Express controller on the VC707 development board supports lane widths of up to 8× at Gen1 and Gen2 speeds, the AXI Bridge for PCI Express IP Core is more limited and only supports a subset of the possible configurations. In addition, the width of the AXI interface on the accelerator side is also a key parameter as it may impose a bottleneck for data transfer performance. This happens if the throughput on the accelerator side does not at least equal the aggregate throughput achieved by the PCI Express link. Table 5.1 summarizes the configurations supported by this IP Core.
Table 5.1: PCI Express Gen1 and Gen2 support on the AXI Bridge for PCI Express IP Core
No. of Lanes AXI Data Width Gen1 Gen2 x1 64 Yes Yes x2 64 Yes Yes x4 64 Yes No x4 128 Yes Yes x8 128 Yes No
At this point, it is important to note that the PCI Express bridge and DMA engine cannot operate at frequencies higher than 100 MHz. This constraint arises from the need to clock the add-in card with the central motherboard clock, in order to ensure the accurate frequency-lock (as discussed in Xilinx’s Answer Record AR# 18329). Thus, the highest bandwidth the AXI interface can provide is 1.6 GB/s (16B × 100 MHz) in each direction, given that the AXI4 protocol provides independent read and write channels. Again, this does not account for arbitration latencies nor idle periods between successive bursts. This creates a throughput ceiling that effectively limits the achievable data transfer performance to 3.2 GB/s, meaning that both the x8 @ Gen1 and x4 @ Gen2 configurations cannot be fully exploited and will yield the same results. Given that
44 5.2 HIB Implementation and Performance the Gen2 implementation of the PCI Express bridge uses slightly more resources than Gen1, the x8 @ Gen1 combination was preferred. The HIB was tested by setting up back-to-back transfers with varying buffer sizes and measur- ing the elapsed time between writing the address of the tail descriptor to the AXI DMA registers and the moment the interrupt signalling the end of the receive operation was received. For each buffer size, the procedure was repeated 100 times and the minimum elapsed time registered, in order to minimize as much as possible the impact of context switching and other non-deterministic behaviour on the host machine. In order to guarantee enough precision on the interval measure- ment the gettimeofday() Linux call was employed, which provides resolutions as high as 1 µs. The results from this experiment are represented in Fig. 5.3.
2272.73 2308.00 1344.08 844.59 1000.00 416.67
211.15
110.03 100.00 53.51 Throughput [MB/s]
14.05
10.00 1K 1024 4096 4K 8192 8K 16384 16K 32768 32K 65536 64K 131072 128K 262144 256K Chunk Size [Bytes]
Figure 5.3: Measured aggregate throughput for back-to-back transfers with varying buffer size
As expected, the achievable throughput rises with the increase in buffer size, since this masks the various overheads present on the system. From 256 KB onwards, the transfer rate saturates at around 2.3 GB/s, 72% of the theoretical limit of the AXI interface, which may be justified by the aforementioned arbitration phases, latency between bursts among other factors. Nevertheless, the reduced performance for smaller sizes can only be attributed to either the PCI Express bridge or the DMA engine. This can be confirmed by attaching a Chipscope monitor to the PCI Express bridge AXI Slave port, where the read and write requests from the DMA engine are received. Figure 5.4 depicts two waveforms obtained with the Chipscope Analyzer software, which corre- spond, respectively, to the AXI transaction of a 4 KB buffer from the PCI Express bridge to the DMA engine and vice-versa. By taking note of the elapsed time between the start and end of the successive bursts, a send throughput of 870 MB/s and a receive throughput of 1097.26 MB/s can be calculated, which results in an aggregate bandwidth of 1967.26 MB/s, which is significantly smaller than the 53.51 MB/s obtained experimentally, Fig. 5.3, and confirms that the transfer time is being limited by one of the two structures of the HIB and not the interconnection.
45 5. Framework Prototype
218 ns 667 ns
(a) AXI transaction from the PCI Express bridge to the DMA engine 356 ns
(b) AXI transaction from the DMA engine to the PCI Express bridge
Figure 5.4: Chipscope waveforms obtained during a back-to-back transfer of 4 KB
It should be noted that the numbers presented above do not include the time required for the HotStream API to set-up both transfers, which inevitably reduce the effective throughput. Be- cause of this, the methods that handle this task within the API were carefully optimised so as to reduce unnecessary operations and branching. Nevertheless, this time can be significant, mainly when dealing with the transfer of small data chunks, and can be reduced to two components: i) compulsory operations that result in a constant execution time and; ii) a variable execution time that is dependent on the amount of physical chunks that resulted from mapping the user-provided buffer to the physical memory. The time elapsed during the configuration of both the send and receive transactions for a varying buffer size is depicted in Fig. 5.5. By including this initial latency on the numbers of Fig. 5.3, it can be observed that, as expected, the performance drop is more pronounced when smaller chunks are transferred, Fig. 5.6
280 258 260
240
220 199 191 200 190 177 180 166 170 160 164 Setup Time [us] [us] Time Setup 160
140 1024 1K 4096 4K 16384 16K 65536 64K 262144 256K Chunk Size [Bytes]
Figure 5.5: Time elapsed during the configuration of the send and receive transactions
46 5.3 Backplane Implementation and Performance
5000.00 2106.87 1618.12 883.39 471.70 500.00 248.02
128.07
65.93
50.00 32.96 Throughput [MB/s]
8.51
5.00 1K 1024 4096 4K 8192 8K 16384 16K 32768 32K 65536 64K 131072 128K 262144 256K Chunk Size [Bytes]
Figure 5.6: Aggregate throughput for a back-to-back transfer including the time taken for the transaction set-up
Again, the results presented in this section refer to a x8 @ Gen1 PCI Express link and an AXI interface with a data width of 128 bits. Similarly, the AXI DMA engine, instantiated with 2D data access support, also utilizes 128-bit interfaces, in order to avoid bottlenecks due to bandwidth mismatches. It should be noted that, during these experiments, the support for multiple streaming destinations was not activated as it significantly reduces the data copying performance, since it is incompatible with descriptor queueing [5]. Descriptor queueing essentially refers to a double- buffering mechanism which automatically fetches the subsequent SG descriptors in a chain, while the current ones are being processed, and stores them in a FIFO. However, in order to fully comply with the HotStream framework specification, the HIB must be capable of equally streaming data to the shared memory, any individual core or one of the associated instruction memories. This requires the support for multiple streaming destinations to be enabled and, therefore, a significant performance drop with relation the results presented in this section should be expected.
5.3 Backplane Implementation and Performance
As discussed in Sec. 2.4 and Sec. 2.3, the high-speed, high-connectivity backplane can be equally implemented by a NoC or by a more conventional Crossbar-based solution. Following the survey of publicly available NoC implementations, the Hermes network [30] was selected as the best balance balance between performance and resource usage. On the opposite side of the comparison, and given the adoption of the AXI family of interfaces for prototyping the HotStream framework, the AXI-Stream Interconnect [38] was naturally selected as the Crossbar to which the NoC must be compared to. The following subsections briefly discuss the main characteristics of both the Hermes NoC and AXI Stream Interconnect Crossbar to determine which leads to the best implementation of
47 5. Framework Prototype the high-speed backplane for the specific case of an FPGA-based design (the trade-offs involved when targeting other ASICS, for instance, are fundamentally different [36]).
5.3.1 Hermes NoC
The central element of the Hermes NoC is the Hermes Switch [30]. It encompasses five bi- directional ports, four of which are used to establish connections to the neighbouring switches, while the fifth ensures the communication with the local IP Core. The first four ports can be ex- panded through the use of virtual channels, which have the ability to reduce traffic congestion at the cost of an increased resource usage. After the round-robin based arbitration step is complete, the XY routing algorithm is used to connect the input port to the correct output port. Since worm- hole switching is utilized, the routing decision is performed upon the reception of the header flit, while the second flit indicates the size of the payload. After all payload flits have been routed, the input to output mapping is marked as free. The network structure is of the 2D Mesh type, meaning that the outer switches only possess three ports instead of five. The selection of this type of arrangement is justified by the easier place- ment and simpler routing algorithm. Thus, different switches have different peak performances, depending on their location within the mesh. In fact, while inner switches can theoretically main- tain five simultaneous connections, outer switches see this number reduced to three. Since each flit takes two clock cycles to be sent, the aggregate peak bandwidth of a network with N × N nodes is given by:
N×4−4 N(N−4)+4 X f X f P eakT hroughput = 3 × flitwidth × + 5 × flitwidth × (5.1) 2 2
For an 3×3 network with 32-bit flits and a frequency of 100 MHz, the peak throughput is thus 46,400 Mbit/s or 5,800 MB/s. According to the authors, the minimal latency, i.e in the absence of network contention, to transfer a packet from a source to a target switch can be expressed in clock cycles as:
n X MinLatency = ( Ri) + P × 2, (5.2) i=1
where n is the number of switches in the communication path, otherwise known as hops, Ri is the execution time of the routing algorithm at each switch (at least 10 clock cycles) and P is the packet size, which is multiplied by 2, since a single flit is sent in two clock cycles.
5.3.1.A Modified packet structure
As it is proposed by the authors, the Hermes Switch requires the number of flits in a packet to be known before a transfer is initiated. While this is not a problem if the payloads are fixed in size, if the size of the messages varies over time, it is necessary to buffer an entire packet before populating the size flit. This naturally increases both the hardware resources utilization and the
48 5.3 Backplane Implementation and Performance latency of the network. To circumvent this limitation, an end-of-packet (EOP) bit was added to each flit, so that the Switch can determine when to terminate the ongoing connection.
One drawback of this solution is that the routing information overhead is no longer fixed but, instead, dependent on the payload size. For instance, and considering 32-bit flits, the overhead of the proposed modification is smaller than the original implementations up to a payload size of 32 flits, after which it continues to increase while the original remains fixed.
5.3.2 AXI Stream Interconnect
The AXI4-Stream protocol is part of the AXI4 specification and is targeted at low-resource, high-bandwidth unidirectional data transfers. Flow control is implemented through two signals only, TVALID and TREADY, and support for bursts of undetermined length is ensured by a TLAST signal.
The AXI Stream Interconnect is an IP core developed by Xilinx to be used with its standard design tools. Its main functionality is to provide an efficient routing mechanism to allow the com- munication between multiple AXI4-Stream masters and slaves. In addition, it includes a collection of modules that further improve the IP functionality, such as bus width and clock conversion, pipelining and data buffering.
At the heart of the AXI Stream Interconnect is an arbitrated Crossbar, capable of intercon- necting up to 16 masters and slaves with varying degrees of connectivity, i.e, a programmable connectivity map lets the user specify full or sparse Crossbar connectivity. The arbitration can either be round-robin or priority-based, with statically assigned priorities. In addition, it is possible to define when the arbitration mechanism is applied: at TLAST boundaries, after a set number of transfers and/or after a certain number of idle cycles.
Naturally, as the degree of connectivity of the Crossbar is increased, its area, throughput and latency will be affected. As such, combinations of N master x 1 slaves and 1 master x N slaves are preferable to MxN interconnects. The relevant user guide even goes so far as to suggest that when MxN interconnects are absolutely necessary, the number of endpoints should be kept low or sparse connectivity should be specified[38].
The latency involved in each transfer depends on the particular configuration of the IP for each interface. The Crossbar switch itself inserts 2 clock cycles of latency, but the addition of a register slice or FIFO buffer adds an additional 1 and 3 clock cycles of latency, respectively.
On the other hand, the throughput of a datapath through the interconnect depends only on the data width and clock frequency used along the path. Therefore, its maximum throughput is limited by the slowest component. The peak aggregate throughput will be simply given by the peak throughput between each master-slave pair, multiplied by the number of pairs that can communicate simultaneously.
49 5. Framework Prototype
5.3.3 Backplane Performance Evaluation
While the performance of a bus-based design can be easily evaluated by understanding its in- ternal architecture, arbitration algorithm and overall latencies, determining the realistically achiev- able throughput and latency of a Network On Chip is considerably more complicated as it greatly depends on the nature of the traffic it is subject to, even more so than the parameters of the net- work itself [13]. Moreover, even if a complete analytical description of the traffic behaviour would be performed, this would result in a very complicated analysis, as the routing and arbitration de- cisions at each node must be taken into account, which becomes increasingly difficult as the size of the network increases. Therefore, it is common practice to evaluate the performance of such networks by resorting to software models or even to behavioral simulation of the actual Register Transfer Level (RTL) description of the circuit. In order to perform the Backplane evaluation, a behavioural description of a core emulator was developed to easily simulate the network under various real-use conditions. Performance measures were achieved by building a VHDL testbench that logs every event across the network, along with its timestamp, to a text file. This text file is then parsed by a Python script which calculates the latency incurred by each packet transfer and presents a final summary of the overall latency and throughput in the delivery of the injected data to the various nodes. According to [13], these are the most important performance metrics of an interconnection network. The following sections describe in greater detail the various features of the core emulator and the structure of the testbench utilized to monitor the network activity.
5.3.3.A Core Emulator and Stream Wrapper
In order to keep the core emulator as generic as possible, thus facilitating its reuse in other circumstances, it was designed to have an interface that is fully compatible with the AXI-Stream specification. In addition, extensive use of generic parameters was made to allow for the config- uration of basic parameters such as the width of the data ports but also emulation parameters, described in greater detail below. The module permits two distinct types of operation modes: pipelined and non-pipelined. The first mode is considerably simpler and reduces the emulator to a one-slot FIFO which starts by receiving one data beat and immediately tries to output it in the next clock cycle. Meanwhile, no further data is accepted. This is in contrast with the non-pipelined mode which includes a non-zero latency before sending or receiving new data, effectively simulating a non-pipelined computation. Most of the remaining configuration parameters of the core emulator are based in this mode of operation and allow to define whether the core starts on a sending or receive state and the waiting period between these two states. It is also possible to configure the core in half or full duplex, i.e., data is sent and received in an independent and simultaneous manner or, on the other hand, if the two phases are serialized. Regardless of the configuration chosen, the module behaves in a cyclical manner with a period that is also user-defined. Given that the core emulator was designed with a conventional stream interface in mind, a
50 5.3 Backplane Implementation and Performance wrapper is needed to make it compatible with the Hermes’ IP core interface. Fortunately, by selecting the credit-based flow control mechanism for the NoC, the wrapper’s task is greatly sim- plified as this is quite similar to the stream interface used by the emulator. Having said that, its job is reduced to adding the header flit, indicating the destination of the payload that follows, and asserting the end of package bit on the last flit of the payload. The number of flits per packet is determined by the burst size parameter.
5.3.3.B Testbench and Python script
In order to monitor the activity over the network, a small set of signals were routed from the Hermes network to the output of the unit under test (UUT), which consisted of the network with all the emulators properly attached. These signals were then evaluated by the testbench which registered on a text log whenever an event occured, i.e, a payload flit was sent or received by any of the attached cores. In order to allow the computation of latency data, the core emulators were modified to include a timestamp and the number of the node currently emitting a flit in the generated payload. For each entry in the log file, the Python script compares the current time to the timestamp carried by the flit, thus computing the average latency over all transactions. The average through- put of the data delivered to the multiple cores is obtained by dividing the number of received bytes by the time taken to complete all transactions. This is then compared to the throughput generated by all the core emulators combined.
5.3.3.C Results
For simulation purposes, the Hermes NOC was configured as a 2 × 2 mesh with 32 bit-wide flits and 32 flit-long buffers at each of the two Virtual Channels available on each switch. The performance tests performed on the network mainly focused on two key aspects: the effect of the size of the payload of each packet on the overall throughput and the way in which the traversal of more or less routers affects the average latency. To accomplish these objectives the test cases utilize worst and best-case routing, as well as a variable burst length, from 8 to 1000 flits per packet. Worst-case routing corresponds to the situation where each node is instructed to send traffic to a node that sits diagonally across him, as this leads to the biggest number of hops until the destination is reached. On the other hand, in the best-case routing scenario, each core sends information to its neighbour, reducing the number of hops to the minimum possible value. When best-case routing is selected, nodes 0 and 1 first send 5,000 data flits to nodes 2 and 3, respectively, after which their roles are reversed. This scenario is depicted in Fig. 5.7a. When using worst-case routing, Fig. 5.7b, the procedure is analogous to the previous but now, nodes 0 and 1 start by sending packets to nodes 3 and 2 which, as a consequence of XY routing, will force the packets to traverse the longest possible path, which, in this simple case of a 2x2 network, corresponds to one intermediate hop. It is worth noting that, while nodes 0 and 1 are simultaneously sending and receiving data, the performance of the network will not be affected
51 5. Framework Prototype
(a) Best-case routing (b) Worst-case routing
Figure 5.7: Traffic patterns for the NoC simulation since each router contains independent send and receive channels on each of its virtual channels. The results presented in Fig. 5.8 show that, as the burst size increases, so does the throughput of the data delivered to the IP Cores, increasing asymptotically to 8 Bytes per clock cycle which corresponds to the amount of data two active cores inject per clock cycle. This is a expected result because, as the burst size increases, the impact of the overhead introduced by the routing decision is reduced, since wormhole switching is employed. It can be concluded that IP Cores with controller-like behaviour, issuing small data bursts occasionally will suffer the bigger penalty. Likewise, and for the exact same reason, the latency decreases when increasing the burst size, as depicted in Fig. 5.9, down to a minimum of 10 clock cycles, which corresponds to the latency required to traverse a single router, as described in section 5.3.1.
16
14 Best Case Rou ng
12 Worst Case Rou ng
10 Best Case Rou ng Full Duplex 8
6 Throughput [Bytes/C.C]
4
2 8 80 800 Burst Size [Bytes]
Figure 5.8: Data delivery throughput in various traffic configurations. In every one, the input throughput is reached asymptotically
Finally, and maintaining the conditions of the previous experiment, the core emulators were configured to work in full-duplex mode. This results in all 4 nodes inserting 4 bytes per cycle in simultaneous, leading to an aggregate injected bandwidth of 16 bytes per cycle. Again, the existence of independent send and receive interfaces in each router results in the same behaviour as in the previous two cases with regard to throughput as well as latency, as depicted by the dotted lines with triangular markers in Fig. 5.9 and Fig. 5.8.
52 5.3 Backplane Implementation and Performance
70 Best Case Rou ng 60 Worst Case Rou ng 50 Best Case Rou ng Full 40 Duplex
30
20 Average Latency [C.C] Average 10
0 8 80 800 Burst Size [Bytes]
Figure 5.9: Source to destination latency when using best case or worst case routing. Duplicating the in- serted data throughput does not affect latency
5.3.4 Crossbar and NoC Comparative Evaluation
The rationale behind the adoption of Network On Chip for very large SoCs is mostly the in- creasing cost, both in terms of area and communication delay, of the interconnection ”wires”. This is especially true for ASICs [18], but FPGA designs are also increasingly constrained by this issue, as new technology nodes are making transistors smaller and faster, while ”wires” get compara- tively slower [27]. In this section, the NoC and Crossbar solutions discussed above are compared to determine whether the former is advantageous for FPGA-based designs. Table 5.2 summarizes the resource utilization of the Hermes NoC using the configuration used in section 5.3.3, i.e., a 2 × 2 mesh with 4 byte-wide flits and 32 flit-long buffers with 2 Virtual Channels per port. In the same table, the hardware requirements for interconnecting a varying number of cores with the AXI Stream Interconnect Crossbar are presented. To guarantee a fair comparison between the two, the crossbar was configured with a 32 bit datapath, round robin arbitration and a data fifo with a depth of 32 elements in each interface, which is analogous to the flit buffer in the Hermes NoC. Moreover, full connectivity was enabled in the switching element, so that any master interface can establish a connection with any other slave port.
Table 5.2: Hardware utilization of the Hermes NoC configured in a 2×2 mesh and the AXI Stream Intercon- nect Crossbar for a varying number of independent cores
Available Hermes NOC AXI Stream Interconnect Crossbar Resources 2 × 2 Mesh 4 Cores 8 Cores 16 Cores 4 Cores (@ 128 bit) Regs 607,200 15,935 1,773 4,584 8,656 5,229 LUTs 303,600 7,886 1,125 2,981 8,882 2,448 Slices 75,900 5,587 849 1,969 4,875 1,989 Max. Freq. 200 MHz 144 MHz 133 MHz 133 MHz 146 MHz
The results presented in table 5.2 clearly show a significant gap between the Crossbar and
53 5. Framework Prototype
NoC solution in terms of hardware requirements. In fact, for the same number of interconnected cores, the Crossbar requires 6.6× less slices than its NoC counterpart. As expected, the Hermes NoC offers a better performance in terms of attainable clock frequency, which confirms the reduc- tions in the average length of the interconnecting ”wires”, but the hardware overhead introduced by its multiple switches means that its scalability is very limited on FPGA designs. Moreover, the smaller area footprint of the Crossbar solution leaves room for the increase of the interconnection width, thus potentially increasing the peak throughput, while still keeping the resource utilization lower than the equivalent NoC configuration.
For equivalent ”wire” widths, however, the peak throughput of the NoC will be superior to that of the Crossbar due to the higher clock frequency. In fact, by replacing the variables in equation 5.1, derived in section 5.3.1, with a frequency of 200 MHz, and a 2×2 mesh, a peak throughput of 4,800 MB/s is obtained. On the other hand, at a clock frequency of 144 MHz, the crossbar yields a peak throughput of 2,304 MB/s.
Since the Crossbar is never subject to traffic contention, its bandwidth does not vary with traffic behaviour as in the NoC. Table 5.3 provides a comparison between the throughput and latency of the Crossbar and NoC for the worst-case routing scenario depicted in Fig. 5.9 and Fig. 5.8. Even for controller-like behaviour, which is characterized by small data bursts, the NoC provides superior performance due to its higher operating frequency, even though the average transmission latency is longer in every situation.
Table 5.3: Throughput and latency of the Hermes NoC and AXI Stream Crossbar when interconnecting 4 Cores, under different traffic conditions
Burst NoC Crossbar Size Latency Peak Measured Latency (Avg.) Peak Measured [(Avg.) [c.c] Throughput [MB/s] Throughput [MB/s] [(Avg.) [c.c] Throughput [MB/s] Throughput [MB/s] 8 63.3 6,400 4,800 6 4,608 4,608 100 40.7 6,400 6,160 6 4,608 4,608 1000 15 6,400 6,390 6 4,608 4,608
In conclusion, the NoC is undeniably more scalable in terms of operating frequency, as the addition of extra nodes does not increase the average length of the interconnections between the switches. However, the gains obtained in performance are overshadowed by the considerable resource utilization figures. Moreover, although NoCs are usually targeted at SoCs with a large number of IP Cores, the analysis above shows that these large numbers, for which NoCs are supposedly the best solution, are simply not feasible in modern FPGA devices, given the large re- source overhead introduced by the network switches. Finally, it is interesting to note that, although in ASIC designs the circuit is evaluated in terms of its total area, which comprises both transis- tors and wires, in FPGA designs, resource usage is, generally speaking, only evaluated in terms of occupied LUTs, slices, and registers, but not on the number or length of the interconnections resources. This further reduces the attractiveness of FPGA-based NoC implementations.
54 5.4 Shared Memory Performance
5.4 Shared Memory Performance
The shared memory access time is another critical factor for the performance of the HotStream framework, since stream re-use within the MCPE is entirely supported by this element. Despite providing a considerable peak throughput, of 12.8 GB/s, under normal operating conditions the bandwidth utilization of the off-chip DDR3 memory available in the VC707 development board may be significantly reduced. While the overall read and write latency of the DDR module is dependent on how the memory controller is configured, the most influential factor is the behaviour of the traffic and access patterns (Xilinx Answer Record AR# 45644). Thus, it is not possible to evaluate the performance of the shared memory from a purely theoretical standpoint. Fortunately, the Memory Interface Generator (MIG) tool provided by Xilinx to configure and instantiate memory controllers also generates a timing-accurate simulation model. This model eases the profiling of the memory controller and memory module combination under varying access patterns. By performing simulations with the various access patterns available in the test bench, the read access latency was determined to be between 23 and 30 clock cycles. In continuous op- eration, i.e., when multiple read requests are issued sequentially, the latency between bursts of 8 data beats is reduced to 5 clock cycles. This corresponds to an overall access efficiency of 8 / (8 + 5)= 0.6153 and, on average, a read throughput of 7.87 GB/s. The same procedure, re- peated for write accesses resulted in similar results, justifying the adoption of the 7.87 GB/s mark as the overall DDR average access throughput. However, it should be noted that these values refer to the user interface provided by the memory controller. In the context of the HotStream framework, an AXI Slave controller is added in order to facilitate multiple arbitrated accesses to the shared memory. This naturally reduces the throughput due to the various overheads of the AXI4 protocol, as well as the latencies introduced by the AXI Slave controller attached to the native memory controller.
5.4.1 Cycle-Accurate Simulator
In a framework such as the one presented in this thesis, it is important to be able to accurately predict the expected performance of a given application without having to go through the full de- sign, testing and integration cycle. This makes it possible to iteratively adapt a particular design to the framework and to mitigate existing bottlenecks in the early stages of the development process, resulting in a significantly decreased time-to-market and lower engineering effort. The extensive profiling effort presented in Chapter is key to making this possible, as it provides the necessary information to simulate the behaviour of the various communication channels that compose the HotStream framework. To combine all this information into usable performance metrics, a Python-based simulator was developed that takes an arbitrary number of Cores configuration files as an input and performs a cycle-accurate simulation of all the transactions that take place between the multiple Cores, the backplane, and the shared memory. The simulator is highly parametrizable as it allows to define
55 5. Framework Prototype the size of the burst requests that are issued to the shared memory, as well as the arbitration latency for both read and write accesses, and the latency incurred between successive burst ac- cesses from a same Core. Each Core configuration file is composed of a sequence of instructions that allow to easily emulate any stream-based Core by taking into account its data processing la- tency, the address generation behaviour of the associated DFC and the relationship with other cores, by issuing synchronization requests. The output of the simulator is a comprehensive set of statistics that indicates the overall shared memory utilization of its read and write channels and number of unused cycles. Similar informa- tion is provided for each emulated Core and complemented with the total number of reads and writes it performed, as well as its period of activity in clock cycles. As it is an event-driven sim- ulator, simulation halts when no more read or write requests are received. The duration of the simulation, in clock-cycles, provides an estimate of the execution time of the application after it has been mapped to the HotStream framework.
5.5 Summary
The prototyping of the framework was done on a Xilinx VC707 development board, powered by a state-of-the-art Virtex 7 FPGA and accompanied by an high-performance DDR3 memory and an 8 lanes PCI Express interface capable of Gen2 speeds. The Host machine was equipped with an Intel Core i7 3770K processor at 3.5 GHz and 16 GB of DDR3 memory. Being in a Xilinx environment, the HIB was naturally implemented through the combination of the AXI DMA engine and the AXI Bridge for PCI Express. The subsequent experimental evaluation of the PCI Express connection confirms that the achievable throughput over this type of interface greatly depends on the size of data chunks being exchanged. Small-sized chunks can lead to aggregate throughputs that are 160× lower than what is possible under ideal conditions. By taking into account the time elapsed during the configuration steps performed by the HotStream Application Programming Interface (API) when a transfer is configured, aggregate throughputs as high as 2.1 GB/s were measured for a x8 Gen1 configuration. For the implementation of the backplane interconnection two opposing solutions were com- pared: the Hermes NoC [30] and the AXI Stream Interconnect [38] Crossbar. Taking data de- livery throughput and latency as the two main performance metrics, the NoC solution reveals to be slightly superior to its counterpart. However, such advantage is achieved at an hardware- utilization cost that seriously hinders the scalability objectives defined for the HotStream frame- work. In fact, the obtained results can be extrapolated to most FPGA designs, as the considerable amount of hardware required to implement the switches and routers that form the ”wires” in a NoC implementations, makes it unattractive for designs based on reconfigurable hardware. For the shared memory, an off-chip DDR3 memory was used. To evaluate this component a time-accurate simulation model was used, easing the characterization of the memory access time when different read and write patterns are utilized. While the peak bandwidth of the module is of
56 5.5 Summary
12.8 GB/s, the various dynamic mechanisms that are present in these type of devices reduce this value to an average access time of 7.87 GB/s for the considere read and write patterns. By taking advantage of the results obtained during the characterization of the various com- munication channels, a Cycle-Accurate Simulator was developed that makes it possible to obtain performance estimates of HotStream-based accelerators early in the development cycle. An ar- bitrary number of Cores can be emulated, each described by its own configuration file, where its data processing latency, synchronization with other Cores and the address generation rate of the associated DFC is specified. Accesses to the shared memory are also accurately simulated based on the various latency figures obtained in this Chapter, which may be changed by the user to reflect different configurations of the data bus or the shared memory itself.
57 6 Framework Evaluation
Contents 6.1 General Evaluation ...... 59 6.2 Case Study 1: Matrix Multiplication ...... 63 6.3 Case Study 2: Image processing chain in the frequency domain ...... 69 6.4 Summary ...... 75
58 6.1 General Evaluation
The HotStream framework is a comprehensive solution for the development of stream-based hardware accelerators. It aims at handling all the communications between the accelerators and the Host machine, as well as to facilitate the management of the intra-accelerator communica- tions. This effectively allows the hardware designer to focus on the development of highly efficient computation Cores, as the powerful, yet easy-to-use HotStream API and the programmable DFCs guarantee that the final product will provide the best possible performance. To accomplish this, the HotStream framework is essentially divided into 3 components: i) the software layer; ii) the HIB and; iii) the MCPE. The first two work in conjunction to allow the streaming of data between the accelerator and the Host, while the extensive support for coarse-grained patterns make it possible to fully explore the available bandwidth between the two. The third component is where the actual computation takes place, by hosting an arbitrary number of computation Cores that may be interconnected with each other via the high-speed and low-latency Backplane interface. Such backplane also allows the cores to access the shared memory through individual DFCs. These DFCs are a key element of the framework, as they provide fine-grained access to the large-capacity shared memory, which confers the HotStream framework its unique data-reuse ca- pabilities. For such purpose, this unit was custom designed to support arbitrary access patterns, which can be easily programmed by a custom assembly language, without compromising the address generation efficiency and hardware resources.
6.1 General Evaluation
The following sections provide a comprehensive evaluation of the HotStream framework in two major steps. First, its fine-grained memory access capabilities are tested through a series of com- monly used access patterns, where address generation rates and specific memory requirements for the storage of the pattern descriptions are compared to the most relevant related art, namely the PPMC [20]. However, since the PPMC implementation was not publicly available at the time of this work, the Xilinx AXI DMA engine was used for this purpose, as it features identical function- alities to the PPMC [20]. This DMA engine IP Core was also used in the prototyping framework, to implement the HIB module. The second step demonstrates how the HotStream framework can be used to develop real applications, and the levels of performance that can be expected. This is achieved through two distinct case-studies: i) a block-based multiplication of very large matrices and; ii) a full signal processing chain, where high-resolution images, in the range of 1024 × 1024 to 4096 × 4096 pixels, are filtered in the frequency domain using 2D FFTs.
6.1.1 Resources Overhead
Considering the strong focus on scalability of the proposed streaming framework, it is paramount that its core elements do not significantly impact the overall resource usage. This goal was achieved by designing the DFC with a low area footprint in mind, as this is bound to be the most replicated unit in this framework.
59 6. Framework Evaluation
Table 6.1 summarizes the resource utilization by the key elements composing the framework. It is important to note that the DFC is fully configurable with relation to the number of Loopcontrol units that are used. Thus, its resource occupation varies with the chosen configuration. Table 6.1 presents the results based on a DFC configuration with 1 to 3 Loopcontrol units. All resource utilization and performance figures refer to the implementation on a XC7VX485T Virtex-7 FPGA.
Table 6.1: Resource usage for each component in the MCPE and HIB (hardware platform: XCV7VX485T Virtex-7 FPGA)
Available DFC (1-3 Loopcontrol units) Streaming Backplane DSB DMA Resources + BMC Bus (AXI) Interconnect Slices 75,900 1,014 - 1,216 3,273 4,875 5,300 1,548 LUTs 303,600 1,743 - 2,225 5,305 8,882 12,620 3,588 Regs 607,200 1,553 - 2141 4,922 8,656 9,160 4,128 DSPs 2,800 4 0 0 0 0 BRAM36 1,030 1 0 0 0 6 Max.Freq. 160 MHz 167 MHz 146 MHz 200 MHz 136 MHz
It is important to note that, while the resource utilization of the Backplane Interconnect (im- plemented in a Crossbar topology) seems rather high, it represents a worst-case scenario, con- figured to support full connectivity to a maximum of 16 independent nodes. On the other hand, each DFC/BMC pair accounts for only 1.6% of the total resources available in the device, which ensures the addressed scalability goal. While competitive for FPGA-based designs, the maximum operating frequency of the DFC is limited by the simple pipelined nature of the used microcon- troller. By adopting a more aggressively optimized architecture, higher processing frequencies can be achieved. This solution was not sought because, as explained in section 5.2, for an add-in card to work in any environment, its PCI Express interface and DMA engine must be clocked at the 100 MHz clock provided by the motherboard.
In this particular embodiment of the HotStream framework, the DSB, which is implemented by the PCI Express bridge, accounts for the largest fraction of resource usage, at roughly 7% of the available resources of the target FPGA. Naturally, this balance would change if the implementation platform featured other means of communication between the Host processor and the accelerator module, such as the ones available on the Zynq All Programmable SoC.
Regarding the relationship between the DFC and the BMC (as depicted in Fig. 3.4), it should be recalled that a BMC can be shared by two independent DFCs. Incidentally, the BMC is nearly two times larger than each DFC, as depicted in Tab. 6.2. This means that the DFC + BMC column in Tab. 6.1 corresponds to a worst-case scenario, where the full-duplex capabilities of the BMC are not exploited. In the case of a core arrangement where DFCs and BMCs are perfectly paired, the 1 hardware cost of each DFC + 2 BMC is effectively halved. In the grand scheme of the framework, this corresponds to just 0.8% of the total resources available in the device.
60 6.1 General Evaluation
Table 6.2: Individual resource usage of the DFCs and BMCs (hardware platform: XCV7VX485T Virtex-7 FPGA)
Available DFC BMC Slices 75,900 236 - 337 542 LUTs 303,600 371 - 612 1,001 Regs 607,200 390 - 684 773 DSPs 2,800 2 0 BRAM36 1,030 0 0
6.1.2 Stream Generation Efficiency
Given the relation between the complexity and nature of the considered patterns and the resulting address generation rate and size of the pattern descriptor, a proper evaluation of the proposed DFC and overall framework can only be achieved through a representative benchmark. Therefore, five distinct patterns of varying complexity will be herein considered: Linear; Tiled; Diagonal; Zig-Zag; and Greek Cross. While the first two are usually found in a wide range of applications, the remaining three are somewhat more exotic in nature. Nevertheless, they are still of great importance in the context of stream-computing. As an example, the Diagonal access pattern is extensively used by the Smith-Waterman algorithm for DNA sequences alignment [25]; the Zig-Zag scanning is a key element in the entropy encoding of the AC coefficients in the JPEG and MPEG standards [37]; and the Greek Cross is often used by the vast class of diamond search motion estimation algorithms adopted in video encoding [41]. Figure 6.1 depicts the access pat- terns being considered, including their size and evolution over time, as well as the pseudo-code for their generation using the API proposed for this framework. 1024 128 1024
72 512 1024
512 Time
(a) Linear (b) Tiled (c) Diagonal
8 1024
16x16
8 1024
Time
(d) Zig-Zag (e) Greek Cross
Figure 6.1: Access patterns, with varying complexity degrees, adopted for the DFC evaluation
The metrics considered for this evaluation are the code size, required to describe each pattern,
61 6. Framework Evaluation and the address generation rate, defined as the average number of addresses generated per clock cycle. As stated above, the AXI DMA engine is used as the baseline for this comparison, representing the characteristics of most descriptor-based pattern generation mechanisms that have been proposed in the related art. Table 6.3 depicts the obtained results.
Table 6.3: Address generation rate and descriptor size of the considered access patterns (the adopted length of each pattern results from the parameterization depicted in Fig. 6.1)
DFC DMA Pattern Length Code Size Addr/cycle Code Size Addr/cycle Linear 1024 24 1 32 0.96 Tiled 128×721 40 0.99 32 1 Diagonal 1024×1024 44 1 65k 1 Zig-Zag 8×8 48 (132*) 0.36 (0.71*) 480 0.63 Cross 1024×1024 132 0.89 228k 1 * Values obtained after loop unrolling 1 For a memory block of 512×512
By analyzing the pattern generation results obtained with the proposed DFC, when compared with the traditional descriptor-based data-fetch DMA mechanisms, it can be concluded that the proposed controller achieves a similar address generation rate but with significantly lower code- memory requirements. Moreover, the related state of the art does not offer any form of scalability, as the size of the descriptor increases with the length of the pattern. This is particularly em- phasized for the Diagonal and Cross patterns: for an 1024×1024 pattern, the descriptors for the conventional DMA occupies about 1500× more memory to store the access patterns than the proposed DFC approach. For larger matrices this discrepancy will be even greater. As a con- sequence of the larger code size, the conventional DMA approach would require a significantly larger internal memory and eventually need an external processor to dynamically generate the patterns, which would further increase the required hardware resources and reduce the attain- able performance. Nevertheless, certain cases still exist where the execution time of the pattern description code in the DFC cannot be entirely overlapped with the address generation, with a consequent re- duction of the attained rate. This is the case of the Zig-Zag access pattern. To circumvent this problem, the loop that sets the AGC parameters for each diagonal can be unrolled, thus improving the address generation rate at the cost of a slightly larger code size. This technique effectively provides a duplication of the rate for the Zig-Zag pattern generation (values marked with an * in the table), allowing for an address generation rate above the one reported in the related state of the art, still with smaller memory requirements. Naturally, the actual performance gains that can be achieved by accelerating a given data streaming application with the proposed framework depend not only on the amount of parallelism that can be exploited, but also on the involved computational complexity (i.e., the number of operations performed on a single data element). The latter is especially important, as it effectively defines the amount of data reuse that can take place within the framework. In order to provide an
62 6.2 Case Study 1: Matrix Multiplication insight of the speed-up magnitudes that can be expected by utilizing the HotStream framework, the following section presents two case studies.The first case study considers the block-based multiplication of very large matrices, which is able to take full advantage of the advanced data- reuse capabilities of the proposed framework, as well as its inherent support for an high-degree of data parallelism. The second case study consists of a full image processing chain on the frequency domain utilizing 2D FFTs, representing a complete and self-contained final product that also highlights several features of the HotStream framework, such as its extensive support for heterogeneity among the computing Cores.
6.2 Case Study 1: Matrix Multiplication
In this section, a block-based matrix multiplication example is used to evaluate and compare the proposed framework with other usual approaches based on hardware accelerators. As a comprehensive data management solution for streaming applications, the proposed framework provides efficient data streaming mechanisms between the host and the accelerating hardware, as well as extensive data (re-)usage and (pre-)fetching capabilities within the MCPE. The outcome is a significant increase in the attained input/output data bandwidth within each processing core, as well as the consequent maximization of the resulting data processing throughput. This approach dramatically contrasts with traditional implementations, where the host GPP or a conventional DMA engine centralize the whole data management, at a detrimental cost of being subjected to the (often rather limited) data bandwidth of the underlying communication interface between the host and the accelerating hardware. While this block-based matrix multiplication case study allows to demonstrate the potential of the proposed framework, additional advantages are expected as the application complexity grows. Block-based matrix multiplication is typically used for improving data locality, and allows the implementation of multiplication operations with matrix sizes much greater than would be possible if the multiplication was performed in one single step. The considered implementation is divided into two steps: i) the multiplication of the sub-blocks; and ii) the accumulation (reduction) step, to compose the final matrix with the computed partial sub-matrices [26]. Equation 6.1 depicts a simple partitioning example of an N×N matrix multiplication operation, by considering sub-blocks of size N/2×N/2. " # " # " # A11 A12 B11 B12 A11·B11+A12·B21 A11·B12+A12·B22 · = (6.1) A21 A22 B21 B22 A21·B11+A22·B21 A21·B12+A22·B22
Three different approaches were considered for this evaluation. The first, hereinafter denoted as conventional, simply streams the matrix data over the PCI Express link into the accelerator, where a single matrix multiplication core is consuming the incoming data and producing the results that are streamed back to the Host. The second approach, referred to as conventional+buffering, features an additional memory in the accelerator, which is large enough to buffer one of the input matrices, so that it can be reused over the course of the entire computation. Finally, the third
63 6. Framework Evaluation
Sub-block Sub-block Sub-block Matrix Addition Matrix Addition Matrix Multiplication (16:1 Reduction) (8:1 Reduction) 1 1 1 PCIe M A1 A2 PCIe 2 2 2
Shared Shared Backplane Shared Memory Memory Memory
Figure 6.2: HotStream-based implementation of the block-based multiplication algorithm, consisting of 3 Kernels to process multiple and concurrent data streams, where double buffering is used on the shared memory to overlap communication with computation approach makes use of the HotStream framework, which maximizes the data re-usage by includ- ing reduction (accumulation) modules on the MCPE that run concurrently with the multiplication cores (see Figure 6.2), as well as overlapping the data communication with the computation by employing double-buffering techniques. It should be noted that to implement a 4096×4096 matrix multiplication with 32×32 sub-blocks, a 128:1 reduction step is required, in which all the interme- diate results from the sub-block multiplications are combined, through simple additions, into the final matrix. This is achieved by using two addition cores: one to perform a 16:1 reduction and the other to perform an 8:1 reduction. The result is a self-contained accelerator that only streams back the final matrix to the Host, contrasting with the former solutions, which completely relies on the Host to perform the reduction. In addition to these three basic implementations, corresponding parallel versions were also considered, by replicating the structure depicted in Fig. 6.2. The level of exploited parallelism is limited either by the available hardware resources or by the data bandwidth capacity of the communication channels.
6.2.1 Computing Cores
It is important to recall that the main focus of this case study is not on the adopted matrix multiplication cores but instead on the framework itself. In accordance, it was decided to adopt off-the-shelf Xilinx IP Cores [4] to implement both the matrix multiplication and the accumulation cores. It is worth noting that the same multiplication units are also used in the conventional solutions. These soft cores support matrices of up to 32 × 32 entries of 2 bytes each. While this limits the proposed solution in terms of data width, it is very suitable to allow the evaluation of the considered framework in terms of scalability. In order to implement the accumulation cores, A1 and A2, depicted in Fig. 6.2, the off-the-shelf Xilinx IP Cores were adequately arranged in a binary tree, as depicted in Fig. 6.3. Since the input is in serial form, i.e., a matrix entry is received per clock cycle, a convenient set of input buffers was added to the first level of accumulators. The reduction operation only starts when the first level is ready to be processed. In what concerns the multiplication core, it uses a dedicated structure that offers an higher degree of data re-use. In fact, given that each 32×32 sub-block of matrix A is involved in a
64 6.2 Case Study 1: Matrix Multiplication
OUTPUT
A1
A1 A1
A1 A1 A1 A1
INPUT
Figure 6.3: 8:1 binary reduction tree based on Xilinx Matrix Accumulators. The structure of the 16:1 reduc- tion core follows the same architecture but with double the number of basic accumulators
number of multiplications that is equal to the number of columns of matrix B, its re-utilization allows to greatly reduce the number of accesses to the shared memory. Figure 6.4 depicts the structure of such multiplication core, where a sub-block from matrix A is stored and re-used during the multiplication with a full line from matrix B. Once a line is finished, the input buffer is flushed and a new matrix A sub-block is pushed in.
Sub-Block A
M OUTPUT INPUT Sub-Blocks B
Figure 6.4: Internal structure of the multiplication core utilized in the HotStream implementation of the matrix multiplication. Sub-blocks from matrix A are stored and re-used during the computation of a full sub-block line from matrix B
These cores are run at a conservative frequency of 100 MHz and output a 2 byte matrix element per clock cycle, after a significant initial latency, which is effectively hidden by the large data set that composes the stream. Therefore, a constant rate of 100 MOps (Million Operations per Second) is maintained by each core. Table 6.4 summarizes the resource utilization of each of the basic cores utilized in the various implementations of the matrix multiplication, as well as the more complex structures featured in the HotStream version.
65 6. Framework Evaluation
Table 6.4: Resource usage of the cores utilized in the various implementations of the 4096×4096 matrix multiplication (hardware platform: XCV7VX485T Virtex-7 FPGA)
Available Xilinx Xilinx HotStream HotStream HotStream Mat. Mult. Mat. Add. Mult. Acc. Tree 16 Acc. Tree 8 Slices 75,900 10,588 117 10,650 1,541 792 LUTs 303,600 9,238 144 12,230 2,141 1,134 DSPs 2,800 32 1 32 15 7 RAMB36 1,030 2 0 3 17 9 RAMB18 2,060 64 0 64 0 0
6.2.2 Roofline Model
To evaluate the available design space in terms of the offered performance, the Roofline model [21] was applied to the considered case-study, in order to correlate the exploited processing performance with the throughput of the involved communication channels. Figure 6.5 depicts the peak performance of the conventional implementations, using a paral- lelism level of 1×, 2× and 4×, which result from using 1, 2 and 4 Multiplication Cores, respectively, to implement the matrix multiplication kernel (using data parallelism). For a 1× parallelism level, the conventional solution is limited by the performance of the XILINX IP Core that performs the matrix multiplication. By increasing the number of cores to 2 or 4, 2× or 4× parallelism can be achieved, respectively. However, these implementations become limited by the PCIe, thus result- ing in a performance of only 190 MOps, i.e, a speed-up of only 1.9 with 4× parallelism.