PGAS Communication for Heterogeneous Clusters with FPGAs

by

Varun Sharma

A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto

© Copyright 2020 by Varun Sharma Abstract

PGAS Communication for Heterogeneous Clusters with FPGAs

Varun Sharma Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2020

This work presents a heterogeneous communication library for generic clusters of processors and FPGAs.

This library, Shoal, supports the partitioned global address space (PGAS) memory model for applications.

PGAS is a shared memory model for clusters that creates a distinction between local and remote memory accesses. Through Shoal and its common application programming interface for hardware and software, applications can be more freely migrated to the optimal platform and deployed onto dynamic cluster topologies.

The library is tested using a thorough suite of microbenchmarks to establish latency and throughput performance. We also show an implementation of the Jacobi method that demonstrates the ease with which applications can be moved between platforms to yield faster run times.

ii Acknowledgements

It takes a village to raise a child and about as many people to raise a thesis as well. Foremost, I would like to thank my supervisor, Paul Chow, for all that he’s done for me over the course of the four years that I’ve known him thus far. Thank you for accepting me as your student for my undergraduate thesis and not immediately rejecting me when I asked to pursue a Masters under your supervision as well. The freedom and independence you provided over the course of this work was intimidating at times but also comforting as a testament to your confidence in me. I’d like to thank all my colleagues in Paul’s group. Our weekly meetings of Vivado Anonymous was a constant reminder that we were in this together. This work was also made easier by the many great grad students, especially my fellow denizens of PT477, of whom there are too many to name. Among them, I’d like to mention in particular Rose Li, Naif Tarafdar, Daniel Rozhko and Daniel Ly-Ma, Thomas Lin and Marco Merlini. Thanks for everything, both in and outside of work. Thank you to Ruediger Willenberg and Sanket Pandit for the inspiration, prior work and help that you provided. Ruedi, the casual confidence you exuded as a graduating PhD was a revelation to me as an undergrad and I hope to be able to emulate that someday. Finally, I’d like to thank my family for their endless support, concern and love. My parents took a chance on coming to Canada with a small family in tow. I would not be here if not for them.

iii Contents

1 Introduction 1 1.1 Motivation ...... 1 1.2 Research Contributions ...... 2 1.3 Thesis Organization ...... 2

2 Background 4 2.1 Field-Programmable Gate Arrays ...... 4 2.2 Hardware vs. Software ...... 5 2.3 Memory Models ...... 5 2.3.1 Shared ...... 5 2.3.2 Distributed ...... 6 2.3.3 Partitioned Global Address Space ...... 7 2.4 AXI Interfaces ...... 7 2.4.1 AXI-Stream ...... 7 2.4.2 AXI-Full and AXI-Lite ...... 8 2.5 Galapagos ...... 8 2.5.1 A Layered Approach ...... 8 2.5.2 The Galapagos Model ...... 10 2.5.3 Why Galapagos? ...... 10 2.6 Related Work ...... 11 2.6.1 SHMEM and its Successors ...... 11 2.6.2 GASNet ...... 11 2.6.3 HUMboldt ...... 13

3 Shoal 14 3.1 Previous Work: THeGASNets ...... 14 3.1.1 THeGASNet ...... 14 3.1.2 THe_GASNet Extended ...... 15 3.1.3 Limitations ...... 17 3.2 Rationale for Shoal ...... 18 3.2.1 Compatibility ...... 18 3.2.2 Scalability ...... 19 3.2.3 Freedom ...... 19 3.2.4 Maintainability and Extensibility ...... 19

iv 3.2.5 Usability ...... 20 3.3 Communication API ...... 20 3.3.1 Packet Format ...... 22 3.4 Software Implementation ...... 22 3.4.1 libGalapagos ...... 23 3.4.2 Making a Node in Shoal ...... 23 3.4.3 Handler Thread ...... 26 3.5 Hardware Implementation ...... 26 3.5.1 GAScore: a remote DMA engine ...... 26 3.5.2 Integration with Galapagos ...... 29 3.6 Shoal Kernels ...... 30

4 Sonar 32 4.1 Background ...... 33 4.1.1 Forms of Testing ...... 33 4.1.2 Related Work ...... 33 4.2 Motivation ...... 34 4.2.1 Difficulties with Hardware Testbenches ...... 34 4.2.2 Simulation in HLS ...... 35 4.3 Introducing Sonar ...... 35 4.4 Writing a Testbench ...... 36 4.4.1 DUT ...... 36 4.4.2 Test Vectors ...... 36 4.5 Case Studies ...... 39 4.5.1 UMass RCG HDL Benchmark Collection ...... 39 4.5.2 cocotb ...... 40 4.6 Comparing Different Tools ...... 41 4.6.1 Controllability ...... 42 4.6.2 Ease of Use ...... 42 4.6.3 Capability ...... 42 4.6.4 Readability ...... 42 4.6.5 Compatibility ...... 42 4.6.6 Heterogeneity ...... 43 4.6.7 Summary ...... 43 4.7 Future Work ...... 43

5 Evaluation 44 5.1 Experimental Setup ...... 44 5.1.1 Hardware ...... 44 5.1.2 Software ...... 45 5.2 Hardware Usage ...... 45 5.3 Microbenchmarks ...... 46 5.3.1 libGalapagos ...... 47 5.3.2 Shoal ...... 49

v 5.4 Stencil Codes ...... 56 5.4.1 Baseline ...... 56 5.4.2 Porting from THe_GASNet Extended ...... 56 5.4.3 Software Performance ...... 60 5.4.4 Hardware Performance ...... 60

6 Conclusions 63 6.1 Future Work ...... 64 6.1.1 Quick Improvements ...... 64 6.1.2 Handler Functions ...... 65 6.1.3 Supporting Partial Words and Unaligned Access ...... 66 6.1.4 API Refinement ...... 66 6.1.5 Automated Flows ...... 67

Bibliography 68

Appendix A Shoal Packet Formats 74 A.1 Universal Header ...... 74 A.2 Network Packets ...... 75 A.2.1 Short, Medium and Long Messages ...... 75 A.2.2 Strided Messages ...... 75 A.2.3 Vectored Messages ...... 76 A.3 Request Packets ...... 76 A.3.1 Short, Medium and Long Messages ...... 77 A.3.2 Strided Messages ...... 77 A.3.3 Vectored Messages ...... 77 A.4 Kernel Packets ...... 78

Appendix B Source Code Documentation 80 B.1 Initialization ...... 80 B.2 Makefile ...... 81 B.3 data/ ...... 81 B.3.1 Shoal Microbenchmarks ...... 81 B.3.2 libGalapagos Microbenchmarks ...... 82 B.3.3 Jacobi ...... 82 B.3.4 Other Scripts ...... 82 B.4 GAScore/ ...... 83 B.5 helper/ ...... 83 B.6 include/ ...... 84 B.7 repo/ ...... 85 B.8 src/ ...... 85 B.9 tests/ ...... 85 B.9.1 Benchmark ...... 86 B.9.2 Jacobi ...... 87

vi List of Tables

3.1 Selected methods of the shoal::kernel class ...... 21

4.1 Comparison between different tools for writing testbenches ...... 42

5.1 Available FPGA resources on select boards ...... 45 5.2 Hardware utilization of the full Galapagos Shell for the 8K5 ...... 46 5.3 Hardware utilization of the GAScore (with one kernel) on the 8K5 ...... 46 5.4 Hardware utilization of the Application Region (with one Benchmark kernel) on the 8K5 . 47

vii List of Figures

2.1 Comparing flexibility and efficiency in different types of processors ...... 5 2.2 Memory models with different threads and processes ...... 6 2.3 The Galapagos stack ...... 9

3.1 MapleHoney cluster ...... 16 3.2 Internal structure of an FPGA in the MapleHoney cluster ...... 17 3.3 Internal structure of an FPGA in THe_GASNet Extended ...... 18 3.4 Short, Medium and Long network packet formats ...... 22 3.5 Kernel threads and the local router in libGalapagos after automatic function wrapping . . 27 3.6 System diagram of the FPGA in Shoal ...... 27 3.7 The new GAScore in Shoal ...... 28

5.1 Throughput of messages with replies in libGalapagos ...... 48 5.2 Comparison of throughput with different thread schedule and delays ...... 49 5.3 Average median latency of communication methods with TCP in different topologies . . . 51 5.4 Speedup of median latency using UDP instead of TCP ...... 52 5.5 Average throughput of communication methods with TCP in different topologies . . . . . 54 5.6 Speedup of throughput using UDP instead of TCP ...... 55 5.7 Speedup of throughput between libGalapagos and Shoal with two kernels on the same node 55 5.8 Speedup of throughput between libGalapagos and Shoal with two kernels on different nodes 55 5.9 Run time of the Jacobi application in software ...... 61 5.10 Run time of the Jacobi application in hardware ...... 62

A.1 Short, Medium and Long network packet formats ...... 75 A.2 Strided network packet format ...... 76 A.3 Vectored network packet format ...... 77 A.4 Medium and Long request packet format ...... 78 A.5 Strided request packet format ...... 78 A.6 Vectored request packet format ...... 79 A.7 Kernel packet format ...... 79

viii The beginning is the most important part of the work. Plato

Chapter 1

Introduction

Heterogeneous computing refers to a system where different types of processors are used to do work. The advantage of such a setup is that it allows each of its constituent processors to do work that may be most suitable for it. The result then (ideally) is an application that performs better—in whatever the target metrics may be—than a homogeneous system. The heterogeneous system considered in this work specifically targets clusters of x86 or ARM processors and field-programmable gate arrays (FPGAs). FPGAs are logic chips that may be repeatedly “reprogrammed” by the user. Unlike conventional processors, FPGAs offer parallelism through these programmed digital circuits. Their reprogrammability makes them more dynamic than using custom application-specific integrated circuits (ASICs) and an asset to heterogeneous platforms. Unfortunately, FPGAs are not trivial to use—a problem that only grows in these heterogeneous environments. This thesis examines one approach to simplify their incorporation into workloads: a common communication library between FPGAs and processors.

1.1 Motivation

In 2014, Microsoft presented Catapult [1] and later showed that the inclusion of FPGAs in their data center dramatically accelerated Bing search [2]. FPGAs have since been included broadly in their Azure platform where they are used to compress network traffic [3]. Amazon has deployed FPGAs-as-a-service as part of Amazon Web Services (AWS) that users can rent [4]. These two implementations represent different heterogeneous compute configurations. In the Microsoft version, FPGAs are a “bump in the wire” and sit between the processors and the network. In contrast, in AWS, they are closely coupled to a processor. We see the Microsoft model as more productive1 for heterogeneous computing. Here, FPGAs are equal citizens in the network and can independently compute and communicate without a host processor. In Catapult, processors and FPGAs are connected together using standard high-speed Ethernet cables to top-of-rack switches and communicate using standard or custom protocols on top of Ethernet. Given this sea of available interconnected hardware in the data center, the question and challenge is how to write an application that can run on it. This work focuses on a major difficulty facing applications in a heterogeneous system: effective communication between devices. With only conventional processors, there are mature software ecosystems

1However, this model is also more vulnerable as the FPGA has direct access to the network that errant applications can abuse.

1 Chapter 1. Introduction 2 to write and scale applications across many servers. The inclusion of FPGAs complicates this process. The standard tool flows around these devices require experienced digital hardware experts and result in long development times when compared to software. Furthermore, integrating these devices into heterogeneous platforms requires developing a communication scheme that can work across processors and FPGAs as well as providing the low-level infrastructure for basic I/O. The chosen method of communication must be convenient, easy to use, scalable, and adaptable for different scenarios. Prior works at the University of Toronto identified the partitioned global address space (PGAS) programming model as a good fit to meet these requirements and developed a communication model [5], [6]. Due to limitations of those implementations, their model cannot be applied in its original state in the data center environment envisioned here. In this work, we distill elements of these past implementations and combine them with more recent developments in managing heterogeneous clusters to more fully address the problem of communication in environments like the data center.

1.2 Research Contributions

The goal of this research is to develop an application programming interface (API) for a PGAS program- ming model that can be used with high-performance heterogeneous applications on flexible compute topologies such as those in the data center. We leverage prior work in PGAS communication and cluster deployment to provide a reliable foundation. Our major contributions are:

• A new heterogeneous PGAS communication library, Shoal, and the associated hardware IPs to facilitate communication between hardware and software.

• Hardware and software infrastructure that is compatible with Galapagos [7], which is an open-source framework that provides deployment, connectivity between nodes, and enables usage of the library in dynamic clusters.

• A characterization of the performance of Galapagos and the Shoal API through microbenchmarks.

• An adaptation of the Jacobi method application from prior work to demonstrate the functionality of Shoal in a realistic use case.

• An open-source Python library, Sonar [8], to simplify writing comprehensive hardware testbenches.

Shoal is not merely a port of prior work to a different environment. Our implementation simplifies and streamlines the API for users by removing archaic and unnecessary elements and taking advantage of new paradigms. As a result, it is more portable, scalable and easier to use than previous works. We discuss these metrics in Section 3.2.

1.3 Thesis Organization

The remainder of the thesis is organized as follows. Chapter 2 presents basic information on FPGAs, memory models and AXI interfaces. This chapter also goes over Galapagos [7], an open-source framework for heterogeneous clusters that this work uses, and some prior work. Chapter 1. Introduction 3

Chapter 3 describes the main body of this work: the Shoal platform and its API. We also review two projects that directly preceded this work, THeGASNet [5] and THe_GASNet Extended [6]. Chapter 4 introduces Sonar, an open-source Python library that makes writing testbenches easier. This project arose during the development of Shoal as a means to test and maintain the associated hardware IPs. Chapter 5 first evaluates the performance of Galapagos and the Shoal platform through microbench- marks before concluding with experimental results from the Jacobi application adapted from previous work. Chapter 6 concludes the thesis and examines future work. Appendix A shows the Shoal packet formats in detail and Appendix B is a guide to the Shoal source code repository [9]. If I have seen further, it is by standing on the shoulders of Giants. Isaac Newton

Chapter 2

Background

In this chapter, we provide an overview of the necessary information that forms a foundation for our work. First, we introduce FPGAs and explain what it means to “run on hardware” to further motivate the advantages of heterogeneous computing. Then, we compare three memory models used in programming. In particular, the partitioned global address space (PGAS) model used in this work is defined here. Next, we present Galapagos, a platform for creating and deploying heterogeneous clusters. The use of Galapagos in this work greatly simplifies many of the low-level details that would otherwise be developed from scratch. Finally, we summarize some related work in the area of communication libraries and PGAS programming.

2.1 Field-Programmable Gate Arrays

Field-programmable gate arrays (FPGAs) are integrated circuits that can be “programmed” by the user to implement custom digital circuits [10]. This property allows FPGAs to be dynamically configured for the task at hand in contrast to application-specific integrated circuits (ASICs), which are static after manufacture. Programming an FPGA can be done through its vendor’s computer-aided design (CAD) tools that map a user’s specifications to a bitstream that gets loaded on to the FPGA. The bitstream defines which of the individual subcomponents of an FPGA, such as look-up tables (LUTs), flip-flops (FFs) and memory cells (block RAMs or BRAMs), are used and how they are connected together to create the intended function. The description of the design to implement is conventionally described using hardware description languages (HDLs). Two common HDLs are (and SystemVerilog) and VHDL. More recently, other languages such as /C++ and OpenCL have also been used to define hardware blocks for FPGAs. For these languages, high-level synthesis (HLS) tools are used to convert the design from its original source language to an HDL. A number of commercial HLS tools exist such as ’s Vivado HLS [11] and the HLS Compiler [12] as well as open-source ones such as LegUp [13]. One of the goals of these tools is to make development for FPGAs easier since implementing designs in HDLs can be difficult and requires specialized skills. The use of HLS helps non-traditional FPGA users to take advantage of these devices and veteran users to quickly explore the design space.

4 Chapter 2. Background 5

2.2 Hardware vs. Software

We reserve the bare term “processors” to specifically refer to the central processing unit (CPU) in computers or embedded platforms. To run applications on processors, users write software that is eventually transformed into the processor’s instruction set and executed. Software implementations are flexible and take advantage of decades of refinement in compilers and tools. If we consider the scale in Figure 2.1 with flexibility increasing on the left and efficiency increasing on the right, processors lie firmly to the left.

CPU FPGA ASIC

Flexibility

Efficiency

Figure 2.1: Comparing flexibility and efficiency in different types of processors (adapted from [14], [15])

In contrast, applications running on FPGAs and ASICs are said to be running on hardware. The “processor” in this case is the silicon chip that has been specifically designed to run a particular application. With ASICs, this design is determined at manufacture while FPGAs can be reprogrammed by the user. Thus, ASICs lie on the far right of the flexibility-efficiency scale with FPGAs somewhere in the center- right. Using custom hardware solutions, applications can run faster, more efficiently, and/or with higher throughput depending on the metrics of interest. ASICs, though highly efficient, are expensive to produce and have a long lead time. Therefore, they are not suited to the data center environment where the application is subject to quick change [3]. Instead, FPGAs can provide a middle ground between flexible and efficient computation. Both hardware and software have their strengths and weaknesses. These differences between the two systems justify the importance of heterogeneous computing platforms to take advantage of both.

2.3 Memory Models

A memory model describes the developer’s abstract view of system memory for parallel computations. There are three models that we describe here: shared, distributed and the partitioned global address space (PGAS), illustrated in Figure 2.2. In this section, we use the term “node” to refer to one parallel computing element such as a thread or a process.

2.3.1 Shared

In a shared memory model, all parallel parts of an application have access to the same memory, which can be accessed using the same addresses. This model is similar to traditional non-parallel code running on a single machine. Data allocated in one parallel compute element are accessible in others without explicit Chapter 2. Background 6

T T T T T T P P P

0…

0… 0…

3

3 3

/ /

Address 0…N Address 0…N /

N

N N

Address

Address Address Global Address Space Partitioned Global Separate Address (shared) Address Space Space (distributed)

Figure 2.2: Memory models with different threads and processes (adapted from [16]) communication. Multi-threaded programs inherently use shared memory between different threads. For multi-process or multi-core programs, there needs to be support through an interconnect or the runtime to create shared memory. The advantage of using a shared memory model is its ease of use. There is no need for explicit communication between different nodes since all nodes can access all data. Parallel code can be written as though it were sequential, which is very familiar for application developers. However, this illusion that parallel code looks serial can lead to poor performance. With non-uniform memory access (NUMA) architectures (for example, with clusters spanning multiple machines or even among different memory channels), performance can vary wildly if poor data storage leads to frequent expensive memory accesses. There are tools such as the Intel Memory Latency Checker [17] that can help identify these latencies on a particular system. Another issue that can arise in shared memory systems is false sharing [18]. Processors fetch data from system memory and cache it in multi-word units called cache lines. Even if multiple nodes are each reading and writing to separate words within a single cache line, the line will be constantly invalidated. Even though each node is not actually sharing the same data, the cached data will not be used, which leads to longer access times. OpenMP [19] and Intel’s Threading Building Blocks [20] are two examples of models for parallel programming that use the shared memory model.

2.3.2 Distributed

The distributed memory model is the traditional one used in multi-node high-performance computing clusters. Each node in the cluster has its own separate memory address space. To share data between nodes, explicit communication between them must take place. The message passing paradigm is frequently used in this space to facilitate nodes exchanging data. The de facto implementation of this protocol is the aptly named Message Passing Interface (MPI) [21]. While the MPI specification has grown to include more exotic features such as one-sided remote memory access [22], the core of the protocol is as follows. Communication is between two nodes and considered two-sided, which is to say that both nodes must participate in the exchange. The simplest use of MPI is where one node sends data while the other node blocks execution until it receives the message. Chapter 2. Background 7

A distributed memory model allows the developer to have precise control over data movement and to optimize transfers. This control comes at the expense of programming difficulty in synchronizing the nodes’ communication patterns. Two-sided communication also forces the communicating parties to stop potential useful work, perform handshaking and wait for the data transfer.

2.3.3 Partitioned Global Address Space

The PGAS model is a hybrid of the shared and distributed models. The memories of different nodes are physically separate as in the distributed model but logically contiguous in that any node can access the shared space [23]. Bridging the disparate memories together requires support from hardware or software. PGAS attempts to combine the best of both worlds: to gain the benefits of easier application writing and to scale across many machines in a cluster. Each node in a PGAS cluster has one partition of the global address space. While all nodes can access the shared space, this locality information is known to the programmer and so accessing data stored in other partitions is implemented as a remote access. While the line between shared NUMA memory and PGAS can be thin, this explicit separation of local and remote access is a hallmark of PGAS. In PGAS languages, memory locations with relatively uniform access times are grouped into places with an associated cost of access where the place with the lowest cost is considered local [24]. If the notions of memory hierarchy and access times are exposed and considered by developers targeting NUMA architectures, it may be considered a form of PGAS. PGAS promotes the concept of one-sided communication in which only the application on the partici- pating node is explicitly involved in the communication, which makes it easier to overlap communication and computation as compared to MPI [25]. Incoming memory requests still need to be handled by an agent, which may be with hardware support through remote direct memory access (RDMA) NICs, using separate threads, or another method. As in all memory models, poorly optimized code in PGAS results in performance penalties. Repeatedly accessing remote data incurs a higher cost than local data so the layout of data is critical.

2.4 AXI Interfaces

This work is built on Xilinx FPGAs using Xilinx’s CAD tool Vivado [26]. The standard vendor IPs in Vivado use AXI interfaces, part of the ARM AMBA specification, to communicate. This standardization allows different IPs to be easily connected together. A brief overview of these interfaces is provided here with full details available in the specifications for AXI-Stream [27], AXI-Full and AXI-Lite [28].

2.4.1 AXI-Stream

AXI-Stream (AXIS) is a point-to-point streaming interface where data can be transmitted every cycle, subject to optional back-pressure. Here, the IP that is sending data is the master (it has an AXIS-M interface) and the receiver is the slave (AXIS-S). There are only two required signals in the standard: TDATA and TVALID. TDATA must be an integer multiple of bytes and carries the data payload while TVALID is a 1-bit signal indicating that the data is valid. Other signals are optional and a subset of them are described below:

• TREADY: asserts back-pressure from the slave IP to the master. When TREADY is deasserted, it indicates the slave cannot accept data. Thus, data is only sent when TVALID and TREADY are both Chapter 2. Background 8

asserted if this signal is present.

• TLAST: indicates the last flit (or “beat” in AXI nomenclature) of a packet when asserted.

• TKEEP: indicates which bytes of TDATA are valid if transmitting data that is smaller than the data width. Most IPs only allow partial words on the last flit of a packet.

• TDEST: indicates the destination of the flit. AXIS switches and interconnects use the TDEST field to route incoming flits.

• TID: indicates different streams of data.

• TUSER: carries any custom sideband information that is transferred with the same timing rules as for TDATA.

2.4.2 AXI-Full and AXI-Lite

AXI-Full is a memory-mapped interface that allows an AXI-Full master to read/write multiple words (bursts) of data from an AXI-Full slave. Each AXI-Full master is assigned a base address and an address range that defines the region of memory it can access. AXI-Full slaves generally expose contiguous memory that a master may access. This high-performance interface is frequently used to connect to on-chip and off-chip RAM. AXI-Lite is a subset of AXI-Full and, as its name suggests, it is a simpler memory-mapped interface. Among its restrictions, AXI-Lite enforces a burst size of 1 to disallow burst read/writes and requires the data to be 32 or 64 bits wide. As with AXI-Full, AXI-Lite masters are assigned a base address and an address range while AXI-Lite slaves typically expose discrete registers at particular addresses. AXI-Lite is typically used as the interface to expose memory-mapped control registers in a slave IP.

2.5 Galapagos

Galapagos is an open-source stack and middleware platform to support the creation of heterogeneous clusters through a set of user-provided configuration files [29]. As our work is built on top of and relies on Galapagos, we provide a description of this framework and explain why it is a good fit for this project. Galapagos is the source of the definitions of “node” and “kernel” used in the majority of this work. A node, unless otherwise stated, refers to a processor, FPGA or another device in a cluster that has a unique network address such as an IP address. Each node may have one or more kernels on it, where each kernel is an independent computing element (for example a thread or a hardware IP) that has a globally-unique kernel ID assigned to it.

2.5.1 A Layered Approach

One of the central tenets of Galapagos is the importance of layers. This idea is not original to Galapagos. In the OSI model, different layers of the network stack hide implementation-specific details of lower layers through interfaces [30]. Different protocols may be used within a layer provided they present the same interfaces to the layers above and below. The layers of Galapagos serve the same purpose and make hardware and software nodes more similar to each other. Through heterogeneous communication, Chapter 2. Background 9

Figure 2.3: The Galapagos stack (adapted from [29]) distributed applications can exist on either hardware or software transparently. The layers of the heterogeneous Galapagos stack are shown in Figure 2.3 and the hardware stack is described below. At the base of the stack is the Physical layer, which refers to how the nodes are connected together and how they communicate. When there is no abstraction like Galapagos, everything must be done from scratch at this layer such as setting up network interfaces and accessing off-chip memory. The first abstraction layer in the stack is the Hypervisor layer (also called the Shell). Just as an simplifies access to hardware for software applications, the Shell provides a standardized way to access interfaces such as the network and off-chip memory from user applications. The introduction of the Shell divides the FPGA in two regions: the Shell and the Application Region. Since the Shell provides the controllers to interact with the external interfaces, the user does not need be concerned with how they are implemented and instead can focus on how to interface with them. By exposing a consistent interface back to the user, Shells on different FPGAs—which might have very different controllers for external devices—can be made to appear the same. Through the use of the Shell, user applications become FPGA-agnostic. Next, the Cloud Provisioning layer is responsible for acquiring machines and managing their physical locations and network addresses. Currently, this layer is not being actively used in Galapagos and so machines have to be manually set up. The Middleware layer orchestrates the creation of clusters of heterogeneous nodes. The user provides two configuration files to enable this functionality. The first is the logical file, which defines the user kernels, their interfaces, and the packet structure used. The second is the map file, which lists the available nodes, their network addresses and what kernels should exist on them. While these files are currently written by hand, there is ongoing work to generate them automatically through a placement tool. Galapagos provides Python scripts that use this data to write TCL [31] scripts for Vivado. TCL is a scripting language in by Vivado [32] that is used here to create the projects, instantiate and connect IPs, and eventually generate bitstreams. Finally, the Communication layer provides heterogeneous communication between kernels in hardware Chapter 2. Background 10 and software. It defines a simple Galapagos header that includes the message size and routing information followed by the payload sent over a streaming interface. Users can write applications that communicate using this protocol and can be run on a heterogeneous cluster. In software, Galapagos includes a low-level communication library called libGalapagos [33]. This library provides stream-like interfaces for software kernels to use that emulate the behavior of the hardware counterparts to support migration between the two platforms. It is further discussed in Section 3.4.1. Above the Communication layer rests the less-rigidly defined Application layer. This layer is not a monolith but may be broken down into further sublayers as shown in dashed lines in Figure 2.3. This work lies in this layer as one such sublayer right above the Communication layer. As further discussed in Section 3, we define a communication API on top of the native one provided in libGalapagos to create an expressive messaging system under the PGAS model. An application for Galapagos could use our work directly or it could be used to underpin further abstractions to simplify writing applications.

2.5.2 The Galapagos Model

When creating a cluster, Galapagos assigns each kernel an ID and each node an address. Galapagos also stores an additional mapping between a kernel ID and the node that it occupies. When one kernel sends data to another, the destination kernel ID is used to get the node address and a network packet is constructed and sent to the destination node. From there, the destination node routes the data to the appropriate kernel using the ID. Currently, all communication between nodes in Galapagos occurs over the network but additional communication methods such as over PCIe may be added in the future. Another important aspect of using Galapagos is understanding the assumptions and constraints it creates in hardware. Each kernel in Galapagos has at least three interfaces: a pair of streaming interfaces to communicate over and a constant input to pass in the kernel’s ID. Hardware kernels may also have AXI-Full interfaces to memory. These kernels are all placed within the Application Region on the FPGA. The external interfaces exposed by the shell to this region on the tested hardware are:

• AXIS-M: for outbound data (1 available)

• AXIS-S: for incoming data (1 available)

• AXI-Full Master: to external memory (2 available)

• AXI-Lite Slave: for control (1 available)

Note that other Shells may expose different interfaces to the Application Region.

2.5.3 Why Galapagos?

The notable advantage that Galapagos provides is portability across networks and simplified access to FPGA resources. Through its dynamic cluster creation, Galapagos supports scalability over the network where FPGAs can exist as first-class computation nodes. The incorporation of layers also allows transparent changes under the hood with no user impact. For example, Galapagos currently supports the TCP, UDP and Ethernet protocols for communication, which can be chosen in the Middleware layer and determines which hardware cores are instantiated in the design. New protocols can be added and leveraged with little change in user code. Chapter 2. Background 11

Many of Galapagos’ functionalities are critical for any heterogeneous cluster application and having them codified into a framework simplifies usage and encourages reuse rather than reinvention. In hardware, access to the network and off-chip memory would have to be set up regardless but would otherwise be done manually. In software, libGalapagos presents a higher-level stream-based protocol that can be used to send data between kernels instead of setting up sockets or interacting with a network library directly. With this protocol, hardware and software interfaces look the same, which supports functional portability between different platforms. Finally, routing data to the correct destination is essential for both hardware and software kernels in a cluster environment, which Galapagos manages instead of requiring the user to contrive a scheme.

2.6 Related Work

This section reviews some prior work in this area. First, we introduce one of the first implementations of PGAS ideas in SHMEM in Section 2.6.1. Then, Section 2.6.2 discusses GASNet: a communication standard that influenced a number of later efforts in PGAS development, including those that inspired this work. Finally, the HUMboldt protocol is presented in Section 2.6.3. This lightweight implementation of MPI was the first language abstraction built on top of Galapagos. Note that the pertinent discussion of prior work on THeGASNet and THe_GASNet Extended projects is presented in Section 3.1 to compare it more closely with this work.

2.6.1 SHMEM and its Successors

SHMEM was originally introduced in 1993 as a library for shared memory operations on Cray’s T3D architecture [34], [35]. The library provided one-sided communication functions such as shmem_get() and shmem_put() that were tightly coupled to the Cray hardware. As one of the earliest implementations of distributed shared memory, SHMEM influenced what users expected similar libraries to provide in their APIs. To support non-Cray platforms, other vendors implemented their own versions of SHMEM, leading to efforts such as GPSHMEM [34] and OpenSHMEM [36] to create a more portable standard version. SHMEM is aimed at application developers and can use lower-level communication libraries such as GASNet [37] and ARMCI [38] under the hood [36].

2.6.2 GASNet

GASNet [37] is a specification for a low-level communication layer that can be used as part of higher- level PGAS languages. The specification defines a language- and platform-independent interface for implementations to use. The API is intentionally verbose as it is not intended to be directly included in user code. Instead, GASNet implementations are used under languages like UPC [39], Co-Array Fortran [40], and Chapel [41] to seamlessly provide communication. Chapter 2. Background 12

2.6.2.1 Core API

The Core API defines methods to initialize a GASNet program and for nodes1 to query information about themselves and other nodes. It also defines the active messaging interface described in Section 2.6.2.2. A subset of these methods are described below:

• gasnet_init(): parses command-line arguments and performs system setup

• gasnet_attach(): accepts an array of active message handler functions, which are further described in Section 2.6.2.2, to register along with the size and offset of the local shared memory

• gasnet_getSegmentInfo(): allows a node to query for remote addresses (including its own)

• gasnet_mynode(): returns a node’s own ID

• gasnet_nodes(): returns the number of nodes

The first three methods are called in the same order to initialize the node before messaging can occur.

2.6.2.2 Active Messages

Active Messages (AMs) are the mechanism through which nodes in GASNet communicate. Originally proposed in [42], AMs differ from conventional messaging in that they can trigger computation upon receipt through the use of handler functions. Handler functions are user-defined functions that may accept arguments and perform work. They are initialized through the gasnet_attach() function that specifies the mapping from numeric handler function IDs to their function pointers. Handler functions work as follows. Node A includes a particular handler ID when sending a message to node B. After the received message is managed appropriately, the particular handler function will be called on node B to take additional action. GASNet bases the AM definitions in its Core API on a modified version of the Active Messages 2.0 specification [43]. There are three main classes of AMs: Short, Medium and Long. Despite these names, the messages do not differ solely in size; in fact, the length of short AM can potentially exceed that of a long AM. Instead, these message types differ in behavior and contents2:

• Short messages carry no payload

• Medium messages carry payload to a temporary buffer on the remote node

• Long messages carry payload to the allocated shared memory on the remote node

In addition to payload, AMs carry optional arguments for handler functions. When a node receives an AM, the handler function will be passed any handler-specific arguments and payload data, if available. After processing the handler function, the remote node responds with a reply message back to the source node.

1Node, in GASNet jargon, refers to a single compute element, whether that is a thread or a processor, that runs an instance of the application and therefore has an address space and an ID. This definition of node is used in the context of GASNet. 2Technically, these classes are special cases of a unified AM format as noted in [37, Appendix A.3]. Chapter 2. Background 13

2.6.2.3 Extended API

The Extended API of GASNet defines some higher-level functions such as blocking and non-blocking get/put messages to exchange data. Blocking memory transfers wait until a reply is received before continuing. Non-blocking messages return immediately and come in two classes: explicit and implicit. The explicit variant returns a handle that represents the status of the current operation and offers fine-grain control over the message. In contrast, implicit non-blocking messages return no such handle and can only be collectively waited on. For example, a node could initiate non-blocking implicit get requests and call the appropriate synchronization function that will wait for all the outstanding get requests to finish before returning. Similar functions can be used to wait on outstanding put requests or to wait for all unfinished requests. The Extended API also specifies barrier functionality. Barriers in GASNet are split-phase where nodes call a notify function to indicate that they have reached a certain point in the code and a wait function that blocks until all other nodes have notified as well. While the Extended API can be composed using functions purely from the Core API, the GASNet specification encourages native implementations of the Extended API (or a subset thereof) in the interest of performance. However, the Core API is still required as the basic level of functionality that a GASNet implementation must provide.

2.6.3 HUMboldt

The HUMboldt protocol [7] is a minimal implementation of MPI built on Galapagos for heterogeneous communication. In particular, the library defines two functions HUM_Send() and HUM_Recv() to send and receive data to/from kernels. These functions are provided in a C++ header file that can be used in both software and hardware kernels (the latter using HLS). HUMboldt, as in MPI, uses two-sided communication where all participating kernels must be involved in the data exchange. Communication is initiated by the sending a request that the receiver acknowledges. At this point, the sender is cleared to send data. The receiver sends a final message back to the sender to complete the transaction. Research is what I am doing when I don’t know what I am doing Wernher von Braun

Chapter 3

Shoal

Shoal is the heterogeneous PGAS library that has been implemented as a spiritual successor to THe- GASNet [5] and THe_GASNet Extended [6] projects, which are briefly described in Section 3.1. In keeping with the aquatic theme of Galapagos-related projects, Shoal is named for large groups of fish swimming together. It defines an API for Active Message (AMs) communication between kernels as well as barriers for synchronization. The AMs used in Shoal are based on the AMs used in GASNet, shown in Section 2.6.2.2. It supports widely-distributed workloads spread out over both hardware and software kernels over a generic network. The network-agnosticism comes from the use of Galapagos: a heterogeneous stack that eases multi-FPGA deployment [7]. Building on this platform allows Shoal to take advantage of the abstraction provided. In software, Shoal includes a C++ library and headers for users to compile applications. These applications can be run through an HLS tool or rewritten in Verilog and moved to an FPGA. Hardware support of the PGAS model is provided primarily through a new GAScore: a direct memory access (DMA) engine to facilitate remote memory access. This chapter first reviews the prior work that greatly influenced Shoal, THeGASNets, in Section 3.1. Next, we present the rationale for Shoal and why it improves on prior work in Section 3.2. Then, we describe the communication API used in Shoal and the packet formats used for data in Section 3.3. Sections 3.4 and 3.5 present the software and hardware implementations, respectively. Finally, Section 3.6 shows how a Shoal kernel can be made heterogeneous with a code sample.

3.1 Previous Work: THeGASNets

This section presents an overview of the two direct ancestors of the Shoal API: THeGASNet [5] and THe_GASNet Extended [6] (referred to collectively as THeGASNets) in Sections 3.1.1 and 3.1.2 respec- tively. It concludes with a discussion on the limitations of these works in Section 3.1.3.

3.1.1 THeGASNet

THeGASNet is a framework providing runtime support for heterogeneous PGAS applications developed by Willenberg [5]. This work includes both software and hardware components. The primary software contribution is THeGASNet Core API, which is an implementation of a subset of the GASNet Core API, described in Section 2.6.2. Due to the direct relationship between these two libraries, they share the same basic function calls used to initialize an application (albeit with

14 Chapter 3. Shoal 15 some minor differences) such as gasnet_init() and gasnet_attach(). One notable difference from GASNet is that addressing has been changed to offsets from base addresses instead of absolute addresses. Communication between kernels in both libraries occurs through active messages (AMs) defined in the GASNet Core API. In addition to this set of AMs, THeGASNet adds Strided and Vectored AMs to its implementation. Strided AMs pack constant-sized blocks of data separated by a constant stride into one Long AM. Vectored AMs allow grouping multiple Long AM payloads of different sizes and varying destination addresses into a single Long AM. The same software API of THeGASNet is supported in hardware with custom IPs, namely the GAScore and the PAMS. Both of these IPs are described in great detail in [16] and are only briefly summarized here. The GAScore (global address space core) serves as the remote DMA engine by sending and receiving AMs. Outgoing AMs (relative to the user application), depending on type, may be passed through to the external network or augmented with data from memory. Similarly, incoming AMs may be passed to the application directly or written to shared memory. Handler requests are passed on through a separate channel for processing and replying. The PAMS (programmable active message sequencer) is a custom processor intended to simplify the interaction of custom user application hardware with the GAScore. Placing the PAMS between the application and the GAScore allows it to coordinate communication and provide synchronizing capabilities. To work correctly for a particular application, the PAMS must be programmed by the user with a special Medium message. This message contains a handler ID whose corresponding handler function directs the payload of the message to an on-chip memory that the PAMS will read as instructions. The PAMS has several capabilities. For example, it contains internal counters that can be programmed to wait until a target value is reached as a means to implement barriers. The PAMS can also initiate AMs by sending the request to the GAScore. However, this functionality must all be programmed to time when to send messages or when to execute a particular handler function. A side effect of this behavior is that handler functions are not executed when the message triggering them arrives. Rather, the handler function is performed when the PAMS reads the instruction to do so, which can potentially leave messages unacknowledged for long periods of time. In this model, a user application on hardware may run on a soft processor such as a Microblaze or on a custom IP and can choose whether to interface with the GAScore directly or through the PAMS. One set of these cores (the application, the GAScore and potentially the PAMS) constitutes one kernel in Galapagos nomenclature. This work has been demonstrated on the MapleHoney cluster shown in Figure 3.1. This cluster uses the BEE4 [44] platform to host four Virtex 6-LX550T FPGAs arranged in a ring topology. There are also four Intel i5-4440 x86 processors connected over a 1 Gb/s link to an Ethernet switch. Each processor is paired with one FPGA over PCIe. Each kernel on an FPGA has access to off-chip memory and to the on-chip network. This network connects each kernel to all other kernels on the same FPGA and provides connections to adjacent FPGAs in the ring and to the host processor. The internal structure of one FPGA is shown in Figure 3.2.

3.1.2 THe_GASNet Extended

THe_GASNet Extended [6], as its name suggests, is an extension by Pandit to the work in THeGASNet. It is an implementation of a subset of the GASNet Extended API, which is summarized in Section 2.6.2.3. Chapter 3. Shoal 16

BEE4 PCIe

PC RAM RAM RAM PC RAM FPGA FPGA

GbE 1G DDR200 Switch Switch

FPGA FPGA RAM

PC RAM RAM RAM PC

Figure 3.1: MapleHoney cluster (adapted from [16])

In particular, THe_GASNet Extended API provides get and put functions in blocking and non-blocking variants. It also supports both explicit and implicit handles for non-blocking messages and includes the synchronization functions needed for implicit requests. In addition to these direct implementations from the GASNet specification, THe_GASNet carries forward support for Strided messages from THeGASNet and provides an additional initialization function to supplement the regular gasnet_init(). In software, the extended API makes calls to the same low-level AM sending calls as in THeGASNet. While THeGASNet only ran on x86 processors, THe_GASNet Extended has additional support to run on ARM processors such as those found in system-on-chip (SoC) FPGA devices that have an onboard processor. Hardware support of the Extended API comes through the GAScore and the xPAMS. The GAScore is reused from THeGASNet since the underlying AM calls are the same in both platforms. The xPAMS (Extended PAMS) module is an updated version of the PAMS fulfilling the same purpose: to simplify communication between user code and the GAScore. Notably, the xPAMS adds interrupt-based handler processing in contrast to the original PAMS so that handlers are run when the message arrives. THe_GASNet Extended has been implemented on an x86 Intel i5-4440 quad core processor and on a cluster of Zedboards [45]. This cluster was made of up to eight Zedboards connected to a simple 1G Ethernet switch. Each Zedboard has a Zynq 7000 FPGA, which includes a dual-core ARM Cortex A9 processor, programmable logic, and 512 Mb of DDR3 RAM. In this FPGA, the processor has access to the network connection and the off-chip RAM. If the FPGA fabric needs to access the network, it must transfer the data to the ARM using a DMA. The internal structure of one FPGA is shown in Figure 3.3. In the programmable logic, bidirectional connections with arrows indicate AXIS interfaces while connections without arrows are also bidirectional but indicate Fast-Simplex Link (FSL) interfaces. FSL is discussed further in the next section. Chapter 3. Shoal 17

PCIe DRAM Controller

PAMS GAScore Custom Accelerator Kern Kern Kern Kern 4 5 6 7 Kern 3

Kern 2 NetIf Network Kern 1

DRAMController Kern FPGA ring down 0 up FPGAring

Figure 3.2: Internal structure of an FPGA in the MapleHoney cluster (adapted from [16])

3.1.3 Limitations

Both THeGASNet and THe_GASNet Extended projects have some notable strengths. As implementations of GASNet, these works benefit from being adaptations of an existing mature specification. Furthermore, as GASNet is in active use, other platforms built on top of it could theoretically be implemented on the heterogeneous THeGASNets instead. However, there are also limitations of these works. Foremost, they are written and developed using Xilinx’s now-defunct ISE platform [46], which has since been replaced with Vivado. This change means that all the old ISE projects, scripts, and packaged IPs used during development are no longer applicable and need to be redone. Porting IPs to Vivado is not trivial. One prominent change between ISE and Vivado is the deprecation of the FSL interface and its replacement by AXIS. FSL is similar in principle to AXIS as a point-to-point streaming bus that uses back-pressure to control the flow of data. Unlike AXIS, it does not provide optional side channels and includes a generic control signal. In AXIS, the TLAST signal is used to indicate the last word of a packet. Since the specification allows the control signal to be freely used for any purpose, it may be used in the same way as TLAST. However, the hardware IPs in THeGASNet use the control signal to indicate the first word of a packet. Replacing the FSL interfaces with AXIS is needed for the IP to connect to other IPs in the modern Xilinx ecosystem in Vivado. But this change requires careful examination of the source code so that the change to TLAST does not break the flow. It may also require modifying the logic of the IP to better reflect the new control signal semantics. Another constraint with these works is scalability. THeGASNets have been implemented on statically- defined clusters and make a number of assumptions. In THeGASNet, if code running on a processor needed to send data to an FPGA, this data transfer would always occur over PCIe to the processor’s attached FPGA. From the FPGA, data would be routed to a different FPGA (if needed) and eventually to the destination kernel. Thus, the software and hardware were closely coupled together. THe_GASNet Chapter 3. Shoal 18

Kern DRAM

0 Controller xPAMS xPAMS GAScore Custom Accelerator ARM

AXI Processors Kern DMA

1 NetIf - Network FSL Ethernet Programmable Logic Controller

Figure 3.3: Internal structure of an FPGA in THe_GASNet Extended (adapted from [6])

Extended makes a similar assumption that it would run on SoC devices and so would have access to both a processor and the FPGA fabric. The use of Zedboards forced all network communication to go through the processor, which negatively affected the maximum performance. Finally, these works cannot be directly used as a drop-in replacement for GASNet because THeGASNet (and by extension, THe_GASNet Extended) intentionally departs from GASNet’s specifications at certain points. While the initial goal had been to closely follow the specifications, the addition of heterogeneous hardware components made it impossible to do so. Therefore, existing GASNet software code is incompatible to be directly linked against THeGASNets. As such, any further extension of the projects to adapt higher-level PGAS libraries built on GASNet would drift even farther from their existing software implementations. Since full compatibility is not possible, these works’ adherence to get close to the GASNet specification is not necessarily optimal for a heterogeneous library.

3.2 Rationale for Shoal

Considering prior work and its limitations discussed in the previous section, in this section, we explain how and why Shoal improves on THeGASNets across a number of metrics.

3.2.1 Compatibility

Compatibility is used here to refer to the platforms and ecosystems to which a particular project is suited. A highly compatible project adheres to any relevant standards and can work with a broad set of modern tools. As a result, the project is up-to-date with recent work and has longevity for the future. The prior works’ dependence on deprecated tools limits compatibility. In Shoal, we implement our hardware IPs from scratch rather than attempting to port old hardware. In doing so, we take advantage of new tools like Vivado and Vivado HLS that are a part of modern hardware development in the Xilinx environment. We also update the source code to use C++ rather than pure C as it was in previous works. Using C++ supports object-oriented programming and the constantly updating C++ standards add powerful features to improve capabilities of the language. Finally, we build on and extend Galapagos Chapter 3. Shoal 19 to be compatible with this new standard of heterogeneous cluster development unlike past work that is more isolated and independent.

3.2.2 Scalability

Scalability here measures how easily and effectively a project can take advantage of an increasing number of nodes and kernels. Since moving an application to a cluster is intended to improve parallelization and increase computation power, having the ability to scale out as far as needed is a desirable qual- ity. THeGASNets are implemented on statically-defined small clusters and make assumptions about communication between nodes. The decision to use Galapagos as the foundation of Shoal was made in part to address the scalability of our overall solution. Through its cluster description files, users are free to define an arbitrarily-connected cluster that may consist of just a few nodes or hundreds. The Shoal API makes no assumptions about the protocols used to communicate. Instead, we simply create a Galapagos packet and then let Galapagos forward it to the correct destination over the protocol of the user’s choice.

3.2.3 Freedom

Depending on the project, external restrictions on the design space can limit exploration. In prior work, the constraint of adapting GASNet limits the scope of development. Even though close adherence to the specification was desired, total compatibility was not ultimately met. By allowing divergence from GASNet more openly, Shoal is more supportive of research into what makes a good heterogeneous API. We are free to cherry-pick parts of GASNet and THeGASNets that we see are useful and exclude those that are not. Some of these changes directly improve the usability of our platform as compared to prior work. Some other exploratory changes are discussed in future work in Section 6.1.

3.2.4 Maintainability and Extensibility

Maintainability and extensibility are related metrics that measure how easy a project is to manage and expand in the future. A maintainable project should be documented and logically organized to help new readers understand its structure. It should also maximize the use of scripts to automate processes to reduce the risk of human error. Through this guidance, the reader can more fully understand the technical implementation details and the justifications behind them. The process of patching newly discovered bugs or extending the project in new directions becomes more feasible when this reasoning is clear. The hardware implementations of THeGASNets rely on the GAScore and PAMS, both of which are written in VHDL. While hand-optimized HDL IPs are high-performing, they are less suitable for exploratory research because they take significant effort to develop and debug. Substantial modifications to the design result in iterating through this development cycle again. As a result, these projects are difficult to extend for additional work. In addition, any scripts associated with these projects are also no longer valid in Vivado, as discussed in Section 3.2.1. In Shoal, the new GAScore IP, discussed in Section 3.5.1, is composed of smaller IPs written in a mix of C++ and Verilog. These IPs are then stitched together into the GAScore by a script. Using C++ for IPs yields more readable source code than Verilog. It also allows for fast experimentation of alternative Chapter 3. Shoal 20 designs. Where possible, configurable C macros and typedefs are used to define variables and types instead of using fixed values to support easy modifications to our implementation.

3.2.5 Usability

Usability is a broad term to reflect the level of difficulty a user would face to use the design. Depending on the intended audience, there may be significant overlap between usability and maintainability in the previous section. For this reason, we restrict our definition of usability as it pertains to users who are not interested in the technical details of the implementation. Instead, they depend on an easy-to-use model coupled with scripts and examples. Usable designs are more productive for users. In prior work, we believe that a lack of abstraction and the PAMS hinder usability. In hardware, THeGASNets expose all the implementation details to the user such as access to the network and memory. While these details can offer control, they harm usability. Clever abstractions can hide these details and present a simplified interface for users. In Shoal, we do not use the PAMS as a tool to connect the GAScore to user applications. Using the PAMS requires users to write out the PAMS instructions explicitly and create an unnecessary burden on the user to correctly orchestrate the instructions. For example, in the original version of the Jacobi application that we present in Section 5.4, almost 100 assembly-like instructions for the PAMS are interspersed throughout the source code. These instructions are then sent to the computation kernel using a Medium message. In Shoal, we use HLS to allows users to arbitrarily connect their applications to the GAScore and the overall platform. We incorporate basic handler functions into the new GAScore rather than leaving them to be implemented per kernel. Furthermore, the incorporation of Galapagos provides a clean abstraction for users. This decision has wide-ranging effects and is discussed throughout this chapter.

3.3 Communication API

The basic methods that the Shoal API provides for messaging are low-level AMs. These functions are verbose with arguments and require some knowledge of the Shoal header format, which is discussed in Section 3.3.1, to pass the correct message type. They fulfill the same purpose as the Core API of GASNet. To make them easier to use, these AM functions are wrapped in a more convenient form in the shoal::kernel class in the spirit of the GASNet Extended API. The AM-related member functions of this class hide some implementation-specific details of AMs though they are still fairly verbose. In practice, many of the existing arguments of these functions can be made optional with good default values to further simplify usage in the common case but this task is left as future work. The Shoal API is heterogeneous and uses the same (or very similar) function prototypes for software and hardware targets and has platform-specific function implementations where needed. Some functions from the API are described in Table 3.1 and discussed in this section. As in THeGASNets, there are three classes of AMs: Short, Medium and Long. Short message types are primarily used for signaling and reply messages. Medium message types serve as point-to-point communication for one kernel to send data directly to another kernel. Finally, Long message types contain payload that is written to remote memory. We carry forward support for Strided and Vectored Long messages as well. Chapter 3. Shoal 21

Table 3.1: Selected methods of the shoal::kernel class

Function Description wait_reply(int num) Block until num replies have been received barrier_send(int id) & barrier_wait() Used to initiate barriers sendShortAM_normal(...) Send a normal Short AM to a kernel sendShortAM_async(...) Send an asynchronous Short AM to a kernel getLongAM_normal(...) Get data from a remote kernel’s memory and save it to this kernel’s local memory

The Medium and Long message types are further divided into two cases depending on the source of their payload data. “Medium/Long FIFO” messages are used to denote messages whose payload originates from the kernel while “Medium/Long” messages specify messages whose payload comes from shared memory and is added by the runtime. As described, these messages are the put variants. The API also supports get requests for Medium and Long message types that can bring data from remote memory. For example, a Medium get request would bring remote data from a particular address in the destination kernel to the source kernel. Following the semantics of previous APIs, each received packet triggers a reply unless the initial message is marked as asynchronous. Reply messages are Short messages that trigger a handler function that increments a variable and thus keeps track of the number of reply messages received at each kernel. Kernels can therefore send several messages and then collectively wait for the same number of replies to know that all have been received. Runtime support for AMs is managed by the handler thread in software and by the new GAScore in hardware, which are described in Sections 3.4.3 and 3.5.1 respectively. These components are responsible for parsing incoming AMs and directing them appropriately, calling handler functions, and sending out AMs from local kernels. One notable change between the previous work to Shoal is the removal of the PAMS. From examining THeGASNets, we can see that one of the PAMS’s primary objectives is maintaining a set of counters that the kernel could read and could be incremented by incoming packets. This basic functionality has been folded into the runtime instead of using a separate component, where in-built handler functions currently provide three counters, two of which are used to implement barriers and the third can be freely used. The API provides methods to check their values and wait until they reach a certain target. For the use cases observed in this work and prior work, this set of handler functions has been sufficient though more may be added into the API if the need arises. This change is also tied to the general reduction of handler function usage. In GASNet and THeGASNets, custom user-defined handler functions can be defined in software kernels. While this functionality has been maintained in Shoal software kernels as well, it is not as applicable in hardware. Supporting similar behavior in hardware requires a custom handler IP external to the other Shoal IPs that the user can modify as needed or adding a PAMS-like core internally that defines basic operations that user can string together into the desired functionality. In practice, this broad freedom is rarely needed and so it is removed to simplify the hardware implementation. However, custom handler support can be added to hardware through existing workarounds or as future work, which is further discussed in Section 6.1.2. The PAMS is also used in THeGASNets to handle reply messages and initiate AMs to offload the AM Chapter 3. Shoal 22

0 8 24 40 48 56 60 64 ) Type SRC DST Payload H Args Header Token Destination (Long only) ) Args 0 Handler . Args . Payload (Medium and Long only) hhh hh hhh hhh hhhh hhhh hhh hhh hhhh hhhh hhh hhh hhhh hhhh hhh hhh hhhh hhh hhh hhhh hhh hhhh

Figure 3.4: Short, Medium and Long network packet formats packet construction from the kernel. In Shoal, management of reply messages has also been absorbed into the runtime and managed without kernel intervention. HLS can be used to simplify creating Shoal packets in the right format through the API. In this way, a simple controller can be developed to send AMs based on custom control signals from the user IP.

3.3.1 Packet Format

The AMs are encapsulated in a custom packet format extended from the one used in THeGASNets. An example is shown in Figure 3.4 for the basic network packet format and all packet formats are detailed in Appendix A. The format includes a small header that defines the type of AM, the IDs of the source and destination kernel, the size of the payload (if any), and the handler function (H in Figure 3.4) to call on the remote kernel, along with the function’s arguments (if any). The Token field may be used to include an identifier for a particular message. There are a few differences between the packet format in THeGASNet and in Shoal. Most significantly, the size of each word is increased to 64 bits instead of remaining at 32 bits. This change is made to match the default data widths in Galapagos. Conveniently, it also allows for data that had been previously split over two words to be sent in one word instead. Some data fields’ widths have been adjusted in Shoal. These changes have been largely made to better pack fields together into as few words as possible to reduce header overhead. In particular, the payload size has been reduced to 216 bytes from 232 as packets need to traverse the network and are bound by the maximum size of TCP and UDP packets. The Token field has been added to messages to identify a particular message or class of messages.

3.4 Software Implementation

This section describes the software implementation of Shoal. First, the software support provided from Galapagos, libGalapagos [33], is described. Shoal relies on this library for communication between software kernels. Then, the code required to make a software node in Shoal is shown and compared to Chapter 3. Shoal 23 the process in native libGalapagos. Finally, the handler thread is introduced as the method by which PGAS memory semantics are implemented in software and connected to libGalapagos.

3.4.1 libGalapagos libGalapagos [33] is a C++ library in Galapagos that provides hardware-like stream-based communication between different software kernels. The use of this library is one method of improving usability in Shoal. A brief overview is provided here through an example in Listing 1 of how to create a software node. Lines 19-20 define the network addresses associated with the kernels in this cluster and are stored in a vector. The order of addresses in the vector also determines the kernel IDs. Line 24 creates a vector of galapagos::external_driver objects. Internally, libGalapagos will route data to local kernels on the node through a round-robin router. However, for messages intended for remote kernels, data must be sent to an “external driver”, which may be one of galapagos::net::tcp or galapagos::net::udp objects as of writing. Each driver uses Boost’s Asio [47] networking library to manage network connections and starts the service in separate threads. Lines 25-30 create a UDP driver at port 7 and add it to the vector of external drivers. Line 34 creates the galapagos::node object that accepts the list of kernel addresses, the node’s own address and any external drivers as arguments. Line 36 adds any local kernels that may exist on this node, which in this case is defined by a function pointer to kern_0(...). When adding a function, a galapagos::kernel object is created within libGalapagos using this function pointer and two new galapagos::interface objects that are connected to the local router that connects all kernels on a given node. These Galapagos Interfaces (GIs) are used throughout Shoal for communication because they are heterogeneous and thread-safe; they are simply AXIS interfaces in hardware and behave similarly in software. In both platforms, kernels can read and write data from these interfaces on a flit-by-flit basis. However, GIs have the additional capability to read or write entire packets in bulk in software, which has no equivalent in hardware. Line 38 starts the node and libGalapagos creates threads for local kernels and runs them. Kernel functions in libGalapagos, following the Galapagos kernel definitions, have three arguments: an integer ID and a pair of pointers to GIs. Pointers to the GIs created when the kernel was added in line 36 are passed to the kernel as arguments and used to send and receive data. Finally, line 39 attempts to close the node and cleanup but it will block until all local kernels terminate.

3.4.2 Making a Node in Shoal

In Shoal, the process to create a software node is similar to the method in libGalapagos. The same example in Listing 1 is repeated in Listing 2 using the Shoal API. The galapagos::node class is extended to create the shoal::node class, shown in line 17. Notably, this class internally creates external drivers based on an optional Boolean flag (defaulting to true) on whether to use TCP. If the flag is marked false, then UDP is used. This process can be updated to allow additional selections if more drivers are added to libGalapagos. Line 23 shows a new method that has been added to initialize the Shoal node. It dynamically allocates enough space for some global variables that Shoal uses based on the number of kernels on this node. Chapter 3. Shoal 24

1 #include "galapagos_interface.hpp"

2 #include "galapagos_kernel.hpp"

3 #include "galapagos_local_router.hpp"

4 #include "galapagos_external_driver.hpp"

5 #include "galapagos_net_udp.hpp"

6 #include "galapagos_node.hpp"

7

8 typedef long long word_t;

9

10 void kern_0(short id, galapagos::interface*in,

11 galapagos::interface*out){

12 // body of kernel function

13 }

14

15 int main(){

16 // define the IP addresses of the kernels. The index of the addresses

17 // corresponds to the kernel ID

18 std::vector kern_info_table;

19 kern_info_table.push_back("10.1.2.101");

20 kern_info_table.push_back("10.1.2.102");

21

22 // define the external protocol, in this case just UDP

23 // Creates a Galapagos UDP driver and adds to to the vector

24 std::vector*> ext_drivers;

25 galapagos::net::udp my_udp(

26 7, // UDP port to use

27 kern_info_table, // addresses of kernels

28 "10.1.2.101" // my address

29 );

30 ext_drivers.push_back(&my_udp);

31

32 // create the node by passing it the kernel addresses, its address,

33 // and a vector of external drivers

34 galapagos::node node0(kern_info_table, "10.1.2.101", ext_drivers);

35

36 node0.add_kernel(0, kern_0);

37

38 node0.start();

39 node0.end();

40 }

Listing 1: Making a software node in libGalapagos. Each line is further explained in Section 3.4.1. Chapter 3. Shoal 25

1 #include "shoal_node.hpp"

2

3 void kern_0(...);

4

5 int main(){

6 std::string my_address= "10.1.2.101";

7 std::string remote_address= "10.1.2.102";

8

9 // define the IP addresses of the kernels. The index of the addresses

10 // corresponds to the kernel ID

11 std::vector kern_info_table;

12 kern_info_table.push_back(my_address);

13 kern_info_table.push_back(remote_address);

14

15 // create the node by passing it the kernel addresses, its address,

16 // and a Boolean value on whether to use TCP for remote communication

17 shoal::node node0(kern_info_table, my_address, false);

18

19 // add local kernels using a numeric ID and a function pointer

20 node0.add_kernel(0, kern_0);

21

22 // initialize the node with the number of local kernels

23 node0.init(1);

24

25 node0.start();

26 node0.end();

27 }

Listing 2: Making a software node in Shoal. Each line is further explained in Sections 3.4.1 and 3.4.2. Chapter 3. Shoal 26

3.4.3 Handler Thread

Adding the PGAS memory semantics requires additional code to parse headers, redirect data to memory or to kernels, and call handler functions. This functionality should be implemented as a new thread, here called the handler thread, so as not to block kernel execution. There are multiple places where the handler thread could be inserted. While modifying libGalapagos itself is an option, this step violates the separation of the layers of Galapagos and needlessly complicates the implementation of libGalapagos. As described in Section 3.4.1, each kernel function has at least three arguments: an integer ID and a pair of GIs to send and receive data from other kernels. Since the kernel function is directly started as a new thread by libGalapagos, a second approach is to create the handler thread within each kernel function. Incoming data to the kernel would need to get passed to the handler thread first and then potentially back to the kernel thread. Similarly, the kernel would not use the external GI pointers directly to send data out. It would pass the data to the handler thread instead, which would actually send the data out after processing. The implemented solution is a more automated version of this idea. Instead of the kernel function creating the handler thread directly, the kernel function itself is wrapped with a new function by the linker during compilation1. This new wrapper function behaves as the handler thread and implements the needed PGAS semantics and is visualized in Figure 3.5. Since it replaces the original user kernel function, the handler thread behaves as though the user added this function as the kernel: it is passed to libGalapagos, connected to the local router and started in a separate thread when the node starts. This thread is marked as the Galapagos kernel thread in the figure to denote that libGalapagos considers it to be the kernel function. The kernel function pointer that the user originally passed has not been lost and instead is started as a new thread by the handler thread, along with new galapaogos::interface objects to communicate data between the user kernel thread and the handler thread. There is no visible change internal to the user kernel thread as it has the same set of arguments and uses these GIs to communicate. However, instead of connecting to the local router directly, the GIs now transparently connect to logic in the handler thread that serves as the gatekeeper to and from the wider network. Incoming messages may be redirected to shared memory or passed on to the kernel or trigger the appropriate handler function that an AM specifies. Outgoing message requests from the kernel are read and converted into Shoal network packets to be sent out. Since each kernel function is wrapped by the handler thread in this manner, each kernel has its own thread. This use case matches the multi-threaded model in THeGASNets and in libGalapagos.

3.5 Hardware Implementation

This section presents the hardware implementation of Shoal. We start with the main component that enables remote memory access on the FPGA, the new GAScore, in Section 3.5.1. Then, Section 3.5.2 describes how Shoal is integrated with Galapagos.

3.5.1 GAScore: a remote DMA engine

The GAScore in Shoal fulfills the same purpose as the old GAScore in THeGASNets: to facilitate access to remote memory. As discussed in the limitations of previous work, the old GAScore would not be directly compatible in Vivado and so overhauling it offers more flexibility of design. Its software counterpart is

1Function wrapping at link-time was also applied in [16] and relies on the –wrap option in GNU’s ld. Chapter 3. Shoal 27

Handler Thread 0 Galapagos Kernel Local Router Thread 0 Thread

Handler Logic ...... External Drivers

User Kernel Galapagos Kernel Thread 0 Thread N-1

Figure 3.5: Kernel threads and the local router in libGalapagos after automatic function wrapping the handler thread discussed in Section 3.4.3. Unlike the handler threads, which are created per kernel, the GAScore is shared among all kernels on a node. A simplified system diagram for the FPGA is shown in Figure 3.6, showing how the GAScore fits into the Application Region.

FPGA Application Region Kern PCIe 0

Galapagos ...... Application Galapagos GAScore Region Shell Network IPs Kern N RAM

Figure 3.6: System diagram of the FPGA in Shoal

The GAScore is composed of IPs written through a mix of C++ and Verilog and its structure is shown in Figure 3.7. Unless otherwise indicated, the connections between these blocks are AXIS interfaces. It has two pairs of master-slave AXIS interfaces: to/from the external network and to/from local kernels. While the old GAScore had an additional set of streaming interfaces for handler calls, it has been removed in the Shoal GAScore. This change is tied with the removal of the PAMS2, which was discussed earlier in Section 3.3. Instead of using the PAMS for handler functions, the GAScore provides a set of AXI-Lite slave interfaces—one for each local kernel on the FPGA—that are used to access the registers associated with the built-in handler functions. Finally, the GAScore has one AXI-Full master interface to the shared memory. The Xilinx AXI DataMover IP [48] is used to facilitate access to memory by providing a conversion between AXIS and AXI-Full. A simplified explanation is provided here. The DataMover exposes two parallel interfaces for reading and writing data. To read, a read command is sent to the IP that specifies the source address and the number of bytes to read. The read data is streamed back over AXIS and a

2We pay homage to this history by naming two of the GAScore’s internal IPs after the xPAMS. Chapter 3. Shoal 28

FIFO xpams_tx am_tx add_size

To Network FIFO

AXI GAScore DataMover

handler_wrapper FIFO

handler AXI-Lite xpams_rx hold_buffer am_rx

Figure 3.7: The new GAScore in Shoal status indicator is available. Similarly, writing data to memory first requires a write command to the IP that specifies the destination address and the number of bytes to write. After the command is sent, data can be streamed into the IP over AXIS and it is written to memory as instructed in the command. The am_rx and am_tx modules perform these actions when data must be written to or read from memory. The general path of egress packets is described below, with some special cases excluded here for simplicity.

1. A Shoal kernel packet arrives at the “From Kernels” interface.

2. Based on the header, the type of message and destination is decoded in xpams_tx. For the special cases of Short messages and Medium FIFO messages intended for local kernels, this module will route data to the handler internally and pass any payload directly to the “To Kernels” interface. Other messages types, whether they are to local or remote kernels, need access to memory and so proceed unaltered to am_tx.

3. am_tx determines the type of message based on the header and parses the rest of the command packet. Depending on the message type, additional data may be added to the outgoing message. In particular, for messages with a payload, requests for data are sent over the DataMover’s command interface and the read data from the IP is padded onto the end of the outgoing packet.

4. The add_size block adds the metadata needed for Galapagos to handle the data correctly. The final message size is counted and this size (in words) is added to the TUSER side channel of the AXIS interface.

5. Messages intended for local kernels will be routed back to the “From Network” interface of the GAScore by Galapagos. Other packets will proceed through and eventually leave the FPGA.

Ingress packets follow a similar path in the opposite direction.

1. A Shoal packet arrives at the “From Network” interface. Chapter 3. Shoal 29

2. am_rx parses the header and forwards it. For Long message types, the payload gets written to memory. The hold_buffer is a special FIFO that buffers the forwarded data in the case of Long AMs. While the payload is being written to memory, the AM’s header is held at the buffer. After it has been written, the message is allowed to proceed.

3. Using the forwarded message, xpams_rx will pass the handler function data on to the handlers. It will also forward the Medium AM payload to the kernels through the “To Kernels” interface. Finally, this module creates a reply packet and sends it to am_tx to be sent back to the source kernel.

3.5.2 Integration with Galapagos

As described in Section 2.5, Galapagos provides scripts to automate the creation of bitstreams. We show the major change we made to this flow to support Shoal in Section 3.5.2.1. In Section 3.5.2.2, we highlight some fixes we made to a Galapagos IP to improve TCP support in Galapagos.

3.5.2.1 TCL Flow

We add a special “custom” tag into the map configuration file used in the Galapagos Middleware layer. The presence of this tag makes two modifications to the old Galapagos flow. First, at points during the TCL file generation, certain variable values are saved to a new TCL script called custom.tcl. These values provide general information about the names and numbers of some IP interfaces. Second, this tag adds the custom.tcl and a user-provided TCL file—selected based on the tag’s value in the configuration file—as the last step of creating the user Application Region on the FPGA. The TCL script that Shoal provides for this purpose uses the variables from the custom.tcl and modifies the Application Region on the FPGA to add the GAScore and reconnect IPs appropriately. Other user extensions to Galapagos are possible through this general-purpose hook in the Galapagos Middleware.

3.5.2.2 TCP Network Bridge

The TCP Network Bridge in Galapagos is an HLS IP that interfaces with the TCP/IP core. It ensures that data is sent to the right destination and passes incoming data on to the Application Region. We greatly improve the robustness of this IP by fixing some bugs and adding new features. First, we add a kernel-to-node mapping to hardware. In software, libGalapagos maintains a mapping from a kernel ID to the IP address of its node but there was no equivalent in hardware. For the scenario where there is only one kernel per node, there is no issue as a message to a new kernel will require opening a new TCP connection to a new node. With multiple kernels per node, a message to a new kernel should reuse an existing TCP connection if the right one exists3, but this information was not discoverable. We modify the Galapagos Middleware scripts to additionally keep track of node IDs that different kernels belong to and save that information in a file. This file is loaded on the FPGA in a BRAM and the TCP Network Bridge is modified to use this mapping. With this change, TCP connections are reused and hardware kernels can now communicate to other kernels even when multiple exist on one node. Second, we address some bugs in handling large packets, one of which we detail here. The issue arises due to the hardware TCP/IP core not supporting IP fragmentation. Thus, messages exceeding one Ethernet frame must be sent as discrete packets. To prevent data corruption, the TCP Network

3While TCP allows opening multiple connections to the same address, the TCP/IP core in the FPGA does not. Chapter 3. Shoal 30

Bridge parses Galapagos packet headers and uses the embedded packet size information to determine its boundaries. If a TCP message ends with an incomplete Galapagos packet, the Bridge is aware and assumes that the next message to arrive will contain the remaining data. However, this solution is incomplete. With many nodes communicating, it is possible for data from a different node to arrive and be read as a continuation of the previous packet instead of as a new one, resulting in corrupted data. We address this problem by making the TCP Network Bridge more aware of active TCP sessions and keeping track of the session to which a packet belongs. With an incomplete packet, the Bridge buffers received data notifications from other sessions until data from the same session arrives to complete the packet. After finishing the packet, the Bridge continues with the saved notifications from other sessions in the order of receipt.

3.6 Shoal Kernels

Through HLS, kernels can be truly heterogeneous with the use of C preprocessor directives. An example of such a kernel is shown in Listing 3. As with basic Galapagos kernels, Shoal kernels have an ID and communicate over a pair of GIs. One additional interface is an AXI-Full master4 to the handler interface of the GAScore. Lines 15-19 define the interface types for HLS using pragmas, which are ignored when compiling the kernel for software. The kernel object instantiation is shown on lines 21-25. All interfaces are passed as arguments as well as the total number of kernels, which is used for implementing barriers. After initializing the kernel object (line 27), software kernels additionally can add custom handler functions (none in this case) and define the size of the shared memory segment. Finally, calling kernel.end() is only required for software kernels and is a no-op in hardware kernels. This function terminates the kernel and signals this status to the kernel’s handler thread. After the handler thread also shuts down, the shoal::node will mark this kernel as finished. Note that marking the kernel function with extern "C" as in line 4 is necessary for the function wrapping described in Section 3.4.3 because the name of the kernel function to wrap needs to be known. This command prevents the function name-mangling that the C++ compiler performs to support overloading and instead directs it to perform C-style linking. An alternative approach is to identify the mangled name and use that to specify which function to wrap but this name is subject to change depending on source files.

4While the GAScore uses AXI-Lite for this interface, Vivado HLS does not provide an AXI-Lite master interface. However, as AXI-Lite is a subset of AXI-Full, the two can still be connected through an interconnect. Chapter 3. Shoal 31

1 #include "am_gasnet.hpp"

2 #include "shoal_node.hpp"

3

4 extern "C"{

5 void benchmark(

6 short id,

7 galapagos::interface* in,

8 #ifdef __HLS__

9 galapagos::interface* out,

10 volatile int* handler_ctrl

11 #else

12 galapagos::interface* out

13 #endif

14 ){

15 #pragma HLS INTERFACE axis port=in

16 #pragma HLS INTERFACE axis port=out

17 #pragma HLS INTERFACE ap_ctrl_none port=return

18 #pragma HLS INTERFACE ap_stable port=id

19 #pragma HLS INTERFACE m_axi port=handler_ctrl depth=4096 offset=0

20

21 #ifdef __HLS__

22 shoal::kernel kernel(id, KERNEL_NUM_TOTAL, in, out, handler_ctrl);

23 #else

24 shoal::kernel kernel(id, KERNEL_NUM_TOTAL, in, out);

25 #endif

26

27 kernel.init();

28 #ifndef __HLS__

29 kernel.attach(nullptr,0, SEGMENT_SIZE);

30 #endif

31

32 ...

33

34 kernel.end();

35 }

36 }

Listing 3: A heterogeneous Shoal kernel. Each line is further explained in Section 3.6. We only see what we know. Johann Wolfgang von Goethe

Chapter 4

Sonar

Verification of IPs is essential to validate the intended functionality of a hardware design. The standard technique to check the behavior of an IP, which is known as the device-under-test (DUT), is to use a testbench that provides various stimuli to its inputs. Then, the DUT’s output signals can be examined to determine if the response matches expectations. A good testbench should make it easy to implement tests and determine which tests are being performed. However, for complex designs, creating and maintaining effective testbenches can take inordinate amounts of time away from actual design. A problem we see is that conventional testbenches are unnecessarily written in a hardware description language (HDL) like Verilog when they can be more conveniently described using higher-level languages such as Python. Python is a good candidate language to use as an abstraction for testbench writing for several reasons. First, it is a popular cross-platform language used by a large community with an extensive set of packages. In addition, Python programs are highly readable and the language itself is powerful and easy to write. Here, we introduce Sonar1 as a tool to simplify writing testbenches. It was developed during the implementation of the new GAScore in Shoal to improve the verification process. Sonar is an open- source Python library [8] through which the user can define a device-under-test (DUT), test stimuli and the DUT’s expected response. Using this information, Sonar creates a testbench in C++ and/or in SystemVerilog (SV). These two testbenches can be used to test HLS designs at different levels of fidelity. In the first case, HLS source code is tested as a regular C++ program using a Sonar-generated testbench. For SV-simulation, the HDL source code and the output files of Sonar are used in a simulator such as Vivado’s xsim [49] or ModelSim [50], which are known as register-transfer level (RTL) simulators. We show that this approach results in much simpler testbenches than the Verilog equivalents. The use of Python and Sonar also allows for automation, preprocessing and access to external resources, all of which are normally inaccessible to Verilog testbenches. The remainder of this chapter is organized as follows. First, we provide some background information on verification and related work in Section 4.1. Then, we describe the particular motivation for Sonar in Section 4.2. After presenting Sonar in Section 4.3, we walk through an example Sonar testbench in Section 4.4. Note that our descriptions and code examples in this chapter use Sonar 2.1. In Section 4.5, we evaluate Sonar through a series of case studies to adapt existing testbenches to Sonar. Next, in Section 4.6, we compare Sonar with other tools across a number of metrics to help identify when it may 1Just as real sonar can spot enemy submarines in the vicinity, Sonar helps find hidden bugs. The name also fits with the prevailing theme.

32 Chapter 4. Sonar 33 be the right tool. Finally, we conclude with future work in Section 4.7.

4.1 Background

In this section, we review some background information for Sonar. We review the general forms of tests in Section 4.1.1 and survey the state of testing hardware in Section 4.1.2.

4.1.1 Forms of Testing

There are two main types of tests: unit and system. Unit tests are intended to verify the smallest discrete component of code. In doing so, they validate that a subcomponent behaves correctly by itself before its integration into a larger project. In contrast, system tests check whether the different components of code work together. Both types of tests are essential to have. Unit tests help catch bugs early and provide confidence that the internal operations are valid. System tests ultimately need to pass for the project to work but debugging them when they fail can be arduous. Together, these practices help form modern software design processes such as continuous integration [51], agile [52] and test-driven development [53].

4.1.2 Related Work

There are a number of specialized tools and frameworks to support software-like testing and verification for hardware. The industry standard is UVM, which is a standardized methodology to ensure common verification capabilities across vendors [54]. Its broad industry support and powerful feature set are huge advantages though it can also be intimidating to learn and needlessly complex relative to the DUT. To support unit-testing, SVUnit [55] and VUnit [56] are two open-source frameworks for testing HDL designs. Both of these tools provide sets of libraries, macros and functions for users to add to their own testbenches. [57] is an open-source tool that converts HDL code into a cycle-accurate C++/SystemC model that can be simulated as an executable. Unlike standard event-driven RTL simulators, cycle-accurate software simulation of hardware IP can be significantly faster at the expense of some inaccuracy. Finally, there are two Python libraries that effectively provide an overhaul to traditional verification: MyHDL [58] and cocotb [59]. MyHDL allows users to design hardware in Python that can be converted to HDL for implementation through vendor tool flows. For verification, a testbench in Python can communicate using the Verilog Procedural Interface (VPI) to an RTL simulator for cosimulation. The testbench can also be simulated natively through Python if the DUT is written with MyHDL. Similar to MyHDL, cocotb is another verification framework in Python to write testbenches. It is a mature project with numerous contributors and has been in development since at least 2014. Through cocotb’s library, the user can define a testbench, which is executed in an RTL simulator using VPI and Makefiles. MyHDL and cocotb have the restriction that they can only run on compatible simulators as they need to connect to it for cosimulation. In particular, they are not compatible with the Vivado simulator2. There are two common elements among these tools that we want to focus on. First, they emphasize the importance of automated testing rather than by inspecting waveforms. Through automated verification,

2Though xsim has a similar API to VPI called XSI [60, Appendix I] that may allow xsim to be used with cocotb, there is no direct support in cocotb for this use case. Chapter 4. Sonar 34 it is immediately obvious what tests were performed and therefore passed. This record can be committed to version control as evidence of this success and can be referred to in the future if the design breaks. Second, they present some kind of framework for structuring the testbench, whether that is through an HDL or a software language. This model is useful to adopt because there are inherent similarities in many tests that can be shared across them. For example, performing a transaction over a bus like AXIS is universally applicable; it can be implemented once and then reused in all testbenches.

4.2 Motivation

The constituent modules of the GAScore, as described in Section 3.5.1, are written through a mix of C++ and Verilog. These blocks generally contain one or more master/slave AXIS interfaces and perhaps a master AXI-Full interface, a slave AXI-Lite interface and some bare wires. Testing this design effectively creates some difficulties and motivates Sonar. In Section 4.2.1, we explore why testing hardware is difficult. Then, we discuss the shortcomings of the standard HLS verification process in Section 4.2.2.

4.2.1 Difficulties with Hardware Testbenches

Applying the same principles of good software testing to hardware is not trivial. Foremost, hardware is far more difficult to verify as the tests run longer and are more complex to set up than for software. Traditional event-based simulation requires a relatively powerful system to compile the simulation objects and run the simulation. Complex timing relationships between signals and parameterized modules can quickly create a seemingly endless number of potential failure scenarios. To improve the feasibility of actually running the tests, the tester needs to intelligently consider these cases and select a representative set to run. Finally, moving beyond simulation and testing on actual hardware means contending with long implementation times through the CAD flow and the infrastructure to automatically run, collect and analyze results. It is no surprise that these difficulties can lead to verification by manual waveform inspection and checking very specific scenarios. These methods do not adequately test the design. There are some characteristics that we would like a testbench to have:

• A testbench must be reusable and therefore must be readable; it should be clear what is being tested. Unfortunately, testbenches, especially in SV, can quickly become difficult to follow as length increases.

• IPs may need to respond to interactions over their interfaces. Frequent and constant reads/writes to complex interfaces become messy without the appropriate abstractions, which a testbench should provide.

• It is desirable to be able to express parallelism in a testbench to better model what happens on the hardware. However, the parallelism can also make the logic of a testbench more difficult to follow.

Flexible tool options encourage wider adoption of testing. While cocotb and MyHDL are capable solutions, they are not the right tool in all circumstances. The same can be said for UVM and other more formal verification techniques. We explore these trade-offs in Section 4.6. Chapter 4. Sonar 35

4.2.2 Simulation in HLS

For C++-based HLS IPs, it can be convenient to perform design verification in C++ as an initial step as it is faster to verify than RTL. C++ simulation requires writing a testbench in C++ and compiling it as an executable. However, this form of simulation is not complete as the test executes as regular single-threaded C++ code and does not reflect the parallel nature of the eventual hardware. The traditional next step in Vivado HLS development is to perform RTL cosimulation. Using the C++ testbench, the tool creates an RTL equivalent and uses it to test the synthesized RTL. However, RTL cosimulation in this form is predicated on the assumption that the core has the standard control signals such as “start” and “stop”. These signals are usually omitted on IPs that are intended to always be active as is the case for the IPs tested for the GAScore. When a DUT does not have these signals, the tool requires a few other conditions, one of which is that the initiation interval is one. The initiation interval refers to how frequently data can be pushed into an IP where an initiation interval of one means that the module can accept new data every clock cycle. This requirement is rarely true in unoptimized designs and can be difficult to achieve even in optimized ones. RTL simulation is crucial to verify the timing between signals and to ensure that synthesized RTL matches the intended specification. Since cosimulation fails to run in this scenario, the alternative is to write a separate RTL testbench manually. However, this approach violates the principle of “single source of truth”, a practice to minimize repetition and prevent stale versions of the same data. It results in two similar testbenches for C++ and RTL that send the same stimuli and expect the same responses, but that differ in syntax and must now be manually written and kept in sync.

4.3 Introducing Sonar

To address these concerns, we developed Sonar: an open-source Python library through which the user can define a device-under-test (DUT), test stimuli and the DUT’s expected response. This high-level description is written as a Python script and its execution creates testbenches in C++ and/or SystemVerilog (SV). Our use of Python ends once the testbench is generated. The C++ testbench can be used to simulate an HLS design as an initial step by compiling and running an executable using standard software tools. Our SV testbench approach does not use VPI and so the resulting testbench can be run in any RTL simulator. One of the key design principles behind Sonar is the separation of data and structure in a testbench. There is an enormous amount of repetitive, boilerplate code within any testbench that constitutes its structure. For example, all SV testbenches need to declare the wires of the DUT, initialize them to known values, instantiate the DUT and pass these wires to it. In different testbenches, only the number, names and directions of wires changes. Other structural elements include functions for reading and writing over interfaces, clock creation, and initial blocks to assign values to registers. These parts of a testbench needlessly complicate it because the tester needs to actively look past the structural details into the relevant part: the data. The implementation of a function to write to an AXIS interface is less important than the data being written at a given point. Sonar hides these implementation details and allows the user to focus on the data in the Python script as seen in Section 4.4. The structure of the testbench is created through a template and as a function of the data. Separate templates exist for C++ and SV and could be developed for any other target language in the future. The template defines a universal structure within which all user testbenches can fit. Based on the Chapter 4. Sonar 36 data provided in the Sonar testbench, the C++/SV testbench itself gets populated through simple string replacements from Python. The data describing the testbench’s actions are written to a separate data file. The testbench reads this file and uses the instructions to execute the commands as the user intended.

4.4 Writing a Testbench

Writing a testbench in Sonar consists of two parts: specifying the DUT and adding one or more test vectors. Test vectors represent independent sets of stimuli to send to the DUT for different tests. These sections are shown in Listings 4 and 5, respectively, and are described in the remainder of this section. For simplicity, some details specific to C++ testbenches are omitted in this example.

4.4.1 DUT

The DUT specification in Listing 4 starts with creating a Testbench object using the default constructor on line 4. This object serves as the container for the other elements of the testbench. The DUT in this case is a module named “reverse” that does the following:

1. Waits until the “enable” register is set to 1.

2. Reads one flit on the AXIS-S interface.

3. Sets ack_V.

4. Reads another flit on the AXIS-S interface.

5. Writes this flit with its bits reversed on the AXIS-M interface and clears ack_V.

Next, we create the DUT, add some signals to it in lines 6-9 and add the module to the testbench on line 10. Clocks and resets are assumed to be 1-bit wide inputs to the module. Other ports’ directions (relative to the DUT) and widths (if not 1-bit) must be specified. Lines 12-14 and 16-18 add a pair of AXIS interfaces named “axis_output” and “axis_input” with 32-bit data widths. These AXIS objects are predefined in Sonar to represent the behavior of the AXIS protocol. For example, these objects offer read() and write() methods that correspond to performing one read and write transaction over the interface in simulation. Other similar objects may be defined for other interfaces and functions. The AXIS protocol allows many optional side channels and here the testbench uses a default set of side channels consisting of TDATA, TREADY, TVALID, and TLAST. Finally, lines 20-24 create an AXI-Lite slave interface on the DUT. We define a control register named “enable” at address 0x10 and define the memory-mapped range of the interface to be 4K starting from offset 0.

4.4.2 Test Vectors

Continuing with the same DUT and testbench from Listing 4, we now write one test vector for testing the DUT, which is shown in Listing 5. In Sonar, test vectors are sent to the DUT serially and therefore each test vector is independent of the others. Each one is made up of one or more Thread objects. Each Thread in a test vector runs in parallel with the other Threads in the same test vector3 and the contents

3This is only true for SV testbenches. In C++ testbenches, the Threads are run serially in the same order as in the Python file. Chapter 4. Sonar 37

1 from sonar.testbench import Testbench, Module, TestVector, Thread

2 from sonar.interfaces import AXIS, SAXILite

3

4 reverse_tb= Testbench.default("reverse")

5

6 dut= Module.default("DUT")

7 dut.add_clock_port("ap_clk", "20ns")

8 dut.add_reset_port("ap_rst_n")

9 dut.add_port("ack_V", size=1, direction="output")

10 reverse_tb.add_module(dut)

11

12 axis_out= AXIS("axis_output", "master", "ap_clk")

13 axis_out.port.init_channels("default", 32)

14 dut.add_interface(axis_out)

15

16 axis_in= AXIS("axis_input", "slave", "ap_clk")

17 axis_in.port.init_channels("default", 32)

18 dut.add_interface(axis_in)

19

20 ctrl_bus= SAXILite("s_axi_ctrl_bus", "ap_clk", "ap_rst_n")

21 ctrl_bus.add_register("enable", 0x10)

22 ctrl_bus.set_address("4K",0)

23 ctrl_bus.port.init_channels(mode="default", dataWidth=32, addrWidth=5)

24 dut.add_interface(ctrl_bus)

Listing 4: Defining a DUT in Sonar of each individual Thread run serially. Using Threads, we capture the parallel nature of hardware to allow multiple activities to occur independently of each other within a test vector. The first Thread is created in lines 4-10. This thread initializes all the signals in the testbench to zero (line 6). After a small delay, it disables reset and sets the DUT’s master AXIS interface’s TREADY input high (lines 8-9). The second Thread in lines 12-21 sets the input data to the DUT. After waiting longer than the reset time in the first Thread (line 13), this Thread starts a timer to measure how long the test takes (line 14). We write 1 to the “enable” register and then send the first flit (lines 15-16). This thread then waits until ack_V is 1. In this case, this change happens instantly after the first flit is written so it proceeds to write the second flit (line 18). After some point, the acknowledgment signal should go down again. We wait for that to occur and add an arbitrary-length delay after (lines 19-20). Line 21 sets flag 0, which is a mechanism provided in Sonar to synchronize between Threads. The third Thread checks the output behavior of the DUT. In line 24, it attempts to read a value from the DUT’s AXIS-M interface. This read blocks in the testbench until it completes and checks whether the read data matches the argument provided here. Line 25 waits for flag 0 to be set. When this Thread continues, it is the only active Thread left as no other Threads have any additional actions remaining. Chapter 4. Sonar 38

As a sanity check, we read the state of the “enable” register to make sure it is still set (line 26). Finally, we terminate the test vector by printing the time elapsed for the test, printing a message and ending the vector (lines 27-29).

In this way, many test vectors can be added for the DUT and threads can be reused as needed by adding them to multiple test vectors. The tests can be selectively enabled by adding them to the Testbench object. The last method (line 32) generates the testbench and places the files at the path provided.

1 # Listing 4...

2 test_vector_0= TestVector()

3

4 initT= Thread()

5 initT.wait_negedge("ap_clk")

6 initT.init_signals()

7 initT.add_delay("40ns")

8 initT.set_signal("ap_rst_n",1)

9 initT.set_signal("axis_output_tready",1)

10 test_vector_0.add_thread(initT)

11

12 inputT= test_vector_0.add_thread()

13 inputT.add_delay("100ns")

14 inputT.init_timer()

15 ctrl_bus.write(inputT,"enable",1)

16 axis_in.write(inputT, 0xABCD)

17 inputT.wait_level("ack_V == $value", value=1)

18 axis_in.write(inputT,1)

19 inputT.wait_level("ack_V == 0")

20 inputT.add_delay("110ns")

21 inputT.set_flag(0)

22

23 outputT= test_vector_0.add_thread()

24 axis_out.read(outputT, 0x80000000)

25 outputT.wait_flag(0)

26 ctrl_bus.read(outputT,"enable",1)

27 outputT.print_elapsed_time("End")

28 outputT.display("The_simulation_is_finished")

29 outputT.end_vector()

30

31 reverse_tb.add_test_vector(test_vector_0)

32 reverse_tb.generateTB(FILEPATH,"sv")

Listing 5: Defining a test vector in Sonar Chapter 4. Sonar 39

4.5 Case Studies

As Sonar is a productivity tool to simplify describing testbenches and to improve usability, it is difficult to quantitatively evaluate. With FPGA CAD algorithms, there are a number of benchmarks that can be used to measure their performance against competing approaches [61]. Unfortunately, there is no equivalent set of standard comparison metrics in testbenches. In this section, we walk through some examples of converting existing testbenches to Sonar to highlight the strengths and areas of future improvement of our design. While we use the number of lines in a testbench as a rough estimate for its complexity, this metric is incomplete. Line counts are inherently imprecise as the number of lines can also be shrunk by aggressively forcing lines together. The count alone also ignores the relative differences of complexity between a line of Verilog compared to a line of Python. To enforce some standardization in our line counts, we use the following guidelines. The line counts in the reference testbenches are counted without making any modifications to the designs. In the Sonar testbench, lines are counted with each Python command on a new line. In both cases, comments and whitespace are excluded and each line is restricted to 80 characters. These case studies are separated in two: Section 4.5.1 discusses writing existing Verilog testbenches to Sonar and Section 4.5.2 adapts some testbenches from cocotb to Sonar. In Section 4.6, we continue our evaluation of Sonar by comparing it with other tools.

4.5.1 UMass RCG HDL Benchmark Collection

The RCG Benchmark Collection [62] is an old set of Verilog IPs and their testbenches compiled by the Reconfigurable Computing Group at the University of Massachusetts Amherst. There are six IPs included in the collection that we discuss in the following sections.

4.5.1.1 FIR, ASOVA, AVA and RS

The FIR, ASOVA, AVA and RS IPs represent a finite impulse response filter, two communication error correction decoders and a Reed-Solomon decoder, respectively. Their testbenches are generated from the Quartus Waveform Editor [63] and have the following general structure. The clock is fixed to a period of 20ns and reset is disabled after a period of time. Each bit of the input ports are set in separate initial blocks with timed waits. Individual bits of the expected data output from the DUT are set in the same manner. Then, this expected data is compared against the true output from the DUT and the number of mismatches are tracked. The success or failure of the testbench—measured by the number of mismatches—is reported after a finite time has passed. This testbench is different from more manually-written testbenches in that it modifies each bit of the signals separately over time. While this information can be generated from a waveform easily, it is difficult for humans to write from scratch and makes interpreting the testbench challenging. This approach can also result in very long testbenches; for example, the AVA testbench is over 450 thousand lines long. We exclude these testbenches from general consideration as they are in a style that is difficult to understand and write from scratch with any tool. Chapter 4. Sonar 40

4.5.1.2 FDCT

The FDCT IP is a forward discrete cosine transform with parameterizable coefficient and data widths. It has the following ports: clk, ena, rst, dstrb, din, dout and douten. After raising dstrb for one cycle, an array of numbers is sent over din to the DUT. Once the douten is raised, dout is read, compared against the golden data and the number of mismatches are reported. This testbench can be fully rewritten with Sonar and greatly simplified as a result. We observe a 72% decrease in line count in the Sonar Python script as compared to the original Verilog testbench. The bulk of the savings come from the succinct array initialization in Python as compared to Verilog. However, similar savings can also be gained using the tick notation in SystemVerilog. If we exclude the array initialization from the line counts from both testbenches, we still see a 40% decrease in line count in Sonar. One difference in the Sonar testbench is that it does not track the number of mismatches and any failed matches result in the simulation ending. For future work, Sonar should support soft failures that can be counted without stopping the simulation.

4.5.1.3 JPEG

The JPEG IP is an image compression encoder and has a similar interface to the FDCT IP. In addition to the ports on that IP, JPEG adds qnt_val and qnt_cnt. The testbench is similar to FDCT as well: an array of numbers is sent to the DUT and the output is compared against golden data. However, there is an additional feedback path to the input qnt_val port: the qnt_cnt output value is used to index a second array and the value is set on the qnt_val port every clock cycle. This testbench can be partially rewritten with Sonar to yield a 83% decrease in line count, primarily due to the savings from array initialization as in FDCT. However, the qnt_cnt and qnt_val relationship, in which the value of an output port of the DUT to send input to it, is not currently supported. Additionally, the nature of this command where the value to qnt_val is set every clock cycle until the testbench terminates is different from other Sonar commands that are inherently countable: the number of times a particular command will be run is known when the Python script is executed. A workaround for this problem would be to determine the length of the testbench and repeat the assignment the appropriate number of times. These restrictions can be addressed by expanding Sonar’s support for such continuing assignments. A simple continuing assignment is already performed in Sonar testbenches for clock creation. That model can be expanded to allow more general user-defined continuous assignments separate from the standard set of sequential commands. This model should also allow feedback paths as those observed in this testbench.

4.5.2 cocotb

The cocotb repository [59] includes a set of example testbenches for a variety of use cases. Here, we mainly discuss the axi_lite_slave testbench and touch on some others.

4.5.2.1 AXI-Lite Slave

This testbench performs some bus transactions over the AXI-Lite interface to confirm that reads and writes to registers succeed. The register data is accessed in two ways: as a bus transaction or by directly accessing the register. Finally, some reads and writes to invalid addresses are tested as well. Chapter 4. Sonar 41

In Sonar, only the bus transactions over AXI-Lite can be tested as the testbench has no access to the internal registers of the DUT unless they are exposed as ports. While transactions to invalid addresses are possible in Sonar, there is no confirmed mechanism to detect this case. Currently, Sonar uses the Xilinx Verification IP (VIP) [64] to manage bus transactions over AXI-Lite. We have not tested how it behaves in this circumstance. It is possible that the error would only be discovered when reads from the erroneous address fail to match the expected value.

4.5.2.2 Other Testbenches

Some cocotb testbenches, such as the adder testbench, can be fully duplicated in Sonar in comparable complexity. Other testbenches, such as ping_tun_tap, cannot be directly implemented in Sonar because they rely on VPI to communicate live with the simulator. However, these types of testbenches may be implemented statically by defining the data explicitly rather than sending it through a pipe as in the original testbench. Since cocotb is more fully-featured than Sonar, not all cocotb testbenches can be currently implemented in Sonar. Considering the subset that can be implemented, we expect the Sonar testbenches will be of a comparable length to the originals. For example, the Sonar equivalent for the AXI-Lite slave testbench from the previous section has three fewer lines. The relative complexity of testbenches implemented in Sonar or cocotb is subjective. In both cases, a user must first learn the tool. We compare these tools more fully in the next section.

4.6 Comparing Different Tools

The abundance of tools and frameworks for writing testbenches demonstrates that there is no one universally applicable choice. An easy analogy can be drawn to programming languages. There are many languages—each with their own syntax, quirks, strengths and weaknesses—that a developer can choose for a new project. One language (or a set of languages) may be more convenient to use than others based on the metrics the developer considers important for the project. Similarly, the “best” testbench-writing tool is circumstantial as well. In the absence of widely-accepted criteria to determine the appropriate tool, we propose the following metrics:

• Controllability: how much control does the developer have to specify what the testbench should do?

• Ease of use: how hard is it to learn (one-time cost) and use (recurring cost)?

• Capability: what can the developer do with it?

• Readability: how easy is the testbench to understand?

• Compatibility: where can the testbench be run?

• Heterogeneity: what languages are supported?

We evaluate a subset of tools discussed in this work—VUnit, UVM, cocotb and Sonar—using these criteria, which is summarized in Table 4.1. We choose these tools to compare because they represent a wide range of the available tools. In addition, we add a hypothetical tool named Sonar+ to this table to show what is possible in Sonar though not currently implemented. Chapter 4. Sonar 42

Table 4.1: Comparison between different tools for writing testbenches

Tools Feature VUnit UVM cocotb Sonar Sonar+ Controllability High High Med-High Low Med Ease of Use Low Low Medium High Med-High Capability High Med High Low Med-High Readability Low Low High High High Compatibility Low High* Medium High High* Heterogeneity Med-Low Low Med-Low Med High

4.6.1 Controllability

VUnit and UVM demonstrate high controllability since the testbenches are written in an HDL directly. Since both cocotb and Sonar are written in Python, the appropriate functions must exist in these libraries to perform an action in the testbench. In this respect, cocotb is fairly controllable as a mature project where the user more directly writes the testbench. In contrast, Sonar takes a more abstracted approach in describing the testbench that limits controllability. However, support for different testbench behaviors can be added over time to Sonar to increase the user’s control.

4.6.2 Ease of Use

Sonar excels in usability as an abstracted tool with a restricted but useful set of functions. As the project grows more complex in Sonar+, it will be more difficult for beginners to learn about all its features. This problem is already present in cocotb. However, both of these tools are arguably easier to use than VUnit and UVM where users must write HDL testbenches and learn about many classes to implement even a simple testbench.

4.6.3 Capability

UVM, VUnit and cocotb are all highly capable tools. However, the latter two tools can enable additional functionality in testbenches through their use of Python. In addition, cocotb can connect external utilities to a hardware testbench over VPI. Sonar is currently not as fully-featured as these more mature tools however there is a lot of scope for new features, some of which are discussed in Section 4.7.

4.6.4 Readability

Python is generally a more readable language than SV; Sonar and cocotb testbenches are consequently easier for the average reader to parse. Sonar testbenches are structured to improve understanding and are neatly partitioned into separate sections for the DUT specification and the test vectors. Whether Sonar or cocotb testbenches are easier to read is subjective and depends on user familiarity.

4.6.5 Compatibility

VUnit is the most restricted tool and explicitly supports only four simulators. The next most restricted tool, cocotb, depends on simulator support for VPI as well as its own simulator backends to run testbenches, Chapter 4. Sonar 43 which limits its compatibility. UVM, as a set of SV classes, should be broadly supported in simulators. Unfortunately, RTL simulators have varying levels of support for the complete specifications of SV and not all of UVM may be supported though subsets may be. Similarly, Sonar produces SV testbenches that are restricted only by the simulator’s support for SV. For this reason, these entries are marked with asterisks in Table 4.1. Existing Sonar testbenches should work in most simulators as they do not use exotic SV languages features but it is possible that future extensions of the project introduce caveats for simulator support for certain functionalities.

4.6.6 Heterogeneity

With the exception of UVM, all the listed tools support some form of heterogeneity in testbenches. In addition to SV, VUnit and cocotb also support another HDL called VHDL for some simulators. Their heterogeneous support is restricted to these two HDLs however. Sonar more broadly supports heterogeneous testbenches through any number of language backends that may be implemented. While it currently supports C++ and SV testbenches, other languages such as VHDL or SystemC can be added as well in the future. In particular, Sonar is the only tool to directly address the issues of HLS cosimulation described in Section 4.2.2 by enabling direct simulation of HLS source files as well as the synthesized HDL from the same testbench.

4.6.7 Summary

From Section 4.4 and Table 4.1, it can be seen that Sonar is already an easy-to-use and broadly-applicable tool to write powerful and succinct testbenches. With continued development, the characteristics of Sonar+ will make the Sonar approach even more robust in terms of control and capability, especially with HLS flows. HLS is becoming more popular in the FPGA community where there is a strong movement to make FPGAs more accessible to application developers. For that community, it is then also important to enable testing with higher-level abstractions, which Sonar provides.

4.7 Future Work

Continued development on Sonar is driven by two factors: need and automation. Currently, Sonar cannot fully replicate all the behaviors that may exist in an HDL testbench. Thus, there are gaps that can be filled as those functionalities are needed. For example, while Sonar currently does not support differential clock ports in DUTs, this feature can be added if required. Some other gaps have been discussed in the preceding sections. Increased automation is intended to provide powerful new features through a convenient user interface. In particular, we are interested in adding support for test vector generation and examining functional coverage of the DUT. These new capabilities are broadly useful: the former improves how thoroughly and easily a design is tested while the latter returns a useful metric to evaluate quality of the testbench. With Sonar, the goal is to add these features with the implementation details hidden from the users to seamlessly improve the testbenching infrastructure. When you cannot express it in numbers, your knowledge is... meagre. Lord Kelvin

Chapter 5

Evaluation

This chapter evaluates and characterizes the Shoal platform. First, the test environment is defined in Section 5.1. Then, Section 5.2 discusses the hardware utilization of Galapagos, the GAScore and the addition of Shoal API calls to HLS IPs. Section 5.3 reports the performance of the API through microbenchmarks on a custom Benchmark IP. Some baseline microbenchmarks on the underlying libGalapagos library are also presented here. Finally, an implementation of the Jacobi method is tested in Section 5.4 as a real-world application and compared against previous work.

5.1 Experimental Setup

This section aggregates the information about the hardware and software used to perform the tests in Sections 5.3 and 5.4 for reproducibility.

5.1.1 Hardware

Most tests make use of a single server with an Intel Xeon E5-2650 x86 processor (with 12 physical cores and 24 threads) and 64 GB of 2400MHz DDR4 RAM. The server is connected through an SFP+ Ethernet port to a Dell S4048-ON 10G switch. While the NIC on the server supports TCP and UDP offloading, it has been disabled to guarantee that no IP fragmentation occurs because it is unsupported by the TCP/IP core used in Galapagos. This IP is maintained by Daniel Ly-Ma and is based on the work in [65]. Two Alpha Data 8K5 boards are attached to the server over PCIe 3.0 x8. Each board has a Xilinx Kintex Ultrascale FPGA (XCKU115-FLVA1517-2-E), 16 GB 2400 MHz DDR4 RAM and a pair of SFP+ Ethernet ports [66], one of which is connected to the same 10G switch as the server. Tests that use an FPGA are mainly run on one or both of these boards. If more FPGAs are used, their network connections are made to the same switch so that all FPGAs are one hop away from each other and the server. For isolation, the ports on the switch corresponding to the active devices for a particular test are added to a port-based VLAN on the switch. Another server with the same hardware configuration is used for tests that require two software nodes. While these tests are run on this hardware configuration, there are no inherent restrictions preventing usage on other platforms. This freedom is one of the advantages of Galapagos and so other boards and FPGAs can be used if corresponding Galapagos Shells exist. The resources available in the programmable region of the FPGA on several boards are provided in Table 5.1 to provide a relative scale of resource

44 Chapter 5. Evaluation 45 availability on different sized FPGAs. In Section 5.2, the percent usage of the top entity in the hierarchy will be given relative to the FPGA on the 8K5.

Table 5.1: Available FPGA resources on select boards

Board FPGA Family LUTs FFs BRAMs DSPs Alpha Data 8K5 [66], [67] Kintex Ultrascale 663 360 1 326 720 2160 5520 Fidus Sidewinder [68], [69] Zynq Ultrascale+ 552 720 1 045 440 984 1968 BEEcube BEE4 [44], [70] Virtex-6 343 680 687 360 632 864 Zedboard [45], [71] Zynq 7000 53 200 106 400 140 220

5.1.2 Software

The x86 servers above are running Ubuntu 16.04 with kernel 4.4. Software applications are compiled with gcc 7.4 and linked against Boost 1.66. Xilinx’s Vivado 2018.1 is used to synthesize Vivado HLS IPs and for the full CAD flow to generate bitstreams. The updatemem utility, included as part of the Vivado installation, is used for updating BRAM content on these bitstreams as needed. Live versions of Galapagos and libGalapagos are used based off the active “dev” branch of these tools at the time of development and some changes are merged upstream into Galapagos from this project. Hardware IPs are tested with Sonar 2.1, though 3.0 is also compatible.

5.2 Hardware Usage

On its own, using the Shoal API on an FPGA is relatively lightweight in terms of hardware resource usage. Of course, its usage is predicated on the presence of some basic Galapagos infrastructure in the Shell and the Application Region. However, as noted in [72], the use of Galapagos components does not necessarily contribute to higher utilization cost since they would have to be replaced with equivalent functional blocks. Table 5.2 shows the resource utilization of the full Galapagos Shell on the 8K5. The “Parent %” column reports the percent utilization of the component relative to the resources of its parent entity. For the top entity, this percentage is relative to the resources available on the FPGA on the 8K5. For example, the Shell consumes 12.4% of the 8K5’s available LUT resources and the PCIe interface uses 45.2% of the Shell’s LUT count. These numbers are similar to those reported in [72] for the same board, though slightly smaller across all metrics. This difference may be due to run-to-run variance in implementation quality, tool version, and time allocated to the tool. Depending on the usage, this Shell may be modified to remove unneeded components. In these tests, the PCIe hierarchy has been removed to reduce congestion and utilization. Shoal hardware is found within the Application Region. One GAScore IP must exist in this region and is shared among all local kernels. The utilization of this IP is shown in Table 5.3 when there is one kernel present on the FPGA. With more kernels, the handler_wrapper grows approximately linearly in usage, and a handler is added for each kernel. However, the additional cost of a larger interconnect between the different handlers grows as well. The other subcomponents of the GAScore are shared between any additional kernels and remain constant in usage. Outside the GAScore, the interconnects between the rest of the Application Region and the kernels must scale with the number of kernels. These components are shown in Table 5.4, which details the Chapter 5. Evaluation 46

Table 5.2: Hardware utilization of the full Galapagos Shell for the 8K5

Component LUTs FFs BRAMs Count Parent% Count Parent% Count Parent% Shell 82 068 12.4 98 981 7.5 176.5 8.2 DDR4 Controller 38 930 47.4 55 449 56.0 99.0 56.1 PCIe Controller 37 058 45.2 35 148 35.5 69.0 39.1 Ethernet Controller 3973 4.8 5001 5.1 5.5 3.1 Other 2107 2.6 3383 3.4 3.0 1.7 utilization of the Application Region for the microbenchmark tests in Section 5.3. Though the Benchmark IP dominates utilization here, the Application Region as a whole uses only a small fraction of the resources on the 8K5. The relatively high utilization of the Benchmark IP is due in part to limitations of the HLS tools. For comprehensive testing, a wide range of Shoal API functions are included in the Benchmark IP. While each function uses on the order of hundreds of LUTs and FFs, it was necessary to inline certain function calls using HLS pragmas for Vivado HLS to generate hardware with the correct behavior. Inlining functions can yield higher performance through replicating hardware, as opposed to sharing it, for each function call. Without inlining, under the assumption that the HLS tool works correctly, using any one Shoal function once in user code increases the utilization by the aforementioned hundreds of LUTs and FFs. This issue is further discussed in Section 5.3.2.1. Table 5.3: Hardware utilization of the GAScore (with one kernel) on the 8K5

Component LUTs FFs BRAMs Count Parent% Count Parent% Count Parent% GAScore 3595 0.5 4634 0.3 28.0 1.3 am_rx 274 7.6 377 8.1 0.0 0.0 am_tx 274 7.6 380 8.2 0.0 0.0 AXI DataMover 1381 38.4 1465 31.6 8.5 30.4 FIFOs 99 2.8 166 3.6 2.5 8.9 Interconnects 600 16.7 703 15.2 0.0 0.0 hold_buffer 423 11.8 881 19.0 8.5 30.4 xpams_rx 70 1.9 80 1.7 0.0 0.0 xpams_tx 73 2.0 72 1.6 0.0 0.0 add_size 171 4.8 157 3.4 8.5 30.4 handler_wrapper 229 6.4 353 7.6 0.0 0.0 handler_0 228 99.6 345 97.7 0.0 0.0

5.3 Microbenchmarks

This section presents the results of several microbenchmarks that establish the latency and throughput of the communication methods in the Shoal API. These microbenchmarks exercise Shoal in different hardware topologies and over two network protocols. First, some baseline results of the underlying libGalapagos library are shown in Section 5.3.1 to provide some context of performance. Then, the main Shoal microbenchmark results are presented and discussed in Section 5.3.2. Chapter 5. Evaluation 47

Table 5.4: Hardware utilization of the Application Region (with one Benchmark kernel) on the 8K5

Component LUTs FFs BRAMs Count Parent% Count Parent% Count Parent% applicationRegion 18 830 2.8 23 382 1.8 33.0 1.5 GAScore 3595 19.1 4634 19.8 28.0 84.8 Galapagos IPs 2463 13.1 4816 20.6 0.5 1.5 Benchmark 12 772 67.8 13 932 59.6 4.5 13.6 Benchmark IP 11 811 92.5 12 959 93.0 3.5 77.8 Interconnects 538 4.2 606 4.3 0.0 0.0 AXI Timer 291 2.3 240 1.7 0.0 0.0 Instruction Memory 132 1.0 127 0.9 1.0 22.2

5.3.1 libGalapagos

As the communication of Shoal software kernels is built on top of libGalapagos [33], their performance will be bound by what this library can provide. In this section, two situations are evaluated. Section 5.3.1.1 determines how to efficiently communicate through libGalapagos by looking at the performance of the different styles of communication. The best set of these methods are used in Section 5.3.1.2 to measure communication performance between two software kernels on the same node through similar tests as used in measuring Shoal in Section 5.3.2. In both of these tests, there are two participating kernels: the Sender and the Receiver.

5.3.1.1 Baseline

There are two classes of read and write functions in libGalapagos: flit-based and packet-based. Each call to a flit-based read and write function is similar to hardware where an AXIS interface can write one flit per clock cycle and as such, flit-based libGalapagos function calls are directly portable to hardware kernels. Packet-based read and write functions are specific to software kernels and take advantage of memcpy() to move entire packets instead of copying them one flit at a time. Between software kernels on the same node, data may not even need to be copied and instead can be moved through pointer manipulation. In this initial test, run time is recorded similarly to the default libGalapagos unit tests and is measured as follows. For latency, time for an individual message starts when the Sender sends the message and ends when the Receiver receives it. This time is then averaged over a number of iterations. This test confirms that for small payload sizes, any flit- or packet-based function combination can be chosen to read and write and it yields comparable latency. However, the efficiency gained from performing bulk reads and writes compounds in larger payload sizes and makes a dramatic difference. Therefore, to generically handle all cases, packet-based read/writes are preferred as opposed to flit-based ones in software kernels.

5.3.1.2 Baseline with Replies

The measurement of time in the baseline test in the previous section differs from the way time is measured in Shoal. To get a more accurate comparison, this second baseline test changes the measurement of time to more closely resemble that of the Shoal microbenchmarks. This test also focuses on the communication methods that most closely match Shoal AMs and uses packet-based messaging. Throughput is measured Chapter 5. Evaluation 48 using two different methods. In the first, the Sender sends all the messages in a loop and then waits for all the replies. This method is the non-blocking case. Second, the Sender sends one message and waits for a reply before sending the next message; this is the blocking case. In both methods, the time taken for all the messages to be sent and acknowledged is recorded. These throughput tests are intended to model different communication patterns. The summarized results are shown in Figure 5.1. With two nodes, non-blocking mode exhibits a modest performance decrease as compared to local communication in the same mode. Blocking communication has even lower performance with two nodes. In both scenarios, the data must now be actually sent over the network rather than just moved through pointer manipulation as with local communication on a single node.

Blocking (one node) 2500 Non-Blocking (one node) Non-Blocking (two nodes) 2000 Blocking (two nodes)

1500

1000 Throughput (Mb/s)

500

0 16 64 256 1024 4096 Payload Size (bytes)

Figure 5.1: Throughput of messages with replies in libGalapagos

Somewhat unexpectedly, Figure 5.1 shows higher throughput performance under blocking calls rather than under non-blocking within a single node. This discrepancy is suspected to be due to the tests’ performance depending on the kernel workload and the OS thread scheduling. To validate this theory, some additional tests are performed to control for these factors. A differing kernel workload can be modeled through a finite busy loop to consume processor cycles. The sleep() function cannot be used to model workload accurately here because the function will also affect the thread scheduling. Instead, the busy loop is set to count up to an arbitrarily large value (chosen to be 105). To prevent compiler optimizations from removing or otherwise altering this loop, this counter and a dependency are added through inline assembly. Controlling the thread schedule requires restricting the cores that the application runs on. Three variations are tested: one-core, affinity and unrestricted. The one-core test constrains the entire application to run on a single core. In affinity, each of the two kernel threads are constrained to run on one of two hyperthreaded cores on the same physical core. Finally, unrestricted imposes no restrictions on the core usage. The results of this test are shown in Figure 5.2. Chapter 5. Evaluation 49

With busy loop (unrestricted) With busy loop (1 core) 4000 No busy loop (with core affinity) 4000 No busy loop (1 core) No busy loop (unrestricted) No busy loop (with core affinity) No busy loop (1 core) No busy loop (unrestricted) With busy loop (1 core) With busy loop (unrestricted) 3000 3000

2000 2000 Throughput (Mb/s) Throughput (Mb/s)

1000 1000

0 0 16 64 256 1024 4096 16 64 256 1024 4096 Payload Size (bytes) Payload Size (bytes)

(a) Non-blocking (b) Blocking

Figure 5.2: Comparison of throughput with different thread schedule and delays

In Figure 5.2a, adding the dummy workload between writes dramatically improves throughput when no scheduling restrictions are placed (the “With busy loop (unrestricted)” and the “No busy loop (unrestricted)” cases). We can also see that restricting the application to one core shows that the addition of the busy loop slightly lowers throughput as would be expected. This core restriction lowers the throughput because parallelization cannot occur as effectively. The opposite result can be seen in Figure 5.2b where the application has the highest throughput when running on a single core and the performance with and without the busy loop is equivalent in the figure. Data locality and reuse is the likely reason for this improvement. Based on this data, we can see that tuning for maximum performance on a given application depends on the workload, communication patterns and the thread scheduling. For more general implementation results, no dummy work or constrained thread scheduling is used in the remaining tests.

5.3.2 Shoal

Measuring the results of microbenchmarks for the Shoal API requires sweeping many hardware con- figurations through different types of communication methods. Since Galapagos kernels may exist in both software and hardware, performance between all combinations of these placements must be measured. There are six combinations: software-to-software (same node), software-to-software (different nodes), software-to-hardware, hardware-to-software, hardware-to-hardware (same node) and hardware- to-hardware (different nodes). For each configuration, five different message types from the Shoal API are sent: Short, Medium FIFO, Medium, Long FIFO, and Long. For all Medium and Long message types, payload sizes from 8 bytes to 4096 bytes are tested. These tests are all run in one of three modes: latency, throughput (non-blocking), and throughput (blocking). For latency, time is measured from when the Sender sends the message to when it receives the reply from the Receiver. Time in the throughput tests is measured in the same way as described in the libGalapagos benchmark in Section 5.3.1.2. There are no core restrictions or affinity as enforced in software. This section first describes the application used to gather the microbenchmark data in Section 5.3.2.1. Then, Sections 5.3.2.2 and 5.3.2.3 present and discuss the latency and throughput results, respectively. Chapter 5. Evaluation 50

5.3.2.1 Benchmark Application

All benchmark data is acquired through a single benchmarking application on both software and hardware; it can be compiled to run on software and synthesized through Vivado HLS for hardware. The application is effectively a small custom processor. It consists of a loop that reads data from memory and a large switch statement that selects between the different operations based on this data. These operations represent all the individual test cases that may be run. For example, one instruction may encode the command “short_latency”, indicating that it performs the latency test for short messages. Each such instruction reads additional data from memory to define the number of times to send the message. Depending on the type of test, a timer is started and stopped at different times to measure the elapsed time. In software, time is measured through the std::chrono library and through the Xilinx AXI Timer [73] in hardware.

There are three kernels active during the microbenchmarks: kernels 0, 1 and 2, where kernel 1 is the Sender and kernel 2 is the Receiver. The two kernels may independently be on hardware or software depending on the test configuration. Kernel 0 is fixed to be a software kernel and it receives timing data from the Sender. The instructions for each kernel are generated in the correct format using a Python script.

Performing benchmarking with this application has the distinct advantage that it allows for rapid testing of different configurations. For hardware, this separation between instruction and execution means that the hardware kernel is as generic as a processor and only the instructions need to be updated to change which tests are run. The Xilinx tool flow allows for direct bitstream manipulation through the updatemem utility, which is used to write initial data to block RAM. To overwrite memory in a bitstream, the utility requires the site of the block memory, its address range and the type of memory, which can all be extracted through a TCL script in the implemented design in Vivado, and results in a new bitstream with updated memory. This process takes approximately a minute instead of needing to go through implementation again to make a small change such as changing the number of iterations to run. The process of generating instructions and updating bitstreams is described in more detail in Appendix B.9.1. In software, the instruction-generation script accepts arguments to define what tests are going to be run, allowing software-only kernels to iterate through different configurations in a script without needing to recompile the application. However, the configurability comes at the cost of optimization. In hardware, the Benchmark kernel is not optimized to send and absorb data at line-rate, which sets an upper bound on the measurable performance.

The development of this Benchmark application also exposed some limitations of the Vivado HLS tool flow. At numerous points, the tool failed to synthesize the correct IP. In these cases, the behavior of the generated RTL was inconsistent with what the source code specified with no warning reported in HLS. The only way to observe a problem was to run the generated IP through a testbench and evaluate the results. The resolution to these problems was always selective inlining or outlining of functions through the INLINE HLS pragma or a superficial source code change. As mentioned before, inlining a function can improve performance through hardware duplication instead of sharing. However, whether a function is inlined or not should have no bearing on the final behavior, and yet it appears to affect how the HLS tool synthesizes source code. Chapter 5. Evaluation 51

SW-SW (same) SW-SW (diff) SW-HW HW-SW HW-HW (same) HW-HW (diff)

100.0 ) s  (

e 10.0 m i T

1.0

0 8 16 32 64 128 256 512 1024 2048 4096 Payload Size (bytes)

Figure 5.3: Average median latency of communication methods with TCP in different topologies

5.3.2.2 Latency

The latency of communication is important to benchmark for a low-level communication API. For the test cases described above, we perform the tests using both TCP and UDP, where possible. Figure 5.3 shows the median latencies in different hardware topologies using TCP to communicate between kernels on different nodes. Kernels on the same node use internal routing within the FPGA (in the hardware case) or within libGalapagos (in the software case). For simplicity, this figure shows the average of the different types of AMs in each topology. As expected, communication between kernels in hardware occurs much faster than communication in software. Even two hardware kernels on different nodes can use the whole TCP/IP stack faster than software can internally route data in libGalapagos. For most cases, the latencies increase with increasing payload size. Notably, SW-SW (same) shows a relatively constant trend, indicating that there are other overheads beyond the payload size. Using UDP—a lighter protocol than TCP—offers even shorter latencies. Figure 5.4 shows the speedup when using UDP instead. This figure excludes the communication between kernels on the same node as no network protocol is used. In most cases, messages sent with UDP are faster. No data was collected for topologies including hardware for UDP messages with 2048- and 4096-byte payload sizes. Large UDP packets sent from software are marked as IP fragmented, which is unsupported by the hardware UDP core on the FPGA. The inverse case—transmitting large UDP packets from the hardware—fails as well Chapter 5. Evaluation 52

SW-SW (diff) SW-HW HW-SW HW-HW (diff) 2.5

2.0

1.5 Speedup 1.0

0.5

0.0 0 8 16 32 64 128 256 512 1024 2048 4096 Payload Size (bytes)

Figure 5.4: Speedup of median latency using UDP instead of TCP because the UDP core does not send them out. These packets may have been dropped by the core or are otherwise unsupported. There are workarounds for this problem. Data could be pre-fragmented to fit within the Ethernet frame and reconstructed using the Galapagos header. However, this preprocessing has not been implemented.

5.3.2.3 Throughput

The throughput of communication measures the sustained data transmission rate. For the test cases described above, we perform the tests using both TCP and UDP (where possible) and measure throughput in both non-blocking and blocking modes. Figure 5.5 shows the throughput in different hardware topologies using TCP to communicate between kernels on different nodes. As in the latency case above, kernels on the same node use internal routing within the FPGA (in the hardware case) or within libGalapagos (in the software case). For simplicity, this figure shows the average of the different types of AMs in each topology. Throughput between hardware nodes is significantly higher than software nodes and it generally increases with payload size. With 4096 bytes of payload, the throughput between hardware kernels on different FPGAs in non-blocking communication is close to kernels on the same FPGA. In Figure 5.5a, the HW-HW (diff) case with a 2048-byte payload shows an abnormally low throughput relative to adjacent payload sizes for the same configuration. This discrepancy is due to Long messages exhibiting extremely poor performance at Chapter 5. Evaluation 53 this particular payload size, which drags down the average throughput. The cause of this behavior is undetermined at this time. In the blocking communication case (Figure 5.5b), throughput falls drastically as compared to the non-blocking case since kernels have to wait for replies before continuing. The notable exception is the SW-SW (same) topology, which benefits from local communication. Figure 5.6 shows how throughput improves when using UDP instead of TCP. Again, this figure excludes the communication between kernels on the same node as no network protocol is used. Measuring throughput in the HW-SW and SW-SW (diff) cases fails using UDP since it is not a reliable protocol. In the HW-SW case, hardware is capable of sending data much faster than software can receive it, resulting in dropped packets. The topology with two different software nodes fails as well, which indicates that software can transmit these packets faster than it can receive them. In the two remaining topologies where it does work, UDP has mostly positive results in that it improves throughput. This result does not hold with software sending to hardware in non-blocking mode. Further investigation is needed to determine why this is the case but it may be related to Boost’s Asio library, which is used for networking in libGalapagos. We can also compare the measured throughput in Shoal against that measured in libGalapagos in Section 5.3.1.2. The speedup of throughput is shown in Figures 5.7 and 5.8 for kernels communicating on the same node and on separate nodes (with TCP) respectively. As these graphs show, measured throughput performance is as much as halved in Shoal as in the libGalapagos tests. However, this result is somewhat expected. Reduced performance is often the price to pay for more user-friendly programming models. In this case, the magnitude of the slowdown is due in part to the additional work being done in these tests as compared to those for libGalapagos. In Shoal, the kernel initiates the data transfers by sending a packet from its own thread to the handler thread. The handler thread receives this data, parses it to determine the message to be sent, constructs this new packet, and then sends it out. In the receiver’s handler thread, the sent data is parsed to read the Shoal header and perform additional work based on the type of message. This work may be to store the data to memory as it is in the libGalapagos measurement (Long AMs) or forward it to the kernel thread (Medium AMs). In the libGalapagos tests by contrast, the data is simply sent to the destination kernel and not looked at further. Interestingly, there are some outlying scenarios that do not exhibit this general trend, particularly the FIFO-type AMs in Figure 5.7a. As discussed in Section 5.3.1.2, on the same node, performance is highly susceptible to the workload and the resulting thread scheduling. Here, the different workload that each kernel is doing in this benchmark as compared to the libGalapagos reference tests tends to result in a more favorable schedule. Figure 5.8a shows a similar trend as above in the two node case but the maximum speedups are lower. Chapter 5. Evaluation 54

SW-SW (same) SW-SW (diff) SW-HW HW-SW HW-HW (same) HW-HW (diff)

1000.0

100.0

Throughput (Mb/s) 10.0

1.0

8 16 32 64 128 256 512 1024 2048 4096 Payload Size (bytes)

(a) Non-Blocking

SW-SW (same) SW-SW (diff) SW-HW HW-SW HW-HW (same) HW-HW (diff)

1000.0

100.0

Throughput (Mb/s) 10.0

1.0

8 16 32 64 128 256 512 1024 2048 4096 Payload Size (bytes)

(b) Blocking

Figure 5.5: Average throughput of communication methods with TCP in different topologies Chapter 5. Evaluation 55

SW-HW HW-HW (diff) SW-HW HW-HW (diff) 2.5 2.5

2.0 2.0

1.5 1.5

Speedup 1.0 Speedup 1.0

0.5 0.5

0.0 0.0 8 16 32 64 128 256 512 1024 2048 4096 8 16 32 64 128 256 512 1024 2048 4096 Payload Size (bytes) Payload Size (bytes)

(a) Non-Blocking (b) Blocking

Figure 5.6: Speedup of throughput using UDP instead of TCP

Medium FIFO Medium Long FIFO Long Medium FIFO Medium Long FIFO Long 2.5 2.5

2.0 2.0

1.5 1.5

Speedup 1.0 Speedup 1.0

0.5 0.5

0.0 0.0 8 16 32 64 128 256 512 1024 2048 4096 8 16 32 64 128 256 512 1024 2048 4096 Payload Size (bytes) Payload Size (bytes)

(a) Non-blocking (b) Blocking

Figure 5.7: Speedup of throughput between libGalapagos and Shoal with two kernels on the same node

Medium FIFO Medium Long FIFO Long Medium FIFO Medium Long FIFO Long 2.5 2.5

2.0 2.0

1.5 1.5

Speedup 1.0 Speedup 1.0

0.5 0.5

0.0 0.0 8 16 32 64 128 256 512 1024 2048 4096 8 16 32 64 128 256 512 1024 2048 4096 Payload Size (bytes) Payload Size (bytes)

(a) Non-blocking (b) Blocking

Figure 5.8: Speedup of throughput between libGalapagos and Shoal with two kernels on different nodes Chapter 5. Evaluation 56

5.4 Stencil Codes

Stencil codes represent iterative simulations within a system that is composed of simulation units or cells arranged in a regular matrix. The state of each cell at t = 0 represents the initial conditions of the system that then changes at discrete time steps. At t = n, the state of each cell can be computed as a function of both the cell’s state and that of its neighbors at t = n − 1. The cells that are considered neighbors depends on the application. In this way, the iterative algorithm continues for some defined number of iterations or until a convergence condition is reached. The Jacobi method is one algorithm in this class where the system is a 2D matrix, which in the simplest case is a square of size N. Many real world phenomena fit this model such as Poisson’s equation (electrostatic differential) and the heat equation. Through the finite differences method, these partial differential equations can be transformed into matrices to be solved through the Jacobi method. This use case is a good candidate to test for several reasons. First, due to the nature of the algorithm, it is easy to parallelize across many computing kernels by assigning each kernel a region of the whole system. Each kernel can then maintain the state and update the cells in its range. However, the cells at the boundaries belong to ranges of different kernels and so their state needs to be shared at each iteration. This need for data exchange necessitates a capable communication protocol with barriers to ensure iterations advance together. Second, this algorithm has been tested by previous work in [5] and [6], making it useful to compare results across these related works. The von Neumann stencil is used in both works and carried forward here. This neighborhood only considers the cells in the cardinal directions as neighbors and excludes diagonals.

5.4.1 Baseline

This section summarizes some results from previous works as a reference. As the raw data used by the authors is inaccessible, these conclusions are derived from written descriptions and figures. In [16], tests are performed on the MapleHoney cluster (Figure 3.1) and performance is reported in iterations per second. For N = 4096, one software node can compute approximately 10-14 iterations per second depending on the number of kernels. Meanwhile, one hardware node computes about 5-9 iterations per second with up to eight kernels per node. This iteration rate holds as the number of nodes increases proportionally to N, indicating high scalability. In [6], tests are performed on a Zedboard cluster and performance is reported in run time, which has been converted here to iterations per second for comparison. For N = 4096, one software node can compute approximately 6 iterations per second with two kernels. Each hardware node has up to two kernels and can compute 5-22 iterations per second as the number of kernels increases from two to sixteen. We do not provide a direct comparison among THeGASNet, THe_GASNet Extended and Shoal because the numerical performance is misleading due to the significant differences of the platforms used.

5.4.2 Porting from THe_GASNet Extended

This section describes the process of porting the Jacobi method application from THe_GASNet Extended to Shoal. It also provides a code comparison between previous work and Shoal and some lessons on how changes in design affect usability. Listing 6 shows a high-level look at the Jacobi method implementation in THe_GASNet Extended. Between THeGASNet and THe_GASNet Extended, the code structure looks relatively similar. The Chapter 5. Evaluation 57 major highlighted differences are the addition of some predefined handler functions in line 5 and the higher level messaging function calls as in line 32. In Shoal, there are a number of significant changes as seen in Listing 7. First, this particular application does not make use of any user-defined handler functions. The handler functions used to implement synchronization in the original code have been folded into the core Shoal platform, and so they are not needed. Shoal discards the NODE_GLOBALS struct used in THeGASNet as a means to hold shared data accessible to both handlers and the application. This change is in line with the general reduction of dependence on custom handlers. As handlers get embedded into Shoal itself, the API provides accessor methods for the application to use. Second, the Jacobi method is no longer the main entry point into the program. Instead, it is a function that matches the libGalapagos definition of a software kernel with a numerical ID and a pair of streaming interfaces (lines 5-7). This function is now added as a function pointer to a separate main function that initializes a Galapagos node to run the kernels. Lines 9-13 initialize a shoal::kernel object and allocate the memory to be used as part of the shared partition. They perform a similar purpose to their analogues in THeGASNet. The addresses of individual segments are determined through the kernel IDs (under the assumption that they are sized equally for all kernels) rather than through a dedicated function. As mentioned above, barriers are intrinsically a part of Shoal and so do not need to be defined in the user application (lines 20-24). At each iteration, the values are computed in the same way as before. The only difference here is that communication of data occurs through the Shoal API instead (lines 31-32). With the addition of compiler directives to add hardware-specific code and to remove software-specific parts, this C++ code can be run through the HLS tool to produce a working Jacobi method kernel. Unfortunately, there are a few problems with this approach. First, the synthesis of the kernel is not guaranteed to be correct. Similar issues as discussed in 5.3.2.1 arise where the synthesized HDL does not match the source code. Second, the performance of the HLS kernel without adding pragmas is about 50x worse than software in one test case. Instead of using this kernel for computation, the computation core from the THeGASNet (stencil_core) is used. It was written in VHDL by Willenberg. Using this core takes advantage of an optimized imple- mentation for the computation of the Jacobi method. In THeGASNet, the stencil_core was connected through PAMS to the GAScore. In Shoal, a stripped down version of the code in Listing 7 is run through HLS and used for communication and synchronization as a controller. Some additional ports are added to it to interface with a custom Verilog wrapper that allows the controller to communicate with the stencil_core. Together, the controller and the wrapper expose the same outward interfaces as a normal Galapagos kernel. Chapter 5. Evaluation 58

1 #include "the_gasnet_core.h"

2 #include "the_gasnet_extended.h"

3

4 static gasnet_handlerentry_t handlers[]={

5 THE_GASNET_EXTENDED_HANDLERS

6 {handler_id, (void(*)())handler_func_ptr}...

7 }

8

9 START_NODE_GLOBALS

10 unsigned int barrier_cnt, mynode, nodes;

11 ...

12 END_NODE_GLOBALS

13

14 int main(int argc, char**argv){

15 gasnet_init(&argc,&argv);

16 gasnet_extended_init();

17 ...

18 // ... parse command-line arguments

19

20 gasnet_attach(handlers, num_handlers, segment_size,0);

21 ...

22 gasnet_getSegmentInfo(segment_table_app, gasnet_nodes());

23

24 if (shared->mynode == shared->ctrlnode){

25 // ... send initial pole data

26 }

27 // ... wait for data distribution

28 for(i=0; i< ITERATIONS; i++){

29 if (shared->mynode == shared->ctrlnode){

30 // ... wait for stats

31 } else{

32 if (sendN) gasnet_get_nbi(...);

33 ...

34 barrier();

35 // ... compute iteration

36 // ... send stats

37 }

38 }

39 }

40 // ... handler functions

Listing 6: Structure of the SW Jacobi application in THe_GASNet Extended. The highlighting shows the differences from the same application in THeGASNet. The lines are explained in Section 5.4.2. Chapter 5. Evaluation 59

1 #include "jacobi.hpp"

2

3 extern "C"{

4 void jacobi(

5 short id,

6 galapagos::interface* in,

7 galapagos::interface* out

8 ){

9 shoal::kernel kernel(id, KERNEL_NUM_TOTAL, in, out);

10 kernel.init();

11 ...

12

13 kernel.attach(nullptr,0, mem_available);

14 ...

15

16 if (id == ctrlnode){

17 // ... send initial pole data

18 }

19

20 if(id == ctrlnode){

21 kernel.barrier_wait();

22 } else{

23 kernel.barrier_send(ctrlnode);

24 }

25

26 for(i=0; i< ITERATIONS; i++){

27 if (id == ctrlnode){

28 // ... wait for stats

29 } else{

30 if (sendN){

31 kernel.getLongAM_normal(...);

32 kernel.wait_mem(1);

33 }

34 ...

35 // ... compute iteration

36 // ... send stats

37 }

38 }

39 }

40 }

Listing 7: Structure of the SW Jacobi application in Shoal. The lines are explained in Section 5.4.2. Chapter 5. Evaluation 60

5.4.3 Software Performance

Selected results from running the Jacobi application on a single software node are shown in Figure 5.9 to show a breadth of grid sizes. The application is run for 1024 iterations and the elapsed time is recorded. Several interesting trends can be observed here that also apply to the hardware implementations. For small grid sizes, the overhead of communication, synchronization and memory contention dominates and results in longer execution times as the number of kernels is increased. At a grid size of 1024, this trend changes and increasing the number of kernels improves the run time to a point. With 16 kernels on one node, the computation time is almost one second faster than the 8-kernel case but the significantly increased time spent in synchronization offsets this saving. At the largest tested grid size, we see improvement when increasing the number of kernels again though not with 16 kernels. We expect the 16-kernel case to be more viable if the kernels are split over more nodes instead of concentrated on one, as we see in the hardware cases. Note that with a grid size of 4096, using two and four kernels does not currently work. Due to the size of the grid allocated to each kernel in this case, the amount of edge data that must be exchanged at each iteration is too large to send in a single AM1. The resolution to this limitation is to detect whether the message size exceeds the limit and request the data in smaller sections but this has not been implemented. With a single kernel, there is no communication between adjacent kernels so size is not an issue. With more than eight kernels, the data to be exchanged is small enough to fit within one AM.

5.4.4 Hardware Performance

In the hardware tests, the control kernel remains in software but all computation kernels are moved into one or more FPGAs. Communication between nodes is performed over TCP to ensure reliability. Hardware experiments show similar trends to those in software. For small grid sizes, the cost of communication dominates overall run time. Until at least a grid size of 2048, it is better to use a single FPGA and a reduced number of kernels. Having many kernels on a single FPGA creates contention for RAM and decreases performance for these grid sizes. We focus here on the larger sizes where increased kernels and nodes benefit run time. Figure 5.10 shows a comparison between hardware and software for the case where the grid size is 4096, iterations are again fixed to 1024 and there are either eight or sixteen kernels in total. In hardware, holding the total number of kernels constant but spreading them out over multiple nodes improves performance as it decreases contention of local resources. Increasing the number of kernels also improves run time but not necessarily as dramatically. With more than one FPGA, the hardware is markedly faster than a single software node.

5.4.4.1 Issue of Correctness

While the ported Jacobi application runs in Shoal, it does not yield correct results in hardware. Time does not permit attempting to resolve this issue, which is the result of some addressing inconsistencies that are described below. Despite this issue, the communication pattern and memory accesses are consistent with that of an accurate application and the performance results in Section 5.4.4 are still valid.

1Currently, libGalapagos enforces a maximum packet size of 9000 bytes—the size of an Ethernet jumbo frame—due to limitations imposed by the hardware TCP/IP core. Chapter 5. Evaluation 61

Kernels 1 2 100.0 4 8 16

10.0 Time (s)

1.0

64 256 1024 4096 Grid Size

Figure 5.9: Run time of the Jacobi application in software. The iteration count is fixed to 1024.

The original Jacobi application in THeGASNet and THe_GASNet Extended uses 32-bit integers to hold the state of each cell. This width matches the width of the data and control paths in their hardware implementation. In Shoal, the native width is 64-bit words as Galapagos uses the same. While Shoal has some limited support for payload sizes that are not an integer number of whole words, this feature has largely been untested as Galapagos currently does not support sending partial words. In software, this difference can be addressed by scaling up the data from 32-bit to 64-bit, yielding correct results. In hardware, the stencil_core has been optimized for 32-bit data and changing it is not as trivial as in software. Thus, 64-bit data is stored in memory from the get requests from the controller but the data is incorrectly read in 32-bit words during computation. Note that the size of the data does not affect how the run time is measured. In all our experiments, we run for a fixed iteration count of 1024 rather than towards a convergence condition that may vary with the width of the data. Another problem pertains to address offsetting in hardware. All kernels and the GAScore use memory- mapped AXI-Full interfaces to access memory. For HLS kernels, the address offset for a memory-mapped interface can be set in the source code. However, to avoid multiple versions of the same kernel differing only by address offsets, this offset is fixed in the source code to be zero. To provide each kernel with a distinct address space, this offset needs to be provided externally but this application does not do so. Accesses to memory from the GAScore do provide a fixed offset using the kernel ID; this offset is currently hard coded to 256 MB to allow sixteen kernels per FPGA, which is also the maximum number Chapter 5. Evaluation 62

60

50

40

30 Time (s)

20

10

0 8 kernels 8 kernels 16 kernels 16 kernels 8 kernels 16 kernels

1 Node 2 Nodes 4 Nodes 1 Node

Hardware Software

Figure 5.10: Run time of Jacobi application in hardware. The iteration count is fixed to 1024 and the grid size is set to 4096. The bar labels show the total number of computation kernels in the test. supported by a single Xilinx AXIS switch. Initially, these memory offsets were to be provided within the GAScore by routing in all AXI-Full interfaces from kernels, adding offsets internally, and then routing the interfaces out to off-chip memory. Due to some errors in RTL simulation after packaging the IP with this approach, it was abandoned. Finally, the software application begins by clearing all shared memory by setting it to zero. The hardware implementation does not perform this initial step. Its exclusion does not affect the timing results in hardware as the time taken for this step is measured separately in software. And to make an end is to make a beginning. T.S. Eliot

Chapter 6

Conclusions

The rise of hardware-as-a-service in the cloud and strategic acceleration of data center applications with FPGAs motivates the need to develop applications that can take advantage of a heterogeneous cluster. Through HLS, the problem can be opened up to the numerous domain experts who have potential applications but lack the hardware skills to implement their vision. Note that this is not an endorsement of current HLS solutions. As the Jacobi example and the benchmarking in this thesis show, writing performant HLS code is difficult and the tool may implement something else entirely. However, it can be a valuable first step to explore the design space and implement the scaffolding in which hardware experts can insert optimizations. Writing applications in a distributed environment requires the use of a robust communication API, which is the main goal of this work. We took advantage of the previous research in THeGASNet and THe_GASNet Extended projects to provide a firm foundation and make use of Galapagos to simplify deployment. As a result, we have developed a cross-platform API that provides a mix of message types. We demonstrate this library using a Jacobi method application to highlight the ease of use in adding FPGAs in a high-compute task. Instead of allocating time to set up the network and access to memory on the FPGA, time can be devoted to developing the application. Through this work, we also sought to re-evaluate PGAS as a viable programming model for het- erogeneous clusters. The use of PGAS does not inherently result in higher performance (though it can) when compared to the distributed memory model but it can be more productive to write PGAS applications [25], [74]. This performance-usability trade-off is a typical problem for developers. The prioritization of one over the other depends on the particular use case and can change over the course of a project. Usability is more critical at the beginning of a new project. For example, we leverage the Galapagos stack and HLS in this work. Both tools can decrease the maximum achievable performance, but they provide an essential foundation without which this work would have taken much longer. By lowering the barrier of entry, usability encourages adoption of tools and platforms. More users spur the continued development and refinement of tools, leading to improved efficiency for the tool itself and other tools dependent on it. PGAS promotes usability through simpler communication semantics. In this work, we provide a comprehensive heterogeneous API for kernels to use. These features share the same vision as Galapagos for mixed-platform application design by making it easier to move between software and hardware. Future exploration into automated placement and program partitioning can be used to guide the mapping of applications onto heterogeneous clusters.

63 Chapter 6. Conclusions 64

Unlike usability, performance is easier to quantitatively measure. In Section 5.3.2.3, we compared the throughput performance between libGalapagos and Shoal. This microbenchmark measurement does not conclusively answer whether the overheads imposed by Shoal justify the usability gains. For a more accurate test, we should implement one or more real applications, once using only native libGalapagos methods and then with Shoal, and compare the results. Of course, as Shoal needs to handle more generic data, additional processing on the data is needed in Shoal than for a custom libGalapagos approach. This discussion may feel familiar as it is similar to that surrounding Figure 2.1 in the trade-offs between flexibility and efficiency for different processors. The resolution proposed then was heterogeneity: take advantage of different types of processors by using the appropriate one for the task to improve performance. For similar reasons, we suggest defining Shoal and other such APIs using a modular specification to balance between flexibility and performance as needed. Shoal is implemented as a monolith based on the specification in THeGASNets, which is in turn based on GASNet. As such, Shoal must be fully flexible to handle all message types. However, it is likely that a given application only uses a subset of the specification, resulting in constant cost to evaluate conditions that will never be true and unnecessary hardware usage on the FPGA. With a modular API specification, we can define discrete components of the API that can be selectively enabled. In doing so, Shoal comes closer to a custom implementation at the cost of generic flexibility. The freedom to choose can also be used to implement different communication models. For example, enabling barriers and Medium messages only creates a simple point-to-point communication protocol that can be used as a thin layer on top of libGalapagos or used to implement message passing.

6.1 Future Work

There are a number of areas where further improvements can be applied in this work. Broadly, there are five main areas discussed here:

1. Quick improvements (Section 6.1.1)

2. Managing handler functions (Section 6.1.2)

3. Supporting partial-word data (Section 6.1.3)

4. Modularization of the Shoal API (Section 6.1.4)

5. Long-term improvements through automation (Section 6.1.5)

6.1.1 Quick Improvements

The most immediate issue to address is the adjustment of memory addresses for kernels as mentioned in Section 5.4.4.1. Since the original implementation of the address adjustment did not quite work in simulation, an alternate approach is to expose all the kernel memory interfaces directly instead of embedding AXI-Full interconnects inside the GAScore. This change requires modifying the top-level Verilog of the GAScore to add these interfaces and adding the constraints in the IP packaging in Vivado to selectively hide these interfaces based on the number of active kernels. Currently, Shoal software nodes may need to allocate more memory than actually required. The number of kernels on a particular node is used to determine the size of some arrays of global objects in Chapter 6. Conclusions 65

Shoal. The Galapagos-provided kernel ID of each kernel is then used to index in the array. However, this ID is globally assigned and there is no accessible information in Galapagos about relative IDs. Consider the following scenario. Kernels 0 and 1 are on software Node A and Kernel 2 is on software Node B. Therefore, Node A needs to allocate space for two kernels and Node B needs to allocate space for one kernel. On Node A, accessing the global arrays through the indices 0 and 1 works. On Node B, attempting to access index 2 results in a segmentation fault. Since kernels may be on a variety of topologies, we cannot assume Kernel 2 will be on Node 2 alone and should always access index 0. The workaround used at present is to allocate space on each software node for the total number of kernels so all indices are reachable. The formal fix in Galapagos would be to additionally expose a relative ID of each kernel based on the number of kernels on a node to allow for proper indexing. This information is computed in libGalapagos as kernels are added to a node but it is not accessible.

6.1.2 Handler Functions

As described in Section 3.3, the removal of the PAMS ended the direct support of custom handler functions in hardware. However, this functionality persists in software as it did in THeGASNets. At this time, we do not see a clear need for applications to require this feature but it may be desired for hardware-software feature parity or another application. There are a few methods by which additional handler function support can be added in hardware. The suggested approach would be to add more handler functions within Shoal itself. Barriers and synchronization are universally used across applications so moving the associated counters internal to Shoal simplifies kernels. Similarly, other broadly-applicable handler functions could be added to the API for all applications to use. The downside of this approach is that these handlers then persist for all kernels, whether they are needed for a particular application or not. For niche and resource-intensive handlers, this method is not ideal. A second approach is to pull the handler modules out of the GAScore. Currently, data for handler functions is streamed to the handler_wrapper IP that directs it to the handlers of the correct kernel. It is internal to the GAScore for simplicity of the design and packaging but it can be made into an external module. In this case, the user is free to replace the handlers with any custom IP while adhering to the packet formats. Custom handler functions can also be implemented without changing the GAScore through workarounds. Since Medium messages go to the kernel, their payload can contain handler function information and arguments. The user can place an IP between the kernel and the GAScore that executes handler functions in a user-defined implementation and forwards true Medium messages to the kernel. Alternatively, the kernel itself can process Medium messages that it receives. One side effect of the de-emphasis of handler functions is that the reply message for a received message in hardware is no longer gated by the handler function. In THeGASNets, the completion of the handler function triggered the reply message to be sent. In Shoal, the reply message is triggered after simply passing the handler data to the handler. With the current set of built-in handler functions, this change does not break ordering assumptions because the handler functions are short and are resolved quickly. For any kernel receiving the reply message, it can be certain that the handler function has finished. However, with generic functions, this assumption may no longer hold. In this case, reply messages need to be held until the function completes. Chapter 6. Conclusions 66

6.1.3 Supporting Partial Words and Unaligned Access

One of the issues highlighted through the Jacobi application in Section 5.4.4.1 was the importance of sending data that was not necessarily an integer number of words and supporting unaligned accesses in memory. Allowing partial words requires some changes in both Shoal and Galapagos. In Shoal packets, payload sizes are already defined through bytes so no change is needed to the header. Galapagos packet headers count words so this size would need to change to bytes as well. In both Shoal and Galapagos, having partial words triggers a domino effect of needed changes. Many hardware IPs and code in both projects would need to be updated to contend with partial last words by paying attention to the TKEEP signal of the stream. Unlike partial words, unaligned memory accesses are more straightforward as only Shoal needs to change. No change is needed in software as memcpy() can already perform this task. In hardware, the AXI DataMover natively supports unaligned accesses and only slight adjustments to the DataMover commands are needed.

6.1.4 API Refinement

The available methods in the shoal::kernel class do not cover all possibilities of Shoal methods that may be of interest. For example, not all asynchronous variants of different message types are present. This omission is made only in the interest of time. Completing the class for all methods largely requires duplicating existing code and making slight modifications. With a complete function set, the next step is to further refine the API through useful shortcut methods that simplify function arguments and improve usability. Of course, the original functions should remain available if full control is needed. These shortcuts can be implemented through macros or functions. If using functions, care must be taken to ensure that HLS does not complain about deeply-nested functions. It is possible that INLINE pragmas need to be added for HLS to synthesize kernels with these shortcuts. As discussed earlier in this chapter, one avenue of future work is to change the Shoal API to be more modular by defining discrete components that can be selectively enabled on a particular Shoal application. One proposed split is as follows:

• Barriers (and Short messages): Critical in any distributed platform. Short messages are needed for barriers and so the two need to be enabled together.

• Medium messages: allow point-to-point communication.

• Long messages: allow automatic data storage to memory.

• Strided messages: allow the Strided class of Long messages and is therefore dependent on Long messages also being enabled.

• Vectored messages: allow the Vectored class of Long messages and is therefore dependent on Long messages also being enabled.

• Handler functions: If disabled, all handler functions can default to the empty handler function. Chapter 6. Conclusions 67

6.1.5 Automated Flows

There are several areas where automation can further improve the usability of Shoal and deploying applications onto heterogeneous clusters in the long term. Source-to-source compilers and other forms of preprocessing of source files can be leveraged to automatically partition source code and add Shoal API calls for communication and synchronization. These techniques can also be applied to modify data accesses and addresses at compile time to reflect the distributed nature of the application without the developer specifying it explicitly. For deployment, we rely on Galapagos. Programming many FPGAs is currently a very manual process, and so we support the ongoing work in Galapagos to pass bitstreams to Docker [75] containers for programming. This work will directly benefit Shoal. The Shoal-specific scripts for Galapagos integration should be generalized to directly support different kernel topologies rather than the more restricted assumptions used today for bitstream generation. These restrictions are described in Appendix B.9. In general, we can investigate closer integration with libGalapagos to improve performance and reduce overhead. It is possible that some features from Shoal can be merged into libGalapagos for broader use. Bibliography

[1] A. Putnam et al., “A reconfigurable fabric for accelerating large-scale datacenter services,” in 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA), ISSN: 1063-6897, IEEE, Jun. 2014, pp. 13–24. doi: 10.1109/ISCA.2014.6853195. [2] A. M. Caulfield et al., “A Cloud-Scale Acceleration Architecture,” in 49th Annual IEEE/ACM International Symposium on Microarchitecture, Cloud-Scale Acceleration Architecture, Taipei: IEEE, Oct. 2016, pp. 1–13. doi: 10.1109/MICRO.2016.7783710. [Online]. Available: https: //ieeexplore.ieee.org/document/7783710/ (visited on 06/14/2018). [3] D. Firestone et al., “Azure accelerated networking: SmartNICs in the public cloud,” in Proceedings of the 15th USENIX Conference on Networked Systems Design and Implementation, ser. NSDI’18, USA: USENIX Association, Apr. 2018, pp. 51–64, isbn: 978-1-931971-43-0. (visited on 08/16/2020). [4] Amazon, Amazon EC2 F1 Instances, 2020. [Online]. Available: https://aws.amazon.com/ec2/ instance-types/f1/ (visited on 08/17/2020). [5] R. Willenberg and P. Chow, “A Heterogeneous GASNet Implementation for FPGA-accelerated Computing,” in Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, ser. PGAS ’14, Eugene, OR, USA: Association for Computing Machinery, Oct. 2014, pp. 1–9, isbn: 978-1-4503-3247-7. doi: 10.1145/2676870.2676885. [Online]. Available: http://doi.org/10.1145/2676870.2676885 (visited on 01/25/2020). [6] S. Pandit, “An Extended GASNet API for PGAS Programming on a Zynq SoC Cluster,” M.A.Sc. University of Toronto, Canada, 2016. [Online]. Available: http://search.proquest.com/docview/ 1817916737/abstract/F07743EC0A794CC6PQ/1 (visited on 06/29/2018). [7] N. Eskandari, N. Tarafdar, D. Ly-Ma, and P. Chow, “A Modular Heterogeneous Stack for Deploying FPGAs and CPUs in the Data Center,” in Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, ser. FPGA ’19, New York, NY, USA: Association for Computing Machinery, Feb. 2019, pp. 262–271, isbn: 978-1-4503-6137-8. doi: 10 . 1145 / 3289602.3293909. [Online]. Available: http://doi.org/10.1145/3289602.3293909 (visited on 08/23/2020). [8] V. Sharma, sonar, version 3.0, Jun. 2020. [Online]. Available: https://github.com/sharm294/ sonar. [9] ——, Shoal, Sep. 2020. doi: 10.5281/zenodo.4023239. [Online]. Available: https://github.com/ UofT-HPRC/shoal.

68 BIBLIOGRAPHY 69

[10] H. Yang, J. Zhang, J. Sun, and L. Yu, “Review of advanced FPGA architectures and technologies,” Journal of Electronics (China), vol. 31, no. 5, pp. 371–393, Oct. 2014, issn: 0217-9822, 1993-0615. doi: 10.1007/s11767-014-4090-x. [Online]. Available: http://link.springer.com/10.1007/ s11767-014-4090-x (visited on 09/02/2019). [11] Xilinx Inc. (2019). Vivado High-Level Synthesis, [Online]. Available: https://www.xilinx.com/ products/design-tools/vivado/integration/esl-design.html (visited on 09/02/2019). [12] Intel Inc. (2019). Intel High Level Synthesis Compiler, [Online]. Available: https://www.intel.com/ content/www/us/en/software/programmable/quartus-prime/hls-compiler.html (visited on 09/02/2019). [13] A. Canis et al., “LegUp: High-level synthesis for FPGA-based processor/accelerator systems,” in Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays, ser. FPGA ’11, New York, NY, USA: Association for Computing Machinery, Feb. 2011, pp. 33–36, isbn: 978-1-4503-0554-9. doi: 10.1145/1950413.1950423. [Online]. Available: http: //doi.org/10.1145/1950413.1950423 (visited on 08/17/2020). [14] A. Shimoni, A gentle introduction to hardware accelerated data processing, Aug. 2018. [Online]. Available: https://hackernoon.com/a- gentle- introduction- to- hardware- accelerated- data-processing-81ac79c2105 (visited on 08/23/2020). [15] Microsoft Inc., What are field-programmable gate arrays (FPGA) and how to deploy. [Online]. Available: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-deploy- fpga-web-service (visited on 08/23/2020). [16] R. Willenberg, “Heterogeneous Runtime Support for Partitioned Global Address Space Programming on FPGAs,” Ph.D, University of Toronto, Toronto, 2016. [17] V. Krishnaswamy, Intel® Memory Latency Checker v3.9, Jun. 2020. [Online]. Available: https: //www.intel.com/content/www/us/en/develop/articles/intelr-memory-latency-checker. html (visited on 09/28/2020). [18] J. Torrellas, H. Lam, and J. Hennessy, “False sharing and spatial locality in multiprocessor caches,” IEEE Transactions on Computers, vol. 43, no. 6, pp. 651–663, Jun. 1994, Conference Name: IEEE Transactions on Computers, issn: 1557-9956. doi: 10.1109/12.286299. [19] L. Dagum and R. Menon, “OpenMP: an industry standard API for shared-memory programming,” Computational Science & Engineering, IEEE, vol. 5, no. 1, pp. 46–55, 1998. [20] J. Reinders, Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism. "O’Reilly Media, Inc.", Jul. 2007, Google-Books-ID: do86P6kb0msC, isbn: 978-1-4493-9086-0. [21] L. Clarke, I. Glendinning, and R. Hempel, “The MPI Message Passing Interface Standard,” in Programming Environments for Massively Parallel Distributed Systems, K. M. Decker and R. M. Rehmann, Eds., ser. Monte Verità, Basel: Birkhäuser, 1994, pp. 213–218, isbn: 978-3-0348-8534-8. doi: 10.1007/978-3-0348-8534-8_21. [22] A. Geist et al., “MPI-2: Extending the message-passing interface,” in Euro-Par’96 Parallel Processing, L. Bougé, P. Fraigniaud, A. Mignotte, and Y. Robert, Eds., ser. Lecture Notes in Computer Science, Berlin, Heidelberg: Springer, 1996, pp. 128–135, isbn: 978-3-540-70633-5. doi: 10.1007/3-540- 61626-8_16. BIBLIOGRAPHY 70

[23] G. Almasi, “PGAS (Partitioned Global Address Space) Languages,” in Encyclopedia of Parallel Computing, D. Padua, Ed., Boston, MA: Springer US, 2011, pp. 1539–1545, isbn: 978-0-387-09766-4. doi: 10.1007/978-0-387-09766-4_210. [Online]. Available: https://doi.org/10.1007/978-0- 387-09766-4_210 (visited on 08/17/2020). [24] M. De Wael, S. Marr, B. De Fraine, T. Van Cutsem, and W. De Meuter, “Partitioned Global Address Space Languages,” ACM Computing Surveys, vol. 47, no. 4, 62:1–62:27, May 2015, issn: 0360-0300. doi: 10.1145/2716320. [Online]. Available: http://doi.org/10.1145/2716320 (visited on 09/28/2020). [25] H. Shan, N. J. Wright, J. Shalf, K. Yelick, M. Wagner, and N. Wichmann, “A preliminary evaluation of the hardware acceleration of the Cray Gemini interconnect for PGAS languages and comparison with MPI,” ACM SIGMETRICS Performance Evaluation Review, vol. 40, no. 2, pp. 92–98, Oct. 2012, issn: 0163-5999. doi: 10.1145/2381056.2381077. [Online]. Available: http://doi.org/10. 1145/2381056.2381077 (visited on 08/17/2020). [26] Xilinx Inc., Vivado Design Suite. [Online]. Available: https://www.xilinx.com/products/design- tools/vivado.html (visited on 08/17/2020). [27] ARM, AMBA 4 AXI4-Stream Protocol Specification, 2010. [Online]. Available: https://developer. arm.com/documentation/ihi0051/a/ (visited on 08/17/2020). [28] ——, AMBA AXI and ACE Protocol Specification, 2010. [Online]. Available: https://static. docs.arm.com/ihi0022/g/IHI0022G_amba_axi_protocol_spec.pdf (visited on 08/17/2020). [29] N. Tarafdar, N. Eskandari, V. Sharma, C. Lo, and P. Chow, “Galapagos: A Full Stack Approach to FPGA Integration in the Cloud,” IEEE Micro, vol. 38, no. 6, pp. 18–24, Nov. 2018, Conference Name: IEEE Micro, issn: 1937-4143. doi: 10.1109/MM.2018.2877290. [30] N. Briscoe, “Understanding the OSI 7-layer model,” PC Network Advisor, vol. 120, no. 2, pp. 13–16, Jul. 2000. [Online]. Available: https://www.os3.nl/_media/2014-2015/info/5_osi_model.pdf (visited on 08/17/2020). [31] J. K. Ousterhout, “Tcl: An embeddable command language,” in In 1990 Winter USENIX Conference Proceedings, 1990, pp. 133–146. [32] Xilinx Inc., Vivado Design Suite Tcl Command Reference Guide (UG835), 2018. [Online]. Available: https : / / www . xilinx . com / support / documentation / sw _ manuals / xilinx2018 _ 1 / ug835 - vivado-tcl-commands.pdf. [33] N. Tarafdar and P. Chow, “libGalapagos: A Software Environment for Prototyping and Creating Heterogeneous FPGA and CPU Applications,” in FSP Workshop 2019; Sixth International Workshop on FPGAs for Software Programmers, Barcelona, Spain: VDE, Sep. 2019, pp. 1–7. [34] K. Parzyszek, J. Nieplocha, and R. Kendall, “A Generalized Portable SHMEM Library for High Performance Computing,” Las Vegas, Nevada, Sep. 2000. [Online]. Available: https://www.osti. gov/servlets/purl/764612. [35] R. Barriuso and A. Knies, “SHMEM User’s Guide for C,” Cray Research Inc., Tech. Rep., 1994. BIBLIOGRAPHY 71

[36] B. Chapman et al., “Introducing OpenSHMEM: SHMEM for the PGAS Community,” in Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model, ser. PGAS ‘10, New York, NY, USA: Association for Computing Machinery, 2010, isbn: 978-1-4503-0461-0. doi: 10.1145/2020373.2020375. [Online]. Available: https://doi- org.myaccess.library. utoronto.ca/10.1145/2020373.2020375. [37] D. Bonachea and P. Hargrove, “GASNet Specification, v1.8.1,” Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States), Tech. Rep., Aug. 2017. doi: 10.2172/1398512. [Online]. Available: https://www.osti.gov/biblio/1398512 (visited on 07/31/2020). [38] J. Nieplocha and B. Carpenter, “ARMCI: A portable remote memory copy library for distributed array libraries and compiler run-time systems,” in Parallel and Distributed Processing, J. Rolim et al., Eds., ser. Lecture Notes in Computer Science, Berlin, Heidelberg: Springer, 1999, pp. 533–546, isbn: 978-3-540-48932-0. doi: 10.1007/BFb0097937. [39] T. El-Ghazawi, W. Carlson, T. Sterling, and K. Yelick, UPC: distributed shared memory program- ming. John Wiley & Sons, 2005, vol. 40. [40] R. W. Numrich and J. Reid, “Co-array Fortran for parallel programming,” ACM SIGPLAN Fortran Forum, vol. 17, no. 2, pp. 1–31, Aug. 1998, issn: 1061-7264. doi: 10.1145/289918.289920. [Online]. Available: http://doi.org/10.1145/289918.289920 (visited on 08/17/2020). [41] B. Chamberlain, D. Callahan, and H. Zima, “Parallel Programmability and the Chapel Lan- guage,” The International Journal of High Performance Computing Applications, vol. 21, no. 3, pp. 291–312, Aug. 2007, Publisher: SAGE Publications Ltd STM, issn: 1094-3420. doi: 10.1177/ 1094342007078442. [Online]. Available: https://doi.org/10.1177/1094342007078442 (visited on 08/17/2020). [42] T. von Eicken, D. E. Culler, S. C. Goldstein, and K. E. Schauser, “Active messages: A mechanism for integrated communication and computation,” ACM SIGARCH Computer Architecture News, vol. 20, no. 2, pp. 256–266, Apr. 1992, issn: 0163-5964. doi: 10.1145/146628.140382. [Online]. Available: https://doi.org/10.1145/146628.140382 (visited on 07/31/2020). [43] A. Mainwaring and D. Culler, “Active Message Applications Programming Interface and Communi- cation Subsystem Organization,” Tech. Rep., 1995. [44] CMC Microsystems, BEE4, 2018. [Online]. Available: https://account.cmc.ca/en/WhatWeOffer/ Test/Prototyping/HighPerformance/BEE4.aspx. [45] Avnet, ZedBoard. [Online]. Available: http://zedboard.org/product/zedboard (visited on 08/18/2020). [46] Xilinx Inc., ISE Design Suite, 2020. [Online]. Available: https://www.xilinx.com/products/ design-tools/ise-design-suite.html (visited on 08/18/2020). [47] C. Kohlhoff, Boost.Asio, 2017. [Online]. Available: https://www.boost.org/doc/libs/1_66_0/ doc/html/boost_asio.html (visited on 08/18/2020). [48] Xilinx Inc., AXI DataMover v5.1, 2017. [Online]. Available: https://www.xilinx.com/support/ documentation/ip_documentation/axi_datamover/v5_1/pg022_axi_datamover.pdf. [49] ——, Vivado Simulator. [Online]. Available: https : / / www . xilinx . com / products / design - tools/vivado/simulator.html (visited on 08/19/2020). BIBLIOGRAPHY 72

[50] Mentor Graphics, ModelSim. [Online]. Available: https://www.mentor.com/products/fpga/ verification-simulation/modelsim/ (visited on 08/19/2020). [51] P. M. Duvall, S. Matyas, and A. Glover, Continuous Integration: Improving Software Quality and Reducing Risk. Pearson Education, Jun. 2007, Google-Books-ID: PV9qfEdv9L0C, isbn: 978-0-321- 63014-8. [52] R. C. Martin, Agile Software Development: Principles, Patterns, and Practices. USA: Prentice Hall PTR, 2003, isbn: 978-0-13-597444-5. [53] K. Beck, Test-driven Development: By Example. Addison-Wesley Professional, 2003, Google-Books- ID: CUlsAQAAQBAJ, isbn: 978-0-321-14653-3. [54] UVM Class Reference Manual 1.2, 2014. [Online]. Available: https://accellera.org/images/ downloads/standards/uvm/UVM_Class_Reference_Manual_1.2.pdf (visited on 01/28/2020). [55] L. Asplund, VUnit, Aug. 2020. [Online]. Available: https://github.com/VUnit/vunit (visited on 01/28/2020). [56] T. Timisescu, SVUnit, Aug. 2020. [Online]. Available: https://github.com/tudortimi/svunit (visited on 01/28/2020). [57] W. Snyder, Verilator. [Online]. Available: https://www.veripool.org/wiki/verilator (visited on 08/19/2020). [58] J. Decaluwe, “MyHDL: A python-based hardware description language,” Linux Journal, vol. 2004, no. 127, p. 5, Nov. 2004, issn: 1075-3583. [59] Cocotb, Jul. 2020. [Online]. Available: https://github.com/cocotb/cocotb (visited on 02/02/2020). [60] Xilinx Inc., Vivado Design Suite User Guide: Logic Simulation, 2018. [Online]. Available: https: //www.xilinx.com/support/documentation/sw_manuals/xilinx2018_3/ug900- vivado- logic-simulation.pdf. [61] K. E. Murray, S. Whitty, S. Liu, J. Luu, and V. Betz, “Titan: Enabling large and complex benchmarks in academic CAD,” in 2013 23rd International Conference on Field programmable Logic and Applications, ISSN: 1946-1488, Sep. 2013, pp. 1–8. doi: 10.1109/FPL.2013.6645503. [62] J. Allen, UMass RCG HDL Benchmark Collection, Sep. 2006. [Online]. Available: http://www.ecs. umass.edu/ece/tessier/rcg/benchmarks/ (visited on 09/07/2020). [63] Altera Inc., Quartus II Handbook, May 2015. [Online]. Available: https://www.intel.ca/content/ dam/www/programmable/us/en/pdfs/literature/hb/qts/quartusii_handbook.pdf (visited on 09/29/2020). [64] Xilinx Inc., AXI Verification IP, Oct. 2019. [Online]. Available: https://www.xilinx.com/ support/documentation/ip_documentation/axi_vip/v1_1/pg267-axi-vip.pdf. [65] D. Sidler, G. Alonso, M. Blott, K. Karras, K. Vissers, and R. Carley, “Scalable 10Gbps TCP/IP Stack Architecture for Reconfigurable Hardware,” in 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines, May 2015, pp. 36–43. doi: 10.1109/FCCM. 2015.12. [66] Alpha Data. (2019). ADM-PCIE-8K5, [Online]. Available: https://www.alpha-data.com/pdfs/ adm-pcie-8k5.pdf (visited on 07/09/2020). BIBLIOGRAPHY 73

[67] Xilinx Inc., UltraScale FPGA Product Tables and Product Selection Guide, 2016. [Online]. Available: https://www.xilinx.com/support/documentation/selection- guides/ultrascale- fpga- product-selection-guide.pdf#KU. [68] Fidus Systems, Sidewinder-100 Datasheet, 2018. [Online]. Available: https://fidus.com/wp- content/uploads/2019/01/Sidewinder_Data_Sheet.pdf. [69] Xilinx Inc., Zynq UltraScale+ MPSoC Data Sheet: Overview (DS891), 2019. [Online]. Available: https://www.xilinx.com/support/documentation/data_sheets/ds891-zynq-ultrascale- plus-overview.pdf. [70] ——, Virtex-6 Family Overview (DS150), 2015. [Online]. Available: https://www.xilinx.com/ support/documentation/data_sheets/ds150.pdf. [71] ——, Zynq-7000 SoC Data Sheet: Overview (DS190), 2018. [Online]. Available: https://www. xilinx.com/support/documentation/data_sheets/ds190-Zynq-7000-Overview.pdf. [72] N. Eskandari, “A Modular Heterogeneous Communication Layer for a Cluster of FPGAs and CPUs,” Thesis, Nov. 2018. [Online]. Available: https://tspace.library.utoronto.ca/handle/1807/ 91684 (visited on 01/26/2020). [73] Xilinx Inc., AXI Timer v2.0, 2016. [Online]. Available: https://www.xilinx.com/support/ documentation/ip_documentation/axi_timer/v2_0/pg079-axi-timer.pdf. [74] H. Jin, R. Hood, and P. Mehrotra, “A practical study of UPC using the NAS Parallel Benchmarks,” in Proceedings of the Third Conference on Partitioned Global Address Space Programing Models, ser. PGAS ’09, New York, NY, USA: Association for Computing Machinery, Oct. 2009, pp. 1–7, isbn: 978-1-60558-836-0. doi: 10.1145/1809961.1809973. [Online]. Available: http://doi.org/ 10.1145/1809961.1809973 (visited on 08/25/2020). [75] D. Merkel, “Docker: Lightweight Linux containers for consistent development and deployment,” Linux Journal, vol. 2014, no. 239, 2:2, Mar. 2014, issn: 1075-3583. [76] N. Tarafdar, Galapagos, Aug. 2020. [Online]. Available: https : / / github . com / UofT - HPRC / galapagos. Appendix A

Shoal Packet Formats

This appendix describes the different packet formats used in Shoal. Most packet formats start with a common header, shown in Section A.1. There are three classes of packets: network, request and kernel. They are presented in Sections A.2, A.3 and A.4, respectively.

A.1 Universal Header

Most packets (except for kernel packets) share a common header, which can be seen as the first two words in Figure A.1. The first word is made up of the following bit-fields:

• Type: defines the type of AM. Its interpretation is defined below.

• SRC: the numeric ID associated with the source kernel of the message.

• DST: the numeric ID associated with the destination kernel of the message.

• Payload: the size of the payload of this message in bytes.

• H: the numeric ID of the handler function to trigger on the destination kernel.

• Args: the number of arguments for the handler function.

The AM type is indicated through the Type field in first word of the header. Each bit has significance and may be set (to 1) or not to encode the following meanings. Note that Bit 0 is the least significant bit.

• Bit 0: set if the message is Short or Long Strided.

• Bit 1: set if the message is Medium or Long Vectored.

• Bit 2: set if the message is any type of Long message.

• Bit 3: reserved.

• Bit 4: set in request packets to indicate if the payload data comes from the kernel.

• Bit 5: set if a message is asynchronous so its receipt triggers no reply in the remote kernel.

74 Appendix A. Shoal Packet Formats 75

• Bit 6: this bit serves two purposes. If the message is Short, this bit indicates that it is a reply message. All reply messages in Shoal are Short AMs. If the message is not a Short message, this bit is used to indicate a get request from the remote node as opposed to a put.

• Bit 7: reserved.

The other universal field in the header is the Token. The Token may be used as a message identifier for the receiving kernel. Currently, it serves no critical purpose in Shoal but it may be applied in different use cases in the future.

A.2 Network Packets

Network packets are the AMs that are sent over the network. Thus, egress packets from a kernel and ingress packets to a kernel take one of these forms. The message types are shown in Figures A.1, A.2 and A.3 for the different AM types.

0 8 24 40 48 56 60 64 ) Type SRC DST Payload H Args Header Token Destination Addr (Long only) ) Args 0 Handler . Args . Payload (Medium and Long only) hhhh hhh hhh hhh hhhh hhhh hhh hhh hhhh hhhh hhh hhh hhhh hhhh hhh hhh hhhh hhh hhh hhhh hh hhh

Figure A.1: Short, Medium and Long network packet formats

A.2.1 Short, Medium and Long Messages

Following the header, Long messages add a destination field to indicate the offset address where to store the payload in the remote kernel. If the number of handler arguments in the header is non-zero, then the handler arguments follow in the packet. Finally, the payload is attached at the end for Medium and Long messages while Short messages have no payload.

A.2.2 Strided Messages

The header in Strided messages adds additional fields in the second word:

• Stride: stride (or separation) between the blocks in bytes Appendix A. Shoal Packet Formats 76

0 8 16 24 32 40 48 56 60 64 ) Type SRC DST Payload H Args Header Stride Block Size Block Count Token Destination Addr ) Args 0 Handler . Args . Payload hhh hh hhh hhh hhhh hhhh hhh hhh hhhh hhhh hhh hhh hhhh hhhh hhh hhh hhhh hhh hhh hhhh hhh hhhh

Figure A.2: Strided network packet format

• Block Size: size of each block in bytes

• Block Count: number of blocks

Note that there is no error-checking to ensure the Payload and the product of the Block Size and Block Count match so it is left to the user. The header is followed by the Destination that indicates the starting address where the first block of the payload will be written. Subsequent blocks are written to addresses computed by advancing the base address by the specified stride. Finally, handler arguments (if any) and the payload complete the Strided message.

A.2.3 Vectored Messages

The Dst field in the header of Vectored messages specifies the number of data vectors in this message, which must be at least one. Each vector has an associated size (in bytes) and destination address. Note that there is no error-checking to ensure that the Payload and the sum of all vector sizes match so it is left to the user. Finally, handler arguments (if any) and the payload complete the Vectored message. The first Size 1 bytes are sent to Destination 1 followed by the next Size 2 bytes to Destination 2 and so on. For future work, the sizes of Vectored packets should change from 12 bits to 16 bits to be in line with the other message types.

A.3 Request Packets

Request packets are sent by kernels to the handler thread (in software) or to the GAScore (in hardware). The network packet is constructed using the request packet and sent out. Generally, the request packets are similar to the network packets so only the differences are highlighted here. Appendix A. Shoal Packet Formats 77

0 4 8 12 20 24 32 40 48 56 60 64 ) Type SRC DST Payload H Args Header Dst Size 1 Token Destination 1 Addr    Size 2  Dest Destination 2 Addr  Metadata  .  . ) Args 0 Handler . Args . Payload hhhh hhh hhh hhh hhhh hhhh hhh hhh hhhh hhhh hhh hhh hhhh hhhh hhh hhh hhhh hhh hhh hhhh hh hhh

Figure A.3: Vectored network packet format

A.3.1 Short, Medium and Long Messages

Short request packets are sent without modification over the network as network packets. The Medium and Long request packet format is shown in Figure A.4. These messages sent from the kernel may set Bit 4 in the Type field in the header to indicate that their payload also comes from the kernel. In this case, the packet omits the source address and adds the payload. These are Medium FIFO and Long FIFO messages. The other option is to clear Bit 4 and instead pass the Source Address to indicate to the handler thread or GAScore to read the payload from that address and append it to the message when sending it out. These messages are Medium and Long messages.

A.3.2 Strided Messages

A Strided request packet add two words that are used to specify the source addresses of the data and is shown in Figure A.5. These words are used by the handler thread or the GAScore to get the data for the payload in the Strided network packet. The bit-fields have the same meanings as defined in Section A.2.2. This setup allows data rearrangement. For example, one large contiguous block from the source kernel can be written in strided blocks on the destination kernel.

A.3.3 Vectored Messages

The Vectored request packet adds additional data to specify the source addresses and sizes and is shown in Figure A.6. At least one source vector and one destination vector are required and others are optionally specified after the first destination address. The handler thread or the GAScore will append the data from the source addresses as payload in the network packet. Appendix A. Shoal Packet Formats 78

0 8 24 40 48 56 60 64 ) Type SRC DST Payload H Args Header Token Source Address (Bit 4 not set in Type only) Destination (Long only) ) Args 0 Handler . Args . Payload (Bit 4 set in Type only) hhh h hhh hhhh hhh hhh hhhh hhhh hhh hhh hhhh hhhh hhh hhh hhhh hhhh hhh hhh hhhh hhhh hhh hhhh

Figure A.4: Medium and Long request packet format

0 8 16 24 32 40 48 56 60 64 Type Payload Args  SRC DST H   Stride (src) Block Size (src) Block Count (src)  Header Source Addr    Stride (dst) Block Size (dst) Block Count (dst) Token Destination Addr ) Args 0 Handler . Args .

Figure A.5: Strided request packet format

A.4 Kernel Packets

Kernel packets are Medium messages that are forwarded to the kernel from the handler thread (in software) or from the GAScore (in hardware). They consist of a single header word followed by the Medium message’s payload, shown in Figure A.7. This header is very similar to the universal header used in network and request packets except that it replaces the old SRC field with the Token and the old DST field with SRC. The Token can be used by the kernel to identify a particular received message. The exact locations of the bit-fields is arbitrary as long as the kernel and Shoal agree where the different fields are. For future work, this header can be made more similar to the universal header by keeping the SRC field as it is and replacing the DST field with the Token instead. Appendix A. Shoal Packet Formats 79

0 4 8 12 20 24 32 40 48 56 60 64 ) Type SRC DST Payload H Args Header Src Dst Src Size 1 Dst Size 1 Token Source 1 Addr Destination 1 Addr  Src Size 2   Src Source 2 Addr  Metadata .  .  Dst Size 2   Dst Destination 2 Addr  Metadata .  . ) Args 0 Handler . Args .

Figure A.6: Vectored request packet format

0 8 24 40 56 60 64

Type Token SRC Payload H Args Header Payload hhh hh hhhh hhhh hhh hhh hhhh hhhh hhh hhh hhhh hhhh hhh hhh hhhh hhh hhh hhhh hhh hhh hhh hhhh

Figure A.7: Kernel packet format Appendix B

Source Code Documentation

This appendix is intended as low-level documentation for the source code for Shoal [9]. We present the major directories, files and scripts in Shoal to explain some design decisions and workflows. Shoal data Measured data and data parsing scripts (Section B.3) GAScore Source files and tests for the GAScore (Section B.4)

helper Helpful utilities and Wireshark dissector (Section B.5)

include C++ include directory for Shoal applications (Section B.6)

repo Repository of packaged IPs for Vivado (Section B.7)

src C++ source directory for Shoal applications (Section B.8) tests Shoal test applications (Section B.9)

init.sh Section B.1 Makefile Section B.2 README.md

B.1 Initialization init.sh is used to initialize the environment for Shoal. There is a similar script for Galapagos that must also be run.

$ source init.sh /path/to/shoal /path/to/Vivado /path/to/Vivado_HLS vivado_version hls_version

The initialization script creates a hidden file (.shoal) in the user’s home directory that is sourced by .bashrc. This file defines some environment variables that Shoal scripts use such as the path to the Shoal repository and the versions of the tools being used. It also defines some Bash functions such as shoal-update-board.

$ shoal-update-board BOARD_NAME

80 Appendix B. Source Code Documentation 81

.shoal includes some definitions of board names that names the FPGA part they use. The board name and FPGA part are used in Shoal scripts and in Vivado to create bitstreams for the correct FPGA. The initialization functionality used currently in Shoal has since been superseded by a more automated approach in Sonar 3.0 but those changes are not applied here.

B.2 Makefile

The Makefile in the Shoal directory defines many targets, the most important of which are described here:

• make lib DEBUG=0: The lib target compiles the Shoal software library. The object files are used to create the static library libTHeGASNet.a1, which is placed in the build/ directory under Shoal. A library compiled from libGalapagos is also copied over to this directory. The DEBUG argument is set to zero for the optimized library and to one (by default) for the debug version. A Shoal software application must link against this library.

• make galapagos-...: These targets are used to compile or synthesize applications for software and hardware. They are further described in Section B.9.

B.3 data/

This directory holds the measured data used during the evaluation in Shoal and Python scripts to parse the data, create summary texts and generate graphs. build/ is created here to hold all the generated data.

B.3.1 Shoal Microbenchmarks

Files of the format ubench_{topology}_{test_type}_{communication}_{iterations}.txt contain the Shoal microbenchmark data used in Section 5.3.2. This data was collected by running the Benchmark IP in tests/ in different configurations. Some bitstreams and memory configuration files for these tests are saved here as well for the 8K5 board. The instructions of how to use the bitstreams and memory configuration files are provided in Section B.9.1. Some microbenchmarks are in two variants where one is marked as “optimized”. This distinction is made to mark the tests run after some optimizations were performed in software and hardware. In software, the primary optimization was a switch to using packet-based communication instead of flit-based. For hardware, the HLS code for GAScore IPs was restructured and HLS pragmas were added to improve performance. Originally, we were interested in comparing how the performance improves in the optimized case to show the effects of packet-based communication and/or HLS optimizations. Due to the extensive time required to run tests in many topologies, this plan was abandoned and only the optimized data was considered. This data is analyzed with two Python scripts:

• benchmark.py: Main parsing script for the microbenchmarks. Use the optional flags to control its behavior.

• cross_analyze.py: Compares the Shoal microbenchmarks against the libGalapagos microbench- marks. 1This name is a holdover from previous work. It should be eventually updated to libShoal.a. Appendix B. Source Code Documentation 82

B.3.2 libGalapagos Microbenchmarks

Files of the format libGalapagos_{configuration}.txt contain the libGalapagos microbenchmark data used in the evaluation. It is collected through tests written in the Galapagos repository [76] based on the original libGalapagos unit tests. This data is analyzed with libGalapagos.py. An additional set of libGalapagos tests were the “detail” tests. These tests track the time that the different reads and writes were performed during a libGalapagos benchmark as opposed to the aggregate data computed normally. A similar test was also performed for some initial Shoal microbenchmarks to debug some slowdowns. Both sets of data can be parsed with benchmark_detail.py but this data is not used in this work.

B.3.3 Jacobi

The data from the hardware and software Jacobi tests are in jacobi_shoal_{hw/sw}.txt. It is collected through the Jacobi application and its usage is described in Section B.9.2. This data is analyzed with extract_jacobi.py.

B.3.4 Other Scripts

This directory also contains some other scripts, some of which are deprecated:

• export.sh: The source files for this thesis were also locally available on the same machine as the Shoal repository. This Bash script copies over some relevant graphs generated from the Python scripts over to the thesis repository for use in the thesis. While not all these figures were eventually used in the final document, it does highlight the most important sets of figures.

• extract_data.py: This script serves two purposes and is controllable through command-line arguments. First, given a path to a Vivado HLS project, it extracts the synthesis report for all solutions in the project and creates a CSV file (saved with the date and the current Git commit tag of the repository) that has the performance and hardware usage estimates from HLS. The second capability of this script is that it can compare two such CSV files that have been extracted by the first method. This comparison only highlights the differences and shows how metrics like latency and used LUTs changed between the two versions for the same IPs. If new IPs are added or removed, they are marked as such. The goal of this script was to guide HLS optimization by showing the effects of different optimizations.

• extract_utilization.py: Given the full utilization report from an implemented Vivado design, this script extracts the usage of a particular hierarchy of the design, computes the percent utilization of modules relative to their parent module and prints the results as the body of a LATEX table. It’s configured through global constants in the body of the script.

• GAScore_latency.py: In a similarly methodical approach as in extract_data.py, this script was an attempt to rigorously measure latency in the GAScore. The Sonar testbenches were modified to add a timestamp after every action. Then, this script could analyze the data and use it to compute the number of cycles taken to complete the tasks in the testbenches. Unfortunately, this granularity was difficult to manage. Each test vector needed its own script to parse the timings and interpret what the times represented. This interpretation was also fragile as one small change Appendix B. Source Code Documentation 83

in the testbench would break it. While it should be functional, this script was abandoned after parsing two test vectors.

• timing.py: This script was used to run the Benchmark microbenchmark in software by repeating it a configurable number of times for averaging. It is deprecated.

B.4 GAScore/

This directory contains most of the source code to package the GAScore IP. GAScore

include C++ header files

src C++ and Verilog source files testbench Sonar testbenches vivado Scripts to make Vivado projects for testing and packaging IPs

vivado_hls Scripts to make Vivado HLS projects Makefile run.sh The Makefile in the GAScore directory defines targets to synthesize, test, and package the IPs that make up the GAScore. Targets often have a base name followed by a hyphenated module name, which must be one of the names listed in the Makefile under the c_modules, hdl_modules, or custom_modules variables. Calling these targets without the module name recursively calls the target for all appropriate modules. Some of the most important Make targets are:

1. make config-{module_name}: Makes the Sonar testbench for the particular module.

2. make hw-{module_name}: Synthesizes the particular HLS IP and adds it to the local IP repository.

Other targets like make sim and make package are more complex to run directly and are instead run through run.sh, which simplifies calling these targets. For make sim, the Bash script provides a shorthand notation to pass the arguments that the target needs. The target creates a Vivado project for the particular IP, adds the Sonar testbench and can optionally simulate it, based on the arguments provided to the Bash script. Packaging the GAScore from scratch—if packaging the IP for a new FPGA for example—can be done by calling:

$ ./run.sh export11

The two numerical arguments specify that all the HLS IPs should be remade and that the Vivado project for the GAScore should be created from scratch. For stability, this second argument should always be 1 though the first can be 0 if the HLS IPs are already up-to-date.

B.5 helper/

This directory contains some short programs and scripts for small tasks. Appendix B. Source Code Documentation 84

• eval_macro.cpp: Shoal uses a great deal of macros in C++ to define sizes of various variables and bit-fields. These sizes are also used in Python scripts, especially in the Sonar testbenches to help construct Shoal packets for simulation. This program was intended as a way for Python to learn the values of numerical macros from C++ source code. This file would be compiled using share.py and the C++ executable would return the macro value back to Python. Ultimately, this script was never integrated into Shoal and the widths of Shoal bit-fields in the Sonar testbenches are hard coded.

• parse_header.cpp: This application is compiled using the Makefile. It accepts a command-line argument of a 64-bit number and parses it as a Shoal header and returns the bit-fields.

• parse_packet.py: Interpreting network packets in Vivado simulation is difficult. This script requires the user to copy out each word of the network packet in standard little-endian ordering and concatenate it as one string. Then, the script byte-reverses it into network order and prints it out so that it may be entered into an online network packet parser to parse the Ethernet and IP headers.

• parse_wireshark.py: This script parses the JSON output of a captured Wireshark session. Due to the potentially large size of the JSON file, the file is read iteratively. This script was used to search for missing packets in a test.

• print_macro.cpp: Similar to eval_macro.cpp, this file simply prints the value of a macro to the terminal as a way to check the final resolved value of a complex macro.

• Wireshark dissector: Dissectors in Wireshark are used to analyze packets. For example, the “echo” dissector parses all TCP packets using port 7 as ECHO packets and marks them in the Wireshark GUI. Users can also define their own dissectors through C++ or Lua. Lua dissectors are slower but much easier to integrate into Wireshark. Here, the C++ dissector was done first as a way to make analyzing Shoal packets in Wireshark easier. It was eventually replaced by the Lua dissector, which also parses the Galapagos header in the network data. While functional enough, the dissector is not robust enough to handle every case such as when multiple Shoal packets are concatenated into one TCP packet.

B.6 include/

There are a lot of header files in Shoal and things to say about each. In the interest of brevity, only some comments about these files are presented here:

• am_gasnet.hpp: Defines the PGAS_METHOD and DECLARE_METHOD macros that are used in Shoal applications for the function wrapping for software kernels. The second argument of the PGAS_METHOD macro is now longer used but is left in the macro definition for backward-compatibility.

• config.hpp: Defines the bit-fields and widths as macros, along with some helpful macros derived from these widths.

• hls_types.hpp: Provides the typedefs for the different bit-fields. The data types used here must be compatible with the widths used in config.hpp. Appendix B. Source Code Documentation 85

B.7 repo/

This directory holds the packaged GAScore and any test IPs built in Shoal. The IPs are sorted by Vivado version and FPGA family during creation so multiple versions of the GAScore can exist simultaneously. Vivado projects must include the appropriate directory as one of the IP repositories to add the GAScore to the IP Catalog.

B.8 src/

This directory contains the source files for the Shoal API. Most files have their corresponding header files in the include/ directory. Once again, in the interest of brevity, only some commentary about these files is included here:

• active_messages_x86.cpp: As its name suggests, this file defines the active message functions for processors2. In particular, one caveat must be made about the _writeWord() function. This function is eventually used in all flit writes in Shoal kernels running on a processor and it defines a constant value for the TID of all written flits. While the TID is largely ignored in software, it is of critical importance for the hardware TCP Network Bridge. The TID value in this function serves the same purpose as the node ID for the Bridge: it identifies the node ID of the source kernel. The ID is therefore application-dependent and should be set for each node separately. Currently, this value is set in the Shoal library itself. This implementation means that for a topology that includes multiple software nodes and at least one hardware node, the two software nodes need to be linked against different versions of libTHeGASNet.a in which the TID set in this function matches the node ID of the particular software node.

• am_gasnet.cpp: Defines the handler thread that wraps each software kernel in Shoal.

B.9 tests/

There are three active applications: Benchmark, Jacobi, and stencil_ctrl. These tests are located in this directory but are made using targets in the main Shoal Makefile. The general syntax of these make commands is similar and some sample commands to make applications are:

$ make galapagos-benchmark K_START=0 K_END=2 MODE=x86 $ make galapagos-jacobi K_START=0 K_END=1 KERN_BUILD=1 MODE=x86 DEBUG=1 $ make galapagos-stencil_ctrl MODE=HLS KERNEL=benchmark

The K_START and K_END define the inclusive range of kernels that are on the software node. For each kernel, an additional –wrap flag is needed for the linker and the Makefile adds it using these arguments. Another assumption made during this linking process, and in the helpful macros in am_gasnet.hpp, is that the kernel function names are of the form kernX, where X is a number in the range between K_START and K_END. The KERN_BUILD argument is used here to switch between different topologies. The interpretation of this value is application-specific. For example, in Benchmark, omitting this value implies a value of -1,

2The x86 in the name is a misnomer as it should be compatible on ARM as well though that has not been tested. Appendix B. Source Code Documentation 86 which adds all three kernels in software. Setting it to a value of 0 only adds kernel 0 in software while the other two kernels are assumed to be in hardware. The MODE argument is used to switch between compiling the kernel for software and synthesizing it through HLS. For HLS kernels, the make command takes the KERNEL argument that names the function that should serve as the top-level of the HLS IP. Finally, these commands can all optionally set the DEBUG argument to 1 to set the kernel in debug mode. This mode is only relevant for software kernels and its value must match the value used to make libTHeGASNet.a with make lib. One caveat with compiling software kernels and switching between DEBUG modes is that the Makefile does not recognize that object files need to be recompiled when only a Make argument has changed. For correct usage, the object files need to be manually deleted so it is remade with the specified DEBUG argument. Another quirk of the wrapping process is that the kernel definition and the main node definition need to be in separate source files. Since the kernel function is wrapped, it needs to exist as an incomplete function call when linking the application. If the kernel function is available when the Shoal node is compiled, it is compiled into the object file at this step and then the linker fails to wrap it. This directory also contains an example map and logical configuration file for Galapagos. The syntax for these files is consistent with normal Galapagos configuration files. However, in map.json, we have added the “custom” field for the hardware kernel that calls GAScore.tcl during the Galapagos flow to add the GAScore into the Application Region. Currently, GAScore.tcl is in the Galapagos repository [76]. It supports adding one hardware kernel in the purely automated flow. The only user adjustment that must be made is that the name of the kernel IP instance must be set in this script at the top. For the future, this information should be exported from the previous Galapagos middleware flow and used automatically without user intervention. The script can also be made more robust to handle adding multiple kernels.

B.9.1 Benchmark

The Benchmark IP is a custom processor and as such needs to read instructions to operate. These instructions are written using benchmark_data.py. It uses the same instruction enumerations as in the Benchmark’s source files to identify instructions. Based on the command-line arguments, it can write any subset of the instructions required and change the number of times to run each test. The instructions are written in three formats. The software version of the Benchmark IP reads benchmark_X_sw.mem while hardware kernels need benchmark_X.coe for manual loading in the Vivado GUI and benchmark_x.mem for loading the instructions into the bitstream. Adding the memory file into the bitstreams is accomplished through the following steps:

1. Open the Vivado project with the kernel and open the implemented design.

2. Source the write_mmi.tcl script from this directory and run it. This script defines one TCL command called write_mmi that accepts a single argument that names the Xilinx Block Memory Generator that is the instruction memory. In this work, the name “instr_blk_mem” was used. This name (and suffixed appropriately if needed) should be reserved only for these instruction memories to avoid confusion.

3. From the script, an MMI file is generated in the working directory of Vivado. This MMI file needs some manual modification: Appendix B. Source Code Documentation 87

(a) Change the “InstPath” key to be “dummy”. Normally, MMI files are used in the context of Microblaze soft processors and this field is auto-populated using the Microblaze that exists in the Galapagos Shell. The value used does not have to be “dummy” but it is a convenient one. (b) Change the memory “End” keys from -1 to 4095 (there should be at least two locations). This process assumes that the instruction memory is using 32-bit words and can therefore hold 1024 words. These values are important as they ensure that the memory can fit within one BRAM. For memories that span multiple BRAMs, the data is split in some way, which makes this process more difficult. This restriction also enforces a strict instruction limit of 1024. (c) Change “RAMB36E2” to “RAMB36”.

4. Call the updatemem utility and pass in the MMI file, the MEM file containing the instructions, the bitstream produced from the implementation results used to generate the MMI file, and the output bitstream file name. The –proc argument should be “dummy”. If there are multiple Benchmark kernels, there will be more than one address space in the MMI file, assuming the instruction memories are named with the same pattern. In this case, the user must selectively comment out all address spaces save one and update the bitstream one BRAM at a time.

B.9.2 Jacobi

An example of how to compile the software Jacobi application is in run_jacobi.sh. This script can be used to sweep through different configurations and save all the data. While the jacobi_comp.cpp can be run through HLS to also run on the FPGA, the HLS implementation in this case is about 50x worse than software. Instead, we use the stencil_ctrl.cpp application as an HLS controller for the stencil_core VHDL code written by Willenberg. This core, along with a Verilog wrapper, is provided in the jacobi directory here. There are two variants of the stencil_core where one has the suffix “other”3. This file is the stencil core that directly communicates with Willenberg’s GAScore while the file without the suffix is intended to communicate with the PAMS module. In this work, we use this latter version of the core and our wrapper, stencil_wrapper.v, provides the interface between the core and the Shoal GAScore.

3Yes, this file has been poorly named.