An Automated Flow to Generate Hardware Computing Nodes from for an FPGA-Based MPI Computing Network

by D.Y. Wang

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF BACHELOR OF APPLIED SCIENCE

DIVISION OF ENGINEERING SCIENCE

FACULTY OF APPLIED SCIENCE AND ENGINEERING UNIVERSITY OF TORONTO

Supervisor: Paul Chow April 2008 Abstract

Recently there have been initiatives from both the industry and academia to explore the use of FPGA-based application-specific in high-performance computing platforms as traditional based on clusters of generic CPUs fail to scale to meet the growing demand of computation-intensive applications due to limitations in power consumption and costs. Research has shown that a heteroge- neous system built on FPGAs exclusively that uses a combination of different types of computing nodes including embedded processors and application-specific hardware accelerators is a scalable way to use FPGAs for high-performance computing. An ex- ample of such a system is the TMD [11], which also uses a message-passing network to connect the computing nodes. However, the difficulty in designing high-speed hardware modules efficiently from descriptions is preventing FPGA-based systems from being widely adopted by software developers. In this project, an auto- mated tool flow is proposed to fill this gap. The AUTO flow is developed to auto- matically generate a hardware computing node from a C program that can be used directly in the TMD system. As an example application, a Jacobi heat-equation solver is implemented in a TMD system where a soft is replaced by a hardware computing node generated using the AUTO flow. The AUTO-generated hardware module shows equivalent functionality and some improvement in performance over the soft processor. The AUTO flow demonstrates the feasibility of incorporating au- tomatic hardware generation into the design flow of FPGA-based systems so that such systems can become more accessible to software developers.

i Acknowledgment

I acknowledge Synfora and for hardware, tools and technical support, and my supervisor, Professor Paul Chow, for his guidance, patience, and insights, all of which are very valuable for the completion of this project. Thanks to Chris Madill and Arun Patel for their help in setting up the development environment, and Manuel Salda˜na for help with the MPE network and scripts, and patiently answering all my questions during the many unscheduled drop-by visits. Also many thanks to Henry Wong for discussions, suggestions and debugging tips, and Ryan Fung for proofreading the final report. Finally, I would like to thank my mother for her love and support as always.

ii Contents

1 Introduction 1

2 Related Work 4 2.1 FPGA-Based Computing ...... 4 2.2 The TMD-MPI Approach ...... 5 2.3 Behavioral Synthesis ...... 7

3 System Setup 9 3.1 TMD Platform Architecture ...... 9 3.2 Design Flow ...... 11 3.3 C-to-HDL Using PICO ...... 13

4 Implementation of the Tool Flow 15 4.1 Flow Overview ...... 15 4.2 MPI Library Implementation ...... 16 4.3 Control Block ...... 19 4.4 Scripts ...... 19 4.4.1 Preprocessing Script ...... 20 4.4.2 Packaging Script ...... 23 4.5 Test Generation ...... 24 4.6 Limitations ...... 25 4.6.1 Floating-Point Support ...... 26 4.6.2 Looping Structure ...... 26

iii 4.6.3 Pointer Support ...... 26 4.6.4 Division Support ...... 27 4.6.5 Performance Specification ...... 27 4.6.6 Hardware Debugging ...... 28 4.6.7 Exploitable Parallelism ...... 28

5 The Heat-Equation Application 30 5.1 Implementation ...... 30 5.2 Experiment Methodology ...... 32 5.3 Results ...... 33

6 Conclusion and Future Work 36

Appendix 37

A Hardware Controller for PICO PPA 37 A.1 Control FSM ...... 37 A.2 Stream Interface Translation ...... 39

B Using PICO: Tips and Workarounds 43 B.1 Stream Ordering ...... 43 B.2 Improving Performance ...... 44

Bibliography 47

iv Glossary

The glossary contains the acronyms that are used in this report.

• CAD – Computer Aided Design

• CPE – Cycles Per Element

• DCM – Digital Clock Manager

• FSL – Fast Simplex Link. Xilinx’s FIFO stream IP block.

• FSM – Finite State Machine

• HDL – Hardware Description Language

• HPC – High-Performance Computing

• IP – Internet Protocol

• IP – Intellectual Property

• MHS – Microprocessor Hardware Specification

• MSS – Microprocessor Software Specification

• MPI – Message Passing Interface

• MPE – Message Passing Engine. Provides MPI functionality to

• NetIf – Network Interface used in the TMD network

v • PICO – Program-In Chip-Out. An algorithmic synthesis tool from Synfora, Inc.

• PPA – Pipeline of Processing Arrays. The top-level hardware block generated from a function by the PICO flow.

• TCAB – Tightly Coupled Accelerator Blocks. A hardware module generated by PICO from a C procedure that can be used as a black box when generating a higher-level hardware block.

• TMD – Originally the Toronto Molecular Dynamics machine; now refers to the exclusively FPGA-based HPC platform developed at the University of Toronto. hardware accelerators in a TMD system.

• VLSI – Very Large Scale Integrated Circuit

• XPS – Xilinx Platform Studio. Xilinx’s embedded processor system design tool.

• XST – Xilinx Synthesis Technology. Xilinx’s synthesis tool.

• XUP – Xilinx University Program

vi List of Figures

3.1 TMD platform architecture ([13]) ...... 10 3.2 Network configuration for different node types ([13]) ...... 11 3.3 TMD design flow ([13]) ...... 12 3.4 PICO design flow ([15], p.5) ...... 14

4.1 The AUTO tool flow ...... 16 4.2 Stream operations required to implement MPI behaviour ...... 17 4.3 TMD system testbed ...... 25

5.1 A simple two-node TMD implementation of a Jacobi heat-equation solver 32 5.2 Main loop execution time per element with different iteration lengths 34

A.1 Design of the control block ...... 38 A.2 State transition diagram of the control block FSM ...... 40 A.3 The PICO stream interface ...... 41 A.4 The FSL bus interface ...... 41

vii List of Tables

4.1 Implemented families of MPI functions ...... 18

5.1 Normalized computing power of the reference and test systems . . . . 35

A.1 I/O ports exported by the control block ...... 38 A.2 Raw control ports on the PPA module ...... 39

viii Chapter 1

Introduction

Much of today’s scientific research relies heavily on numerical computations and demands high performance. Computational fluid dynamics, molecular simulation, finite-element structural analysis and financial trading algorithms are examples of computation-intensive applications that would not have been possible without the advances in computing infrastructure. Since the 1960s, generations of supercom- puters have been built to address the growing needs of the scientific community for more computing power. With the improved performance and availability of micro- processors, clusters of conventional CPUs connected in a network using commercially available interconnects became the dominant architecture used to build modern su- percomputers. As of November 2007, 409 of the top 500 supercomputers were cluster based [18]. However, as computing throughput requirements of new applications continue to increase, supercomputers based on clusters of generic CPUs become increasingly limited by power budgets and escalating costs and cannot scale further to keep up with the demand. As a result, specialized hardware accelerators became popular. In recent years there have been significant development in both GPU-based and FPGA-based computing models. While GPUs demonstrated remarkable performance improvement in highly data-parallel stream-based applications [1], FPGAs, with the flexibility they offer, are good candidates for specialized hardware acceleration systems. In order to leverage FPGAs in high-performance computing systems, hardware

1 accelerators need to be built from software specifications. The primary challenge in this is that hardware design is intricate and software developers typically do not have the expertise to design high-performance hardware. On the other hand, having to have both software and hardware designers working on the same project is costly and inefficient. As a result, hardware acceleration have not been adopted more widely among software developers. A tool flow that allows the software designers to eas- ily harness the power of hardware acceleration is hence essential to make hardware acceleration feasible in non-high-end applications. To address this need, we show in this project an automated tool flow, AUTO, that generates a hardware accelerator from a C program directly. This work builds on previous work on TMD, which is a scalable multi-FPGA high-performance computing system [11] that consists of a collection of computing nodes, where each node can be a soft processor or a hardware engine. The TMD uses the TMD-MPI message-passing programming model [12] for inter-node communication. The AUTO flow takes in an MPI program written in C as input and produces a hardware computing node that can be used directly in the TMD system. As a proof-of-concept prototype, the main objective of this project is to explore the possibility of algorithmic synthesis to target an FPGA-based system, with a focus on the feasibility of an automated tool flow. A Jacobi heat equation solver is implemented on TMD as an example application to demonstrate the functionality of the AUTO flow. With little designer intervention, we are able to automatically generate a functional hardware block that performs better than the soft processor node it replaces. Our eventual goal is to completely automate the design flow that generates a system of hardware accelerators from a parallel program as opposed to a single hardware computing node at a time. The rest of the report is organized as follows. Chapter 2 reviews existing research work in FPGA-based computing and algorithmic synthesis, which provides context to our work. Chapter 3 describes the TMD platform, the TMD-MPI design flow and AUTO’s role in it. Chapter 4 explains the implementation of the AUTO flow. The limitations of the implementation are also outlined. In Chapter 5, a sample

2 application is presented with some performance results. Finally, in Chapter 6, we summarize our findings and give suggestions for future work.

3 Chapter 2

Related Work

Recent research has shown that FPGA-based high-performance computing models have the potential to speedup certain computing tasks significantly using application- specific hardware acceleration. The disadvantage is that it sacrifices the generality offered by CPUs. This is remedied by the reconfigurability of FPGAs, which allows them to be reprogrammed for different computing tasks. Consequently, the success of FPGA-based systems hinges on an efficient underlying computing infrastructure and a flexible design flow, of which the AUTO flow tries to address. This section presents an overview of existing research in the areas of FPGA-based computing and algorithmic synthesis. It provides context for our work on the AUTO flow.

2.1 FPGA-Based Computing

With power consumption becoming an increasingly critical design constraint for high- performance computing systems, many vendors of traditional cluster based systems have started to incorporate hardware acceleration using FPGAs. Examples include the Cray XD1 [5] and the HP ProLiant DL145 servers using Celoxica’s RCHTX FPGA acceleration boards [3]. These systems use FPGAs as coprocessors to exploit fine- grained parallelism in algorithms to improve the overall performance. The fork-join control flow is most natural to these systems due to the master-slave relationship between the processor and the FPGA. The master processor is responsible for the co-

4 ordination and synchronization among the computing slaves. It executes instructions sequentially, and opportunistically farms out computation-intensive tasks to hard- ware accelerators on the FPGA coprocessors. Since the inherently parallel hardware structures on the FPGA are controlled by the sequential processor, maximizing ef- ficiency to amortize the overhead of transferring data and synchronization requires significant effort from both hardware and software designers. Starting from the 90nm processing node, FPGAs have been built with high enough density and speed to make them possible contenders for high-performance computing platforms. The BEE/BEE2 system [4], TMD [11] and the one presented in [2] are examples of high-performance computing platforms built on FPGA based technologies exclusively. Some of these systems only use application-specific hardware modules, others use soft processors that are embedded in the FPGA fabric and provide low- latency communication channels between the embedded processors and the hardware computing nodes. The latter heterogeneous architecture enables more efficient use of on-chip resources to improve the throughput vs. area ratio under a given power constraint for specific applications. The advantage of using FPGAs to build high- performance computing platforms is that the application-specific portions can be easily reconfigured to suit the need of a variety of applications.

2.2 The TMD-MPI Approach

The Toronto Molecular Dynamics (TMD) system is a scalable multi-FPGA high- performance computing platform developed at the University of Toronto. The original motivation for the system was to address the increasing demand for computing power in molecular dynamics simulation. However, as the platform developed, it is no longer limited to molecular dynamics. Today it is a testbed for FPGA-based high- performance computing systems and design flows for such systems. As mentioned earlier, the TMD is a heterogeneous system that consists of comput- ing nodes, which could be soft processors and application-specific hardware modules (hardware engines). A TMD system uses a distributed-memory architecture where

5 each computing node has its own memory and address space. This architecture is sim- ple and scalable for highly parallel systems since it does not need to consider memory coherency issues or memory bus congestion, which would be critical for a shared- memory system. For distributed-memory systems, message passing has been proven to be an efficient programming model. The de facto message passing API used in the high-performance computing community is the Message Passing Interface (MPI) [10]. The MPI API provides a generic platform-independent interface by specifying only the functionality and syntax of the interface. The actual implementation depends completely on the host platform. MPICH is a popular C implementation of MPI for computer clusters using Linux or Windows [6]. The TMD-MPI is a lightweight subset of MPI designed for embedded systems on the TMD. It contains two components: a software library for use with soft processors, and the Message-Passing Engine (MPE), which is a hardware implementation of the MPI API that can be used with hardware engines [12]. The software component does not require an and has a very small memory footprint. With TMD-MPI, C programs written using MPI can be ported to embedded processors in a TMD system with minimal modification [13]. TMD-MPI provides an abstraction of inter-node communication so that software developers do not need to be aware of the details of the communication infrastructure. For the TMD, the underlying network is realized using point-to-point unidirectional links implemented as FIFOs. Each hardware node is connected to a dedicated MPE, which in turn connects to the rest of the network. Soft processors can connect directly to the network or through an MPE; both are supported in the TMD-MPI implemen- tation. When the MPE is used, the type of the computing node behind the MPE is hidden from the rest of the system. The benefit of such a modular network setup is two fold; multi-threaded programs developed for a computer cluster can be easily ported to the TMD by instantiating a soft processor for each thread of the cluster, and soft processors hosting computation-intensive programs can then be identified and replaced by hardware engines without the rest of the system noticing. If the gen- eration of the hardware engines can be automated, this TMD architecture will enable software developers to easily leverage hardware acceleration. This is the motivation

6 behind the AUTO flow.

2.3 Behavioral Synthesis

Designing a hardware accelerator from a software description often requires hardware designers to work closely with software designers in order to arrive at an efficient design. The engineering effort required often deters software developers from using hardware acceleration. In order to make FPGA-based high-performance computing systems more accessible to software developers, the conversion of software into hard- ware needs to be automated. This is the field of behavioral synthesis, sometimes also known as high-level synthesis. It refers to the generation of a logic circuit, often in the form of a Hardware Description Language (HDL), such as or VHDL, from high-level functional descriptions of the desired system. Behavioral synthesis is not a new problem. Exponential growth of the number of transistors in integrated circuits has led to increased complexity of VLSI systems and the engineering effort required to design them. As a result, a great deal of research effort has been spent in the past three decades on developing high-performing and robust CAD tools that will create the logic circuits based on a description of the desired functionality of the circuit, which is often specified in a high-level software language such as C/C++, or Matlab. The challenge in behavioral synthesis comes from the inherent difference in the software and hardware design paradigms. A software developer is more familiar with a data-centric view, where a program is seen as a sequence of tasks performed on a set of data. On the other hand, the hardware designer uses a time-centric view and thinks about the hardware resources used in each clock cycle [16]. The data- centric view does not easily expose the parallelism contained in the algorithm. A behavioral synthesis tool needs to understand the data-centric view described by the software, and schedules the operations, allocates hardware resources, and generates the necessary control logic to provide the functionality of the software, while exploiting concurrency in the algorithm. This process involves many optimization decisions and

7 trade-offs that cannot be easily automated. Recently there have been some commercial realizations of behavioral synthesis tools for applications in specific domains. [7], Handel C [3], [9] and PICO [17] are a few examples. Impulse C is a subset of ANSI C that can be given to an Impulse C compiler to generate HDL output. It allows a software to be partitioned into software processes and hardware processes. A C-compatible library is supplied to support a parallel, stream-based programming model. The library functions are used to facilitate stream operations, such as open, close, read, and write, as well as communication of control messages between the software and hardware processes. The compiler generates hardware from Impulse C library functions and other C statements. Handel C is very similar to Impulse C. It also has library functions to support floating point operations. Catapult C is based on ANSI C++ instead of C. All three perform the best with data-oriented stream-based parallel applications. By making use of non-standard extensions to specify parallelism, the compiler can be better guided to produce more efficient hardware. However, the disadvantage is that programs that are ported to one of these languages can no longer be compiled or debugged using standard C compilers or debuggers. The Synfora PICO (Program-in Chip-out) is slightly different from the three other tools mentioned above. Instead of non-standard extensions, it uses pragmas to specify parameters related to hardware generation. Because it ignores these pragmas, a C compiler can compile a C program prepared for PICO synthesis. The minimal deviation between the original C version and the PICO-compliant version of the software makes PICO a suitable tool for this project.

8 Chapter 3

System Setup

The AUTO flow has been developed to address the need for an automated CAD flow to convert soft-processor computing nodes in a TMD system to an equivalent hardware computing node. This section describes the architecture of the TMD testbed used for this project and the design flow, which incorporates AUTO.

3.1 TMD Platform Architecture

The TMD system can contain multiple FPGAs connected in a 3-tier network as shown in Figure 3.1. Tier 1 is the on-chip network used for intra-FPGA commu- nication among computing nodes located on a single FPGA. This network is based on unidirectional point-to-point links, which are implemented using Xilinx FSLs [21]. Each FSL is a 32-bit wide, 16-word deep FIFO. Network interface blocks (NetIf) are used in the tier-1 network to route packets from source to destination through a multi-hop path based on a unique ID of each node, called the rank. The routing table is contained in the NetIf. Each FPGA also contains several gateway nodes, one for each neighbouring FPGA. The NetIfs forward packets whose destination ranks reside outside of the current FPGA to the appropriate gateway node. The tier 2 net- work consists of several FPGAs in a cluster placed on the same print circuit board. High-speed serial I/O links, such as the RocketIO MGT (Multi-Gigabit Transceiver) are used for inter-FPGA communication. There are no network routing components

9 Figure 3.1: TMD platform architecture ([13]) in this tier. The tier 3 network facilitates communication between FPGA clusters and can be implemented using standardized high-speed switches such as InfiniBand switches [8]. This 3-tier system masks the implementation details of the network and provides a uniform view to individual computing nodes. It also allows the TMD to be easily scaled up to meet the demand of the application. Because of the network abstraction provided by the 3-tier system, a simple system with only the tier 1 network can be used as the testbed for the development of the AUTO flow without loss of generality. A Xilinx University Program (XUP) Development Board [20] with a single Virtex-II Pro XCV2P30 is used to implement the TMD testbed for this project. There are two types of computing nodes that are of interest to the AUTO flow: processor nodes, which are implemented using the Xilinx MicroBlaze soft processor core, and hardware engine nodes, which can be either hand designed or generated by the AUTO flow. The network configuration of each type is shown in Figure 3.2. The TMD-MPI library implementation allows a MicroBlaze to be connected to the NetIf directly, or through an MPE. A hardware engine always uses an MPE. When an MPE is used, two sets of FSLs connect the computing node to the MPE, where one set carries the MPE command traffic and the other carries the MPE data traffic. There are two individual FSLs within each set; one for the inbound traffic and the other

10 Figure 3.2: Network configuration for different node types ([13]) for the outbound traffic. The NetIfs are connected to each other in a partial-mesh topology, where two FSLs, one for each direction, are used for each pair of NetIfs that are connected. With this setup, an MPI program written in C can be ported to a soft processor node using the TMD-MPI library. The network abstraction allows this node to be re- placed by a functionally equivalent hardware engine to improve performance without any impact on the rest of the system. The objective of the AUTO flow is to automate this conversion process. The AUTO flow analyzes the C program on the MicroBlaze to prepare it for C-to-HDL conversion using Synfora’s PICO algorithmic conversion tool. When the hardware is generated, the AUTO flow packages it into a peripheral that can be used as a hardware engine node directly in the TMD system.

3.2 Design Flow

A four-stage design flow is proposed for the TMD system in [13]. This is illustrated in Figure 3.3. In stage 1, the user prototypes the application in C on a workstation. In stage 2, the application is parallelized using an well-known MPI distribution, such as MPICH, and tested on a cluster of workstations. In stage 3, a TMD system is created by mapping each MPI process in the parallel version of the application from stage 2 to a soft processor node. The TMD-MPI library is used instead of MPICH starting

11 Figure 3.3: TMD design flow ([13]) from this step. Because both TMD-MPI and MPICH implement the same API, the porting process requires minimal changes to the original C source code. In stage 4, the soft processor nodes that are executing the most computation-intensive portion of the application are identified for hardware acceleration and subsequently replaced by faster hardware blocks. The decision of which soft processors to replace requires detailed profiling of the system performance and an understanding of the tradeoffs between different system resources. Hence it will remain a job of the system designer for now. However, the laborious process of designing hardware from a functional description in software will be automated by AUTO. The TMD system in stages 3 and 4 is designed using the Xilinx EDK/ISE 9.1 suite of tools. Design entry can be done manually using the EDK GUI for simple systems. However, for large systems, this manual process is error-prone. An automated tool such as the System Generator [14] is recommended. All hardware engines, including the MPE and the NetIf, are available as custom peripherals that can be imported into EDK. The hardware components of the system are specified by the Microprocessor

12 Hardware Specification (MHS) file while the software components are specified by the Microprocessor Software Specification (MSS) file. The system is completely defined by the MHS and MSS, and is implemented using the Xilinx tool flow. After this, a custom script compiles the software using a specified version of TMD-MPI and initializes the routing tables in the NetIfs. The output is the final bitstream that is used to program the FPGA. Synfora’s PICO Express design suite is used to generate hardware from C pro- grams. The PICO flow is explained in the next section.

3.3 C-to-HDL Using PICO

The PICO C-to-HDL flow is illustrated in Figure 3.4. The input to PICO Express is a C program that contains the module to be converted to hardware, which is App.c in the figure, and driver code the calls the target procedure. Both App.c and the driver code can be written in ANSI C, with some restrictions; for example, no floating-point numbers or pointers are supported. A complete list can be found in [16]. At the start of the flow, PICO performs the Golden Simulation, where the input C source code is compiled with the driver code and executed on the workstation on which PICO is running. The output from the golden simulation are checked against reference output provided by the user. This step is useful if the original C program was modified to make it PICO-compliant. To guard against bugs introduced in this process, the user supplies reference input and output that are used to verify the golden simulation. If the golden simulation results agree with the reference, the golden results are saved for future reference. After the golden simulation, PICO proceeds through several steps, transforming the source code into the final output HDL. Simulation is run at each step and results are checked against the golden results to ensure correctness. In order to incorporate PICO into the TMD design flow, several additional pro- cessing steps and components are needed. These are provided in the AUTO flow, and described in the next chapter.

13 Figure 3.4: PICO design flow ([15], p.5)

14 Chapter 4

Implementation of the Tool Flow

This section describes the implementation of the AUTO flow. The objective of the AUTO flow is to take a soft processor in the TMD system and generate a functionally equivalent hardware engine. Efficient conversion from software to hardware is tricky and has traditionally been done manually. The AUTO flow automates this process by providing a set of tools, a library and hardware components.

4.1 Flow Overview

The AUTO flow contains three components: an MPI implementation that is compli- ant with PICO (PICO MPI), a lightweight hardware control block, and the scripts that perform the operations involved in the flow. The actual flow consists of three steps as illustrated in Figure 4.1. Step 1 is the preprocessing of the user source file, which contains the top-level function to be converted to hardware. This should be the main function from the soft processor that is to be replaced by the hardware engine. The output of step 1 includes a PICO source file, the driver code, and a flow script. The PICO source file is created based on the user input file by adding some pragma settings and the PICO MPI library implementation. The driver code is the testing code. It contains functions that redirect the stream I/O to file I/O so that the module in the PICO source file can be tested as a standard C program. In step 2, PICO Express uses the

15 Figure 4.1: The AUTO tool flow

flow script and files from step 1 to generate the PPA core hardware. An iterative method may be used in this step to explore the design space for best performance. Verilog files that describe the generated hardware core are produced at the end of step 2. In step 3, the control block is custom fitted to the generated core. The resultant Verilog files are then packaged into a custom peripheral core that can be used in the EDK flow. Since step 2 is not guaranteed to be successful in the first pass, the user may need to manually adjust performance parameters when running step 2 iteratively. As a result, these three steps are not integrated into a single push-button flow. However, all transformations done in step 1 and step 3 are encapsulated into two scripts, which are provided as part of the AUTO flow. The next few sections describe each component of the AUTO flow.

4.2 MPI Library Implementation

As described in Section 3.1, the MPI API is supported in the TMD system through the TMD-MPI library and the MPE, where the underlying communication network is built on FSLs. PICO does not support FSL directly as a native interface for the hardware it generates. Instead, it provides a generic FIFO stream interface. Given an MPI program that runs on a MicroBlaze node, we need a way to tell PICO to

16 interpret calls to MPI interface functions in the source code as a set of operations on the streams. Therefore, a PICO-compliant MPI implementation is needed. Currently, only MPI Send and MPI Recv are implemented in this PICO MPI library. However, all other MPI functions build on these two functions and can be easily added to the PICO MPI implementation. Figure 4.2 shows the stream operations involved in the MPI Recv and MPI Send operations, where cmd in is the input command stream, cmd out is the output command stream, data in is the input data stream, and data out is the output data stream. The numbers in the parentheses indicate the ordering of the stream operations.

(a) MPI Send (b) MPI Recv

Figure 4.2: Stream operations required to implement MPI behaviour

Due to the restrictions on input C code imposed by PICO, two modifications to the MPI API had to be made. First, the concept of a family of MPI functions is introduced. A family of MPI functions refers to a collection of MPI functions that provide the same functionality, but operate on different data arrangements. This is needed because PICO does not support pointers. In a C implementation of the MPI API, the function prototypes for MPI Recv and MPI Send look like the following:

int MPI_Send (void *buffer, int count, MPI_Datatype type, int dest, ...); int MPI_Recv (void *buffer, int count, MPI_Datatype type, int source, ...);

The generic pointer buffer specifies the starting location in memory. Together with count, they specify a vector. In general, any memory within the address space

17 of the calling program is acceptable. For example, buffer may point to the middle of a 2D array. However, PICO does not support pointer access to memory. Hence alternatives are needed. There are two approaches to accommodate the usage of MPI functions on different data arrangements. One is to introduce another wrapper layer on top of the basic PICO MPI implementation. For operation on 1D arrays, the default PICO MPI implementation is used. For other data arrangements, the operation is performed on a temporary buffer, and the values copied between the temporary buffer and the actual data location. Clearly this is slow as every MPI operation on a buffer of size N would require 2*N memory accesses. Alternatively, a different “flavour” can be designed to target a particular data arrangement for each MPI function. All variants of the same MPI function form a family. The three most common data arrangements used in software are scalars, 1-dimensional vectors, and 2-dimensional arrays. The PICO MPI implementation includes variants of MPI functions that target each of these data arrangements, as shown in Table 4.1.

Table 4.1: Implemented families of MPI functions MPI Function Family Operand MPI OP (int buf[], int count, ...) 1D vector MPI OP Scalar (int *buf, int count, ...) A single scalar variable MPI OP2D (int buf[][maxc], int row, int count, ...) A row in a 2D array * OP can be Send or Recv.

The second modification to the MPI API is particular to MPI Recv. In the API specification, MPI Recv takes in an MPI Status data structure as an argument and populates it with information related to the receive operation. In C distributions of MPI, MPI Status is implemented as a struct, which is not supported in PICO. Since the most often queried fields of MPI Status are the source rank and tag, as a workaround, the function prototypes of the MPI Recv family of functions are modified to take in two int references instead of an MPI Status to pass back the source rank and tag information. Even though the PICO MPI implementation is designed for hardware synthesis,

18 it still follows the ANSI C standard. The AUTO flow generates the driver code that redirects the stream I/O to file I/O when the target program is compiled and executed as a standard C program. The user is still responsible for providing the input file and verifying the correctness of the output file. The generation of test stimuli and reference output is described in Section 4.5.

4.3 Control Block

After PICO generates the hardware core (referred to hereafter as the PPA, which stands for Pipeline of Processing Arrays and is the PICO lingo for the hardware core), a control block is needed to initialize and start the PPA on system startup, and to translate the PICO stream interface to the FSL interface used to communicate with the MPE. A simple-finite-state-machine-based hardware controller is designed to provide such functionality. The control block module is a wrapper around the PPA hardware. It operates on the system clock and reset, and provides four FSL bus interfaces that can be connected to the corresponding FSL bus interfaces on the MPE. On system startup, the rank 0 node in the TMD system sends rank information to all hardware nodes in the system. The control block receives the rank from the network, initializes the PPA, and enables the translation between the FSL interfaces and the PICO stream interfaces. All activities on the FSL interfaces are translated into the corresponding control and data signals on the stream interface of the PPA automatically. The control block sits idle until the PPA completes its current task. Then it restarts the PPA to process the next task. A more detailed discussion on the operations of the control block can be found in Appendix A.

4.4 Scripts

Two scripts are used to encapsulate the operations in the AUTO flow. They are described in this section.

19 4.4.1 Preprocessing Script

The preprocessing script (auto pico.pl) processes the user C program and generates the files that are needed for the PICO Express flow. The preprocessing script is invoked using the following command:

auto_pico.pl [Optional: ]

Here src C file is the user C program. The mpi rank and mpi size parameters are the rank of the hardware node to be generated and the size of the TMD systems. These two parameters are for simulation purposes only. When the hardware is im- plemented in the TMD system, they will be supplied as part of system initialization procedure. The transcript file contains the expected input and output on the FSL interfaces of the hardware block for a test application. This is used by the AUTO flow to generate input stimuli and reference outputs. Section 4.5 describes how this transcript can be obtained. The mem option file provides optional information about how the arrays in the program should be mapped to hardware. It is optional. More details on this are given later in this section. There are five tasks involved in the preprocessing step. They are listed below. The first two steps result in a new C file that is based on the user input file, with added pragma settings and the PICO MPI implementations.

1. Parse through the user program and add #pragma settings

2. Attach PICO MPI implementations

3. Generate the driver code, input and reference output files

4. Generate Makefile

5. Generate run script for PICO

The input to the preprocessing script is a C program that contains one top level function to be converted into PPA hardware, and any number of helper functions, all

20 contained in a single C file. The input program has to follow the constraints imposed by PICO. The preprocessing script does not check for these. However, if there is a violation, the PICO Express flow will error out when it is run in step 2. The user can fix the source program and restart from this step. The first task in this step is to parse out all arrays declarations and add the following pragma settings immediately after the line in which the array is declared:

#pragma bitsize #pragma host_access none

The first line indicates the word size when the array is mapped to a memory. The second line tells PICO not to expose any access ports for this memory as external ports on the PPA because all communication with other modules in the system for a TMD hardware engine is through the stream interfaces. Another setting associated with arrays is the type of memory that the array maps to. PICO supports three types of memories: internal fast, internal block RAM, and user supplied. Internal fast memories are registers-based. Internal block RAM, as its name suggests, is a block RAM instantiated within the PPA hardware. User supplied memories are placed outside the PPA. PICO will generate a standard SRAM interface to interact with the external memory. The user is responsible for supplying the external SRAM. To explicitly map an array to a particular type of memory, the following pragma setting can be used:

#pragma (internal_fast|internal_blockram|user_supplied_fpag)

When not specified, PICO chooses the memory type based on the size of the array. However, the user can supplied a memory option file to tell the preprocessing script to set the type explicitly. A memory option file may contain lines that look like the following, where a # at the beginning of a line indicates a comment:

# Specifying the memory type for the following arrays array_1 blockram array_2 fast array_3 user_supplied

21 After the annotation of the array, the PICO MPI library implementations are attached to the user code. Recall from Section 4.2 that each MPI function has a different variant that targets a particular data arrangement. In the C program, all MPI calls use the standard prototype. It is the preprocessing script’s job to replace each call by the appropriate variant according to the type of the operand. This is illustrated below. The same type of transformation will be done for MPI Send.

/* Original Program */ void ppa_function (void) { int buf[BUFSIZE]; int buf2[NUM_ROWS][NUM_COLS]; int buf3; ... // (1) Operand is a 1D vector MPI_Recv (buf, count, MPI_INT, source, tag, MPI_COMM_WORLD, &status); // (2) Operand is a 2D array MPI_Recv (buf2[i], count, MPI_INT, source, tag, MPI_COMM_WORLD, &status); // (3) Operand is a scalar MPI_Recv (&buf3, 1, MPI_INT, source, tag, MPI_COMM_WORLD, &status); ... } /* PICO-Ready Program */ void ppa_function (void) { int buf[BUFSIZE]; int buf2[maxr][maxc]; int buf3; ... // (1) 1D variant of MPI_Recv MPI_Recv (buf, count, MPI_INT, source, tag, MPI_COMM_WORLD, &status); // (2) 2D variant of MPI_Recv; row = i MPI_Recv2D (buf2, i, count, MPI_INT, source, tag, MPI_COMM_WORLD, &status); // (3) Scalar variant of MPI_Recv MPI_Recv_Scalar (&buf3, 1, MPI_INT, source, tag, MPI_COMM_WORLD, &status); ...

22 }

With all MPI calls updated, the PPA program is ready to be used with the PICO Express flow. The preprocessing script then generates the driver code that is used by PICO for simulation and testing purposes during various stages of the PICO flow. The driver code is generic and the same template can be used for any PPA function. A Makefile is also generated so that the driver and the PPA program can be compiled as a standard C application and tested prior to submitting to the PICO Express flow. The final item generated by the preprocessing script is a flow script (run.tcl) that contains the commands to set up the PICO project and invoke the flow. The flow scripts tells PICO to generate the PPA core in a hierarchical fashion. Each MPI function is generated as a sub module and instantiated in the PPA module. This hierarchical approach provides better modularity and area efficiency, as opposed to inlining all the modules. Finally, with all the files generated, the user can start step 2 using the following command:

prompt> pico_extreme_fpga run.tcl

4.4.2 Packaging Script

The packaging script (pack pico.pl) takes the Verilog files generated in the PICO Express flow and produces a custom peripheral core that can be directly imported into Xilinx’s EDK. The packaging script is invoked using the following command:

pack_pico.pl

Here pico experiment directory specifies where to retrieve the Verilog generated by PICO. There are two steps involved in the packaging script: generating the control logic and packaging the HDL into an EDK-importable peripheral. The control logic is generated from a pre-designed template by setting a few parameters, including the peripheral name and the PPA module name. Making a custom peripheral core involves putting the source Verilog files in a specific directory structure, as shown below:

23 mpi_jacobi_v1_00_a/ data/ mpi_jacobi_v2_1_0.mpd mpi_jacobi_v2_1_0.pao hdl/ verilog/ mpi_jacobi.v (Control logic) mpi_jacobi_ppa.v (PPA top-level wrapper) ... (Other PPA files)

The packaging script creates the mpi jacobi v1 00 a/ directory for version 1.00.a of a peripheral named mpi jacobi. This directory can be placed in any custom IP repository to make it available use with the EDK tools. All source code files are collected under the hdl/verilog/ sub-directory. The data/ sub-directory contains the Microprocessor Peripheral Description file (.mpd), and the Peripheral Analysis Order file (.pao). The MPD describes the interface exported by the peripheral as well as the parameters that can be set in the MHS file of an . The PAO contains the list of all source files required to build this peripheral. Both files are generated automatically by the packaging script.

4.5 Test Generation

As mentioned earlier, in order to run the PICO flow, the user needs to provide input stimuli for the input streams and reference output for the output streams. These can be generated from the TMD system directly. Figure 4.3 shows a TMD prototyping system where one of the two MicroBlaze processors is to be replaced by a hardware engine. Since the replacement of the MicroBlaze by a hardware engine is transparent to the rest of the system, the node behind the MPE can be viewed as a black box that can be implemented either as a soft processor node or a hardware engine. Since all communication with the black box is through the FSL interfaces, by snooping the FSL interfaces and capturing the activities, both the input stimuli and the corresponding reference output can be obtained. To do this, the TMD-MPI library is instrumented

24 Figure 4.3: TMD system testbed so that it prints out the raw data sent to and received from the FSL interfaces when a compiler flag is turned on. A transcript obtained this way is then given to the AUTO flow to produce the input and reference output files that are used by PICO during hardware generation.

4.6 Limitations

The AUTO flow was developed as a proof-of-concept prototype to demonstrate the feasibility of automatically generating a hardware accelerator from software descrip- tions. Consequently, there are still some limitations of the current version. Most of the issues described in this section are due to limitations of the various tools inte- grated into the AUTO flow. Many of them are likely to be resolved in the future. The others are inherent to the shift in design strategies from a software paradigm to a hardware paradigm. Hopefully, more sophisticated automation tool flows will help bridge this gap.

25 4.6.1 Floating-Point Support

A major limitation of the current version of the PICO tool is the lack of support for floating-point numbers. Currently, only fixed-point numbers (char, int, long) are accepted. Floating points are more complicated to support, hence hardware-based applications have traditionally used fixed-point-based approaches. On the other hand, all modern microprocessors have floating-point support. It is rare to see software, especially for scientific-computing purposes, that does not use floating-point numbers. As a result of this limitation, the range of applications that can be converted to hardware using the PICO flow is severely limited. In order to use the PICO flow as it is now, the designer will have to rewrite the original software application using fixed- point numbers. There algorithmic synthesis tools that are commercially available now that support floating-point operations. An example is Handel C [3]. Perhaps in the future PICO will also incorporate this feature.

4.6.2 Looping Structure

Using more than one loops inside another loop is not allowed. This limits the com- plexity of the software that can be put through PICO. An alternative to this is to turn each internal loop-nest into a Tightly Coupled Accelerator Block (TCAB). This seems to be a cleaner way when working with large designs due to the nice hardware design hierarchy provided by its usage; although programs with complicated looping structures should probably be avoided because it makes timing performance analysis harder.

4.6.3 Pointer Support

The pointer in C is a powerful concept that allows flexible access to memories. Many C programs make extensive use of pointers. However, general pointer accesses are currently not supported in PICO, possibly to avoid complex pointer-aliasing analyses. Consequently, programs that use pointers need to be manually inspected to replace those instances of pointer usage that are illegal in PICO with equivalent alternatives.

26 This process may involve rewriting portions of the code. Due to the dynamic nature of such analysis, AUTO does not attempt to do this automatically. It is the user’s responsibility to provide AUTO a source program that follow PICO’s guidelines in terms of pointer usage.

4.6.4 Division Support

Division with an arbitrary divisor is not supported when using the Xilinx XST syn- thesis tool. XST can only synthesize a divider when the operand is a power of 2. However, PICO instantiates a general-purpose divider even when the divisor can be determined to be a power of 2 at compile time. PICO recommends Synplify Pro to be used to synthesize generic dividers. However, we do not have access to a license of Synplify Pro, hence we cannot verify this recommendation. As a result of this limitation, applications with general division is not supported by the AUTO flow. A workaround exists for fixed point division when the divisor is a power of 2. This is done by replacing the division with a bitwise right shift (>>) by the appropriate number of bits.

4.6.5 Performance Specification

The AUTO flow currently does not optimize for performance. It is up to the user to decide what the appropriate performance parameters should be. One reason is that the user should be more familiar with the required performance constraints for the target hardware. Secondly, for complex software, the PICO flow may need to be invoked iteratively to find a good set of performance constraints that deliver the best area-delay-power tradeoff. Performance constraint can be specified in terms of MITI (Minimum Inter-Task Interval) to control the amount of task overlap, or II (Initiation Interval) for a loop to affect how tightly loop iterations can be scheduled. When a constraint is not specified, PICO tries aggressively to obtain a compact schedule. Sometimes this schedule may be impossible to meet during synthesis. Therefore, an iterative approach might be needed to manually find the sweet spot.

27 Another parameter that has shown to improve performance is the trip count for loops, which is the expected number of iterations in the loop. When this is specified, PICO can better optimize for both area and speed. However, the trip count cannot be determined from static analysis of the code in general. Hence, the AUTO flow does not set this option for loops. It is up to the developer to tune this option manually if desired.

4.6.6 Hardware Debugging

In the current flow, once the software is converted to hardware, it is very difficult to debug if the hardware is nonfunctional. There are two main causes of a malfunctioning hardware module: bugs in the user C program and bugs in the PICO flow. For bugs in the PICO flow, the user needs to use standard hardware debugging tools to track down the problem. This requires knowledge of Verilog and hardware design and debugging in general, which an average software developer may not possess. On the other hand, if the bug is in the user program, standard C debuggers can be used. This is likely to be the case in the long run, when the PICO flow is expected to be relatively stable. In this case, it would be helpful if the live input and output of the hardware module in its working environment can be captured and used to generate test stimuli for the C program. This way, the software developer can test the program with real hardware data in a C debugging environment, which he will be familiar with.

4.6.7 Exploitable Parallelism

The amount of parallelism that can be exploited by PICO is limited to that exposed in the software. Badly designed software will result in inefficient hardware. Software optimization, and automatic parallelism extraction is a separate research field. The AUTO flow only prepares a piece of software for use in the PICO C-to-HDL flow. It does not optimize the source code. The software designer hence should use dis- cretion when writing the program. PICO provides a number of options to automate some basic optimizations such as full loop unrolling and multi-buffering of memory

28 through the use of pragmas [16]. However, these automated options have limited ap- plicability, and cannot replace intelligent code design by the programmer. That said, not all software optimization techniques apply when the final goal is hardware. For example, any cache-related optimizations such as loop-reordering are unlikely to see performance upside in the final hardware. Therefore, the designer should also have some high-level understanding of the hardware architecture that PICO produces in order to get the best result.

29 Chapter 5

The Heat-Equation Application

This section presents an example application that is built to demonstrate the function- ality of the AUTO flow. The application chosen is a heat-equation solver. The heat equation is a partial differential equation that describes the temperature variation in a given region over time, given the initial temperature distribution and boundary con- ditions. The thermal distribution is determined by the Laplace equation 5(x, y) = 0. The solution to this equation can be found by the Jacobi iterations method [22], which is a numerical method to solve a system of linear equations. This application was chosen because a TMD system has been previously implemented for this method [13], so a working MPI program already exists and can be used in the AUTO flow directly. The main objective of this project is not to produce a high-performance hardware accelerator for the heat-equation application. Rather, the goal is to demonstrate the feasibility of an automated flow. Therefore, it is expected that the resultant hard- ware may not deliver the best performance in comparison to hand-designed hardware modules.

5.1 Implementation

Because floating point is not supported in PICO, in this implementation of the heat- equation solver, the temperatures are represented as fixed-point numbers. This is

30 acceptable because the precision of the computation is not important for this project, as the objective is not to build a high-performance heat-equation solver. We will accept the hardware as long as it produces the same result as the software running on a soft processor. In the original TMD implementation of the heat-equation application, nine com- puting nodes were used in the system. For the simplicity, only two computing nodes are used for the test application in this project. Rank 0 is a MicroBlaze and is the root node. It generates the initial temperature map and sends the working data to rank 1. Rank 1 is a computing node. The two nodes solve the heat equation together. In the reference system, rank 1 is a MicroBlaze running the software implementation of the Jacobi iterative solver. Rank 1 in the test system is the hardware generated from the software version using the AUTO flow. The original software implementation of the Jacobi iterative solver contains a sin- gle program that is run on all soft-processor computing nodes. Depending on the rank of the computing node, which is defined during compile time through a compiler directive, sections of the program are selectively exercised. All non-root computing nodes perform essentially the same operations. In addition to the computation loop performed by all computing nodes, the root node is also responsible for data initial- ization and finalization, as well as synchronization at the end of each Jacobi iteration. The goal for the example application is to generate a hardware implementation for a non-root node. The first step in the implementation is to extract the parts of the program that pertain only to non-root nodes. The resultant program is a concise version of that running on the non-root nodes originally. This program is then passed through the three stages of the AUTO flow, as described in Section 4.4. After obtaining a peripheral from the AUTO flow, the reference system and test system are implemented using the design flow described in Section 3.2. The two sys- tems are implemented on a Xilinx Virtex II Pro FPGA. The performance is measured, and the results are documented in the next section.

31 5.2 Experiment Methodology

A few simple experiments are conducted to compare the performance of the Jacobi hardware engine produced by the AUTO flow to that of the reference software imple- mentation. The setup of the reference and test systems are shown in Figure 5.1. The two-node implementation is chosen for simplicity purposes. It can be easily scaled up to include more computing nodes by duplicating the rank 1 MicroBlaze in the reference system or the Jacobi hardware engine in the test system. Since each node only communicates with two of its neighbours the individual behaviour of each node is not affected with increased system size.

(a) Reference System

(b) Test System

Figure 5.1: A simple two-node TMD implementation of a Jacobi heat-equation solver

Two different programs are run on the rank 0 MicroBlaze to conduct two tests. The first test verifies the correctness of the test system against the reference system. The original TMD C implementation of the heat-equation solver is used on rank 0. Recall that the program on the reference system rank 1, which is used to generate the hardware, is originally extracted from the C implementation. In this setup, the root node collects the results from the computing element after the system converges and prints out the results to the UART, which is then captured on a PC connected to the FPGA board through a serial link. A problem size of 40x40 is used. Each node hence operates on a section of 20x40 elements. The output from both systems are identical. The second test measures the performance of the Jacobi hardware engine com-

32 pared against the reference system. In this case, rank 0 only performs data initializa- tion, finalization and synchronization between the computing nodes. It does not run the computation loop. In each iteration, rank 0 exchanges the boundary rows with rank 1, and proceeds directly to the synchronization barrier and waits for rank 1 to finish its computation loop. The number of cycles from the end of the row exchange to the end of the iteration synchronization is measured for each iteration and accu- mulated over the entire program. The cycles-per-element (CPE) is the total number of cycles divided by the problem size (20 × 40 × Niterations). Normally, the Jacobi iteration algorithm stops when the results converge. In this test, in order to emulate a different problem size (i.e. total number of elements to process), the program on rank 0 is changed slightly so the total number of Jacobi iterations to run before stop- ping the computing can be controlled. Using this hack, the CPE of the reference and test systems are measured for a few cases with different number of Jacobi iterations executed in each. The results are documented in the next section.

5.3 Results

Both the test and reference system are implemented using Xilinx XPS 9.1. The reference system uses a 100MHz clock, the maximum clock frequency provided on the Xilinx University Program Development Board [20]. The test system uses a 50MHz clock. This is because the highest speed that could be achieved during the hardware generation in the PICO flow was 80MHz, and the Digital Clock Manager (DCM) block in the embedded system does not allow fractional division. So the only available clock frequencies below 80MHz are 50MHz and 66MHz, and 50MHz is chosen for convenience. The designs are tested on a Xilinx Virtex-II Pro XC2VP30 FPGA and the performance is measured. A fair comparison between the hardware Jacobi solver and the MicroBlaze solu- tion is the number of elements computed per second per LUT. The area cost of the hardware Jacobi solver was obtained by synthesizing it alone using ISE 9.1. This includes a a small overhead of about 9 LUTs introduced by the control block, which

33 is negligible for most designs. Since the hardware solver provides equivalent function- ality to a complete MicroBlaze system, which includes a MicroBlaze soft processor, 2 Local Memory Busses (LMB), 2 LMB BRAM controllers, and 1 BRAM block, the equivalent area cost is that of such a system. This is obtained by implementing a single MicroBlaze system containing the above components using the EDK flow. The total LUT count in each case is obtained from the Xilinx Mapping Report (.mrp). Figure 5.2 shows the raw CPE measurements for the reference and test systems. The high CPE observed in the reference system for short-running experiments that perform a small number of iterations may be due to the initialization overhead in the MicroBlaze that is not fully amortized. The performance of the two systems are compared using the asymptotic CPE when the number of iterations is large. The normalized computing power in terms of number of elements processed per second per LUT is shown in Table 5.1.

Figure 5.2: Main loop execution time per element with different iteration lengths

The hardware Jacobi solver is actually 22% slower than the soft processor im- plementation when running at 50MHz. However, theoretically the hardware Jacobi solver can operate at 80MHz. This is shown as the third column in Table 5.1, in which case we get 1.25x runtime improvement over the reference system. It is clear that speedup like this is not enough for a hardware acceleration system. However, we have achieved the objective of developing a working flow. Going forward, more effort

34 Table 5.1: Normalized computing power of the reference and test systems MicroBlaze HW Jacobi HW Jacobi (Speculated) Clock Frequency 100 MHz 50 MHz 80 MHz Asymptotic CPE 41 17 17 Total LUTs 2771 4267 4267 Nelements per sec per LUT 880 689 1103 Speed-Up 1x 0.78x 1.25x will be spent in the optimization of the AUTO flow to provide more performance improvement.

35 Chapter 6

Conclusion and Future Work

In this project, we have presented an automated flow that generates hardware com- puting nodes from C directly that can be used in a TMD system directly. As a result, soft processor nodes in a TMD system can be easily converted and replaced by functionally equivalent hardware engines to achieve better performance. A working hardware Jacobi heat equation solver is produced as an example to demonstrate the feasibility of the AUTO flow. With little designer intervention, the hardware Jacobi solver is generated automatically and shows some performance improvement over the software implementation on a soft processor. The AUTO flow shows that FPGA- based high-performance computing platforms such as the TMD can be made more accessible to software developers who are unfamiliar with hardware design. We are now one step closer towards the ultimate goal of a completely automated flow that converts a parallel program into a complete TMD system. Nevertheless, more work still needs to be done to address the limitations of the current AUTO flow, which include the lack of support for the true MPI API and the requirement for running the iterative flow through PICO manually. Hence, the next step is to complete these features and focus on robustness and performance optimization. The AUTO flow will be used with a wider range of applications to test its robustness. Further optimized of the flow will be done to provide more performance advantage for the hardware engines.

36 Appendix A

Hardware Controller for PICO PPA

To integrate the hardware module generated by PICO into the TMD system, a lightweight hardware control block is used. The control block has two responsibil- ities. It generates the control signals to initialize and start the PPA module upon system startup. This is done using a finite-state-machine (FSM). It also translates the PICO stream interface to the FSL interface, which is done through combinational logic. Figure A.1 shows the design of the control block. Table A.1 lists the I/O ports exported by the control block, which include the clock, the reset, and four sets of FSL ports for the four FSL bus interfaces. These ports are visible to the rest of the TMD system.

A.1 Control FSM

A finite state machine (FSM) is used to generate the control signals for the PPA module. Table A.2 lists the raw control ports of the PPA that need to be controlled by the control block. The control block drives the input ports with the appropriate values to initialize and start the PPA. Upon system reset, the FSM receives the rank of the current hardware node from the MPE, and stores it in an 8-bit word, which is connected to the raw-

37 Figure A.1: Design of the control block

Table A.1: I/O ports exported by the control block Signal Direction (I/O) Bus Width Bus Interface clk I rst I To mpe cmd data O [31:0] FSL To mpe cmd ctrl O To mpe cmd To mpe cmd write O To mpe cmd full I From mpe cmd data I [31:0] FSL From mpe cmd ctrl I From mpe cmd From mpe cmd exists I From mpe cmd read O To mpe data data O [31:0] FSL To mpe data ctrl O To mpe data To mpe data write O To mpe data full O From mpe data data I [31:0] FSL From mpe data ctrl I From mpe data From mpe data exists I From mpe data read O

38 Table A.2: Raw control ports on the PPA module Signal Direction (I/O) Bus Width clk I reset I enable I start task init I start task final I clear init done I clear task done I psw task done O rawdatain self rank 0 I/O [0:7]

datain self rank 0 port on the PPA. This port is generated when the original C function declares and uses the global variable self rank. The PPA is enabled after the rank is obtained. The necessary sequence for each control signal is described in Appendex C.3 of [15]. Figure A.2 shows the state transition diagram of the FSM. When the FSM reaches the PPA READY state, the translation between the PICO stream interface and the FSL interface is enabled. The FSM remains idle in this state until psw task done is raised high by the PPA, at which point the PPA is restarted to wait for the next task.

A.2 Stream Interface Translation

The second task of the control block is to translate the PICO stream interface to the FSL interface. A detailed description of the PICO stream interface can be found in Appendix C.5 of [15]. Information on the FSL interface can be find in [19]. Figure A.3 shows a simple illustration of the PICO input and output stream interface. In both cases, the req signal is raised high by the PPA when it wants to interact with the stream. The ready signal is an input to the PPA. A high ready signal indicates that the other end of the stream is ready to receive or send data. The FSL interface is shown in Figure A.4. To read from an FSL, the data consumer raises FSL S Read. If there is data in the FSL, the FSL S Exists signal will also be high. In this case, the data word and control bit can be read through FSL S Data and FSL S Control in the next

39 Figure A.2: State transition diagram of the control block FSM

40 (a) Input stream (b) Output stream

Figure A.3: The PICO stream interface

(a) Read from an FSL (b) Write to an FSL

Figure A.4: The FSL bus interface clock cycle. Similarly, to write to an FSL, the data producer raise FSL M Write and puts the data word and control bit on FSL M Data and FSL M Control respectively. In the first clock cycle after FSL M Full becomes low, the data is pushed into the FIFO. Comparing the operations between the PICO stream interface and the FSL bus interface, it is not difficult to see the similarity in functionality between instream req and FSL S Read, instream ready and FSL S Exists, outstream req and FSL M Write, outstream ready and the complement of FSL M Full. Since the PICO stream interface does not contain a control bit, a 33-bit data bus is used, where the highest bit is interpreted as the control bit in the FSL interface, and the lower 32 bits as the data word. The translation between the PICO stream interface and the FSL bus interface is summarized in the Verilog excerpt shown in Program 1. This translation relationship is used for all four PICO streams (cmd in, cmd out, data in, data out) and the corresponding FSL interfaces.

41 Program 1 Code excerpt to illustrate the stream interface translations

/∗ For outbound FSL ∗/ FSL M Data = outstream data[31:0]; FSL M Ctrl = outstream data [ 3 2 ] ; FSL M Write = outstream req & ˜FSL M Full ; outstream ready = ˜FSL M Full ;

/∗ For inbound FSL ∗/ instream data = { FSL S Control, FSL S Data } ; FSL S Read = instream r e q & FSL S Exists ; instream ready = FSL S Exists ;

42 Appendix B

Using PICO: Tips and Workarounds

This chapter describes some of the PICO-related workarounds and performance- improving tips that were discovered during the development of the AUTO flow. Refer to [16] for details on the PICO options mentioned in this chapter.

B.1 Stream Ordering

As described in Section 3.1, MPI operations are realized in hardware as a series of stream operations. The order of these stream operations are defined by the MPE protocol, and hence must be enforced for correct interaction with the MPE block. However, stream ordering is tricky in PICO. Because PICO always tries to produce a schedule that is as compact as possible, it will schedule two operations in the same phase as long as it does not detect data dependency between the two operations. As there is no direct data dependency between the streams, it is hard to make PICO understand the sequential ordering constraint of the stream operations as imposed by the semantics of the MPE protocol. A workaround that allows stream operations to be ordered is to wrap each stream in a function and generating the function as a Tightly Coupled Accelerator Block (TCAB). Since each TCAB is generated as an independent sub module, dummy data dependency can be introduced between the

43 TCABs to force a sequential schedule. Program 2 shows an excerpt of the code for the MPI Recv function that illustrates the techniques used to establish ordering in the stream operations. In the code excerpt shown in Program 2, the write operation on the output com- mand stream is encapsulated in a TCAB called recv cmd out. The return value of the TCAB is used in conditional statements to force the instructions enclosed in the conditional statement to be scheduled at least one time unit after the execution of recv cmd out. Program 2 also illustrates the technique of loop-sinking, where sequen- tial instructions before a loop are sinked into the loop to provide hints about the ordering of the operations. By writing the code this way, it tells PICO clearly that a single instruction is to be executed every iteration so an instruction in a later iteration will not be scheduled before one in an earlier iteration.

B.2 Improving Performance

With PICO the performance of the generated hardware can sometimes be improved by turning on certain options manually. This section describes a few of them. More details regarding the usage of these options can be found in [16].

Memory Type

There are three types of memories supported by PICO: internal fast register-based memory, block RAM and user-supplied external memory, such as DDR. The memory type can be specified using the following command in the C source file after the array declaration:

#pragma (internal_fast | internal_blockram | user_supplied_fpga)

When the memory type is not explicitly specified, PICO chooses the type based on the size of the array. However, some automatic optimizations are only available for the register-based memories. Also, block RAMs can only have a maximum of 2 read/write ports, which may limit the amount of parallelism that can be exploited.

44 Program 2 Code excerpt to illustrate stream ordering unsigned char recv cmd out ( unsigned long long cmd out ) { #pragma bitsize recv cmd out 1 #pragma bitsize cmd out 33 pico stream output mpe cmd out ( cmd out ) ; return 1 ; } int MPI Recv ( int b u f f e r [ ] , int count , MPI Datatype datatype , unsigned int source , unsigned int tag , MPI Comm comm, unsigned int ∗ i n t a g p t r , unsigned int ∗ i n s r c p t r ) { unsigned char done = 0 ; #pragma bitsize done 1 unsigned int i ;

#pragma n u m iterations (3,,) for ( i = 0 ; i < count + 3; i++) { i f ( i == 0) { unsigned long long cmd out = MPE RECV OPCODE | ( source << 22) | count | FSL CTRL MASK; #pragma bitsize cmd out 33 recv cmd out ( cmd out ) ; } else i f ( i == 1) { done = recv cmd out (tag & FSL DATA MASK) ; i f ( done ) { ∗( i n s r c p t r ) = ( pico stream input mpe cmd in ( ) & NET SOURCE MASK) >> 2 4 ; } } else i f ( i == 2) { ∗( i n t a g p t r ) = pico stream input mpe cmd in ( ) & FSL DATA MASK; } else { // Standard receive data from stream i f ( done ) { b u f f e r [ i −3] = p i c o s t r e a m i n p u t m p e d a t a i n ( ) & FSL DATA MASK; } } } return MPI SUCCESS ; }

45 Loop Trip Counts

Specifying the trip counts for loops saves area and improves speed. This is because PICO can optimize better when it knows the expected iterations of a loop. The loop trip count can be specified using the following pragma immediately before the start of the loop:

#pragma num_iterations (, , )

Hand Perfectization

Perfectization in the PICO lingo is the process of transforming the original source code into a sequence of loop nests, without any sequential code between the loops. This is done automatically by PICO during the scheduling phase. In certain cases, the user can do a better job perfectizing the code than PICO could. For example, in Program 2, hand perfectization is used to sink the initial sequential setup code into the loop. The resultant hardware can start a new iteration at each cycle since it is clear that only one instruction is executed per iteration. However, without hand perfectization, PICO could choose to sink all initial sequential instructions to iteration 1 and as a result, the scheduler will not be able to schedule an iteration per clock cycle.

Memory Porting Arbitration

Sometimes PICO will produce a schedule that results in the need for a memory with more than 2 ports. Such memories are not synthesizable in FPGAs. A remedy for this is to specify the following pragma on the array in question and re-synthesize the design:

#pragma forward_boundary_register always

This results in additional logic that can arbitrate the accesses to the memory to reduce the actual port requirement. However, it does not always work. Sometimes arbitration could result in miscomparison with the golden results in post-synthesis C simulation or the final RTL simulation indicating erroneous hardware. Therefore, this option should be used with caution and the two simulations are strongly recommended when using this option.

46 Bibliography

[1] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Han- rahan. Brook for GPUs: stream computing on graphics hardware. ACM Trans- actions on Graphics (TOG), 23(3):777–786, 2004.

[2] C. Cathey, J. Bakos, and D. Buell. A Reconfigurable Fabric Exploiting Multilevel Parallelism. Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM06).

[3] Celoxica, low-latency and accelerated computing solutions for capital markets, Curr. April 2008. http://www.celoxica.com.

[4] C. Chang, J. Wawrzynek, and R. Brodersen. BEE2: a high-end reconfigurable computing system. Design & Test of Computers, IEEE, 22(2):114–125, 2005.

[5] Cray XD1 for Reconfigurable Computing, Technical report, Cray, Inc. 2005. http://www.cray.com/downloads/FPGADatasheet.pdf.

[6] W. Gropp and E. Lusk. User’s Guide for MPICH, a Portable Implementation of MPI. Argonne National Laboratory, 1994.

[7] Impulse Accelerated Technologies. Software Tools for an Accelerated World, Curr. April 2008. http://www.impulsec.com.

[8] InfiniBand Trade Association. The InfiniBand Architecture Specification R1.2, Technical Report, October 2004. http://www.infinibandta.org.

[9] . The EDA Technology Leader, Curr. April 2008. http://www.mentor.com.

47 [10] The Message Passing Interface (MPI) Standard, Curr. April 2008. http://www- unix.mcs.anl.gov/mpi/.

[11] A. Patel, C. Madill, M. Saldana, C. Comis, R. Pomes, and P. Chow. A Scalable FPGA-based Multiprocessor. Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 71:72, 2006.

[12] M. Saldana and P. Chow. TMD-MPI: An MPI implementation for multiple processors across multiple FPGAs. IEEE 16th International Conference on Field Programmable Logic and Applications, 2006.

[13] M. Saldana, D. Nunes, E. Ramalho, and P. Chow. Configuration and Program- ming of Heterogeneous Multiprocessors on a Multi-FPGA System Using TMD- MPI. In the Proceedings of the 3rd International Conference on Reconfigurable Computing and FPGAs, September 2006.

[14] L. Shannon and P. Chow. Maximizing system performance: using reconfigurabil- ity to monitor system communications. Field-Programmable Technology, 2004. Proceedings. 2004 IEEE International Conference on, pages 231–238, 2004.

[15] Synfora, Inc. PICO Express FPGA - PICO RTL: Synthesis, Verification and Integration Guide, 8.01 edition, 2008.

[16] Synfora, Inc. PICO Express FPGA - Writing C Applications: Developer’s Guide, 8.01 edition, 2008.

[17] Synfora, Inc., Curr. April 2008. http://www.synfora.com.

[18] Top 500 Supercomputer Sites, November 2007. http://www.top500.org.

[19] Xilinx, Inc. Fast Implex Link (FSL) Bus (v2.00a) Product Specification, ds449 edition, December 2005.

[20] Xilinx, Inc. Xilinx University Program Virtex-II Pro Development System Hard- ware Reference Manual, ug069 (v1.0) edition, March 2005.

48 [21] Xilinx, Inc., Curr. April 2008. http://www.xilinx.com.

[22] J. Zhu. Solving Partial Differential Equations on Parallel Computers. World Scientific, 1994.

49